• Open Access

Development of an expressed gene catalogue and molecular markers from the de novo assembly of short sequence reads of the lentil (Lens culinaris Medik.) transcriptome

Authors


Correspondence (Tel +91 11 26735159; fax +91 11 26741658;

email sabhyatabhatia@nipgr.ac.in)

Summary

Genomic resources such as ESTs, molecular markers and linkage maps are essential for crop improvement. However, these resources are still limited in important legumes such as lentil (Lens culinaris Medik.), which is valued world wide as a rich source of dietary protein. In this study, the de novo transcriptome assembly of 119 855 798 short reads, generated by Illumina paired-end sequencing, was performed using various assembly programs. This resulted in 42 196 nonredundant high-quality transcripts of average length 810 bases, N50 value of 1 432 and an average expression per transcript of 26.21 rpkm reads per kilobase per million(RPKM). Similarity search with the unigenes and protein sequences of other plants resulted in maximum similarity with soybean. A total of 20 009 nonredundant transcripts showed similarity with the UniProtKB database and of these, 18 064 transcripts were grouped into three main GO categories, that is, biological process (15 126), molecular function (15 505) and cellular component (9 434). Annotated transcripts were mapped to 289 predicted Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and 8 893 transcripts were classified into 24 functional categories based on Cluster of Orthologous Groups (COG) of proteins. Mining the data set for the presence of SSRs resulted in 8 722 SSRs with a frequency occurrence of one SSR per 3.92 kb. From these, 5 673 SSR primer pairs were designed, and a subset of these were utilized for diversity analysis. This study, which provides a large data set of annotated transcripts and gene-based SSR markers, would serve as a foundation for various applications in lentil breeding and genetics.

Introduction

Lentil (Lens culinaris Medik.) is one of the most ancient crops of the Mediterranean region and is grown in many countries spread across all continents covering over 3.5 million hectares with a total production of over 3 million metric tons. India produces about 0.9 million tons, a third of the worldwide production of lentils, most of which is consumed in the domestic market, while Canada leads with 1.5 million tons and is the largest exporter (FAOSTAT http://faostat.fao.org/site/567/DesktopDefault.aspx?PageID=567#ancor). With 26% protein, lentils are among the highest protein containing plant-based foods. Lens contains the cultivated lentil, Lens culinaris (syn. Lens esculenta Moench.; Muehlbauer et al., 1980) and three wild species, that is, L. nigricans, L. ervoides and L. lamottei (Alo et al., 2011). The yield of this important legume is hampered by several biotic (Ascochyta blight, Fusarium wilt and Anthracnose) and abiotic (drought, heat and salinity) factors (Muehlbauer et al., 2006). Despite the advent of recent genomic technologies, limited success has been made in lentil that is a self-pollinating annual diploid (2n = 14), with a relatively large genome of 4063 Mbp (Arumuganathan and Earle, 1991), which makes it difficult for whole genome sequencing. The currently available linkage maps in lentil have been constructed using RAPD and AFLP (Durán et al., 2004; Rubeena et al., 2003), few SSR markers (Hamwieh et al., 2005) and recently published SNP-based map (Sharpe et al., 2013). Only about nine thousand lentil ESTs were available (as of Sep 2012) in the NCBI database when this study was undertaken. Therefore, in order to accelerate research in this nonmodel species, there was an urgent need to enrich the repertoire of already available genomic resources of this legume.

Next generation sequencing technologies (NGS) especially the massively parallel and short read strategy of DNA sequencing using the Illumina platform has provided unprecedented opportunities for high-throughput functional genomics research. Transcriptome analysis for global gene discovery and expression studies is an ideal application of NGS. However, de novo assembly of transcriptome data in organisms where no auxiliary knowledge is available still remains a challenge. Recently, the short read lengths obtained from Illumina have been used efficiently for de novo transcriptome assemblies of many important plants such as sweet potato (Tao et al., 2012), pigeon pea (Kudapa et al., 2012), wheat (Duan et al., 2012), chickpea (Garg et al., 2011), sesame (Wei et al., 2011), rubber plant (Xia et al., 2011) and carrot (Iorizzo et al., 2011) as well as in animals like Chinese sika dear (Yao et al., 2012), silver carp (Fu and He, 2012), seal (Gao et al., 2012), snail (Feldmeyer et al., 2011), whitefly (Wang et al., 2010a), etc. These studies have clearly established that Illumina's sequencing technology produces a deep coverage of expressed sequences and reliable quantitative data (Metzker, 2010) and is therefore advantageous for transcriptome analysis.

In this study, the Illumina paired-end sequencing was utilized to characterize the transcriptome of lentil for the development of a high-quality expressed gene catalogue and gene-derived functional SSR markers. Paired-end library sequencing strategy was applied not only to increase the sequencing depth, but also to improve the efficiency of de novo assembly. Towards this, the de novo assembly using various assembly programs was carried out, and the most appropriately assembled set of nonredundant transcripts were identified. These were subjected to functional annotation, pathway analysis and identification of different protein families. Further, analysis of GC content and expression levels of unigenes were determined. Additionally, mining of SSRs, their validation and application in analysis of genetic diversity was demonstrated. Our results clearly establish the utility of the NGS for transcriptome analysis especially with the short read technologies applied to nonmodel organisms. This data set will be a major advancement to the repertoire of existing genomic resources in lentil.

Results

Illumina paired-end sequencing and optimization of de novo assembly

cDNA libraries prepared from RNA isolated from leaf, root and seedlings of lentil were sequenced using the Illumina GAII platform. Paired-end sequencing resulted in a total of 119 855 798 reads comprising 8 880 660 006 nucleotide bases. To improve the quality of data set, high stringency filtering was performed, which included removal of reads containing adaptor and/or vector sequence as well as reads with 30% bases having phred quality ≤20 were also discarded. Finally, 91 282 242 (76.16%) high-quality reads remained, in which the average quality score at each base position was above 30 (except for the last five bases) and were used for the assembly.

The quality of a de novo assembly is dependent on many factors, the foremost being the selection of an assembler followed by parameters like k-mer length, N50 value and coverage. Hence, the filtered reads were de novo assembled using the Velvet software at various k-mer lengths of 35, 39, 43, 47, 51, 55 and 59 to determine parameters like total no. of contigs, N50 value, average contig length and longest contig length as a function of k-mer (Figure S1A; Table S1). The analysis showed that as the k-mer length increases, the total number of contigs decreases and are therefore inversely related. However, there was no direct correlation of N50 and average contig size with k-mer; therefore, k-mer 47 was adjudged as most optimal resulting in 32 330 contigs with N50 value of 1 454 and average contig length of 846. This assembly utilized 78 155 671 (85.62%) high-quality reads.

The quality of the transcriptome assembly may also be determined by the assembly software that is employed to assemble the short read data. Therefore, various other assembly programs such as ABySS and SOAPdenovo were utilized for assembling the lentil reads with different k-mers. These assemblies revealed N50 values and average contig lengths lower than those obtained using the Velvet assembler (Figure S1B). Moreover, total contigs formed were unexpectedly high (>55 000 and >65 000, respectively; Figure S1B). Hence, the Velvet assembly seemed most promising. Further, another software called Oases, which is specifically recommended for the assembly of short reads to generate transcript isoforms, was also used. This resulted in 42 196 nonredundant transcripts that were represented by 55 463 transcript isoforms comprising 65.20% of the total reads (Table 1). This assembly had N50 value of 1 432 and an average contig length of 810 bp with the maximum number of transcripts (30 026) being 100–1000 bp in size and a significantly lower number (12 170) being >1000 bp (Table 1 and Figure 1).

Table 1. Statistics of final assembly of lentil transcriptome sequence data assembled using Velvet and Oases
Total no. of raw reads119 855 798
Total no. of filtered reads91 282 242 (76.16%)
No. of reads assembled78 155 671 (85.61%)
Total no. of transcript isoforms55 463
No. of nonredundant transcripts42 196
N50 length (bp)1432
N50 Index7605
No. of bases comprising nr transcripts (bp)34 188 311
Avg. contig length (bp)810
Avg. no. of reads per transcript1409.15
Average coverage (rpkm)26.21
Number of transcripts between 100 and 50022 004
Number of transcripts between 500 and 10008022
Number of transcripts >100012 170
Figure 1.

Size distribution of lentil transcripts.

Sequence similarity of lentil transcripts with other plants

Sequence conservation between the nonredundant transcripts of lentil and other plants was analysed by searching against the available unigene data sets of plants, namely Medicago, Lotus, soybean, Pisum, chickpea, Phaseolus, Arabidopsis, coffee and rice using TBLASTX homology search with a cut-off threshold E-value of ≤1E-05. The similarity search revealed that about 32 746 (77.60%) of lentil transcripts showed similarity with at least one unigene of these plants and had maximum similarity with soybean (68.74%) and Medicago (68.67%). We further characterized the transcripts by high stringency BLASTX analysis with E-value ≤1E-05 to proteome databases of seven different reference genomes, namely Arabidopsis, soybean, Jatropha, rice, Populus, Sorghum and Zea. About 30 031 (71.17%) of lentil transcripts showed significant similarity with the predicted proteins of the sequenced plants with the maximum number being with soybean (69.87%), as expected, followed by Populus (64.59%), Jatropha (63.68%) and Arabidopsis (62.49%; Figure 2). The analysis revealed that 33 358 (79.05%) of total transcripts showed similarity with at least one known protein or known unigene of other plant species, whereas 8838 (20.95%) of transcripts did not show similarity to any of the known proteins or unigenes, and some of them may represent putative novel or lentil-specific genes.

Figure 2.

Similarity of lentil transcripts with proteins of various plant species. The similarity of lentil transcripts with proteins of different plant species was performed using BLASTX (E-value ≤1E-05).

The GC content of lentil was determined to be 39.97%, which was comparable to other legume plants whose GC content varied from 42.14% in Lotus to 39.16% in Medicago but was contrastingly much lower than that of rice (51.64%; Figure S2).

Functional annotation and assigning of Gene Ontology (GO) categories

Lentil transcripts were searched against the nonredundant protein sequences available in the UniProtKB database using BLASTX with a threshold E-value of ≤1E-05 in order to assign putative functions to the lentil transcripts. About 20 009 (47.42%) transcripts among 42 196 showed significant hits with the proteins available in the UniProtKB data set thereby showing that there was overall gene conservation in lentil. Further, Gene Ontology (GO) terms were assigned to 18 064 annotated transcripts, which were grouped into three main GO categories, that is, biological process, molecular function and cellular component. In this annotation, one gene may be assigned to more than one category and therefore the total number of transcripts categorized in each category exceeded the total number of observed transcripts. As a result, 15 505 transcripts were grouped under molecular function making it the largest class of GO assignments, followed by 15 126 under biological process and 9 434 under cellular component category. In the molecular function category, 27.99% of the genes were associated with binding followed by catalytic activity (23.74%), transferase activity (9.29%) and hydrolase activity (8.75%). Under the biological process category, majority of the genes were involved in cellular processes (27.58%) and metabolic processes (24.06%) followed by macromolecule metabolic process (11.5%) and response to stimulus (10.35%). Within the cellular component category, highly represented genes belonged to the intracellular component (21.09%) followed by cell surface (15.39%), extracellular space (10.97%) and chromosome (8.50%; Figure 3).

Figure 3.

Gene Ontology classification of lentil transcripts. Gene ontology terms were assigned to the 18 064 annotated transcripts and were grouped in three main categories: biological process (15 126), cellular component (9434) and molecular function (15 505).

Further analysis involved the identification of transcription factors (TFs) by sequence comparison of the nonredundant lentil transcripts with the known plant transcription factor families (Perez-Rodriguez et al., 2010) using BLASTX with threshold E-value of ≤1E-05. About 17.69% of the lentil transcripts were found to have significant hits thereby identifying 7 463 putative transcription factors distributed to 82 families (Figure 4). Some of the largest TF families identified in lentil included FAR1, C3H, NAC, bHLH, MADS, PHD and MYB (and related) followed by WRKY, C2H2, SNF2, TRAF, AP2-EREB, FHA and HB. Further, the putative lentil TFs were compared with the TFs of the four legumes (Medicago, Lotus, soybean and common bean) that taxonomically fall in same group as lentil. It was observed that many of the lentil TF genes were highly similar (almost identical) to those of other legumes especially Medicago and Lotus. However, when number of hits in each TF family were compared across legumes, many of the lentil TF families were found to be larger (C3H, PHD, MADS, NAC, SNF2, etc.) in comparison with Medicago, Lotus and soybean, whereas some families were found to be smaller (AP2-EREB, bZIP, C2H2, CCAAT, etc.) as compared with soybean (Table S2). This variation in number of TFs among legumes, despite genome-wide colinearity among them, may be a result of contractions or expansions during evolution.

Figure 4.

Transcription factor families identified in lentil transcriptome. Transcription factors (TFs) were identified by sequence comparison of lentil transcripts with the plant transcription factor database using BLASTX (E-value of ≤1E-05). 7 463 putative TFs were identified, which were distributed to 82 families.

The distribution of lentil transcripts into various biological pathways was carried out by mapping the unigenes to the reference canonical pathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG). Only 5 227 (26.12%) of annotated transcripts were assigned to 289 KEGG pathways of which the top 20 pathways with the greatest number of sequences are shown in Figure 5. The ‘metabolic pathways’ was the most abundant category with 748 members. ‘Biosynthesis of secondary metabolites’ and ‘microbial metabolism in diverse environment’ were also well represented with 340 and 117 members, respectively.

Figure 5.

KEGG pathway analysis. Top 20 active biological pathways obtained in lentil transcriptome among a total of 289 predicted pathways.

In order to functionally classify the transcripts at the protein level, all the transcripts were aligned to the Cluster of Orthologous Groups of proteins (COG) database. COG is based on the phylogenetic classification of the proteins encoded in the complete genome where each COG consists of individual proteins or groups of paralogs from at least three lineages and thus corresponds to an ancient conserved domain. Of 10 703 nonredundant hits, 8 893 transcripts had a COG classification and were divided into 24 functional categories (Figure 6). Among the 24 COG categories, the cluster for ‘general function and prediction’ represented the largest group (1 996, 22.44%), followed by ‘post-translational modification, protein turnover, chaperones’ (846, 9.51%) and ‘carbohydrate transport and metabolism’ (649, 7.29%).

Figure 6.

Cluster of Orthologous Groups (COG) annotations of putative proteins. Of 10 703 nonredundant hits, 8 893 transcripts were aligned to COG database and classified functionally into 24 categories.

Mapping of reads to lentil transcripts for identification of highly expressed genes

Gene expression analysis was performed by mapping the NGS reads onto the nonredundant set of transcripts, which revealed two kinds of information, that is, level of expression and tissue-specific expression of the transcripts. In order to determine the level of expression of lentil genes, the coverage of each transcript was determined as a function of reads per million. Analysis revealed that number of reads corresponding to each transcript ranged from 6 (0.07 rpm) to 145 098 (1 766.21 rpm) with an average of 1 409 (23.92 rpm; Table S3). This wide range of expression indicated that even the lowly expressing transcripts with very low expression levels were also represented in the lentil assembly. The RPKM values were determined which ranged from 0.46 to 6655.65 with an average of 26.21 (Table S3). The expression levels may be classified based on RPKM values as very low (1–3), low (3–10), moderate (10–50), high (50–100) and very high (>100). The largest fraction of transcripts showed very low expression (RPKM <1–3), followed by low expression RPKM (<3–10) in all the three tissue samples. Among very low, low and moderate expression levels, maximum number of transcripts (26.03%–34.64%) was expressed at very low level followed by low level (27.33%–29.42%) and moderate level (19.01%–22.45%). (Figure 7a). The lowest number of transcripts (3.47%–4.44%) were expressed at high RPKM value (>50–100), while very high (RPKM >100) expression levels were observed in 4.05%–4.83% of transcripts. Furthermore, preferential expression of transcripts in specific tissues (seedling, leaf and root) was also analysed based on their RPKM values for each tissue sample (>3 fold expression). Analysis of the number of genes that have significant preferential expression between two tissues (Figure 7b) revealed that the largest number of transcripts showed preferential expression in root as compared with seedling (6719), followed by root in comparison with leaf (5930). The lowest expression was observed in seedling tissue in comparison with leaf (2636).

Figure 7.

Preferential expression analysis of the lentil transcriptome. (a) Differential expression analysis of the lentil transcriptome. Number of transcripts with different expression abundances in various tissue samples based on the RPKM method. (b) Transcripts preferentially expressed in each tissue sample as compared with others in a tissue-by-tissue comparison. Transcripts represented by threefold or greater RPM as compared with the other tissue sample are given. For the transcripts in each cell, there is preferential expression in the column tissue than in the row tissue sample.

The very highly expressing transcripts (RPKM>100) from the three tissues were functionally annotated by subjecting to GO annotation, COG and KEGG pathway analysis. GO analysis showed that maximum number of transcripts in seedling were associated with ribonucleoprotein complex (219, GO:0030529), whereas in leaf and root with metal ion binding (218, GO:0046872 and 229, GO:0046872, respectively). Among COG categories, the cluster for ‘translation, ribosomal structure and biogenesis’ represented the largest group in all the three tissues. KEGG pathway analysis revealed that the highly expressed transcripts were assigned to 218 pathways in seedling, 221 in leaf and 234 in root pathways. Analysis of the transcription factors within the highly expressed transcripts in each tissue revealed that the C3H transcription factor family was the most represented.

Mining of gene-derived SSRs

Genic SSR markers have proven to be valuable tools for various applications in genetics and breeding. Therefore, to develop a novel set of functional SSRs, the 42 196 lentil transcripts were mined for the presence of microsatellite motifs using MIcroSAtellite (MISA) tool (http://pgrc.ipk-gatersleben.de/misa). A total of 6 656 sequences containing 8 722 SSRs were identified with the frequency of one SSR per 3.92 Kb. The most abundant type of repeat motifs were the trinucleotide repeats (54.72%) followed by tetranucleotide repeats (20.06%) with pentanucleotide repeats being the lowest (6.63%; Table 2). The frequencies of SSRs based on number of motifs revealed that SSRs with four tandem repeat motifs (39.23%) were the most common (Table 2). Overall, among all types of SSRs, the trinucleotide motifs were the most abundant (Figure 8) [AAG/CTT (13.12%), ATC/ATG (10.15%), AAC/GTT (9.69%), ACC/GGT (7.64%)] followed by AG/CT (7.25%) and AAAG/CTTT (3.76%). The sequences flanking the SSRs were used to design primers, and 5673 primer pairs were designed (Table S4) using the Primer3 software tool with default parameters (http://sourceforge.net/projects/primer3/develop; Table S4).

Table 2. Frequency of occurrence of SSRs in the lentil transcriptome
Motif lengthRepeat numbersTotal%
345678910>10
Di2591337457392297919.07
Tri309695738017073411343477354.72
Tetra156814431610000175020.06
Penta49663145000005786.63
Hexa684119176301008309.52
Total2748342210196563071479952272  
%31.5139.2311.687.523.521.691.140.63.12  
Figure 8.

Frequency distribution of SSRs based on motif types. Among the 8 722 SSRs identified, the trinucleotides were the most abundant.

The SSRs identified in this study were associated with coding regions; therefore, the SSRs containing transcripts were subjected to BLASTX analysis, and functions were assigned to 4020 transcripts (Table S7). GO analysis showed that maximum number of transcripts (562) were associated with intergral to membrane (GO:0016021), followed by ATP binding (487, GO:0005524), metal ion binding (485, GO:0046872), transferase activity (466, GO:0016740) and regulation of transcription (460, GO:0045449). SSRs associated with transcription factor families were also identified, which revealed that 1293 transcripts belonged to 70 transcription factor families. The highest number of transcripts (109) encoded C3H transcription factor family followed by Myb (82), C2H2 (77), bHLH (73) and MADS (70).

Validation of genic SSRs and analysis of genetic diversity

Of the large number of genic SSR markers identified, 96 primer pairs were randomly selected for validation of which 82 (85.4%) showed successful amplification in the parent genotype. Of these, 54 were further used for the analysis of genetic diversity within 24 genotypes (Table S5) including the parents of a mapping population (Precoz x L-830) and other legume genera like Medicago, Glycine and Vigna. A total of 23 (42.59%) SSRs revealed polymorphism amplifying 2–4 alleles with an average of 2.30 alleles per locus. The polymorphic information content (PIC) for these markers ranged from 0.06 to 0.88 with an average of 0.47 (Table S6). The observed heterozygosity (Ho) ranged from 0.43 to 1.00 with an average of 0.33 per locus, while expected heterozygosity (HE) ranged from 0.34 to 0.67 with an average of 0.20 per locus. The cross-genera transferability was 42.85%. Further, the genetic relationship among the 24 genotypes was determined by the unweighted pair-group method of arithmetic averages (UPGMA)-based dendrogram (Figure 9). The coefficient of genetic similarity ranged from 0.29 to 0.97. Each of the accessions could be clearly delineated, and all the accessions were clustered in two major clusters (Figure 9). Cluster I consisted of all the lentil accessions, whereas the three other legume genera (Medicago, Vigna and Glycine) were clustered separately (cluster II).

Figure 9.

Dendrogram showing the genetic similarity of 24 accessions based on UPGMA algorithm.

Discussion

Transcriptome analysis, which is now facilitated by the development of NGS technologies, is essential for understanding the functional complexity of genomes. Characterization of transcriptomes is especially important for plant species such as lentil which have a very large genome and present a great challenge for whole genome sequencing. Hence, this study, in which ~120 million high-quality reads obtained from Illumina Genome Analyzer II platform were used to assemble the transcriptome of lentil, is one of the most comprehensive studies which would serve as a foundation for functional genomics of this legume. Initially, the Roche/454 was the method of choice for transcriptome analysis due to its larger read length and ease of assembly and was used in many organisms like olive (Alagna et al., 2009), pine (Parchman et al., 2010), coral larvae (Meyer et al., 2009) and chestnut (Barakat et al., 2009). Illumina sequencing on the other hand was limited only to transcriptome analysis in species for which reference genomes were available such as mouse (Rosenkranz et al., 2008), yeast (Nagalakshmi et al., 2008), etc. However, with recent improvements in assembly programs that can effectively assemble relatively short reads, coupled with the great advantage of paired-end sequencing, the short read sequence data generated for transcriptomes or whole genomes have been assembled de novo very successfully for species such as maize (Li et al., 2011), soybean (Libault et al., 2010), giant panda (Li et al., 2010), carrot (Iorizzo et al., 2011), rubber tree (Li et al., 2012), peanut (Zhang et al., 2012a) safflower (Lulin et al., 2012), sweet potato (Xie et al., 2012), etc. In the present study of the lentil transcriptome, the strategy for de novo assembly of the short reads was successfully demonstrated. Various assemblers and parameters were analysed and finally Velvet, followed by Oases, was found to be the most suitable and resulted in 42 196 nonredundant transcripts. As the quality of assembly depends on the program parameters, the effect of varying k-mer on assembly quality was investigated. Using N50 values and average length of contigs as indicators of good assembly, the k-mer value of 47 was identified for the optimum assembly. It has been demonstrated in earlier studies that higher k-mer length generates more contiguous assembly of highly expressed RNAs, while lower k-mer results in the abundance of poorly expressed transcripts, and therefore, optimization of the transcriptome assembly using various k-mer lengths, especially for de novo assemblies, is highly desirable (Surget-Groba and Montoya-Burgos, 2010; Zerbino and Birney, 2008). Our assembly (at k-mer 47) using Velvet and Oases resulted in the incorporation of 85.62% of the raw reads, which was higher than those reported for other transcriptome assemblies such as 48% in pine (Parchman et al., 2010) and 50% in sika dear (Yao et al., 2012), but was lower than those reported in some of the transcriptome assemblies such as 88% in eucalyptus (Novaes et al., 2008), 90% in coral (Meyer et al., 2009) and 85% in switchgrass (Wang et al., 2012a). Another distinguishing feature of our lentil assembly was the high average length of unigenes (810 bp). This was much higher than the unigene length in assemblies of whitefly (265 bp; Wang et al., 2010a,b), sweet potato (581 bp; Wang et al., 2010a,b), sesame (629 bp; Wei et al., 2011) and bamboo (425 bp; Zhang et al., 2012a,b) generated using the Illumina platform. In fact, comparison with the transcriptome studies based on Roche 454 platform (which generates much longer read lengths) such as in pigeon pea (273 bp; Dubey et al., 2011), garden pea (324 bp; Franssen et al., 2011) and switchgrass (535 bp; Wang et al., 2012a,b) established the validity of the lentil transcriptome assembly. One of the reasons for generating longer average read length in lentil may be the greater depth of coverage of sequencing provided by the Illumina platform.

Until now, no general criteria have been proposed as standards for evaluation of the quality of the transcriptome assembly may depend upon several factors - annotation and gene coverage being the most important. In the current study, lentil transcripts were compared with the unigene and protein sets of different plants. The assembled sequences of different lengths showed variable efficiency of matching to sequences in the database. Sequences ranging from 500 bp to 2000 bp constituted the maximum percentage that showed BLAST hits (56.66%), while much lower number of sequences (28.14%) upto 500 bp in size showed hits. Longer sequences (>2000 bp) were the minimum (8.59%). This analysis also resulted in 79.05% of the lentil transcripts showing similarity with atleast one known unigene or protein, wherein the highest similarity was with the soybean unigenes (68.74%) followed by the Medicago unigenes (68.67%) and soybean proteins (69.87%). Further, about 71.17% of lentil transcripts showed significant similarity with the known predicted proteins, which was higher in comparison with previous studies of whitefly (16.20%), sweet potato (46.21%) and sesame (53.91%), and very close to the value obtained in mustard stem (72%). This confirmed the validity of our de novo assembly as well as the high gene conservation within legumes. A recent study of the lentil transcriptome by Sharpe et al. (2013) helped in identifying a large number of orthologous genes in Medicago and soybean (62.9%). Genome-wide synteny and colinearity in legumes, especially between those belonging to the same taxonomic group, has been previously observed (Libault et al., 2009) and proven to be advantageous for transferring established knowledge within related legumes especially from the models, M. truncatula and L. japonicus. Additionally, sequence comparison of lentil transcripts with other plant unigenes and protein data sets also revealed that 20.95% of the lentil transcripts did not show any similarity to the known plant sequences. ‘Non-BLASTable’ sequences have been reported in all the transcriptomes, varying from 13 to 80%, depending upon the sequencing depth and parameters of the BLAST search (Blanca et al., 2011; Ness et al., 2011; Wang et al., 2010a,b). These transcripts may majorly correspond to novel genes or lentil-specific genes or to noncoding RNAs from untranslated regions. Taken together, such large number of sequences and deep depth of coverage can provide sufficient transcriptome sequence information for discovering novel genes. Similar BLAST-based approaches have been utilized earlier, and legume-specific genes (Graham et al., 2004) as well as chickpea-specific genes (Garg et al., 2011) have been identified, which may prove useful in evolutionary studies (Campbell et al., 2007).

Several methods of functional annotation were used in order to predict potential genes and their biological functions at the whole transcriptome level. Firstly, GO terms were assigned to about 42.81% of the total lentil transcripts, which was helpful in understanding the distribution of gene functions at the macro level. Further, KEGG pathway analysis was performed, in order to analyse gene functions related to specific metabolic and cellular processes in order to gain insights into genetically and biologically complex behaviours (Kanehisa and Goto, 2000). This revealed more than 39.67% of lentil transcripts to be enrichment factors involved in known pathways. Further, the transcripts were annotated using the COG database, which contains orthologous gene products evolved from ancestral proteins. This classified the lentil transcripts into 24 different COG categories. Annotation of the lentil transcripts using these methods proved helpful in predicting potential genes and their diverse functions thereby suggesting that the lentil transcriptome was widely sampled and provided a valid assembly. Moreover, it was also observed in the lentil transcriptome that the pattern of gene distribution in different functional categories was similar to other plant transcriptomes (Garg et al., 2011; Li et al., 2012).

Differential expression analysis was performed based on the RPKM values, which helped in the identification of highly expressing genes. Pathway-based analysis of the highly expressed transcripts in each tissue sample revealed that some pathways were only represented in the leaf tissue, while others were specific to root tissue. Pathways specific to root were galactose metabolism (ko00052), DNA replication (ko03030) and mismatch repair (ko03430), while the C5-branched dibasic acid metabolism (ko00660) and fatty acid biosynthesis (ko00061) were specific to the leaf. As photosynthesis occurs in the aerial tissues, therefore pathways related to photosynthesis were found to be enriched in them. In leaf and seedling tissues, 26 and 35 transcripts, respectively, showed hits to the photosynthesis-related enzymes (ko00195). In the pathway ‘photosynthesis-antenna proteins’ (ko00196), 10 transcripts from leaf tissue and nine from seedling showed hits to the enzymes associated with this pathway. Similar observations have been made in an earlier study comparing the peanut aerial and subterranean young pods (Chen et al., 2013).

The ubiquity of microsatellites or simple sequence repeats in eukaryotic genomes and their usefulness as genetic markers has been well established over the last decade as they have been used extensively for analysis of genetic diversity, population genetics, linkage mapping, comparative genomics and association analysis. However, in modern genetic analysis, both SSRs and SNPs predominate. Even though SNPs serve as excellent markers especially for high-throughput mapping and studying complex genetic traits, SSRs provide a number of advantages over other marker systems. SSRs with their moderate density still serve as the best codominant marker system for construction of framework linkage maps. Moreover, the high PIC value of SSRs, upto threefold higher than SNPs, coupled with high heterozygosity values makes them useful for assessment of genetic relatedness and association mapping (Yang et al., 2011). Hence, in our study, SSR identification was undertaken in order to immediately make available a resource to the lentil geneticists and breeders, especially due to the unavailability of SNPs in lentil till very recently (Sharpe et al., 2013). In this study, 8 722 genic SSRs were identified in 15.77% of the transcripts. The frequency of occurrence of genic SSRs not only depends on the genomic composition but also on parameters used in mining microsatellites, such as the repeat length threshold (Toth et al., 2000). In general, it is expected that the frequency of di-, tri-, tetra-, penta- and hexanucleotide repeats should simultaneously decrease (Zeng et al., 2010). However, in lentil, trinucleotides were found to be the most abundant constituting upto 54.72% of total SSRs which is in agreement with other studies based on the identification of EST-SSRs (Varshney et al., 2005; Xie et al., 2012). Such dominance of triplets over other repeats in coding regions may be explained on the basis of the suppression of nontrimeric SSRs in coding regions possibly due to the change in reading frame which may occur as a result of the increase or decrease in size and number of repeat units. Among trinucleotides, AAG/CTT (23.97%) was found to be the most frequently occurring motif, which was consistent with the situation in many other plants (Li et al., 2004, 2012; Wang et al., 2012a,b; Zeng et al., 2010). However, in many cereals, CCG has been reported to be the most commonly occurring triplet (Cordeiro et al., 2001; Thiel et al., 2003; Varshney et al., 2002). Moreover, mining of SSRs in several other transcriptomes have alternatively reported higher occurrence of dinucleotide repeats (Triwitayakorn et al., 2011; Wang et al., 2010a,b; Wei et al., 2011).

Genetic markers, especially SSRs are of great importance for the understanding of genetic variation, which may be utilized for various applications in molecular breeding. Until now, some genomic SSRs (Gupta et al., 2012; Hamwieh et al., 2005) as well as genic SSR markers (Kaur et al., 2011) were available for lentil, and therefore, the 5 673 SSR markers developed in this study would serve to enrich this genomic resource. Further, the analysis of genetic diversity among the lentil genotypes clearly established the efficacy and utility of the developed markers because fairly high PIC values were obtained and even the closely related lentil genotypes could be distinguished. The average number of alleles per locus observed in our study (2.30) was comparable to previous studies in peanut (2.1, Liang et al., 2009) and chickpea (2.6, Choudhary et al., 2009). The difference between the average observed heterozygosity (0.33) and expected heterozygosity (0.20) may be indicative of high self-pollination rates within the population. Moreover, transferability of SSRs among the subspecies as well as across genera also was quite significant (42.85%), which was expected as the markers were designed from the highly conserved, genic regions (Cho et al., 2000; Eujayl et al., 2002). SSRs located in the coding regions are under strong selection pressure and accumulate few mutations (Li et al., 2004; Varshney et al., 2005). However, despite the lower polymorphism, the genic SSRs are preferable over genomic SSRs as these are associated with the coding regions of the genome and therefore represent ‘true genetic diversity’ that would directly assist in ‘perfect’ marker trait associations (Eujayl et al., 2002; Thiel et al., 2003).

In summary, this study was undertaken with the objective of enriching the existing set of genomic resources in lentil, in order to provide the necessary boost for its accelerated improvement. Towards this, an advanced genomic resource of annotated genes/transcripts and SSR markers were developed in an under resourced nonmodel legume crop. Expression analysis of the transcriptome was carried out, which provided a deep insight into the gene content, biological processes and pathways conserved in lentil, thereby laying the foundation for future functional genomics studies. Further, genic SSR resources were developed for utilization in linkage mapping and comparative genomics, which would facilitate improvement of crop lentil.

Experimental procedures

Plant material and DNA isolation

Lens culinaris cv. Precoz seeds were obtained from CSK Himachal Pradesh Agricultural University, Palampur. The seeds were sown in a mixture of sterile agropeat and vermiculite (1:1) in pots. The harvesting of root and leaf tissue samples was performed every five days for upto 50 days. Seedlings were collected after 20 days of germination. Three independent biological replicates for each sample were collected.

For validation and analysis of genetic diversity, 24 genotypes (obtained from CSK HPKV, Palampur) of L. culinaris were analysed along with three other legume genera (Table S5). Genomic DNA was isolated from fresh young leaves of the 24 genotypes using the modified CTAB protocol, as described by Khanuja et al. (1999). The quality and quantity of genomic DNA was analysed on the NanoDrop and 0.8% agarose gel.

cDNA library preparation and Illumina sequencing

In order to ensure inclusion of maximum number of expressed transcripts from all stages of developing plants, equal amounts of leaf and root tissues of L. culinaris cv. Precoz were collected at five-day intervals and were pooled. Seedling sample consisted of 20-day-old seedlings. Total RNA was isolated from the root, leaf and seedling samples using the RNA plant mini kit (Qiagen, Germany). The quality as well as the quantity of RNA samples was assessed on 2100 Bioanalyzer (Agilent) using the Agilent Plant RNA 6000 nano kit. RNA samples having 260/280 ratio of 1.9–2.0 and RNA integrity number (RIN) of 8.0 and above were included in the analysis. cDNA libraries were generated from the tissues of leaf, root and seedlings. Paired-end mRNA-Seq library was prepared according to the Illumina protocol. Adapters were used for bar coding the pooled samples. Sequencing was carried out in two lanes to generate 71- and 75-bp PE reads using mRNA-Seq assay on the Illumina Genome Analyzer II platform. Illumina reads were passed through two quality control checks, and high-quality reads based on phred score were refined, and adaptor/vector sequences containing reads were removed using NGS QC Toolkit (Patel and Jain, 2012). The sequence data obtained have been deposited at the NCBI in the Short Read Archieve (SRA) database under the accession number SRA037767.

De novo assembly of sequence data

The entire data set of leaf, root and seedling was assembled on a server with eight cores and 48 GB random access memory (RAM). Several assembly programs such as ABySS 1.2.0 (http://www.bcgsc.ca/platform/bioinfo/software/abyss; Simpson et al., 2009) SOAPdenovo 1.04 (http://soap.genomics.org.cn/soapdenovo.html), Velvet1.0.13 (http://www.ebi.ac.uk/~zerbino /velvet/; Zerbino and Birney, 2008) and Oases (Version 0.1.8 (http://www.ebi.ac.uk/~zerbino/oases) were used. Various assembly parameters (N50 length, avg. contig length, no. of contigs) were optimized to obtain the best assembled data with high coverage.

Similarity analysis and functional annotation of assembled contigs

The assembled contigs obtained using Velvet were further resolved into their alternative transcripts using Oases 0.1.18 software (http://www.ebi.ac.uk/~zerbino/oases) for comparing lentil transcripts with other plant unigenes. The largest transcript from the group of isoforms was selected to generate a nonredundant set of transcripts. ESTs of different plants- Medicago, chickpea, soybean, Arabidopsis, rice, Pisum, coffee, Phaseolus and Lotus were downloaded from the NCBI EST database (as on March 2012). Vector and adaptor sequences were trimmed off using Seqclean software and were assembled using CAP3 program with default parameters to form unigenes. The nonredundant set of lentil transcripts were analysed by searching against the unigenes generated for each of the above-mentioned species using TBLASTX with the threshold E-value of ≤1–05.

The lentil transcripts were further subjected to BLASTX analysis against the nonredundant protein database of UniProtKB to assess the quality of de novo assembly and to deduce the putative function. Proteomes of Arabidopsis thaliana (http://www.plantgdb.org/download/Download/xGDB/AtGDB/ATpepTAIR9),Glycine max (soybean; ftp://ftp.jgi-sf.org/pub/JGI_data/phytozome/v5.0/Gmax/annotation/Glyma1_), Zea mays (ftp://ftp.plantgdb.org/download/Genomes/ZmGDB/), Oryza sativa (rice; ftp://ftp.plantgdb.org/download/Genomes/OsGDB/previous_version/OSpepV6.1), Jatropha (ftp://ftp.kazusa.or.jp/pub/jatropha/JAT_r3.0), Populus trichocarpa (ftp://ftp.plantgdb.org/download/Genomes/PtGDB/) and Sorghum bicolor (ftp://ftp.jgi-psf.org/pub/JGI_data/Sorghum_bicolor/v1.0/Sorbi1_) were downloaded, and the lentil transcripts were blasted for similarity search with these plant proteomes using BLASTX program with E-value ≤1E-05. The unigenes that showed significant BLAST hits with UniProtKB were used for functional annotation based on the Gene Ontology categories and were distributed into three main functional categories of biological process, cellular component and molecular function.

The lentil transcripts were further aligned with the KEGG and COG (Cluster of Orthologous Group) protein databases using BLASTX. The KEGG orthology (KO) assignments and the KEGG pathway reconstruction were performed in KAAS (Automatic Annotation Server Ver. 1.6a; http://www.genome.jp/tools/kaas/) with the default parameters. Alignment with the COG database (http://www.ncbi.nlm.nih.gov/COG) was performed to predict and classify the transcripts into functional categories.

Lentil transcripts were also searched against all transcription factor protein sequences at the Plant Transcription Factor Database (http://plntfdb.bio.uni-potsdam.de/v3.0/downloads.php) Version 3.0 (Perez-Rodriguez et al., 2010) for the identification of transcription factors in lentil, using BLASTX with an E-value of ≤1E-05.

GC content

The GC content of the lentil transcripts was analysed using the in-house-developed perl script. For comparison of the GC contents of different plant species, the ESTs of Arabidopsis, chickpea, soybean, Lotus, Medicago, rice, Pisum, Phaseolus and coffee were downloaded from the NCBI database, and the unigene sets of the respective plant species were used for the analysis of the GC content.

Determination of transcript abundance

The sequence reads from the three tissues were mapped on the nonredundant lentil transcripts to quantify sequence depth for predicting putative expression levels using Maq 0.7.1 software (http://maq.sourceforge.net/index.shtml), and the mapping was viewed using the Tablet software (http://bioinf.scri.ac.uk/tablet/download.shtml). Number of reads mapped were normalized and measured in terms of reads per kilobase per million (RPKM) in order to determine the transcript abundance. Unigene expression levels were calculated as RPKM (A) = (10^9 × C)/(N * L) where A refers to the expression of gene A, C to number of reads that uniquely aligned to gene A, N to total number of reads that uniquely aligned to all genes, and L to the length of gene A.

SSR identification, validation and diversity analysis

SSRs were searched using the program MIcroSAtellite (MISA; http://pgrc.ipk-gatersleben.de/misa). The search criteria for SSR identification was based on the number of repeat motifs: dinucleotides ≥6, trinucleotides ≥4, tetranucleotides, pentanucleotides and hexanucleotides ≥3. The primers were designed from the regions flanking the SSR motifs using the Primer3 software with default parameters (http://sourceforge.net/projects/primer3/develop).

In order to validate the SSRs, PCR amplification was carried out using 25–30 ng of genomic DNA according to the protocol described earlier (Choudhary et al., 2009). The amplified products were resolved on the high-throughput electrophoresis system LabChipGX® (Caliper Life Sciences). Allelic data obtained for all the genotypes were analysed using POPGENE Version 1.32 (Yeh and Boyle, 1997). Polymorphism information content (PIC) of each locus was deduced using the equation PIC = 1−ΣP2ij, where Pij is the frequency of the jth allele for ith locus (Anderson et al., 1993). The binary data were used to compute pair-wise similarity coefficients (Jaccard), and the similarity matrix thus obtained was subjected to cluster analysis using the unweighted pair-group method with arithmetic averages (UPGMA) algorithm on NTSYS-PC, Version 2.1 software (Rohlf, 1998).

Acknowledgements

This research was supported by the Department of Biotechnology (DBT), Government of India by means of a project grant (BT/01/08/NRC-PBI/01). We are thankful to Dr. T. R. Sharma (CSK, Himachal Pradesh University, Palampur) for providing the lentil genotypes.

Ancillary