SEARCH

SEARCH BY CITATION

Keywords:

  • Panicum;
  • gene expression;
  • RNA-Seq;
  • transcriptome;
  • gene atlas;
  • biofuels

Summary

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental procedures
  7. Acknowledgements
  8. References
  9. Supporting Information

Panicum hallii is an emerging model for genetic studies of agronomic traits in Panicum, presenting a tractable diploid alternative study system to the tetra- or octaploid biofuel crop switchgrass (Panicum virgatum). To characterize the gene complement in P. hallii var. filipes and enable gene expression analysis in this system we sequenced, assembled, and annotated the transcriptome. Over 300 Mb of normalized cDNA prepared from multiple tissues and treatments was sequenced using 454-Titanium, producing an annotated assembly including 15 422 unique gene names. Comparison with other grass genomes identified putative P. hallii homologs for >14 000 previously characterized genes. We also developed an atlas of gene expression across tissues and stages using RNA-Seq (the quantitative analysis of short cDNA reads). SOLiD sequencing and quantitative analysis of more than 40 million cDNA tags identified substantial variation in expression profiles among tissues, consistent with known functional differences. Putative homologs were found for all enzymes in the phenylpropanoid pathway leading to lignin biosynthesis, including genes with known effects on biomass conversion efficiency. The resources developed here will enable studies of the genes underlying variation in cell wall composition, drought tolerance, and biomass production in Panicum.


Introduction

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental procedures
  7. Acknowledgements
  8. References
  9. Supporting Information

Although the majority of biofuel production has relied on sucrose and starch feedstocks (Somerville, 2007), lignocellulosic feedstocks are receiving increased attention as abundant and energy-efficient alternatives (Rubin, 2008; Pauly and Keegstra, 2010). Plant cell walls composed of cellulose, hemicellulose and lignin are degraded to sugars either enzymatically or chemically for further processing into ethanol or other hydrocarbon fuels (Carroll and Somerville, 2009). Perennial C4 grasses are promising candidates for biomass production, including sugarcane (already widely grown as a biofuel crop), Miscanthus × giganteus (Somerville et al., 2010) and switchgrass (Panicum virgatum) (Bouton, 2007). Substantial progress has been made in developing genomic resources for P. virgatum, including expressed sequence tag (EST) analysis (Tobias et al., 2008) and a genetic linkage map (Okada et al., 2010), and genome sequencing is under way at the DOE Joint Genome Institute (JGI, http://www.jgi.doe.gov/genome-projects). However, like other perennial grass crops, the switchgrass cultivars currently in production have large, polyploid, and highly heterozygous genomes (Costich et al., 2010) that complicate genetic and genomic analysis in this system (Bouton, 2007).

The genus Panicum also includes diploid species closely related to switchgrass (Zhang et al., 2011) in which these analyses would be more straightforward, including the emerging model Panicum hallii (Anderson et al., 2011). Like the well-characterized upland and lowland switchgrass types, P. hallii also presents distinct ecotypes, with the upland var. hallii extending into more xeric conditions than the lowland var. filipes (Waller, 1976). This species is well suited as a laboratory model, with a short generation time, high fecundity, and a small diploid genome. The JGI has undertaken whole genome sequencing of multiple P. hallii accessions to aid in subsequent assembly of the more challenging P. virgatum genome (http://www.jgi.doe.gov/genome-projects). Despite this species’ strong potential as a model for Panicum research, the sequence information needed for studying genes underlying agronomic traits in P. hallii is not available. The present study endeavors to fill this gap by cataloging the genes expressed in multiple stages and tissues of P. hallii, providing a starting point for functional analysis of gene expression in future studies.

Sequencing and annotation of the transcriptome provides a rapid, cost-effective route to developing sequence resources for systems in which the assembly of a complete genome sequence remains out of reach. Taking advantage of the decreased cost and increased throughput of next-generation sequencing technologies (NGS), transcriptome sequencing is now widely used for gene discovery and comparative genomics in non-model systems (Ekblom and Galindo, 2011). An annotated collection of expressed sequences matching known genes enables functional genomic analysis in organisms lacking any prior sequence information. For example, these resources have been used for identification of genetic polymorphisms (Novaes et al., 2008) and comparative sequence analysis to uncover signatures of natural selection (Baldo et al., 2011). Transcriptome sequencing has opened non-model organisms to gene expression profiling using quantitative (q)PCR (Toth et al., 2007) or microarrays (Vera et al., 2008). RNA-Seq, the quantitative analysis of short cDNA reads (Mortazavi et al., 2008), can also take advantage of de novo assembly for expression profiling (Meyer et al., 2011).

Comparing expression levels across tissues and stages to produce an atlas of gene expression provides an important biological context for gene expression patterns that cannot always be inferred from sequence comparisons alone. This ‘gene atlas’ approach has been adopted in numerous model and crop plants, typically using microarray platforms for expression profiling (Schmid et al., 2005; Benedito et al., 2008; Wang et al., 2010; Sekhon et al., 2011). More recently, NGS technologies have been used to develop gene atlas resources based on RNA-Seq (Libault et al., 2010; Zenoni et al., 2010). In addition to annotating the expressed genes with tissue-specific expression patterns, these analyses have produced important functional insights, identifying genes involved in the specialization of nitrogen-fixing nodules in legumes (Benedito et al., 2008), berry ripening in grapes (Zenoni et al., 2010), and inflorescence development (an important determinant of yield) in rice (Wang et al., 2010).

In this study, we sequenced and annotated the transcriptome of P. hallii, and used this resource as a reference for RNA-Seq analysis of tissue- and stage-specific gene expression profiles. Our findings provide a reasonably complete catalog of the genes expressed in P. hallii, functional annotation of those sequences, and evidence of tissue-specific expression differences. This provides proof of concept for RNA-Seq analysis in P. hallii, and will enable future studies targeting specific genes underlying important agronomic traits such as drought tolerance, biomass production, and cell wall characteristics.

Results

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental procedures
  7. Acknowledgements
  8. References
  9. Supporting Information

Transcriptome sequencing and assembly

Normalized cDNA libraries prepared from multiple tissues and treatments were sequenced using 454-FLX Titanium, producing 1.26 million reads (Table 1). The sequences obtained from these libraries showed a broad size distribution (Figure 1a), averaging 242 bp (shorter than the expected approximately 400 bp), but also including a substantial number of long reads (>400 000 reads ≥300 bp).

Table 1.   Sequencing and assembly of the Panicum hallii transcriptome using 454-Titanium
 MbNumber
Adaptor-trimmed reads306.01 262 088
Reads assembled266.91 088 073
Contigs19.723 474
Unique singletons15.864 179
Assembly35.587 653
image

Figure 1.  Distribution of length and sequencing coverage in the transcriptome assembly. (a) Length distribution for trimmed reads (gray bars) and contigs (black bars). (b) Distribution of sequencing coverage among contigs. Note the logarithmic x-axes on both plots.

Download figure to PowerPoint

Adaptor-trimmed 454 reads were assembled using the Roche De Novo Assembler, with 86% of reads included in the assembly and 113 353 remaining as singletons. This produced a set of 23 474 contigs with an average length of 837 bp (Figure 1a), with half of the total assembly length in contigs >1.3 kb (N50 = 1308). More than half (55%) of contigs were at least 500 bp in length, and more than a quarter (28%) were at least 1 kb. The overall coverage was 13.5×, with most contigs (75%) containing 10 or more reads and approximately half (49%) containing ≥20 (Figure 1b). To preserve rare transcripts, we also included a filtered set of non-redundant singletons (= 64 179) for annotation and expression analysis (filtering described in Experimental Procedures).

Rarefaction analysis was used to evaluate the completeness of our sequencing efforts. The results (Figure 2) indicate that the sequencing depth was saturating with respect to gene discovery, since nearly all contigs detected in the complete dataset (1.26 million reads) were also detected at substantially lower coverage (96% of contigs detected at 0.5 million reads).

image

Figure 2.  Rarefaction analysis of gene discovery as a function of 454 sequencing depth.

Download figure to PowerPoint

Functional annotation of the transcriptome

The assembly was annotated with gene names and Gene Ontology (GO) terms based on sequence comparisons between P. hallii transcripts and the UniProt database. Approximately two-thirds of contigs were assigned gene names (64%; Table 2), and 28% of singletons. More than half the contigs (56%) were assigned GO terms, and 26% of singletons. Altogether 15 422 unique gene names were identified in this analysis.

Table 2.   Annotation of the Panicum hallii transcriptome assembly based on sequence comparisons with public sequence databases
BLAST matchContigsSingletonsUnique matches
Number (%)Number (%)
  1. aNumber of unique gene names.

  2. bNumber of unique best matches.

  3. Percentages are relative to total number of contigs and singletons.

UniProt records
 With gene names15 047 (64.1)18 163 (28.3)15 422a
 With Gene Ontology annotation13 031 (55.5)16 766 (26.1)18 973b
Expressed sequence tags
 Switchgrass (Panicum virgatum)18 135 (77.3)21 813 (34.0)23 483b
Gene models
 Foxtail millet (Seteria italica)15 429 (65.7)14 689 (22.9)15 031b
 Rice (Oryza sativa)14 999 (63.9)13 111 (20.4)14 119b
 Maize (Zea mays)14 971 (63.8)13 406 (20.9)16 432b
 Thale cress (Arabidopsis thaliana)13 051 (55.6)9141 (14.2)10 053b

Putative homologs were identified by comparing the assembly with gene models from other species (Table 2). A majority of contigs (77%) and 34% of singletons matched ESTs from P. virgatum, and a smaller fraction matched gene models from more distantly related plant genomes (56–66% of contigs; Table 2). Pairwise nucleotide sequence alignments of P. hallii transcripts with P. virgatum ESTs revealed strong similarity (92.9% average global sequence identity), while the most closely related grass with a sequenced genome (foxtail millet, Setaria italica) showed weaker similarity (86.5%). A considerable fraction of contigs (23–36%) lacked matches in other grass genomes or the UniProt database. While some sequences without matches (including most singletons) were too short for BLAST to reliably identify conserved regions, some longer sequences also lacked matches: 106 contigs at least 500 bp in length lacked matches in any database. These putatively P. hallii-specific transcripts are listed in File S1 in the Supporting Information.

Comparison with plastid and mitochondrial genomes revealed negligible contributions from organelles, with sequences derived from mitochondrial, plastid, and ribosomal RNAs accounting for <2% of the assembly length and <2% of reads (File S2). In the context of the high abundance of organelles within the cell, this indicates that cDNA normalization was highly effective in reducing the original abundance of these transcripts. Putative transposable elements (TE) were more abundant, accounting for over 6% of the assembly length and nearly 8% of reads. The TE sequences were highly diverse, including 4262 unique sequences spanning 37 different superfamilies (File S3). Gypsy- and copia-like elements were especially abundant, comprising 27 and 17% of putative TEs, respectively.

We found little evidence of biological contamination in the assembly. Among the P. hallii transcripts matching one or more records in NCBI’s nr database (76% of assembled sequences), nearly all (93%, representing 99% of reads) match records from plants more closely than any other taxon (File S4). Most putative contaminants matched metazoan records, including plausible sources of contamination (e.g. 51% best matched sequences from an aphid, Acyrthosiphon pisum). The overall low contamination observed here (1% of the total reads) supports the continued use of similar greenhouse grown specimens in gene expression analysis. Putative contaminant sequences were excluded for all subsequent analyses.

The RNA-Seq expression profiling

The annotated transcriptome assembly served as a reference for RNA-Seq profiling of tissue- and stage-specific expression. Clones of the same accession used for 454 sequencing (FIL2) were sequenced at approximately 1 million mapped reads per sample (= 5 replicates for each of eight tissues or stages), producing over 40 million mapped reads altogether (Table 3). These reads were archived at the Sequence Read Archive (SRA) (accession number SRA048047.1).

Table 3.   The RNA-Seq (quantitative analysis of short cDNA reads) analysis of tissues and stages in Panicum hallii. Overview of sampling and sequencing yield
Sample typeAligned reads (millions)
Per replicatePer tissue
  1. Five replicate samples per tissue.

Inflorescence1.47.0
Leaf1.05.1
Seedling1.15.3
Node0.94.3
Root1.26.0
Seed0.83.8
Stem0.84.1
Crown1.04.9
Total 40.5

The core set of 23 831 transcripts reproducibly detected by RNA-Seq (two or more reads in three or more samples) included 13 844 contigs and 9987 singletons from the 454 assembly. Because the assembly remained fragmented and RNA-Seq tags were derived from the 3′ end of transcripts, any contigs not detected in this probably correspond to 5′ transcript fragments. The RNA-Seq reads mapping to contigs were more abundant on average (62 reads per million, rpm) than reads mapping to singletons (14 rpm), consistent with the idea that singletons in transcriptome assemblies include valid transcripts expressed at low levels. Most transcripts were expressed at low levels on average across samples: 63% of transcripts at ≤10 rpm, and 38% at ≤2 rpm (File S5).

Each tissue-specific expression profile included approximately 20 000 transcripts on average, ranging from 18 299 in leaf to 21 466 in inflorescence (File S6). Nearly half the transcripts detected in any tissue were ubiquitously expressed in all tissues (= 11 549). An additional 37–45% of transcripts in each profile were expressed in multiple, but not all, tissues. Only a small fraction of transcripts were truly tissue-specific (detected at least once in only one tissue). In most tissues only 0.04–0.9% of transcripts were specific to that tissue, while seed profiles included a slightly higher fraction of tissue specific transcripts (2.1%, File S6). The complete list of 741 tissue-specific transcripts is given in File S7.

Organelles and biological contaminants comprised a substantially higher fraction of the non-normalized libraries used for RNA-Seq than in the normalized libraries sequenced by 454 (Figure 3). Clear differences in organelle contributions were apparent; e.g. tags derived from the plastid genome accounted for 29% of reads in leaf tissue, but only 8% of reads in roots. The chloroplast 16S and 23S rRNA genes remained highly stable across tissues [> 0.05, coefficient of variation (CV) <10%], while components of the major photosynthetic complexes (PSII, PSI, cytochrome b6/f complex, and the light-harvesting complex) were strongly up-regulated in leaf tissue. This suggests that differential expression of chloroplast genes rather than simply differences in plastid abundance drives the observed difference in plastid contributions. Conversely, nuclear-encoded rRNA accounted for 38% of tags in root samples but only 13% in leaf tissue. Mitochondrial and TE sequences were relatively stable across tissues, accounting for 2–4% and 5–7% of reads, respectively. Sequences flagged as putative contaminants during transcriptome annotation were rare in these libraries, comprising 3–5% of root and seed samples and <1% of reads in all other samples.

image

Figure 3.  Contribution of organelle or contaminant to RNA-Seq (quantitative analysis of short cDNA reads) in different tissues. For each tissue, each bar represents the percentage of SOLiD reads matching a contig derived from organelles, transposable elements (TE), or biological contaminants.

Download figure to PowerPoint

Rarefaction analysis of all tissues, with replicate samples pooled by tissue, was used to estimate minimum sequencing requirements (Figure 4). Although each tissue-specific expression profile included most transcripts detected in any tissue (range: 77–90% of core transcripts), the minimal sequencing depth required to saturate transcript detection varied substantially between tissues. For example, 90% of core transcripts were detected in crown, node, and inflorescence with fewer than 4 million aligned reads, while 10.5 million aligned reads would be required to detect the 90% of transcripts in leaf tissue (Figure 4). Considered from the perspective of each tissue-specific transcriptome rather than the total, rarefaction analyses revealed a similarly wide range of sequencing requirements across tissues. For example, 90% of the transcripts detected in the complete stem dataset were detected with 1.6 million mapped reads, while 3.4 million reads were required to detect 90% of leaf transcripts. Although the low coverage employed in our analysis (approximately 1 million reads per sample) was beyond the point of diminishing returns (i.e. doubling the sequencing investment would detect an additional 11.5% of transcripts), this sequencing depth was clearly not saturating for any tissue, detecting 81.7% of the tissue-specific transcriptome on average (ranging from 71.4% in leaf to 85.5% in seed). These relationships suggest that detecting all genes would require impractically high sequencing depths, while the majority of transcripts in a tissue can be detected at reasonably low depths.

image

Figure 4.  Rarefaction analysis of genes detected by RNA-Seq (quantitative analysis of short cDNA reads) as a function of SOLiD sequencing depth.

Download figure to PowerPoint

Differential gene expression among tissues and stages

The Z-score distributions illustrate differences in gene expression dynamics among tissues (Figure 5). Stem-associated tissues (including stem, node, inflorescence, and crown) shared similar expression profiles, with most genes in these tissues expressed at similar levels to the overall average across tissues. The other tissues (seed, leaf, root, and seedling) deviated farther from the overall mean, indicating substantial tissue-specific expression differences in these tissues (Figure 5).

image

Figure 5. Z-score distributions of gene expression in different tissues. Scores are calculated for each gene among replicate samples within each tissue type.

Download figure to PowerPoint

Statistical comparisons based on a negative binomial model for count data revealed that a large proportion of transcripts (= 10 844) were differentially expressed among tissues (≤ 0.05; File S8). These differentially expressed genes (DEG) included 4939 transcripts that lack sequence similarity with known genes (UniProt), providing biological context for a set of transcripts that could not be annotated based on sequence similarity alone. The remaining transcripts (58%) showed stable expression across all tissues and stages (> 0.05), including 1629 highly stable transcripts with CV ≤10%.

Principal components analysis (PCA) of DEG identified distinct tissue-specific clusters. Altogether, the first three components in this analysis explain 53% of variance among samples (PC1 = 23%, PC2 = 21%, PC3 = 9%). The tight clustering of tissue types in a three-dimensional plot (Figure 6) indicates that tissue type can be reliably predicted from expression profiles. Expression profiles differed substantially between seed, leaf, and root relative to the other tissues.

image

Figure 6.  Principal components analysis of expression patterns identifies tissue-specific clusters. Each symbol represents a single sample (= 5 replicate samples per tissue type). Tissue types are indicated by color.

Download figure to PowerPoint

We used hierarchical clustering to identify sets of DEG associated with particular tissues. Tissue-specific patterns were readily apparent in these data, with perfect clustering by sample type (Figure 7). Seven gene expression clusters were selected, three of which comprised genes preferentially expressed in one tissue (cluster 2 = seedling, cluster 4 = root, cluster 5 = seed). Cluster 1 was most highly expressed in stem-associated tissues (crown, inflorescence, node, and stem), cluster 3 in root, seedling, crown, and node; cluster 6 in leaf and seedling; and cluster 7 in leaf and seed.

image

Figure 7.  Heatmap showing relative expression of differentially expressed genes among tissues. Colors indicates relative expression (yellow = high expression, black = intermediate, blue = low). Numbers in the right margin identify tissue-specific expression clusters.

Download figure to PowerPoint

Functional analysis of the gene sets identified by clustering DEG (Figure 7) revealed enrichment of biological processes in each set consistent with known functional differences (File S9). Expression cluster 6 (genes preferentially expressed in leaf and seedling) was enriched for photosynthesis (GO 0015979; P = 3 × 10−40) and energy metabolism (GO 0006091; P = 2 × 10−14), including components of all major photosynthetic complexes: photosystem II (eight transcripts), photosystem I (six transcripts), cytochrome b6/f complex (seven transcripts), and the light-harvesting complex (eight transcripts). Certain genes associated with the dark reactions were also included in this category (e.g. three transcripts matching ribulose bisphosphate carboxylase).

The DEG expressed at high levels in stem-associated tissues (cluster 1) were enriched for intracellular transport (GO 0046907; P = 2 × 10−8), organic acid metabolism (GO 0006082; P = 0.02), and carbohydrate metabolism (GO 0005975; P = 1 × 10−6). Carbohydrate metabolism included genes associated with cellulose biosynthesis (cellulose synthase subunits, 10 transcripts) and degradation (exoglucanase, two transcripts; endoglucanase, two transcripts; and beta-glucosidase, six transcripts).

Genes expressed at high levels in root, crown, and seedling (cluster 3) were enriched for macromolecule biosynthesis (GO 0009059; P = 2 × 10−26) and metabolism (GO 0044260; P = 5 × 10−31), protein metabolism (GO 0019538; P = 4 × 10−24), and amino acid metabolism (GO 0006519; P = 0.03). Protein metabolism genes included ribosomal proteins (13 transcripts), elongation factors (five transcripts), and aminoacyl-tRNA synthetases (five transcripts). Ribosomal RNA was also highly abundant in these tissues (22–38% of reads; Figure 3), supporting the interpretation of these profiles as indicating elevated levels of protein synthesis.

Analysis of cell wall biosynthesis pathways

As a starting point for studies of cell wall biosynthesis in diploid Panicum, and to demonstrate the utility of our transcriptome resource for pathway analysis, we compiled expression data for all genes associated with lignin biosynthesis pathways (Figure 8). Each gene was represented by multiple sequences in the transcriptome assembly, ranging from 2 to 11 sequences per gene; additional experiments would be required to clarify whether these represent different alleles, homeologs, splice variants, or simply incomplete assembly. This analysis revealed the ubiquitous expression of all pathway genes in all tissue types, with many genes showing differential expression among tissues (asterisks in Figure 8). These include caffeoyl-CoA 3-O-methyltransferase (CCoAOMT), which was expressed at substantially higher levels in root and stem than other tissues, and caffeic acid O-methyltransferase (COMT), which was specifically elevated in stems.

image

Figure 8.  Tissue-specific expression of genes involved in the phenylpropanoid pathway leading to lignin biosynthesis. Sequences included in this analysis best matched the corresponding genes from Arabidopsis, rice, maize, or sequences from some other species annotated with that gene name. Tissue labels: F, inflorescence; G, seedling; L, leaf; N, node; R, root; S, seed; T, stem; W, crown. Asterisks indicate differential expression (false discovery rate <0.05). Pathway redrawn from a previous description (Humphreys and Chapple, 2002). PAL, phenylalanine ammonia-lyase; C3H, p-coumarate 3-hydroxylase; F5H, ferulate 5-hydroxylase; C4H, trans-cinnamate 4-hydroxylase; 4CL, 4-coumarate CoA ligase; CCR, cinnamoyl CoA reductase; HCT, hydroxycinnamoyl CoA:shikimate/quinate hydroxyl-cinnamoyltransferase; CCoAOMT, caffeoyl-CoA 3-O-methyltransferase; COMT, caffeic acid O-methyltransferase; CAD, cinnamyl alcohol dehydrogenase.

Download figure to PowerPoint

Discussion

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental procedures
  7. Acknowledgements
  8. References
  9. Supporting Information

A transcriptome assembly could be considered finished when all genes are represented, with each transcript entirely covered by a single contig. In species lacking a reference genome for comparison, the fragmentation and completeness of a transcriptome assembly can only be estimated by comparison with the gene complement of related species. The P. hallii transcriptome clearly remains fragmented, with an average contig length of 837, or 404 bp if unique singletons are included. This is substantially shorter than the average transcript lengths in related grasses with completed genome sequences (1.5 kb in Sorghum bicolor, 1.4 kb in Setaria italica, and 1.6 kb in Brachypodium distachyon). This fragmentation obviously precludes estimating the number of genes expressed in P. hallii from contig numbers.

Considering the redundancy among annotated sequences in the assembly (15 422 genes names among 33 210 annotated transcripts) and assuming a similar level of redundancy among the remaining non-annotated transcripts (54 443) suggests that approximately 40 704 unique transcripts are present in the assembly. This is similar to the total number of gene models (approximately 41 000) in S. italica, the most closely related grass with a fully sequenced genome. The RNA-Seq analysis independently confirmed the expression of 23 831 transcripts at sufficiently high levels for expression analysis (two or more reads in three or more samples). The reproducible detection of these sequences across multiple samples and sequencing platforms supports their interpretation as valid transcripts. Most of these matched ESTs from P. virgatum (76%), accounting for 15 631 different sequences from that assembly. Assuming a similar level of redundancy among the putatively P. hallii-specific transcripts that remained (5729) suggests that 20 577 unique transcripts are represented among these core sequences. Although the transcriptome assembly remains fragmented and some genes expressed at low levels may not be represented, these comparisons suggest that the resource provides a reasonably complete catalog of the genes expressed in P. hallii.

The number of transcripts quantified by RNA-Seq in our analysis (23 831) compares favorably with the number of genes (approximately 20 000–25 000) detected by existing microarray platforms for other plants (Schmid et al., 2005; Wang et al., 2010; Sekhon et al., 2011). Previous RNA-Seq analyses of plant transcriptomes have detected a similarly high proportion of genes; e.g. 75% of genes were detected in one or more tissues in the soybean gene atlas (Libault et al., 2010). This suggests that tag-based RNA-Seq analysis of the P. hallii transcriptome assembly profiles a comparable fraction of the transcriptome as other platforms in common use (Meyer et al., 2011).

Sequencing is the major cost in RNA-Seq analysis, making it important to optimize sequencing coverage relative to gene detection and accurate quantification. Previous RNA-Seq studies have used relatively high coverage, ranging from 1.4 to 4.1 million mapped reads per sample in soybean (Libault et al., 2010) to 14–18 million in grape (Zenoni et al., 2010) and 26–31 million in rice (Lu et al., 2010). We sequenced at a substantially lower depth of approximately 1 million mapped reads per sample. Tag-based RNA-Seq is expected to produce a single tag per transcript and thus use sequencing coverage more efficiently than sequencing of multiple fragments spanning the entire transcript. Rarefaction analysis of pooled samples revealed that the low sequencing depth used in our study was not saturating, capturing only 82% of the transcripts detected at the maximum coverage achieved for each tissue type. This analysis identifies sequencing depth requirements for future expression profiling studies in P. hallii, i.e. 90% of transcripts can be detected with approximately 4 million mapped reads per sample for leaf tissues, and with approximately 2 million for other tissues (Figure 4).

Differences in minimal sequencing requirements among tissues probably result in part from different expression dynamics, with a few highly expressed transcripts displacing rarer transcripts in sampling. Ribosomal RNA is the dominant component of cellular RNA (80%), with an additional 15% comprising tRNA, and mRNA accounting for only 5% (Warner, 1999). Although the oligo-dT specificity of our library preparation method substantially reduced rRNA abundance, rRNA remained in our samples and contributed different amounts in each tissue (ranging from 13 to 38%, Figure 3). Since expression levels were normalized to sequencing effort rather than any absolute measure (e.g. number of cells), profiles with high rRNA abundance (e.g. root and seed) indicate elevated rRNA to mRNA ratios that could result either from overall down-regulation of mRNA expression, or from increased expression of rRNA. The Z-score distributions for root and seed tissues indicate down-regulation of many genes in those tissues (Figure 5), supporting the former explanation, while the up-regulation of genes associated with protein synthesis in roots supports the latter (File S9). Resolving these alternative possibilities is outside the scope of the current study, but would be generally interesting for the interpretation of RNA-Seq analysis. Hybridization-based methods are commercially available for removing rRNA from plant samples (e.g. Invitrogen’s RiboMinus and Epicentre’s Ribo-Zero), and our analysis indicates that certain tissues like root and seed might benefit from this treatment. The relatively high per-sample cost of these procedures suggests that this modification would be most cost-effective at high coverage, while at lower coverage (approximately 1 million reads per sample) it would be more cost-effective to simply sequence additional reads to account for rRNA.

Approximately half the genes measured by RNA-Seq were expressed at equivalent levels across tissues, allowing the empirical identification of stable housekeeping genes that will be useful for normalizing qPCR data in future studies. The 12 986 transcripts that were not differentially expressed among tissues (adjusted P-values > 0.05) included 1629 highly stable transcripts (CV ≤ 10%). We selected the most stable transcripts from this set (n = 12 with CV ≤ 2.5%) for use as qPCR reference genes in this system, and verified the specificity and amplification efficiency for a subset of these (File S10). Our analysis identified stable transcripts that would not have been expected a priori, including homologs of 3′ histone mRNA exonuclease, histone H3, and the regulatory kinase CHK1. Although some stable transcripts (CV ≤ 10%) matched commonly used reference genes [actin, glyceraldehyde 3-phosphate dehydrogenase (GAPDH), and ubiquitin], other transcripts with similar annotation were differentially expressed (< 0.05) among tissues (including homologs of actin, tubulin, GAPDH, and ubiquitin).

We found distinct expression profiles in all tissues and stages (Figures 6 and 7), matching the general picture emerging from gene atlas analyses in other plants (Schmid et al., 2005; Benedito et al., 2008; Libault et al., 2010; Wang et al., 2010; Zenoni et al., 2010; Sekhon et al., 2011). A large fraction of transcripts (42%) were differentially expressed in our study. This is consistent with the scale of expression differences found in previous studies; for example, 69% of genes vary among stages and tissues in rice (Wang et al., 2010) and 73% in the model legume Medicago truncatula (Benedito et al., 2008). We found that nearly half of the transcripts detected were expressed ubiquitously in all tissues and stages (48%), in agreement with previous reports ranging from 23 to 55% (Benedito et al., 2008; Libault et al., 2010; Wang et al., 2010; Sekhon et al., 2011). Similarly, our observation that only 3.1% of genes are specific to a single tissue is consistent with these previous reports (2.9–5.2%). Expression profiles in P. hallii showed broad similarities among stem-associated tissues, while root, seed, and leaf samples diverged substantially from this core profile. This finding is similar to the maize gene atlas, where most tissue-specific genes were expressed in leaf, endosperm, and root (Sekhon et al., 2011). Different patterns were reported in Arabidopsis, where seeds and roots showed extensive tissue-specific expression but not leaves (Schmid et al., 2005). In contrast with the aerial versus underground clustering of tissue profiles reported in the M. truncatula gene atlas (Benedito et al., 2008), relationships among tissue types in our data can be explained by adjacency and tissue specialization. Because our tissue dissections were relatively coarse (e.g. node, crown, and inflorescence samples also included portions of stem internode, and seedlings included both root and leaf tissues), these similarities between adjacent tissue types are as expected. While detailed comparisons between gene atlas studies are complicated by differences in tissue sampling, our findings generally agree with previous reports of distinct tissue-specific expression profiles, and are consistent with transcription playing a major role in tissue differentiation and function.

To demonstrate the utility of the annotated transcriptome assembly produced in this study, and because cell wall biosynthesis is the subject of ongoing research in biofuel crops (Faik, 2010; Pauly and Keegstra, 2010; Bosch et al., 2011; Carpita, 2011), we compiled a list of cell wall biosynthesis related genes in the P. hallii transcriptome. In grasses the primary and secondary cell walls are composed almost entirely of cellulose and hemicelluloses (Vogel, 2008), and our transcriptome assembly includes the major genes associated with synthesis of these cell wall polysaccharides. RNA-Seq detected the expression of multiple transcripts matching cellulose synthase (Ces, 17 transcripts) and cellulose synthase like genes (Csl, seven transcripts), with clear tissue-specific differences in expression within each gene family. The biosynthesis of xylan, the dominant hemicellulose in grass secondary cell walls (Vogel, 2008), has not been described as completely as other cell wall polysaccharides. Numerous candidates with possible roles in this process were identified in the core transcripts, including glycosyl transferase (GT) homologs (41 transcripts), and many of these were preferentially expressed in stem-associated tissues (14 transcripts in cluster 1; Figure 7). These included two well-characterized Arabidopsis genes associated with xylan biosynthesis (IRX9 and IRX10). Other GT transcripts showed different tissue specificity; for example, a transcript homologous to the maize glycosyl transferase (GT47) gene GRMZM2G059825, previously shown to be down-regulated in elongating internodes (Bosch et al., 2011), was expressed at low levels in all tissues except root in our RNA-Seq data (cluster 4 in Figure 7). While a comprehensive description of the genes and pathways underlying xylan biosynthesis in grasses remains unavailable (Faik, 2010), the sequences and expression profiles presented here provide valuable tools for further studies of xylan metabolism in Panicum.

Secondary cell walls in grasses also include a substantial proportion of lignins (Vogel, 2008), and lignin content is an important determinant of the efficiency with which lignocellulosic feedstocks are converted to sugars for biofuel production (Simmons et al., 2010). The biochemical and genetic basis of lignin production have been widely studied (Humphreys and Chapple, 2002), and the saccharification efficiency of lignocellulosic feedstocks improved by manipulating lignin biosynthetic pathways (Chen and Dixon, 2007; Fu et al., 2011). Our analysis of the P. hallii transcriptome identified putative homologs for all genes in the phenylpropanoid pathway leading to lignin biosynthesis, and demonstrated their expression in all tissues tested here (Figure 8). This includes the COMT gene targeted to improve efficiency of biomass degradation in P. virgatum (Fu et al., 2011). Our results are broadly similar to those obtained from a similar analysis in the recently published maize gene atlas (Sekhon et al., 2011), demonstrating that the transcriptome assembly produced in this study enables detailed study of metabolic pathways relevant for biofuel research.

Analysis of cDNA from multiple tissues and stages of P. hallii using next-generation sequencing technologies produced an extensive catalog of expressed sequences, each annotated based on both sequence similarity and tissue-specific expression patterns. Our application of RNA-Seq to profile expression across tissues demonstrates the utility of this transcriptome resource, and highlights tissue-specific differences in expression dynamics and minimal sequencing coverage. These resources will make it possible for future studies to target genes underlying important agronomic traits such as drought tolerance, cell wall characteristics, and biomass production in this emerging diploid model.

Experimental procedures

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental procedures
  7. Acknowledgements
  8. References
  9. Supporting Information

Transcriptome sequencing

To maximize the number of genes included in the transcriptome, we used RNA from a variety of tissues and stress treatments. Multiple tissues were sampled (approximately 1 g tissue per sample) from an individual greenhouse-grown genotype (FIL2 accession): leaf, inflorescence (developing fruits, florets, spikelets, and panicle material), stem, node, crown, root, and seeds. Additional samples were prepared from seedlings grown from surface-sterilized seeds (scarified with sandpaper to enhance germination) on sterile agar plates (4.4 g L−1 MS basal salts, 2% sucrose, 1.2 mm carbenicillin, 0.7% agar) for 2 weeks (24°C, 50% relative humidity, 12 h:12 h light–dark cycle). Additional leaf tissue samples were collected from other clones (FIL2) exposed to crude stress treatments intended to induce changes in gene expression: heat and drought stress (full sun, 32°C, 4 h in drying soil), chilling (17°C for 5 h), and high light levels (leaves held in contact with a fluorescent bulb, 17 W Philips Alto 368126 for 4 h). Other samples were sampled approximately 4 h after mechanical damage (hole punch) to simulate herbivory. Another sample was from excised leaves desiccated on the bench top for approximately 1 h.

Tissue samples were ground in liquid nitrogen using disposable plastic pestles, then extracted using the Spectrum Plant Total RNA kit (Sigma, http://www.sigmaaldrich.com/) according to the manufacturer’s instructions, including the on-column DNase treatments. Isolation of intact RNA from root samples required an additional step prior to the kit. Ground root tissue was first suspended in pre-heated (65°C) cetyltrimethyl ammonium bromide (CTAB) buffer [2% CTAB, 2% polyvinylpyrrolidone, 100 mm 2-amino-2-(hydroxymethyl)-1,3-propanediol (TRIS) pH 8.0, 25 mm EDTA, 2%β-mercaptoethanol] and immediately combined with an equal volume of phenol–chloroform–isoamyl alcohol mixture (25:25:1). After incubation at 65°C for 10 min, samples were extracted using two rounds of a standard phenol–chloroform–isoamyl alcohol (25:24:1) procedure. The RNA was precipitated using ethanol, and then purified using the Spectrum Plant Total RNA kit.

Normalized cDNA libraries were prepared for 454 sequencing as previously described (Meyer et al., 2009). Complementary DNA from each sample was combined in equal amounts prior to normalization to maximize transcript diversity. An initial library was prepared from all samples except root, seed, and seedling, and those three cDNAs combined to produce a second library. Each pooled cDNA sample was normalized using double-strand-specific nuclease (Evrogen, http://www.evrogen.com/). Three fragment libraries were prepared from each pool of normalized cDNA, corresponding to 5′ ends, internal fragments, and 3′ ends, then combined in a 1:3:1 ratio to standardize coverage across the length of each transcript. These pooled libraries were sequenced on 454-FLX Titanium sequencer (Roche, http://my454.com/) at the Genome Sequencing and Analysis Facility (GSAF) (University of Texas, Austin, TX, USA).

Transcriptome assembly and annotation

Raw 454 reads were trimmed using BLASTN and custom Perl scripts to exclude adaptors used in library preparation, then assembled using Roche De Novo Assembler (version 2.5p1) with the –cdna option and a minimum contig length of 100 bp. To preserve rare transcripts, we identified a set of unique singletons by clustering singletons at 99% nucleotide sequence identity and comparing the cluster consensus sequences against the contigs (bit-score ≥50) to eliminate redundant sequences. Contigs and unique singletons were combined to produce the final assembly used for annotation and gene expression profiling.

Functional annotation of the assembly was based on comparisons with UniProt (Release 2010_09, http://www.uniprot.org/) using blastx (significance threshold: e-value ≤ 10−6). Annotation was conducted as previously described (Meyer et al., 2009) using custom Perl scripts drawing heavily on BioPerl (Stajich et al., 2002). Gene names were assigned based on annotation of the closest UniProt match, with uninformative descriptions excluded. The GO annotation of the UniProt database (http://www.geneontology.org) was used to assign GO terms to each assembled sequence matching a record with GO annotation.

The assembly was compared with other plant species to identify putative homologs. An assembled collection of P. virgatum ESTs (version 181a) was obtained from the Plant Genome Database (http://www.plantgdb.org/), and compared with P. hallii sequences using BLASTN. Protein sequences of gene models for foxtail millet (verson 2.1), rice (version 6.0), and sorghum (version 1.0) were obtained from Phytozome (http://www.phytozome.net/), and for Arabidopsis (version TAIR9) from TAIR (http://www.arabidopsis.org/); these were compared with P. hallii sequences using blastx. Sequence divergence was estimated for each pair of putative homologs based on global sequence similarity using fsa software (Bradley et al., 2009). To account for differences in database size, a bit-score threshold of 60 (equivalent to the 10−6 e-value threshold used in larger databases) was used for these searches.

To evaluate the contributions of organelles and biological contaminants we compared the assembly with the P. virgatum chloroplast genome (accession HQ731441) and the S. bicolor mitochondrial genome (DQ984518) (TBLASTX bit-scores ≥50). The contribution of rRNA was estimated from comparisons with S. bicolor rRNA genes (accessions ABXC01000775 and ABXC01009962) (BLASTN bit-score ≥100). To account for sequence similarity between chloroplast and nuclear rRNA genes, these assignments were further confirmed using BLASTN comparisons against a database containing both. Putative TEs were identified by comparison with eukaryotic TEs from RepBase version 14.12 (http://www.girinst.org/repbase/) (TBLASTX bit-score ≥50). To evaluate the contribution of any biological contaminants that might arise from the use of non-sterile greenhouse-grown samples we screened the assembly against NCBI’s nr database (BLASTN, e-value ≤ 10−6) and scored the taxonomic identity of each transcript’s best match.

Profiling tissue-specific gene expression with RNA-Seq

An atlas of gene expression across multiple tissues and stages was produced using samples from additional clones of the same accession (FIL2) used for transcriptome sequencing. Five replicate RNA samples (individual plants) were prepared from inflorescence, leaf, node, root, seed, stem, and crown tissues, and from 2-week-old seedlings grown on sterile agar. To minimize any confounding effects of circadian rhythm on expression comparisons, all tissues were sampled at midday (10 a.m.–2 p.m.) and immediately flash-frozen on liquid nitrogen. Roots were sampled at midday on a subsequent day, seeds were sampled from dry storage (originally intended for propagating plant material), and seedlings were sampled from a growth chamber.

One microgram of total RNA from each sample was used to prepare a cDNA fragment library for expression profiling, as previously described (Meyer et al., 2011). This tag-based method sequences a single cDNA tag derived from the 3′ end of each transcript. The cDNA constructs were prepared from fragmented RNA, labeled with sample-specific oligonucleotide barcodes, then pooled and sequenced. Each of 40 samples (eight tissues or stages with five replicates each) was sequenced at approximately 4 million raw reads per sample on the SOLiD System (version 3.0) (Applied Biosystems, http://www.appliedbiosystems.com/) at GSAF. We subsequently generated additional sequence data from leaf samples (4.3 million additional mapped reads) to evaluate the benefits of increased sequencing depth.

SOLiD reads (50 bp) were trimmed to remove non-template bases introduced in library preparation, then processed to exclude low-quality reads, homopolymers, and reads matching adaptors used in library construction. The high-quality (HQ) reads passing these filters were aligned in color-space against the annotated transcriptome assembly. Alignment (‘mapping’ of reads to transcripts) was accomplished using the Shrimp software package (Rumble et al., 2009) (version 2.1.1b) with default settings. We required unique mappings at least 35 bp in length with < 0.05 (obtained using Shrimp’s probcalc program). Expression levels were compiled for each sample based on the number of reads mapping to each transcript.

Minimal sequencing requirements for complex templates such as cDNA depend on the distribution and diversity of the pool. To evaluate coverage requirements in P. hallii we applied rarefaction analysis to sequence data from multiple tissues. Custom Perl scripts were used to simulate lower sequencing depths by randomly sampling without replacement from the read-to-transcript associations, with n = 10 simulations per sequencing depth. In the normalized cDNA assembly (454), the relationship between reads and contigs was resampled. Associations between SOLiD reads and reference sequences were resampled in the same way, to estimate the relationship between sequencing depth and number of genes detected in RNA-Seq.

Raw reads used for this assembly were archived in NCBI’s SRA database (accession number SRA046309.1), and the assembly in TSA (accession number GDSub17051). The annotated transcriptome assembly is also available for download and BLAST at our laboratory website (http://w3.biosci.utexas.edu/juenger_lab).

Statistical analysis of gene expression

Raw counts data were used to compare gene expression profiles among samples. Expression comparisons focused on a set of ‘core transcripts’ detected reproducibly across multiple samples (two or more reads mapped in each of three or more samples). Statistical comparisons used DESeq (Anders and Huber, 2010), a gene expression analysis package for the R software environment (R Development Core Team, 2008). This method is based on the negative binomial distribution, with parameters of the mean–variance relationship estimated empirically to account for overdispersion. False discovery rate (FDR) was controlled at 0.05 as previously described (Benjamini and Hochberg, 1995). For subsequent analyses, all expression data were normalized using a variance-stabilizing transformation based on the mean–variance relationship estimated by DESeq (Anders and Huber, 2010).

Z-scores from normalized gene expression data for all genes and tissues were obtained for an overview of expression dynamics. Scores were calculated as: = (XXav)/SD, where X is the average expression in each tissue, and Xav and SD are the average and standard deviation respectively across all tissues.

A PCA was used to quantify global expression trends among DEG, using R’s prcomp command without scaling. To identify sets of genes associated with tissue-specific expression patterns, we used hierarchical clustering of expression levels based on Pearson product–moment correlation, with Ward’s minimum variance method.

The GO enrichment analysis was accomplished by comparing GO annotation (biological process terms, level 4) with that of the complete set. The proportion of genes associated with each term was compared using Fisher’s exact test with the FDR controlled at 0.05.

Acknowledgements

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental procedures
  7. Acknowledgements
  8. References
  9. Supporting Information

Panicum hallii var filipes germplasm was obtained from the United States Department of Agriculture (USDA) Kika de la Garza Plant Materials Center (Kingsville, TX, USA). This work was funded by NSF grant number IOS-0922457 to TEJ.

References

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental procedures
  7. Acknowledgements
  8. References
  9. Supporting Information

Supporting Information

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental procedures
  7. Acknowledgements
  8. References
  9. Supporting Information

File S1. Transcripts putatively specific to Panicum hallii.

File S2. Relative contributions of organelles and transposable elements (TE) in the transcriptome assembly.

File S3. Summary of transposable elements in the assembly.

File S4. Relative contributions of biological contaminants in the transcriptome assembly.

File S5. Distribution of expression levels by transcript.

File S6. Tissue specificity of gene expression.

File S7. Transcripts specifically expressed in a single tissue.

File S8. Transcripts showing differential expression among tissues.

File S9. Gene Ontology analysis of tissue-specific gene expression.

File S10. Reference genes for quantitative PCR identified from stable transcripts in RNA-Seq (the quantitative analysis of short cDNA reads) expression profiles.

As a service to our authors and readers, this journal provides supporting information supplied by the authors. Such materials are peer-reviewed and may be re-organized for online delivery, but are not copy-edited or typeset. Technical support issues arising from supporting information (other than missing files) should be addressed to the authors.

FilenameFormatSizeDescription
TPJ_4938_sm_supp-info-legends.doc20KSupporting info item
TPJ_4938_sm_Supplementary-File-1.xls21KSupporting info item
TPJ_4938_sm_Supplementary-file-10.xls28KSupporting info item
TPJ_4938_sm_Supplementary-File-2.xls27KSupporting info item
TPJ_4938_sm_Supplementary-File-3.xls28KSupporting info item
TPJ_4938_sm_Supplementary-File-4.xls27KSupporting info item
TPJ_4938_sm_Supplementary-File-5.xls21KSupporting info item
TPJ_4938_sm_Supplementary-file-6.eps30497KSupporting info item
TPJ_4938_sm_Supplementary-file-7.xls114KSupporting info item
TPJ_4938_sm_Supplementary-file-8.xls2544KSupporting info item
TPJ_4938_sm_Supplementary-file-9.eps13KSupporting info item

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.