Serial analysis of gene expression (SAGE) was applied for profiling expressed genes in rice seedlings. In the SAGE method, a 9–11 bp fragment (tag) represents each transcript, and frequency of a tag in the sample directly reflects the abundance of the respective mRNA. We studied 10 122 tags derived from 5921 expressed genes in rice (Oryza sativa L.) seedlings, among which only 1367 genes (23.1%) matched the rice cDNA or EST sequences in the DNA database. SAGE showed that most of the highly expressed genes in rice seedlings belong to the category of housekeeping genes (genes encoding ribosomal proteins or proteins responsible for metabolism and cell structure). Unexpectedly, the most highly expressed gene in rice seedlings was a metallothionein (MT) gene, and together with three other messages for MT, it accounts for 2.7% of total gene expression. To our knowledge, this is the first quantitative study of global gene expression in a higher plant. We further applied the SAGE technique to identify differentially expressed genes between anaerobically treated and untreated rice seedlings. Additionally, we show that a longer cDNA fragment can be easily recovered by PCR using the SAGE tag sequence as a primer, thereby facilitating the analysis of unknown genes identified by tag sequence in SAGE. In combination with micro-array analysis, SAGE should serve as a highly efficient tool for the identification and isolation of differentially expressed genes in plants.
Owing to the recent development of genome projects in various organisms, sequence information for genes is accumulating at an unprecedented pace. However, our understanding of the biological functions of genes is being left far behind. Expression analysis of genes comprises one of the most important parts of gene-function studies. High-throughput, quantitative methods for gene expression analysis are currently required to meet this goal. Conventional techniques such as Northern blotting, RNA dot blotting and RT–PCR methods allow the evaluation of only a limited number of genes at one time. To identify genes that are differentially expressed in two samples, differential screening ( Kiyosue et al. 1994 ) or differential display ( Liang & Pardee 1992) techniques have been increasingly applied. Although useful for scaling-up, these methods are not very accurate in quantitatively monitoring differences in gene expression.
Serial analysis of gene expression (SAGE), developed by Velculescu et al. (1995) , allows a highly accurate, quantitative analysis of expression of thousands of genes at a time, and recommends itself as an efficient tool for gene expression studies in the post-genomic era.
Briefly, a 9–11 bp DNA fragment (tag) adjacent to the NlaIII site located closest to the 3′ end of each cDNA is extracted by a series of linker-ligations and restriction digestions. Tags are once amplified by PCR, concatenated, cloned into a plasmid vector and sequenced. The frequency of each tag in the sequence directly reflects the abundance of each mRNA in the tissue. The size of tags (9 bp) contains sufficient information to uniquely identify the genes, provided that it is extracted from defined positions of the transcripts. The possible maximum number of different tag sequences (49 = 262 144) is far more than the estimated number of genes in plants (15 000–33 000 in Arabidopsis, Cooke et al. 1996 ), ensuring the unique identification of a gene by the tag sequence. SAGE analysis has been successfully applied for transcript profiling in yeast ( Velculescu et al. 1997 ), and the identification of differentially expressed transcripts in normal and cancer cells of humans ( Zhang et al. 1997 ). A comparison of SAGE profiles between cancer cells and p53-introduced cancer cells identified genes that are presumably involved in p53-mediated apoptosis ( Polyak et al. 1997 ), demonstrating the power of SAGE for elucidating the genetic cascades controlled by transcription factors.
Here we report the use of the SAGE technique to study rice gene expression. To our knowledge this is the first application of SAGE for studying plant genes. For the first attempt, we studied gene expression profiles of rice seedlings by extracting a total of 10 122 tags. Differential gene expression between anaerobically treated and untreated rice seedlings was studied by extracting >2000 tags from each sample, which allowed the identification of 24 differentially expressed genes. We further show that a longer cDNA sequence can be easily recovered by PCR using the 13 bp SAGE tag sequence as a primer for the 3′-RACE. This will allow rapid isolation of the entire cDNA of interest, even if the SAGE tag shows no match in the pre-existing EST database.
Results and Discussion
As the material for the first application of SAGE analysis to rice, we used 5-day-old etiolated seedlings. PolyA+ RNA, which was subsequently converted to double-stranded cDNA, was isolated from whole seedlings. The original SAGE protocol ( Velculescu et al. 1995 ) was followed except for three changes: (i) for efficient PCR amplification of the ditags different primer sets from the original protocol were used; (ii) biotinylation of PCR primers and the removal of linker fragments by streptavidin-coated magnet beads before concatenation ( Powell 1998); and (iii) size fractionation (>400 bp) of concatenated ditags before cloning into a plasmid vector. In our hands these alterations were essential for successful cloning of concatemers larger than 400 bp (>26 tags).
DNA sequences of the concatenated ditags were analysed by SAGE analysis software ( Velculescu et al. 1997 ). After the duplicated ditags were eliminated, the DNA sequence of 9 bp fragments adjacent to the NlaIII sites (CATG) was extracted, representing each’tag’ comprising 13 bp.
After sequencing 650 plasmid clones, a total of 10 122 tags were obtained, of which 5593 tags were encountered more than once. The rest of the tags were unique, making the number of different tags 5921 ( Table 1). We define these different tags as different genes, and of these only 1367 genes (23.1%) matched the rice cDNA or EST sequences in the DNA database.
Table 1. . Summary of SAGE analysis in rice seedlings
No. of tags matched with sequences in the database b (%)
No. of tags appearing more than once
Tag was extracted from the sequence data as 9 bp sequence adjacent to an NlaIII site (CATG).
Determined by searching previously known rice cDNA and EST databases with the 13 bp tag sequence.
On the other hand, 80% of the most abundant sequences match to known rice cDNA or EST sequences ( Fig. 1). This is similar to observations for SAGE analysis of human cells ( Zhang et al. 1997 ) in which 98% of highly expressed genes (>500 copies per cell) matched GenBank entries, whereas only 54% of the total genes had GenBank matches. The higher rate of database matching of abundantly expressed genes is reasonable, as highly expressed genes are more likely to be isolated and registered in the DNA database by large-scale cDNA or EST analysis.
Profiles of highly expressed genes in rice seedlings
In Table 2, the 30 most frequently expressed genes are listed. Additionally, the 100 most abundantly transcribed genes in rice seedlings were categorized by function and are presented in Fig. 2. To our knowledge this is the first quantitative study of global gene expression in higher plants. Three genes in the top 30 genes, and 20 genes in the top 100 genes, did not match database entries. Ten genes in the top 100 genes had unknown functions (non-annotated).
Table 2. . The 30 most abundantly expressed genes in rice seedlings
Tags are presented as a 9 bp sequence excluding the NlaIII site (CATG).
Percentage of gene (tag) expression was calculated from the number of genes per 10122 tags.
19 kDa globulin
salT (salt stress inducible gene)
glycine-rich cell wall protein
EST similar to metallothionein
EST similar to barley amylase inhibitor
70 kDa heat-shock protein
EST similar to metallothionein
EST similar to allergenic protein
EST similar to ribosomal protein
EST similar to ribosomal protein
13 kDa prolamin
We hypothesized that genes essential for living cells (genes encoding proteins with metabolic functions) should account for the most abundantly expressed gene class in rice seedlings, as shown in yeast ( Velculescu et al. 1997 ) and human ( de Waard et al. 1999 ) SAGE studies. As expected, our results also showed that most of the categories in the top 100 genes belonged to potential housekeeping genes, i.e. genes encoding proteins involved in sugar, nucleic acid or amino acid metabolism, cell structure formation and maintenance, ribosomal functions, protein folding, water channelling or translation. However, quite unexpectedly the most abundantly expressed gene in rice seedlings was not a housekeeping gene, but a type-2 metallothionein (MT) gene. Its expression level was almost twice as high as the second most abundantly expressed gene (19 kDa globulin), and accounted for more than 1% of total gene expression. Together with three other MT-like protein genes found among the top 30 genes, MT genes amounted to 2.7% of the total gene expression in rice seedlings. Although an MT gene was observed among the highly expressed genes in yeast ( Velculescu et al. 1997 ), the amount of expression and the number of expressed MT genes in rice seedlings are remarkable. Generally, MT is suggested to have a function for metal binding and detoxification of heavy metals ( Hsieh et al. 1995 ). It is known that genes encoding MT or MT-like proteins consist of a gene family in rice ( Hsieh et al. 1995 ; Hsieh et al. 1996 ), and ESTs of MT genes are abundant in rice EST database. Why are there many MT genes expressed so abundantly in rice seedlings? We speculate that MTs have essential roles in plant growth other than metal detoxification – such as delivering metal cofactors to anionic flax peroxidase, which is possibly involved in cell wall lignification and cell elongation ( Omann et al. 1994 ; Yu et al. 1998 ); or reducing the concentration of free metal ions in the cell to prevent the increase of reactive oxygen species ( Batt et al. 1998 ) .
Expression of a globulin gene was the second highest ( Table 2), and other genes of storage proteins (prolamin) were also found among the highly expressed genes. Since the expression of these genes was reported to be specific for maturating seeds ( Liu et al. 1995 ), their expression in the seedlings seems unusual. Vegetative storage proteins, expressed in young leaves, are known in soybean ( DeWald et al. 1992 ) and Arabidopsis ( Utsugi et al. 1998 ); however they show protein phosphatase activity ( DeWald et al. 1992 ), and are thought to be a different type of protein from rice prolamin or glutelin. Further analysis of the role of abundant storage proteins in the rice seedlings is required.
Several genes, designated as stress-inducible protein genes, were observed among the highly expressed genes. For example, the salT gene, known to be inducible by salt treatment or drought ( Claes et al. 1990 ), and the zinc-induced protein gene, were found among the top 30 genes. These genes were characterized only in terms of their expression patterns, and their functions are almost unknown. Three chitinase genes were also highly expressed. Although chitinases are widely conceived as antifungal enzymes, a certain class of chitinase is known that can rescue the development of an embryogenesis mutant in carrot ( De Jong et al. 1992 ). Highly expressed chitinase genes in rice plants under non-stress conditions may have a potential role in plant development.
Differential gene expression in response to anaerobic stress
In order to evaluate the power of the SAGE technique for identifying differentially expressed genes in rice, gene expression was compared between anaerobically treated and untreated rice seedlings. Response to anaerobic stress in higher plants has been extensively studied at the physiological and molecular biological levels. It was previously reported that overall gene expression patterns in plants were greatly changed by anaerobic stress ( Sachs et al. 1980 ). It is well known that genes for anaerobic metabolism, such as alcohol dehydrogenase, are anaerobically induced in many plants including rice ( Matsumura et al. 1998 ).
RNA was extracted from shoots of 6-day-old seedlings, which were either anaerobically treated for 1 day or untreated before harvest, and their transcripts were quantitatively profiled. A total of 2094 and 2205 tags (corresponding to 1496 and 1551 different genes) were obtained from anaerobically treated and untreated samples, respectively. The 10 most highly expressed genes in each sample are shown in Table 3. In both samples, metallothionein and globulin genes accounted for the two most highly expressed genes, as observed in the results for SAGE in the whole seedlings ( Table 2). Furthermore, most of the genes listed in Table 3 were also among the 30 most abundantly expressed genes shown in Table 2. This result indicates that independent SAGE experiments give reproducible results insofar as the developmental stages or growth conditions of plant materials are the same.
Table 3. . Highly expressed genes in anaerobically treated and untreated rice shoots
Tags are presented as a 9 bp sequence excluding the NlaIII site (CATG).
EST similar to metallothionein
EST similar to metallothionein
EST similar to barley amylase inhibitor
Comparison of expression patterns between anaerobically treated and untreated shoots revealed that most transcripts expressed at similar levels (green bars in Fig. 3). In more than fourfold induced or repressed genes, 24 transcripts were expressed at significantly different levels (P < 0.05; Table 4). Eighteen genes, including six with no match in rice EST sequences, were anaerobically induced, and six genes were repressed. The most highly induced gene (14-fold) was a prolamin gene. Prolamin is a type of storage protein, and is usually expressed in seed but not in leaf or shoot tissues ( Masumura et al. 1990 ; Shyur et al. 1992 ). However, stress response of prolamin gene expression has not been described, and this identified prolamin gene may have a novel function in anaerobiosis. Expression of a transcript similar to the expansin gene was also anaerobically induced (fivefold increase). Expansin is necessary in cell wall extension, and its expression was promoted by submergence in deepwater rice ( Cho & Kende 1997). Genes for a glycine-rich cell-wall protein, involved in cell elongation ( Xu et al. 1995 ), were also anaerobically induced (sixfold increase). Shoot elongation of rice plants usually occurs under anaerobic conditions, so that high expression of the genes relevant to cell elongation may be necessary. Although we expected that the genes for anaerobic metabolism (glycolytic or fermentative enzymes) should be strongly induced by anaerobic treatment, they were not identified in the differentially expressed genes. Expression levels of these genes are far less apparent than those of constitutively expressed genes such as metallothionein.
Table 4. . Differentially expressed genes in the anaerobically treated seedlings
Most of the genes shown in Table 4 are not previously known to respond to anaerobic stress. Expression study and functional analysis of these genes may reveal hitherto unidentified mechanisms of anaerobic response in rice. Our results demonstrate that a wealth of information may be obtained by the application of the SAGE technique in plant science.
RACE analysis using a 13 bp tag sequence
A large number of rice genes identified by SAGE did not match rice cDNA entries in the database. To further exploit the information obtained by SAGE, it is necessary to recover the longer DNA sequences flanking the tag site. In the first report of SAGE, a cDNA library was screened with 13 bp oligonucleotides, and the entire cDNA sequences obtained ( Velculescu et al. 1995 ). As an alternative and faster approach, we attempted the 3′-RACE technique by using the 13 bp tag sequence as PCR primer. Using a lambda phage lysate of a rice cDNA library as template, PCR was carried out using 13 bp tag sequence primer and phage vector primers (M13 forward or T7 promoter). As a test case, we amplified the elongation factor 1 alpha (EF-1α) cDNA using this method. The complete cDNA sequence of EF-1α was already determined (Accession no. D63583). A 0.2 kbp fragment amplified by PCR ( Fig. 4) was cloned into a plasmid vector and sequenced. The sequence of the amplified fragment was identical to the original EF-1α cDNA, and contained the 13 bp tag sequence and polyA tail.
Application of SAGE for plant transcriptomics
The most important application of SAGE is the identification of differentially expressed genes. It is an effective method for isolating genes that are expressed in specific tissues or in response to stress. Since SAGE is a technique for analysing the amount of transcripts, we can characterize the functions of isolated genes by overexpression or knockout experiments. A wealth of useful new genes will be isolated through this approach. Another, and equally important, application of SAGE is the isolation of new gene promoters to drive transgenes.
Micro-array analysis is another high-throughput method for gene expression studies ( Ruan et al. 1998 ; Schena et al. 1995 ). It is of interest to compare the advantages and disadvantages of SAGE and micro-array. The major advantage of SAGE is that the expression of unknown (thus not cloned) genes can be analysed. For the production of micro-arrays, the clones of cDNA must be spotted onto the chips. However, even in Arabidopsis or rice, cDNA sequencing projects are far from complete, and most of the expressed genes are still undetermined. In this situation there is a higher chance of uncovering a useful novel gene by SAGE than by micro-array. Ten unidentified genes were expressed in rice seedlings at significantly different levels between anaerobically treated and untreated samples ( Table 4). Another advantage of SAGE over micro-array is that no special device, other than a DNA sequencer, is required to carry out the analysis. The disadvantage of SAGE is that it cannot be used for the analysis of small tissue samples. However, recent modification of the method (MicroSAGE, Datson et al. 1999 ) allows starting with 500–5000-fold less material than was required for the original protocol. Advantages and disadvantages are inherent in each method, thus the combined use of SAGE and micro-array offers an ideal tool for gene expression studies.
Progress in large-scale cDNA analysis (EST analysis) in many organisms, including rice ( Sasaki et al. 1994 ), is a prerequisite for the meaningful application of SAGE, as the annotation of SAGE tags is based on pre-existing EST databases. As many rice ESTs have been obtained form the 5′ end of mRNA, it will be possible to annotate more tags if the number of 3′-end rice ESTs increases.
The uniqueness of SAGE is that it allows transcript profiles to be given as digital data. Accordingly, they are suitable for the construction of gene expression databases on computer networks. Yeast and cancer transcriptome databases based on SAGE are already accessible via the Internet. The accumulated results of SAGE in plants will be useful for the construction of plant transcript databases which are widely accessible to the scientific community.
Rice (Oryza sativa L. cv. Sasanishiki) seeds were sterilized by treatment with 10% NaOCl, and germinated on moist filter paper in a sterilized box at 28°C. After 5 days’ incubation in the dark, etiolated seedlings were harvested and used for RNA isolation. For anaerobic treatment of the seedlings, approximately 500 rice seedlings (cv. Nipponbare) were prepared as described above. Half of the seedlings were completely submerged in a bottle filled with sterilized water, with nitrogen gas flowed through for deprivation of oxygen. Seedlings under anaerobic or aerobic conditions were incubated in the dark at 28°C, and their shoots were harvested after 1 day for RNA extraction.
RNA isolation and cDNA synthesis
Total RNA was extracted from 200 whole seedlings, including grain, or from shoots, by an SDS–phenol protocol. From more than 1 mg total RNA, mRNA was obtained using an mRNA Purification Kit (Amersham Pharmacia Biotech). PolyA+ RNA (5 μg) was used for double-stranded cDNA synthesis with a cDNA synthesis kit (Gibco–BRL). Protocols supplied with the kits were followed, except for the use of biotinylated oligo-dT.
SAGE procedures were performed according to the original protocol ( Velculescu et al. 1995 , Velculescu et al. 1997 ) with some modifications. The cDNA was digested with NlaIII and captured with streptavidin-coated magnet beads (Promega). Linkers were ligated to captured cDNA ends and SAGE tags adjacent to the linkers were released from beads by BsmFI digestion. Two pools of tags were blunt-ended, and the ditag was formed by ligation of two tags. Ditags were amplified by PCR using AmpliTaq Gold (Perkin-Elmer) and biotinylated linker-specific primers (5′-TCTAACGATGTACGGGGACA-3′ and 5′-TAC- AACTAGGCTTAATAGGGACA-3′ T. Imai, personal communication). Amplified fragments were digested by NlaIII, and separated with PAGE. Ditag fragments (26–30 bp) released from linkers were cut out from the gel and purified. Subsequently, contaminated linker fragments were removed by binding to the streptavidin beads. Tag concatemers were obtained by ligation of purified ditags. Before cloning to pZero vector (Invitrogen), concatemers were size-fractionated (>400 bp) by cDNA sizesep column (Pharmacia Biotech). Escherichia coli strain Top 10F′ was transformed by the plasmids. Colonies formed on LB-Zeocin (Invitrogen) plates were screened for the plasmids with the inserts larger than 600 bp (400 bp concatemers and vector fragments) using colony PCR. PCR products were purified and directly sequenced with the autosequencer ABI373 or ABI377 (Perkin-Elmer). Sequence data were analysed by the sage program supplied by Dr Kinzler.
Gene expression patterns represented by SAGE tags in two samples were compared, and the significance (P value) in difference of expression was calculated using the same sage program.
For the isolation of larger cDNA fragments containing a SAGE tag, 3′-RACE was performed ( Fig. 4). The lysate of a cDNA library derived from rice seedlings (cv. Kakehashi) was used as template, and the first PCR reaction was performed using a lambda phage vector primer (M13 forward primer) and 13 bp SAGE tag sequence (5′-CATGGCACTCGTT-3′) primer. An aliquot of the first PCR mixture was used as template for the second PCR, using another phage vector primer located in the nested position (T7 promoter primer), and the SAGE tag sequence primer. The second PCR product was cloned into the pCR2.1 vector using a TA cloning kit (Invitrogen).
We appreciate the provision of the SAGE protocol and SAGE analysis program by Dr Kenneth W. Kinzler (Johns Hopkins University, Baltimore). We also thank Dr Takashi Imai (National Institute of Radiological Sciences, Chiba) for advising alterations of ditag PCR condition and concatemer cloning. Thanks are due to Prof. Günter Kahl (University of Frankfurt, Frankfurt) and Dr Peter Matthews (National Museum of Ethnology, Osaka), who kindly gave us invaluable suggestions to improve the manuscript.