The application of transcriptomics to study host–pathogen interactions has already brought important insights into the mechanisms of pathogenesis, and is expanding further keeping pace with the accumulation of genomic sequences of host organisms (human and economically important organisms such as food crops) and their pathogens (viruses, bacteria, fungi and protozoa). In this review, we introduce SuperSAGE, a substantially improved variant of serial analysis of gene expression (SAGE), as a potent tool for the transcriptomics of host–pathogen interactions. Notably, the generation of 26 bp tags in the SuperSAGE procedure allows to decipher the ‘interaction transcriptome’, i.e. the simultaneous monitoring of quantitative gene expression, of both a host and one of its eukaryotic pathogens. The potential of SuperSAGE tags for a rapid functional analysis of target genes is also discussed.
The analysis of gene expression during host–pathogen interactions
Infectious diseases cost lives of over 14 million people each year (WHO, 2003). Similarly, diseases reduce crop production by more than 10% worldwide ( ISPP, 1998). Therefore, it is of primary importance to effectively control infectious diseases both from medical and agricultural points of view. The study of host–pathogen interactions is instrumental for such purposes. Host eukaryotes are constantly exposed to attacks by microbes seeking to colonize and propagate in host cells. To counteract them, host cells utilize a whole battery of defence systems to combat microbes. However, in turn, successful microbes evolved sophisticated systems to evade host defence. As such, interactions between hosts and pathogens are perceived as evolutionary arms races between genes of the respective organisms (Bergelson et al., 2001; Kahn et al., 2002; Woolhouse et al., 2002). Any interaction between a host and its pathogen involves alterations in cell signalling cascades in both partners, that may be mediated by transcriptional or post-translational changes. Here, the major challenge for researchers is how to select target genes to be studied in detail from among thousands of genes encoded in the genome. Transcriptomics is one of the methodologies to serve this purpose. Analytical techniques for transcriptomics include differential display (DD; Liang and Pardee, 1992), cDNA-AFLP (Bachem et al., 1996), random EST sequencing (Kamoun et al., 1999), microarray (Schena et al., 1995), serial analysis of gene expression (SAGE; Velculescu et al., 1995) and massively parallel signature sequencing (MPSS; Brenner et al., 2000). Among them, microarray is recently used more frequently than other platforms. Several excellent reviews are available for the use of micoarray for studying host–pathogen interactions (Cummings and Relman, 2000; Diehn and Relman, 2001; Kato-Maeda et al., 2001; Wan et al., 2002; Bryant et al., 2004). In this context it is quite remarkable that most of the gene expression studies addressing host–pathogen interactions in reality examined either host or pathogen separately. However, the simultaneous monitoring of gene expression of both host and pathogen (‘interaction transcriptome’), preferably during the infection process and in situ, especially in the field of plant–microbe interactions, is at stakes and has already been advocated by Birch and Kamoun (2000). Also in our mind, this approach is necessary to elucidate the host–pathogen interplay in molecular detail, although presently available techniques cannot discriminate between the transcriptomes of both organisms. In this article, we present a novel method called SuperSAGE, which has proven potential for an analysis of the interaction transcriptome.
Serial analysis of gene expression as developed by Velculescu et al. (1995) is a high-throughput method to determine the absolute abundance of every transcript in a population of cells (Fig. 1). Messenger RNAs isolated from cells are converted to double-stranded DNA. After digestion with a 4 bp cutter ‘anchoring enzyme’ NlaIII, the poly-A proximal ends are collected and ligated to a linker fragment. This linker fragment harbours a 5′-GGGAC-3′ sequence, which is the recognition site of the Type IIS restriction endonuclease BsmFI that cleaves the cDNA 15 bp away in the 3′ direction from the recognition site. Treatment of the linker-ligated cDNA with BsmFI therefore releases a 15 bp fragment called ‘tag’ from a defined position of each cDNA. After removal of the linker fragment, tags are concatenated, cloned into a plasmid vector, and sequenced. Usually, 10 000–100 000 tags are analysed in total for a given sample. The numbers of each tag in the total sample (tag count) faithfully represent the abundance of the transcript corresponding to the tag. The next step is to identify the gene corresponding to the tag (tag annotation). The 15 bp tag sequence is used as query to search expressed sequence tags (ESTs) or cDNA databases of the organism of interest by blast (Altschul et al., 1990). Results of tag counts and tag annotation are finally combined into a gene expression profile. By comparing gene expression profiles of two samples that are differently treated, we can tell which gene is up- or downregulated in response to the particular treatment. SAGE has been widely applied to profile gene expression in yeast (Velculescu et al., 1997), humans (Polyak et al., 1997; Zhang et al., 1997; Polyak and Riggins, 2001; Patino et al., 2002), protozoa (Cummings and Wirth, 2001; Patankar et al., 2001), plants (Matsumura et al., 1999, 2003a) and pathogenic fungi (Steen et al., 2002; Thomas et al., 2002; Irie et al., 2003). In contrast to the analogue data set generated by microarray analysis capitalizing on the relative strength of the hybridization signal of each spot, SAGE data, i.e. transcript counting, is digital and suitable for a comparison of different data sets (see Sagemap at NCBI: http://www.ncbi.nlm.nih.gov/SAGE/) and for bioinformatics in general. Furthermore, in any microarray experiment, expression changes can be studied only for the genes spotted on the array, whereas SAGE theoretically can discover and address all the transcripts, so that SAGE can be regarded as an ‘open architecture’ technique. Another advantage of SAGE over microarray is that SAGE does not require special and costly equipments (such as DNA spotting machines and microarray reader). SAGE can be performed in any laboratory equipped with basic molecular biology facilities. This low requirement of equipment in SAGE procedure contrasts with MPSS (Brenner et al., 2000), another powerful, but facility-demanding high-throughput expression profiling technique. On the other hand, the disadvantage of SAGE is its high requirement of human work and time. For the analysis of two samples and starting from isolated RNA, 10–14 days are required for the full protocol. Therefore, SAGE is not suitable for studying many samples at a time.
Although SAGE doubtless is a useful technique for transcriptomics, it nevertheless owns a shortcoming. The size of the SAGE tag, 15 bp, is frequently too short to unequivocally identify the gene of origin. This tag size does not cause much problem in organisms with lower gene numbers (e.g. yeast), but does in organisms with more complex genomes. If entered for a blast search as a query, the same tag frequently matches two or more gene sequences, which confounds further analysis. If SAGE is applied to organisms, for which no DNA database is available, it is necessary to recover a longer DNA sequence adjacent to the tag by experiment, and to further annotate this longer fragment by blast search. Previously, several polymerase chain reaction (PCR) techniques were reported to recover cDNA fragment from 15 bp SAGE tags (van den Berg et al., 1999; Chen et al., 2000). However, it appeared always difficult to determine the appropriate conditions for a specific amplification of cDNAs from each gene because the SAGE tag primers were too short. The 15 bp DNA fragment is not suitable as an oligonucleotide probe for screening a cDNA library either. To improve the obvious disadvantage of conventional SAGE, Saha et al. (2002) replaced the tagging enzyme BsmFI with another Type IIS enzyme, MmeI, to isolate 20 bp tags. This modified version of SAGE, dubbed LongSAGE, was a steplet forward. However, digestion of a DNA fragment with MmeI generates a two-nucleotide recessed 5′ terminus, which is difficult to fill in. Now that SAGE principle requires blunting of the 5′-termini of the tags, which, however, reduces the LongSAGE tag size to 18 bp. In practice, LongSAGE is performed without blunting the 5′ ends of the tags, allowing the isolation of 20 bp tag at the cost of faithful gene expression profiling. Therefore, the resulting 20 bp tag information is only used to help annotate the 15 bp tag obtained by the original SAGE procedure.
To circumvent the problems of SAGE and LongSAGE, we have developed a method named ‘SuperSAGE’ (Matsumura et al., 2003b). In SuperSAGE, the Type III restriction endonuclease EcoP15I (Meisel et al., 1992; Mücke et al., 2001) is used as the tagging enzyme. Among the restriction enzymes, EcoP15I has the longest distance so far reported between the recognition and cleavage sites. The use of EcoP15I allowed us to isolate 26-bp tags from transcripts (Fig. 1). EcoP15I digestion generates a two-nucleotide recession at the 3′ terminus, which is easy to fill in. Therefore, with SuperSAGE, we can obtain faithful gene expression profiles based on the 26 bp tags. We applied SuperSAGE to profile rice gene expression. Rice is the first crop species for which the whole genomic sequence became available (Goff et al., 2002; Yu et al., 2002). To demonstrate that the 26 bp tag size of SuperSAGE allows a highly reliable tag-to-gene annotation, the following in silico experiment was performed. Fifty SuperSAGE tags (26 bp) were randomly selected from rice SuperSAGE data (Matsumura et al., 2003b). These DNA sequences were truncated from the 3′ ends so that the tag sizes were 20, 18 and 15 bp long respectively (Table 1). A tag size of 15 bp corresponds to that produced in the conventional SAGE protocol. The 18 or 20 bp tags are equivalent to LongSAGE tags, when the linker-tag fragments were ligated to each other with or without end-blunting respectively. The tags of different sizes were blasted against the entire body of DNA sequences deposited in GenBank representing DNA sequences from more than 130 000 species. The number of species containing DNA sequences with a perfect match to a tag of a given size were counted, and the average and maximum numbers of species were obtained across the 50 tag sequences. The conventional SAGE tag (15 bp) matched DNA sequences of 7.5 species on average, with a maximum of 23 species. All of the 50 tags corresponded to two or more species (Table 1). The 18 bp tags matched 2.4 species on average, with a maximum of 14 species. Twenty-six tags out of 50 corresponded to two or more species. The 20 bp tags matched 1.3 species on average, with a maximum of four species. Only seven tags out of 50 still corresponded to more than two species, indicating a great improvement over the original SAGE tag length (15 bp). However, note that LongSAGE with 20 bp tag does not necessarily produce accurate gene expression profiles (see above). The 26 bp tags of our SuperSAGE method matched 1.1 species on average, with a maximum of only two species. As few as three tags out of 50 corresponded to the DNA sequences of more than two species. These results clearly show that the information content in the 26 bp DNA tag sequence greatly improves the efficiency of gene annotation of the tags. The 26 bp tags matched DNA sequences of only one species (which in fact was rice, Oryza sativa) on average, and in most cases identified a single gene of the particular species. Thus, the annotation of the tag sequence can be carried out almost perfectly. Tag annotation in SuperSAGE can be performed against EST sequence database as well as against whole genome sequences.
Table 1. Summary of blast search of 50 rice SAGE tags for the entire body of GenBank data.
Tag size (bp)
Average number of species with DNA sequence perfectly matching the tag
Maximum number of species with DNA sequence perfectly matching the tag
Number of tags for which DNA sequences of more than two species showed perfect matches
Simultaneous monitoring of host–pathogen gene expression by SuperSAGE
The high information content of the 26 bp SuperSAGE tag allows the simultaneous gene expression analysis of two or more eukaryotic organisms. As an example, we applied SuperSAGE to study gene expression profiles of both rice plants infected with blast disease and the causative fungus Magnaporthe grisea (Matsumura et al., 2003b). A draft of the whole genome sequence of M. grisea is also available (Martin et al., 2002). After isolating a total of 12 119 tags from blast-infected rice leaves, each tag was annotated by blast search for all the genome sequences of rice and M. grisea. As expected, the majority of the tags were annotated to rice genes, while 74 tags did not match rice but M. grisea sequences. By increasing the number (e.g. 10-fold) of total tags to be analysed in a future study, we will be able to address the details of rice –M. grisea interaction transcriptome.
The first contact of pathogen and host occurs in a limited number of host cells. It is important therefore to obtain gene expression profiles of host and pathogen at such a front line of the battle ensuing any infection. Taking the example of rice blast disease, we will explain in some detail what could be done with SuperSAGE (Fig. 2). Conidial spores of M. grisea, an ascomycete fungus, germinate on rice leaves. At the end of the so-called germination tube, an oval structure called appressorium develops. Extremely high turgor pressure inside the appressorium drives the drilling of the appearing so-called infection hyphae through the plant's rigid cell wall, a prerequisite for the colonization of the host by the fungus. Two types of interaction between M. grisea and rice exist. In the compatible interaction, fungal mycelia keep developing inside host cells, and infection is established. In this case, the host is susceptible and the pathogen is defined as virulent. On the other hand, in the incompatible interaction, host cells perceive invasion of the pathogen by some means, and rapid host cell death around the attempted infection site is induced to confine the pathogen. Here, the host is resistant and the pathogen is called avirulent. A careful and in-depth study of the changes in the transcriptional activity of host and pathogen genes in compatible and incompatible interactions, respectively, is necessary to understand the molecular mechanisms underlying these interactions. It is now possible to isolate even a single cell by laser microdissection (LMD, Emmert-Buck et al., 1996; Nakazono et al., 2003). RNA extracted from LMD-isolated tissue could be amplified five to six orders of magnitudes with high fidelity by T7 RNA polymerase-amplified (a)RNA amplification protocol (Wang et al., 2000), which could be subjected to SAGE (Vilain et al., 2003). Thus, starting from 5 ng of total RNA extracted from tissue collected by LMD, we expect 3 µg of aRNA suitable for SuperSAGE. SuperSAGE would allow separation of rice and M. grisea transcripts. By conducting SuperSAGE experiments for compatible and incompatible interactions, we would be able to compare gene expression profiles of susceptible and resistant rice cells as well as virulent and avirulent fungi. By doing so, we should be able to identify host and pathogen genes that are specifically up- and downregulated in compatible and incompatible interactions respectively. From among these identified genes, we can select those that we want to functionally analyse in detail. The very same strategy is applicable to any interactions between eukaryotic organisms and between eukaryotic host and viruses (e.g. permissive and non-permissive interactions).
A substantial part of mammalian infectious diseases is caused by bacteria. Although some bacterial mRNA have poly(A) sequences at the 3′ end, these sequences are only short lived and used as signal for mRNA degradation. Therefore, it is difficult to monitor bacterial gene expression with the SAGE protocol in which the 3′ end of mRNA is collected on the basis of poly(A) tails. Technical improvement is needed in the simultaneous analysis of gene expression of eukaryotic and prokaryotic organisms.
Use of SuperSAGE for organisms without DNA database
In organisms for which no DNA database is yet available, the 26 bp SuperSAGE tag can immediately be used as a specific 3′-RACE (Rapid amplification of cDNA ends) primer to rapidly recover a longer cDNA fragment. This cDNA fragment, in turn, could serve as query for a blast search to identify the gene of origin. Using the SuperSAGE tag as the 3′-RACE primer, we successfully carried out gene expression profiling in Nicotiana benthamiana, a non-model organism without extensive DNA data sets that had been treated with the protein elicitor INF1 from Phytophthora infestans (Matsumura et al., 2003b).
Possible application of SuperSAGE
As mentioned above, the major drawback of SAGE as compared with microarrays is that SAGE can be adapted to only few samples at a time. However, we sometimes wish to compare gene expression profiles of multiple samples. For such purposes, it is possible to design a microarray spotted with cDNA fragments or oligonucleotides that were selected on the basis of SuperSAGE information. In the case of host–pathogen interaction, for instance, we perform SuperSAGE at a given time after the inoculation of the pathogen. By comparing the obtained profile with that of an appropriate control (e.g. mock-inoculated), we can identify SuperSAGE tags that are differentially represented in the two samples. cDNA fragments of the genes corresponding to these subset of tags could be PCR-amplified and spotted onto the array. Otherwise, 26 bp oligonucleotides corresponding to the tag sequences could directly be immobilized on the array. By using these arrays, we would be able to study the kinetics of gene expression during host–pathogen interaction(s). By filtering with SuperSAGE, we can reduce the number of genes subjected to array analysis from, for instance, ∼30 000 for human (Crollius et al., 2000) or 30 000–50 000 for rice (Goff et al., 2002) to ∼1000. This approach seems especially helpful in non-model organisms for which infrastructure such as DNA database and cDNA clones are not available.
Gene expression profiling gives only a hint how to select genes for further functional analysis. After target genes are selected, their function is usually tested by reverse-genetics. At the moment the most high-throughput method of reverse genetics is RNAi or gene knockdown. It is known that RNAi is mediated by 21–23 nucleotide long double-stranded (ds) RNA (short interference RNA; siRNA), and administration of 20–28 mer dsRNA to cells can trigger RNAi in Caenorhabditis elegans (Kamath et al., 2003), Drosophila melanogaster (Lum et al., 2003) and human (Paddison et al., 2002). In this regard, the size of a SuperSAGE tag (26 bp) is relevant. It is long enough (> 21–23 bp) to cause specific RNAi, so that SuperSAGE tag sequences could directly be used for the synthesis of dsRNA to be delivered to the cells to knockdown the gene corresponding to the tag (Fig. 3). SuperSAGE analysis of host–pathogen interaction(s) would identify 26 bp tags, which interest researchers by their expression information. Oligonucleotides corresponding to the tags are synthesized and cloned into an appropriate vector (knockdown vector), so that short-hairpin (sh) RNA could be expressed in the cells by a strong promoter like the U6 snRNA promoter. After transformation of host or eukaryotic pathogen with the knockdown vector, their disease resistance (host) or virulence (pathogen) could be tested within a short time. We have coined this direct use of SuperSAGE tag sequences for RNAi ‘SuperSAGE-RNAi’. Furthermore, use of SuperSAGE tags for RNAi would be most useful in the ‘large-scale RNA-interference-based screen’ (Berns et al., 2004; Paddison et al., 2004). In this screening strategy, a library made in a mammalian expression vector harbouring a total of 5000–10 000 different shRNA is generated, and this library is transfected into mammalian cells en masse. After selection with a selectable marker, only the cells carrying the vector can survive. Each surviving cell should express at least one of 5000–10 000 shRNAs, whereby the gene corresponding to the particular shRNA is knocked down. After that, an appropriate regime is applied so that wild-type cells can no longer survive. Cells surviving under such screen are those expressing particular shRNAs derived from genes necessary for cell death in the screening scheme. Recovery of the vector from surviving cells followed by its sequencing shows the shRNA sequence. From this, we can infer which gene is necessary for cell death in the screen. SuperSAGE would effectively allow to select the shRNA population for library construction. For instance, we treat cells with a particular screening regime and perform SuperSAGE to identify the tags that are over- or under-represented in the cells under the treatment. Oligonucleotides corresponding to such SuperSAGE tags are synthesized, and cloned into the vector to establish the library for the ‘RNA-interference-based screen’.
SuperSAGE is a modification of the conventional SAGE procedure, whereby the tag size of 15 bp of the latter is increased to 26 bp owing to the Type III endonuclease EcoP15I as the tagging enzyme. Yet, this increase of tag size by 11 bp has a tremendous impact on the utility and versatility of the SAGE technology, most notably for the analysis of host–pathogen (or host–parasite) interactions. The information content of the 26 bp SuperSAGE tag is sufficient to allow simultaneous gene expression analysis of a host and its pathogen. Also, a SuperSAGE tag is long enough to cause RNAi in eukaryotic cells. We predict that this technology will be an important bridge between transcriptome and gene functional analysis of host–pathogen interactions.
R.T. was in part supported by the ‘Program for Promotion of Basic Research Activities for Innovative Biosciences’ (Japan) and by the ‘Research for the Future Program of the Japan Society for the Promotion of Science’. Research of P.W. and G.K. was partly financed through ‘Bundesministerium für Wirtschaftliche Zusammenarbeit’ (BMZ, Germany) under research contract 2001.7860.8-001.00. Work of M.R. and D.H.K. is supported by Deutche Forschungs gemeinschaft (Grant KR 1293/4-1).