SEARCH

SEARCH BY CITATION

Keywords:

  • Reverse serial analysis of gene expression;
  • Human embryonic stem cells;
  • Transcriptome;
  • Antisense transcription;
  • POU5F1;
  • SOX2;
  • NANOG

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Disclosures
  8. Acknowledgements
  9. References

Serial analysis of gene expression (SAGE) is a powerful technique for the analysis of gene expression. A significant portion of SAGE tags, designated as orphan tags, however, cannot be reliably assigned to known transcripts. We used an improved reverse SAGE (rSAGE) strategy to convert human embryonic stem cell (hESC)-specific orphan SAGE tags into longer 3′ cDNAs. We show that the systematic analysis of these 3′ cDNAs permitted the discovery of hESC-specific novel transcripts and cis-natural antisense transcripts (cis-NATs) and improved the assignment of SAGE tags that resulted from splice variants, insertion/deletion, and single-nucleotide polymorphisms. More importantly, this is the first description of cis-NATs for several key pluripotency markers in hESCs and mouse embryonic stem cells, suggesting that the formation of short interfering RNA could be an important regulatory mechanism. A systematic large-scale analysis of the remaining orphan SAGE tags in the hESC SAGE libraries by rSAGE or other 3′ cDNA extension strategies should unravel additional novel transcripts and cis-NATs that are specifically expressed in hESCs. Besides contributing to the complete catalog of human transcripts, many of them should prove to be a valuable resource for the elucidation of the molecular pathways involved in the self-renewal and lineage commitment of hESCs.


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Disclosures
  8. Acknowledgements
  9. References

Pluripotent human embryonic stem cells (hESC) cell lines are derived from fibroblast feeder layers via the isolation and extended serial propagation of the inner cell mass from supernumerary 5-day-old blastocysts [13]. They have offered much hope by promising to revolutionize the future of regenerative medicine through the provision of novel cell replacement therapies to treat a variety of debilitating diseases, such as myocardial infarcts, diabetes, and Parkinson's disease [4, 5]. The molecular mechanisms controlling pluripotency and self-renewal in hESCs are presently not well understood [6]. Transcriptome profiling studies using DNA microarrays [711], serial analysis of gene expression (SAGE) [12], expressed sequence tag (EST) enumeration [13], and massively parallel signature sequencing (MPSS) [14, 15] have elucidated gene networks and putative signaling pathways that are believed to be essential in the maintenance of the hESC phenotype. Recent studies have implicated that the WNT and transforming growth factor-β/activin/nodal pathways are involved in the maintenance of pluripotency in hESCs [16, 17]. Transcriptome studies have shown that key components of these two pathways are active or highly expressed in hESCs. In addition, SAGE and other gene expression profiling studies have suggested that stem cells, in particular hESCs, express numerous uncharacterized or novel transcripts, many of which are likely to represent novel genes [1215, 18, 19].

SAGE is a sequence-based transcriptome profiling approach that provides qualitative and quantitative assessment of gene expression [20]. The underlying principle assumes that a short nucleotide sequence, or SAGE tag, located at the last anchoring enzyme (Cmost) site contains sufficient information to represent a specific transcript. Often the NlaIII restriction enzyme is used, and the length of the SAGE tag could range from 14 (SAGE) to 21 (LongSAGE) or 26 base pairs (bp) (SuperSAGE), depending on the tagging enzymes used [2022]. The digital nature of SAGE tags means that cumulative SAGE data can easily be merged, allowing large-scale comparisons between independent libraries. The sequencing of concatemerized SAGE tags also permits a high-throughput determination of the transcriptome compared with EST sequencing. Besides being a robust method that reflects accurately the actual relative levels of mRNA transcripts, SAGE also allows transcripts that are expressed at low levels to be efficiently detected [23, 24]. However, the reliance on short sequence tags for gene identification imposes limitations on the precision and accuracy of gene identification. For instance, a SAGE tag may match multiple mRNA transcripts making gene assignment difficult, although with the advent of LongSAGE and SuperSAGE, this problem has been largely solved. A more daunting problem is that many SAGE tags do not appear to match known mRNA transcripts or genes. In poorly characterized transcriptomes, such as those from hESCs [12] and hematopoietic stem cells [18, 19], such orphan SAGE tags could reach as much as 40%. A recent study has shown that approximately 70% of orphan SAGE tags are indeed derived from bona fide transcripts [24], reinforcing the view that SAGE is indeed a powerful method for novel gene discovery. This suggests that a large number of the orphan SAGE tags that we have uncovered in the hESC transcriptome are true representatives of novel genes, transcripts, or splice variants [12], although the total number of genes present in the human genome is estimated at a conservative 30,000–40,000 [25, 26].

Another major source of uncertainty in SAGE tag-to-transcript assignment lies in the widespread presence of single-nucleotide polymorphisms (SNPs) within the human genome; SNPs occur as frequently as once every 100–300 bases [27, 28]. Occurrence of SNPs within the SAGE tag sequence or within the tagging restriction enzyme site will result in the assignment of an alternative SAGE tag. In a recent large-scale study of the SAGE database, at least one SNP-associated alternative SAGE tag was observed for 8.6% of all known human genes when the influence of SNPs and small insertion/deletion polymorphisms on SAGE tags was taken into consideration [29]. Indeed, the presence of this class of alternative SAGE tags has led to an underestimation of the expression of certain genes (e.g., GAL) and erroneously identified others (e.g., BTF3) as being specific to hESCs [12].

Naturally occurring antisense transcripts (NATs) have been recently reported in a variety of metazoan species [30, 31], and it is likely that a significant portion of the hESC orphan SAGE tags are derived from NATs. There are two main classes of NATs. The cis-encoded NAT (cis-NAT) is transcribed from the opposite strand of the same genomic locus and has the potential to form long complementary duplex with the sense RNA transcript. In contrast, trans-encoded NAT (trans-NAT) is transcribed from another distinct genomic locus, possibly a pseudogene [31], and is generally short and forms imperfect duplex with its sense transcript. The human genome has been shown to express NATs widely [3234], with as many as 20% of human genes forming sense-antisense (SA) transcript pairs [35]. For instance, hESCs have been reported to express a unique set of microRNAs, which belongs to a class of trans-NAT [36]. A recent large-scale EST project has provided an important resource of full-length cDNAs for hESCs [13]. But like the >5 million ESTs that are available [37], they are difficult to use to verify the expression of NATs because many ESTs have not been directionally cloned [3132]. In contrast, SAGE tags are directionally reliable, as they are generated from well-defined restriction sites at the 3′ end of each RNA transcript. Thus, large SAGE datasets contain latent information on both sense and antisense transcription [38]. Interestingly, tags matching mRNAs or ESTs in antisense orientation were first observed in SAGE libraries constructed from Plasmodium falciparum [39, 40].

Without additional sequence information, it is difficult to characterize orphan SAGE tags from hESCs and identify the transcripts they represent. Several polymerase chain reaction (PCR)-based strategies have been developed, including reverse SAGE (rSAGE) [41, 42], generation of longer cDNA fragments from SAGE tags for gene identification (GLGI) [43, 44], and rapid analysis of unknown SAGE-tag-PCR [45]. In this report, we have modified the original rSAGE protocol [41, 42], which is also similar to the GLGI [43, 44], and used it to obtain additional 3′ cDNA sequence information for a select group of orphan SAGE tags that are expressed specifically in hESCs. Our results identified novel transcripts unique in their expression to hESCs, transcripts that displayed alternative polyadenylation, and novel splice variants of known genes. More importantly, we found NATs for several pluripotency genes, including POU5F1 and NANOG. Collectively, the unique 3′ ESTs derived from orphan hESC SAGE tags (HESTs) will be an important resource in downstream functional analyses and the concerted dissection of molecular pathways critical to the pluripotent phenotype of hESCs.

Materials and Methods

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Disclosures
  8. Acknowledgements
  9. References

Culture of hESCs

hESCs (HES3 line, passages 19–25; ES Cell International, Singapore, http://www.escellinternational.com) were cultured on a feeder layer of mitomycin-C inactivated mouse embryonic fibroblasts (MEFs) as described previously [2]. HES3 cell colonies were passaged by mechanically cutting small clumps of undifferentiated HES3 (UD-HES3) cells and transferring these fragments to fresh MEF feeders at 7–8-day intervals [2, 46]. Differentiated HES3 (D-HES3) cells were obtained by prolonged (20-day) high-density culture on MEFs [12].

Total RNA Isolation

Total RNA was extracted from hESCs using TRIZOL (Invitrogen, Carlsbad, CA, http://www.invitrogen.com), whereas total RNA from the various somatic and fetal tissues were obtained commercially (Clontech, Palo Alto, CA, http://www.clontech.com). Prior to rSAGE library construction or reverse transcription (RT)-PCR, total RNA was treated with DNase I (Ambion, Austin, TX, http://www.ambion.com) to remove any residual genomic DNA contamination, and PCR using β-actin primers (forward, 5′-GATGCAGAAGGAGATCACTGC-3′; reverse, 5′-CACCTTCACCGTTCCAGTTT-3′), designed to span the last intron-exon boundary of the gene, was carried out to confirm the absence of genomic DNA.

cDNA Synthesis, NlaIII Digestion, and Linker Ligation

A schematic for the rSAGE library construction with all primer and linker sequences is depicted in Figure 1. cDNA synthesis was carried out using the Superscript II double-stranded cDNA synthesis kit (Invitrogen) with 10 μg of total RNA from HES3 cells and a biotinylated primer was used (5′-biotin-ATTGGCGCGCCGCGAGCACTGAGTCAATACGAT30VN- 3′; Integrated DNA Technologies, Coralville, IA, http://www.idtdna.com). Double-stranded cDNA was digested with NlaIII (New England Biolabs, Ipswich, MA, http://www.neb.com) to generate 3′ overhangs. The biotinylated cDNAs were immobilized on streptavidin-magnetic beads (Invitrogen). Annealed linkers, A1 (5′-AAGCAGTGGTATCAACGCAGAGTCATG-3′) and A2 (5′-phosphate-ACTCTGCGTT-GATAC-CACGCTT-aminoC7-3′) were ligated to the 5′ end of NlaIII-digested cDNA before AscI (New England Biolabs) digestion was performed to release the 3′ cDNA fragments from the streptavidin-magnetic beads.

PCR Scale-Up of rSAGE Library

Amplification of the primary rSAGE library was performed with 1 μl of the NlaIII-digested cDNAs, 5 U of Platinum Taq Polymerase (Invitrogen), rSAGEF1 (5′-AAGCAGT-GGTAT-CAACGCAGAGT-3′) and rSAGER1 (5′-GCGAGCACT-GAGTCAATACGC-3′) primers (350 ng each). After an initial denaturation at 94°C for 2 minutes, PCR was carried out for 25 cycles at 94°C for 45 seconds, 57°C for 1 minute, and 72°C for 1 minute, with a final extension at 72°C for 5 minutes.

Selection of Orphan SAGE Tags and Design of Tag-Specific rSAGE Primers

The 200 orphan SAGE tags selected for rSAGE were identified through a pairwise comparison of HES3 SAGE data against pooled data from 21 human SAGE libraries [12]. The SAGE tag-to-gene database used for gene identification was based on UniGene Build 160 (http://www.ncbi.nih.gov/SAGE/). The majority of the orphan SAGE tags selected were upregulated in HES3 compared with the pooled human SAGE libraries (p < .001; fold difference >4). A table describing the SAGE tags, sequences of the SAGE tag-specific rSAGE primers (TSRPs), and their respective frequencies in tags per million (tpm) in the pooled human, HES3 and HES4 SAGE libraries, is provided as supplemental online Table 1. For those HES SAGE tags where LongSAGE tags were available, which were obtained through comparison with a HES3 LongSAGE library, the TSRPs were designed using the Primer3 software (http://frodo.wi.mit.edu) [46]. Typically, they included the entire 21 bases of the Long-SAGE tag or they included additional four to eight bases of the common linker (CGCAGAGT) and up to 19 bp of the Long-SAGE tag. If no appropriate LongSAGE tag was available (Tag IDs 1–77), the TSRPs were designed with seven bases of the common linker sequence (GCAGAGT) and the entire 14 bases of the SAGE tag, with the exception of Tag IDs 30 and 72.

rSAGE Amplification Reaction and Characterization of 3′ rSAGE Fragments

Touchdown PCRs were performed using an initial denaturation cycle at 94°C for 2 minutes, followed by four cycles at 94°C for 45 seconds, 63°C for 1 minute, and 72°C for 1 minute; four cycles at 94°C for 45 seconds, 60°C for 1 minute, and 72°C for 1 minute; 25 cycles at 94°C for 45 seconds, 58°C for 1 minute, and 72°C for 1 minute; and a final extension step at 72°C for 5 minutes. The reaction setup for rSAGE PCR was as follows: 1 μl of amplified rSAGE library, 1 U of Platinum Taq Polymerase, 350 ng of TSRP and rSAGER1 primer. The PCR products were run on 1.2% TAE agarose gel, and the bands were excised and purified using QIAquick Gel Extraction Kit (Qiagen, Valencia, CA, http://www.qiagen.com). Purified PCR products (2–4 μl) were ligated into the pGEM-T Easy Vector (0.5 μl) (Promega, Madison, WI, http://www.promega.com) using T4 DNA ligase. The ligation reaction was incubated overnight at 16°C and resuspended in 8 μl of sterile water. Electroporation was performed using 1 μl of the ligated products and 25 ml of pTOP10 cells (Invitrogen). The transformants were plated on selective media, and two to four clones were picked for each rSAGE PCR product. Plasmid DNA was extracted using QIA-prep Spin Miniprep Kit (Qiagen). Sequencing reactions were carried with Big Dye v3.1 (Applied BioSystems, Foster City, CA, http://www.appliedbiosystems.com) and M13 Forward primer. The sequenced products were analyzed on an ABI 3100 DNA Sequencer (Applied BioSystems).

Sequence Analysis and Identification of Genuine rSAGE PCR Products

A bona fide 3′ rSAGE product was defined as possessing the entire SAGE tag sequence, the rSAGER1 primer sequence and a poly(A) tract of >10 adenine residues. Sequences that lacked any one of the three were considered nonspecific amplification artifacts and omitted from further analysis. The rSAGE 3′ EST sequences were searched against the GenBank Database (NR, dbEST, and human genome) using BLASTN (http://www.ncbi.nlm.nih.gov/BLAST/), the University of California Santa Cruz human genome browser database (May 2004 build) using the BLAT program (http://genome.ucsc.edu/cgi-bin/hgBlat) and the EMBL database using a web interface-based batch BLAST program (http://biomedicum.csc.fi:8010/cgi-bin/batchblast.cgi) [20].

An rSAGE sequence was classified as novel if no matches to a transcript sequence (known gene, mRNA, or EST) were found. A sequence was considered to represent a known gene if it matched a full-length transcript sequence with >95% similarity in the same orientation. A sequence was classified as known EST if it matched an EST or open reading frame (ORF) with >95% similarity in the same orientation. A sequence was classified as an SNP alternative tag if it contained a single-bp mismatch within the SAGE tag sequence or NlaIII site. A sequence was classified as an insertion/deletion if it contained an insertion or deletion of fewer than three nucleotides within the SAGE tag sequence. A sequence was classified as an anti-sense transcript if it matched with high similarity to known transcripts in the opposite orientation. A sequence was classified as poly(A) if it was near the end of the poly(A) tract. Finally, a sequence was considered an alternative isoform if it matched the middle of known full-length transcripts in the same orientation and contained a poly(A) track immediately downstream of the matched region. Genomic coordinates of the 3′ SAGE ESTs were annotated based on the University of California Santa Cruz genome browser annotation database (http://genome.ucsc.edu/).

RT-PCR Confirmation of Novel 3′ cDNAs

First-strand synthesis was performed using the SuperScript first-strand synthesis system (Invitrogen). One μl of first-strand reaction was used for each PCR together with 50 pmol of forward and reverse primers. Initial denaturation was carried out at 94°C for 2 minutes, followed by 30 cycles of PCR (94°C for 30 seconds, 55°C for 30 seconds, 72°C for 1 minute), and a final extension cycle at 72°C for 5 minutes. PCRs were loaded on a 1.5% agarose gel and size fractionated. In instances where the 3′ cDNA sequence obtained was short and no suitable primer pairs could be found, additional 5′ genomic sequences were used to anchor the forward primers. In all cases, the reverse primer primed from the rSAGE 3′ cDNA sequence. Primers used were as follows. ACTB: product 400 bp, 5′-TGGCACCACACCTTTCTACAAT-GAGC-3′, 5′-GCACAGCTTCTCCTTAATGTCACGC-3′; POU5F1: product 247 bp, 5′-CGRGAAGCTG GAGAAG-GAGAAGCTG-3′, 5′-CAAGGGCCGCAGCTTACACAT-GTTC-3′; HEST97: product 160 bp, 5′-CCTTTGTCATGAGC-CCTTGT-3′, 5′-GGAATGAAAGAATGGTTG CTC-3′; HEST101: product 119 bp, 5′-AAGAGCCTGCTACG-GAACTG-3′, 5′-TCACTAGAGGTTTCCAACACACTT-3′; HEST120: product 159 bp, 5′-AAATTTGGTGCTGTGAC TCG-3′, 5′-GCGGGCTGAGTCGGATTT-3′; HEST123: product 200 bp, 5′-GGGTTATGT GTAGAAACCAAGTGA-3′, 5′-TCTTAGAACTTATGATACACCCAGTTG-3′; HEST127: product 218 bp, 5′-GGGAAAAGATGGCAAGGTTA-3′, 5′-AATATATTCGAGTCACATCA TGACA-3′; HEST146: product 171 bp, 5′GATGCCATCACTCAAACTAGACC-3′, 5′-GACGTCCTATGCAGGCATTT-3′; HEST147: product 205 bp, 5′GGGGATTCGAGGTTC CTGTA-3′, 5′-CATTTCAAG-GCACAATTTTAATAGC-3′; HEST149: product 196 bp, 5′-CCCAGGCTGAAGTGTAGTGA-3′, 5′-CATTTACAATGGTA-CAAGGAGCA-3′. The universal reference RNA sample was obtained from Stratagene (La Jolla, CA, http://www.stratagene.com), and somatic tissue RNA samples were obtained from Clontech.

Orientation-Specific RT-PCR

To detect the NATs for POU5F1, NANOG, LIN28, TALE, TERF1, and TERA, orientation-specific first-strand cDNA synthesis was carried with the appropriate sense primers. Thereafter, Superscript II RT was heat-inactivated at 95°C for 15 minutes. PCR was performed with 3 μl of the 20-μl first strand mix as described. Control experiments without reverse transcription (−RT controls) for each of the three antisense primers were performed to detect genomic DNA contamination. The primers used were as follows. POU5F1 NAT: product 184 bp, 5′-AGTTTGTGCCAGGGTTTTTG-3′, 5′-TGTGTCCCAG-GCTTCTTTATTT-3′; NANOG NAT: product 278 bp, 5′-TCGGTATTGTTTGGGATTGG-3′, 5′-TCATCGAAAC-ACTCGGTGAA-3′; LIN28 NAT: product 178 bp, 5′-GGAGGCCAAGAAAGGGAATA-3′, 5′-CCGCCCCATA-AATT CAAGAT-3′; TALE NAT: product 80 bp, 5′-TTTTCA-GACTGTGCAATA CTTAGAGAA-3′, 5′-TTAGACAG-TATGTGGGCATCC-3′; TERF1 NAT: product 169 bp, 5′-TGCGGAGT AGATGAGATGGA-3′, 5′-AAGGCAATG-GAAAACAGGTAAA-3′; TERA NAT: product 131 bp, 5-TTT-TGGCTGCAGTATTGGTG-3′, 5′-CATCCTACAGGC-AAAGAGAGG-3′.

Results

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Disclosures
  8. Acknowledgements
  9. References

rSAGE Amplification, Specificity, Efficiency, and Size Distribution of 3′ cDNAs

The original rSAGE (Kinzler/Vogelstein laboratories) [41, 42], the GLGI-SAGE protocol [43, 44], and our modified rSAGE strategy share several key features (Fig. 1). However, we have made several modifications to increase the efficiency of 3′ cDNA conversion. For instance, changes in the design of the universal primers allowed the rSAGE library scale-up and the subsequent TSRP PCR amplification to be carried out at an increased melting temperature (Tm). The introduction of a longer poly(T) tract (T30) and the inclusion of VN dinucleotides in first strand RT-PCR primer allowed a better trapping and synthesis of full length mRNAs at their 3′ ends, compared with a shorter poly(T) tract (T10) as used in the GLGI strategy that might result in the primer binding to internal poly(A) residues within mRNA transcripts. Finally, increasing the Mg2+ concentration when no distinct rSAGE band was observed in the first round of PCR could occasionally enhance the specificity of the rSAGE amplification reaction.

Of the 200 HES3 orphan SAGE tags that were selected for rSAGE conversion (supplemental online Table 1), 168 (84.0%) yielded PCR amplification products (Fig. 2A). The conversion rate of orphan LongSAGE tags into longer 3′ cDNA fragments was much higher (93.4%) than that of the SAGE tags (69.2%). We attributed these improvements to the availability of additional sequences from the LongSAGE tags for the design of TSRPs, as well as better-designed universal primers (rSAGEF1 and rSAGER1) in our strategy (Fig. 1). In particular, we found the universal M13 primer used as the antisense primer in the original rSAGE strategy [41, 42] was unsatisfactory for rSAGE because of its low Tm.

A representative agarose gel showing the rSAGE products is shown in Figure 2B. It is noteworthy that the majority (∼90%) of the TSRPs yielded only a single distinct rSAGE band. Our results also support the notion that there is no strict correlation between the efficiency of target template amplification and the abundance of the SAGE tag [29], unlike earlier reports on GLGI-SAGE [44, 45]. Other variables, such as SAGE tag length and primer sequence, may be equally important parameters influencing the efficiency of target amplification. As shown in Figure 2B, the rSAGE amplification generally generated intense bands that were easily gel-purified, although amplification of SAGE tags with a lower copy number (<20 tpm) yielded lesser PCR products and in some cases (Tag IDs 156 and 169) contained one or multiple faint bands that were difficult to gel purify; these bands were not analyzed. When two or more distinct rSAGE bands were obtained (Tag IDs 126, 141, and 148), they usually turned out to be discrete 3′ cDNA fragments. In most GLGI reports, conversion to 3′ cDNAs is usually attempted for SAGE tags with a high copy number [18, 19]. In contrast, a large proportion (68%) of the orphan SAGE tags we attempted to convert to 3′ cDNAs were present at lower frequencies (≤50 tpm). We also managed to obtain genuine rSAGE products for SAGE tags with frequencies of as low as 5 tpm, which is equivalent to the detection of a singleton in the HES3 SAGE library (HESTs 79, 147, and 174; supplemental online Table 1). In conclusion, it appears that our modified rSAGE protocol has some improvements over the original rSAGE protocol [41, 42] and was as efficient as GLGI-SAGE [43, 44] and GLGI-MPSS [47].

From the 168 SAGE tags that yielded PCR amplification products, a total of 196 rSAGE products were cloned and sequenced. Of these, 148 (75.5%) were confirmed as specific rSAGE products following DNA sequencing, BLAST and BLAT confirmation (supplemental online Table 2). These 148 rSAGE 3′ cDNA fragments have been submitted to GenBank (accession numbers DN604327–DN604453), and we will refer to these cDNA sequences hereafter as HESTs. When TSRPs were designed using the LongSAGE tags, the overall amplification specificity reached 80.5% compared with GLGI-SAGE specificities that varied between 60% for low-copy SAGE tags and 80% for high-copy SAGE tags [43, 44]. Many of the nonspecific rSAGE fragments lacked a poly(A) tract and the rSAGER1 primer and were generated mainly because of mispriming at the 3′ ends (supplemental online Table 3). Finally, although the hESC lines used in our earlier SAGE study [12] and for the present rSAGE library construction were grown on MEF feeders, we did not find contaminating murine RNA transcripts a significant problem in our 3′ rSAGE conversion attempts.

Overall, 16.0% of rSAGE reactions failed to give distinct amplification products. Taken together with the nonspecific rSAGE results, our main conclusion is that a SAGE tag does not always provide an ideal sequence for the design of thermodynamically favorable TSRPs for the efficient amplification of 3′ cDNA by rSAGE. Thus, orphan SAGE tags that were AT-rich or contained sequences that were self-complementary often failed to generate specific rSAGE 3′ cDNA fragments. Although it is possible that when the expression level of targeted templates is very low, partial annealing of the TSRPs with other highly expressed templates may result in nonspecific amplification [44], the availability of additional sequences through the generation of LongSAGE or even SuperSAGE tags [22] would allow most of the remaining orphan SAGE tags to be converted into longer 3′ cDNA fragments for gene identification.

Analysis of 3′ HESTs Generated from HES3 Orphan SAGE Tags

The size distribution of the 148 HESTs ranged from 36 to 538 bp, with 56.7% of them longer than 100 bp, which matched well to the reported data from GLGI-SAGE studies [18, 19, 43, 44]. A small number of the TSRPs [14] gave two or more distinct rSAGE bands. The majority of them were mapped to distinct transcripts (HEST31, 52, 53, 65, 98, 99, 148, and 170; supplemental online Table 2), whereas those for HEST126 and 141 were the result of alternative polyadenylation sites. Previous GLGI-SAGE reports have relied on BLAST searches to determine the identity of the 3′ cDNA fragments [18, 19, 43, 44]. We used both BLAT and BLAST searches to establish the identity of rSAGE cDNA sequences (Fig. 3A). Indeed, the BLAT transcript viewer made it easier to visualize and quickly identify NATs, novel introns, and new splice variants of known transcripts and to confirm SNPs within the SAGE tags. For several SAGE tags, rSAGE extension resulted only in poly(A) sequences, as a result of the NlaIII site occurring just adjacent to the poly(A) tract, and would require the use of a different tagging enzyme to reveal their true identity. More importantly, our rSAGE results have clearly identified 59 of these rSAGE 3′ cDNA fragments as novel rSAGE 3′ESTs and 30 NATs, all of which are identified for the first time (Fig. 3A).

The majority of the novel rSAGE 3′ESTs that mapped to specific chromosomal locations also contained the canonical polyadenylation signal, AATAAA or its functional variant [48], and are likely to represent bona fide transcripts from previously undescribed human genes. As shown in Table 1, the majority of these 18 novel rSAGE 3′ESTs are underrepresented in the nonhuman embryonic stem (ES) SAGE libraries and are found mainly in SAGE libraries constructed from cancer cell lines or carcinomas. They are likely to represent transcripts that are expressed specifically in hESCs. For instance, HEST94 and 147 are represented only in hESCs and could turn out to be an excellent marker for the “stemness” phenotype of hESCs. To confirm the validity of these rSAGE 3′ cDNAs and whether they were indeed restricted only to undifferentiated hESCs, RT-PCR was performed for several selected HESTs (97, 146, 147, and 149) across a selected tissue panel (testis, brain, heart, skeletal muscle, fetal brain, and stomach), undifferentiated hESCs (HES3 and HES4), and differentiated HES3 cells (Fig. 3B). Like the well-established hESC marker POU5F1, the RT-PCR products for these four novel rSAGE 3′ ESTs were detected only in the hESC lines and were absent in the other somatic tissues examined. The expression of HEST149 was also completely abrogated in differentiated HES3 cells (Fig. 3B) and could, like Oco90 [12], prove to be a reliable marker for monitoring the early differentiation of hESCs. Indeed, HEST149 expression was undetectable in the universal reference RNA sample, which is an RNA pool from several cancer tissues (Fig. 3B) and absent in several embryonal carcinoma lines such as GCT-27C4, GCT-27X1, and GCT-44 (unpublished results).

In addition, a number of the novel 3′ rSAGE cDNA fragments (e.g., HESTs 73, 92, 102, and 126) could not be matched reliably to the human genome and were also not the products of contaminating MEF cDNAs. Perhaps these HESTs represent transcripts from novel hybrid RNAs with a regulatory function or as yet undiscovered genes. The presence of consensus polyadenylation sites on several of these HESTs (e.g., 92, 102, and 126) is a good indication that these are authentic transcripts.

Interestingly, four HESTs (112, 120, 128, and 170) showed high sequence similarity to the WiCell hESC ESTs [13]. HEST2 and 146, classified as novel sequences, did not overlap with known hESC ESTs but mapped to genomic regions proximal to chromosomal sites where several WiCell hESC ESTs appear to be transcribed from. Obtaining 3′ cDNA sequences that matched WiCell ESTs [13] indicated that our modified rSAGE protocol was working well. In addition, our RT-PCR data also confirmed that the expression of HEST120, 127, and 146 were confined to hESCs, although HEST120 (and to a lesser extent HEST127) was also detected in the fetal brain (Fig. 3B). Unfortunately, although these ESTs are highly restricted in their expression to hESCs, as demonstrated either by RT-PCR or by their representation in human ESC SAGE libraries [12], their exact functional role is unknown.

The impact of SNPs on the correct assignment of SAGE tags to specific transcripts [29] is also illustrated by our rSAGE results. For instance, HEST49 matched the CHD8 with almost 100% sequence similarity and is the result of an SNP that created a new NlaIII restriction site upstream of the AATAAA polyadenylation site. The full-length cDNA sequence of CHD8 is 8,160 bp long, and this SNP would generate the C-most SAGE tag. The original C-most SAGE tag for CHD8 is GGC-CCCATTG (nts 7311–7320), which is also represented in the HES3 SAGE library (5 tpm). We also detected an SNP within the C-most SAGE tag of GJA1, which encodes the gap junction protein connexin 43. The putative C-most SAGE tag is TGT-TCTGGAG (nts 2916–2925). The rSAGE conversion of the orphan SAGE tag, TGTTTTGGAG, resulted in HEST113, which displayed a 97% sequence similarity to the 3′ terminal region of the GJA1 coding region. Careful examination of corresponding EST and genomic DNA sequences indicated that this orphan tag most likely represented an SNP in the canonical GJA1 SAGE tag and not the hypothetical protein FLJ10407 as suggested by the predicted tag-to-gene mapping of SAGEGenie. The GJA1 SNP was verified using 6-carboxyfluorescein (FAM)- and VIC-labeled Taqman probes that were specific to the polymorphism (Fig. 3C).

The generation of longer 3′ cDNA sequences by rSAGE has also helped to resolve some of the ambiguities in tag to gene assignments, at least in HES3 cells. For example, HEST119 (AGTGAGGATA) matched the hypothetical protein FLJ35155 (C3orf21), which is restricted in expression to hESC lines and tissues of cancerous origin. In addition, the SAGE tag for HEST114 (CATCCAAAAA) was incorrectly assigned to NPY and CEP2 by SAGEGenie and SAGEMap, respectively. Instead, rSAGE conversion confirmed that HEST114 matched to the hypothetical protein FLJ10884, a hypothetical protein restricted in its expression to the testis, placenta, and hESC lines, instead of NPY.

Antisense Transcription in hESCs

BLAT and BLAST searches revealed that many of the HESTs were the products of antisense transcription. Interestingly, cis-NATs for several important ES-specific genes, such as NANOG (HEST16), POU5F1 (HEST88), and LIN28 (HEST168), were identified by our rSAGE results (supplemental online Table 2). Analyzing the chromosomal location of these cis-NATs and the corresponding sense tags from the HES3 library revealed the presence of sense-antisense (SA) gene pairs [34, 35, 38]. Table 2 is a list of 18 SA SAGE tag pairs and the corresponding antisense HESTs that were experimentally obtained with rSAGE. Although several SA SAGE tag pairs can be mapped in trans to remote genomic loci, other pairs mapped in cis on contiguous oppositely oriented DNA strands (Fig. 4A). Besides POU5F1, NANOG, and LIN28, a number of other highly expressed hESC-specific genes, like TGIF/TALE (HEST109), ERH (HEST151), TERA (HEST155), and TERF1 (HEST193.2), also expressed cis-NATs. Furthermore, the representation of many of these co-expressed SA SAGE tag pairs decreased upon differentiation of the hESCs (Table 2). The SAGE tags for NANOG (TCATTACGAT) and POU5F1 (ATGTGGGATT) cis-NATs were found only in hESC SAGE libraries, indicating that the expression pattern of cis-NATs for NANOG and POU5F1 are even more restricted than their sense transcript counterparts.

To validate that the cis-NATs for POU5F1, NANOG, LIN28, TALE, TERF1, and TERA were specifically in hESCs, orientation specific RT-PCR [33, 49] was carried out using total RNA isolated from HES3, a universal reference RNA sample (Stratagene), testis, and stomach (Fig. 4B). First strand cDNAs were prepared using primers specific to POU5F1, NANOG, LIN28, TALE, TERF1, and TERA, respectively. Specific RT-PCR products for all three cis-NATs were detected only when RT was included, thus confirming that these cis-NATs were specifically expressed in hESCs and not due to spurious PCR amplification.

HEST115 and 168 appeared to represent spliced SA transcripts from ILF2 and LIN28, respectively. Nucleotides (nts) 1–42 of HEST115 matched the ILF2 coding region in the antisense orientation (Chr1[+]: 150447872–150447913), whereas nts 24–222 matched the sense orientation (Chr1[−[: 150447587–150447785). Likewise, nts 1–133 of HEST168 matched the LIN28 coding region in the antisense orientation (Chr1[−]: 26439918–26440050), whereas nts 131–171 matched the sense orientation (Chr1[+]: 26440310–26440350). This novel sense-antisense RNA hybrid structure is originally reported for the cardiac troponin I gene in rat hearts [50]. The structure the cardiac troponin I “hybrid RNA,” which the authors themselves have tentatively concluded to be formed from the transcription of the troponin mRNA in the cytoplasm, is very similar to what we have described for ILF2 and LIN28. The functional significance of these hybrid RNAs is currently unknown.

Discussion

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Disclosures
  8. Acknowledgements
  9. References

Unlike DNA microarray, SAGE does not require prior knowledge of the sequences to be analyzed. Hence, SAGE libraries provide discreet and unbiased directional gene expression data that are ideally suited for gene discovery and SA expression analysis [35, 38]. Although MPSS [51] is capable of deeper coverage of the gene expression profile, it requires specialized reagents and equipment, and this has restricted the availability of MPSS libraries for various human tissues and cell types, including those for hESCs. On the other hand, SAGE comprises several standard molecular biology techniques and can be adapted for microanalysis [52, 53]. This has resulted in the construction of SAGE libraries from a large variety of human cell types and tissues, and they are an important resource for the discovery of novel genes and NATs [38, 54, 55].

Although the human transcriptome is necessarily less complex than the human genome, it is quite apparent that transcriptome complexity has been underestimated [34, 35, 38, 44]. Noncoding RNA, regulatory RNA, NATs, and novel splice variants add to the multifaceted nature of the transcriptome. In the present study, we have used a modified rSAGE strategy to convert selected orphan SAGE tags from hESCs into longer 3′ cDNAs. It has facilitated the identification of isoforms due to splicing, alternative polyadenylation and SNPs. A large number of novel hESC-specific genes have also been identified, indicating that the hESC transcriptome is indeed poorly characterized [12]. This is also the first description of cis-NATs from several key pluripotent genes that are involved in the maintenance of hESC self-renewal, suggesting that SA transcript pairing might be a key regulatory mechanism [31].

A recent study reported that 41.5% of SA transcript overlaps occurred in the last exon or untranslated region (UTR) of the coding sequence [34]. We have found that overlaps between the cis-NAT of LIN28, NANOG, and POU5F1 and their corresponding sense transcripts occurred in the 3′ UTR of the coding sequence as well. Although the exact significance of this positional overlap is unknown, UTRs are believed to contribute toward the localization, stability, and translational control of mRNA transcripts. Indeed, the finding that >30% of vertebrate mRNAs show orthologue-specific conservation of 3′ UTRs suggests a possible functional or regulatory role for UTR sequences [56]. The recent finding that many of the human SA gene pairs are also detected in mouse, rat, and fugu and are probably conserved throughout the course of vertebrate evolution [57] lends some support to the notion that cis-NATs are not due to a “leakage” of the transcriptional apparatus but rather that their abundance is the result of active transcription. For POU5F1 and NANOG, we have ruled out the possibility that their cis-NATs are due to the insertion of L1 retrotransposon [58]. However, because there are several pseudogenes for POU5F1 and NANOG, the possibility of trans-NATs from these genomic loci remains to be determined.

Several reports have hinted that the contribution of NATs in the human genome has been underestimated [34, 35] and that up to 25% of human transcripts might form natural SA pairs. Although initial studies indicated that there was no correlation between NATs and their function or localization [34], a more recent survey of SA pairs confirmed that they are predominant for genes involved in translation regulator activity, DNA damage response, and cell growth, whereas non-SA transcripts were found to have a significantly different functional distribution [35]. Several of the human ES NATs and SA gene pairs we have identified are representative of genes that code for transcription factors and RNA-binding proteins, whereas SA gene pairs for ubiquitously expressed genes, such as glyceraldehyde-3-phosphate dehydrogenase and ACTB, were not present in the HES3 SAGE library. The fact that SA transcripts have a significantly higher probability of involvement in translation regulator activity and are more frequently located in both the nucleus and cytoplasm [35] is compatible with a role in antisense-mediated gene regulation occurring in both the nucleus and cytoplasm and at the transcription and translation levels [31].

Although certain human miRNAs (miR-1 and miR-124) have been recently demonstrated to influence and define tissue-specific gene expression profiles in HeLa cells [59], the functional roles of the cis-NATs in similar context have not been previously reported. Since cis-NATs are also capable of regulating gene expression through RNA masking, transcriptional or RNA interference [31, 32], the identification of cis-NATs for POU5F1 and NANOG prompted us to determine whether cis-NATs might be commonly expressed for other key regulators that are involved in the maintenance of pluripotency in hESCs. Both the mouse and the human SAGE libraries were searched for the presence of SAGE tags representing the cis-NATs for ES-specific genes [12, 60]. We failed to find SAGE tags representing UTF1, REX1, LEFTB, and GDF3 cis-NATs in human and mouse SAGE libraries. However, we detected cis-NATs for a number of key ES-specific genes (e.g., FGFR1, FGFR2, TDGF1, SOX2) in HES3 and SAGE libraries constructed from other hESC lines (Table 3). In addition, SAGE tags representing pou5f1, nanog, tera, and lin28 were also detected in mouse embryonic stem cells (mESCs). In summary, cis-NATs for a number of ES-specific genes, such as POU5F1 and NANOG, were shown to be expressed in both hESCs and mESCs, and it is possible that some of these cis-NATs might have a role in maintaining the “stemness” phenotype of ES cells.

Our study further underscores the importance of obtaining longer 3′ cDNAs from orphan SAGE tags and the versatility of rSAGE as a powerful complementary tool to SAGE expression libraries for gene discovery. Lastly, the hESC-specific transcripts that we have described are clear targets for further study, and the conversion of the remaining orphan SAGE tags from HES3 and other hESCs would likely provide additional valuable resources, mainly in terms of novel transcripts, and uncover additional cis-NATs for the in-depth functional dissection of the molecular pathways involved in the self-renewal of pluripotent hESCs and their subsequent lineage commitment to their differentiated progenies.

Table Table 1.. Chromosomal location and SAGE library representation of 18 novel 3′ reverse SAGE expressed sequence tags with authentic polyadenylation signal
Thumbnail image of
Table Table 2.. Sense-antisense SAGE tags pairs of antisense HESTs
Thumbnail image of
Table Table 3.. Occurrence of SAGE tags of cis-natural antisense transcripts of selected embryonic stem-specific genes in human and mouse embryonic stem cell SAGE libraries
Thumbnail image of
thumbnail image

Figure Figure 1.. Schematic diagram of the modified rSAGE protocol. Briefly, mRNA was isolated, and cDNA synthesis was performed with an anchored biotin-labeled RT primer. cDNAs were digested with NlaIII to reduce complexity of the library. An rSAGE linker was next ligated to cleaved 3′ cDNAs bound to streptavidin beads, following which AscI digestion was performed to release the cDNAs. rSAGE library scale-up amplification was performed with the rSAGEF1 and rSAGER1 primers. An aliquot of the amplified rSAGE library was used in rSAGE amplifications with a serial analysis of gene expression tag-specific primer and the common Rev1 reverse primer. Abbreviations: HEST, human embryonic stem cell serial analysis of gene expression tag; PCR, polymerase chain reaction; rSAGE, reverse serial analysis of gene expression; RT, reverse transcription; TSP, tag-specific primer.

Download figure to PowerPoint

thumbnail image

Figure Figure 2.. Results of reverse serial analysis of gene expression (rSAGE) amplification for 200 orphan serial analysis of gene expression (SAGE) tags. (A): Pie chart shows the distribution of rSAGE products. (B): rSAGE reactions were carried out using the tag-specific rSAGE and rSAGER1 primers, the products were analyzed on an agarose gel, and the bands were visualized with ethidium bromide. Most lanes show a single distinct amplified rSAGE band. A 100-bp ladder (M) was used as a molecular weight marker. The numbers at the top of the gel represent the SAGE Tag ID. Abbreviations: EST, expressed sequence tag; M, molecular weight marker; PCR, polymerase chain reaction.

Download figure to PowerPoint

thumbnail image

Figure Figure 3.. Identity of the 148 rSAGE 3′ cDNA fragments. (A): The distribution of the various categories of rSAGE products is summarized as a pie chart. (B): Human embryonic stem cell (hESC)-specific expression of eight HESTs were verified with semiquantitative reverse transcription-polymerase chain reaction (PCR) using total RNAs prepared from several peripheral adult tissues and fetal brain, universal reference RNA (Stratagene), undifferentiated HES3 and HES4 hESC lines, and D-HES3 cells. (C): Quantitative real-time PCR results for GJA1 SNP analysis. Abbreviations: bp, base pairs; CT, threshold cycle; EST, expressed sequence tag; FAM, 6-carboxyfluorescein; INDEL, insertion/deletion; rSAGE, reverse serial analysis of gene expression; SNP, single-nucleotide polymorphism.

Download figure to PowerPoint

thumbnail image

Figure Figure 4.. Confirmation of natural antisense transcription in HES3 cells. (A): Illustration of the cis- and trans-serial analysis of gene expression AS tag pair concept. (B): Expression of POU5F1, NANOG, LIN28, TALE, TERA, and TERF1 cis-natural antisense transcripts (NATs). For amplification of cis-NATs, sense-specific primers were used for reverse transcription (RT) instead of oligo(dT) primer. During the subsequent polymerase chain reaction amplification, sense and antisense primers were used. Total RNA that had not been reverse-transcribed was used as a template control for genomic DNA contamination (−RT). Abbreviations: AS, antisense; bp, base pairs.

Download figure to PowerPoint

Acknowledgements

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Disclosures
  8. Acknowledgements
  9. References

This study was supported by Embryonic Stem Cell International Pte. Ltd. grant R-174-000-081-592 and National University of Singapore Academic Research Fund grant R-154-000-179-112.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Disclosures
  8. Acknowledgements
  9. References