rSAGE Amplification, Specificity, Efficiency, and Size Distribution of 3′ cDNAs
The original rSAGE (Kinzler/Vogelstein laboratories) [41, 42], the GLGI-SAGE protocol [43, 44], and our modified rSAGE strategy share several key features (Fig. 1). However, we have made several modifications to increase the efficiency of 3′ cDNA conversion. For instance, changes in the design of the universal primers allowed the rSAGE library scale-up and the subsequent TSRP PCR amplification to be carried out at an increased melting temperature (Tm). The introduction of a longer poly(T) tract (T30) and the inclusion of VN dinucleotides in first strand RT-PCR primer allowed a better trapping and synthesis of full length mRNAs at their 3′ ends, compared with a shorter poly(T) tract (T10) as used in the GLGI strategy that might result in the primer binding to internal poly(A) residues within mRNA transcripts. Finally, increasing the Mg2+ concentration when no distinct rSAGE band was observed in the first round of PCR could occasionally enhance the specificity of the rSAGE amplification reaction.
Of the 200 HES3 orphan SAGE tags that were selected for rSAGE conversion (supplemental online Table 1), 168 (84.0%) yielded PCR amplification products (Fig. 2A). The conversion rate of orphan LongSAGE tags into longer 3′ cDNA fragments was much higher (93.4%) than that of the SAGE tags (69.2%). We attributed these improvements to the availability of additional sequences from the LongSAGE tags for the design of TSRPs, as well as better-designed universal primers (rSAGEF1 and rSAGER1) in our strategy (Fig. 1). In particular, we found the universal M13 primer used as the antisense primer in the original rSAGE strategy [41, 42] was unsatisfactory for rSAGE because of its low Tm.
A representative agarose gel showing the rSAGE products is shown in Figure 2B. It is noteworthy that the majority (∼90%) of the TSRPs yielded only a single distinct rSAGE band. Our results also support the notion that there is no strict correlation between the efficiency of target template amplification and the abundance of the SAGE tag , unlike earlier reports on GLGI-SAGE [44, 45]. Other variables, such as SAGE tag length and primer sequence, may be equally important parameters influencing the efficiency of target amplification. As shown in Figure 2B, the rSAGE amplification generally generated intense bands that were easily gel-purified, although amplification of SAGE tags with a lower copy number (<20 tpm) yielded lesser PCR products and in some cases (Tag IDs 156 and 169) contained one or multiple faint bands that were difficult to gel purify; these bands were not analyzed. When two or more distinct rSAGE bands were obtained (Tag IDs 126, 141, and 148), they usually turned out to be discrete 3′ cDNA fragments. In most GLGI reports, conversion to 3′ cDNAs is usually attempted for SAGE tags with a high copy number [18, 19]. In contrast, a large proportion (68%) of the orphan SAGE tags we attempted to convert to 3′ cDNAs were present at lower frequencies (≤50 tpm). We also managed to obtain genuine rSAGE products for SAGE tags with frequencies of as low as 5 tpm, which is equivalent to the detection of a singleton in the HES3 SAGE library (HESTs 79, 147, and 174; supplemental online Table 1). In conclusion, it appears that our modified rSAGE protocol has some improvements over the original rSAGE protocol [41, 42] and was as efficient as GLGI-SAGE [43, 44] and GLGI-MPSS .
From the 168 SAGE tags that yielded PCR amplification products, a total of 196 rSAGE products were cloned and sequenced. Of these, 148 (75.5%) were confirmed as specific rSAGE products following DNA sequencing, BLAST and BLAT confirmation (supplemental online Table 2). These 148 rSAGE 3′ cDNA fragments have been submitted to GenBank (accession numbers DN604327–DN604453), and we will refer to these cDNA sequences hereafter as HESTs. When TSRPs were designed using the LongSAGE tags, the overall amplification specificity reached 80.5% compared with GLGI-SAGE specificities that varied between 60% for low-copy SAGE tags and 80% for high-copy SAGE tags [43, 44]. Many of the nonspecific rSAGE fragments lacked a poly(A) tract and the rSAGER1 primer and were generated mainly because of mispriming at the 3′ ends (supplemental online Table 3). Finally, although the hESC lines used in our earlier SAGE study  and for the present rSAGE library construction were grown on MEF feeders, we did not find contaminating murine RNA transcripts a significant problem in our 3′ rSAGE conversion attempts.
Overall, 16.0% of rSAGE reactions failed to give distinct amplification products. Taken together with the nonspecific rSAGE results, our main conclusion is that a SAGE tag does not always provide an ideal sequence for the design of thermodynamically favorable TSRPs for the efficient amplification of 3′ cDNA by rSAGE. Thus, orphan SAGE tags that were AT-rich or contained sequences that were self-complementary often failed to generate specific rSAGE 3′ cDNA fragments. Although it is possible that when the expression level of targeted templates is very low, partial annealing of the TSRPs with other highly expressed templates may result in nonspecific amplification , the availability of additional sequences through the generation of LongSAGE or even SuperSAGE tags  would allow most of the remaining orphan SAGE tags to be converted into longer 3′ cDNA fragments for gene identification.
Analysis of 3′ HESTs Generated from HES3 Orphan SAGE Tags
The size distribution of the 148 HESTs ranged from 36 to 538 bp, with 56.7% of them longer than 100 bp, which matched well to the reported data from GLGI-SAGE studies [18, 19, 43, 44]. A small number of the TSRPs  gave two or more distinct rSAGE bands. The majority of them were mapped to distinct transcripts (HEST31, 52, 53, 65, 98, 99, 148, and 170; supplemental online Table 2), whereas those for HEST126 and 141 were the result of alternative polyadenylation sites. Previous GLGI-SAGE reports have relied on BLAST searches to determine the identity of the 3′ cDNA fragments [18, 19, 43, 44]. We used both BLAT and BLAST searches to establish the identity of rSAGE cDNA sequences (Fig. 3A). Indeed, the BLAT transcript viewer made it easier to visualize and quickly identify NATs, novel introns, and new splice variants of known transcripts and to confirm SNPs within the SAGE tags. For several SAGE tags, rSAGE extension resulted only in poly(A) sequences, as a result of the NlaIII site occurring just adjacent to the poly(A) tract, and would require the use of a different tagging enzyme to reveal their true identity. More importantly, our rSAGE results have clearly identified 59 of these rSAGE 3′ cDNA fragments as novel rSAGE 3′ESTs and 30 NATs, all of which are identified for the first time (Fig. 3A).
The majority of the novel rSAGE 3′ESTs that mapped to specific chromosomal locations also contained the canonical polyadenylation signal, AATAAA or its functional variant , and are likely to represent bona fide transcripts from previously undescribed human genes. As shown in Table 1, the majority of these 18 novel rSAGE 3′ESTs are underrepresented in the nonhuman embryonic stem (ES) SAGE libraries and are found mainly in SAGE libraries constructed from cancer cell lines or carcinomas. They are likely to represent transcripts that are expressed specifically in hESCs. For instance, HEST94 and 147 are represented only in hESCs and could turn out to be an excellent marker for the “stemness” phenotype of hESCs. To confirm the validity of these rSAGE 3′ cDNAs and whether they were indeed restricted only to undifferentiated hESCs, RT-PCR was performed for several selected HESTs (97, 146, 147, and 149) across a selected tissue panel (testis, brain, heart, skeletal muscle, fetal brain, and stomach), undifferentiated hESCs (HES3 and HES4), and differentiated HES3 cells (Fig. 3B). Like the well-established hESC marker POU5F1, the RT-PCR products for these four novel rSAGE 3′ ESTs were detected only in the hESC lines and were absent in the other somatic tissues examined. The expression of HEST149 was also completely abrogated in differentiated HES3 cells (Fig. 3B) and could, like Oco90 , prove to be a reliable marker for monitoring the early differentiation of hESCs. Indeed, HEST149 expression was undetectable in the universal reference RNA sample, which is an RNA pool from several cancer tissues (Fig. 3B) and absent in several embryonal carcinoma lines such as GCT-27C4, GCT-27X1, and GCT-44 (unpublished results).
In addition, a number of the novel 3′ rSAGE cDNA fragments (e.g., HESTs 73, 92, 102, and 126) could not be matched reliably to the human genome and were also not the products of contaminating MEF cDNAs. Perhaps these HESTs represent transcripts from novel hybrid RNAs with a regulatory function or as yet undiscovered genes. The presence of consensus polyadenylation sites on several of these HESTs (e.g., 92, 102, and 126) is a good indication that these are authentic transcripts.
Interestingly, four HESTs (112, 120, 128, and 170) showed high sequence similarity to the WiCell hESC ESTs . HEST2 and 146, classified as novel sequences, did not overlap with known hESC ESTs but mapped to genomic regions proximal to chromosomal sites where several WiCell hESC ESTs appear to be transcribed from. Obtaining 3′ cDNA sequences that matched WiCell ESTs  indicated that our modified rSAGE protocol was working well. In addition, our RT-PCR data also confirmed that the expression of HEST120, 127, and 146 were confined to hESCs, although HEST120 (and to a lesser extent HEST127) was also detected in the fetal brain (Fig. 3B). Unfortunately, although these ESTs are highly restricted in their expression to hESCs, as demonstrated either by RT-PCR or by their representation in human ESC SAGE libraries , their exact functional role is unknown.
The impact of SNPs on the correct assignment of SAGE tags to specific transcripts  is also illustrated by our rSAGE results. For instance, HEST49 matched the CHD8 with almost 100% sequence similarity and is the result of an SNP that created a new NlaIII restriction site upstream of the AATAAA polyadenylation site. The full-length cDNA sequence of CHD8 is 8,160 bp long, and this SNP would generate the C-most SAGE tag. The original C-most SAGE tag for CHD8 is GGC-CCCATTG (nts 7311–7320), which is also represented in the HES3 SAGE library (5 tpm). We also detected an SNP within the C-most SAGE tag of GJA1, which encodes the gap junction protein connexin 43. The putative C-most SAGE tag is TGT-TCTGGAG (nts 2916–2925). The rSAGE conversion of the orphan SAGE tag, TGTTTTGGAG, resulted in HEST113, which displayed a 97% sequence similarity to the 3′ terminal region of the GJA1 coding region. Careful examination of corresponding EST and genomic DNA sequences indicated that this orphan tag most likely represented an SNP in the canonical GJA1 SAGE tag and not the hypothetical protein FLJ10407 as suggested by the predicted tag-to-gene mapping of SAGEGenie. The GJA1 SNP was verified using 6-carboxyfluorescein (FAM)- and VIC-labeled Taqman probes that were specific to the polymorphism (Fig. 3C).
The generation of longer 3′ cDNA sequences by rSAGE has also helped to resolve some of the ambiguities in tag to gene assignments, at least in HES3 cells. For example, HEST119 (AGTGAGGATA) matched the hypothetical protein FLJ35155 (C3orf21), which is restricted in expression to hESC lines and tissues of cancerous origin. In addition, the SAGE tag for HEST114 (CATCCAAAAA) was incorrectly assigned to NPY and CEP2 by SAGEGenie and SAGEMap, respectively. Instead, rSAGE conversion confirmed that HEST114 matched to the hypothetical protein FLJ10884, a hypothetical protein restricted in its expression to the testis, placenta, and hESC lines, instead of NPY.
Antisense Transcription in hESCs
BLAT and BLAST searches revealed that many of the HESTs were the products of antisense transcription. Interestingly, cis-NATs for several important ES-specific genes, such as NANOG (HEST16), POU5F1 (HEST88), and LIN28 (HEST168), were identified by our rSAGE results (supplemental online Table 2). Analyzing the chromosomal location of these cis-NATs and the corresponding sense tags from the HES3 library revealed the presence of sense-antisense (SA) gene pairs [34, 35, 38]. Table 2 is a list of 18 SA SAGE tag pairs and the corresponding antisense HESTs that were experimentally obtained with rSAGE. Although several SA SAGE tag pairs can be mapped in trans to remote genomic loci, other pairs mapped in cis on contiguous oppositely oriented DNA strands (Fig. 4A). Besides POU5F1, NANOG, and LIN28, a number of other highly expressed hESC-specific genes, like TGIF/TALE (HEST109), ERH (HEST151), TERA (HEST155), and TERF1 (HEST193.2), also expressed cis-NATs. Furthermore, the representation of many of these co-expressed SA SAGE tag pairs decreased upon differentiation of the hESCs (Table 2). The SAGE tags for NANOG (TCATTACGAT) and POU5F1 (ATGTGGGATT) cis-NATs were found only in hESC SAGE libraries, indicating that the expression pattern of cis-NATs for NANOG and POU5F1 are even more restricted than their sense transcript counterparts.
To validate that the cis-NATs for POU5F1, NANOG, LIN28, TALE, TERF1, and TERA were specifically in hESCs, orientation specific RT-PCR [33, 49] was carried out using total RNA isolated from HES3, a universal reference RNA sample (Stratagene), testis, and stomach (Fig. 4B). First strand cDNAs were prepared using primers specific to POU5F1, NANOG, LIN28, TALE, TERF1, and TERA, respectively. Specific RT-PCR products for all three cis-NATs were detected only when RT was included, thus confirming that these cis-NATs were specifically expressed in hESCs and not due to spurious PCR amplification.
HEST115 and 168 appeared to represent spliced SA transcripts from ILF2 and LIN28, respectively. Nucleotides (nts) 1–42 of HEST115 matched the ILF2 coding region in the antisense orientation (Chr1[+]: 150447872–150447913), whereas nts 24–222 matched the sense orientation (Chr1[−[: 150447587–150447785). Likewise, nts 1–133 of HEST168 matched the LIN28 coding region in the antisense orientation (Chr1[−]: 26439918–26440050), whereas nts 131–171 matched the sense orientation (Chr1[+]: 26440310–26440350). This novel sense-antisense RNA hybrid structure is originally reported for the cardiac troponin I gene in rat hearts . The structure the cardiac troponin I “hybrid RNA,” which the authors themselves have tentatively concluded to be formed from the transcription of the troponin mRNA in the cytoplasm, is very similar to what we have described for ILF2 and LIN28. The functional significance of these hybrid RNAs is currently unknown.