More Than 40,000 Transcripts, Including Novel and Noncoding Transcripts, in Mouse Embryonic Stem Cells



To study the transcriptome of embryonic stem cells, we used a new gene expression profiling method that can measure the expression levels of unknown and rarely expressed transcripts precisely. We detected a total of 33,136 signal peaks representing transcripts in mouse embryonic stem cells, E14. Subsequent random cloning of the peaks suggests that mouse embryonic stem cells express at least 40,000 transcripts, of which about 2,000 are still unknown. In addition, we identified 1,022 noncoding transcripts, several of which change depending on differentiation in gene expression. Our database provides a high-resolution expression profile of E14 cells and is applicable to other mouse embryonic stem cell analyses. It includes most transcription regulation factor-encoding genes and a significant number of unknown and noncoding transcripts.


The identification of unknown transcripts is crucial for understanding a transcriptome, and it is also necessary to be able to efficiently identify low-abundance transcripts, because most transcription regulating factor-encoding transcripts are of this type. But existing hybridization-based methods are unable to detect unknown transcripts and ineffective for detecting low-abundance transcripts. Expressed sequence tag (EST) analysis has been used in the past to identify novel transcripts [1, 2]. Using this method, however, efficient detection of low-abundance transcripts incurs a huge cost, and it is unrealistic to analyze many different types of cells. Serial analysis of gene expression (SAGE) overcomes some of the limitations of EST analysis, but limitations on quantitative measurement and assignment of tags still remain [2].

There are polymerase chain reaction (PCR)-based methods that can detect rarely expressed genes and novel transcripts such as cDNA-amplified fragment length polymorphism (AFLP), arbitrarily primed-PCR and so on, but their coverage is not very high. Several improved methods that overcome previous technical difficulties have been developed, but these still suffer from a substantial rate of false positives caused by misannealing during the PCR step [3, 4]. Each peak does not necessarily correspond to a single transcript, making these methods unsuitable for genome-wide profiling.

High-coverage gene expression profiling (HiCEP) features substantial improvements over cDNA-AFLP, especially in the selective PCR step [5]. It can detect unknown transcripts and is sensitive enough to detect low-abundance transcripts. It features an extremely low rate of false positives (less than 4%), enabling the user to assign each peak to a specific gene unequivocally. In addition, it is noteworthy that highly expressed transcripts do not in general mask moderately or rarely expressed ones the way they do with other procedures such as EST collection and SAGE, since the DNA fragments (derived from transcripts) are usually different in molecular size and easily separated by capillary electrophoresis.

HiCEP can detect approximately 70%–80% of human and mouse transcripts. The dynamic range of the analysis is 103, from 1 transcript copy per cell to approximately 1,000 transcript copies per cell, and its resolution for expression difference is approximately 1.2-fold [5].

We used HiCEP to analyze embryonic stem (ES) cells. Many unknown or rarely expressed transcripts are thought to express in stem cells; the implication of such transcripts for the pluripotency of ES cells has been discussed elsewhere [6, 7].

Materials and Methods

Preparation of Total RNA from Mouse ES Cells

E14 cells were cultured on 0.1% gelatin in knockout Dulbecco's modified Eagle's medium (KO-DMEM) (Invitrogen, Carlsbad, CA, supplemented with Knockout Serum Replacement (Invitrogen), 4 mM l-glutamine, 500 units/ml leukemia inhibitory factor (LIF) (Chemicon, Temecula, CA, and 0.05 mM 2-mercaptoethanol (2-ME) according to the instructions given in a previous report [8, 9]. R1 cells were cultured on 0.3% gelatin in DMEM (Invitrogen) supplemented with 20% fetal calf serum (SAFC Biosciences, Lenexa, Kansas,, nonessential amino acids (Invitrogen), 1,000 units/ml LIF (Chemicon), 6 mM l-glutamine, and 0.1 mM 2-ME [10]. Total RNA was isolated using the RNeasy mini kit (Qiagen, Hilden, Germany, In addition, we performed RNase-free DNaseI treatment (Qiagen).

HiCEP Reaction

The HiCEP reaction was performed according to a previous report [5]. An outline is shown in supplemental online Figure 1. Briefly, RNA is converted to cDNA by reverse transcriptase with biotynated oligo(dT), and double-strand cDNA is prepared, digested by a specific restriction enzyme, and trapped by avidin bound to magnetic beads. After washing off the fragments digested by the enzyme except most of the 3′-region bearing oligo(dT)-biotin, a synthetic adaptor is ligated, and the trapped templates are digested by another restriction enzyme. Most of the steps are the same as in standard AFLP except for the primers and annealing temperature in selective PCR. We found that there is a temperature region, 70–72°C, where we can remove almost all false positives with specific primers whose GC content is 55%–60%, resulting in a decrease in the total number of peaks. This enables us to use four-nucleotide instead of six-nucleotide restriction enzymes for high coverage detection, since the drastic decrease of false positives reduces the total number of peaks. HiCEP analysis consists of two reactions, one for detecting 5′-MspI-MseI-3′ fragments and the other for 5′-MseI-MspI-3′ fragments. Using the same two enzymes twice in opposite order tends to maximize coverage while minimizing the number of fragments that are detected twice.

Fractionation and Sequencing of HiCEP Peaks

One microliter of each selective PCR product, 2.7 μl of formamide, 0.3 μl of GeneScan 500 ROX (Applied BioSystems, Foster City, CA, and 2.0 μl of 10× loading buffer were mixed, denatured by incubation at 95.0°C for 2 minutes, and loaded on a denaturing gel (20 cm × 40 cm of slab gel): 4%, 6%, or 10% polyacrylamide containing 7.0 M urea, followed by electrophoresis at 1,500 V for 4 hours. Fluorescence from the products was detected by Typhoon 9210 (GE Healthcare, Uppsala, Sweden, The portions of the gel containing bands were cut out and suspended in 60 μl of TE (10 mM Tris-HCl, 1 mM EDTA, pH 8.0) buffer. After 30 minutes of incubation at −20°C, 2.0 μl was used for PCR and then forwarded to cloning using the pGEM-T Easy Vector System (Promega, Madison, WI, To confirm the reliability of the cloning step, electrophoresis distance was tested by comparing the distance in the HiCEP analysis, and the distance for the fluorescent product resynthesized with a clone obtained by cloning as a template. The reliability of our criterion was also supported by competitive PCR using a gene-specific primer [11]. T7 primer and the BigDye Terminator v3.1 cycle sequencing kit (Applied BioSystems) were used for sequencing reactions.

Assigning HiCEP Peaks to Genes

To determine the UniGene cluster, we did a BLAST search of the mRNA sequences in mouse UniGene (Build 141) and then searched for ESTs in UniGene. We looked for sequences with at least 90% sequence matching and homology. We checked whether the hit sequence had MspI and MseI sites and the correct selective sequence to correspond to a pattern of selective PCR. To determine the genome region from which the transcripts detected by HiCEP are expressed, we searched mouse Genome (NCBI Mouse Build 33.1) with the BLAST Like Alignment Tool (BLAT) ( followed by Washington University BLAST (

Reverse Transcription-Polymerase Chain Reaction Analysis

We randomly selected 50 low-abundance transcripts with peak height less than 500 from category IIIb to validate the HiCEP analysis. The transcripts in category IIIb are strong candidates for unknown transcripts, because they are expressed from outside known gene loci. In our system, intensity 500 corresponds to approximately 10 transcript copies per cell [5]. Primers for the candidates were designed using the program Primer3 ( Reverse transcription-polymerase chain reaction (RT-PCR) was performed using the QuantiTect SYBR Green RT-PCR kit (Qiagen). Each reaction mixture contained 20 ng of E14 total RNA, and their signals were monitored by PRISM 7700 (Applied BioSystems). The reaction conditions were 50°C for 30 minutes and 95°C for 15 minutes for cDNA synthesis, followed by 50 cycles at 94°C for 10 seconds, 55°C for 30 seconds, and 72°C for 40 seconds.


Gene Expression Profiling of E14

We show the flow of the HiCEP reaction in brief in supplemental online Figure 1. Basically, each peak corresponds to one transcript because the false-positive rate is less than 4%, and the height of each peak reflects its expression level (Fig. 1A). We prepared the E14 cells at least three times and performed gene expression profiling analysis twice for each preparation to obtain reproducible results (supplemental online Fig. 2), and only the reproducibly detected peaks were forwarded to the next analysis step. We conclusively identified 33,136 peaks in E14 cells.

Figure Figure 1..

Content of the ES HiCEP peak database. (A): Examples of HiCEP results. (B): Content of the database. Abbreviations: AA-AA, MspI-AA primer and MseI-AA primer; BLAST, Basic Local Alignment Search Tool; BLAT, the BLAST Like Alignment Tool; EST, expressed sequence tag; WU-BLAST, Washington University BLAST.

The biggest concern about our result was reliability: how many peaks represent transcripts in ES cells? To find out, we attempted to clone peaks randomly and successfully recovered transcripts from 14,383 peaks. Sequencing revealed that the average length of the cloned peaks was 151 bases, which was sufficient to assign each peak to a specific gene or genomic region.

Content of Peaks

Our AFLP-based method divides huge numbers of cDNA molecules into 512 groups using two nucleotides at either end of the fragments generated by digestion with two restriction enzymes, MspI and MseI. What we call the “predicted region” of a transcript is the portion of the transcript that should be amplified in the HiCEP selective PCR step; that is, the 3′-most fragment that has the second enzyme recognition site on the 3′ side and the first enzyme recognition site on the 5′ side (if both recognition sites exist; otherwise the predicted region is undefined and the transcript is not covered by HiCEP analysis) (supplemental online Fig. 1).

A total of 16,873 kinds of fragments were found in the 14,383 peaks, meaning that several peaks were composed of multiple fragments. An in silico study revealed that 14,332 (86%) of these fragments have a match in the UniGene mRNA or EST databases (Fig. 1B). We divided them into four categories: completely known transcripts (category I), partially known transcripts (category II), unknown transcripts (category III), and a few fragments too short or repetitive to assign to a specific region in the genome (category IV) (Fig. 1B).

We subsequently subdivided the categories further. Category Ia includes predicted regions of completely known transcripts. Ib includes parts of the predicted regions of completely known transcripts, and Ic includes portions outside the predicted regions of completely known transcripts. Categories Ib and Ic contain unknown forms of alternative transcripts and transcripts that exhibit poly(A) site heterogeneity. Note that first strand synthesis sometimes occurs in A-rich regions within the transcripts and that these artifacts can be also classified into these categories. Fragments in category II were found in the UniGene EST sequences database, but their full lengths are not available. Category IIIa includes transcripts from the intronic region, whereas Category IIIb includes transcripts from fully novel gene loci. All this information is shown in supplemental online Tables 1–6.

Candidates for Novel Transcripts in the E14 Cells

A total of 698 (546 in category IIIb + 152 in IIIc) transcripts are derived from a genomic region in which no open reading frame has been identified or predicted (Fig. 1B). These can be assumed to be transcripts expressed from novel gene loci. An additional 243 in category Ib are unregistered transcripts expressed from known gene loci, making a total of 941 (698 + 243) unregistered transcripts.

Categories Ic and IIIa also can contain novel transcripts, but we did not count these as novel, because these categories can also contain transcripts that include new poly(A) sites in known transcripts, intronic regions from immature forms of known transcripts, or artifacts caused by misannealing as mentioned above.

Our analysis suggests that approximately 2,000 (941 × 33,136/14,383) novel transcripts remain in E14 cells and that these novel transcripts tended to produce low-intensity peaks. RT-PCR confirmed their expression, as mentioned below.

We confirmed that no peaks, except for a few mechanical artifacts, were detected in a HiCEP analysis with genomic DNA only as a template, and no difference was observed between the analyses using total RNA with and without DNaseI treatment (data not shown), indicating that no signals are coming from genomic DNA, which can be contaminated in the RNA fraction.

The novel transcripts detected by our analysis are very interesting, because they may be ES-specific. We examined whether or not these novel transcripts are expressed in an ES-specific manner using randomly selected transcripts. A decrease in expression after the removal of LIF from the culture medium was investigated (Fig. 2A). Two ES-specific transcripts were found in category IIIa, and these were assigned to intronic regions of known genes whose functions were unknown. Four decreasing transcripts were identified in category IIIb and one in category IV. These transcripts are derived from regions of the genome where no open reading frame has been predicted.

Figure Figure 2..

Expression change of novel and noncoding transcripts depending on cell differentiation. (A): Representatives of the newly identified transcripts. Transcripts identified by HiCEP whose expression decreases depending on differentiation after removal of LIF from culture medium. The y- and x-axes indicate peak height and days after the removal of LIF, respectively. (B): Noncoding transcripts. The arrows indicate the peaks of interest. Note that except for the ones indicated by arrows, the peaks do not exhibit any changes in height after LIF removal. Ia_8003 (fibroblast growth factor 4) and Ia_699 (NADH dehydrogenase 1α subcomplex 8) are shown as positive and negative controls, respectively (supplemental online Table 1). The y-axis indicates peak height. Abbreviation: LIF, leukemia inhibitory factor.

Candidates for Noncoding Transcripts in ES Cells

Recently more than 34,000 noncoding transcripts were reported in the mouse genome [12]. We checked how many were detected in our analysis and found a total of 1,022 candidates: 416 in category Ia, 10 in Ib, 499 in Ic, 26 in II, 15 in IIIa, 30 in IIIb, and 25 in IV. An additional 530 transcripts detected by HiCEP may be noncoding, because no open reading frame was found in or near them. Because the coverage of HiCEP is incomplete, as we will discuss below, there are probably a few thousand more noncoding transcripts in ES cells, supporting a previous report. Our database includes the measured expression levels of such low-abundance noncoding transcripts.

It would be interesting to know whether the noncoding transcript candidates play roles in the mechanism underlying the pluripotency of ES cells. Among the candidates detected, we found two transcripts whose expression changes when LIF is removed from the culture medium, 3_1463 (supplemental online Table 5) and 1c2_6130 (supplemental online Table 3), corresponding to B930044J18 and 4833429C02, respectively, in the FANTOM noncoding transcript library ( The former was suppressed, and the later was induced (Fig. 2B). B930044J18 is located in the intronic region of A630034I12Rik, whose function is unknown, and 4833429C02 is located 1 kilobase pair (kbp) downstream from the 3′ end of the paternally expressed 3 gene (Peg3). No functional information has been obtained yet, so additional studies on these transcripts are needed.


Total Number of Transcripts in E14

Since the false-positive rate is extremely low in our database and in most cases a signal peak corresponds to one transcript, we can estimate the total number of transcripts expressed in E14 cells as follows: 33,136 (the number of peaks detected in HiCEP) × 0.964 (1 − false-positive rate) × 1.17 (transcripts per peak) × 1.17 (1/coverage) = 43,727.

If misannealing occurs during the selective PCR step, the sequences at both ends of the product fragments reflect the sequence of primers used. Therefore, the sequences at both ends of all products amplified by selective PCR using a given set of primers are the same, regardless of whether or not misannealing occurs. So we can tell whether misannealing has occurred by comparing the HiCEP fragment with the sequence registered in the public database. Analyzing the 16,873 cloned fragments revealed that the false-positive rate is 3.8%. Incidentally, one particular primer set was responsible for most of the false positives (manuscript in preparation).

Each peak corresponds roughly to one transcript, but this correspondence depends on the complexity of the transcriptome. Even if selective PCR works perfectly (i.e., without misannealing), the amplified fragments may overlap by chance. The incidence of overlapping depends on the complexity of the transcriptome. We estimated the transcripts-per-peak value through detailed analysis of the 14,383 peaks isolated and characterized as described above; the ratio is 1.17. Some peaks are still overlapping peaks corresponding to multiple transcripts in mice, and this value will only increase with more extensive cloning. The effect is more severe in species having a complex genome, although the use of additional restriction enzyme sets will overcome this difficulty. Coverage was estimated in silico to be 85.7% using information from 33,434 full-length cDNA sequences in the public database. Taken together, these data lead us to conclude that more than 40,000 kinds of transcripts are expressed in E14 cells.

We estimated the total number of transcripts using the information obtained by random cloning of more than 16,000 transcripts. However, we could not totally exclude the possibility that our sample is biased, because the expression level of the transcripts isolated by random cloning tends to be rather high. On this matter, additional studies will be needed. Nevertheless, we are confident that at least 40,000 kinds of transcripts are expressed, since the transcript-per-peak ratio must increase. It is fairly definite that more cloning needs to be done, and that the 14,383 peaks would contain more than 16,873 transcripts (Fig. 1B).

On the other hand, annealing between the oligo(dT) primer and A-rich regions within mRNA sometimes occurs. This means that the predicted total number of transcripts could be an overestimate, since multiple tags can be generated from one transcript. We use oligo(dT) primer for cDNA synthesis, so this is an unavoidable problem. In addition, it is hard to discriminate between such artifacts and novel forms of transcripts whose polyadenylation occurs at unknown sites. We experimented with primer-annealing temperatures of 42°C and 50°C, and few differences were observed in the peak pattern, so we performed the reaction under 50°C to minimize this effect.

Finally, we want to emphasize that we detected approximately 4,000 signals with HiCEP analysis in Saccharomyces cerevisiae, which is the simplest eukaryote. This number is nearly equal to that estimated by another method under similar culture conditions [5]. In addition, the prediction of peak positions using public databases was quite efficient for most signals due to the presence of few exons and thus few alternative transcripts, almost all of which are already identified. These results mean that few artifacts generated by partial digestion by restriction enzymes or inefficient washing were observed.

Ability of HiCEP to Quantitatively Measure Rarely Expressed Transcripts

HiCEP analysis is sensitive enough to detect 1 transcript per cell [5]. This is a critical point for studies such as gene network analysis because substantial numbers of gene-expression regulation factor-coding transcripts are low-abundance transcripts.

Here, we add more information to indicate the potential of our method by comparing the results of HiCEP and those of quantitative PCR analysis. We statistically analyzed the expression of unknown transcripts and found that significant numbers of unknown transcripts exhibit low levels of expression (Fig. 3A). Next, we attempted to design primers for 50 unknown transcripts whose HiCEP peaks exhibited low intensity, and for 46 of the 50, our primers worked well. Real-time PCR detected clear bands for 43 of the 46 (Fig. 3B) and revealed low expression similar to that exhibited by HiCEP analysis (supplemental online Fig. 3B). We confirmed that these products were not amplified from genome DNA molecules contaminated during RNA preparation (supplemental online Fig. 3A). HiCEP technology has been validated by quantitative PCR in other studies as well [5, 13, [14]–15].

Figure Figure 3..

Validation of the high-coverage gene expression profiling (HiCEP) analysis with reverse transcription-polymerase chain reaction (RT-PCR). (A): Distributions of peak heights of unknown transcripts from unknown gene loci (left) and known transcripts (right). The x-axis shows the numbers of fragments in each peak height range. The y-axis shows peak height range with 400 width. (B): Detection of unknown transcripts identified by HiCEP with reverse transcription-PCR. The photographs show electropherograms of PCR products using 2% agarose. The molecular weight marker is the 100-bp DNA ladder. Lanes 1–46: 3_1431, 3_1448, 3_1458, 3_1467, 3_1473, 3_1489, 3_1494, 3_1501, 3_1507, 3_1522, 3_1676, 3_1680, 3_1689, 3_1702, 3_1703, 3_1705, 3_1759, 3_1793, 3_1797, 3_1803, 3_1807, 3_1817, 3_1827, 3_1837, 3_1855, 3_1861, 3_1862, 3_1863, 3_1866, 3_1878, 3_1883, 3_1891, 3_1907, 3_1928, 3_1929, 3_1934, 3_1936, 3_1946, 3_1954, 3_1955, 3_1964, 3_1996, 3_1999, 3_2003, 3_2012, and 3_2033, respectively (supplemental online Table 3). Oct3/4 in lane 47 and Nanog in lane 48 were used as positive controls. No genome contamination was shown in supplemental online Figure 3A.

Furthermore, we compared peak height in HiCEP to frequency in EST collection (92,000 ESTs from E14; this EST collection from E14 contains 4,800 singletons) (Fig. 4A). Among the peaks amplified by TG-AT primers, the peaks in category I (Fig. 1B) were compared with the results of EST collection. A close correlation between peak height and frequency in EST collection was observed for Mm.88212, Mm.24886, Mm.22575, and Mm.94371. The peaks for Mm.258568 and Mm.138512 seem somewhat high, suggesting that they are overlapping peaks. It is noteworthy that HiCEP detected transcripts that were not detected by EST collection, Mm.295480, Mm.271555, and Mm.258568. In addition, to gauge the sensitivity of our method, we focused on the ES markers in the reproducibility figure of HiCEP (Fig. 4B). It is clear that the ES markers are quite abundant and that HiCEP has the ability to detect transcripts whose expression is less than 1/100th that of these ES markers. The ability of HiCEP to perform quantitative analysis of low-abundance transcripts is a crucial advantage.

Figure Figure 4..

Comparison of the sensitivity of high-coverage gene expression profiling (HiCEP) and of other procedures. (A): Detection of low-abundance transcripts in HiCEP. Comparison between the height in HiCEP and frequency in EST collection is shown. Results of two independent HiCEP analyses are shown. The figures (times) indicate the frequency found in our EST collection. (B): Peak height in HiCEP analysis for known ES marker genes. A scatter-plot analysis of lot 1 and lot 2 for E14 is shown. This figure shows the relationship between the height of peaks and reproducibility. The lines around the diagonal indicate differences in gene expression (1.2-, 1.5-, 2.0-, and 3.0-fold). Abbreviations: EST, expressed sequence tag; Fgf, fibroblast growth factor; Oct, octamer binding transcription factor; Rex, reduced expression; Sox, SRY-box-containing gene; UTF, undifferentiated embryonic cell transcription factor.

Usefulness of the Peak Database for Other Cells

A peak database that indicates the relationship between a peak in HiCEP and a transcript or a genomic region registered in a public database would be very helpful, because it would allow us to perform extensive analyses without peak isolation. But each cell line has its own transcriptome. Would we need a different database for each one?

We examined another ES cell line, R1, to see whether the database for a given cell type is applicable to cells of a related type, and we found that the positions of most peaks were identical (Fig. 5). However, a significant number of peaks were different in height (supplemental online Fig. 4). We analyzed the R1 peaks to determine whether the peaks appearing in both R1 and E14 lines were derived from identical transcripts. We cloned 43 randomly selected signals using the R1 cell line with the MspI-CA and MseI-GG primer set, sequenced them, and found that the transcripts corresponding to these peaks were identical to those of the E14 peaks (data not shown). This implies that most of the shared peaks are derived from identical transcripts. These results show that the HiCEP peak database of E14 is applicable to R1 cell analysis and strongly suggest that the database is useful for most murine ES cell lines.

Figure Figure 5..

Usefulness of the high-coverage gene expression profiling (HiCEP) peak database for other embryonic stem cells. Representative gene expression profiling obtained by HiCEP analysis of E14 and R1 cells. Results of two independent analyses are shown as lot1 and lot2.

We also observed similarities between peaks in E14 cells and in the mouse embryonic fibroblast (MEF) cell line MEF3T3. This seems reasonable since there are a lot of housekeeping genes, most of which are common over most cell lineages; however, more precise information is necessary before the database is used with other cell types.

We tried to compare the expression of some genes whose products play critical roles in the mechanism underlying pluripotency, Nanog, Oct3/4, and Sox2. We found that Oct3/4 and Sox2 are expressed clearly in E14 cells, but no expression could be observed in MEFs (supplemental online Fig. 5). HiCEP cannot be used to analyze the Nanog transcript, because it contains no recognition sites for the restriction enzyme MspI, which HiCEP uses. These results were confirmed by quantitative PCR, and the ES-specific Nanog expression was also confirmed.

It is also noteworthy that we found many (1,858) genomic regions from which transcripts expressed with both sense and antisense orientation. This suggests that there are more than 4,000 such transcripts in E14 cells. Our database will provide quantitative information on these transcripts. The next question is what their role is.

It has been suggested that ES cells are relatively unstable in in vitro culture and often partially differentiate under standard culture conditions [16]. Whether or not our database really represents the transcriptome of the immature stem cell is an important point. The following observations suggest that most ES cells used in the present study maintain their pluripotency. (a) The ES cells injected into blastocysts developed into intact mice. (b) Most of them were alkaline phosphatase-positive. (c) SSEA1 products were positive [17]. Nanog, Oct3/4, Sox2, Rex1, fibroblast growth factor 4 (Fgf4), and Utf1 were also positive [18, [19], [20], [21]–22]. (d) Three mouse ES cell lines (E14, R1, and TT2) share quite similar expression profiles (Fig. 5) (manuscript in preparation). (e) We identified a substantial number of transcripts whose expression decreases drastically upon removal of LIF in culture (Fig. 2). However, at the same time, slight expression of H19 and Fgf5 transcripts, which are known differentiation markers, was also observed (data not shown).

There have been several studies of the transcriptome of ES cells. SAGE suggested 44,569 unique tags, including 31,184 that were identified once, in mouse ES cell line R1 [23]. We attempted to compare their results with ours. However, direct comparison of tag sequences between SAGE and HiCEP was impossible, since the tags of these two types of analyses are derived from different regions of the transcript. Only tags assigned to genes whose full-length sequence or genome organization had already been determined could be used for the comparison. Even in this case, the comparison would be not accurate, since alternative transcripts in a gene locus could not be discriminated.

Focusing on known gene loci, we compared the gene symbols corresponding to the results of two analyses and found that 5,895 tags were detected in both analyses, 4,712 were detected by SAGE only, and 1,727 were detected by HiCEP only. This comparison has several limitations. Alternative transcripts from identical gene loci were defined as one and then calculated. The tags assigned to novel gene loci cannot be included, because full-length cDNA and their genome genes were not finally determined. Furthermore, most SAGE tags match more than two regions on the genome and could not be assigned to specific genes, resulting in overestimation of the number of gene loci.

In general, rarely expressed genes are detected as singletons in SAGE analysis, and this discourages further statistical study since the assignment of singletons is prone to sequencing errors and single nucleotide polymorphisms. Our results contain accurate gene expression information for over 70% of the transcripts and annotations for 45%. With this database and HiCEP analysis, quantitative gene expression profiling of mouse ES cells becomes possible. This system enables us to measure quantitatively a number of noncoding transcripts, as well as known and unknown protein-encoding transcripts, with high resolution. Currently, we are developing a system using computers to compare the expression change of every peak detected in HiCEP analysis automatically and comprehensively [24]. We will release updated information about the HiCEP database on our homepage (

“How many transcripts are expressed in mice?” is quite an important question, and answering it is one of our goals. Using the whole body to count the total number of transcripts is not realistic because analysis would suffer from the complexity of cell types, leading analysts to miss a substantial number of low-abundance transcripts. We must integrate results using every tissue in the body at every stage of differentiation. Although we report only on a database of embryonic stem cells here, we plan to set up a system in which HiCEP analysis results will be registered and available to the scientific community, enabling us to address this important issue.

We used oligo(dT) primer for the cDNA synthesis, so we were measuring only RNA molecules with poly(A) tails. Recently it has become clear that there are several types of RNA molecules without poly(A) tails, but our system does not detect such molecules [25].


  1. We detected a total of 33,136 signal peaks representing transcripts in mouse E14 embryonic stem cells, suggesting a total of more than 40,000 transcripts in ES cells.

  2. We provide a high-resolution transcriptome database of E14 cells that is applicable to other mouse ES cell analyses. This database contains 16,873 transcripts, including 1,552 candidates for noncoding transcripts.

  3. Some ES-specific noncoding transcripts were identified.

  4. A delineation of low-abundance transcripts in E14 cells was made.


M.A and T.S. own stock in Messenger Scape Co., Ltd. N.S. and T.S. served as officers or members of the board of Messenger Scape Co., Ltd within the last 2 years.


We are most grateful to M. Hooper and A. Nagy for providing us with the E14 and R1 cell lines, respectively. We also thank J.J. Rodrigue for helpful discussions and editing of the English manuscript. We are grateful to A. Nifuji and K. Kadota for discussions and to K. Nakamura, A. Uotsu, M. Ohtani, K. Nishikawa, A. Ishibashi, Y. Harada, S. Ohki, K. Mori, S. Ando, R. Fujii, and Y. Hoki-Fujimori for technical assistance. This work was partly supported by Research Grants from the Japanese Science and Technology Agency. R.A., R.F., and N.S. contributed equally to this work.