Total Number of Transcripts in E14
Since the false-positive rate is extremely low in our database and in most cases a signal peak corresponds to one transcript, we can estimate the total number of transcripts expressed in E14 cells as follows: 33,136 (the number of peaks detected in HiCEP) × 0.964 (1 − false-positive rate) × 1.17 (transcripts per peak) × 1.17 (1/coverage) = 43,727.
If misannealing occurs during the selective PCR step, the sequences at both ends of the product fragments reflect the sequence of primers used. Therefore, the sequences at both ends of all products amplified by selective PCR using a given set of primers are the same, regardless of whether or not misannealing occurs. So we can tell whether misannealing has occurred by comparing the HiCEP fragment with the sequence registered in the public database. Analyzing the 16,873 cloned fragments revealed that the false-positive rate is 3.8%. Incidentally, one particular primer set was responsible for most of the false positives (manuscript in preparation).
Each peak corresponds roughly to one transcript, but this correspondence depends on the complexity of the transcriptome. Even if selective PCR works perfectly (i.e., without misannealing), the amplified fragments may overlap by chance. The incidence of overlapping depends on the complexity of the transcriptome. We estimated the transcripts-per-peak value through detailed analysis of the 14,383 peaks isolated and characterized as described above; the ratio is 1.17. Some peaks are still overlapping peaks corresponding to multiple transcripts in mice, and this value will only increase with more extensive cloning. The effect is more severe in species having a complex genome, although the use of additional restriction enzyme sets will overcome this difficulty. Coverage was estimated in silico to be 85.7% using information from 33,434 full-length cDNA sequences in the public database. Taken together, these data lead us to conclude that more than 40,000 kinds of transcripts are expressed in E14 cells.
We estimated the total number of transcripts using the information obtained by random cloning of more than 16,000 transcripts. However, we could not totally exclude the possibility that our sample is biased, because the expression level of the transcripts isolated by random cloning tends to be rather high. On this matter, additional studies will be needed. Nevertheless, we are confident that at least 40,000 kinds of transcripts are expressed, since the transcript-per-peak ratio must increase. It is fairly definite that more cloning needs to be done, and that the 14,383 peaks would contain more than 16,873 transcripts (Fig. 1B).
On the other hand, annealing between the oligo(dT) primer and A-rich regions within mRNA sometimes occurs. This means that the predicted total number of transcripts could be an overestimate, since multiple tags can be generated from one transcript. We use oligo(dT) primer for cDNA synthesis, so this is an unavoidable problem. In addition, it is hard to discriminate between such artifacts and novel forms of transcripts whose polyadenylation occurs at unknown sites. We experimented with primer-annealing temperatures of 42°C and 50°C, and few differences were observed in the peak pattern, so we performed the reaction under 50°C to minimize this effect.
Finally, we want to emphasize that we detected approximately 4,000 signals with HiCEP analysis in Saccharomyces cerevisiae, which is the simplest eukaryote. This number is nearly equal to that estimated by another method under similar culture conditions . In addition, the prediction of peak positions using public databases was quite efficient for most signals due to the presence of few exons and thus few alternative transcripts, almost all of which are already identified. These results mean that few artifacts generated by partial digestion by restriction enzymes or inefficient washing were observed.
Ability of HiCEP to Quantitatively Measure Rarely Expressed Transcripts
HiCEP analysis is sensitive enough to detect 1 transcript per cell . This is a critical point for studies such as gene network analysis because substantial numbers of gene-expression regulation factor-coding transcripts are low-abundance transcripts.
Here, we add more information to indicate the potential of our method by comparing the results of HiCEP and those of quantitative PCR analysis. We statistically analyzed the expression of unknown transcripts and found that significant numbers of unknown transcripts exhibit low levels of expression (Fig. 3A). Next, we attempted to design primers for 50 unknown transcripts whose HiCEP peaks exhibited low intensity, and for 46 of the 50, our primers worked well. Real-time PCR detected clear bands for 43 of the 46 (Fig. 3B) and revealed low expression similar to that exhibited by HiCEP analysis (supplemental online Fig. 3B). We confirmed that these products were not amplified from genome DNA molecules contaminated during RNA preparation (supplemental online Fig. 3A). HiCEP technology has been validated by quantitative PCR in other studies as well [5, 13, –15].
Figure Figure 3.. Validation of the high-coverage gene expression profiling (HiCEP) analysis with reverse transcription-polymerase chain reaction (RT-PCR). (A): Distributions of peak heights of unknown transcripts from unknown gene loci (left) and known transcripts (right). The x-axis shows the numbers of fragments in each peak height range. The y-axis shows peak height range with 400 width. (B): Detection of unknown transcripts identified by HiCEP with reverse transcription-PCR. The photographs show electropherograms of PCR products using 2% agarose. The molecular weight marker is the 100-bp DNA ladder. Lanes 1–46: 3_1431, 3_1448, 3_1458, 3_1467, 3_1473, 3_1489, 3_1494, 3_1501, 3_1507, 3_1522, 3_1676, 3_1680, 3_1689, 3_1702, 3_1703, 3_1705, 3_1759, 3_1793, 3_1797, 3_1803, 3_1807, 3_1817, 3_1827, 3_1837, 3_1855, 3_1861, 3_1862, 3_1863, 3_1866, 3_1878, 3_1883, 3_1891, 3_1907, 3_1928, 3_1929, 3_1934, 3_1936, 3_1946, 3_1954, 3_1955, 3_1964, 3_1996, 3_1999, 3_2003, 3_2012, and 3_2033, respectively (supplemental online Table 3). Oct3/4 in lane 47 and Nanog in lane 48 were used as positive controls. No genome contamination was shown in supplemental online Figure 3A.
Download figure to PowerPoint
Furthermore, we compared peak height in HiCEP to frequency in EST collection (92,000 ESTs from E14; this EST collection from E14 contains 4,800 singletons) (Fig. 4A). Among the peaks amplified by TG-AT primers, the peaks in category I (Fig. 1B) were compared with the results of EST collection. A close correlation between peak height and frequency in EST collection was observed for Mm.88212, Mm.24886, Mm.22575, and Mm.94371. The peaks for Mm.258568 and Mm.138512 seem somewhat high, suggesting that they are overlapping peaks. It is noteworthy that HiCEP detected transcripts that were not detected by EST collection, Mm.295480, Mm.271555, and Mm.258568. In addition, to gauge the sensitivity of our method, we focused on the ES markers in the reproducibility figure of HiCEP (Fig. 4B). It is clear that the ES markers are quite abundant and that HiCEP has the ability to detect transcripts whose expression is less than 1/100th that of these ES markers. The ability of HiCEP to perform quantitative analysis of low-abundance transcripts is a crucial advantage.
Figure Figure 4.. Comparison of the sensitivity of high-coverage gene expression profiling (HiCEP) and of other procedures. (A): Detection of low-abundance transcripts in HiCEP. Comparison between the height in HiCEP and frequency in EST collection is shown. Results of two independent HiCEP analyses are shown. The figures (times) indicate the frequency found in our EST collection. (B): Peak height in HiCEP analysis for known ES marker genes. A scatter-plot analysis of lot 1 and lot 2 for E14 is shown. This figure shows the relationship between the height of peaks and reproducibility. The lines around the diagonal indicate differences in gene expression (1.2-, 1.5-, 2.0-, and 3.0-fold). Abbreviations: EST, expressed sequence tag; Fgf, fibroblast growth factor; Oct, octamer binding transcription factor; Rex, reduced expression; Sox, SRY-box-containing gene; UTF, undifferentiated embryonic cell transcription factor.
Download figure to PowerPoint
Usefulness of the Peak Database for Other Cells
A peak database that indicates the relationship between a peak in HiCEP and a transcript or a genomic region registered in a public database would be very helpful, because it would allow us to perform extensive analyses without peak isolation. But each cell line has its own transcriptome. Would we need a different database for each one?
We examined another ES cell line, R1, to see whether the database for a given cell type is applicable to cells of a related type, and we found that the positions of most peaks were identical (Fig. 5). However, a significant number of peaks were different in height (supplemental online Fig. 4). We analyzed the R1 peaks to determine whether the peaks appearing in both R1 and E14 lines were derived from identical transcripts. We cloned 43 randomly selected signals using the R1 cell line with the MspI-CA and MseI-GG primer set, sequenced them, and found that the transcripts corresponding to these peaks were identical to those of the E14 peaks (data not shown). This implies that most of the shared peaks are derived from identical transcripts. These results show that the HiCEP peak database of E14 is applicable to R1 cell analysis and strongly suggest that the database is useful for most murine ES cell lines.
Figure Figure 5.. Usefulness of the high-coverage gene expression profiling (HiCEP) peak database for other embryonic stem cells. Representative gene expression profiling obtained by HiCEP analysis of E14 and R1 cells. Results of two independent analyses are shown as lot1 and lot2.
Download figure to PowerPoint
We also observed similarities between peaks in E14 cells and in the mouse embryonic fibroblast (MEF) cell line MEF3T3. This seems reasonable since there are a lot of housekeeping genes, most of which are common over most cell lineages; however, more precise information is necessary before the database is used with other cell types.
We tried to compare the expression of some genes whose products play critical roles in the mechanism underlying pluripotency, Nanog, Oct3/4, and Sox2. We found that Oct3/4 and Sox2 are expressed clearly in E14 cells, but no expression could be observed in MEFs (supplemental online Fig. 5). HiCEP cannot be used to analyze the Nanog transcript, because it contains no recognition sites for the restriction enzyme MspI, which HiCEP uses. These results were confirmed by quantitative PCR, and the ES-specific Nanog expression was also confirmed.
It is also noteworthy that we found many (1,858) genomic regions from which transcripts expressed with both sense and antisense orientation. This suggests that there are more than 4,000 such transcripts in E14 cells. Our database will provide quantitative information on these transcripts. The next question is what their role is.
It has been suggested that ES cells are relatively unstable in in vitro culture and often partially differentiate under standard culture conditions . Whether or not our database really represents the transcriptome of the immature stem cell is an important point. The following observations suggest that most ES cells used in the present study maintain their pluripotency. (a) The ES cells injected into blastocysts developed into intact mice. (b) Most of them were alkaline phosphatase-positive. (c) SSEA1 products were positive . Nanog, Oct3/4, Sox2, Rex1, fibroblast growth factor 4 (Fgf4), and Utf1 were also positive [18, , , –22]. (d) Three mouse ES cell lines (E14, R1, and TT2) share quite similar expression profiles (Fig. 5) (manuscript in preparation). (e) We identified a substantial number of transcripts whose expression decreases drastically upon removal of LIF in culture (Fig. 2). However, at the same time, slight expression of H19 and Fgf5 transcripts, which are known differentiation markers, was also observed (data not shown).
There have been several studies of the transcriptome of ES cells. SAGE suggested 44,569 unique tags, including 31,184 that were identified once, in mouse ES cell line R1 . We attempted to compare their results with ours. However, direct comparison of tag sequences between SAGE and HiCEP was impossible, since the tags of these two types of analyses are derived from different regions of the transcript. Only tags assigned to genes whose full-length sequence or genome organization had already been determined could be used for the comparison. Even in this case, the comparison would be not accurate, since alternative transcripts in a gene locus could not be discriminated.
Focusing on known gene loci, we compared the gene symbols corresponding to the results of two analyses and found that 5,895 tags were detected in both analyses, 4,712 were detected by SAGE only, and 1,727 were detected by HiCEP only. This comparison has several limitations. Alternative transcripts from identical gene loci were defined as one and then calculated. The tags assigned to novel gene loci cannot be included, because full-length cDNA and their genome genes were not finally determined. Furthermore, most SAGE tags match more than two regions on the genome and could not be assigned to specific genes, resulting in overestimation of the number of gene loci.
In general, rarely expressed genes are detected as singletons in SAGE analysis, and this discourages further statistical study since the assignment of singletons is prone to sequencing errors and single nucleotide polymorphisms. Our results contain accurate gene expression information for over 70% of the transcripts and annotations for 45%. With this database and HiCEP analysis, quantitative gene expression profiling of mouse ES cells becomes possible. This system enables us to measure quantitatively a number of noncoding transcripts, as well as known and unknown protein-encoding transcripts, with high resolution. Currently, we are developing a system using computers to compare the expression change of every peak detected in HiCEP analysis automatically and comprehensively . We will release updated information about the HiCEP database on our homepage (http://220.127.116.11/english/index.html).
“How many transcripts are expressed in mice?” is quite an important question, and answering it is one of our goals. Using the whole body to count the total number of transcripts is not realistic because analysis would suffer from the complexity of cell types, leading analysts to miss a substantial number of low-abundance transcripts. We must integrate results using every tissue in the body at every stage of differentiation. Although we report only on a database of embryonic stem cells here, we plan to set up a system in which HiCEP analysis results will be registered and available to the scientific community, enabling us to address this important issue.
We used oligo(dT) primer for the cDNA synthesis, so we were measuring only RNA molecules with poly(A) tails. Recently it has become clear that there are several types of RNA molecules without poly(A) tails, but our system does not detect such molecules .