Microarray technology provides a unique opportunity to examine gene expression patterns in human embryonic stem cells (hESCs). We performed a meta-analysis of 38 original studies reporting on the transcriptome of hESCs. We determined that 1,076 genes were found to be overexpressed in hESCs by at least three studies when compared to differentiated cell types, thus composing a “consensus hESC gene list.” Only one gene was reported by all studies: the homeodomain transcription factor POU5F1/OCT3/4. The list comprised other genes critical for pluripotency such as the transcription factors NANOG and SOX2, and the growth factors TDGF1/CRIPTO and Galanin. We show that CD24 and SEMA6A, two cell surface protein-coding genes from the top of the consensus hESC gene list, display a strong and specific membrane protein expression on hESCs. Moreover, CD24 labeling permits the purification by flow cytometry of hESCs cocultured on human fibroblasts. The consensus hESC gene list also included the FZD7 WNT receptor, the G protein-coupled receptor GPR19, and the HELLS helicase, which could play an important role in hESCs biology. Conversely, we identified 783 genes downregulated in hESCs and reported in at least three studies. This “consensus differentiation gene list” included the IL6ST/GP130 LIF receptor. We created an online hESC expression atlas, http://amazonia.montp.inserm.fr, to provide an easy access to this public transcriptome dataset. Expression histograms comparing hESCs to a broad collection of fetal and adult tissues can be retrieved with this web tool for more than 15,000 genes.
Disclosure of potential conflicts of interest is found at the end of this article.
In the preimplantation mammalian embryo, the inner cell mass is able to differentiate into any cell type of the embryo proper. It has been recognized in mice since 1981 that embryonic stem cells (ESCs) with a prolonged proliferative capacity in vitro can be derived from the inner cell mass . ESC line derivation from human embryos was reported in 1998 . ESCs are pluripotent cells that can contribute to all tissues in vivo, and to the three primary germ layers as well as extraembryonic tissues in vitro. Because pluripotency is maintained even after prolonged periods of culture, human ESCs (hESCs) have a great therapeutic potential in regenerative medicine. Careful molecular characterization of this unique cellular model of pluripotency should help to optimize and scale up in vitro production of hESCs for clinical applications. Some genes specific to the very early stages of development and expressed in hESC lines such as POU5F1/OCT3/4, NANOG, REX1, SOX2, FGF4, and FOXD3 have already been identified [3, , , , –8]. However, the picture is far from complete, and the molecular mechanisms involved in self-renewal and pluripotency are still under tight scrutiny [9, 10]. Moreover, extensive knowledge on hESCs should help in designing protocols for the isolation and production of other multi-/pluripotent stem cells derived from tissues of adult individuals.
Microarrays are a major technical breakthrough that can monitor the expression of a whole genome in one experiment. Application of this technology to hESCs has largely contributed to our knowledge on the mechanisms underlying the maintenance of pluripotency of hESCs and their in vitro differentiation. Unfortunately, the datasets generated are heterogeneous both in accessibility (public databases or supplemental online data) and in the techniques used (variability in microarray design, sample labeling techniques, choice of control samples, and computational tools). Despite our findings of 38 original publications reporting hESC transcriptome analyses, the disparities between datasets, the variety of sources, the substantial know-how needed for transcriptome data mining discourage most nonspecialists from consulting this information. Thus, large amounts of data are very much underused, and there is a lack of alternative interpretations because only the conclusions reflecting the analysis carried out by the authors are presented. This situation has led to initiatives such as ONCOMINE in the field of cancer . We present here the first effort in compiling all publicly available microarray data relating to hESCs. From the 38 original publications studying the hESC transcriptome, we identified genes that were consistently overexpressed in hESCs when compared to differentiated samples (“consensus hESC gene list”) and underexpressed in hESCs (“consensus differentiation gene list”) in different studies. These lists will further deepen our knowledge on this unique cell model of developmental biology. Concurrently, we created an online database, Amazonia!, which provides an easy access to this public transcriptome dataset.
Materials and Methods
Lists of Genes Differentially Expressed
Analyzing 38 original studies using transcriptome analysis to study hESC, we were able to collect 20 lists of transcripts that were upregulated in hESCs compared to differentiated cell types and 11 lists of transcripts that were downregulated (supplemental online Tables 1–3). We selected only transcripts lists that provided a fold ratio of the mean expression in hESCs to that in differentiated cells. Each list was mapped to UniGene build 176. When the mean value of expression in differentiated cells was 0, which occurred with expressed sequence tags (ESTs) and serial analysis of gene expression (SAGE), the hESC/differentiation ratio was arbitrarily set at 50. Only genes with a fold ratio greater or equal to 2 (hESC genes), or lower or equal to 0.5 (differentiation genes) were selected. The Gene Ontology (GO) annotation analysis was carried out using the Fatigo+ tool on the Babelomics website (http://babelomics.bioinfo.cipf.es) using gene symbols . Only annotations with a false discovery rate-adjusted p value below .05 were considered significant.
Integrating Affymetrix GeneChip Datasets Obtained from Distinct Studies
To compare the transcriptome of hESCs to that of differentiated cell populations, we built an expression compendium by combining the U133A (Affymetrix, Santa Clara, CA, http://www.affymetrix.com) microarray data from eight publications [13, , , , , , –20]. Indeed, we and others have shown that the GeneChip system (Affymetrix) allows direct comparison between datasets obtained in different centers, provided that the same chip and the same normalization are used [10, 21, 22]. The number of samples amounted to 217, including 24 hESC samples (11 different hESC lines). All samples were normalized before analysis with the GCOS 1.2 software (Affymetrix), using the global scaling method, with a target intensity (TGT) value set to 100. The detection call can be “present,” when the perfect match probes are significantly more hybridized than the mismatch probes; “absent,” when both perfect match and mismatch probes display a similar fluorescent signal; or “marginal,” when the probe set complies with neither the present nor the absent call criteria. When several probe sets measured the same gene, only the probe set with the maximal number of present detection call across all samples was selected. This step reduced the list of probe sets to 14,074. This dataset is available as supplemental online Table 4 and can be accessed on our website, http://amazonia.montp.inserm.fr.
Table Table 1.. hESCs: Laboratory of origin and karyotype
Table Table 2.. Forty-three original studies or reviews analyzing the hESC transcriptome
Table Table 3.. Forty-eight genes overexpressed in hESCs compared to differentiated cell types in at least 10 studies
Table Table 4.. Forty genes specifically expressed in hESCs
Hierarchical clustering was carried out on the detection call data with the CLUSTER and TREEVIEW software packages . The value 1 was assigned to present calls, −1 to absent calls, and 0 to marginal calls. This matrix was clustered without further mathematical transformation. Only genes were clustered (i.e., the order of the samples is the order used on the Amazonia! website and is based on grouping according to the embryonic germ layer origin of the sample).
hESC Lines and Karyotype
The hESC lines used and their respective karyotypes are listed in Table 1.
After approval from the French Ministry of Research, the French Ministry of Health, and the Agence de la Biomédecine, HUES1 and HUES3 hESCs were imported from Douglas Melton's laboratory (Harvard University, Cambridge, MA) and cultured as described previously . Cells were passaged every 3–4 days enzymatically with 0.25% trypsin/EDTA (Invitrogen, Cergy Pontoise, France, http://www.invitrogen.com) and cultured in knockout Dulbecco's modified Eagle's medium (DMEM; Invitrogen) without plasmanate, with 10% KO-SR (Invitrogen), 2 mM l-glutamine, 1× nonessential amino acids, 0.05 mM β-mercaptoethanol, and 10 ng/ml fibroblast growth factor (FGF)-2 (Abcys, Paris, France, http://www.abcysonline.com). Medium was replaced daily. HUES1 and HUES3 were cultured on murine embryonic fibroblasts (MEFs) obtained from E13 ICR mice embryos (Harlan, Gannat, France, http://www.harlan.com) or on human foreskin fibroblasts (hFFs) in porcine skin gelatin-coated (Sigma-Aldrich, St. Quentin Fallavier, France, http://www.sigmaaldrich.com) six-well dishes. The hFF cell lines SM1 and SM3 were derived from, respectively, 75- and 3-year-old patients undergoing foreskin reduction. Informed consent was obtained from the patient or the patient's parents. MEFs were cultured in DMEM with 10% fetal calf serum (FCS) and hFFs in DMEM with 20% FCS. MEFs and hFFs were mitotically inactivated by mitomycin-C (2 hours at 10 μg/ml). These hESCs expressed POU5F1, TRA-1-60, TRA-1-81, displayed phosphatase alkaline activity, and were able to differentiate into embryoid bodies that expressed differentiation markers of astrocytic lineage (glial fibrillary acidic protein [GFAP]) or endodermal lineage (α-fetoprotein; supplemental online Figure 1 and data not shown).
The HS293 and HS235 hESC lines were cultured in the Department of Obstetrics and Gynecology (CLINTEC) at the Karolinska University Hospital as described previously . Briefly, hESCs were cultured on hFF (CRL-2429; American Type Culture Collection, Manassas, VA, http://www.atcc.org) mitotically inactivated by irradiation (40 Gy), in KO-DMEM 20% knockout SR, 2 mM GlutaMAX, 0.5% penicillin-streptomycin, 1% nonessential amino acids, 0.5 mM β-mercaptoethanol (all from Gibco, Paisley, U.K., http://www.invitrogen.com); 1% insulin-transferrin-selenium (Sigma-Aldrich); and 8 ng/ml of basic FGF (R&D Systems, Oxford, U.K., http://www.rndsystems.com).
For karyotype analysis, hESCs were treated with 75 ng/ml final Karyomax Colcemid (Invitrogen) for 1 hour, trypsinized, incubated in 0.0375 M KCl for 20 minutes, and fixed in fresh 3:1 methanol/acetic acid solution. One-two spreads were counted for chromosome number, and 12–16 banding patterns were analyzed at a resolution of 300–500 bands.
Flow Cytometry Analysis
hESCs and fibroblasts were dissociated with trypsin (0.25%)-EDTA (1 mM) (Gibco) for 3 minutes. Cells were then washed with phosphate-buffered saline (PBS) and incubated for 30 minutes at 4°C in PBS with the corresponding monoclonal antibody (MAb): anti-CD24 MAb conjugated to phycoerythrin (dilution 1:50; clone ALB9, Immunotech, Marseille, France, http://www.immunotech.com) and/or anti-CD44 MAb conjugated to fluorescein isothiocyanate (dilution 1:50; clone J-173; Immunotech). After PBS washes, cells were suspended in FACSFlow (Becton Dickinson, San Jose, CA, http://www.bdbiosciences.com) and fluorescence was analyzed with a FACSCalibur flow cytometer (Becton Dickinson) or sorted with a FACSAria cell sorter (Becton Dickinson). Appropriate isotype controls were included in all analyses.
hESCs cultured on coverslips were fixed for 20 minutes in 4% paraformaldehyde and washed three times in PBS. Cells were permeabilized with 0.1% Triton X-100 (Sigma). After blocking at room temperature for 60 minutes in PBS with 5% donkey serum (S30; Chemicon International, Temecula, CA, http://www.chemicon.com), cells were incubated for 1 hour at room temperature with primary antibody diluted in PBS with 5% donkey serum: POU5F1/OCT3/4 (sc 9081 1:300; Santa Cruz Biotechnology, Santa Cruz, CA, http://www.scbt.com) and SEMA6A (AF1146; 1:50; R&D). Cells were washed three times in PBS and incubated for 1 hour at room temperature with Alexa Fluor 488 donkey anti-rabbit (A-11034; 1:1,000; Molecular Probes, Eugene, OR, http://probes.invitrogen.com) and Cy3 donkey anti-goat (1:400; Jackson ImmunoResearch, West Grove, PA, http://www.jacksonimmuno.com) secondary antibodies, for POU5F1 and SEMA6A, respectively. Unbound antibodies were removed by three washes in PBS. Hoechst staining was added to the first wash (5 μg/ml; Sigma-Aldrich).
Compiling hESC Expression Profiles
As of October 1, 2006, we identified 38 original studies, one protocol description, and four reviews analyzing the transcriptome of hESCs (Table 2). The original studies used various hESCs, control cells, and gene expression analysis techniques (summarized in supplemental online Table 1). Twenty-eight different hESC lines were used, and transcriptome techniques included microarrays (seven different types of chips), ESTs scanning, SAGE, massively parallel signature sequencing (MPSS), and Illumina beads (Illumina, San Diego, CA, http://www.illumina.com). One study compared chromatin immunoprecipitation chip data with transcriptome data in hESCs . Nevertheless, some common features emerged such as the frequent investigation of the H1, H9, and BG01 hESCs (in 12, 12, and 11 studies, respectively), and the use of the GeneChip (Affymetrix) microarray system (in 16 studies; supplemental online Table 1).
Meta-Analysis of Genes Differentially Expressed Between hESCs and Nonpluripotent Cells
One main objective of large-scale gene expression analyses of hESCs is to identify the set of genes that are overexpressed in this unique cell type (hESC genes) or underexpressed (differentiation genes). We reasoned that bona fide hESC or differentiation genes would be repeatedly uncovered by independent groups, regardless of the hESC lines, the control cells, the assay format, or the statistical method that had been used. We collected, from these 38 original studies, 20 lists of transcripts overexpressed in hESCs (hESC genes), and 11 lists of genes underexpressed in hESCs (differentiation genes).
The 20 lists of hESC genes comprised 5,567 different genes. As illustrated in Figure 1A, we observed a marked heterogeneity between these lists, with only 1,076 genes found to be overexpressed in hESCs by three or more independent studies, 48 genes by 10 or more studies, and only one gene by all 20 lists (supplemental online Table 2). The 48 genes found to be overexpressed in at least 10 studies are listed in Table 3. Of note, the pivotal ES transcription factor POU5F1/OCT3/4 is the one gene found by all 20 lists, whereas genes found in at least 10 studies include the transcription factors NANOG and SOX2 and the growth factors TDGF1/CRIPTO and Galanin (GAL) that are known to be highly expressed by hESCs. Thus, according to hESC transcriptome analyses published to date, this list of 1,076 genes found to be overexpressed in hESC cells by three or more studies can be viewed as a consensus hESC gene list.
To get further insights into this hESC list, we built an expression compendium by combining the data from five publications using the U133A GeneChip microarray to analyze hESC transcriptome and three publications providing the transcriptome of various normal fetal and adult tissues (see Materials and Methods; supplemental online Table 4). This compendium included the gene profiling of 24 hESC samples and more than 190 various fetal and adult tissues samples. A heat map was generated for the consensus hESC gene list in this expression compendium based on the detection call provided by the GCOS 1.2 software. The detection call is a way to evaluate whether a gene is expressed in a given sample. Hierarchical clustering (Fig. 1C) delineated four major clusters of genes: cluster A was a group of 40 genes specifically detected in hESCs (hESC-specific genes; Table 4), including expected genes such as POU5F1/OCT3/4, NANOG, TDGF1, LIN28, CLDN6, GDF3, and DNMT3A, but also genes such as CYP26A1, HELLS, or GPR19; cluster B featured genes that were detected in both hESCs and central nervous system samples such as the GABA receptors GABRB3 and GABRA5, and the growth factor FGF13; cluster C genes detected in samples characterized by a high mitotic index such as SKP2, MYC, the cyclin CCNA2, and the MCM genes MCM2, MCM5, MCM6, and MCM7; and cluster D genes overexpressed in hESCs but also expressed in a majority of the tissues included in this dataset such as PGK1, HSPA9B, and the ribosomal genes RPLP0, RPL6, RPL7, and RPL24. The complete lists of genes composing these clusters are available as supplemental online Table 5. The expression histograms of the 40 hESC-specific genes are shown in supplemental online Figure 2. These results show that, although all 1,076 genes of the consensus hESC gene list have been found to be overexpressed in hESCs compared to non-hESC samples by at least three different studies, 40 genes are indeed hESC specific (cluster A), but most are nonetheless also expressed in adult tissues to various extent.
Table Table 5.. Thirty selected genes underexpressed in hESCs compared to differentiated cell types and found by at least six studies
The 11 lists of differentiation genes totaled 4,798 genes, and we noted a similar heterogeneity as in the hESC lists (supplemental online Table 3). Of these 4,798 genes, only 783 were found to be underexpressed in hESCs by at least 3 different studies, composing a consensus differentiation gene list, 3 genes (lumican and collagen 1A1 and 3A1), by 9 studies, and none by all 11 studies (Fig. 1B). Table 5 shows 30 selected genes found by at least 6 studies, which include the bone morphogenetic proteins (BMP) 1 and 4, keratin 7 and keratin 18, insulin-like growth factor 2 (IGF2), heart and neural crest derivatives expressed 1 (HAND1), and the transducing chain IL6ST/GP130. The complete consensus differentiation gene list can be found in supplemental online Table 3.
We compared functional GO annotations of the hESC genes to the differentiation genes. Several functional annotations were more represented in each category (Fig. 1D). There were significantly more genes involved in “metabolism,” “mitosis,” “RNA splicing,” “nuclear pore,” and “DNA repair” in the hESC gene list, reflecting the intense proliferation, DNA replication, and DNA remodeling taking place in these cells. Conversely, GO annotations such as “organ development,” “skeletal development,” “extracellular matrix,” “cell adhesion,” “cell communication,” “integral to plasma membrane,” and “signal transduction,” were significantly more frequent among genes upregulated in differentiated tissues, in agreement with the idea that hESC differentiation mimics early organogenesis, which is associated with the development of complex cell-cell communication and cell-extracellular matrix interactions.
CD24 and SEMA6A Are hESC Markers
We next inquired whether we could find, among the hESC genes, new markers that may be useful to identify, isolate, and qualify hESCs in vitro. We focused on two cell surface hESC genes: CD24 and SEMA6A. These two genes have been found to be overexpressed in hESCs by, respectively, 9 and 15 studies, and are, therefore, good candidates as new hESC markers.
CD24 is a sialoglycoprotein known to be expressed on mature granulocytes and B-cell subpopulations . In addition to these hematopoietic cell types, microarray data from the expression compendium suggested a high CD24 mRNA expression in keratinocytes, pancreas, and thyroid, whereas most neural tissues, muscle, liver, and testis did not express CD24 (Fig. 2A). Upon differentiation of hESCs into embryoid bodies (EBs), CD24 expression is markedly downregulated (Fig. 2B). Importantly, hFF samples did not express CD24 (Fig. 2C, 2F). Hence, we investigated whether CD24 could discriminate hESCs from fibroblasts in culture. We analyzed by flow cytometry the hESC lines HUES1, HUES3, HS293, and HS235 [24, 25] cultured on hFF and evidenced two distinct cell populations, one CD24+ and one CD24− (Fig. 2E). To demonstrate that the CD24+ population corresponded to hESCs and the CD24− population to hFFs, we took advantage of CD44, a strong fibroblast marker not expressed by hESCs (Figs. 2D, 2G). Double staining of hESCs cultured on hFF showed that these two markers were expressed in a mutually exclusive manner and delimited a hESC CD24+ CD44− population that did not overlap with the fibroblast CD24− CD44+ population (Fig. 2H). Using CD24/CD44 labeling, we were able to separate hESCs from fibroblasts and obtain pure hESC populations that recovered the cell sorting procedure and grew in vitro while retaining POU5F1 and cardinal cell surface markers of pluripotency expression (supplemental online Fig 3 and data not shown).
SEMA6A is a class 6 semaphorin  (i.e., transmembrane with cytoplasmic domain) known to be expressed in developing neural tissue. We recently reported the expression of this semaphorin in cumulus oophorus cells . Comparison of RNA expression in hESCs and normal adult tissues showed that, in addition to hESCs, SEMA6A was also expressed at high-level in adult samples from the central and peripheral nervous system and placenta (Fig. 2I). As for CD24, SEMA6A mRNA is downregulated on EB differentiation and is not detected in human fibroblasts (Fig. 2J, 2K). Immunofluorescence analysis showed a membrane localization of SEMA6A on hESCs, in contrast to POU5F1/OCT3/4, which had a strict nuclear localization (Fig. 2L, 2M). In summary, we showed that CD24 and SEMA6A have indeed a preferential expression in hESCs, that this expression is confirmed at the protein level, and that it declines on EB differentiation. Thus, these markers can be used to distinguish hESCs from feeder fibroblasts.
hESC Transcriptome Data Visualization Through an Open-Access Web Interface
We developed a website, Amazonia! (http://amazonia.montp.inserm.fr), to allow the scientific community access to these public data. This website is dedicated to the visualization of large, publicly available human transcriptome data (T. Le Carrour, C. Berthenet, S. Assou, L. Lhermitte, G. Palenzuela, N. Lamb, T. Rème, B. Klein, J. De Vos, manuscript submitted). A main topic of this website is human embryonic stem cells. Data are visualized as expression histograms with a color code facilitating the recognition of cell type. Genes are accessed either by keywords or through lists of genes. Most interestingly, when data were obtained using the same platform format, sample labeling, and data normalization, it was possible to combine different experiments in a single virtual experiment. The U133A expression compendium comprises, for example, more than 200 different samples from eight publications including hESCs and normal adult tissues. Thus, Amazonia! provides the expression profile of approximately 15,000 different genes in approximately 100 different tissues types or purified cell populations, including 11 hESCs (H1, H9, HS181, HS235, HS237, FES21, FES22, FES29, FES30, I6, and HES-2). Figure 3A illustrates this feature of our website with the expression histograms of five hESC-specific genes (POU5F1, NANOG, GPR19, Helicase lymphoid-specific [HELLS], and the cytochrome P450 CYP26A1), two genes expressed in non-lineage-differentiated cells (HAND1 and IGF2), two factors highly expressed by human fibroblasts and smooth muscle that may contribute to the supporting properties of fibroblast to hESC culture (Gremlin [GREM1] and matrix metalloproteinase 1 [MMP1]), one hematopoietic marker (CD45), one central nervous system marker (GFAP), and one ubiquitously expressed gene (ribosomal protein L3 [RPL3]).
Another important feature of Amazonia! is the possibility to compare, on the same web page, the expression of a gene of interest in various datasets. Figure 3B shows that the expression of Frizzled 7 (FZD7) was markedly upregulated in hESCs, was downregulated during nonlineage differentiation into EBs, and was also highly expressed in embryonal carcinoma and yolk sac carcinoma samples. The combination of these three histograms evidences a preferential FZD7 mRNA expression in normal and malignant embryonic cells, suggesting that FZD7 may play a major role in these pluripotent cell types.
To facilitate access for the scientific community to the lists of genes from published transcriptome analyses, we implemented a list manager in our Amazonia! website. Thus, one can access these lists straightforwardly and obtain an expression histogram for each gene in various public transcriptome collections. This feature is of particular interest to challenge a list of genes, such as an hESC gene list, with other hESC datasets or even with non-hESC datasets such as cancer datasets. Indeed, once a gene is selected, the user can navigate between various thematic pages, switching for instance from the “stem cells” page to the “leukemia and lymphoma,” “lung cancer,” or other pages. Using this feature of the Amazonia! website, we could observe that the sialoglycoprotein CD24 was also highly expressed in acute lymphoblastic leukemia, lung cancer, and glioma samples when compared to the corresponding normal samples (Fig. 3C and data not shown).
Human ESCs are remarkable in their ability to both self-renew and generate virtually any cell type, hence carrying many hopes for cell therapy. It is anticipated that genome-wide expression analyses, by providing an extensive molecular taxonomy, will help understanding this unique cell model of stem cell pluripotency. Transcriptome results can be viewed as a nonbiased, genome-wide expression catalog. Many groups have published the transcriptome of hESCs, but, as a direct consequence of the massive data generated, access to this data on a routine basis was precluded for most researchers. Therefore, the construction of a database collecting publicly available hESC transcriptome and accessible through a user-friendly interface is of utmost interest for the hESC research community. We found 38 original studies or reviews analyzing the transcriptome of hESCs. Expression data and gene lists extracted from these studies were included into Amazonia! and are now readily accessible. Interestingly, the frequent use of the U133A oligonucleotide microarrays allowed us to construct a virtual expression dataset of approximately 15,000 different genes in more than 200 tissue samples from various origins, including 24 hESC samples. Hence, the expression of each gene in hESCs is directly contrasted with that of normal fetal and adult tissues, as illustrated in Figure 3A.
Unearthing the hESC gene panoply may help to define what makes hESCs unique. To achieve this goal, most studies compared hESC transcriptome data to that of more differentiated cell types and obtained lists of genes over- or underexpressed in hESCs. Comparison of different transcriptome surveys of hESCs gave us the unique opportunity to identify genes that were identified by several authors as being differentially expressed in hESCs. However, a striking heterogeneity was observed among the 20 lists of hESC genes, and only 48 genes were found to be highly enriched in hESCs in at least 10 publications, among 5,567 genes found in at least one publication. Some of these differences may be explained by platform-to-platform or lab-to-lab variability, but this is likely not the main explanation, as suggested by transcriptome platform comparisons . Rather, the differences between the hESC cells lines used, the control samples, the specific caveats of each transcriptome analysis technique, and the statistical methodologies likely contribute to these disparities (supplemental online Table 2). For instance, the homeobox transcription factor NANOG, which is universally expressed in hESC, was not reported by Sperger et al.  because no probe for NANOG was present on their microarray, nor was it reported by Brandenberger et al. , because the differential regulation of NANOG in their in vitro differentiation model did not met their stringent statistical criteria. ZFP42 (the human homolog of murine rex1) was never listed by MPSS studies because its MPSS signature has repeat sequences . Another pitfall influencing differentially expressed gene list comparisons and contributing to the “small intersection” problem  is that, to be at the intersection of 20 lists, a gene must have fulfilled 20 times the statistical filter, which it does with a probability equal to the product of the probabilities of each test. A way to circumvent this difficulty would be to obtain the raw data from these studies and to apply specific statistical tests . However, the raw data were not available for many studies analyzed here, which prevented us from applying this approach in our study. Nevertheless, the 1,076 hESC gene list provides the opportunity to the scientific community to examine the genes that have been found to be over- or underexpressed in hESC by several authors. These lists provide further molecular insight into the biology of this unique stem cell model and are now starting points for many new research directions in the field of hESC. In the future, it will be interesting to extend this list by investigating additional hESC lines. The collection of additional transcriptome data is ongoing, and we will update the database according to new publications on the hESC genome expression.
As one can easily notice by browsing the consensus hESC gene list in the U133A “embryonic and adult samples” dataset, we found very few genes that are completely hESC specific. Indeed, most genes found to be overexpressed in hESCs are also expressed in other tissues (Fig. 1C), and only 40 genes, grouped in cluster A, comprised genes for which expression was not detected in most other tissues (supplemental online Figure 2). Of note, genes can be labeled as hESC specific only with respect to the different tissues and cell populations that have been tested. As new cell types are investigated, some of these genes may be found to be expressed in these new samples and, thereby, cease to be specific to hESCs. Specificity also depends on the sensitivity of the assay investigating expression of a given gene. For example, POU5F1/OCT3/4 has been reported to be expressed in germinal cells and even in bone marrow by reverse transcriptase polymerase chain reaction assays [34, 35]. By contrast, microarray analysis show that this gene, clearly expressed at a high level in hESCs and teratocarcinoma samples, is not detected by this technique in testis, ovary, bone marrow samples, nor in a pure oocyte population (Fig. 3A and unpublished data). These observations suggest that the properties of hESCs, comprising self-renewal, unrestrained proliferation, and pluripotency, are mediated by the expression of few specific genes, if any, together with genes that, individually, are not hESC specific, but whose combined expression is specific to hESCs. The consensus hESC gene list should encompass many of those genes contributing to the embryonic stem cell characteristics.
Transcriptome approaches have several limitations. Each technique has its own technical limits, but this meta-analysis partly circumvented these by the simultaneous analysis of different, complementary methods. Another drawback is that by looking at the transcriptome, we consider gene expression at the RNA level only, eluding all forms of post-transcriptional regulation. It will, therefore, be important in the future to look at the differential expression of the genes described in the hESC and differentiation lists at the protein level. For that matter, this meta-analysis provides a list of pertinent genes for further protein validation. In line with this proposition, we chose to investigate more thoroughly two cell surface protein-encoding genes, which may serve as new hESC markers: CD24 and SEMA6A. Although these two genes have been found to be overexpressed in hESCs by 9 and 15 studies, respectively, we found that hESCs shared this RNA expression with several adult tissues, as substantiated by microarray results (Fig. 2). However, purification of hESCs requires only distinguishing them from differentiating stem cells and from the cocultured feeder cells. CD24 expression is low in differentiating hESCs, and is absent in human fibroblasts. We were thus able to purify pure hESC populations by flow cytometry. This provides a new tool to isolate highly enriched populations of hESCs for subsequent experiments, including microarray analysis. We observed that CD24 was also highly expressed on various malignant cell types, as previously reported [36, 37]. Because CD24 is a ligand for P-selectin, it has been suggested that CD24 could be important in the dissemination of tumor cells by facilitating the interaction with endothelial cells . The role of CD24 in hESCs is less clear because P-selectin is not expressed by human fibroblasts nor by hESCs themselves (data not shown), suggesting that CD24 may have other molecular functions. Semaphorins have been initially identified for their role in neuronal guidance as chemorepellents, but it is becoming clear that this large family of genes also plays important roles in organogenesis, vascularization, angiogenenesis, and B-lymphocyte signaling. We show here a clear protein expression of semaphorin 6A in hESCs, with an overexpression in hESCs compared to many other cells types, suggesting a functional role for this transmembrane molecule in cell-to-cell interaction or signaling in hESCs.
The clustering of the consensus hESC gene list based on the detection call identified a cluster of 40 genes with a high specificity in hESCs, with no expression in most samples from more than 100 different fetal and adult cell tissues (Table 4). In addition to genes clearly expected such as POU5F1/OCT3/4, NANOG, or TDGF1, this cluster included genes whose hESC specificity had been overlooked. For example, the expression of the G protein-coupled receptor 19 (GPR19) is restricted to hESCs and, if functional, may offer a possibility for in vitro intervention on proliferation or pluripotency of hESCs. Other hESC-specific genes comprise the cytochrome P450 CYP26A1, which is responsible for retinoic acid degradation  or the helicase HELLS, which is expressed at a moderate level in a few lymphoid samples, but most importantly in hESCs, and could be involved in DNA strand separation, including replication, repair, recombination, and transcription  (Fig. 3A). Our meta-analysis also spotted additional interesting genes such as FZD7, which was identified as the frizzled receptor preferentially expressed on hESCs. On the basis of this expression, we hypothesize that FZD7 could be a major WNT receptor in hESCs. Thus, FZD7 could contribute to the pluripotency signal mediated by WNT previously reported in hESCs . Regarding the consensus differentiation gene list, many genes, such as collagen and keratin genes, are clearly related to differentiation. We note that the transducing chain IL6ST/GP130, which is necessary to convey the signals from the interleukin-6 growth factor family including leukemia inhibitory factor (LIF), is expressed at a very low level in hESCs compared to most differentiated adult tissues. This is in line with the largely accepted view that LIF signaling is dispensable for pluripotency in hESCs .
We analyzed 38 publications studying the hESC transcriptome. We propose a consensus hESC gene list and a consensus differentiation gene list that identifies the genes found to be, respectively, up- or downregulated in hESCs compared to differentiated samples by at least three publications. We provide the first tool to directly visualize the expression of most human genes in hESCs, and provide direct comparison with their expression in many normal and malignant tissues. This tool may be considered the first hESC expression atlas online. By providing an easy access to this large public data, we hope that Amazonia! will help boost the translation of this invaluable expression information into biological applications.
Disclosures of Potential Conflicts of Interest
The authors indicate no potential conflicts of interest.
We thank the various labs that gave free access to their complete transcriptome data, in agreement with the Minimum Information About a Microarray Experiment recommendations ; Ned Lamb and Cyril Berthenet for support with informatics (Institut de Génétique Humaine de Montpellier), and Hassan Boukhaddaoui for cell imaging (Montpellier RIO Imaging); Isabelle Rodde-Astier, Antoine Héron and Valérie Duverger (MacoPharma) for decisive support for this project; and Geneviève Lefort, Marie Ponset, and Franck Pellestor for assistance in hESC karyotyping.