Housekeeping genes (HKGs) are involved in basic functions needed for the sustenance of the cell and are assumed to be constitutively expressed at a constant level. Based on these features, HKGs are frequently used for normalization of gene expression data. In the present study, we used the CodeLink Gene Expression Bioarray system to interrogate changes in gene expression occurring during differentiation of human ESCs (hESCs). Notably, in the three hESC lines used for the study, we observed that the RNA levels of 56 frequently used HKGs varied to a degree that rendered them inappropriate as reference genes. Therefore, we defined a novel set of HKGs specifically for hESCs. Here we present a comprehensive list of 292 genes that are stably expressed (coefficient of variation <20%) in differentiating hESCs. These genes were further grouped into high-, medium-, and low-expressed genes. The expression patterns of these novel HKGs show very little overlap with results obtained from somatic cells and tissues. We further explored the stability of this novel set of HKGs in independent, publicly available gene expression data from hESCs and observed substantial similarities with our results. Gene expression was confirmed by real-time quantitative polymerase chain reaction analysis. Taken together, these results suggest that differentiating hESCs have a unique HKG signature and underscore the necessity to validate the expression profiles of putative HKGs. In addition, this novel set of HKGs can preferentially be used as controls in gene expression analyses of differentiating hESCs.
Housekeeping genes (HKGs) are required for basal cellular function and maintenance and are assumed to be expressed at relatively stable levels across different cell types and experimental conditions . Therefore, they have been used for normalization of gene expression data . With the advent of genome-wide expression profiling, HKG mRNA levels were observed to vary extensively among different cell types . This led investigators to turn to other strategies, such as statistical methods, for normalizing large-scale gene expression data [4, 5]. These methods assume that up- and downregulated genes with similar average intensities roughly cancel out or, alternatively, that most genes remain unchanged . This assumption is usually correct in large-scale genome studies but not for smaller focused arrays, which still require carefully selected and validated HKGs for normalization [6, 7]. Commonly used HKGs in studies of somatic cells include glyceraldehyde-3-phosphate dehydrogenase (GAPDH), albumin, actins, tubulins, cyclophilin, and 18S rRNA [2, 3, 8, 9].
Human ESCs represent populations of pluripotent undifferentiated cells with unlimited replication capacity that can be coaxed to differentiate into a variety of specialized cells . As a result, there is great hope that hESCs will be extremely useful by providing platforms for various in vitro applications (e.g., in drug discovery) as well as for future use of hESCs and their differentiated progeny in cell replacement therapies [11, 12]. To realize the potential of hESCs, it is necessary to gain much deeper knowledge about the processes that govern differentiation of these cells.
In recent years, significant progress toward understanding cellular differentiation has been fueled, in part, by studying gene expression using microarrays [13, , , , –18]. In lower throughput analyses, RNA levels in hESCs are also measured using reverse transcription polymerase chain reaction (PCR), requiring normalization of the gene expression data to adequately correct for intersample variation. In general, investigators have used the traditional HKGs (e.g., GAPDH, β-tubulin, β-actin) in studies of hESCs [19, 20]. However, it is well known that the expressions of several of these genes vary considerably in adult tissues, and their suitability as HKGs in hESCs remains to be proven. In this regard, the RNA levels of hypoxanthine phosphoribosyltransferase and β-tubulin were shown to vary substantially in differentiating mouse ESCs .
Microarrays are very useful for identification and evaluation of HKGs, and many adult and fetal tissues have been analyzed [2, 8, 9]. However, no studies to date that have included hESCs or early derivatives thereof have specifically investigated stably expressed genes in these cell populations. In the present study, we performed whole-genome expression profiling of undifferentiated and differentiating hESCs using CodeLink Bioarrays targeting approximately 54,000 transcripts and expressed sequence tags (ESTs). Of 56 genes that have previously been used as HKGs, we observed that the vast majority varied considerably in differentiating hESCs. Hence, we have identified a novel set of stably expressed genes that can be used for normalization of gene expression data from hESCs. We propose that these novel HKGs are more reliable as reference genes in studies of early differentiation of hESCs compared with many of the previously used HKGs.
Materials and Methods
Human ESC Culture and In Vitro Differentiation
The hESC lines SA001, SA002, and SA002.5 (Cellartis AB, Göteborg, Sweden, http://www.cellartis.com) were propagated as previously described [22, 23] on mitomycin-C (Sigma-Aldrich, St. Louis, http://www.sigmaaldrich.com) inactivated mouse embryonic fibroblast (MEF) feeder layers using VitroHES medium (Vitrolife AB, Kungsbacka, Sweden, http://www.vitrolife.com) supplemented with 4 ng/ml basic fibroblast growth factor (bFGF). SA002 is trisomic for chromosome 13 , whereas SA002.5 is a euploid subclone derived from SA002 . Undifferentiated hESCs were passaged every 4–5 days by mechanical dissociation using a Stem Cell Cutting Tool (Swemed Lab International AB, Billdal, Sweden, http://www.swemed.com).
Spontaneously differentiating hESC cultures were obtained using two different protocols (Fig. 1), using serum-containing medium (KnockOut-Dulbecco's modified Eagle's medium, 20% fetal calf serum, 1% penicillin-streptomycin (PEST), 1% GlutaMAX, 1% nonessential amino acids, and 0.1 mM β-mercaptoethanol) (Invitrogen, Carlsbad, CA, http://www.invitrogen.com). In protocol 1 (the high density [HD] protocol), the hESCs were maintained on MEF for spontaneous differentiation; harvested at days 5, 11, and 25 after passage; and rapidly frozen at −80°C for subsequent RNA extraction . In protocol 2 (the [EB] protocol), undifferentiated hESC colonies were transferred from the MEF to suspension cultures at day 5 after passage, using a Stem Cell Cutting Tool . After an additional 6 days, the suspended EBs were either harvested for RNA extraction or plated onto gelatin-coated culture dishes to allow further differentiation. At day 14 after plating of the EBs, the cells were harvested for RNA extraction.
Total RNA was extracted from undifferentiated or differentiated hESCs using the Qiagen RNeasy Mini Kit (Hilden, Germany, http://www1.qiagen.com) according to the manufacturer's instructions. DNase treatment was performed on-column using the RNase-free DNase Kit (Qiagen).
Microarray Analysis Using the CodeLink Bioarrays Platform
Total RNA was used to generate cRNA using the CodeLink Expression Assay Reagent kit (GE Healthcare, Little Chalfont, Buckinghamshire, U.K., http://www.gehealthcare.com). Total RNA and final cRNA concentrations and quality were assessed using an Agilent Bioanalyzer and UV spectrophotometry (PerkinElmer Life and Analytical Sciences, Boston, http://www.perkinelmer.com).
Fragmented cRNA was hybridized at 37°C for 18 hours to CodeLink Human Whole Genome Bioarrays (GE Healthcare) targeting approximately 57,000 transcripts and ESTs. All experiments were performed in triplicate. For detection and quantization, an Axon GenePix 4000B scanner (Molecular Devices Corp., Union City, CA, http://www.moleculardevices.com) and the CodeLink Scanning and Expression Analysis software (version 4.2) (GE Healthcare) were used.
Identification of Stably Expressed Genes in Differentiating hESCs
All subsequent analyses were based on average intensity measurements from three technical replicates, where only probes flagged as “good” in all arrays were included. For each hESC line, genes expressed at a stable level were identified by computing the coefficient of variation (CV) using the mean expression from undifferentiated and differentiated hESCs. The threshold was set to CV = 20%, since the number of probes below this threshold matched the number of probes that were within 1.5-fold change in all pairwise comparisons. It has been shown that the CodeLink platform can detect a fold change of 1.5 with 90% power using three technical replicates . The final set of HKGs includes genes that display a CV below 20% in all cell lines and are represented in the UniGene database. Furthermore, the probes were subdivided into three categories representing high- (H), medium- (M), and low (L)-expressed genes. The ranges were defined as follows: L, [log2 intensity range from −1 to 0]; M, [log2 intensity range from 0 to 2]; H, [log2 intensity >2]. The low expression threshold was selected based on the expression range of the positive control gene LEUB. The threshold of two was selected for high expression as it is close to the median of the good-flagged probes. For all bioinformatics analyses, the R software package (http://www.r-project.org) was used (scripts are available upon request).
Real-Time Quantitative PCR Analysis
In separate experiments, hESC lines SA001 and SA002 were maintained and differentiated according to the HD protocol . The cells were harvested from the MEF at days 5, 14, and 21 after passage. Total RNA was extracted using the RNeasy Mini Kit (Qiagen). Gene-specific primer pairs were designed for the following genes: NRPS998, FBXL12, RNF7, SRP72, SLC4A1AP, NUBP1, RND1, ELN, CREBBP, and GAPDH. The conditions for quantitative PCR (QPCR) were optimized, and the assays, including primer sequences, are available from TATAA Biocenter (http://www.tataa.com). The RNA levels of each gene were determined using real-time QPCR as described in detail previously . Quantitation of gene expression was based on the cycle threshold (Ct) value of each sample.
To obtain populations of differentiating hESCs in vitro, we used two different protocols (Fig. 1). In the first, hESCs were maintained in the absence of bFGF on MEF without passaging of the cells (HD protocol). In the second, the hESCs were differentiated through EB formation (EB protocol). Cells were harvested at matched time points and used for subsequent gene expression analysis. To verify that differentiation of the hESCs occurred under the experimental conditions used in this study, the RNA levels of known pluripotency and differentiation markers were monitored. Figure 2 summarizes the results obtained from cells cultured according to the HD protocol and shows the kinetics of the downregulation of OCT4, SOX2, TERT, and DNMT3B and upregulation of FN1, ACTC, and ISLT1 RNA levels during hESC differentiation. Notably, these marker genes cover the range of low-expressed genes (DNMT3B) up to highly expressed genes (FN1). Similar results were obtained for cells differentiated through EB formation (data not shown).
Using published reports, we assembled a list of 56 commonly used HKGs and investigated the stability of these genes in differentiating hESCs. The CV was calculated for each gene and hESC line (Table 1). The threshold for detection of differentially expressed genes using CodeLink Bioarrays is a fold change of 1.5 , and this corresponds to a CV of 20% in the experiments used in the present study. Strikingly, only 4 of the 56 genes were stably expressed (CV <20%) in all the cell lines tested. However, none of these four genes were flagged as good in all experiments. For CodeLink Bioarrays, probes are flagged as good (i.e., above the noise level) if the mean spot intensity is 1.5 standard deviations above the background signal. Thus, we included only good-flagged probes in our subsequent analysis. Notably, both GAPDH and ACTB, which are among the most frequently used HKGs, display variation that is unacceptably high for reference genes, with CVs ranging from approximately 60% to almost 170% in our experimental settings (Table 1).
Table Table 1.. CVs for 56 frequently used housekeeping genes (HKGs) in undifferentiated and early differentiating human ESCs (hESCs)
This observation prompted us to perform whole-genome investigations to define a novel set of genes that could be used as HKGs in differentiating hESCs. For each probe and cell line, we used only probes that were good-flagged in all replicates and calculated the CV. Figure 3 shows the number of probes identified and the overlap among the different cell lines. For SA001, SA002, and SA002.5, we obtained 2,308, 3,100, and 1,873 unique genes, respectively, that were good-flagged and displayed a CV below 20%, indicating a stable expression during differentiation. Of these, only 292 genes were common to all cell lines, which illustrates that there are substantial differences in the gene expression profiles among different lines. We further divided these 292 genes into three groups (high-, medium-, and low-expressed genes) based on their absolute expression levels. Table 2 shows a subset of the eight most stable genes from each of the three groups (total of 24 genes), and a complete list of all of the 292 HKGs is available in supplemental online Table 1.
Table Table 2.. Selection of stably expressed genes in undifferentiated and early differentiating human ESCs (hESCs)
Our initial analysis (Table 1) suggested that a majority of the commonly used HKGs were not stably expressed in differentiating hESCs. To further explore differences and similarities across tissues and cell types in terms of stably expressed genes, we compared our results with previously published data. Eisenberg and Levanon  presented a comprehensive panel of 575 HKG candidates for non-stem cell studies, selected from 47 different adult tissues and cell lines. However, only three of these genes (PPP1R11, PPP2CB, and RASSF1) were also observed to be stably expressed in the differentiating hESCs studied here.
To validate our results, we accessed independently generated publicly available hESC gene expression data [27, 28]. We identified stably expressed genes in these data sets and investigated the overlap with our novel group of HKGs. Importantly, these studies used different technical platforms and hESC lines. Skottman et al.  used Affymetrix GeneChips for analyzing the gene expression profiles of seven hESC lines and their differentiated progenies. Of the 44,928 probes on the GeneChips, 7,832 unique genes displayed a Present Call in all experiments and were considered to be expressed. A subgroup representing 1,354 genes were stably expressed according to our criteria (CV <20%). Furthermore, 67 of the 292 stably expressed genes listed in supplemental online Table 1 were given a Present Call in the study by Skottman et al. , and 17 of these (25%) also displayed CV <20% in the cross-comparisons between undifferentiated and differentiated hESCs. A similar analysis using the data reported by Xu et al.  identified 2,109 stably expressed genes (CV <20%). However, this data set was generated using cDNA arrays, and the detection limit of this particular system is unclear. For the purpose of our investigations, we did not filter out genes that were expressed below a certain threshold. Of the 292 stably expressed genes listed in supplemental online Table 1, 77 were included on the cDNA arrays used by Xu et al. . Of these 77 genes, 23 (30%) displayed CV <20% and thus overlapped between the two studies. Considering all three studies, 33 of the 292 stably expressed genes are included on both the Affymetrix GeneChips and the cDNA arrays used by Xu et al. . Finally, the intersection of stably expressed genes among all three studies, comprising a total of 11 different hESC lines, contained a total of six genes (19%), which are listed in Table 3.
Table Table 3.. Intersection of stably expressed genes in this study and previous work of undifferentiated and early differentiating human ESCs (hESCs)
Further validation of a subgroup of the genes listed in Tables 2 and 3 using QPCR demonstrated that most of these genes are indeed relatively stably expressed during early differentiation of hESCs (Fig. 4). These experiments were performed using RNA collected from hESC lines SA001 and SA002 cultured and differentiated independently of the cells used to derive the original set of genes (supplemental online Table 1). The genes NRPS998 and ELN appeared to be the most stable ones showing a ΔCt of approximately 0.4 when comparing cells harvested at day 5 and day 21 after passage. For comparison, we also measured the RNA levels of GAPDH, a commonly used HKG. It is clear that the expression of GAPDH varies substantially, showing a ΔCt of approximately 3, thus making it inappropriate to use as a reference gene in similar experimental settings.
To explore the biology of the novel potential HKGs for differentiating hESCs, we used FatiGO  and available Gene Ontology (GO) annotations to group the genes according to function, localization, and biological process. The results are shown in supplemental online Figure 1. Briefly, of the 292 genes, 151 have a GO annotation for biological process. The majority are involved in metabolic functions. Molecular function annotation was available for 121 genes, of which 22% are involved in transition metal ion binding, 15% in adenyl nucleotide binding, and 10% in transcription. Of the 113 genes with GO cellular location annotation, 42% are located in the nucleus, 37% are integral to membrane, and only 3% are ribosomal. The GO annotation distribution was compared between our novel group of HKGs and the previously reported HKGs . Using FatiGO, we identified those GO terms that are significantly (p < .05) over- or under-represented between the two groups of genes. For example, the set of stably expressed cation-binding genes is significantly larger, and genes associated with metal ion binding are overrepresented in our novel list, compared with somatic cells. In addition, the fractions of genes annotated as RNA-binding, as well as the fractions of stable ribosomal and cytoplasmic proteins, are significantly smaller than in the HKG set derived from other tissues . Supplemental online Figure 2 summarizes the results from these comparisons and illustrates the differences between the two sets of HKGs in terms of biological function.
Global gene expression analysis has become a widely used tool for assessing the molecular state of various cells and tissues. Most investigators report the differential up- and downregulation of genes in relation to a control or basal state. Less reported but equally important is the identification of genes that remain constant during the experimental conditions used. Together, these genes can provide information on the basal activities and states of the cells. In addition, stably expressed genes (i.e., HKGs) represent markers that can be used for normalization of gene expression data across various samples. However, previous studies have shown that the expression patterns of commonly used HKGs can vary extensively [3, 21, 30, 31]. This suggests that the use of HKGs as normalization controls without appropriate validation might lead to systematic errors in the calculation of fold change in gene expression levels .
In the present study, we investigated for the first time the stability of commonly used HKGs in differentiating hESCs. We observed that most of them are inappropriate to use as reference genes because of their highly variable expression levels (Table 1). Using global gene expression data obtained from three hESC lines, including a clonally derived line, we identified a new group of genes (supplemental online Table 1) that remained unchanged during early differentiation of the cells. Independent QPCR analysis confirmed the expression profiles of a subset of these genes (Fig. 4). One important aspect of this study is the design of the algorithm for identification of stable genes. Notably, the variation in gene expression is dependent on the absolute RNA level . Taking this into consideration, we divided the expression range into three groups and selected stable genes from each of these groups (details given in Materials and Methods). Thus, these genes are novel candidate reference genes for normalization of gene expression data obtained from hESCs and their early progenies.
On a molecular level, the variation among individual hESC lines appears to be due to the genetic diversity in the source material and differences in the initial culture of each line . Our results support this observation. In the cell lines investigated, only 292 genes were stably expressed in all three lines across the differentiation protocols. Since we used stringent inclusion criteria together with a complex matrix of experiments and cell lines, we believe that the novel group contains valid, stably expressed genes. Based on the profiles of stably expressed genes, hESC lines SA001 and SA002 appear to be more similar than SA002 and the clonally derived SA002.5 (Fig. 3) . A possible explanation for this is that SA001 and SA002 were analyzed in passages 37 and 41, respectively, whereas SA002.5 was analyzed in passage 155 + 55. In this regard, genetic and epigenetic changes occur in hESCs, and there is a certain degree of culture adaptation leading to altered gene expression in hESCs maintained in vitro [33, 34]. In the present study, we took steps to minimize the contribution of these and other possible variables and thus selected only genes that are stable in different cell lines and passages. However, it is still important to validate the expression profile of any HKG before using it for normalization in any given experimental setup.
Previous studies have identified ubiquitously expressed genes in human cells and tissues [2, 8, 9]. We observed a very poor overlap between these and our novel list of candidate HKGs in differentiating hESCs. This is not very surprising since, to the best of our knowledge, no study so far has actually included hESC-derived material. We also compared the groups of genes with respect to GO annotation and observed significant differences (supplemental online Fig. 2), suggesting that some of the basal maintenance functions in differentiating hESCs may be slightly different from those in specialized cell types. Additional studies are necessary to evaluate the biological significance of these observations.
To validate our HKG set in other independent hESC lines, we analyzed publicly available gene expression data obtained from eight additional hESC lines [27, 28]. Although these data were generated under considerably different conditions (e.g., different laboratories, cell lines, culture conditions, and array platforms), we observed interesting overlaps with our results. The intersection of all three studies contained a total of six stably expressed genes (Table 3). Among these genes, RNF7 and FBXL12 are associated with ubiquitin ligase activities and are involved in cell-cycle progression and development [35, 36]. Another gene, NUBP1, displays sequence similarity to the MinD gene from Escherichia coli, which regulates cell division . Two genes associated with protein transport and ion exchange (SRP72 and SLC4A1AP) are also among the stably expressed genes. Although potentially interesting, the exact functions of these genes in hESCs remain to be investigated.
In summary, the results from this study demonstrate that hESCs and early progenies thereof are quite different from other cell types in terms of their HKG expression. Furthermore, the importance of validating candidate reference genes for subsequent normalization of gene expression data is also highlighted. Our novel group of 292 HKGs for differentiating hESCs represents an important step toward the identification of reliable reference genes for early differentiating hESCs.
The authors indicate no potential conflicts of interest.
This work was supported by Cellartis AB (Göteborg, Sweden), GE Healthcare (Niskayuna, NY), and the Information Fusion Research Program (University of Skövde, Sweden) under Grant 2003/0104 from the Knowledge Foundation. Cellartis is the recipient of NIH Grant R24RR019514-01.