Potential conflict of interest: Nothing to report.
Hepatocyte nuclear factor 4 alpha (HNF4α), a member of the nuclear receptor superfamily, is essential for liver function and is linked to several diseases including diabetes, hemophilia, atherosclerosis, and hepatitis. Although many DNA response elements and target genes have been identified for HNF4α, the complete repertoire of binding sites and target genes in the human genome is unknown. Here, we adapt protein binding microarrays (PBMs) to examine the DNA-binding characteristics of two HNF4α species (rat and human) and isoforms (HNF4α2 and HNF4α8) in a high-throughput fashion. We identified ∼1400 new binding sequences and used this dataset to successfully train a Support Vector Machine (SVM) model that predicts an additional ∼10,000 unique HNF4α-binding sequences; we also identify new rules for HNF4α DNA binding. We performed expression profiling of an HNF4α RNA interference knockdown in HepG2 cells and compared the results to a search of the promoters of all human genes with the PBM and SVM models, as well as published genome-wide location analysis. Using this integrated approach, we identified ∼240 new direct HNF4α human target genes, including new functional categories of genes not typically associated with HNF4α, such as cell cycle, immune function, apoptosis, stress response, and other cancer-related genes. Conclusion: We report the first use of PBMs with a full-length liver-enriched transcription factor and greatly expand the repertoire of HNF4α-binding sequences and target genes, thereby identifying new functions for HNF4α. We also establish a web-based tool, HNF4 Motif Finder, that can be used to identify potential HNF4α-binding sites in any sequence. (HEPATOLOGY 2009.)
Hepatocyte nuclear factor 4α, HNF4α (HNF4A), is a member of the nuclear receptor superfamily of ligand-dependent transcription factors (NR2A1) and a liver-enriched transcription factor (TF) that is also expressed in the kidney, pancreas, intestine, colon, and stomach.1 Originally identified based on its ability to bind DNA response elements in the human apolipoprotein C3 (APOC3) and mouse transthyretin (Ttr) promoters,2 HNF4α has since been shown to play a critical role in both the development of the embryo and the adult liver.3, 4 Mutations in the HNF4A coding sequence and promoter regions are linked to Maturity Onset Diabetes of the Young 1 (MODY1),5 and mutations in HNF4α response elements have been directly linked to disease, most notably in genes encoding blood coagulation factors in hemophilia and in HNF1α in MODY3.6–8 Through classical promoter analysis, functional HNF4α-binding sites have been identified in >140 genes, including those involved in the metabolism of glucose, lipids, and amino acids, as well as xenobiotics and drugs1, 4, 9 (see Supporting Table 1A for a listing of those genes). Recent genome-wide location analyses suggest that the number of HNF4α targets may be much greater (>1000) based on widespread binding of HNF4α to promoter regions,10–12 although it is not known how many of those are functional targets. A more comprehensive list of direct HNF4α targets was recently made even more critical with our finding that HNF4α binds an exchangeable ligand and hence may be a potential drug target.13
HNF4α binds DNA exclusively as a homodimer.14, 15 The canonical HNF4α consensus sequence consists of the half site AGGTCA with one nucleotide spacer (referred to as a DR1, AGGTCAxAGGTCA).16 Whereas the number of experimentally verified HNF4α binding sequences is sizable (>217) (Supporting Tables 1A and 1B), they were derived in a biased fashion building on the first HNF4α-binding sites,2 and subsequently on the direct repeat rules for nuclear receptor DNA binding.16 Furthermore, the total number of 13-base oligomer (13-mer) permutations is much greater than 217 (413 ∼ 67 million), and whereas HNF4α will certainly not bind all potential 13-mers, the total number of DNA sequences that will bind HNF4α is anticipated to be in the tens of thousands. Because the presence of one or more HNF4α response elements in the promoter region of a gene is a prerequisite for classification as a direct HNF4α target, it is desirable to accurately predict all the HNF4α-binding sites throughout the genome in an unbiased fashion.
Recent genome-wide technologies, most notably genome-wide location analysis (i.e., chromatin immunoprecipitation [ChIP] followed by tiling arrays, known as “ChIP-chip”) and expression profiling, have greatly accelerated the identification of target genes for many TFs, including HNF4α. However, as powerful as those technologies are, they provide information only about the state of the cells used in the assay, not about any other physiological or pathological state. Furthermore, expression profiling cannot indicate whether a gene is a direct or an indirect target and ChIP does not provide any information about whether the gene is expressed by the bound TF. And neither assay allows one to precisely identify the sequence to which the TF binds. The third tool in the genomic arsenal—computational prediction of target genes—is curiously less developed than the other two. Although many attempts have been made at predicting TF binding sites, including our own for HNF4α,17 this approach still suffers from a lack of sizable datasets of verified binding sites.
To improve the prediction of potential HNF4α target genes, we adapted the protein binding microarray (PBM) technology to rank thousands of HNF4α sequences based on their relative binding affinities using full-length protein expressed in mammalian cells. We compare two species of HNF4α (rat and human) and two tissue-specific isoforms (HNF4α2 and HNF4α8). Additionally, we use a Support Vector Machine (SVM), a powerful machine learning model to predict additional HNF4α-binding sequences with high accuracy. Finally, we combine the PBM and SVM binding site searches with expression profiling performed here and ChIP-chip performed by others to identify ∼240 new direct target genes of HNF4α in cells of hepatic origin (see Fig. 1A for an overview).
See Supporting Materials and Methods for additional details.
Preparation of HNF4α Proteins in COS-7 Cells.
Nuclear extracts were prepared from COS-7 cells transiently transfected with HNF4α expression vectors as previously described.15 Mock-transfected samples contained no DNA. Crude nuclear extracts were filtered and concentrated using Microcon Ultracel YM-30 filters (Millipore, Bedford, MA) and applied directly to the PBM (Fig. 1B), except for purified samples that were immunoprecipitated from the crude extracts with the α445 antibody2 (Fig. 2A) and then peptide-eluted.
Protein Binding Microarrays.
Custom 8 × 15k arrays of single-stranded 42-mer to 51-mer oligonucleotides (Agilent Technologies, Santa Clara, CA) were extended on the slide in the presence of Cy3 deoxyuridine triphosphate (dUTP) using a universal primer (Fig. 1C–E) as described in Bulyk.18 Both PBM1 and PBM2 contained 3000 unique sequences replicated five times, including random controls, sequences collected from the literature, mined from ChIP-chip datasets,11, 19 and derived from variations on the consensus 5′-AGGTCAaAGGTCA-3′. PBM2 contained sequences derived from PBM1 and sequences predicted by SVM1 on human promoter regions and the regions reported in ChIP-chip11 (for a complete list of sequences on PBM1 and PBM2, see Supporting Tables 2A and 2B, respectively). Briefly, PBMs were premoistened, incubated with HNF4α protein for 1 hour, washed, and then incubated with the indicated antibodies. All washes and incubations were performed at room temperature (27°C). PBMs were scanned using a GenePix Axon 4000B scanner (Molecular Devices, Sunnyvale, CA) at 543 nm (Cy3) dUTP and 633 nm (Cy5-conjugated secondary antibody). Signals were gradient-corrected using Micro-Array NORmalization of array–Comparative Genomic Hybridization data (MANOR) implemented in R.20 Cross-array and intra-array normalization was performed using quantile normalization,21 enabling comparison between independent experiments. Replicates for each probe were averaged, and only probes with a coefficient of variation less than 0.3 were used to train the SVM.
SVM Training and Binding Sequence Analysis.
The Kernel-based SVM (KSVM) function from Kernlab package in R with Laplace dot kernel was used to train the model (SVM1) in the classification mode22 using results averaged from independent PBM1 experiments. SVM1 was then used to generate sequences for PBM2. Another SVM model in the regression mode was trained on the results of the PBM2 experiments (SVM2). For a complete list of sequences in the SVM1 and SVM2 training data, see Supporting Tables 4A and 4B, respectively. The human genome (University of California Santa Cruz [UCSC] Human Genome Browser, UCSC hg18) was searched with the binding sequences from PBM2 and the predicted binding sequences from SVM2 using the sliding window approach.
RNA Interference and Expression Profiling Analysis.
RNA interference (RNAi) against HNF4α2 was performed in HepG2 cells using small, interfering RNAs (siRNAs) corresponding to nucleotides +179 to +197 of human HNF4A (NM_178849, sense siRNA: 5′-UGUGCAGGUGUUGACGAUGdTdT-3′, antisense siRNA 5′-CAUCGUCAACACCUGCACAdTdT-3′) (Dharmacon, Lafayette, CO). Total RNA was extracted with Trizol (Life Technologies, Carlsbad, CA) and reverse transcribed with the Reverse Transcription System (Promega, Madison, WI). Polymerase chain reaction (PCR) amplification was performed in the linear range (see Supporting Table 3B for a list of PCR primers). Expression profiling analysis was performed with Affymetrix oligonucleotide arrays (HGU133 Plus 2.0) using RNA from control (PGL3 siRNA) or treated (HNF4α siRNA) HepG2 cells, and analyzed as previously described.13
Chromatin Immunoprecipitation and ChIP-Chip Analysis.
ChIP for HNF4α from HepG2 cells on the Ninjurin 1 (NINJ1) promoter was carried out as previously described.23 HNF4α ChIP-chip data from primary human hepatocytes11 were extracted from ArrayExpress database, reanalyzed with the Bioconductor package LIMMA and ACME,24, 25 and subsequently visualized using Integrated Genome Browser (IGB; Affymetrix, Santa Clara, CA).
Protein-Binding Microarrays Using Full-Length HNF4α in Crude Nuclear Extracts.
PBMs are a high-throughput in vitro DNA binding assay that allow for the examination of TF binding to thousands of unique sequences in a single experiment.26 Recently, PBMs have been used to define the DNA-binding specificity of large classes of TFs27, 28 and have been shown to correlate well with gel shift results.29 Whereas as others have pioneered the technology using the DNA-binding domain (DBD) of TFs purified from bacteria, here we adapt the PBM technology to more closely approximate physiological conditions. Because HNF4α has a very strong dimerization domain outside of the DBD and a very low affinity for DNA when expressed in bacteria,14, 30, 31 we ectopically expressed full-length, native HNF4α in COS-7 cells and prepared minimally processed nuclear extracts (Fig. 1B) that we then applied directly to a PBM specifically designed for HNF4α (Fig. 1C,D). The PBM was developed with a highly specific antibody to the C-terminus of HNF4α (Supporting Fig. 1), allowing us to examine a completely native TF. The full-length HNF4α protein in the crude extracts yielded an excellent signal with a range of intensities, whereas extracts from mock-transfected cells yielded no reproducible signals (Fig. 1E).
Reproducibility and Utility of Adapted Protein-Binding Microarrays.
We compared two species (rat and human) and two isoforms of HNF4α (HNF4α2 and HNF4α8), as well as antibodies that recognized different regions of HNF4α (Fig. 2A). There was an excellent correlation between replicate arrays in the first-generation PBM (PBM1) using crude nuclear extracts, regardless of antibody used (R2 = 0.78), and results with affinity-purified protein were very similar to those with crude extracts (R2 = 0.68) (Fig. 2B). In a second generation of the PBM (PBM2), different HNF4α isoforms (HNF4α2 versus HNF4α8) and species (human versus rat) also produced excellent correlations (R2 > 0.9), indicating that these isoform and species differences do not influence the binding of HNF4α to DNA. This is not surprising considering that the DBD is identical in these constructs (Fig. 2A).
Accuracy of PBM and SVM.
PBM1 identified ∼500 new HNF4α binding sequences with the DR1-derived sequences exhibiting the best binding affinities relative to negative controls (P < 8.274 × 10−12) (Fig. 3A ). Sequences derived from ChIP-chip analysis bound roughly as well as the DR1 variants. In PBM2, an additional ∼1000 novel sequences that strongly bind HNF4α were identified, including sequences identified by SVM1. The signal-to-noise ratio (literature-derived versus random sites) was also significantly improved in PBM2 due to optimization of the binding conditions (P < 2.6 × 10−11 versus P < 2.6 × 10−16, respectively, using the Student t test) (Fig. 3B). The PBM2 results also correlated very well with gel shift results (Fig. 3C). Additionally, SVM2 derived from PBM2 predicted binding sequences with a high degree of accuracy (R2 = 0.76) (Fig. 3D).
Identification of New “Rules” for HNF4α DNA Binding by PBM.
Even though position weight matrices (PWMs) do not capture the interdependence between the positions in a motif as do PBMs and SVMs, they are useful for describing motifs. Interestingly, the PWM of the ∼450 sequences that yielded the greatest binding intensity in PBM2 (“strong binders”) did not strictly follow the DR1 rule of AGGTCAxAGGTCA. Rather, a core sequence of CAAAG is the most prominent feature, with the classical AGGTCA half-site evident only on the 3′ side (Fig. 4A), a finding supported by the recent crystallographic structure of the HNF4α DBD on DNA in which fewer hydrogen bonds were observed between the HNF4α protein and the 5′ half site.32 In the PWMs for the medium and weak binding motifs, the three A's in the core appeared less frequently.
Using ∼1400 strong HNF4α-binding sequences obtained from PBM2, we determined the distribution of potential HNF4α-binding sites in the human genome and found a broad distribution of sites with an enrichment within ∼1 kilobase (kb) of the transcription start site (+1) (Fig. 4B). This is in contrast to profiles of sites for some other TFs, such as Sp1 and ELK1, that are found more exclusively near +1,33 but is consistent with the fact that there are many well-characterized HNF4α sites far from +1. We also found a small percentage (<1%) of sites that bound HNF4α well in PBM2 but did not contain the CAAAG core (see Supporting Fig. 7 for the PWM and gel shift assay), but the biological relevance of these sequences remains to be verified.
Expression Profiling of an HNF4α RNAi Knockdown in Hepatic Cells.
To identify functional HNF4α target genes, we used RNAi to knock down HNF4α2 expression in HepG2 cells, a human hepatocellular carcinoma cell line that expresses endogenous HNF4α and many liver-specific genes (Fig. 5A, top panels and Supporting Fig. 5). Using the SVM2 model, we predicted several other potential HNF4α target genes and determined that they were also down-regulated by reverse transcription PCR (APOC4, RDH16, APOM, APOH, SPSB2, UBD, ZDHHC11) (Fig. 5A, bottom panel). Whole-genome expression profiling identified ∼1500 additional genes that were down-regulated (see Supporting Table 3A for a complete list). Interestingly, the gene that was down-regulated the most—Ninjurin 1 (NINJ1) (12.5-fold)—is not a gene typically associated with HNF4α function (i.e., intermediary metabolism); rather, it is involved in regulating the cell cycle. In order to determine whether NINJ1 is a direct target of HNF4α, we used SVM2 to identify a potential HNF4α binding site within the NINJ1 promoter region (Fig. 5B) and subsequently verified that it was bound by HNF4α in vivo using a ChIP assay (Fig. 5C) and in vitro using a gel shift assay (Fig. 5D); these results suggest that NINJ1 is indeed a direct target of HNF4α.
Gene Ontology Analysis Reveals Complementary Nature of PBM, Expression Profiling, and ChIP Analysis.
To compare the different methods of predicting target genes, we performed Gene Ontology (GO) on the HNF4α targets predicted by RNAi expression profiling and the PBM2 search (−2 kb to +1 kb), as well as on published HNF4α ChIP-chip results from primary human hepatocytes11 (Fig. 6). In general, six broad biological processes contained significant GO terms for all three assays—metabolism, transport, development, regulation of signal transduction, protein modification, and apoptosis—showing the overlapping nature of the three assays. There were three additional categories—inflammatory response, cell cycle, and nucleic acid metabolism—in which genes from at least one but not all three assays were overrepresented. The most notable difference between the PBM2 search from the other assays was an enrichment of genes involved in developmental processes. This is consistent with the known role of HNF4α in early development,34 and could be explained by the fact that the cells used in the ChIP-chip and RNAi assays are from adult stages, not embryonic stages. In general, the ChIP assay yielded more significant GO terms in all categories, which is most likely a reflection of the more specific nature of this assay and the stringent cutoff values used.
Identification of New HNF4α Target Genes and New Functions.
In order to more closely compare the three methods of identifying potential target genes, we cross-referenced the PBM2 search results with the HNF4α RNAi and ChIP-chip results. We identified 198 genes that were positive in all three categories, i.e., bound by HNF4α in ChIP-chip, down-regulated by HNF4α in HepG2 RNAi, and containing one or more verified HNF4α-binding sites in the −2 kb to +1 kb region of the promoter (Fig. 7A). A similar analysis with the SVM2 search yielded 135 genes (Fig. 7B). Among these two categories, there were ∼260 nonredundant genes, of which ∼240 were not in the original list of HNF4α target genes from the literature (Supporting Table 1A). Several of these genes are new targets within known categories of HNF4α targets (e.g., homeostasis = solute carrier proteins, SLC genes; lipid metabolism = e.g., ABCC6, DGAT2, hydroxysteroid dehydrogenase [HSDs] genes), or more recently identified targets of HNF4α (e.g., CREB3L3, NR1I2, NR1H4, DO1).35–38 There were also many genes that, like NINJ1, are in completely new categories of genes not typically associated with HNF4α (e.g., signal transduction, immune response, stress response, apoptosis, cancer related, and cell structure) (Fig. 7C), several of which are reminiscent of the new functional categories identified by GO (Fig. 6). In order to determine whether the ChIP signal overlapped with the PBM or SVM sites in these new targets, all three datasets were visualized using Integrated Genome Browser. Although not all ChIP signals aligned exactly with the PBM or SVM sites, a very large number did; a sampling of these are shown in Fig. 8.
Identification of TF binding sites and target genes can be a laborious process. Recent genome-scale technologies such as expression profiling and genome-wide location analysis can greatly expand the repertoire of potential targets with relative ease, although the question remains as to which are direct targets that contain bona fide binding sites. PBMs allow for a high-throughput identification of DNA binding sequences that can then be integrated with the other techniques, and can also be used to predict potential new targets in additional tissues or developmental stages.
Here, we successfully adapt the PBM technology to assess HNF4α DNA binding under conditions that more closely approximate physiological conditions (i.e., native full-length receptor in a crude nuclear extract) (Fig. 1). We show that the PBM results are highly reproducible across different species (human and rat) and isoforms (α2 and α8) of HNF4α under a variety of conditions (Figs. 2 and 3). We identify new rules for DNA binding and develop an SVM model to predict additional sites (Figs. 3 and 4). We compare the PBM and SVM results to RNAi expression profiling (Fig. 5) as well as to published ChIP-chip results in order to develop an integrated approach for the identification of human HNF4α target genes. We show that all three systems yield similar overrepresented categories of target genes (Fig. 6), supporting the notion that specific TF binding sites in promoter regions are a major factor in driving gene expression. Using this integrated approach, we identified ∼240 new, direct targets of HNF4α, many of which are in new functional categories (Figs. 7 and 8). To our knowledge, this is the first such integration of extensive PBM, ChIP-chip, and expression profiling data for any TF. Finally, to facilitate future HNF4α target gene research, we have developed a publicly available web-based tool (HNF4 Motif Finder) based on our PBM results that can be used to search any DNA sequence for potential HNF4α-binding sites (http://nrmotif.ucr.edu).
We define direct targets as genes that meet three criteria: contain a functional binding site in a regulatory region (PBM/SVM search), bind in vivo to the promoter (ChIP), and are down-regulated when HNF4α expression is knocked down (RNAi). Applying these criteria, we expand upon the classical roles of HNF4α by identifying additional target genes involved in metabolism (e.g., APOM, LIPC, LPIN1), solute carrier transport (e.g., SLC7A2, SLC12A7, SLC25A20), protein transport and secretion (e.g., COPA, GOLGB1, GOLGA1), as well as transcription regulation (e.g., HDAC6, MED14, etc.).
The integrated approach also identified new HNF4α targets in pathways not previously associated with HNF4α, such as regulation of signal transduction (e.g., TAOK3, NGEF, PRKCZ, FNTB), and inflammation and immune response (e.g., IL32, BRE, LEAP2, IFITM2, BAT3). Perhaps the most intriguing new categories of HNF4α target genes are those involved in apoptosis, DNA repair, and cancer. HNF4α has long been considered a key factor in hepatocyte differentiation3, 4 but there are an increasing number of reports indicating that HNF4α may act as a tumor suppressor.39, 40 This view is supported by the new target genes identified here, such as NINJ1 (Fig. 5), which may play a role in regulating cellular senescence by inducing the expression of p21, a cell cycle inhibitor gene,41 and is consistent with our previous findings that the p21 gene (CDKN1A) itself is a direct target of HNF4α.23 Other new HNF4α target genes related to anti-growth effects are: CIDEC, which induces fragmentation of DNA upon apoptosis; ATPIF1, which inhibits an adenosine triphosphatase involved in angiogenesis; and STEAP3, which is induced by tumor suppressor p53 and whose down-regulation is associated with a transition from cirrhosis to hepatocellular carcinoma.42 There were also genes involved in stress responses such as the DNA repair gene FANCF, a Fanconi's anemia complementation group F, and USP1, a ubiquitin-specific protease.
In addition to the genes that meet the three criteria mentioned above, our analysis also revealed thousands of additional genes that met only one or two of the three criteria. While technical considerations (e.g., missing tiles in the ChIP-chip, malfunctioning probes in the expression arrays, false positives in the ChIP assay, etc.) are sure to account for some of those genes, other explanations are also possible. For example, the genes present only in the expression profiling could be indirect targets of HNF4α and hence yield no PBM/SVM or ChIP signal. Genes present in ChIP-chip alone could contain as-yet unidentified HNF4α-binding sites or recruit HNF4α in a nondirect fashion; it should also be noted that in Fig. 7B, we imposed a fairly stringent requirement of four or more SVM sites for a gene to be included in that analysis. Genes identified only in the PBM/SVM searches could contain bona fide HNF4α-binding sites but are simply not expressed in the hepatocellular carcinoma cell line (HepG2) used in the expression profiling nor in the particular set of primary human hepatocytes used in the ChIP-chip. It could also be that in adult hepatocytes the promoter regions of those genes are not available for binding (and hence activation) due to the structure of the chromatin. Genes found only in the PBM/SVM searches could also represent nonhepatic targets that are expressed in other HNF4α-expressing tissues such as kidney, pancreas, intestine, and colon. Finally, it is also possible that there may be potential HNF4α-binding sites in the human genome that are never used by HNF4α.
Whatever the reasons for the incomplete overlap between the three assays, the use of the PBM/SVM results presented here, as well as the web-based HNF4 Motif Finder, should greatly facilitate any future investigation of potential HNF4α target genes. Additionally, our approach of integrating data from multiple genome-wide assays, including PBMs, provides a powerful new framework for identifying direct targets of TFs.
This work was funded by grants to F.M.S. (National Institutes of Health [NIH] DK053892), T.J. (National Science Foundation IIS-0711129), F.M.S. and T.J. (University of California Riverside Institute for Integrative Genome Biology, NIH R21MH087397), E.B. (PhRMA Foundation predoctoral fellowship), and W.H.-V. (University of California Toxic Substance Training Grant). We would also like to thank the following for help: A. Karatzoglou (ksvm), S. Davis (ACME), and J. Schnabl (Supporting Table 1A).