Henk G. Stunnenberg, Nijmegen Center for Molecular Life Sciences, Department of Molecular Biology, M850/3, 79, Radboud University Nijmegen, Geert Grooteplein 30, Nijmegen, the Netherlands. e-mail: firstname.lastname@example.org
Despite the availability of several completely sequenced genomes, we are still, for the most part, ignorant about how genes interact and regulate each other within a given cell type to specify identity, function and cellular memory. A realistic model of cellular regulation based on current knowledge indicates that many interacting networks operate at the epigenetic, transcriptional, translational and post-translational levels, with feedback between the various levels. Protein–protein and protein–DNA interactions help to define which genes may be activated in a particular cell, and determine whether external cues cause activation or repression. New technologies, e.g. proteomics using mass spectrometry, high-density DNA or oligonucleotide microarrays (chips), and chromatin immunoprecipitation (ChIP), provide new and exciting tools for deciphering the pathways and proteins controlling gene expression. Analysis of these pathways offers new insight that aids targeted drug development.
stable isotope labelling with amino acids in cell culture
tandem affinity purification
methyl CpG binding protein
TATA-element binding protein.
Knowledge about the human genome has increased exponentially since the description of the structure of DNA by Watson and Crick 50 years ago . The four bases forming the structure of DNA have been identified, and the genetic code deciphered. Basic mechanisms for transcription and translation have been described. Techniques were developed to sequence both individual genes and the genome as a whole. The sequence information has, in turn, helped identify open reading frames and intron/exon boundaries. However, the process by which expression of individual genes or groups of genes actually occurs in different cell types is less clear.
In a human cell, ≈ 2 m of DNA are packed into a nucleus as a superstructure called chromatin. Microscopically, dense and light staining nuclear regions can be distinguished and have been classically defined as heterochromatin and euchromatin. Although these are cytological terms, heterochromatin and euchromatin are generally regarded as closed and open, or transcriptionally inactive and active, respectively. Chromatin structure not only stores the genetic material but also has a very important functional role in regulating gene expression. Although many different DNA-associated proteins are present, it is the nucleosome, an octamer of four different histones around which the DNA is wrapped, that represents the basic repeat unit within the chromatin. This packaging of the DNA creates a barrier for reading and interpreting the stored DNA-sequence information. Access to this information is controlled at the level of the histones: post-translational modifications especially at the histone N-termini direct transcription processes. Different combinations of phosphorylation, methylation, acetylation, and ubiquitination ‘marks’ determine whether nucleosomes will be remodelled to establish a transcription-permissive chromatin structure. As such, the pattern of modifications within the chromatin acts as a ‘barcode’ that provides information about the transcriptional state of a gene. This is known as the ‘histone code theory’. Histone-modifying enzymes such as histone acetyltransferases, histone deacetylases and histone methyltransferases ‘write’ this barcode, whereas other protein complexes ‘read’ and interpret the code. Apart from the modifications present on histones, the DNA itself is also subject to modification. DNA is subjected to methylation of cytosine bases within CpG dinucleotides.
The information encoded by the palette of modifications is ‘epigenetic’, because it provides a heritable source of information that is passed on independently of DNA sequence. Gene expression is therefore regulated not only by DNA sequence, but also by epigenetic determinants, histone-modifying proteins, and proteins that interpret these marks. Such proteins are typically constituents of multisubunit protein complexes. As such, protein–protein interaction networks determine which genes are activated in a particular cell as well as whether a molecule will act as an agonist, an antagonist, or a neutral agent in a given cell pathway. This additional level of complexity allows many cell type-specific ‘epigenomes’ to exist within a single organism.
The influence of the epigenetic state on gene expression is apparent in the process of X inactivation, which ensures that only one of the two X chromosomes in the female cell is transcriptionally active. This occurs through epigenetic modifications, and results in silencing of one of the two X chromosomes (Fig. 1). Analysis of the human Xi and Xa, mostly by ourselves and others, revealed that the two alleles of a given X-linked gene have different patterns of histone modifications [3–5]. Thus, although these two X chromosomes share the same DNA sequence, their epigenetic state determines whether and how DNA-encoded sequence information is used. Although genomic sequences serve as blueprints for the biology of a cell, DNA sequence information alone does not provide enough information to decipher the mechanisms that underlie processes such as cell specialization and reprogramming of cancer cells. Clearly, large-scale and genome-wide analysis of epigenetic patterns is necessary to understand these processes. New technologies, including proteomics, microarrays, chromatin immunoprecipitation (ChIP), and a combination of these techniques, provide unique and exciting tools to investigate chromatin structure and the proteins controlling gene expression through chromatin structure. In this review we describe recent progress and discuss how these novel technological advances can be used to gain insight into the complex mechanisms underlying regulation of gene expression and the targeted development of therapeutic agents.
ANALYSIS OF THE TRANSCRIPTOME
In the last decade technological advances have enabled large-scale gene expression analysis or ‘expression profiling.’ These emerging technologies have proved to be highly useful in elucidating novel regulatory pathways associated with a variety of diseases. The availability of such expertise is crucial to correlate gene expression status with associated epigenetic patterns.
Fluctuations of mRNA levels in a cell are mostly caused by changes in the transcriptional rate of genes, but they also involve post-transcriptional events. The strategy for analysing the abundance of mRNAs on a large scale, expression profiling using DNA microarrays, has become widely used. In short, RNA is converted to cDNA, labelled with Cy3 or Cy5 fluorophores, and hybridized to a DNA microarray (‘gene chip’). The current generation of DNA microarrays consists of a glass slide containing tens of thousands of DNA target molecules of known identity. Labelled cDNAs that specifically hybridise with their complementary target DNA molecule are retained on the microarray and can be quantified. Typically, a single experiment compares pools of reference cDNA labelled with Cy3 with a cDNA preparation labelled with Cy5 from cells triggered by an external stimulus such as a ligand. Both pools are hybridized on the microarray, and the intensity of each fluorophore is measured at two different wavelengths. The ratio of the two intensities is determined to identify up-regulated or down-regulated genes.
Gene expression profiling using DNA microarrays and other variants of the microarray technique has been used successfully to elucidate changes that occur during tumour development and progression. For example, oligoarrays have been used to identify up-regulation of cytokine genes in biopsies from human renal allografts undergoing acute rejection . Oligonucleotide microarray analysis of 6416 genes and expressed sequence tags showed the effects of the proto-oncogene myc on cell growth, cell cycle progression, adhesion, and cytoskeletal organization . Use of cDNA microarray analysis combined with gene-specific analysis showed up-regulation of DD3, a gene specific to prostate cancer, and down-regulation of IGF binding protein-3 [8,9]. Microarray analysis also showed that p300/CREB-binding protein-associated factor, a coactivator of the tumour suppressor p53 gene, was induced by p53 in breast tumour cell lines .
An alternative approach to expression profiling is to use high-density oligoarrays spanning entire chromosomes (so-called ‘tile path’ arrays). One application of this type of array uses microarrays containing oligonucleotides placed every 50 bp, allowing high-resolution analysis of transcription of an entire chromosome or the whole genome. In an in-depth analysis of human chromosomes 21 and 22, this technique identified 126 novel transcripts . In addition, the authors identified many novel noncoding RNA transcripts with as yet unknown function, showing the power of unbiased approaches.
ANALYSIS OF THE PROTEOME
Proteomic analysis is challenged by the entire set of proteins expressed by a cell. The human genome contains > 25 000 genes. Alternative splicing of mRNA, proteolytic cleavage, and post-translational modifications of proteins increase the complexity of the human proteome up to ≈ 100 million different polypeptide chains. Classic proteome analysis approaches use two-dimensional gel electrophoresis followed by isolation and characterization of individual protein spots by mass spectrometry (MS). This is labour-intensive and time-consuming, hampering rapid and complete analysis of the proteome. The time required for comprehensive proteomic analyses has recently decreased significantly with novel developments that improved methods and instruments. These include separation of complex protein mixtures on microcapillary liquid chromatography, electrospray ionization, tandem mass spectometry (MS/MS), Fourier transform ion cyclotron resonance MS, and greatly improved database search algorithms. The high accuracy and sensitivity now allow the unequivocal identification of peptides, including their post-translational modifications at the femtomole level. Using this approach, we have analysed the proteomes of asexual blood stages, gametocytes and gametes of the human malaria parasite Plasmodium falciparum. By separating complex peptide sample mixtures on a reverse-phase column and direct on-line injection into the mass spectrometer, we identified 1289 proteins.
Due to the increased sensitivity of MS, stable isotope labelling with amino acids in cell culture (SILAC) has provided another technical advance that holds great promise for large-scale quantitative proteomic analysis. In this method, replacement of amino acids by ‘heavy’ isotope-containing amino acids, such as 13C6-Arg, is achieved by simply culturing a cell population in medium supplemented with the heavy amino acid. Labelled and unlabelled cells can be treated with different stimuli, such as agonist or antagonist, and are subsequently combined before further purification [13,14]. Because of the presence of a light and a heavy isotope, a qualitative and quantitative comparison of the peptide complement in the two cultures through MS has become feasible. The ratio of peak heights between heavy and light peptides correlates with the presence or absence of a protein as a consequence of drug treatment. Using the SILAC approach, de Hoog et al. identified new cell adhesion components by comparing labelled attached cells with unlabelled detached cells. The results showed the unexpected involvement of RNA and RNA-binding proteins in contacts between cells and the matrix. As modulation of cell adhesion is a prerequisite for tumour metastasis, these new results help to pinpoint novel targets for drug development. The SILAC approach has also been used to analyse microsomal fractions of highly metastatic prostate tumours, leading to the identification of proteins that were higher or lower in metastasising cells than in non-metastasising cells .
Recently, high-throughput experimental approaches have been established in yeast and other organisms to gain insights into the network and composition of protein complexes inside a cell. These approaches involve fusion of epitopes, for which a high-affinity antibody is available to a protein of interest, a procedure termed tagging. Through homologous recombination in yeast, the wild-type gene is replaced with its tagged version. The tagged protein of interest is subsequently purified along with putative interacting partners via a technique called tandem affinity purification (TAP) . The purified protein complex can be visualized by silver staining and analysed by MS or directly analysed by liquid chromatography-MS/MS. Newly identified subunits of a particular complex are validated by tagging these novel subunits and repeating the purification and analysis. Using this procedure, one can ‘walk’ from one protein complex to the next and visualize the ‘interactome’ to provide a functional context of a certain group of proteins. A combination of TAP and MS of the Saccharomyces cerevisiae (baker's yeast) proteome led to the identification of 232 distinct multiprotein complexes and the assignment of possible functions to these complexes .
TAP approaches have been successfully applied to mammalian cells using a retroviral gene delivery or stable inducible system, and were applied in our laboratory for purifying protein complexes involved in gene silencing [19–21]. Purification of a TAP-tagged methyl CpG-binding protein (MBD3) resulted in purification of the Mi-2 complex, a complex containing nucleosome remodelling as well as histone deacetylase activity (Fig. 2). MBD3 is known to be a part of the Mi-2 complex. The purified MBD3 complex had potent histone deacetylase activity (data not shown), showing that the retrovirally transduced MBD3 protein assembles into a functional complex that can be purified.
ANALYSIS OF PROTEIN–DNA INTERACTIONS IN VIVO
The recent development that enabled the study of protein–DNA interactions in vivo is ChIP. Since its introduction 10 years ago for the analysis of polycomb-group proteins binding to the Drosophila BX-C locus, it has been widely used . In short, adding formaldehyde to living cultured cells introduces covalent cross-links between proteins interacting with DNA, RNA, or other proteins, resulting in ‘freezing’ of protein–protein and protein–DNA interactions in vivo. Fragmentation of the chromatin and immunoprecipitation of the protein-DNA complexes using a specific antiserum facilitates purification of a particular protein and copurification of cross-linked DNA fragments (Fig. 3). In a conventional ChIP experiment, the presence of particular DNA sequences in the precipitate is subsequently analysed using quantitative PCR. Amplification of the target site in (A) parallel with an unrelated sequence (B) reveals whether the immunoprecipitated protein was bound to its presumed target site in vivo. Thus, the association of a protein with its binding site in vivo can be measured by ChIP, if antibodies against the protein are available. ChIP has been used to investigate the in vivo binding sites of a multitude of transcription factors including nuclear hormone receptors (RXR, v-ErbA, oestrogen receptor, PPARγ), myc and the p53 family [23–30]. In addition, the availability of antisera against specific histone modifications or methylated DNA allows one to study the distribution of epigenetic modifications on particular loci. ChIP can also be used to analyse cell-type and ligand-specific cofactor recruitment. Experiments on oestrogen receptor-mediated activation showed that many chromatin-modifying complexes are sequentially and independently recruited [26,31]. In addition, repression mechanisms induced by tamoxifen-loaded oestrogen receptors could be delineated . Such studies, when performed on a global scale, provide insight into the mechanisms that dictate cellular responses to hormone and could provide a rationale for the development of drugs that target specific pathways.
CHIP-ON-CHIP IN THE ANALYSIS OF PROTEIN–PROTEIN AND PROTEIN–DNA INTERACTIONS
ChIP approaches provide valuable information about the involvement and chronology of transcription factor and cofactor recruitment during activation or repression of a locus. Furthermore, ChIP provides a means to accurately determine the factor composition and epigenetic status of a gene and its regulatory regions [33,34]. In a few cases, complete loci or large chromosomal regions have been investigated, but applying the gene-by-gene ChIP approaches to large sets of genes or even to the whole genome is cumbersome and labour intensive [35–37]. Furthermore, without prior knowledge, such experiments would be very costly ‘fishing expeditions’. To facilitate large-scale analyses, ChIP has recently been combined with microarray technology . In ‘ChIP-on-chip’ experiments, DNA obtained by conventional ChIP is amplified, labelled and hybridized on a DNA microarray. This allows the identification of target sites for a certain protein on a genome-wide level. ChIP-on-chip was initially applied to yeast, where novel target sites for Gal4 and Ste12 were identified through the use of a microarray that contained all intergenic sequences . Such genome-wide studies are much more complex when dealing with higher eukaryotes. Due to their genome size, the presence of repetitive sequences and the widespread distribution of regulatory elements, developing suitable microarrays for ChIP-on-chip is a demanding task. The several types of DNA arrays currently in use are as follows: (i) tailored/designed target gene arrays, consisting of pre-selected PCR-amplified genomic sequences; (ii) CpG-island arrays, generated by cloning of GC-rich DNA fragments purified from the human genome; (iii) oligo tile path arrays spanning whole chromosomes or even the whole genome; and (iv) ChIP-cloning arrays consisting of DNA fragments obtained by cloning precipitated ChIP DNA. Although CpG-island microarrays have the limitation of being restricted to high-GC–containing DNA sequences, they have been successfully used to identify novel targets for E2F, polycomb group (PcG) complexes, and MBD proteins, as well as histone H3K9 methylation [38–41]. A recent and more unbiased ChIP-on-chip study used oligonucleotide tiling arrays completely covering human chromosome 21 and 22. This resulted in the identification of c-Myc, p53 and Sp1 target sites in a chromosome-wide manner . An attractive alternative to generating microarrays is ChIP cloning. In this approach, DNA, which is brought down in a ChIP, is purified and cloned, yielding a library of target sites. Subsequently, the cloned fragments are printed on a microarray. Because ChIP cloning does not need pre-existing knowledge, a genuine target within the genome will be present within the resulting library of DNA fragments. ChIP cloning has been applied to E2F and CTCF [43,44]. In our laboratory, we have recently adopted it to construct a TATA element binding protein (TBP)-binding site microarray by using a highly specific antibody against TBP. Because TBP is a central transcription factor for RNA polymerase I and II, as well as RNA polymerase II promoters, we obtained a comprehensive library of presumed promoter sequences. In addition to identifying novel promoters, this approach has facilitated the analysis of the basal transcription factor composition of various promoter classes. Furthermore, the epigenetic signature and cofactor recruitment to distinct promoters can now be determined. The technology allows determination of the targets as well as the pathways for recruitment. The scatter plot in Figure 4 shows data from a ChIP-on-chip experiment in which two different DNA samples, obtained by ChIP using antibodies against serine 5 phosphorylated RNA polymerase II, and anti-BDP1 (a subunit of RNA polymerase III initation factor), were hybridized onto the TBP target site microarray. Displaying the signal intensities for the two ChIP samples (Cy3 and Cy5) in a scatterplot reveals two distinct populations, one containing polymerase II promoters and one containing polymerase III promoters associated with tRNA genes. A similar ChIP-on-chip strategy was used for target-site analysis of p73, a close relative of p53 that plays a role in oncogenesis and transformation (Fig. 4C). A stable cell line expressing FLAG-tagged p73 was used for ChIP, after which the immunoprecipitated DNA fragments were hybridized onto the TBP target site microarray. The circled area in Fig. 4C represents p73 target sites. Some of the identified target genes were validated using conventional single-gene ChIP (Fig. 4D). A similar ChIP-cloning approach for p53 using a specialized dedicated array will allow genome-wide localization of target sites.
COMBINING THE BEST OF TWO WORLDS: PROTEOMICS AND CHIP-ON-CHIP ARE COMPLEMENTARY TECHNIQUES
Although dedicated microarrays generated by ChIP cloning are valuable tools in target-site analysis, they require the availability of a highly specific and ‘ChIP-grade’ antiserum against the protein of interest. Because such requirements cannot always be fulfilled for each protein, TAP-tagging approaches used in MS can also be used for ChIP cloning. This involves generating a stable cell line expressing epitope tags fused to the protein of interest. These cells can then be subjected to ChIP or protein complex purifications using antibodies against the tags. We are currently screening and selecting such dual-purpose ‘generic’ tags that will allow purification of protein complexes as well as ChIP cloning to determine where in the genome the factor exerts its function (Fig. 5).
When combined, proteomics and ChIP-on-chip may help decipher pathways and gene networks regulated by transcription factors such as nuclear receptors. Subtypes and cell-type specificity of nuclear receptors can be determined as well as ligand-dependent recruitment of factors. Analysis of the composition of corepressors or coactivators recruited by the different nuclear hormone receptors by liquid chromatography-MS/MS can be combined with an analysis of DNA target sites using stable cell lines expressing generic tagged nuclear receptors. Epigenetic marking, including cell-type specific chromatin modifications, DNA methylation, and locus-specific susceptibility to activation will soon be within reach and will add a new dimension to targeted drug development.