A small fraction of the typical animal genome (<5% in humans) codes for the organism's collection of proteins, yet the study of protein coding sequences dominated the early years of genomics research. In the decade since the sequencing of complete eukaryotic genomes, however, attention has turned to the non-coding “junk” DNA that makes up the remainder. The ability to move beyond the scant few large genomic regions studied prior to that point has shed a great deal of light on the complex interactions between DNA and nuclear proteins that control genomic regulatory processes. These advances would not have been possible, of course, without complete genome sequences and resulting technologies such as the DNA hybridization microarray. But a single molecular technique, Chromatin Immuno-Precipitation (ChIP), has more than any other come to dominate the field and make possible the study of an incredible range of biology. As one measure of the success of this technique, the number of scientific publications utilizing some form of ChIP has increased nearly 20-fold in the past decade, from 67 articles in 2000 to 1,025 in 2008 (Fig. 1). This issue of The Journal of Cellular Biochemistry aims to put into context advancements made possible by the ChIP-location revolution, while at the same time highlighting some of the most important technical aspects and challenges. Finally, it gives us an idea of the exciting work yet to come—most would agree that we have only seen the very tip of the ChIP-location iceberg.
Wacker and Kim [this issue] begin by giving a detailed historical overview of ChIP techniques beginning with initial experiments by Alexander Varshavsky and colleagues using formaldehyde to chemically cross-link chromosomal DNA to proteins (Fig. 2), and proceeding all the way through modern incarnations that use microarray (ChIP-on-chip) and massively parallel sequencing technologies (ChIP-Seq) to read out the locations of the bound regions. A conceptual framework is developed for understanding the three major processes under investigation: (1) interactions between sequence-specific transcription factor complexes and their direct DNA binding targets and core RNA polymerase machinery, (2) epigenetic marks, most importantly histone tail modifications critical for the regulation of chromatin structure and function, and (3) localization of protein complexes that mediate long-range intra- and inter-chromosomal interactions and interact with the structural architecture of the nucleus.
Mouse and human embryonic stem cells have been one of the systems most extensively studied by ChIP-location, and Wacker and Kim [this issue] and Zechini and Mills [this issue] outline how it has substantially added to our understanding of the regulatory processes involved in pluripotency and lineage specification. Importantly, studying transcriptional control and epigenetic changes using ChIP-chip and ChIP-seq has aided in the development of artificially induced pluripotent stem cells (iPS) in mammals, a development which holds great promise for regenerative medicine. Some of the most interesting stem cell findings challenge older concepts about how global epigenetic marks and the recruitment of RNA Polymerase II to promoters influence transcription. The past several years of intense study have established that histone marks once considered “active” and “repressive” actually co-occur at many promoters during stem cell development and interact in ways that we still do not fully understand. We have also learned that RNA polymerase II is present at the promoters of many inactive genes, possibly maintaining a state of competency to allow for future activation and elongation [Muse et al., 2007; Zeitlinger et al., 2007; Wu and Snyder, 2008]. ChIP-location has recently been used to identify a large number of long non-coding RNAs expressed in stem cells [Guttman et al., 2009], some of which may play critical roles in the establishment of these chromatin domains [Rinn et al., 2007; Zhao et al., 2008].
ChIP-location studies are also facilitating new insights into the biology of cancer. Zechini and Mills [this issue] describe how the study of genomic regulatory processes involving nuclear hormone receptors has led to a number of important insights, including the computational identification of key DNA-binding motifs and novel transcriptional co-factors such as FoxA1 [Carroll et al., 2005], as well as the connection between transcription factor DNA binding and the recruitment of both histone modifying enzymes and their associated histone tail modifications [Strahl and Allis, 2000; Bernstein et al., 2007; Heintzman et al., 2007]. Identification and interrogation of regulatory control elements lying downstream of key signaling pathways such as these will prove invaluable for interpreting the results of genome wide polymorphism association studies (GWAS), which are increasingly implicating the non-coding sequences of the genome in disease risk. Examples are the role of a common, non-protein coding regulatory variant in the RET gene that explains much of Hirschprung's disease [Sancandi et al., 2003; Fitze et al., 2003a,b], and more recently, the association of non protein-coding variation in chromosome 8q24 with prostate and other cancers [Amundadottir et al., 2006; Freedman et al., 2006; Gudmundsson et al., 2007; Haiman et al., 2007; Yeager et al., 2007], and in the FGFR2 intron with breast cancer [Meyer et al., 2008]. Other unexpected observations are made possible by global ChIP-location, for instance that most known translocation breakpoints in T-cell cancer are located in regions of active and open chromatin based on histone profiling [Barski et al., 2007]. Cancer research also underscores the need to reduce the amount of starting material necessary to perform ChIP analyses—current protocols require millions of cells, but new techniques, requiring 10,000 cells or less [Acevedo et al., 2007], will be necessary to move from cell lines to primary tumors and cell-sorted subpopulations of cancer stem cells.
Two articles focus on technical aspects of emerging ChIP techniques. Barski and Zhao [this issue] discuss ChIP-seq, where ChIP isolated fragments are directly sequenced using high-density (“2nd generation”) sequencers capable of generating hundreds of millions sequence tags in a single experiment. ChIP-seq has a number of advantages over its predecessor, ChIP-on-chip: First, it does not suffer from hybridization artifacts that confound all microarray-based assays. Second, while microarray approaches can typically cover roughly the half of they typical mammalian genome that does not cross-hybridize to oligonucleotide probes, ChIP-seq can currently cover over 70% [Mikkelsen et al., 2007] and will very quickly improve to cover the vast majority including many transposable elements. While some consider repetitive sequences to be “selfish” DNA that does not contribute to genomic regulation, there is little direct evidence for this. To the contrary, significant evidence exists that transposable elements can have a direct effect on regulatory processes [Bejerano et al., 2006] and contribute significantly to the evolutionary processes that shape cis-regulatory sequences [Feschotte, 2008; Xie et al., 2006]. Another important benefit of ChIP-seq is that it has the ability to detect allele-specific binding [Mikkelsen et al., 2007].
Distinguishing signal from noise in whole genome ChIP-seq and ChIP-chip data is a difficult statistical problem, and Barski and Zhao [this issue] provide a substantive discussion of the issues involved. While a number of techniques using sliding window averages [Jothi et al., 2008] and hidden markov models [Ji and Wong, 2005; Du et al., 2006] have been developed, all have substantial error rates and most are poorly suited to the extended domains characteristic of many epigenetic marks. Zechini and Mills [this issue] highlight this problem in a discussion of the troubling lack of replicability in whole-genome ChIP-on-chip experiments when performed by different labs. While some of this variability might be truly biological, it is clear that at least some is technical in nature [Johnson et al., 2008] and is highly dependent on selecting an appropriate underlying statistical model and enrichment cutoff. Another emerging factor is that ChIP assays seem to pick up bona fide low-level binding events that do not appear to have any physiological significance. A recent study [Li et al., 2008] made the interesting observation that the most highly bound regions were a great deal better at predicting gene expression patterns than the more weakly bound regions, even within a set of highly statistically significant binding regions that were replicatable using an independent antibodies. This suggests that while a ChIP-location experiment for a single factor in isolation might not be an extremely good predictor of regulatory function, the addition of ChIP data for a second related protein, such as a co-factor or a related epigenetic mark, may multiply the predictive power. Indeed, when the epigenetic activation mark for histone H3 acetylation was added to ChIP data for the sequence specific transcription factor AR, correlation with regulatory function was exponentially better [Jia et al., 2008].
Another confounding factor in the analysis of ChIP-seq data is how to deal with copy number changes and other chromosomal aberrations common in cancer genomes. Most ChIP-chip study designs deal with this by normalizing against a control sample consisting of ChIP input DNA or a ubiquitous mark such as unmodified histone H3. We believe it is always a good idea to follow such strategy with ChIP-seq, because it is now well established that ChIP-seq can have a tendency to enrich non-specifically for many nucleosome-free hypersensitive sites in the genome [Giresi et al., 2007] (particularly at active promoters). Because cancer genomes often contain many regions with increased copy number, they present an even larger background problem and many more sequencing tags will be necessary to control for these effects and prevent false positives. While this may not be a major concern when ChIPping a factor with very high enrichment over a relatively small fraction of the genome (like NRSF [Johnson et al., 2007], or STAT1 [Robertson et al., 2007]), it could present serious issues when trying to identify domains with low enrichment covering large stretches of the genome, for example histone marks H3K36me3 and H3K27me3.
Barski and Zhao [this issue] highlight the data processing requirements of generating and analyzing high throughput ChIP-location data. Development of a bioinformatics infrastructure for storage, analysis, and exchange of this data will be a crucial challenge for the genomics community. While repositories for proteins, genes, and promoters (Genbank) and gene expression microarray data (GEO) have existed for a number of years, ChIP-seq datasets are several orders of magnitude bulkier than their microarray analogs. Central resources such as NCBI's new short read archive [Wheeler et al., 2008] will be critical for warehousing such experiments, but we will need analysis tools that make it possible to compare new or private ChIP datasets (at base pair resolution) to previously published ones. The UCSC and ENSEMBL genome browser groups have provided some early tools for these types of analysis, but they do not really fill the needs of modern ChIP-based research. Large collaborative projects such as the ENCODE Consortium will undoubtedly help to spur such developments [Thomas et al., 2007].
In the final article, Fullwood and Ruan [this issue] describe the newest frontier in ChIP-location, and how it is being used to study the regulatory mechanisms involved in long-range chromosomal interactions. Chromatin Conformation Capture (3C) is a method for identifying distinct genomic regions bound by a common protein or protein complex, which brings them in close spatial proximity within the 3D space of the nucleus [Simonis et al., 2007]. The combination of 3C with ChIP has led to the elucidation of chromosomal looping events that bring together distant sites into central hubs or interchromatin granules that are critical for gene activation [Hu et al., 2008]. While these techniques seem poised to emerge as the next major step in the evolution of ChIP, they are fraught with numerous sources of noise and potential bias. Fullwood and Ruan [this issue] describe an emerging methodology including the optimum use of sonication-based chromatin fragmentation, ChIP-based enrichment, chromatin proximity ligation and Paired-End Tag ultra-high-throughput sequencing as the winning strategy to establish DNA/protein interactomes in three-dimensional space.
ChIP-location is also helping define the architecture of the nucleus by mapping the nuclear organizing proteins themselves. Binding sites for the insulator protein CTCF [Cuddapah et al., 2009] and Nuclear Lamina attachment proteins [Guelen et al., 2008] have led to unexpected connections between gene regulation and sub-nuclear chromatin organization. Broad disruption of chromatin looping has been recently implicated in the progression of breast cancer [Han et al., 2008], and it will be exciting to determine what other roles nuclear organization plays in cancer and other diseases. ChIP-location will undoubtedly play a role in this discovery process.
ChIP-location techniques have the potential to help unravel some of the most fundamental and vexing questions in molecular biology, for instance whether the contribution of simple cis-acting regulatory motifs or more indirect, trans-acting factors in the nuclear environment are more central in the regulation of complex transcriptional networks. Zechinni and Mills [this issue] dive into this debate head on, highlighting a recent ground breaking study [Wilson et al., 2008] involving a trisomic mouse model that carries most of human chromosome 21. Using ChIP-on-chip to determine the location of key liver-specific transcription factors and epigenetic histone marks in liver cells, they showed that despite the very different cellular environment, regulatory regions on the heterologous human chromosome have behavior almost identical to that of their native human cell counterparts, but very different from the homologous mouse regions. Gene expression from the transplanted chromosome follows the same pattern, seemingly unaffected by the differences of the mouse cellular environment and arguing for local cis-acting sequences playing a dominant role. While this stunning result addresses the question of cis-acting sequences and the trans-acting nuclear milieu, a related question is the primacy of transcription factors binding their short recognition motifs versus larger scale epigenomic structural events. Does the former determine the latter via the histone modifying activities of transcription factors, or does epigenomic organization precede and make possible site-specific binding? While the number of well-understood cases is still too small to say for sure, the ability to profile transcription factor occupancy and markers of chromatin structure in parallel as a function of time and development will facilitate better understanding of this relationship. Recent studies in yeast and Drosophila [Segal et al., 2008; Gertz et al., 2009] lend support to the notion that a complex arrangement of short protein binding sites can explain much of the complexity observed in transcriptional regulation. How such sequences encode complex animal regulatory programs is an ongoing pursuit, but it is clear that we are only in the early stages of being able to identify and interpret the code within these cis-regulatory sequences [Hare et al., 2008]; ChIP-location will play the key role in the pursuit of this cis-regulatory “grammar,” by allowing the identification of many thousands of new functional cis-regulatory regions.
In the span of a short decade, ChIP-location techniques have allowed genomic maps to progress from one-dimension to richly populated 2D and even 3D landscapes. Although we are still learning the basics of how to navigate in this dynamic chromatin world, it is certainly an exciting time to be a genome explorer.