Cracking the ENCODE: From transcription to therapeutics

Authors


  • Potential conflict of interest: Nothing to report.

ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 2012;489:57-74. (Reprinted with permission.)

Abstract

The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

As the world debates the wisdom of excess regulation in other aspects of life, it is becoming increasingly clear that the genome is under highly complex regulatory control. The Human Genome Project (HGP) not only characterized the protein-coding genes of the genome, but also ushered in an era of personalized medicine, where patients are beginning to receive targeted therapies based on genomic sequence. An immediate example in hepatology is the use of IL28B genotyping in hepatitis C therapy.1 However, expectations of advances in the pathobiology and treatment of complex diseases have not been fulfilled, since the majority of the genome remains a mystery—nonprotein coding and labeled as “junk DNA.” The aim of the ENCODE project (Fig. 1) was to address this gap in knowledge.

Figure 1.

The ENCODE project describes the functional genomic elements that regulate the expression of the human genome. The regulatory processes include DNA methylation and histone modifications that influence the rate of transcription, long-range chromatin interactions that alter the relative proximity of DNA elements in three dimensions, and the presence of sites sensitive to DNAse I which allows access of transcription factors and transcription machinery. More direct control of transcription occurs through gene-regulatory DNA elements, including promoter regions and more distant regulatory elements. RNA transcripts are also part of the regulatory process, through alternative transcripts which may be nonfunctional, or through noncoding RNAs (micro RNAs) with regulatory roles. This knowledge allows the expansion of molecular therapeutic targets, such as the use of FXR agonists, to increase transcription of pro-endothelial genes such as DDAH-1 through enhancer elements.

The approach to apply the wealth of genetic information from the HGP to determining susceptibility for complex diseases has thus far been through the use of genome-wide association studies (GWAS). Over 1,500 GWAS studies have been conducted since the first GWAS study was reported in 2005 (www.genome.gov/gwastudies/), and several hundred disease-associated genetic variants have been found.2 However, disappointingly, the majority of these are single nucleotide polymorphisms (SNPs) with only a small effect on the trait or disease being studied. The implication is that a large part of the heritability of these complex diseases remains unexplained.

It appears that there are two reasons for the lower “hit rate” from GWAS studies for biological targets than expected. First, the GWAS targets are occasionally in linkage disequilibrium with the specific causative locus, thereby obscuring the true causative gene product.2 However, more commonly, the locus associated with the disease phenotype is not related to a coding region of genomic DNA. Indeed, from the National Institutes of Health (NIH) GWAS catalog, it is apparent that the vast majority (88%) of disease-associated SNPs are not related to coding regions—45% are intronic and 43% are intergenic.2 The implication is that variance in the regulatory elements of the genome carry a large burden of the risk of complex diseases. Indeed, several GWAS variants in diabetes,3 colon cancer,4 and cardiovascular disease5 reside in enhancer elements. Moreover, these results imply that large-scale sequencing studies focusing on protein-coding sequences (the “exome”) risk missing crucial parts of the transcribed genome (the “transcriptome”) and consequently the ability to identify true causal variants.

From Proteomics to Transcriptomics

An international collaborative effort to determine the functional importance of noncoding DNA was developed which generated an encyclopedia of DNA elements (ENCODE).6 This followed a 4-year pilot study initiated in 2003, which demonstrated significant functionality of noncoding elements in 1% of the human genome,7 and the project was scaled up to annotate the entire genomic sequence. A by-product of these efforts was the development of “next-generation” sequencing technologies—including the first ChIP-plus-sequencing assays (ChIP-seq) for transcription factors and histone modifications,8, 9 as well as pioneering RNA sequencing assays (RNA-seq).10 The findings were published in the above flagship article in September 2012, as well as 30 other simultaneously published research papers. ENCODE demonstrated, using a variety of methodologies, that 80% of noncoding “junk” DNA contains elements with biochemical function.

The cornerstone of ENCODE is the recognition of biochemical signatures which characterize certain types of noncoding functional DNA elements. Examples include promoter regions that are rich in predictable biding sites for DNA binding proteins, which can be experimentally verified by site-specific occupancy assays such as ChIP.11, 12 Promoter regions also have alterations in chromatin structure giving rise to nuclease hypersensitivity of the underlying DNA.13 Further characteristics of functional elements are histone modification suggesting transcription factor occupancy of adjacent DNA, and DNA methylation as an epigenetic modulator of gene expression.11, 14 All of these biochemical signatures were experimentally assayed in the ENCODE project.

To identify regions of DNA-protein interaction, the binding locations of 119 different DNA-binding proteins and a number of RNA polymerase components were assayed in 72 cell types using ChIP-seq. Overall, 636,336 binding regions covering 231 megabases (8.1% of the genome) were enriched for regions bound by DNA-binding proteins across all cell types. The ENCODE consortium has made the information associated with each transcription factor in FactorBook (http://www.factorbook.org)—a freely available public resource.

The accessibility of chromatin to DNase I hypersensitivity was assessed by mapping 2.89 million unique, nonoverlapping DNase I hypersensitive sites (DHSs) by DNase-seq in 125 cell types, occupying 15.2% of the genome. Moreover, 98.5% of the occupancy sites of transcription factors previously mapped by ChIP-seq lie within accessible chromatin defined by DNase I hotspots, reaffirming their likely cell-specific regulatory role. Histone modifications associated with regulatory elements (e.g., methylation, acetylation) were also assayed by ChIP-seq, and were found to be common in the genome (56.1%).

Finally, one of the principal purposes of ENCODE was to determine what proportion of this noncoding genome is transcribed, and in which cell/tissue types. Djebali et al.15 demonstrate with ultra-deep RNA sequencing that about 75% of the genome is transcribed to RNA at some point in certain cell types. Therefore, the majority of RNA in a cell is never translated to protein, but may play important regulatory functions.

Moreover, the expression of RNA transcripts from genes is not uniform—most genes express more than one isoform of a transcript, with an average of 10-12 expressed isoforms per gene per cell line. This remarkable finding has forced a re-think of our nomenclature of genomic organization, and in particular the gene as the fundamental building block of the genome. On the basis of the ENCODE data, it can be argued that the transcript is the basic unit of genomic organization, describing genes which are transcribed in different cellular environments under specific conditions.

Limitation or Opportunity?

The ENCODE project has demonstrated that the vast majority of the human genome, although not coding for proteins, does contain important regions that bind proteins and RNA molecules which cooperate to regulate the function and expression of protein-coding genes. Additionally, it seems that transcription is a lot more widespread than previously thought, with large numbers of noncoding RNA molecules with potential regulatory roles.

The immediate implications of these findings are that genome-wide approaches to determining disease risk and finding targets for therapy require reevaluation in this light. ENCODE demonstrates that noncoding regions must be considered when interpreting GWAS findings, and provides a strong basis for reinterpreting previous GWAS results. Furthermore, as mentioned above, the results of ENCODE suggest that exome-sequencing studies focusing on protein-coding sequences risk missing crucial parts of the genome and the ability to identify true causal variants.

Although the prospect of characterization and validation of this new tier of genomic control is daunting, it does provide opportunity both in terms of technologies and therapeutics. Just as ENCODE disseminated technologies such as ChIP-seq and RNA-seq over the last decade, so technologies of gene editing such as zinc-finger and TAL effector-like nucleases are now scalable, and thus functional elements can be validated on a large scale.16, 17

Implications for the Treatment of Liver Diseases

More immediately relevant for the liver community is the prospect of transcriptional modulation as a therapeutic strategy. Knowledge of regulatory elements will point us toward new therapeutic approaches and expand the “druggable genome.” A specific example is the use of Farnesoid X receptor (FXR) agonists to augment the transcription of FXR-responsive genes such as dimethylarginine dimethylaminohydrolase in portal hypertension, a target with no alternative pharmacological agonist.18

However, ENCODE also opens the door for targeted therapies to regulatory elements. Functional elements, including DNA sequences, transcription factors, and noncoding RNAs, have been widely considered “undruggable” targets, mostly because of the incomplete molecular understanding of these complex systems. However, as an example, microRNAs (miRs) are key RNA molecules regulating gene expression. Anti-miR oligonucleotide therapies directed to the liver have been shown to modulate cholesterol metabolism and hepatitis C viral kinetics, and phase 2 clinical studies are in progress.19 Thus, the paradigm shift in genomic data provided by ENCODE, along with improved chemistry for the delivery of nucleic acid based therapies to the liver, has provided the opportunity for novel genome and epigenome targeted therapies. As William Ford Gibson famously said, “the future already exists, it's just not very evenly distributed.”

Ancillary