Resolving the variable genome and epigenome in human disease


Julian C. Knight, Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, OX3 7BN UK.
(fax: +44 1865 287533; e-mail:


Abstract.  Knight JC (University of Oxford, Oxford, UK). Resolving the variable genome and epigenome in human disease (Review). J Intern Med 2012; 271: 379–391.

The individual human genome and epigenome are being defined at unprecedented resolution by current advances in sequencing technologies with important implications for human disease. This review uses examples relevant to clinical practice to illustrate the functional consequences of genetic and epigenetic variation. The insights gained from genome-wide association studies are described together with current efforts to understand the role of rare variants in common disease, set in the context of recent successes in Mendelian traits through the application of whole exome sequencing. The application of functional genomics to interrogate the genome and epigenome, build up an integrated picture of the regulatory genomic landscape and inform disease association studies is discussed, together with the role of expression quantitative trait mapping and analysis of allele-specific gene expression.


We live in remarkable times in the field of human genetics. Our ability to define variation at the sequence and structural level in the human genome is unprecedented, with whole genome sequencing now being performed on thousands of individuals [1]. Indeed, the pace of technological advances in sequencing is such that the target of the $1000 genome [2] now appears likely to be achieved within 3–5 years. For medicine, the opportunities afforded by genomic science are wide-ranging and potentially paradigm shifting, but amid the scientific optimism there is some justifiable concern from clinicians and patients about whether ‘genomic medicine’ will deliver. The investments from public and private funding in this field of research over the last 20 years have been enormous, and deliverables in terms of tangible change or benefit in the clinic are yet to be widely appreciated. In the current economic climate, the importance of translational research output is increasingly sought and should be more actively considered by those engaged in genomic research. This will be facilitated by the growing involvement of other professionals in this field, notably from public health and the pharmaceutical industry. The potential is, however, undoubtedly present: enormous strides have been made in our understanding of the heritable component of common multifactorial disease, while research into the genetics of rare diseases showing a Mendelian pattern of inheritance is undergoing a renaissance as new sequencing technologies offer the opportunity to tackle diseases where linkage and other approaches had previously been unsuccessful [3].

In this review, I will briefly outline the historical context of current progress in genomic research as applied to human disease, illustrating how much of this has been made possible by radical technological advances. This has allowed us to resolve sequence level and structural genomic variation and apply this in a high-throughput manner to study thousands of individuals, for example, using genotyping arrays comprising hundreds of thousands of common biallelic single nucleotide substitutions [4] and more recently massively parallel ‘next generation’ high-throughput DNA sequencing [5, 6]. I will discuss the successful application of genome-wide association studies (GWAs) to common disease, the insights this has provided but also the challenges that lie ahead as we seek to define the substantial proportion of the estimated heritable risk remaining unexplained in multifactorial traits. Current efforts to define the role of rarer variants, structural genomic variants, gene–gene and gene–environment interactions, epigenetic and other factors are being actively pursued to try and address this problem.

A fundamental question that remains unanswered for the majority of disease associations in common traits, as well as many Mendelian diseases, is the identity and functional consequence of the causal genetic variants for gene expression, the nature and function of the encoded protein, and disease pathogenesis [7]. As we understand more about the remarkably complex processes by which gene expression is regulated, proteins synthesized and cellular systems operate, this challenge to define functional variants appears increasingly daunting. However, advances in functional genomics, notably taking advantage of new sequencing technologies to allow genome-wide resolution of transcription and the regulatory chromatin landscape, offer exciting new opportunities to do so, particularly when combined with proteomic, metabolomic and other approaches. This review discusses some of the approaches that can be taken to define functional variants, current limitations and future directions.

Linkage, GWAs and rare variants

For Mendelian disorders, the application of linkage-based approaches has been extremely successful with thousands of gene loci and specific mutations identified over the last 30 years [8]. By contrast, progress in common multifactorial disease without a clear Mendelian pattern of inheritance was slow [9]. Linkage did yield some notable successes such as the role of NOD2 in Crohn’s disease [10] but was not a tractable approach in the vast majority of cases. Similarly, candidate gene analysis, while being fruitful in some instances such as APOE e4 in Alzheimer’s disease [11, 12] or factor V Leidin in venous thrombosis [13], in most cases was often unsuccessful or yielded associations that failed to replicate [14]. The common disease: common variant hypothesis became tractable to test with the advent of affordable high-throughput genotyping, and growing insights into the nature and coinheritance of genetic variation across different populations through large collaborative studies such as the International HapMap Project [15]. This set the scene for GWAs in which informative common biallelic genetic markers could be genotyped in thousands of cases and controls to look for the evidence of association [9]. One minor note in terms of terminology: single nucleotide variants (SNVs) include single nucleotide substitutions, which when present in the human population with a frequency of both alleles of >1% are also referred to as single nucleotide polymorphisms (SNPs). Rare SNVs may be defined based on a minor allele frequency (MAF) <1% but of note, GWAs typically involve genotyping SNPs with a MAF >5% which means both rare and less common variants (MAF 1–5%) are not well captured.

Genome-wide association studies

By June 2011, 1449 genome-wide associations have been reported at a P value of <5 × 10−8 for 237 traits ( (Fig. 1). The results have been striking in terms of the strength of association, with many variants implicated with considerable statistical confidence for diseases ranging from type I diabetes [16, 17] to leprosy [18, 19] and cancer [20]. However, the magnitude of effect of individual disease-associated variants was in almost all cases very modest, typically 1.2-fold, and the proportion of the estimated heritability explained by such variants was relatively low, for example, ranging from 5% to 10% in type II diabetes [21] to 25% in Crohn’s disease [22]. It would be wrong, however, to interpret this as meaning that GWAs have been unsuccessful: for the first time, we have a substantial number of robustly replicated associations with common genetic markers which are starting to be translated into risk modelling and prediction of clinical utility, notably in cancer, [23, 24]. More evident are the new insights into disease pathogenesis which GWAs are providing, ranging from the role of complement factor H in age-related macular degeneration [25] to Crohn’s disease where the significance of autophagy [26–29] and IL23 signalling [28, 30] has been highlighted and is providing new targets for therapeutic intervention [31, 32].

Figure 1.

Genome-wide association studies. (a) Number of published genome-wide association studies (GWAs) reporting at least one significant single-nucleotide polymorphisms (SNP) trait association by year to October 2011 catalogued by the National Human Genome Research Institute (NHGRI) GWAS Catalog [146] [Hindorff LA, MacArthur J, Wise A, Junkins HA, Hall PN, Klemm AK, and Manolio TA. A Catalog of Published Genome-Wide Association Studies. Available at: Accessed 10/30/2011]; (b) Schematic showing location of associated marker SNPs from GWAs by frequency [147]; (c) Reported GWAs loci by disease classification [147].

The ‘missing heritability’ in common disease following GWAs has been the subject of much debate [33–35]. Dubbed by some investigators as the ‘dark matter’, which underlined the elusive nature of resolving the basis of this heritable risk, current views highlight the potential role of rarer variants with moderate or high magnitude of effect. GWAs have not interrogated such variants to date, and their analysis has become achievable through the increasing application of massively parallel sequencing as costs continue to fall (Fig. 2). Anticipated results over the next 12 months for ongoing studies involving in large numbers of cases will be highly informative in the context of common disease. There may also be much potentially useful information within GWAs data sets involving associated variants just below the selected thresholds for statistical significance: mining such data and sifting the wheat from the chaff is challenging and may be facilitated by increasing sample sizes (with a note of caution in terms of the cost benefit of doing so) and using functional genomics and other approaches to try and inform this process. There are many other potentially relevant contributors to this phenomenon of unexplained heritability: the estimations of heritable risk may be overinflated, and epigenetic factors are increasingly recognized to be significant contributors to heritable risk (including parent of origin effects and environmental modulators), while gene–gene and gene–environment interactions have not yet been well characterized.

Figure 2.

DNA sequencing cost of the human genome. The dramatic fall in sequencing costs is illustrated by data arising from sequencing centres funded by the NHGRI [Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Program Available at: Accessed 10/30/2011].

Rare variants

The relentless pace of technological advances in our ability to detect and quantify genetic variation in a high-throughput manner now makes the analysis of rarer variants a feasible option at the whole exome and increasingly the whole genome level. While for common disease, the jury remains out on the relative importance of such variants, there is growing optimism that for rare ‘orphan’ diseases with very robust phenotypes such as primary immunodeficiencies and metabolic disorders, the potential is very great, while for Mendelian diseases, considerable success has already been reported [36–40].

The potential of whole exome sequencing was underlined in 2009 by data from sequencing four unrelated individuals with the rare autosomal dominant disorder Freeman Sheldon syndrome that resolved the known causal gene [41]. Whole exome sequencing has since been successfully used to determine the genetic basis of a number of unresolved Mendelian disorders. This includes autosomal dominant traits where, for example, sequencing 10 unrelated probands resolved mutations of MLL2 as a major cause of Kabuki syndrome [38], while for Schinzel–Giedion syndrome, SETB1 was implicated following whole exome sequencing of four unrelated individuals which revealed de novo mutations involving this gene [42]. For autosomal recessive diseases, success has also been achieved as illustrated by work involving whole exome sequencing of four individuals from three families with Miller syndrome which defined DHODH as the disease gene [39], and for hyperphosphatasia mental retardation syndrome where mutations in PIGV were resolved following whole exome sequencing of three siblings with validation in additional families [43]. This latter work also highlighted the power of filtering regions based on identity by descent.

The value of whole exome sequencing using a family-based approach for sporadic disease was illustrated for 10 cases of unexplained severe mental retardation where case parent trios were sequenced and de novo likely pathogenic nonsynonymous SNVs identified in seven of the affected individuals [44]. For very specific phenotypes where extensive biochemical and functional data and validation are possible, whole exome sequencing of a single individual may be informative as illustrated for a mitochondrial respiratory chain disorder where a mutation involving ACAD9 (encoding acyl-CoA dehydrogenase 9) was identified and causally implicated in disease, with other mutations in the same gene identified in further cases [45].

The utility of whole exome sequencing for clinical diagnosis is also increasingly recognized. This is illustrated by a patient referred with renal disease in whom Bartter syndrome was suspected: sequencing revealed a mutation in SLC26A3, leading to the diagnosis of congenital chloride diarrhoea [36]. A further case involving a young patient with intractable and atypical inflammatory bowel disease illustrates how the therapeutic implications can be significant. In this instance, whole exome sequencing revealed a mutation in XIAP, knowledge of which contributed to a clinical decision to carry out stem cell transplantation [46, 47].

The optimal strategic approach to apply whole exome and whole genome sequencing in Mendelian disease is still in the process of being resolved for different scenarios, while for common multifactorial traits, how to apply high-throughput sequencing approaches is a source of considerable debate. If families can be identified, then sequencing distantly related individuals within the pedigree, looking for cosegregation and testing specific implicated variants in large cohorts may be fruitful, while for other traits, studying individuals in the extreme tails of the phenotype distribution, particularly where there is younger age of onset, is advocated [48]. As costs continue to fall, however, whole genome sequencing of hundreds or thousands of cases and controls to identify all variants will be carried out.

The bioinformatic and analytical challenges such data sets represent should not be underestimated [49]. The mapping and accurate calling of sequence and structural variants remain a very active area of development and research, in which further progress is being made and urgently needed [50–54]. The amounts of data involved are prodigious and on a scale more commonly encountered in astrophysics. Accepting that these challenges can be overcome, the subsequent analysis to narrow down the lists of potentially deleterious variants causing disease is challenging enough in Mendelian traits. Recent data from the 1000 Genomes Project have highlighted how on average, each of us has 250–300 loss-of-function variants in annotated genes and 50–100 variants previously associated with inherited disease [1].

There are several examples of common multifactorial traits where rare variants have been shown to play a significant role. Resequencing candidate genes identified through GWAs has been productive, with rare variants with large effects resolved in hypertriglyceridaemia [55], Crohn’s disease [56] and type I diabetes [57]. The latter highlighted rare variants in IFIH1, a gene encoding interferon induced with helicase domain 1, which is important in the recognition of RNA from picornaviruses, and may be highly relevant given the link between enteroviruses and development of diabetes [57]. Other candidate genes resolved through animal studies such as SIAE (encoding the enzyme sialic acid acetyl transferase) have revealed several functionally important rare variants associated with autoimmune disease [58]. Further examples from autoimmune disease include the association of rare variants in the DNA exonuclease gene TREX1 with systemic lupus erythematosus [59].

For common traits, we can anticipate that if rare variants play a role, lessons should be learned from Mendelian diseases such that analysing association with rare variants present at a given gene or locus may be of value with many different mutations resulting in a common phenotype. Various analytical strategies are being advocated and have been reviewed elsewhere [49]. A blurring of the distinction between common and Mendelian disease is apparent as we also appreciate the role of modifier genetic variants and the environment in observed penetrance and phenotypic heterogeneity in Mendelian disease, such that conditions such as sickle cell disease are viewed as complex multigenic disorders rather than monogenic disease [60]. The role of modifier variants is highlighted by recent work in cystic fibrosis, where genome-wide association and linkage analysis have highlighted variation at chromosome 11p13 and 20q13.2, respectively, in modulating observed variation in the severity of lung disease among patients with two copies of loss-of-function CFTR alleles [61].

Functional genomics and epigenomics of the individual

Advances in our ability to sequence DNA have had important ramifications beyond the identification and screening of genetic and genomic variants. Application of the technologies for high-throughput sequencing to analyse RNA (RNA-seq) represents a significant advance on microarray-based approaches in terms of the dynamic range that can be achieved with single-base-pair resolution [62]. Being able to interrogate the transcriptome at this level of resolution using increasingly small amounts of input RNA is revolutionizing our ability to understand the function of the genome and more particularly in the context of this review, to appreciate how genetic variation may modulate the critical processes of alternative splicing [63–65] and the generation of noncoding RNAs which have profound implications for gene regulation [66–68].

In parallel, sequencing technologies are radically advancing our understanding of the broader transcriptional landscape at genome-wide resolution, for example, in terms DNA methylation, chromatin accessibility, specific histone modifications and transcription factor binding (Fig. 3). Such data, notably through international collaborative studies such as the ENCODE (ENCyclopedia Of DNA Elements) Project [69], are publically available for a range of cell lines allowing investigators to interrogate specific loci or integrate data sets using genome browsers [70] such that markers from disease GWAs may be overlaid onto RNA-seq and ChIP-seq data, together with other information on sequence conservation and putative regulatory elements, to help generate hypotheses and prioritize disease-associated variants. It is important, however, to consider the disease-relevant context for such analyses, as any effects of specific variants are increasingly recognized to be cell or tissue type specific [71–73]. This means that we need to continue to expand such data sets for a range of cell and tissue types, including primary cells from healthy individuals as well as patients with disease.

Figure 3.

Resolving the transcriptional landscape of allele-specific gene expression. Allele-specific differences in gene expression may arise because of sequence variation, for example, in distant enhancer regions. Such sites may be indicated by allelic differences in chromatin accessibility, specific histone modifications and DNA methylation, as well as recruitment of specific transcription factors.

We are also beginning to understand the three-dimensional structure of the transcribing genome, in particular how different genomic regions interact, through experimental approaches based on chromosome conformation capture which can now be performed at genome-wide resolution. These take advantage of new sequencing technologies to sequence the products of proximity-based ligation (Hi-C) [74] and chromatin interaction analysis by paired end tag sequencing (ChIA-PET) [75]. Other approaches and analyses can also be highly informative, for example, based on systems biology to define interactions and biological pathways. By using information from many different sources now available at genome-wide resolution, we can adopt an integrated approach to understanding the functional genomic context of genetic and epigenetic variation [76] and do so in a disease-relevant manner.

Such integrative approaches should facilitate the generation of specific hypotheses regarding the mechanism of action and location of putative functional variants in a more systematic and high-throughput manner than has previously been possible. In some instances, this will relate to structural changes in the encoded protein with consequences for function, which may be profound; in other cases, it may involve variants modulating levels of gene expression in many different ways [7]. Infectious disease has provided striking examples of such events, notably malaria. Here, the protective role of sickle cell trait was established as arising from a glutamic acid to valine substitution that converts normal adult haemoglobin (HbA) to haemoglobin S (Hb S, sickle variant haemoglobin) [77] and arises because of an A to T single nucleotide substitution in the HBB gene [78] that protects against the development of severe cerebral malaria because of Plasmodium falciparum. The molecular mechanisms underlying this include the induction of heme oxygenase-1, suppression of circulating free heme by carbon monoxide and independent immunoregulatory effects on pathogenic CD8+ T cells [79]. By contrast, a G to A single nucleotide substitution in the promoter region of the DARC gene (Duffy blood group, chemokine receptor) was found to modulate transcription factor binding by GATA-1, dramatically reducing the levels of gene expression in a cell-type-specific manner and rendering red blood cells resistant to invasion by Plasmodium vivax [80]. Structural variants may also be highly significant, as noted for a 32-bp deletion in the CCR5 gene encoding the major host coreceptor for HIV-1, the CC chemokine receptor CCR5. The deletion resulted in a frameshift, prematurely terminating translation and truncating the protein and rendering cells resistant to invasion by HIV-1 when present in the homozygous state [81–83]. Copy number variation also plays a critical role. Indeed, copy number of a segmental duplication spanning CCL3L1, encoding chemokine (C-C motif) ligand 3-like 1, which is the most significant ligand for CCR5 and is a potent HIV-1 suppressive chemokine, varies by individual with most people having 1–6 copies: when present at lower than the population average, copy number of CCL3L1 was associated with increased HIV-1/AIDS susceptibility [84].

Expression quantitative trait mapping

A powerful approach aiding the interpretation of GWAs is based on mapping gene expression as a quantitative trait [85, 86]. Gene expression is recognized to vary widely between and within populations and to be heritable [87]. The application of ‘genetical genomic’ approaches in model systems and humans has highlighted that expression quantitative trait loci (eQTL) and more specifically in the context of GWAs, expression-associated SNVs, are common and informative [88, 89]. Many such studies have been carried out in EBV-transformed lymphoblastoid cell lines (LCLs), while more recent eQTL analyses in humans are being carried out in specific cell types and tissues [72, 90–96]. This is important, as association with differential gene expression is recognized to be context specific – for example, over 50% of cis-eQTL defined in LCLs or T cells were cell type specific [72].

An elegant early example of integrating GWAs and eQTL data was provided by the work of Cookson and colleagues [97] who found that SNVs associated with childhood onset asthma at chromosome 17q21 were also significantly associated with expression of the neighbouring gene ORMDL3 in LCLs established from children in the asthma family panel. This cis association was subsequently also noted in peripheral blood leucocytes [98], and the locus is of broad interest given significant association with other autoimmune diseases including type 1 diabetes [99], Crohn’s disease [100] and primary biliary cirrhosis [101]. ORMDL3 encodes a protein involved in regulating endoplasmic reticulum-mediated calcium signalling, in turn modulating the unfolded protein response which is proposed to provide a link with inflammation [102]. This situation is complex with a number of variants implicated and likely several genes involved, with evidence of allele-specific chromatin remodelling in the region and involvement of the insulator factor CTCF [103]. More recently, other investigators have shown the value of eQTL mapping in interpreting GWAs for a range of traits including coeliac disease [104], body mass index [105] and psoriasis [106].

When considering eQTL data, it is important to note that the vast majority of genome-wide data sets to date have been generated using expression microarrays and that observed associations for specific oligonucleotide probes may be confounded in many cases by sequence variation falling within the sequence bound by the probe – this may lead to spurious results apparently suggesting an eQTL is present [107].

Recently published work involving type II diabetes and metabolic traits further highlights the informativeness of an integrated approach to following up GWAs signals [108]. A striking example was noted at chromosome 7q32.3 of significant association with type II diabetes [21] and HDL cholesterol [109], with a parent of origin effect involving the maternal allele and differential expression of the KLF14 gene [110]. A local likely cis-acting eQTL was defined in adipose tissue for the expression of KLF14, a gene that encodes the transcription factor Kruppel Factor 14. Strikingly, the same associated variants show strong trans-eQTL with at least 10 genes in adipose tissue which were found to be enriched for KLF binding sites and whose expression correlates with metabolic phenotypes and themselves harbour metabolic trait-associated variants [108]. This data underlines the complexity of how particular variants may modulate function in a given tissue, dependent on epigenetic mechanisms and in turn helping resolve further disease associations within existing GWAs data sets as well as providing new insights into disease process.

Transcript profiling using RNA-seq

RNA-seq has been successfully applied to eQTL mapping. This is exciting given the inherent advantages of the technology, which does not rely on hybridization but directly sequences the transcripts. Data for LCLs established from individuals of European and African ancestry have highlighted the increased resolution that can be achieved, notably including associations with the expression of alternatively spliced isoforms and long noncoding RNAs, as well as highlighting effects on transcriptional termination [95, 96]. With greater read length and coverage, analysis arising from RNA-seq will be increasingly informative while falling reagent costs and opportunities for multiplexing should ensure broad application across the field [111]. There are, however, significant bioinformatic challenges remaining in the analysis of RNA-seq data, not least relating to read mapping, quantification of expression at transcript isoform resolution and differential expression [112]. Potential bias introduced by the reference genome used for mapping is a particular issue for allele-specific quantification using RNA-seq as discussed later in this review.

Transcription landscape and epigenetics

The analysis of chromatin immunoprecipitation experiments using high-throughput sequencing (ChIP-seq) is providing genome-wide resolution of binding by specific transcription factors in a variety of contexts, as well as the presence of specific histone modifications allowing interrogation of epigenetic mechanisms. ChIP-seq analysis of 10 LCLs for NF-κB and RNA polymerase II binding underlined how often any two individuals differ in observed binding regions (7.5% and 25%, respectively), while human/chimp comparison indicated differences in 32% of sites [113]. Strong correlation with SNVs and structural variants was noted.

Understanding the regulatory transcriptional landscape is highly informative when considering GWAs, as illustrated by our work mapping binding by the ligand-activated vitamin D receptor (VDR) to DNA which demonstrated significant enrichment in GWAs intervals for a variety of autoimmune diseases as well as cancer and other specific traits related to vitamin D [114]. This provides a route-map for GWAs signals that may relate to genes modulated by vitamin D and provides a link with growing epidemiological evidence implicating vitamin D in autoimmune disease susceptibility. Other data for specific loci such as HLA-DRB1 illustrate how disease risk haplotypes may be associated with allele-specific recruitment of VDR, providing a potential link between genetic and environmental risk factors for disease [115].

Specific histone marks such as H3K4me1 are often associated with more distant enhancer elements [116]. Such sites may also be characterized by open chromatin as revealed by DNase I hypersensitivity (DHS) mapping. While conventionally analysed by Southern blotting, DHS experiments can also be analysed using high-throughput sequencing (DNase-seq) [117]. Active chromatin sites based on DNase-seq and ChIP-seq for the transcription factor CTCF showed evidence of heritability when analysed in LCLs from family pedigrees with 10% of sites individual specific [118].

A further, experimentally more straightforward method to assay open chromatin is FAIRE (Formaldehyde-Assisted Isolation of Regulatory DNA Elements) [119], which proved very informative in understanding GWAs signals for type II diabetes [120]. FAIRE-seq analysis in pancreatic islet cells resolved that a variant associated with type II diabetes in the TCF7L2 gene (encoding the transcription factor 7-like 2) was not only located in a region of open chromatin but showed allelic differences in accessibility and enhancer activity with evidence of tissue specificity [121]. Such work underlines the future importance of characterizing the function of disease-associated variants in the most disease-relevant cell/tissue type and context, making use of a variety of approaches to resolve associations. The recent generation of publically available ChIP-seq, DNase-seq and FAIRE-seq data sets for a number of different cell types [122, 123] is an important next step in such work.

The power of analysis defining chromatin accessibility and modifications in combination with gene expression is illustrated by the work of Higgs and colleagues in the α globin locus. Careful study of this region in the context of thalassaemia has been highly informative in understanding fundamental processes in gene regulation and the impact of particular mutations [124, 125]. These can be dramatic, as found for a gain-of-function intergenic regulatory variant identified in individuals with α-thalassaemia which was associated with allele-specific histone acetylation, recruitment of transcription factors and Pol II binding resulting in a new transcriptionally active region, ‘stealing’ transcriptional activity from downstream α globin genes whose expression was significantly reduced [126].

DNA methylation is a critical epigenetic alteration modulating gene expression that involves addition of a methyl group to the 5 position of the pyrimidine ring of cytosine residues in CpG dinucleotides [127]. DNA methylation is known to be dependent on cell type, developmental stage and environmental factors. Recently, it was found that allele-specific differences in DNA methylation at nonimprinted loci are common across the genome [128] and that CpG methylation can be mapped as a quantitative trait [129]. A number of techniques are available for analysing DNA methylation at genome-wide resolution, based primarily on restriction enzyme digestion as seen with Methyl-seq, or affinity enrichment, for example, with antibodies specific for methylated cytosines (Me-DIP-seq) [130]. Whole genome bisulphite sequencing remains technically challenging but would offer significant advantages including single-base resolution.

High-throughout sequencing is also dramatically advancing our understanding of the role of microRNAs in gene expression acting through posttranscriptional mechanisms, a process thought to be critical in 30% of genes [131]. A striking example of how underlying sequence variation may modulate this process was seen for HLA-C where an insertion deletion polymorphism in the 3′UTR affected binding by miR-148 [132]. This provided a mechanism for the observed association of a linked SNP 35 kb upstream of HLA-C that was previously strongly associated with HIV control [133].

Allele-specific gene expression

The analysis of allele-specific gene expression has proved a powerful approach to try and resolve functional variants. When present in the heterozygous state in an individual, a biallelic genetic variant such as a single nucleotide substitution can provide a useful marker of the allelic origin of a transcript and allele-specific expression. For example, when located in coding sequence, transcript abundance specific to each allele can be quantified based on the presence of the variant in the transcribed RNA [134]. For genes without transcribed genetic markers, relative allelic expression can be assessed by haplotype-specific chromatin immunoprecipitation (haploChIP) for RNA polymerase II using antibodies specific for the phosphorylated serine residues in the C-terminal domain characteristic of actively transcribing RNA polymerase II [135]. This approach can also be used to resolve allele-specific recruitment of specific transcription factors such as activated B cell factor-1 which we demonstrated was recruited to the LTA (encoding lymphotoxin alpha) gene [136] in the presence of a genetic variant subsequently associated with susceptibility to leprosy [137].

For a small number of imprinted genes, monoallelic expression is seen dependent on the parental origin of alleles [138]. Genome-wide surveys have shown that smaller differences in allelic expression are common, involving an estimated 20% of autosomal genes with typically differences in relative allelic expression of 1.5-fold [139–141]. Genome-wide allele-specific discrimination is now possible to high resolution using RNA-seq [95, 96], but there is potential bias dependent on the reference sequence to which reads are mapped which may or may not be a match for the particular read from a given individual [142]. Analysis of allele-specific gene expression is highly complementary to eQTL data and can be integrated to facilitate mapping of likely regulatory variants [95, 96, 143].

The analysis of individuals homozygous for particular genomic regions can also be informative. We recently analysed gene expression for LCLs established from individuals homozygous for autoimmune disease risk haplotypes spanning 3.5 Mb of the classical Major Histocompatibility Complex (MHC) on chromosome 6p21 and demonstrated that allelic differences were common and often involved alternative splicing [144]. This was made possible by using a custom microarray that combined a strand-specific tiling path probe set with probes specific to known and predicted splice junctions and included alternate allele probes for sequence variants identified by resequencing as part of the MHC Haplotype Project [145].


There is no doubt that recent advances in genomics, currently driven by new high-throughput sequencing techniques, are taking us to remarkable new levels in our understanding of the human genome, and the genetic and epigenetic variation that exists, with important implications for our understanding of human disease. As this knowledge grows, our appreciation of the complexity with which we are faced is also underlined. For common multifactorial traits, GWAs have been very informative but leave much heritable risk unresolved. Rarer variants may prove important but in general, more integrated approaches are needed in which environmental risk factors are considered and combined with functional genomic analyses. Moreover, we need to derive functional genomic data in a disease-relevant setting as the consequences of underlying genetic and epigenetic diversity are increasingly recognized to be highly context specific.

Current technologies that can interrogate the whole genome carry with them significant caveats: these tools are new, and successful application to important biological problems requires careful experimental design and consideration of the limitations inherent in such approaches. The data sets involved are highly complex, and analysis remains extremely challenging with significant risks of false positive and negative results until the field matures. High-throughput sequencing is not a panacea but a critical tool in current genomics. Used wisely, it is resolving the individual genome and epigenome, at a structural and functional level, and will radically advance our understanding of disease. For Mendelian traits, the impact is already being felt. For common multifactorial diseases, this may take a little longer.


The work of the author is funded by the Wellcome Trust [grant number 074318 /075491/Z/04], the Medical Research Council [grant ID 98082] and the European Research Council under the European Union's Seventh Framework Programme (FP7/2007-2013) / ERC Grant agreement number 281824.

Conflicts of interest statement

No conflicts of interest to declare.