Systems biology for hepatologists


  • José M. Mato,

    Corresponding author
    1. CIC bioGUNE, Ciberehd, Parque Tecnológico de Bizkaia, Bizkaia, Spain
    • Address reprint requests to: Prof. José M. Mato, CIC bioGUNE, Parque Tecnológico de Bizkaia, 48160 Derio, Bizkaia, Spain. E-mail:; fax: +34-944-0611301.

    Search for more papers by this author
  • M. Luz Martínez-Chantar,

    1. CIC bioGUNE, Ciberehd, Parque Tecnológico de Bizkaia, Bizkaia, Spain
    Search for more papers by this author
  • Shelly C. Lu

    1. Division of Gastroenterology and Liver Diseases, USC Research Center for Liver Diseases, The Southern California Research Center for Alcoholic and Pancreatic Diseases & Cirrhosis, Keck School of Medicine USC, Los Angeles, CA, USA
    Search for more papers by this author

  • Potential conflict of interest: Dr. Mato consults for and owns stock in Owl. He consults for Abbott.

  • Supported by NIH RO1AT1576 (to M.L.M-C., S.C.L., J.M.M.), RO1DK051719 (to S.C.L., J.M.M.), Spanish Plan Nacional I+D SAF 2011-29851 (to J.M.M.), ETORTEK-2010 Gobierno Vasco (to M.L.M.-C, J.M.M.), PI11/01588, Sanidad Gobierno Vasco 2008, Educación Gobierno Vasco 2011 (to M.L.M.-C), 2012 (J.M.M.). Ciberehd is funded by ISCiii.


Medicine is expected to benefit from combining usual cellular and molecular studies with high-throughput methods (genomics, transcriptomics, proteomics, and metabolomics). These methods, collectively known as omics, permit the determination of thousands of molecules (variations within genes, RNAs, proteins, metabolites) within a tissue, cell, or biological fluid. The use of these methods is very demanding in terms of the design of the study, acquisition, storage, analysis, and interpretation of the data. When carried out properly, these studies can reveal new etiological pathways, help to identify patients at risk for disease, and predict the response to specific treatments. Here we review these omics methods and mention several applications in hepatology research. (Hepatology 2014;60:736–743)


deoxycholic acid


differentially methylated regions


Encyclopedia of DNA Elements


fatty acids


gas chromatography


genome-wide association study


hepatocellular carcinoma


methionine adenosyltransferase 1A


mass spectrometry


nonalcoholic fatty liver disease


nonalcoholic steatohepatitis










single nucleotide polymorphism




trimethylamine N-oxide


ultraperformance liquid chromatography

There is a nautical chart attributed to Christopher Columbus, obviously drawn before he set sail on the voyage that would lead to the discovery of America, that stretches from the south of Scandinavia to the mouth of the river Congo showing all the Mediterranean ports of Europe and Africa in detail. The enormous space that Columbus dedicated to the Atlantic Ocean is conspicuously lacking in detail. In all probability, this huge blank space served not only to mark the frontier of the known world and therefore the potential expansion of world knowledge, it also opened up a route for the imagination and the adventure of sailing through it, a route traveled by numerous sixteenth- and seventeenth-century explorers who, although in most cases were destined to remain anonymous, changed the world forever.

In the same way, sequencing the human genome opened up a new era in biomedical sciences that is being explored by a legion of scientists. Biomedical research has evolved from the analysis of the effects of individual genes to a more integrated view that examines whole ensembles of genes as they interact during a biological process. This has changed the way we look at human disease and understand better why specific therapies work or do not work. An example in hepatology is the use of a genetic variation near the IL28B gene that predicts the response to hepatitis C therapy.[1] This way of thinking has given excessive value, however, to a way of carrying out research in biosciences that consists of measuring everything (genes, proteins, metabolites) in a biological system in the hope that upon analyzing this huge amount of information, new properties of the system will emerge that will allow an integral understanding. This holistic approach often forgets that in biology the interactions between molecules (DNA, RNA, proteins, and metabolites) are characterized not by exclusivity, but by the multiplicity of possible interactions between some molecules and others. The problem is that it is not possible to discover how an organism works based on a model that incorporates hundreds of thousands of measurements of its internal components simply because there is no single solution, no predefined design, not a unique 3D structure. From this perspective, health or disease cannot be viewed as the result of the fulfillment of a linear program, but the result of an open process in which a specific biological state springs from certain genetic information interacting with other information existing at that moment.

This vision helps to understand how multiple phenotypes can be formed from a single genome, and how environment and chance select, at each moment, one from among all possible phenotypes (Fig. 1). Even in those diseases caused by mutations in a single gene, as in the case of phenylketonuria, although the genotype predicts well the biochemical phenotype (the concentration of phenylalanine in the blood), it does not predict the clinical phenotype (the appearance of intellectual disability).[2] It is important to emphasize that clinical reasoning is basically Bayesian. In other words, the predictive value associated with a diagnostic test varies when it is applied to populations with indices of prevalence very different from those of the study condition. For example, in a person diagnosed with iron overload the presence of a mutation in the HFE gene is a highly reliable predictor of hereditary hemochromatosis. However, in a population that has not been preselected for iron overload, the presence of the same mutation confers only a slight risk of developing clinical symptoms.[3] These results speak for themselves of the importance of interpreting the results of studies of genetic variations within an adequate medical context.

Figure 1.

Metabolic fluxes are the best representation of the phenotype of an organism. Health or disease cannot be viewed as the result of the fulfillment of a linear program, but as the result of an open process in which a specific biological state or phenotype springs from certain genetic information interacting with other information existing at that moment. This vision helps to understand how multiple phenotypes can be formed from a single genome, and how environment and chance select, at each moment, one from among all possible phenotypes.


When carried out within the adequate medical context, genetic screens are powerful tools for identifying new genes and variations within genes that are involved in specific physiopathological processes. For example, a single nucleotide polymorphism (SNP) that is consistently associated with nonalcoholic fatty liver disease (NAFLD) is a nonsynonymous substitution (a mutation in which a single nucleotide change results in a codon that codes for a different amino acid) in the PNPLA3 gene.[4] To identify this gene variant, a genome-wide association study (GWAS) of 9,299 nonsynonymous sequence variations was carried out in a population of 2,111 individuals from three different ethnic groups, in which hepatic fat content was measured by proton magnetic resonance spectroscopy.[5] The substitution of an isoleucine by a methionine at position 148 (I148M) of PNPLA3 was found to associate strongly with the accumulation of fat across the three ethnic groups studied, with an overall P-value of 5.9E-10. Recently, this variant was associated with NAFLD progression to nonalcoholic steatohepatitis (NASH), alcoholic fatty liver disease, and hepatocellular carcinoma (HCC).[6-8] The experience of identifying PNPLA3 teaches us that this hypothesis-free approach to the identification of new genes and variations within genes involved in a pathological process needs to be statistically sound and requires a large sample size of clinically well-characterized patients.

Through December 2013, 2,034 GWAS have been published that have led to the identification of several hundreds of disease-associated gene variants.[9] GWAS are approaches that aim to identify potential associated genes, at the whole genome level, based on the statistical significance of the differential occurrence of common SNPs when comparing populations with distinct traits such as disease and health or drug responders and nonresponders.[10] Interestingly the majority of these SNPs are not in gene coding regions, which suggests that these variants affect regulatory elements of the genome that have key functions in the development of complex diseases, such as those of the liver. This agrees with the results of the Encyclopedia of DNA Elements (ENCODE) consortium, an ambitious project that aims to identify and characterize all functional elements in the human genome,[11] whose principal conclusion is that the majority of the genome, although not coding for proteins, is active and plays important regulatory functions.[12]

Transcriptomics and Proteomics

Genomic technologies have made feasible investigating the expression of thousands of genes at a time using large sets of samples. This technology has often been used with the aim to develop tests that more reliably diagnose diseases and predict the response to specific treatments. However, the clinical application of these diagnostic and prognostic gene expression signatures has been delayed for three main reasons. First, complex diseases, like NASH, cirrhosis, or HCC, likely involve a large number of different genes and biological pathways and are very heterogeneous in terms of clinical manifestations, genomic alterations, and gene expression patterns.[13-15] Therefore, large cohorts of well-characterized patients are necessary to obtain genomic signatures of clinical relevance. Second, diagnostic and prognostic gene signatures contain a large number of genes and the prediction algorithms are complex and not easy to use routinely in clinics.[16] Third, the development of complex molecular tests based on DNA, RNA, proteins, or metabolite profiles carries a series of problems inherent to all high-throughput techniques where large datasets are analyzed.[17] Selecting the statistically significant results from a large dataset also containing nonsignificant data is challenging, because when multiple significance tests are calculated the probability that at least one reaches significance by chance increases with the number of tests performed.[18] It is therefore critical to control this multiplicity problem as well as to use one or more model validation technique for assessing how the results of a statistical analysis will generalize to an independent dataset.[18] However, in the race to apply genomics technology, genomics works are too frequently published in which massive quantities of data containing avoidable errors are handled.[17] Yet when used correctly, microarray technologies may be translated into score systems that can reproducibly predict clinical outcomes. An example in hepatology is the development of a simple risk score classifier based on the expression of a small number of genes that can predict in a reproducible manner overall survival of patients after surgical resection for HCC.[19] In a recent report, the Institute of Medicine identified best practices for future research and development of omics-based tests.[20] These practices include the use of rigorous statistical methods, bioinformatics and data management, and open access to the datasets and algorithms used to develop the test. The application of these best practices should be reinforced by all research organizations.[21]

One of the most important objectives of genomic research is also to associate transcriptomic data with the molecular pathways that underlie disease. However, gene expression changes in complex diseases, such as those of the liver, often reflect processes that are secondary to the pathological process. To overcome this problem, transcriptional networks have been developed based on the assumption that gene products that are causative of a disease process and whose expression is altered in a pathological condition have similar expression patterns (coexpressed genes).[22] An impressive example of this approach is the use of an integrative genomic method, based on the analysis of transcriptional networks in human brain, to identify a new molecular pathway linking late-onset Alzheimer's disease, aging, and APOE4.23 The same principle applies to proteomic research, where the concentration of hundreds to thousands of proteins is determined simultaneously in a cell type or tissue. An application of this method in hepatology is the demonstration that knocking out the liver-specific prohibitin-1 gene (Phb1) in mice results in the spontaneous development of severe liver disease and HCC[24] after identification, using proteomics, that liver PHB1 content was decreased in an experimental model of NASH.[25]

One of the best ways to learn about the function of a gene is to generate a mouse with a deletion in that specific gene. Hence, the function of around 7,300 mouse genes has been described using this approach.[26] In general, these knockout mice have been generated towards genes previously studied and where a phenotype was anticipated, as in the case of Phb1 mentioned above. Interestingly, on many occasions no obvious phenotype is observed, probably because only the expected traits are investigated. As a result of this, the full biological function of many genes for which knockout mice are available is not known. Genome-wide mouse programs aiming to generate knockout mice with mutations in all protein coding genes are under way.[26, 27] Over 900 knockout mice, many with phenotypic data, have been made openly available for further analysis.[27] It is important to note that even when the deletion of a known gene is associated with a phenotype, elucidating the molecular mechanism by which mutation of that gene leads to a particular phenotype is not obvious at first glance and requires extensive experimental work to elucidate it.

The coordinated changes in gene expression patterns associated with the molecular pathways that underlie disease are controlled at multiple levels. Examples include nucleosome remodeling, noncoding RNAs, histone variants, and modifications (e.g., by acetylation or methylation).[28] MicroRNAs and RNA-binding proteins play a critical role in the posttranscriptional regulation of global changes in gene expression.[11, 12] DNA methylation is another key epigenetic modulator of gene expression that is generally associated with transcriptional repression.[11, 12] Large-scale DNA methylation mapping studies have provided important insights into the gene regulation and the development of various diseases, particularly in tumorigenesis.[29] One of the most important objectives of DNA-methylation mapping research is to link DNA methylation changes with the expression of genes pathways that underlie disease. Here, similar to genome research, a variety of software has been released to facilitate the identification of differentially methylated regions (DMRs), classification of DMRs into enriched genomic regions, and comparison of DNA methylation, and gene expression changes. Recently, this technique has elegantly been applied to identify differences in DNA methylation that could distinguish patients with advanced versus mild NAFLD,[30] and that led to the identification of MAT1A (methionine adenosyltransferase 1A) as one of the principal down-regulated genes in NASH.[31] These findings agree with earlier work demonstrating that MAT1A expression and MAT activity are markedly reduced in human liver cirrhosis,[32, 33] and that MAT1A expression is silenced in human HCC.[34] Furthermore, deletion of Mat1a in mice causes NASH and HCC,[35, 36] which support the concept that from NASH to HCC MAT1A may be a therapeutic target.[37]


Metabolomics, the high-throughput identification and quantification of small-sized (<1,500 Da) molecules, is the last branch of omics-based technology incorporated into biomedical research. While in other omics fields thousands of targets are routinely analyzed at a time, until recently few studies had identified and quantified more than 100 metabolites at a time in a large set of samples. Two factors have made it feasible to determine the concentration of hundreds of metabolites at a time using large sets of samples. First, the release of an electronic database equivalent to GenBank or UniProt, like the Human Metabolome Database; and second, the development of modern high-resolution nuclear magnetic resonance (NMR) spectroscopy and of mass spectrometry (MS) technologies, such as ultraperformance liquid chromatography-MS (UPLC-MS) and gas chromatography-MS, for the identification and quantification of thousands of metabolites at a time in as little as a few minutes per sample.

The human serum metabolome is composed, with today's technology, of around 4,200 metabolites, half of which are phospholipids and over a thousand glycerolipids (triglycerides [TG], diglycerides, and monoacylglycerols).[38] In other words, around three-quarters of the known human serum metabolome are lipids. Amino acids, peptides, carbohydrates, amines, and carboxylic acids complete the list of the serum metabolome. Thousands of different lipids seem much more than the four bases used by DNA to encode the genetic information of an organism, much more than the 23 amino acids that are the building blocks of proteins, much more than the hundreds of carbohydrates and carboxylic acids that form the central carbon metabolism. But ultimately, this many thousands of lipids make sense, if we think, for instance, that an average car has over 10,000 moving parts. From the storage of energy and the establishment of the permeability barrier for cells and cell organelles, to the regulation of membrane-associated processes, such as oxidative phosphorylation, intracellular trafficking, cell growth, apoptosis, and the facilitation of membrane protein folding in a manner similar to protein molecular chaperones, lipids play an essential biological function.

Metabolic dysfunction has been implicated in a wide variety of human diseases, such as obesity, NAFLD, diabetes, inborn errors of metabolism, and cancer, just to mention a few.[39] The results are consistent with an important contribution of metabolic disbalance, that is, the rerouting of the metabolic fate of lipids, carbohydrates, and amino acids through the intermediary metabolism, to the initiation and/or progression of these and other diseases. Consequently, there is an increased interest in understanding what the metabolic differences are between normal and diseased tissues that can lead to the development of more selective and effective treatments. Cellular metabolism consists of a multitude of enzymatic reactions, inextricably interconnected, that are involved in two functions: one, the conversion of thousands of molecules into building blocks for macromolecular biosynthesis; and two, the reactions that ensure the constant supply of energy by way of adenosine triphosphate (ATP) and redox equivalents (NADPH). The concentrations of the metabolites in a cell are the result of the fluxes in the metabolic reactions, which ultimately depend on the conditions of the moment such as the available nutrients, hormonal and neural factors, the properties of the enzymes involved, and the levels of the metabolites themselves, as they exert important feedback and feedforward regulation on the system.[40] Notoriously, the liver parenchyma shows a zonal distribution of key metabolic enzymes and metabolism, which indicates that there are different types of hepatocytes in the liver. For instance, oxidative phosphorylation, glucose output, urea synthesis, and bile acid synthesis is higher in the periportal area, whereas glucose uptake, glutamine formation, and xenobiotic metabolism are greater in the perivenous area.[41] Metabolic zonation is altered in liver steatosis,[42] but whether this reflects processes that are secondary to the pathological process is an open question. From this perspective, it is clear that the metabolic fluxes represent the final outcome of cellular regulation at many different levels, and hence they are the best representation of the phenotype of an organism (Fig. 1).

A consequence of this convoluted network of enzymatic reactions that integrate cellular metabolism is that it is not possible to conclude that a certain metabolic pathway is altered in a cell or tissue simply by measuring the steady-state concentration of its metabolites at a single timepoint. For example, the three major sources of fatty acids (FA) used by the liver to synthesize TG are the diet, de novo lipogenesis, and the adipose tissue; and the four major fates of hepatic FA are mitochondrial β-oxidation, biosynthesis of other lipid classes, esterification and storage as TG into lipid droplets, and assembly as TG into very low-density lipoproteins and export into blood (Fig. 2). Processes that lead to an imbalance between the intake and biosynthesis of TG and the export and catabolism of TG cause NAFLD. Elucidating which of all these potential mechanisms are responsible for hepatic TG accumulation under a specific condition requires careful measurements of metabolic fluxes, using labeled tracers, as well as the determination of the content of dozens to hundreds of metabolites and activities of key enzymes. Unfortunately, studies are too frequently published in which a pathological process is associated with changes in a certain metabolite or group of metabolites based simply in their steady-state concentration, or quantification of mRNA and/or protein of specific enzymes. Moreover, it is important to remember that changes in the concentration of metabolites often reflect processes that are secondary to the pathological process. However, when used correctly metabolic studies may lead to the identification of the rate-limiting step responsible of a pathological process. An example in hepatology is the identification that an excess of hepatic S-adenosylmethionine (SAMe), which occurs in the setting of impaired glycine N-methyltransferase-mediated catabolism, reroutes phosphatidylethanolamine (PE) metabolism towards the biosynthesis of phosphatidylcholine (PC), by way of activation of the enzyme PE methyltransferase (Fig. 2). The excess PC thus generated is used by the hepatocyte to synthesize TG that accumulate into lipid droplets, causing NAFLD.[43]

Figure 2.

FA metabolism in liver. The three major sources of FA used by the liver to synthesize triglyceride (TG) are the diet, de novo lipogenesis, and the adipose tissue; and the four major fates of hepatic FA are mitochondrial β-oxidation, biosynthesis of other lipid classes (such as phospholipids, cholesterol esters, and sphingolipids), esterification, and storage as TG into lipid droplets (LD), and assembly as TG into very low-density lipoproteins (VLDL) and export into blood. Hepatic TG can be synthesized by desaturation, elongation, and esterification of FA, or by the phosphatidylethanolamine N-methyltransferase (PEMT) pathway that converts phosphatidylethanolamine (PE) to phosphatidylcholine (PC). TG export by way of VLDL requires incorporation of PC synthesized by the PEMT pathway.

The human metabolome is an ocean full of biomarkers. Accordingly, a central objective in metabolomics research is the discovery of specific metabolic profiles (in serum, urine, feces, sweet, tears, saliva, tissues) that associate with disease or the response to specific treatments. The development of metabolomic-based diagnostic and prognostic tests has the same problems inherent to all high-throughput techniques, namely, the detection of statistically significant relationships between a group of metabolites and disease while minimizing the risk of false-positive associations.[18, 44] An additional complication in metabolomics, as compared to other omics-based methods, is the preparation and storage of the samples, due to large differences in solubility and stability among metabolites. When used correctly, metabolomics is a powerful novel approach for biomarker identification. For example, a serum lipidomic signature associated with NAFLD progression has been identified.[45] To obtain this signature, 540 serum metabolites were determined by UPLC-MS in a population of 467 biopsied individuals with different body mass indexes.[45, 46]


In addition to the 22,000 or so protein-coding genes of the human genome, the collective genome of the human gut flora is guessed to contain 100 to 200 times that number.[47] This collective genome, the microbiome, provides us with an additional and extraordinary metabolic capacity that modulates host energy and lipid metabolism,[48] whose importance in health and disease we are beginning to understand. Thus, gut microbiota alterations have been associated with the susceptibility of developing certain diseases such as obesity, diabetes, celiac disease, cardiovascular disease, and NASH.[48] An example of this complex relationship between the gut microbes and the host metabolism is the discovery of a new pathway for gut flora-mediated generation of the pro-atherosclerotic metabolite trimethylamine N-oxide (TMAO) from dietary PC.[49] Another example that illustrates the complex relationship between gut microbiota and liver disease is the demonstration that bile acid metabolism by intestinal bacteria has a key role in obesity-associated HCC development.[50] Those authors analyzed the serum metabolites of high-fat-diet and normal-diet-fed mice by UPLC/MS and observed an increase in the levels of deoxycholic acid (DCA), a secondary bile acid solely produced by hydroxylation of primary bile acids by gut bacteria. DCA is known to cause DNA damage and enhance liver cancer. Interestingly, lowering DCA levels in obese mice treated with the carcinogen dimethylbenz(a)anthracene decreased HCC development.[50] These results speak for themselves of the complex relationship between the gut flora and the host metabolism and the importance to assess medical risks, monitor, diagnose, and treat patients according to their specific metabolic phenotype.

In conclusion, the ultimate aim of omics-based research in hepatology is to translate this knowledge into useful results that improve our understanding of complex biological processes, make reliable predictions in silico of human liver drug toxicity, and provide clinically relevant tests (Fig. 3). However, several problems need to be overcome to ensure the successful translation of these technologies. One is adopting protocols that yield consistent results in different laboratories so that data can be built into a single repository. Another problem is the integration of all the data generated by omics-based screens (such as RNAs, proteins, metabolites, protein-protein interactions, protein-lipid interactions, protein-nucleic acid interactions, and so on). Once these two problems are solved, the translation of omics-based results into clinically useful products will be within reach.

Figure 3.

Omics-based medicine. The ultimate aim of omics-based medicine is to translate human genomics, transcriptomics, proteomics, and metabolomics results into clinically useful products. To help researchers achieve this goal, several freely accessible initiatives have been established, such as the Genome Sequencing Program (GSP), the Encyclopedia of DNA Elements (ENCODE), the Genetic Variation Program (GVP), or the Genome-Wide Associations Studies (GWAS) of the National Human Genome Research Institute ( In transcriptomics, the Gene Expression Omnibus (GEO) provides a public repository that archives and freely distributes ( microarrays and other functional genomics data. In proteomics, the Human Proteome Organization (HUPO, sponsors several initiatives such as the Human Liver Protein Project (HLPP) or the Human Antibody Initiative (HAI); and in metabolomics the Human Metabolome Database (HMDB, and other related resources such as KEGG, LipidMaps, and MassBank, that contain freely available information about metabolites found in the human body.