Potential conflict of interest: Nothing to report.
The wealth of human genome sequence information now available, coupled with technological advances in robotics, nanotechnology, mass spectrometry, and information systems, has given rise to a method of scientific inquiry known as functional genomics. By using these technologies to survey gene expression and protein production on a near global scale, the goal of functional genomics is to assign biological function to genes with currently unknown roles in physiology. This approach carries particular appeal in disease research, where it can uncover the function of previously unknown genes and molecular pathways that are directly involved in disease progression. With this knowledge may come improved diagnostic techniques, prognostic capabilities, and novel therapeutic approaches. In this regard, the continuing evolution of proteomic technologies has resulted in an increasingly greater impact of proteome studies in many areas of research and hepatology is no exception. Our laboratory has been extremely active in this area, applying both genomic and proteomic technologies to the analysis of virus-host interactions in several systems, including the study of hepatitis C virus (HCV) infection and HCV-associated liver disease. Since proteomic technologies are foreign to many hepatologists (and to almost everyone else), this article will provide an overview of proteomic methods and technologies and describe how they are being used to study liver function and disease. (HEPATOLOGY 2006;44:299–308.)
The flow of biological information—from DNA to RNA to protein—immediately suggests that any view of a biological system that stops at the tier of nucleic acids will be incomplete. This is evidenced by the often poor to moderate correlation (e.g., <40% concordance) between the relative expression abundance of a gene and its biologically active protein product.1–4 Factors contributing to this disparity may include differences in the rates of synthesis and half-lives for an mRNA and the protein it encodes. Additionally, mRNA measurements can not predict phenotypic protein variations resulting from down-stream regulatory events (e.g., modification, interaction with other proteins, subcellular distribution, and activity) that confer or modify protein function.5, 6 Moreover, the limited presence of mRNA in body fluids restricts identification of clinically relevant disease biomarkers to the measurement of secreted proteins in these samples. It therefore appears essential that the proteins expressed in a cell or tissue are also analyzed and related to gene expression measurements on the mRNA level to provide the full picture of a biological process.
A comprehensive proteomic analysis requires characterization of the many diverse properties of a protein, all of which can impact cellular function and contribute to altered cellular states. Much of the initial proteomics effort has centered on cataloging the proteins expressed in a cell or tissue and identifying alterations in protein levels that occur in response to physiologic cues or environmental perturbations. This approach, termed “global comparative quantitative protein profiling”, can be utilized for a variety of research applications. For example, studies integrating gene and protein expression profiling data have uncovered important molecular events (e.g., posttranscriptional regulatory mechanisms) contributing to processes such as response to drug treatment and human cancer.1, 4 A related important application of quantitative proteomics is the identification of protein expression patterns that distinguish between control and disease samples.7–9 The identification of such biomarkers has tremendous potential to improve patient outcome by increasing the availability of molecular indicators for diagnosis, prognosis, and evaluation of therapeutic efficacy. By contrast, functional-based proteomics studies aim to characterize phenotypic changes in individual proteins that alter protein function and contribute to different cellular states. This encompasses experiments for defining the composition of protein complexes, identifying post-translational modifications that affect biological function, and activity-based studies that analyze the functional state of enzyme families.6, 7, 10–12 What follows is an overview of commonly used proteomics technologies and a description of recent accomplishments in hepatology research.
Mass Spectrometry (MS) Based Proteomic Strategies for Protein Identification
Recent success in the proteomics field has been driven in large part by major advances in effective separations coupled with mass spectrometry (MS), and the computational tools that now permit rapid identification of hundreds to thousands of proteins in a single experiment. These technological advances have fueled the emergence of a variety of mass spectrometry (MS)-based proteomics platforms to address the challenges of analyzing large complex biological systems.6, 10, 11, 13, 14 An overview of the most commonly used approaches for large-scale protein identification efforts is presented in Fig. 1 with each of these strategies involving the integration of sample fractionation methods (e.g., electrophoretic or liquid chromatographic separation) with MS analysis for protein identification/characterization. Up front fractionation methods are employed to reduce the complexity of the peptide/protein mixture thus increasing the ability to analyze deeper into the proteome and include lower abundance proteins. Physically interfacing these separation approaches with MS analysis is typically performed using either a solid phase-based matrix-assisted laser desorption ionization (MALDI) or liquid phase-based electrospray ionization (ESI) approach. Multiple MS platforms exist (see Fig. 2 for basic scheme) that are utilized for proteomic approaches. These platforms can all be used either alone or in conjunction with a tandem MS (MS/MS) approach to assist in fragmentation and peptide sequence identification. MS/MS-based approaches first measure the mass of intact “parent” peptides, then isolate a single “parent” peptide ion for fragmentation and subsequent mass analysis of the fragmentation spectra. Once scanned, this fragmentation spectrum is used, in conjunction with automated database search tools that compare the molecular weight information of the fragment ions to theoretically predicted fragments for all possible peptide sequences in a database to identify the exact peptide sequence, and thus the subsequent protein, which is giving rise to the spectrum (Table 1).15, 16 For a more comprehensive description of mass spectrometry instrumentation and MS-based proteomic technologies we recommend several excellent reviews.10, 11, 13
Table 1. Proteomics Software Programs and Websites Where Additional Information Can Be Obtained
NOTE. Representative examples of the various types of software programs and resources available for proteomics research are provided as described here and in the text.
One of the most widely used platforms for separating and identifying proteins is two-dimensional polyacrylamide gel electrophoresis (2D-PAGE) coupled with MALDI-TOF analysis. Proteins are first separated based upon both isoelectric point (pI) and molecular weight followed by staining for spot visualization thus allowing isolation of the gel spot for identification. Identification is generally performed using a peptide-mass fingerprinting approach which is based upon tryptic protein digestion followed by MALDI-TOF analysis of the subsequent peptide signatures.17, 18 The 2D gel approach offers the advantage that the added information on intact molecular weight and pI facilitates efforts to distinguish between protein isoforms (e.g., splice or sequence variants) and phenotypic variants (e.g., arising from co/post-translational modification or proteolytic processing events) present in the original sample. Nevertheless, several limitations of 2D gel technology (e.g., co-migration of proteins on the gel, bias against membrane proteins and proteins of extreme pI, detection sensitivity) have led to the investigation of alternative methods of fractionation for large-scale protein identification efforts. This has resulted in the emergence of a highly popular approach referred to as “shotgun proteomics” (Fig. 1).10, 11
In a fashion analogous to shotgun sequencing of genomic DNA, shotgun proteomics involves proteolytic digestion of the protein mixture into shorter peptide fragments prior to fractionation. This overcomes the technical difficulties inherent to separation and detection of intact proteins via 2D gels. However, the resulting peptide digest will exhibit even greater complexity than the original mixture of intact proteins thus necessitating multiple dimensions of separation to effectively resolve the peptides for high proteome coverage. The most commonly employed strategy involves multidimensional separation using strong cation exchange fractionation followed by reversed-phase high performance liquid chromatography (HPLC). Enrichment of targeted peptide classes (e.g., phosphopeptides, glycopeptides) can further reduce complexity and improve the detection of lower abundance proteins.9 Since ESI ionizes peptides and proteins in solution the columns are typically interfaced with an electrospray ionization tandem mass spectrometer (LC-ESI-MS/MS) for mass analysis and peptide sequencing.
While the high throughput nature of shotgun proteomics approaches has gained it significant popularity, it should be noted that protein identification by this method becomes more challenging for several reasons.15, 19 Large scale proteomics using LC-MS/MS and automated database searching is prone to an increase in the number of incorrect (false positive) peptide identifications. Additionally, the peptide centric nature of this proteolysis-based approach (e.g., “bottom up” analysis) requires that the identified peptides be mapped back to their parent proteins. Insufficient protein sequence coverage and sequence redundancy (e.g., the same peptide sequence can be present in multiple different proteins) often preclude discrimination between protein isoforms, phenotypic variants or closely related proteins in the absence of information about the mature form(s) originally present in the sample. This has significant implications for biological interpretation of shotgun proteomics data where for example incorrect peptide (and thus protein) identification can mislead investigators searching for protein biomarkers. Moreover, since gross or subtle changes in protein phenotype can dramatically alter function, failure to identify the correct protein species originally present in a sample can hinder efforts to unravel the molecular basis of disease processes. Fortunately, the degree of stringency required for analysis, interpretation and dissemination of proteomics data is becoming more widely appreciated and an increasing number of strategies that minimize false positives and assign confidence measures to peptide and protein identifications are being reported.20–23 Additionally, new LC-MS methods are emerging for the separation and analysis of proteins in their native state (e.g., “top down” analysis). Top-down analysis, alone or in combination with “bottom up” analysis, should facilitate efforts to more completely characterize proteins and protein mixtures.24, 25
MS, mass spectrometry; MALDI, matrix assisted laser desorption ionization; ESI, electrospray ionization; 2D PAGE, two dimensional polyacrylamide gel electrophoresis; FTICR, fourier transform ion cyclotron resonance; MS/MS, tandem mass spectrometry; LC-MS/MS, liquid chromatography tandem mass spectrometry; LC-MS, liquid chromatography mass spectrometry; AMT, accurate mass and time.
Quantitative Proteomic Analysis
The ability to globally profile changes in protein abundance is essential for elucidating the events that occur during cellular processes. The use of 2D gels for quantitative proteomic studies remains popular since it is more amenable to monitoring changes in the abundance of phenotypic variants (e.g., protein isoforms, post-translationally modified or proteolytically processed proteins) that may be of functional importance.8, 13 However, conventional 2D-PAGE is plagued by the need to separate, stain and quantify protein samples on individual gels which must then be overlaid for comparison. Accurate matching and comparison of superimposed images is a tedious and time-consuming process prone to error. A simplified version of the method known as 2D difference gel electrophoresis (2D DIGE) offers the advantage of two-sample comparison on one gel thus minimizing experimental variation (Fig. 3A). However, problems such as co-migration and limited sensitivity remain a significant challenge.
In response to the limitations of 2D gels a multitude of quantitative methods employing shotgun proteomics have evolved. These include several isotope-coded labeling strategies using either chemical [e.g. isotope-coded affinity tag (ICAT)], metabolic [e.g., stable isotope labeling by amino acids in cell culture (SILAC),15 N-enriched media], or enzymatic [e.g., proteolytic digestion in18 O water (H218 O)] methods.26, 27 Briefly, peptide mixtures from two different samples are labeled with a light and heavy version of isotope, respectively, and then combined and analyzed by LC-ESI-MS/MS (Fig. 3B). Since the heavy labeled peptides have the same physicochemical properties as their light labeled counterparts they co-elute during chromatographic separation and are identified in the mass spectrometer as a peptide pair differing in mass by the number of incorporated isotopes in the heavy peptide. The relative abundance of the peptide is determined using software tools that calculate the peptide ion current ratio from the MS-derived ion chromatograms.15, 28 Multiple peptide measurements can then be combined to obtain an average protein expression ratio. In this regard, the open-source software tool ProteinProphet (Table 1) is helpful for first generating the simplest list of proteins sufficient to explain the identified peptides.21 This facilitates efforts to provide a measure of the relative abundance of proteins in the samples.19
Finally, the ease and limited expense associated with approaches that rely on mass spectral peak intensities or the number of MS/MS spectra assigned to a given protein has made these methods popular for protein quantitation.29, 30 The complementary nature of stable isotope labeling and label-free methods suggests that a combination of these approaches may provide improved validation and increase the likelihood for detecting protein changes.
Extracting Biologically Meaningful Information
The field of proteomics continues to expand at a very rapid pace and the identification of disease biomarkers has received considerable attention owing to the potential to vastly improve patient diagnosis, treatment and outcome. In the sections that follow we address some of the practical considerations necessary for success in such proteomic endeavors.
Sample Quality and Experimental Design.
The observational nature of many proteome studies requires careful consideration of potentially confounding variables that might lead to artifacts in the assignment of a phenotypic response to the disease process under study.31 Failure to consider variations in sample collection, storage and quality can compromise the ability to identify consistent protein expression profiles that correlate with specific histological or clinical phenotypes. Perhaps this is best exemplified in proteomic studies of serum where for example, variations in clotting time, storage time and storage temperature cause changes in the mass spectra profiles thus compromising their predictive value when analyzing artificially created differentiable sample groups.32 Similarly, the presence of Optimal Cutting Temperature compound (OCT), a cryoprotectant commonly used for tissue banking, has been shown to interfere with the detection of liver biopsy peptides by MS analysis (Fig. 4B). These findings point to the need for new guidelines for tissue collection and storage where samples destined for proteome analysis are best preserved by flash freezing (Fig. 4A).
Additional considerations include sample size and heterogeneity. Although proteomics is considered high-throughput, owing to its ability to simultaneously track a large number of proteins, time and cost considerations often result in the analysis of a limited number of samples. This under-sampling (e.g., fewer biological or technical replicates) can further compromise the ability to extract biologically meaningful information owing to the lack of statistical power. Moreover, failure to consider sample heterogeneity (e.g., disease etiology, tumor heterogeneity) precludes the ability to determine whether changes in protein abundance correlate with specific histological or clinical phenotypes. Taken together, these observations demonstrate the need for routine interaction among a multidisciplinary group of scientists with expertise in proteomics, biology, clinical medicine, and statistics in order to ensure the careful study design and high-quality sample preparation necessary for success in proteomics studies.
Ultra-Sensitive Proteomic Analysis of Clinical Samples.
The analysis of complex in vivo systems provides the best opportunity to decipher and understand human liver biology and disease. In this regard, much of the proteomics effort in hepatology has centered on the study of human liver cancer using diseased tissue.33–36 Similarly, serial liver biopsy specimens acquired from HCV-infected liver transplant recipients offers the unique opportunity to study the early effects of HCV infection that lead to liver disease. However, unlike samples collected from surgically resected liver specimens where milligram amounts of starting material are obtained, the low protein yields (often less than 50 μg total protein) associated with much smaller clinical specimens presents a significant challenge for proteomic analysis. In our experience, the use of conventional methods (e.g., ICAT + LC-ESI-MS/MS) precluded attempts to detect a broad abundance range of proteins from human liver biopsies. Successful analysis of such limited amounts of material depends on the use of an ultra-sensitive nanoproteomics platform. To this end, the combination of high resolution Fourier transform ion cyclotron resonance (FTICR) mass spectrometry together with the accurate mass and time (AMT) tag strategy37 provides the ultra-high sensitivity necessary for identification of thousands of proteins from μg rather than mg amounts of starting material by eliminating the need for repeating time-consuming LC-MS/MS peptide identification analyses on every sample. An illustration of the approach is provided in Fig. 5. Applying the AMT tag approach, over 1,500 proteins have been identified from only 2 μg of a protein digest obtained from a liver biopsy sample.38 This represents a significant advancement in clinical proteomics now making it possible to track physiologically relevant protein abundance changes in vivo using small patient samples.
Data Storage, Analysis, Interpretation and Dissemination.
A key issue which remains to be discussed is the need for significant bioinformatic and computational biology resources to support proteomics technologies. The massive volumes of data generated using high-throughput proteomic technologies can be overwhelming and investigators are left with the daunting challenge of exploiting this information to gain a better understanding of complex biological processes. The marked variation in proteomics platforms has precluded the establishment of a complete or universal “solution” for effective analysis of large quantities of data originating from various studies and a multitude of proteomic analysis resources have emerged. What follows is an overview of the types of software programs available for proteomics data analysis and management (Table 1).
The use of different mass spectrometers with output data in a variety of proprietary formats requires software programs for conversion to a standard format (e.g., mzXML or mzDATA) allowing utilization of software tools that support peptide and protein identification, and quantitation (Table 1). A number of integrated proteomics analysis pipelines exist that incorporate various combinations of these software programs together with a secure database repository for data storage and mining (Table 1). An example of one such integrated system is shown in Fig. 6.39, 40 Briefly, data processed via peptide and protein identification and quantitation software (PRISM-MTS and VIPER) are transferred to a secure database repository (EAM) for storing and managing functional genomics data and accompanying experimental information. The integration of clinical and histopathological data aids efforts to identify protein expression patterns or clinical variables of relevance to disease progression. Coupling to commercial data analysis software such as Elucidator (Rosetta Biosoftware) provides error modeling, analysis and visualization tools for mining proteomic data. Additional links to commercial and publically available software programs containing prediction and characterization tools for protein analysis (e.g., ExPasy), biological and functional annotation tools (e.g., SpotFire, Babelomics), and in-depth pathway and network oriented analysis tools (e.g., Ingenuity Pathways Analysis, MetaCore, Cytoscape) can further augment data analysis capabilities (Table 1, Fig. 6). Finally, implementation of data sharing between internal and external investigators is commonly achieved via web-based access to proteomic repositories. In this regard, it is worth noting that various public repositories now exist for integration of data from the proteomics community at large (Table 1).
A key goal of many functional genomics studies is to utilize these tools to begin integrating and analyzing high throughput proteomics and genomics datasets. By collecting information at both the proteomic and genomic level one can maximize the chance of identifying biomarkers of the conditions being studied and uncovering cellular mechanistic pathways that contribute to disease processes. Although such endeavors present several challenges including the lower proportion of proteome coverage (e.g., hundreds to thousand of proteins) relative to genome coverage (e.g. tens of thousands of genes), even limited analyses can significantly contribute to the understanding of complex biological systems. Consistent with previous reports,1–4 our initial studies using an in vitro model of human immunodeficiency virus (HIV) infection demonstrated that mRNA levels had little predictive power for protein abundances and in-depth pathway oriented analyses have provided insights into potential posttranscriptionally regulated pathways during virus infection.
Application of Proteomics to the Study of Liver Function and Disease
Proteomics has enormous potential to enhance our understanding of fundamental aspects of how biological systems operate, as well as providing practical insights that will impact medical practice. Here we describe a number of studies highlighting the use of proteomics to study liver function and disease.
Global Comparative Quantitative Protein Profiling.
Given that hepatocellular carcinoma is one of the most frequently occurring cancers world-wide, it is not surprising that numerous investigators are searching for proteomic signatures of liver cancer. A consistently emerging theme from the analysis of tumor tissue from patients with a variety of different disease etiologies including HCV infection, is the apparent decline in abundance of fatty acid oxidation enzymes.33–36, 41 Similar findings have been observed in two independent proteomic studies of HCV replication in vitro38, 42 and more recently in liver biopsy specimens obtained from HCV-infected individuals who have progressed to various stages of fibrosis (Diamond et al, manuscript in preparation). While the wide-spread nature of fatty acid oxidation down-regulation suggests it may not be beneficial for etiological diagnosis, it remains to be determined whether the molecular basis underlying these perturbations are conserved among the various causative agents leading to chronic liver damage and hepatocellular carcinoma. The apparent link with progression of liver injury and fibrosis43 and increased risk for developing liver cancer makes these cellular metabolic pathways attractive targets for further investigation of HCV-specific disease processes and potential discovery of new anti-fibrotic drugs.
Proteomics is also receiving considerable attention in the area of liver toxicity where the identification of protein abundance changes that precede the onset of biochemically detectable liver damage represent potential early markers of compounds prone to inducing hepatotoxicity.44 Other studies have provided insights into potential mechanisms of hepatoprotection. For example, several highly expressed chaperone and proteasome components have been identified in mice resistant to acetaminophen-induced liver damage.45 These may function as hepatoprotective factors by facilitating the folding or degradation, respectively, of proteins denatured or otherwise harmfully altered during toxicity. Similarly, analysis of a sheep model of copper-associated liver toxicity (e.g., Wilson disease) suggests an adaptive response associated with the upregulation of glutathione synthetase and glutathione-s-transferase mu may protect against oxidative liver damage during moderate levels of copper loading.46 This adaptive response is abrogated with higher levels of copper loading where oxidative stress-induced liver injury occurs.
While the studies described above focus on characterization of liver tissue lysates, the observation that mitochondrial protein expression clusters into functional modules (e.g., oxidative phosphorylation, steroidogenesis, heme biosynthesis) exhibiting tightly correlated, tissue-specific patterns of gene expression highlights the importance of organelle-based proteomic studies in human health and disease.47 Two recent in-depth proteomic surveys of several mouse organs and organelles, including liver, represent major advancements in this area.48, 49 These studies provide molecular compendiums upon which other information (e.g., gene expression data, cis- and trans-regulatory elements) have been overlaid to gain a better understanding of the regulatory organization of the human genome. The availability of these resources will greatly aid efforts to dissect the pathways involved in human disease and identify organelle-specific targets for disease monitoring and drug development.
Proteomic strategies that characterize the functional state of proteins are also an important component of efforts to understand liver function and disease. This is elegantly demonstrated in a study employing chemical tagging strategies to identify differentially expressed enzyme activities in the liver of lean and obese (ob/ob) mice.50 The upregulation of hydroxpyruvate reductase in ob/ob mice suggests that non-classical glucose biosynthetic pathways contribute to the hyperglycemic phenotype in states of obesity. By contrast, oxidative modifications that impair activity of the detoxifying enzyme aldehyde dehyrogenase are proposed to contribute to mitochondrial dysfunction and liver injury in a rat model of chronic alcohol consumption.51 Alternative mechanisms of impaired mitochondrial function have been reported in an S-adenosylmethionine knockout mouse model for nonalcoholic steatohepatitis.52 Finally, the identification of a novel methyltransferase activity and several arginine methylated proteins in the rat liver golgi complex represents an exciting avenue of research into a potential role for golgi protein methylation in human disease.53
Perspectives and Future Directions
The field of proteomics continues to expand at a very rapid pace and the identification of disease biomarkers has received considerable attention owing to the potential to vastly improve patient diagnosis, treatment and outcome. As proteomic studies move increasingly towards in vivo model systems exhibiting significant heterogeneity the success of such endeavors will depend on the ability to identify consistent protein expression profiles that correlate with specific histological or clinical phenotypes. The challenge for proteomics is to provide the needed throughput, in addition to the required sensitivity and dynamic range, necessary for proper clinical proteomic studies on the order of several hundred to thousands of samples. Current proteomic approaches and technologies are not sufficient for this volume. Moreover, the implementation of proteomics standards similar to the Minimum Information about a Microarray Experiment (MIAME) standards required for reporting microarray data,54 will go a long way toward better enabling researchers to properly interpret, independently verify, and compare large scale proteomic datasets.
Moving forward, an area of significant potential is the discovery of surrogate markers of liver disease in easily collected body fluids such as human blood plasma or serum. Although several challenges exist, most notably the difficulty to detect low abundance proteins, several studies point to the potential utility for disease diagnosis and monitoring (9). In this regard, the wealth of knowledge obtained from quantitative and functional proteomic studies of the liver will provide valuable guidance in the effort to identify circulating markers likely resulting from leakage, secretion, or shedding of proteins from diseased liver tissue.
Ultimately, the data generated from proteomics studies will be most useful when viewed in the context of an overall and integrated picture.55, 56 The utilization of bioinformatics tools to integrate gene expression data with both quantitative and functional proteomics data from multiple independent datasets will go a long way toward providing a systems level understanding of liver function and disease associated processes. In this regard, Thorgeirsson et al.55 describe an elegant model for performing such integrative functional genomics endeavors to study human liver cancer. It is our hope that the continued adoption of proteomic technologies among translational investigators will serve to stimulate further technological and methodological innovations that ultimately lead to better patient care.
We would like to thank Dr. Marcus Korth and Dr. Kathie Walters for critically reading the manuscript. We thank the Environmental Molecular Sciences Laboratory at PNNL for use of the instrumentation applied in this research. Portions of this research were supported by the NIH National Center for Research Resources (RR18522). Work was performed in the Environmental Molecular Science Laboratory, a U. S. Department of Energy (DOE) national scientific user facility located on the campus of Pacific Northwest National Laboratory (PNNL) in Richland, Washington. PNNL is a multiprogram national laboratory operated by Battelle Memorial Institute for the DOE under contract DE-AC05-76RLO-1830.