Colour Online: See the article online to view Figs. 1–4 in colour.
In this review, we present an update on the progress of the Human Protein Atlas, with an emphasis on strategies for validating immunohistochemistry-based protein expression patterns and on the possibilities to extend the map of protein expression patterns for cancer research projects. The objectives underlying the Human Protein Atlas include (i) the generation of validated antibodies toward a major isoform of all proteins encoded by the human genome, (ii) creating an information database of protein expression patterns in normal human tissues, in cells, and in cancer, and (iii) utilizing generated antibodies and protein expression data as tools to identify clinically useful biomarkers. The success of such an effort is dependent on the validity of antibodies as specific binders of intended targets in applications used to map protein expression patterns. The development of strategies to support specific target binding is crucial and remains a challenge as a large fraction of proteins encoded by the human genome is poorly characterized, including the approximately one-third of all proteins lacking evidence of existence. Conceivable methods for validation include the use of paired antibodies, i.e. two independent antibodies targeting different and nonoverlapping epitopes on the same protein as well as comparative analysis of mRNA expression patterns with corresponding proteins.
With the completion of the Human Genome Project, the next major challenge is to systematically map the building blocks of man encoded therein, i.e. the human proteome. Antibody-based proteomics provides one powerful strategy for the systematic exploration of the proteome , and the Human Protein Atlas project was initiated in 2003 to pursue a systematic high-throughput generation of affinity-purified polyclonal antibodies. Subsequent immunohistochemistry (IHC)- and immunofluorescence (IF)-based profiling of protein expression patterns in tissues and cell lines is used to generate an expression map that is made publicly available at http://www.proteinatlas.org [2, 3]]. Antibodies generated within the Human Protein Atlas project have also been used for detailed mapping of the expression of 89 proteins in 25 selected rat brain areas, and a database portal (http://www.proteinatlas.org/rodentbrain) has been set up for visualization of the images and results .
The global setup for the analysis within the Human Protein Atlas project allows for addressing basic questions regarding what fraction of the genome is expressed on a protein level in different given cell types, how large fraction of our genes encode for cell type specific proteins, house keeping proteins, and proteins differentially expressed across different cell types, and how this correlates with defined cell functions, molecular pathways, and morphological phenotypes. Results from an earlier study based on 5934 antibodies, targeting 4842 proteins, suggested that cellular phenotype and function is not primarily determined by the expression of cell type specific proteins, i.e. proteins expressed in only one cell type, but rather on a tight regulation of protein levels expressed by a large proportion of the genome  (Fig. 1). In fact, only few proteins appear to be exclusively expressed in a limited number of cell types. The details of the high-throughput strategy and workflow of the Human Protein Atlas project as well as descriptions of the publicly available Human Protein Atlas portal (http://www.proteinatlas.org) have recently been reviewed elsewhere [6, 7].
Correspondence concerning this and other Viewpoint articles can be accessed on the journals' home page at:
Correspondence for posting on these pages is welcome and can also be submitted at this site.
IHC offers the benefit of immediate and intuitive visual presentation of the localization and relative abundance of proteins in complex tissues at cellular and to certain extent subcellular resolution. However, one apparent drawback of using antibodies for mapping the proteome lies in the lack of available antibodies specifically targeting all human putative proteins. This bottleneck is increasingly being addressed in efforts such as the European Union (EU) initiative “Proteome binders” ] and “Affinomics” (http://www.affinomics.org/), and projects aiming to generate various antibodies targeting a defined set of cancer-associated proteins (NCI) , proteins that contain SH2 domains , and all nonredundant proteins (Human Protein Atlas) .
The spatial resolution offered by IHC enables a broader insight into the association between protein expression and morphology, phenotype, and function of cell types in complex tissues. This is an important advantage compared to other available methods for large-scale analyses of expression, such as microarray-based or mass-spectrometric expression analyses, performed on tissue lysates or homogenates. As most tissues are comprised of additional components other than the cell types underlying the specific function of a given tissue, such as extracellular matrix, fibroblasts, blood vessels, nerves, and inflammatory cells, the read-out from an assay performed on a homogenate will not represent solely tissue-specific cells of interest (Fig. 2A and B). This issue is of particular importance in cancer research since the tumor stroma and microenvironment form an integral component of solid tumors (Fig. 2C and D).
IHC constitutes a qualitative rather than quantitative method. The output is interpreted subjectively and the observed intensity levels are directly coupled to the titer of the antibodies used. The lack of appropriate reference standards, i.e. defined samples containing known amounts of the target protein, remains an obstacle and hinders accurate quantification through manual as well as image analysis based evaluation of IHC stainings. Resulting intensities are known to vary between individual experiments and laboratories, as exemplified by the level of reproducibility in scoring HER2 positivity as a prognostic biomarker for tumor aggressiveness and a predictive biomarker for response to trastuzumab (Herceptin), which has been reported to be as low as 20% . For this reason, the tissue microarray (TMA) technology has revolutionized the evaluation of IHC markers . This technology enables a multitude of tissue specimens to be stained simultaneously in a single experiment, under the same conditions, which allows for the relative comparison of staining intensities across different cell types or tissues within the TMA . The trade-off of employing TMAs for tissue analyses includes that only smaller, selected regions of large tumors are represented rendering difficulties to fully assess subpopulations of tumor cells with heterogeneous expression patterns.
2 Validating antibody-based protein profiles
In any antibody-based assay, the specificity with which each antibody recognizes and binds its intended target needs to be validated in order to determine the reliability of the assay. Since antibodies have affinities not only toward the intended protein (on-target) but also to some degree to a number of undesired and generally unknown proteins (off-targets), unwanted binding events are likely to occur in any assay. Depending on relative levels of the “on-target” protein and other “off-target” proteins expressed in the analyzed tissue sample, and on the interaction kinetics of the off- and on-target antibody binding, this will inevitably obscure the validity and interpretation of protein expression data (exemplified in Fig. 3).
2.1 Technical antibody validation
There are several ways to validate the “on-target” specificity of an antibody. For known proteins with available experimental data from other sources, strategies such as comparisons of IHC staining result with previously published data, and use of established negative as well as positive controls are possible. These approaches are obviously not applicable for proteins for which no previous data exist. For such proteins, bioinformatic models can be useful for obtaining information on predicted molecular weight and probable subcellular localization, which can then be compared with observed results in Western blot (WB), IHC, or IF assays. Another approach, which does not require any previous information, is the protein array setup. However, antibody binding to the correct protein fragment on a protein array does not guarantee appropriate binding of denatured forms of the “on-target” protein present in other applications, nor does is exclude binding to epitopes on other proteins present in the denatured sample that is analyzed.
The above-mentioned methods are all useful for basic characterization of an antibody, but more functional assays are required to actually prove that the antibody is “on-target.” This can be achieved for instance by preabsorbing the antibody with antigen, or by transiently knocking down the expression using siRNA transfection and subsequently measuring the loss of signal, or conversely by gain of expression experiments to measure increase in signal. Another strategy is to perform epitope mapping to identify the sequence of amino acids to which the antibody binds . Epitope mapping followed by affinity capture has also been shown to be a strategy of generating epitope-specific antibodies, each targeting a single epitope, from polyclonal antibodies . The same study also described a dramatic difference in functionality across different applications (WB, IHC, and IF) for the individual epitope-specific antibodies. Although such functional assays can provide compelling evidence of on-target binding, they are time consuming and not easily incorporated in a high-throughput workflow.
2.2 Validation using paired antibodies
An attractive path for high-throughput on-target validation is the use of “paired antibodies,” i.e. the use of two or more antibodies directed toward nonoverlapping epitopes on the same protein. In this setup, paired antibodies can be used to cross-validate each other by, e.g., generating similar WBs or IHC staining patterns on consecutive sections (Fig. 3). The strategy to manually compare and score stainings with paired antibodies, provided that more than one antibody expression pattern for a given protein is available, has recently been implemented in the Human Protein Atlas project. The result is denoted an “annotated protein expression” (APE) score , and represents an effort to derive a best estimate of true on-target protein expression (Fig. 3, right panel). Previously published literature or expression data, bioinformatic predictions, etc., are also taken into account when generating the APE score, by awarding the most reliable antibody(ies) more weight in the comparison. At present, approximately one-quarter of all protein-coding genes on the Human Protein Atlas portal is presented with an APE, and the long-term ambition is to provide APEs for all protein-coding genes in the human genome to provide a knowledge-based database of protein expression rather than merely constituting a repository for immunohistochemical images. Paired antibodies also enable staining of protein expression using proximity ligation assay, a technology that extends the specificity of traditional immunoassays by requiring both antibodies to bind in order for a signal to be generated. The method not only allows for increased specificity in the detection of a particular protein, but also makes it possible to detect protein–protein interactions or phosphorylation of a particular protein . The methodology of epitope mapping with subsequent affinity capture may provide a possibility to generate “paired antibodies” for the entire proteome. For proteins for which two polyclonal antibodies cannot be generated due to extensive homology, paired antibodies targeting different single epitopes of approximately six amino acids is plausible.
2.3 Validation using data on RNA expression
Another strategy for validating protein expression patterns is to compare the results with levels of RNA in corresponding cells (Fig. 3, bottom panel). Cell lines are suitable samples for comparative analysis of mRNA and protein levels since each cell line constitutes a homogenous collection of cells and, unlike tissues, does not represent a range of different phenotypes. Antibody validity can be assessed by comparing WB blot result or IHC staining to known levels of corresponding transcript. Different cell lines, or, e.g., one cell line grown under different conditions, can be assembled in a cell microarray format, sectioned and stained analogous to tissue samples in tissue microarrays . IHC on cell lines enables the implementation of unbiased automated image analysis for quantification of IHC positivity . Although the levels cannot reflect more than relative measurements of protein expression in the cells included in the cell microarray (CMA), the image analysis based expression data offer the possibility to evaluate to what level IHC results and RNA levels correlate. It should be noted that IHC as well as MS are limited by a lower level of sensitivity and resolution. This is exemplified by a previous study, in which mRNA levels were compared with protein levels, as detected using SILAC-based MS and antibody-based confocal microscopy, showing that low-abundant transcripts, as exemplified by the functional group of G protein-coupled receptors (GPCRs), were often not detected on a protein level . Within the Human Protein Atlas project, the three cell lines routinely analyzed using IHC as well as IF have also been analyzed using next-generation RNA sequencing using the Illumina system (Illumina Inc. San Diego, CA, USA). RPKM (reads per kilo base of exon model) data from the three cell lines along with the IHC and IF images are available on the Human Protein Atlas portal. On-going work within the project aims to collect and analyze RNA expression from all cell lines and normal tissues presented in the Human Protein Atlas. These data will provide important validation to how our genes are expressed in both cell lines and in human tissues.
Although the complex machinery of regulatory mechanisms controlling protein levels is likely to include regulation of translation, RNA and protein turnover kinetics, micro-RNAs, posttranslational modifications, etc., data on RNA abundance is a valuable complement to the validation scheme. Numerous transcript-profiling studies have been performed, with the assumption that a particular phenotype is the result of what proteins are expressed and at what levels, and that transcriptomic data can serve as a blueprint for protein expression. The correlation between RNA and protein levels is therefore highly relevant, and several studies have shown that although protein levels appear to be regulated by other means than abundance of transcript, a significant correlation between RNA and protein levels can be shown for a large portion of the proteome [22, 23]. Most previous studies have used MS analysis for obtaining protein data, and although MS has the capability of providing quantitative data, it is limited in the number of gene products that can be measured. Emerging global antibody-based proteomic data now allow for comparative analysis of the correlation between mRNA and protein levels as measured using IHC.
2.4 Evidence map
A gene-centric Human Proteome Project has been proposed to characterize the human protein-coding genes. On-going efforts for global characterization includes antibody-based and MS-based projects to map the distribution and abundance of proteins in organelles, cells, tissues, and organs [24, 25]. Other global efforts aim to characterize protein interactions to enable interactomics maps for better understanding of protein function . To establish the actual existence of a protein is, however, one vital step in the basic characterization of the proteome. All experimental evidence of protein existence including information on transcript abundance and various assays aiming to verify and determine protein expression is needed to create an information matrix including all levels of evidence for the existence of protein expression, i.e. an “evidence map.” Ultimately such an evidence map will facilitate the work of defining the actual protein-expressing genes in our genome. By discriminating true protein-coding genes from pseudo-genes and adding data from various platforms, a genome-wide matrix including the status for all protein-encoding genes regarding subcellular localization, tissue distribution, and molecular characterization of the corresponding proteins can be envisioned. Several hurdles exist, as sensitive unbiased methods for determining the existence of an unknown protein have yet to be developed. The development of more sensitive MS technologies will be important to detect proteins expressed at very low levels. Although cell type specific protein expression appears rare, it is possible that several proteins exist that have a highly restricted pattern of expression. Highly specialized cell types, e.g. present in inner ear, retina, olfactory system, may be difficult to sample in sufficient amount and states for analyses of protein expression. Other complicating factors include the possibility of proteins only transiently expressed in certain cell populations during development, rendering sampling of relevant human tissues difficult. Despite the difficulties with reaching a full coverage and certainty on what genes are expressed on the protein level, it is of high value to generate an information database where all relevant experimental data of protein existence can be compiled. An outline for such an evidence map has been suggested for a subset of the human genes, defined as all proteins encoded from chromosome 21 (Fig. 4A and B) .
3 Antibody-based proteomics and biomarker discovery
With the advances in molecular biology, enabling high-throughput techniques and large-scale analyses of a multitude of samples simultaneously, the interest for and quest to identify biomarkers that can be used in clinical oncology and pathology have increased dramatically. The vision is to develop personalized medicine based on the development of novel therapeutic agents and biomarkers as fundamental pillars used for diagnostics, detection of pre- or early stages of cancer and for subclassification of cancer, for monitoring drug response, for detecting recurring disease, and for predicting response to specific therapies and patient outcome . In its most advanced form, such biomarkers will be necessary to tailor the optimal therapeutic intervention based on stratification of each diagnosed cancer [29, 30].
In addition to protein expression in normal tissues, the Human Protein Atlas contains IHC-based expression data for the 20 most common forms of cancer with 12 individual tumors representing each tumor type. This allows for efforts to identify tumor type specific expression patterns and also to identifying proteins that are differentially expressed in different tumors of a given type. Using this strategy to mine the Human Protein Atlas database, several new potential cancer biomarkers have been identified, leading to on-going efforts to further characterize identified proteins and to investigate their clinical usefulness by analyzing expression patterns in large clinical patient cohorts . The two main biomarker discovery strategies within the Human Protein Atlas project (one focusing on cancer using TMAs and IHC, and the other focusing on blood-based analyses using a suspension bead antibody array) have recently been reviewed elsewhere .
3.1 Antibodies as biomarker tools in IHC
For IHC biomarkers, most progress today have been made in the area of lymphohematopoetic tumors for which antibody-based molecular characterization constitutes an invaluable tool for diagnostics and stratification of lymphoma and leukemia . Noteworthy is the identification of CD markers through series of international workshops based on characterizing antibodies targeting proteins expressed on the surface of human blood cells, and how this continuous effort has provided a greater understanding and improved treatment of lymphoma and leukemia . However, outside the realm of diagnostics and with the exception of hormone receptors and HER2 expression in breast cancer, there has been little progress in identifying clinically useful prognostic or predictive IHC biomarkers for solid tumors . Several promising biomarkers have been described, yet few have been validated for clinical use. One reason for this is the difficulty to provide sufficient evidence that alterations observed in an experimental setting are of significant value for patients in the clinic. Another plausible reason for this “failure” is the inherent complexity and heterogeneity of cancer as a disease. Rather than systematically searching for “one shot biomarkers,” an effective path forward more likely involves identifying panels of proteins that when analyzed in combination can provide useful and clinically relevant information. For this purpose, large-scale and high-throughput techniques such as RNA-expression profiling are of great value and may serve as important tools for biomarker discovery (Fig. 5).
The complexity of tissue composition renders manual inspection in the microscope the gold standard for correct segmentation of cell types and structure, and as such, the assay remains inherently subjective and qualitative. Although efforts are on-going to enable automated image analysis assessment of IHC data from tissue sections [35, 36], even the most sophisticated image analyses software can only offer a semi-quantitative measurement of IHC positivity, although numeric data along a continuous scale is offered.
The difficulties presented using IHC as means of evaluating biomarkers do not only lie within the method itself but also in the fact that no consensus has been reached regarding the preparation of histological samples causing preanalytical variability both within and between laboratories. Despite downsides of IHC, stainings on tissue biopsies are widely used in clinical routine today and is a crucial complementary tool for the pathologist to set a diagnosis since no other method provides expression data in a histological context (Fig. 5).
3.2 IHC versus transcriptional profiling
In biomedical research, an “omics-approach,” using, e.g., microarrays, has been used for transcriptional profiling to screen for differentially expressed genes in various forms of cancer. The usefulness of the methodology is exemplified by studies performed in breast cancer, for which the identification and molecular characterization of clinically relevant cancer subtypes have been facilitated by microarray-based analysis to complement a microscopical pathological diagnosis [37, 38] as well as provided screening data for potential multigene prognostic and predictive biomarkers . However, although transcriptional profiling enables the reading of numerous data points simultaneously, the method suffers from the unresolved question of how well specific transcripts reflect the level of corresponding proteins and from the inherent lack of spatial resolution. Expression profiling using IHC is limited in its output (Fig. 5). Nevertheless, provided that antibodies of high quality are available, and that the panel of markers to be analyzed is limited, expression profiling using IHC provides a realistic medium-throughput biomarker discovery path. Antibody panels can also provide an important tool to be used in molecular pathology for more precise classification of cancer.
The above-mentioned methodologies should be viewed as important tools for cross-validation, and strategies used for identifying biomarkers will differ depending on the intended purpose of the marker(s), as well as the intended or desired assay it will be used in. Moreover, in situ hybridization allows for detection of mRNA in cells within a tissue context and can be used for validation on a transcriptional level, without losing spatial information. A high-throughput strategy based on comparing mRNA and corresponding proteins on consecutive TMA sections was recently described in a study where IHC and in situ hybridization was used .
3.3 Antibody arrays versus MS in blood-based assays
Samples from blood are the most commonly used type of specimen for diagnostics. Contrary to tissue samples, in which proteins are fixed to a location, a serum or plasma analysis reveals a systemic view that is more difficult to assess. In addition, in serum or plasma samples, there is a major intrinsic challenge posed by the total protein composition, the ranges of abundances of different proteins, as well as variations in protein concentrations stemming from dietary or lifestyle conditions as is sampling at different times of the day . Nevertheless, the minimally invasive procedure to collect serum and plasma, and the less-cumbersome procedures to store these types of specimens in biobanks, makes them a very attractive sample source for biomarker discovery projects.
From a technical perspective, a high-throughput analysis of blood can be achieved by methods using mass spectrometric (MS) read-out or by means of affinity-based detection such as antibody-based arrays (Fig. 5). Key aspects of both strategies are (i) how much of the sample is available for the analysis, (ii) the sensitivity of the assay, and (iii) how many different types of proteins should be analyzed in parallel. Although MS still suffers from the need of larger amounts of sample to deplete, enrich, or fractionate, the development of multiplexed and affinity-based methods has enabled the possibility for sensitive analyses from just a few microliters of sample . Beyond this, antibody array based methods of today have the advantage of offering the capability of analyzing many samples simultaneously, and can be further enhanced and automated by liquid handling systems. Although antibody arrays have a high sample throughput, the method requires an operator to select target proteins (i.e. the capturing antibodies) beforehand. On the other hand, MS lacks the sample throughput but the information generated reaches deeper into the proteome since specific target protein(s) are not selected prior to analysis. Seen from a biomarker discovery perspective, both MS and affinity-based methods can benefit from complementing and validating each other's data, but neither method is likely to be the sole solution for future biomarker discovery efforts. Rather, the preferred discovery strategy needs to be determined by the limitations in either availability of samples (MS) or of affinity reagents (antibody array).
3.4 Biomarker antibodies as clinical tools
In addition to Fluorescence-activated cell sorting (FACS), which is used in clinical routine in the diagnostics of hematological malignancies, there are two main antibody-based diagnostic assays: ELISA and IHC (Fig. 5). Common for both is the relatively low number of measurements (data points) that can be obtained from one sample at a time, but the need for high throughput is obsolete since the assays in a clinical setting rely on well-validated antibodies and the experience on how to interpret the results. Conversely, high-throughput and information-dense arrays are currently not commonly used in a clinical setting even though there are on-going commercial efforts to promote assays such as the MammaPrint®, Onkotype DX®, and ColoPrint® for the purpose of prognosis- and treatment stratification in breast and colorectal cancers [43-46].
High-throughput methods are still used primarily for research purposes, and the likely development is that such methods will identify sets of biomarkers useful in combination as panels for specific diseases. Following appropriate validation, such panels can subsequently be incorporated into the well-established, fast, and relatively cheap ELISA or IHC platforms (depending on the preferred sample origin), provided also the availability of standardized and renewable affinity reagents (e.g. monoclonal antibodies).
4 Concluding remarks
Over the past two decades, the scientific advances aimed at mapping the human genome, proteome, and transcriptome have made incredible and rapid progress, producing vast amounts of data. One of the most important scientific endeavors of the future will be to interpret and transform these data into tools that can be introduced into clinical medicine. The objectives of the Human Protein Atlas form one aspect of this challenging undertaking by the inclusion of human tissues and clinical material from cancer patients as the analytes and read-out for determining global protein expression patterns. The usefulness and importance of this effort will ultimately depend on the solidity of antibody specificity and quality of tissues used for generating a map of protein expression patterns. The hope is to gain an insight, albeit with a reductionist view, into what constitutes a human blueprint of healthy homeostasis, as well as to increase our understanding of what underlies the different forms of cancer, so that new tools for diagnostics and treatment can be developed.
The authors would like to acknowledge the entire staff of the Human Protein Atlas project. This work was supported by grants from the Knut and Alice Wallenberg Foundation.
The authors have declared no conflict of interest.