Approaching clinical proteomics: Current state and future fields of application in cellular proteomics

Authors


Abstract

Recent developments in proteomics technology offer new opportunities for clinical applications in hospital or specialized laboratories including the identification of novel biomarkers, monitoring of disease, detecting adverse effects of drugs, and environmental hazards. Advanced spectrometry technologies and the development of new protein array formats have brought these analyses to a standard, which now has the potential to be used in clinical diagnostics. Besides standardization of methodologies and distribution of proteomic data into public databases, the nature of the human body fluid proteome with its high dynamic range in protein concentrations, its quantitation problems, and its extreme complexity present enormous challenges. Molecular cell biology (cytomics) with its link to proteomics is a new fast moving scientific field, which addresses functional cell analysis and bioinformatic approaches to search for novel cellular proteomic biomarkers or their release products into body fluids that provide better insight into the enormous biocomplexity of disease processes and are suitable for patient stratification, therapeutic monitoring, and prediction of prognosis. Experience from studies of in vitro diagnostics and especially in clinical chemistry showed that the majority of errors occurs in the preanalytical phase and the setup of the diagnostic strategy. This is also true for clinical proteomics where similar preanalytical variables such as inter- and intra-assay variability due to biological variations or proteolytical activities in the sample will most likely also influence the results of proteomics studies. However, before complex proteomic analysis can be introduced at a broader level into the clinic, standardization of the preanalytical phase including patient preparation, sample collection, sample preparation, sample storage, measurement, and data analysis is another issue which has to be improved. In this report, we discuss the recent advances and applications that fulfill the criteria for clinical proteomics with the focus on cellular proteomics (cytoproteomics) as related to preanalytical and analytical standardization and to quality control measures required for effective implementation of these technologies and analytes into routine laboratory testing to generate novel actionable health information. It will then be crucial to design and carry out clinical studies that can eventually identify novel clinical diagnostic strategies based on these techniques and validate their impact on clinical decision making. © 2009 International Society for Advancement of Cytometry

The proteome of an organism is the protein complement of its genomic functionality. The proteome is highly dynamic and varies according to cell type and functional state of the cell. The proteome may reflect immediate and characteristic changes in response to disease processes and external stimulation. As the actual proteome, e.g., a set of all proteins present in a body fluid, cell, tissue, or organism at a certain point of time, represents only a subset of all possible gene products, the proteome can not be directly predicted from genome information. Any protein may exist in multiple forms that respond within a particular cell or between different cells via post-translational modifications or degradation processes that affect protein structure, localization, function, and turnover. Like the proteome, the proteolytic degradation products of the proteome, the so called low-molecular-weight (LMW) range proteome, may also have the potential to contain disease-specific information. The identified disease-specific peptides appear to be the fragments of endogenous high-abundant proteins, such as transthyretin or fragments of low abundance cellular and tissue proteins, such as BRCA-2 (1, 2). Also, catabolic pathways lead to degradation of proteins and the extent of this degradation may depend on preanalytical variables like temperature or time of storage of the clinical sample.

Beyond protein target identification, emerging novel nanotechnology strategies that make use of these LMW biomarkers in vivo or ex vivo may enhance our ability to discover and characterize molecules for early disease detection or stratification and will expand the prognostic capability of current proteomic modalities. However, they may also pose new problems related to accuracy and variation. In addition, pathophysiological processes involving proteolytical activities, e.g., tumor proteases, are detectable in the plasma peptidome (1). The discovery of sensitive and specific panels of biomarkers or single marker proteins holds the key to future diagnostic and therapeutic monitoring of complex diseases such as metabolic and vascular disease, cancer, and neurodegenerative disorders.

Clinical proteomics focuses on the clinical and analytical validation and implementation of novel therapy- or diagnosis-related markers that originate from preclinical studies designed to identify hits and leads in analogy to drug screening studies (fundamental research during the discovery phase is not discussed in this document). It also includes the selection, validation, and Standard Operating Procedure (SOP)-assessment of the most appropriate and robust method that can be integrated into the workflow of available analytical platforms in clinical laboratories. Hit picking of potential targets for clinical proteomics may follow either a top down approach applying high-throughput technologies to population-based studies or selected cohorts as case/control studies or from twin registers, to identify an unbiased approach, novel markers that have to be further characterized for sensitivity, specificity, and function. Alternatively, the bottom up approach follows search strategies such as protein–protein or metabolite interaction within already identified pathways or around targets therein to find additional targets with higher sensitivity, disease specification, or stability.

Advances in proteomic analysis with regard to high-throughput and high-content analysis were needed for clinical proteomics. They have been achieved both in whole cell multiparameter flow cytometry and high-content microscopic imaging and in the field of affinity binding technologies and liquid phase technologies like MS/MS applied for cellular abstracts or fluid samples. High resolution liquid phase separation methods, along with progress in chemometry and biometry for large-scale data analysis, now offer the possibility to introduce these research tools into actual laboratory diagnostics to screen for risk factors, identify new disease-specific or stage-specific biomarkers and find novel cellular or fluid markers for therapeutic drug monitoring or new therapeutic targets. As a result, clinical proteomics has the potential to complement cytomics, genomics, metabolomics, lipidomics, glycomics, and transcriptomics, including splice variant analysis, in gaining a better understanding of disease processes and translate this complex knowledge into diagnostic tools for clinicians.

While metabolomics has already been established in clinical diagnosis for years, e.g., in newborn screening; clinical proteomics is only now on the verge to enter the hospital, and just like in the case of metabolomics, essential criteria for their successful use in a clinical environment have to be fulfilled. First, high-throughput analysis platforms must provide reproducible protein patterns in a clinically acceptable turnaround time and have to reach the instrumental stability of laboratory analysers that can operate at the technician level with a minimum of academic supervision necessary; platforms must also fit into the workflow of a clinical laboratory. Second, bioinformatics algorithms must include chemometry, data reduction, and conversion into actionable health information and they have to be robust and integrated into standard laboratory information systems. Third, preanalytical conditions have to be standardized and optimized for the development of clinically applicable tests. Before any work can start to establish novel biomarkers, a patient cohort, which can be used to address a specific clinical question and that is well-defined by all necessary anamnestic and physiologic parameters including age, sex, hormonal status, treatment, and hospitalization status must be identified and made available for study. For biomarker development, SOP-driven biobanks have to be integrated into the diagnostic workflow with validated storage conditions for all proteomic applications. The standardized and also SOP-driven preparation of patient samples is one of the most urgent challenges for reproducible and clinical useful results. Regulations must be established addressing medical-legal issues such as patient consent and commercial use of samples and intellectual property. This must be achieved at an international level to allow large-scale multi center studies that are a prerequisite for the task at hand.

Cytomics deals with the study of “cytomes” or cell systems at a single cell level to understand the functions and molecular architecture of the cytome (3, 4). Using various cytometrical procedures including fluorescent microscoping imaging, multiparameter flow cytometry, novel transcriptomic, proteomic planar, and bead-based arrays, the various components of a cell become accessible to high-throughput and high-content analysis.

In this review, we focus on cellular proteomics (cytoproteomics) and suggest accepted procedures for sample preparation and standardization of protocols. We summarize the preparative methods in the field of cytoproteomics and emphasize key clinical applications in the field. We conclude with bioinformatics approaches for the evaluation of proteomic data. Another review focussing on clinical proteomics in body fluids is in preparation and will also be published as a consensus document from the authors.

PREANALYTICAL PHASE

Sources of technical or sample variation are mainly found within the preanalytical phase; most prominent here is lack of standardized procedures for sample collection, including patient preparation, like fasting, specimen acquisition, handling, and storage, which accounts for more than 90% of the errors within the entire diagnostic process (5). Advances in genomics and proteomics have led to high expectations of clinical biomarker discovery; yet, for the successful generation of validated biomarkers, more attention must be directed towards the preanalytical stage such as sample collection, transport, preparation, and processing (6) but also at standardization and quality management. Automation of preanalytical sample processing using robotics and central sample storage and management helps to eliminate or minimize variations in sample quality (Fig. 1).

Figure 1.

Automation of preanalytical sample processing and sample stabilization from primary tubes to high-throughput platform specific secondary tube formates containing stabilization reagents. [Color figure can be viewed in the online issue, which is available at www.interscience.wiley.com.]

Furthermore, for cellular analysis, the sample material has to be preserved prior to consecutive analysis. There are a variety of reagents and methods available that facilitate proper handling, isolation without preactivation, and stabilization of the cellular material. Figure 1 summarizes the principles of cell harvesting by magnetic separation and the reagents available for stabilization. The choice of the reagents depends on the downstream application of either transcriptomic, genomic, metabolomic, lipidomic, or proteomic analysis.

The preclinical discovery phase may yield interesting biomarkers that require validation before any translation into clinical proteomics is feasible. The type of sample needed and their processing could be quite different for the two phases. In general, there is a major difference between the requirements of high-throughput proteomic profiling as a clinical proteomics approach and protein identification and in-depth characterization of single protein samples (low throughput) in the preclinical phase. Another point is the preparation of the patient, e.g., for parameters that depend on metabolic status where fasting before sample collection is essential; also other influencing factors must be accounted for, such as diurnal rhythms for parameters like peptide hormones or cytokines.

CELLULAR PROTEOMICS (CYTOPROTEOMICS)

Blood cells offer unique insights into disease processes. Therefore, erythrocytes, granulocytes, monocytes, lymphocytes, and platelets are of special interest for clinical proteomics (7). Blood is a liquid organ and isolated blood cells reflect the environment and genome of the individual (Fig. 2).

Figure 2.

Cytomics of the blood compartment.

Flow cytometry is currently widely used as an analytical tool for clinical cell analysis directly from anticoagulated whole blood and also for cell sorting to generate pure populations of cells from heterogeneous and highly integrated mixtures as are found in the majority of biological environments. Elispot, slide based cytometry, and tissue arrays together with high-content screening microscopy are further upcoming techniques in cytoproteomics. The major challenge for preanalytical standardization is related to the use of fresh samples, either for direct multiparameter analysis of cellular proteomics in whole blood or body fluids without preseparation or for cell sorting and enrichment strategies for subsequent proteomic and functional genomic analysis.

SAMPLE PREPARATION AND STANDARDIZATION FOR CLINICAL CYTOPROTEOMICS

For clinical analysis, the samples should be rapidly analyzed, because transportation and storage lead to artifacts like selective damage or aggregation of specific cell subpopulations or shedding of cell surface markers. Especially for activation markers or functional tests, the samples should be analyzed within 3 h. Immunophenotyping, including that for intracellular antigens, however, can be performed for up to 7 days following recently developed methods for the stabilization of cells without destruction of epitopes (8). Most parameters in clinical flow cytometry are performed from EDTA-blood, especially surface parameters for immunophenotyping. The EDTA-tube can be used in parallel for routine hematology and is available in all clinical settings. Many applications could also be done from blood anticoagulated using heparin or citrate. However, the current anticoagulants have major drawbacks for functional cell analysis, and the Ca2+ capturing anticoagulants disturb phosphoproteomics (9) because of the interference with cell signaling networks. In addition, the lysis-procedures release cytosolic products that coat and activate or inhibit other cell types. As an example heme/haptoglobin complexes in plasma bind to the CD163 cysteine-scavenger receptor of circulating monocytes and alter signal transduction (10). Therefore, novel anticoagulants and elimination of intact red blood cells (RBC) are necessary (11–14). The investigation of parameters, whose functions depend on calcium, like certain receptor ligand interactions, oxidative burst, or phagocytosis, heparinized blood has to be used. For cell preparation, the use of gradient centrifugation is almost completely substituted by erythrocyte lysis for routine applications before cytometric analysis (15). There are highly automated and standardized preparation robotics available for lysis and in part for complete sample handling (www.beckmancoulter.com, www.bdbioscience.com). For the analysis, highly standardized multiparameter flow cytometers with special software application for routine diagnosis (www.beckmancoulter.com, www.bdbioscience.com, www.miltenyibiotec.com, www.partec.com) are on the market. Up to eight parameters (FSC, SSC, and six different antibodies) are already implemented in routine diagnosis and CE-certified kits are available. Higher complexities using more than 15 parameters are available for more scientific approaches. Obviously, there is more information with high-content analysis of single cells, but there are also more problems due to antibody interactions like unspecific binding or cross-linking of antibodies. With increasing numbers of different fluorescent dyes, the compensation of the fluorescence signals in different detector channels is an important aspect. Most up-to-date analysers use digital compensation methods, which are highly standardized due to software driven compensation algorithms (16).

The strategy to validate antibody producing cell lines in so called “Cluster of Differentiation” workshops (CD-nomenclature) has generated more than 300 defined antibody specificities which are of great help for standardized clinical test development. A main topic of the working group “flow cytometry and quantitative microscopy” of the DGKL is quality control and standardization. This working group is involved in the European Society for Clinical Cell Analysis (ESCCA), who already started to review and validate recommendations for flow cytometric multiparameter analysis. However, especially with regard to phenotyping and isolation of cells for genomic or proteomic analysis, standardization of techniques is still missing (17). Absolute quantification of the expression level of surface molecules is partly realized with stabilized cell lines (e.g., Daudi) or beads with certain amounts of epitopes, recognized by the specific antibodies used for cell staining (18). All these approaches have not yet reached a good interlaboratory comparability and need further standardization. The availability of antibodies against phosphorylated structures in signal transduction molecules recently has expanded the applicability of flow cytometry to the multiplex characterization of signaling patterns, e.g., during the activation of cells of the immune systems or during malignant transformation (9).

Combining analytical flow cytometry with magnetic isolation of specific cell populations for consecutive functional proteomic, transcriptomic, and metabolomic analysis will ultimately need robotics, which harvest cells from primary tubes and to transfer the separated cells into different preservation solutions for consecutive enabling technology platforms (Fig. 1).

PREPARATIVE METHODS

Fluorescence-Activated Cell Sorting

Beside cell analysis, flow cytometry is also used for cell isolation using fluorescence-activated cell sorting (FACS) (19, 20). Cell sorting allows the investigator to quantify several fluorescence and light-scattering parameters of individual particles and purify those events with the desired characteristics for further functional study. This technology can separate a heterogeneous cell suspension into purified fractions containing a single cell type with equivalent speed and accuracy. Different methods of capturing a particle of interest are implemented in different cytometers. Some systems use a so called catcher tube which is located in the upper portion of the flow cell. It moves in and out of the sample stream to collect a population of desired cells at a rate of up to 300 cells per second. The particles and cells can be physically separated and deposited onto a defined location for further analysis or culture even under sterile conditions. The advantage of this sorting method is a soft handling of cells, but the sorting speed is rather low. High-speed sorters use techniques like “stream in the air,” which allows much higher throughput with up to 100,000 cells per second, but cells are more stressed and may be less suitable for further analysis (21, 22). Applications for high-speed cell sorting are commonly grouped into two categories—bulk sorting and rare event sorting. In bulk sorting, the sort operator is given a large number of cells of which a certain percentage is expressing a given phenotype and this population is desired for further study. This marker can be a surface antigen bound to a fluorescently labeled antibody, a transient transfection expressing a given protein of interest, activated leukocytes or any detectable measure of cell state or phenotype. Whatever the measured parameter, the operator must screen through a large number of cells to collect a sufficient amount of material for the next step of analysis. For instance, an investigator may be interested in the top 5% of cells expressing a given intracellular cytokine to perform proteomic analysis. Typically, a fairly large number of cells are needed for these studies. With readily available mass-spectrometry methods, in the order of 10 million cells would be needed to detect a protein expressed at roughly 1000 copies/cell, corresponding to a total starting population of ∼200 million cells. At a sort rate of 40,000 cells per second, this translates to 5000 s or around an hour and a half of sorting. This does not take into account material losses due to sample handling (e.g., centrifugation, etc.) or time losses due to technical issues.

Magnetic Bead-Based Separation of Cells

Isolation techniques have to be considered before the analysis of the cell proteome. Magnetic bead-based techniques for the isolation of cell subsets according to the expression profile of membrane antigens provides a rapid and convenient approach to cell isolation from blood, bone marrow, and other sources of cells in suspension to select or deplete specific cell populations. These procedures can be performed without major technical instrumentation for low throughput applications, e.g., with Dynal magnetic bead systems (http://www.invitrogen.com/dynal). Disadvantages of this system are first the use of “large” (>5 μm) beads, which grossly activate some cells, for example monocytes and interfere with subsequent analysis of the cell by flow cytometry. A second major disadvantage is that there is no automation for high-throughput sample handling. In clinical proteomics, cell-based analysis requires high-throughput and rapid techniques for cell harvesting and collection, realized by two automated systems for cell isolation from whole blood without prior density gradient centrifugation. One system (RoboSep®; https://www.stemcell.com/product_catalogue/robosep.asp) uses a lysing protocol prior to magnetic leukocyte preparation. This system has the major disadvantage that some activation pathways are influenced by the lysed erythrocyte membranes. Furthermore, yield and purity is inferior to the manual Dynal or automated Miltenyi System. The Miltenyi autoMACS™ Pro and multiMACS™ Separator® Systems need no lysing or other manipulation prior to magnetic bead separation (www.miltenyibiotec.com). It is mandatory that the separation procedure does not induce major changes neither in the expression nor in the cellular localization of proteins. On the other hand, it is also necessary that the separation techniques do not result in unwanted cell loss, as it may occur under pathological conditions when cells float outside the density gradient fraction. In this regard two general approaches can be distinguished: (1) positive selection uses one or more cell specific markers for identification and separation, and therefore, the purity is usually very high—but is prone to alter cell function, whereas (2) negative selection depletes contaminating cells and leaves the target cell untouched, but could lack accuracy, and high purity, and may result in poor recovery. Moreover, negative isolation has the drawback of frequent contamination with immature cells, e.g., reticulocytes or reticulated platelets. Based mostly on specific membrane-associated antigens, various cell subsets can be isolated. A major drawback of most available antibodies against cell surface proteins (Table 1), which are classified in the CD-nomenclature series, is their effect on cell activation. Therefore, there is need for the identification of other, non-activating cell specific targets with stable surface expression for affinity isolation. Table 1 summarizes common antibodies used for magnetic isolation of different cell types and the putative cell, RNA, protein, and lipid yield, if the cells are separated from 10 ml EDTA-blood. Sufficient cell counts for genomic, proteomic, or lipidomic analysis can be obtained from 10 to 20 ml of peripheral blood, depending on the cell count of the individual subset. Ready-to-use kits for positive or negative isolation of various cell populations are now commercially available. Acceptable results in terms of recovery, yield, and purity (≥90%) of the final cell population which is not yet reached with all systems (23).

Table 1. Magnetic antibody separation of blood cells from 10 ml of whole blood
CELL FAMILYCELL SUBSETCAPTURE ANTIBODY/METHODSCELL NUMBERRNA (mg)PROTEIN (mg)LIPID (nmol)
Red Blood CellsAllCD235a (Glycophorin A)5 × 10100.3500250,000
Immature (Reticulocytes)FACS5 × 1080.352,500
PlateletsAllCD42b (WF-Receptor)2.5 × 1090.12512,500
Immature (Retic. Platelets)FACS5 × 1070.12.5250
Myeloid cellsNeutrophilsCD 15 (Fucosyltransferase 4)3 × 1073.03.015,000
Basophils/MastcellsNegative selection2.5 × 1050.250.25125
EosinophilsNegative selection1.5 × 1061.51.5750
Monocytes/PhagocytesCD 14 or Negative selection2.5 × 1062.52.51,250
Monocytes/APCFACS0.5 × 1050.50.5250
Lymphoid cellsT-cellsCD 3+1 × 1071.01.0500
T reg.CD 4+/CD 25+5 × 10550.5250
NKCD 56 (NCAM-1)2 × 1062.02.01,000
B-cellsCD 19+2 × 1062.02.01,000
Stem cells/ProgenitorsHematopoietic progenitorsCD133 (Prominin-1)5 × 1040.50.0525
Endothelial progenitor cellsCD34+/KDR+ (VEGFR)800.00080.000080.04
Non-hematopoietic cellsTumor cellsVariableVariableVariableVariableVariable
Fetal cellsFACS500.00050.000050.02

The different methods for cell fractionation by size and density and highly selective affinity-based technologies including affinity chromatography, fluorescence-activated cell sorting (FACS), and magnetic cell sorting were recently reviewed (24).

Microdissection of Cells

Microdissection offers the advantage to analyze directly the pathologically relevant cell type, which is often grossly underrepresented in the tissue samples investigated. For example, in a heterogeneous tumor tissue, tumor cells can be isolated specifically by microdissection, and the dissected cells can be analyzed quantitatively using for instance DIGE saturation labeling (25) or LC-MS (26). Applying DIGE saturation labeling in combination with two-dimensional gel electrophoresis and mass spectrometry, it has been shown for the first time that quantitative proteome analysis of 1,000 micro dissected cells is feasible (25). With this technique, a number of candidate biomarkers were identified and confirmed as pancreatic cancer cells compared to the cells in precursor lesions.

Laser-based microdissection offers one opportunity to separate functional tissue (27), but for proteomic analysis fine-needle microdissection is preferable, because fewer artifacts, especially thermal denaturation dependent on ablation energy, are introduced. The quality and quantity of tissue molecules critically depends on upstream tissue preparation (28). Snap-frozen specimens, for example, give a higher and better yield than fixed and embedded tissue specimens. Nevertheless, each preparation protocol has its own advantages and disadvantages, so the choice also depends on the consecutive type of analysis. Microdissection can be used to design proteome-based studies in combination with serology or in parallel to genomic and/or expression analysis for the identification of biomarkers and novel targets. Different systems for microscopic microdissection are available (www.leica-microsystems.com, www.arryx.com, www.qiagen.com/applications/microdissection, www.palm-microlaser.com). The standardization and automatization of these techniques is still in progress. Most preparative steps are still operator-dependent (29).

PREPARATION OF CELLULAR SAMPLES AND APPLICATIONS

Mononuclear Cells (Lymphocytes, Monocytes, and Dendritic Cells)

Monocytes are a target of proteomics, especially concerning clinical diagnosis and monitoring of atherosclerosis. Seong et al. (30) analyzed the effect of oxidative stress generated at sites of inflammation and injury on the proteomic profiles of monocytes. As a result, 28 identified proteins mainly involved in energy metabolism, translation, and mediation of protein folding, were over expressed. Dupont et al. (31) elaborated two-DE reference maps of human macrophage proteome and secretome to elucidate the macrophage dysfunctions involved in inflammatory, immunological, and infectious diseases. They showed that macrophages are involved in a wide array of biological functions, including cytoskeletal machinery, carbohydrate metabolism, apoptosis, and protein metabolism. Combined oligonucleotide microarray and proteomic approaches have been used to study genes associated with dendritic cell differentiation. Dendritic cells are antigen-presenting cells essential for the initiation of primary immune responses, they mostly derive from human monocytes. Protein analysis of these cells was done and about 4% of the protein spots separated by two-DE exhibited quantitative changes during differentiation and maturation. The differentially expressed proteins were identified by MS and represent proteins with calcium binding, fatty acid binding, or chaperone activities, as well as cell motility functions (32).

Several studies report on the proteome of lymphocytes. From the current literature (PubMed), we created a database (Table 2, Supporting Information, www.LipidomicNet.org) summarizing the literature in which human blood cells have been studied with high-throughput transcriptomic (arrays, etc.) or proteomic (MALDI-TOF/SELDI-TOF/2D-gelelectrophoresis, etc.) technologies. As shown in Table 2, for a variety of cells, transcriptomic and proteomic analyses have been performed thus linking genotype to phenotype, but still, there are cells that have not been yet analyzed in-depth on the proteomic level. While in Table 2 the total number of transcriptomic and proteomic publications related to individual cells is given, the detailed table in the Supporting Information displays the individual references and the type of the analysis, which involved either transcriptomic, proteomic, or both investigations.

Table 2. Current status of literature on high-throughput transcriptomic and proteomic analysis of cells from the blood compartment
CELL TYPESUBTYPETRANSCRIPTOMEPROTEOMEBOTH
  1. The total number of publications per cell type in the field of transcriptomics and proteomics is shown. A more detailed list is given in the Supporting Information and in LipidomicNet (www.lipidomicnet.org).

Hematopoietic stem cellsEmbryonic462
CFU-E1  
T-cell lineage3  
B-cell lineage 1 
Hematopoietic64 
Mesenchymal31 
Red blood cellsReticulocytes31 
Red blood cells 6 
PlateletsMegakaryocytes7  
Platelets311 
Platelet microparticles 1 
MonocytesMonocytes4072
Macrophages1961
Dendritic cells2042
GranulocytesNeutrophils662
Eosinophils12 
Basophils1  
LymphocytesT-cells431
B-cells44 

Proteome databases of human helper T-cells were established using classical proteomics (33). The proteins differentially expressed in Th1 and Th2 cells are described in the article by Rautajoki et al. (34) Detailed proteomic studies have also been published on lymphoblastoid cells, and strategies for studying signaling pathways in lymphocytes, combining proteomics, and genomics, have been proposed (35). Signaling via immunoreceptors is orchestrated at specific plasma membrane microdomains, referred to as lipid rafts. Lipid rafts are dynamic assemblies floating freely in the surrounding membranes of living cells. The proteins participating in lipid rafts in T lymphocytes have been studied with proteomics, and the subject was recently reviewed by Wollscheid et al. (36) MS was used by Li et al. (37) to specifically detect proteins depleted from rafts by cholesterol-disrupting drugs. The authors detected a large number of signaling molecules in lipid rafts and provided evidence for a connection between cytoskeletal proteins and lipid rafts.

Granulocytes

Neutrophil activation is expressed by the production and release of inflammatory mediators, which induce inflammatory cascades leading to cell damage and dysfunction. The mediators involved in the inflammation processes are mainly proteins, making proteomics the ideal approach to investigate them. Functional proteomics also provides a good tool to elucidate the complex signal transduction network of inflammation to identify disease-associated targets and to improve current therapies. Genomic and proteomic studies were performed on activated neutrophils. The central pathway in the regulation of neutrophil function is the p38 mitogen-activated protein kinase signal transduction as shown by Singh et al. (38). In a recent study, Burlak et al. used subcellular proteomics to identify proteins associated with human neutrophil phagosomes. Proteins were identified by MALDI-TOF MS and/or LC-MS/MS analysis. They unexpectedly found enzymes typically associated with mitochondria in the phagosome fractions and conclude that neutrophil phagosomes have a till now unrecognized complexity (39).

Erythrocytes and Reticulocytes

Because of its high abundance, high availability, and easy purification, red blood cells (RBC) have been the subject for many proteomic studies. In the last 6 years, more than 750 red blood cell proteins have been identified (40). RBCs can be pelleted by centrifugation, the supernatant can be discarded, and RBCs can be resuspended. The purity of erythrocytes can be further enhanced by eliminating the top RBC layers after centrifugation. Further density centrifugations, e.g., on CL5020 (Cederlane Laboratories, Hornby, ON, Canada), elimination of granulocytes by passing the RBC fraction through nylon nets and wash steps to eliminate plasma proteins should be done. The RBCs must be used then immediately to prepare membrane and cytoplasmic fractions. The quality of the preparations has to be determined by purity assessment of RBC samples. The packed RBC samples must be counted for white blood cells (WBCs), granulocytes, monocytes, platelets, reticulocytes, and RBCs, e.g., using a routine hematology analyser. No reticulocytes or other cell types should be found. Subproteomes of RBCs can be analyzed by membrane preparations and membrane extractions, soluble protein preparation, or cytoskeleton extractions (41). The study of Kakhniashvili et al. describes the proteome of red blood cells (42), where erythrocyte membranes as well as cytoplasm were analyzed by LC-MS/MS. The authors identified a total of 181 unique protein sequences in the membrane fractions and in the cytoplasmic fractions.

Because of the low frequency in peripheral blood, reticulocytes must be enriched for proteomic analysis. In contrast, for proteomic analysis of mature red blood cells (RBC), reticulocytes should be allowed to mature, e.g., by storing peripheral whole blood for 72 to 96 h at 4°C without shaking, although it has been suggested that the maturation of reticulocytes in vitro is limited at 4°C (43). Recent data on the analysis of the final material clearly shows that the procedures taken to eliminate reticulocytes is effective (41), but there may be problems with protein degradation during this time. Leukocytes must be eliminated for example with filters (Plasmodipur, Euro-diagnostica, Arnhem, The Netherlands).

Platelets and Reticulated Platelets

Platelets are easily and rapidly activated after blood drawing. Therefore, collected blood should be immediately processed. Especially for platelets, recent studies have shifted away from global profiling to the analysis of subfractions of the proteome and the identification of changes induced upon blood cell activation. For example, these studies on platelets allowed the identification of many more platelet proteins than can be achieved by global profiling; giving a more complete view of the platelet proteome and the biological information obtained may be of more relevance. The two important subproteomes of the platelet, which have been extensively studied include the phosphoproteome and the secretome (44–46). Many signaling pathways in platelets are regulated by differential phosphorylation upon platelet activation. The application of new proteomic approaches on the protein phosphorylation events which occur during the activation of platelets lead to the discovery of new phosphorylated target proteins and new possible phosphorylation sites in already known phosphorylated proteins (47–49).

Like many other cells, platelets secrete proteins from preformed storage granules in response to stimuli. The platelets secretome has been analyzed in the supernatants isolated by differential centrifugation of low-dose thrombin-activated platelets. The supernatant contains secreted proteins but no membrane proteins. Using a complementary multidimensional chromatography approach, the secreted protein fraction may be digested with trypsin and the resulting peptides separated using strong cation exchange and reverse-phase chromatography before introduction into an ion-trap MS. Analysis of the secretome from thrombin-activated platelets has identified over 300 proteins that are secreted upon activation. For a number of proteins it was not known whether they are present in, or secreted by platelets, including secretogranin III, cyclophilin, and calumenin. These three secreted proteins have been identified in atherosclerotic lesions, suggesting a potential role in atherothrombosis (50).

Circulating Microparticles

Cell-derived microparticles have been shown to be relevant in a number of diseases like thrombosis, cardiovascular disease, antiphospholipid syndrome, and systemic inflammation (SIRS, sepsis) (51–53). Microparticles have been shown to be a valuable tool in the field of red blood cell shape anomalies and dysfunctions. Furthermore, it is likely that microparticle release is part of the aging process of erythrocytes or platelets. It may coincide with apoptotic blebbing, shape change, or loss of surface protein modifications. In this sense, the microparticle release is thought to be an innovative marker for chronic vascular diseases like atherosclerosis or other aging disorders.

Microparticles are defined as shedded membranous fragments or vesicles with a diameter of less than 1 μm bearing on their surface markers of the parent cell (51). They are surrounded by a phospholipid bilayer that is mainly composed of phosphatidylcholine, sphingomyelin, and phosphatidylethanolamine. Microparticles differ from their parental cells in regard of the lipid composition and distribution between the two membrane leaflets. The asymmetrical phospholipid distribution with anionic phospholipids being confined to the inner leaflet usually changes during microparticle formation. Activation, as well as apoptosis, leads to calcium-dependant swelling, budding, and microparticle release (54). Finally, both cell activation and apoptosis result in disruption of the membrane skeleton structure that is necessary for surface blebbing and particle release. The principal sites of microparticle release seem to be cell protrusions that in regard to the cell type resemble microvilli, pseudopodia, filopodia, or proteopodia.

Microparticles are derived from a number of different cell types (55). In addition to its well-established role of platelet-derived microparticles in haemostasis, an important function in a variety of additional blood cells (e.g., red blood cells, monocytes, and lymphocytes) and in other cell types like endothelial and stromal progenitor cells has been proposed. Red blood cells have been known for a long time to shed membrane microparticles under stimulation. It could be shown that as for other microparticles this process depends on a rise in intracellular calcium (54). Additionally, they found that the microparticle release coincides with the formation of echinocytes, which is characterized by loss of the natural discoid shape to gain a more spherical form with regularly distributed cell membrane protrusions. The shape change as well as the microparticle release may resemble the in vivo events during cell aging. Circulating microparticles of leukocyte origin modulate cellular interactions through the upregulation of cytokines and cytoadhesins in endothelial cells and monocytes.

Antigenic markers measured by flow cytometry are mainly used to classify and further subdivide membrane microparticles (55, 56). Alternative approaches include ELISA and solid phase capture assays. The advantage of flow cytometry lies in its capacity to distinguish different microparticle populations and subpopulations according to the expression of surface and intracellular antigens in a multiparameter approach. The sensitivity and specificity of flow cytometry could, therefore, be used to improve the analysis of blood and other cell types. In this sense it should be combined with the well-established technique of conventional analysers as outlined above.

Organelles (Phagosome, Proteasome, Nucleosome)

The introduction of proteomic analysis like mass-spectrometry-based identification of proteins has created new opportunities for the study of organelle composition, processing of transport intermediates, and large subcellular structures (Anderson and Mann, 2006). Traditional cell-biology techniques such as sucrose density gradient centrifugation are used to enrich these structures for proteomic analysis, and such analysis provides insights into the biology and functions of these structures. Beside a good upfront purification of the sample, the validation of sample purity done by electron microscopy, immuno-blotting, or other orthogonal methods to exclude significant cross-contamination during organelle separation is obligatory. Standardization of purification procedures is therefore urgently requested for a successful clinical application (57–60). A major challenge in the control of purity of the organelle fraction is the lack of specific markers for each organelle (61). The study of cell compartments provides a unique way to access and identify low abundance and organelle-specific proteins in a biological sample. The immunoisolation procedure on magnetic beads allows for the isolation of highly enriched fractions of certain organelles. A single analysis by MS enables the identification of thousands of peptides, leading to the formal identification of several hundred proteins (62).

The analysis of the proteome of extracellular integral membrane proteins in living cells is complicated because of their hydrophobic nature, their low abundance, and problems with sufficient purity. Therefore, a high-throughput platform has been developed, which is based on proteolytic digestion of whole intact cells (63). The resulting peptide fragments are subjected to liquid chromatology and tandem MS to generate the proteome of integral plasma membrane proteins. This method has been called PROCEED (PROteome of Cell Exposed Extracellular Domains).

Within the surface plasma membrane of cells morphologically distinct regions or domains, like microvilli, cell–cell junctions, clathrin coated pits, and lipid raft domains can be distinguished. Each of these domains is specialized for a particular function, such as nutrient absorption, cell–cell communication, and endocytosis. Lipid raft domains include caveolae, characterized by a distinctive membrane coat composed of caveolin-1, and rafts (64–69). Both have a high-content of cholesterol and sphingolipids, have a light buoyant density, and function in endocytosis and cell signaling.

Classical proteomics does not allow the detection of human endothelial caveolae and raft proteins as they represent less then 0.5% of the total cellular protein pool and less then 2% of total plasma membrane proteins. Therefore, fractionation techniques like cationic silica enrichment of caveolae or detergent solubilization with Triton X-100 at 4°C are needed to enable the subproteome analysis of these membrane microdomains (70). The analysis of these distinct regions can be put forward by high-throughput proteomic platforms together with modern techniques like stable isotype labeling (SILAC) (71, 72).

Proteomic analysis of purified caveolae has identified a wide assortment of proteins that are localized to these structures. The first detailed proteomic analysis of caveolae was carried out in 1994. On the basis of their buoyancy and resistance to detergent solubilization, Lisanti et al. used sucrose density ultracentrifugation to purify caveolae-rich membrane domains from murine lung tissue (73). This procedure allowed for the exclusion of ≥98% of an integral plasma membrane protein marker, while retaining ∼85% of total caveolin and ∼55% of GPI-linked marker proteins. This initial proteomic analysis of caveolae allowed for the first time large-scale characterization of caveolae-enriched protein constituents and provided the basis for many follow-up investigations into the functional significance of these proteins (74). For clinical proteomics, it is interesting that a long list of diseases could be associated with lipid rafts and raft associated proteins (75).

Phagosomes are key organelles for the innate ability of macrophages to participate in tissue remodeling, clear apoptotic cells, and restrict the spread of intracellular pathogens. Using a proteomic approach, >140 proteins associated with latex bead-containing phagosomes had been identified. The elaboration of a two-dimensional gel database of >160 identified spots allowed us to analyze how phagosome composition is modulated during phagolysosome biogenesis. The systematic characterization of phagosome proteins provided new insights into phagosome functions and the protein or groups of proteins involved in and regulating these functions. Different types of phagocytosis can be distinguished. For example, type I phagocytosis is primarily mediated by Fcγ-receptors, while type II phagocytosis is mediated by complement receptors, especially CR3. Another type of phagocytosis is associated with deep tubular invaginations. So far, more than 500 proteins have been identified in different phagosome preparations using MALDI-TOF MS and nano-electrospray MS/MS (57, 76).

The secretion of proteins by exocytosis is an important subset of protein-trafficking events. Secretion can be either constitutive (occurring continuously) or regulated (occurring on demand) as a result of an extracellular signal. Both genetic and biochemical approaches have been combined to produce our current understanding of eukaryotic protein secretion, although there are clearly many questions that remain unanswered (77). Concerning clinical proteomics, it is of interest, that constituents of the exocytosis pathways are associated with multiple diseases. For example, members of the AP-3 pathway are involved in disorders of lysosome-related organelles such as the Hermansky-Pudlak syndrome complex, Chediak-Higashi syndrome, and the ceroid lipofuscinoses. This provides new opportunities to understand AP-3 pathway-related disorders and their relation to membrane phospholipid processing. Mutations in the ABCA1 gene, one member of the AP-3 pathway, are involved in dysregulated vesicular trafficking from the transgolgi compartment to the plasma membrane (78).

The nucleolus/nucleosome is a key organelle that coordinates the synthesis and assembly of ribosomal subunits and forms in the nucleus around the repeated ribosomal gene clusters. A quantitative analysis of the proteome of human nucleoli was successfully performed using mass-spectrometry-based organellar proteomics and stable isotope labeling. In vivo fluorescent imaging techniques are directly compared to endogenous protein changes (79, 80). The flux of 489 endogenous nucleolar proteins in response to three different metabolic inhibitors that each affect nucleolar morphology could be registered (81, 82). Mitochondria are not only the major suppliers of ATP, but also significantly involved in different cellular processes, e.g., metabolism or apoptosis. The proteomic investigation of mitochondria, e.g., after stress exposure, offers the opportunity to specify their role in physiologic and pathologic processes (83). Both the nuclear- and mitochondrial-encoded proteins and their genes are summarized in MITOP (http://www.mitop.de:8080/mitop2/), a comprehensive database for genetic and functional information. The “Human disease catalogue” contains tables with a total of 110 human diseases related to mitochondrial protein abnormalities, sorted by clinical criteria and age of onset. MITOP should contribute to the systematic genetic characterization of the mitochondrial proteome in relation to human disease (81, 84, 85).

CLINICAL APPLICATIONS IN CYTOPROTEOMICS

Multiparameter Flow Cytometry

Multiparameter flow cytometry is widely used in the routine laboratory to generate clinical diagnostic information from complex heterogeneous mixtures such as human blood for multiple indications (23). The main indications for clinical flow cytometry can be classified according to the material used and divided into two major principles, the immunophenotyping and functional analysis of cells.

Immunophenotyping

Immunophenotyping has become a routine practice in the diagnosis and classification of leukaemias. For non-Hodgkin lymphoma, flow cytometry is one of several complementary techniques (86). Further important examples include the analysis of lymphocyte subpopulations, e.g., in the diagnosis of primary or acquired immunodeficiency, in the quantitation of haematopoietic stem cells and of residual leukocytes in erythrocyte preparations (15), and in the analysis of platelets (87). Diagnostic haematopathology depends on the applications of flow cytometric immunophenotyping (FCI) combined with immunohistochemical immunophenotyping. Selected cases may require additional cytogenetic and molecular studies for diagnosis. FCI offers the sensitive detection of antigens for which antibodies may not be available for paraffin-embedded immunohistochemical immunophenotyping. However, paraffin immunohistochemical immunophenotyping offers preservation of architecture and evaluation of expression of some proteins, which may not be available by flow cytometric immunophenotyping. These techniques should be used as complementary tools in diagnostic hematopathology. Types of specimens suitable for FCI include peripheral blood, bone marrow (BM) aspirates, and core biopsies, fine-needle aspirates (FNAs), fresh tissue biopsies, and all types of cell containing body fluids. There are many advantages of FCI compared to immunohistochemistry. For example, dead cells may be gated out of the analysis, weakly expressed surface antigens are detectable. Multicolour (2-, 3-, 4-) analysis can be performed, allowing for an accurate definition of the surface antigen profile of specific cells. Two simultaneous hematologic malignancies may be detected within the same tissue site. Tissue biopsy may be substituted by the relatively non-invasive diagnostic evaluation of body fluids. Disadvantages of FCI are for example that sclerotic BM may yield too few cells for adequate analysis. A markedly hypercellular or “packed” BM may yield too few cells for analysis. Sclerotic tissue may be difficult to suspend for individual cellular analysis. There is loss of architectural relationships. T-cell lymphomas that do not have an aberrant immunophenotype may not be detected but on the other hand, aberrant T-cell immunophenotype (i.e., absence or down-regulation of pan T-cell antigens, particularly CD7) does not necessarily indicate malignancy and may be observed in infectious mononucleosis or inflammatory disorders. This requires correlation with light microscopy as well as clinical data and additional ancillary studies (e.g., molecular/cytogenetic analysis), in some situations (88). A “Kompetenznetz Leukämien” is established in Germany to support standardization of clinical diagnosis (http://www.kompetenznetz-leukaemie.de). The diagnosis of many primary immunodeficiency diseases requires the use of several laboratory tests. Flow cytometry is applicable in the initial workup and subsequent management of diagnostics of primary immunodeficiency diseases related to the B-, T-, and NK-cell compartment (89). Further indications for immunophenotyping are listed in Table 3 (15).

Table 3. Immunophenotyping in peripheral blood
PARAMETERINDICATION
  1. For the analysis of surface molecules, EDTA-blood is usually used and stored at room temperature (refer also “sample preparation”).

Lymphocyte subpopulations (T-, B-, NK-Cells)Primary immunodeficiency syndromes, HIV, transplantation
Malignant lymphocytesNon-Hodkin-lymphoma
CD34+ stem cellsStem cell transplantation
Lymphocyte activation (CD38; HLA-DR)HIV
T-cell repertoireOmen-Syndrome
CD16, CD66b on granulocytesParaoxysmal nucturnal hemoglobinuria
CD11 a/b on granulocytesLeukocyte adhesion defects
HLA-DR on monocytesSepsis
LDL-receptorFamilial hypercholestrolemia

For certain samples and functional assays, specific sample preparation procedures have to be considered as summarized in Table 4. In conclusion, multiparameter flow cytometry is established and widely distributed in clinical routine analysis.

Table 4. Sample preparation for different material and functional analysis in clinical diagnosis
SPECIMEN/BLOOD CELL TYPEFUNCTIONAL TESTINDICATIONSAMPLE PREPARATION
Reticulated platelets ThrombocytopeniaFreshly drawn citrated blood, CTAD anticoagulated blood
Bone marrow Leukemia-, lymphomadiagnosisEDTA- or heparin-anticoagulated sample, erythrocyte lysing methods are preferable to density centrifugation
Bronchoalveolar lavage Inflammatory, autoimmune lung diseases, sarcoidosis, collagenosis, fibrosisHeparinized sample preferably stored at 4°C
Cerebrospinal fluid Inflammatory, autoimmune diseases of the CNSPreferably stored at 4°C
NeutrophilesOxidative burstChronic granulomatosis, sepsis, SIRSHeparinized blood mandatory
NeutrophilesPhagocytosisHereditary or transient defects in phagocytosisHeparinized blood mandatory
BasophilesBasophil degranulationAllergic or pseudoallergic reactionsEDTA-or heparinized blood
Microparticle analysis Thrombosis, cardiovascular disease, systemic inflammation (SIRS, sepsis)Enrichment from anticoagulated blood by two differential centrifugation steps; flow cytometric measurement

Slide Based Cytometry

Slide based cytometry with fluorescence microscope based instruments is an analytical technique that allows rapid quantitative analysis of a high number of individual cells in suspension, cell culture, or in tissue sections. The specimens are tagged with fluorescent dyes and immobilized on a slide. There is a broad range of clinical applications for slide based cytometry systems (90) and flow cytometry and slide based cytometry analysis yield comparable results (91). Two important features of slide based cytometry that cannot be covered by flow cytometry are: (1) the quantitative analysis of tissue sections (92) and (2) the reanalysis of identical cells following restaining (93, 94). Regarding (1), approaches have been proposed towards 3D-cytometry of tissue sections and in tissue cultures (95). Regarding (2), this approach enables quantification of up to 100 antigens per individual cell (96).

In conclusion, slide based cytometry instruments are for some years standard in pharmaceutical R and D, but until today, only few of them are used for clinical routine analysis. However, due to the rapid development in this field, a substantial role in the near future can be expected.

High-Content Image Screens

Automated microscopy and advanced automated image analysis software offer the possibility to quantitatively correlate multiple markers to each other from a single cell within a large analyzed cell population (97–99). This could be useful to discover new pathways for target protein identification and gaining a better understanding of complex intracellular mechanisms. The development of novel algorithms for image analysis and data analysis tools and employing automated confocal microscopy will shed new light on the High-Content Analysis (HCA) fields (100). Potential applications of such technologies for HCA screening lay in the use on intracellular high-definition imaging using, e.g., for G-protein receptor analysis (GPR ligand dependent internalization) for the analysis of signalomes (cytosol–PM translocation, cytosol/ER–nuclear translocation) (101).

In conclusion, high-content image screens appear to be a good candidate to fulfill high-throughput with sensitivity, low imprecision, and high potential to discover new clinical relevant pathways for drug developments.

Tissue Arrays

Tissue arrays are assemblies of multiple patient tissue samples (e.g., tumors) prepared for multiplexed immunohistological serial analysis of tissue specimens (102–104). This technology promises to have a great potential for clinical research and diagnostics, especially in clinical oncology (105). It offers the following advantages: amplification of a scarce resource, experimental uniformity (tissues of multiple patients are treated in an identical manner), decreased assay volume, preservation of original blocks. Depending on the shape of the tissue sample and the method used to obtain it, multi-tissue array techniques may be classified into two different groups: rod-shaped tissue techniques and core tissue techniques. Some technical aspects should be considered when deciding which technique should be used: the number, size, and origin of tissue samples; the quality of paraffin wax, the distance between samples, and the depth in the receptor block; antigenicity preservation and block sectioning. European initiatives are established to design and standardize an infrastructure for a networked tumor tissue bank. The samples are collected according to an SOP used at different institutes and the data are collected and distributed in a central database [TuBaFrost 4 (106); http://genome.tugraz.at (107)]. Antibody-based tissue profiling allows a streamlined approach for generating expression data both for normal and disease tissues. It is also possible to generate protein expression data on many different individual patients to evaluate heterogeneity of tissue profiles. In addition, specific antibodies directed to a particular target protein allow numerous functional assays to be performed ranging from conventional ELISA assays to detailed localization studies using fluorescent probes and protein capture experiments (“pull-down”) for purification of specific proteins and their associated complexes for structural and biochemical analyses. The challenge for antibody-based proteomics is to move from a conventional protein-by-protein approach into a high-throughput mode to allow chromosome wide analysis. Technical challenges involve both the antigen production and the subsequent generation and characterization of the antibodies. In addition, methods for systematic protein profiling on a whole proteome level are lacking. Recently, Nilsson et al. showed that antibodies specific to human proteins can be generated in a high-throughput manner involving stringent affinity purification using recombinant protein epitope signature tags (PrESTs) as immunogens and affinity-ligands. The specificity of the generated affinity reagents, here called mono-specific antibodies (msAb), were validated with a novel protein microarray assay (108). The use of tissue microarrays (TMAs) generated from multiple biopsies combined into single paraffin blocks enabled high-throughput analysis of protein expression in various tissues and organs. Recently, Kampf et al. showed that high-throughput analysis of protein expression can be performed with a standard set of tissue microarrays representing both normal and cancer tissues (109). The TMA technology provides an automated array-based high-throughput technique where as many as 1000 paraffin-embedded tissue samples can be brought into one paraffin block in an array format. A comprehensive atlas of human protein expression patterns in normal and cancer tissues can be created by combining the methods mentioned earlier. A set of standardized TMAs can be produced to allow for rapid screening of a multitude of different tissues and cell types using immunohistochemistry. This approach could also quite effectively be used for generation of expression data for model animals such as mouse, rat, and chimpanzee. A valuable tool for medical and biological research can thus be envisioned as a complement to genome and transcript profiling data (110–113).

In conclusion, in clinical proteomics, tissue arrays are already in use with different complexity. The main applications of this techniques are screening purposes, quality control, diagnosis, biomarker validation, and teaching (103, 114). In autoimmune diagnosis, for example commercially distributed tissue arrays ranging from two to more than 20 different tissue samples for the incubation with the serum of a single patient (http://www.euroimmun.de) are available. In cancer studies, multiplexed tissue arrays are in use for the screening for potential therapeutic antibodies, for the prediction of outcome, prognosis, and for biomarker screening and validation (114).

BIOINFORMATICS APPROACHES IN PROTEOMICS

Although there are numerous examples of single laboratory parameters, which allow good clinical support, like troponin I alone for the diagnosis of myocardial infraction, procalcitonin with a good predictive value for bacterial sepsis, or TSH in the diagnosis of thyroidal dysfunction, most proteomic studies indicated that a single biomarker may be not adequate for reliable diagnosis, staging, or prognosis of a disease. This immediately raises a question of how to combine several biomarkers to provide a diagnostic or predictive pattern? While a definitive answer is probably still far away, a number of approaches emerged.

Hierarchical decision tree-based classification methods, such as CART (115), were among the first algorithms to utilize the available information on multiple biomarkers. Another approach is heuristic clustering (116, 117). However, empirical observations suggested that these approaches were not successful, because the number of incorrect predictions made by the classification algorithm increases with the complexity of the decision tree. Furthermore, the number of datasets available for training the decision tree was quite low, resulting in a lack of statistical significance beyond the second or third nodes of the tree.

Support Vector Machines (SVM) [for an example see (118)] appeared to be an excellent way to overcome some of these limitations due to the theoretical principles upon which they are based. Indeed, excellent empirical performance of SVM has been reported in a number of diverse applications. These approaches provided superior cross-validated predictive performance, but mixed results were obtained with blinded datasets. Reliable results have been obtained when the number of variables was less than 20 and substantial differences between the datasets existed. However, when the differences were more subtle, the number of support vectors had to be increased. This, in turn, resulted in the classifiers over-fitting (also referred to as “memorizing,” a term often employed in Artificial Intelligence research) to the training set and thus in poor classification of blinded datasets (Mischak et al., unpublished data). To avoid such memorizing effects, the number of variables and dimensions has to be lowered.

One very important facet of the use of combinations of biomarkers when making a predictive diagnosis with a classification algorithm is to have a properly calibrated indication of the level of confidence in the predictions being made. In other words, a classification such as “this urine sample has been drawn from an individual with type II diabetes” should also have a numeric score indicating how likely, or probable, the classification is to be correct, i.e., with 90% confidence this urine sample has been drawn from an individual with type II diabetes. Clearly, 90% confidence is more reliable than a prediction with 50% confidence, especially if there are only two alternatives to be considered: disease presence versus disease absence (in which case 50% confidence indicates little more than random guessing). Having such confidence levels (also referred to as probabilities) attached to a classification also enables unbalanced costs of miss-classification to be assigned in an optimal manner. For example, there may be far more serious consequences to incorrectly predicting the absence of a disease than to incorrectly predicting its presence. Although SVM's provide very encouraging classification performance on a range of difficult problems, they are devoid of any probabilistic semantics, thus are unable to provide levels of confidence attached to any classification, and the clinician is then left with no information as to how much the predictions should be trusted.

A promising probabilistic classification method which shares many of the positive characteristics of the SVM, but in addition provides the important levels of confidence with each classification prediction is based on the Gaussian Process. A general purpose and computationally efficient Gaussian Process based classification method has recently been developed (119), and has been successfully applied to the problem of correct prediction of BRCA1 and BRCA2 heterozygous genotypes (120). The probabilistic nature of Gaussian Process based classification methods provides a means of inferring optimally weighted combinations and possible selection of biomarkers and a detailed study of this capability is currently investigated.

No matter which of these approaches is being utilized, two basic considerations apply: (1) the number of independent variables should be kept minimal and should certainly be below the number of samples investigated, and (2) any such approach only becomes valid if applied towards an entirely blinded validation set, and it should be imperative to include such a blinded dataset in any report on potential biomarkers.

DATA ANALYSIS STRATEGIES

Mass-spectrometry-based proteomics has become an important component of biological and clinical research. The analysis of rather complex protein or peptide mixtures, most importantly derived from body tissues or body fluids challenges data acquisition, handling and data processing capabilities. Proteomic data are combined with patient or clinical data they are merged, compiled and integrated to very complex datasets. Thus, very heterogeneous types of data are generated with the need to be managed properly. As data handling, interpretation, validation, storage, and dissemination are critical with respect to ensuring their proper use, it is indispensable to develop formats as well as minimal requirements to ensure data quality.

Another challenge is the sheer size of proteomic datasets. Current genomic and proteomic analytical methods, while highly developed and powerful, are easily generating gigabyte comprehend datasets. The informatics, however, has to handle these data with respect to retrieving information from these data in a reasonable time and quality manner. For the various tasks including for example the identification of labeled and unlabeled peptide species or the elimination of background signal in the mass spectrum and chromatogram, a plethora of tools and approaches have been developed, although mature data standards are still missing (121). In addition, the wealth of genomic or proteomic information accumulated from prior studies is almost never adequately used for the initial planning or interpretation of experiments or data from a given study as integration of data from outside remains challenging and tedious. One consequence of this situation is that in every proteomic experiment, the result is rediscovered (122). Using external resources for own research pipelines causes various problems in many cases, as remote access to external sources may be difficult and integration of the data retrieved laborious or hardware-dependent. In many cases, the compilation of the required external information has to be done in a manual fashion. Here, transparency, as well as efforts to generate the means to integrate previous data into a current dataset or experimental planning can avoid costly repetition of a study. As suggested by Mewes (personal communication, Workshop Clinical Proteomics, Martinsried 9/2006), it would be preferable to combine resources with interfaces (in any desirable programming language) and use these interfaces to convert generic data formats into standardized ones, send standardized information over the Web and thereby disseminate standardized and structured information to any interested clients located anywhere via Web Services. The HUPO-PSI has started these standardizing efforts already for proteomic datasets and the ProDaC consortium (www.fp6-prodac.eu) will develop practical software tools to bring datasets into these standardized formats. However, at the end of generating/analyzing data and converting them into standardized file formats an export from local database systems into central data repository like PRIDE (http://www.ebi.ac.uk/pride) will be mandatory for private datasets as well as public datasets (after the results were published in scientific journals). These central repositories will be a fundamental required infrastructure to avoid rediscovering of already known results.

In many studies it is necessary to analyze a large number of individual proteomes and to compare the obtained results. In biomarker discovery or clinical studies, large cohorts of patients and normal individuals are required to detect protein patterns that consistently associate with a specific condition within a large background of proteins that may randomly fluctuate within the population tested. The risk to misinterpret correlating protein patterns as biomarkers is enormous. This is particularly true for the many types of cancer, but also other diseases such as Type 2 diabetes or heart failure where useful and adequate diagnostic markers with high specificity and sensitivity are lacking. Thus, the success of mass-spectrometry-based discover of candidate biomarkers is highly dependent on the ability to properly statistically handle and interpret these data with use of decoy databases and a defined false discovery rate. Besides the sheer volume of proteomic datasets the heterogeneity of patients, sample processing and data acquisition in multi-centered studies as well as heterogeneity in data formats, processing methods, software tools, and databases that are involved in the translation of spectral data into relevant and actionable information for scientists can greatly limit the value of such data. In the past, biomarker discovery has suffered from inconsistent data acquisition, statistical handling, and validation. Thus, clinical proteomics needs standard operating procedures and guidelines for thorough data generation (123, 124) consistent data analysis (125) as well as distinct training and validation of datasets (126).

These standards include:

  • Standards and data formats for data acquisition (raw data), storage (mzXML/mzData), and exchange (PSI: http://www.psidev.info)

  • Public data repositories (Peptide Atlas, PRIDE, GPMDB, SwissProt/Uniprot)

  • Standards for quality assessment (e.g., FDR, composite decoy databases, power analysis)

  • Integration towards a “linear pipeline” (Aebersold, ProDaC)

  • Integration in a complex database including biological information (systems oriented approach)

With respect to the interpretation of complex data from clinically oriented studies, rules of conduct have to be defined for linking profiling results with clinical phenotypes and allow proper statistical evaluation and interpretation of such protein profiles. Here, standards for data analysis and quality management have been recently established, standards for data reporting (HUPO-PSI, MCP, and other journal guidelines) have been set up and databases for proteomics data being structured as a task within large HUPO initiatives. Proteomic search algorithms and a public repository and peptide identifications for the proteomic community (PRIDE: http://www.ebi.ac.uk/pride) have been developed.

Data have to be qualified with respect to selectivity and specificity for a pathological process. Clustering variables, statistics, reproducibility, as well as standards and exchange formats and study design has to be carefully considered as a basis for proper data analysis. Here, sample retrieval, asservation, storage, and handling are some of the most important parameters to be strictly defined. In addition, protocols and standards have to be ensured to generate and secure reproducibility of the data. Recently established HUPO standards and SOPs for MS-based data acquisition and analysis can be considered as an important step towards that direction. A current challenge yet remaining is data heterogeneity, which can negatively affect proper integration of proteomics data with other biomedical data.

Future efforts in data management in the field of clinical proteomics have to consider that data and results are likely to be the application and disease specific.

A remaining problem with respect to handling clinically relevant data will be to allow outmost transparency and at the same time ensure data protection and intellectual property. Early teamwork between the clinical level, bioinformatics, medical informatics, and proteomic scientists is needed to overcome the current limitations.

Although grouped analysis of proteomics data is of high importance for the development of efficient therapies as well as for scientific purposes, it is equally essential to use such data to assess the therapy dependent disease progression of individual patients in advance that is at diagnosis, disease relapse, or other time points of interest. This will favor the individualized treatment of patients in stratified patient groups as an important prerequisite to maximize therapeutic success and to minimize adverse drug reactions (personalized/individualized medicine) (127). One key question to address in this context relates to our ability to draw the right conclusions for (short-, mid-, or long-term) therapeutic approaches from the highly dynamic proteome patterns that are influenced by various states of disease.

With these aims, clinical proteomics studies have to comprise diseased patients as well as a reference group of healthy or otherwise defined individuals. An unknown validation set of diseased and reference individuals, including patients with unrelated diseases is required to prove the robustness of classification for unknown patient samples.

CONCLUSION

Current approaches to understand the functional diversity of an organism preferentially strive for a systems approach whereby first the phenotypic classification of a specific cytome as the cellular systems and subsystems and functional components of the organism is achieved prior to an attempt to perform proteomic analysis. The fundamental basis for this approach has been well established in studies of cellular systems over many years. When combined with advanced proteomic approaches, this approach can achieve rapid and specific identification of a direct link between biomarkers and their functional roles in complex organisms (128). Integration of proteomics and cell-based technologies will allow the description of the molecular setup of normal and abnormal cell systems within a relational knowledge system, permitting the standardized discrimination of abnormal cell states in disease.

At the current stage, these methods are mostly qualitative and must be seen as exploratory approaches, aiming at advancing scientific knowledge within clinical studies rather than routine. In future, this may gradually change, with some of the methods as well as applications being established as clinical routine assays. As one of the consequences, individualized predictions of further disease course in patients (predictive medicine by cytomics) by characteristic discriminatory data patterns will then permit individualized therapies, identification of new pharmaceutical targets, and establishment of a standardized framework of relevant molecular alterations in disease (129). Most important for clinical proteomics and cytoproteomics on the verge of routine laboratory diagnosis is the control of preanalytical aspects. This is best met by a high degree of standardization, including SOPs and automated work stations for high-throughput sample preparation. Therefore, clinical proteomics and cytoproteomics have the potential to complement genomics, cytomics, metabolomics, lipidomics, glycomics, and transcriptomics including splice variant analysis for gaining better understanding of disease processes.

Ancillary