Prof Robert Rees The John Van Geest Cancer Research Centre School of Science and Technology Nottingham Trent University Clifton Lane Nottingham NG11 8NS UK e-mail: email@example.com
Companion animals are exposed to similar environmental conditions and carcinogens as humans. In some animal cancers, there also appears to be the same genetic changes associated as in humans. However, little work has been carried out in cancer biomarker identification in animals. The recent dramatic advances in molecular medicine, genomics, proteomics and translational research will allow biomarker identification, which may provide the best strategies for veterinarians and clinicians to combat disease by early diagnosis and administration of effective treatments. Proteomics may have important applications in cancer diagnosis, prognosis and predictive clinical outcome that could directly change clinical practice by affecting critical elemen-ts of care and management. This review summarizes the advances in proteomics that has propelled us to this exciting age of clinical proteomics, and highlights the future work that is required for this to become a reality. In this review, we will discuss the available proteomic technologies and their limitations, and highlight the key areas of research and how they have been used to discover cancer biomarkers. The principles described here are equally applicable to human and animal disease, but implementation of ‘omic’ technologies requires stringent guidelines for collection of clinical material, the application of analytical techniques and interpretation of the data.
Cancer is an intractable disease and the risk of human and companion animal cancers increases dramatically with age. It is estimated that cancer kills over 6 million people per year worldwide, and over 10 million new cases per year are diagnosed every year. Cancer has also increased in companion animals in recent years probably owing to increased life expectancy as a result of advances in pet nutrition, vaccination for common infectious diseases and overall advances in veterinary care, and it is well known that certain breeds are more prone to develop cancer. Mortality is mainly because of the fact that majority of the patients attending oncology clinics present too late when the primary cancer has disseminated to distant organs with the presence of micro- or macro-metastases. There are no effective treatment options for this stage of the disease.
Using single cancer-specific biomarkers for cancer diagnosis and/or progression has been recently challenged simply because of the fact that they do not reach the levels of specificity, sensitivity and accuracy that are required for routine clinical practice.1–5 This is mainly because of the molecular heterogeneity of tumours from patient to patient according to recent literature.6–8 Individuals with the same type of cancer vary in their tumour location, size, histology, grade and stage. Moreover, an individual organ can harbour a tumour that contains several stages in the same tissue (for example, in situ and invasive cancer). The specificity of cancer biomarkers is further reduced because of epidemiological heterogeneity, including differences in age, sex and genetic background. Although an individual biomarker candidate might be specific and sensitive only for a certain stage or molecular aetiology, combinations of many markers, screened for sensitivity and specificity concomitantly, could achieve a higher level of specificity and sensitivity. Markers for general population screening for rare diseases would have to approach 100% specificity to be accepted. However, markers that are used for high-risk screening or for relapse monitoring can have much lower specificity but require high sensitivity.
Clinical proteomics (the analysis of blood serum/plasma, urine and/or tissue) aims at providing clinicians with tools to accurately diagnose and treat patients in an individualized manner. To diagnose early stage disease accurately and to monitor disease progression, clinicians and veterinarians need the means to identify when precisely an individual patient is at risk of developing a specific disease. A need to identify the best therapeutic strategy to suit an individual patient and to assess treatment efficacy and susceptibility to adverse events are also needed.
A genomic analysis of DNA or RNA alone cannot provide information of the biologically functional proteins because of the presence of post-translational modifications, such as phosphorylation and glycosylation all of which affect protein function.7 Potentially several hundred thousand to several million protein forms could exist as a result of post-translational processing and modifications translated from the 30 000–40 000 genes which make up the human genome. The scientific community has been empowered by this knowledge to engage in large-scope proteomic experiments to determine the function of the proteins these genes encode. Previously unidentifiable ‘peaks’ and ‘bands’ can now be directly assigned to the proteins that represent the source of these signals. It is the proteins and their actions that are responsible for the functional diversity of cells, and it is proteins that perform most biological processes. The analysis of the ‘proteome’ that is, the proteins expressed by a cell or tissue type in a given biosystem provides exciting opportunities for early detection of disease using protein patterns of body fluid samples, and applications to diagnosis to complement histopathology, individualized selection of therapeutic combinations that optimize treatment, real-time assessment of therapeutic efficacy and toxicity and change of therapy dictated by drug resistance.9–11
The field of proteomics encompasses tools, technologies and approaches targeted at studying variations within the proteome. Through proteomics, scientists link differences in protein expression with underlying biological processes, including those that govern disease. Our rapidly growing repertoire of techniques allows us to identify and study protein expression and modification either singly or collectively. The aims of proteomic research may be classified into four main categories: protein mining is the identification of as many proteins as possible in a sample; protein expression profiling allowing the comparison of specific proteins/peptides in healthy and diseased states; protein network mapping, the identification of interacting proteins and protein modification mapping the identification of protein post-translational modifications.
Proteomics a discovery tool for research of animal and human cancers, although its use in veterinary medicine has lagged behind human medicine and there are only a handful of studies that have been reported in the literature.12–15 New innovations in proteomic technology however, now offer the potential for proteomic profiling to become standard practice in the both human and veterinary clinical laboratories through the development of better diagnostic tests, the identification of new therapeutic targets and ultimately, ‘personalized’ patient therapy. Unfortunately, the early promise of proteomic strategies was followed by concern about the reproducibility and robustness of some studies.16 It is now generally recognized that it is essential, when utilizing these new technologies, to implement standard operating procedures (SOPs) with stringent validation and quality control of sample collection, pretreatment, mass spectral and statistical analyses. This review aims to highlight the different proteomic methods available for the identification of cancer biomarkers in human and veterinary oncology. The following will also address many of the concerns, strengths and weaknesses of proteomics with regard to both experimental design and proteomics-based analyses. A number of technical obstacles remain before routine proteomic analysis can be achieved in the clinic or veterinary oncology practice; however, the standardization of methodologies and dissemination of proteomic data into publicly available databases is starting to overcome these hurdles.
The proteome is responsive to physiological and diseased states, as well as external stimuli and is dynamic in nature, which makes the proteome a ‘real-time’ indicator of physiological processes.17,18 Studies that profile proteomic patterns in body fluids are particularly challenging because of the complexity and dynamic nature of a constantly changing system, where the vast number of proteins may be present in many forms because of post-translational modifications. Protein concentrations in serum range from 10 to 12 orders of magnitude (<1 pg mL−1 to >1 mg mL−1), which presents analytical and other challenges. In proteomic analysis, high-abundance proteins, like albumin can severely interfere and suppress the detection of proteins present in low abundance that exist at pg mL−1 levels, like currently known human cancer biomarkers (i.e. prostate-specific antigen [PSA], carcinoembryonic antigen [CEA], etc.). This means that to be able to study the low-abundant proteins, the first step in any proteomic analysis would be to decrease the complexity of the biological sample to make it easier to identify individual proteins. This can be carried out in a number of ways: (1) the sample can be purified by affinity chromatography (antibody-based purification, presence of metal ions within proteins [metal affinity binding] and post-translational modifications [glycosylation or phosphorylation]); (ii) fractionating the proteins using gel-based methods of separating the proteins according to their size or mass (by denaturing one-dimensional polyacrylamide gel electrophoresis [1D-PAGE]), isoelectric point (pI; isoelectric focusing), charge (anion or cation exchange liquid exchange chromatography) and according to protein hydrophobicity (binding of the protein to hydrophobic matrix in a columns or tip and elution of the bound protein with high amount of organic solvent); (iii) enzymatically digesting the proteins into peptides using, for example, trypsin, which cleaves proteins at specific and predictable sites. The next stage is the analysis of the sample by mass spectrometry (MS) after conversion of the proteins or peptides into gas-phase desolvated ions, which can be carried out using either the electrospray ionization (ESI) or matrix-assisted laser desorption ionization (MALDI) technique described below, to measure the masses of the peptide fragments. The peptide masses can then be searched against one of the many protein databases to determine the peptide sequences, and which returns a match with confidence intervals to the protein present in the original sample. The order in which a proteomic workflow or experiment is carried out can vary depending on which protein is under investigation, the main goal of the studies and the availability of the relevant technologies and expertise.
The application and limitations of two-dimensional gel electrophoresis
Biological samples can be separated based on their size in a 1D-PAGE, where the smaller proteins move faster through the gel than the larger proteins. The gels can then be stained and bands can be viewed after destaining. This technique has a major flaw in that it is only able to resolve a few hundred proteins and cannot separate proteins of very similar size. An important development known as the ‘gold standard’ method of two-dimensional gel electrophoresis (2DE)-PAGE, however, separates proteins using two different properties based on the first dimension on their pI, that is, the specific point at which the net charge of the protein is zero by isoelectric focusing in a flat gel strip. The proteins are then separated in a second experiment by placing the gel strip on top of a standard sodium dodecyl sulphate–PAGE, which separates the proteins according to size. The proteins can then be visualized as spots with different stains, for example, silver staining,19 Coomassie blue dye, fluorescent dyes20 or radiolabels. A good 2DE gel can resolve approximately 3000 proteins depending on the sample used and the sensitivity of the staining technique.21
The main advantage of using the 2DE method is that it allows the simultaneous separation and visualization of thousands of proteins.22 Some of the limitations of 2DE-PAGE include the poor resolution of hydrophobic and membrane proteins, poor quantification, lack of identification and the fact that it is very labour intensive.23,24 It remains a rather low-throughput approach as well of only being able to process at best two samples at a time, which require relatively large amounts of sample. The main limitation, however, is the reproducibility of the gels, as the principle of the method relies on the comparison of the variations in intensity of the spots within the two gels from different proteomes. Gel-to-gel variations have been overcome with some success with the development of new software packages; however, the introduction of a technique known as differential in-gel electrophoresis (2DE-DIGE) in 1997 by Unlu et al.25 has provided the best tool for overcoming problems associated with gel-to-gel variations. The method involves the labelling of complex protein mixtures with fluorescent dyes before 2DE electrophoretic separation. The advantage of this technique is that up to three different samples can be mixed together and separated on a single gel. Each sample is labelled with a different cyanine dye, for example, a control sample may be labelled with Cy3 dye and a disease state sample with Cy5 dye. These cyanine dyes can be differentially viewed because they absorb and emit light at different wavelengths and are matched for charge and molecular weight ensuring that the same protein in each sample will diffuse to the same location in the 2DE gel. Although 2DE-DIGE allows the protein expression ratio to be observed from a single gel and the option to use internal standards to improve intergel variation, the method is still considered low throughput and requires relatively large amounts of sample. Protein spot identity cannot be gained from 2DE gels, so the sample spots that show difference in expression have to be cut out of the gel and digested with the use of trypsin (a protease enzyme that cleaves proteins at specific and predictable sites), which results in smaller peptide fragments. These peptides are then subjected to MS analysis that produces a peptide mass fingerprint (PMF) to identify biomarker proteins as described below. The PMF obtained is then compared with the theoretical peptide masses against extensive human protein databases to determine the protein/peptide identity.
The technique outlined above is useful when comparing two similar samples to find specific protein differences particularly suitable for the analysis of cancerous tissue because protein expression of normal and diseased tissue from the same patient can be compared. The types of human cancers that have been investigated by this technique include lung, thyroid, colon, kidney and bladder.26–30 Proteomic techniques are making their mark in veterinary medicine too and a recent article by McCaw et al.,13 used a combination of 2DE-PAGE and MALDI-MS on canine lymph nodes to study their expression in B-cell lymphoma compared with normal dogs. In their study, several proteins that showed differential protein expression of these prolidase, triosephosphate isomerase and glutathione S-transferase were downregulated, whereas macrophage capping protein was upregulated in the lymphoma samples. This study highlights the discovery and diagnostic utility of proteomic strategies in veterinary oncology.
Until recently, gel-based techniques such as 2DE-PAGE have dominated proteomic studies,31 but the new non-gel-based MS technologies, which profile thousands of proteins, are ideal methods for identifying new diagnostic targets. The development of modern separation techniques coupled with advanced MS have ushered a new paradigm that overcomes some of the limitations of 2DE and can be adapted for high-throughput processing. A mass spectrometer consists of three main components: an ionization source, a mass analyser and an ion detector. The two most common ion sources used in proteomics are MALDI and ESI. Peptides are converted to ions in the gas phase by the addition or loss of one or more protons in a so-called ‘soft’ ionization technique that still maintains sample integrity. The function of the ion source is to produce ions from the sample. The function of the mass analyser is to separate ions with different mass-to-charge ratios (m/z). Then, the numbers of different ions are measured by the detector that presents these as a mass spectrum or chart with a series of spiked peaks, each representing the ion or charged protein fragment present in a given sample.
The most commonly used techniques for the expression analysis of proteins are MALDI and surface enhanced laser desorption/ionization (SELDI) combined with time-of-flight mass spectrometry (TOF-MS) and ESI combined with liquid chromatography–(tandem) MS (LC–MS/MS). These techniques may also be combined with quantitative techniques such as isotope-coded affinity tags (ICAT) and isobaric tags for relative and absolute quantification (iTRAQ).32–35 These MS techniques are particularly important for the analysis of low-molecular-weight (LMW) fraction of the proteome because, in this compartment of the proteome, the use of immunological assays such as enzyme-linked immunosorbent assay is limited owing to the difficulty in producing antibodies for LMW proteins. The following sections will provide a brief overview of each methodology but it will focus mainly on the use of MALDI and SELDI-MS highlighting the introduction of this technology from discovery of biomarkers in human as well as animal cancers to the clinic and the hurdles and challenges that need to be overcome.
In MALDI, proteins or peptides are mixed with an excess of an organic matrix that absorbs light at the wavelength of a pulsed laser and spotted onto a plate. Once dried, the matrix forms crystals within which the proteins are embedded. The mixture is then subjected to a laser pulse, which leads to rapid heating, and the ejection of a plume of matrix and proteins desorbed from the plate and converted into gas-phase ions. The charged protein particles (ions) are accelerated into the vacuum flight tube of a TOF mass analyser and travel through the device until they impact the detector. The smaller ions have higher velocities and have a shorter flight time compared with the heavier ions. This relationship between the mass of the ion and its TOF is used to determine the m/z ratio of the ions (Fig. 1A). There are two modes of mass analyser, one works in linear mode and the other works in reflectron mode (Fig. 1A,B). In linear mode, MALDI-TOF analyser works in a simple way as it measures the TOF for an ion as it flies from one end of the flight tube to the other. In reflectron mode however, when the ion reaches the reflectron, it is reflected back towards a detector located near the MALDI ion source resulting in a lengthened flight path. The role of the reflectron is to focus the ions with the same m/z value, increasing the resolution of the spectrometry, which allows the protein or peptide ions to be separated better and their mass to be measured more accurately. The detector produces a response for each ion that is represented in the mass spectrum as a plot of m/z on the x-axis against intensity (i.e. the number of ions of a particular m/z detected) on the y-axis. In this way, a protein/PMF is generated. This approach has proven particularly useful in the discovery of biomarkers because of the technology’s high sensitivity (down to fmol), wide dynamic range (it has the ability to detect several orders of protein concentration in a single run) and ability for high-throughput screening.36–41 Additionally, samples can be readily separated, purified and applied to target plates using rapid automated procedures. MALDI-MS has also been used for generating patterns of proteins from clinical samples such as serum and plasma that does not rely on protein identity, which can be used as a ‘diagnostic fingerprint’. MALDI has also been used for molecular imaging of tissue sections.42,43
Another MS-based technology for sample clean up and analysis called SELDI-TOF-MS, has been used successfully to detect several disease-associated proteins in complex biological samples, such as cell lysates, seminal plasma and serum. SELDI uses a chip (the ProteinChip) with an affinity-based surface (affinity binding can be antibodies, cationic, hydrophobic or metal) that selectively binds proteins of interest while readily removing non-bound proteins and impurities by washing the surface with buffer. Matrix is then added to the bound proteins on the surface as in MALDI and in this way the energy absorbing matrix molecules co-crystallize with the proteins in the sample. The PMF can then be generated in exactly the same way as for MALDI as described above.
Mass accuracy is the key element in any MS-based methodology as this is the only variable that allows for the future identity of the protein/peptide, therefore the sharper the peaks the more accurate the mass of that peak. SELDI-MS instrumentation has lower sensitivity compared with modern MALDI-MS and cannot resolve the proteins into sharp mass spectral peaks.44 Both SELDI- and MALDI-TOF-MS have the capability to detect high-molecular-weight proteins (>100 kDa) when used in conjunction with linear TOF analysers, but typically a mass range of up to 30 kDa is used for biomarker identification (Fig. 1A). However, recent developments in MALDI-TOF instrumentation have meant that analysis of peptides and LMW proteins can be carried out with high resolution up to 10 kDa in reflectron mode45 (Fig. 1B). An added characteristic of some MALDI instrumentation using a reflectron TOF analyser is the ability to provide partial amino acid sequence data using its post-source decay (PSD) selection on ions with m/z values of up to 3 kDa, a feature not found with earlier SELDI-MS analysers. PSD using MALDI-TOF is, therefore, an alternative approach to the more conventional MS/MS using ESI for the elucidation of the primary structure of biomolecules.46 This, again, is primarily because of the fact that MALDI represents simple instrumentation with the high sensitivity and has a higher tolerance to buffer and salt contaminants in comparison with ESI.
The MALDI and SELDI technologies underpin high-throughput proteomic approaches that allow protein expression profiling of large sample sets.47 The two approaches have been applied successfully to human cancer detection with reported high sensitivity and specificity using a variety of statistical pattern-recognition tools, for the early detection of ovarian,16 breast,40,48 prostate,36,37,49 astrocytoma/gliblastoma38 and melanoma41,50 cancers. Identification of serum biomarkers for canine B-cell lymphoma using SELDI has also been reported.12 Sera from 29 dogs with B-cell lymphoma and 87 control dogs (approximately equal numbers of healthy dogs, dogs with malignant cancers other than B-cell lymphoma and dogs with various non-neoplastic diseases or conditions). Serum samples were fractionated chromatographically and analysed through SELDI-TOF MS. Using classification trees, the investigators reported three biomarker protein peaks with a sensitivity of 97% and specificity of 91% for the classification of B-cell lymphoma. This study provides a proof-of-principle that SELDI combined with bioinformatic tools has the potential to provide useful development of a diagnostic test for B-cell lymphoma in dogs. However, the investigators do state that further investigation is needed to determine whether these biomarkers are useful for screening susceptible dog populations or for monitoring disease status during treatment and remission of B-cell lymphoma in dogs. Canine tears have also recently been examined14 using a combination of 2DE-PAGE and MALDI. This study reported that expression of actin, albumin and an unidentified protein in tear film of dogs with cancer could be identified as potential diagnostic and/or management of canine cancers.50
Analysis of proteins and peptides by MS is carried out using one of the two strategies: the first is called a ‘top-down’ approach and the second a ‘bottom-up’ approach. Top-down refers to the determination of the structure of the protein directly without first breaking it into pieces by enzymatic digestion. Bottom-up refers to the reconstruction of the primary structure of the proteins from the sequences of the peptides fragments after digestion, either by comparison with databases or derived from the analysis of their mass spectra. Although the majority of these ‘top down’ approaches have been successful in generating proteomic profiles that distinguish between disease states, many have failed to provide identification of the proteins associated with the biomarker ions detected. Previous reports have been successful in protein identification from enzymatic digests of serum or plasma proteins.51–53 This method overcomes the sensitivity and molecular weight limitations of protein detection by MALDI-MS, and adds protein identification capability by peptide amino acid sequence determination by a technique called tandem MS (MS/MS).
Our laboratory has tested a variety of clean-up methods to prepare the proteins or peptides for mass spectrometric analysis as summarized in Fig. 2. We have developed an integrated ‘bottom-up’ approach involving the tryptic digestion of proteomic samples for rapid screening, the identification of peptide fragments biomarker ions and the generation of sequence information allowing the associated proteins to be identified41 (Fig. 3). A summary of the studies utilizing a variety of proteomic technologies for cancer biomarker identification is summarized in Table 1A–D.
Table 1. Summary of cancer biomarker studies and technologies used to study and identify the biomarkers
SELDI-TOF, C16 hydrophobic interaction protein chip. Genetic algorithms combined with cluster analysis
The discriminatory pattern correctly identified all 50 ovarian cancer cases in the masked set, including all 18 stage I cases. Of the 66 cases of nonmalignant disease, 63 were recognized as not cancer. This result yielded a sensitivity of 100% (95% CI: 93–100), specificity of 95% (87–99) and positive predictive value of 94% (84–99). There has been subsequent comment on the experimental design and statistical processing of the data.90,43 Proteins associated with the discriminating mass spectral ions were not identified
Early detection – canine lymphoma, one of the most common neoplasms in dogs, is similar to human non-Hodgkin’s lymphoma
2DE-PAGE separation and MALDI/ionisation TOF analysis
Thirteen lymph nodes from normal dogs and 11 lymph nodes from dogs with B-cell lymphoma. A total of 93 differentially expressed spots was subjected to MALDI/ionisation TOF–MS/MS analysis, and several proteins that showed differential expression were identified. Of these, prolidase (proline dipeptidase), triosephosphate isomerase, and glutathione S-transferase were downregulated in lymphoma samples, whereas macrophage capping protein was upregulated in the lymphoma samples
SELDI-TOF, WCX2 weak cation exchanger protein chip, decision tree algorithm
A preliminary training set of spectra derived from 31 primary ovarian cancer patients, 16 patients with benign ovarian diseases and 25 healthy women was used to develop a proteomic model that discriminated cancer from noncancer effectively. A four-peak model was established in the training set that discriminated cancer from noncancer with sensitivity of 90.8% and specificity of 93.5%. A sensitivity of 87.0% and a specificity of 95.0% for the blind test were obtained compared with 60.7%, 55% for CA125 for the same samples
SELDI-TOF, hydrophobic H4 chips, Kruskal–Wallis nonparametric test
Three protein peaks were identified in the serum of men with prostate cancer and BPH, but not in controls, with relative molecular masses of 15.2, 15.9 and 17.5 kDa. These three proteins were significantly associated with BPH and prostate cancer when compared with controls (P = 0.001, 0.004 and 0.011, respectively, Kruskal–Wallis test)
Combination of laser capture microdissection (LCM) with two-dimensional 2DE-PAGE
Twenty-three proteins were consistently differentially expressed between both the LMP and three invasive ovarian tumours in the limited study set. The 52 kDa FK506-binding protein, Rho G-protein dissociation inhibitor (RhoGDI), and glyoxalase I are found to be uniquely overexpressed in invasive human ovarian cancer when compared with the LMP form of this cancer
A 2-D liquid separation mass mapping method followed by identification of proteins using ESI-TOF MS for intact protein mapping and MALDI-TOF MS for peptide mass fingerprinting
Proteins associated with the metastatic phenotype included osteopontin and extracellular matrix protein 1, whereas the matrix metalloproteinase-1 and annexin 1 proteins were associated with the nonmetastatic phenotype
Imaging MALDI to examine sections of human glioblastoma compared with healthy tissue. Identification of proteins peaks |was identified using ESI-MS/MS
An increase in expression of several proteins in the proliferating area was found compared with healthy tissue. LC–MS and MS/MS were used to identify thymosin β.4, a 4964 Da protein found only in the outer proliferating zone of the tumour (Stoeckli M 2001). One of the known activities of this peptide is its ability to sequester cytoplasmic monomeric actin
Imaging MALDI was used a few hundred cells from frozen sections and identification of key ions was carried out using electrospray quadrupole TOF MS (Q-Star)
They could accurately distinguish normal lung, adenocarcinoma, squamous cell carcinoma and large-cell carcinoma. They also identified using biostatistical selection of differentially expressed peaks were indicative of patients with good or poor prognosis based upon 15 distinct peaks on MS. Identification of three of the peaks were reported as SUMO protein a ubiquitin-like protein that is conjugated to cellular regulatory proteins, including oncogenes and tumour suppressor genes
MALDI-MS imaging of glioma tumor biopsies and identification was carried out using MALDI TOF/TOF mass spectrometer or a ThermoLTQ ion trap mass spectrometer
Protein patterns were identified that accurately classify glioma subtypes and distinguish patients into two prognostic groups, a short-term survival group and a long-term survival group. In addition, a well-characterized subset of patients with grade IV gliomas was identified whose protein patterns could predict differential survival
Non-small-cell lung cancer (NSCLC) treatment with epidermal growth factor receptor (EGFR) tyrosine kinase inhibitors (TKIs)
MALDI-MS and a k-nearest neighbour (KNN) algorithm
A classification algorithm based on MALDI MS analysis of pretreatment serum and plasma that could identify subgroups of NSCLC patients with improved time to progression and overall survival after treatment with the EGFR TKIs gefitinib and erlotinib
Protein profiles from mammary tumour virus/HER2 transgenic mouse frozen tumour sections were analysed after treatment with the erbB receptor inhibitors and OSI-774 and Herceptin. These investigators were able to demonstrate that early changes in tumour protein profiles predict for dose- and time-dependent effects of signalling inhibitors. Drug-induced changes in the proteome also predicted for therapeutic synergy and drug resistance. Finally, in drug sensitive tumours, the spatial distribution of the erbB tyrosine kinase inhibitor OSI774 mapped with biomarkers of anti-tumour activity
CEM lymphoblastic leukaemia cells following bohemine a cyclic-dependent kinase inhibitor
2D liquid phase separation coupled to MS
These authors reported that there was downregulation of three histone variants in response to bohemine indicating that antimitotic and anticancer activities of this compound may be associated with epigenetic regulation at the level of chromatin structure. Furthermore, crk-like adaptor scaffolding protein represents a new important protein family affected by bohemine
Identifying peptide/protein expression patterns correlating to particular pathological/clinical phenotypes are not only contingent upon high-throughput proteomic screening technologies for data acquisition but there is also a strong interdependence upon the implementation and coordinated integration of bioinformatic algorithms that are (1) capable of analysing large data sets and (2) endowed with sophisticated pattern recognition/detection capabilities. A candidate biomarker must be evaluated to assess its potential to provide an acceptable measure of a biological process to be monitored.
Preprocessing of data
Bioinformatic analysis of m/z and ion intensity data obtained using MS-based proteomics is complicated by many factors, such as technical and biological variability and also chemical and electronic noise that can result in baseline shifts and ‘noise’ in the data (Table 2). These may influence the data and show differences between sample groups that have no true biological meaning. For example, mass accuracy in MS is critically dependent on the instrumental configuration and calibration protocols used, and peak alignment may also have to be carried out when comparing multiple profiles over many samples (Fig. 3). It is therefore usually necessary to baseline correct, smooth, normalize and align mass spectral profiles as outlined in Fig. 3 before any disease-associated factors can be extracted by computational analysis. A recent and more detailed discussion of the pretreatment of mass spectral profiles can be found in Arneberg et al.54
Table 2. The different bioinformatic approaches and their respective advantages and disadvantages
Deals with different measurements associated with a distribution (1). Verbal words can be used in the rules
Not suitable for highly dimensional data systems
Principal components analysis (PCA)
Data reduction technique (quality control), can be used to remove noise. Allows visual display of the data in 2D or 3D
Cannot resolve nonlinearity. Visual display of high dimension of data difficult
Can classify a data set into subsets that present similar characters, so that the closest are clustered together. Visualization of data structure. Simple. Results quite easy to understand
Does not indicate how and why the group is clustered. Does not indicate the parameter(s) that allowed to cluster. Cannot really cope with the noise
Artificial neural networks
Deals with nonlinearity, high dimensionality and noisy data. Can be used for classification. Ability to generalize solution to other data sets
Appears as a blackbox for biologists. This problem can be overcome by visualization
If patterns in spectral profiles are to have broader utility, reproducibility has to be assured across time, instruments and indeed platforms.55 This can be optimized through the use of technical and experimental replicates, where filtering and averaging of samples are methods that are commonly used to assess reproducibility and increase the confidence in the profiles for comparison. Lack of reproducibility decreases the validity of markers and makes validation and ultimately clinical use difficult.56 Low replication and poor data quality can lead to the introduction of features not representative of disease but of sample run, sample collection, storage and preparation, and introducing random features within the data.
Dimensionality and complexity of the data
One of the major challenges in bioinformatic analysis is the ability to handle highly dimensional (having a large number of parameters/variables) and complex MS data. To develop a model, with good predictive performance for new blind cases (not used in the modelling process), would require potentially thousands of cases (as determined by power analysis). Clearly this is not feasible because of sample availability in biomedical medical or veterinary situations. In practice, this problem is addressed by repeatedly testing the model using blind cases through the model development process. The modelling process is then stopped when the models predictions for this blind data are optimal. The model is then tested on further blind sample sets to further validate it.
Another problem with the analysis of this data type is caused by the high dimensionality of the data (number of parameters) masking the real influence of the important markers. This has been termed ‘the curse of dimensionality’57 and can lead to irrelevant components being given high importance, invalidating the model. In practice, this problem is addressed by analysing one parameter/variable at a time and assessing its ability to predict the blind sample cohort. Recently, a gold standard for this analysis has been proposed by Michiels et al.58
Methodological challenges of SELDI-MS
Several investigators55,59,60,61 have re-analysed the publicly available data posted by Petricoin et al. 16 who co-authored a landmark paper that appeared in the British medical journal The Lancet in February 2002. The paper reported that MS, coupled with an artificial intelligence algorithm, could distinguish ovarian cancer from normal controls with 100% sensitivity (in other words, no false negatives) and 95% specificity (5% false positives). Diamandis55 has raised concerns whether the SELDI/MALDI-based approaches are reproducible, whereas Sorace and Zhan59 and Baggerly et al.61 have raised concerns over study design bias. These critics can be grouped into three classes: (1) questions regarding the experimental design and the statistical processing of the data; (2) questions regarding the technical limitations SELDI with linear TOF MS and (3) concerns with respect to the biological meaning of the data obtained. To address these issues in future studies, it will be essential to ensure rigorously designed investigations, inclusive of platform reproducibility and with analytical and other variables standardized, which should form important elements of all discovery, prevalidation and validation efforts. The recognition of the enormity of the problem and potential benefits of success has brought international cooperation and coordination within the research community, for example, Human Proteome Organisation is attempting to provide a comprehensive analysis of the proteins of human plasma and serum, annotate the entire human proteome and make the data publicly accessible.
Validation protocols for MS-based proteomics
In our study, we have undertaken extensive investigations to ensure reproducibility (Fig. 4A) and remove bias in our clinical biomarker identification studies by investigating how pre-analytical, analytical and postanalytical factors (such as processing time, clotting times, centrifugation speeds, pre-aliquoting, storage temperature, the number of freeze–thaw cycles and serum sample preparation) affect protein/peptide biomarker profiles. Consideration of separation procedures, including ZipTip (a hydrophobic separation methodology in a pipette tip) reproducibility, MALDI sample crystallization and laser irradiation, as reported in Matharoo-Ball et al., 41 all affect biomarker discovery studies. From blood ageing studies, we observed a correlation between the amount of time the blood sample was left at room temperature and changes (degradation) in the serum protein and peptide profiles (Fig. 4B). In fact, as little as 4-h incubation of blood at room temperature was seen to cause degradation to both proteins and peptides as shown by the boxed areas in Fig. 4B. As a result of these studies, we have introduced standardized protocols for prospective sample collection and for retrospective samples that are subjected to no more than three freeze–thaw cycles, which were shown not to have detrimental effects on the protein or peptide profiles (data not shown). The flow diagram in Fig. 5. shows the stringent procedures followed in our laboratory to implement fully optimized and standardized protocols for MALDI-MS analysis for biomarker discovery in cancer and their subsequent validation. The major goal of plasma and serum proteome analysis is to obtain the most reliable information possible for diagnostic and therapeutic purposes.
Beyond identification, MS also facilitates quantification, an important requirement in clinical proteomics for comparing two or more samples for discriminating features. The identification of low-abundance proteins may often be hindered by the presence of abundant proteins and much of the improvement in sensitivity for proteomic analyses has come both from new MS instruments and from sample fractionation to reduce the complexity of proteins.
Isotope-coded affinity tagging
ICAT uses MS for protein separation and different isotope tags for distinguishing populations of proteins. ICAT reagents label cysteine residues contained within the protein during sample preparation. These labelled proteins then go on to be enzymatically or chemically digested, or separated using a gel-based system before digestion. The prepared sample is then subjected to MS/MS allowing the peptide amino acid sequence to be identified and accurate relative quantification of the proteins contained within complex mixtures.
The fundamental principles of ICAT protocols are outlined in Fig. 6. In brief, two protein mixtures (protein mix 1 and protein mix 2) are labelled respectively with heavy (H) and light (L) biotinylated stable isotope labels that have identical chemical properties but different masses. The two protein samples are then combined, enzymatically digested and captured using the biotin tag present on the ICAT reagent. This reduces the complexity of the sample before LC-MS/MS analysis. LC-MS/MS analysis of the extracted peptide mixture allows the detection of peak pairs corresponding to peptides labelled with either the heavy or light ICAT reagents, which are easily distinguished by MS. The difference in mass observed between the heavy and light reagents means that identical peptides belonging to the same protein, but originating in different samples can be identified because of their difference in mass. ICAT technology has been used to identify biomarkers for a number of diseases and applications including cancer,41,62–64 Alzheimer’s disease,65,66 toxicity and neurotoxicity,11,33 neurotrauma,35 and analysing the plant proteome.67,68 However, ICAT technology is not without its limitations and these are listed in Table 3.
Table 3. Limitations of ICAT technology
ICAT reagents tend to be costly
Because of the nature of the labelling procedure, proteins that do not contain cysteine residues or do not contain tractable cysteine containing peptides upon enzymatic digestion will not be detected
Coverage of differentially expressed proteins may be limited because of the high sample complexity and the data-acquisition rate of the mass spectrometer. Some of these difficulties are being overcome with advances in sample preparation, improved data acquisition schemes and better reagents
Moseley99, MacCoss and Yates100, Li and Zeng101, Smolka et al.102
Isobaric tags for relative and absolute quantification
In addition to ICAT, other stable isotope coding techniques applied to quantitative proteomics have been reported including a more recent approach that is analogous to ICAT, termed iTRAQ. iTRAQ reagents consist of a set of tags with differing masses that attach at the N-terminus and the lysine side chains and allow for the simultaneous identification and quantification of up to four different samples. The iTRAQ methodology is similar to that of ICAT and a typical experimental protocol is outlined in Fig. 7. In brief, up to four different samples, or alternative sample replicates, are reduced, alkylated and tryptically digested. One of the four available iTRAQ reagents is added to each sample before all the samples are combined together. High-resolution LC fractionation is performed to separate the peptides before LC-MS/MS analysis. Ultimate protein identification and quantification is performed using ProQUANT software (Applied Biosystems, Foster City, CA, USA). Quantification of the peptides is made possible because of fragmentation of the iTRAQ tag attached to the peptides. Upon fragmentation in MS/MS, a reporter ion (m/z 114-117) that is unique to the tag used to label each of the digests is generated and the intensity can be measured enabling relative quantification of the peptides in each digest. To date, a number of studies have used iTRAQ reagents including studies into toxicology,69 inflammation,70 cancer62 and neurodegenerative disorders.71,72 iTRAQ allows for a number of the associated problems with ICAT to be overcome; firstly, the method of labelling means that all peptides in a digest mixture should be labelled and this labelling will not be restricted to cysteine containing peptides. In addition, no loss of information will occur because of the post-translational modification of proteins.73 Finally, the multiplexing capacity means that information replication within LC-MS/MS experiments can be performed thereby producing experimental statistical validation.74
Although iTRAQ is a relatively new technique, it is gaining in popularity compared with ICAT because of the advantages it offers over ICAT as outlined previously. A comparative study of these two techniques along with 2DE gel-based systems is given in a review by Wu et al.75
Once a suitable reagent and protocol has been defined for peptide analysis, then the choice of mass spectrometric equipment needs to be addressed. For biological samples to be analysed by MS, they must first be converted to gas-phase desolvated ions in the ion source. The two most common ion sources for this are MALDI (as discussed earlier) and ESI. Automated LC systems are usually connected online with the mass spectrometer and offer separation of the sample before introduction into the ion source. Following ionisation, mass spectral (MS) and tandem mass spectral (MS/MS) data may be obtained. Often the key factor involved in equipment purchase is cost although some instruments lend themselves to certain types of analysis, for example, quadrupole ion traps are widely used for peptide sequencing and relative quantification studies, while triple quadrupole instruments are preferred for absolute quantification studies.76,77 A comprehensive background on the applicability of MS to proteomics can be found in Liebler et al.78 More detailed information about the different types of mass analysers can be found in other comprehensive reviews and papers.78–80 The LC-MS/MS approach has been utilized in combination with other techniques to identify a number of categories of peptides including glycopepides79–81 and phosphopeptides82,83 as well as identifying tryptic components of proteomes and protein mixtures.84–87 More recently, there has been a surge in the number of papers published that use this technique in combination with other reagents such as ICAT and iTRAQ, as well as combining this technique with 2DE gel-based systems to identify disease biomarkers.85
Clinical proteomics: from discovery to the clinic
Clinical proteomics is an interdisciplinary field and requires involvement of clinicians, statisticians/bioinformaticians, epidemiologists, clinical and analytical chemists and biologists/biochemists from the beginning, with the different responsibilities clearly stated in any subsequent report. A clinical proteomic study would firstly need to address a well-defined clinical question or problem, followed by good experimental design including the appropriate size of a study population, sample collection SOPs and the type of technology needed to analyse the samples. The protein discovery phase has to be seen separate from the clinical proteomic validation phase, and the types and numbers of samples required may differ for these (Fig. 8). The explosion of interest in clinical proteomics is often associated with the completion of the human genome project. Proteomic technologies have opened the doors to techniques like MS, which has within a decade provided the means to analyse and identify large numbers (sometimes hundreds) of proteins within a single MS run with high confidence compared with past efforts in which the identification of a single protein was a challenge.
Although proteomics has proved its promise for biomarker discovery, further work is still required to enhance the performance and reproducibility of established proteomic tools before they can be used routinely in human or veterinary clinical laboratories. A number of technical obstacles remain before routine proteomic analysis can be achieved in the clinic. However, the standardization of methodologies and dissemination of proteomic data into publicly available databases is beginning to overcome these hurdles. Furthermore, the cost is also a critical factor for the widespread use of proteomics in clinical laboratories because of the expensive and complex instrumentation and consumables, which require critical computing power for data analysis. The process for MS-based biomarker or drug target discovery follows a long and complicated pipeline (Fig. 8). The challenge is that many factors may affect reproducibility and biases can be introduced that affects the final results, for example, collection, processing and storage of samples, type of instrumentation, limitations of bioinformatic tools and databases for protein identification as summarized in Table 4. These challenges are increased by the inherent instability of proteins and the difficulty of detecting proteins that are present in small amounts in complex proteomes. A collective effort to define protocols for sample collection, to optimize the pre-analytical techniques and to further develop and improve bioinformatic tools and databases is underway.88,89 With rigorous and extensive multiplication of runs, MS is now the most robust component in the pipeline. Table 4 highlights the hurdles and challenges that will need to be addressed before biomarker discovery phase using MS technologies can even have a hope of being introduced into the clinic.
Table 4. Recommended criteria for clinical applications of protein/peptide profiling by MS
• Selection of which anticoagulants for plasma (sodium heparin, potassium ethylenediamine tetraacetic acid or sodium citrate)
• Identification of optimum procedures for specimen collection and processing (e.g. transportation and processing time from clinic to laboratory. clotting time; centrifugation speeds; prealiquoting; storage which temperature −20°C; inclusion of protease inhibitors and minimizing freeze–thaw cycles)
• Criteria for specimen acceptability
• Instrumentation SELDI, MALDI or ESI (spectral resolution dependent upon SELDI that provides low-resolution spectra or MALDI generates high-resolution spectra because of both linear and reflectron mode operation, ESI MS/MS data but provides more complex spectral patterns because of multiply charged ions)
• Ensure no sample bias by randomization of samples before any manipulations
• Optimize type of sample fractionation and separation using multiple steps
• Sample and matrix deposition (dry droplet, sandwich or thin-layer methods)
• Prepare calibrators for mass, resolution and detector sensitivity (inclusion of internal or external calibrants, develop calibration compared with sample)
• Evaluate and optimize instrument parameters (automated or manual analysis of samples, number of laser irradiation shots; raster versus autoquality [hot spots]; number of profiles captured and number of replicates run to minimize spectral variation)
• Instruments variables: removal of bias by randomization of sample; reproducibility of spectra; manual acquisition of spectra versus automated acquisition; calibration shifts; baseline shifts because of deterioration of laser or detectors; optic contamination
• Acceptance criteria of QC and sample spectra (check QC spectra and cluster analysis, of inspection and removal of spectra below accepted signal-to noise ratio and id problems arise check instrument parameters, operator bias differences in operator for preparation of sample, reagents and samples)
• Inclusion of QA procedures (e.g. documentation of reagent preparation, operator, sample processed and number of freeze–thaw cycles, QC checks, instrumentation parameters used, date of calibrants)
• Inclusion of QC (e.g. inclusion of blanks to check for contamination, QC to ensure preanalytical and analytical procedures are working correctly and BSA samples to ensure efficiency of digestion protocol)
• Evaluation of reproducibility (Variation [calculation of CV] of spectral m/z and intensity with-in run and between runs. Reproducibility of same experiment in the same laboratory at different time and more importantly in another independent laboratory using different instrumentation)38
• Evaluate effects of sample storage and temperature41
• Evaluate limits of detection
• Evaluation of biological variation (Age, gender, socioeconomic background, diurnal rhythm, i.e. time of day blood is taken, use of tourniquet technique, type of collection [venipuncture, arterial puncture and skin puncture], and quantity of blood withdrawn)
• Develop programs for internal and external QA/QC procedures (we are participation with other laboratories to evaluate our procedures in conjunction with Durham University)
• Develop acceptance criteria for each QC and sample spectrum
• Develop analytical procedures for acceptance and assurance for QC (e.g. we have developed a cluster analysis for each run which ensures an early alert to any problems because of preanalytical, analytical or postanalytical procedures)
• Limitations and awareness of mass range of instrument For SELDI profiling m/z use only above 3000 and MALDI-MS m/z above 1000 in linear mode and m/z above 800 in reflectron mode
• Introduction of modelling of data in a standard format using a defined algorithm
• Standard way for preprocessing of data, for example, baseline corrections; spectral alignment
• Ensure no bias from analysis, for example, overfitting of data by ensuring large enough sample set separate for modelling, test and validation sets
• Provide evidence of ROC curves
• Application of predicted ions to the general population to give the same sensitivity and specificity as the original model
• Validation on an independent population with different demographics
Most proteomic studies have been based on the idea that a subset of proteins present in the blood reflects, reproducibly and specifically, a single disease at a particular stage. It is also assumed that we can reproducibly identify the biomarker in a complex biological fluid like blood. Unless we have knowledge of the composition and the dynamics of the blood proteome in healthy individuals, then important biomarkers will possibly be missed. Finally, one of the biggest hurdles in biomarker discovery studies is the lack of means to gauge the significance of the generated results and to measure success. Without independent, large validation studies in clinical trials, it is very difficult to determine whether the expression of the markers that are found to discriminate between groups will have clinical utility. Proteins identified in the discovery phase may not necessarily turn out to be the best diagnostic or therapeutic biomarkers. It is vital to precisely define a clinical problem and to focus the experimental design around appropriate study populations and samples. We have to accept that the move of this technology from the research laboratory to the human or veterinary clinic is a long process that will take time. Moreover, new targets or biomarkers will require rigorous study and validation before regulatory approval is granted. Unreasonable timescales to achieve this goal will be damaging to proteomics-based clinical studies.
The authors thank the European Commission for support of the ENACT, an EU funded 6th Framework programme (proposal no: 2004-503306) and The John and Lucille van Geest Foundation.