Elucidation of a protein signature discriminating six common types of adenocarcinoma

Authors

  • Gregory C. Bloom,

    1. Biostatistics Program, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL
    Search for more papers by this author
  • Steven Eschrich,

    1. Biostatistics Program, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL
    Search for more papers by this author
  • Jeff X. Zhou,

    1. Large Scale Biology Corporation, Germantown, MD
    2. National Institutes of Health, Rockville, MD
    Search for more papers by this author
  • Domenico Coppola,

    1. Biostatistics Program, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL
    Search for more papers by this author
  • Timothy J. Yeatman

    Corresponding author
    1. Department of Interdisciplinary Oncology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL
    2. Department of Surgery, University of South Florida College of Medicine, Tampa, FL
    3. Department of Pathology, University of South Florida College of Medicine, Tampa, FL
    • H. Lee Moffitt Cancer Center and Research Institute, 12902 Magnolia Drive, SRB-2, Tampa, FL 33612, USA

    Search for more papers by this author
    • Fax: +813-745-1433.


Abstract

Pathologists are commonly facing the problem of attempting to identify the site of origin of a metastatic cancer when no primary tumor has been identified, yet few markers have been identified to date. Multitumor classifiers based on microarray based RNA expression have recently been described. Here we describe the first approximation of a tumor classifier based entirely on protein expression quantified by two-dimensional gel electrophoresis (2DE). The 2DE was used to analyze the proteomic expression pattern of 77 similarly appearing (using histomorphology) adenocarcinomas encompassing 6 types or sites of origin: ovary, colon, kidney, breast, lung and stomach. Discriminating sets of proteins were identified and used to train an artificial neural network (ANN). A leave-one-out cross validation (LOOCV) method was used to test the ability of the constructed network to predict the single held out sample from each iteration with a maximum predictive accuracy of 87% and an average predictive accuracy of 82% over the range of proteins chosen for its construction. These findings demonstrate the use of proteomics to construct a highly accurate ANN-based classifier for the detection of an individual tumor type, as well as distinguishing between 6 common tumor types in an unknown primary diagnosis setting. © 2006 Wiley-Liss, Inc.

Precise tumor diagnosis is the first step in cancer management since therapy generally stems from the initial tumor classification. While many tumor biopsies are diagnostic and form the cornerstone of cancer therapy, classification of tumor type and site of origin is a significant clinical challenge that is often underestimated. Distinguishing the most common metastatic adenocarcinomas (ovary, colon, kidney, breast, lung and stomach) from each other is one of the most vexing problems clinicians are facing today. In fact, it is estimated that up to 10% of all metastatic tumors have no defined primary site of origin.1 Moreover, adenocarcinomas represent 60% of all of unknown primary tumor types.2 The current standard of pathologic practice, using morphologic criteria and semi-quantitative immunohistochemical (IHC) analyses, is often limited in its capacity to define tumor type or site of origin. Thus, there is a clear need for the identification and validation of a classification model that will cleanly distinguish these histologically similar tumor types and improve our capacity to direct therapy.

Gene expression profiling is a powerful tool that has shown promise in its capacity to discriminate subpopulations of tumors from heterogeneous groups.3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 We recently developed a prototype multitumor classifier capable of interrogating up to 21 different tumor types with an accuracy of ∼88%.14 The success with this approach led us to test the hypothesis that similar classifiers could be developed based on protein expression.

Using 2-D gel analysis combined with MALDI mass spectrometry to simultaneously assess 1,400 protein spots, we developed global protein expression profiles for 77 primary adenocarcinomas representing 6 different organ sites. We used a series of Wilcoxon rank-sum tests to generate 6 lists of proteins that effectively separated their associated tumors from the other 5 tumor types. A neural network was then constructed to develop a classifier to identify all 6 tumor types with a high degree of overall accuracy in a leave-one-out cross validation (LOOCV). Proteins have been identified by mass spectrometry that may serve as novel biomarkers for each disease site.

Material and methods

Clinical samples

Human tumor samples were obtained from the Moffitt Cancer Center Tumor Bank under IRB approved protocols. Seventy-seven primary tumor samples were obtained from 6 different sites of origin. Tumor subtypes were not selected for in this study. In addition, tumors with all differentiation statuses were used in this study to help insure that any potential classifier would serve in a realistic clinical setting. All samples were obtained within 15 min of surgical extirpation and snap frozen in liquid nitrogen. Tumor samples were then microdissected to >80% purity prior to protein extraction under frozen section control. Laser capture microdissection (LCM) was not applied to the samples as we wanted to include stromal elements that we believe add critical signature information derived from the tumor cells interacting with their environment. Samples were distributed among 6 organ sites as follows: 10 ovary; 9 breast; 20 colon; 10 kidney; 10 lung and 18 stomach (Table I).

Table I. Organ Site, Histological Diagnosis, Tumor Subtype and Grade for Tumors Used in this Study
Organ siteHistological diagnosisSubtypeGrade
StomachAdenocarcinomaSignet ringPoorly differentiated
AdenocarcinomaDiffusePoorly differentiated
AdenocarcinomaIntestinal typeModerately differentiated
AdenocarcinomaIntestinal typeModerately differentiated
AdenocarcinomaIntestinal typeModerately differentiated
AdenocarcinomaDiffusePoorly differentiated
AdenocarcinomaIntestinal typeModerately differentiated
AdenocarcinomaDiffusePoorly differentiated
AdenocarcinomaIntestinal typeModerately differentiated
AdenocarcinomaSignet ringPoorly differentiated
AdenocarcinomaIntestinal typeModerately differentiated
AdenocarcinomaDiffusePoorly differentiated
AdenocarcinomaIntestinal typeModerately differentiated
AdenocarcinomaSignet ringPoorly differentiated
AdenocarcinomaDiffusePoorly differentiated
AdenocarcinomaIntestinal typeModerately differentiated
AdenocarcinomaIntestinal typeModerately differentiated
AdenocarcinomaIntestinal typeModerately differentiated
OvaryAdenocarcinomaPapillary serous cystadenoModerately differentiated
AdenocarcinomaPapillary serous cystadenoModerately differentiated
AdenocarcinomaPapillary serous cystadenoModerately differentiated
AdenocarcinomaPapillary serous cystadenoModerately differentiated
AdenocarcinomaMucinous cystadenoModerately differentiated
AdenocarcinomaPapillary serous cystadenoModerately differentiated
AdenocarcinomaPapillary serous cystadenoWell differentiated
AdenocarcinomaMucinous cystadenoModerately differentiated
AdenocarcinomaPapillary serous cystadenoModerately differentiated
AdenocarcinomaPapillary serous cystadenoModerately differentiated
KidneyAdenocarcinomaClear cell typeModerately differentiated
AdenocarcinomaClear cell typeModerately differentiated
AdenocarcinomaClear cell typeModerately differentiated
AdenocarcinomaClear cell typeModerately differentiated
AdenocarcinomaClear cell typeModerately differentiated
AdenocarcinomaClear cell typeModerately differentiated
AdenocarcinomaClear cell typeModerately differentiated
AdenocarcinomaClear cell typeModerately differentiated
AdenocarcinomaClear cell typeModerately differentiated
AdenocarcinomaClear cell typeModerately differentiated
LungAdenocarcinoma Moderately differentiated
Adenocarcinoma Moderately differentiated
Adenocarcinoma Moderately differentiated
Adenocarcinoma Poorly differentiated
AdenocarcinomaNon small cellModerately differentiated
AdenocarcinomaNon small cellModerately differentiated
Adenocarcinoma Moderately differentiated
Adenocarcinoma Moderately differentiated
Adenocarcinoma Moderately differentiated
Adenocarcinoma Poorly differentiated
BreastAdenocarcinomaInfiltrating ductal carcinomaModerately differentiated
AdenocarcinomaInfiltrating ductal carcinomaModerately differentiated
AdenocarcinomaInfiltrating ductal carcinomaModerately differentiated
AdenocarcinomaInfiltrating ductal carcinomaModerately differentiated
AdenocarcinomaInfiltrating ductal carcinomaModerately differentiated
AdenocarcinomaInfiltrating ductal carcinomaModerately differentiated
AdenocarcinomaInfiltrating ductal carcinomaModerately differentiated
AdenocarcinomaInfiltrating ductal carcinomaModerately differentiated
AdenocarcinomaLobular carcinomaModerately differentiated
ColonAdenocarcinomaInvasiveModerately differentiated
AdenocarcinomaInvasiveModerately differentiated
AdenocarcinomaInvasiveModerately differentiated
AdenocarcinomaInvasiveModerately differentiated
AdenocarcinomaInvasiveModerately differentiated
AdenocarcinomaInvasiveModerately differentiated
AdenocarcinomaInvasiveModerately differentiated
AdenocarcinomaInvasiveModerately differentiated
AdenocarcinomaInvasiveModerately differentiated
AdenocarcinomaInvasiveModerately differentiated
AdenocarcinomaInvasiveModerately differentiated
AdenocarcinomaInvasiveModerately differentiated
AdenocarcinomaInvasiveModerately differentiated
AdenocarcinomaInvasiveModerately differentiated
AdenocarcinomaInvasiveModerately differentiated
AdenocarcinomaInvasiveModerately differentiated
AdenocarcinomaInvasiveModerately differentiated
AdenocarcinomaInvasiveModerately differentiated
AdenocarcinomaInvasiveModerately differentiated
AdenocarcinomaInvasiveModerately differentiated

Sample preparation

Tumor samples were homogenized in 8 volumes of 9 M urea, 2% CHAPS, 0.5% dithiothreitol (DTT) and 2% carrier ampholytes pH 8–10.5. The homogenates were centrifuged at 420,000g at 22°C for 40 min (Optima™ L70-K ultracentrifuge, Type 50.4 Ti rotor, 50,000 rpm; Beckman Instruments, Palo Alto, CA). The supernatant was removed, divided into 4 aliquots and stored at −80°C until analysis.

Two-dimensional gel electrophoresis

Sample proteins were resolved using the ISO-200 and the DALT-100 components of LSBCs fully automated ProGEx™ system. The protein concentration of the tumor samples were measured using the BCA method in the absence of ampholytes. About 200 μg of solubilized sample was applied to each gel, and the gels were run in groups of 25 for 25,050 V hr using a logarithmically increasing voltage with a high-voltage programmable power supply. An Angelique™ computer-controlled gradient-casting system (Large Scale Biology Corporation, Germantown, MD) was used to prepare the second-dimension SDS slab gels. The top 5% of each gel was 8%T acrylamide and the lower 95% of the gel varied linearly from 8% to 15%T. The IEF gels were loaded directly onto the slab gels using an equilibration buffer with a blue tracking dye and were held in place with a 1% agarose overlay. Second-dimensional 20 × 25 cm2 slab gels were run in groups of 25, with a run time of 2 hr at 600 V in cooled DALT tanks (20°C) with buffer circulation, and were taken out when the tracking dye reached the bottom of the gel. Following SDS electrophoresis, the slab gels were fixed overnight in 1.5 l/10 gels of 50% ethanol/3% phosphoric acid and then washed 3 times for 30 min in 1.5 l/10 gels of temperate DI water. They were transferred to 1.5 l/10 gels of 34% methanol/17% ammonium sulfate/3% phosphoric acid for 1 hr, and after the addition of 1 g powdered Coomassie Blue G-250 the gels were stained for 3 days to achieve equilibrium intensity.

Quantitative gel pattern analysis

Stained slab gels were digitized in red light at 100 μm resolution, using an Ektron 1412 scanner and images were processed using the Kepler® software system. A master pattern (USF209M2) was constructed from one of the best quality patterns and edited to include spots observed from all of the tissues. Three gels were run for each of the 77 tissue samples. The criteria for choosing the best gel to represent each sample included that the gel had the least horizontal and/or vertical streaking, no or very low staining background, and no broken pieces. An experiment package was constructed using the best two-dimensional gel electrophoresis (2DE) pattern of each tissue sample, and each pattern was matched to the USF209M2 master to establish the correspondence of spots between patterns and to assign master numbers to spots. The pattern matching process included manual and automatic procedures. The single master gave adequate representation for all of the tissues and allowed all of the patterns to be matched together as a single unit, greatly simplifying the analysis. To correct for differences in loading and staining, the 77 patterns were scaled together by a linear procedure based on a selected set of spots by setting the summed abundance of the selected spots equal to a constant (linear scaling).

Protein spot analysis by mass spectrometry

Sample preparation for mass spectrometry.

Protein spots were excised from Coomassie stained gels using an LSBC-designed, fully automated proprietary spot cutter and placed in a 96-well polypropylene microtiter plate for further processing. Sample preparation of gel plugs (destain, reduction and alkylation, trypsin digestion) was carried out on a TECAN Genesis Workstation 200 (Tecan, Durham, NC) equipped with a carousel tower, a ROMA microtiter plate transport arm, a LIHA 8-tip liquid handler arm and 4 hotels for incubation in the dark at room temperature (RT) and at 37°C. The TECAN was controlled by 2 interactive pieces of software: Gemini software controlled the liquid handling and FACTS software controlled the scheduling. Samples were prepared using a previously described method.15 was used. Briefly, gel plugs were destained by two 45-min cycles of 0.1 M NH4HCO3 (AmBic) in 50% CH3CN. Wash was discarded. Reduction and alkylation were accomplished by dispensing 400 nmol DTT in 0.1 M AmBic and incubating at 37°C for 30 min in the dark. After cooling, 2.2 μmol iodoacetamide in 0.1 M AmBic was added and incubated at RT in the dark for 30 min. The supernatant was removed, and spots were washed with diHOH, then 100% MeCN was added and discarded after 15 min to dehydrate gel plugs. After a 5-min air dry, 62.5 ng of trypsin was added, plates were heat sealed and incubated overnight at RT. Peptides were manually extracted from the gel plugs and spotted onto MALDI target plates using the 96-tip CyBi-Well robot (CyBio, Woburn, MA). A fraction of the sample volumes were deposited onto a 384-format Bruker 600 μm AnchorChip MALDI target followed by α-cyano-4-hydroxycinnamic acid matrix. Samples plus matrix were allowed to dry, followed by a wash with 1% TFA. The remainder of the samples were prepped for LC-MS/MS analysis using a Packard Multiprobe II EX liquid handling system (Perkin Elmer, Boston, MA). Remaining sample was transferred to narrow 96-well MTPs (220 μl), fresh extraction solution was added to the gel plugs for 30 min and the supernatant was transferred to the narrow 96-well MTP, leaving a final volume of 10 μl.

MALDI-TOF analysis.

MALDI targets were automatically run on a Bruker Biflex or Autoflex mass spectrometer. Both instrument models were equipped with delayed ion extraction, pulsed nitrogen lasers (10 Hz Biflex, 20 Hz Autoflex), dual microchannel plates and 2 GHz transient digitizers. All mass spectra represent signal averaging of 120 laser shots. The performance of the mass spectrometers produced sufficient mass resolution to produce isotopic multiplets for each ion species below m/z 3,000. Spectra were internally calibrated using 2 spiked peptides (angiotensin II and ACTH18–39) and database searched with a mass tolerance of 50 ppm.

LC-MS/MS analysis.

Samples that do not get positive identifications by MALDI were subjected to LC-MS/MS analysis using a LCQ mass spectrometer. A proprietary microelectrospray interface similar to an interface described previously16 was employed. Briefly, the interface utilizes a PEEK micro-tee (Upchurch Scientific, Oak Harbor, WA) into which one stem of the tee is inserted a 0.025 in. platinum–iridium wire (Surepure Chemetals, Florham Park, NJ) to supply the electrical connection. Spray voltage was 1.8 kV. A 15-μm i.d. PicoTip spray needle (New Objectives, Cambridge, MA) is inserted into another stem of the tee and aligned with the MS orifice. A 10-cm microcapillary column packed with 5 μm reversed phase C18 Zorbax material was plumbed into the last tee. A 20 μl/min flow from a Microtech UltraPlus II 3-pump solvent delivery system (Microtech Scientific, Vista, CA) was reduced using a splitting tee to achieve a column flow rate of ∼400 nl/min. Samples were injected from an Endurance autosampler (Spark-Holland, The Netherlands) onto a trapping cartridge (CapTrap, Michrom BioResources, Auburn, CA) with pump C. Seven-minute reversed phase gradients from pumps A and B eluted peptides off the trap and capillary-LC column and into the MS. Spectra were acquired in automated MS/MS mode with a relative collision energy (RCE) preset to 35%. To maximize data acquisition efficiency, the additional parameters of dynamic exclusion, isotopic exclusion and “top 3 ions” were incorporated into the auto-MS/MS procedure. The scan range for MS mode was set at m/z 375–1,400. A parent ion default charge state of +2 was used to calculate the scan range for acquiring tandem MS.

MS data analysis

MS data was automatically registered, analyzed and searched with the appropriate public protein/genome databases using RADARS, a separate relational database provided by Proteometrics (acquired by Harvard Biosciences, Holliston, MA) and optimized in-house. For MALDI peptide mapping, Mascot (Matrix Science, London, UK) and Profound (Harvard Biosciences) search engines were employed. Identifications are noted when search results are above the 95th percentile of significance in both Profound and Mascot. Mascot is used for peptide sequence searching of LC-MS/MS data. Scores above the 95th percentile are noted.

Identification of discriminating proteins

Identification of a relatively small number of proteins that have the ability to distinguish between different tumor types is a great challenge that is inherent in all large-scale biological assays. To avoid the possibility of selecting a large list of proteins for the classifier where many or all of the highly significant proteins distinguish 2 or only a few tumor types, the following approach was used. A series of 6 Wilcoxon rank-sum tests were performed comparing a single tumor type vs. the 5 remaining tumor types. This resulted in 6 lists of proteins that were subsequently sorted by p-value. To construct a classifier with n number of proteins we simply chose the top-rated protein from each of the 6 lists, then continued to the number 2 rated protein from each list. This process was repeated until n proteins were chosen. This general method was performed to choose any number of proteins that were needed in classifier construction.

Artificial neural network construction

To understand the influence of different artificial neural network (ANN) architectures, we constructed an automated script that allowed us to easily create a series of ANN architectures based on user supplied input parameters. For this work, we chose to start with 60 input nodes and sequentially increase the number of input nodes by 30 until 600 input nodes were reached. This range was chosen for 2 reasons. One was to limit the number of input nodes at the beginning to a relative few so that the effect of the most useful proteins would not be overly influenced by the noise of any proteins that were included by random chance. The upper boundary was established to allow for a large number of proteins to be used in the classifier in the event that a large number of proteins contributed a relatively small amount to the overall ability of the classifier to accurately select the correct class. In addition, the problem of choosing the “right” number of hidden nodes for an ANN is intractable. Therefore, to determine the effect of the number of input nodes on classification accuracy, we evaluated 3 different formulas for calculation of the number of hidden nodes. The formula simply divides the number of input nodes by a given value (5, 10 or 20) to determine the number of hidden nodes used in the ANN construction.

Leave-one-out cross validation

Owing to the limited number of samples in this study, we used LOOCV to access the accuracy of any constructed classifiers. LOOCV in some cases can be slightly optimistic and an independent training set will be needed for any further validation. It should be noted however that we performed a “complete” analysis for each sample, meaning that both the gene selection procedure and subsequent ANN training steps were performed for each fold.

Results

Protein separation and identification

A total of 77 primary adenocarcinoma tissue samples were analyzed in this study. For each sample, at least 3 gels were run and the gel with the best quality was chosen to represent that sample for the subsequent analysis. The reproducibility of the gels for each sample was examined by measuring the coefficient of variation (CV) of selected 100 protein spots from those gels. A distribution of CV was plotted for each sample. For most samples, more than 80% of the spots had CV ≤ 10%. About 1,400 protein spots were visualized following staining of 2D gels. Figure 1 shows a typical 2DE image produced from a human kidney cancer specimen. The edited Master Pattern USF209M2 contains 1,420 proteins spots covering the protein spots across all the 77 patterns. Using MALDI or LC-MS/MS, 650 protein spots were identified. On the basis of the pI and molecular weight (MW) information from the identified protein spots, each 2DE pattern was calibrated for its pI (4–7) and MW ranges. Results of mass spectrometry analysis produced identification for 69 of the 173 unique proteins selected from the ANN with inputs ranging between 330 and 570. Only protein abundances were assessed. Occasionally, multiple protein identifications of the same protein were observed and could represent identical isoforms or could also represent subtle differences based on posttranslational modifications such as glycosylation or phosphorylation or even partial degradation.

Figure 1.

A representative 2-DE gel used as the master gel for spot matching in this study. Protein sample from kidney adenocarcinoma was prepared and 2-DE was performed as described in the Material and Methods section. The pI (4–7) and the molecular weight were calibrated using Kepler® software. About 1,400 protein spots were analyzed for each gel using Kepler software.

ANN results

Figure 2 is a summary of the accuracies of all 57 different ANN architectures. As can be seen from the figure, the accuracy of the network on the left out sample trends upward as the number of inputs increases, until the number of inputs approaches 540. The increase of accuracy reflects the increase of information contained in the proteins as they are added to the ANN. Once the amount of noise in the additional proteins outweighs the information content, the network performance suffers. It is also important to note that the number of hidden units as a function of input nodes seemed to have little effect on the overall network accuracy. The accuracy of the ANN, when given enough proteins for classification, is 87% with a mean accuracy of 82% across all configurations. Because of the limited sample numbers, an independent test approach could not be used for this study. As can be seen from the confusion matrix (Table II), the incorrectly classified samples were fairly well distributed equally across all the tumor types demonstrating the network performed equally well at classifying all 6 tumor types. This result is comparable to that in a previous work by us using microarray technology14 and represents a potential new approach to tumor classification of unknown primary cancers.

Figure 2.

ANN accuracy across various network configurations. Y-axis is percentage accuracy for the network and X-axis is the number of input nodes, i.e., the number of proteins used in the ANN. The number of hidden nodes for each of the network constructs were derived using a formula where hidden nodes = 1/x (input nodes) where x = 5, 10 and 20.

Table II. Confusion Matrix of Classification for the ANN Showing the Highest Overall Accuracy From the 57 Network Architectures
(a)(b)(c)(d)(e)(f)Classified as:
801010(a) Ovary
170010(b) Breast
0018002(c) Colon
0001000(d) Kidney
020071(e) Lung
0010017(f) Stomach

Description of selected proteins

Analysis was performed to gain insight into the proteins selected by our approach for each of the 6 classes. Table III details the percentage of proteins selected that were unique to each of the tumors. Surprisingly, the proteins selected for distinguishing breast cancer from all other tumor types were also present in the colon vs. all set. Not surprising, however, was the relatively low amount of overlap between kidney and the other 5 tumor types, reflecting the uniqueness of kidney tumors in general. Of the 6 types of adenocarcinomas represented in this study, kidney adenocarcinomas were the most histologically distinct. For the majority of tumor types, the percentage of proteins selected for other tumors ranged from 3% to 10%. Figure 3 is a histogram summary of the number of discriminating proteins identified vs. the number of those proteins that were unique to that tumor.

Figure 3.

Histogram of total vs. unique proteins used per tumor type in the ANN unknown primary classifier. Y-axis is the number of proteins, X-axis is the tumor type. Breast had no unique proteins; all were found in the set selected for colon.

Table III. Percentage of Overlap Between Proteins Selected for Use in the ANN
% OverlapColonBreastKidneyLungOvaryStomach
Colon100.082.24.30.06.44.3
Breast100.0100.05.70.05.75.7
Kidney4.14.1100.02.06.10.0
Lung0.00.03.7100.03.73.7
Ovary9.16.19.13.0100.015.2
Stomach5.65.60.02.813.9100.0

A total of 1,412 proteins were used in the selection process. Proteins were selected based on differences in protein abundance rather than based on other potential differences such as those induced by posttranslational modification, which were not measured. We identified 227 total proteins when selecting those ANN architectures containing between 330 and 570 input nodes. Of these, 173 proteins were unique to individual tumor types. From this set, 69 have had their identities confirmed via mass spectrometry. Supplementary Tables 1–6 list identified proteins for each of their respective classes with the unique proteins for each class shown in bold. These proteins represent candidate tumor biomarkers for each tumor type.

Discussion

A number of adenocarcinomas still produce significant diagnostic challenges to pathologists and practicing clinicians. The unknown primary cancer, often a metastatic adenocarcinoma to sites like the liver or lungs, represents a substantial number of cases worldwide and is highly problematic. Similarly, discriminating primary tumors from organs such as the ovary versus the colon can be difficult. Since therapy still stems from the organ site-based diagnosis, correct identification of site of origin of cancer is clinically valuable. To address these problems, we have profiled a significant number of human tumors using a global protein approach. We have constructed an ANN, protein-based classifier that is very accurate as assessed by LOOCV in the classification of 6 common tumor types: lung, kidney, breast, colon, ovary and stomach. A series of Wilcoxon rank-sum tests were used to identify a discriminating set of proteins. The use of this one vs. all approach coupled with our selection method, while simple in application, was key in our ability to construct a successful classifier and allowed us to test for proteins that separated a single group from all remaining classes. Most importantly, this approach prevented the selection of a group of features that, although being the most statistically significant, contained a large number of proteins that discriminated one type of tumor from only one or a few of the other types.

Analysis of the selected proteins revealed the utility of this approach as most of the tumor classes shared only a small percentage of proteins selected. The exception was breast and colon, where all breast proteins were also contained in the colon set. The multitumor classifier, however, was still able to successfully separate these 2 groups. The most likely explanation was that, although they both shared the same proteins, the levels of protein were differentiating enough for correct classification. Ultimately, this approach should prove useful in future data analysis involving high dimensional multiclass data, a global problem in bioinformatics and proteomics in particular.

In addition to our novel feature selection approach, we examined the effect of ANN architecture on its ability to correctly classify the 6 tumor types. This was done by using an automated scripting approach that generated a series of networks with increasing numbers of input nodes and differing numbers of hidden nodes. In all, 57 networks were constructed, trained and tested using a LOOCV approach. This analysis allowed us to determine the correct architecture for use in our classifier. As can be seen in Figure 2, network architecture had influence on classification accuracy, though not to a great extent. Additionally, the number of hidden units, calculated as a fraction of the input nodes, seemed to have little effect.

Whether this classifier will perform as well on an independent testing set of tumors remains to be answered. While 2DE gel analysis was used to identify discriminating proteins, it is not easily adaptable to a testing environment where many samples can be assessed. However, with these initial protein identifications in our prototype classifiers, it will now be feasible to validate their performance with complimentary platforms such as tissue microarrays, which will be the subject of future investigations. Of the identified proteins, some may prove to have benefit as protein biomarkers. From the published literature, we can find evidence validating a few of the identified proteins as biomarkers for specific tumor types. For example, cytokeratin 7 was identified as unique to lung cancer and this, in fact, is a commonly used biomarker to discriminate lung from colon cancer.17 The interesting finding here also suggests that this biomarker discriminates lung cancer from all other 5 types assessed. Similarly, cytokeratin 19 and vimentin have both been previously associated with kidney cancer.18 These studies validate the potential for the proteins we have identified to serve as novel candidate protein biomarkers.

An additional possible limitation of this work is the ability of a classifier necessarily constructed from known primaries to accurately predict unknown primary tumors i.e. mets. In our work using microarray data we were successfully able to identify metastatic tumors starting with primaries as well as identifying primaries when the classifier was built using metastatic tumors. In addition, Tothill et al. were able to classify metastasis of unknown origin using a cDNA-based SVM classifier constructed with data from 13 primary tumor types.19 These results from the microarray field, while not conclusive, lend support to the idea of using proteomic data from primary tumors to develop an accurate classifier to address the important unknown primary problem in cancer diagnosis.

Acknowledgements

J.X.Z. was an employee of Large Scale Biology Corp., whose ProGEx™ system was used in the work presented in this article. The authors certify that they have not entered into any agreement that could interfere with their access to the data on the research, nor upon their ability to analyze the data independently, to prepare manuscripts, and to publish them. The authors thank Ms. Lindsay J. Rodzwicz and Ms. Anita C. Bruce for their editorial assistance in preparing this manuscript.

Ancillary