SEARCH

SEARCH BY CITATION

Keywords:

  • transcriptional networks;
  • gene expression;
  • disease diagnosis;
  • lung cancer

Abstract

  1. Top of page
  2. Abstract
  3. MATERIALS AND METHODS
  4. RESULTS
  5. DISCUSSION
  6. CONFLICT OF INTEREST DISCLOSURES
  7. REFERENCES

BACKGROUND:

Transcriptional networks play a central role in cancer development. The authors described a systems biology approach to cancer classification based on the reverse engineering of the transcriptional network surrounding the 2 most common types of lung cancer: adenocarcinoma (AC) and squamous cell carcinoma (SCC).

METHODS:

A transcriptional network classifier was inferred from the molecular profiles of 111 human lung carcinomas. The authors tested its classification accuracy in 7 independent cohorts, for a total of 422 subjects of Caucasian, African, and Asian descent.

RESULTS:

The model for distinguishing AC from SCC was a 25-gene network signature. Its performance on the 7 independent cohorts achieved 95.2% classification accuracy. Even more surprisingly, 95% of this accuracy was explained by the interplay of 3 genes (KRT6A, KRT6B, KRT6C) on a narrow cytoband of chromosome 12. The role of this chromosomal region in distinguishing AC and SCC was further confirmed by the analysis of another group of 28 independent subjects assayed by DNA copy number changes. The copy number variations of bands 12q12, 12q13, and 12q12-13 discriminated these samples with 84% accuracy.

CONCLUSIONS:

These results suggest the existence of a robust signature localized in a relatively small area of the genome, and show the clinical potential of reverse engineering transcriptional networks from molecular profiles. Cancer 2011. © 2010 American Cancer Society.

Lung cancer is the leading cause of cancer mortality, with >1.3 million deaths per year worldwide.1 Greater than 80% of lung cancers are nonsmall cell lung carcinoma (NSCLC). Adenocarcinoma (AC) and squamous cell carcinoma (SCC) are the 2 most common subtypes of NSCLC and, together, account for >60% of lung cancer cases.2 AC and SCC are categorized together in NSCLC because of the similar microscopic appearance of their tumor cells and their similar treatment options in clinics. Nonetheless, AC and SCC are heterogeneous in many clinical aspects. AC responds to chemotherapy better than SCC,3 but it has a greater tendency to relapse in the form of distant metastases than SCC.4 After surgical resection, AC has higher rates in recurrence and mortality than SCC5 in Western countries, but in East Asia AC has better prognosis.6 Unfortunately, the histological identification of tumor cells with a recognizable morphological pattern is partly subjective,7 and can become particularly difficult for small-sized tumors at early stages8 or for patients who suffer from multiple types of primary lung carcinomas.9 Even more importantly, the emergence of individualized therapeutic strategies for NSCLC based on defect-targeted drugs, such as gefitinib,10 requires the creation of molecular profiles to categorize tumors according to their underlying molecular characteristics rather than their histology or location. Targeted therapy in Asian nonsmoking women has been shown more effective for AC than for SCC,6 and personalized medicine is expected to develop more therapeutic strategies specific to these carcinomas.11

Over the past decade, high-throughput gene expression analysis has delivered on its promise to revolutionize our understanding of cancer12 through the identification of new tumor classes, the development of genomic prognostic models, and the discovery of new therapeutic targets. In more recent years, advances in systems biology have used the comprehensive transcriptional landscape offered by microarrays to go beyond the phenomenological signatures of cancer tissues and to identify the transcriptional networks that coordinate the expression of tumor genes.13, 14 These transcriptional networks capture regulatory interactions between genes and explain the processes underpinning tumorigenesis,15, 16 rather than revealing signatures of a particular phenotype. But the 2 approaches are not as antithetic as they may appear. Here we reconcile the 2 approaches by describing how a transcriptional network can be used to discriminate between AC and SCC. Here we describe a systems biology approach to cancer classification based on the reverse engineering of the transcriptional network discriminating AC and SCC. Intuitively, we can regard these transcriptional network classifiers as a gene network perturbed by the presence of the phenotype. The phenotype is treated as a binary perturbation of the overall transcriptional network so that, to reconstruct its transcriptional network classifier from expression profiles, we just need to infer the transcriptional network surrounding it.

To model this classifier, we use a multivariate analysis method known as Bayesian networks. Bayesian networks have been extensively used to analyze several types of genomic data, including gene regulation,17, 18 protein-protein interactions,19, 20 single nucleotide polymorphisms,21 and pedigrees.22 The application of our network classifier to clinical data will show its superior performance in classifying lung AC and SCC.

MATERIALS AND METHODS

  1. Top of page
  2. Abstract
  3. MATERIALS AND METHODS
  4. RESULTS
  5. DISCUSSION
  6. CONFLICT OF INTEREST DISCLOSURES
  7. REFERENCES

Gene Expression Data

This research considered the gene expression data of primary lung tumors for analysis. The training data were comprised of 58 ACs and 53 SCCs (Gene Expression Omnibus [GEO] accession number GSE3141). The independent validation data consisted of the following data: (1) 58 AC samples from Italy (GEO GSE10072); (2) 27 AC samples of Taiwanese origin (GEO GSE7670); and (3) 5 American populations (GEO GSE12667, GSE4824, GSE2109, GSE4573, GSE6253) in a total of 147 ACs (132 Caucasians, 9 African descent, 2 Asian descent, 4 other) and 190 SCCs (167 Caucasians, 3 African descent, 20 other). Except for the Michigan data, which had only preprocessed intensity levels available, data had raw CEL files available. We adopted the Affymetrix MAS 5.0 algorithm to process the CEL files. The raw expression intensities were scaled to 500 and log transformed. The data sets from Duke, Washington University, and International Genomics Consortium (http://expo.intgen.org) were collected with Affymetrix HG-U133Plus2.0 platform, whereas the remaining data sets were collected with the Affymetrix HG-U133A platform. We treated the HG-U133A platform as the base and used the batch query tool provided by Affymetrix to match the probe identifiers of the HG-U133Plus2.0 platform to those of HG-U133A.

Transcriptional Network Construction

We modeled the transcriptional network classifier with the Bayesian networks framework,23 which started with gene selection followed by gene network learning. The gene selection was realized by a statistical score, called Bayes factor, which evaluated for each gene the ratio of its likelihood of being dependent on the phenotype to its likelihood of being independent of the phenotype. When the Bayes factor was >1, the gene was selected because, it is more likely to be dependent on the phenotype than to be independent of the phenotype. The step of gene network learning entailed searching for the most likely modulators of the genes, where each gene is modulated by another gene or the phenotype. Figure 1 depicts the resulting network representing the training data, where the rectangle node denotes the subtype variable, the elliptic nodes denote genes, and the directed arcs encode the conditional probabilities of the target nodes dependent on the source nodes.

thumbnail image

Figure 1. The Bayesian network model encoding the dependence relation among the subtype variable and genes is shown. For each gene, its likelihood of dependence on the subtype variable or another gene were evaluated, and then its parent node was determined by the highest likelihood. The subtype variable's first tier child nodes, which are colored in green, are under its Markov blanket and assemble a signature to discriminate between adenocarcinoma (AC) and squamous cell carcinoma (SCC).

Download figure to PowerPoint

Subtype Recognition by the Transcriptional Network Classifier

In terms of the transcriptional network shown in Figure 1, the signature genes are the first tier children nodes directly modulated by the subtype variable. Given a tumor sample's expression levels of the signature genes, we can compute the probability of it being AC or SCC by the network model and then assign to the sample the subtype with higher probability.

Statistical Differential Analysis

We used the Limma package24 in the R programming language and environment (www.r-project.org) to conduct the differential analysis.

Classification Accuracy

The discrimination accuracy of the model was determined by calculating the receiver operator characteristic (ROC) curves. The estimation of each ROC curve started with creating the convex hulls using the Qhull algorithm, followed by optimally smoothing the curve. We adopted the area under the ROC curve as the measure of classification accuracy.

Cross Validation

To assess the robustness of the network to sampling variability, we used 10-fold cross validation in which the original training data were partitioned into 10 nonoverlapping subsets that were used for learning the network dependency and re-estimating the model parameters. Each network was then used to classify the lung carcinoma subtypes of the individuals not included in the learning process.

Comparisons of Classification Performance With Other Methods

We further contrasted our classification results with 3 other popular methods: principal component analysis with linear discriminant analysis (PCA-LDA), prediction analysis for microarray, which uses nearest shrunken centroid for tissue classification, and weighted voting, which weighs the significance of genes by signal-to-noise ratios to classify samples. PCA-LDA carried out a smaller signature with 13 genes but produced only 91.2% accuracy. Prediction analysis for microarray resulted in 77 genes in the signature and generated 91.0% accuracy. These analyses show that the superiority of our method to PCA-LDA and prediction analysis for microarray is statistically significant (P = .0047 and .0014, respectively). The classification by weighted voting reached 93.4%; although the difference between our transcriptional network classifier and weighted voting is not statistically significant (P = .6240), our transcriptional network classifier achieved higher accuracy with a much more compact signature than weighted voting with a huge signature of 800 genes.

Comparative Genomic Hybridization Data and Processing

The comparative genomic hybridization (CGH) data in our study were available from GEO with accession number GSE7878, which included 13 ACs and 15 SCCs. On chromosome 12, the CGH data contained 25, 207, and 18 genes occupying bands q12, q13, and q12-q13, whose average copy number changes were considered as 3 individual features of each tissue sample. Each feature is modeled by a Gaussian distribution. We built up a naive Bayes classifier by treating the features conditional only on the subtype variable and by learning from the data the parameters of the conditional probabilities. When classifying a sample, we evaluated its probability of being AC or SCC using Bayes theorem and then assigned to the sample the subtype with higher probability.

RESULTS

  1. Top of page
  2. Abstract
  3. MATERIALS AND METHODS
  4. RESULTS
  5. DISCUSSION
  6. CONFLICT OF INTEREST DISCLOSURES
  7. REFERENCES

Lung Carcinomas Classification

Figure 1 shows the transcriptional network inferred from a set of 111 tumor samples (58 ACs and 53 SCCs) from Duke University.25 Of the 22,283 gene probes in the microarray, 77 probes are dependent, directly or indirectly, on the carcinoma subtypes. Of these 77 genes, 25 are directly modulated by the cancer subtype and they are per se sufficient to identify it. Enrichment study shows that there are 23 unique genes in this signature, listed in Table 1. All the 25 genes are differentially expressed across AC and SCC with high statistical significance (P < 10−5) and >2-fold change. Notably, 18 genes have >5-fold change. False discovery rates for the 25 genes are <10−5.

Table 1. Signature Genes in the Network Classification Modela
Gene SymbolGene TitleCytobandPathway
  • a

    Enrichment shows that there are 23 unique genes in the signature.

ABCC3ATP-binding cassette, sub-family C (CFTR/MRP), member 317q22ABC transporters
BICD2Bicaudal D homolog 2 (Drosophila)9q22.31 
CDACytidine deaminase1p36.2-p35Pyrimidine metabolism, drug metabolism
CLDN3Claudin 37q11.23Cell adhesion molecules, tight junction, leukocyte transendothelial migration
DPP4Dipeptidyl-peptidase 42q24.3 
HGDHomogentisate 1,2-dioxygenase (homogentisate oxidase)3q13.33Tyrosine metabolism, styrene degradation
ITPKAInositol 1,4,5-trisphosphate 3-kinase A15q14-q21Inositol phosphate metabolism, calcium signaling pathway, phosphatidylinositol signaling system
KRT14Keratin 14 (epidermolysis bullosa simplex, Dowling-Meara, Koebner)17q12-q21Cell communication
KRT6A, KRT6B, KRT6CKeratin 6A, keratin 6B, keratin 6C12q12-q13Cell communication
MUC3BMucin 3B, cell surface associated7q22 
MUC5BMucin 5B, oligomeric mucus/gel-forming11p15.5 
NMNAT2Nicotinamide nucleotide adenylyltransferase 21q25Nicotinate and nicotinamide metabolism
NTRK2Neurotrophic tyrosine kinase, receptor, type 29q22.1MAPK signaling pathway
RHCGRh family, C glycoprotein15q25 
SERPINB13Serpin peptidase inhibitor, clade B (ovalbumin), member 1318q21.3-q22 
SOX2SRY (sex determining region Y)-box 23q26.3-q27 
SPINK1Serine peptidase inhibitor, Kazal type 15q32 
SPRR1ASmall proline-rich protein 1A1q21-q22 
TJP3Tight junction protein 3 (zona occludens 3)19p13.3Tight junction
TOX3TOX high mobility group box family member 316q12.1 
VSNL1Visinin-like 12p24.3 

We tested the classification accuracy of the network on 7 independent study populations, for a total of 422 samples, 232 AC and 190 SCC, from subjects of Caucasian, Asian, and African descent representing 84.6%, 6.9%, and 2.8% of the data, respectively. On these independent samples, the transcriptional network classifier achieved an accuracy of 95.2%.

Uniqueness of the 25-Gene Signature

To confirm that the set of 25 signature genes cannot be exchanged with other downstream genes, we performed a stochastic analysis by randomly selecting 25 genes in the data to construct a transcriptional network classifier. After 10,000 random trials, the mean classification accuracy on the independent samples was 64.7% (standard deviation, 9.7). We further investigated if any single signature gene can on its own make good classification. None of the signature genes by itself could reach accuracy >90.0% in both cross-validation and the independent samples.

Discrimination by Chromosome 12q12-13

It is worth noting that KRT6A, KRT6B, and KRT6C together represent a narrow cytoband on chromosome 12q12-q13. Surprisingly enough, these genes alone were able to achieve a classification accuracy of 90.2%, accounting for 95% of the accuracy of the entire signature. To understand the interplay of the expression levels of these 3 genes impacting AC-SCC discrimination, we assembled them in a signature and simulated their possible expression values using our network model. Figure 2 shows that the discriminative surface generated by these 3 genes was nonlinear and concave, and it accurately discriminated AC and SCC in all the 8 populations considered in this study.

thumbnail image

Figure 2. The adenocarcinoma (AC)-squamous cell carcinoma (SCC) discriminative surface in the use of KRT6A, KRT6B, and KRT6C as a signature is shown. The classification accuracy achieved by this signature was 90.2%, accounting for 95% of the accuracy of the entire 25-gene signature. Simulating the possible expression levels of the 3 genes generated a nonlinear discriminative surface, in which the region below it belonged to AC, and the region above belonged to SCC.

Download figure to PowerPoint

To test the structural role of this region, we analyzed the copy number variations of another independent group of 25 subjects, assayed by CGH microarrays. We found that copy number variations of bands 12q12, 12q13, and 12q12-13 define a nonlinear surface (Fig. 3) that discriminates these new 28 samples with 83.9% accuracy. These findings are consistent with the results of recent analysis of DNA copy number alterations in a large number of AC and SCC samples evaluated by CGH arrays, which showed that a gain of 12q13 appears more frequently in SCC than in AC.26

thumbnail image

Figure 3. The adenocarcinoma (AC)-squamous cell carcinoma (SCC) discriminative surface generated by the comparative genomic hybridization data is shown. The discriminative surface is a saddle, in which the region below it belongs to AC, and the region above belongs to SCC. This surface can recognize the lung cancer samples with 83.9% accuracy.

Download figure to PowerPoint

DISCUSSION

  1. Top of page
  2. Abstract
  3. MATERIALS AND METHODS
  4. RESULTS
  5. DISCUSSION
  6. CONFLICT OF INTEREST DISCLOSURES
  7. REFERENCES

The 25-gene signature identified by the transcriptional network classifier is unique to discriminate AC and SCC with high accuracy. Furthermore, most of these genes are consistent with what the literature has reported. In the signature, ABCC3, CLDN3, DPP4, MUC3B, MUC5B, NTRK2, SPINK1, and TJP3 are specific markers of lung AC. The role of ABCC3 is to mediate the elimination of toxic compounds, for example, carcinogens in tobacco smoke,27 and a recent discovery revealed that ABCC3 is 1 of the few genes up-regulated in early lung AC.28CLDN3 and TJP3 are involved in tight junction, and are found preferentially expressed in AC.29DPP4 functions as a tumor suppressor, and its down-regulation may result in the progression of cancer. Among all the lung cancer subtypes, only AC remains the same level of expression as normal tissue, so DPP4 is a good marker to recognize AC.30MUC3B and MUC5B are in the family of mucins that are important for tumor invasiveness and metastasis. An intestinal mucin, MUC3B is absent in normal lung but exhibits an increased mRNA level particularly in AC.31MUC5B is abundant naturally in lung and airway tissues, and its presence elevates in AC.32 A tyrosine kinase gene, NTRK2, is a newly identified proto-oncogene because of its mutations in lung AC.33SPINK1 has been associated with prostate and pancreatic cancers, but it is found highly expressed in lung AC.34

KRT6A, KRT6B, KRT6C, KRT17, RHCG, SPRR1A, and VSNL1 are unique to squamous cells. KRT6A, KRT6B, KRT6C, and KRT17 are members of the keratin protein family and are related to epidermalization of squamous epithelium, so their expression surges in SCC.35, 36RHCG is specific to squamous epithelia in many organs,37 and our classifier uses its high expression in lung SCC to discriminate from AC. SPRR1A is frequently amplified in SCC and predominantly expressed in squamous epithelium, where it contributes to the formation of the insoluble cornified cross-linked envelope that limits permeability and provides structural integrity.35VSNL1, also known as VILIP-1, acts as a tumor suppressor gene specific to SCC, with higher expression in early stage than in advanced stage; in contrast, its expression pattern in AC is mild.38

BICD2, CDA, NMNAT2, SERPINB13, and TOX3 have no specificity to either AC or SCC but to lung cancer. BICD2 is found involved in epidermal growth factor receptor (EGFR) signaling pathway.39 Because the percentage of EGFR amplification in SCC is about twice in AC,2 it is not surprisingly that our analysis uses the higher expression of BICD2 in SCC to distinguish from AC. CDA has been associated with alterations in enzymatic activity and may change sensitivity to the widely used chemotherapy drugs.40 Because the NSCLC subtypes have different responses to chemotherapy, our study exploits the different expression levels of CDA to characterize AC and SCC. NMNAT2 is shown up-regulated in current smokers,41 so it is correlated to both AC and SCC. SERPINB13 is found overexpressed in both AC and SCC,42 but our study infers that its higher expression in SCC than in AC can distinguish these NSCLC subtypes. TOX3 has been shown a biomarker for breast cancer,43 and a recent study suggests it a good prognostic marker for NSCLC.44

The roles of the remaining genes (HGD, ITPKA, SOX2) in lung carcinomas have not been reported. HGD is involved in tyrosine metabolism, whose alteration is involved in lung carcinoma progression. ITPKA regulates inositol phosphate metabolism, and SOX2 is in the SOX family of transcription factors crucial for cell differentiation. The linkage of these 2 genes with breast cancer has been reported.45

Lung tumor subtypes exhibit diversity in the molecular physiology.46 Although the association with tumor subtypes of molecular markers has been proposed, there is currently no widely accepted molecular-based tool to help identify the different histological subtypes. Two markers, thyroid transcription factor-1 (TTF1) and TP63, are regularly used by the surgical pathologist as an adjunct to morphological diagnosis. TTF1 stains tumors with adenodifferentiation, whereas TP63 stains SCC.47-49 However, TTF1 and TP63 together have a low sensitivity for a particular histological type, as they are not necessarily specific to AC and SCC. TTF1 has been reported in a minority of SCCs, and TP63 has been noted to be expressed in a minority of ACs, resulting in these markers in combination often both staining a single tumor or not staining at all, and therefore failing to classify a large fraction of lung carcinomas.50-52 Our analysis confirmed these reports; unlike the 25-gene signature whose expression levels differ between AC and SCC by 13-fold on average, TTF1 and TP63 differ by only 7-fold, so they were excluded from the signature in the transcriptional network model. Conversely, our 25-gene signature along with the computational model was evaluated by its sensitivity and specificity, achieving 95.2% classification accuracy. The high accuracy suggests that a new combination of multiple molecular markers is necessary to accurately discriminate lung tumor subtypes.

The actual subtypes of the NSCLC samples used in this research were identified by histology. The high AC-SCC discrimination accuracy resulting from our gene expression microarray analysis suggests that gene expression profiling is a powerful alternative to histology. When the morphological patterns of tumor cells are not recognizable, when small-sized tumors in early stage are difficult to be distinguished, or when patients present both primary AC and SCC, a microarray assay focused on the limited number of signature genes defined in the present study could be devised to objectively subclassify NSCLC samples.

An interesting topic for future research would be to interrogate the impact of races in the gene signature. In this paper, our data consisted of >90% Caucasians. The small sample sizes of African and Asian descent made it infeasible to investigate how race plays a role in AC-SCC recognition. However, if additional African and Asian patients can be recruited, this analysis can be extended to identify race-specific signature genes.

In summary, this study shows the existence of a small functional network modulating the differences between the 2 most common types of lung cancer, confirmed by the high predictive accuracy of this network on a very large number of subjects. The ability of this small functional network to pinpoint a small region of chromosome 12 accounting for a large proportion of the differences between AC and SSC suggests the possibility of developing high-throughput screening methods to identify candidates for defect-targeted drugs. At the same time, the reliability of this network signature also suggests the potential of these network analyses to develop systemic molecular profiles for personalized therapeutic strategies.

CONFLICT OF INTEREST DISCLOSURES

  1. Top of page
  2. Abstract
  3. MATERIALS AND METHODS
  4. RESULTS
  5. DISCUSSION
  6. CONFLICT OF INTEREST DISCLOSURES
  7. REFERENCES

Supported in part by the National Institutes of Health/National Human Genome Research Institute (R01HG003354).

REFERENCES

  1. Top of page
  2. Abstract
  3. MATERIALS AND METHODS
  4. RESULTS
  5. DISCUSSION
  6. CONFLICT OF INTEREST DISCLOSURES
  7. REFERENCES