Transcriptional expression profiles of oral squamous cell carcinomas




Currently, the classification of oral squamous cell carcinoma (OSCC) depends heavily on the clinical and pathologic examination of tissue. This system can lead to the classification of potentially heterogeneous tumors into single groups when they may have different degrees of aggressiveness. No system to date has incorporated genetic changes as a factor by which to classify OSCC tumors.


To test the hypothesis that OSCC has a genome-wide genetic expression profile that differs from normal oral tissue and that transcriptional expression profiling can be used to characterize the heterogeneity among tumors, the authors examined the genetic expression profiles of 26 invasive squamous cell carcinomas of the oral cavity and oropharynx, 2 premalignant lesions, and 18 normal oral tissue samples using oligonucleotide arrays that contained probes representing approximately 7000 full-length human genes.


Using hierarchical clustering analysis, the data show that oral carcinomas are distinguishable from normal oral tissue based on genome-wide transcriptional expression patterns. However, there is genetic expression profile heterogeneity among tumors of a particular histopathologic grade and stage. In addition, using a statistical approach that integrated normalization and regression analysis, the authors found 314 genes that were expressed differentially in the OSCC samples with statistical significance (P ≤ 0.05). Of these, 239 genes were overexpressed in the OSCC samples, whereas 75 genes were down-regulated.


No statistically significant differences in gene expression were found between early-stage disease and late-stage disease or between metastatic tumors and nonmetastatic tumors. The implications of these findings for the prediction of clinical outcome and for the discovery of new OSCC tumor markers are discussed. Cancer 2002;95:1482–94. © 2002 American Cancer Society.

DOI 10.1002/cncr.10875

Oral and oropharyngeal squamous cell carcinoma (OSCC) is a considerable public health problem. It is the sixth most frequently diagnosed malignancy worldwide. In the United States, there were approximately 30,300 new diagnoses of OSCC in 1999. The incidence of oral carcinoma has risen since the early 1970s.1 Despite considerable advances in surgical techniques and the addition of adjuvant treatment modalities, the overall prognosis for patients with OSCC has not improved in the past 2 decades.1 The 5-year survival rate is only 55% for white patients in the United States and only 34% for black patients in the United States.2 In addition, treatment often results in loss of speech, chewing and swallowing dysfunction, shoulder pain and dysfunction, cosmetic deformity, and psychological distress. One of the impediments to the effective management of patients with OSCC is our limited ability to predict the natural history of individual lesions. Although prognosis, to some extent, is correlated with anatomic site, tumor thickness, tumor grade, and lymph node involvement, there remains much unexplained variability in the clinical course of patients with OSCC. Currently, we are unable to classify tumors based on their potential to metastasize or to resist radiation treatment. Moreover, wound margins only are examined histologically for evidence of microscopic disease. However, there is not enough scientific evidence for clinicians to screen for molecular changes at the wound margins that would predispose patients to malignancy and contribute to disease recurrence. With these problems at hand, it is imperative to study oral carcinomas at the genetic level and to characterize the genetic changes responsible for carcinogenesis and tumor behavior.

Past studies that attempted to characterize and predict tumor behavior in head and neck carcinoma, including oral carcinoma, placed great emphasis on the examination of loss of heterozygosity or microsatellite instability in a number of genetic loci that are thought to be associated with tumor suppressor genes.3–11 There also have been examinations of individual genes or proteins, such as p53, p16, c-erb-B2 and H-ras, K-ras, and N-ras.12, 13 Although those studies contributed greatly to our current understanding, they did not explain the complexities of such malignancies. With the advent of DNA microarrays, genomic-scale differential gene expression profiles can be obtained. This technology allows for the simultaneous assessment of intracellular transcripts of thousands of genes at one time. We now have the opportunity to correlate clinical outcomes with patterns of gene expression on a genomic scale, possibly yielding a more accurate predictor of tumor behavior. Because phenotypic changes are heralded by genotypic changes, it also may be possible to detect differentially expressed genes that may serve as markers for the early detection of premalignant lesions. More than just focusing on the expression of a handful of genes, genomic-scale expression profiles allow the investigator to look at genetic expression variability in the context of genetic themes and pathways, such as proliferation, invasion, carcinogen metabolism, apoptosis, radiosensitivity, etc. Tissue samples with similar profiles can be clustered on the basis of gene expression, or, conversely, tissues that are assumed similar can be analyzed for genetic expression variability. Genes with expression that is up-regulated or down-regulated strongly during the progression to neoplasia may serve as candidate genes for future studies on their potential roles in the development of oral carcinoma or as points for potential prevention trials.

Many reports have grouped OSCC with tumors from many other sites of the head and neck in their analyses. We suspect that tumors from some of these sites (e.g., the larynx) may have different features and that mixing OSCC with other head and neck malignancies would make it difficult to detect OSCC-related associations. Therefore, we focused our study on OSCC for a number of reasons: OSCC tumors are easy to examine and follow; and, because they can be detected early, it is possible to access tissue at all stages of neoplasia, which is helpful for studies in which disease progression from the premalignant lesion to the aggressive metastatic neoplasia can be followed and studied.

In the current study, we generated genomic-scale gene expression profiles using fresh oral cavity/oropharyngeal squamous cell carcinomas taken at the time of resection. We obtained mRNA from the samples, labeled the transcripts, and arrayed them against oligonucleotide arrays (GeneChip®; Affymetrix, Inc., Santa Clara, CA). We addressed whether we could 1) use gene expression profiles to obtain a molecular portrait of OSCC and 2) determine which genes are expressed differentially in OSCC and normal oral tissue with statistical significance.


Tissue Collection

Surgically excised tumor samples were obtained from patients who were diagnosed with oral cavity/oropharyngeal carcinomas with the approval of the Human Subjects Division at the University of Washington. Whenever possible, a biopsy sample was obtained prior to devascularization to prevent any ischemia-induced genetic expression changes. Tumor tissue was immersed immediately in RNAlater® solution to ensure the stability of RNA. Pathologic confirmation was then obtained from frozen sections by the study pathologist and on permanent paraffin embedded sections of the resected surgical specimen by a pathologist who was not involved in the study. The final paraffin embedded sections also were reviewed by the study pathologist. No discrepancy between the pathology reviews was detected.

With patients' informed consent and when the concern for healing was not an issue, normal tissue was collected either from oral epithelium at the surgical wound margin from the patients' oral cavity after tumor resection or from the opposite side where the tumor was located. Normal tissue specimens also were collected from consented patients who were scheduled for oral surgery but who did not have a diagnosis of oral carcinoma. Normal samples were treated in the same fashion as the tumor samples. In addition, to enrich the content of epithelial tissue in the normal control samples and cancerous cells in the tumor samples, dissection of underlying soft tissue was performed. It was demonstrated that all normal control samples that were excised from surgical margins of patients with OSCC were histologically normal. Pathologic examination of hematoxylin and eosin-stained sections indicated that the OSCC samples were comprised of at least 60–70% tumor.

RNA Isolation

Under RNase free conditions, the resected specimens were submerged in approximately 2–3 mL of Trizol® solution and homogenized using a rotor-stator homogenizer (Ultra-Turrax T25; IKA Labortechniks, Germany). Total RNA was extracted from the samples using Trizol extraction protocol (Life Technologies, Gaithersburg, MD). After dissolving the total RNA in RNase free water, ethanol precipitation was carried out by adding a 1:10 volume of 3 M sodium acetate, pH 5.2, and 2.5 volumes of 100% ethanol. The sample was stored at −20 °C for at least 1 hour and centrifuged at ≥ 12,000 × g at 2–8 °C for 20 minutes. The pellet was washed with 80% ethanol twice and resuspended in RNase free water. The RNA was purified further using an RNeasy® kit (Qiagen, Chatsworth, CA).

Target Sample Preparation

We used the SuperScript™ Choice System (Life Technologies) to synthesize double-stranded DNA (dsDNA) from sample-derived total RNA. Five to 25 μg of total RNA were mixed with 100 pmol of T7-(dT)24 primer [5′-GGC CAG TGA ATT GTA ATA CGA CTC ACT ATA GGG AGG CGG-(dT)-24-3′] in a total volume of 11 μL. The primer and RNA were heat denatured at 70 °C for 10 minutes, centrifuged, and quickly put on ice. After addition of 4 μL of 5 × first-strand reaction buffer, 2 μL of 0.1 M dithiothreitol and 1 μL of 10 mM deoxyribonucleoside triphosphate (dNTP) mix, the reaction mixture was incubated for 2 minutes at 42 °C. The specified amount of SuperScript II enzyme (200 U/μL) was then added according to the total amount of RNA starting material and incubated at 42 °C for 1 hour for first-strand cDNA synthesis. For second-strand synthesis, 30 μL of 5 × second-strand synthesis buffer (100 mM Tris-HCl, pH 6.9; 450 mM KCl; 23 mM MgCl2; 50 mM [NH4]2SO4; and 0.75 mM β-nicotinamide adenine+), 200 μM each of dNTP, 4 μL DNA polymerase I (10 U/μL), 1 μL E. coli RNase H (2 U/μL), 1 μL E. coli DNA ligase (1 U/μL), and 91 μL RNase free water were added to the first-strand reaction mixture, and the sample was incubated at 16° C for 2 hours. Two microliters of T4 DNA polymerase (5U/μL) were then added and incubated for an additional 5 minutes at 16 °C. Ten microliters of 0.5 M ethylenediamine tetraacetic acid were added to stop the reaction. The resulting dsDNA was purified with phenol-chloroform (1:1 dilution) and precipitated by adding a 0.5 volume of 7.5 M ammonium acetate and a 2.5 volume of 100% ethanol. The DNA was pelleted immediately by centrifugation at 12,000 × g for 20 minutes at room temperature. The pellet was washed twice with 80% ethanol by centrifugation, as described above, and resuspended in 12 μL of RNase free water.

We generated biotin-labeled cRNA targets from the dsDNA by T7 RNA polymerase linear amplification using the Enzo BioArray™ High-Yield RNA Transcript Labeling System (Enzo, New York, NY). We purified biotinylated cRNA using an RNeasy kit and determined the quantity and purity of the cRNA spectrophotometrically. We fragmented the cRNA by adding 2 μL of 5 × RNA fragmentation buffer (200 mM Tris-acetate, pH 8.1; 500 mM KOAc; and 150 mM MgOAc) for every 8 μL of cRNA plus water and incubated the mixture at 94 °C for 35 minutes.

Gene Expression Profiles

We generated genomic-scale gene expression profiles using oligonucleotide arrays purchased from Affymetrix, Inc. Affymetrix synthesizes the arrays using light-directed combinatorial chemistry, as described previously.14 We used both the Test-1 and the HuGeneFL probe arrays. Test-1 probe arrays contain 23 probe sets that represent bacterial, viral, yeast, murine, and human housekeeping and ribosomal genes. The HuGeneFL probe arrays contain probe sets representing approximately 7000 full-length human genes. These probes are complementary and correspond to human genes that are registered in Unigene, Genebank®, and the Institute for Genomic Research data base (TIGR). Each probe set is comprised of 20 probe pairs. Each probe pair contains a probe with oligonucleotides 25 bases in length that are identical to sequences of genes and its mismatch probe pair, which contains a homomeric mismatch at the central base position of the oligomer. This mismatch probe is used to test for cross hybridization of the oligomer. Probes are selected with a bias toward the 3′ region of each gene's sequence. Probe sets representing human genes, such as glyceraldehyde 3-phosphate dehydrogenase (GADPH), B-actin, transferrin receptor, and the transcription factor ISGF-3, serve as internal controls for RNA integrity. In addition, the probe arrays contain oligonucleotides that represent sequences of bacterial genes (BioB, BioC, BioD) and one phage gene (Cre) as quantitative standards.

Fragmented, biotin-labeled cRNA targets were hybridized to GeneChip® probe arrays in an Affymetrix hybridization oven (model 320; Affymetrix, Inc.). After hybridization, the sample was removed, and the probe array was washed and stained with a streptavidin-linked fluor using GeneChip® Fluidic Station 400. The degree of gene expression was determined by monitoring the fluor intensity of a given Affymetrix probe feature using a Hewlett-Packard GeneArray Scanner.

Preprocessing of Microarray Data

The Affymetrix GeneChip® software uses differences in the fluorescence signal intensities between the perfect-match and mismatch probe sets to represent the quantity of each transcript. Given the Affymetrix probe pair design, if a transcript is not detected, then the mismatch probe set and the perfect-match probe set should report similar intensities with the average difference close to 0. Unfortunately, due to the limitations of the technology, the probe pairs complementary to some genes sometimes generate a negative average difference that cannot be used in data analysis. To exclude such genes from our analysis, we adopted the following procedures. The GeneChip® software uses the number of positive and negative probe pairs to generate an absolute call (present, marginal, or absent) for each gene. We included in our analysis only those genes that were called present by the GeneChip® software in at least one sample. After applying this criterion, there were still samples in which the average difference value for a particular gene was reported as negative. In the second step, we excluded the genes with median expression levels (average differences) that were below an arbitrarily low threshold value of 100 in both the experimental group (i.e., OSCC samples) and the normal control group (median of expression levels for all genes across all samples, 1519.5; standard deviation, 5407; range, from −17,094 to +77,508).


The goal of our analysis was to find genes with expression levels that were increased or decreased consistently in carcinoma tissues compared with normal tissues. To compare expression levels of the same gene assessed by different arrays, we acknowledged the fact that there are systematic variations associated with each individual oligonucleotide array. Because these variations are systematic across one chip, we postulated sample specific heterogeneity factors: additive factor δk and multiplicative factor λk. We assumed that the genome-wide expression levels between two chips should be comparable. Therefore, we estimated δk and λk from the data by using a least-squares method and normalized the data with the estimated heterogeneity factors.15

Regression Model

We constructed a statistical modeling framework to analyze the expression profiles.15 We conceptualized an array of gene expression profiles as a vector of outcomes. Let Yk = (Y1k, Y2k,…Yjk) note the array k, where Yjk notes the expression of the jth gene on array k (j = 1,2,…,J; k = 1,2,…,K). We introduced a covariate xk to represent the disease status associated with each sample. Usually, xk is denoted as 0 for the control group and 1 for the experimental group. For example, xk = 1 for carcinoma tissue and xk = 0 for normal tissue. We proposed a regression model for the expression level of the jth gene in the kth sample as follows:

equation image

in which δk and λk are the sample specific additive and multiplicative heterogeneity factors, respectively; aj and bj are gene specific regression coefficients; and ϵjk is a random variable that reflects variations from sources other than those identified by the known covariate and the heterogeneity factors. Because xk is binary, aj measures the mean expression level of the jth gene in normal samples (xk = 0), and bj measures the difference of averaged expression levels of the jth gene between the carcinoma samples and the normal tissue samples. We then estimated the gene specific regression coefficients associated with each gene j (âj, j) using the least square. Because the properties of the random variations typically are unknown, they are unlikely to follow any familiar distributions, such as the normal distribution. Therefore, we estimated each gene's parameter using a method of estimating equations that was independent of any distributional assumptions.16, 17 The standard error was estimated by using a bootstrap procedure that was designed to acknowledge the loss of degrees of freedom in normalizing microarray data.

Statistical Inference

We determined the statistical inference by testing the null hypothesis that there is no difference in the expression level for each gene between two groups (i.e. cancerous tissues and normal tissues). Statistically, it is expressed as: Ho: bj = 0, j = 1,2,…,J. We tested the null hypothesis by calculating the Z-score: j/SE(j), where SE (j) is the standard error of j. Then we sorted the genes according to the absolute values of Z-scores in descending order: The higher the Z-score, the more likely the null hypothesis should be rejected. In other words, a gene with a high Z-score was very likely to be expressed differentially in the experimental samples (i.e., in OSCC samples). The sign of the Z-score indicates the directionality of the differential expression. Conventionally, microarray data are first normalized, and the normalized data are used for further downstream analysis. However, we have found that the normalization step may introduce additional variations and may influence the calculation of Z-scores. To correct this, we combined the normalization and regression analysis into one step and used a bootstrap resampling scheme to obtain a more accurate estimate of the standard error for the calculation of Z-scores.

Multiple Comparisons

To determine the threshold of the Z-score for statistical significance, the Z-score was transformed through an asymptotic distribution into a P value, and the threshold was set at a prespecified significance level, such as 0.05. However, because we were performing a large number of comparisons on a relatively small number of samples, the false positive rate may be high. This is called the multiple comparison phenomenon and is well recognized in the literature.18 To ensure that the significance level was applicable on a genomic scale, we raised the statistical threshold for declaring that a transcript was expressed differentially. A conservative method for adjusting the significance is the Bonferroni correction, which divides the desired significance (e.g., 5% or P = 0.05) by the total number of statistical tests performed. In the current study, we calculated the significance value (i.e., P value) for each probe set using a modified Bonferroni correction, as proposed by Hochberg.19

Hierarchical Clustering

To perform hierarchical clustering, the expression level for any given gene (average difference) in any given sample (array) was compared with the mean expression level of that gene across all samples. To avoid division by 0 or a negative number in transcripts that were not detected, expression values ≤ 0 were changed to 20. The expression values were then log-transformed, and both vectors of the data matrix, the genes (rows), and the arrays (columns) were centered about the mean and then normalized. We then used a two-way, pairwise, average-linkage clustering algorithm to group genes that had similar expression profiles across all tissue samples and, conversely, to group tissue samples that had similar patterns of gene expression. This analysis was performed with Cluster software (version 2.02), and the resulting expression map was visualized with Treeview software (version 1.45; both shareware programs are available at Internet address A dendogram is shown in a matrix format that resembles a phylogenetic tree. How far up the tree one has to go to find a branch point that connects the two samples is a measure of how different the two samples are. Each row represents the expression levels of a particular gene across all samples, whereas each column represents the expression levels of all the tested genes for each sample. To visualize the results, the expression levels of any given gene in any given tissue sample (relative to its mean expression level across all samples) is represented by a color, with red representing expression levels above the mean, green representing expression levels below the mean, and the color intensity indicating the magnitude of the deviation from the mean.

Confirmation of Differential Gene expression by Real-Time Quantitative Reverse Transcriptase-Polymerase Chain Reaction

We confirmed the differential expression of a subset of genes by quantitative reverse transcriptase-polymerase chain reaction (QRT-PCR) using SYBR® Green I technology and melting-point dissociation curve analyses. This included the up-regulation of urokinase-type plasminogen activator (uPA), cathepsin L, cystine-rich acidic secreted protein (SPARC), and matrix metalloproteinase 14 (MMP14) and included the down-regulation of cytokeratin 4. Six hundred nanograms each of total RNA extracted from 6 normal tissue samples and 12 tumor tissue samples were used as templates in RT-PCR reactions (using an RT-PCR kit; Applied Biosystems Inc.[ABI], Foster City, CA) to generate cDNA. These samples were selected because they had been used for the GeneChip® analysis and because there was an abundance of RNA available for experimentation. Each cDNA sample was divided into five wells for the QRT-PCR reactions: three wells for the gene of interest and two wells for the endogenous control, glyceraldehyde 3-phosphate dehydrogenase (GAPDH). QRT-PCR analyses were performed on an ABI 5700 Sequence Detector using 10 ng of cDNA and appropriate, gene specific primers in 1 × SYBR® Green I PCR Master Mix (ABI) in a 50-μL reaction. Cycling parameters were 50 °C for 2 minutes, 95 °C for 10 minutes, and 40 cycles at 95 °C for 15 seconds and at 60 °C for 1 minute. GAPDH was chosen as the endogenous control because its expression, as tested by the Affymetrix FL68 GeneChip® arrays, was comparable in all tumor tissue and normal tissue samples. Primer sequences for uPA and cathepsin L were as follows: uPA forward, 5′-GCA CCA TCA AAC AAA CCC CCT TAC-3′; uPA reverse, 5′-CAG ACA GAA AAA CCC CTG CCT G-3′; cathepsin L forward, 5′-CAG TGT GGT TCT TGT TGG G-3′; and cathepsin L reverse, 5′-CTT GAG GCC CAG AGC AGT CTA-3′.20 Primer sequences for SPARC, cytokeratin 4, and GAPDH that were designed using PE/ABI Primer Express® software and were checked for specificity against the National Center for Biotechnology Information nucleotide data base were as follows: SPARC forward, 5′-CGG CTT TGT GGA CAT CCC TA-3′; SPARC reverse, 5′-GGA AGG ACT CAT GAC CTG CAT C-3′; cytokeratin 4 forward, 5′-CGC GAA CAG ATC AAG CTC CT-3′; cytokeratin 4 reverse, 5′-AGG TTC CAT TTG GTC TCC AGG-3′; GAPDH forward, 5′-TTG GTA TCG TGG AAG GAC TCA-3′; and GAPDH reverse, 5′-TGT CAT CAT ATT TGG CAG GTT T-3′. A melting-curve dissociation curve analysis protocol was run immediately after cycling to determine the specificity of the reaction. Quantification of the transcripts was determined by choosing a fluorescence threshold at which the amplification of the target gene was exponential in both tumor samples and normal samples. The PCR cycle number at which the amplification curve intercepted the threshold is termed the threshold cycle (CT). The threshold cycle is inversely proportional to the copy number of the target template. Relative fold changes were calculated by 2math image, where ΔΔCT = [average CT, gene j − average CT, GAPDH] tumor tissue − [average CT, gene j − average CT, GAPDH] normal tissue.


For this study, we analyzed 26 invasive oral tumors, 2 premalignant oral lesions (1 carcinoma in situ [CIS] and one hyperplastic lesion), and 18 normal oral tissue samples. The samples were coded with numbers and were labeled CA for invasive disease or N for normal, according to the histologic examination. For the premalignant samples, the CIS sample and the hyperplastic sample were labeled CIS and HYP, respectively. The CIS22 sample was taken from a patient with a history of oral carcinoma who developed oral erythroplakia several months after undergoing resection of the primary tumor. The hyperplastic lesion, HYP40, represented the biopsy of one lesion in a patient with multiple verrucous tongue growths. Of the 18 normal oral tissue samples, 10 were taken from patients with oral carcinoma. These samples were named NCA and were given the identification number of the patient from whom they were obtained. A full description of the clinical data, including diagnosis, stage, and grade, is provided in Table 1.

Table 1. Clinical Data of the Study Participants
Sample no.DiagnosisStageaGrade (differentiation)
  • SCC: squamous cell carcinoma; N/A: not available.

  • a

    According to the TNM staging system of the American Joint Committee on Cancer.

1Tongue SCCT4N2bM0Well to moderate
3Alveolar ridge SCC/leukoplakiaT1N0M0Moderate
4Chronic tonsillitisN/AN/A
5Chronic tonsillitisN/AN/A
6Alveolar ridge SCCT4N0M0Moderate
7Tonsil SCCT2N0M0Moderate
8Recurrent tongue SCCT2N0M0Moderate
9Floor of mouth SCCT2N0M0Well
10Alveolar ridge SCCT1N0M0Well
11Tongue SCCT2N0M0Well
14Obstructive sleep apneaN/AN/A
15Recurrent tonsillitisN/AN/A
16Obstructive sleep apneaN/AN/A
17Tongue/gingiva/palate SCCT4N0M0Well
18Recurrent floor of mouth SCCT1N0M0Well
19Obstructive sleep apneaN/AN/A
21Tongue SCCT2N0M0Moderate
22Retromolar trigone erythroplakiaT1SN0M0Carcinoma in situ
23Retromolar trigone SCCT4N2bM0Moderate to poor
25Recurrent tongue SCCT2N0M0Moderate
26Recurrent soft/hard palate/retromolar trigone SCCT2N0M0Well to moderate
27Tongue SCCT2N2bM1Well
28Floor of mouth SCCT3N2cM0Moderate to poor
29Recurrent tonsillar SCCT2N0M0Moderate to poor with perineural invasion
30Tonsil SCCT2N0M0Poor
31Anterior floor of mouth SCCT2N0M0Moderate
32Tonsillar fossa; soft palate, tongue base SCCT2N2bM0Moderate
33Tonsil SCCT2N0M0Moderate to poor
34Chronic tonsillitisN/AN/A
35Tongue/base SCCT2N0M0Well
36Retromolar trigone SCCT4N2bM0Moderate
37Chronic tonsillitisN/AN/A
38Recurrent tongue SCCT2N2bM0Moderate
39Recurrent floor of mouth/tongue SCCT2N2bM0Moderate to poor
40Proliferative verrucous leukoplakiaN/AProliferative hyperplasia
41Mandible SCCT4N0M0Moderate

Hierarchical Clustering Analysis

After filtering the raw data, as described above, 4617 genes were left for analysis. We used a hierarchical clustering algorithm to study the changes in gene expression in the oral carcinomas on a genome-wide level. The resultant data matrix is shown in Figure 1a. Each row represents the expression levels of a particular gene across all samples, as described above, and each column represents the expression level of all of the genes tested for each sample. To test the reproducibility of the oligonucleotide array and of our algorithm, two samples, N5 and CA1, were arrayed twice. The resultant expression profiles were named N5a and CA1a, respectively. In each case, the same samples were clustered with a high degree of relatedness (Fig. 1a). This shows that the protocol is reproducible enough so that repeated analyses of the same specimen are more similar to one another than they are to analyses of other specimens.

Figure 1.

(a) Hierarchical clustering of the gene expression data. Approximately 4617 genes were combined into a single matrix that was clustered as described in the text. The dendogram at the top lists all of the samples arrayed and measures their degree of relatedness in gene expression. All samples were coded with numbers, as shown in Table 1. The samples of invasive oral squamous cell carcinoma (OSCC) were labeled CA, whereas the samples of carcinoma in situ and hyperplastic lesions were labeled CIS and HYP, respectively. The oral tissue samples from patients without oral carcinoma were identified by the letter N, whereas the surgical margins from patients with oral carcinoma were labeled NCA and were assigned codes corresponding to the patient from whom they were obtained. The color bar underneath the sample identifiers marks the experimental samples with orange and the normal samples with blue. Each column represents the expression levels for all genes in a particular sample, whereas each row represents the relative expression of a particular gene across all samples. The expression level of any given gene in any given sample (relative to the mean expression level of that gene across all tissue samples) is reported along a color scale in which red represents transcriptional up-regulation, green represents down-regulation, and the color intensity indicates the magnitude of deviation from the mean. (b) Expanded view of Clusters 1 and 2 from a. Clusters 1 and 2 represent the genes that were up-regulated (red) and down-regulated (green), respectively, in the OSCC samples.

At a glance, our genome-wide level analysis showed that, with the exception of sample CA31, the oral carcinoma samples were grouped into one cluster, whereas the normal control samples were clustered separately. This shows that genome-wide transcript profiling can be used to distinguish oral malignancy from normal oral tissue. However, within the carcinoma cluster, tumors from a particular pathologic grade or clinical stage were not clustered reliably with other tumors of the same grade or stage, indicating that there is significant molecular heterogeneity among tumors classified within a particular pathologic grade or clinical stage (Fig. 1a). Closer inspection of the data matrix reveals two main clusters of genes that were expressed differentially in oral carcinoma samples compared with normal control samples (Fig. 1a). These clusters were labeled Cluster 1 for genes that were up-regulated in the oral carcinoma samples and Cluster 2 for genes that were down-regulated. An expanded view of these clusters is provided in Figure 1b. It is noteworthy that sample CA31 was the only tumor that clustered among the normal samples. However, in general, the same genes that were up-regulated or down-regulated in the carcinoma samples (Clusters 1 and 2) also were expressed differentially in sample CA31 (Fig. 1b, Clusters 1 and 2). CIS22 and HYP40 (both premalignant lesions) both were clustered with the invasive oral carcinoma samples (Fig. 1a), indicating that the genome-wide gene expression levels in these tissues were similar to the expression levels in the invasive oral carcinoma samples. It is interesting to note that, 4 months after the date of biopsy, the patient from whom sample HYP40 was obtained developed verrucous squamous cell carcinoma of the tongue. In particular, a closer look at Cluster 1 reveals that these two preneoplastic tissues generally up-regulate the same genes that were up-regulated in the invasive oral carcinoma samples (Fig. 1b, Cluster 1). Supporting this hypothesis, samples NCA21 and NCA29 (both wound margins that were taken after resection of CA21 and CA29, respectively) clustered with CA31 with a high degree of relatedness (Fig. 1a). In these two samples, the pathology reports concluded that there was extensive dysplastic disease adjacent to the areas from which the NCA21 and NCA29 samples were taken. It is possible that certain changes in gene expression associated with malignancy are present in premalignany. However, this was not the case for genes that were down-regulated in oral carcinoma (Fig. 1b, cluster 2). In samples CIS22 and HYP40, the pattern of expression of these genes was similar to that seen in normal control samples. A similar pattern was observed for samples NCA21 and NCA29.

Differential Gene Expression in Oral Carcinoma

Clustering analysis, although it is appropriate as a discovery tool, can find similarities in gene expression only by grouping either genes or samples into clusters independent of clinical data. Another major disadvantage of hierarchical clustering analysis is the inability to estimate the statistical significance of the results. Although it is possible to detect a group of genes that cluster with some arrays of interest (i.e., the oral carcinoma group), it is primarily a qualitative tool. To determine those genes associated with malignancy in a rigorous statistical fashion, we compared the expression profiles of the invasive carcinoma samples with those of the normal oral tissue samples from patients who were without malignant disease by using a statistical modeling approach that integrated normalization and regression analysis. Comparison of normal tissue from patients with carcinoma and normal tissue from patients without carcinoma did not reveal any statistically significant changes in gene expression. However, given the possibility of field cancerization21 for the surgical margins (NCA samples) from patients with oral carcinoma, we excluded the NCA samples from our regression analysis. The regression was performed by taking the disease status (i.e., tumor tissue vs. normal tissue) as an explanatory variable and using the expression level of each gene as a dependent variable (for details, see Materials and Methods). This statistical modeling approach is analogous to the Student t test. We identified 239 genes that were up-regulated and 75 genes that were down-regulated in the oral carcinoma samples (P ≤ 0.05) compared with oral tissue samples from individuals without OSCC. Tables 2 and 3 show the list of genes that were up-regulated or down-regulated, respectively, along with the fold changes in the gene expression in tumor samples compared with normal samples. To limit the size of the tables, we included only genes with P values ≤ 0.001. A full description of the data set can be found in a supplementary data file (Internet address, HYPERLINK

Table 2. Genes Up-Regulated in Oral Squamous Cell Carcinoma (P ≤ 0.001)a Fold Changes
  • uPA: urokinase-type plasmogen activation; TGF: transforming growth factor; NF-IL6 nuclear factor-interleukin 6; IGF: insulin-like growth factor; SPARC: cystine-rich acidic secreted protein; epith. epithelial.

  • a

    The expression values of these genes in at least one sample were negative and thus, were replaced with 20 in the calculation of fold change.

D15050Transcription factor AREB62.6M26576Alpha-1 collagen Type IV, exon 524.3U96629Chromosome 8 BAC clone CIT987SK-2A81.6
D21255OB-cadherin 2> 10aM55210Laminin B2 chain2.9V00594Metallothionein from cadmium-treated cells12.9
D31762KIAA00578.1aM55998Alpha-1 collagen Type I, 3′ end13.2X02419uPA gene.9.4
D31887KIAA00623.4M61916Laminin B1 chain4.4X06700Proalpha 1 (III) collagen, 3′ region13.6
D38521KIAA00771.9M64673Heat-shock factor 1 (TCF5)7.1aX07834Manganese superoxide dismutase (EC
D43950KIAA00981.5M77349TGF-beta induced gene product (BIGH3)7.1X07979Fibronectin receptor beta subunit2.5
D64110tob family1.8M83667NF-IL6-beta protein3.2X14787Thrombospondin> 10a
D78132Ras homologue enriched in brain1.7M86757Psoriasin> 10aX15880Collagen VI alpha-1 C-terminal globular domain4.9
D7815126S proteasome subunit p971.6S54005Thymosin beta-101.9X15882Collagen VI alpha-2 C-terminal globular domain5.4
D7857714-3-3 protein eta chain1.7S62539Insulin receptor substrate-13.8aX52022Type VI collagen alpha3 chain22.5
D83174Collagen-binding protein 23.5S69115Granulocyte colony-stimulating factor induced> 10aX52947Cardiac gap junction protein6.2
D83777KIAA01933.2S72493keratin-keratin 16 homolog10.2X53416Actin-binding protein (filamin) (ABP-280)3.9
D87258Serine protease with IGF-binding motif3.9U01062Type 3 inositol 1,4,5-trisphosphate receptor2.8X53586Integrin alpha 65.3
HG2197Collagen, Type Vii, Alpha 12.9U0169Annexin V (ANX5) gene, 5′-untranslated region2.3X53587Integrin beta 4> 10a
HG2743Gamma-glutamy 1 transferase 1 (Gb:J04131)4.3U03057Actin-bundling protein (HSN)5.3X54941Cks1 protein homologue2.4
HG2743Caldesmon 1, alt. splice 3, nonmuscle5.9U15131p126 (ST5)> 10aX573511-8D gene from interferon-inducible gene family2.2
HG3494Nuclear pactor Nf-1163.9U16306Chondroitin sulfate proteoglycan versican V0> 10aX63629P cadherin> 10a
HG4480Collagen Type VI alpha 2, N-terminal domain> 10aU17760Laminin S B3 chain5.0X64330ATP-citrate lyase2.0
J03040SPARC/osteonectin12.9U21128Lumican9.8X65965SOD-2 gene for manganese superoxide dismutase.3.1
J03764Plasminogen activator inhibitor-1, exons 2–97.0U22431Hypoxia-inducible factor 1 alpha3.0X70683SOX-4 protein> 10a
J04102Erythroblastosis virus oncogene homolog 2 (ets-2)> 10aU22970Interferon-inducible peptide (6–16)> 10aX78565Tenascin-C, 7560bp14.1
L00205K6b (epidermal keratin, type II)5.3U26173bZIP protein NF-IL3A (IL3BP1)2.8X80692ERK3 mRNA2.2
L12350Thrombospondin 28.9U30521P311 HUM-3.14.8X83416PrP gene, exon 23.4
L13210Mac-2 binding protein2.9U31201Laminin gamma 2 chain> 10aX84373Nuclear factor RIP1404.8a
L13698Gas15.4U31383G protein gamma-10 subunit2.1X87160Epith. amiloride-sensitive Na channel, gamma subunit> 10a
L25081GTPase (rhoC)2.0U32114Caveolin-23.9X89750TGIF protein2.9
L32137Germline oligomeric matrix protein (COMP)12.7U41060Breast carcinoma, estrogen regulated LIV-1 protein3.9Y00264Amyloid A4 precursor of Alzheimer disease3.0
L77886Protein tyrosine phosphatase2.3U44754PSE-binding factor PTF gamma subunit6.31Y00282Ribophorin II1.7
M11718Alpha-2 Type V collagen, 3′ end14.5U51478Sodium/potassium-transporting ATPase beta-3 subunit2.1Z19574Cytokeratin 1726.1
M11749Thy-1 glycoprotein3.8U59877Low-Mr GTP-binding protein (RAB31)> 10aZ22534ALK-2 mRNA8.4a
M17219brain G-binding protein alpha-1 subunit, 5′ end4.2aU72263Multiple exostoses type II protein EXT2.18.8aZ24724PolyA site DNA2.8
M19989platelet-derived growth factor (PDGFA) A chain, exon 73.8aU72882Interferon-induced leucine zipper protein (IFP35)3.9Z25521Integrin-associated protein2.4
M20777Alpha-2 (VI) collagen28.7U73377p66shc3.5Z290835T4 gene for 5T4 oncofetal antigen6.9
M23294Beta-hexosaminidase beta subunit (HEXB)3.1U86602Nucleolar protein p402.3Z74615Preproalpha1 (I) collagen4.4
M24766Alpha-2 collagen Type IV, 3′ end> 10aU89505Hlark3.6aZ74616Preproalpha2(I) collagen7.0
Table 3. Genes Down-regulated in Oral Squamous Cell Carcinoma (P ≤ 0.001): Fold Changes
Accession no.DescriptionFold change
  • a

    The expression values of these genes in at least one sample were negative. The negative values were replaced with 20 in the calculation of fold change.

D14710ATP synthase alpha subunit1.7
D385245′ Nucleotidase3.3
D87735Ribosomal protein L142.0
HG3928Surfactant protein SP-A1 delta> 10a
J04794Aldehyde reductase2.0
L1960556 K autoantigen annexin XI2.5
L25080GTP-binding protein (rhoA)1.7
M17885Acidic ribosomal phosphoprotein P01.7
M77232Ribosomal protein S61.7
U31903CREB-RP (creb-rp)3.0
U47414Cyclin G22.0
X03342Ribosomal protein L322.0
X07695Cytokeratin 4 C-terminal region10.0a
X573511-8D gene from interferon-inducible gene family2.5
X67683Keratin 425
X76013Glutaminyl-tRNA synthetase2.0
X76223MAL, exon 4> 10.0a
X83218ATP synthase1.7
Z50749Sds22-like mRNA1.7

Many of the differentially expressed genes in our data set are known to encode for cytoskeletal and extracellular proteins or are involved in a wide array of cellular processes, such as cell proliferation, angiogenesis, and tumor invasion. Genes that encode for cytoskeletal and extracellular matrix proteins included filamin A; keratin 4 and 19; collagen Type III α1 and Type IV α1–α3; integrin α6, α3, β1, 4, and 7; laminin α4, β1, β3, and γ1; nidogen; heparan sulfate proteoglycan (HSPG); fibronectin; vinculin; actinin; ras homologue gene family, member C; FAT tumor suppressor (Drosophila) homolog; and cadherins 3 and 11. Genes that may be involved in tissue remodeling and fibrosis included lumican22 and colligin 2.23 Among the genes that may participate in tumor invasion and matrix morphology were cadherin 11; tenacin-C; Rho C; matrix metalloproteinases (MMPs), such as MMP2, MMP3, MMP12, and MMP14; osteonectin; caveolin 1 and 2; and tetranectin. It has been shown that these genes are expressed differentially in other types of carcinoma.24–27 Other proteinases that were up-regulated in the oral carcinoma samples included the serine proteinase uPA; its endogenous inhibitor, plasminogen activator inhibitor-1; and the cystine proteinases cathepsin L, C, and K.

Genes that have been shown to be involved in angiogenesis,28–41 such as cystine-rich angiogenic inducer 61 (CYR61), melanocyte growth stimulatory activity α (GRO1), HSPG2, hypoxia-inducible factor (HIF)-1α, interleukin 8, ephrin-A1, thymosin-β10, and thrombospondin 1 and 2, were overexpressed, whereas vascular endothelial growth factor receptor expression was down-regulated in the oral carcinoma samples. We also observed the differential expression of genes that were involved in cell survival and proliferation. These included the up-regulation of apoptotic markers annexin V and annexin VII and the down-regulation of annexin XI and cullin 3, which is involved in the ubiquitin degradation of cyclin E.42 We also observed the up-regulation of cell cycle regulation genes (ERK3 [microtubule-associated protein kinase 6], CDC2-associated protein, cyclin-dependent kinase 4, c-myc, jun B proto-oncogene, v-ets avian erythroblastosis virus E26 oncogene homolog 2, and v-fos FBJ murine osteosarcoma viral oncogene homolog [c-fos]) as well as protein kinases and phosphatases (protein-tyrosine kinase 7; protein tyrosine phosphatase, receptor-type, κ; protein tyrosine phosphatase, type 4A, 1; and dual specificity phosphatase 6). In addition, evidence of oncogenic or growth factor induction in OSCC was seen with the overexpression of certain immediate-early genes encoding for connective tissue growth factor (CYR61); insulin-like growth factor (IGF)-binding protein 10 (IGFBP10); protease, serine 11 (IGF binding); IGFBP7 (Mac25); and ETR101 immediate-early protein.

5T4 oncofetal antigen and HIF-1α have been associated with malignancy and aggressive disease43–48 and were overexpressed in our OSCC samples with statistical significance. Fourteen of 75 down-regulated genes in the OSCC samples encoded for ribosomal proteins. Among these, ribosomal protein L14 (RPL14) was localized to chromosomal region 3p21.3, a site that is deleted consistently in oral carcinoma.3, 4, 6, 49. The rhoA gene, which has been mapped to chromosomal region 3p21.3,50 also was down-regulated in the OSCC samples. Cytokeratin 4, which is a differentiation marker, also was down-regulated in OSCC.

Assessment of Correlation with Clinical Stage and Lymph Node Status

We explored whether the expression of the 314 genes that were expressed differentially between normal tissues and carcinoma tissues differed between tumors of different clinical stage and metastatic potential. Due to our small sample size, we grouped patients with Stage I and II disease into an early-stage disease category and grouped patients with Stage III and IV disease into a late-stage disease category. We performed statistical regression analysis on the 314 genes to compare 15 Stage I and II tumors with 11 Stage III and IV tumors. Our results showed that none of the 314 genes were expressed differentially at a significant level of 5%. The same analysis was performed to compare 16 nonmetastasized tumors (N0M0) with 8 metastasized tumors (N1–N3 or M1). We found that 1 of 314 genes had significant differential expression between the two groups (P = 0.02): ribosomal protein S13. The same results were obtained when these comparisons were made using all of the genes that were interrogated by the GeneChip® array.

Validation of the Gene Expression Data Generated by the Oligonucleotide Arrays

The SYBR® Green I QRT-PCR assays confirmed the results we obtained using the array technology and regression analysis on the differential expression of four up-regulated genes (uPA, SPARC/osteonectin, cathepsin L, and MMP14) and one down-regulated gene (cytokeratin 4). The dissociation curves in each assay showed a single melting-point peak, indicating a single amplified product (data not shown). For each of these five genes, the relative average fold changes obtained by QRT-PCR paralleled the trends of up-regulation and down-regulation seen with the GeneChip® arrays: uPA, 2.7 versus 10.7; cathepsin L, 1.7 versus 1.6; SPARC/osteonectin, 21.0 versus 3.9; MMP14, 17.0 versus 2.8; and cytokeratin 4, 0.003 versus 0.06.


The current results based on clustering analysis show that OSCC can be distinguished from normal oral tissue based on genome-wide transcriptional expression profiles and that OSCC tumors of the same clinical stage and grade showed considerable dissimilarities in their gene expression profiles. This latter observation suggests that the current TNM staging system used to classify OSCC may not be adequate. The genes we identified that distinguished OSCC from normal tissues have been associated by others with a wide array of cellular processes and pathways and merit exploration for their roles in oral carcinogenesis. Because our observations were based on a relatively small number of study participants, the conclusions would have to be confirmed or refuted by much larger studies.

Although our OSCC samples contained at least 60–70% tumor cells, an issue to address when analyzing any solid tumor sample is bystander cell contamination. OSCC tumors usually are exophytic or macroscopically apparent, and an effort was made to remove tissue that was not involved in tumor. However, even if microdissection is done, it is difficult to obtain a sample that contains only tumor cells. Ohyama et al. used laser-capture microdissection (LCM) to generate highly pure tumor samples for interrogation with oligonucleotide array.51 However, recognizing the limited quantities of RNA that can be obtained by using the LCM procedure, those authors used multiple rounds of linear transcript amplification to generate enough cRNA for hybridization. This is not only extremely cumbersome for high-throughput analysis, but it is unclear whether this approach would ensure linear amplification of low copy number transcripts among all samples tested. Future studies will be needed to directly compare gene expression profiles of LCM samples with gene expression profiles of biopsy samples to determine whether there are any significant differences.

Our cluster analysis revealed some interesting features of the CIS22 and HYP40 premalignant samples, which clustered with a high degree of relatedness (Fig. 1a). Although the sample size was too small to draw any conclusion, it seemed that the transcriptional expression profiles of these premalignant lesions had striking similarities compared with the invasive lesions. In particular, the genes that were up-regulated in the invasive carcinomas also were up-regulated in these two samples (CIS22 and HYP40). This finding raises the possibility that the up-regulation of these genes is an important early event in the development of OSCC and one that is sustained throughout malignancy. The same was not true for the genes that were down-regulated in the invasive lesions. For this cluster of genes, the pattern of expression in the CIS22 and HYP40 samples was similar to the pattern of expression in the normal controls. Perhaps the down-regulation of this set of genes is important for transition to malignancy, as discussed above. Clearly, studies of much larger sample size comparing premalignant lesions with invasive disease will be necessary to affirm or refute this finding.

It is interesting to note that the expression of the 314 genes that showed differential expression between oral carcinoma and normal tissue did not differ between clinically early-stage disease (Stage I–II) and late-stage disease (Stage III–IV) or between metastatic tumors and nonmetastatic tumors (with the exception of ribosomal protein S13). This observation is consistent with our finding that tumor samples did not cluster according to stage or grade, and that the premalignant lesions clustered with the tumor samples (see above, Hierarchical Clustering Analysis). Although our current sample size was not adequate to draw definitive conclusions, these results are compatible with the hypothesis that changes in genetic expression occur relatively early in carcinogenesis. This implies that early tumors already may contain genetic changes that allow more advanced tumors to grow, invade, and metastasize.

Publications on the study of gene expression in OSCC on a genome-wide scale have been extremely limited. A recent article reported results that were obtained from cells collected by LCM from paired normal and malignant oral epithelial tissue from five patients.20 Those authors found that 39 genes (16 genes that were up-regualted and 23 genes that were down-regulated) were expressed differentially between tumor tissue and normal tissue. Of the 16 up-regulated genes in that study, 11 genes also were up-regulated in our samples, and 6 of their 23 down-regulated genes also were down-regulated in our study. It remains to be determined whether their list of genes can discriminate between normal tissue and oral carcinoma tissue in a much larger study. A direct comparison between that study and ours is difficult, because 1) the sample sizes were significantly different; 2) the sample processing procedures used in the two studies were significantly different; 3) the GeneChip® analysis software and self-organizing maps algorithms used in that study to determine differential gene expression differed significantly from the regression-based algorithm used in our study; and 4) our statistical approach accounted for the potential false positive findings from making comparisons between a large number of genes in a small number of samples. Another study52—a retrospective analysis of 17 patients with head and neck carcinoma—identified 375 genes with expression that distinguished patients into two groups with distinct clinical outcomes. Given the prospective nature of our study, as we follow our patients for their disease outcome, we will be able to assess whether the 314 differentially expressed genes between OSCC tissues and normal tissues are the same genes that differentiate patients according to their disease outcome.