• Open Access

Gene expression signature-based prognostic risk score in patients with glioblastoma


To whom correspondence should be addressed.

E-mail: ryaman@cmt.kpu-m.ac.jp


The present study aimed to identify genes associated with patient survival to improve our understanding of the underlying biology of gliomas. We investigated whether the expression of genes selected using random survival forests models could be used to define glioma subgroups more objectively than standard pathology. The RNA from 32 non-treated grade 4 gliomas were analyzed using the GeneChip Human Genome U133 Plus 2.0 Expression array (which contains approximately 47 000 genes). Twenty-five genes whose expressions were strongly and consistently related to patient survival were identified. The prognosis prediction score of these genes was most significant among several variables and survival analyses. The prognosis prediction score of three genes and age classifiers also revealed a strong prognostic value among grade 4 gliomas. These results were validated in an independent samples set (n = 488). Our method was effective for objectively classifying grade 4 gliomas and was a more accurate prognosis predictor than histological grading.

Glioblastomas are pathologically the most aggressive form of glioma, with a median survival range of only 9–15 months.[1, 2] Even advances in cancer biology, surgical techniques, chemotherapy and radiotherapy have led to little improvement in survival rates of glioblastoma patients.[1] Poor prognosis is attributable to difficulties in early detection and to a high recurrence rate after initial treatment. Therefore, more effective therapeutic approaches, a clearer understanding of the biological features of glioblastoma and the identification of novel target molecules are needed for improved diagnosis and therapy of this disease.

Several histological grading schemes exist. The World Health Organization (WHO) system is currently the most widely used; a high WHO grade correlates with clinical progression and decreased survival rate.[3] However, individual fates vary within diagnostic categories, even in grade 4 glioma,[1, 2] indicating the need for additional prognostic markers. The inadequacy of histopathological grading is evidenced, in part, by the inability to recognize patients prospectively.

Microarray technology has permitted the development of multiorgan cancer classification including gliomas, the identification of glioma subclasses, the discovery of molecular markers and predictions of disease outcomes.[4-14] Unlike clinicopathological staging, molecular staging can predict long-term outcomes of individuals based on gene expression profiles of tumors at diagnosis, enabling clinicians to make optimal clinical decisions. The analysis of gene expression profiles in clinical materials is an essential step towards clarifying the detailed mechanisms of oncogenesis and the discovery of target molecules for the development of novel therapeutic drugs.

In the present study, we describe an expression profiling study of a panel of 32 patients with grade 4 gliomas for the identification of genes that predict overall survival (OS) using random survival forests models, with validation in independent data sets.

Materials and Methods


Tissues were snap-frozen in liquid nitrogen within 5 min of harvesting and stored thereafter at −80°C. The clinical stage was estimated from accompanying surgical pathology and clinical reports. Samples were specifically re-reviewed by a board-certified pathologist at Niigata University, Niigata, Japan according to WHO criteria, by observing sections of paraffin-embedded tissues that were adjacent or in close proximity to frozen samples from which the RNA was subsequently extracted. The histopathology of each collected specimen was reviewed to confirm the adequacy of the sample (i.e. minimal contamination with non-neoplastic elements) and to assess the extent of tumor necrosis and cellularity. Informed consent was obtained from all patients for the use of the samples, in accordance with the guidelines of the Ethical Committee on Human Research, Niigata University Medical School (Protocol #70). Overall survival was measured from the date of diagnosis. Survival end-points corresponded to the dates of death or last follow up.

RNA extraction and array hybridization

Approximately 100 mg of tissue from each tumor was used to extract total RNA using the Isogen method (Nippongene, Toyama, Japan) following the manufacturer's instructions. The quality of RNA obtained was verified with the Bioanalyzer System (Agilent Technologies, Tokyo, Japan) using RNA Pico Chips. Only samples with 28S/18S ratios >0.7 and with no evidence of ribosomal peak degradation were included in the present study. One microgram of each RNA was processed for hybridization using GeneChip Human Genome U133 Plus 2.0 Expression arrays (Affymetrix, Inc., Tokyo, Japan), which comprised approximately 47 000 genes. After hybridization, the chips were processed using a Fluidics Station 450, a High-Resolution Microarray Scanner 3000 and a GCOS Workstation Version 1.3 (Affymetrix, Inc).

Validation of differential expression using real-time quantitative PCR

The quantitative PCR (QPCR) was performed using a StepOne Real-Time PCR System (Applied Biosystems, Tokyo, Japan) and TaqMan Universal PCR Master Mix (Applied Biosystems) according to the manufacturer's protocol. The Assays-on-Demand probe/primer sets (Applied Biosystems) used were as follows: ANGPTL1, Hs00559786_m1; ARHGAP39, Hs00286798_m1; ASF1A, Hs00204044_m1; CASP8, Hs01018151_m1; C11orf71, Hs00535489_s1; EFNB2, Hs00187950_m1; GAPDH, Hs99999905_m1; GPNMB, Hs01095679_m1; ITGA7, Hs00174397_m1; LDHA, Hs00855332_g1; LMAN2L, Hs01091681_m1; LOXL3, Hs01046945_m1; MED29, Hs00378316_m1; and MGMT, Hs01037698_m1.

Total RNA (1 μg) was reverse transcribed into cDNA using SuperScript II (Invitrogen, Tokyo, Japan) and 1 μL of the resulting cDNA was used for QPCR. Validation was performed on a subset of tumors that were part of the original tumor data set assessed. Assays were carried out in duplicate. The raw data produced using the QPCR referred to the number of cycles required for reactions to reach the exponential phase. Expression of GAPDH was used to normalize the QPCR data. Mean expression fold change differences between tumor groups were calculated using the 2−ΔΔCT method.[15]


Five-micron sections from formalin-fixed, paraffin-embedded tissue specimens were used for immunohistochemistry (IHC). Endogenous peroxidase was blocked with 0.3% H2O2 in methanol. Antigen retrieval was performed by autoclaving at 120°C for 10 min in 50 mM citrate buffer (pH 6.0). The IHC for anti-O6-methylguanine-methyltransferase (MGMT; antibody dilution 1:50; clone MT3.1; Millipore, Billerica, MA, USA) was performed as described previously.[16] Immunoreactivity (MGMT staining index [SI]) was quantified by counting stained tumor nuclei in >1000 cells and was expressed as a percentage of positive cells. A MGMT SI >30% was considered positive for MGMT. Averages of three independent measurements were calculated to the first decimal place. Observers were not aware of case numbers.

Analysis of the isocitrate dehydrogenase 1 (IDH1) codon 132 mutation

A 129-bp fragment of IDH1 that included codon 132 was amplified using IDH1f, 5′-CGGTCTTCAGAGAAGCCATT-3′ as the sense primer and IDH1r, 5′-GCAAAATCACATTATTGCCAAC-3′ as the antisense primer. A PCR was performed on 20 ng of DNA with Taq DNA Polymerase (Takara, Tokyo, Japan) and standard conditions of 35 cycles were used. The PCR amplification product was sequenced using a BigDyeTerminator v3.1 Sequencing Kit (Applied Biosystems) using the sense primer IDH1f and antisense primer IDH1rc, 5′-TTCATACCTTGCTTAATGGGTGT-3′. Sequences were determined using the semiautomated sequencer (ABI 3100 Genetic Analyzer; Applied Biosystems) and Sequence Pilot version 3.1 software (JSI-Medisys, Kippenheim, Germany) as described previously.[17]

Bioinformatics analysis

All statistical analyses were performed using R software[18] and Bioconductor.[19] The Affymetrix GeneChip probe-level data were preprocessed using MAS 5.0 (Affymetrix Inc.) for background adjustment and log-transformation (base 2). Each array was normalized using a quantile normalization to impose the same empirical distribution of intensities to each array. Genes that passed the filter criteria below were considered for further analysis. To select predictors (genes) for OS, we first set filtered gene expressions and applied the random survival forests–variable hunting (RSF-VH) algorithm.[20] Among the algorithm parameters, the number of Monte Carlo iterations (nrep) and value to control step size used in the forward process (nstep) were set as nrep = 100 and nstep = 5, respectively, following the method of Ishwaran et al.[20] For other parameters such as number of trees and number of variables selected randomly at each node, we used the default settings for varSelfunction within the RandomSurvivalForest package before selection. We classified samples into two survival groups using Ward's minimum variance cluster analysis, inputting ensemble cumulative hazard functions for each individual for all unique death time-points estimated from the fitted random survival forests model to selected genes.

The two classified survival groups were used to compute the prognosis prediction score (PPS) from a simple form (linear combination of gene expressions). To do this, we used principal component analysis and receiver operating characteristic analysis. Briefly, we computed the first principal component of gene expressions selected by the RSF-VH algorithm as a risk score and then searched for the optimal value to predict survival groups with maximum accuracy using the Youden index.[21] Validation for this method is used in the validation set (n = 488; Table 1), which is derived from glioblastoma patients in four external data sets.[8, 10, 12, 22]

Table 1. Patient characteristics of grade 4 glioma
VariableTest setValidation set P
(n = 32)(n = 488)
Age (years)
Survival time (days)4113640.41

The survival tree method[23] constructs prognostic groups based on PPS and age among those with grade 4 glioma. This method is based on a recursive partition of the PPS and age values while splitting patients into the subset. Final output results in groups of patients with similar prognoses, which are represented as combinations of binarized PPS or age. This was executed using the rpart package of the R software.

The Kaplan–Meier method was used to estimate the survival distribution for each group. A log-rank test was used to test differences between survival groups. The association of the PPS with OS was evaluated using multivariate analyses with clinical characteristics and with other predictors using the Cox proportional hazards regression model. P < 0.05 was considered statistically significant.


Patient characteristics

Thirty-two non-treated primary glioblastomas (WHO grade IV) came from patients who underwent surgical resections between 2000 and 2005 (Table 1). The median age of patients was 54.5 years (range,18–80 years). Twenty patients were male and 12 were female. The preoperative Karnofsky performance status (KPS) was at least 70 in 25 (78%) patients. The IDH1 mutation was negative in 31 cases, but was detected in one patient who remains alive 2365 days after the onset of disease. The MGMT IHC was positive in 21 cases and negative in 11 cases. After maximum surgical tumor resections, patients received external beam radiation therapy (standard dose of 60 Gy to the tumor with a 2-cm margin) and first-line chemotherapy with nimustine and temozolomide at recurrence. Patients were monitored for tumor recurrence during initial and maintenance therapy using MRI or computed tomography. Treatments were carried out at the Department of Neurosurgery, Niigata University Hospital. The median survival time was 13.7 months.

Selection of predictive genes

Microarray data were deposited in the Gene Expression Omnibus (accession number GSE 43378) and 25 genes were selected as predictors. Table 2 shows a list of the genes with their variable importance values. The scatter plot in Supporting Information Figure S1 shows the relationships between the estimated ensemble mortalities and expression for six selected genes (AFTPH, ARHGAP39, CASP8, ITGA7, LDHA and LOXL3). Validation of the microarray results was accomplished using QPCR. These 10 genes were also found to be differentially expressed between short-term (survival time, ≤1.5 years) and long-term (survival time, ≥2.5 years) survivors (Table S1). The heat map (Fig. S2) shows patients clustered by estimated ensemble mortalities (columns) and genes clustered by their expression levels (rows). For patients with low survival (blue bar), the lower genes are overexpressed while the upper genes are underexpressed. For patients with improved survival (red bar), these patterns were reversed; thus, the indicated genes might be effective in distinguishing between patients with different survival rates.

Table 2. Identification of survival related 25 genes
  1. VI, variable importance.

225708_atMED29Mediator complex subunit 290.0202
227876_atARHGAP39Rho GTPase activating protein 390.0101
228821_atST6GAL2ST6 beta-galactosamide alpha-2,6-sialyltranferase 20.0101
200650_s_atLDHALactate dehydrogenase A0.0081
220260_atTBC1D19TBC1 domain family, member 190.0060
218981_atACN9ACN9 homolog (S. cerevisiae)0.0060
231773_atANGPTL1Angiopoietin-like 10.0060
201141_atGPNMBGlycoprotein (transmembrane) nmb0.0040
228255_atALS2CR4Amyotrophic lateral sclerosis 2 (juvenile) chromosome region, candidate 40.0040
203427_atASF1AASF1 anti-silencing function 1 homolog A (S. cerevisiae)0.0020
222108_atAMIGO2Adhesion molecule with Ig-like domain 20.0020
1562527_atLOC283027Hypothetical protein LOC2830270.0000
218789_s_atC11orf71Chromosome 11 open reading frame 710.0000
219240_s_atC10orf88Chromosome 10 open reading frame 88−0.0020
213373_s_atCASP8Caspase 8, apoptosis-related cysteine peptidase−0.0020
225126_atMRRFMitochondrial ribosome recycling factor−0.0020
209663_s_atITGA7Integrin, alpha 7−0.0040
223222_atSLC25A19Solute carrier family 25 (mitochondrial thiamine pyrophosphate carrier), member 19−0.0040
214271_x_atRPL12Ribosomal protein L12−0.0040
229648_atARHGAP32Rho GTPase activating protein 32−0.0040
228253_atLOXL3Lysyl oxidase-like 3−0.0060
206172_atIL13RA2Interleukin 13 receptor, alpha 2−0.0101
221274_s_atLMAN2LLectin, mannose-binding 2 like−0.0141

Identification of a PPS associated with survival

The gene expression predictor PPS was computed from a linear combination of the 25 genes and was calculated for each tumor as follows:

display math

The Z1 score of the expression value for each individual gene was adapted in this formula. The Z1 scores ranged from −4.91 to 4.28, with high scores associated with poor outcomes. The optimal cut-off was a Z score of −1.17. As expected, the predictor performed well in terms of patient prognosis; the improved prognosis group (Z ≤ −1.17) had a median survival time of 721 days, while the poor prognosis group (Z > −1.17) had a significantly lower median survival time of 335 days (< 0.0001; Fig. 1a).

Figure 1.

Survival analyses using the selected 25-gene classifiers show the prognostic value for glioblastoma. Kaplan–Meier curves that compare groups classified using the Z1 prognosis prediction score with the 25-gene model in the test (a) and validation (b) sets.

Identification of a PPS with a three-gene set associated with survival

For more practical purposes, the gene expression predictor PPS was computed from a linear combination of three genes and was calculated for each tumor as follows:

display math

The Z2 score of the expression value for each individual gene was adapted in this formula. The Z2 scores ranged from −2.53 to 2.27, with high scores associated with poor outcomes. The optimal cut-off was a Z score of −0.76. As expected, the predictor performed well in terms of patient prognosis; the improved prognosis group (Z ≤ −0.76) had a median survival time of 721 days, while the poor prognosis group (Z > −0.76) had a significantly lower median survival time of 335 days (< 0.0001; Fig. 2a). Classification using cell-of-origin is associated with survival. We classified our cases into proneural, neural, classical and mesenchymal subtypes using a gene expression-based method according to Verhaak et al.[13] (Fig. S3A). These four groups differed significantly in survival rates (P = 0.0093; Fig. S3B) and classification by cell-of-origin was found to be significantly associated with patient survival.

Figure 2.

Survival analyses using the selected three-gene classifiers show the prognostic value for glioblastoma. Kaplan–Meier curves that compare groups classified using the Z2 prognosis prediction score with the three-gene model in the test (a) and validation (b) sets.

The gene expression predictor is the most significant feature

The Z PPS results were compared with traditional individual indicators. As shown in Table 3, Z1, Z2, age, KPS and subtype were significantly associated with OS in univariate analyses. Table 4 shows the results of the multivariate analyses, which found that the gene expression predictor Z1 was significantly associated with OS. The PPS was the most significant feature of these clinical parameters.

Table 3. Prognostic value of clinical factors stratified by overall survival (OS) in patients with grade 4 glioma
Variable n Median OS (days) P
  1. CL, classical; IDH1, isocitrate dehydrogenase 1; IHC, immunohistochemistry; KPS, Karnofsky performance status; MES, mesenchymal; MGMT, O6-methylguanine-methyltransferase; ND, not determined; NL, neural; PN, proneural.

Age (years)
Z1 Score
Z2 Score
Table 4. Multivariate analysis: prognosis prediction score and clinical and therapeutic variables associated with overall survival in patients with grade 4 glioma
VariableSubgroupEntire series (n = 32)
Hazard ratio95% CI P
  1. CI, confidence interval; IHC, immunohistochemistry; MGMT, anti-O6-methylguanine-methyltransferase.

Z 1 Continuous variable1.341.03–1.770.026
Z 2 Continuous variable1.480.95–2.380.081
MGMT IHCPositive/Negative1.720.70–4.300.228
Age (years)≥60, <602.220.95–5.370.065
KPS≥70, <702.760.88–8.500.078

The PPS formula was validated in the independent sample set

The PPS formula was validated in the validation set (n = 488; Table 1), which is derived from glioblastoma patients in four external data sets.[8, 10, 12, 22] The Z1 scores ranged from −5.43 to 5.33. As expected, the OS was significantly higher in the improved prognosis group (Z ≤ −1.17) than in the poor prognosis group (Z > −1.17; = 0.0016; Fig. 1b). The Z2 scores ranged from −3.98 to 2.66. As expected, the OS was significantly higher in the improved prognosis group (Z ≤ −0.76) than in the poor prognosis group (Z > −0.76; = 0.028; Fig. 2b).

Survival analyses using the PPS with a three-gene set and age classifiers shows a prognostic value for patients with grade 4 glioma

Even among Grade 4 gliomas in both test (n = 32) and validation sets (n = 488), the OS ranged between 0 and 3880 days. Fifty-two patients (10%) survived for longer than 1000 days. As predicted by the survival tree, the OS differed significantly between the improved prognosis group (−0.76 ≥ Z2 or −0.76 < Z2 with age <57 years) and the poor prognosis group (−0.76 < Z2 with age ≥57 years) in the test and validation set (= 0.0006 and < 0.0001, respectively; Fig. 3). The median OS using test and validation data sets was 641 and 490 days, respectively, for the improved prognosis group and 347 and 302 days, respectively, for the poor prognosis group. The two-year survival rates were 36.3% and 30.8% in the improved prognosis group and 4.7% and 11.8% in the poor prognosis group, using the test and validation data sets, respectively.

Figure 3.

Survival analyses using the Z2 prognosis prediction score (PPS) and age classifiers reveal a prognostic value for glioblastoma. Kaplan–Meier curves compare groups classified using the Z2 PPS and age in the test (a) and validation (b) sets.


We assessed relationships between gene expression and survival time using a random survival forests model. This is classified into a tree-based method, which aids the detection of interactions. As discussed by Cordell, the functional form should contain gene-by-gene interaction terms.[24] The model was developed for use with datasets in which several variables (genes, in the present case) greatly outnumber patients; a framework of random forests is needed for such an analysis. Genes were selected using the RSF-VH algorithm, which eliminates the need to screen the genes.[20]

Many studies of microarray data use univariate analyses for screening in which potential genes that interact with other genes may be dropped from the analyses. However, the RSF-VH algorithm is more appropriate for our application and we previously reported its usefulness[20] in the identification of a gene-expression signature that predicts outcomes in patients with malignant glioma and primary central nervous system lymphoma.[14, 25]

Although our predictor was mainly based on cases from first-line nitrosourea-based chemotherapy, results from the combined four external data sets,[8, 10, 12, 22] in which first-line temozolomide-based chemotherapy was used, support the universal performance of the predictor, irrespective of the chemotherapeutic regimen. Survival benefit by chemotherapy is relatively small in most grade 4 gliomas, so it is important to elucidate the differences in the intrinsic biological characteristics of the tumors. Genetic differences within malignant gliomas also underscore the heterogeneity of these tumor types. Compared with our previous report of oligodendrocytic tumor patients,[14] three (GPNMB, LOXL3 and IL13RA2) out of 25 genes are identical to the present study.

The value of gene expression-based predictors in estimating the prognosis of malignant glioma patients will not be fully realized until more efficacious therapies are available for those in whom current treatment is less successful. In this regard, although the biological investigation of these genes is important, expression profiles might predict long-term survival as well as yielding clues about individual genes involved in tumor development, progression and response to therapy. Moreover, the ability to distinguish between histologically ambiguous gliomas will enable appropriate therapies to be tailored to specific tumor subtypes. Class prediction models based on defined molecular profiles allow the classification of malignant gliomas in a manner that will better correlate with clinical outcomes than with standard pathology. Glioblastomas have wide-ranging survival times, which require a more precise prognostic scoring system to study novel therapeutic approaches. Therefore, the identification of molecular subclasses could greatly facilitate prognosis and our ability to develop effective treatment protocols. As our PPS involves a small number of genes, quantitative reverse transcriptase PCR assays or customized DNA microarrays could be developed for clinical applications. Molecular targeted therapies that specifically target disabled pathways might then be tailored for those patients with poor prognoses.

In summary, we identified gene signatures associated with outcome in patients with glioblastoma. Adaptation of subsets of these genes for use in clinical assays could result in improved outcome prediction. We have extended our observations to validate these signatures using independent data sets from other institutions. Our profiling results should help construct a new classification scheme that better assesses clinical malignancies compared with the conventional histological classification system.


This work was supported in part by JSPS KAKENHI grant number 21700312 to A.K. and 17390394 to R.Y.

Disclosure Statement

The authors have no conflict of interest.