Gene regulation network analysis reveals core genes associated with survival in glioblastoma multiforme

Abstract Glioblastoma multiforme (GBM) is a very serious mortality of central nervous system cancer. The microarray data from GSE2223, GSE4058, GSE4290, GSE13276, GSE68848 and GSE70231 (389 GBM tumour and 67 normal tissues) and the RNA‐seq data from TCGA‐GBM dataset (169 GBM and five normal samples) were chosen to find differentially expressed genes (DEGs). RRA (Robust rank aggregation) method was used to integrate seven datasets and calculate 133 DEGs (82 up‐regulated and 51 down‐regulated genes). Subsequently, through the PPI (protein‐protein interaction) network and MCODE/ cytoHubba methods, we finally filtered out ten hub genes, including FOXM1, CDK4, TOP2A, RRM2, MYBL2, MCM2, CDC20, CCNB2, MYC and EZH2, from the whole network. Functional enrichment analyses of DEGs were conducted to show that these hub genes were enriched in various cancer‐related functions and pathways significantly. We also selected CCNB2, CDC20 and MYBL2 as core biomarkers, and further validated them in CGGA, HPA and CCLE database, suggesting that these three core hub genes may be involved in the origin of GBM. All these potential biomarkers for GBM might be helpful for illustrating the important role of molecular mechanisms of tumorigenesis in the diagnosis, prognosis and targeted therapy of GBM cancer.

Many studies have illustrated the numerous candidate hub genes involved in GBM from RNA-seq data. Biomarkers could combine with different diseases and display a high value which lead to the development of a robust, effective take on GBM therapy. 5 Numerous recent studies have discovered the biomarkers in GBM from molecular biology to proteomics, such as circulating tumour DNA (ctDNA), DNA, microRNA (miRNA), lncRNA (long non-coding RNA) and protein. [6][7][8] ctDNAs in cerebrospinal fluid better reflected the sequential change of those tumour drivers than in plasma, which served as biomarkers to improve patients' outcomes. 9 GFAP (Glial Fibrillary Acidic Protein) and EGFR (Epidermal Growth Factor Receptor) were considered to be increased and as potential therapeutic markers in GBM patients. 10,11 CDK1 (Cyclin Dependent Kinase 1) and BUB1 (BUB1 Mitotic Checkpoint Serine/Threonine Kinase) were significantly connected with carcinogenesis of GBM. 12,13 Hsa-miR-21 and hsa-miR-10b were discovered as GBM-specific miRNAs, which lead to the development of a robust take on GBM therapy. 5 The blood biomarkers (LRG1 (Leucine Rich Alpha-2-Glycoprotein 1), CRP (C-Reactive Protein) and C9 (Complement C9)) revealed significant positive correlations with tumour size. 14 Our previous study also reported that HMG-box family and related ceRNA (competing endogenous RNA) established the significance of SOX6 (SRY-Box Transcription Factor 6) in the malignant progression of glioblastoma. 6 Taking into account the individual differences, in this study, we selected the microarray data from the Gene Expression Omnibus (GEO) database and the RNA-seq data from TCGA-GBM (The Cancer Genome Atlas Glioblastoma Multiforme) dataset, using the RRA (RobustRankAggreg) method to identify differentially expressed genes (DEGs) between GBM tissues and normal tissues. Through network analysis and functional analyses, it is possible to predict the pathways and interactions of DEGs. In addition, hub genes related lncRNA, miRNA and transcription factor (TF) were also explored. All of these bioinformatics methods were used to elucidate the comprehensive molecular mechanisms responsible for the development and progression of GBM and to provide potential biomarkers as a treatment for patients in different subgroups of GBM.

| Data preparation and RRA method analyses
The microarray data of human tissue from GSE2223, 15

| Function and network analyses
Protein-protein interaction (PPI) network was obtained from STRING v11. 29 We used the Cytoscape v 3.7.2 plugin MCODE and cytoHubba to select the hub genes. 30 We used Cytoscape to visualize the PPI networks, and MCODE plugin to screen a significant module from the PPI network with a degree cutoff = 2, node score cutoff = 0.2, node density cutoff = 0.1, Max depth = 100 and K-core = 2. Then, cytoHubba plugin was used to determine the hub genes when the degrees were ≥ 10.
The analyses of GO (Gene Ontology) enrichment, KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway were performed via clusterProfiler package. 31 Gene set enrichment analysis (GSEA) was conducted by clusterProfiler package as well and drawn as emapplot and heatplot to better understand the results for GSEA. Survival analysis was drawn based on GlioVis database (http://www.gliov is.bioin fo.cnio.es). Tumour-infiltrating immune cells were inferred using TIMER (Tumor Immune Estimation Resource). 32 Functional associations of the hub genes (with TF, miRNAs and lncRNAs) were analysed using NetworkAnalyst. 33

| Validation of core hub genes
We downloaded the RNA-seq data of human tissues from CGGA, after batching all mRNA-seq matrix and removing the incomplete and duplicate data, we analysed these data by R packages: (a) Prognostic accuracy of the three core hub genes evaluated by ROC  The RRA method was used to identify genes that are ranked consistently better and to explore the robust DEGs in different seven datasets. Finally, we determined 133 significantly DEGs in these datasets, including 82 up-regulated and 51 down-regulated DEGs (Table S1)

| PPI network and hub gene detection
These 133 DEGs were input to the STRING database for PPI networks, for their potential biological functions. The clusters of subnetworks that were obtained from STRING database with 133 nodes and 547 edges (PPI enrichment P-value < 1.0e-16) ( Figure 3).

The top one biological processes (BP), molecular functions (MF)
and cellular components (CC) were listed in Table S2 and marked as red, purple and green, respectively. The highest enriched BP, MF, CC terms were 'developmental process', 'protein binding', 'secretory vesicle'. The top one KEGG pathway was 'AGE-RAGE signalling pathway in diabetic complications' marked as yellow. In the glioma pathway (hsa05214), included in CDK4 (Cyclin Dependent Kinase 4), PRKCG (Protein Kinase C Gamma) and EGFR. The highest enriched Reactome pathway was 'Extracellular matrix organization' marked as blue ( Figure 3 and Table S2).
Next, Cytoscape plugin MCODE were used to find out the hub genes. We found five types from MCODE results, the cluster one with score = 9.7, obtained 21 DEGs, and the other four clusters with score = 6.588, 4, 3 and 3, respectively ( Figure  Subunit)) ( Figure 4F).

| Functional and survival analyses
Gene ontology (GO) functional enrichment analyses were used to determine the potential molecular mechanisms employed by these   Figure S4.

| Tumour-infiltrating immune cells analyses and target interactive analyses
Subsequently, we predicted the target of ten hub genes and predicted their network interaction with lncRNA, miRNA and transcription factors (TF) (Figure 9 and Table S3). A total of 88 lncRNAs could target the ten hub genes (Table S3), such as HNF1A-AS1 could target EZH2 and MYC. In Figure 9A, CDK4 was targeted by hsa-miR-21-5p and hsa-miR-193b-3p. We also found TOP2A could regulate RRM2, TOP2A, CDC20 and EZH2 expression ( Figure 9B). FOXM1 could interact with PAX2 (Paired Box 2) and E2F4 (E2F Transcription Factor 4). Combined the results of survival analysis and lncRNA-, miRNAand TF network analysis, we were interested to further illustrate the molecular mechanism that were regulated these ten hub genes.

| Validation in the CGGA, HPA and CCLE database: CDC20, CCNB2, MYBL2
We further validated these ten central genes in the CGGA database and detected that CCNB2, CDC20 and MYBL2 expression was regular in different tissues ( Figure 10, Figures S5 and S6). Subsequently, Because of the IDH1 mutation status and WHO classification are important for the prognosis of GBM, so it is necessary to determine whether our risk score is an independent prognostic factor for overall survival. The prognostic significance of IDH1 mutation status and 1p19q codeletion status in GBM in the CGGA database was shown to be highly significant. When GBM patients in the CGGA dataset were classified into two groups according to CDC20, CCNB2 and MYBL2 expression for survival analysis. For core hub genes, the expression levels of CCNB2, CDC20 and MYBL2 were also significantly higher in GBM with IDH1 mutations than in those of IDH1 wild-type ( Figure 10H and Figures S5H and S6H). CCNB2, CDC20 and MYBL2 can classify patients into high-risk and low-risk groups according to different 1p19q_codeletion states ( Figure 10D and Figures S5D and   S6D), age ( Figure 10E and Figures S5E and S6E) and chemical status ( Figure 10F and Figures S5F and S6F), WHO ranks ( Figure 10G and Figures S5G and S6G), IDH1 states ( Figure 10H and Figures S5H and   S6H) and PRS (main, recurrent, second category) Type ( Figure 10I and Figures S5I and S6I), which p-value < 0.05. After univariate and multivariate cox analysis of core hub genes (CDC20, CCNB2 and MYBL2), we found these three genes independently indicated unfavourable prognosis in CGGA database, the expression of CCNB2, CDC20 and MYBL2 in different types could be independent prognostic marker in GBM.
We selected CCLE and HPA database to further validate the core hub genes. In glioma cells from CCLE, we could see the expression of core gene both in RNA-seq expression and Achilles shRNA knockdown ( Figure 11). In GBM U251-MG cells from HPA, immunofluorescent staining of human cell line U251-MG showed us the gene location, green represents antibody, red means microtubules.
CDC20 detected in the nucleoplasm and cytosol ( Figure 11A), compared with other RNA cell lines, although the gene expression is not cell-dependent, the expression of CDC20 in U251-MG is the highest ( Figure S7). CCNB2 localized to the cytosol and the Golgi apparatus, cell cycle dependent gene expression according to correlation analysis ( Figure 11B). The localization of MYBL2 is the nucleoplasm ( Figure 11C).

| D ISCUSS I ON
Recent bioinformatic studies aimed to analyse differentially expressed genes, miRNAs and lncRNAs demonstrated the robustness of the results obtained through the integration of different GEO and TCGA datasets, 37,38 encouraged researchers to carry out the realtime analysis of in-motion big data, while protecting privacy and security. 39 With the development of sequencing and bioinformatics technology, more and more gene datasets are released on the public platform, we need to sort them out, and analysed them from more angles. 27,40,41 Compared with low-grade disease, hsa-miR-506-514 cluster, hsa-miR-592, hsa-miR-199a-5p were related to the overall survival of the uveal melanoma patients. 42 Up-regulated hsa-miR-183-5p and down-regulated hsa-miR-195-5p were directly related to colorectal cancer in the cancer development. 43 Increased C-MYC F I G U R E 9 TF and miRNA with seven hub gene and their regulatory network. Blue shows hub genes, circle red shows miRNAs, square red means TF associated with glucocorticoid and was resistant in acute lymphoblastic leukaemia. 44 In this study, we selected six GEO datasets (389 Activation of the MYC signalling pathway in normal astrocytes exposed to GBM-EV may be the mechanism by which GBM acquires a phenotype that promotes tumour progression, 51 MYC was enriched in developmental process and protein binding in this study, was consistent with the predictions of this study. The miRNAs found altered in GBM have been widely reported, hsa-miR-21 was overexpressed in the development and progression of GBM. 52 Our previous study has established ceRNA network and identified miRNA, lncRNA and TF were glioma-related molecules in GBM. 6 In current study, hsa-miR-21-5p was involved in the regulation of CDK4 and MYC, which known to be detected in the GBM development. Six lncRNA-related ceRNA combined with four small molecule compounds were considered to help identify the regulatory functions of lncRNAs in the pathogenesis of GBM. 53 Via recruiting EZH2, LncRNA HOTAIR modulated the chromatin architecture. 54 We also found that lncRNA APTR and lncRNA H19 were up-regulated in GBM which interacted with EZH2, played a role in inhibiting the cell proliferation and promoting the tumorigenesis. LncRNA HOTAIR could also down-regulate EZH2 promoting cell cycle, cell proliferation and cell invasion. E2F4 was increased in GBM, which stimulated the GBM growth. 55 FOXM1 could also interact with E2F4 in this study. It was confirmed that the miRNAs here identified are able to target the hub genes here identified (specially CDK and cycline families).
Ten hub genes were all overexpressed in GBM, functional enrichment results focused on these up-regulated genes, there were In our study, CCNB2, CDC20 and MYBL2 were up-regulated in GBM compared with control brain tissues, with poor prognosis in GBM patients. We found that mRNA expression of CCNB2, CDC20 and MYBL2 was significantly different in primary, recurrent and secondary GBM (primary > recurrent>secondary), suggesting that CCNB2, CDC20 and MYBL2 might be involved in the origin of GBM. The results described above indicated that CCNB2, CDC20 or MYBL2 was independent prognostic marker for overall survival and might play a significant role in determining glioma prognosis, and the association between CCNB2, CDC20, MYBL2 and GBM should be investigated further.

| CON CLUS IONS
In summary, bioinformatics analysis in the context of big data identified key roles for ten hub genes FOXM1, CDK4, TOP2A, RRM2, MYBL2, MCM2, CDC20, CCNB2, MYC and EZH2 in the development, progression, diagnosis, treatment and prognosis of GBM.
These results provide candidate gene for molecular targeting therapy and biomarker for radiotherapy of GBM cancer, and further in vivo and in vitro experiments are needed to validate the role of these screened genes and pathways. At the same time, further research is needed for the synergistic interaction between CDC20, CCNB2 and MYBL2.

ACK N OWLED G EM ENT
We thank for our anonymous reviewers for their helpful comments and guidance throughout the review process.

CO N FLI C T S O F I NTE R E S T
The authors confirm that there are no conflicts of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
The microarray data of human tissue from GSE2223, GSE4058, GSE4290, GSE13276, GSE68848 and GSE70231 were downloaded from the Gene Expression Omnibus (GEO) database.