Top‐down stepwise refinement identifies coding and noncoding RNA‐associated epigenetic regulatory maps in malignant glioma

Abstract With the emergence of the molecular era and retreat of the histology epoch in malignant glioma, it is becoming increasingly necessary to research diagnostic/prognostic/therapeutic biomarkers and their related regulatory mechanisms. While accumulating studies have investigated coding gene‐associated biomarkers in malignant glioma, research on comprehensive coding and noncoding RNA‐associated biomarkers is lacking. Furthermore, few studies have illustrated the cross‐talk signalling pathways among these biomarkers and mechanisms in detail. Here, we identified DEGs and ceRNA networks in malignant glioma and then constructed Cox/Lasso regression models to further identify the most valuable genes through stepwise refinement. Top‐down comprehensive integrated analysis, including functional enrichment, SNV, immune infiltration, transcription factor binding site, and molecular docking analyses, further revealed the regulatory maps among these genes. The results revealed a novel and accurate model (AUC of 0.91 and C‐index of 0.84 in the whole malignant gliomas, AUC of 0.90 and C‐index of 0.86 in LGG, and AUC of 0.75 and C‐index of 0.69 in GBM) that includes twelve ncRNAs, 1 miRNA and 6 coding genes. Stepwise logical reasoning based on top‐down comprehensive integrated analysis and references revealed cross‐talk signalling pathways among these genes that were correlated with the circadian rhythm, tumour immune microenvironment and cellular senescence pathways. In conclusion, our work reveals a novel model where the newly identified biomarkers may contribute to a precise diagnosis/prognosis and subclassification of malignant glioma, and the identified cross‐talk signalling pathways would help to illustrate the noncoding RNA‐associated epigenetic regulatory mechanisms of glioma tumorigenesis and aid in targeted therapy.


| INTRODUC TI ON
Malignant glioma, the most common primary malignant tumour of the central nervous system, usually occurs in the sixth through eighth decades of life, 1 has an annual incidence of 5.26 cases per 100,000 people and usually results in quick fatality. [2][3][4] Malignant glioma basically consists of low-grade glioma (LGG, WHO grades II-III) and glioblastoma multiform (GBM, WHO grade IV). 3,5,6 Nearly, all LGGs will ultimately progress to GBM. 7 The current therapy for newly diagnosed malignant glioma is surgical removal of the maximum safe amount of tumour followed by adjuvant radiation therapy and temozolomide chemotherapy. 8 This approach has a 2-year survival rate of 27% for newly diagnosed GBM, but the overall prognosis remains poor. 1,2 Therefore, it is urgent to elucidate the mechanisms of malignant glioma and find new treatments. Once upon a time, malignant gliomas have ever been classified, diagnosed and therapized based on histological characteristics and resemblance with a supposed cell type of origin. 9 In the past decade, however, the classification, diagnosis and therapy of glioma have dramatically changed. Deletions involving chromosomes 1p/19q, IDH mutations, and so on are closely related to prognosis. [10][11][12][13] The latest guide incorporates molecular parameters, in addition to histology, into the classification, diagnosis and therapy of gliomas, contributing to a profound and in-depth molecular era. 3 While accumulating studies have investigated coding geneassociated biomarkers in malignant glioma, research on comprehensive coding and noncoding RNA (ncRNA)-associated biomarkers is lacking. Noncoding RNAs, which are an emerging class of transcripts that are encoded by the genome but are mostly not translated into proteins, consist of housekeeping RNAs, small ncRNAs (sncRNAs, including microRNAs, miscRNAs, circular RNAs and piRNAs, etc.), and long ncRNAs (lncRNAs). 14,15 sncRNAs and lncRNAs were once indicated as byproducts of the splicing procedure, until some researchers found that they may function in post-transcriptional modification, the organization of nuclear domains, and the regulation of proteins or RNA molecules, indicating that the biological functions of ncRNAs are more pervasive than previously suspected. [16][17][18][19][20][21] Therefore, more research on these ncRNAs will lead to a greater understanding of cancer cell functions and may lead to novel clinical applications in oncology. 15 The competing endogenous RNA (ceRNA) mechanism is one of the most important mechanisms of ncRNAs and indicates that endogenous coding and noncoding RNAs may share common microRNA-binding sites and thus indirectly regulate the expression of each other by competing for microRNA binding. 22 Researchers have focused on lncRNA/miRNA/mRNA-associated ceRNA regulatory networks. However, ceRNA networks are not limited to lncRNA-associated networks; they apply to all ncRNAassociated networks. 22 Therefore, it is necessary and valuable to mine the comprehensive ncRNA-associated regulatory networks in glioma.
The emerging and accumulating applications of high-throughput sequencing make it easier to research and identify differentially expressed genes (DEGs) or biomarkers of various tumours. Based on bioinformatics methods, DEGs can be used to further mine critical signalling pathways or molecular mechanisms to provide guidance or illuminate tumour research questions. However, previous bioinformatics studies on glioma usually identified biomarkers or performed functional enrichment analyses. Few studies have illustrated the cross-talk signalling pathways among these biomarkers and mechanisms in detail. Illustration of the cross-talk signalling pathways among biomarkers and mechanisms is beneficial for understanding glioma tumorigenesis and malignant progression, as well as promoting glioma therapy.
'Stepwise refinement' was proposed by Swedish computer scientist Wirth in the 1970s and refers to compiling executable programs not in one step but in several steps and gradual refinement, where the program compiled in the first step is the most abstract, the program compiled in the second step is less abstract, and the program compiled in the last step is executable. 23 Here, we used two databases, The Cancer Genome Atlas (TCGA) database, whose samples were derived mainly from America (698 malignant glioma samples and 5 normal brain samples), and the Chinese Glioma Genome Atlas (CGGA) database, whose samples were derived mainly from Asia (1211 malignant glioma samples and 5 normal brain samples), to produce, validate and mine the ncRNAassociated networks. After the DEGs were identified, ncRNAassociated ceRNA networks were constructed. Subsequently, Cox and Lasso regression models were used to further identify the most valuable genes in glioma tumorigenesis and then validate them in LGG and GBM. 'Top-down' refers to dividing complex and large problems into small problems, determining the key points, and then qualitatively and quantitatively describing the problems with accurate thinking. Here, logical deduction based on references and top-down comprehensive integrative analysis, including Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) functional enrichment, single nucleotide variation (SNV), immune infiltration, transcription factor binding site and molecular docking analyses, revealed that these genes were correlated with the circadian rhythm, tumour immune microenvironment, and cellular senescence pathways ( Figure 1A,B). The epigenetic regulatory maps and signalling pathways among these genes were deduced and are discussed in the discussion section.
This research aimed to determine the potential genes involved

| Extraction of DEGs and construction of the ncRNA-miRNA-mRNA network
The sequenced genes were mapped to the human genome reference (Homo_sapiens.GRCh38.101.chr) from the Ensembl database (https://asia.ensem bl.org/index.html). The mRNAs were recognized and extracted using the keyword 'protein_coding', while the ncRNAs (non-miRNA) were recognized and extracted using the following keywords (see supplementary material section Ⅰ).
The differentially expressed mRNAs (DEmRNAs) and differentially expressed ncRNAs (DEncRNAs) (non-miRNA) were identified and depicted using the Linear Models for Microarray and RNA-Seq Data (LIMMA) package in R software (version 3.5.1). [24][25][26] The differentially expressed miRNAs (DEmiRNAs) were identified and depicted using the Empirical Analysis of Digital Gene Expression Data (edgeR) package in R software. 27 The DEGs were identified using |log2(fold change)| (|log2FC|), which was set to 2, and the false discovery rate (FDR), which was set to 0.05.
The DEncRNAs (non-miRNA) and DEmiRNAs were mapped to the ncRNA-miRNA network in the miRcode database (http:// www.mirco de.org/) to identify the DEncRNAs (non-miRNA) and DEmiRNAs in the ceRNA network. The target mRNAs were identified through the miRDB database, 28 miRTarBase database 29 and TargetScan database. 30,31 The ncRNA-miRNA-mRNA network was depicted using Cytoscape software 32 (version 3.6.1).

| Cox and Lasso regression models
First, the DEGs in the ncRNA-miRNA-mRNA network were added to the univariate Cox proportional hazards model (Cox) using the 'survival' package in R software (version 3.5.1). The Cox model was constructed as follows: where h 0 (t) is the benchmark risk equation, which can be any nonnegative equation for time; X i is the eigenvector of instance i; and β is the parameter vector, which is obtained by maximizing the Cox partial approximation.
Subsequently, the results were incorporated into the least absolute shrinkage and selection operator (Lasso) regression model using the 'glmnet' and 'survival' packages in R software (version 3.5.1).
Given the objective function the Lasso regression model would be constructed as follows: where only β is penalized, while α is free to take any allowed value. and represents a worse prognosis, then this gene would be recognized as a useful gene in the model, and the patients with this gene expression higher than the control would obtain a positive number score, while the patients with this gene expression lower than the control would obtain a negative number score. Finally, after testing each gene expression, every patient would obtain a risk score which is calculated by the summation of the scores of each gene. Half of the patients (whose risk scores are higher than the median) would be categorized into the high-risk group, and the others would be categorized into the low-risk group.
The final model was evaluated through the internal (TCGA) and external (CGGA) datasets by Kaplan-Meier survival analysis, which was performed and depicted using the 'survival' and 'survminer' packages in R software (version 3.5.1). The TDROC curve was generated and depicted using the 'survivalROC' package in R software (version 3.5.1), and the C-index was determined using the 'survcomp' package in R software (version 3.5.1). The area under the TDROC curve (AUC) was calculated as follows: where 1 f t 0 < f t 1 denotes an indicator function that returns 1 if

| SNV analysis
The median was used to divide patients into the high-risk group and low-risk group based on gene expression in the model through the 'predict' function of the 'survival' package in R software (version 3.5.1). SNV analysis was performed and depicted using the 'maftools' package (https://bioco nduct or.org/packa ges/relea se/ bioc/html/mafto ols.html) in R software (version 3.5.1).

| Functional enrichment analysis
The functional enrichment analyses via the GO and KEGG databases were performed and depicted using the 'clusterProfiler', 'org. Hs.eg.
db', 'enrichplot' and 'ggplot2' packages in R software (version 3.5.1) to identify the main biological functions of the genes in the model.
The enrichment functions were identified using the adjusted pvalue, which was set to 0.05.

| Molecular docking analysis
RNA sequences were obtained from the NIH genetic sequence database (GenBank), 33 while protein sequences were obtained from the UniProt database. 34 The minimum free energy (MFE) secondary structures of the RNAs were constructed using a loop-based energy model and the dynamic programming algorithm. 35 The three-dimensional structures of the RNAs were constructed based on the MFE secondary structure using the two-step procedure 36 and distance geometry (DG) method. 37 The three-dimensional structures of the RNAs were evaluated using all-heavy-atom knowledge-based statistical potential (3dRNAscore). 38 The three-dimensional structures of the proteins were obtained from the worldwide Protein Data Bank (PDB) database (https://www.wwpdb.org/) if they were known; otherwise, they were constructed using remote homology detection methods. 39 The interactions between RNAs and proteins were evaluated using the hybrid docking algorithm of template-based modelling and free docking (HDOCK). 40 The interactions between RNAs or proteins were evaluated using the docking energy score in HDOCK. The score used to evaluate the interaction between RNAs was set to −200, which means that a docking energy less than −200 could be recognized as a stabilized interactive conformation. Similarly, the docking energy score used to evaluate the interaction between proteins or RNAs and proteins was set to −300. The results were depicted using PyMOL software (The PyMOL Molecular Graphics System, Version 2.3.0 Open-Source, Schrodinger LLC).

| Transcription factor binding site analysis
The 2000 bp sequences upstream of the 5'-untranslated region (5'-UTR) are recognized as the potential sequences where promoters are located. The potential sequences where promoters are located were obtained from the National Center of Biotechnology Information (NCBI). Transcription factor binding sites were predicted using JASPAR software. 41 The possibility of binding (relative score) was set to 0.9.

| Tissue samples collection
The glioma tissues and corresponding adjacent tissues (NC) were collected from the patients who underwent surgery at

| Immunohistochemistry (IHC)
The slides were retrieved by the method of heat-induced epitope

| Statistical analysis
The interactions between RNAs or proteins were evaluated using the docking energy score in HDOCK. The score used to evaluate the interaction between RNAs was set to −200, which means that a docking energy less than −200 could be recognized as a stabilized interactive conformation. Similarly, the docking energy score used to evaluate the interaction between proteins or RNAs and proteins was set to −300. The DEGs were identified using |log2(fold change)| (|log2FC|), which was set to 2, and the FDR, which was set to 0.05.
The enrichment functions were identified using the adjusted p-value, which was set to 0.05. Patients were grouped into the high-risk group and low-risk group using the median. All statistical analyses were performed using R software (version 3.5.1).

| Ethics approval
The glioma tissues and corresponding adjacent brain tissues were  Figure 2E).
Third, the target mRNAs were identified through the miRDB database, miRTarBase database, and TargetScan database based on the remaining 23 miRNAs. Only the mRNAs that were recognized by all three databases were considered candidate target mRNAs.

| Construction of the Cox and Lasso regression models
The

| SNV analysis in malignant glioma
Before we could further determine and illustrate the molecular mechanisms among these genes, we must first examine the mutations in Furthermore, the samples were grouped into two categories, a high-risk group and a low-risk group, based on the expression of the 19 genes to examine the incidence rate of mutations in various genes ( Figure 4D,E). IDH1, TP53 and ATRX were the top 3 mutated genes in both the high-risk group and the low-risk group. The incidence rate of IDH1 mutation in the low-risk group (89%) was much higher than that in the high-risk group (65%), and simultaneously, the incidence rate of ATRX mutation in the low-risk group (40%) was much higher than that in the high-risk group (35%), while the incidence rate of TP53 mutation in the low-risk group (49%) was not significantly different from that in the high-risk group (47%).    Figure 5A), thus promoting PER2 and CRY2 expression. Moreover, CRY2 heterodimerizes with PER2 ( Figure 5B) to bind to the transcription factors NR4A2, HNF4A, PPARA and NR1D1 to facilitate their activation.  (Table S2).

| Clinical significance of the circadian rhythm, tumour immune microenvironment, and cellular senescence pathways in malignant glioma
The eleven genes within the circadian rhythm pathway (AC020907.

| Experiment validation
The PCR assays were performed to verify the 19-gene model preliminarily ( Figure Figure 3D,E). The results showed that the circadian rhythm and secretory granule organization were the most important biological processes. First, we will discuss the circadian rhythm.
Interestingly, Ben Franklin's aphorism 'early to bed, early to rise, makes a man healthy, wealthy, and wise' also suggests the potential importance of the circadian rhythm in health. Regardless of bacteria or eukaryotes, circadian rhythms play an important role in controlling a variety of physiological processes, and disruptions to normal circadian biology can cause many diseases. 73 Many physiological processes, including hormone secretion, drug and xenobiotic metabolism, glucose homeostasis, cell cycle progression, and tumorigenesis, are regulated by the circadian rhythm, 49,74,75 . Accumulating evidence indicates that circadian rhythm disorders cause many diseases. [76][77][78][79][80][81] Therefore, understanding the correlation between the circadian rhythm and disease is critical and important for understanding the underlying mechanisms and potential therapies. Our work reveals that glioma tumorigenesis is significantly correlated with the circadian rhythm. We strongly suggest not staying up late if you can go to bed early.

| Tumour immune microenvironment
During the process of illustrating the circadian rhythm pathway, among the 6 coding genes, we exhibited the biological functions of TBPL1, NPAS2, CRY2, and ETNK1; therefore, what are the roles of VPS33B and C9ORF40 in malignant glioma? As GO and KEGG functional enrichment analyses revealed that these genes are also related to secretory granule organization ( Figure 3D,E), we are going to discuss secretory granule organization.
Accumulating evidence indicates that VPS33B plays a critical role in vesicle-mediated transport and organization. 53 Immune analyses revealed that as the risk score increased, the pro-  Figure 1B).
Interestingly, for a long time, the immune system was considered to play a protective role in tumour development. 89 .
Accumulating evidence, however, indicates that various types of immune and inflammatory cells are frequently present within tumours and are strongly correlated with tumours. 90

| Cellular senescence
After illustrating the circadian rhythm pathway and tumour immune microenvironment pathway, only one coding gene, C9ORF40, and three ncRNAs, AC008738.2, AC102941.1 and AL136115.1, remain to be illustrated. Although C9ORF40 has no relationship with either the circadian rhythm or secretory granule organization, it plays a critical role in cellular senescence by activating the p21 pathway. 61 p21, which is encoded by CDKN1A, is a key protein in cellular senescence and causes G1 phase arrest by interacting with CDK1 or CDK2 to disrupt the interaction between substrates and CDKs, thus inhibiting cell cycle progression. 62-64 p21 can directly bind to PCNA and inhibit its activation, which inhibits DNA polymerase activity, transcription, and excision repair functions [65]. Moreover, p21 induces apoptosis by activating the caspase-3-mediated signalling pathway. 62 Figure 5D), AC008738.2 ( Figure S2G), and AC102941.1 ( Figure S2H). In eukaryotes, the PCNA trimer forms a hexagon-like structure with a cavity where DNA passes through PCNA and binds to DNA polymerase δ (Pol δ) to replicate the lagging strand ( Figure 5E) and cooperates with flap endonuclease 1 (FEN1) to process the Okazaki fragments for ligation [66].
However, the DNA binding cavity for PCNA would be occupied by the ncRNAs AC008738.2, AC102941.1 and AL136115.1, hindering and impeding DNA replication and thus suppressing proliferation and other cell cycle processes in malignant glioma.
In particular, the process of cellular senescence is considered a state of irreversible growth arrest. Cancer cells, however, do not enter this state and begin to proliferate infinitely. 94 Cellular senescence was found to be mediated by the two main tumour suppressor pathways of the cell: the ARF/p53 and INK4a/RB pathways. [94][95][96] Overall, overwhelming evidence indicates that cellular senescence is a critical feature of mammalian cells to suppress tumorigenesis. In our work, we also revealed that cellular senescence is significantly correlated with glioma tumorigenesis ( Figure 1B).

| CON CLUS IONS
• Nineteen genes ( • TBPL1 would bind to the NPAS2 promoter to promote its expression, and NPAS2 would interact with BMAL1 to promote CRY2 and PER2 expression and suppress C-MYC expression. CRY2, which complexes with PER2 would bind to the promoters of C9ORF40, ETNK1, and VPS33B to promote their expression. Low ETNK1 expression would directly promote malignant glioma proliferation.
• VPS33B would promote the organization and release of exosomes that deliver CT62, DPY19L2P1, and KCNH1-IT1 to immune cells, suppressing the differentiation of M1 macrophages into M2 macrophages.