Identification of a glioma functional network from gene fitness data using machine learning

Abstract Glioblastoma multiforme (GBM) is an aggressive form of brain tumours that remains incurable despite recent advances in clinical treatments. Previous studies have focused on sub‐categorizing patient samples based on clustering various transcriptomic data. While functional genomics data are rapidly accumulating, there exist opportunities to leverage these data to decipher glioma‐associated biomarkers. We sought to implement a systematic approach to integrating data from high throughput CRISPR‐Cas9 screening studies with machine learning algorithms to infer a glioma functional network. We demonstrated the network significantly enriched various biological pathways and may play roles in glioma tumorigenesis. From densely connected glioma functional modules, we further predicted 12 potential Wnt/β‐catenin signalling pathway targeted genes, including AARSD1, HOXB5, ITGA6, LRRC71, MED19, MED24, METTL11B, SMARCB1, SMARCE1, TAF6L, TENT5A and ZNF281. Cox regression modelling with these targets was significantly associated with glioma overall survival prognosis. Additionally, TRIB2 was identified as a glioma neoplastic cell marker in single‐cell RNA‐seq of GBM samples. This work establishes novel strategies for constructing functional networks to identify glioma biomarkers for the development of diagnosis and treatment in clinical practice.


| INTRODUC TI ON
Glioblastoma (GBM) remains the most common and aggressive (Grade IV) central nervous system (CNS) tumour 1,2 with median overall survival of up to 14-16 months. [3][4][5] Current GBM treatment regimens constitute a combination of radiotherapy with adjuvant Temozolomide (TMZ) chemotherapy, which could expand the life expectancy by 1.8 years on average. 6,7 Since prognoses and therapy responses vary dramatically among GBM patients, there remains the need to identify early diagnostic GBM biomarkers. One consensus was recently reached that IDH (Isocitrate Dehydrogenase 1) could be one biomarker based on which GBM can be divided as IDH-wild type and IDH-mutant. 2 The IDH-wild type tends to affect older people (mean age of 62) as the primary tumour and accounts for most of GBMs (~90%), while the IDH-mutant presents in the secondary GBM, which progresses from lower-grade glioma. However, the association between IDH status and GBM prognosis remains poorly understood.
In this study, we proposed a novel strategy to identify biomarkers by constructing a landscape of co-functional associations in the context of glioma, termed as Glioma Functional Network (GFN).
Unlike previously published biological networks, such as proteinprotein interaction networks, 8 gene co-expression networks, 9 the functional networks revealed gene-gene associations that do not necessarily physically interact or share similar expression patterns.
Initially, functional networks were constructed using double gene knockouts on a genome-wide scale. 10 However, this strategy is not feasible in the human genome given that the combination space would increase tremendously, making experimental and computational approaches rather challenging. With recent advances in genome-wide CRISPR-Cas9 functional screening, there are accumulating studies [11][12][13][14] focusing on genome-wide single-gene knockouts via CRISPR based techniques, and generating gene fitness data correlating the extent of cell proliferation to gene perturbations. While such data were generated across large pools of cell lines, computing pairwise functional similarities and inference genetic interactions can be implemented. Several studies have demonstrated the functional networks inferred from gene fitness screen data could recapitulate protein complexes 15 and functional modules. 16 Given that, there still lacked functional networks focusing on gliomas, we sought to fill the gap by implementing a novel systematic strategy to identify gene co-functional networks using machine learning algorithms.

| Predicting glioma functional network from CRISPR screening data using machine learning
Gene fitness data (version 21Q3) from CRISPR screening and RNA-seq expression data were downloaded from the Depmap portal (https://depmap.org/porta l/) and CCLE (Cancer Cell Line Encyclopedia, https://porta ls.broad insti tute.org/ccle) project respectively. To construct glioma specific networks, we retrained data from a total of 67 glioma cell lines ( Figure 1A). Several steps were applied to pre-process the data to select informative genes for predicting the network. Firstly, it was suggested that not all genes are expressed in cell lines, due to the inherent nature of genomic alterations in cancer cells. 15 Therefore, in each cell line, genes with less than 0 TPM (Transcript Per Million) were eliminated. Then, genes with fitness scores less than −0.5 in the most dependent cell lines were retained, as drastic fitness effects upon genetic depletion would facilitate functional relationships in the network construction. Lastly, genes with high variations in fitness data were retained.
The filtration criterion is 1 MAD (median absolute deviation) greater than the population MAD. Finally, a total of 959 genes were selected to prepare the training data ( Figure 1A).
To generate feature data as input for the machine learning pipeline, a series of similarity metrics, including Pearson correlation coefficient, Spearman's rank correlation coefficient, Euclidean distances, Dice's coefficient, Manhattan distance, Minkowski distance, Chebyshev distance, Harmonic mean, Jaccard index and mutual information, were computed among 959 genes in pairwise combinations. The R package, philentropy (version 0.5.0) 17 was used to compute these similarities, and 10 sets of feature data for a total of 459,361 gene pairs were generated (Table S1).
To generate reference data for machine learning, co-functional gene pairs reported from at least two out of three previous studies 15  with Radial Basis Function Kernel (svmRadial) and Weighted k-Nearest Neighbor Classifier (kknn). 21 The performances of the algorithms were benchmarked using receiver operating characteristic (ROC) analysis. The area under the ROC curves (AUROC) were computed for each algorithm, and the best performances were achieved by the MARS algorithm with AUROC of 0.94 ( Figure 1B). The optimal threshold was determined using the coords function from the R package, pROC (version 1.18.0). 22 A total of 47,475 gene pairs with MARS scores greater than 0.02 were selected as associated interactions for the glioma functional network ( Figure 1C).

| Detecting glioma functional modules
To detect modules from the glioma functional network, the ClusterONE 23 algorithm was used to predict modules from the functional network. Briefly, the ClusterONE algorithm aims to increase dense regions from randomly selected genes from the network and then identify groups of high cohesiveness as modules. 23 For this study, the program was downloaded from the CluterONE website (https://pacca narol ab.org/static_conte nt/clust erone/ clust er_one-1.0.jar), and a total of 88 modules were detected from the glioma functional network.

| Differential gene expression analysis in gliomas
To identify differentially expressed genes in glioma samples, the microarray intensity files (*.CEL) of three brain tumour studies, including the Repository for Molecular Brain Neoplasia Data (Rembrandt), 24 were downloaded from the NCBI GEO database with accession numbers: GSE68848 (Rembrandt), GSE16011 25 and GSE50161. 26 The raw data were normalized using the Bioconductor package, affy 27 and annotated using the custom CDF files (version: 25). 28 The Bioconductor package, limma, 29 was used to fit the linear models for differential gene expression analysis by comparing brain tumour and normal healthy samples. where h(t) is the expected hazard at time t, h 0 (t) is the baseline hazard, X i represents the expression levels of β-catenin target genes predicted in this study, and b i is the regression coefficient coefficient for gene i.

| Survival analysis of β-catenin target genes
For each cohort ( Figure 6), the Cox proportional hazards regression modelling was implemented using the coxph function from the R package, survival. The predict.coxph function was used to compute the risk scores. Then, samples were ranked based on the score and divided into two groups with high and low risks with a cut-off at the median value of the population scores. The significance of the difference of overall survival outcomes was evaluated using log-rank tests.

| GBM single-cell RNA-seq data analysis
Single-cell RNA-seq of GBM data were retrieved from three independent cohorts including single-cell suspensions from untreated IDH-wild type glioblastomas, 35 IDH-mutant astrocytomas and oligodendrogliomas. 36 Fresh tissues were subjected to dropletbased single-cell RNA-seq pipeline. The raw gene counts data were retrieved from NCBI GEO database with accession numbers: GSE89567 and GSE138794, the deposited website (https://github. com/mbour gey/scRNA_GBM). The Bioconductor packages, scater and scran, were used for data normalization, dimension reduction and clustering. To identify glioma neoplastic cells, the SingleR package was used to annotate the cells by correlating gene expression profiles with a previous published study. 37 Briefly, a total of 3589 cells were sorted from four GBM patients and subjected to RNAseq. Using differential gene expression analysis, seven types of cells were identified including astrocytes, immune cells, neoplastic cells, neurons, oligodendrocytes, OPC (oligodendrocyte precursor cells) and vascular cells. The normalized gene counts, cell type assignments and reduced dimension data were downloaded from the website (http://www.gbmseq.org/).

| Validation of GFN using published proteinprotein interaction networks and protein complexes
To validate GFN, protein-protein interaction (PPI) networks were retrieved from previous studies, including InBio_Map, 38 STRING (version: 11.5) 39 and BioGRID (version: 4.4.202). 40 For the STRING data, PPIs were filtered with confidence scores greater than 500. The GO semantic similarities among interacting protein pairs were computed using the Bioconductor, GOSemSim, package. 41 The manually Generation of the glioma functional network. (A) Diagram of the computational framework for generating glioma functional networks. (B) ROC analysis to benchmarking machine learning algorithms for predicting cofunctional gene associations. The AUC (area under curve) values were each algorithm was computed as shown in the brackets. (C) Landscape of the glioma functional network. The size of the node reflects the degree of each node. The grey lines denote predicted functional associations. The identified hub genes were highlighted in red curated complexes were retrieved from the comprehensive resource of mammalian protein complexes (CORUM, version 3.0). 42

| A glioma functional network (GFN) generated from gene fitness screening data
Genome-wide CRISPR screening of gene fitness in cancer cell lines has provided abundant data to generate functional networks 43 and elucidated the landscape of gene regulations in an unprecedented systematic manner. However, methodologies involved in these studies were limited to Pearson's correlation coefficient 43 or linear modelling. 16 We sought to implement a novel systematic strategy by incorporating similarity metrics with machine learning approaches to generate functional scores ( Figure 1A). After data preprocessing, we first applied ten similarity metrics (see Methods) to pairwise combinations of 959 candidate genes. Then, the resulting feature data were fitted with four machine learning models, including random forest (RF), 18 Figure 2B), which suggested as a novel resource with high biological relevances.
We then hypothesised that GFN may involve the pathogenesis of gliomas. Test this, each gene in the GFN was ranked by the Kleinberg's hub centrality scores, 44  were up-regulated in gliomas, while CDPIT and MFN2 were downregulated. MRE11 is engaged in DNA damage repair pathways, and it was previously reported to be involved in the breast cancer progression, 45 and played a role in the response of drug treatment in glioma. 46 METTL14 promoted differentiation of embryonic stem cells 47 and may regulate genes involved in cell proliferation, differentiation and DNA damage. 48 On the contrary, MFN2 was known as a tumour suppressor and exhibited lower expression in cancers. 49 Taken together, as central players in the network, dysfunctions of these genes would suggest GFN identified from this study played roles in glioma tumorigenesis.

| GFN modules significantly enriched in biological pathways
We next implemented a clustering algorithm, ClusterOne, to identify a total of 88 functional modules from GFN ( Figure 4A,   50 It also played roles in neurodegenerative disease, 51 and pontocerebellar hypoplasia. 52 Deregulations of pathway members including AIMP1, AIMP2 and AIMP3 were observed in gastric and colorectal cancer. 53 One previous study showed that the Terpenoid backbone biosynthesis pathway was down-regulated in glioblastoma cells due to the knock-down of lncRNA HULC, which was involved in cell proliferation. 54 Glioma progression associated genes identified from clustering and differential expression analysis were significantly enriched in the vibrio cholerae infection pathway. 55 The SNARE interactions in vesicular transport involved in the fusion of multivesicular body and cell membranes. 56 Knockdown of one of SNARE proteins, Stx1, could inhibit cell growth and invasion in glioblastoma. 57 In summary, pathway analysis revealed the GFN modules significantly enriched in glioma tumorigenesis, which could assist investigating glioblastoma biology.
Previous studies demonstrated that co-functional networks could recapulated protein complexes. 15 Consistent with these findings, GFN modules were also significantly overlapped with protein complexes, including 55S mitochondrial ribosome (CID-02,

| Prediction of β-catenin targets from glioma functional modules
Accumulating evidence suggests that one of the embryonic stem cell signalling pathways, Wnt/β-catenin pathway, is involved in the proliferation 58,59 and prognosis 60 of gliomas, which prompts this pathway as potential therapeutic target. 61 Therefore, β-catenin dysregulations served as a hallmark in cancer progression. 62 Alongside with the concept of glioma stem cells (GSCs), the Wnt/β-catenin pathway has gained interest from the research community in recent decades. 63 Thus, we sought to further predict β-catenin potential targets inferred by the glioma functional modules, since

TA B L E 1 List of 12 predicted β-catenin targets
the regulator and its targets may be functionally associated. We identified that β-catenin is among the members of module CID-02 ( Figure 4B) and was functionally associated with 12 genes includ-  Table 1). Some of predicted targets are in line with previous studies, as Wnt/β-catenin pathway has potential affect in mediator complexes and SWI/SNF complexes. 64 Additionally, HOXB5 was one of homeobox genes and interacted conservely with Wnt/β-catenin pathway. 65 Experimental data suggested that HOXB5 involved in the progression of breast cancer through Wnt/β-catenin pathway. 66 In summary, various pieces of evidence suggested biological functions of β-catenin targets, which could play roles in glioma pathogenesis.
Notably, we compared predicted scores of β-catenin targets from our study with previously established methods based on Pearson correlation coefficient (PCC). The PCC scores of the predicted targets ranged from −0.06 to 0.12 ( Figure 5), while the GFN scores exceeded the cut-off threshold. This suggested that the machine learning strategy implemented in our study was able to reveal nonlinear co-functional associations, which could be neglected based on previously established methods. 15

| β-catenin targets significantly associated with glioma prognosis
While β-catenin dysregulations may involve in glioma progressions, we sought to fit expression levels of the predicted β-catenin targets in the proportional hazards regression models 34

| Identification of a glioblastoma neoplastic cell marker from GFN modules
For decades, great challenges remained in glioblastoma treatment due to tumour heterogeneity. 67 Nevertheless, the recent advanc-

F I G U R E 5
Fitness profiles in cancer cell lines (n = 1032) for β-catenin and its predicted targets from the glioma functional modules. Functional associated scores between β-catenin and its predicted targets were computed using PCC and retrieved from the GFN. Rows (genes) are hierarchically clustered based on PCC scores gliomas, one recent study demonstrated that its combined elevated expression with MAP3K1 was significantly associated with survival and chemoresistance. 71 Our study showed specificity of TRIB2 expression in glioblastoma neoplastic cells. While these data were at single-cell resolution and the elevated patterns are consistent in three independent cohorts, this could shed light on the possibility of becoming a drug target.

| DISCUSS ION
In this study, we presented a systems biology approach to identify GFN by applying machine learning algorithms on multiple similarities of genome-wide fitness screening data. We demonstrated the networks involved in glioma tumorigenesis and predicted potential targets of WNT/β-catenin pathways. These targets are significantly associated with glioma overall survival prognosis and also could be used as cell type-specific markers for the scRNA-seq data analysis. While most gene co-functional associations have not been reported before, they were significantly enriched biological pathways. This could serve a novel resource for studying tumour biology in gliomas. Additionally, we have demonstrated our computational strategy could capture gene co-functional associations that may be lost in previously established methods. 15 We reasoned that functionally associated gene pairs may share similar fitness patterns in a non-linear manner, which could be utilized by machine learning algorithms. This has provided another approach to identify gene co-functional associations in addition to linear methods, such as PCC and PCA.
From the GFN, we further identified a total of 88 glioma functional modules, which are densely connected in the GFN. Consistent with previous findings, 15 these modules are significantly enriched in biological pathways, or protein complexes. From one of the modules, we predicted β-catenin targets from its direct functional associations. Statistical modelling of expression levels of these targets was significantly associated with glioma overall survival prognosis.
However, these findings need to be verified in ChIP-seq for binding F I G U R E 6 Survival analysis of β-catenin target genes in four glioma studies. Left panel: samples were divided into high-and low-risk groups based on the risk scores generated from prognostic modelling using expression levels of β-catenin target genes. Right panel: Kaplan-Meier estimated the two groups associated with overall survival. P value was computed using the log-rank test sites and loss-of-function experiments. Lastly, we showed the identification of a glioma neoplastic cell marker, TRIB2, from single-cell RNA-seq data analysis, which could potentially become a drug target to tackle tumour heterogeneity challenges.
Taken together, we anticipate that the outcome of this study will significantly advance the understanding of tumour biology and the molecular attributes of glioma progression, but also facilitate the development of diagnostic assays for clinical applications as a complementary to the traditional histopathological assessments. We have also demonstrated the powerful capacity of the systems biology approach implemented in this project to elucidate biomarkers in various types of cancer. As the wealth of multi-omics data grows, the robustness of biomarkers could be improved by optimizing data from various sources, which could be expanded to a wider range of aspects, such as drug repurposing and personalized treatments in cancer.

ACK N OWLED G EM ENT
Not applicable.

CO N FLI C T O F I NTE R E S T
Not applicable. Project administration (lead).

PATI E NT CO N S E NT FO R PU B LI C ATI O N
Not applicable.

DATA AVA I L A B I L I T Y S TAT E M E N T
Not applicable.