Machine‐learning based radiogenomics analysis of MRI features and metagenes in glioblastoma multiforme patients with different survival time

Abstract Background This study aimed to examine multi‐dimensional MRI features’ predictability on survival outcome and associations with differentially expressed Genes (RNA Sequencing) in groups of glioblastoma multiforme (GBM) patients. Methods Radiomics features were extracted from segmented lesions of T2‐FLAIR MRI data of 137 GBM patients. Radiomics features include intensity, shape and textural features in seven classes were included in the analysis. Patients were divided into two groups depending on their survival time (shorter or longer than 1‐year survival). Four different machine learning algorithms were implemented to construct the prediction models. Features with top importance (importance >0.04) were selected to construct the prediction model using the model with the best performance. The interactions between image features and genomics were then analysed with Pearson's correlation analysis. Results The GBDT model with 72 features with highest importance had the highest accuracy of 0.81 on both short and long survival time classes, and the area under the curve (AUC) of the receiver operative characteristic (ROC) of the short and long survival time class were 0.79 and 0.81. Six metagenes showed significant interactive effect (P < 0.05), and Pearson's correlation analysis revealed that three of these metagenes (TIMP1,ROS1 EREG) showed moderate (0.3 < |r| < 0.5) or high correlation (|r| > 0.5) with image features. Conclusion Radiogenomics analysis shows that MRI features are predictive of survival outcomes, and image features are highly associated with selective metagenes. Radiogenomics analysis is a useful method for optimizing clinical diagnosis and selecting effective treatments.


| INTRODUC TI ON
Glioblastoma multiforme (GBM), one of the most invasive and fatal brain tumours that develops from glial cells, can severely affects the central nervous system and general health [1]. The percent 5-year surviving rate was estimated to be 33.2% between 2008 and 2014 according to statistics from the Surveillance, Epidemiology and End Results (SEER) database and the Centers for Disease Control and Prevention's National Center for Health Statistics (https://seer.cancer.gov/ csr/1975_2015/). Due to the heterogeneous nature of GBM, relatively high age of disease onset, migration of malignant cells to surrounding tissue, the treatment outcome for GBM are highly variable, yielding an average survival rate of 12.6 months [2]. Current clinical practice for treating GBM mostly involves tumour resection and chemotherapy [3].
Genomics study is an essential method to study GBM by examining alternations in genomic pathways and identifying relevant biomarkers. Gene studies involving tissues, plasma, or cell lines used protein expression data to reveal that common alternations in GBM include mutations of specific gene and proteins such as RTKs, TP53 RB1 and increased expression of EGFR and PDGFRA [4,5]. However, tissue sample is usually acquired after biopsy and may not be suitable for all patients, especially for early diagnosis.
Neuroimaging of GBM is a non-invasive tool for disease diagnosis and monitor treatment outcome. A wide range of MR techniques including T1, T2 and FLAIR imaging are used to capture GBM characteristics. Typically, GBM appears as a heterogeneous enhancement region with a non-enhancing necrosis in the center [6]. FLAIR sequences have advantages of showing abnormalities more clearly [7]. MRI-based features were shown to be highly predictive of tumour grading in GBM [8]. Textural image features were associated with CD3 T cell infiltration status in GMB [9].
In recent years, the emergence of radiogenomics, combing radiomics image features and genomics, allows the study of GBM more comprehensively. For example, MRI parameters revealed that haemodynamic abnormalities were associated with the expression level of the mTOR-EGFR pathway in [10]. Based on previous findings, we aimed to investigate the machine learning based methods in combination with radiogenomics to study the associations among MRI features, genomics and the survival rates in GBM patients. Computer assisted methods allow more comprehensive characterization of imaging data and more sophisticated way to predict disease outcome.
We hypothesize that radiomics features of FLAIR imaging data can be predictive of patients' survival, and radiogenomics analysis can reveal the linkage between images features and known genes in previously defined molecular pathways.

| Dataset
MRI data were obtained from the Cancer Imaging Archive (TCIA) (https://wiki.cancerimagingarchive.net/display/Public/TCGA-GBM), and corresponding genomics data were acquired from the Genomic Data Commons (GDC) Data Portal. A total of 137 patients with MRI data, 129 patients with known genomic data were included in the analysis and 46 patients were the intersection of MRI data set and gene data set. Patient characteristics are summarized in Table 1. Because the average survival rate of GBM patients was reported to be 12.6 months [2], and all the patients in our cohort has demised during follow-up, for the classification purpose, we used 1 year as a threshold and the patients were divided into short (<1 year) and long (>1 year) survival groups. Figure 1 shows the process of the workflow of this study.

| Image preprocessing and lesion segmentation
Lesion segmentation is required before feature extraction. Lesion volumes were manually delineated by an experienced radiologist using 3D slicer (https://www.slicer.org/). All original loaded MRI images of patients were DICOM format. After adding MRI data into 3D slicer, we selected the Segment Editor module to segment the lesion.

| Feature extraction
Feature extraction was performed using a Python software package Pyradiomics [11]. First-order and multi-dimensional features were Features. Detailed number of each feature is listed in Table S1.

| Machine learning model construction and evaluation
The MRI dataset was divided into the training and testing sets according to a ratio of 7:3. Four machine learning algorithms including GBDT (Gradient Boosting Decision Tree), logistic regression, support vector machine (SVM) and KNN (k-nearest neighbours) were tested. These four methods are representative in their own category. Gradient boosting decision tree is a tree-based ensemble machine learning model which can achieve state-of-the-art accuracy in classification and regression. Logistic regression is a classic probabilistic model. Support vector machine is another widely used model featured by kernel trick [12]. As for k-nearest neighbours, it is a typical lazy-learning method and is frequently treated as a benchmark in predictive modelling [13].

| Relevant gene selection
Differentially expressed genes (DEGs) analysis was performed with R software, using package DESeq2. A gene is declared to be DEGs if a difference or change observed in read counts or expression is statistically significant. Fold change and t test are widely used methods to estimate gene variances in practice [14]. The condition we added for screening out DEGs was |log 2 (fold change)| > 1 and adjusted P < 0.05. And the same DEGs analytical process was applied to Dataset of Gene and Dataset with both MRI and  After screening out DEGs, the number of samples was reduced while individual differences among groups were enhanced. In order to screen for efficiently DEGs, we selected the DEGs from the intersection of the Genetic Dataset and Dataset which contain both MRI and Gene data.

| Correlations between image features and genomics
To

| Risk stratification of metagenes
In order to survey the prognostic power of identified metagenes.
We used the maximally selected rank statistics [16], implemented by R package maxstat to find the optimal cut point for the risk stratification on the basis of expression value of corresponding metagenes. Afterwards, we used Kaplan-Meier (KM) estimator to measure the patients' survival rates in high and low gene expression strata and plotted the aforementioned information by R package survminer.

| Selected radiomics features
Thresholding based on feature importance (importance index >0.04) resulted in a total of 72 features for constructing the final prediction model. The threshold is chosen after the manually checking of feature importance distribution ( Figure S1A). Table 2 lists the full name and abbreviation of the corresponding 72 features in the model.

| Model comparison
Among the four machine learning algorithms, GBDT had the highest accuracy of 0.81 for discriminating patients with short or long survival in testing set, while the accuracy of logistic regression, SVM and KNN is 0.69, 0.76 and 0.79, respectively. Figure 2 shows the performance of the GBDT classifier. Figure 2A is the F I G U R E 3 Gene expressions of six gene. confusion matrix demonstrating the proportion of correct and wrong predictions in each survival class. Figure 2B shows the ROC curves for predicting patients with short and long survivals, yielding an AUC value of 0.79 for short-survival class and 0.81 for long-survival class.

| Metagenes selection
Six metagenes including WDR72, C14orf39, TIMP1, CHIT1, ROS1 and EREG were found to have significantly different expression levels among patients with short vs. long survival time (Figure 3). The For all the subplots, the 'group 1', coloured by yellow, stands for higher-expression group at the optimal cut point identified by maximally selected rank statistics difference analysis of these six genes was conducted between the long and short group, and the result is shown in Table 3. Figure

| Associations between image features and survival outcome
Our results indicate that prediction models using radiomics features

| Differentially expressed genes in different survival groups
We identified six genes (WDR72, C14orf39, TIMP1, CHIT1, ROS1 and EREG) with significantly different levels of expression between short and long survival groups. To reveal the relationship between expression levels of six genes and the prognosis of patients, a survival analysis was performed. In this study, we used Kaplan-Meier (KM) estimator to measure the patients' survival rates in high and low gene expression [19]. Figure 6 shows survival time, elevated levels of EREG expression has been found. [20]. EREG can initiate the signalling cascade, and in gastric, EREG is up-regulate [21]. Previous studies have shown the Epiregulin (EGFR) ligands have the effect of stabilizing receptors, affecting breast cancer cells associated with differentiation function [22].
Altered TIMP-1 expression has been identified as a biomarker in GBM, with decreased TIMP-1 linking to longer survival in GBM [23]. ROS1, which belongs to one subfamily of kinase insulin receptor genes, is a proto-oncogene, highly expressed in a variety of tumour cells. This gene is often altered in lung cancer, of which the effects on the progression of GBM are remains to be eliminated [24].

| Associations between image features and genes
Associating genes and microRNAs with high FLAIR volumes enables researchers to screen for molecular cancer subtypes and genomic relationship of cellular invasion. [25]. We found TIMP-1 and EREG showed similar correlations with textural features ( were most predictive of GBM subtypes and overall survival [27].
Other relevant gene, such as POSTN, was found to play important roles in the regulatory pathways through radiogenomics analysis [25].

| Limitations and suggestions
In this study, we used MRI data of 137 to identify radiomics features, but only a subpopulation of them (46) are provided with genomics data as well. For future analysis, larger patient sample size with both imaging and genomics data may be better to detect more correlating genes. In addition to FLAIR data, additional sequences and imaging modalities can be combined for multimodal analysis, which can provide comparison results about different methods.
We selected 72 features to construct the prediction model. More advanced dimensionality reduction method can be implemented for potential improvements of dimensionality reduction and improving classification performances.
Our study validates the method of radiogenomics analysis to study the correlations among gene variables, imaging features and survival outcome in GBM. Our findings provide useful information for further examination of corresponding genes, which may potentially serve as biomarkers for GMB diagnosis and treatment indicators.

ACK N OWLED G EM ENT
None.

CO N FLI C T O F I NTE R E S T
The authors declare that they have no conflict of interest with the contents of this article.

AUTH O R CO NTR I B UTI O N
XL performed the research and wrote the paper, BC designed the research study and wrote the paper, YL(Luo) and WS analysed the data, YL(Li) contributed essential reagents and analysed the data.