Robust prognostic model based on immune infiltration‐related genes and clinical information in ovarian cancer

Abstract Immune infiltration of ovarian cancer (OV) is a critical factor in determining patient's prognosis. Using data from TCGA and GTEx database combined with WGCNA and ESTIMATE methods, 46 genes related to OV occurrence and immune infiltration were identified. Lasso and multivariate Cox regression were applied to define a prognostic score (IGCI score) based on 3 immune genes and 3 types of clinical information. The IGCI score has been verified by K‐M curves, ROC curves and C‐index on test set. In test set, IGCI score (C‐index = 0.630) is significantly better than AJCC stage (C‐index = 0.541, p < 0.05) and CIN25 (C‐index = 0.571, p < 0.05). In addition, we identified key mutations to analyse prognosis of patients and the process related to immunity. Chi‐squared tests revealed that 6 mutations are significantly (p < 0.05) related to immune infiltration: BRCA1, ZNF462, VWF, RBAK, RB1 and ADGRV1. According to mutation survival analysis, we found 5 key mutations significantly related to patient prognosis (p < 0.05): CSMD3, FLG2, HMCN1, TOP2A and TRRAP. RB1 and CSMD3 mutations had small p‐value (p < 0.1) in both chi‐squared tests and survival analysis. The drug sensitivity analysis of key mutation showed when RB1 mutation occurs, the efficacy of six anti‐tumour drugs has changed significantly (p < 0.05).


| BACKG ROU N D
Ovarian cancer is one of the most lethal gynaecological cancers, with an overall 5-year survival rate of 48%, and is in the top five cancer types associated with death in women. 1 Due to the unobtrusive nature of the symptoms, nearly 75% of patients present at an advanced stage, leading to a 5-year survival rate of only 29% in the advanced stage. The first-line therapy is tumour-debulking surgery followed by platinum-based chemotherapy; however, recurrence occurs in nearly 75% of patients. 2 Moreover, there is no confirmed effective drug for patients following platinum-resistant recurrence.
Although recent clinical trials have shown that poly ADP-ribose polymerase (PARP) inhibitors extend progression-free survival (PFS) of patients with the BRCA1/2 mutation, 3,4 homogenous repair deficiency (HRD), 5 or platinum-sensitive recurrence regardless of BRCA mutation, 6 the overall survival (OS) rate is still low.
Immunotherapy is highlighted in the search for new strategies for maintenance therapy in OV. Nonetheless, both active immunotherapy (e.g. ovarian cancer vaccine) and positive immunotherapy (e.g. adoptive T-cell therapy) have been studied, although the response rates are not ideal. 7 To identify novel drug targets and select patients that respond effectively to immunotherapy, it is essential to establish methods to predict the basic immune status of patients.
The first step in this process is to analyse differentially expressed genes (DEGs) and immune-related pathways associated with prognosis. Due to the variety of gene clusters identified in different types of statistical analysis, it is necessary to select an appropriate scoring strategy in relation to tumour immunity. The tumour microenvironment, which consists of immune cells, tumour cells and other components, 8 is a promising target for the exploration of novel biomarkers.
ESTIMATE is an algorithm designed by Yoshihara et al. to calculate an immune score based on differential gene signatures between stromal cells and tumour cells. 9 This tool has been used to identify microenvironment-related markers in several solid tumours. [10][11][12] In the present study, we combined the ESTIMATE algorithm with other bioinformatic analysis tools to build an immunomicroenvironment-related prognosis model (IGCI score) for OV. Two mutations RB1 and CSMD3, which are closely related to immune invasion and prognosis, were identified in OV and found that in patients with RB1 mutation, the sensitivity of some drugs has changed. Due to the lack of normal tissue samples, we downloaded the RNA sequencing data of 88 normal ovarian tissues from GTEx 13 (The Genotype-Tissue Expression, http://commo nfund.nih.gov/GTEx/) database and performed FPKM standardization. When combining expression data obtained from TCGA, the duplicate expression of a same gene was processed to get average value as the final expression value using the 'avereps' function in limma package in R.

| Screening of differentially expressed genes
DEGs between cancerous tissues and normal tissues may contain key information about disease development, so we then try to screen out the DEGs. Genes with expression were less than 0.3 in all samples were removed firstly. Then, the screening process for DEGs used the Wilcoxon signed rank test. 14 Genes with a false discovery rate (FDR) <0.05 and |log 2 fold change| > 1 were identified as DEGs.

| Weighted gene correlation network analysis (WGNCA) of DEGs
WGCNA was applied to analyse DEGs in an attempt to find key gene modules and key genes related to the development of ovarian cancer. 15 Research shows that the gene regulatory network in the organism obeys the basic structure of the scale-free network. To achieve this goal, the correlation coefficients between genes were weighted as follows: A topological overlap matrix (TOM) was then constructed, and the distance between genes was defined by considering other genes related to these two genes. Dynamic clustering methods were used to determine the final gene modules. Genes clustering within the same module often have similar functions. Correlation analysis was performed between the first principal component of the gene modules and the tumour phenotypes (for discrete variables, 0 represents no occurrence and 1 represents occurrence), and we obtained the

| DEGs related to immune infiltration
The ESTIMATE algorithm 9 is a method of gene set analysis to evaluate the purity of tumour tissue. The ESTIMATE algorithm first performs whole-genome sequencing data on known immune cells and tumour cells and then performs screening of DEGs. Such DEGs are selected as the background. After that, other tumour tissue sequencing data can be analysed by GSEA 17 in this genetic background, and the score based on the degree of enrichment (ImmuneScore) can be used to evaluate the immune cell content in this tumour tissue.
The StromalScore is calculated via a similar process. In this study, the ESTIMATE algorithm was used to calculate the StromalScore and ImmuneScore values in all tumour samples to clarify the degree of immune infiltration in samples. According to the median ImmuneScore, tumour samples were divided into high-and low-score groups. DEGs related to ImmuneScore between low-score and high-score group were identified using the criteria: FDR <0.05 and | log2 fold change | > 1; DEGs related to StromalScore were identified in the same way.
The final DEGs identified from the intersection of these two groups of DEGs were considered to be key genes related to immunity during the development of ovarian cancer. These genes and key genes from WGCNA were used in the construction of subsequent patient prognosis models. The ESTIMATE algorithm was applied using the 'estimate' package (1.0.13) in R.

| Survival analysis and SNP analysis
The genes located at the intersection of the DEGs obtained by WGCNA and the ESTIMATE were regarded as being closely related to disease development and immune processes. We use patient clinical information and DEGs to build a prognostic model. We excluded samples that lacked clinical or gene expression information in the TCGA database, and ultimately obtained 258 patient data. We group patients according to 1:1 ratio randomly. The training data set has 130 samples and the test data set has 128 samples. We then built a prognostic model using the training set. Lasso regression 18 was used to eliminate collinearity between different factors, with 10fold cross-validation performed 1000 times. The penalty coefficient lambda selection criterion was used to obtain the smallest partial likelihood deviance. Multivariate Cox proportional hazards regression analysis 19 was then used to build an effective prognostic model.
The variable selection method is the forward-backward selection method. By adding weight to factors, we obtained the risk score for each patient according to the following formula: where β is the coefficient of the factor in the Cox regression model, and Valuei is the factor level. We divided patients into high-and lowrisk groups according to the median risk score in the training set, and plotted the Kaplan-Meier survival curves using a log-rank test in the training and test set. In addition, we predicted the survival of patients (1.26.0) packages in R were used in these analyses.
Then, we performed a survival analysis in relation to the presence or absence of SNPs. After excluding patients lacking SNP data, mRNA data or clinical data, data of 272 patients from the TCGA database were analysed. We performed survival analysis on all genes with mutations identified in at least 15 patients using log-rank tests. 21 In addition, we also performed chi-squared tests on the SNPs and ImmuneScore groups of patients to identify the key gene mutation related to the immune process of ovarian cancer. The mutations of no less than 5 patients were included in the chi-squared test.
The SNPs with P-value of less than 0.1 in both survival analysis and SNP analysis can be regarded as critical SNPs in ovarian cancer. We used the Genomics of Drug Sensitivity in Cancer (GDSC, https://www. cance rrxge ne.org/) database for drug sensitivity analysis of key SNPs. 'digest' (0.6.27) packages in R were used in these analyses.

| Functional and pathway enrichment analysis
The overall workflow of this study is shown in Figure 1A.

| Identification of DEGs in ovarian cancer process and immune invasion process
We aimed to identify DEGs that are closely related to the occurrence of ovarian cancer and the immune infiltration process.
Compared with 88 normal samples, 2908 genes were up-regulated and 3162 genes were down-regulated in 379 tumour samples ( Figure   S1A). These 6070 (2908 + 3162) genes could be regarded as closely related to the occurrence of OV and were used in subsequent WGCNA analysis.
It is generally believed that the degree of tumour immune infiltration is closely related to the content of stromal cells and immune cells in tumour samples. Compared with the low StromalScore group, a total of 734 genes were up-regulated and 398 genes were downregulated in the high StromalScore group ( Figure S1B). Compared with the low ImmuneScore group, 629 genes were up-regulated and 520 genes were down-regulated genes in the high ImmuneScore group ( Figure S1C). From the intersection of StromalScore and ImmuneScore, the up-(420 genes, Figure 1B) and down-regulated genes (263 genes, Figure 1C) were identified, and these genes can be regarded as genes that play a key role in the progress of tumour immune infiltration.

| Results of DEGs GO and KEGG gene enrichment analysis
To verify the method of grouping according to the scores assigned by the ESTIMATE algorithm was indeed applicable to our investigations and achieve a better understanding of the roles of the identified 683

| WGCNA of DEGs related to ovarian cancer
To make the connectivity of the gene regulatory network obey the power law distribution, we exponentially weighted the correlation coefficients of genes. A soft threshold (weight) of beta = 8 better meets the requirements of scale-free networks ( Figure 3A). We obtained 14 gene modules through dynamic clustering and then performed a correlation analysis between the gene modules and the occurrence of tumours ( Figure 3B). Based on previous reports, 26 we assumed that when the correlation coefficient >0.65, the module was the key gene module in the process of disease, and the hub genes were selected from these modules. As shown in Figure 3B, the red, blue, turquoise, black, green, purple, pink, grown, magenta and yellow modules had a coefficient >0.65 and were included in the subsequent analysis. In addition, the correlation analysis between modules ( Figure 3C) showed that the similarity among red, blue and turquoise gene modules was high; the similarity among the brown, magenta and yellow gene modules was high; and the similarity among the tan, black and pink gene modules was high. The scatter plot of gene importance is shown in Figure 3D  of these gene modules. 27 A total of 46 genes were obtained from the intersection of these genes with the DEGs obtained using the ESITIMATE algorithm ( Figure 3E). These genes were regarded as related both to the occurrence of ovarian cancer and the degree of immune infiltration.

| The establishment of the IGCI score
The degree of immune infiltration is often closely related to the tumour recurrence and the amount of tumour stem cells. The tumour recurrence is related to the patient's final survival status. Therefore, 46 genes from the intersection of the key genes of WGCNA and the DEGs by ESTIMATE algorithm were used, and then, Lasso regression was performed on these genes and 7 clinical factors (Age; Asian, 1 means yes, 0 means not; Black, 1 means yes, 0 means not; White,1 means yes, 0 means not; Stage; Pharmaceutical. Therapy, 1 means yes, 0 means not; Radiation. Therapy, 1 means yes, 0 means not) in the training set. The summary of the patient's clinical information is shown in Table 1. The criteria of selecting clinical factors are that the ration of samples with blank values are no more than 90%. Then, to prove our random grouping of patients is reasonable, we used the t-test for continuous variables and the chi-squared test for discrete variables to compare the clinical information of the training set and the test set. As shown in Table 1, all clinical information has no significant difference between the training set and the test set (p > 0.05), indicating this grouping can be used in subsequent studies.
In lasso regression, by minimizing partial likelihood deviance, we selected the penalty coefficient lambda ( Figure S2A,B), which was 0.007; the remaining 10 factors were included in the subsequent study. The levels of these 10 factors in the training set are shown in Table S1. Through Multivariate Cox proportional hazards regression analysis, we constructed a prognostic score based on immune genes and clinical information (IGCI) for patients with ovarian cancer: The high level of Age, FGF7 and CD14 was found to be disadvantageous for patients' prognosis, while high level of White, Pharmaceutical. Therapy and CCR1 was favourable factor for patients' prognosis. We calculated the hazard radio (HR) of each gene and the corresponding 95% confidence interval, as shown in Figure 4A. More details of the multivariate Cox proportional hazards regression are shown in Table S2.
The patient groups were subdivided according to median IGCI score in the training set. As the IGCI score gradually increased, the survival time decreased, and the proportion of patients' deaths gradually increased in both training and test set ( Figure S2C,D), which indicated the accuracy of IGCI score. In addition, analysis of patient survival status based on IGCI score groups showed significant differences ( Figure 4B,C). The p-value in the training set is less than 0.001 ( Figure 4B), and the p-value in the test set is less than 0.05 ( Figure 4C). In the training set, the 1-, 3-and 5-year OS rates in Then, to explore the relationship of factors and IGCI score, we used Wilcoxon test to see difference in a chromosomal instability score (CIN25) 20 in low-and high-risk groups. It was obvious that IGCI score was not associated with CIN25 score ( Figure S4A,B), which indicated that the IGCI score is not related to chromosomal instability. IGCI score was associated with age significantly but the Spearman coefficient was limited (0.61, Figure S4C,D). This may be due to that IGCI can provide additional information beyond ageing. In addition, through chi-squared test (Table S3), we found that IGCI is not related to pharmaceutical therapy, but this factor is significant in Cox regression.
In order to explore the content of prognostic genes at the protein level, we used data on immunohistochemistry (IHC) datasets (the Human Protein Atlas database, http://www.prote inatl as.org/) to explore the content of FGF7, CCR1 and CD14 in protein levels in ovarian cancer and control groups. The results are shown in Figure 5A,B.
The corresponding sample information is shown in the Table S4. To assist the clinical work of ovarian cancer, we established a nomogram based on the IGCI score ( Figure 5C). To test our IGCI score, we collected popular OV signatures to perform as baseline models.
At present, AJCC stage is often used to predict the prognostic status of OV patients. In addition, Carter et al. 20 proved that a signature of chromosomal instability containing 25 genes (CIN25) can also effectively predict the prognostic status of OV patients. Zhang et al. 20 proved the glycolysis and m5A RNA methylation processes can precisely reflect the prognosis state of OV patients; then, they developed related signatures (GRG score and m6A score), respectively, which were test in independent cohorts. In order to verify the validity of the IGCI score and the nomogram, we calculated the C-index of the IGCI score and compared with the C-index of the AJCC stage, CIN25, GRG score and m6A score. The results are shown in Table 2.
The C-index of the IGCI score in the training and test set were significantly higher than the results of the other scores. In addition, we have plotted the 3-year and 5-year survival rate calibration curves to evaluate the predictive power of the IGCI score. In the training ( Figure S3D,E) and test ( Figure 5D,E) set, our prediction results are close to the ideal results (red lines), and the errors are within the standard error range. This shows that the prediction results of the IGCI score are accurate.

| Genetic mutation analysis
To understand the types of gene mutations that are closely related to ovarian cancer, we first generated a waterfall map of the type of genetic mutation ( Figure 6). From TCGA database, we obtained 409 samples with complete information for types of genetic mutations and clinical data and the top 10 genes with the most mutations were selected for display ( Figure 6). The waterfall map showed no clear relationships among the stage of cancer, patient age and gene mutations ( Figure 6). Compared with other genes, TP53 had a variety of mutation types, while TTN and DST were found to be more prone to mutations in the intron regions.
The survival analysis of gene mutations revealed that the mutation of five genes significantly affect the prognosis of patients  We performed a chi-squared test ( Table 3) Figure 7B). On this basis, the use of drugs with enhanced efficacy for these patients may get better treatment results.

| Key factors in the IGCI score
In the IGCI score, age is a poor prognostic factor, which is closely related to the decline in physical function of the elderly. In addition, though 'Pharmaceutical. Therapy' factor cannot be identified at the diagnosis state, the clinician can use this nomogram with this factor to predict the survival rate change after these two kinds of treatments.
The FGF7 gene has the largest absolute coefficient (0.4089,  The CD14 antigen is a 365-amino acid phosphatidylinositolbinding glycoprotein, which is mainly expressed on monocytes and macrophage membranes in body tissues. 33 In tumour tissues, macrophages mainly infiltrate the pericarcinoma and cancer interstitial tissues. 34 The detection rate of macrophages represents the immune status of local tumour tissues. Therefore, CD14 can effectively reflect the level of immune infiltration in patients with ovarian cancer. In our prognostic model, the HR of CD14 is =0.879 <1 ( Figure 4A), which is a favourable factor for prognosis. Cancer patients with a large degree of immune infiltration often have better prognosis, 35 which is consistent with the results of our model. TRRAP is a subunit of histone acetyltransferase and a key cofactor for c-Myc, which is an oncogenic DNA-binding transcription activator. By recruitment of TRRAP, c-Myc activates RNA polymerases I and III to control ribosome biogenesis and cell growth. It has been confirmed that TRRAP positively regulates the accumulation of mutant p53 in lymphoma, and TRRAP inhibition by histone deacetylases decreases mutant p53 levels. 45 In addition, TRRAP depletion leads to down-regulation of TOP2A, which is consistent with our results and indicates that the association between these two genes is worthy of exploration in ovarian cancer.

| Key genes in the gene mutation analysis
The SNP of RB1 is closely related to immune process ( Table 3).
This evidence strongly illustrates the potential of RB1 as an immune and prognostic marker in ovarian cancer. A previously reported model indicated that RB1 functions as an essential tumour suppressor, which physically interacts with RBAK. 46 In ovarian cancer, the concurrent inactivation of P53 and RB1 is adequate forcarcinogenesis, 47 and in addition, RB1 may promote chemotherapy resistance. 48 Based on comprehensive studies on the BRCA state and DNA repair, a model has been proposed to illustrate the relationship between P53 and RB1, as well as homologous repair deficiency. 49 RBAK was computationally predicted as a downstream target of miR-155 in lymphoma, 50 although the function of RBAK in solid tumours remains poorly understood. Our analysis revealed differential regulation of RBAK and RB1 in patients, which implies the importance of interactions between these two genes in ovarian cancer and their potential as drug targets. Further studies are required to confirm the involvement of RBAK in this process following its interaction with RB1.
CSMD3 is a member of CSMD gene family. The nonsynonymous mutation of CSMD3 has been identified in familial colorectal cancer but not in healthy controls. 51 Whole-exome sequencing has revealed that CSMD3 is the second most frequently mutated gene after TP53 in non-small cell lung carcinoma, and loss of CSMD3 causes proliferation of airway epithelial cells. 52 Additionally, CRISPR/Cas9-mediated knockout of CSMD3 inhibits the death of PDX tumour cells, which also suggests that CSMD3 is an important tumour suppressor. 53 HMCN1, which is a conserved extracellular member of the immunoglobulin superfamily, manages epithelial cell attachments. As a cell polarity regulatory gene, HMCN1 is significantly up-regulated in gastric carcinoma. 54 Moreover, mutation of HMCN1 is associated with metastasis in breast cancer. 55 In ovarian cancer, HMCN1 may promote invasiveness by regulating cancer-associated fibroblasts. 36 Newly generated fibrocytes act as a 'wall' that prevents the entry of immune cells into the ovarian cancer site.
As a plasma glycoprotein, von Willebrand factor (vWF) mediates the attachment of platelets confronted with damaged endothelium. 56 A large population-based study demonstrated the association between coagulation, inflammation and survival of cancer patients, 57 indicating that increased mortality in cancer survivors is dependent on high vWF levels. Our data also show that vWF expression is related to immune scores and patient survival, suggesting vWF as a novel biomarker of the basic immune state and predicts long-term survival in cancer patients.
For the first time, we report the immune-related mutations of ZNF462, ADGRV1 and FLG2 in ovarian cancer. As a zinc-finger protein, ZNF462 may take part in embryonic development and is associated with neurodevelopmental abnormalities. 58 ZNF462 may be the target of miR-210, 59 which could be induced by hypoxia-inducible factor-1alpha in pancreatic cancer. ADGRV1 is one of the biallelic pathogenic identification markers of Usher syndrome, 60 which is characterized by congenital bilateral sensorineural hearing loss. 61 The deficiency of FLG2 causes defective adhesion between cornified cells, which leads to peeling skin syndrome. 62 However, the roles of these two genes in cancer remain unclear, and further studies are required.

| CON CLUS ION
In the present study, data from TCGA and the GTEx database were combined with the ESTIMATE algorithm to identify 46 genes closely related to OV occurrence and immune infiltration process. Using genes and clinical information together, we established the IGCI score containing six essential factors. The IGCI score and the corresponding nomogram have been effectively verified by the ROC curves, C-index and calibration curves on the test set. The prediction ability of the IGCI score is better than AJCC stage (p < 0.05) and CIN25 (p < 0.05). In addition, by analysing, the gene mutations related to the process of ovarian cancer, the gene mutations that are closely related to the patient's prognosis and the degree of immune infiltration were identified. We conducted drug sensitivity analysis on key gene mutation, which provides a reference for subsequent research.

CO N FLI C T O F I NTE R E S T
The authors declare that there are no conflict of interests.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data involved in this study can be obtained through the TCGA (https://portal.gdc.cancer.gov/) and GTEx (https://www.gtexp ortal. org/home/index.html) databases.