Novel gene signatures for prognosis prediction in ovarian cancer

Abstract Ovarian cancer (OV) is one of the leading causes of cancer deaths in women worldwide. Late diagnosis and heterogeneous treatment result to poor survival outcomes for patients with OV. Therefore, we aimed to develop novel biomarkers for prognosis prediction from the potential molecular mechanism of tumorigenesis. Eight eligible data sets related to OV in GEO database were integrated to identify differential expression genes (DEGs) between tumour tissues and normal. Enrichment analyses discovered DEGs were most significantly enriched in G2/M checkpoint signalling pathway. Subsequently, we constructed a multi‐gene signature based on the LASSO Cox regression model in the TCGA database and time‐dependent ROC curves showed good predictive accuracy for 1‐, 3‐ and 5‐year overall survival. Utility in various types of OV was validated through subgroup survival analysis. Risk scores formulated by the multi‐gene signature stratified patients into high‐risk and low‐risk, and the former inclined worse overall survival than the latter. By incorporating this signature with age and pathological tumour stage, a visual predictive nomogram was established, which was useful for clinicians to predict survival outcome of patients. Furthermore, SNRPD1 and EFNA5 were selected from the multi‐gene signature as simplified prognostic indicators. Higher EFNA5 expression or lower SNRPD1 indicated poorer outcome. The correlation between signature gene expression and clinical characteristics was observed through WGCNA. Drug‐gene interaction was used to identify 16 potentially targeted drugs for OV treatment. In conclusion, we established novel gene signatures as independent prognostic factors to stratify the risk of OV patients and facilitate the implementation of personalized therapies.

by OV in all malignancies in the United States. 3 The main reason for these observations is that more than 70% cases with OV are not diagnosed until the tumour has progressed to advanced stages (stage III-IV; International Federation of Gynecology and Obstetrics, FIGO). 4 At present, effective screening test for early OV detection has not been accessible. Biological markers such as the carbohydrate antigen 125 (CA125) and human epididymis protein (HE4) are widely used in clinical diagnosis. [5][6][7] However, the serum CA125 level is not specific for OV because its elevation may result from menstruation, benign gynaecological diseases and other cancers in spite of high sensitivity. 8 On the other hand, HE4 has reliable specificity but poor sensitivity. 5,9 What's more, the prognosis cannot be predicted although the combination of these biomarker levels improves diagnostic accuracy. Therefore, it is necessary to explore gene signatures associated with prognostic prediction from the potential mechanism of OV progression.
The G2/M DNA damage checkpoint serves to prevent the cell with DNA damage from entering mitosis (M-phase) during cell cycle. 10 In most tumours, upstream G1/S checkpoint is inactivated due to the loss of function of tumour suppressor genes, which strengthens their survival ability. Meanwhile, it means that tumour cells mainly rely on the G2/M checkpoint to avoid factors that disrupt genome stability. Furthermore, previous researches have shown robust correlations between G2/M cell cycle arrest and prognosis for multiple cancers, including OV. [11][12][13] Nevertheless, survival varies by category of OV. Epithelial cancers are the most common OV types. 3 Serous carcinoma, the most common epithelial subtype by histological classification, is mainly diagnosed at late stage and possesses aggressive nature of high grade. 14 Both advanced stage and high grade are important factors associated with worse prognosis. 15,16 Prognostic predictors need to be further developed, especially for patients with these poor outcome indicators. Previous studies have identified several potential genes for predicting the prognosis of OV but their comprehensiveness and clinical application remain limited. [17][18][19] In this study, we discovered that differential expression genes (DEGs) between tumour and normal tissues were most significantly enriched in G2/M checkpoint signalling pathway based on the several data sets in the Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) data sets. The multi-gene and single-gene signatures were constructed on genes related to G2/M checkpoint and validated in cohorts of OV patients.

| Data collection
We searched for data sets related to OV from the GEO database (http://www.ncbi.nlm.nih.gov/geo/) with the Mesh terms 'ovary neoplasms' and 'human'. A further filter was performed with organism 'Homo sapiens', study type 'Expression profiling by array' and samples count 'Higher than ten'. According to the systematic screening strategy, a total of eleven data sets were eventually included in this study. Eight data sets were used to screen DEGs, including GSE105437, GSE54388, GSE69428, GSE14407, GSE12470, GSE4122, GSE10971 and GSE26712. GSE23554, GSE14764 and GSE63885 were applied at the validation stage. Twenty-six samples from GSE63885 with incomplete survival data were removed.
Detailed information was shown in Table 1. Raw data were processed with robust multi-array average expression measure (RMA) background correction, log2 transformation and normalization.
Moreover, expression profiling and clinical information of the samples with complete prognostic data were downloaded from the TCGA-OV data set (https://cance rgeno me.nih.gov/). The gene list of hallmark gene sets 'HALLMARK G2M CHECKPOINT' was downloaded from the Gene Set Enrichment Analysis (GSEA) database (http://softw are.broad insti tute.org).

| DEGs identification
GEO2R (http://www.ncbi.nlm.nih.gov/geo/geo2r/) was used to detect DEGs in each GEO data set. P values and log fold change (FC) of duplicate genes were averaged based on the 'limma' package in R. 20 Significant DEGs were defined as those with adjusted P < .05 and |log FC| ≥ 1 and were ranked by the logFC in each microarray data set. The results of eight series accessions were integrated through the 'RobustRankAggreg (RRA)' R package to select the most significant DEGs. 21 tool for gene function analyses. 22 We also performed enrichment analysis using the hallmark gene sets as the reference gene set. In addition, protein-protein interaction (PPI) enrichment analysis was carried out. The Molecular Complex Detection (MCODE) algorithm was applied to identify densely network components.

| Gene set enrichment analysis (GSEA)
We utilized javaGSEA 4.0.3 to perform GSEA within the above microarray data to analyse the difference between tumour and normal tissues. The most significant hallmark gene set in the enrichment analysis was selected as the reference gene set.

| Construction of the prognostic gene signature
The genes in the hallmark gene set 'HALLMARK G2M CHECKPOINT' in which the DEGs were enriched most significantly were considered as candidate biomarkers. LASSO (least absolute shrinkage and selection operator) Cox regression model was used to construct multigene signature for predicting OV prognosis from these candidate biomarkers. 23 Based on the 'glmnet' R package, the model was applied to the expression matrix of candidate genes and the optimal value of the penalty parameter λ was selected to calculate the coefficient of each gene constituting prognostic signature. The sum of the product of these coefficients and gene expression for each sample, defined as the risk score of the prognostic gene signature, was used to evaluate the prognostic risks.

| Prognostic value estimation of the multigene signature
TCGA-OV cohort was considered internal set, and GSE26712, GSE63885 and GSE14764 were deemed external sets for prognos-

| Multivariate Cox regression analysis
Clinicopathological variables and risk score were included in multivariate Cox regression to determine which were significant prognostic factors. The result was shown in a forest plot using the 'forestplot' package in R. According to the regression coefficient, every independent variable corresponded to a point at each value.

| Prognostic values estimation of singlegene signatures
From the multi-gene signature, we selected the genes with prognostic significance and closely related to risk stratification in the TCGA cohort as simplified signatures for prognosis prediction. Differences in expression levels between OV and normal tissues were investi-

| Correlation between signature genes and clinical characteristics
To further investigate the correlation between signature genes and clinical characteristics, we combined expression profiles of

| Drug-signature gene interaction
We searched for potential drugs response to promising targets G2/M checkpoint signalling pathway to which all genes in the multi-gene signature were related. Drug-Gene Interaction Database (DGIdb; http://www.dgidb.org) was used to explore interactions between drugs and eight signature genes. The interaction network was constructed by the online tool STITCH (http:// stitch.embl.de).

| Statistical analysis
Statistical analyses were conducted by online resources and R software 3.6.1. In brief, limma procedure was used to investigate differences in gene expression in GEO2R and the accumulative hypergeometric distribution was applied to carry out pathway and process enrichment analysis in Metascape. Terms with a P-value <.01, a minimum count of 3 and an enrichment factor >1.5 (the enrichment factor was the ratio between the observed counts and the counts expected by chance) were collected and grouped into clusters based on their membership similarities. The P values obtained in the above two steps were adjusted by the Benjamini-Hochberg procedure. Student's t test or one-way ANOVA was performed to compare mRNA expression if the data were normally distributed; otherwise, Wilcoxon or Kruskal-Wallis test was conducted for comparisons. The two-sided log-rank tests were performed to analyse survival differences between the high-risk and low-risk groups when KM survival curves were drawn based on the 'survival' and 'survminer' package in R. Univariate and multivariate Cox proportional hazard models were built to estimate the hazard ratios of prognostic factors. P < .05 was considered as statistical significance (*, P < .05).

| Identification of integrated DEGs by the RRA method
The workflow for construction and validation of novel gene signatures for prognosis prediction in OV was shown in Figure S1. Eight eligible GEO data sets were included in the subsequent RRA analysis.
F I G U R E 1 Identification of robust DEGs by RRA method. Heatmap shows the top 20 up-regulated and down-regulated DEGs in GEO series accessions. Each row denotes one DEG, and each column represents one data set. The colour changes from red to green indicates regulation from up to down. The numbers in the box stand for logarithmic fold change The DEGs of each data set were sorted by logFC. The RRA method synthesized the ranking of genes across all data sets to determine which were selected for integrated DEGs based on the assumption that each gene in each data set was randomly arranged. In accordance with the results of RRA analysis, a total of 478 significant DEGs were identified. The top 20 up-regulated and down-regulated DEGs were depicted on a heatmap ( Figure 1).

| Enrichment analyses of DEGs
GO annotation and KEGG pathway enrichment analysis were performed on the overall integrated DEGs. We detected that GO terms (such as cell division, regulation of mitotic cell cycle and muscle structure development) were most significantly enriched.

| Prognostic values of the multi-gene signature
The risk score was ranked from low to high. In the internal data  (Table S2).

| Validation of prognostic value in subgroups
Subgroup analysis was performed to explore the applicability of our multi-gene signature in predicting survival outcomes for patients with specific clinicopathological characteristics. GSE14764, GSE23554, GSE26712 and GSE63885 were integrated into a whole.
Notably, all tumour samples from GSE23554 and GSE26712 were diagnosed at the advanced stage. Other detailed information of patients from these data sets was described in Table S1. According to residual tumour, patients were stratified into residual tumour <1 cm

| Multivariate Cox regression analysis
The result of multivariate Cox regression analysis revealed age, stage and risk score was independent factors for prognosis prediction ( Figure S3). A nomogram was constructed to visualize the relationship between these independent prognostic factors and survival probability ( Figure 5A). Clinicians were able to predict prognosis of patients based on their total points. Patients with higher number of total points had poorer survival outcomes.
The C-index of the nomogram was 0.695 (95% CI, 0.670-0.727) and corrected to be 0.689 through bootstrapping validation.
Furthermore, calibration curves also showed a good predictive power of the nomogram for 3-year and 5-year overall survival ( Figure 5B,C).

| Prognostic values of single-gene signatures
It could be inferred from the high ranking of LASSO coefficients that SNRPD1 and EFNA5 played important roles in the multi-gene  Figure S4). In the GEPIA set, SNRPD1 expression showed significant association with overall survival and disease-free survival ( Figure 7A,B), which confirmed its F I G U R E 5 Nomogram for predicting survival probability in the TCGA data set. A, Nomogram to predict survival probability at 1, 3, 5 y. B, Calibration curve for the nomogram predicting 3-y overall survival. C, Calibration curve for the nomogram predicting 5-y overall survival prognostic value. Similar results were obtained from performance of the same analyses on EFNA5 expression, and the prognostic value was validated ( Figure 7C,D and Figure S5). Notably, it was higher EFNA5 expression and lower SNRPD1 that predicted poorer prognosis. At last, univariate and multivariable Cox regression model identified both single-gene signatures as independent prognostic factors for patients with OV (Table 2).

| Correlations between signature genes expression and clinical traits
Clinical information of OV samples, such as stage, age, living sta- Besides, EFNA5, CDKN1B and KATNA1 were identified as members of turquoise module which was significantly related to living status ( Figure S6).

| Drug-gene interaction
CDKN1B and SLC7A1 were identified as promising targets for potential drug reactions based on the results of drug-gene interaction exploration using DGIdb (Table 3)  Type II tumours are considered high grade, diagnosis at advanced stage and low survival, wherefore result in the major fraction of OV deaths. One of the important factors in elevating the mortality of OV patients is that effective screening tests remain blank to date.

| D ISCUSS I ON
A recent large randomized trial combining transvaginal ultrasound with changes in CA125 has observed a reduction in mortality after long-term follow-up but screening strategies based on secondary analysis remain controversial. 33,34 In addition, current treatment of OV is limited to radical surgery and chemotherapy, which prolongs the interval between recurrences but does not benefit overall survival. 35    However, applying the multi-gene signature for prognosis prediction costs more medical expenses. Since it has been reported that many single-molecule biomarkers are also related to clinical outcomes, we simplified the model into two single-gene markers at the cost of prediction accuracy to improve practicality.
Both SNRPD1 and EFNA5 are members of the 'HALLMARK G2M CHECKPOINT' gene set, so they participate in cell proliferation and tumour progression via cell cycle arrest at the G2/M-phase.
Furthermore, SNRPD1 plays an important role in manipulating the regulation of pluripotency-specific spliceosome assembly and the acquisition and maintenance of pluripotency. 40 SNRPD1 has also been reported to be involved in osteogenic differentiation of mesenchymal stem cell. 41 Regarding the function of EFNA5, recent studies have shown its significant expression alterations in prostate cancer, gastric cancer and colorectal cancer compared to conventional normal tissues. [42][43][44] Notably, EFNA5 is likely to be one of novel candidate genes that contribute to human Mendelian disorders. 45 The above findings also indicate that the change in EFNA5 expression seems to be applicable to a variety of genetic diseases and has low specificity for OV.
Previous studies have applied pathological stage as an indicator to guide treatment choice but little direct evidence suggested it could be regarded as an accurate factor for predicting the prognosis of OV patients. 16,28 Nevertheless, tumour stage was identified as an independent predictor in our nomogram. Interestingly, the prognostic prediction ability of tumour stage decreased in the multivariate Cox regression combined with two single-gene signatures, which possibly resulted from the correlation between SNRPD1 expression and tumour stage, as revealed by the process of WGCNA.
In conclusion, our study constructed novel gene signatures for prognosis prediction in OV based on the G2/M checkpoint signalling pathway enrichment. Prognostic value of the multi-gene signature was validated in the internal, external and entire sets. Independence from other clinical factors was determined through subgroup analysis. By incorporating this signature with age and pathological tumour stage, a visual predictive nomogram was established, which was convenient for predicting survival outcomes of OV patients.
Two single-gene signatures were also built as simplified independent prognostic factors to satisfy diversified clinical requirements.
However, these models require further verification in different clinical centres in the future.

CO N FLI C T O F I NTE R E S T
The authors declare no conflict of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data sets used and/or analysed during the current study are available from the GEO database (http://www.ncbi.nlm.nih.gov/ geo/) and TCGA database (https://cance rgeno me.nih.gov/).