Gene signatures predict biochemical recurrence‐free survival in primary prostate cancer patients after radical therapy

Abstract Background This study evaluated the predictive value of gene signatures for biochemical recurrence (BCR) in primary prostate cancer (PCa) patients. Methods Clinical features and gene expression profiles of PCa patients were attained from Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) datasets, which were further classified into a training set (n = 419), a validation set (n = 403). The least absolute shrinkage and selection operator Cox (LASSO‐Cox) method was used to select discriminative gene signatures in training set for biochemical recurrence‐free survival (BCRFS). Selected gene signatures established a risk score system. Univariate and multivariate analyses of prognostic factors about BCRFS were performed using the Cox proportional hazards regression models. A nomogram based on multivariate analysis was plotted to facilitate clinical application. Kyoto Encyclopedia of Gene and Genomes (KEGG) and Gene Ontology (GO) analyses were then executed for differentially expressed genes (DEGs). Results Notably, the risk score could significantly identify BCRFS by time‐dependent receiver operating characteristic (t‐ROC) curves in the training set (3‐year area under the curve (AUC) = 0.820, 5‐year AUC = 0.809) and the validation set (3‐year AUC = 0.723, 5‐year AUC = 0.733). Conclusions Clinically, the nomogram model, which incorporates Gleason score and the risk score, could effectively predict BCRFS and potentially be utilized as a useful tool for the screening of BCRFS in PCa.


| INTRODUCTION
The second most common male malignancy is prostate cancer (PCa) in the world. 1 In 2020, estimated new cases and deaths of PCa in the United States will account for 21% and 10%, respectively. 2 Primary PCa is usually managed with radical prostatectomy (RP) or radical radiotherapy (RT). 3 Unfortunately, 30%-50% of patients with RT and 20%-40% of patients with RP will develop BCR within ten years. 4,5 BCR is defined as two consecutive rising prostate-specific antigen (PSA) values >0.2 ng/ml following RP or >2 ng/ml higher than the PSA nadir value following RT. 6 It is well known that BCR contributes to distant metastasis. Generally, 24%-34% of men with BCR will progress to metastasis, 7,8 who should be carefully monitored and endured salvage therapy. Most earlier studies have focused exclusively on the outcomes of PCa following RP or RT. Accordingly, more accurate rapid methods are eagerly needed to identify BCR of primary PCa patients after radical therapy (including RP and RT).
In recent years, notable improvement has been made in precision oncology that applies molecular and medical imaging information to improve the diagnosis and therapy of urological malignancies. 9,10 In particular, molecular information has outstanding interpretability and discriminative power. For example, gene signatures exhibit an excellent discrimination power for BCR. [11][12][13] Accumulating evidence has suggested gene deregulation related to the prognosis of PCa, such as CRTC2, MYC, and PTEN. [14][15][16][17][18] However, gene signatures in the early identification of patients at highrisk BCR of primary PCa after radical therapy have rarely been reported. Therefore, it is necessary to decipher gene signatures together with underlying molecular mechanisms predicting BCRFS based on genomic information from different platforms.
The current study applied four microarray datasets, which were obtained from GEO and TCGA. Three GEOs were merged as a training set and one dataset from TCGA as a validation set. Afterward, LASSO-Cox was applied to identify prognostic gene signatures to predict BCRFS and to establish a risk score. Accordingly, the prognostic value of the risk score in both sets was verified. Then, a nomogram was built up to estimate BCRFS time. Finally, GO and KEGG on gene signatures were performed to explore molecular mechanisms and crucial genes.

| Data preprocessing
In this study, eligible datasets were selected based on the following inclusion criteria: (a) the dataset must include patients with primary prostate cancer (PCa) following radical therapy and (b) patients with clear clinical and pathological information (i.e., gene expression values, Gleason score, BCR event, time to BCR, total follow-up time). Exclusion criteria were as follows: (a) datasets with a small sample size (n < 50) and (b) datasets without complete data for analysis. Gene expression and complete clinical data from 822 (419 samples from GEO and 403 from TCGA) PCa samples that met the inclusion and exclusion criteria were downloaded from the three GEO (GSE70768, GSE70769, GSE11 6918) and TCGA datasets, serving as the training Conclusions: Clinically, the nomogram model, which incorporates Gleason score and the risk score, could effectively predict BCRFS and potentially be utilized as a useful tool for the screening of BCRFS in PCa. and validation sets, respectively. The main characteristics of the datasets are shown in Table 1. To ensure data integrity for each indicator, incomplete raw information (i.e., age in the training set) was excluded for further COX analysis. In addition to Gleason score and follow-up BCR information, the training set included preoperative PSA, clinical T (cT) stage, and radical therapy (RP = 196, RT = 223); meanwhile, the validation set included radical therapy (RP = 403). The characteristics of patients with prostate cancer in the training set and validation set are shown in Table 2. We have made subgroup analyses for every variable in the training set, and the results are shown in Figure S1. Three GEO datasets were merged and applied function "Normalize Between Array" from the R package "limma" for standardization.

| Identification of gene signatures
The batch influence was adjusted for GEO and TCGA by R package "sva." Gene expression profiling was merged with clinical information for analyses. To select gene signatures with predictive value, LASSO-Cox regression was applied using the R package "glmnet." 19,20 The risk score was founded by weighting individual normalized expression value of gene signature and LASSO coefficient.

| Validation of gene signatures
According to the median value of risk score in the training set, both training and external validation sets were classified into high-risk and low-risk groups. Kaplan-Meier (K-M) survival curves were drawn by R packages "survival" and "survminer." Then, logarithmic rank (log-rank) tests were performed to compare differences in BCRFS time between the high-and low-risk groups. To visualize BCRFS differentially, a heatmap was constructed using the R package "pheatmap." Multivariate and univariate Cox regression models were established using the R package "survival." R package "timeROC" was applied to build a t-ROC curve, which was used to assess the predictive accuracy of the risk score system for BCRFS. Afterward, based on the results of multivariate models, a nomogram was depicted using the R package "rms." To assess the performance of the nomogram, calibration plots and C-index were used in both training and validation sets.

| Bioinformatical analysis
To estimate the potential functions of DEGs in low-risk versus (vs.) high-risk groups, the KEGG pathway and GO annotation were performed using the R package "cluster-Profiler." 21 Briefly, GO and KEGG annotation sets were derived from the R package "org. Hs.eg.db." GO reveals the catalogs of biological process (BP), cellular component (CC), and molecular function (MF). All visualizations were produced using R packages "ggplot2" and "GOplot." After multiple-test correction, KEGG pathways and GO terms with corrected P (P.adjust) value <0.05 were considered to be prominently enriched in DEGs.

| Statistical analysis
Data analysis was implemented using the R program (version 3.6.3, https://www.r-proje ct.org) with the following libraries: base-package, survival-package, glmnet-package, survminer-package, timeROC-package, limma-package, rms-package, and clusterProfiler-package. Support Vector Machine (SVM) and Random Forest(RF) models were carried out in Python, using the Scikit-learn (version 0.24.0). BCRFS curves were depicted by Kaplan-Meier plots, and the difference in BCRFS was assessed by the log-rank test.
Multivariate and univariate Cox regression models were used to ascertain independent prognostic factors. Time-dependent ROC curves were constructed, and AUCs were used to predict the performance of BCRFS in 3, and 5 years, respectively. Nomogram was validated with C-index. DEGs were defined as differential expression for |logFC| > 0.5 and an

| Prognostic gene signatures identification in the training set
Gene expression variables in the training set were submitted to high-throughput LASSO-Cox proportional hazards regression analysis. All gene variables were reduced to the most useful potential predictors for BCRFS. The optimal λ value was chosen by "Leave-one-out" cross-validation, and the λ value of 0.11517381 with log(λ) = −2.1613129 was selected ( Figure 1A). Six BCRFS-associated gene signatures (NOX4, F12, TPX2, PHYHD1, AURKA, and YIPF1) were identified by LASSO-COX models. ( Figure 1B).

| Construction of the risk score
The risk score was established by the summating of every gene signature expression value multiplied by its corresponding coefficient, as follows: risk score = (0.046043 × NOX4) The expression value of every gene was log2-transformed and standardized. The distributions of risk scores for the training set and validation set were shown, respectively (Figure 2A,B). The distributions of BCRFS and BCR status for both sets are shown in Figure 2C,D. The risk score was ranked and the high-risk score indicates poor BCRFS. The median risk score of the training set was used to classify all patients into high-risk (>0.183427) versus low-risk (<0.183427) groups. The heatmaps of six prognostic genes expression values were presented in Figure 2E,F.

| Validation of the risk score
Based on multivariate and univariate Cox regression(CR) models, the risk score, which was adjusted by the clinical variables in both sets, was an independent prognostic factor for BCRFS (p < 0.05). Gleason score was prominently associated with BCR in the training set (n = 419) (p < 0.05) ( Table 3). Conversely, no apparent association was observed between BCR and preoperative PSA or cT stage (Table S1; Figure S1). Similarly, in the validation set (n = 403), a high Gleason score was notably associated with BCR(p < 0.05).
Based on Kaplan-Meier survival curves, there were meaningful differences between high-risk and low-risk groups for both sets (p < 0.001) ( Figure 3A,B). According to t-ROC, the risk score was a strong prognostic factor for BCRFS in the training set The best-performing model was the CR model, which was selected to calculate the risk score (Table S2).

| Establishment of the nomogram
Following the results of multivariate and univariate Cox analyses, Gleason score and risk score were used to draw a nomogram in the training set. The nomogram predicted the probability of BCRFS in patients with PCa for 3 and 5 years, while the risk score was a dominant factor ( Figure 4). The likelihood of BCRFS decreased with an increase in risk score, revealing that our gene signatures might hold promising predictive value for BCRFS. The calibration plots exhibited outstanding conformity between the actual observation and the nomogram prediction for 3-and 5-years BCRFS in the training set ( Figure 5A,B) and validation set ( Figure 5C,D). The C-index of the constructed nomogram for estimating BCRFS was 0.793 in the training set. Compared to the  Gleason score (C-index of 0.588), risk score (C-index of 0.790), the nomogram showed better predictive accuracy. For the validation set, the constructed nomogram had a C-index of 0.722 that was also finer to Gleason score (C-index of 0.676) and risk score (C-index of 0.710) for BCRFS (Table 4).

| Bioinformatics analysis
Above all, three DEGs (PHYHD1, AURKA, and TPX2) between low-risk (n = 210) and high-risk cases (n = 209) were identified using the R package "limma" in the training set, under cut-off criteria of an adjusted P value < 0.05 and |logFC| > 0.5 (Table 5). According to bioinformatics analysis, 145 enriched considerably GO terms belong to the molecular function (MF), biological process (BP), and cellular component (CC) categories (P adjusted < 0.05) ( Table S3). The most enriched BP terms were associated with mitotic spindle organization, spindle assembly, and microtubule cytoskeleton organization involved in mitosis. The three most dominant terms in CC were mitotic spindle, spindle pole, and spindle. In the MF category, histone kinase activity was the most abundant term, followed by protein serine/ threonine/tyrosine kinase activity and dioxygenase activity (Figure 6a). Moreover, two significantly enriched GO terms belong to KEGG categories (P adjust < 0.05). As shown in Figure 6B, the notably enriched KEGG pathways of the DEGs were "progesterone-mediated oocyte maturation" and "Oocyte meiosis." Ultimate, a chord diagram, was created to measure the relationship between DEGs and GO terms.

F I G U R E 3 Gene signatures can predict BCRFS in both cohorts. (A) K-M survival curves for the training set indicated that better BCRFS
was associated with significantly lower risk score; (B) K-M survival curves for the validation set indicated that better BCRFS was associated with significantly lower risk score; (C) Time-dependent ROC revealed that the risk score was an excellent predictor for BCRFS in the training set; (D) Time-dependent ROC revealed that the risk score was an excellent predictor for BCRFS in the validation set. BCRFS, biochemical recurrence-free survival; K-M, Kaplan-Meier; ROC, receiver operating characteristic Figure 6C summarizes the top three pathways enriched in the BP, CC, and MF.

| DISCUSSION
The current study focuses on appraising the potential prognostic values of gene signatures in BCR using public datasets.
Three GEOs associated with BCR are integrated as a training set to obtain optimal gene signatures. Besides, the TCGA dataset serves as an external validation set. The risk score system consisting of 6-gene signatures is significantly associated with BCRFS by a series of bioinformatical and statistical analyses, which is consistently observed in the validation set. These results indicate that gene signatures have promising predictive value for BCRFS of primary PCa patients after Based on uni-and multivariate Cox regression models, risk score and Gleason score can predict prognosis in both sets. In contrast, no significant association is observed between BCR and preoperative PSA in the training set. These findings are consistent with previous reports that a high Gleason score was appreciably related to early BCR,whereas factors (i.e., age at diagnosis, preoperative PSA) were not associated with BCRFS. 22,23 However, our consequences were inconsistent with other reports. 24,25 This inconsistency may be related to race and radical therapy.
Although radical therapy was a potential prognostic factor for the BCRFS in univariate and subgroup analyses, it was no longer a prognostic factor with multivariate analysis ( Figure S1; Table S1). Similar results have been reported in low-intermediate risk patients with PCa. 26,27 Nevertheless, this was inconsistent with some previous studies that the therapeutic effect of RP was better than RT, and the probability of BCR after RP was lower than RT. 4,5 The reason may be that the data came from different datasets and the sample size was small. The experimental results need to be further verified by larger sample size. BCR mainly arises from PCa process itself or as a result of the side effects during treatment. For instance, positive surgical margins (PSM) and lymph node metastases were associated with BCRFS. 28,29 However, some clinical and pathological parameters (i.e., surgical margin and extracapsular extension) were missing in the training set, so they could not be added for further analysis. In future experiments, more clinical and pathological parameters need to be analyzed. This article focus on the biological characteristics of the disease itself rather than on the therapeutic effect. Furthermore, the predictive contribution of the therapeutic effect was much smaller than the risk score in this article. Thus, the therapeutic effect was not that substantial and did not affect the correctness and reliability of our conclusions.
Several studies highlighted different gene signatures associated with BCR following RP. In a case-control study, a 10-gene molecular signature(HDDA10) showed superior performance for predicting BCR in PCa patients with RP (AUC = 0.65). 12 Meanwhile, an original gene signature model predicted 3-years BCRFS in PCa patients after RP (AUC = 0.836). 11 In addition, CDO1 promoter methylation was proposed as a feasible predictive biomarker for BCRFS in PCa patients following RP, even though it flunked to reach statistical significance in multivariate analysis. 13 Our gene signatures may offer a broader range of possibilities for clinical application.
A few biomarkers of our gene signatures have previously been studied in PCa. For example, TPX2, a risk biomarker in our study, positively associated with the BCR of PCa and played an essential role in the proliferation and aggression of PCa. 30 TPX2 depletion led to the growth inhibition of PCa cells and reduced tumorigenesis. 31 AURKA, another essential risk biomarker in our study, was correlated with poor prognosis in lethal treatment-related neuroendocrine prostate cancer. 32 Also, the inhibition of TPX2 and AURKA stimulated mitotic catastrophe (MC) or apoptosis in PCa cells, and the possible mechanism might be the Glioma pathogenesis-related protein 1 (GLIPR1) through heat shock cognate protein 70 (Hsc70)-mediated suppression of TPX2 and AURKA. 33 Our conclusions show excellent agreement with these results.
Notably, PHYHD1, which has not been studied in PCa, may be involved in the process of BCR. PHYHD1 had been investigated in other tumors. For instance, one research had shown that the DNA methylation level of PHYHD1 was related to the invasion of non-functioning pituitary adenoma. 34 However, the underlying mechanism of its action in PCa remains to be established.
There have been several studies that investigated the possible mechanisms of prostate cancer progression. For instance, the centrosome was associated with cell mitosis, and its defects contributed to the change in cellular and gene that accompany the progression, dissemination, and lethality of prostate cancer. 35 Another study demonstrated that spindle orientation controls cell fate of PCa. 36 These results resonate well with GO and KEGG results where "mitotic spindle organization" and "Oocyte meiosis" have been the most significantly enriched in prominent GO terms, suggesting their roles as significant progressive pathway signatures in BCR. T A B L E 5 DEGs between low-risk cases and high-risk cases in the training set, under cut-off criteria of |logFC| > 0. 5   This study has the following restrictions. First, this study is restricted by its retrospective proposal and validation. A prospective evaluation would improve the reliability of our findings. Second, experimental evidence to support this conclusion is not yet available and is worthy of further assessment.
In conclusion, the gene signatures in our study have a good fit and discrimination, so does risk score classification, indicating excellent predictive values for BCRFS. Besides, based on risk score and Gleason score, the nomogram can predict 3 and 5-year BCRFS rates precisely, thus providing evidence of treatment for PCa patients. It is worthy of wider clinical application.