Comprehensively analysis of splicing factors to construct prognosis prediction classifier in prostate cancer

Abstract Splicing factors (SFs) are proteins that control the alternative splicing (AS) of RNAs, which have been recognized as new cancer hallmarks. Their dysregulation has been found to be involved in many biological processes of cancer, such as carcinogenesis, proliferation, metastasis and senescence. Dysregulation of SFs has been demonstrated to contribute to the progression of prostate cancer (PCa). However, a comprehensive analysis of the prognosis value of SFs in PCa is limited. In this work, we systematically analysed 393 SFs to deeply characterize the expression patterns, clinical relevance and biological functions of SFs in PCa. We identified 53 survival‐related SFs that can stratify PCa into two de nove molecular subtypes with distinct mRNA expression and AS‐event expression patterns and displayed significant differences in pathway activity and clinical outcomes. An SF‐based classifier was established using LASSO‐COX regression with six key SFs (BCAS1, LSM3, DHX16, NOVA2, RBM47 and SNRPN), which showed promising prognosis‐prediction performance with a receiver operating characteristic (ROC) >0.700 in both the training and testing datasets, as well as in three external PCa cohorts (DKFZ, GSE70769 and GSE21035). CRISPR/CAS9 screening data and cell‐level functional analysis suggested that LSM3 and DHX16 are essential factors for the proliferation and cell cycle progression in PCa cells. This study proposes that SFs and AS events are potential multidimensional biomarkers for the diagnosis, prognosis and treatment of PCa.

biological processes such as cell proliferation, embryonic morphogenesis, immune response and cancer progression. 1,4 AS includes seven types of alternative splicing: alternate acceptor site (AA), alternate donor site (AD), alternate promoter (AP), alternate terminator (AT), exon jump (ES), exon mutual exclusion (ME) and intron retention (RI). 5 It has been proposed that dysregulation or mutation of SFs can promote the occurrence of cancers and confer advantages for the tumour cells, including uncontrolled proliferation, stemness, antiapoptosis and evasion of immune surveillance. 2,6 Researchers have found that some genes can switch between 'oncogenic' and 'tumoursuppressive' states via AS. 7 These findings suggest that SFs could serve as biomarkers or therapeutic targets for cancer management.
Prostate cancer (PCa) is one of the major life-threatening diseases worldwide. 8,9 Complex coordinated molecular and cellular events are known to contribute to the carcinogenesis, proliferation and metastasis of PCa. 10 Recently, accumulated studies suggest that SFs are involved in many aspects of PCa. The dysregulation of several key SFs, such as HNRNPL (Heterogeneous Nuclear Ribonucleoprotein L), MALAT1 (Metastasis Associated Lung Adenocarcinoma Transcript 1), SRRM3 (Serine/Arginine Repetitive Matrix 3) and SRRM4 (Serine/ Arginine Repetitive Matrix 4), has been shown to contribute to the progression of PCa. [11][12][13] Isoform switching of the androgen receptor (AR) has also been found deeply involved in the progression of PCa. 14 Alternative splicing of NF-YA (Nuclear Transcription Factor Y Subunit Alpha) has been demonstrated to promote PCa aggressiveness and represents a new molecular marker for the clinical stratification of patients. 15 However, a comprehensive analysis of the prognostic value of SFs in PCa is limited.
In this study, we systematically analysed 393 SFs to deeply characterize their expression patterns, prognostic value and biological functions in PCa. We identified two de novo molecular subtypes with distinct mRNA expression patterns, signalling pathway activity, AS patterns and prognoses. Furthermore, we constructed an SF- org/study/ summa ry?id=prost ate_dkfz_2018). 16

| Analysis of the expression and survival landscapes of SFs and AS-events
The gene list of SFs was obtained from a previous pan-cancer study conducted by Seiler et al. 3 SFs from the list that did not match TCGA-PRAD data were omitted from this study. A total of 393 SFs, shown in Table S2, were subjected to the downstream analysis. For differential expression analysis and univariable COX analysis of SFs, the TPM data were log2 transformed, the survival R package was used to perform the univariable COX analysis, and a strict cut-off (p < 0.001) was implemented to define the survival-significant SFs. For differential expression analysis of SFs between tumour and tumour tissues, the RAW count value of mRNA expression was used and analysed using the DESeq2 package. 19 For AS events, the PSI value of AS data was directly subjected to differential expression calling between subtypes. For univariable COX analysis of AS events, the PSI values were Z-normalized before analysis. The selected key SFs and AS events were also subjected to multivariable-COX analysis using the survival package.

| ConsensusClusterPlus analysis and principal component analysis
The ConsensusClusterPlus package was used to explore the molecular subtypes in TCGA-PRAD dataset using the 53 survival-significance SFs. 20 The parameters for the analysis were set as follows: reps = 50, clusterAlg = km, distance = euclidean. The optimal cluster number was chosen by manually checking the consensus matrix and consensus cumulative distribution function (CDF) output, with two clusters selected for downstream analysis. For principal component analysis (PCA), the mRNA expression data were utilized (log2-transformed TPM for TCGA-PRAD; PSI value for AS-events) and analysed using the pca3d package in R.

| Construction and validation SFs based-PIF-classifier
For the construction of a splicing factors-based risk classifier for the prediction of PSI, we used the TCGA-PRAD dataset for the LASSO-COX model construction and then the classifier was verified by three external PCa cohorts. In brief, patients were randomly divided into training and testing groups. The 53 survival-significant SFs were subjected to LASSO regression (using the glmnet R package) to eliminate the false-positive parameters caused by overfitting. Finally, a multivariable Cox analysis was used to calculate the Hazard ratio and coefficients, thereby generating the prognosis model. The calculation formula is as follows: In this formula, the SFi represents the identifier of the ith se- and an area under the curve (AUC) >0.700 were considered to have acceptable predictive power.

| Gene set variant analysis and immune cell infiltration analysis
The gene set variation analysis (GSVA) was performed using TPM value of TCGA-PRAD RNAseq data using GSVA package. The significantly enriched signalling pathways between the two identified SF subtypes were identified using the limma package, which used the GSVA score of each TCGA-PRAD case. The HALLMARKS50 and KEGG Pathway gene sets were used in the analyses.
The infiltration levels of 22 types of immune cells were estimated using the CIBERSORTx program (https://ciber sortx.stanf ord.edu/), with bulk tumour RNA-Seq data as the input. 21 CIBERSORTx was run in relative mode, with batch effect correction applied in S-mode, using RNA-Seq TPM data and the built-in LM22 signature matrix.

| Identification of key AS-event and construction of AS-risk model
The key AS events were identified following a previously described method. 5 Briefly, 57 AS events were selected based on the following criteria 1 : AS events that were dysregulated in the 2 SF subtypes with a Wilcoxon test FDR <0.01 and |log2FC| > 0.3 2 ; AS events that were significantly associated with PFI with a univariable COX p < 0.001.
These 57 AS events were then subjected to LASSO-COX regression analysis to screen for the most prognostically significant AS events.
Six key AS events were identified, leading to the construction of an AS-risk model as follows: The ASi represents the identifier of the ith selected AS event.
The coefficient (ASi) is the Cox-regression coefficient estimated from the Cox proportional risk regression analysis for a specific ASi.
The predictive power of the AS-risk model was then analysed using bovine serum (HyClone), 100 U/mL penicillin (HyClone) and streptomycin (HyClone). The cells used in this work were routinely checked by morphological observation and tested for mycoplasma contamination.

| Small interfering RNA transfection and quantitative real-time PCR analysis
The small interfering RNA (siRNA) against NOVA2, LSM3 and DHX16 were synthesized by GenePharma (Tianjin, China). The siRNA transfection was performed using Lipofectamine RNAiMax (Thermo Fisher Scientific) following the manufacturer's instructions.
The siRNA sequences used in this study are shown as follows (5′-

| Cell proliferation assay colony formation assay, cell cycle analysis and cell apoptosis analysis
The cell proliferation of PCa cell lines was examined by using Cell Count Kit 8 (CCK8) assay as previously described. 22  were analysed using a flow cytometer, and the data were processed by FlowJo V10 (for apoptosis assay) and Modfit (for cell cycle analysis) software.

| CRISPR/CAS9 screen data analysis
The CRISPR/CAS9 screen data were analysed using the Project Score database (https://score.depmap.sanger.ac.uk). We used the fitness score, a quantitative measure of the cell viability effect elicited by CRISPR-Cas9 mediated cell inactivation, as the metric to evaluate the significance of genes in cell survival and proliferation.
This is based on Bayes Factor values computed using BAGEL on CRISPRcleanR corrected gene depletion fold changes. 23 Values are scaled to a 5% false discovery rate threshold from classifying reference essential and non-essential genes.

| Western blot analysis
The western blot analysis was carried out as previously described. 24 The antibodies used in this study were summarized as follows: Cyclin

| Statistical analysis
All statistical analyses were performed using R software (ver-

| Flowchart of this Study
The design, data processing and analysis in this study are depicted in Figure 1. First, we acquired TCGA-RNA-Seq, TCGA-Somatic mutation, TCGA-SplicingSeq and GEO-mRNA expression data for PCa.
These datasets underwent preprocessing and quality control measures and were analysed according to the various modules shown in We initially performed a COX analysis combined with differential expression analysis to identify the most clinically significant splicing factors (SFs). We then subjected these genes to functional annotation and enrichment analyses. Following this, we conducted consensus clustering analyses to identify consensus molecular subtypes (defined as SF subtypes) based on the selected splicing factor genes.
We further delineated the differences in molecular characteristics of these SF subtypes through PCA and GSVA analyses. Key alternative splicing events were then identified based on the differentially expressed AS events between the two SF subtypes.
The SF-classifier (risk model) was built using survival-significant SFs in the TCGA internal training and testing dataset and further validated using three independent PCa cohorts. Somatic mutation analysis was carried out to identify key mutation events in the classifier-predicted high-and low-risk groups based on TCGA data.
Finally, cell-level experiments, including proliferation assay, cell cycle assay, cell apoptosis assay and western blotting, were conducted to confirm the biological functions of the identified SFs (the SFs used to construct the classifier).

| Survival-related SFs predict two molecular subtypes with different prognoses
To screen out the survival-related SFs, we first calculated the COX hazard ratio of all involved SFs using the TCGA-PRAD dataset.
Interestingly, we found most of the survival-significance SFs have a >1 hazard ratio, which suggests that high levels of specific SFs may promote the progression of PCa (Figure 2A suggested that cell cycle-related genes ( Figure 2C) and spliceosome structural proteins ( Figure 2D) were over-represented in the survival-related SFs. We conducted a differential expression analysis of all 393 SFs and found that a small fraction (10 genes) were differentially expressed between normal and cancerous PCa tissues (Figure S1A-C). However, only one of these differentially expressed genes showed significance in the COX-survival analysis. These results and previous reports led us to speculate that heterozygosity among tumour cells, rather than the general difference between tumour and normal tissue, may play a more important role in SFrelated tumour promotion effects. 25 To explore the biological significance of the survival-related SFs in large cohorts of PCa patients, we subjected these genes to Interestingly, these two distinct subtypes emerged with the highest stability ( Figure 2E). The PCA analysis further suggested that the two subtypes of the PCa sample exhibit a clear difference in mRNA expression ( Figure 2G,H). Notably, these two subtypes showed significant separation on the Kaplan-Meier plot, which suggests these SFs may play a significant role in the progression of PCa ( Figure 2I,J).

F I G U R E 1
Flowchart of this study. The workflow of this study. The clinicopathological information, mRNA expression, splicing and somatic mutation data were subjected to the following analyses as the figure mentioned.

| The two SF subtypes have distinct pathway activation and immune modulation patterns
Further, GSVA was carried out on the two PCa subtypes using the HALLMARKS50 gene set, and the results showed that some cancer-promoting pathways, such as oestrogen response, epithelialmesenchymal transition and KRAS signalling, significantly enriched the subtypes with worse clinical outcomes ( Figure 3A). In addition, GSVA using the KEGG gene set also showed enrichment of wellknown tumour-associated metabolic-related signalling pathways in the subtypes with worse clinical outcomes ( Figure 3B), such as arginine signalling, indicating that dysregulation of SFs may trigger metabolic reprogramming and thus facilitate tumour progression.
We further explored the infiltration patterns of immune cells in the two subtypes of PCa. We observed that 10 out of the 21 analysed immune cell types were differentially enriched in the two SFs-based PCa subtypes ( Figure 3C), including three prognosis-related immune cell types: the regulatory T cells (Hazard, p = 0.014, Figure 3D), resting memory T cells (protective, p = 0.04, Figure 3E) and plasma B cells (protective, p = 0.002, Figure 3F). Collectively, these results suggest that intrinsic alterations of SFs exist in PCa, creating two distinct molecular subtypes with different molecular characteristics, prognosis indices and immune microenvironments. This indicates that the SF subtypes may be used as a supportive strategy to facilitate the diagnosis and prognosis of PCa.

| The SF subtypes have significant differences in the general alternative splicing pattern
The SF subtypes show significant differences in the general pattern of alternative splicing. As the ConsensusCluster result predicted two consensus PCa subtypes, and considering that alternative splicing events are controlled and influenced by SFs, we next examined the pattern of alternative splicing in the two identified SF subtypes.
Differential expression analysis suggested that 1251 out of 7704 (16.2%) AS events were differentially expressed in the two PCa subtypes ( Figure 4A). This result was consistent with the PCA results, which showed that the overall pattern of AS events was distinct in the two SF subtypes ( Figure 4B). Univariable-COX analysis identified 298 survival-related AS events that were subjected to enrichment analysis ( Figure 4C, Table S2). GO enrichment analysis showed that signalling pathways such as adhering junction, cell leading edge, focal adhesion and cell-substrate junctions were significantly enriched.
This suggests that the dysregulated AS events may be linked with the metastatic potential or/and mobility of PCa cells ( Figure 4D).
Meanwhile, HALLMARKS50 and KEGG enrichment analysis showed that cell cycle progression, MYC targets and HIFα signalling pathway may also be controlled by these AS events ( Figure 4E).
Furthermore, we evaluated the clinical significance of the identified 57 AS events that were both dysregulated in the two SF subtypes and significantly associated with PFI. COX-lasso regression analysis was used to construct a survival prediction model. The results showed that a risk prediction model with promising survival predicting power was successfully constructed. ROC analysis ( Figure 4F) followed by Kaplan-Meier analysis ( Figure 4G) showed that the constructed risk prediction model has good predictive power for the PFI of PCa, with AUC >0.750 in both the training and testing datasets. The multivariable COX analysis also identified six survival-associated AS events with a p value <0.05, and the detailed results are shown in Figure 4H. These data suggest that the identified two SF subtypes of PCa patients have significant differences in the general alternative splicing pattern and the identified key AS events may be linked with the prognosis and diagnosis of PCa.

| Construction of survival prediction classifier using the survival-related SFs
As the bioinformatics analysis showed that dysregulation of specific SFs was linked with the progression of PCa, a survival prediction classifier was constructed using the LASSO-COX method, as previously described. Initially, the survival-related SFs with a COX p < 0.01 were subjected to LASSO regression followed by multivariable COX regression analysis to calculate the coefficients using the TCGA-PRAD training dataset (Table S2) Figure 5E) and testing datasets ( Figure 5F). Furthermore, the multivariable COX analysis showed that the six key SFs were independent factors that correlated with the PFI of PCa patients ( Figure 5G). Additionally, Kaplan-Meier analysis showed that all six genes were significantly associated with the PFI of PCa patients ( Figure 5H). The expression of the six key SFs and the calculated RiskScores are shown in Table S3.

| Verification of the SF-classifier in three independent external datasets
Next, we verified the predictive power of the classifier in three ex-  the details are provided in Table S3. Additionally, we analysed the mutation spectrum of the high-and low-risk patients predicted by the classifier. The results indicated that many key genes were more frequently mutated in the high-risk group compared to the low-risk group, such as TP53 (Tumour Protein P53), SPOP (Speckle-Type POZ Protein), FOXA1 (Forkhead Box A1) and so forth ( Figure 6J).
Interestingly, TP53 showed the highest mutation rate in the highrisk group, which suggests that dysregulation of the six identified SFs may contribute to the gain of mutation of TP53 in PCa ( Figure 6K). The expression of the six key SFs and the calculated RiskScores are shown in Table S3.

| Comparison of the prognosis prediction power between SF-classifier and other clinicopathologic parameters
We also examined the prognosis prediction power of the SF-classifier in comparison with other well-established clinicopathological parameters, including PSA value, Gleason Score, tumour stage and age, according to the available information in each dataset. In the TCGA cohort, the RiskScore showed the highest AUC value and a significant multivariable-COX hazard ratio ( Figure 7A). In the DKFZ, GSE70769 and GSE21035 datasets, RiskScores showed very promising prediction power with AUC >0.700, but they exhibited less robustness compared with some other parameters, especially the Gleason score ( Figure 7B-D). Although multivariable-COX analysis indicated that the calculated RiskScores were independent prognosis factors in the TCGA ( Figure 7A) and GSE21035 ( Figure 7D) datasets with a p value less than 0.05, results from DKFZ ( Figure 7B) and GSE70769 ( Figure 7C) suggested that the RiskScore should be combined with other clinicopathological parameters such as Gleason score (GS) and Stage to accurately predict BCR in PCa patients.
Collectively, these analyses suggested that a six-SF-based classifier was successfully constructed with promising prognosis prediction power in PCa patients.

| Key SFs are involved in the progression of PCa
The above in silico analyses have demonstrated that dysregulation of the six SFs is associated with poor clinical outcomes, as well as the progression of PCa. Next, we aimed to understand the biological significance of the six identified key SFs in PCa. We first analysed the high-throughput CRISPR screening data using the Project Score database (https://score.depmap.sanger.ac.uk/). Notably, the results showed that LSM3 and DHX16 were significantly associated with the fitness of prostate cancer cells, suggesting that these two SFs might be involved in controlling key processes for tumour cell survival and proliferation ( Figure 8A).
To further assess the biological significance of NOVA2, LSM3 and DHX16 in PCa cells, in vitro functional studies were carried out. The siRNAs that specifically targeted NOVA2, LSM3 and DHX16 were designed and transfected into PC3 and DU145 cells ( Figure 8B). We found that the knockdown of LSM3 and DHX16, but not NOVA2, significantly inhibited the proliferation of two PCa cell lines, PC3 and DU145, as revealed by the CCK8 assay ( Figure 8C) and the colony formation assay ( Figure 8D). Consistently, we observed that silencing LSM3 and DHX16 altered the cell cycle progression of PCa cells (Figures 6E and S2). Interestingly, although the knockdown of NOVA2 had no significant influence on cell proliferation, we found that down-regulation of both LSM3, DHX16 and NOVA2 induced cell apoptosis in DU145 cells ( Figure 8F,G). However, NOVA2 did not affect the apoptosis of PC3 cells, and LSM3 did not associate with the apoptosis in PC3 cells ( Figure 8F,G). Furthermore, western blot analyses were carried out to examine the key proteins that regulated cell cycle progression upon silencing LSM3 and DHX16. The results showed that the protein levels of Cyclin B1, Cyclin D1 and Cyclin E1 were influenced by the knockdown of LSM3 and DHX16 ( Figure 8H). Collectively, these results strongly indicated that LSM3 and DHX16 were involved in controlling key checkpoints in cell cycle progression, and the function of NOVA2 and LSM3 may be context-dependent.

| DISCUSS ION
Tumour progression is a multi-step process that involves intricate  In this work, we constructed an SF classifier using the survival- screening data, as well as the cell-level functional analyses, showed that NOVA2 was not associated with the proliferation and cell cycle of PCa cells. As for DHX16, it has been suggested to be involved in multiple biological processes, such as innate antiviral immunity and pre-mRNA splicing, but it has not received attention in the field of cancer. 38,39 We found its high expression significantly associated with a worse prognosis of PCa patients. Notably, both the CRISPR/ CAS9 screening data and our in vitro experiment indicated that it played important roles in facilitating the proliferation and survival of PCa cells. The LSM family of proteins has been found to have potential therapeutic and prognostic value in breast cancer. 40 We have previously reported that LSM3 can serve as a subtype-specific biomarker in basal-like breast cancer. 5 Here, we found that its high expression strongly associates with a shorter biochemical recur-  Ruifang Niu: Conceptualization (equal); data curation (equal); formal analysis (equal); funding acquisition (equal); investigation (equal); methodology (equal); project administration (equal); supervision (equal); validation (equal); visualization (equal); writing -original draft (equal); writing -review and editing (equal).

ACK N O WLE D G E M ENTS
The results presented here are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.

CO N FLI C T O F I NTE R E S T S TATE M E NT
The authors declare no potential conflict of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
All datasets used in this study are publicly available from the corresponding database or provided as Supplementary Materials.