Identification of prognosis‐related alternative splicing events in kidney renal clear cell carcinoma

Abstract Alternative splicing (AS) contributes to protein diversity by modifying most gene transcriptions. Cancer generation and progression are associated with specific splicing events. However, AS signature in kidney renal clear cell carcinoma (KIRC) remains unknown. In this study, genome‐wide AS profiles were generated in 537 patients with KIRC in the cancer genome atlas. With a total of 42 522 mRNA AS events in 10 600 genes acquired, 8164 AS events were significantly associated with the survival of patients with KIRC. Logistic regression analysis of the least absolute shrinkage and selection operator was conducted to identify an optimized multivariate prognostic predicting mode containing four predictors. In this model, the receptor‐operator characteristic curves of the training set were built, and the areas under the curves (AUCs) at different times were >0.88, thus indicating a stable and powerful ability in distinguishing patients' outcome. Similarly, the AUCs of the test set at different times were >0.73, verifying the results of the training set. Correlation and gene ontology analyses revealed some potential functions of prognostic AS events. This study provided an optimized survival‐predicting model and promising data resources for future in‐depth studies on AS mechanisms in KIRC.


| INTRODUC TI ON
Alternative splicing (AS) modifies over 90% of human genes by removing most introns and selectively including or excluding specific exons. [1][2][3] It regulates the specificity of gene expression, 4 plays an important role in the diversity of mRNA isoforms with a limited set of genes and down-regulates the translation of mRNA isoforms. [5][6][7] AS is also an essential mechanism in physiological processes, such as hematopoiesis and muscle function, 8,9 and in cancer-causing pathological processes, including proliferation, apoptosis, hypoxia, angiogenesis, immune escape and metastasis. 10,11 Furthermore, changes and defects in splicing patterns and generation of peculiar mRNA isoforms can trigger cancer. [11][12][13] Thus, AS participates in oncogenesis, and the profiling of AS events may provide novel potential biomarkers for cancer prognosis, diagnosis and treatment.
In 2012, the estimated number of new cases of kidney cancer worldwide was 338 000. 14 Currently, kidney cancer is the 9th most common carcinoma in men and the 14th most common in women globally. Renal cell carcinoma comprises more than 90% of this malignant tumour, and clear cell carcinoma accounts for a large proportion of approximately 70%. 15 The relevance of AS events in kidney cancer is gradually being revealed. The importance of the AS of enhancer of zeste 2 polycomb pre-mRNA in the tumorigenic potential of kidney cancer has been reported. 16 The AS variants of doublecortin-like kinase 1 are overexpressed in kidney renal clear cell carcinoma (KIRC) and may be implicated in immune response. 17 In addition, differentially splicing isoforms between KIRC and nontumour tissues have been studied, and a cassette exon differentially skipped in DAB adaptor protein 2 in KIRC has been identified. 18 The importance of the systemic profiling of survival-associated AS has been emphasized in lung, ovarian and prostate cancers. [19][20][21] However, the survival-associated AS events in KIRC have yet to be systematically studied.
Bioinformatics analysis is a scientific method to examine genetic information and employ approaches to acquire, store, visualize and interpret medical or biological data. 22 Bioinformatics is widely used in cancer studies. 23,24 Machine learning, which is an important bioinformatics method, is applied to considerable clinical data sets to develop robust risk models and redefine patient classes. 25 The least absolute shrinkage and selection operator (LASSO) regression is one of the machine learning methods suitable for the regression of massive and multivariate variables. 26 These methods provide an optimized approach for the systematic study of AS in KIRC.
The Cancer Genome Atlas (TCGA) is a project that provides the detailed mRNA expression data and clinical information of patients with cancer. 27 In the current study, we comprehensively profiled the genome-wide AS events in 537 patients with KIRC from TCGA. We discovered a number of survival-associated AS events through bioinformatics analysis. Using LASSO regression, we built a high-power prognostic model based on the per cent spliced in (PSI) value of AS for patients with KIRC. We also revealed interesting function pathways in correlative genes and the potential mechanisms of AS influencing the prognosis of such patients. We aimed to construct an optimized survival-predicting model and provide useful data resources for future in-depth studies on AS mechanisms in KIRC.

| Data acquisition and processing
First, the RNA-seq data and clinical information of the KIRC cohort were downloaded from the TCGA data portal (https ://tcgadata.nci.nih.gov/tcga/) by using the GDC tool. mRNA expression counts were converted into an expression value by using DEseq R package, and only the expression counts of more than 2 were included. 28 Furthermore, only patients with at least 30 days of overall survival (OS) were retained in the study. Second, AS events with PSI value of KIRC were obtained from the TCGA SpliceSeq data portal (https ://bioin forma tics.mdand erson.org/TCGAS plice Seq/PSIdo wnload.jsp). AS events were divided into seven different types, so the intersections between these types and the quantitative analysis of these interactive sets were presented by the UpSet plot. 29 PSI value is a visualized ratio for quantifying an AS event from 0 to 1, and it is calculated for the seven types of AS events. A PSI value of 0 or 1 was then filtered, and the corresponding splicing patterns whose exon was NA were deleted. Third, a matrix with the seven types of AS events was built. The PSI value of the AS events was combined with the OS information and survival stage by gene symbols.

| Univariate survival analysis
COX regression allows us to calculate a special form of rate ratios known as hazard ratios (HRs) and investigate the relationship of predictors and time to event. Using COX regression, we can obtain a P value provided by the log-rank test and estimate an effect with its confidence intervals (CIs). To determine the survivalassociated AS events, we performed univariate COX regression analysis between AS and OS by using 'survival' and 'survminer' R package. Furthermore, we divided the survival-associated AS events into high-or low-risk groups by using the surv_cutpoint function and applied a log-rank test to verify the cut-point accuracy. Kaplan-Meier (K-M) curves were built to demonstrate the survival probability variation with time in high-and low-risk groups. P < .05 was considered significant, and all reported P values were two-sided.

| LASSO regression for multivariate prognostic model construction
To gain the final highly important prognostic predictors with less errors, we applied LASSO regression rather than multivariate COX regression, which may not be that suitable for this type of data, among survival-associated AS events. 30 This process, which is one of the machine learning methods adopted in several studies, was performed using glmnet package in R. 31,32 A multivariate regression formula was constructed on the basis of the PSI value of AS events.
Lastly, several notable predictors with non-zero LASSO coefficients were obtained.

| Training and testing for multivariate prognostic models
The tumour samples obtained from the TCGA data portal were randomly distributed into two parts, namely training and test groups, by using classification and regression training (caret) packages. In the training set, a risk score was calculated for each F I G U R E 1 AS events in KIRC. A, AS events were divided into seven types. Blue columns represent AS number, and red columns represent gene number. B, Interactions between the seven types of AS events in KIRC presented by UpSet plot. Red dots represent splicing type, yellow columns represent AS number, and blue columns represent the interactive number of AS events patient, and the data were divided into high-and low-risk groups based on the cut-point of a risk score. The K-M curve was then applied to verify whether the prognostic model could distinguish patients with long or short OS. 33 Receiver operator characteristic (ROC) curves in different years were built by using the survival ROC package to assess the efficiency of the prognostic model. In the test group, the same processes were performed to validate the prognostic model of this group.

| Correlation analysis for the source genes of AS
After several final important prognostic AS predictors were obtained, the correlation between the expression levels of AS source genes and human protein-code genes was analysed through Pearson correlation analysis. P < .05 was identified to be correlative genes.

| Gene ontology (GO) analysis
Gene ontology (GO) analysis includes three categories: molecular function (MF), biological process (BP) and cellular component (CC). In this study, we selected BP to perform GO analysis through an enrich GO function in the clusterProfiler package(version 3.5) and obtained the original database from 'org.Hs.eg.db' package. 34 To speculate the potential functions of survival-associated AS events leading to KIRC, we chose the source genes of notable AS events whose univariate regression P < .01 to perform GO analysis. In addition, to speculate the potential functions of the final AS predictors, we selected the top 500 correlative genes to perform GO analysis. For the AS events in GO analysis, we considered the adjusted P < .05 significant.

| Statistical analysis
R-studio platform (v. 3.5.1) was used for UpSet plot, univariate Cox regression, LASSO regression, K-M curves, ROC analysis, Pearson correlation and GO analysis. P < .05 indicated statistically significant difference. Regression coefficient (R) >0.5 was the cut-off of correlation analysis.  (Figure 1 A). Nearly half of the AS events was ES, demonstrating that ES was the prevalent type. In Figure 1B, most of the AS events were from one gene, which could have several types of AS events. For example, one gene might contain up to six AS types, such as ES, AP, AT, AA, AD and RI. Furthermore, ES was the most common type of AS, whereas ME was the least.

| Survival-associated AS events and GO analysis
To study the prognostic value of mRNA splicing events, we performed univariate regression analysis to identify survival-associated AS events (both prognostic P value and log-rank P < .005 were considered significant). Consequently, we detected 12 888 survival-associated AS events in KIRC. The top 10 HR >1 and the top 10 HR <1 survival-associated AS events are presented in Figure 2A, which contains information on gene ID and symbol, splicing type, exon site, HR, 95% CI and P value. We discovered that the AT of C4orf19 in exon site 6 was a favourable prognostic predictor, whereas the AT of C4orf9 in exon site 5 was a poor prog-

| Multivariate AS prognostic model
After conducting univariate regression analysis, we obtained 12 888 survival-associated AS events (P < .05). We then selected 8632 significant survival-associated AS events (P < .01) as candidates in identifying the final prognostic predictors for patients with KIRC.
As shown in ( Figure 3A and B), LASSO logistic regression was performed to the 8632 candidate AS events. Certain coefficients were accurately reduced to zero by forcing the total absolute value of the regression coefficients to be less than the constant value, and the most powerful prognostic predictors were selected. The final prognostic model of LASSO regression analysis is presented in Figure 3C as follows: first, the AP of KIAA0930 (HR = 1.9e + 01, 95%

| Training for the multivariate AS prognostic model
Initially, a risk score was calculated for each patient in the training set by combining the PSI value of AS and the corresponding LASSO coefficient. The cut-off point generated from the optimal sensitivity and specificity based on the ROC curve was then used to divide the patients into high-or low-risk groups ( Figure 4A and C).
Patients with risk scores of ≥ −17.13 were allocated to the high-risk group, whereas the remaining patients were in the low-risk group ( Figure 4A). Patients with high-risk scores likely expressed a high PSI value of the risky AS events (HR >1), whereas patients with low-risk   Figure 4A). To study the relationship between risk score and survival status, we performed K-M curves and log-rank test on the training sets. As shown in Figure 4B, patients with high-risk scores likely had a low survival probability, whereas those with lowrisk scores had a high survival probability (HR = 7.7, 95% CI: 5.5-11; P < .001). As shown in Figure 4C, we used AUCs at 1, 3 and 5 years to assess the prognostic power of the final model. The AUCs of the ROC curves at 1, 3 and 5 years were >0.88, revealing a powerful prognostic ability.

| Testing for the multivariate AS prognostic model
To verify the results of the training set, we performed the same analyses on the patients in the test set. As shown in Figure 5A,

| Potential functions of genes from AS predictors in the multivariate AS prognostic model
We analysed the co-expression between genes from the final AS predictors and other protein-code genes to obtain a co-expression relationship. (Figure 6A UACA correlative genes indicated that extracellular structure and oxidative phosphorylation were associated with ATP pathways ( Figure 6B). The significant GO terms of RPS24-related genes indicated viral gene expression, including transcription and translation ( Figure 6C). The significant GO terms of BCCIP-related genes corresponded to pathways related to ATP metabolism ( Figure 6D).
Most of the abovementioned GO terms were closely associated with oncogenesis, and this finding revealed the potential functions of the final prognostic model influencing the outcome of patients.

| D ISCUSS I ON
In the present study, we first demonstrated the comprehensive fea-   studies. This study discovered massive important AS sites and presented scientific data for further mechanism studies.

FU N D I N G S TATEM ENT
This work was supported by the NSFC (Grant no. 81270805) and the Science and Technology Department of Sichuan Province (Grant no. 2018SZ0378).

CO N FLI C T O F I NTE R E S T
On behalf of all authors, the corresponding author states that there is no conflict of interest.

AUTH O R CO NTR I B UTI O N S
WT designed the experiments; WT and WH.T. analysed the data; and YZ and LZ wrote the manuscript. All authors read and approved the final manuscript.

DATA AVA I L A B I L I T Y S TAT E M E N T
All data sets generated/analysed for this study are included in the manuscript and the supplementary files.