RNA processing genes characterize RNA splicing and further stratify colorectal cancer

Abstract Objectives Due to the limited evaluation of the prognostic value of RNA processing genes (RPGs), which are regulators of alternative splicing events (ASEs) that have been shown to be associated with tumour progression, this study sought to determine whether colorectal cancer (CRC) could be further stratified based on the expression pattern of RPGs. Materials and Methods The gene expression profiles of CRCs were collected from TCGA (training set) and three external validation cohorts, representing 1060 cases totally. Cox regression with least absolute shrinkage and selection operator (LASSO) penalty was used to develop an RNA processing gene index (RPGI) risk score. Kaplan‐Meier curves, multivariate Cox regression and restricted mean survival (RMS) analyses were harnessed to evaluate the prognostic value of the RPGI. Results A 22‐gene RPGI signature was developed, and its risk score served as a strong independent prognostic factor across all data sets when adjusted for major clinical variables. Moreover, ASEs for certain genes, such as FGFR1 and the RAS oncogene family, were significantly correlated with RPGI. Expression levels of genes involved in splicing‐ and tumour‐associated pathways were significantly correlated with RPGI score. Furthermore, a combination of RPGI with age and tumour stage resulted in significantly improved prognostic accuracy. Conclusions Our findings highlighted the prognostic value of RPGs for risk stratification of CRC patients and provide insights into specific ASEs associated with the development of CRC.


| INTRODUC TI ON
Colorectal cancer (CRC) is the third most prevalent malignant tumour, accounting for 9% of all cancer-related fatalities worldwide in men and 8% in women. 1 Aberrant gene expression profiles play an essential role in the progression of CRC. 2 A series of processes involved in post-transcriptional RNA processing can mediate gene expression, including removal of introns through alternative RNA splicing, as well as 5′-cap and 3′-end formation. 3 RNA processing, a determinant factor in translating genotype to phenotype, is pivotal for the DNA damage response, cancer development and chemo-resistance. [4][5][6][7] Given that dysregulated expression of RNA processing genes may contribute to abnormalities in RNA expression profiles in CRC patients, systematic examination of the roles that RNA processing factors play in CRC is warranted.
RNA processing factors function in intron removal and regulate alternative splicing events (ASEs) of eukaryotic genes. Aberrant selective RNA processing, especially alternative splicing (AS), facilitates cancer development and progression via alterations in protein structure, nonsense-mediated mRNA decay, DNA repair defects and genome instability. 8,9 In CRC, several RNA processing factors, including HNRNPLL, SRSF1, SRSF3 and SRSF6, have been shown to actively participate in tumour progression and have demonstrated prognostic value, indicating that genetic alterations affecting RNA splicing are associated with CRC pathogenesis. [10][11][12][13] Recently, Xiong et al analysed seven kinds of ASEs in CRC and linked a selection of ASEs to patient clinical outcomes. 14 However, to date, the prognostic significance of RNA processing genes serving as regulators of ASEs has not been clearly elucidated.
In this study, we systematically investigated the capability of RNA processing gene expression profiling for the prediction of overall survival in a total of 1060 CRC patients. RNA-sequencing data from The Cancer Genome Atlas (TCGA) and microarray data from the Gene Expression Omnibus (GEO) database were utilized for the construction and validation of the RNA processing-related signature. The association between this signature and both AS profiles and clinicopathological variables of CRC patients were further analysed. Eventually, ingenuity pathway analysis (IPA) and gene set enrichment analysis (GSEA) identified that a higher-risk score in the RNA processing-related signature was involved in several aspects of tumour progression in CRC patients, including RNA damage and repair, cell death and cell cycle regulation. These results provide novel insights into CRC progression and RNA processing.

| Study population
Molecular data from patients diagnosed with colorectal cancer were retrieved from TCGA. Transcriptome HTSeq-count data from the TCGA-COAD (colon adenocarcinoma) and TCGA-READ (rectum adenocarcinoma) projects were downloaded from the Genomic Data Commons using R package "TCGAbiolinks", 15 including 591 fresh-frozen samples with primary malignancies. Somatic mutation data and patient survival information were downloaded from PanCanAtlas and were filtered for COAD and READ tumour types. Of these TCGA tumour samples, 43 samples whose overall survival (OS) time was less than three months were excluded to enhance the robustness of downstream analyses; corresponding clinicopathological information of the remaining 548 samples was retrieved from cBioPortal (http:// www.cbiop ortal.org/datasets). Another three independent cohorts downloaded from the GEO, including GSE17536, 16 GSE17538 17 and GSE38832, 18 comprising a total of 512 CRCs with known gene expression matrix and corresponding clinicopathological information were utilized to confirm the performance of the prognostic signature. Of these external validation cohorts, gene expression matrices were profiled using the Affymetrix Human Genome U133 Plus 2.0 Array; the same exclusion criteria of OS were followed.

| Data pre-processing for gene expression profiles
For raw data from high-throughput sequencing, Ensembl IDs for mRNAs were transformed to gene symbols with GENCODE27.
The number of fragments per kilobase of non-overlapped exon per million fragments mapped (FPKM) was computed first and transformed into transcripts per kilobase million (TPM) values, which showed greater similarity to those generating from microarray analysis and were more comparable between samples. 19 The mRNAs with TPM values less than 1 in over 90% of samples were considered to be noise and removed. For microarray data retrieved from the GEO database, we performed RMA normalization and processing using default settings for background correction and normalization with the R package "affy". 20 Affymetrix probe ID was annotated to gene symbols according to the GPL570 platform. For multiple probes that mapped to one gene, the mean value of expression was considered.

| Collection of RNA processing genes
We collected a total of 929 genes that participated in any procedure engaged in the conversion of at least one primary RNA transcript into at least one mature RNA molecule by searching GO:0006396 term in the AmiGO online database (http://amigo.geneo ntolo gy.org/ amigo). We ultimately collated a total of 774 genes shared in both the TCGA and GEO data sets with sufficiently reliable expression for further analyses.

| Identification of the prognostic signature
Univariate Cox regression analysis was performed on the expression matrix of RNA processing genes (RPGs) to first determine genes that were associated with prognosis of CRC patients in the TCGA data set with a relatively loose threshold of P < .1. To enhance robustness of the risk signature, the TCGA cohort of 548 samples was randomized into two subsets based on 5-fold sampling; the training set included 4-fold CRC samples, and the internal testing set included the rest. Least absolute shrinkage and selection operator (LASSO) penalty was applied to multivariate Cox regression analyses to build an optimal prognostic signature with the minimum number of RPGs. Ten-fold cross validation was conducted to tune the optimal value of the penalty parameter λ, which gives the minimum partial likelihood deviance. Finally, a set of RPGs (ie the prognostic signature) and their non-zero coefficients were identified and used to build an RPG index (RPGI). An RPGI risk score was calculated for each sample via a linear combination of the selected features, weighted by the corresponding coefficients based on the following formula: where C i is the coefficient, E i is the normalized expression value of each selected gene by log 2 and z-score transformations, and R RPGI represents the risk score for RPGI. Patients were dichotomized into high-risk (HRisk) and low-risk (LRisk) groups using the cohort-specific median RPGI risk score as the cut-off for each data set.

| Bioinformatics analyses
Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses were utilized for gene set annotation, and GSEA was further used to investigate the functional enrichment of risk scoreassociated genes using the R package "clusterProfiler". 21 Differential expression analysis based on TPM values was conducted by the twosample Mann-Whitney U test. The Benjamini-Hochberg method was used to adjust nominal P values (false discovery rate, FDR) for multiple testing. We divided the mean expression of the treatment group (ie HRisk group) by the control group (ie LRisk group) to obtain the fold change value. Differentially expressed genes between the two groups were uploaded into IPA software (Qiagen) for core analysis, which described possible disease and bio-functions enriched in the data set. The biological significance of the IPA was defined as an absolute value of z-score > 2. The presence of infiltrating stromal cells in tumour was estimated with the R package "ESTIMATE". 22 The population abundance of tissue-infiltrating immune and stromal cell populations was estimated with the R package "MCPcounter" per sample in the TCGA cohort. 23 The mutation landscape was analysed with the R package "maftools" following initial removal of 100 FLAGS genes, 24,25 and differentially mutated genes were identified by using the mafCompare() function where genes mutated in greater than 5% of CRC samples in the TCGA cohort were considered. Individual consensus molecular subtype (CMS) was predicted with the R package "CMScaller" with an FDR threshold of 0.05 by default. 26 Eight signal transduction pathways related to colorectal carcinogenesis were analysed based on the published literature, 27 and we referred to a previous report to establish a signature of these eight oncogenic pathways. 28 We then used the single sample GSEA (ssGSEA) method on these gene sets to generate enrichment scores for each pathway per sample for the TCGA cohort by using the R package "GSVA." Subsequently, we compared the ssGSEA score of each pathway between the two risk groups.

| Construction of regulatory network between RNA processing genes and ASEs
We retrieved RNA splicing data from an online archive (http://bioin forma tics.mdand erson.org/TCGAS pliceSeq). The percent spliced in (PSI) value, which represents the ratio of included transcript reads in the total transcript reads, was used to quantify the ASEs. 29 To generate as strongly reliable a set of ASEs as possible, we implemented a series of stringent filters (percentage of samples with PSI value ≥ 75 and average of PSI value ≥ 0.05). RPGs with significant changes in expression levels were used to investigate potential association of the differential PSI levels of ASEs between CRCs with lower-risk (first quartile) and higher-risk (fourth quartile) scores. In this context, we measured the Pearson correlation coefficient for each RPG-ASE pair; those pairs with absolute correlation coefficients greater than 0.5 and an FDR less than 0.05 were considered significantly correlated. The potential regulatory network was constructed via each significantly correlated pair and visualized via Cytoscape. 30

| Development and verification of a composite Processing-Clinical prognostic index
Based on the results derived from multivariate analyses, we inte- The prognostic value of the PCPI score was compared with that of the RGPI in continuous form according to the concordance index (C-index) and given by the restricted mean survival (RMS) curve. 31 The RMS represents the life expectancy at 120 months (10 years) for patients with different risk scores. The performance of risk groups determined by the RGPI risk score was assessed with reference to the RMS time ratio between the HRisk and LRisk groups. 32 The higher the RMS value, the greater the prognostic difference.

| Immunohistochemical analysis
Protein expression data were obtained from the Human Protein Atlas (HPA) (www.prote inatl as.org). These immunohistochemical staining images were used to determine protein expression of the 22 selected genes in both normal and CRC tissues.

| Statistical analyses
All statistical analyses were conducted by R3.6.2 using Mann-Whitney testing for continuous data and Fisher's exact testing for categorical data. Correlation between two continuous variables was measured via Pearson's correlation coefficient. Kaplan-Meier curves were generated for survival rates of patients, with difference detection based on log-rank testing. A Cox proportional hazard regression model was used to calculate the hazard ratios (HRs) and 95% confidence intervals (CI) regarding OS. The C-index was calculated with "survcomp" and compared with the "compareC" R packages. 33 The RMS curve and RMS time ratio were estimated with the "survival" and "survRM2" R packages. For all statistical analyses, a two-tailed P value less than .05 was considered statistically significant.

| Overview of study design
A total of 1060 patients diagnosed with CRC from four independent data sets were ultimately included in this study; demographic and clinical characteristic descriptions of the different data sets are summarized in Table 1. The entire workflow of this study, including the filtration of RPGs, development and validation of a prognostic signature (ie RPGI), the analyses of RPGI-associated alteration of the ASEs and RNA expression profiles, and the construction of a composite Processing-Clinical prognostic index (ie PCPI), are delineated in Figure 1. A schematic view of RNA processing gene selection and prognostic signature development is depicted in Figure S1.

| Prognostic value of RPGs and their biological function in CRCs
We evaluated the prognostic effect of the 774 RPGs and identified 127 genes that were associated with CRC patient OS (Table   S1). Among these, 70 RPGs were risk-associated because the corresponding HRs were greater than 1, while the remaining 57 genes were considered protection-associated. Since these RPGs represent a grouping of genes that participate in any step involved with the conversion of at least one primary RNA transcript into at least one mature RNA molecule, we used GO analysis to identify the more explicit biological processes that these prognosis-related RPGs are enriched in. We found that they were relevant to such key biological functions as RNA splicing, RNA 3'-end processing, regulation of RNA splicing and regulation of mRNA metabolic process, among others (all FDR < 0.001; Figure 2A). We further used KEGG analysis for annotation, and the results indicated that pathways involved in the spliceosome, mRNA surveillance pathway, RNA transport and aminoacyl-tRNA biosynthesis were closely associated with these RPGs (all FDR < 0.05; Figure 2A). a Sum of frequency numbers may not equal to the total sample size due to missing or unpredictable values.

TA B L E 1 Demographic and clinic characteristic descriptions for colorectal cancer patients in different data sets
b Median survival time is incalculable because the mortality at the last follow-up time is less than 50%.
c Age is represented as mean ± standard deviation.

| Feature selection and prognostic signature building
To readily and efficiently categorize clinical outcomes of CRC patients via RPGs, we applied a LASSO penalty with multivariate Cox regression analysis to the TCGA training set and identified 22 features with non-zero coefficients ( Figure 2B, Figure S2A,B). These LASSO-selected features were used to build a prognostic signature, the RPG index (RPGI), and corresponding RPGI risk scores were computed for all data sets. All 1060 CRC samples were further dichotomized into high-risk (HRisk) and low-risk (LRisk) groups, and each sample was predicted to be one of the four CMS ( Figure S3A-D). Interestingly, the HRisk group of the TCGA cohort had more tumour protein 53 (TP53) mutations (P = .024), and CMS4 especially (P < .001; Figure S4A). We then pooled all 1060 CRCs samples together and found that almost all 22 LASSO-selected features were significantly differentially expressed between the two risk groups ( Figure S4B-C). Moreover, advanced tumour stage (ie stage III and stage IV) was enriched in the HRisk group (57.4% vs 41.5%, P < .001; Figure 2C, Figure S5A). We further verified that CMS4, featuring stromal activation, 34 was dramatically enriched in the HRisk group (46.8% vs 20.3%, P < .001; Figure 2C, Figure S5B), which was consistent with the higher enrichment identified in a previously reported stromal score 22 (P < .001; Figure S5C).
In all data sets, we found that the LRisk group had a significantly more favourable prognosis than the HRisk regarding OS (TCGA F I G U R E 1 Flow chart of the study design. Using 774 RNA processing genes derived from Gene Ontology (GO: 0006396), we constructed a 22-gene risk signature in the TCGA cohort that was subsequently validated in three external validation cohorts from GEO. Furthermore, we identified differential splicing events and underlying splicing networks between first and fourth quartiles of risk score.  Table 2). To further investigate the prognostic performance of the RPGI risk score with adjustment for major clinical variables, including tumour stage and patient age (an exception for validation 3 due to a lack of age records), we performed multivariate Cox regression analysis and found that RPGI risk score was a significant independent prognostic factor for CRC patients (Table 3).
We further performed immunohistochemical analysis of the 22 identified genes in The Human Protein Atlas, and we found that the protein products of the risk-associated genes showed higher expression levels in CRC samples compared with adjacent normal tissues ( Figure   S6A). In contrast, protein expression of the protection-associated genes showed the opposite trend ( Figure S6B). These results may support the functional relevance of the identified 22 RPGs in CRC patients.

| Correlation of RPGI risk score with immunity and oncogenic pathways
We used the MCPcounter algorithm to compare tumour immune microenvironments (TIMEs) between the HRisk group and the LRisk group ( Figure S7A). We found significant elevations in the proportion of endothelial cells and fibroblasts in the HRisk group (both P < .01), whereas the proportions of CD8 + T cells and NK cells were comparable between the two groups ( Figure S7B). We then estimated the enrichment score of eight oncogenic pathways ( Figure S7C). We Individual stromal score, predicted CMS, TNM stage and RPGI risk score are also annotated above the heatmap. Kaplan-Meier overall survival curves with difference detection of log-rank testing for patients from the TCGA training set, TCGA internal testing set, and three external validation sets are portrayed in (D)-(H), respectively. Patients were divided into different risk groups based on a cohort-specific median cut-off value of RPGI risk score pathway was significantly downregulated in the HRisk group (P < .01), although this might be due to the frequent mutations in TP53 in this group. These results indicated an activation of stromal components in TIME of high-risk patients together with activated oncogenic pathways based on the proposed signatures, which likely contributed at least partially to the poorer prognosis in these patients.

| Functional enrichment of genes that were associated with RPGI risk score
Given that RPGs are the primary elements manipulating the life cycle of RNAs in eukaryotes, we subsequently assessed how the RPGI could mediate RNA expression profiles. In this context, we correlated the RPGI risk score with all robustly expressed mRNAs and generated a pre-ranked list sorted by Pearson correlation coefficient. We then performed GSEA and found that RPGI risk score was closely associated with dysregulation of the cell cycle, wound healing, angiogenesis and protein serine/threonine kinase activity based on GO terms (all FDR < 0.01; Figure 3A). GSEA of KEGG also revealed dysregulation of the extracellular matrix (ECM)-receptor interaction, MAPK and p53 signalling pathways, spliceosomes and RNA transport (all FDR < 0.01; Figure 3B). IPA indicated that several biological functions were significantly associated with RPGI risk score, including RNA damage and repair, cell cycle, cell death and RNA trafficking (all P < .001; Table 4). and retained intron (RI), were identified per CRC sample; the proportion of these ASE categories in CRCs varied dramatically, from 0.3% to 45.8% ( Figure 3C). Although the proportional pattern of each ASE type was similarly shared for all CRCs, the amount of each of the ASEs showed significant positive correlation with the RPGI (ρ = 0.22, P < .001 by Spearman's correlation analysis; Figure 3C), and the amount of detected ASEs was significantly higher in CRCs with a higher-risk (fourth quartile, n = 137) score compared to those with a lower-risk (fourth quartile, n = 137) score (P < .001, Figure 3D).

| Association between RPGI risk score and ASEs
We further identified differentially expressed RPGs (absolute fold change > 1.5 and FDR < 0.05;  Figure 4C). For these ASEs with markedly different PSIs, we found that the frequency of all ASE types (except for ME, which has the lowest proportion) was significantly altered (P < .001 for AP, AT, AD and ES; P < .05 for AA and RI; Figure S8) compared to background ASEs, which suggested that the presence of altered ASEs is relevant for the prognosis of CRC patients.
Subsequently, we examined potential regulatory networks involved among the significantly altered 36 RPGs and 743 ASEs, and constructed a network with 453 pairwise correlations that ultimately involved 25 differential RPGs and 164 associated differential ASEs ( Figure 5A,  Figure 5B).

| Combining RPGI with clinical characteristics
In addition to RPGI risk score, we also affirmed that clinical characteristics (ie age and tumour stage) served as independent prognostic factors, which could have complementary values (Table 3). To further improve the prognostic accuracy, we combined RPGI risk scores

F I G U R E 3
Risk score-related functional pathways and alternative splicing profile analysis in CRCs with lower-(first quartile) or higherrisk (fourth quartile) scores. A, GSEA of GO for risk scores based on pre-ranked Pearson's correlation coefficients of risk score-associated mRNAs. B, GESA of KEGG analysis for risk scores based on pre-ranked Pearson's correlation coefficients of risk score-associated mRNAs.   48 and TERT has been found to be overexpressed in the right colon compared to the left, 49 indicating worse prognosis in CRC. Both papillary thyroid carcinoma and right colon cancer are characterized by BRAF V600E mutation, which might interact with TERT expression and TERT mutation for the promotion of tumour invasion. 50,51 The elongator complex protein 6 (ELP6) gene encodes an elongator subunit, which is reportedly capable of controlling cell migration and melanoma tumorigenesis. 52 PTB-associated splicing factor (PSF), also known as splicing factor proline-and glutamine-rich (SFPQ), is a PPARγ-interacting protein involved in RNA processing and DNA repair process, 53 and serves as a regulator of apoptosis in colon cancer cell lines. 54 These results indicate that our study protocol may be able to identify novel carcinogenesis-associated RPGs; future studies of these prognostic RPGs may identify novel mechanisms underlying RNA processing and overall CRC progression.

| D ISCUSS I ON
Several strengths of this study warrant specific focus. First, we analysed a large sample size of 1060 CRC patients with either RNA-Seq or microarray data, which indicates that our analysis conclusions are likely highly reliable, robust and independent of specific expression quantitative platform. This also suggests the possibility of future verification of the risk signature in additional cohorts. Second, we used restricted mean survival time, an alternative summary measure of survival time distributions that does not rely on the proportional hazards assumption to demonstrate the clinical utility of the RPG risk signature, which is robust and more clinically interpretable. 55 However, as with all investigations, certain limitations are present as well. Both validation cohorts were derived from microarray platforms, and additional validation in a prospective cohort using an RNA-Seq platform is warranted. Additionally, further experimental results regarding these prognosis-related RPGs are required to elucidate the mechanisms underlying RNA processing and CRC tumorigenesis.
In summary, we identified the prognostic value of specific genes associated with RNA processing in CRC and propose a 22-RPG signature for risk classification of CRC patients. We further identified differential ASEs, bio-functions, signalling pathways, and clinical features underlying the RPG risk signature. Combining data concerning age, tumour stage and risk signature could further improve prognosis prediction in CRC patient samples.

ACK N OWLED G EM ENTS
We greatly appreciate the patients and investigators who partici-

CO N FLI C T O F I NTE R E S T
The authors have no conflict of interest.

E TH I C A L S TATEM ENT
As the data (TCGA and GEO data sets) are publicly available, no ethical approval is required.

AUTH O R ' S CO NTR I B UTI O N
Conceptualization, X-FL and FY; Methodology, X-FL, YZ and JM; Software, X-FL and LJ; Formal analysis, X-F.L, YZ, JM, and LJ; Investigation, X-FL, YC, YW, BZ, and HY; Writing-original draft, X-FL and JG; Visualization, X-FL and YZ; Funding acquisition, X-BL and FY; Supervision, X-BL and FY

DATA AVA I L A B I L I T Y S TAT E M E N T
Raw data for this study were generated at TCGA with cancer type of COAD and READ, and GEO database with Series ID of GSE17536,