A stroma‐related lncRNA panel for predicting recurrence and adjuvant chemotherapy benefit in patients with early‐stage colon cancer

Abstract The heterogeneity in prognoses and chemotherapeutic responses of colon cancer patients with similar clinical features emphasized the necessity for new biomarkers that help to improve the survival prediction and tailor therapies more rationally and precisely. In the present study, we established a stroma‐related lncRNA signature (SLS) based on 52 lncRNAs to comprehensively predict clinical outcome. The SLS model could not only distinguish patients with different recurrence and mortality risks through univariate analysis, but also served as an independent factor for relapse‐free and overall survival. Compared with the conventionally used TNM stage system, the SLS model clearly possessed higher predictive accuracy. Moreover, the SLS model also effectively screened chemotherapy‐responsive patients, as only patients in the low‐SLS group could benefit from adjuvant chemotherapy. The following cell infiltration and competing endogenous RNA (ceRNA) network functional analyses further confirmed the association between the SLS model and stromal activation‐related biological processes. Additionally, this study also identified three phenotypically distinct colon cancer subtypes that varied in clinical outcome and chemotherapy benefits. In conclusion, our SLS model may be a significant determinant of survival and chemotherapeutic decision‐making in colon cancer and may have a strong clinical transformation value.

could not only distinguish patients with different recurrence and mortality risks through univariate analysis, but also served as an independent factor for relapse-free and overall survival. Compared with the conventionally used TNM stage system, the SLS model clearly possessed higher predictive accuracy. Moreover, the SLS model also effectively screened chemotherapy-responsive patients, as only patients in the low-SLS group could benefit from adjuvant chemotherapy. The following cell infiltration and competing endogenous RNA (ceRNA) network functional analyses further confirmed the association between the SLS model and stromal activation-related biological processes. Additionally, this study also identified three phenotypically distinct colon cancer subtypes that varied in clinical outcome and chemotherapy benefits. In conclusion, our SLS model may be a significant determinant of survival and chemotherapeutic decision-making in colon cancer and may have a strong clinical transformation value.

K E Y W O R D S
chemotherapy responsiveness, colon cancer, prognosis, stroma-related lncRNA signature assesses tumour invasion depth, lymph node metastatic status and remote metastasis, 4 is still the most commonly used indicator for assessing the recurrence risk of patients with colon cancer and whether they need ADJC or not. However, the predictive ability of this system is considered insufficient, as it does not predict the outcome of the patients precisely enough. Owing to high levels of heterogeneity found in colon cancer, prognoses and chemotherapeutic responses may vary widely between patients with similar clinical features. Therefore, it is necessary to develop novel biomarkers to help clinical workers tailor therapies more rationally and precisely.
Accumulating evidence has suggested that genetic difference of tumours is the major cause of heterogeneous anti-cancer drug response. 5,6 However, previous efforts to develop models for predicting chemotherapy response based on expression and mutation profiling of protein-coding genes (PCGs) have been unsuccessful. 7,8 Similar to PCGs, long non-coding RNAs (lncRNAs), which used to be regarded as 'transcript junk', 9,10 also act as key regulators that participate in multiple biological processes involved in tumour development, progression and cancer therapy response. 6,11 LncRNA is a transcript longer than 200 nucleotides that cannot be translated into proteins 10 and is among the most prevalent transcriptional changes in tumour. 11 In colon cancer, several prognostic predictive models have been developed based on lncRNAs. [12][13][14][15] However, as none of them has been reported to provide potential treatment guidance, these models may not meet clinical needs. Nevertheless, the clinical significance of lncRNAs in colon cancer still needs further exploration.
Here, we downloaded five datasets of colon cancer derived from Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih. gov/geo/), which consisted of the transcriptome profile data of 988 samples. To establish a lncRNAs-based model that could be used for improving the relapse risk prediction and tailoring therapies, we specially identified lncRNAs that were significantly associated with both cancer prognosis and biological processes including angiogenesis, hypoxia, TGFβ signalling and epithelial-mesenchymal transformation (EMT), which have been well-studied in multiple solid tumours, and defined as important stroma-related factors mediating tumour metastasis and drug resistance, including in colon cancer. [16][17][18] 2 | ME THODS

| Data source, study population and clinicopathological variables
The workflow chart is shown in Figure S1. A summary of all dataset information used in this study is provided in Table S1. Sample tissues were excluded from the study if they came from a normal colon, stage IV patients, or patients without survival information on relapse. Then, all eligible samples were randomly separated into the training and validation (7:3) set groups using the 'caret' package. In addition, for gastric cancer exploration, the 'GSE62254' dataset was also downloaded and analysed. The demographic information and clinical information were retrieved using the 'GEOquery' package for GEO datasets. The end-points analysed in this study were relapsefree survival (RFS), defined as the interval between the date of diagnosis and date of tumour relapse, and overall survival (OS), defined as the interval between the date of diagnosis and death.

| Microarray data processing and lncRNA profile mining
As all samples involved in the downloaded colon cancer datasets were hybridized to Affymetrix HG-U133 Plus 2.0, the raw microarray data were all renormalized using a robust multiarray averaging method with 'affy' and 'simpleaffy' packages. The 'ComBat' algorithm was applied to reduce the likelihood of batch effects from non-biological technical biases. The lncRNA annotations were collected from three sources: (a) Based on the 'Comprehensive gene annotation' file provided by the GENCODE website, we screened the official gene annotation documents of the Affymetrix HG-U133 Plus 2.0 platform for the lncRNA gene; (b) all GPL570 platform probes were mapped to the 'Transcript sequences' file downloaded from the GENCODE website using SeqMap and re-annotated with ensembl ID. 19 Then, the lncRNA probes were extracted based on a gene transfer format file for 'Long non-coding RNA gene annotation'. Of note, the probes were dropped if the lncRNA gene type was marked as a TEC gene; and (c) lncRNA annotation files generated by Zhang et al were also downloaded and checked as a supplement. 20 Finally, 4037 distinct annotated lncRNA transcripts with corresponding Affymetrix probe IDs were generated.

| Determination of lncRNA function
lncRNA function was explored using the triple competing endogenous RNA (lncRNA-miRNA-mRNA) network 21 constructed through the following steps: First, we predicted the miRNA target of each lncRNA using the miRCODE website; then, the corresponding PCGs were identified using miRDB, miRTarBase and TargetScan; the correlation between lncRNA and PCGs was further tested by calculating the Pearson correlation coefficients. The PCGs significantly positively correlated with lncRNA (correlation coefficient > .5, adjusted P value < .001) were considered lncRNA-related PCGs, and these genes were entered into the Gene Ontology (GO) enrichment analysis to determine the lncRNA function. Finally, we chose 50 lncRNArelated PCGs with the highest correlation as representatives to draw the triple ceRNA network using Cytoscape software, to show the interaction between genes more clearly.

| Gene set variation analysis (GSVA)
Gene set variation analysis is the most often used method to estimate biological process activity. 22 In the present study, gene sets for F I G U R E 1 The survival impact of SLS model. A, Circos plot shows the survival impact of 52 lncRNAs selected by LASSO Cox regression analysis (risk factor, red; protect factor, green); B-C, Kaplan-Meier curves (left) and TDROC curves (right) of relapse-free survival according to SLS groups in the training cohort; (B) and validation cohort (C); D, Kaplan-Meier curves (left) and TDROC curves (right) of overall survival according to SLS groups; E-F, Forest plots of the associations between SLS and relapse-free survival (E) and the associations between SLS and overall survival (F) in various subgroups. Unadjusted HRs (boxes) and 95% confidence intervals (horizontal lines) are depicted. AUC, area under ROC curve; CI, confidence interval; CMS, consensus molecular subtypes; HR, hazard ratio; SLS, stroma-related lncRNA signature 'EMT', 'angiogenesis', 'hypoxia' and 'TGFβ signalling' were retrieved from the 'hallmark gene sets' collection of the 'molecular signatures database' and were employed for GSVA using the 'GSVA' package.
The sum of the GSVA score obtained from the above four gene sets was defined as the stromal activation level of the corresponding sample.

| Characterization of tumour microenvironment
To quantify the infiltration levels of stromal cells located in tumour microenvironment, the single-sample gene set enrichment analysis (ssGSEA) method was applied using the 'GSVA' package. 23 The

| LASSO Cox regression and survival analysis
The least absolute shrinkage and selection operator (LASSO) regression is a commonly used method for feature selection and has been applied to the Cox proportional hazard regression model. 27 As previously described, 28,29 only lncRNAs which passed the 1000-times bootstrapping robustness test were selected for LASSO regression analysis and all lncRNA expression values were dichotomized and respectively assigned as 1(represents lower expression) and 2 (represents higher expression). The genes represented by the minimum penalty parameter, λ, would be chosen to establish the prognostic risk score formula via Cox regression analysis in the training cohort. The optimal cut-off values for each gene were calculated using the 'survminer' R package. The Kaplan-Meier method was applied to calculate the survival rate, and the log-rank test was performed to assess the statistical significance. Uni-and multivariate analyses were performed using the Cox proportional hazard models with a stepwise 'LR forward' method. The time-dependent receiver operating characteristic curve (TDROC) and Harrell's concordance index (c-index) analyses

| Other statistical analyses
The 'CancerSubtypes' R package was used for identifying molecular cancer subtypes based on aggregating multiple genomic platform data. 30 The chemotherapy response of each sample was predicted by a predefined FOLFIRI (a chemotherapy regimen combination of irinotecan, 5-fluorouracil and leucovorin deployed in first-line treatment of patients with metastatic colon cancer) response signature 31 using the nearest template prediction (NTP) algorithm 32 module from GenePattern as described by Sadanandam et al. 33 The unpaired Student's t test (two groups) and one-way ANOVA test (more than two groups) were used for normally distributed data comparison; otherwise, the Mann-Whitney U test (two groups) and Kruskal-Wallis test (more than two groups) were performed. Nomogram construction and validation were performed following Iasonos' guide. 34 All statistical analyses were conducted using R software (version 3.5.0) and SPSS software (version 25.0), and P values were two-tailed. Statistical significance was set at P < .05.

| Identification of robust prognostic stromarelated lncRNAs
The patient cohort of this research included 988 patients. Pearson's correlation test identified a total of 2654 lncRNAs significantly related with the stromal activation level (adjusted P value < .05).
Among them, 372 lncRNAs were significantly correlated with RFS via the Cox univariate regression analysis in the entire cohort. Next, we used 1000-times bootstrapping to test the predictive robustness of the above 372 genes as we mentioned in 'Methods' section.
Finally, 82 lncRNAs, the expression levels of which were stably and significantly correlated with prognosis, were identified and defined as robust prognostic stroma-related lncRNAs (see Table S2).

| The stroma-related lncRNA signature (SLS) generation and validation
The 988 colon cancer samples were randomly regrouped into training (n = 692) and validation cohorts (n = 296) as described in the 'Methods' section. Comparison of patient characteristics between the two groups showed no significant differences (P > .05; see Table   S3). Through LASSO analysis (see Figure S2), 52 lncRNAs were screened out as predictors to generate the SLS model and their survival impact is shown in Figure 1A. The cut-off values and risk coefficients of these 52 lncRNAs used for creating the risk score formula are listed in Table S4.  found for the analysis of 678 patients with documented OS information ( Figure 1D, Tables 1 and 2). In addition, we also conducted subgroup analyses to further validate the prognostic role of the SLS model. The forest plots indicated that regardless of whether it was referring to RFS ( Figure 1E) or OS ( Figure 1F), a higher SLS score was significantly associated with a poorer prognosis in all subgroups but the stage I patients.

| SLS and the therapeutic benefit of adjuvant chemotherapy
Several studies have reported that EMT, TGFβ signalling, angiogenesis and hypoxia are all important biological processes influencing chemotherapy efficacy. Correspondingly, through analysing the GSE39582 dataset, we found that patients with a high stromal activation level could not benefit from adjuvant chemotherapy (ADJC) ( Figure 2A). As the SLS model consisted of stromal activation level-related lncRNAs, we were interested in knowing whether the SLS model also helps to find patients who could potentially benefit from chemotherapy. As expected, ADJC could only significantly reduce the mortality risk of patients in the low-SLS group, but did not confer survival benefits to patients in the high-SLS group ( Figure 2B).
Moreover, the survival benefit of ADJC was more obvious for patients with a low-SLS value in both stage II and stage III subgroups, although the P value did not reach a statistical significance in stage II patients ( Figure 2C). We then conducted multivariate analyses in the GSE39582 dataset (see Table S5) and obtained the independent prognostic factors for RFS and OS, respectively, to construct the nomograms ( Figure 2D-E). The calibration plots showed that both nomograms performed well compared with the performance of the ideal models ( Figure 2F). Decision curve analysis revealed that the clinical usefulness of the nomograms significantly overwhelmed that of the TNM stage system ( Figure 2G).

| The relationship between SLS signature, tumour microenvironment, clinical parameters and biological processes
The correlation between SLS signature and tumour microenvironment (TME) landscape was evaluated. Using unsupervised clustering, we identified four distinct TME infiltrated cell clusters ( Figure 3A), and the survival impact of each cells is exhibited in Figure 3B. As shown in the heat map, the cells whose infiltration levels were significantly positively associated with SLS scores mainly fell into cell clus-

| Evaluation of novel colon cancer subtypes by aggregating lncRNA and mRNA data
In a previous study, we identified 100 robust prognostic TME genes and found that a panel consisting of these genes could not only precisely predict the relapse and mortality risks in patients with colon cancer, but also discriminate patients who would potentially benefit from ADJC. 28 However, it has been widely believed that aggregating multiple types of genomic data could help to reflect the complexity of the tumour genetic characteristics. Therefore, we attempted to construct a novel molecular subtype based on both the 52 stroma-related lncRNAs identified in this study and 100 TME genes. We chose three as the optimal clustering number (see Figure S3) and samples were then divided into three distinct subtypes based on SNF-CC approach (combination of 'similarity network fusion' method and 'consensus clustering' method) 30 using 'CancerSubtypes' R package ( Figure 4A).
The survival analyses revealed a significant survival difference between the three subtypes: The relapse and death risks were highest in subtype 3, whereas subtype 2 showed the best prognosis ( Figure 4B). The chemotherapy treatment effect analysis ( Figure 4C) showed that only patients from subtype 2 could significantly benefit from ADJC, whereas subtype 1 and subtype 3 patients could not. In particular, for subtype 3 patients, the ADJC conduction was a risk factor for an unpromising prognosis. We then, respectively, compared the FOLFIRI response rate ( Figure 4D) and stroma score (calculated by ESTIMATE algorithm, Figure 4E) between the different patient subtypes. The result showed that both the non-response rate and the stroma score were also highest in subtype 3 patients and lowest in subtype 2 patients. Referring to cell infiltration, both the heat map ( Figure 4F) and triangle plot ( Figure 4G) showed that it was cell cluster D, including multiple stromal cells, that exhibited the significant difference of infiltration between the three subtypes.

| The role of SLS in gastric cancer
We also analysed the role of the SLS model in the GSE62254 database to verify whether it was also correlated with relapse and ADJC treatment effect in patients with gastric cancer. The clinical characteristics of this cohort were also listed (see Table S6). The cut-off values for each lncRNA were re-calculated because of the potential existence of heterogeneity between different tumour types. As shown in Figure 5A the SLS model also was significantly associated with RFS as a continuous variable in the GSE62254 cohort (see Table S7). Finally, we compared the SLS scores of gastric cancer patients with different Asian Cancer Research Group subtypes. The results showed that the SLS scores were significantly higher in the 'EMT' and 'MSS/TP53-' subtypes compared with the 'MSI' and 'MSS/TP53+' subtypes ( Figure 5C). The following scatter plot also confirmed that SLS scores were significantly positively associated with stromal activation level in patients with gastric cancer ( Figure 5D).

| D ISCUSS I ON
The rapid development of high-throughput technology has enabled researchers to study cancer heterogeneity more deeply and comprehensively, and identify ideal molecular biomarkers that could be used to precisely predict prognosis and guide treatment. Among them, lncRNAs, which possess tissue-specific and cancer-specific expression patterns and play essential roles in tumorigenesis and cancer progression, 11  and 1 year after initial primary resection) and found that the prognostic ability of the model was stronger than that of the TNM stage system. 12 However, the current available signatures only provide prognosis-related information; none of these studies discussed the role of these signatures in treatment guidance. The clinical practicality of these signatures was therefore limited.
To construct an lncRNA-based model that could provide information on patient prognosis and therapeutic benefit, we first used correlation analysis to obtain 2654 lncRNAs that were significantly associated with the biological processes that trigger tumour metastasis and chemotherapy resistance, including angiogenesis, hypoxia, EMT and TGFβ signalling activation. [16][17][18] 42 This result is consistent with our findings. However, the functions of most lncRNAs enrolled in the signature have not been investigated. As our research has shown that the expression of these lncRNAs was correlated with stroma activation level and robustly associated with increased recurrence risk, they deserve further clinical and basic investigation.
Limitations to our study include the following: First, as this study was conducted based on publicly available datasets, some important clinicopathological features were not available for analysis and could lead to potential error or bias. Second, all colon cancer transcriptome profiling collected in the present study was produced by the GPL570 platform; the lncRNA candidates identified here may not represent the complete lncRNA populations underlying colon cancer biological behaviour, owing to the fact that the GPL570 platform only contains a small part of all possible lncRNAs. Moreover, the lncRNAs that were identified by re-annotation of probes should also be validated by further studies. Finally, as mentioned above, the biological function of some lncRNAs involved in SLS signatures still requires experimental verification.

| CON CLUS IONS
In conclusion, our study developed a novel lncRNA signature that can be used as a reliable tool for personalized prognosis prediction and for treatment decision-making in patients with colon cancer.
Further prospective clinical trials are warranted to validate our findings.

ACK N OWLED G EM ENTS
The authors thank the members of W. Liao's laboratory for advice and discussion.

CO N FLI C T O F I NTE R E S T
The authors declare that they have no competing interests.