A prognostic 10‐lncRNA expression signature for predicting the risk of tumour recurrence in breast cancer patients

Abstract Breast cancer is one of the most frequently diagnosed malignancies and a leading cause of cancer death among females. Multiple molecular alterations are observed in breast cancer. LncRNA transcripts were proved to play important roles in the biology of tumorigenesis. In this study, we aimed to identify lncRNA expression signature that can predict breast cancer patient survival. We developed a 10‐lncRNA signature‐based risk score which was used to separate patients into high‐risk and low‐risk groups. Patients in the low‐risk group had significantly better survival than those in the high‐risk group. Receiver operating characteristic analysis indicated that this signature exhibited excellent diagnostic efficiency for 1‐, 3‐ and 5‐year disease‐relapse events. Moreover, multivariate Cox regression analysis demonstrated that this 10‐lncRNA signature was an independent risk factor when adjusting for several clinical signatures such as age, tumour size and lymph node status. The prognostic value of risk scores was validated in the validation set. In addition, a nomogram was established and the calibration plots analysis indicated the good performance and clinical utility of the nomogram. In conclusion, our results demonstrated that this 10‐lncRNA signature effectively grouped patients at low and high risk of disease recurrence.

suffer from locoregional or distant tumour recurrence months or years later. 2,3 Breast cancer is a heterogeneous disease, and it is widely acknowledged that inheritance plays important roles in the initiation and progression of breast cancer. Multiple molecular alterations are observed in breast cancer. It was reported that 5%-10% of breast cancer cases resulted from hereditary and genetic factors, such as inherited mutations and family history. 1 BRCA mutations occur in 20% triple-negative breast cancer patients, whereas in the general population, the mutations of BRCA are less common.
To date, BRCA1 and BRCA2 mutations are currently detected to assess the risk of inherited breast cancer. 4 In order to predict recurrence and mortality of breast cancer, previous studies stratified patients into high-and low-risk groups based on their histopathological features, including tumour size, lymph node status and grade. 5 While because of molecular differences, clinical outcomes are largely different even in patients with histologically similar tumours. 6 During the past decade, molecular studies demonstrated that there were at least four molecular subtypes of breast cancer: luminal, basal, human epidermal growth factor receptor 2 (HER2)-enriched and normallike. These subtypes exhibit different histopathological features and treatment sensitivities. 7 Patients with luminal breast cancer often have a better prognosis, whereas those with HER2-enriched or basal-like types have a poorer prognosis. For HER2positive breast cancers, the monoclonal antibody, trastuzumab and the dual tyrosine dual kinase inhibitor, lapatinib, were approved. [8][9][10][11] Because of the heterogeneity of breast cancer, multiple gene prognostic signatures could provide further prognostic information, and several molecular prognostic profiles have been validated for clinical use. 12 The 21-genes score (Oncotype DX) calculates a recurrence score and divides breast tumours into low-, intermediate-and high-risk groups to estimate the likelihood of distant recurrence in tamoxifen-treated patients with oestrogen receptor-positive breast cancer. [13][14][15] The Amsterdam 70-gene signature accurately grouped patients into low or high risk to predict distant metastases and deaths. 16,17 Detection of these biomarkers alone or in combination assists early diagnosis, therapeutic strategies determination and prognosis prediction after treatment.
Analysis of mammalian transcriptomes demonstrated that more than 50% of transcripts have no protein-coding potential.
Long non-coding RNA (lncRNA) is a subset of these non-coding transcripts >200 nucleotides. 18 Accumulating evidence indicated that lncRNAs were involved in cancer progression. In breast cancer, several lncRNAs were associated with the prognosis and indicated their potential roles in prediction of clinical outcome.
In the present study, we constructed a multi-lncRNA-based signature and developed a nomogram to predict the relapse-free survival (RFS) survival of patients with breast cancer. Our findings suggested that this multi-lncRNA-based signature could be used as an effective prognostic predictor for patients with breast cancer.

| Data processing and differentially expressed lncRNAs screening
The GSE21653 data set was downloaded from the GEO database

| Construction of the lncRNA-based prognostic signature
After screening out the DELs, we carried out univariate Cox regression analysis to identify prognostic lncRNAs. A P value <.05 was considered as significant. Lasso-penalized Cox regression was then performed to narrow the lncRNAs for prediction of the RFS. 19 The LASSO Cox regression model was analysed using the 'glmnet' package. LASSO shrinks all regression coefficients towards zero and sets the coefficients of many irrelevant features exactly to zero base on the regulation weight λ. The optimal λ was chosen according to minimum cross-validation error in 10-fold cross-validation. Finally, a multivariate Cox regression analysis was conducted to assess the contribution of a lncRNA as an independent prognostic factor for patient survival. A stepwise method was employed to select the best model, and a risk score was calculated with the coefficients weighted by the penalized Cox model in the training set. The optimal cut-off of risk score was obtained using 'survminer' package in R. All patients were classified into either high-risk or low-risk group based on the optimal cut-off of risk score.

| Construction of the nomogram
A nomogram was constructed using the 'rms' R package. Calibration plots were performed to assess the prognostic accuracy of the nomogram. The predicted outcomes and observed outcomes of the nomogram were presented in the calibrate curve, and the 45° line represents the best prediction.

| External data validation
To further validate the predictive value of the signature, we analysed the data set GSE19615 and GSE20685 with a total of 115 and 327 cases, respectively. These two data sets were based on platform GPL570 ([HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array).

| Statistical analysis
To investigate the prognostic accuracy of multi-lncRNA-based classifier, time-dependent receiver operating characteristic (ROC) analysis was performed using the 'survivalROC' R package. Relapse-free survival was analysed based on Kaplan-Meier method, and the logrank test was performed to assess the statistical significance of the differences between different groups. Cox regression model was used to analyse multivariable survival analysis. Hazard ratios (HR) with their respective 95% confidence intervals were obtained. A P value <.05 was considered statistically significant, and all tests were two-sided. All statistical tests were performed with R software (Version 3.5.0).

| Gene set enrichment analysis
A total of 227 breast cancer samples in GSE21653 were divided into two groups (high risk vs low risk) according to the optimal cut-off of risk scores. In order to identify the significantly alerted Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, we performed gene set enrichment analysis (GSEA) between the high-risk and low-risk groups using the Java GSEA implementation. Annotated gene set c2.cp.kegg.v6.2.symbols.gmt (Version 6.2 of the Molecular Signatures Database) was chosen as the reference gene set. FDR <0.05 was chosen as the cut-off criteria.

| Analysis of DELs
A flow chart of the analysis procedure was developed to describe our study ( Figure 1). In the presented study, 71 disease-relapse samples and 156 disease-relapse free samples in the data set of GSE21653 were analysed. Based on the cut-off criteria of P-value <.05 and |log2 fold-change (FC)| > 2, a total of 30 DELs were identified, including nine up-regulated and 21 down-regulated DELs. Univariate Cox regression analysis was performed to identify prognostic lncRNAs.
The patients were stratified into high expression and low expression groups according to optimal cut-off of each lncRNA. The 19 lncRNAs significantly associated with the RFS were considered as prognostic lncRNAs for further analysis.

| Patient characteristics
The clinicopathologic characteristics of patients in the training set were shown in Table S1. The median follow-up in training set was 5.04 years (low-risk group) and 3.02 years (high-risk group).
In the validation set GSE19615, median follow-up was 5.9 years (low-risk group) and 4.3 years (high-risk group). In the validation set GSE20685, median follow-up was 8.1 years (low-risk group)

| Identification of a multi-lncRNAbased signature
After primary filtration of univariate Cox regression which identified 19 lncRNAs significantly associated with the RFS, a Lasso-penalized Cox analysis with 10-fold cross-validation was performed to narrow the lncRNAs for prediction of the RFS. As a result, 17 lncRNAs were identified. Subsequently, a stepwise multivariate Cox regression analysis was conducted, and 10 lncRNAs were finally identified as prognostic lncRNAs to build a predictive model.  (Table S2). When the patients were stratified by clinicopathological risk factors, the 10-lncRNA signature was still a statistically significant prognostic model for patients in the high-risk group with poorer prognosis (Figure 4).

| Validation of the signature
To further assess the predictive value of this 10-lncRNA signature, two external validation sets (GSE19615 and GSE20685) were used to validate our results. According to the 10-lncRNA-based signature identified above, patients with breast cancer in these two validation sets were divided into a high-and a low-risk groups (based on the threshold of −6.63). Compared with the high-risk ones, significantly higher survival rates were observed in the low-risk group ( Figure 5), which was consistent with the results from the training set. ROC curve indicated good prognostic performance in both GSE19615 and GSE20685. In GSE19615, AUCs at 3 years were the same as that at 5 years, and no patients relapsed during the 2 years. Multivariate Cox proportional hazards regression analysis also demonstrated that the 10-lncRNA signature was an independent risk factor (Table S2).

| Nomogram development
To predict the recurrence probability of patients with breast cancer using a quantitative method, we constructed a nomogram that in-

| Gene set enrichment analysis
To identify the significant changes of biological pathways between high-and low-risk groups, the GSEA was performed. Based on the cut-off criteria of FDR <0.05, three significantly altered pathways were selected: cell cycle pathway, oxidative phosphorylation pathway and JAK/STAT signalling pathway (Figure 7). HAGLR, also known as HOXD-as1, was involved in the occurrence and progression of variate types of human tumours, including bladder cancer, hepatocellular carcinoma, prostate cancer, gastric cancer, neuroblastoma and lung cancer. [25][26][27][28][29][30][31] In prostate cancer, HOXD-AS1 recruited WDR5 to mediate histone H3 lysine 4 tri-methylation, thus promoting cell proliferation, chemo-resistance and castration resistance. 28 In ovarian cancer, HOXD-AS1 was reported to Several predictors, such as radiotherapy and Ki-67 index, were not analysed. In addition, the biological functions of the 10 lncRNAs in breast cancer progression are to be revealed. Our study only included the data set based on GPL570 platform, not representing all possible lncRNAs. The underlying mechanisms of these ln-cRNAs in our signature remain largely unclear. Further in vivo and in vitro studies are required to confirm the exact molecular mechanisms of these diagnostic genes.

| D ISCUSS I ON
In conclusion, our results demonstrated that the 10-lncRNA signature effectively grouped patients at low and high risk of disease relapse. Thereby, it may be a useful predictive tool with a good prospect of clinical application for patients with node-positive breast cancer.

ACK N OWLED G EM ENT
We thank Zishu Zhan for assistance in revising the statistical method.

CO N FLI C T O F I NTE R E S T
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

AUTH O R CO NTR I B UTI O N S
JT, ML, YG and GW reviewed relevant literature and drafted the manuscript. DZ, DK, XL, JR and QC conducted all statistical analyses.
All authors read and approved the final manuscript.

E TH I C A L A PPROVA L
The research was carried out according to the World Medical Association Declaration of Helsinki and was approved by the Ethics Committee at Zhongnan Hospital of Wuhan University. Patients used in this manuscript were extracted from the GEO registry database. Informed consent was not applicable.

DATA ACCE SS I B I LIT Y
Data sharing is not applicable to this article as no new data were created or analyzed in this study.