Identification of a 4‐mRNA metastasis‐related prognostic signature for patients with breast cancer

Abstract Metastasis‐related mRNAs have showed great promise as prognostic biomarkers in various types of cancers. Therefore, we attempted to develop a metastasis‐associated gene signature to enhance prognostic prediction of breast cancer (BC) based on gene expression profiling. We firstly screened and identified 56 differentially expressed mRNAs by analysing BC tumour tissues with and without metastasis in the discovery cohort (GSE102484, n = 683). We then found 26 of these differentially expressed genes were associated with metastasis‐free survival (MFS) in the training set (GSE20685, n = 319). A metastasis‐associated gene signature built using a LASSO Cox regression model, which consisted of four mRNAs, can classify patients into high‐ and low‐risk groups in the training cohort. Patients with high‐risk scores in the training cohort had shorter MFS (hazard ratio [HR] 3.89, 95% CI 2.53‐5.98; P < 0.001), disease‐free survival (DFS) (HR 4.69, 2.93‐7.50; P < 0.001) and overall survival (HR 4.06, 2.56‐6.45; P < 0.001) than patients with low‐risk scores. The prognostic accuracy of mRNAs signature was validated in the two independent validation cohorts (GSE21653, n = 248; GSE31448, n = 246). We then developed a nomogram based on the mRNAs signature and clinical‐related risk factors (T stage and N stage) that predicted an individual's risk of disease, which can be assessed by calibration curves. Our study demonstrated that this 4‐mRNA signature might be a reliable and useful prognostic tool for DFS evaluation and will facilitate tailored therapy for BC patients at different risk of disease.

medication, while others may be faced with recurrence or metastasis due to lack of appropriate treatment. 12,13 These limitations have prompted a search for new biomarkers for discrimination of cancer patients to improve precision cancer treatment.
In recent years, detailed information regarding prognosis evaluation for cancer patients can be effectively provided by genome-wide expression profiling detection. [14][15][16][17] Numerous studies have evaluated the prognostic roles of array-based gene expression signatures acquired from tumours. 14,18,19 Several gene signatures have also been established to distinguish the prognosis of patients beyond the BC clinicopathologic features; however, most of them are not used clinically. 17,19,20 Thus, identifying a novel and practical gene signature to predict patients' prognosis is urgently needed and of great clinical significance.
Several studies have identified that a number of mRNAs are differentially expressed in nasopharyngeal carcinoma, 21 lung cancer 22 and colorectal cancer 23 and so on, 24 which are associated with survival prognosis. As mRNAs related to survival are usually associated with the development and metastasis of some cancers, it is vital to develop a robust BC prognosis-related mRNAs signature. Therefore, to identify mRNAs that might serve as potentially accurate markers for predicting clinical outcome, we used the Gene Expression Omnibus (GEO) database to characterize the mRNAs profiling on large cohorts of BC patients, and finally identified a 4-mRNAs signature which can be validated.

| Patient cohorts
To overcome the bias in microarray platforms, this study only included patients from BC-related gene expression datasets measured by the same platform (GPL570-55999, the Affymetrix HU133 Plus 2.0 microarray), including GSE102484, GSE20685, GSE21653 and GSE31448. All these gene expression and corresponding clinical data were obtained from the GEO database (https://www.ncbi.nlm. nih.gov/geo), so the approval of ethics committee was not needed.
To evaluate the correlations of mRNAs expression with survival status for BC patients, we screened those datasets that included more than 200 patients with disease-free survival (DFS) for model development and validation in this study. In total, 1496 samples (683 from GSE102484, 319 from GSE20685, 248 from GSE21653 and 246 from GSE31448) were obtained ( Table 1). The GSE102484 cohort was used as the discovery cohort, and then the GSE20685 cohort was applied for the training cohort. In addition, we randomly consider GSE21653 and GSE31448 as independent external validation cohorts. A summary of these cohorts and procedures for data processing are provided in Supplementary Methods.

| Discovery cohort
Differential mRNAs expression analysis by array was used to identify mRNAs differentially expressed in primary tumour tissues of patients with BC that have developed distance metastases or not.
To screen mRNAs expression profiles, 683 tumour samples were obtained from BC patients, which included 101 patients with distance metastasis and 582 patients without distance metastasis ( Figure 1, Table 1).

| Model development and validation cohorts
A total of 319 BC specimens from GSE20685 were obtained for the training cohort to develop mRNAs signature model. Fifty-six candidates differentially expressed mRNAs screened from discovery set were used to construct mRNAs signature. Firstly, we analysed the prognostic role of the candidate mRNAs and found 26 mRNAs were

| Statistical analysis
All the statistical analysis was performed with R software (version

| Development of survival-associated mRNAs signature for patients with BC
Samples in discovery cohort were divided into metastasis group and non-metastasis group. Fifty-six mRNAs were found to be differentially expressed between the two groups (P < 0.05, fold change ≥1.25; Figure 2A). To identify the survival-associated mRNAs, we conducted Cox regression analysis and found 26 of these differentially expressed genes were associated with MFS in the training cohort. We further proceeded collinear analysis and found collinearity among some mRNAs, which may prejudice the accuracy of traditional Cox regression analysis ( Figure 2B). Therefore, we used LASSO Cox regression method and finally identified four mRNAs from the 26 differentially expressed mRNAs with the highest fre-  MYBPC1  NPY1R  DEPDC1B  FLJ13744  ANLN  BUB1B  TPX2  KIF14  MCM10  TREM1  MMP1  C5orf46  KRT81  C2orf54  ST8SIA6  KRT6A  KRT80  PRAME  CRISP3  HOTAIR  THBS2  GREM1  INHBA  PLAU  KCNN2 Figure 3A,C,E, which suggested that patients with lower risk scores generally had better survival than those with higher  Figure 3B). We then did the same analyses for DFS and overall survival (OS), 5-year DFS and OS was 59.3% and 68.7% for the high-risk group and 87.6% and 93.6% for the lowrisk group (HR: 4.69, 95% CI: 2.93-7.50, P < 0.001, Figure 3D; HR: 4.06, 95% CI: 2.56-6.45, P < 0.001, Figure 3F).  Figure 4A).

| Nomogram combined mRNAs signature and clinical-related variables predicts patients' DFS
In the training cohort, univariate and multivariate analysis all revealed that T stage, N stage and mRNAs signature were significantly associated with DFS (Table 2). Then based on the above analysis results, we developed a mRNAs nomogram that combined the clinical-related factors (T stage and N stage) and mRNAs signature ( Figure 5A; Table 2). The calibration plots for the 5-year DFS were predicted well in the training cohort, the GSE21653 validation cohorts and the GSE31448 validation cohort ( Figure 5B-E).

| DISCUSSION
An array-based database to identify survival-associated mRNAs for prognostic prediction is significant and urgently needed to guide tailored therapy for patients with BC. 20,21,23 In our study, we used GEO array data to screen differential mRNAs in a discovery set and  and may increase angiogenesis, which is associated with poorer prognosis. 30 The expression of CRISP3 was associated clinical outcome in prostate cancer. 31 However, several limitations are still existed in this study as well.
First of all, our study is entirely retrospective and inherent biases may influence results. In the second place, clinical-related factors, such as molecular classification and treatment related information, has not been analysed due to lack of relevant information in the training and validation sets. In addition, the sample size is too small and array data were all obtained from single platform, which may disturb the application of our constructed model. Last but not the least, although mRNAs signature and nomogram showed good predictive accuracy in the training cohort, their performance in the two external validation cohorts is still low and remains to be improved.
Therefore, more markers should be mined and incorporated into our prediction model in future.

| CONCLUSION
Taken together, we filtered specific mRNAs differentially expressed between patients with or without distant metastasis and successfully constructed a prognostic associated mRNAs signature which may aid our prognosis prediction and tailored therapy for BC. Importantly, we developed a 4-mRNA nomogram that incorporated both mRNA signature and clinical-related risk factors to predict patients' prognosis. We confirmed this signature could service as potential specificity biomarkers in the prognosis prediction for BC patients.