Correlation of immune infiltration with clinical outcomes in breast cancer patients: The 25‐gene prognostic signatures model

Abstract Purpose Breast cancer is the most common cancer in women. The aim of this study was to build a prognostic signatures model based on the immune score of the ESTIMATE algorithm to predict survival of breast cancer patients. Methods The RNA‐seq expression data and clinical characteristics of patients were derived from TCGA and GSE88770 of GEO. The ESTIMATE algorithm was used to calculate the patients' immune scores and to obtain DEGs. The LASSO Cox regression model was applied to select prognostic genes. Survival analysis and the ROC curve were used to evaluate the predictive efficacy of the prognostic signatures model. Independent prognostic factors of breast cancer were assessed using the Cox regression analyses, and a nomogram was constructed to enhance the clinical value. Results Based on the immune score, we found that the high‐score group showed better clinical outcomes than the low‐score group. Twenty‐five (25) genes of 616 DEGs were confirmed as prognostic signatures through the LASSO Cox regression. The risk score for each patient was calculated according to the prognostic signatures. Survival analysis showed that the low‐risk group had longer overall survival than the high‐risk group. We also found that the risk score was an independent prognostic factor. To improve the clinical application value, a nomogram combing the risk score according to the 25‐gene prognostic signatures and several clinicopathological prognostic factors was constructed. Conclusions This study revealed the significance of immune infiltration and constructed a 25‐gene prognostic signatures model, that has a strong prognostic value for patients with breast cancer.


| INTRODUCTION
Breast cancer is the second most common cancer in the world (1.7 million cases, 11.9%), and it is the most frequently seen cancer among women. 1 Breast cancer is a highly heterogeneous disease, with a slow growth rate, and, while it can be highly invasive, there is a good prognosis for some patients. 2 At present, the application of surgery, radiotherapy, chemotherapy, hormone therapy, and HER2-targeted therapy have greatly improved the survival of breast cancer patients. However, some patients still relapsed and have distant metastasis after classical therapies. It remains important to continue looking for novel and effective treatments for breast cancer patients.
With recent developments and applications of new findings in immunology, immunotherapy is coming to play a critical role in cancer treatment. It has become the standard-of-care treatment strategy for some malignancies, such as melanoma, lung cancer, and bladder cancer. 3 In breast cancer, previous studies have confirmed that in patients with early stage triple-negative and HER2-positive disease, high levels of lymphocytic infiltration were consistently associated with a better prognosis, and these infiltrations reflect a good host anti-tumor immune response, suggesting potential benefit of immune activation to improve prognosis. 4 There have been clinical reports of immune checkpoint blockade monotherapy for several molecular subtypes of breast cancer, and persistent responses have been observed in a significant number of women with chemotherapy-resistant diseases. 3 In addition, Atezolizumab (targeted drugs for PD-L1) in combination with chemotherapy has been recently approved for the treatment of advanced triple-negative breast cancer. 5 All these indicate that immunotherapy is expected to achieve encouraging results in the treatment of breast cancer. However, as breast cancer is a heterogeneous disease and most breast cancers exhibit limited innate immunogenicity, it is important to identify relevant biomarkers to distinguish immunotherapy responsive tumors. 3 In this study, differentially expressed genes (DEGs) were selected based on the immune score using the ESTIMATE algorithm, and a 25-gene prognostic signatures model was established and validated using The Cancer Genome Atlas (TCGA) and the Gene Expression Omnibus (GEO) databases. We then investigated the survival value of this model, and constructed a more meaningful nomogram combining clinicopathological prognostic factors.

| Data source
The RNA-seq expression and clinical characteristic data of breast cancer tumors came from two databases and formed three cohorts. The training cohort was downloaded from TCGA (n = 1070) databases (https://portal.gdc.cancer.gov/ repos itory, Table 1). Among patients from the training cohort, we conducted multiple random sampling to establish internal validation cohorts, each containing 100 patients. The external validation cohort data set was the GSE88770 (n = 117) from GEO databases (https://www.ncbi.nlm.nih. gov/geo/). Patients with unknown overall survival (OS) were excluded from both data sets.

| ESTIMATE algorithm of breast cancer
The estimate R package was used to implement the ESTIMATE algorithm, which calculated the immune score, stromal score and ESTIMATE score. And patients were divided into two groups, high-score or low-score, using the median score of the immune score (stromal score or ESTIMATE score) as the cutoff.

| Differential gene expression analysis
The differential gene expression analysis was performed by the limma R package. 6 The raw data were normalized and transformed to log2-counts per million (log CPM). The Benjamini-Hochberg method was adopted to correct the significance p value. Finally, the fold change (FC) and the adjusted p value (False Discovery Rate, FDR) were adopted as key indexes for screening DEGs. DEGs were selected with |log2(FC)| > 1 and FDR < 0.05.

| Construction of the prognostic signatures model
The LASSO Cox regression model was built using the glmnet R package. The optimal λ value was chosen through 10 times cross-validations. Prognostic signatures were selected and the coefficient of each signature was calculated according to the LASSO Cox regression. The same formula was used to calculate the risk score for each patient. According to the risk score, patients were divided into low-risk and highrisk group with the median risk score as the cutoff. Survival analysis and the receiver operating characteristic (ROC) curve were used to test the performance of the prognostic signatures model.

| Independent prognosis factors and the nomogram
The Cox regression analyses were used to analysis clinicopathologic characteristics and find independent prognostic factor, p < 0.05 was considered statistically significant. To improve the application value, a nomogram was built with the rms R package and its performance was evaluated by C-index.

| Functional enrichment and immune cell infiltration analysis
The Gene Ontology (GO) analysis and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis were performed through the clusterProfiler R package. 7 Biological process, cell component, molecular function and KEGG pathway analysis were performed to enrich the function associated with the DEGs. In addition, Tumor IMmune Estimation Resource (TIMER2.0, http://timer.comp-genom ics.org/) was the online tool that provided us the correlation coefficient of prognostic gene expression and immune cells. 8 Immune cells included B cells, CD8+T cells, CD4+T cells, macrophages, neutrophils, and dendritic cells. And two of the prognostic genes screened by us, JCHAIN and TMEM273, were not included in the TIMER2.0 analysis.

| Statistical analysis
Statistical analyses were conducted with the R program version 4.0.0, using the R studio version software 1.2.5042. The survival R package was used to perform the Kaplan-Meier survival analysis, the Cox regression analysis and define the C-index. The survival curve was constructed by the survminer R package. The volcano plot was formed through an R package called ggplot2. The pheatmap R package was used to draw the heatmap. The ROC curve was drawn using the survivalROC R package.

| The acquisition of DEGs based on the immune score
According to the ESTIMATE algorithm, we obtained the immune score, stromal score, and ESTIMATE score of each breast cancer patient from TCGA. To explore whether a high score was associated with a better prognosis, we performed the survival analysis of two groups that grouped around the median scores. The median immune score was 24.57032, the median stromal score was 404.0199, and the median ESTIMATE score was 465.7355. Based on the immune score, there was a distinct difference between the two groups, and the high-score group showed longer OS than the lowscore group (p = 0.038, Figure 1A). However, in terms of the T A B L E 1 Clinicopathological characteristics for breast cancer patients in TCGA (n = 1070).

Characteristics n (%)
Gender stromal score, no remarkable difference in survival was found between the two group of patients (p = 0.680, Figure 1B). Even though there was no significant difference between the two ESTIMATE score groups, the higher score contributed to a slightly better prognosis (p = 0.230, Figure 1C). To further analyze the difference between immune score groups, we compared the high-score group with the low-score group by differential gene expression analysis. Finally, we found 616 DEGs that included 65 down-regulated genes and 551 up-regulated genes ( Figure 1D).

| Construction of the 25-gene prognostic signatures model
To identify potential gene prognostic signatures in DEGs, we built the LASSO Cox regression model, and the value of λ was 0.01534 ( Figure 1E). A final prognostic signature model was consisted including 25 genes (Table 2), and the model was used to evaluate the prognosis of the training, external validation, and internal validation cohorts. The mRNA expression levels of these 25 prognostic signatures in TCGA is shown in Figure 1F. Red and blue respectively represented high and low risk score group in the risk item (divided by the median risk score).

| Survival validation of the 25gene prognostic signatures model and subgroup analysis
Based on the 25-gene prognostic signatures model, we calculated each patient's risk score and grouped them according to the median risk score. In the training cohort (TCGA), there was a significant difference between the OS of the low-risk and high-risk score group, and the low-risk group demonstrated a longer OS. Furthermore, similar to the training cohort, lower scores were associated with better prognosis in the external (GSE88770) and internal validation cohorts.
To further evaluate the survival validation of the model, we constructed ROC curves. In the training, external validation and one of the internal validation cohorts, the area under the curve (AUC) values of the 3-year cohort were 0.777, 0.843, and 0.820, the 5-year AUC cohort values were 0.741, 0.717, and 0.856, and the 10-year AUC cohort values were 0.771, 0.709, and 0.779, respectively. (Figure 2A-I, Table S1). In addition, the subgroups analysis of patients with different PAM50 subtypes in TCGA indicated that a lower risk score predicted better OS. Good predicting function was found in the Basal-like (p = 0.0038), Luminal A (p < 0.0001), Luminal B (p = 7e-04) and Normal-like (p = 0.041) breast cancer patients. Although there were no significant differences between the two groups of the HER2-enrich subtype (p = 0.070), the low-risk score group still had a better OS than the high-risk score group ( Figure 3A-E).

| A nomogram with better predictive value
To determine whether the risk stratification formed by the 25-gene prognostic signatures model is an independent prognostic determinant, and to look for other clinicopathological independent prognostic factors, we performed univariate and multivariate Cox regression analyses in the TCGA cohort. The clinicopathological factors included gender, age, primary site of tumor (T), regional lymph node (N), metastasis (M), and TNM stage (Table 3). We confirmed that age, TNM stage and the risk score were independent prognostic factors for breast cancer. To enhance the clinical value of the 25-gene prognostic signatures model, we constructed a nomogram combining the risk score and clinicopathological independent prognostic factors. The nomogram was used to predict patient OS at 3-year, 5-year, and 10-year. We found that a lower score means a longer OS ( Figure 4A). The 3-year, 5-year, and 10-year calibration curves coincided well with the median line ( Figure 4B-D), and the C-index of the nomogram was 0.82824083. This indicates that the nomogram has a strong predictive value for patient OS. What is more, compared to the 25-gene prognostic signatures model or other clinicopathological independent prognostic factors, this nomogram was a better predictor (Table 4).

| Functional enrichment analysis correlated with DEGs
To further explore the function of DEGs, we conducted GO analysis and KEGG pathway analysis. We listed the top 10 terms among biological processes, cell components, molecular functions and KEGG pathway ( Figure 5A-D). We found that biological processes of DEGs were significantly associated with an immune response, such as T cell activation, regulation of lymphocyte activation, and lymphocyte proliferation. In addition, there was also a degree of immune correlation in the cellular component, molecular function and KEGG pathway, including immunological synapse, immune receptor activity, cytokine−cytokine receptor interaction and so on.

| Association between prognostic signatures and tumor immune infiltrating cells
We cells, dendritic cells and neutrophils, especially dendritic cells and neutrophils. In contrast, some genes had limited connections to immune cells, including NKAIN1, RIMS1, ELOVL2, NPY1R, IGFALS, and NXNL2. Moreover KLRB1, BIRC3, GPR55, MATK, S100B, KLHDC7B, CHI3L1, and CXCL1 were associated with certain immune cells, such as CD4+T cells, neutrophils, and dendritic cells. But in other immune cells, such as macrophage, there was a lower correlation. (Figure 6).

| DISCUSSION
The tumor microenvironment is extremely complex, including not only epithelial cells, vascular cells, stromal cells, but also a large number of infiltrating immune cells. Infiltrating immune cells, which function in an environmentally dependent manner, have been shown to play an antitumor role in some tumors and are closely related to tumor growth, invasion, and metastasis. 9 In breast cancer, many studies have been conducted in terms of the correlation between the tumor immune cell infiltration and clinical outcome, and it has been reported that high levels of immune cell infiltration are associated with better outcomes. [10][11][12] In addition, the infiltration of lymphocytes in breast cancers is highly correlated with the sensitivity to chemotherapy. 13 Therefore, here we focused on the infiltration immune cells to build a prognostic model for patients with breast cancer. The ESTIMATE algorithm is a way to use gene expression characteristics to infer the proportion of stromal cells and immune cells in a tumor sample. Based on the immune signature (141 genes) and stromal signature (141 genes), the immune score and stromal score were obtained for each sample to predict the infiltration of stromal and immune cells in tumor tissues. And the ESTIMATE score, that combined immune and stromal scores, was calculated to infer tumor purity in tumor tissue. The predictive power of the ESTIMATE algorithm has been demonstrated in a variety of tumors. 14 Our study focused on the immune score of the ESTIMATE algorithm. The TCGA cases were grouped according to their immune scores, and the results showed that patients with high scores had better OS compared to those with the low scores. In addition, the related DEGs were obtained. The 25gene prognostic signatures model was constructed with the LASSO Cox regression model, and the risk score calculated by this prognostic model was negatively correlated with survival. Furthermore, we found that the risk score could be used as an independent prognostic factor for breast cancer patients.
Some of the prognostic signatures that showed a remarkable correlation coefficient with immune cells have been shown in prior studies to play an important role not only in the occurrence and development of tumors, but also in the regulation of the immune system. BCL2A1 is an apoptosis modulator that is overexpressed as a nuclear factor kappa B target gene in many cancer cells and contributes to tumor progression. Combined with the results of our analysis (Figure 6), BCL2A1 showed a chemotactic role for T lymphocytes, dendritic cells, and neutrophils. In fact, BCL2A1 plays a crucial role in lymphocyte development, mast cell mediated allergic reactions, and lymphocyte and macrophage activation, especially the up-regulation of BCL2A1 to regulate the CD40 survival signaling pathway of B lymphocytes. 22,23 BIRC3 is a multifunctional protein that regulates immunity, apoptosis, metastasis, and other functions. It not only has an obvious chemotaxis effect on immune cells, such as B lymphocytes, T lymphocytes, and neutrophils, but also participates in the TNFR2/ BIRC3-TRAF1 signaling pathway, that is a novel NK cell immune checkpoint for cancer. 24 Il-10 is a cytokine that induces an immune response and an anti-inflammatory microenvironment that promotes tumor growth by helping tumor cells evade immune surveillance. It can be used as an indicator to evaluate the efficacy of neoadjuvant chemotherapy in patients with HER2-enrich breast cancer. 25 The CHI3L1 gene product is a glycoprotein that can be expressed and secreted by both tumor cells and immune cells. CHI3L1 can promote tumor progression by upregulation of pro-inflammatory mediators, like CCL2, CXCL2, and MMP-9. Tumor-recruited M2 macrophages can secret CHI3L1 to promote the metastasis of gastric cancer and breast cancer, and elevated serum levels of CHI3L1 glycoprotein are associated with poor prognosis in patients with metastatic breast cancer. 26,27 CXCL1 is significantly correlated with T lymphocyte infiltration, during the progression of breast cancer CXCL1 is up-regulated by Th17 cells and can promote the growth and development of breast cancer. 28 GBP2 contributes to better clinical outcomes in rapidly proliferating breast tumors and may serve as a marker for an effective T cell response. 29 TOX is closely related to depletion of CD8(+) T cells. It is expressed in most circulating effector cell memory CD8(+) T cell subsets, and the knockout of TOX in human tumor-infiltrating CD8(+) T

F I G U R E 4
The nomogram combining the risk score and clinicopathological prognostic factors from breast cancer patients from TCGA. A, A nomogram to predict 3, 5, and 10 years survival for breast cancer patients. Calibration curves of the nomogram for 3 (B), 5 (C), and 10 (D) years survival. cells can lead to down-regulation of PD-1, TIM-3, TIGIT, and CTLA-4, suggesting that the TOX can promote T cell failure by up-regulating immune checkpoint proteins in tumors. In addition, TOX expression in tumor-infiltrating T cells can be used to stratify patients during antitumor therapy, including anti-PD-1 immunotherapy. 30,31 Other genes in our model are also associated with physiological and pathological processes in tumors. According to our study, there are unique correlations between various genetic prognostic signatures and types and numbers of immune cells. This may indicate the unique functions of these genes in the tumor immune microenvironment, suggesting that immune infiltration may influence the occurrence and development of tumors through specific pathways. Consideration of these results may provide new directions for immunotherapy of breast cancer patients.

Model C-index
Of many prognostic indicators of breast cancer have been reported, the Oncotype DX Recurrence Score is the most widely used test to clinically assess the risk of recurrence in patients receiving endocrine therapy. 32 Similar tests include Prosigna, 33 Breast Cancer Index, 34 and EndoPredict. 35 In addition, MammaPrint has been shown to improve the prediction of clinical outcomes in patients with early breast cancer. 36 These indicators are more limited to a certain subtype or stage of breast cancer and are used to screen patients who can be exempted from chemotherapy. However, the data of our prediction model is based on the total population of breast cancer patients, from which the high-risk population is screened. The selection of prognostic signatures is closely related to immune infiltration, which is helpful to find the target of immunotherapy.
We have identified the following limitations of this study. First of all, our study was mainly based on the overall data of TCGA for breast cancer, although it had an optimistic predictive effect in most molecular subtypes of breast cancer, its predictive effect in breast cancer subtypes needed to be confirmed by increasing the sample size of each subtype. Second, the study was performed as a retrospective analysis, and prospective cohort studies are required to verify our findings. Furthermore, our study needs to be validated by clinical specimens and we will further explore it in our future work. Finally, in the process of clinical application, choosing an appropriate cutoff of risk score is a problem that needs to be further discussed. In summary, our study constructed a 25-gene prognostic signatures model that associated with immune infiltration. And it can be used as an independent prognostic factor for predicting clinical outcomes in patients with breast cancer.

INFORMED CONSENT
This study was approved by Ethics Committee of Fujian Medical University Union Hospital and in accordance with the Declaration Helsinki and its later amendments or comparable ethical standards. All data in this study were obtained from published researches which contained informed consent.