Development and validation of a novel 15‐CpG‐based signature for predicting prognosis in triple‐negative breast cancer

Abstract DNA methylation is an important biological regulatory mechanism that changes gene expression without altering the DNA sequence. Increasing studies have revealed that DNA methylation data play a vital role in the field of oncology. However, the methylation site signature in triple‐negative breast cancer (TNBC) remains unknown. In our research, we analysed 158 TNBC samples and 98 noncancerous samples from The Cancer Genome Atlas (TCGA) in three phases. In the discovery phase, 86 CpGs were identified by univariate Cox proportional hazards regression (CPHR) analyses to be significantly correlated with overall survival (P < 0.01). In the training phase, these candidate CpGs were further narrowed down to a 15‐CpG‐based signature by conducting least absolute shrinkage and selector operator (LASSO) Cox regression in the training set. In the validation phase, the 15‐CpG‐based signature was verified using two different internal sets and one external validation set. Furthermore, a nomogram comprising the CpG‐based signature and TNM stage was generated to predict the 1‐, 3‐ and 5‐year overall survival in the primary set, and it showed excellent performance in the three validation sets (concordance indexes: 0.924, 0.974 and 0.637). This study showed that our nomogram has a precise predictive effect on the prognosis of TNBC and can potentially be implemented for clinical treatment and diagnosis.

DNA methylation is an important biological regulatory mechanism that changes gene expression without altering the DNA sequence. 7 Increasing studies have revealed that DNA methylation data play a vital role in the field of oncology. 8,9 In addition, the emergence of high-throughput technology makes it possible to identify reliable markers. Several studies have reported that DNA methylation may play a key role in predicting prognosis in various cancers. [10][11][12] For example, a DNA methylation signature was identified by Sandoval et al that improved prognostic accuracy beyond standard staging in non-small-cell lung cancer, 12 and Lasseigne et al proposed novel methylation-related biomarkers. In their study, DNA methylation profiles were used to distinguish between tumours and benign adjacent tissues of renal cell carcinoma. 11 In BC research, Monika Lesicka et al showed that some circadian genes with abnormal methylation patterns may be novel indicators and may play an important role in BC aetiology. 13 Bin Xiao et al identified several major methylation sites for predicting the prognosis of BC in the luminal subtype. 14 However, to the best of our knowledge, few studies have investigated the prognostic value of methylation sites in TNBC. Therefore, this study was designed to investigate overall survival (OS)-related CpGs in TNBC. First, we selected 86 candidate CpGs in 120 training samples; we then validated those CpGs in 38 testing samples and 37 external validation samples. Finally, a nomogram incorporating the prognostic risk model and clinicopathological features was developed. Overall, our nomogram has a precise predictive effect on the prognosis of TNBC and may potentially be implemented for clinical treatment and diagnosis.

| Study design
In our study, the inclusion criteria of samples were as follows: (a) both methylation level and survival data were available; (b) OS time was more than 1 month; and (c) histologically confirmed invasive TNBC. A total of 158 TNBC samples with complete survival information were obtained. Following the methods of random sampling at a ratio of 70:30, the 158 TNBC samples were separated into the training set (n = 120) and test set (n = 38). To avoid the reduction in statistical test efficiency and the bias caused by the direct exclusion of missing values, we used multivariate interpolation to estimate the missing values 16 (Appendix S1).
Three phases were used to investigate the OS-related methylation sites in TNBC patients. In the discovery phase, LIMMA package has been used to carry out normalization and the Mann-Whitney U-test was used to compare methylation differences between the TNBC and normal samples, sites with DNA methylation levels with false discovery rate (FDR) <0.05 and |log 2 fold change (FC)| ≥ 0 were defined as differentially methylated sites (DMSs). Then, univariate Cox proportional hazards regression (CPHR) analysis was performed to select significant DMSs correlated with OS. Finally, the 86 DMSs most related to OS with P < 0.01 were selected for least absolute shrinkage and selector operator (LASSO) Cox regression in the training set to narrow down the candidate CpGs using the R package glmnet.
The risk score was calculated as follows: 15 methylation sites were found to have nonzero coefficients in the model, and the optimal cut-off value of −11.6 was derived from the time-dependent receiver operating characteristic (ROC) curve using the Youden index. The samples with a risk score greater than −11.6 were divided into the high-risk group, and the remaining samples were divided into the low-risk group. Finally, the 15-CpG-based clinicopathological nomogram was built according to the results of the multivariate Cox analyses.
In the validation phase, we validated our nomogram in three different cohorts. The area under the curve (AUC) based on the time-dependent ROC analysis 17 was calculated to assess the risk scoring system. We also performed Kaplan-Meier survival curve analysis to identify its prognostic value. In addition, stratified analysis was carried out to identify whether the CpG-based signature was correlated with OS regardless of different clinical features. The calibration curve was plotted by the rms package of R software to estimate the consistency between the prediction outcomes of the model and the actual outcomes. Harrell's C-index was calculated to measure the goodness of fit of the CpG-based signature nomogram.

| Functional enrichment analysis of the differentially expressed genes between the two groups
According to the risk scores based on the 15-CpG signature, 158 TNBC samples were divided into high-and low-risk groups. The limma and DESeq2 R packages were used for differentially expressed gene

| Statistical analysis
All statistical analyses were conducted using R version 3.6.1.
The Mann-Whitney U-test and the Pearson chi-square test were performed to compare the associations of continuous and categorical variables, respectively, between the training set and testing set. Univariate and multivariate CPHR analyses were used to identify the predominant prognostic factors of OS (P < 0.05).
Kaplan-Meier survival curves were compared using the log-rank test. The limma R package was used to nominalize the data, and the DESeq2 R package was used to identify differentially expressed genes. The ggplot2 R package was used to plot the volcano plot and heat map. P < 0.05 (two-sided) was considered statistically significant.

| Patient characteristics
Our research flow chart is shown in Figure . Six samples were excluded because the OS time was <30 days. Therefore, a total of 158 TNBC samples from 888 BC samples from TCGA were included. The primary cohort was divided into a training set and a testing set at a ratio of 70:30 with the method of random sampling. The detailed baseline clinical features of the training and testing sets are shown in Table S1. There were no statistically significant differences between the two independent sets, as shown in Table S1 (all P > 0.05). The detailed baseline clinical features of the TNBC and normal samples are also shown in Table S2.

| Candidate OS-related methylation sites were found in the training set
The DNA methylation levels were compared between 98 adjacent normal breast tissues and 158 TNBC samples using the limma R/ Bioconductor package. A total of 225153 DMSs (FDR < 0.05 and |log 2 FC| > 0) were identified (Appendix S2). Next, these DMSs were subjected to univariate CPHR analysis in the 158 TNBC samples.
Then, we observed 86 methylation sites that were significantly related to OS (P < 0.01) (Appendix S3), and these candidate methylation sites were subsequently selected for LASSO Cox regression in the training set. Finally, 15 methylation sites were found to have nonzero coefficients in the model (Figure 2A,B).

| Establishment of a 15-CpG-based prognostic model
A risk score was generated to better identify the prediction efficiency of the 15-CpG-based signature ( Figure 2B and Appendix S4). The samples with a risk score greater than −11.6 were divided into the high-risk group, and the remaining samples were divided into the low-risk group.
The features and coefficients of these methylation sites are shown in Table S3.

| Assessment of the 15-CpG signature in clinical characteristic subgroups
The results of univariate and multivariate CPHR analyses are shown in Table S4, and the 15-CpG signature and AJCC stage were

| Building a predictive nomogram
To identify the best prognostic nomogram, three models were built to compare their predictive accuracies (Table S5). As a result, model 1 (including risk features and AJCC stage) had a significantly better predictive performance than the other two models (C-index: 0.918). To provide a clinically applicable method that could predict a patient's OS probabilities, these independently associated risk features were used to build a risk estimation nomogram ( Figure 6A

| Gene expression differences between the high-and low-risk score groups based on the 15-CpG signature
According to their risk scores, the 158 TNBC samples were divided into high-and low-risk groups based on the 15-CpG signature (61 samples were in high-risk group and 97 samples were in low-risk group). The differentially expressed genes were selected by the limma R package. Hierarchical clustering analysis was used to show the expression levels of the genes most related to the risk scores as heatmaps ( Figure 7A). A total of 191 genes that displayed significant differential expression (FDR < 0.05 and |log 2 FC| ≥ 1) in the different groups were found ( Figure 7B). The results showed that 72 genes were positively correlated with the risk scores and 119 genes were negatively correlated with the risk scores (Appendix S8). To estimate the potential function of these genes, GO and KEGG pathway analyses were conducted ( Figure 7C,D). The overall results of these analyses indicated that these genes may be related to material metabolism and transport.
There are limitations in our study. First, it has been reported that the prevalence of TNBC is different in different races. 33 However, our main research data were downloaded from the TCGA database, so most of the patients were white women. Whether our predictive model can be applied to non-white female patients needs further study. Second, some methylation sites might be difficult to use as clinical diagnoses because they may not be easy to detect in serum.
Third, although this study was validated using GEO data, further studies are needed to validate our research.
In summary, we built a nomogram including the 15-CpG signature and AJCC stage to predict prognosis in TNBC patients. The performance of the nomogram was verified in different validation sets. Therefore, our nomogram may potentially be implemented to predict the prognosis of patients with TNBC.

CO N FLI C T O F I NTE R E S T
On behalf of all authors, the corresponding author states that there is no conflict of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
All data sets analysed for this study are included in the manuscript and the supplementary files.