Prognostic biomarker identification and tumor classification in breast cancer patients by methylation and transcriptome analysis

Breast cancer is one of the most common and heterogeneous malignancies. Although the prognosis of breast cancer has improved with the development of early screening, the mechanisms underlying tumorigenesis and progression remain incompletely understood. DNA methylation has been implicated in tumorigenesis and tumor development and, in the present study. we screened methylation‐driven genes and explored their prognostic values in breast cancer. RNA‐sequencing (RNA‐Seq) transcriptome data and DNA methylation data of the TCGA‐BRCA dataset were obtained from The Cancer Genome Atlas. Differentially expressed genes and differentially methylated genes were identified separately. The intersected 783 samples with both RNA‐Seq data and DNA methylation data were selected for further analysis. Fifty‐six methylation‐driven genes were identified using the MethylMix r package and 10 prognosis methylation‐driven genes (CDO1, CELF2, ITPAIPL1, KCNH8, PTK6, RAB25, RIC3, USP44, ZSCAN1 and ZSCAN23) were further screened by combined methylation and gene expression analysis. Based on the methylation data of the screened 10 methylation‐driven genes, six subgroups were identified with the ConsensusClusterPlus r package. The protein levels of the 10 prognostic methylation‐driven genes were detected by immunohistochemical experiments. Moreover, based on the RNA‐Seq data, a signature calculating the risk score of each patient was developed with stepwise regression. The risk score and other clinical features (age and stage) were confirmed to be independent prognostic factors by univariate and multivariate Cox regression analyses. Finally, a prognostic nomogram incorporating all the significant factors was integrated to predict the 3‐, 5‐ and 7‐year overall survival. Taken together, the methylation‐driven genes identified here may be potential biomarkers of breast cancer.

Breast cancer is one of the most common and heterogeneous malignancies. Although the prognosis of breast cancer has improved with the development of early screening, the mechanisms underlying tumorigenesis and progression remain incompletely understood. DNA methylation has been implicated in tumorigenesis and tumor development and, in the present study. we screened methylation-driven genes and explored their prognostic values in breast cancer. RNA-sequencing (RNA-Seq) transcriptome data and DNA methylation data of the TCGA-BRCA dataset were obtained from The Cancer Genome Atlas. Differentially expressed genes and differentially methylated genes were identified separately. The intersected 783 samples with both RNA-Seq data and DNA methylation data were selected for further analysis. Fifty-six methylation-driven genes were identified using the MethylMix R package and 10 prognosis methylation-driven genes (CDO1, CELF2, ITPAIPL1, KCNH8, PTK6, RAB25, RIC3, USP44, ZSCAN1 and ZSCAN23) were further screened by combined methylation and gene expression analysis. Based on the methylation data of the screened 10 methylation-driven genes, six subgroups were identified with the ConsensusClusterPlus R package. The protein levels of the 10 prognostic methylation-driven genes were detected by immunohistochemical experiments. Moreover, based on the RNA-Seq data, a signature calculating the risk score of each patient was developed with stepwise regression. The risk score and other clinical features (age and stage) were confirmed to be independent prognostic factors by univariate and multivariate Cox regression analyses. Finally, a prognostic nomogram incorporating all the significant factors was integrated to predict the 3-, 5-and 7-year overall survival. Taken together, the methylation-driven genes identified here may be potential biomarkers of breast cancer.
Breast cancer (BC) is one of the most common and heterogeneous malignancy and has become the main cause of cancer-associated death among females in the world [1,2]. Clinically, the subtypes of BC are commonly considered to include ER + , PR + , HER2 + and triple negative breast cancer (TNBC) [3]. Although the therapeutic methods and prognosis of BC have been significantly improved with the development of early screening, molecular genetics and targeted therapies, the mechanisms of the tumorigenesis and progression still remain relatively unknown.
Genomic instability has a very important role in the tumorigenesis and development of carcinogenesis [4]. DNA methylation is one type of epigenetic modification associated with gene expression and genomic stability [5]. The alterations in DNA methylation comprise early events in carcinogenesis, which is of great clinical interest for potential biomarkers with respect to diagnosis, prognosis, therapeutic classification and follow-up after treatment [6]. For example, methylation of the promoter for O 6 -methylguanine-DNA methyltransferase has been well studied and included in the NCCN clinical practice guidelines to determine the therapeutic method in glioblastoma [7]. The clinical value of plasma septin9 for the detection of asymptomatic colorectal cancer was revealed and promoted an early diagnosis [8]. The methylation level of serum biomarker EFC#93 was identified as a new way of diagnosing and managing BC [9]. To date, our understanding of methylation-driven genes in BC is still lacking.
MethylMix is a bioinformatic tool applied to screen hyper and hypo methylated genes [10]. According to a beta mixture model, it screens methylation status and compares this with the normal DNA methylation state [11]. The negative correction between DNA methylation and RNA expression is also considered and calculated to improve the accuracy. ConsensusClusterPlus also comprises a bioinformatic tool widely used in studies of tumor classification [12].
In the present study, the MethylMix R package was used to identify the methylation-driven genes. Then, combined methylation and gene expression analyses were performed to screen prognosis biomarkers. Consensus clustering was applied with the ConsensusClusterPlus R package to identify subgroups. Finally, the protein levels of the prognosis methylation-driven genes were detected by immunohistochemical experiments. In total, 56 methylationdriven genes were identified, of which 10 survivalrelated genes were further explored in the molecular tumor classification. Moreover, based on the RNAsequencing (RNA-Seq) data, a signature calculating the risk score of each patient was developed with stepwise regression. The risk score and other clinical features (Age and Stage) were confirmed to be independent prognostic factors by univariate and multivariate Cox regression analyses. Finally, a prognostic nomogram incorporating all the significant factors was integrated to predict the 3-, 5-and 7-year overall survival. Taken together, the screened methylation-driven genes could be potential biomarkers of BC.

Materials and methods
Data resources and analysis RNA-Seq data (fragments per kilobase of transcript per million mapped reads values) of the TCGA-BRAC dataset and corresponding clinical information were obtained from The Cancer Genome Atlas (cancergenome.nih.gov), involving 113 normal and 1109 tumor samples. DNA methylation data (beta value) ranging from 0 to 1 (unmethylated to totally methylated) of the TCGA-BRCA dataset were also downloaded from The Cancer Genome Atlas, including 96 normal and 796 tumor samples. First, differentially expressed genes were screened on the basis of a Wilcox test in R, version 3.6.1 (R Foundation for Statistical Computing, Vienna, Austria) with a false discovery ratio (FDR) < 0.05 and absolute log2 fold change > 1. DNA methylation data were merged using Perl, version 5.32.0 (https:// www.perl.org/get.html) and differentially methylated genes were identified using a Wilcox test with FDR < 0.05 and absolute log 2 fold change > 0.5. Methylation-driven genes are those genes for which the DNA methylation levels are negatively correlated with the mRNA expression level after linear regression analysis. Then, the intersected 783 samples including both RNA expression data and DNA methylation data were selected. The MethylMix R package was used to identify the methylation-driven genes with Pearson correlation between the DNA methylation level and RNA expression < À0.3 and P < 0.05. The methylation mixture models were plotted via the MethylMix_PlotModel function within the MethylMix R package. Finally, a total of 56 methylation-driven genes were obtained. The heatmaps of RNA expression and the methylation level of 56 methylation-driven genes were plotted using the R pheatmap package in R, version 3.6.1.

Functional enrichment analysis
To reveal the function of the methylation-driven genes, Gene Ontology (GO) enrichment analysis was performed with the clusterProfiler R package and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis was applied via ConsensusPathDB (http://cpdb.molge n.mpg.de). P < 0.05 was used to distinguish significantly enriched terms.

Combined gene expression and methylation survival analysis
Gene expression and methylation survival analyses were combined to identify potential biomarkers to predict the prognosis. In total, 754 samples with an overall survival of between 30 and 3650 days were selected into the survival analysis at beginning. According to the median levels of the Federation of European Biochemical Societies methylation data and RNA expression data of each gene, patients were divided into hyper methylation, hypo methylation, high expression and low expression groups. For each gene, only the patients with hyper methylation and low expression or hypo methylation and high expression were included into the combined survival analysis. Kaplan-Meier curve analysis was further performed to find prognosis genes with P < 0.05.

Molecular subtypes related to prognosis
Consensus clustering was conducted with the Consen-susClusterPlus R package in R, version 3.6.1 to screen BRCA subgroups according to the methylation data of the prognosis methylation-driven genes. The algorithm started with subsampling from the methylation data and every subsample was separated into k groups by k-means. This was repeated for 100 times to reach consensus values and get the stability of the screened clusters. After ConsensusClus-terPlus was applied, the cluster consensus and itemconsensus results were obtained.

Analysis of subgroups in survival and clinical features
Survival analysis of subgroups were performed with the survival R package. The differences among the clusters were indicated with Kaplan-Meier plots. Boxplots of methylation levels were performed using the reshape2 R package. Associations between clinical features (age, TNM, stage) were plotted and different methylation level among clusters were calculated by a Wilcox test with FDR < 0.05 and delta beta > 0.1. The difference was plotted using the pheatmap R package.

Immunohistochemistry (IHC)
For the purpose of revealing the different protein expression levels of the prognosis methylation-driven genes, immunohistochemical experiments were performed. In total, 42 tumor and adjacent normal tissues were collected from our hospital. Abcam) was conducted overnight at 4°C. Phosphate-buffered saline without primary antibody was used as the negative control. DAB chromogenic reagent was applied to develop the stain and hematoxylin was used to stain the nucleus. The sections were finally dehydrated and mounted with a neutral resin onto slides. Digital photomicrographs of sections were taken from representative areas at a fixed magnification of 2009. Positive staining in images was quantified as the integral optical density/area, which was expressed as the mean density using IMAGE-PRO PLUS, version 6.0 (Media Cybernetics, Inc., Bethesda, MD, USA). Then, the mean density values were analyzed with PRISM, version 8 (GraphPad Software Inc., San Diego, CA, USA). A paired t-test was conducted to compare the differential expressions between tumor and normal tissues.

Integration and evaluation of the prognostic nomogram
To further explore the prognostic value of the 10 identified survival-related and methylation-driven genes, multivariate Cox regression was performed to establish a nomogram. First, a signature estimating the risk score of each patient was constructed with stepwise regression based on the RNA-Seq data. Then, the independence of the signature and the other clinical features (Age and Stage) was confirmed by univariate and multivariate Cox regression analyses with the survival R package. The hazard ratio and Pvalues were plotted. Finally, a prognostic nomogram incorporating all the significant factors (P < 0.05) was integrated to predict the 3-, 5-and 7-year overall survival with the rms R package. The receiver operating characteristic (ROC) 3-, 5-and 7-year curves were plotted with the survivalROC R package and the calibration were also carried out using the rms R package to show the prognostic predictive accuracy of the nomogram.

Screening of methylation-driven genes in BRCA
In total, 56 genes were screened and found to be methylation-driven genes. Mix models were conducted and performed to determine differential methylation (log 2 fold change > 0.5, P < 0.05, Cor < À0.3). The details of 56 identified genes are shown in Table 1 (Fig. 1). The heatmaps of RNA expression ( Fig. 2A) and the methylation level (Fig. 2B) of the 56 methylationdriven genes were plotted with R pheatmap package in R 3.6.1.  Functional enrichment analysis GO enrichment analysis results (Fig. 2C) showed that gland development, growth hormone receptor signaling pathway, lactation, mammary gland development and the Janus kinase-signal transducer and activator of transcription cascade in the growth hormone signaling pathway were the most enriched functions. KEGG pathway enrichment analysis (Fig. 2D) showed that prolactin receptor signaling, nuclear signaling by ERBB4, signaling by ERBB4 and the estrogen signaling pathway were the most enriched pathways.

Combined gene expression and methylation survival analysis
Kaplan-Meier curves indicated 10 genes were related with the prognosis of BRCA (P < 0.05) (Fig. 3A). According to the results, the hypermethylation and low-expression survival rates of CDO1, CELF2, ITPRIPL1, KCNH8, RIC3, USP44, ZSCAN1 and ZSCAN23 were significantly lower. On the other hand, the hypomethylation and high-expression survival rates of PTK6 and RAB25 were significantly lower. Pearson correlation between the DNA methylation level and the RNA expression level revealed that there were significant negative correlations with Cor < À0.3 and P < 0.05 (Fig. 3B).

Identification and analysis of molecular subtypes
The variations among different clusters and average cluster consensus were estimated to determine the number of clusters. The criteria were followed with higher consistency inside the cluster, lower variations between different clusters and no obvious raise in the area under the cumulative distribution function curve.
The area under the cumulative distribution function curve started to stabilize after six clusters (Fig. 4A). The consensus matrix represented the consensus for k = 6 and showed a six-block structure (Fig. 4B). The Kaplan-Meier plot displayed significant differences among the six clusters (P < 0.001) and the clusters 5 and 6 had the worst outcome, whereas cluster 1 had the best survival rate (Fig. 4C). Different methylation levels of the six clusters were calculated and plotted based on the methylation data of the 10 screened methylation-driven genes (Fig. 4D). Clusters 6 and 5 showed a higher methylation level than the other clusters, whereas cluster 4 had the lowest level. Associations between clinical features (Age, TNM and Stage) and gene methylation level in  (Fig. 4E). Different gene methylation levels among clusters were calculated and plotted (Fig. 4F). USP44 and ZSCAN23 showed the greatest variant methylation levels, whereas ITPRIPL1 showed the lowest.

IHC analysis
The protein expression levels of the prognosis methylation-driven genes were demonstrated by IHC experiments. Representative images of IHC at 2009 were obtained (Fig. 5A-J). Clinicopathological characteristics in the present study are shown in Table 2.

Integration and evaluation of the prognostic nomogram
After the stepwise regression analysis, three genes were included in the signature calculating the risk score of each patient with the formula: risk score = (À1.01001 9 expression level of ITPRIPL1) + (À0.28248 9 expression level of ZSCAN1) + (À0.96202 9 expression level of ZSCAN23). All of the three variables (Risk score, Age and Stage) were indicated to be the significant factors (P < 0.001) in the univariate and multivariate Cox regression analyses (Fig. 6A,B). A prognostic nomogram incorporating all three significant factors was integrated to predict the 3-, 5-and 7-year overall survival  (Fig. 6D). The calibration curve of 5 years was plotted using the calibrate function and was well calibrated (Fig. 6E).

Discussion
Recently, DNA methylation has been widely recognized as one important epigenetic modification and is closely related to tumorigenesis. With the development of genomic detection technology, the understanding of the alterations in DNA methylation has greatly improved [13,14]. Tumorigenesis occurs with extensive DNA methylation changes [15]. Most of these changes happen early in carcinogenesis, which makes DNA methylation valuable for early screening, diagnosis, prognosis, therapeutic classification and follow-up after treatment [16]. Usually, DNA hypermethylation of tumor suppressor genes or DNA hypomethylation of oncogenes could lead to unfavorable outcomes in patients. In the present study, 56 methylation-driven genes in BC were screened and 10 of these were associated with the prognosis and tumor classification. CDO1 (cysteine dioxygenase 1), initiating several key metabolic pathways associated with pyruvate and sulfurate compounds, is an important regulator of cellular cysteine concentrations. In the present study, CDO1 was a hypermethylated-low expression gene in BC and this could result in the favorable survival of BC patients. It has been implicated as a novel tumor suppressor gene that is silenced by promoter methylation in many cancers. By analyzing differential RNA expression profiles with or without treatment with 5aza-2 0 -deoxycytidine, the frequency of CDO1 promoter methylation was observed with a statistically significant difference between normal and tumor tissues [17]. In particular, Tanaka et al. [18] reported that CDO1 was silenced at the mRNA level in six types of BC cell lines. Overexpression of CDO1 decreased the growth capacity.
CELF2 (CUGBP Elav-like family member 2) is one type of RNA-binding protein [19]. In the present study, CELF2 was a hypermethylated-low expression gene in BC and this could result in the favorable survival of BC patients. Piqu e et al. [20] demonstrated that CELF2 was targeted by promoter hypermethylation-associated transcriptional silencing in BC. The restoration of CELF2 could inhibit tumor growth and the epigenetic loss induced an aberrant downstream pattern of alternative splicing. Ramalingam et al. [21] found that the expression of CELF2 was consistently reduced with neoplastic transformation, indicating that it might be a potential tumor suppressor protein. Yeung et al. [22] revealed that CELF2 could interact with PREX2 and reduce the association of PREX2 with PTEN, playing a tumor suppressor role in PI3K signaling by antagonizing the oncogenic effect of PREX2. ITPRIPL1 (inositol 1,4,5-trisphosphate receptorinteracting protein-like 1) was a hypermethylated-low expression gene in BC and this could result in the favorable survival of BC patients. It was identified as a methylation-driven gene and acted as an independent biomarker for the prognosis of lung adenocarcinoma by using bioinformatics methods [23]. However, little is known of the function and mechanism of ITPRIPL1 in cancer research.
KCNH8 (potassium voltage-gated channel subfamily H member 8) was a hypermethylated-low expression gene in BC and this could result in the favorable survival of BC patients. It could exhibit RNA polymerase II cis-regulatory region sequence-specific DNA binding activity and voltage-gated potassium channel activity. Using quantitative MethyLight assays, KCNH8 was found to affect hypermethylation frequencies in lung tumor samples from 117 clinically well-characterized NSCLC patients [24]. In prostate cancer, KCNH8 was identified as a novel outlier gene with potential rearrangement and confirmed the association with primary  and metastatic prostate samples [25]. However, there are very few studies of KCNH8 in BC. PTK6 (protein tyrosine kinase 6) plays a role as an intracellular signal transducer in epithelial tissues. It was a hypomethylated-high expression gene in BC and this could result in the unfavorable survival of BC patients. In mammary epithelial cells, overexpression of PTK6 could cause sensitization of the cells to epidermal growth factor and lead to a partially transformed phenotype. Ito et al. [26] reported that PTK6 was expressed in approximately 70% of TNBCs and kinase-active PTK6 suppressed E-cadherin expression, promoted cell migration and played an important role in promoting an epithelial-mesenchymal transition. In ER + Luminal BC cells, enhanced expression of PTK6 promoted the growth of ER + BC cells, including tamoxifen-treated cells [27]. However, another study suggested that the BC cell growth was independent of PTK6 kinase activity. The tumor cell growth inhibition showed no correlation with PTK6 kinase activity inhibition, nor with total or activated PTK6 protein levels [28].
RAB25 is a member of the RAS oncogene family. In the present study, RAB25 was a hypomethylatedhigh expression gene in BC and this could result in the unfavorable survival of BC patients. Overexpression of RAB25 was correlated with poor prognosis and aggressiveness of renal, lung, breast, ovarian and other cancers [29]. Mitra et al. [30] reported that RAB25 was amplified and enhanced aggressiveness in luminal B cancers, whereas, in claudin-low tumors, RAB25 is lost, indicating possible anti-tumor functions. In a retrospective study, the expression of RAB25 was evaluated by IHC in 57 primary BC samples. The results obtained indicated that the expression of RAB25 was correlated with clinicopathologic variables and different molecular subtypes [31].
RIC3 (RIC3 acetylcholine receptor chaperone) was a hypermethylated-low expression gene in BC and this could result in the favorable survival of BC patients. It could encode a member of the resistance to inhibitors of the cholinesterase 3-like family, which may have functions including inflammation control. In human lymphocytes and macrophages, immune activation could lead to dynamic changes in RIC3 expression and RIC3 was found to show a strong correction with inflammatory processes [32]. In another study, RIC3-TCRBC2 fusion was identified by RNA-Seq in T-cell lymphoblastic lymphoma, which might be a strong driver for neoplasia-associated mutations [33]. No study has yet focused on the relationships of RIC3 and any other cancer. USP44 (ubiquitin carboxyl-terminal hydrolase 44) is a protease that functions as a deubiquitinating enzyme. The present study indicated that it was a hypermethylated-low expression gene in BC and this could result in the favorable survival of BC patients. USP44 was identified as a key regulator of anaphasepromoting complex activation [34]. In BC, Liu et al. [35] reported that USP44 silencing induced spindle multipolarity, abated vasculogenic mimicry, reduced transendothelial migration and decreased interleukin-6 and interleukin-8 levels in BC stem cells. Lan et al. [36] indicated that USP44 contributed to N-CoR functions with respect to regulating gene expression and was required for the efficient invasiveness of TNBC cells. Sloane et al. [37] reported that the USP44 CpG Island was hypermethylated in colorectal cancer cell lines.
ZSCAN1 (zinc finger and SCAN domain-containing protein 1) and ZSCAN23 (zinc finger and SCAN domain-containing protein 23) belong to the same gene family. In the present study, these two genes were hypermethylated-low expression genes in BC and this could result in the favorable survival of BC patients. However, the protein level of ZSCAN1 was not found to demonstrate a significant difference. On the whole, the understanding of these two genes is quite limited. In a study identifying DNA methylation markers for the detection of high-grade cervical intraepithelial neoplasia, the SOX1/ZSCAN1 panel (84%, 167/200) had a higher sensitivity and specificity compared to the others [38]. The role of ZSCAN23 in cancer has not yet been reported.
With these 10 survival-related and methylationdriven genes, a signature estimating the risk score of each patient was constructed with stepwise regression based on the RNA-Seq data. It was confirmed to be an independent prognosis factor by univariate and multivariate Cox regression analyses. A prognostic nomogram including over 700 BC patients was further integrated to predict the 3-, 5-and 7-year overall survival. The area under the curve values of the 3-, 5and 7-year ROC curves were 0.801, 0.745 and 0.764, which demonstrated good prediction ability.
In conclusion, the DNA methylation levels of these 10 methylation-driven genes (CDO1, CELF2, ITPAIPL1, KCNH8, PTK6, RAB25, RIC3, USP44, ZSCAN1 and ZSCAN23) were negatively correlated with the mRNA expression level after linear regression analysis with the MethylMix R package. Using the ConsensusClusterPlus R package, six subgroups were identified with a significantly different prognosis based on the methylation data. The protein levels were confirmed by IHC and nine of 10 (i.e. except ZSCAN1) showed statistical differences. The prognosis values of the 10 identified methylation-driven genes in BC were explored. Finally, a prognostic nomogram including over 700 BC patients was further integrated to predict the 3-, 5-and 7-year overall survival with good prediction ability. Taken together, the screened methylation-driven genes could be potential biomarkers of BC.
However, there are also some limitations to the present study. For example, the results of the study are mainly based on the bioinformatic analysis. The DNA methylation changes, gene functions and mechanisms of the methylation-driven genes could be better revealed in additional experiments. Moreover, the prediction ability of the integrated prognostic nomogram requires more research to be validated for clinical practice.