A methylation‐driven gene panel predicts survival in patients with colon cancer

The accumulation of various genetic and epigenetic changes in colonic epithelial cells has been identified as one of the fundamental processes that drive the initiation and progression of colorectal cancer (CRC). This study aimed to explore functional genes regulated by DNA methylation and their potential utilization as biomarkers for the prediction of CRC prognoses. Methylation‐driven genes (MDGs) were explored by applying the integrative analysis tool (methylmix) to The Cancer Genome Atlas CRC project. The prognostic MDG panel was identified by combining the Cox regression model with the least absolute shrinkage and selection operator regularization. Gene set enrichment analysis was used to determine the pathways associated with the six‐MDG panel. Cluster of differentiation 40 (CD40) expression and methylation in CRC samples were validated by using additional datasets from the Gene Expression Omnibus. Methylation‐specific PCR and bisulfite sequencing were used to confirm DNA methylation in CRC cell lines. A prognostic MDG panel consisting of six gene members was identified: TMEM88, HOXB2, FGD1, TOGARAM1, ARHGDIB and CD40. The high‐risk phenotype classified by the six‐MDG panel was associated with cancer‐related biological processes, including invasion and metastasis, angiogenesis and the tumor immune microenvironment. The prognostic value of the six‐MDG panel was found to be independent of tumor node metastasis stage and, in combination with tumor node metastasis stage and age, could help improve survival prediction. In addition, the expression of CD40 was confirmed to be regulated by promoter region methylation in CRC samples and cell lines. The proposed six‐MDG panel represents a promising signature for estimating the prognosis of patients with CRC.

The accumulation of various genetic and epigenetic changes in colonic epithelial cells has been identified as one of the fundamental processes that drive the initiation and progression of colorectal cancer (CRC). This study aimed to explore functional genes regulated by DNA methylation and their potential utilization as biomarkers for the prediction of CRC prognoses. Methylation-driven genes (MDGs) were explored by applying the integrative analysis tool (METHYLMIX) to The Cancer Genome Atlas CRC project. The prognostic MDG panel was identified by combining the Cox regression model with the least absolute shrinkage and selection operator regularization. Gene set enrichment analysis was used to determine the pathways associated with the six-MDG panel. Cluster of differentiation 40 (CD40) expression and methylation in CRC samples were validated by using additional datasets from the Gene Expression Omnibus. Methylation-specific PCR and bisulfite sequencing were used to confirm DNA methylation in CRC cell lines. A prognostic MDG panel consisting of six gene members was identified: TMEM88, HOXB2, FGD1, TOGARAM1, ARHGDIB and CD40. The high-risk phenotype classified by the six-MDG panel was associated with cancer-related biological processes, including invasion and metastasis, angiogenesis and the tumor immune microenvironment. The prognostic value of the six-MDG panel was found to be independent of tumor node metastasis stage and, in combination with tumor node metastasis stage and age, could help improve survival Edited by Takashi Gojobori prediction. In addition, the expression of CD40 was confirmed to be regulated by promoter region methylation in CRC samples and cell lines. The proposed six-MDG panel represents a promising signature for estimating the prognosis of patients with CRC.
Colorectal cancer (CRC), which has heterogeneous outcomes and distinct underlying pathobiological and molecular features, ranks third in cancer incidence and second in cancer-related mortality worldwide [1]. Generalized screening of high-risk populations with precursor-initiating adenomas at age 50 years or older is an effective and durable strategy to detect early-stage cancers, reducing the incidence and mortality of CRC [2][3][4]. Surgical resection of the primary cancer and/or limited metastasis is the only approach for attempted cure, and additional chemoradiation may improve outcomes in some patients [4,5]. However, relapse or metachronous metastasis occurs in a subset of these patients, leading to increased mortality [6]. Therefore, robust diagnostic, prognostic and predictive biomarkers are urgently needed.
Currently, the American Joint Committee on Cancer tumor node metastasis (TNM) staging system is the only well-recognized stratification method used in clinical practice to guide therapeutic decisions and to predict the prognoses of patients with CRC [7,8]. However, the fact that the survival times in patients with CRC with the same TNM stage often vary highlights the need for more accurate strategies [9]. It is widely known that genetic changes, such as gene mutations, contribute to cancer formation and can be used to predict the outcomes of patients with CRC [10,11]. Recently, a consensus has been reached that epigenetic alterations, such as aberrant DNA methylation, abnormal histone modifications and altered expressions of noncoding RNA, occur early and manifest more frequently than genetic changes in CRC [12]. In addition, advances in genomic technology and bioinformatics have led to the identification of specific epigenetic alterations as potential clinical biomarkers in patients with CRC [12,13]. For example, with the availability of genomic platforms capable of broadly surveying gene expression and DNA methylation, such as The Cancer Genome Atlas (TCGA) project, we can now identify genomic subtypes of CRC [14,15], and the CpG island methylator phenotype (CIMP) has undoubtedly been one of the most promising epigenetic biomarkers for the prognostication of patients with CRC [12,16].
By applying an integrative analysis tool (METHYLMIX) to CRC samples from TCGA project, this study aimed to explore functional genes regulated by DNA methylation and the potential of these DNA methylation changes to become biomarkers for the prediction of CRC prognosis. We identified a prognostic methylation-driven gene (MDG) panel consisting of six gene members: transmembrane protein 88 (TMEM88); homeobox B2 (HOXB2); FYVE, RhoGEF and PH domain containing 1 (FGD1); TOG array regulator of axonemal microtubules 1 (TOGARAM1); RhoGDP dissociation inhibitor beta (ARHGDIB); and cluster of differentiation 40 (CD40). The high-risk phenotype classified by the six-MDG panel was associated with cancer-related biological processes, including invasion, metastasis, angiogenesis, tumor immune microenvironment, among others. We also confirmed the expression and methylation of CD40, a member of the six-MDG panel, in CRC samples and cell lines.

Data acquisition and preprocessing
The TCGA-Assembler was used to download level 3 DNA methylation data from colon adenocarcinoma (COAD) samples, measured by the Illumina Human Methylation 450 Beadchip (450 K array), from the TCGA data portal (https://portal.gdc.cancer.gov/) [17]. These data were preprocessed via TCGA pipelines and presented in the form of beta (β)-values, a ratio between methylated probe and total probe intensities. Probe-level data were condensed to a summary β-value by calculating the average methylation value for all CpG sites associated with a gene [18].
In total, 353 DNA methylation samples, including 315 COAD samples and 38 tumor-adjacent samples, were obtained. Methylation data were normalized using the LIMMA R package. Level 3 RNA sequencing data and clinical information were retrieved from the TCGA data portal. Of 521 transcriptome profiles, 41 cases were obtained from tumor-adjacent tissues, while the remaining 480 cases were COAD tissues. The transcriptome data were normalized and log 2 transformed with the functions of DEGList and calcNormFactors in the EDGER package [19]. The clinical data were preprocessed by exclusion of samples without survival status, and patients whose survival time was less than 30 days were also removed [20]. Two additional CRC profile datasets, GSE8671 and GSE42752, were downloaded from Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) and used to examine the expression and methylation of CD40, respectively. The GSE8671 dataset contains transcriptional data from 32 patients with COAD with adjacent normal mucosa, which was evaluated by Affymetrix Human Genome U133 Plus 2.0 Array [21]. The GSE42752 dataset includes a genomewide DNA methylation profile obtained from 22 COAD samples with corresponding adjacent normal colon mucosa and 20 samples from healthy colon mucosa using a 450 K array [22]. The earlier data are available for research with no restrictions, and this study was performed in accordance with the guidelines provided by TCGA and GEO.

Identification of MDGs
To identify MDGs, we used the METHYLMIX R package to perform an analysis integrating gene expression and DNA methylation data. In the METHYLMIX algorithm, the methylation state of a gene is established by a β-mixture model, and hypomethylated or hypermethylated genes are determined by comparing their differential methylation state in cancer versus normal tissues [false discovery rate (FDR) < 0.05] [23,24]. To be functionally relevant, MDGs should have a significant predictive effect on gene expression, implying that methylation is inversely associated with transcription (Pearson's coefficient < −0.3, P < 0.05) [23,24].

Construction of a prognostic model for survival prediction
Survival analysis was performed on 281 patients with COAD for whom both methylation and survival information [overall survival (OS) > 30 days] were available. First, we randomly designated 50% of the patients with COAD as the training set and the remaining 50% of patients with COAD as the testing set. Data matrices were generated by combining the methylation levels of the identified MDGs with corresponding follow-up data from the patients with COAD. Then univariate Cox regression analysis was performed to screen for MDGs that were significantly associated with OS (P < 0.05) based on their methylation β-value in the training set. Least absolute shrinkage and selection operator (LASSO) estimation, a well-suited approach when there is a large number of correlated covariates for model construction in the patient cohort [25], was performed to penalize the effect of multicollinearity using the GLMNET R package [26]. MDGs that survived the LASSO estimation were subsequently subjected to multivariate Cox regression to construct a best-fitting prognostic model, with the Akaike Information Criterion (AIC) indicating model fitness [27]. The SURVIVAL R package was used for the univariate and multivariate Cox regression analyses.

Risk score calculation
The risk score was calculated by a linear combination of the methylation β-value of the selected MDGs weighted by their estimated regression coefficient in the multivariable Cox regression analysis, as discussed previously [28]. Patients with COAD were classified into high-or low-risk groups, using the median risk score of the training set as the cutoff value.

Gene set enrichment analysis
Gene set enrichment analysis (GSEA) [29] was used to determine whether the members of a given gene set were generally associated with the risk score derived from the prognostic six-MDG panel. The risk score (high or low) was designated as the phenotype, and the analysis was conducted using the matched gene expression profile. Random sample permutations and the significant threshold were set at 1000 times and FDR < 0.01, respectively. GSEA was performed using the JAVA program (http://software.broad institute.org/gsea/index.jsp) using the MSigDB C2 CP: Kyoto Encyclopedia of Genes and Genomes (KEGG) gene set collection. The enriched KEGG pathways were ranked by normalized enrichment score, and if a gene set had a positive normalized enrichment score, the high expression level of the majority of its members was positively related to the high-risk score phenotype.

Experimental validation of CRC cell lines
A panel of six CRC cell lines (RKO, SW480, SW620, HCT116, DLD1 and LoVo) was included in this study. All cell lines were preserved at our institute (The First Medical Center, Chinese PLA General Hospital, Beijing, China) and were cultured in Roswell Park Memorial Institute 1640 medium supplemented with 10% fetal bovine serum and 1% penicillin-streptomycin. mRNA expression of CD40 in CRC cell lines with or without 5-aza-2 0 -deoxycytidine (5-Aza; Sigma, St. Louis, MO, USA) treatment (2 μM for 96 h) was evaluated using semiquantitative RT-PCR as previously described [30]. Genomic DNA was prepared using the Proteinase K method. Bisulfite treatment, methylation-specific PCR (MSP) and bisulfite sequencing (BSSQ) were performed as previously described [31]. Genomic sequences around the transcriptional start site (TSS) were used as the template for CpG island prediction and the design of MSP and BSSQ primers using METHYL PRIMER EXPRESS software v1.0 (Thermo Fisher Scientific, Waltham, MA, USA). The primers for RT-PCR, MSP and BSSQ are listed in Table S1.
Total protein of CD40 in these CRC cell lines was measured by western blotting, as previously described [30], using β-actin as the loading control. The antibodies used for western blotting were purchased from Proteintech (Wuhan, China). We also examined the membrane expression of CD40 using flow cytometry. Cells were harvested using trypsin and were washed with phosphate-buffered solution before incubation with and without phycoerythrintagged mouse monoclonal antibody to human CD40 (Sino Biological, Beijing, China) at 4°C for 30 min. Then the cells were washed twice to remove unbound antibodies before being measured on a FACSCalibur flow cytometer (BD Biosciences, Franklin Lakes, NJ, USA).

Statistical analysis
The Mann-Whitney and Wilcoxon matched-pairs signedrank tests were used to analyze the differences in DNA methylation, gene expression and risk score in nonpaired and paired samples, respectively. The relationship between the risk score and clinicopathological characteristics was analyzed using the Chi-square or Fisher's exact test. Survival differences between the high-and low-risk groups were evaluated using Kaplan-Meier analysis, and the log rank test was used as a statistical method. Multivariate Cox regression and data stratification analyses were performed to determine whether the risk score derived from the prognostic MDG panel was independent of the clinicopathological features of the patients with COAD. A receiver operating characteristic (ROC) curve was used, and the area under the ROC curve (AUC) was calculated to compare the sensitivity and specificity of survival prediction based on age, TNM stage, the risk score derived from the prognostic six-MDG panel and a combination thereof. Statistical tests were conducted using PRISM8 (GraphPad Software, San Diego, CA, USA) or R 3.6.0 using the corresponding aforementioned R package.

Screening MDGs in COAD
We first prepared corresponding expression and methylation data, and three data matrices were acquired: a gene expression profile of 308 tumor tissues and two methylation profiles of 38 adjacent and 308 tumor tissues, respectively. These profiles were used as input data for the METHYLMIX R package, with which differential and correlation analyses between DNA methylation and gene expression were conducted. Based on the screening criteria, a total of 299 MDGs were identified (Table S2). The methylation profiles of the most significant 30 hypomethylated and hypermethylated MDGs (ranked by the β-value difference between tumor and adjacent tissues) are shown in Fig. 1A. The correlations between DNA methylation and gene expression and the methylation mixture models of the top three MDGs are shown in Fig. 1B,C, respectively.

Identification of a prognostic six-MDG panel from the training set
After preprocessing the methylation and clinical data, a total of 281 patients with COAD with adequate methylation and follow-up data were included in the survival analysis. The clinical information for these 281 patients is summarized in Table S3. The patients were randomly split into a training set (n = 141) and testing set (n = 140). To identify certain prognostic MDGs, we performed univariate Cox regression analysis on the training set, and 12 prognosis-related MDGs (P < 0.05; Table S4) were chosen for subsequent LASSO estimation. Ten MDGs survived the LASSO regularization ( Fig. 2A) after penalization of the multicollinearity effect and were further subjected to multivariate Cox regression analysis to construct a bestfitting prognostic model. The AIC was used to indicate the model fitness. Finally, a prognostic DNA methylation gene panel consisting of six MDGs (TMEM88, HOXB2, FGD1, TOGARAM1, ARHGDIB and CD40) was identified. Detailed information on the six MDGs is presented in Table 1. The methylation profile, correlations between gene expression and DNA methylation, and methylation mixture models of the six MDGs are shown in Fig. S1. The prognostic six-MDG panel included one gene (ARHGDIB) with a statistically nonsignificant P value (P = 0.071; Table 1); however, this six-MDG panel had the lowest AIC, representing the best model fitness, and the overall effect was significant (AIC = 202.86, global P [log rank] < 0.001).
Next, a risk score model for OS prediction was created based on the methylation β-values of these six MDGs, as follows: We then calculated the risk score for each patient with COAD and classified them into high-or low-risk subgroups using the median risk score of the patients in the training set as the cutoff value. Kaplan-Meier survival curve analysis of the training set showed that patients with COAD in the high-risk group had a significantly shorter median OS than those in the low-risk group (log rank P < 0.001; Fig. 2B). We also profiled the distribution of risk score, survival status and methylation β-values in the training set ( Fig. 2C-E). The risk scores of the patients in the training set ranged from −17.883 to −9.677, with a median risk score of −13.807 (Fig. 2C). Moreover, there were more patients alive in the low-risk than the high-risk group (χ 2 = 13.45, P = 0.0002; Fig. 2D). Interestingly, the methylation levels of all six MDGs were higher in low-risk than high-risk patients (Fig. 2E), indicating that hypermethylation of the six-MDG panel is a favorable prognostic factor for patients with COAD.

The six-MDG panel is predictive of survival in the testing and total sets
To further test the significance of the prognostic six-MDG panel in patients with COAD, we used the testing and total sets as validation groups. Using the same risk score cutoff value obtained from the training set, we divided the patients with COAD in the testing set into high-(n = 75) and low-risk (n = 65) groups. The results of Kaplan-Meier analysis demonstrated that the patients with COAD in the high-risk group had a lower OS than those in the low-risk group (log rank P = 0.0137; Fig. 3 A), and that there were more patients in the low-risk group than in the high-risk group (χ 2 = 4.514, P = 0.0336; Fig. 3B). We also performed the same analysis on the total set (training set plus testing set, n = 281), and the results were consistent with those of the training and testing sets individually (Fig. 3C,D). These results suggest that the selected six-MDG panel can predict survival in both the training and total sets.
The prognostic value of the six-MDG panel is independent of TNM stage TNM staging is a widely used and clinically useful classification system and is highly associated with the 5-year OS in CRC [32]. Therefore, we aimed to clarify whether the prognostic value of the six-MDG panel is independent of the TNM stage. We performed multivariate Cox regression and stratification analyses on the total set. After the exclusion of 10 patients who lacked adequate TNM staging information, we conducted multivariate Cox regression analysis on a total of 271 patients, with age, sex, TNM stage and risk score as covariates. The results showed that age, TNM stage and risk score remained independent prognostic factors (Fig. 4A). We then preformed data stratification analysis, with the patients stratified into four subgroups (stages I, II, III and IV). The results of the stratification analysis showed that the prognostic six-MDG panel could identify patients with different OSs in the TNM stage II (log rank P = 0.0450) and IV (log rank P = 0.0160) subgroups (Fig. 4B), but was unable to sufficiently clarify the patients in the TNM stage I (log rank P = 0.0750) and TNM stage III (log rank P = 0.0975) subgroups with significantly disparate survival (Fig. 4B). This may be attributed to the small sample size or truncated dataset. Therefore, we combined low (stage I plus II) and high TNM stages (stage III plus IV) and found that the risk score could significantly identify patients with different prognoses in these two subgroups (log rank P = 0.0083 and 0.0006, respectively; Fig. 4C). These results suggest that the prognostic value of the six-MDG panel is independent of the TNM stage.
Moreover, we performed ROC analysis to compare the sensitivity and specificity of OS prediction between the prognostic factors, including age, TNM stage, risk score derived from the six-MDG panel and a combination of these three factors. As shown in Fig. 4D, there was no significant difference when the AUCs of the three prognostic factors (age, TNM stage and risk score) alone were compared pairwise (all P > 0.05). However, when these three prognostic factors were combined, the AUC was significantly greater than that of each prognostic factor alone (all P < 0.05). These results indicate that the combination of the three prognostic factors (age, TNM stage and risk score) may help improve survival prediction in patients with COAD.

Assessment of biological pathways associated with the six-MDG panel
We performed GSEA to identify relevant pathways that the six-MDG panel may be involved in, using the risk score for phenotype classification. Gene sets significantly enriched (FDR < 0.01) for the high-risk phenotype are shown in Fig. 5A. High-risk scores were positively associated with the up-regulation of several cancer-related pathways, including invasion, metastasis, angiogenesis and tumor immune microenvironment. Vascular endothelial growth factor, for instance, a key regulator in the growth and maintenance of blood vessels, can directly modulate the vascular wall by loosening cell-cell contacts and increasing the permeability of blood vessels, which aids in the dissemination of tumor cells [33].
Next, we analyzed the relationship between clinicopathological features and the risk score derived from the six-MDG panel in patients with COAD (Table 2). Consistent with the pathway analysis, the results showed that patients with COAD in the high-risk group were more likely to have remote metastasis (χ 2 = 6.465, P = 0.011; Table 2 and Fig. 5B). We also evaluated the risk score as a continuous variable and found that patients with metastasis tended to have higher risk scores than those without metastasis (P = 0.0036; Fig. 5C). Collectively, these results suggest that the selected six-MDG panel is associated with cancer-related signaling pathways and acts as an indicator of tumor metastasis.

CD40 is universally hypermethylated in CRC tissues
CD40 is a member of the tumor necrosis factor (TNF) family and is a new immune-modulating target with great potential in cancer treatment [34]. The regulation of CD40 expression by DNA methylation has yet to be reported in the current literature and therefore deserves further investigation. We first examined the expression of CD40 in patients with CRC from the TCGA and GSE8671 datasets. The transcriptional expression of CD40 was significantly down-regulated in CRC tissues compared with the healthy colon mucosa in both datasets (Fig. 6A). Next, we analyzed the overall methylation level of CD40 in the TCGA and GSE42725 datasets, the results of which showed that CD40 was hypermethylated in CRC tissues compared with adjacent and/or healthy colon mucosa in the two datasets (Fig. 6B). We also observed a negative correlation between mRNA expression and overall DNA methylation level in patients with COAD from TCGA dataset (Pearson's r = −0.511, P < 0.001; Fig. S1B).
In addition, we analyzed the CpG site-specific methylation status of all 15 CpG sites of CD40, assessed by the 450 K array. The CpG sites located in or near the CpG island (island, N shore and S shore) covering the TSS of CD40 (12 CpG sites) were significantly hypermethylated in CRC tissues compared with the adjacent mucosa (Fig. 6C), and except for cg24575067, their methylation levels were negatively correlated with CD40 expression (Fig. 6D). Interestingly, we observed a similar CpG site-specific methylation pattern of CD40 in the GSE42725 dataset (Fig. 6E). These results suggest that CD40 is universally hypermethylated in CRC tissues, which may contribute to its transcriptional silencing.

The expression of CD40 is regulated by promoter methylation in CRC cell lines
To better understand the regulation of CD40 expression in CRC, we detected the levels of CD40 expression in a panel of CRC cell lines, the results of which indicated that CD40 mRNA expression was silenced in three of the six CRC cell lines (Fig. 7A). We confirmed the expression of CD40 by performing western blot and flow cytometry analyses on the total and membrane protein levels of these six cell lines (Fig. 7B,C). Next, MSP and BSSQ were used to evaluate the  Fig. 7D. MSP analysis revealed CD40 promoter methylation in the three cell lines with silenced CD40 expression (SW480, SW620 and DLD1) (Fig. 7E). BSSQ analysis of 19 CpG sites around the TSS showed dense methylation of the cell lines with silenced CD40 expression that were examined (SW480 and DLD1), but not in the CD40-expressing HCT116 cells (Fig. 7F). To test whether promoter methylation directly contributes to the transcriptional silencing of CD40, these six CRC cell lines were treated with 5-Aza, a demethylation reagent. Restoration of CD40 expression was induced using 5-Aza in the three CRC cell lines with silenced CD40 expression (Fig. 7G). These results indicate that CD40 is silenced in CRC cell lines by promoter region hypermethylation.

Discussion
Aberrant epigenetic changes drive carcinogenesis and subsequent tumor progression [35]. Of the various epigenetic modifications, DNA methylation is the key factor and is classically responsible for transcriptional silencing via the hypermethylation of CpG islands located in the promoter regions of tumor suppressor genes [36]. In addition, DNA hypomethylation has been implicated in the regulation of genome rearrangement and chromosomal instability, which may also contribute to carcinogenesis [36]. A plethora of genespecific studies have demonstrated that gene hypermethylation or hypomethylation can be used as an epigenetic biomarker to predict the behavior and prognosis of CRC [37]. There is also evidence of an association between the aberrant methylation of multiple genes and increased CRC aggressiveness [38]. For instance, Weisenberger et al. [] introduced the prevailing method used to identify CIMP in CRC, which is based on the methylation status of five genes: calcium voltage-gated channel subunit alpha-1 G (CACNA1G), insulin-like growth factor 2 (IGF2), Neurogenin 1 (NEUROG1), runt-related transcription factor 3 (RUNX3) and suppressor of cytokine signaling 1 (SOCS1). CIMP-positive tumors were found to exhibit unique clinicopathological and molecular features, correlating with an overall unfavorable prognosis [39]. The advancement and prevalence of high-throughput DNA methylation arrays have confirmed previously identified epigenetic changes and have also uncovered many new alterations, creating an opportunity to discover novel cancer-related epigenetic biomarkers. By applying an integrative analysis tool to TCGA project, we aimed to explore key genes regulated by DNA methylation and their potential use as prognostic biomarkers of CRC. A model-based algorithm (METHYLMIX) was used to identify MDGs, from which we developed a prognostic MDG panel consisting of six genes (TMEM88, HOXB2, FGD1, TOGARAM1, ARHGDIB and CD40) in the training set (50% of the TCGA cohort). The six-MDG panel exhibited favorable performance in OS prediction, which was validated through the test set (the remaining 50% of the TCGA cohort) and the total set (training and test sets). Multivariate Cox regression and data stratification analyses demonstrated that the prognostic value of the risk score derived from the six-MDG panel was independent of the TNM stage. Furthermore, through ROC curve analysis, we found that the combination of age, TNM stage and the six-MDG panel, the three independent prognostic factors revealed by the multivariate Cox regression analysis, may improve prognostication.
These six prognostic MDGs have different methylation values in tumors and their adjacent tissues, and their DNA methylation and mRNA expression levels are inversely correlated, indicating their potential roles in CRC. The GSEA pathway analysis we performed provided evidence that the six MDGs are involved in cancer-related biological processes, including invasion, metastasis, angiogenesis, tumor immune microenvironment, among others. Up-regulation of HOXB2 was found to be an adverse prognostic indicator for stage I lung adenocarcinoma, promoting invasion by transcriptional regulation of metastasis-related genes [40,41]. In this study, HOXB2 expression was negatively correlated with DNA methylation in CRC, and hypermethylation of HOXB2 was associated with prolonged OS.
However, Marsit et al. [42] revealed that increased promoter methylation of HOXB2 in bladder cancer is significantly and independently associated with increased cancer aggressiveness. Further studies are needed to clarify the functional role of HOXB2 in cancer. ARHG-DIB has been identified as a regulator of tumor metastasis, but its role in cancer remains unknown [43]. ARHGDIB has been found to function as a positive regulator of cancer progression in ovarian [44], breast [45], colorectal [43] and gastric cancers [46], and as a negative regulator in Hodgkin's lymphoma [47], bladder cancer [48,49] and lung cancer [50]. In this study, as in previous studies involving CRC, we found hypermethylation of ARHGDIB to be a favorable prognostic factor. TMEM88 is a transmembrane protein that functions as an inhibitor of Wnt signaling [51], and TMEM88 promoter hypomethylation is associated with platinum resistance in ovarian cancer [52]. The results of this study demonstrated that TMEM88 is hypomethylated in the high-risk group, which is associated with shorter OS in CRC. Therefore, we hypothesized that TMEM88 may modulate the prognosis of CRC by altering the sensitivity of cancer cells to chemotherapy through mediation of promoter methylation, although further investigation is needed to confirm this. Ayala et al. [53] revealed that FGD1 is central in the regulation of focal degradation of the extracellular matrices in invadopodia. They also demonstrated that FGD1 is highly expressed in prostate and breast cancers, potentially leading to aberrant growth, invasiveness and/or metastasis [53]. TOGARAM1 encodes a TOG domain array-containing protein that regulates the structure of cilia microtubules [54]. The regulation of TOGARAM1 expression by DNA methylation and its role in cancer have not yet been reported. CD40 belongs to the TNF receptor family and is crucial to the mediation of a variety of immune and inflammatory responses [55]. CD40 ligation provides essential activation signals for immune cells [55], although its function in the promotion or inhibition of tumorigenesis and progression via regulation of TNF alpha (TNFα)-induced apoptosis [56], angiogenesis [57], tumor cell migration and invasion [58], and chemoresistance [59] is unknown. Agonist CD40 antibodies have been developed and tested in clinical trials, in which impressive results have been noted, especially in pancreatic cancer [60]. We confirmed that the expression of CD40 is regulated by promoter region hypermethylation in CRC tissues and cell lines, which may provide new insights into the combination of epigenetic therapy and CD40-stimulating immunotherapy. Further investigation is needed to clarify the underlying mechanisms that potentiate MDGs as DNA methylation biomarkers for CRC.   This study had several limitations. First, no external validation was performed. We attempted to search for CRC cohorts with both methylation and follow-up data in multiple cancer databases, including GEO and the International Cancer Genome Consortium project, among others, but no relevant available datasets were found. However, considering the number of patients included in the processes of model construction and internal validation for this study, the identified prognostic signature is unlikely to be random noise of the methylome. Second, experimental information regarding the regulatory mechanisms of all six prognostic MDGs on the methylation signature was presented. Third, the specific functional role of these prognostic MDGs in CRC remains unexplored.

Conclusions
In summary, we identified an MDG-related signature that acts as an independent prognostic factor in CRC, and its combination with clinical characteristics, including age and TNM stage, could help improve prognostication. Our results also confirmed that CD40, a member of the prognostic six-MDG panel, is regulated by DNA methylation in CRC samples and cell lines. More testing is needed to obtain a complete picture of the regulatory mechanisms and functional roles of all six MDGs in CRC. In addition, clinical investigations of additional CRC patient cohorts are needed to validate our findings and to elaborate on their potential clinical utilization.

Supporting information
Additional supporting information may be found online in the Supporting Information section at the end of the article.  Table S1. Primers used in this study. Table S2. Methylation-driven genes identified in colon cancer patients. Table S3. Clinical characteristics of 281 colon cancer patients included in survival analysis. Table S4. 12 methylation-driven genes significantly associated with overall survival of colon cancer patients screened by univariate Cox regression analysis in the training set (n = 141).