Genome‐wide DNA methylome analysis reveals methylation subtypes with different clinical outcomes for acute myeloid leukemia patients

Abstract Leukemia is the second common blood cancer after lymphoma, and its incidence rate has an increasing trend in recent years. Acute myeloid leukemia (AML) is one of the prevalent forms of leukemia. Although previous studies have investigated the methylation profile for AML patients, the AML methylation subtypes based on the genome‐wide methylome are still unclear. In the present study, we identified three methylation subtypes for AML samples based on the methylation profiles at CGI, CGI shore, CGI shelf, and opensea genomic contexts. Analyzing the molecular characteristics and clinical factors of the three subtypes revealed different methylation patterns and clinical outcomes between them. Further analysis revealed subtype dependent marker genes and their promoter CpG sites with regulatory function. Finally, we found that combining the AML patient age and methylation pattern brought better clinical outcome classification. In conclusion, we identified AML methylation subtypes and their marker genes, these results may help to excavate potential targets for clinical therapy and the development of precision medicine for AML patients.


| 6297
GAO et Al. causative genomic alterations. 2,3 Further investigation found that an increasing number of genetic changes have been recognized in the new World Health Organization (WHO) classification of AML. 4 Apart from genetic alterations, epigenetic changes also reflect biological processes associated with disease progression and treatment response.
DNA methylation, one of the most important epigenetic regulatory mechanisms, was an epigenetic process of transformation of cytosine into 5-methyl cytosine. 5,6 DNA methylation regulated gene expression by altering chromatin structure, DNA conformation, and DNA stability. 7,8 CpG islands (CGI) are defined as sequence ranges where the Obs/ Exp value is greater than 0.6 and the GC content is greater than 50%. CGIs are typically located in the promoter region, 5′ to the TSS. CpG shores are contexts with lower CpG density that lie within the 2 kb up-and downstream of a CpG island. CpG shelves are defined as the 2 kb outside of a shore. CpG openseas are with low methylation and not characterized in any of the above. 9,10 The phenomenon of DNA chemical modification may occur in CGI, CGI shore, CGI shelf, and opensea genomic contexts. 11,12 Early studies have found an abundance of epigenetic alterations in various types of AML. 13,14 However, most studies limited their research in CGI while ignored the genome-wide methylome. The understanding of methylation pattern variation for AML samples remains incomprehensive. Therefore, we expected to decode the methylation pattern of AML patients on genome-scale, including CGI, CGI shore, CGI shelf, and opensea genomic contexts.
To decode the genome-wide methylation pattern for AML patients, we focused on the most variable CpG sites in CGI, CGI shore, CGI shelf, and opensea genomic contexts, respectively, and identified three subtypes based on corresponding four methylation profiles using cluster-of-cluster alignment (COCA) method. 15 This approach took into account genome-wide methylation patterns and identified the subtypes from a more general perspective compared with only CGIs. Survival analysis for the subtype samples derived from genome-wide methylation profiles showed more significant difference than that from single regional methylation profiles. Subsequent analyses for AML subtypes further confirmed the molecular and clinical difference between them. These result indicated that integrating genome-wide methylome to identify methylation subtypes was essential.

| Data collection
The level 3 CpG methylation in CGI, CGI shore, CGI shelf, and opensea genomic contexts of 140 AML patients detected by Illumina Infinium HumanMethylation450 BeadChip array were obtained from The Cancer Genome Atlas (TCGA) portal (http://cance rgeno me.nih.gov/). The level 3 mRNA expression profiles, level 4 mutational datasets, and level 3 Copy Number Variation (CNV) data were also derived from TCGA database. The expression of mRNAs was measured as fragments per kilobase of exon per million reads mapped (FPKM). In addition, we obtained clinical characteristics of the patients from TCGA, including survival time, survival state, age, and other information.

CGI shore, CGI shelf, and opensea genomic contexts of AML samples
We extracted the CpG sites in CGI, CGI shore, CGI shelf, and opensea genomic contexts and further constructed four methylation profiles for AML samples. CpG sites with missing values in more than 30% samples were removed and each of the remaining missing value was imputed by the KNN Imputation.

| The somatic mutations of AML samples
Somatic mutations of samples sequenced by whole-exome sequencing were downloaded from the TCGA database. We integrated all sequencing platforms and obtained the no-redundancy results. After removing silent mutations, we counted the number of somatic mutations to evaluate Tumor Mutational Burden (TMB) for each sample.

| Identifying the methylation-associated subtypes of AML samples
We classified the AML samples using COCA method 15 based on the genome-wide CpG methylation profiles. First, the CpG sites in each type of genomic contexts with top 10% variable methylation levels were kept to identify the regional clusters. Second, consensus clustering for the methylation profile in CGI, CGI shore, CGI shelf, and opensea was developed separately for each. Third, clusters defined from each methylation profile were coded into a series of indicator variables for each cluster to construct a Boolean matrix, and then taken as the input of second-level consensus clustering. Finally, we obtained the subtypes of AML samples based on the genome-wide methylation profiles. The consensus clustering was performed by ConsensusClusterPlus R-package. 16 ConsensusClusterPlus was run with 80% sample and 80% CpG site resampling and 1000 iterations of hierarchical clustering based on a Pearson correlation distance metric.

| Survival analysis
Survival analyses were performed to evaluate the difference in survival rate between more than two sample groups, such as AML subtypes and different AML clusters resulting from separate methylation profile in different genomic contexts. The Kaplan-Meier survival plots, log-rank tests, and multivariate Cox regression models were performed using the R package 'survival'.

| Functional enrichment analysis
Functional enrichment analysis at the GO and KEGG levels was performed using DAVID Bioinformatics Resources (https:// david.ncifc rf.gov/). 17 The DAVID enrichment analysis was limited to GO-FAT biological process (BP) terms and KEGG pathways with the whole human genome as background.

| Identification of the AML subtypes based on the genome-wide methylation profiles
Identification of cancer subtypes make us more understandable about the heterogeneity across cancer samples from the epigenetic perspective and provide potential individualized therapeutic basis. 18 Therefore, we attempted to identify the AML subtypes with the genome-wide methylation profiles. To characterize DNA methylation patterns across different genomic contexts in AML samples, we exacted the beta values of CpG sites at CGI, CGI shore, CGI shelf, and opensea contexts separately, and constructed four corresponding methylation profiles. After data pre-processing described in the 'Materials and Methods' section, we extracted the top 10% variable CpG sites, that is, 13 986, 9223, 3145 and 13 226 sites in CGI, CGI shore, CGI shelf and opensea contexts respectively. Next, we identified the AML subtypes based on the four preprocessed methylation profiles, using the method named COCA. 15 First, 3-4 clusters were obtained by consensus clustering with each methylation profile. Second, the clustering results from single level were integrated and finally three subtypes were identified as the AML methylation subtypes, with 41, 64 and 35 samples in subtypes 1, 2 and 3 respectively ( Figure 1A,B).

| Samples in subtype 1 were with better prognosis compared with subtypes 2 and 3
Many studies have shown that variances at the molecular level lead to the different survival rate and different clinical behaviors across samples. It is precisely because of the clinical differences that it is of significance for us to identify the cancer subtypes. Therefore, we compared the survival rate between different AML subtypes. As a result, the samples in subtype 1 showed significantly better prognosis, compared with the other two subtypes (P = 4.85e-05, P (1&2) = 6.65e-06, P (1&3) = 2.91e-04, Figure 2A). Moreover, we found the different clinical behaviors between these three subtypes. Patients in subtype 2 were the eldest, followed by the patients in subtype 3, while those patients in subtype 1 were the youngest (P (1&2) = 8.171e-07, P (1&3) = 5.318e-03, P (2&3) = 2.314e-02, Figure 2B). The result showed that elder patients had worse prognosis than younger patients, this was reasonable and consistent with the results of previous studies. [19][20][21] Expected for the survival rate and clinical behaviors, F I G U R E 1 Consensus clustering for AML samples. A, Second-level consensus clustering for AML samples when K = 3. B, The connection between clusters from consensus clustering of first-level and second-level. The matrix of first-level clusters was further clustered. Each location type of CpG sites is represented by a different color as shown in colorbar | 6299 we also compared the somatic mutations between the three subtypes. As a result, we found that patients in subtype 3 were with more frequent FLT3 mutations compared with subtypes 1 and 2 (P = 2.207e-02, 1.075e-04 respectively, Figure 2C). What's more, TMB of subtype 3 samples tended to be less than samples in subtypes 1 and 2 (P = .08, .06 respectively, Figure 2C). Then we evaluated the frequency of arm level amplifications and deletions in three subtypes ( Figure 2D). The result showed that there was little change in the overall copy number of the leukemia samples, but the samples in subtype 2 and subtype 1 have more amplified and deleted regions compared with subtypes 3. As somatic mutations and CNV represent the genome instability of samples, 22 we supposed that different methylation levels of samples led to the variant genome instability and finally caused different malignancy grade of AML. Taken together, patients of subtype 2 were eldest and had higher genome instability while patients of subtype 3 were with more FLT3 mutations. These factors were supposed to be the reasons of their poor prognosis. These results suggested that the progression of AML was a complicated procedure that influenced by multiple factors, such as mutation of cancer-related genes and age of patients.
To investigate whether the prognostic ability of the different methylation patterns was independent of other clinical variables, we performed the multivariate Cox regression analysis. The variables in the regression included subtypes derived from clustering of genome-wide methylation pattern, age, gender and FLT3 mutation activity. We found that methylation subtypes and age were independent prognostic factors (Table 1). Next, data stratification analysis was performed for age. Patients with a younger (age < 50) and elder age (age ≥ 50) were stratified into younger group and elder group respectively. For younger or elder patients, we further compared the survival rate between three subtypes. As a result, we found that methylation subtype could classify patients within each age stratum into three subtypes with significantly different survival rate (P (1&3, younger) = 4.75e-02, P (1&2, younger) = 5.36e-02, Figure 2E and P (1&3, elder) = 6.04e-03, P (1&2, elder) = 1.81e-03, Figure 2F). This result suggested that the prognostic ability of methylation subtypes was also age-independent.

| Integrating genome-wide methylation facilitated the identification of AML subtypes
For the analysis of methylation profiles, most studies focused on the CGIs while ignored the methylation pattern on other genome regions. [23][24][25] In this study, we used the COCA method to identify the AML subtypes based on the methylation profiles at different genomic contexts, including CGI, CGI shore, CGI shelf, and opensea. As described above, there was significant different survival rate between the three subtypes. Next, we investigated the survival analysis for the three clusters derived from consensus clustering of CGI methylation profile. The result showed that although survival rate were different between these three clusters (P = 3.82e-02), the difference was much less than the three subtypes derived from genome-wide methylation profiles ( Figure 3A).
In addition, log-rank tests for other clustering results derived from methylation profiles at CGI shore, CGI shelf, and opensea were all less significant than subtypes obtained based on genome-wide DNA methylome (P = 1.28e-04 for shore, 5.33e-05 for shelf and 1.45e-04 for opensea, Figure 3B,D). This result suggested that integrating genome-wide methylation profiles to identify the AML subtypes is necessary and effective. On the other hand, statistic for the samples in different subtypes showed that most samples in each subtype were clustered into same first-level clusters. This result suggested that samples in the same subtype usually exhibited similar methylation pattern across genome-wide CpG sites.

| Different methylation patterns between three subtypes
Next, we analyzed the methylation levels between three subtypes at different genomic contexts. The result showed that samples of subtype 3 were genome-wide hypomethylation compared with those samples of subtypes 1 and 2 ( Figure 4A-D). We also found that samples of subtype 2 showed the highest methylation level at CGI shore, CGI shelf, and opensea. In addition, the methylation values of CGI in subtype 1 samples were found to be higher than other samples ( Figure 4A). This result was consistent with other studies that patients with higher CGI methylation had better prognosis. 26 significantly different methylation CpG sites between any of the three subtypes and compared the related genes between four methylation contexts. The result showed that different methylation related genes (DMRGs) were highly specific ( Figure 4E). To investigate the functional implication of the DMRGs, we performed the functional enrichment analysis of GO and KEGG for protein-coding genes with different methylation in three subtypes. The DMRGs was enriched in cell differentiation and leukemia-related biological processes, such as hemopoiesis and myeloid cell differentiation ( Figure 5A). Despite of the difference of DMRGs between four types of genome locations, we found that DMRGs were all related to cancer pathways ( Figure 5A-D). In addition, we found that protein-coding genes which were differently methylated between three subtypes in opensea had a contact with diabetes mellitus The methylation of CpG sites and differential methylation CpG related genes. A-D, The density plots for methylation levels of AML subtypes, including CGI (A), CGI shore (B), CGI shelf (C) and opensea contexts (D). E, The overlap across protein-coding genes that related to significantly differential methylation CpG sites at different genome locations

F I G U R E 5
The methylation values of different methylated CpG sites between each of three subtypes and functional enrichment of DMRGs at four types of genomic contexts, including CGI (A), CGI shore (B), CGI shelf (C),and opensea (D) ( Figure 5D). A previous study had found that diabetes mellitus increased the risk of developing leukemia, 29 the result in our study also found the correlation between diabetes mellitus and leukemia. Lots of studies have validated that DNA methylation in promoter CGI regulated expression of transcripts. Meanwhile, abnormal expression of mRNA/lncRNA plays roles in the occurrence and development of cancers. 30,31 Therefore, we calculated the correlation between expression of mRNA/lncRNA and methylation of each related CpG site in intervals ± 1 kb around transcription start site (TSS) of this mRNA/lncRNA, which is included in the DMRGs at CGI context. In total, we identified 29 genes whose expression whose expression was closely related to CpG sites (R < −0.3, P < .01), including 2 lncRNAs (HOXB-AS3 and LINC01475) and 27 protein-coding genes ( Table 2). Among the 29 genes, 23, 4 and 2 genes showed highest methylation and least expression in subtype 1, 2, and 3 samples. We found that HOX family genes tend to be marker genes of subtype 1 samples with highest methylation levels, including HOXA7, HOXA9, HOXA10, HOXB3. Previous studies had revealed that high expression of HOX family genes contributed to the progression of AML. 32,33 In our study, samples of subtype 1 showed lowest expression level and best clinical prognosis. Combining these results, we supposed that aberrant promoter region hypermethylation lead to down-regulated expression of HOX genes and further result in the progression of AML and poor prognosis. Moreover, we found that lncRNA HOXB-AS3 showed the similar tendency of methylation and expression with HOX protein-coding genes, which suggested that HOXB-AS3 also may be AML marker gene ( Figure 6A-C). In addition, LINC01475 was found to be another potential AML marker lncRNA ( Figure 6D-F). Taken together, we identified a few AML marker genes with aberrant methylation pattern and regulation on transcript expression, which might be potential targets for clinical therapy.

| Integration of genome-wide methylation pattern and age improved the prognostic ability for AML patients
Many studies have indicated that clinical factors affect the prognosis of cancer patients. 34 In our study, we also found that despite of the different methylation pattern, age also influenced the survival rate of AML patients. Elder patients showed significant worse prognosis than younger patients. In our previous analysis, we found that patients of subtype 1 showed significant better prognosis. In addition, patients of subtypes 2 and 3 showed significant different methylation pattern and clinical behaviors, but not significant different prognosis. Therefore, we integrated the patient age and methylation pattern, and classified samples of subtypes 2 and 3 into elder (subtype 23 & elder) and younger (subtype 23 & younger) groups. The survival analysis between subtype 1, subtype 23 & elder and subtype 23 & younger samples showed more significant survival difference (P = 2.24e − 07, Figure 7). Taken together, these results indicated that cancer progression was related to both molecular changes and clinical factors. Combining the molecular patterns and clinical factors will achieve more precise clinical outcomes for patients.

| DISCUSSION
To investigate the role of methylation in cancers, most studies focused on the single genome context, such as CGI, while neglected CpG sites in other regions. [23][24][25] In the present study, we integrated the genome-wide CpG methylation profiles and identified the methylation subtypes for AML samples using COCA method. Analyzing the clinical outcomes for samples of the three subtypes, we found that subtype 1 showed significantly better prognosis and subtypes 2 and 3 have worse prognosis. Then we analyzed the survival rate of AML sample clusters derived from consensus clustering based on the CGI methylation profile. The result showed that although there was survival difference between the three

F I G U R E 7
The Kaplan-Meier survival curves for samples classified by methylation subtype and patient age sample clusters, the prognostic ability of CGI clusters was declined apparently. In addition, we found that methylation patterns on CGI shore, CGI shelf, and opensea contexts all showed significant difference between subtypes 1, 2, and 3. Further functional enrichments of protein-coding genes related to four contexts were all associated with cancer progression. These results suggested that integrating the methylation profiles on genome-wide to identify the cancer subtypes was necessary.
Through multivariate Cox regression model analysis, we found that not only methylation subtypes but also patient age had the prognostic ability. After stratifying patients into younger and elder groups, we tested the prognostic ability of methylation subtypes in the two groups with different ages. The result showed that although elder patients had worse survival rate than younger patients, [19][20][21] those samples classified into subtype 1 still had better prognosis than subtypes 2 and 3. Combining patient age and methylation subtypes, we found out a better classification for AML samples. Samples of subtype 1 were with the best prognosis, followed by samples of subtypes 2 and 3 with younger age, and other samples showed the worst prognosis. Together, these results indicated that combining clinical factors and molecular variables was better prognostic strategy than single factor.
Overall, we identified the AML methylation subtypes based on the genome-wide methylation profiles and further analyzed the different methylation patterns and clinical outcomes between these methylation subtypes. In addition, we found the different CpG sites across the three subtypes and their corresponding genes. Functional enrichment analysis revealed the cancer-related progresses of these genes. Further expression and methylation analysis revealed the marker genes of AML subtypes. These findings provide additional useful data for the development of clinical therapeutic targets against AML.