Predicting overall survival of patients with hepatocellular carcinoma using a three‐category method based on DNA methylation and machine learning

Abstract Hepatocellular carcinoma (HCC) is closely associated with abnormal DNA methylation. In this study, we analyzed 450K methylation chip data from 377 HCC samples and 50 adjacent normal samples in the TCGA database. We screened 47,099 differentially methylated sites using Cox regression as well as SVM‐RFE and FW‐SVM algorithms, and constructed a model using three risk categories to predict the overall survival based on 134 methylation sites. The model showed a 10‐fold cross‐validation score of 0.95 and satisfactory predictive power, and correctly classified 26 of 33 samples in testing set obtained by stratified sampling from high, intermediate and low risk groups.

Methylation of the promoter inhibits gene expression, and abnormal methylation is associated with many human diseases, including cancer. 5 Genomic methylation can be analyzed in a high-throughput manner, which may facilitate disease diagnosis, prevention and treatment.
In a study of 61 HCC cases, methylation-specific PCR identified MLH1, PMS2, MSH2 and P16 as frequently methylated genes in advanced HCC. 6 An effective two-category classification model was generated for predicting early HCC recurrence based on at least three CpG methylation sites; this model was developed through analysis of 450K methylation chip data from 576 publicly available samples. 7 Analysis of 450K chip data also led to the identification of DNA methylation sites in the genome of peripheral blood mononuclear cells and T cells that were associated with HCC progression. 8 Bisulfite sequencing analysis in the Huh2 HCC cell line showed an association between abnormal DNA methylation and abnormal DLL3 expression. 9 The clinical potential of DNA methylation in HCC was demonstrated when DNA containing methylated SEPT9 promoter circulating in plasma was found to be a promising biomarker for the disease. 10 Several studies suggest that DNA methylation may help predict OS of HCC patients. Analysis of 63 HCC samples and 10 normal controls identified methylation sites potentially associated with poor prognosis, 11 and a study of 27K methylation chip data from 71 HCC patients identified 13 candidate methylation sites.
Unfortunately, both studies failed to develop a predictive model because of small sample size. 12 A larger study of 450K chip data from 304 HCC samples used machine learning to build a model to predict OS based on 36 methylation sites. 13 Xu  HCC patients and found that LINE-1 methylation level was significantly correlated with OS, and may be a promising predictor of F I G U R E 1 Schematic of the study method. Raw data on DNA methylation of 377 HCC samples and 50 adjacent normal tissue samples based on the Illumina Human Methylation 450 (450K) Bead Chip were downloaded from the TCGA database. By using the ChAMP tool in R software, 40 799 sites methylated differently between HCC tissue and adjacent normal tissue were identified. Then Cox regression was used to assess the potential correlation between OS and each CpG site differentially methylated between HCC and normal tissues. 2785 sites significantly related to OS (P < 0.05) were retained. The SVM was then used as a classifier in the SVM-RFE algorithm to rank features (in our case, methylation sites) from most to least relevant for the training objectives in an iterative process that removes the feature from the background, and the best 243 were selected based on the 10-fold cross-validation score for the number of recursive features at each level.The forward-SVM (FW-SVM) method was then used to screen feature subsets emerging from the SVM-RFE analysis. In this process (As shown in the right half of the figure), a model for each feature is constructed, the model with the highest cross-validation score is selected, and then this feature is combined with each of the others to construct two-feature models, the best of which is selected based on the crossvalidation score. This process is then iterated to build up multi-feature models. Finally we built a predictive model containing the best 134 features, and the model was tested using the testing dataset. Of 33 cases, 26 were correctly classified (26/33=79%) OS of HCC patients. 15 The three models established by the above three studies are dichotomy-based (two categories), making their risk prediction relatively crude. The predicting tool for survival that is based on the molecular information of the patients complements currently existing tumor staging methods that are based on clinicopathologic variables of the patients. Combining these predicting tools and current grading and staging methods will further improve current tumor assessment and guide clinicians to better treatment plan including molecular stratification and risk mitigation, and at Study of the relationship between HCC and DNA methylation is still in its infancy, with relatively few methylation sites associated with HCC prognosis and few predictive models. Here, we used machine learning to analyze DNA methylation data from 450K chips in the TCGA database and to build a model with three risk categories for predicting OS of HCC patients. Our work has implications not only for HCC management but also for other methylation-associated conditions. ChAMP was expressly designed for methylation chips and performs quality control, standardization and calculation of methylation sites and regions. 16 The beta value was used to estimate methylation levels at CpG loci.

| Grouping of patients based on OS
Patients with HCC obtained from the TCGA database were classified as "high-risk"(58 samples) if they were likely to die within 1 year after surgery; "low-risk"(41 samples) if they were likely to survive more than

| Screening of differentially methylated CpG sites
First, Cox regression was used to assess the potential correlation between OS and each CpG site differentially methylated between HCC and normal tissues. Sites significantly related to OS (P < 0.05) were retained. Second, these sites were screened using the Support Vector Machine (SVM)-Recursive Feature Elimination (RFE) algorithm. The SVM method finds an optimal plane in a multidimensional space that can divide all sample units into two classes, and this plane should maximize the distance be-

| Cross-validation during screening of methylation sites
In each step of the RFE-SVM and FW-SVM algorithms, the intermediate and final results were evaluated using the average score obtained from 10-fold cross-validation. In cross-validation, training and testing require multiple iterations of data, and 10fold means that the data are randomly divided into 10 batches. 18 During the next 10 machine learning sessions, each batch was used for validation and the other nine for training. Cross-validation estimates the error boundary for multiple samples, resulting in a model with lower generalization errors. The mean accuracy of the 10 validation runs was calculated as the 10-fold cross-validation score. The closer this score was to 1, the more effective the model was considered.

| Model validation and evaluation
The 163 cases of raw data were divided by stratified sampling into a training set (130 cases, 80%) and test set (33 cases, 20%). The SVM model was reconstructed by using the training sample and the final feature combination, and the test samples were used to test the model effectiveness.

| Patient grouping
Raw 450K chip data from 377 HCC samples and 50 adjacent normal tissue samples were downloaded from the TCGA database. Patients who were still alive and for whom fewer than 5 years had passed since surgery were excluded from the analysis. Among the remaining patients, 58 were classified as high-risk, 64 as intermediate-risk and 41 as low-risk.

| Identification and screening of differentially methylated sites
Using ChAMP, we identified 47 099 differentially methylated sites in the sample of 377 HCC samples and 50 adjacent normal tissues (Figure 2A). Of these sites, Cox regression identified 2785 differentially methylated sites that correlated significantly with OS (P < 0.05). SVM-RFE was then applied to these 2785 sites, and the best 243 were selected based on the 10-fold cross-validation score for the number of recursive features at each level. The corresponding 10-fold cross-validation score was 0.50 ( Figure 2B).
This score prompted us to perform further screening using the FW-SVM algorithm, which combined the SVM algorithm with an algorithm that progressively filters feature subsets forward. In order to obtain the "best model with the fewest features", we built a predictive model containing the best 134 features, which gave a mean 10-fold cross-validation score of 0.95 ( Figure 2C).

| Model validation
The SVM model was reconstructed using the training dataset and 134 feature combinations, and the resulting model was tested using the testing dataset (Table 1). Of 33 cases, 26 were correctly classified. These results suggest that the model can effectively predict OS of HCC patients on the basis of methylation status without over-fitting (Table 2). To further validate the predictive power of the model,

| D ISCUSS I ON
Here foundly. Since the current models only contain limited variations, it will be an exciting research area to construct a predicting model that not only takes full advantage of patients' clinicopathologic data but also contains multi-level molecular data.

ACK N OWLED G EM ENTS
This study was partially supported by grants from the National Natural Science Foundation of China: 81602513 (to JC), 81472840 (to GS), 81530077 (to JF) and 81672825 (to AK) and from Shanghai Municipal Natural Science Foundation: 18410720700, 17411951200, 14ZR1405800, 17ZR1405400.

CO N FLI C T O F I NTE R E S T
The authors declare that they have no conflict of interest.