Systematic analysis of differentially methylated expressed genes and site‐speciﬁc methylation as potential prognostic markers in head and neck cancer

Abstract Head and neck cancer (HNC) remains one of the most malignant tumors with a significantly high mortality. DNA methylation exerts a vital role in the prognosis of HNC. In this study, we try to screen abnormal differential methylation genes (DMGs) and pathways in Head–Neck Squamous Cell Carcinoma via integral bioinformatics analysis. Data of gene expression microarrays and gene methylation microarrays were obtained from the Cancer Genome Atlas database. Aberrant DMGs were identified by the R Limma package. We conducted the Cox regression analysis to select the prognostic aberrant DMGs and site‐speciﬁc methylation. Five aberrant DMGs were recognized that significantly correlated with overall survival. The prognostic model was constructed based on five DMGs (PAX9, STK33, GPR150, INSM1, and EPHX3). The five DMG models acted as prognostic biomarkers for HNC. The area under the curve based on the five DMGs predicting 5‐year survival is 0.665. Moreover, the correlation between the DMGs/site‐speciﬁc methylation and gene expression was also explored. The findings demonstrated that the five DMGs can be used as independent prognostic biomarkers for predicting the prognosis of patients with HNC. Our study might lay the groundwork for further mechanism exploration in HNC and may help identify diagnostic biomarkers for early stage HNC.


| INTRODUCTION
Head and neck cancer (HNC) remains the 10th largest primary cancer in the world and the seventh leading cause of cancer death.
There are approximately 400,000 oral and pharyngeal diseases, 160,000 laryngeal cancers and 300,000 deaths worldwide each year (Mehanna, Paleri, West, & Nutting, 2010;Siegel, Miller, & Jemal, 2018;Wozniak, Szyfter, Szyfter, & Florek, 2012). HNC is a heterogeneous tumor, which is characterized by lesions in the oral, pharynx, and larynx (Arantes, de Carvalho, Melendez, Carvalho, & Goloni-Bertollo, 2014) regions. It is documented that one out of 99 people born in the United States today experienced HNC in their lifetime. It is considered an important part of the global burden of cancer (Jemal et al., 2008;Marur & Forastiere, 2008); HNC accounts for 12% of all malignancies in the world. Despite progress in treatment, HNC's 5-year survival rate is still not favorable (Carvalho et al., 2008;Siegel et al., 2018). Early detection and risk classification of HNC is essential to improve prognosis and reduce the mortality and morbidity associated with HNC; useful diagnostic molecular biomarkers for HNC are vital for effective treatment selection.
Epigenetic changes in the development and progression of various cancers have been documented owing to rapid technological breakthroughs in whole-genome sequencing (Kordowski et al., 2018;Molnar et al., 2018;Sweeney et al., 2009). DNA methylation usually occurs at the cytosine-phosphate-guanine (CpG) site, regulating protein function, gene expression, and RNA processing (Baylin & Jones, 2011;Portela & Esteller, 2010;Stein, 2011). Although several studies have determined that the given genes have abnormal DNA hypermethylation or hypomethylation in HNC, a combined profile and pathway of regulatory networks is still rare (Misawa et al., 2016;Misawa et al., 2018).
In the study, the combination of gene expression and DNA methylation data was analyzed. By applying differential analysis and Cox regression analysis, five differentially methylated genes (DMGs) were identified as potential prognostic methylation genes for HNC patients. Moreover, these findings may provide new prospects for potential mechanisms based on exploring sitespecific methylation and DMGs in HNC.

| Data source and data processing
In the present study, RNA-Seq data, DNA methylation data, and clinical information related to HNC patients were obtained from The Cancer Genome Atlas (TCGA) data portal (https://tcga-data.nci. nih.gov/tcga/, August 28, 2018). The 44 adjacent nontumor samples and 502 HNC samples were included in the gene expression profiles, in which messenger RNA (mRNA) microarrays used IlluminaHiSeq RNA-Seq arrays, while 50 adjacent nontumor samples and 530 HNC samples were included in the DNA methylation data set, in which the methylation platform used a Illumina HumanMethylation450 BeadChip.

| Univariate cox analysis
To evaluate the impact of these genes on the HNC patient's prognosis, a univariate Cox regression analysis was used to screen the survival specific DMGs/DEGs. The prognostic genes were screened with the R package "Survival." With the threshold of P values < 0.05, the DMGs/DEGs were considered as prognostic genes in the univariate Cox regression analysis. Then, common genes in DMGs/DEGs were considered as prognostic methylation genes.

| Correlation analysis of DEGs and DMGs in HNC
The identified prognostic DMGs were selected to explore the relationship between mRNA and DNA methylation. It is widely accepted that a negative relationship between DNA methylation and mRNAs was detected. Therefore, we selected those DMGs with a negative association between DNA methylation and mRNAs, which had an R value in Pearson's correlation analysis < −0.3, and a P value < 0.05.

| Multivariate Cox analysis
The multivariate Cox regression analysis was used to further validate these DMGs as prognosis factors. Integration of gene expression levels weighted by the regression coefficient (β) was used to construct a risk score model. The formula for estimating the prognosis index (PI) for each patient is as follows: (β × expression level of EPHX3) + (β × expression level of STK33) + (β × expression level of GPR150) + (β × expression level of PAX9) + (β × expression level of INSM1) (Bao et al., 2014;Zhang et al., 2015).
According to the threshold of the median PI, HNC patients were stratified into high-risk and low-risk groups. The Kaplan-Meier survival curves were plotted. To further verify that the five DMGs were also the independent indicators of other clinical factors, univariate and multivariate Cox regression analysis were performed. We used timedependent receiver operating characteristic (ROC) curves within 5 years to assess the prognostic performance based on the risk score model. A P value < 0.05 was considered statistically significant.

| Functional enrichment analysis
We conducted differentially expressed analyses between low and high-risk groups based on risk score models. The top 200 DEGs (100 upregulated genes and 100 downregulated genes) were screened. To better understand the biological process of the five DMGs, functional enrichment analysis was analyzed by the "ClusterProfiler" package in R software (Yu, Wang, Han, & He, 2012). The gene set enrichment analysis (GSEA) was further performed (Subramanian et al., 2005).

| Screening of prognostic DNA methylation sites
The DNA methylation sites were downloaded through the identified prognostic five DMGs in the TCGA database. We further screened the prognostic DNA methylation sites in five DMGs through univariate Cox regression analysis and Kaplan-Meier survival analysis with the R "survival" package. A P value < 0.05 was defined as statistically significant.

| Correlation analysis of gene expression and DNA methylation sites in HNC
The selected prognostic DNA methylation sites in five DMGs were selected to explore the relationship between mRNAs and DNA methylation sites. Meanwhile, we selected these prognostic DNA methylation sites as our object with a cut-off threshold of the R value in the Pearson's correlation analysis <− 0.3 and the P value < 0.05.
Pearson's correlation analysis was conducted with the R "Cor" function.

| Survival analysis among prognostic DMGs
We selected the genes with hypermethylation/lower expression or genes with hypomethylation/higher expression in HNC. The Kaplan-Meier plot was used to explore the association between mRNA and the hypermethylation/hypomethylation genes. The logrank test was used to compare the survival difference between the HNC group and nontumor group in the overall survival (OS) analysis. The R "Survival" package was used to screen prognostic genes. A P value < 0.05 was regarded as statistically significant.

| Construction of the nomogram model
A nomogram model was constructed in view of the results of the multivariate Cox model. To reduce overfitting bias, the nomogram was bootstrapped by 1000 resampling and quantized by the concordance index (C-index) for verification in the TCGA HNC cohort. The C index value ranged from 0.5 to 1.0, 0.5 denoted no discrimination, and 1.0 denoted complete discrimination (Rao, 2003).
The R "rms" packages were performed for the establishment of the model of the nomogram.

| Validation of the DMGs with Gene Expression Omnibus (GEO) data
To validate the robustness of the hub DMGs from the TCGA data set, the DNA methylation profiles of HNC from the GEO database was searched. To determine eligible studies, we used the following search terms: "head and neck cancer" or "HNSC". We used GEO2R online software to examine the raw data of DNA methylation and screen DMGs. GEO2R is an interactive web tool that permits users to analyze different sample sets in the GEO datasets and select differentially expressed genes under specific conditions. The adj.P.Val < 0.05 and |t| > 2 were used as the cut-off threshold for screening DMG.

| Data
The HNC data set from the TCGA data portal was downloaded. Data on RNA-Seq, DNA methylation, and clinical profiles were collected.
Demographic features are exhibited in Table 1 the TCGA patient ID and column using the following matrix: normalized count (RNA-Seq) and beta value (DNA methylation).

| Correlation analysis between DMG levels and DEG expression
The independent prognostic DMGs were used to examine the correlation between gene expression and DNA methylation. Among

| Construction of five DMG prognostic models
We performed the multivariate Cox regression analysis to validate the above results, and the five DMGs (PAX9, STK33, GPR150, INSM1, and EPHX3) were proven to be prognostic factors for HNC (Table 2).
With the five prognostic DMG models, the risk score value was   3.6 | Functional enriched analysis based on the five DMG model

| Correlation analysis between abnormal mRNAs and DNA methylation sites in HNC
The key DNA methylation sites were used to explore the correlation between mRNAs and DNA methylation sites. The negative correlations between DNA methylation sites and mRNAs were a matter of concern. The 41 DNA methylation sites were identified to negatively correlate with mRNAs, and the r-value in

| Validation of hub genes with GEO data set
One study (GSE38532) was included in the GEO data set.
Analysis of mRNA expression profiles identified a total of 11481 DMGs. Among these genes, 5,686 genes were defined as hypermethylation genes, and 5,795 genes were regarded as hypomethylation genes. Four hub genes (EPHX3, STK33, GPR150, and PAX9) were detected in the GSE38532 data set.
The four DMG methylation level in the GSE38532 is observed in Figure 14.

| DISCUSSION
HNC is noted for its rapid clinical progress and poor prognosis. In the past 40 years, its survival rate has hardly improved. The primary common pathological subtype is squamous cell carcinoma  between OSCC and noncancerous samples (Wang, Ling, Wu, & Zhang, 2013). A study by Hwang et al. (2013) (Cai et al., 2007). The aberrant expression of INSM1 could be found in various cancers, such as head and neck tumors (Rooper, Bishop, & Westra, 2018), neuroendocrine carcinoma (Kuji et al., 2017), Merkel cell carcinoma (Lilo, Chen, & LeBlanc, 2018), and small cell lung cancer (McColl et al., 2017).
A study by Morandi et al. (2017) showed that hypermethylation
These findings help us to speculate on the prognosis of patients with HNC. In addition, these results provide new perspectives in exploring the molecular mechanisms of DMG and site-specific methylation of HNC development.

CONFLICT OF INTERESTS
The authors declare that there are no conflict of interests.

DATA AVAILABILITY
Data availability could be obtained from the TCGA website.