Deciphering molecular properties of hypermutated gastrointestinal cancer

Abstract Great mutational heterogeneity is observed both across cancer types (>1000‐fold) and within a given cancer type, with a fraction harboring >10 mutations per million bases, thus termed hypermutation. We determined the genome‐wide effects of high mutation load on the transcriptome and methylome of two cancer types; namely, colorectal cancer (CRC) and stomach adenocarcinoma (STAD). Briefly, hierarchical clustering of the expression and methylation profiles showed that the majority of CRC and STAD hypermutated samples were mixed and separated from their respective non‐hypermutated samples, exceeding the boundary of tissue specificity. Further in‐detailed exploration uncovered that the underlying molecular mechanism may be related to the perturbation of chromatin remodeling genes.


| INTRODUCTION
A tumor is largely caused by somatic mutations. 1,2 In general, mutation of a few genes is sufficient to initiate tumorigenesis. 1 However, great mutational heterogeneity is observed both across cancer types (>1000-fold) and within a given cancer type. 3 The wide range of mutation frequency within a given cancer type (eg, in melanoma and lung cancer, the frequency ranged between 0.1-100/million bases [Mb]) has prompted researchers to classify cancers into hypermutated and non-hypermutated forms. 4,5 Commonly, a sample is defined as hypermutated if its mutation rate is >10 mutations per Mb. 4,6 Hypermutation can be caused by in vitro (environmental mutagen exposure such as UV light and tobacco smoke) and in vivo factors (dysregulation of apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like (APOBEC) family members and defects in DNA polymerase ε (POLE), Polδ1 (POLQ1), or the DNA mismatch repair system (MMR)). 4,7,8 Tumors with a large number of somatic mutations may be susceptible to immune checkpoint blockade (eg, PD-1). [9][10][11] Thus, microsatellite instability-high (MSI-H) status (commonly linked to high mutation burden) may be associated with a better prognosis. 12,13 However, the transcriptome and epigenetic profiles, such as the DNA methylome, in hypermutated samples are poorly understood. Here, we focused on two cancer types (CRC and STAD) with a relatively high proportion of intrinsic hypermutation.

| MATERIALS AND METHODS
2.1 | Hypermutation definition (11/03/2015). Silent mutations and RNA mutations were discarded from further analyses. The hypermutated criterion was defined, which was similar to the work by Campbell et al. 4 Briefly, the breakpoint at which a significant change in the slope occurred was determined to be the hypermutation threshold. Thus, the samples with over 10 mutations per Mb were defined as hypermutated.
2.2 | Gene expression data processing and normalization All level 3 tumor mRNA expression data sets (RNASeqV2) were obtained from the TCGA (October 2015). Known batch effects were corrected using the ComBat function in the Bioconductor sva package. 14 Analysis of differentially expressed mRNA between hypermutated and non-hypermutated was performed using the DEGSeq package for R/Bioconductor. 15 Genes with expression levels <1 (RNA-Seq by Expectation Maximization (RSEM)-normalized counts) in more than 50% of samples were removed. Significant differentially expressed mRNAs were selected according to a false discovery rate (FDR) adjusted P value <0.05 and fold change >2 conditions.

| HM450k data retrieval and process
CRC and STAD level three DNA methylation data (HumanMethyla-tion450) were downloaded from the TCGA data portal (11/13/ 2016). The methylation level of each probe was measured with a beta value ranging from 0 to 1; this is calculated as the ratio of the methylated signal to the sum of the methylated and unmethylated signals. Probes with an "NA" value in more than 10% of the CRC or STAD samples were discarded. Next, the limma Bioconductor package was used to identify differentially methylated sites (DMSs) in the remaining probes. 16 Significant DMSs were selected according to the FDR adjusted P value <1E-20 condition. All heatmaps were generated using the pheatmap package in R (64-bit, version 3.0.2).

| Survival analysis
Genes that correlated with patient survival time in the multivariate Cox regression analysis were determined using the least absolute shrinkage and selection operator (LASSO) method. The best λ was determined by 10-fold cross-validation using the glmnet package built-in function cv.glmnet. 17 For each group, we divided the patients into high-and low-risk groups by calculating the prognostic index (PI) as follows: where n is the number of survival correlated genes, β g is the regression coefficient of the Cox proportional hazard model for gene g, and m gk is the expression level of gene g in patient k.
Patients were then divided into high-and low-risk groups based on the median PI. The survival difference between the two groups (good-and poor-prognosis) was tested by the Kaplan-Meier method and analyzed with the log-rank test with functions survfit and survdiff in the survival package for R. 18 A P value <0.05 was considered significant.

| Hub gene definition
Co-expression network construction was performed as described in our previous work. 19 Hub genes were those with an extremely high level of connectivity in a given network. Connectivity reflects how frequently a node interacts with other nodes and the sum of the weights across all edges of a node. Because some modules (also known as networks) were rather large, we restricted the number of genes in the output of the module. Here, the top 50 genes with the highest connectivity in each network that were reasonable to display were defined as hub genes as previously described. 20 2.6 | Data and code availability    Figure S2). This result strongly suggested that consistent mutation-driven methylation patterns in CRC and STAD were based on an unknown molecular mechanism, irrespective of MLH1 methylation status.  Table 2). Particularly for RNH1

|
and TAGLN3, the discrimination power (C-index value) reached >0.7 with only a single variable ( Figure S3, Table 2). A well-conceived five-gene signature (DNAI2, EPHA5, GAS2L1, RNH1, TAGLN3) was identified that could predict prognosis of hypermutated STAD patients with high performance (C-index: 0.84), but the robustness should be validated in another independent dataset in the future.
These survival-related genes can be prognostic signatures used in 3.6 | Identification of candidate therapeutic target genes in STAD hypermutated samples A mutated gene that is differentially expressed, survival related and acts as a hub node in a regulatory network may serve as a potential therapeutic target in the treatment of hypermutated patients. Notably, we found only one gene, ANK2 (Ankyrin-2) that may serve as a potential therapeutic target in the treatment of hypermutated STADs ( Figure 3A). ANK2 was highly repressed in hypermutated samples compared with non-hypermutated and adjacent normal tissue (P < 0.05, Mann-Whitney test, Figure 3B). Co-expression network analysis revealed that ANK2 was a hub gene associated with some validated tumor suppressors ( Figure 3C). For example, ANGPTL1 and SPARCL1, crosstalking with ANK2, have been shown to attenuate colorectal cancer metastasis in our previous work. 26,27 Hypermutated patients with higher ANK2 expression levels had a significantly better clinical outcome (P = 0.00771, log-rank test, Figure 3D). Currently, there are no ANK2-targeting molecules available in the clinic or even pre-clinical research. But we hope we and someone else will F I G U R E 3 ANK2 is a candidate therapeutic target gene in stomach adenocarcinoma (STAD) hypermutated samples. A, Venn diagram of mutated, differentially expressed, survival related and act as the hub node in a regulatory network identified ANK2 as a potential therapeutic target in the treatment of hypermutated STAD patients. B, ANK2 was highly depressed in hypermutated samples compared with nonhypermutated and adjacent normal tissue (P < 0.05, Mann-Whitney test). C, WGCNA exploration uncovered ANK2 as a hub gene associated with some validated tumor suppressors. D, Higher ANK2 expression level is correlated with a significantly better clinical outcome (P = 0.00771, logrank test) verify this hypothesis using in vivo activation of endogenous target genes through trans-epigenetic remodeling in future work. 28 The overall 5-year survival rates were 68% (95% CI 54% to 85%) compared to 30% (95% CI 12% to 80%) based on the mean expression cutoff in hypermutated STADs.
3.7 | Chromatin remodeling genes may be responsible for the consistent expression and methylation pattern in hypermutated samples The intrinsic molecular similarity shown in high mutational load samples with different tissues of origin prompted us to seek potential drivers. Correlation analysis of mutational profiles, transcriptomes and methylomes showed that inactivation of chromatin remodeling genes (ARID1A and MLL1~4) may account for the consistent expression ( Figure 4A) and methylation ( Figure 4B) patterns in hypermutated samples. ARID1A fits the hypothesis well; it is mutated in 49% of CRC and 76% of STAD hypermutated samples but rarely exists in non-hypermutated cases. This result was shown for MLL2 as well.
We thus concluded that an MSI-H→chromatin remodeling genes (ARID1A and MLL1~4) inactivation→DNA methylation/expression variation axis ( Figure 4C) shared in CRC and STAD hypermutated samples may be associated with the consistent expression and methylation patterns identified in this study.

| DISCUSSION
The consistent expression and DNA methylation patterns in hypermutated samples uncover a somatic driver foregoing tissue specificity. In CRC and STAD, hypermutation is largely contributed by