Somatic mutations in colorectal cancer are associated with the epigenetic modifications

Abstract Colorectal cancer (CRC) mostly arises from progressive accumulation of somatic mutations within cells. Most commonly mutated genes like TP53, APC and KRAS can promote survival and proliferation of cancer cells. Although the molecular alterations and landscape of some specific mutations in CRC are well known, the presence of a somatic mutation signature related to genomic regions and epigenetic markers remain unclear. To find the signatures from a random distribution of somatic mutations in CRCs, we carried out enrichment analysis in different genomic regions and identified peaks of epigenetic markers. We validated that the mutation frequency in miRNA is dramatically higher than in flanking genomic regions. Moreover, we observed that somatic mutations in CRC and colon cancer cell lines are significantly enriched in CTCF binding sites. We also found these mutations are enriched for H3K27me3 in both normal sigmoid colon and colon cancer cell lines. Taken together, our findings suggest that there are some common somatic mutations signatures which provide new directions to study CRC.

tissue-specific driver mutations in non-coding regions. Examples of previous discoveries include recurrent mutations of the TERT promoter, which creates a binding motif for ETS transcription factors significantly increasing TERT transcriptional activity. 10 In addition, somatic mutations in T-cell acute lymphoblastic leukaemia introduce binding motifs for MYB, creating a super-enhancer upstream of the TAL1 oncogene. 11 In the non-coding cancer genome, CCCTC binding factor (CTCF)/cohesin's binding sites (CBSs) are major mutational hotspots. 12 Moreover, somatic mutations in miRNA exhibit potential role in tumorigenesis. 13 Mutations in the miRNA coding regions will alter the expression of the target gene as the sequence of α-miRNA is strictly complementary to the mature miRNA sequence. 14 Here, we carried out enrichment analysis of somatic mutations in different genomic regions and analysed epigenetic marker peaks using 970 560 somatic mutations (covering 948 975 genome loci, chrM and other non-canonical chromatin) from CRCs. The aim of this study was to survey the signature of somatic mutations in a diverse set of colorectal cancer genomes and obtain insights into the signature of somatic mutations in epigenetic markers of genomic regions.

| Data sources and collection
We included unbiased interpretation of somatic mutations from tu- cancer.gov/data/1c8cf e5f-e52d-41ba-94da-f15ea 1337efc). 15 To reduce the false-positive rate, we implemented two strategies to optimize driver detection and data quality. Briefly, we excluded hyper-mutated tumours because of artefact sensitivity to high background mutation rates. All mutations that passed the MC3 filter criteria were included.
Finally, we randomly selected 10 samples to do the following analysis by permutation test. Clinical information on TCGA was downloaded from the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov/). The data of histone modifications and chromatin assay come from Cistrome Data Browser.

| RRBS sequencing data analysis
Reduced Representation Bisulfite Sequencing was download from GEO data set GSE95654. The raw paired-end FASTQ reads were trimmed to remove both the adapter sequences and low-quality bases. The alignment of bisulphite-treated short reads to the reference genome hg19 was conducted as described by Cai et al. 16 In brief, two read alignments were carried using the SOAP software to get the best hit for a given pair-end short read. A straightforward seed-and-extension algorithm was then employed for the alignment, with two mismatches allowed in the seed (30 bp) and five mismatches in the whole read. Uniquely aligned reads that contained MspI digestion sites at their ends were retained for further analysis. 17 Bisulphite conversion efficiency was calculated by using the C to T conversion rate for all cytosines in the CHH context (where H = A, T, or C). Even under the assumption that all 5mC in CHH nucleotides were products of conversion failure, the bisulphite conversion rate for each single-base resolution approach was >99%, which ensured that the false-positive rate was <1%.

| ChIP-seq sequencing data analysis
All ChIP-seq sequencing data were mapped to the hg19 genome for human by using Bowtie2 (v2-2.2.4) 18 with parameters '-q --phred33 --very-sensitive -p 10'. Then, we removed duplicated reads for both pair-end and single-end data using SAMtools (v1.5). 19 The bigwig files for IP/input ratio were generated from BAM files by using deepTools2

| Somatic mutation annotations
Somatic mutations were annotated and analysed by ANNOVAR v2018Apr16, 24  SNVs (single nucleotide variants), in-frame indels, as well as variants, predicted to have non-deleterious functional effects or population allele frequencies greater than 10% were not reported.
Mann-Whitney U test was used to calculate the mean mutation rate of International Cancer Genome Consortium (ICGC) and TCGA databases in miRNA regions.

| Motif discovery
De novo motifs were calculated with the HOMER findMotifsGenome.pl command with default parameters. Enrichment of de novo motifs was calculated using the findKnownMotifs.pl program in HOMER with default parameters.

| CpG OE calculation
CpG O/E (observed/expected for CpG) was defined as the ratio of the actual CpG density which represent the composition of nucleotide.
CpG O/E was calculated as follows: where N is the size of the sequence segment (window) in which total nucleotides were analysed. 25 A 400 bp window (N = 400) moving across the sequence at 1 bp intervals was chosen to monitor the characteristic variations.

| Somatic mutations are enriched in miRNA regions
To explore the distribution of somatic mutations, we examined the mutation regions in Homer v4.91 by mutation annotation. We annotated 528 087; 398 899; 12 613; 10 890; 9752; 6854; 1931; 814; 690; and 29 somatic mutations located in intergenic, intronic, exonic, promoter, TTS, 3'UTR, ncRNA, 5'UTR, pseudo-gene and miRNA regions, respectively. We analysed the distribution of mutation frequency around the region body and 5' flanking regions. We observed that the mutation frequency was much lower in the promoter of protein-coding genes than within the gene body ( Figure 1A-B).
The decline of mutation frequency in promoters was not apparent in all genes ( Figure 1C) or non-coding genes such as lncRNA genes ( Figure 1D). Interestingly, we found that the mutation frequency in miRNA regions was dramatically higher than in the flanking regions (Wilcoxon P = 1.46 × 10 −12 ) ( Figure 1E), which was not observed in other subsets of genes. The mutation sites were enriched in the miRNA regions than in random regions with P = 1.63e−9 and P < 2.2e−16, respectively (Figure 2A-B).
To explore whether the mutations of key CRC genes (ie APC, TP53 and KRAS) affect the differential enrichment of somatic mutations in different regions, we performed waterfall plots for top 30 mutated genes in CRC and found the top 10 genes were mutated in more than 10% of CRC tumours. These highly mutated CRC genes include APC, TP53, KRAS, PIK3CA, FAT4, FBXW7, CSMD3, BRAF, LRP1B and SMAD4 ( Figure 2C). Then, we separated somatic mutations into two groups with or without top 10 key CRC genes and performed the composite analysis. Finally, we found the somatic mutations were enriched in both key CRC genes and non-key CRC genes ( Figure 2D-E). However, we did not observe enrichment of mutations in CBS regions for these key CRC genes while mutations were enriched in non-key CRC genes ( Figure Figure S4). Therefore, our observations supported that not only SNPs but also InDels contribute to the specific patterns in the genomic and CBS regions.

| Somatic mutations are enriched in CTCF binding sites
Given genomic CTCF/cohesin-binding sites (CBSs) are frequently mutated hotspots in numerous malignancies, 12

| Somatic mutations are correlated with low CpG O/E value and high-CpG methylation
A high rate of CpG mutations should deplete the frequency of CpG sites so that CpG [O/E] decreases. 26 Therefore, we investigated the relationship between mutation occurrence and CpG content (CG), GC content (GC) and CpG[O/E] in 400 bp centred on each mutation. 16 This revealed the mutation frequency was significantly negatively correlated with OE rather than CG or GC ( Figure 4A). Methylated CpG dinucleotides can lead to 10-fold higher C→T mutation rate than at unmethylated sites, 25,27 less is known about whether and how the methylation level alters the mutation rate, in particular, at single-base resolution. Here, we calculated the mean CpG methylation level of 1k regions centred on each somatic mutation. We found methylation versus the mutation occurrence revealed CRC methylation was much lower than in normal adjacent tissues or human aberrant crypt foci (ACF) samples ( Figure 4B). However, the mean methylation was slightly elevated when the value of mutation occurrence is under 6 ( Figure 4B). Methylation was detected by RRBS (GEO data set GSE95654), which covered high-CpG islands and promoters, requiring additional validation via genome-wide MethylC-seq in CRC.

| Somatic mutations are enriched in poised enhancers marked by H3K27me3
By analysing H3K4me1, H3K4me3, H3K27me3 and H3K27ac in sigmoid colon, HCT116, Caco2 and LoVo cell lines, we plotted the while the mutation rate increases in body regions of predicted general enhancers (from JEME database, https://www.nature.com/artic les/ng.3950) compared with flanking regions ( Figure S6).

| Somatic mutations are oscillated in chromatin open regions
Chromatin organization contributes to regional variation in mutation rate, but differently among mutation types. In both germline mutations and somatic mutations, base substitutions are more abundant in regions of closed chromatin, perhaps reflecting error accumulation late in replication. 29 In contrast, a distinctive mutational state with very high levels of insertion or deletions (indels) and substitutions is enriched in regions of open chromatin. 29 In our study, we found regions of open chromatin show alternately higher and lower mutation frequency, compared to flanking regions ( Figure 6). We define this mutation distribution pattern as oscillation. Consistently, we found somatic mutation fluctuated in flanking regions between normal sigmoid colon and colon tumour cell lines HCT116 and Caco-2 ( Figure 6). Here, we confirmed that in CRC, mutation hotspots enriched at CBSs that disrupt CTCF binding, consistent with previous reports in gastrointestinal cancers (GC). 12 CTCF is a DNA-binding protein essential for the maintenance of genome architecture by mediating inter-chromosomal contacts. 32 Somatic mutations of CBSs may disrupt the CTCF binding leading to dysregulation of gene expression. Compared with gastrointestinal cancers, 25% of all gastric tumours are mutated in at least one of the 11 CBS hotspots. 12 We observed a high frequency of mutation hotspots in the CTCF binding regions of CRC and HCT116, compared to MCF7 or K562. In addition, in GC, microsatellite instability mutation profiles showed a positive association with heterochromatin and repressive domains.

| D ISCUSS I ON
Here, we first verified the relationship between the somatic mutations and histone modifications in CRC by comparing different epigenetic markers between CRC and normal tissues.
Epigenetic modifications, such as histone methylation and acetylation, can act as regulatory switches for gene transcription, and their dysfunction can give rise to developmental abnormalities 33 and carcinogenesis. 34 Previous studies have focused exclusively on the effects of cancer-associated mutations on histones themselves, 35 38 High mutation density of repressive histone mark-associated regions has been reported in previous research. 39 Comparing to active enhancers, poised enhancers may have limited accessibility to DNA repair complexes. In addition, active enhancers give rise to eRNAs, 40 which play active roles in transcriptional regulation. 41 When somatic mutations occur in active enhancers, they are unlikely to be accumulated due to the aberrant transcription. Relative to active enhancers, poised enhancers do not give rise to eRNAs. Thus, somatic mutations in poised enhancers can be possibly enriched. However, whether these mutational signatures or differential peak enrichment for H3K27me3 in normal versus cancerous tissues exists in other human cancers remains unknown.
The formal definition of CpG islands is a region with at least 200 bp, a GC percentage greater than 50%, and CpG O/E greater than F I G U R E 6 Distribution of somatic mutations in and ±5 kbps of peaks of DNase-seq in sigmoid colon. A, Sigmoid; B, HCT116 and (C) Caco-2. colon_CT (light green), colon_GA (yellow) and CRC (blue) somatic mutations represent C/T→G/A, G/A → C/T and all of the somatic mutations, respectively. 'PSS' indicates the peak's starting sites while PES means peak's end sites 60%. 42 This CpG content will change as somatic mutation of CpG islands rates increase. We validated that in CRC, mutation frequency was negatively correlated with CpG O/E value rather than CpG content or GC content. Moreover, we found that the average value of CpG O/E was much lower than in normal colon tissue, which meant the somatic mutation frequency CRC CpG islands was higher than normal colon tissue. Methylated cytosine within a gene can alter its expression levels. 43 In mammals, almost 80% of CpG cytosines are methylated. 44 However, we found that this methylation level was decreased in CRC compared with normal adjacent tissues, which suggested that somatic mutations and methylation of CpG islands may have an impact on CRC tumorigenesis. It is worth noting that allele-specific mutations and genomic imprinting are currently hot topics in research community. We tried to explore whether somatic mutations are enriched on the same allele or different allele of regions marked by high-CpG methylation and H3K27me3. However, we could not have the access to the raw sequencing data from TCGA or ICGC projects and failed to apply the authority of raw data deposited in dbGaP database. We hope we could elucidate their relationships for further research.

CO N FLI C T O F I NTE R E S T
The authors declare that they have no competing interests.

DATA AVA I L A B I L I T Y S TAT E M E N T
Reduced Representation Bisulfite Sequencing was download from GEO data set GSE95 654.