Multi‐omic analysis in normal colon organoids highlights MSH4 as a novel marker of defective mismatch repair in Lynch syndrome and microsatellite instability

Abstract Introduction Lynch syndrome (LS) is a hereditary condition that increases the risk of colorectal (CRC) and extracolonic cancers that exhibit microsatellite instability (MSI‐H). MSI‐H is driven by defective mismatch repair (dMMR), and approximately 15% of nonhereditary CRCs also exhibit MSI‐H. Here, we aimed to better define mechanisms underlying tumor initiation in LS and MSI‐H cancers through multi‐omic analyses of LS normal colon organoids and MSI‐H tumors. Methods Right (n = 35) and left (n = 23) colon organoids generated from normal colon biopsies at routine colonoscopy of LS and healthy individuals were subjected to Illumina EPIC array. Differentially methylated region (DMR) analysis was performed by DMRcate. RNA‐sequencing (n = 16) and bisulfite‐sequencing (n = 15) were performed on a subset of right colon organoids. CRISPR‐cas9‐mediated editing of MMR genes in colon organoids of healthy individuals was followed by quantitative PCR of MSH4. The relationship between MSH4 expression and tumor mutational burden was further explored in three independent tumor data sets. Results We identified a hypermethylated region of MSH4 in both the right and left colon organoids of LS versus healthy controls, which we validated using bisulfite‐sequencing. DMR analysis in three gastrointestinal and one endometrial data set revealed that this region was also hypermethylated in MSI‐H versus microsatellite stable (MSS) tumors. MSH4 expression was increased in colon organoids of LS versus healthy subjects and in publicly available MSI‐H versus MSS tumors across four RNA‐seq and four microarray data sets. CRISPR‐cas9 editing of MLH1 and MSH2, but not MSH6, in normal colon organoids significantly increased MSH4 expression. MSH4 expression was significantly associated with tumor mutational burden in three publicly available data sets. Conclusions Our findings implicate DNA methylation and gene expression differences of MSH4 as a marker of dMMR and as a potential novel biomarker of LS. Our study of LS colon organoids supports the hypothesis that dMMR exists in the colons of LS subjects prior to CRC.


| INTRODUCTION
Approximately 15% of colorectal cancers (CRCs) display high levels of microsatellite instability (MSI-H), while the remainder can be broadly deemed as being microsatellite stable (MSS). In CRC, the vast majority (~80%) of tumors are derived from nonhereditary mechanisms. Primarily, these tumors evolve mostly as a result of DNA hypermethylation and inactivation of the mismatch repair (MMR) gene: MutL homolog 1 (MLH1). 1,2 However, individuals with Lynch syndrome (LS) harbor mutations that lead to a hereditary predisposition to some cancers, including CRC. LS is an autosomal dominant disorder resulting from inherited mutations in MMR genes: MLH1, MutS homolog 2 (MSH2), MSH6 or PMS1 homolog 2, mismatch repair system component (PMS2), or through a rare deletion of the Epithelial cell adhesion molecule (EPCAM), which inactivates MSH2. 3 Mutations in MSH2 and MLH1 are responsible for approximately 70% of LS. 4 Inherited mutations in these genes lead to an increased risk of colonic and extracolonic (primarily endometrium and stomach) tumor development. 5 Individuals with LS often present with earlier onset cancers. Differences in lifetime risk estimates for LS subjects are thought to be driven by disease variant pathogenicity, the specific MMR gene involved, and other factors. [6][7][8] One challenge for the clinical management of LS is the need to identify additional molecular markers driving risk. Such markers have the potential to improve screening and provide insight into the early development of cancer.
The MMR system is a highly conserved, postreplicative process involved in maintaining the fidelity of genetic information passed from templates to daughter strands. 9 This process involves a staged, coordinated effort from various MMR proteins. First, surveillance heterodimers MutSα and MutSβ aim to identify specific mismatches. MutSα (MSH2-MSH6) primarily recognizes single-base mismatches or 1-2 nucleotide insertion/deletion loops (IDLs), 9 whereas MutSβ (MSH2-MSH3) has a higher recognition affinity for larger IDLs. 10 Following identification, the appropriate heterodimer will bind to the target site and recruit MutLα (MLH1-PMS2) 10 in a process involving Exonuclease 1 (EXO1). 11 Subsequently, DNA polymerases and replication factors augment re-synthesis at the target sites. Deficient MMR (dMMR) leads to a failure to correct replication errors at microsatellite repeats 12 and results in tumors that exhibit the hypermutator phenotype: MSI-H. Individuals with LS exhibit an increased risk of developing CRC with MSI-H given that they already harbor germline or de novo mutations in one copy of a specific MMR gene. Following a somatic mutation of the normal copy of that MMR gene ("second-hit hypothesis" 13 ), the resulting dMMR affects the cell's ability to repair DNA and results in increased mutational burden. 14 Aberrant levels of DNA methylation are a hallmark of CRC. However, most research into LS has focused on the identification of novel, inherited gene variants. There has been comparatively little research in LS subjects into the molecular mechanisms underlying tumor development that help to drive the dMMR/MSI-H tumor phenotype. Although some studies have attempted to address the role of DNA methylation in LS, these studies have primarily been focused on blood. 15 Given the tissue and cell-specific nature of DNA methylation, blood analyses may not provide appropriate insight into CRC disease pathology, which likely originates from the stem-cell compartment of the colon crypt. In addition, few studies have considered defining the relationship between LS and nonhereditary MSI-H.
In this study, we hypothesized that a multi-omic, comparative analysis of organoids generated from normal colons of LS versus healthy subjects would provide insight into epigenetic and transcriptomic differences occurring in the high-risk, LS population. Colon organoids are an ideal model system in which to study this as they are comprised predominantly of epithelial stem-cell niche cells, which have been hypothesized to be the origin for CRC. 7, 8 We further hypothesized that differences observed in LS organoids may be extended to non-hereditary MSI-H tumors. Our study led to the identification of the meiosis-associated gene MSH4, an MMR gene not previously implicated in LS, as a potential novel marker of dMMR/MSI-H/LS.

| Subject recruitment and exclusion criteria
Subjects scheduled for screening or surveillance colonoscopies were enrolled after providing informed consent Our study of LS colon organoids supports the hypothesis that dMMR exists in the colons of LS subjects prior to CRC.

K E Y W O R D S
colon organoids, colorectal cancer, Lynch syndrome, microsatellite instability under an approved Institutional Review Board protocol at the University of Virginia (IRB-HSR #19710). Subjects were recruited between July 2017 and March 2019 and agreed to donate biopsies from both the right and left colon. Healthy control subjects were excluded from this study if they had a personal or family history of CRC or a personal history of inflammatory bowel disease. All procedures were performed in accordance with relevant guidelines and regulations and were consistent with those required by both the National Institutes of Health and the University of Virginia. Written informed consent has been obtained from the patient(s) to publish this paper and the study was conducted in accordance with U.S. Common Rule.

| DNA extraction of colon organoids
Genomic DNA was extracted from colon organoids using the Qiagen UCP DNA kit (Catalog No: 56204; Qiagen; Hilden, Germany), with a few exceptions. For elution, a 5 min final incubation of Buffer AUE was preferred to increase yield. Further, the elution step was carried out twice using two volumes of 25 μL. DNA quality was assessed using gel electrophoresis to confirm that DNA was not heavily degraded.

| RNA Extraction and sequencing of colon organoids
Total RNA was extracted using NucleoSpin RNA XS Kit (Takara Bio: 740990.250). All samples used for library preparation had RNA integrity numbers above 9.8, as measured by Agilent 4200 Tapestation (G2991BA). Library preparation and RNA-seq were carried out according to Illumina protocols following ribosomal depletion at the Northwest Genomics Center of the University of Washington. Paired-end, 100 bp sequencing was performed using the Illumina NovaSeq 6000. An average of 51.98 million reads were uniquely mapped to the GrCh38 for each sample, with an efficiency of 67.78% using STAR and RSEM. 17,18

| Preprocessing of DNA methylation microarray data in colon organoids
Bisulfite-converted DNA quantity and bisulfite conversion completeness were assessed for each sample using a panel of MethyLight-based real-time PCR quality control assays, as described previously. 19 DNA methylation data were generated using the Illumina Infinium MethylationEPIC Kit (herein EPIC array; Catalog No: 20042130; Illumina; San Diego, California, USA) for right (n = 35) and left (n = 23) colon organoids generated from LS and control individuals at the USC Norris Molecular Genomics Core Facility. For right colon organoids, data were generated in two independent batches and analyzed together. Stratified quantile normalization was preferred as the normalization method for case-control differences. The method was used under default settings while considering sample gender. 20 For this analysis, Sentrix chip and sample positions were used as adjustment factors 21 to account for technical variation prior to the analysis of differentially methylated regions (DMRs). 22

| General steps for processing EPIC array data across cohorts
Samples were excluded if they did not pass sex or ethnicity checks performed in SeSAMe. 23 Samples were processed in minfi using a detection P value of <0.01 for probe removal. 24 Probes were also removed if (1) the probes contained an SNP at the CpG or at the single base extension or (2) were cross-hybridizing, were on either sex chromosome, or (3) were in a non-CpG context. Further, a parallel beta processing of probes was performed using SeSAMe. 23 All additional probes that failed in at least one sample were removed prior to the fitting of the model. Beta values were adjusted for technical variation using the COMBAT function in ChAMP 21 prior to DMR analysis in DMRcate. 22 DMR annotation was also performed using DMRcate. Overlapping bed regions across different analyses were determined through the use of the R package, bedR. 25

| Analysis of publicly available DNA methylation data sets
We downloaded and processed data from The Cancer Genome Atlas (TCGA) -colon adenocarcinoma (-COAD) and -stomach adenocarcinoma (-STAD) cohorts [26][27][28] and -Uterine Corpus Endometrial Carcinoma (-UCEC). Raw IDAT files were downloaded from TCGAbiolinks. 29 Consensus purity estimates (CPE) 30 and mDNAsi scores 31 were downloaded for each sample. Numerous purity measures were missing for STAD, so this covariate was excluded. Functional normalization, which was designed for the analysis of cancer data sets, was applied using minfi 24 while specifying a gender. Principal component analysis was performed to identify initial outliers in technical replicates. In their absence, the replicate with the lowest mean DNA methylation level was excluded. Sentrix chips were excluded if they contained only data from either MSI-H or MSS/MSI-Low (MSS/MSI-L) tumors.

| Sample selection and
preprocessing of external, publicly available DNA methylation datasets

| TCGA-COAD
For the analysis of TCGA-COAD DNA methylation data, a total of 11 Asian samples were removed as they were deemed too small a subset to reduce heterogeneity. Samples were also excluded if colon location data were absent. A total of 149 samples were included in this analysis, with 36 originating from MSI-H tumors. Sentrix chip and position were used as adjustment covariates for the COMBAT function of the ChAMP package. 21 2.8.2 | TCGA-STAD Samples without race, MSI status, mDNAsi, stage, age, or gender information were excluded. Samples were then checked for MSI status representation across each Sentrix chip. A total of 202 samples were considered for analysis.

| TCGA-UCEC
A total of 295 tumors were considered for inclusion. Samples were removed if they were (1) not of endometrial origin, (2) not classified as "endometrioid adenocarcinoma" or "serous cystadenocarcinoma." or (3) missing data for CPE, race, mDNAsi, MSI status, age, or tumor stage. Samples were then analyzed for adequate representation of MSI status across Sentrix IDs.

| GSE68060
Raw IDAT files containing red and green probe intensities were downloaded from gene expression omnibus (GEO), 32 accession GEO68060. Data were processed in a manner similar to TCGA cancer cohorts. To limit the effects of genetic ancestry on baseline DNA methylation levels, our analysis was limited to the study of Spanish patients. Samples were removed if they (1) presented as large outliers on PCA, (2) had missing information on MSI status, (3) failed sex checks, or (4) were placed on a Sentrix chip that contained only MSI-H or MSS/MSI-L data. This limited the final study population to 23 samples. As previously, beta values were adjusted for Sentrix chip prior to DMR analysis.

| Regression models used for DMR identification
DMRcate 22 was used under default settings, except that lambda = 1000, C = 4, and the minimum number of probes required for a DMR was set to 7. C was set a 2 for HM450 analysis. An absolute mean beta value shift of 5% was required for DMR reporting. Regression models were fitted for beta values for (1) right colon organoid, (2) left colon organoid, (3) TCGA-COAD, (4) TCGA-STAD, (5) TCGA-UCEC, and (6) GSE68060.

| Bisulfite sequencing and data processing of the MSH4 locus
Paired-end reads were sequenced for each sample across the MSH4 locus (chr1:76262302-76,262,889; 587 bp). Given the length of the region, three overlapping libraries were constructed. For each, a methylated region of DNA obtained after treatment with the M.SssI methylase was also sequenced, and this served as a positive control for adequate bisulfite treatment at each site. Reads were analyzed using trim_galore. 33 Reads were trimmed with a quality score of less than 30; 6 bp of the 3′ ends of both reads R1 and R2 were also trimmed. Samples were aligned to an in silico bisulfite-treated version of hg19 through Bismark 34 with the following parameters: --score_min L,0,-0.4 --non_directional. DNA methylation values were then extracted using Bismark. Data were then imported into methylKit for regression analysis. 35 In the case of overlapping sites across libraries, the library with the highest DNA methylation level in the M.SssI-treated sample at that site was considered. Differential DNA methylation was calculated using the chisquared test in methylKit. Bisulfite conversion appeared to fail for one sample, which was removed prior to analysis.

| RNA-sequencing analysis of publicly available cohorts
Three TCGA data sets (COAD, STAD, and UCEC) were selected for RNA-sequencing analysis (RNA-seq) to measure differences in expression between MSI-H and MSS/MSI-L tumors. For TCGA-COAD, data analysis was performed on 294 samples, as previously described. 36 Unless otherwise stated, samples were excluded if they belonged to a duplicate already used within the study or if MSI information was missing. Cancer stages were broadly categorized into main hierarchical groupings: Stage I-V, subclasses, that is, Stage IIA, were grouped into the main class (Stage II). For TCGA, raw HT-Seq counts were downloaded from the R package TCGAbiolinks. 29 As with our previous study in TCGA-COAD, MSI-low, and MSS samples were grouped together for comparison to MSI-H.

| TCGA-UCEC
Samples were excluded from TCGA-UCEC if they were not diagnosed with either endometrial adenocarcinoma or serous cystadenocarcinoma, they were not resected from endometrium, they did not have existing consensus purity estimates or sample age or race were not reported. A total of 440 samples were used in this analysis. Cell composition estimated for six-cell populations (ciliated and unciliated epithelial cells, fibroblasts, endothelial cells, lymphocytes, and macrophages) were calculated using CIBERSORTx 37 using single-cell data derived from a previously published data set. 38 2.11.2 | TCGA-STAD A total of 342 samples were included in the final regression. Single-cell deconvolution was performed using single-cell data generated by Zhang et al. 39 Cell scores were generated for T, epithelial, mast, macrophage; B, endothelial, fibroblast, endocrine, and parietal cells using a single-cell data matrix consisting of 8563 cells.

| Cell composition analysis
Raw counts were downloaded from TCGAbiolinks. 29 For TCGA-COAD, data analysis was performed as previously described. 36 Single-cell deconvolution performance was determined by CIBERSORTx using transcripts-permillion values, 37 and second in-house, through manual inspection of the correlation of gene expression markers to cell scores. TCGA-UCEC was processed under largely default parameters, except for the following: minimum expression = 0.6; barcode gene range = 100-600. Similarly, TCGA-STAD data had the following exceptions: minimum expression = 0.7; barcode gene range = 300-500. For both studies, Q values were set to 0.01, S-mode batch correction was applied, cell scores were quantified in absolute mode and 500 permutations were used for cell score quantification.

| Regression analysis: RNA-seq data
Raw counts were imported into DESeq2 and the optimized FDR thresholding approach was used to determine gene significance. 40 Cell composition scores were used to adjust for the effects of cellular heterogeneity on the differentially expressed gene (DEG) reporting for TCGA-UCEC and TCGA-STAD data. 40 This method was previously used for our analysis of TCGA-COAD data. 36 For LS colon organoids, gene counts were generated for each sample and analyzed in DESeq2. 40 A summary of the covariates used in each regression was as follows:

| Microarray analysis of publicly available data
For microarray studies, data were processed in a manner dependent upon the array manufacturer. Data were downloaded either from Gene Expression Omnibus 32 or ArrayExpress. 41 For E-MTAB-8148 42 (Illumina), probes were removed if they had a detection P value of <0.05 in at least one sample. Background correction and quantile normalization were then carried out in limma 43 using negative control probes for correction and both negative and positive controls for normalization. Affymetrix array data were downloaded from E-GEOD-41258, 44 E-GEOD-26682, 45 and GSE13294. 46 E-GEOD-26682 45 contained data generated on two different array platforms. As such, independent regressions were performed for each subset. For each study, PCA was performed to identify potential outliers. Samples were also removed following a visual inspection of median probe intensities; if the array did not contain the representation of both MSI-H and MSS tumors or if they were missing relevant covariates that were used in the regression of each study. Probes that mapped to more than one unique identifier were removed from downstream analysis. Probes were also filtered based on a study-specific minimum cutoff for intensity levels. A total of 140 and 159 samples survived preprocessing in "batch33" and "batch44" for E-GEOD-26682, 45 respectively. A total of 211, 147, and 155 were used for E-MTAB-8148, 42 E-GEOD-41258 44 and GSE13294, 46 respectively.

| Microarray regression analysis of publicly available data
Linear models were employed for regression analysis for each microarray data set was performed for each data set independently in limma 43 on log-transformed expression values. A summary of the covariates used in each regression is as follows: The R package metaSeq 47 was used for meta-analysis of microarray data using the other.oneside.pvalues() function. Genes were not filtered based on p-value prior to meta-analysis. Instead, genes were only considered if they displayed concordant directions of effect across each cohort. Independent weights were assigned as continuous variables which corresponded to the study size of each cohort included in the meta-analysis.

| Survival analysis of TCGA data sets
We downloaded data on overall survival (OS) from cBi-oPortal 48 for TCGA-COAD, TCGA-STAD, and TCGA-UCEC. Cox proportional hazards models were fitted to test for the association between MSH4 expression (tertile: low, medium, high) and OS using the coxph() function of the survival package (version 3.5-5). To account for potential differences between survival metrics and factors such as MSI status, 49 we adjusted for study-specific covariates:

| CRISPR-Cas9 editing of MMR genes and analysis of effects on target gene and MSH4 expression
Clustered regularly interspaced short palindromic repeats (CRISPR) guide RNAs (gRNAs) targeting candidate genes (at least two gRNAs per gene) were purchased from Synthego (Menlo Park, CA) guide sequences; MSH2 gRNA1: 5′ UCA AAC UGA GAG AGA UUG CC 3′, gRNA2: 5′ GUU AAA AUG UCC GCA GUU GA 3′, gRNA3: 5′ GAU UCC AUA CAG AGG AAA CU 3′ (deletion size: ~0.135 kb); MSH6; gRNA1: 5′ AAC AGU UGU GAC UUC UCA CC 3′, gRNA2: AGG CUU UUA AAG CCA UAU AC 3′ (deletion size: ~0.200 kb); MLH1: gRNA1: 5′ ACUGA UAG AAA UUG GAUGUG 3′, gRNA2: 5′ CUU CAC UGA GUA GUU UGC AU 3′ (deletion size ~0.06 kb). The chemical modifications: 2′-O-Methyl at the three first and last bases, and 3′ phosphorothioate bonds between the first three and last two bases, were introduced into the gRNAs in order to provide superior editing in the organoid lines used (Synthego). No suitable gRNAs were available for PMS2. The Cas9 2NLS Nuclease was purchased from Synthego. Organoid lines were electroporated by gRNA and Cas9 (1:3 ratio) using the Neon Transfection System (Thermo Fisher, MPK5000S). Electroporated cells were allowed to grow for approximately 7 days prior to DNA and RNA harvesting. CRISPR editing of each MMR gene was performed in paired organoid lines derived from right and left colon biopsies of three different individuals. Genomic DNA was purified using the QIAamp DNA Mini Kit (Qiagen) and DNA deletions were confirmed with PCR amplifications MSH2: forward primer: 5′ CAG CTT CCA TTG GTG TTG TG 3′, reverse primer: 5′ GGG GAG AAA AGA TCT GAG GT 3′ (amplicon size ~0.45 kb); MSH6: forward: 5′ ATC TGA GGG GGA TTG GTT GC 3′, reverse: 5′ CAT GCC AGG CTG TTG ATG TC 3′ (amplicon size ~2.0 kb); MLH1: forward 5′ GAG GAC CTC AAA TGG ACC AA 3′, reverse 5′ AAC CAA ACT TTG CCA TGA GG 3′ (amplicon size ~0.39 kb). CRISPR-Cas9 editing of MSH4 was also performed. However, we were unable to achieve successful editing of these genes using the gRNAs tested (data not shown). The evaluation of MSH5 expression (partner to MSH4) was restricted to right and left colon organoids of one individual following MSH2 CRISPR editing and was considered as a negative control. This individual was chosen based on the availability of sufficient RNA in both left and right colon organoids for control and CRISPR-edited lines. Experiments were performed in triplicate to address reproducibility.

| Quantitative Real-Time PCR
RNA was isolated using Trizol reagent (Thermo Fisher: 15596026) and cDNA was synthesized from 2 μg of total RNA using the High-Capacity Reverse Transcriptase cDNA kit (Thermo Fisher: 4368813). Quantitative realtime polymerase chain reaction (qRT-PCR) was performed using the Superscript III kit for RT-PCR (Thermo Fisher: 18080051) and amplified using TaqMan assays for the following genes: MSH2 (Hs00953527_m1), MSH4 (Hs00172489_m1), MSH5 (Hs00159268), MSH6 (Hs00943000_m1), and MLH1 (Hs00179866_m1), and the internal control, Glucuronidase Beta (GUSB; Hs00939627_m1). Statistical significance was determined by controlling for subject identifiers in a linear mixed effects regression model using the lmerTest package.

| Outline of study to define cancer risk genes
Although much attention has been paid to the evaluation of omic markers of disease, a better understanding of genes occurring prior to disease onset may help in the identification of at-risk populations through simple, targeted screening approaches. An overview of our approach to define risk genes associated with MSI-H tumors and dMMR is outlined in Figure 1. Various drug combinations have been shown to be effective for MSS and MSI-H tumor biology   (Table S1). Drug use was not considered a covariate in our regression analysis.

| Comparing DNA methylomes of organoids from normal colons of Lynch syndrome and healthy subjects
Aberrant levels of DNA methylation are a hallmark of CRC. 73 To determine whether organoids derived from the normal colons of LS versus healthy subjects displayed significant epigenetic profiles relevant to disease, we first performed an epigenome-wide analysis of DNA methylation (Illumina Infinium MethylationEPIC array, EPIC) in 58 organoid lines derived from 42 unique individuals (Table S2). Individual DMR analyses were carried out in right and left colon organoid subsets to identify differences between LS and healthy subjects. In our analysis of right colon organoids (n = 35), we identified 241 DMRs that survived false discovery rate (FDR) correction (Table S3). This included a number of genes not previously associated with LS, such as DNA hypermethylation of MSH4 (FDR = 2.31E −06 ) ( Figure 2). Notably, MSH4 is a member of the MutS family of MMR genes. However, it has not previously been implicated in LS and has been reported to play a role exclusively in meiosis. 74 A similar analysis for LS versus healthy colon organoids was performed in a largely overlapping cohort of left colon organoids (n = 23). Here, we identified 202 DMRs (Table S4). Of these, only 25 were also significant in right colon organoids (Table 1), including LS-specific DNA hypermethylation of the MSH4 locus (FDR = 6.72E −04 ).

| Bisulfite sequencing of MSH4 locus
We bisulfite-sequenced a 587 bp region (chr1: 76262302-76262889) and generated data for 27 individual cytosines across 15 samples (Table 2) to technically validate the MSH4 locus in a subset of right colon organoids. Eighteen cytosines were significantly different between LS and healthy controls (n LS = 7; n CTL = 8). All but one of the 27 sites were hypermethylated in the LS colon. Of the eight sites that overlapped with cytosines on the Illumina EPIC array, six were significant and followed the same direction of effect.

| Analysis of DMRs in MSI-H tumor data sets
To define the relationship between LS DMRs and DNA methylation associated with MSI-H tumors, we first aimed to define consistent differences occurring between MSI-H and MSS/MSI-L tumors across three gastrointestinal cancer cohorts (see Methods). We identified 2519 DMRs in TCGA-COAD (n = 149), 3986 DMRs in GSE68060 75 (n = 23), and 3144 in TCGA-STAD (n = 202) data sets (Table S5). To determine the confidence of the association between these DMRs and tumor biology, DMRs were grouped into five categories (see Methods). Next, we related these tumor DMRs to those previously identified in our analysis of LS versus healthy colon organoids. Of the 241 DMRs identified in right colon organoids, 27 were found to be associated with MSI-H status, with MSH4 being the highest confidence DMR identified (Table 3). Of the 202 DMRs identified in our analysis of left colon organoids, 45 were found to be associated with MSI-H status (Table 3), MSH4, as well as three additional highest confidence DMRs: chr14:24,779,793-24,780,926 which corresponded to F I G U R E 1 Diagram to show the role of the various cohorts used within each of the various stages of the current study. Broadly, the study can be split into data generated from two omic layers (gene expression and DNA methylation) and three stages (identification, contextualization, and evaluation). multiple genes; chr17:37,123,638-37,124,209 (F-box protein 47 (FBXO47)) and chr15:51,973,083-51,973,591 (Secretogranin III (SCG3)).
A subset of endometrial cancers also exhibits MSI-H, so we downloaded and processed DNA methylation data to determine if similarities exist with gastrointestinal tumors: TCGA-UCEC (see Methods). Eight right-side (Table 3, bold font) and eight left-side (Table 3, bold font) DMRs identified in our analysis of LS versus healthy colon organoids overlapped MSI-H-related DMRs in TCGA-UCEC, including MSH4. Together, these data reveal a consistent role for differential methylation of MSH4 in MSI-H tumors across tumor locations.

| Evaluation of expression of MSH4 and other DMR candidate genes in LS and tumor data sets
The governance of gene expression is a primary role of DNA methylation. To determine the relationship between gene expression and DNA methylation of MSH4-and other MSI-H-related DMRs, we performed the analysis of three publicly available TCGA cohorts with available RNA-seq and MSI-H data (TCGA-COAD, TCGA-STAD, and TCGA-UCEC) using DESeq2, 40 while adjusting for cell composition (see Methods). Our analysis in TCGA-COAD has previously been reported. 36 As our primary goal was to better define the role of LS DMRs in cancer risk, we limited our analysis to genes corresponding to DMRs identified in Table 3 (Table S6) We confirmed this MSH4 finding in a meta-analysis of MSI-H vs MSS tumors using five large (n > 100) microarray colon data sets (Table S7)  . 42,[44][45][46][47] Neither SCG3 nor FBXO47 was identified in our meta-analysis, though a nominal reduction for SCG3 was observed in E-MTAB-8148 (P = 0.031).
We also performed an RNA-seq analysis of a largely overlapping cohort of right-sided colon organoids of LS (n = 7) versus healthy subjects (n = 9). We identified 811 nominal DEGs (p < 0.05; Table S7). Limited overlap was found between DMRs associated with the MSI-H phenotype. However, our analysis did reveal an increase in MSH4 (p = 7.04E −03 ) expression in LS versus healthy subjects. RBAK downstream neighbor (RBAKDN, p = 0.027) F I G U R E 2 DMR plot of MSH4 region found to be significantly different between LS and healthy subjects in right colon organoids and left colon organoids. Beta values were plotted following adjustment for technical variation by COMBAT.
T A B L E 1 (A) Summary of DMRs identified in right colon organoids of LS versus healthy subjects that were also significant in left colon organoids. (B) Summary of DMRs in identified in analysis of left colon organoids of LS versus healthy subjects that were also significant in right colon organoids. expression was also increased in LS colon organoids. RBAKDN is a low confidence DMR that was also found to be hypomethylated (Tabe S5) and concomitant with increased gene expression (Table S6) specifically in TCGA-COAD MSI-H tumors.

| CRISPR deletion of MMR genes leads to increased expression of MSH4
To validate MSH4 as a marker of dMMR we performed CRISPR-mediated deletion of LS-related MMR genes (MSH2, MLH1, and MSH6) in matched right and left colon organoids from three healthy subjects (see Methods). PMS2 was not considered because of a highly homologous pseudogene. We used gRNAs to delete ap-

T A B L E 1 (Continued)
Individuals with MSI-H have been shown to present with a TMB. We, therefore, also downloaded individuallevel data on this measure from cBioPortal. 48 Linear regressions of MMR genes on TMB revealed that MSH4 expression was most significantly associated with TMB compared with all MMR genes ( Figure 4) in TCGA-COAD (P = 9.15E-11), and secondmost (after MLH1) in TCGA-STAD (P = 5.72E −09 ). An association between MSH4 and nonsynonymous TMB was only identified in TCGA-UCEC following the removal of individuals defined by POLE subtype (P = 8.63E −03 ). Finally, Cox proportional hazards models were fitted to interrogate the relationship between survival outcomes and MSH4 expression (low, high as defined by median cutoff) while accounting for adjustment covariates. Here, we found that high MSH4 expression trended toward poorer OS in TCGA-COAD (p = 0.053), TCGA-STAD (p = 0.068), and TCGA-UCEC (p = 0.064).

| DISCUSSION
CRC is a highly heterogeneous disease and fundamental differences in CRC tumor biology exist between MSI-H and MSS CRC subgroups. 36 For example, MSI-H tumors are more mucin-rich and reveal a higher number of tumor-infiltrating lymphocytes than MSS tumors. 4 MSI-H tumors also display a higher tumor mutational burden (TMB) than MSS. 76  Note: Positive beta differences indicated DNA hypermethylation of the cytosine at that site in LS versus healthy colon organoids. Probes were cross-referenced to their presence on the EPIC array (X) following initial filtering in the right colon organoid analysis.   LS-related MSI-H tumors occur at similar rates in both the right and left colon. 77 Despite these differences, we have shown that several epigenetic and transcriptomic similarities are present between nonhereditary MSI-H tumors and epithelial cells derived from the normal colon of LS individuals. Understanding the molecular mechanisms contributing to MSI-H tumors may help improve personalized treatment options. Identifying disease-relevant genes that precede CRC onset in individuals at high risk of developing cancer such as LS may also lead to improved diagnoses, for example, in the absence of a known germline mutation, and improved outcomes. Therefore, we propose that the MSI-H-related differences in normal colon organoids of cancer-free LS subjects identified here have the potential to serve as useful clinical biomarkers for the at-risk LS population. Most notably, our findings support an important correlation between MSH4 DNA methylation and gene expression occurring in dMMR/ MSI-H across different cancer types and in both hereditary/nonhereditary conditions. Our findings strongly support the role of MSH4 in dMMR. Given the important role of dMMR in numerous cancers, future studies should aim to determine whether differences in MSH4 expression may be used in combination with other screening-based approaches to identify and provide supporting individuals in the general population with dMMR. Such a strategy may be particularly useful for individuals suspected of LS, with no known causal variant. This strategy may be particularly important given that our findings implicate some relationship between increased MSH4 expression and OS in each TCGA cohort. Though no finding reached statistical significance (p < 0.05), each trended in a direction to show that individuals with higher than the median expression of MSH4 expression displayed poorer OS after adjustment for factors such as tumor stage and MSI status. However, given the high degree of missingness of survival data and the relatively limited number of MSI-H tumors in each cancer type, this analysis was limited to tumors across both MSI-H and MSS subgroups. It remains possible that the effects on survival may be greater in one subgroup and larger studies should consider this when conducting future analysis.
The sample size of our primary analysis was relatively small. Further, cancer biology is highly heterogeneous. As such, we focused on differences that were consistent across data sets and omic layers. The most consistent association we found in LS and between MSI-H and MSS tumor status was MSH4. MSH4 was the only DMR identified in both right and left colon organoids of LS subjects that were of "highest confidence" in our subsequent analysis of MSI-H vs MSS/MSI-L tumors. This finding was validated using previously published data in MSI-H tumors and across omic layers. MSH4 has not previously been implicated in LS and previous studies have considered it only to play a role in meiosis. 74 Interestingly, a recent study in bladder cancer implies some genetic evidence for an association between MSH4 and MSI-H. 78 To the best of our knowledge, a role for MSH4 in LS has yet to be defined.
To confirm the correlation between MSH4 expression and dMMR, we performed CRISPR-Cas9 deletion of three known LS-related MMR genes and measured the effect on MSH4 gene expression. To reduce the potential of other MMR genes driving any observed differences, these experiments were conducted in colon organoids derived from three healthy individuals. MSH4 expression was significantly increased following the CRISPR-mediated deletion of MLH1 and MSH2 in organoids derived from the right colon. A nonsignificant increase in MSH4 expression was also observed following MSH2 editing in all three matched left colon organoids. Although we did not observe a significant increase in MSH4 expression following CRISPR editing of all three LS-related genes, we note that gene expression is tightly regulated by many genetic and nongenetic factors. Our confirmation of a correlation between increased MSH4 expression and LS-related MMR genes has important biologic implications and strongly supports a role for MSH4 as a relevant marker of dMMR. Any causal relationship between MSH4 expression and dMMR has yet to be determined.
The strong relationship between MSH4 expression and dMMR also allows us to address early progression in LS. It has long been hypothesized that dMMR in LS patients was a secondary process in the development of CRC in LS. 6,79 It has been suggested that factors including the development of Adenomatous polyposis coli (APC) mutations were responsible for adenoma initiation in LS patients and that dMMR occurred only after their initiation. This was partly based on findings that LS mutations did not increase the rate of adenoma initiation but instead, accelerated the process of continued adenoma development. [79][80][81] However, the belief that dMMR is only a secondary event in CRC has been challenged by the identification of dMMR crypt foci in LS patients in the absence of CRC. 6,82,83 Thus, it has been proposed that dMMR may indeed initiate adenoma formation, which then either requires a "second hit" or may lead to CRC directly. 6,79 By identifying MSH4 as a robust marker of dMMR and determining its presence in F I G U R E 4 Comparison of relative expression scores of genes related to MMR and TMB. TMB was grouped into four categories: low (≤5), intermediate (>5, ≤20), high (>20, ≤50), and very high (>50). Significance denoted by *(p < 0.05), **(p < 0.005), and ***(p < 0.0005). organoids derived from normal colon biopsies of LS subjects, our data support the finding of dMMR as an early event in LS patients in the absence of CRC, providing some insight into the temporal order of events for CRC initiation.
Next, we were interested in better defining a clinical role for MSH4 gene expression across three TCGA cancer cohorts by examining the relationship between MSH4 gene expression and TMB. We found that MSH4 expression was significantly associated with TMB. It has been theorized that TMB leads to downstream activation of the adaptive immune response by producing increased levels of neoantigens 78,84 and maybe an important biomarker for immune checkpoint inhibition response (ICI). 85 Conversely, individuals with low TMB (<5) have displayed poor responses to ICI. 86 Nonsynonymous measures of TMB are somewhat challenging to define, requiring the number of somatic mutations to be quantified within a coding region of a tumor genome. That MSH4 expression is significantly associated with TMB, implicates it as a novel biomarker for TMB, and its role in ICI should be considered. A previous report has shown that an individual with an MSH4 mutation and high TMB showed a complete response to ICI. 78 Despite the significant associations observed between TMB and MSH4 in all three data sets, a more complicated pattern emerged in TCGA-UCEC, whereby individuals with POLE subtypes were more strongly associated with TMB than those with MSI-H. Only after the removal of these individuals was the association between MSH4 expression and TMB identified in this data set. Proofreading defects in POLE have been implicated in the establishment of a hypermutator phenotype but do not lead to MSI-H phenotypes in the absence of dMMR. 78 Given the cancer-specific heterogeneity observed in our findings, more work is needed to better determine the relationship between the mutational landscape of a tumor, MSH4, and TMB. Future studies should also consider whether baseline expression of MSH4 in more accessible samples, such as blood or saliva, may be a useful predictive marker for the success of ICI.
There are limitations to this study. For example, therapeutic drug use was not considered a covariate in our regression analysis and measures were taken from each individual at only one timepoint. We aimed to contextualize the DMRs identified between LS and healthy colon organoids within the framework of MSI-H versus MSS tumors. However, a number of important differences exist between hereditary and nonhereditary MSI-H tumor phenotypes. For example, the vast majority of nonhereditary MSI-H CRC tumors occur in the right colon and are driven by hypermethylation or somatic mutation of MLH1. This is in contrast to LS, where as many as 45% of CRC tumors may develop outside the right colon. 77 Nonhereditary MSI-H tumors predominately develop along the serrated pathway, which is not true for LS tumors, which may arise from one of three distinct pathways. 79 This may have led to some LS-specific findings being overlooked. We compared results from organoids of LS versus healthy subjects to MSI-H versus MSS tumors to provide validation given the initial relatively small sample size. Future, larger studies should explore the possibility that differences exist between LS and nonhereditary forms of MSI-H tumors. Of note, some additional cancers such as bladder and breast cancers also present with MSI-H tumors. However, the relative numbers of these tumors within the TCGA database are nominal. 12 As such, the role of MSH4 as a marker of dMMR in these cancers was not considered. Second, our use of CRISPR-cas9 editing technology to delete known MMR genes implicated in LS suggested a direct relationship between MSH4 expression and MMR status in the colon organoid model. However, we do not provide data on whether MSH4 directly alters MMR gene expression, or whether this occurs through MSH4 DNA methylation. Third, a positive relationship between DNA methylation at MSH4 and increased gene expression was observed. Given this surprising, positive relationship, future studies should also look to better disentangle the DNA methylation signal observed here. Most studies of DNA methylation on the Illumina EPIC array, look at the composite effects of 5-methylcytosine (5mC) and 5-hydroxymethylcytosine. However, the latter is present in most cell types at vastly lower levels. 87 Through the use of oxidative bisulfite treatment, the precise role of these two epigenetic marks on gene expression of MSH4 may be better defined. Importantly, at the gene body, differences in 5hMC have been previously associated with increased gene expression of expected targets. 87 Further research is needed to determine whether DNA hypermethylation of the MSH4 locus identified may serve as a relevant biomarker not only for LS and more generally, MSI-H tumors, but also whether differences at this locus may have causal biologic and mechanistic relevance in MSI-H tumor development. This work should also consider whether colon organoid levels of DNA hypermethylation of the identified locus or increased MSH4 expression correlates well with other, more readily accessible tissues such as blood or buccal cells. However, these tissues were not collected for these samples as part of our organoid biorepository and are beyond the scope of this initial study. Verification of these changes in such tissues would improve the potential clinical relevance of our findings.

| CONCLUSION
We have identified MSH4 as a consistent feature in colon organoids of LS versus healthy subjects and in MSI-H versus MSS/MSI-L tumors across different cancers, establishing MSH4 as a novel marker of LS and dMMR. This finding was also seen in normal LS colon organoids adding weight to the hypothesis that dMMR occurs prior to tumorigenesis in LS. Further work will be needed to determine if there is any causal relationship between MSH4 and MSI-H tumor development.