Targeted sequencing of 36 known or putative colorectal cancer susceptibility genes

Abstract Background Mutations in several genes predispose to colorectal cancer. Genetic testing for hereditary colorectal cancer syndromes was previously limited to single gene tests; thus, only a very limited number of genes were tested, and rarely those infrequently mutated in colorectal cancer. Next‐generation sequencing technologies have made it possible to sequencing panels of genes known and suspected to influence colorectal cancer susceptibility. Methods Targeted sequencing of 36 known or putative CRC susceptibility genes was conducted for 1231 CRC cases from five subsets: (1) Familial Colorectal Cancer Type X (n = 153); (2) CRC unselected by tumor immunohistochemical or microsatellite stability testing (n = 548); (3) young onset (age <50 years) (n = 333); (4) proficient mismatch repair (MMR) in cases diagnosed at ≥50 years (n = 68); and (5) deficient MMR CRCs with no germline mutations in MLH1, MSH2, MSH6, or PMS2 (n = 129). Ninety‐three unaffected controls were also sequenced. Results Overall, 29 nonsense, 43 frame‐shift, 13 splice site, six initiator codon variants, one stop codon, 12 exonic deletions, 658 missense, and 17 indels were identified. Missense variants were reviewed by genetic counselors to determine pathogenicity; 13 were pathogenic, 61 were not pathogenic, and 584 were variants of uncertain significance. Overall, we identified 92 cases with pathogenic mutations in APC,MLH1,MSH2,MSH6, or multiple pathogenic MUTYH mutations (7.5%). Four cases with intact MMR protein expression by immunohistochemistry carried pathogenic MMR mutations. Conclusions Results across case subsets may help prioritize genes for inclusion in clinical gene panel tests and underscore the issue of variants of uncertain significance both in well‐characterized genes and those for which limited experience has accumulated.


Introduction
Colorectal cancer (CRC) is the third most commonly diagnosed cancer for both men and women in the United States with an estimated 132,700 new cases and more than 49,000 deaths in 2015 (Siegel et al. 2015). Approximately 10% of CRC cases are familial, with shared genetic and environmental factors both likely influencing the development of disease (Henrikson et al. 2015). Approximately 5% of CRC cases are considered hereditary, harboring an identified pathogenic single-gene alteration in genes established to be associated with a substantial increased risk of disease (Burt 2000;Lichtenstein et al. 2000;Chung and Rustgi 2003;Grady 2003;Lynch and de la Chapelle 2003). Several genes have been identified as CRC susceptibility genes, including those implicated in mismatch repair (MMR), responsible for Lynch Syndrome  (Liu et al. 2000;Smith et al. 2001;Grady and Markowitz 2002;Suchy et al. 2010;Kastrinos and Syngal 2011;Lubbe et al. 2011;Palles et al. 2013). Additional genes suspected of being involved with CRC susceptibility include those involved in DNA repair  (Myeroff et al. 1995;Eppert et al. 1996;Ilyas et al. 1997;Shin et al. 2004;Valle et al. 2008;Goto et al. 2009;Guda et al. 2009;Nahorski et al. 2010;Morak et al. 2011;Fleming et al. 2013;Lao et al. 2013;de Voer et al. 2013de Voer et al. , 2015Mazzoni and Fearon 2014).
Currently, several methods are being used to identify mutations in hereditary colorectal cancer (HCC) susceptibility genes. For evaluation of Lynch syndrome in particular, testing algorithms may include microsatellite instability (MSI) and immunohistochemistry (IHC) analysis, tumor and germline hypermethylation, analysis, and germline sequencing and dosage analysis. Universal MSI or IHC testing of CRCs has also been advocated in order to identify individuals with Lynch Syndrome (Giardiello et al. 2014). For the evaluation of other HCC syndromes, genetic testing is typically limited to germline analysis. Previously, genetic testing for HCC syndromes was limited to single gene tests, performed in a cascade fashion when necessary; thus, only a very limited number of genes were tested, and rarely in those infrequently mutated in CRC. As next-generation sequencing (NGS) technologies advance and costs decrease, sequencing panels of known HCC susceptibility genes are becoming increasingly common. These panels frequently include analysis of candidate CRC risk genes, for which little is known about the spectrum of pathogenic disease-associated variants. Additionally, in both well-established and candidate HCC risk genes, many rare variants occur that are not easily classified for pathogenicity (missense, synonymous, intronic, and intergenic variants) and many of these remain categorized as variants of uncertain significance (VUS). Determining the type and frequency of variations in these genes in CRC cases compared to unaffected controls may help in distinguishing pathogenic and benign variants and may help prioritize testing for family members of affected individuals.
In this study, we screened for germline mutations in 36 genes across five categories of CRC cases, including (1) Familial Colorectal Cancer Type X (FCCTX) which meet the Amsterdam Criteria I for Lynch Syndrome, but have normal mismatch repair function (microsatellite stable [MSS] and/or normal expression of four MMR proteins encoded by MLH1, MSH2, MSH6, and PMS2 by IHC) in the tumor (Lindor et al. 2005a), (2) unselected CRC cases with no prior IHC or MSI testing completed, (3) proficient MMR (pMMR) or unknown MMR status cases diagnosed ≤50 years (diagnosed <50 years of age), (4) proficient MMR (pMMR) cases, based on MSI or IHC testing, diagnosed ≥50 years, and (5) deficient MMR (dMMR) cases where no germline mutation has been previously identified by sequencing or multiplex ligation probe assay (MLPA) in the four main MMR genes (Lindor et al. 2005b;Boland and Goel 2010).

Subjects
Subjects were selected from the Colon Cancer Family Registry (Colon CFR) for mutation screening as part of the overall genetic characterization of this registry. The Colon CFR is an NCI-supported consortium established to create an infrastructure for interdisciplinary studies of the genetic and molecular epidemiology of CRC (Newcomb et al. 2007 Risk factor data, blood samples, and pathology reports were collected from participants using standardized protocols, and germline DNA was isolated from blood. Population-and clinic-based individuals chosen for germline DNA sequencing were divided into five case groups (persons with CRC), namely: (1) Familial Colorectal Cancer Type X cases, which meet the Amsterdam Criteria I for Lynch Syndrome (Vasen et al. 1991), but have normal mismatch repair function (microsatellite stable [MSS] and/or normal expression of four MMR proteins encoded by MLH1, MSH2, MSH6, and PMS2 by IHC) in the tumor ("FCCTX"; n = 153); (2) CRC cases with no prior IHC or MSI testing ("unselected"; n = 548); (3) proficient MMR (pMMR) or unknown MMR status cases diagnosed ≤50 years ("young onset"; n = 333); (4) cases diagnosed ≥50 years with proficient DNA mismatch repair based on MSI or IHC testing ("pMMR"; n = 68); and (5) cases with deficient MMR in tumor but with no germline mutation identified in the gene with lost protein expression by sequencing or multiple ligation probe assay (MLPA) ("dMMR"; n = 129). Several samples could be classified into more than one case group; particularly high overlap was present in the FCCTX, young onset, and pMMR cases. For clarity in reporting, samples were only included in a single group, with priority for classification proceeding FCCTX > young onset > pMMR. In addition, we chose a sample of 93 persons without CRC from among the spouses of cases as "controls". 2)] to be associated with CRC susceptibility were selected for targeted sequencing using Agilent's Custom Capture Kit. All exons and AE30 bp of each exon/intron boundary of each gene were specifically targeted for capture and sequencing.

Custom capture and sequencing
Paired-end indexed libraries were prepared using the Agilent Bravo liquid handler following the manufacturer's protocol (Agilent). Briefly, 3 lg of target DNA in 120 lL TE buffer was fragmented using the Covaris E210 sonicator. The settings of duty cycle 10%, intensity 5, cycles 200, time 360 sec generated double-stranded DNA fragments with blunt or sticky ends with a fragment size mode of between 150-200 bp. The ends were repaired and phosphorylated using Klenow, T4 polymerase, and T4 polynucleotide kinase, after which an "A" base was added to the 3 0 ends of double-stranded DNA using Klenow exo (3 0 to 5 0 exo minus). Paired-end Index DNA adaptors (Agilent) with a single "T" base overhang at the 3 0 end were ligated and the resulting constructs were purified using AMPure SPRI beads from Agencourt. The adapter-modified DNA fragments were enriched by four cycles of PCR using SureSelect forward and SureSelect Pre-Capture Indexing reverse (Agilent) primers. The concentration and size distribution of the libraries were determined using an Agilent Bioanalyzer DNA 1000 chip.
Custom capture of 3.69 Mb was carried out using the Agilent Bravo liquid handler following the protocol for Agilent's SureSelect XT, such that 750 ng of the prepped library was incubated with whole-exon biotinylated RNA capture baits supplied in the kit for 24 h at 65°C. The captured DNA:RNA hybrids were recovered using Dynabeads MyOne Streptavidin T1 from Dynal. The DNA was eluted from the beads and purified using Ampure XP beads from Agencourt. The purified capture products were then amplified using the SureSelect Post-Capture Indexing forward and Index PCR reverse primers (Agilent) for 12 cycles. Libraries were validated and quantified on the Agilent Bioanalyzer.
Libraries were pooled at equimolar concentrations in batches of 96 samples and loaded onto paired-end flow cells at concentrations of 7-8 pM to generate cluster densities of 600,000-800,000/mm 2 following Illumina's standard protocol using the Illumina cBot and HiSeq Pairedend cluster kit version 3. Each pool of samples was run on five lanes of a flow cell to generate a minimum of 200x coverage per sample.
The flow cells were sequenced as 101 bp X 2 pairedend reads on an Illumina HiSeq 2000 using TruSeq SBS sequencing kit version 3 and HiSeq data collection version 1.4.8 software. Base calling was performed using Illumina's RTA version 1.12.4.2.
Variants most likely to disrupt protein expression or function [nonsense variants, frame-shift insertion/deletion/duplication variants, splice site variants (AE2 bases from the exon-intron boundary), initiator codon variants, and stop-codon variants] were designated as Tier 1 variants. Missense variants and in-frame insertion/deletion/ duplication variants were designated as Tier 2 variants and were evaluated as described below to determine classification as either pathogenic, benign, or VUS. Synonymous variants and those located in introns, untranslated regions, or intergenic regions were excluded from this study.

Detection of large exonic deletions
All copy number variations (CNVs) were called using an updated version of PatternCNV, which uses all the samples to learn the pattern and variance of the coverage to better enable CNV calling (Wang et al. 2014). It computes the differences in observed coverage versus the common pattern, while penalizing regions associated with larger variability using a weighting scheme. Results from probe-level CNV are summarized using circular binary segmentation. Further CNV segmentation results were evaluated in three genes, MLH1, MSH2, and MSH6 using À0.5 and 0.5 as log2 ratio cutoff for deletion and amplification, respectively.

Missense variant review
Review and classification of missense variants were completed by three genetic counselors (authors JRB, AMP, and LAW) as outlined in Figure S1. Variants with a MAF ≥2% and <5% were considered benign. The remaining variants were assessed for available annotation information from a number of sources: Mayo's clinical Molecular Genetics laboratory, InSiGHT database (International Society for Gastrointestinal Hereditary Tumors) for MMR genes, ClinVar mutation database, and the Human Gene Mutation Database (HGMD) (Cooper et al. 1998;Ou et al. 2008;Landrum et al. 2016). Variants that had not been annotated by any of these groups were classified as VUS. For missense variants that had been annotated by InSiGHT, the InSiGHT classification was assigned. Variants in genes tested at Mayo's clinical Molecular Genetic laboratory were assessed for prior experience with Mayo's clinical laboratory. For variants that had been recently annotated (since 2015) per Mayo's clinical laboratory internal databases, the current Mayo classifications were assigned. Of note, for those variants annotated in both the internal Mayo databases and InSiGHT, the InSiGHT classification was used. For variants that had been most recently annotated by Mayo's clinical laboratory prior to 2015, additional review was performed as described below. For variants annotated in ClinVar with a ≥ 2 star rating (requiring multiple submitters and no conflicting interpretations) the ClinVar classification was assigned (Landrum et al. 2016). For variants that had been annotated by Mayo's clinical Molecular Genetics laboratory prior to 2015, those that were annotated in ClinVar but had a < 2 star rating, and those that were annotated only in HGMD, final classification was assigned based on available annotation data incorporated with additional genetic counselor variant review. This review process included assessment of available database annotations, literature review, and analysis using in silico predictive tools. Classifications were determined based on ACMG guidelines (Richards et al. 2015).

REVEL scores for missense variants
Rare Exome Variant Ensemble Learner (REVEL) is a new ensemble method developed to help predict the pathogenicity of rare missense variants, such as those commonly identified using modern sequencing technologies. The REVEL random forest was trained on recently discovered disease and rare neutral variants, and incorporates scores from multiple individual tools, including: MutPred, VEST, FATHMM, Polyphen, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP, SiPhy, phyloP, and phastCons. REVEL scores ranging from 0 to 1 were generated for all missense variants in our study, and were utilized to help determine their relative likelihood of pathogenicity. We chose a threshold of ≥0.5 to be considered likely damaging, corresponding to 75.4% sensitivity and 89.1% specificity (Ioannidis, et al.).

LSDB submission of identified variants
All variants reported in the article have been submitted to the corresponding Locus Specific Mutation Database (LSDB, http://grenada.lumc.nl/LSDB_list/lsdbs/).

Subject characteristics
After exclusions due to poor coverage or concordance (n = 6, 0.5%), a total of 1324 individuals were included in the study, including 1231 cases with CRC and 93 controls (Table 1). The majority of cases were Caucasian (76%), while the remaining were African American (11%), Asian (3%), or mixed (10%).

Tier 1 variants
Eight percent (n = 103) of cases harbored a Tier 1 (nonsense, frame-shift, splice site, initial codon, and stop loss) Ninety-two unique Tier 1 variants were identified in 101 individuals, with MSH6, MSH2, MLH1, and APC the most frequently mutated genes (20, 13, 13, and 12 unique mutations each, respectively, Tables 2 and S1). Of the 92 variants, four were present only in controls, one was present in both cases and controls, and the remaining 87 were present only in cases. The vast majority of variants were identified in only one individual (n = 84); the remaining eight variants were identified in two to five individuals. Individual variant results, for both the entire cohort as well as ethnic subgroups, are shown in Table S1.
Five Tier 1 mutations were identified in the controls (Table 2) including an MSH6 nonsense variant (p.Arg1005*) that was confirmed by Sanger sequencing, a nonsense variant in TGFBR1, frame-shift variants in BLM and CHEK2, and a splice-site variant in BMP4.
The unselected cases had the highest frequency of Tier 1 variants overall (11%, Table 2), as it was both the largest sample group and because neither tumor triage nor mutation screening was performed prior to study inclusion. Thirteen unselected cases (2%) carried a Tier 1 variant in MSH6, while 11 (2%), nine (2%), and five (1%) carried a Tier 1 mutation in MLH1, MSH2, and APC, respectively.
Variants were also found frequently in the young onset and FCCTX cases (9% and 7%, respectively; Table 2). APC mutations were the most common in the young onset cases (2%), but none were present in the FCCTX cases. Surprisingly, we identified four FCCTX cases with damaging MMR mutations, including one individual with an Arg711* mutation in MSH2, and three individuals with frame-shift mutations in MSH6 (p.Phe1037fs, p.Ala1320fs, and p.Phe1088fs) ( Table S1). All four had IHC results indicating the protein expression of interest was present and normal. MSI testing was performed on two of the subjects' tumors and indicated microsatellite stable tumors; MSI testing was not completed on the other two. Thus, the available tumor data on these cases would not have led to triage for sequencing for MMR gene mutations. Of the 69 young onset cases without any prior IHC or MSI tumor testing, 16 harbored mutations in MMR genes (23%). Because of the lack of previous MSI or IHC testing and the early onset of disease in these cases, we expected several cases to have pathogenic MMR gene mutations.
Three Tier 1 variants were identified in the dMMR cases, one each in MSH2, MSH6, and TGFB1. The initiator codon of MSH2 was mutated in one case (c.1A>C), however, previous IHC indicated that MSH2 was present and normally expressed and this variant is classified as a VUS by InSiGHT. MLH1 expression was heterogeneous and PMS2 expression was lost and the tumor was also negative for MLH1 methylation. In the case with the MSH6 mutation (p.Phe1088fs), IHC studies indicated loss of MLH1 but normal expression of MSH6.

Large exonic deletions
Twelve cases were identified with a large exonic deletions in MMR genes (MLH1, MSH2, or MSH6 (Table 3). Six large deletions were identified in MLH1, while there were five in MSH2 and two in MSH6. One Unselected case had large deletions in both MLH1 and MSH2. Unselected cases had the largest proportion of large deletions (n = 8), with one case each of dMMR, FCCTX, pMMR, and YO harboring a large deletion. Cases with large deletions were diagnosed young, with a median age of diagnosis of 46 (range: 24-60).

Tier 2 variants
A total of 658 missense and 17 in-frame indels were classified as Tier 2 variants, with 32 being found exclusively in controls. Most Tier 2 variants were present in one to three individuals, concordant with their low minor allele frequency in the public databases. After review of pathogenicity as outlined in the methods, 13 were considered to be pathogenic or likely pathogenic, 61 were considered not pathogenic, likely not pathogenic, or polymorphisms, and the remaining 584 were classified as variants of uncertain significance (VUS) ( Table S2). Of the variants only found in controls, one was classified as likely not pathogenic, while the remaining 31 were classified as VUS. Because of the large number of variants classified as VUS in both cases and controls, we also utilized variant REVEL scores to assess pathogenicity, as described in the methods. For variants classified as VUS, we used a REVEL score of >0.5 to be likely damaging, corresponding to 75.4 sensitivity and 89.1 specificity (Ioannidis et al.). Overall, 25% of the VUS missense variants were considered likely damaging using this cutoff (n = 144) and these variants, as well as the ones classified as pathogenic or likely pathogenic, are discussed further below (Tables 4 and S2). MLH1 and MSH2 harbored the most predicted damaging variants (n = 16 and n = 15, respectively), while CHEK2 had 13 and APC and MSH3 both had 12 (Tables 3 and S2). Several genes had few predicted damaging variants, including those with one (AXIN2, BUB1, CDKN1B, CDKN2A, SMAD2, TGFB1, and TGFBR1), two (AXIN1, BLM, BMPR1A, FLCN, GALNT12, PTEN, and SMAD3), or three (BMP4, SMAD1, and SMAD4). No predicted damaging variants were found in CTNNB1, NUDT1, PALB2, STK11, or STK11IP.
The carrier rate of predicted damaging Tier 2 variants was highest in the young onset and unselected cases, followed by the pMMR cases (26%, 26%, and 21%, respectively) ( Table 4). MUTYH had the highest percent carrier count of predicted damaging variants in three of the sample subsets (controls, FCCTX, and young onset), while in both the unselected and pMMR cases PMS1 had the highest percent carrier count. In the dMMR cases, RECQL5 had the highest percent carrier count and all of the cases with predicted damaging RECQL5 variants had loss of MLH1; however, the same variants were present in several FCCTX and young onset cases with normal tumor expression of MLH1. Thus, it is unlikely these variants are influencing the loss of MLH1. In the unselected cases, MSH3 also had a high percent carrier count, primarily due to three variants present in five (p.Ser490Tyr and p.His 827Arg) or four (p.Leu911Trp) individuals each (Table S2). Two of these variants (p.Ser490Tyr and p.His827Arg) were not present in Caucasian public control populations; however, they are present in African American controls in both 1000 Genomes and the Exome Sequencing Project (ESP), with frequencies ranging from 0.38 to 0.91% (Table S2). Indeed, of the 10 individuals with these two variants in our study, seven were classified as African American and three were of mixed descent. Several other genes had variants that were predominantly present in African American or admixed individuals, including APC (p.Ser26Arg), CDKN2A (p.Ala127Ser and p.Arg144Cys), MLH3 (p.Asp1073Asn), and PMS1 (p.Gly501Arg) ( Table 3).
In FCCTX cases, MUTYH harbored the most unique variants (n = 6). Eight individuals had at least one pathogenic or suspected pathogenic MUTYH variant; no previous screening for common MUTYH mutations had been completed for these individuals. Three individuals harbored two MUTYH mutations, one homozygous for p.Pro405Leu, one homozygous for p.Gly396Asp, and one individual was a suspected compound heterozygote for p.Tyr179Cys and p.Pro359Thr, however, we could not determine whether the two variants were in cis or in trans. The remaining five individuals had a single MUTYH mutation. Eight individuals also harbored a MSH2 variant (p.Gly322Asp) that met the random forest predicted damaging cutoff (0.536); however, this variant has been classified as not pathogenic by InSiGHT (Class 1).
In the unselected cases, MSH2 harbored the most unique variants (n = 11) (Tables 4 and S2). Within this group, 15 cases harbored more than one predicted damaging variant in a single gene, including three individuals with two variants in MUTYH and three individuals with two variants in PMS1 (Table 5).In the young onset cases, several genes had multiple predicted damaging Tier 2 variants, including CHEK2, MLH1, MSH2, MSH3, MSH6, and MUTYH. Three individuals in this subset had two heterozygous mutations in MUTYH and one individual had two heterozygous mutations in MLH1, although the phase of these alterations could not be determined (Table 5). In the pMMR cases, PMS1 harbored the most predicted damaging variants, while in the dMMR cases RECQL5 had the most individual with   predicted damaging variants. No cases in either the pMMR or dMMR subsets harbored homozygous or compound heterozygous variants that were predicted to be damaging.
Seventeen unique in-frame indel variants were identified in our samples, with 13 being present in a single case (Table S3). MLH1 p.Lys618Ala (c.1852_ 1853delinsGC), APC p.Glu1157del, AXIN2 p.His474_ Ser475insHis, and STK11IP p.Ser739del were present in 13, seven, three, and two cases, respectively. Similar to what was seen with certain missense variants, the APC indel was present in African American and mixed decent cases, mirroring the higher frequency of this indel in individuals of African American descent in public control datasets. The majority of the indels were present in unselected cases (n = 12), likely due to the large sample size of the group. No in-frame indels were present in the controls, while the pMMR and young onset cases each had two unique indels. Of note, the MSH2 p.Asn596del indel present in one dMMR case is considered a Class 5 (Pathogenic) variant by InSiGHT, and the case harboring this variant demonstrated loss of MSH2 expression by IHC. The remaining in-frame indels in the MMR genes (p.Leu94del and p.Ile217del in MSH2 and p.Pro768del in MSH6) were not present in the InSiGHT database.

Cases with multiple predicted damaging variants
In total, 117 samples (9.5%) harbored more than one Tier 1 variant, a pathogenic or likely pathogenic Tier 2 variant, or a predicted damaging Tier 2 variant. Two individuals had four predicted damaging variants, 22 had three predicted damaging variants, and 93 had two predicted damaging variants. In most individuals with more than one predicted damaging variant, the variants were present in different genes. However, a few cases had more than one Tier 1 or likely damaging Tier 2 variant in the same gene, as discussed above (Table 5). Six individuals were homozygous for predicted damaging mutations: five individuals for MUTYH and one individual for PMS1. Six additional individuals carried two heterozygous MUTYH mutations. Of the 11 individuals with two MUTYH mutations, nine had multiple polyps while there was no information available for the remaining two cases. Multiple heterozygous mutations were also detected APC, CDH1, CDKN2A, MLH1, MSH3, MSH6, and PMS1. Six of these individuals harbored a Tier 1 and predicted damaging Tier 2 variant. Additionally, three unselected cases harbored two Tier 1 variants in different genes. The first two harbored simple Tier 1 mutations: MLH1 (p.Met35fs) and MSH6 (p.Lys1358fs) in one individual, while another case harbored both APC (p.His2045fs) and MSH3 (p.Gln74fs). As discussed previously, the third had large exonic deletions in both MLH1 and MSH2 (Table 3).

Discussion
In this study, we sought to determine the scope and frequency of variants in 36 known or putative CRC susceptibility genes for five case subgroups (FCCTX, young onset, unselected, pMMR, and dMMR with no identified mutation) and an unaffected control group. We studied 18 genes known to be important in CRC susceptibility ( In total, we identified 72 of 1231 cases with pathogenic nonsense, frame-shift, splice site, large deletions, or likely damaging missense variants in the MMR genes (5.8%), predominantly in the unselected and young onset subsets. Pathogenic mutations in APC were identified in 12 individuals (1.0%) and multiple pathogenic mutations in MUTYH were found in eight cases (0.6%). Overall, we identified pathogenic mutations in the MMR genes, APC, and MUTYH in 7.5% of our cases.
Large exonic deletions in the MMR genes, which can account for 15-45% of germline mutation in MLH1 and MSH2, were identified in 1% of the cases (Baudhuin et al. 2005). In addition to the 1231 cases and 93 controls, 10 samples with known large deletions were also sequenced to determine if the deletions would be detectable. Eight of the ten deletions were identified; two smaller deletions were not identified, one in MLH1 (~3 kb) and one in MSH2 (~100 bp).
Tier 1 and predicted damaging Tier 2 variants were detected in all subsets for four genes (MLH1, MSH6, MUTYH, and PMS1). No Tier 1 or predicted damaging Tier 2 variants in CTNNB1, PALB2, or STK11 were identified, while a single Tier 1 or damaging Tier 2 variant was detected in AXIN2, CDKN1B, CDKN2A, and SMAD2. Other genes for which a few Tier 1 or damaging Tier 2 variants include GALNT12, NUDT1, PTEN, STK11IP, TGFB1, and TGFBR1. While germline mutations in these genes may play a role in CRC pathogenesis, it is likely limited to very small proportion of cases. Including these genes in clinical testing panels would help identify the rare individuals with pathogenic mutations in these genes, however, it would also likely result in more VUS identified. Whether the difficulties in interpreting uncertain variants in relation to disease management and risk assessment are outweighed by the few clearly pathogenic variants identified will need further study.
In addition to the mutations found in cases, we also identified five Tier 1 variants present in control subjects. Mutations in two of the genes are responsible for autosomal dominant CRC and Loeys-Dietz Syndrome (MSH6 and TGFBR1, respectively). Results of all Tier 1 variants, regardless of case or control status, were reported to the site from where the affected individuals were recruited.
Eight genes had an additional 36 variants that would be present only in specific protein isoforms due to alternative splicing or alternative start sites, including APC, CDKN2A, FLCN, MUTYH, OGG1, RECQL5, STK11, and TP53 (Table S4). For example, the TP53 variant at chr17:7576541 is intronic in the predominant isoform (NM_000546.5); however, due to alternative splicing of exon 9, is considered a missense (p.Ser307Leu) in the gamma isoform (NM_001276695.1) of the protein. Several of these different isoforms of the genes have been found to be elevated in various types of cancer, but the impact on disease risk, progression, and outcome remains unclear.
While IHC testing can help determine which gene is implicated in the case of the MMR genes, we identified four cases with pathogenic nonsense or frame-shift mutations in these genes despite tumor expression of the affected protein. Because previous IHC testing did not demonstrate loss of protein expression, germline sequencing of the MMR genes was not indicated. While not unheard of, the prevalence of intact MMR protein staining in conjunction with pathogenic mutations remains unclear. Previous studies have reported similar findings, especially in regards to MSH6; tumors in these cases may instead be phenocopies, have heterogeneous expression of the protein being tested, or the patient may have undergone neoadjuvant therapy (Radu et al. 2011;Shia et al. 2013). Additionally, IHC results are not always clearly positive or negative; centers may interpret the results differently, impacting the decision to proceed with germline testing. Thus, caution should be taken in regards to IHC testing results as they may not always accurately reflect the biology of the tumor. Additional studies are warranted to better understand how often this occurs and how it may lead to incorrect diagnosis for patients.
The ability to identify genomic variants via multigene sequencing panels has surpassed the ability to accurately classify variants in terms of functional importance. While rare nonsense, frame-shift, and splice-affecting variants are generally considered pathogenic when occurring in genes for which loss of function is known to be associated with disease, missense variants are much more difficult to assess. In silico programs, such as SIFT and PolyPhen, are often used, however, the pathogenicity predictions are not sufficiently reliable to use as stand-alone evidence for pathogenicity and the programs are frequently not in agreement with one another; thus many missense variants remain classified as VUS. To aid in determining which variants were likely pathogenic, we used a new program, REVEL, utilizing a random forest score that incorporates multiple in silico prediction tools for use with rare variants (Ioannidis et al.). Of the 77 missense MMR variants in MLH1, MSH2, and MSH6, 53 had a REVEL score above our threshold of 0.5. Twenty of these were not present in InSiGHT; the remaining 33 were classified as Class 5 (pathogenic, n = 3), Class 4 (likely pathogenic, n = 1), Class 3 (uncertain, n = 17), Class 2 (likely not pathogenic, n = 6), or Class 1 (not pathogenic, n = 6). Using a more stringent threshold for the REVEL score would decrease the number of benign variants considered likely pathogenic, however, it may also decrease the chance of identifying true pathogenic variations.
Our study has several strengths. The large number of cases available through the Colon CFR allowed us to compare sample cases with varying characteristics. For example, while all case categories contained damaging or likely damaging variants in MLH1, MSH6, MUTYH, and PMS1, three genes (CTNNB1, PALB2, and STK11) had no damaging or likely damaging variants in any case subset and damaging variants in five genes were present in only a single case category (AXIN2, CDKN1B, CDKN2A, NUDT1, and SMAD2). Many of our cases had a family history of CRC, likely enriching for causative variants. Additional affected members may also be used in cosegregation studies to better predict the pathogenicity of rare missense variants.
Our study has some weaknesses. The modest number of controls limited our ability to compare frequencies of variants to those in our case subgroups. The majority of our variants were very rare or novel, present only in a single case. Sequencing additional controls, in conjunction with utilizing available public control databases and information from functional studies, will be essential to help establish which variants contribute to CRC predisposition. This study did not include all now known CRC susceptibility genes. PMS2 was excluded because it is difficult to target due to the presence of multiple pseudogenes. POLE, POLD1, and GREM1 were not included as they were identified as putative CRC susceptibility genes after completion of our custom capture design. Finally, while we were able to detect large deletions in eight samples with known deletions, we failed to find known deletions in two positive control samples. It is possible that coverage of the genes was not sufficient in these samples or that the deletions were difficult to detect due to their relatively small size. Additionally, verification of the identified deletions in cases has not yet been completed. Based on this, we must be careful in interpreting the data; there may be additional deletions not detected or some of the identified may be false positives. We were only able to search for large deletions in MLH1, MSH2, and MSH6, due to increased coverage of the regions, specifically designed into the targeted array. The remaining genes may also contain large deletions.
Yurgelun et al. completed a study similar to ours, in which they sequenced 1260 CRC cases with suspected Lynch Syndrome in a 25 gene panel (Yurgelun et al. 2015). There are several key differences in our study. First, we categorized cases by tumor MMR status, allowing comparison of case subsets. All of the cases in the Yurgelun et al. study were suspected Lynch Syndrome families, based on family history of Lynch Syndrome-associated cancers. We also included >500 CRC cases with no prior IHC or MSI testing, reflecting what is more likely to be seen in the general population. Additionally, we sequenced 93 unaffected individuals, to help discriminate common, benign variants from those more likely to be truly involved in disease susceptibility. Of the 82 missense variants present in our controls, 50 were also present in cases. Only seven of these had predicted damaging REVEL scores, including one known pathogenic variant (MUTYH, Gly396Asp), one known benign variant (MSH2, Gly322Asp), and five variants of uncertain significance. The presence of these variants in controls at similar or higher frequencies than in cases favors a benign prediction.
In summary, we have utilized targeted sequencing to identify variants in the known and suspected CRC susceptibility genes. We identified multiple pathogenic and likely pathogenic mutations in our cases. Cases with discordant MMR tumor testing and sequencing results were discovered, perhaps due to tumor heterogeneity or phenocopies, underscoring that IHC and MSI testing is not always an accurate indicator of germline MMR status. As sequencing technologies improve and costs decrease, targeted sequencing of multiple CRC susceptibility genes is becoming an efficient method to screen individuals with suspected hereditary CRC, although the high frequencies of VUSs must be anticipated for a long time to come.

Supporting Information
Additional Supporting Information may be found online in the supporting information tab for this article: Figure S1. Missense variant classification. Table S1. Individual Tier 1 variants. Table S2. Individual Tier 2 variants. Table S3. Individual in-frame indels. Table S4. Alternative splicing variants.