Leiden open variation database of the MUTYH gene


  • Communicated by Rolf H. Sijmons


The MUTYH gene encodes a DNA glycosylase involved in base excision repair (BER). Biallelic pathogenic MUTYH variants have been associated with colorectal polyposis and cancer. The pathogenicity of a few variants is beyond doubt, including c.536A>G/p.Tyr179Cys and c.1187G>A/p.Gly396Asp (previously c.494A>G/p.Tyr165Cys and c.1145G>A/p.Gly382Asp). However, for a substantial fraction of the detected variants, the clinical significance remains uncertain, compromising molecular diagnostics and thereby genetic counseling. We have established an interactive MUTYH gene sequence variant database (www.lovd.nl/MUTYH) with the aim of collecting and sharing MUTYH genotype and phenotype data worldwide. To support standard variant description, we chose NM_001128425.1 as the reference sequence. The database includes records with variants per individual, linked to available phenotype and geographic origin data as well as records with in vitro functional and in silico test data. As of April 2010, the database contains 1968 published and 423 unpublished submitted entries, and 230 and 61 unique variants, respectively. This open-access repository allows all involved to quickly share all variants encountered and communicate potential consequences, which will be especially useful to classify variants of uncertain significance. Hum Mutat 31:1–11, 2010. © 2010 Wiley-Liss, Inc.

The Human MUTYH Gene and Colorectal Polyposis

The Human MUTYH Gene

The human MUTYH (mutY homolog [Escherichia coli]) gene is located on chromosome 1p34.3–p32.1 and spans 11.2 kb (MIM♯ 604933, Genomic sequence: NG_008189.1). It has an open reading frame (ORF) of ∼1.6 kb containing 16 exons. Gene aliases are: MYH, hMYH, CYP2C, MGC4416, and RP4-534D1. Different protein names are: A/G-specific adenine DNA glycosylase, mutY homolog (E. coli), and hMYH (NCBI Entrez GeneID: 4595, www.ncbi.nlm.nih.gov/gene, UniProtKB/Swiss-Prot ID: Q9UIF7, www.uniprot.org/uniprot/). MUTYH functions as a DNA glycosylase in the base excision repair (BER) pathway that repairs oxidative DNA damage. MUTYH specifically excises adenine (A) from oxidized guanine (8-oxoG), which is a frequently occurring type of oxidative damage to DNA. 8-OxoG is liable to form 8-oxoG:A mispairs. After adenine excision, further downstream proteins such as RPA1 repair the mutagenic abasic site. The two main other BER enzymes are OGG1, which excises 8-oxoGs, and NUDT1 (or MTH1), which removes 8-oxodGTPs from the nucleotide pool. A lack of functional mutY in E. coli or MUTYH in humans leads to accumulation of G>T transversions throughout their DNA. MUTYH is evolutionary strongly conserved [Parker and Eshleman, 2003].

Transcripts, Isoforms, and Functional Domains

MUTYH is transcribed from the minus strand. On the opposite plus strand the TOE1 gene (target of EGR1, member 1 [nuclear]) is located. TOE1 encodes a DNA binding protein that is a target for the tumor suppressor gene EGR1, encoding a transcription factor. TOE1 mediates the inhibitory growth effect of EGR1 and interacts with p53. The 5′ untranslated regions (UTRs) and exon 1 sequences of MUTYH and TOE1 overlap [Makalowska, 2008; Sperandio et al., 2009]. A core promoter region is predicted to be at the location of the three different 5′ UTRs of MUTYH (Fig. 1A and Supp. Fig. S1). MUTYH is expressed in more than 10 alternative splice variants encoding at least seven isoforms of the MUTYH protein (429–549 amino acids; Fig. 1 and Supp. Table S1). The isoforms differ in their N-terminus and the part encoded by exon 3. The N-terminus contains a mitochondrial localization signal (MLS) and putative nuclear localization signals (NLS) are located both at the N-terminus and C-terminus. The functional significance of the MLS and NLS in MUTYH is not entirely clear. The MLS seems to be dominant over the NLS, but isoforms lacking the MLS have been detected in the mitochondria [Arai et al., 2006; Takao et al., 1998, 1999]. The function of the alternative transcripts and isoforms is also unclear. Possibly they possess different glycosylase activity levels and/or have different expression levels in different tissues (Table 1, Fig. 1, and Supp. Table S1) [Ma et al., 2004].

Figure 1.

MUTYH gene, transcripts, and isoforms (see also Table 1 and Supp. Table S1, for more details and references). A: Gene with UTRs in black for alpha/beta/gamma transcripts and an extra part of UTR for the beta5 transcript, exons alternatively light and dark gray and introns as thin black lines. Genomic coordinates according to Chromosome 1, NCBI build 37; cDNA according to NM_001128425.1. B: Transcript variants. Exons are shown to scale, similar to A (without UTRs; exons 5–16 only shown for alpha5; introns not to scale). The part of exon 3 that is the same among transcripts is shown in dark gray and the part that differs in lighter gray (except for alpha, beta, and gamma4 and difficult to see for alpha and gamma2, which have only one amino acid in light gray). The short, probably noncoding, exon 3 of the alpha, beta, and gamma4 transcripts is shown with a striped pattern. The three putative translation intitiation codons (AUG) are shown as black vertical bars. Stop condons are also shown as black vertical bars and indicated by “X.” C: “Mature mRNAs” (without UTRs) and the protein isoforms (not separated per different exon 3 splice variant). Exons are colored as in B). In isforms 1 to 5 the alternatively spliced part encoded by exon 3 is shown as a gray box. The predicted protein domains are shown at the top and bottom of the largest isoform. In light gray the large regions homologous to other proteins are shown and in dark gray smaller functional domains. A putative NLS is located near the MLS and might be only active in the isoforms without MLS. In case of overlap between domains, the smaller domains are drawn on top of the larger. In the middle of the largest isoform, the known secondary structure is shown with light gray boxes for helices and dark gray boxes for beta strands. S, putative phosphorylated Serines; A, adenine binding sites; G, 8-oxoG binding sites. D: Part of the MUTYH sequence from http://chromium.liacs.nl/LOVD2/colon_cancer/refseq/MUTYH_codingDNA.html, showing different splicing possibilities for exon 3 (0, 1, 11, or 14 “extra” amino acids). The last G-base of exon 2 can either encode an Alanine or a Glycine, depending on the two following nucleotides in exon 3. Notably, variants might have different effects on different MUTYH transcripts, like c.36+325G>C/p.(spl?) (c.−7+5C>G in beta transcripts), c.36+1G>A/p.(spl?) (probably causes skip of exon 1 in alpha transcripts, but less likely to affect beta or gamma transcripts), c.39C>T/p.(=) and c.42C>T/p.(=) (near second translation start site) and c.158C>G/p.(Ala53Gly) (at first nucleotide of exon 3 in alpha5, but c.158–42C>G in alpha3).

Table 1. Functional Domains in the MUTYH Protein and the Unique Coding and Splicing Variants Located in Them
 Amino acidsVariants
Domain, motif, or siteRangefsXXM=DelinsSplTotal/codon
  1. fsX, frameshift; X, nonsense; M, missense; =, silent; Delins, small deletion or insertion of amino acids; Spl, splicing. Frameshift and nonsense variants are mentioned at the first domain that is affected. Splicing variants included here affect a domain by a deletion or frameshift. (NCBI Conserved Domain Search service [CD-Search], http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) [Fromme et al., 2004; Guan et al., 1998; Marchler-Bauer and Bryant, 2004; Parker et al., 2003; Parker and Eshleman, 2003; Shi et al., 2006; Takao et al., 1998].

  2. aPutative NLS that might be active in isoforms lacking the MLS [Takao et al., 1999].

Mitochondrial localization signal (MLS)1–1414  12  30.21
Replication Protein A1 (RPA1) binding8–31241252  100.42
EndoIII-6-helix-barrel (catalytic) like110–2311221526317430.35
 DNA minor groove binding133–1441211    20.17
 Pseudo helix-hairpin-helix (HhH)176–1816  2   20.33
 HhH212–2198  3  140.50
EndoIII-iron-sulfur-cluster (FeS) (catalytic) like91–109, 232–3141023623512400.39
 MutS homolog 6 (E. coli) (MSH6) binding246–26823 232 180.35
 FeS cluster loop290–306171 21  40.24
APEX nuclease 1 (APEX1) binding309–33123 132 170.30
HUS1 checkpoint homolog (S. pombe) (HUS1) binding309–3645622103 5220.39
Linker315–366522193 4190.37
C-terminal mutT (or nudix) like367–49813210527535550.42
Nuclear localization signals (NLS)16–19,a 112–116, 519–52314123   60.43
Proliferating cell nuclear antigen (PCNA) binding523–54119  6   60.32
Adenine binding109–114, 133–137, 142, 218, 236–238, 279–28422  41  50.23
OxoG binding138–140, 177–179, 396–398, 444–44814 12  250.36
Putative phosphorylated Serines6, 49, 99, 363, 508, 51861    230.50
Not localized to a domain32–90, 499–518, 542–549871 62  90.10

Exon 3 is subject to alternative splicing at the 5′ end, with the possibilities of none, 1, or 11 extra codons in known mature mRNAs. Another three codons have been found in some transcripts, BM549153.1 and BM911251.1, derived from a human astrocytoma cell line, and are supported by a putative splice site. The splice acceptor sites of exon 3 are relatively weak, according to the splicing prediction programs (SpliceSiteFinder-like, MaxEntScan and GeneSplicer) of Alamut software (Interactive Biosoftware, Rouen, France). No known functional domain is located at this alternatively spliced part of exon 3. Another splice variant of exon 3 lacks 64 nucleotides at the 5′ end and is predicted to encode an isoform starting from a translation initiation site in exon 4 (Fig. 1 and Supp. Table S1).

Isoform 2 is most abundantly expressed (16 coding exons, 535 amino acids, alpha3 transcript), contains the MLS at the N-terminus (amino acids 1–14) and is found to be located in the mitochondria. Isoform 4 is the most abundantly expressed nuclear isoform, lacking the MLS (15 coding exons, 521 amino acids, beta3, beta5, and gamma3 transcripts). Isoform 1 is the longest proven transcript, including 11 extra amino acids encoded by the 5′ end of exon 3 (16 coding exons, 546 amino acids, transcript alpha1). Isoform 5 (16 coding exons, 549 amino acids, transcript alpha5) is the hypothetically longest possible isoform. It contains three additional codons in the 5′ region of exon 3, compared to the alpha1 transcript (Fig. 1 and Supp. Table S1).

To date, the MUTYH protein has 15 published functional domains involved in DNA binding, base flipping, catalysis, excision, 8-oxoG, and adenine detection/recognition, and interaction with other DNA replication and repair proteins (Fig. 1C and Table 1).

MUTYH-Associated Polyposis (MAP)

Biallelic MUTYH pathogenic germline variants lead to colorectal adenomatous polyposis (MIM♯ 608456, http://www.ncbi.nlm.nih.gov/omim/). MUTYH-associated polyposis (MAP) resembles familial adenomatous polyposis (FAP; MIM♯ 175100), which is caused by autosomal dominantly inherited pathogenic variants in the adenomatous polyposis coli (APC) gene (MIM♯ 611731). The first pathogenic MUTYH variants were discovered in 2002 in a British family with an autosomal recessive inheritance of polyps, without a detectable pathogenic APC germline variant, but many somatic G>T transversions in the APC gene in colorectal tumors. Germline variant screening for the three BER genes resulted in detection of the compound heterozygous c.536A>G/p.Tyr179Cys and c.1187G>A/p.Gly396Asp variants in MUTYH (previously c.494A>G/p.Tyr165Cys and c.1145G>A/p.Gly382Asp) [Al-Tassan et al., 2002].

Most MAP patients have <100 adenomas at diagnosis, with a mean age of ∼45 years and develop colorectal cancer (CRC) at a mean age of ∼50 years. As such, MAP seems to be a relatively mild form of FAP [Nielsen et al., 2009b]. Biallelic pathogenic MUTYH variants have an estimated penetrance for CRC between ∼43% and ∼80% between the ages of 60 and 80 years, with an elevated overall CRC risk of 28-fold [Lubbe et al., 2009]. Tumorigenesis in MAP patients is thought to be initiated by somatic G>T transversions in KRAS and/or APC [Jones et al., 2002; Lipton et al., 2003; Nielsen et al., 2009a]. MAP warrants colonoscopic screening, every 2 years starting from the age of 18–20 years. Treatment of MAP patients consists of colonoscopic polypectomy and/or (partial) colectomy according to the severity of the polyposis. Because duodenal and gastric polyps are relatively common, upper gastrointestinal endoscopy is also indicated starting at the age of 25–30 years [Vasen et al., 2008]. Pathogenic germline variants in MUTYH are also associated with extraintestinal features that have a slightly higher incidence in MAP patients compared to the general population [Vogt et al., 2009].

Of the population of European descent, 1–2% carries a heterozygous pathogenic MUTYH germline variant [Al-Tassan et al., 2002; Cleary et al., 2009; Croitoru et al., 2004]. Assuming Hardy-Weinberg equilibrium and a 2% heterozygote frequency, the prevalence of MAP can be estimated at ∼1:10,000. Less than 1% of CRC cases can be attributed to biallelic pathogenic germline variants in MUTYH [Cleary et al., 2009]. Heterozygous pathogenic variants seem to give only a modestly or no significantly increased CRC risk [Cleary et al., 2009; Jones et al., 2009; Lubbe et al., 2009]. Currently, the majority of identified MAP patients are of European descent. In this ethnic group the most common recurrent pathogenic variants are p.Tyr179Cys and p.Gly396Asp, responsible for ∼80% of pathogenic variants found. In other populations (e.g., South and East Asians) other pathogenic variants play a larger role (Fig. 2) [Dolwani et al., 2007; Kim et al., 2007; Tao et al., 2004].

Figure 2.

Recurrent variants present in the LOVD database and their geographic origin (variants reported twice or more in independent families; homozygotes counted as one; variants from relatives not counted). Variants reported in dbSNP only and reported in dbSNP with an allele frequency >∼2% are omitted from this figure. Black filled cells represent that the variant has been detected once or more in a country. Cells are filled dark gray in case a variant was reported in a group consisting of individuals from multiple geographic origins. If both current country and the country of original descent are known, for both countries the cells are filled. N/W/S/E-Eu(rope): Northern/Western/Southern/Eastern Europe; Am: (North) America; CN: China; JP: Japan; KR: South Korea; IN: India; PK: Pakistan; IR: Iran; TR: Turkey; CG: Congo; NG: Nigeria; MA: Morocco; TN: Tunisia; UK: United Kingdom; IE: Ireland; FI: Finland; SE: Sweden; DK: Denmark; NL: The Netherlands; BE: Belgium; DE: Germany; CH: Switzerland; FR: France; IT: Italy; PT: Portugal; ES: Spain; GR: Greece; YU: (former) Yugoslavia; PL: Poland; CZ: Czech Republic; CA: Canada; US: United States; AU: Australia; NZ: New-Zealand. The variant spectrum might not be complete for all countries and biases may exist, due to differences in detection methods and numbers of tested patients.

The MUTYH Gene Variant Database in LOVD Format

The pathogenic effect of the common p.Tyr179Cys and p.Gly396Asp variants is beyond doubt (Table 2), but the effect of many other MUTYH variants remains uncertain. In contrast to pathogenic APC variants, which mostly result in a truncated protein, most pathogenic MUTYH variants are missense variants. The effect of missense variants on protein function is difficult to predict. For determining the pathogenicity of a variant, sufficient data about the functional effect and the context in which variant has been found are essential. Since 2005, we have developed an interactive MUTYH Leiden Open access Variation Database (LOVD) (www.lovd.nl/MUTYH), to serve as an international resource for current and future MUTYH variant and phenotype data.

Table 2. Functional In Vivo and In Vitro Assays Performed for MUTYH Variants in Human Transcripts
original image

We selected the LOVD software because it is user-friendly, freely available, and the software of choice for many other locus-specific databases (LSDBs) (www.hgvs.org, www.lovd.nl) [Fokkema et al., 2005]. Recently implemented software features are the links to view LOVD variants in the UCSC genome browser (Supp. Fig. S1) (http://genome.ucsc.edu/) and NCBI viewer (www.ncbi.nlm.nih.gov/projects/sviewer/) and the possibility of linking next generation sequencing data to LOVD data with the GAPSS pipeline (www.lgtc.nl/GAPSS).

Sequence Variant Nomenclature

Although a genomic reference sequence would provide a uniform nomenclature regardless of different transcripts and isoforms, it gives no information about positions in relation to RNA and protein. Furthermore, MUTYH is transcribed from the minus strand, which gives genomic DNA and RNA annotations in different directions and with complementary nucleotides. Therefore, we choose for a coding reference sequence that is more practical to work with. To facilitate annotation of all possible coding variants, the Human Genome Variation Society (HGVS) recommends using a coding reference sequence that includes the largest theoretically known transcript, containing as many exons as possible, preferably with only exons that do not disrupt the reading frame (www.hgvs.org/mutnomen/refseq.html♯general) [den Dunnen and Antonarakis, 2000; den Dunnen and Paalman, 2003]. This reference sequence does not necessarily represent an existing natural transcript variant. Its main purpose is to be complete, and thereby facilitate the description of all possible variants. The MUTYH LOVD provides nomenclature according to the longest possible transcript alpha5, isoform 5 (NG_008189.1, NM_001128425.1, NP_001121897.1), created in collaboration with NCBI staff (www.ncbi.nlm.nih.gov/RefSeq/) [Pruitt et al., 2007]. The previously most frequently used annotation refers to transcript alpha3, isoform 2 (NM_001048171.1, NP_001041636.1). For clarity, the MUTYH LOVD provides descriptions in relation to alpha5 and alpha3, as well as the variants as reported in the original and many subsequent publications/submissions. The annotations of all variants in the LOVD have been generated or checked with the commercially available Alamut software or open-source Mutalyzer software [Wildeman et al., 2008].

The alpha5 transcript contains 42 extra coding nucleotides (c.158_199), corresponding to nucleotides c.158–42_158-1 (intron 2) in alpha3. Positions of variants after c.158/p.53 in alpha3 are shifted by 42 nucleotides and 14 amino acids in alpha5 (e.g., c.494A>G/p.Tyr165Cys becomes 536A>G/p.Tyr179Cys) (Fig. 2D). Notably, frameshift variants before c.200 in alpha5 may have a different stop codon position compared to alpha3. Furthermore, any variant in the 5′ region and the start of exon 3 may have different consequences in the alpha, beta or gamma transcripts and isoforms, which should be verified, at least in silico to aid interpretation for pathogenicity (Fig. 1 and Supp. Table S1). Additionally, MUTYH variants, especially in the 5′ region, may have effects on TOE1 (Supp. Fig. S1).

Database Contents and Access

We aimed to collect all variants detected in patients, controls, or specific groups, as well as data from functional in vitro/in vivo assays or in silico analyses, linked to available phenotype data, submitter contact information and/or published reference. To establish the database, we added data from all relevant Pubmed references with the keywords MUTYH, MYH, or Muty, from 2001 to 2009. Abstracts or fulltext articles were screened for MUTYH variants or genotypes. Currently, all variant-containing references accessible to the authors and in the English language have been included in the database (n=121). In the future, we intend to add inaccessible data from libraries or publishers and, after translation, non-English publications. The database also contains 71 unique records from dbSNP (www.ncbi.nlm.nih.gov/projects/SNP/) [Sherry et al., 2001], including information about geographic or ethnic origin. In total, 1968 published entries are included. Last, submitters have registered and started to enter their unpublished findings (423 entries) (Fig. 3).

Figure 3.

Numbers of unique variants that are published only (dark gray), published as well as submitted (light gray), and submitted only (medium gray) and numbers of database entries (reports) and occurrences (individuals with or tests performed on a variant). n.a., not applicable (entries and occurrences are either published or submitted). Venn Diagram Generator: http://jura.wi.mit.edu/bioc/tools/venn.php.

The MUTYH gene LOVD is maintained in the same database environment as the mismatch repair (MMR) gene LOVDs (MLH1, MSH2, MSH6, PMS2, PMS1, MLH3, and EPCAM) and the new APC LOVD, supported by expert members of the International Society for Gastrointestinal Hereditary Tumours (InSiGHT) (www.insight-group.org/, www.LOVD.nl/InSiGHT). This combined environment facilitates collection of data from patients who are screened for multiple CRC-associated genes and with symptoms of overlapping phenotypes, helpful for finding modifier variants or multifactorially inherited CRC.

The database includes variant columns for current and reported DNA and protein annotation, variant pathogenicity, allele frequency, functional data, and references. A column “origin” shows if the variant was reported as a “germline,” “somatic,” or “experimental” variant (e.g., designed for and/or tested in a functional assay). For patient-related data, columns are included for age of diagnosis, polyp number, presence of CRC, submitter information, and a remark column for extracolonic features etc. Unique IDs are available per variant, patient, and family. Patient IDs are used across different CRC gene databases to facilitate linking of information.

Data from functional assays or in silico analysis are displayed in one record per test, similar to (recurrent) occurrences of variants per individual. This format has the advantage that different data types are viewable and searchable within the same database. Specific columns are available for the methods and results, like in the MMR Gene Unclassified Variants Database (Table 2) (www.mmruv.info/) [Ou et al., 2008]. To separate different data types for counting purposes (e.g., of only patient records) the filter options of the software can be used.

Most information is made public by the LOVD curators or submitters and available for all users. Nonpublic information is hidden, for example, in case of privacy issues or when under curation. However, nonpublic variants are searchable and the curator can be contacted for information.

Database Analysis

As of April 2010, the MUTYH LOVD contained 291 unique variants (Fig. 3). The unique variants represent 2,391 different entries, of which 1,943 concern individual patients or controls, 173 functional assays or in silico analyses, and 267 groups of carriers of the same genotype, for example, patients or controls without available data per individual. Variants are distributed throughout the entire gene and 97% are located in known functional domains (Table 1 and Fig. 4). Most unique variants are missense variants (36%) or variants in which no effect on the protein has been predicted (43%). A truncated protein is predicted for 13% of the variants (Table 3). Most MUTYH gene testing has been done in the European, North American, Japanese, and Australian population. Population-specific patterns of variant occurrence are visible, indicating possible founder effects (Fig. 2).

Figure 4.

Unique (upper graph) and total (lower inversed graph) coding variants present in the database, grouped per interval of five codons, along the protein image with indicated domains corresponding to Figure 1C. All LOVD variants, including intronic, are shown in Supp. Figure S1.

Table 3. Counts of Types of Unique MUTYH Variants in the LOVD Database
LocationDNA variant/predicted protein effectfsXXM=Small DelSmall InsSpl?Total
  1. fsX, frameshift; X, nonsense; M, missense; =, silent or p.(=); Del, deletion; Ins, insertion; Spl, splicing; ?, unknown effect; Indel, insertion and deletion; Dup, duplication.

Near geneSubstitution   6    6
 Small Dup   1    1
5′ UTRSubstitution   7    7
IntronSubstitution   77  14293
 Small Del   3  115
 Small Ins   3    3
 Large Ins   2    2
 Small Dup   4    4
ExonSubstitution 1910420  3 146
 Small Del15   3   18
 Small Indel1       1
 Small Dup3    2  5

Tumor types other than colorectal polyps and cancer that have been studied in relation to MUTYH variants include: gastric cancer [Goto et al., 2008; Kim et al., 2004; Tao et al., 2004; Zhang et al., 2006a], prostate cancer [Shin et al., 2007], bladder cancer [Figueroa et al., 2007; Huang et al., 2007], endometrial cancer [Barnetson et al., 2007], lung cancer [Al-Tassan et al., 2004], childhood leukemia [Akyerli et al., 2003], breast cancer [Beiner et al., 2009; Zhang et al., 2006b], hepatocellular carcinoma and cholangiocarcinoma [Baudhuin et al., 2006], and head and neck cancer [Görgens et al., 2007; Sliwinski et al., 2009]. So far, there are no reports of biallelic pathogenic MUTYH variants in the absence of a colorectal phenotype. Two biallelic carriers have been described for whom no colonoscopy data have been published, one with endometrial carcinoma and sebaceous gland carcinoma and one with gastric cancer [Barnetson et al., 2007; Tao et al., 2004]. Twenty-five patients with variants in multiple genes (in MUTYH as well as MSH6 [11], MLH1 and MSH6 [1], MSH2 [4] and APC [9]) are present in the database.

A summary of functional assay results from previously published data and from unpublished data recently submitted to the database are shown in Table 2. Five unpublished reverse transcriptase PCR (RT-PCR) test results have recently been submitted. Cloning assays or haplotype inference from homozygotes showed that some variants are probably (but not always) located on the same allele (in the LOVD alleles can be assigned to “parent ♯1” or “parent ♯2”).

Variant Pathogenicity

Like many other LSDBs, the MUTYH LOVD currently serves as a data source to aid decision making about pathogenicity by the clinical and/or molecular geneticists responsible for patients. In this database the pathogenicity is displayed as reported by the publication/submitter and as concluded by the curators, in a five category system [Plon et al., 2008]. At present, only variants with ample evidence from functional as well as clinical data are classified in other classes than “variant of uncertain significance” (VUS) by the curators. Fifty-seven variants have been classified as (likely) pathogenic (frameshift variants, nonsense variants, missense variants causing lack of function and proven splice variants). The remaining 234 are VUS, including variants published as (likely) neutral. Of note is that a variant classified as (likely) pathogenic will usually only give rise to MAP in a biallelic combination with a second (likely) pathogenic variant.

Ideally, classification of variant pathogenicity should be done in a standardized manner, by weighing clinical, functional in vitro, in vivo and in silico evidence, by multidisciplinary expert panels. The proposed role in this for LSDBs is twofold: (1) to collect all available evidence and (2) to display the current class for each variant with a summary of the supporting evidence and transparency about the decision process. The InSiGHT organization collaborates with the Human Variome Project (HVP, http://www.humanvariomeproject.org/) to improve the MMR gene databases and to organize variant classification according to the earlier mentioned recommendations, undertaken by an expert panel: the InSiGHT Classification Committee. The MMR-related project is hoped to serve as a model for other gene database and classification projects [Greenblatt et al., 2008].

Database Submission

Colleagues in MUTYH gene research and diagnostics are invited to submit all their genotype and phenotype data. For submission of data the submitter needs to register at www.LOVD.nl/InSiGHT. Submission is possible via forms, to the MUTYH gene database and afore mentioned other CRC gene databases at the same Website. Alternatively, data can be sent by e-mail in a table or spreadsheet to the curators. The submitted records will become public after curation, if permitted by the submitter, with a link to the submitter's contact information. Submitters can access and modify their own data. The types of data that are welcome for submission include, but are not limited to, genotype and phenotype data from biallelic MAP index patients, monoallelic carriers, relatives, control subjects (with, e.g., ethnic origin), patients with other phenotypes than MAP, carriers of (likely) neutral variants, and functional or in silico test data.

Data concerning frequent variants like p.Tyr179Cys and p.Gly396Asp are likely to be plentiful. The significance of abundant reports of the same variant might be questioned. However, their reporting keeps the database contents representative and allows for reporting of rare phenotypical features, allele frequencies in different groups, geographic, and ethnic origins. The latter can be helpful for ethnicity dependent testing.

The LOVD system can store all phenotype data available, but discussions are currently ongoing to see how to best ensure privacy. Detailed phenotypical and family related data are crucial for pathogenicity assessment of variants. However, such information visible on the internet might conflict with privacy issues. We propose that in the future, the system should be able to host limited access phenotypical and family information, accessible for analysis models, but only visible for curator, submitter, and collaborators.

Published data can also be resubmitted or modified by the owners. Removal of duplicates is crucial to prevent overreporting. Furthermore, for publications concerning groups of patients or controls without information per individual, authors are invited to submit relevant individual data. The Human Mutation journal strongly recommends submission of variants to an LSDB or creation of an LSDB if not yet present, before an article is published (Human Mutation Author Guidelines, www3.interscience.wiley.com/journal/38515/home/ForAuthors.html).

Conclusion and Future Perspectives

This MUTYH gene LSDB in LOVD format provides a valuable resource for researchers and clinicians. It contains nearly all published MUTYH variants and one-fifth of the entries are submitted entries with unpublished data from DNA diagnostic and research laboratories over the world. This database has made it possible to instantly share unpublished MUTYH variants, functional test results, and phenotypes. Sharing of this data would otherwise have taken much longer or not have happened at all. World-wide contributions to this database should help to further centrally catalog and promote exchange of information on rare variants, facilitating the classification of variant pathogenicity. The MUTYH LOVD is designed for ease of submission and consultation.

In the future, the database can be expanded with more in silico data, obtained by use of uniform software and parameters (e.g., evolutionary conservation and splice-site analysis). Also, the use of the database can be further optimized by controlled vocabularies. In addition, Laboratory Information Management Systems (LIMS) will be designed to facilitate direct submission of variants and local registration. A key value of the LOVD lies in ideas yet to be developed by researchers for whom unfettered access to phenotype and family data is vital. An important factor for assessing VUS pathogenicity can be the number of independent alleles in the dataset. This might be addressed by tracking short tandem repeats or single nucleotide polymorphisms (STRs/SNPs) for ambiguous alleles. Epidemiological studies or meta-analysis might be possible on the database contents in the future, but reporting biases should be taken into account. We endorse suggestions or collaborative initiatives from our colleagues.


We thank all involved clinicians, researchers, and anonymous patients for their contribution to the database contents, with special thanks to Renée Niessen, Dennis Dooijes, Jacopo Celli, Laura de Bes, Senay Ozturk, Marieke de Graaff, the InSiGHT group, and the Netherlands Foundation for the Detection of Hereditary Tumours for their contribution to this work. We also thank future submitters/collaborators in advance. AO was supported by the Dutch Cancer Society grant KWF-UL-2006-3601. IF and LOVD software are supported by the European Community's Seventh Framework Programme (FP7/2007–2013), grant agreement no. 200754—the GEN2PHEN project.