The first two authors contributed equally to this work.
Informatics
Novel tools for extraction and validation of disease-related mutations applied to fabry disease†
Article first published online: 13 JUL 2010
DOI: 10.1002/humu.21317
© 2010 Wiley-Liss, Inc.
Additional Information
How to Cite
Kuipers, R., van den Bergh, T., Joosten, H.-J., Lekanne dit Deprez, R. H., Mannens, M. M. and Schaap, P. J. (2010), Novel tools for extraction and validation of disease-related mutations applied to fabry disease. Hum. Mutat., 31: 1026–1032. doi: 10.1002/humu.21317
- †
Communicated by Alastair F. Brown
Publication History
- Issue published online: 27 AUG 2010
- Article first published online: 13 JUL 2010
- Manuscript Accepted: 23 JUN 2010
- Manuscript Received: 7 JAN 2010
Keywords:
- Fabry;
- GLA;
- database;
- 3DM;
- validator;
- Mutator;
- alpha-amylase
Abstract
- Top of page
- Abstract
- Introduction
- Materials and Methods
- Results and Discussion
- References
- Supporting Information
Genetic disorders are often caused by nonsynonymous nucleotide changes in one or more genes associated with the disease. Specific amino acid changes, however, can lead to large variability of phenotypic expression. For many genetic disorders this results in an increasing amount of publications describing phenotype-associated mutations in disorder-related genes. Keeping up with this stream of publications is essential for molecular diagnostics and translational research purposes but often impossible due to time constraints: there are simply too many articles to read. To help solve this problem, we have created Mutator, an automated method to extract mutations from full-text articles. Extracted mutations are crossreferenced to sequence data and a scoring method is applied to distinguish false-positives. To analyze stored and new mutation data for their (potential) effect we have developed Validator, a Web-based tool specifically designed for DNA diagnostics. Fabry disease, a monogenetic gene disorder of the GLA gene, was used as a test case. A structure-based sequence alignment of the alpha-amylase superfamily was used to validate results. We have compared our data with existing Fabry mutation data sets obtained from the HGMD and Swiss-Prot databases. Compared to these data sets, Mutator extracted 30% additional mutations from the literature. Hum Mutat 31:1026–1032, 2010. © 2010 Wiley-Liss, Inc.
Introduction
- Top of page
- Abstract
- Introduction
- Materials and Methods
- Results and Discussion
- References
- Supporting Information
Due to the ease of today's gene sequencing methods, the relation between genes and corresponding diseases has been unraveled for several genetic disorders. Moreover, the specific sequencing of disease-related genes in patients has enormously increased the available mutation data in the literature. For some extensively investigated genes, gene-specific mutation databases are generated by extraction of mutational information from the literature. Examples of such mutation databases are the IARC TP53 Mutation database [Petitjean et al., 2007] and UMD p53 database for the tumor repressor gene TP53 [Olivier et al., 2002]. For molecular diagnostics and translational research these databases are used as reference to distinguish between naturally occurring single nucleotide polymorphisms (SNPs) and (potentially) pathogenic mutations in patients. Populating these databases usually requires manual intervention, which makes it difficult to generate and maintain mutation databases. Therefore, up-to-date mutational databases are only available for a select number of disease-related genes.
In 2004, a tool MuteXt [Horn et al., 2003] was described for the automatic extraction of mutational information from literature. This tool was specifically designed for populating the nuclear receptor [Van Durme et al., 2003] and GPCR [Horn et al., 2004] Molecular Class-Specific Information Systems with mutation data. We have used the MuteXt method as basis for a new tool, Mutator, which can automatically extract and store mutational information from the literature for genes that are related to a genetic disorder.
Mutator was used to create a Fabry mutational database (FMDB). Fabry disease is an X-linked inborn error of glycosphingolipid catabolism that results from mutations in the alpha-galactosidase A gene (GLA; MIM♯ 300644) at Xq22.1. Currently, two main Fabry disease-related mutation data sets exist; the Human Genome Mutation Database (HGMD) [Stenson et al., 2009], and a collection of mutations automatically extracted from the UniProt databases [Yip et al., 2008]. The HGMD database is more complete because here mutational information is extracted from the literature. However, maintaining this database requires manual intervention. Our method extracts mutations from full text publications in a fully automated manner. The result shows an almost 100% coverage of mutations listed in the combined Uniprot and HGMD databases. Moreover, Mutator extracted from the literature 30% additional mutations covering 25% additional amino acid positions.
Human alpha-galactosidase is a member of the alpha-amylase protein superfamily. In the past, it was shown that protein superfamily-derived data contextually stored in a Molecular Class-Specific Information System (MCSIS) can be used to describe individual functions of residues in proteins [Folkertsma et al., 2004]. This has led to the development of the 3DM suite, a new generation MCSIS builder, that can semiautomatically generate protein superfamily systems specifically designed for mutant prediction purposes [Joosten et al., 2008; Kuipers et al., 2009; Leferink et al., 2009; Narayanan et al., 2009]. A 3DM superfamily system is a knowledge base that contains and connects many different superfamily-related data types, such as structures, sequences, structure-based multiple sequence alignments, protein–ligand interactions, mutational data, correlated mutation analysis results, and residue conservation. Mutator is part of the 3DM suite. The 3DM mutational data that is extracted from literature is collected by Mutator.
3DM was used to collect alpha-amylase superfamily data and to generate the structure-based superfamily alignment (3D-MSA). Strong correlations were observed between the aggregated mutational data and 3D-MSA derived data, which suggested that alignment derived data can principally be used to predict the pathogenicity of individual mutations in GLA.
On these principles Validator, a 3DM Web-based graphical user interface, was developed for retrieval of literature-extracted mutations and for validation of (new) amino acid variants (see Supp. Fig. S1). Validator uses various different information types, such as alignment information (e.g., amino acid conservation) and structural information (e.g., solvent accessibility, secondary structure information) that are stored in the 3DM database for variant validation. The predictability of each information type is predetermined by examining how all known Fabry mutations relate to each specific information type. Furthermore, Validator generates a structure model for each variant in which bumps with neighboring amino acids are highlighted that are the result of the variant. These models can be viewed directly from the Validator website or can be downloaded, visualized, and analyzed in the state of the art protein visualization tool YASARA. The newly developed Validator, the FMDB, and the 3DM structure-based superfamily alignment are freely available at http://3DMCSIS.systemsbiology.nl/FMDB/. The source codes of Validator and Mutator are currently an integral part of the 3DM commercial software suite. For other protein families commercial licenses can be obtained.
Materials and Methods
- Top of page
- Abstract
- Introduction
- Materials and Methods
- Results and Discussion
- References
- Supporting Information
3DM Structure-Based Superfamily Alignment Generation
The structure-based superfamily alignment of the alpha-amylase protein superfamily was generated as outlined by Folkertsma et al. [2005] and Joosten et al. [2008]. This method was automated in the 3DM suite, extensively reviewed by Kuipers et al. [2010], and is only briefly described here: All structure files from the SCOP [Murzin et al., 1995] alpha-amylase family were extracted from the SCOP database to obtain a list of protein structure files with the alpha-amylase fold. The protein sequence of each distinct structure on this list was used as query to BLAST [Altschul et al., 1990] against the PDB database [Berman et al., 2000] with a cutoff e-value of 0.005 to obtain a complete list of available structure files. For multidomain proteins, only the alpha-amylase domain of the sequence was used as blast query to prevent inclusion of proteins that only contain a domain not related to the alpha-amylase superfamily. Identical BLAST search settings were used for searches performed against the Swiss-Prot and TrEMBL [Boeckmann et al., 2003] databases to collect sequences for which no structure is available. 3DM was used to generate a structure-based superfamily alignment from these sequences and structures in three steps:
- 1The structure files were superimposed on the structure of the human GLA (pdb code 1R47) [Garman and Garboczi, 2004]. From the resulting superpositioning, a structure-based multiple sequence alignment was extracted composed of structurally equivalent residues (core). Structural equivalence is defined as three or more consecutive residues that have their C-alphas within a 2.5 Å sphere from the equivalent GLA residues.
- 2The sequences of the resulting core alignment were divided into subgroups so that the sequences of each subgroup are no more than 80% identical to the next subgroup. For each subgroup a representative template structure is selected based on criteria such as the quality of the structure, the number of residues for which 3D coordinates are available in the structure, and the number of residues in the core as determined in step 1.
- 3An iterative profile based alignment procedure [Oliveira et al., 2002] (automated in 3DM) was used to separately generate subfamily alignments by aligning each superfamily sequence to the most similar template structure. These separate subfamily alignments were combined to generate the ultimate superfamily alignment using the core alignment from step 2 as a guide.
Mutator
An overview of the workflow of Mutator is given in Figure 1. To collect a large set of articles that potentially contain mutational information on GLA (or proteins homologues to GLA), a keyword list was created. This list is used by Mutator to query the PubMed database to obtain a list of full-text articles. Mutator collects mutations in four steps: (1) Retrieval of keyword selected (full text) publications; (2) screening of the individual (full-text) publications for mutational data using regular expressions; (3) selection of sequences matching the wild-type subject protein sequence: (4) overall scoring of combined feature of individual (full-text) publications. For scoring of the mutations a Sequence Score (SQ-score) was used. Mutations extracted from publications that scored above the experimentally derived threshold levels were stored in a database. Details of the Mutator workflow (Fig. 1) are presented in Supp. Figure S2 and Supp. Workflow S1. Mutator was specifically designed to collect mutational information reported in proteins (or genes) related to certain diseases in patients. Therefore, in addition to the MuteXt method a module was added to Mutator that can detect mutations reported in DNA sequences.
Validator
Validator is a graphical user interface, specifically designed for DNA-diagnostic purposes. It can be used for variant analysis and retrieval of literature derived mutation data for a specific sequence of the 3DM database. After providing a mutation to the tool it returns the by Mutator extracted associated literature and a structural protein model visualizing the mutation including potential bumps with surrounding amino acids (Fig. 2). In addition, it predicts the likelihood that the mutation is pathogenic based on superfamily alignment statistics such as (structural) conservation, amino acid distributions per alignment position (detailed in the Results section). For a given mutation Validator also presents the Grantham distance [Grantham, 1974], the Blosum62 substitution score [Henikoff and Henikoff, 1992], the solvent accessibility, and provides links to PolyPhen prediction tool [Sunyaev et al., 2000] and the SIFT classification [Ng and Henikoff, 2003].
Results and Discussion
- Top of page
- Abstract
- Introduction
- Materials and Methods
- Results and Discussion
- References
- Supporting Information
Mutator Applied to Fabry Disease: Generation of the FMDB
This work presents a collection of GLA mutations retrieved from literature last accessed on 29 April 2009. It should be mentioned that the fully automated nature of Mutator enables continuous scanning of the literature and that the dataset presented here will soon be outdated.
Yip et al. [2008] recently described a method to retrieve single amino acid polymorphism data from the Swiss-Prot database. Specifically for GLA this dataset contains 137 mutations that cover 101 residues of the GLA sequence. The HGMD contains mutations that are both automatically and manually collected. Excluding splice-site mutations, insertions, deletions, stop codons, and frame shifts the free section of the 2009 version of the HGMD database contains 256 unique point mutations that cover 166 residue positions of the GLA sequence. The restricted HGMD contains 301 unique GLA point mutations in total. HGMD describes a mutation only once. Mutator, however, stores references to all literature available of each specific mutation providing access to disease-related metadata such as literature sources that contain variant phenotypic expression data in different patients. An overview of mutations available in the FMDB, HGMD, and UniProt is available in Supp. Table S1.
Mutator uses a four-step procedure to extract mutations from the literature: (1) retrieval of keyword selected publications, (2) screening of the individual publications for mutational data using regular expressions, (3) evaluation of mutational data with respect to the corresponding subject sequence (here GLA), and (4) scoring of combined features above a set threshold. Supp. Table S2 shows the keyword list used by Mutator to query the PubMed database for Fabry disease-related publications. This Fabry list resulted in the retrieval of 12,847 full-text publications. From this set Mutator extracted and stored in the FMDB 1,781 mutations (371 unique mutations). All articles that Mutator selected for the first 100 GLA residues were manually examined for the presence of Fabry-related mutational data. For these first 100 residues, Mutator collected 338 mutations from exactly 100 articles. Of these 100 articles, only six could be considered as false positives, because these six described mutations not related to Fabry. Three of these six articles described mutations in a human protein (human coagulation factor X), which contains a domain that is abbreviated with GLA. The other three articles described the human matrix GLA protein. To cope with this type of inconsistencies due to ambiguous keywords, extracted mutational data should also match the corresponding subject sequence. For example, when Mutator extracts the mutation G11V from a keyword selected text file, the program verifies that residue number 11 of the GLA subject protein sequence is indeed a glycine. In theory, this step should reduce the false positive discovery rate for single extracted mutations to 5%. Besides having the right keywords all six articles contain mutational information at positions for which the GLA sequence has the same residue type such as the single G11V mutation described for the Gla domain of human coagulation factor X [Chafa et al., 2009]. Therefore, an option was added that enables the user to provide a black list of keywords. Rerunning Mutator using “matrix gla protein,” “human factor,” and “coagulation factor” as black list keywords removed these six false positive articles from the final set. Excluding splice-site mutations, insertions, deletions, stop codons, and frame shifts, this set contains 1,512 unique mutations and is highly (if not exclusively) populated with Fabry-related mutations. Comparison of this set of mutations with mutations stored in the HGMD and Swiss-Prot mutational databases showed that Mutator had collected 70 additional unique mutations. Six mutations were missed by Mutator because they were published in journals to which no subscription was available. This large set of fabry mutations enabled us to find correlations between pathogenicity of mutations in GLA and other data types that are stored in a 3DM database. Although here we have focused on GLA, it must be noted that the alpha-amylase superfamily (see below) also includes human alpha-N-acetylgalactosaminidase (alpha-NAGA; alpha galactosidase B; MIM♯ 611458). Substitutions in alpha-NAGA can cause Schindlers disease. Upon switching from GLA to alpha-NAGA (Swiss-Prot: P17050) Mutator extracted from the literature all alpha-NAGA mutations that are reported in the OMIM database.
Alpha-Amylase Superfamily Alignment
The GLA gene is a member of the alpha-amylase protein superfamily and protein superfamily derived data can be used to describe individual functions of residues in proteins [Folkertsma et al., 2005; Kuipers et al., 2010]. The structures available of the alpha-amylase superfamily can be divided into 41 sequentially distinct groups. The following structure files from the PDB database were chosen as representative structures to generate the superfamily alignment: 1A47A, 1AMYA, 1AQHA, 1B2YA, 1BAGA, 1BF2A, 1BLIA, 1BVZA, 1EA9C, 1EH9A, 1G5AA, 1GCYA, 1GJUA, 1GVIA, 1H3GA, 1HVXA, 1IZJA, 1LWJA, 1M53A, 1M7XA, 1MXGA, 1QHOA, 1R47B, 1UD2A, 1UOKA, 1W9XA, 1WZAA, 2AAAA, 2BHUA, 2DH3A, 2E8YA, 2FH8A, 2GUYA, 2VUYA, 2Z1KA, 2ZE0A, 2ZICA, 3BC9A, 3CC1A, 3CZGA, and 3DHUA. The resulting superfamily alignment contains 4,986 unique sequences and 217 structurally conserved positions (the core).
Figure 3 shows that 80% of the reported mutations are at structurally conserved positions (core). Two regions outside the core are highly populated with Fabry-related mutational data. These two regions are a helix in the middle of the GLA sequence and a beta-sheet at the C-terminal end of the protein (Fig. 3) and contain 77 and 72 mutations, respectively. These two regions are present in most alpha-amylase structures. However, due to positional variability within the superfamily structures the superposing of these regions was ambiguous (Fig. 4A). These two regions were therefore not included in the core. A more straightforward approach to determine structural important positions would be to assign structural importance only to residues of secondary structural elements (e.g., helices and beta-sheets). The advantage of such a method is that only the structure of the target protein (here GLA) is needed. However, it should be noted that, even though the core mostly consists of secondary structural elements, using only secondary structural elements as a delimiter is no solution. For instance, the residue position with the highest number of extracted Fabry-related mutations (3D number 47; Fig. 3) is not part of any secondary structural element, but is positioned in a structural highly conserved loop (Fig. 4B) located at the outside of the protein. Additionally, alignment position 48, which is also part of this loop, is a highly conserved glycine residue, which demonstrates that important residues are not exclusively located in secondary structural elements. If both core and secondary structural elements are considered to be structural important positions, 84% of all Fabry-related mutations are linked to this group. This result suggests that it is five times more likely that a random mutation will result in manifestation of Fabry disease if this mutation involves a structural important position. The Validator tool (see below) therefore defines both core and secondary structural elements as structural important positions.

Figure 3. Number of independently reported Fabry disease-related mutations per GLA residue position. Independently reported mutations detected at structural conserved positions (core) are in light gray. Structural nonconserved positions are in dark gray. More then 40 independently reported mutations were extracted for 3D positions 47, 124, and 172 corresponding with R112, N215, and R301, respectively, of the GLA amino acid sequence. Note that, although only 50% of the GLA residues are core positions, the large majority of the mutations (1117) are observed at those positions.

Figure 4. YASARA ball-and-stick backbone representation of seven superimposed protein structures of different subfamilies of the alpha-amylase superfamily. A: In red, equivalent helices from the seven proteins. This helix is present in almost all superfamily members, but could not be included in the core due to variable positioning within the crystal structures. B: The yellow- and red-colored residues are part of a structural highly conserved loop at 3D positions 47 and 48, respectively. 3D postion 47 is a highly conserved glycine. 3D postion 48 is in GLA a phenylalanine and the most reported mutated residue.
Mutation Analysis
Fabry disease-causing mutations and amino acid occurrences
Superfamily alignments can be considered as inventories of nature's successful mutagenesis experiments conducted during millions of years of evolution. In theory, the spectrum of residues present at a specific alignment position could be considered as allowed substitutions. Statistical analysis of superfamily alignments can therefore potentially be used to predict the pathogenicity of specific mutations. This idea was tested using the set of Fabry-related mutations collected in the FMDB. Figure 5 shows the relationship between the relative occurrence of amino acids at core positions and reported corresponding GLA mutations. For example, only 4% of the 1,117 reported GLA mutations in the core are mutations to an amino acid residue that is present at the corresponding alignment position in more than 26% of the aligned alpha-amylase sequences. Conversely, only 17% of mutations reported in structural conserved residues are mutated into an amino acid present at the corresponding alignment position in more than 7% of the aligned alpha-amylase sequences. Thus, the introduction of a new residue type that is infrequently observed in the complete alignment of the superfamily at the particular alignment position has a high probability to be pathogenic implicating that this correlation can in principle be used to predict the pathogenicity of an unclassified variant (UV) in GLA. For example, if a particular UV is a mutation to an amino acid that is present in more than 25% of the alpha-amylase sequences at the corresponding alignment position, then the analysis suggest a small probability for pathogenicity for this particular UV. On the other hand, when the particular UV is present in less than 5% of the alpha-amylase sequences at the corresponding alignment position, then the analysis suggests a high probability for pathogenicity for the particular UV.

Figure 5. Correlation between the relative amino acid conservation (x-axis) and frequency of reported Fabry related mutations (y-axis). The x-axis represents the percentage of sequences that contains the mutated residue. The y-axis represents the percentage of the total number of fabry related mutation collected by Mutator. This plot shows that Fabry disease is most often the result of a mutation in GLA that resulted in a residue that is not commonly observed at the corresponding alignment position. Obviously, there is a clear relation between the frequency of reported Fabry related mutations and the occurrence of amino acids at alignment positions.
This correlation is not valid for noncore positions. For these positions only the amino acid occurrences of the 77 sequentially related sequences of the of GLA subfamily can be used. However, even within this small set, mutations at highly conserved positions are more likely to be pathogenic (see examples below).
Fabry disease-causing mutations and solvent accessibility
Solvent accessibility is the degree to which a residue in a structure is solvent exposed (e.g., more at the surface of the structure). Using a limited dataset of 278 missense mutations Garman [2007] has shown that there is a strong correlation between solvent accessibility of residues and observed Fabry disease-causing mutations. The substantially increased mutational data collected in this study and the availability of the structural alignment makes it possible to study the predictability of solvent accessibility both at structurally conserved and nonconserved positions (Fig. 6). Two correlations are plotted: (1) the correlation between Fabry disease-causing mutations at structurally conserved core positions and their solvent accessibility and (2) correlation between Fabry disease-causing mutations at structurally nonconserved positions and their solvent accessibility. The plot clearly shows that this strong correlation exists specifically, almost exclusively, for core positions. This surprising observation is very important because it suggests that solvent accessibility should be used as indicator for pathogenicity only at core positions.

Figure 6. The correlation between Fabry disease-causing mutations at structurally conserved and nonconserved core positions and their solvent accessibility. X-axis: percentage of accessible side chain surface area for each residue in the human GLA protein. Y-axis: percentage of the total set of Fabry disease-causing mutations. Two plots are drawn. (1) The correlation between Fabry disease-causing mutations at structurally conserved core positions and their solvent accessibility (light gray line), and (2) correlation between Fabry disease-causing mutations at structurally nonconserved positions and their solvent accessibility (dark gray line). The vertical black line indicates that only 7% of the 1,117 Fabry mutations located in the core are at positions of which the solvent accessibility >33%. In contrast, almost half (44%) of the Fabry mutations located outside the core are at positions of which the residue has a solvent accessibility of >33%.
Structural analysis of a specific amino acid change
Validator performs a conformational analysis based on an estimation of the steric hindrance between the mutated residue and neighboring residues in the 3D structure. For that it generates an in silico model of the protein highlighting the substituted position including its van der Waals surface.
These types of analysis are done by Validator for each mutation uploaded to the FMDB Website. The outcomes of other in DNA diagnostics commonly used classifiers (e.g., Grantham scores, BLOSUM62 scores) are also reported. Furthermore, amino acid specific information is provided, such as domain interface residue, active site residue, and substrate contact information. Combining these predictions can lead to a better prediction.
For example, Fabry disease-associated publications often report for 3D core position 76 (Ala143 in the GLA primary sequence) p.A143P. This mutation predisposes to a classical phenotype in males [Benjamin et al., 2009]. A statistical analysis of the complete superfamily alignment indicates that at this position proline is the most abundant amino acid residue being present in 43% of the alpha-amylase sequences. The relative high solvent accessibility of Ala143 in GLA also suggests a low probability for pathogenicity at this site. Despite the above, the in silico model structure, however, clearly indicates that specifically in GLA a proline at this position clashes with the neighboring aspartic acid with 3D number 32 (Asp93 in GLA) (Fig. 2). Therefore, it is more likely that the p.A143P substitution is not allowed in the GLA protein and therefore can be considered as probably pathogenic. In contrast, Validator suggests that p.A143T would structurally be less damaging and has been reported to lead to a much milder variant of Fabry disease [Benjamin et al., 2009]. In this case, the statistical analysis of the structural superfamily alignment suggests pathogenicity because a threonine is seldom present (0.8%) in other alpha-amylase protein superfamily members. The fact that p.A143T would structurally be less damaging fits well with a milder phenotype.
Performance of Validator tool on classical Fabry mutations
To test the performance of Validator predictions, mutations known to result in the classical form of Fabry were selected being mutations, p.M42V, p.H46Y, p.D92Y, p.R112C, p.C142R, p.W226R, and p.N320Y at core positions, and p.P40S p.R100T at noncore positions. In addition, the special case p.D313Y is discussed.
For p.H46Y, Validator predicts a high probability for pathogenicity. In the superfamily alignment the occurrence of Y is only 4.1% and Figure 6 shows that more than 75% of the recorded Fabry mutations are the result of such a substitution. Also, the solvent accessibility is 1.5%, indicating that the H46 is buried inside the protein. Furthermore, the in silico model suggests that p.H46Y causes bumps with surrounding amino acids. Because histidine residues are hydrophilic, a buried histidine almost always has an important function. Although currently no weight is given to the various indicators for this buried histidine solvent accessibility is probably the most important indicator.
Arguments listed above for p.H46Y are also true for p.D92Y. The fact that both position H46 and D92 are reported in more than 10 independent Fabry disease-associated publications reporting substitutions to a number of different amino acids clearly match Validator predictions.
R112 is an almost completely buried hydrophilic residue. Almost all other sequences of the superfamily have hydrophobic residues at this position instead. This indicates that R112 has an important function that is specific for GLA, which suggests that p.R112C will most probably be pathogenic. Furthermore, a cysteine is not a common residue at position 112 (0.1%), and the high number of publications (81) that report this position in relation to Fabry's disease again indicate a very high probability for pathogenicity.
The Validator tool indicates that C142 forms a cysteine bridge. The p.C142R mutation therefore disrupts the formation of this cysteine bridge. This type of information will overrule all others, because disrupting a cysteine bridge will most probably always be pathogenic independent of solvent accessibility, amino acid occurrences, or other factors. Finally, almost all information that the Validator tool returns for mutations p.W226R and p.N320Y indicate a very high probability for pathogenicity, again supported by a high number of publication reporting mutations to various amino acid types.
Mutations p.P40S and p.R100T are not included in the core, so only the 77 sequences of the GLA subfamily alignment can be used for statistics. In the subfamily both P40 and R100 are 100% conserved, which suggests a high probability for pathogenicity for both.
The only mutation that is predicted not to be pathogenic is p.M42V. Even after meticulous manual inspection of the protein model of p.M42V, no reasonable explanation can be given for the pathogenicity of this mutation. The only indication that this is a true pathogenic mutation is the high number of literature references that report mutations to different amino acids at this position.
In the literature mutation p.D313Y is ambiguously linked with Fabry disease, and the prediction from the 3DM data is contradicting. Although tyrosine is not a common residue at this position (suggesting a high chance for pathogenicity) solvent accessibility indicates that the residue is located on the outside of the protein, and introducing a tyrosine residue does not cause any bumps with surrounding amino acids (suggesting low probability for pathogenicity). The p.D313Y mutation has been tested for activity in vitro. Transient expression of the p.D313Y construct in COS-7 cells resulted in an active enzyme with >67% of the expressed wild-type activity [Froissart et al., 2003]. Mutator extracted 17 different publications from the literature all describing the single p.D313Y mutation, but remarkably so far, no other substitutions have been detected. Could this then be a naturally occurring variant? There are 46 other residues in the GLA protein sequence for which more than 10 independent Fabry disease-related literature references are available. These are for residues 34, 40, 42, 46, 49, 22, 65, 66, 89, 92, 93, 97, 100, 112, 113, 138, 142, 143, 148, 156, 162, 172, 183, 205, 215, 220, 223, 226, 227, 236, 259, 266, 272, 279, 287, 296, 298, 301, 317, 320, 328, 342, 356, 357, 358, and 409. In contrast with reports for position D313 for all these positions, except for R220, a range of amino acid changes are reported. For R220 all 21 available independent publications report a stopcodon at position 220 (p.R220X). The fact that at these 46 positions different amino acid substitutions have been reported to result in Fabry disease significantly increases the chance that mutations at these positions are pathogenic. Furthermore, this result also indicates that p.D313Y is probably a naturally occurring variant, because it is unlikely that only the introduction of a tyrosine results in Fabry disease.
The results for the A143T and D313Y mutations fit what is clinically observed. Authors who report D313Y should comment that it is unlikely (but possible) to be pathogenic.
In this article it is shown that a collection of superfamily data can be used to predict effects of mutations. It must be noted, however, that predicting the pathogenicity of specific mutations is still difficult and statistical analysis of large 3DM alignments should only be used as guidance. For example, if we take the seven core positions that are conserved in more than 95% of the aligned sequences (3D numbers 39, 73, 100, 102, 123, 145, and 146) we see that for two of these positions (73, 123) Mutator has not been able to extract from the literature any mutations causing Fabry disease. Is this unexpected result caused by a still limited set of mutations or do mutations at these positions not lead to Fabry disease?
References
- Top of page
- Abstract
- Introduction
- Materials and Methods
- Results and Discussion
- References
- Supporting Information
- , , , , . 1990. Basic local alignment search tool. J Mol Biol 215:403–410.
- , , , , , , , , , , , . 2009. The pharmacological chaperone 1-deoxygalactonojirimycin increases alpha-galactosidase A levels in Fabry patient cell lines. J Inherit Metab Dis 32:424–440.
- , , , , , , , . 2000. The Protein Data Bank. Nucleic Acids Res 28:235–242.
- , , , , , , , , , , , . 2003. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 31:365–370.
- , , , , , . 2009. Characterization of a homozygous Gly11Val mutation in the Gla domain of coagulation factor X. Thromb Res 124:144–148.
- , , , , , . 2005. The nuclear receptor ligand-binding domain: a family-based structure analysis. Curr Med Chem 12:1001–1016.
- , , , , , , , , , . 2004. A family-based approach reveals the function of residues in the nuclear receptor ligand-binding domain. J Mol Biol 341:321–335.
- , , , , . 2003. Fabry disease: D313Y is an alpha-galactosidase A sequence variant that causes pseudodeficient activity in plasma. Mol Genet Metab 80:307–314.
- . 2007. Structure–function relationships in alpha-galactosidase A. Acta Paediatr Suppl 96:6–16.Direct Link:
- , . 2004. The molecular defect leading to Fabry disease: structure of human alpha-galactosidase. J Mol Biol 337:319–335.
- . 1974. Amino acid difference formula to help explain protein evolution. Science 185:862–864.
- , . 1992. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89:10915–10919.
- , , . 2004. Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors. Bioinformatics 20:557–568.
- , , . 2003. MuteXt: an automated method to extract mutation data from the literature. Pacific Symp Biocomput.
- , , , , , . 2008. Identification of fungal oxaloacetate hydrolyase within the isocitrate lyase/PEP mutase enzyme superfamily using a sequence marker-based method. Proteins 70:157–166.Direct Link:
- , , , , , , , , , , , . 2010. 3DM: systematic analysis of heterogeneous super-family data to discover protein functionalities. Proteins 78:2101–2113.
- , , , , , , , , , . 2009. Correlated mutation analyses on super-family alignments reveal functionally important residues. Proteins 76:608–616.Direct Link:
- , , , , , . 2009. Identification of a gatekeeper residue that prevents dehydrogenases from acting as oxidases. J Biol Chem 284:4392–4397.
- , , , . 1995. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247:536–540.
- , , , , , , , . 2009. Structure and function of 2,3-dimethylmalate lyase, a PEP mutase/isocitrate lyase superfamily member. J Mol Biol 386:486–503.
- , . 2003. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res 31:3812–3814.
- , , . 2002. Correlated mutation analyses on very large sequence families. Chembiochem 3:1010–1017.Direct Link:
- , , , , , . 2002. The IARC TP53 database: new online mutation analysis and recommendations to users. Hum Mutat 19:607–614.Direct Link:
- , , , , , , . 2007. Impact of mutant p53 functional properties on TP53 mutation patterns and tumor phenotype: lessons from recent developments in the IARC TP53 database. Hum Mutat 28:622–629.Direct Link:
- , , , , , , . 2009. The Human Gene Mutation Database: 2008 update. Genome Med 1:13.
- , , . 2000. Towards a structural basis of human non-synonymous single nucleotide polymorphisms. Trends Genet 16:198–200.
- , , , , . 2003. NRMD: Nuclear Receptor Mutation Database. Nucleic Acids Res 31:331–333.
- , , , , , , . 2008. Annotating single amino acid polymorphisms in the UniProt/Swiss-Prot knowledgebase. Hum Mutat 29:361–366.Direct Link:
Supporting Information
- Top of page
- Abstract
- Introduction
- Materials and Methods
- Results and Discussion
- References
- Supporting Information
Additional Supporting Information may be found in the online version of this article
| Filename | Format | Size | Description |
|---|---|---|---|
| humu_21317_sm_SupplInfo1.pdf | 376K | Supplementary Materials |
Please note: Wiley-Blackwell are not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the article.

1098-1004/asset/HUMU_left.gif?v=1&s=4065e12063da1c0efe3c1a74d4f13c3cd92fba18)
1098-1004/asset/HUMU_right.gif?v=1&s=58026811b6aa5bee5a3d0e0563a705f8b681f34d)


