Performance of mutation pathogenicity prediction methods on missense variants


  • Communicated by Christophe Béroud


Single nucleotide polymorphisms (SNPs) are the most common form of genetic variation in humans. The number of SNPs identified in the human genome is growing rapidly, but attaining experimental knowledge about the possible disease association of variants is laborious and time-consuming. Several computational methods have been developed for the classification of SNPs according to their predicted pathogenicity. In this study, we have evaluated the performance of nine widely used pathogenicity prediction methods available on the Internet. The evaluated methods were MutPred, nsSNPAnalyzer, Panther, PhD-SNP, PolyPhen, PolyPhen2, SIFT, SNAP, and SNPs&GO. The methods were tested with a set of over 40,000 pathogenic and neutral variants. We also assessed whether the type of original or substituting amino acid residue, the structural class of the protein, or the structural environment of the amino acid substitution, had an effect on the prediction performance. The performances of the programs ranged from poor (MCC 0.19) to reasonably good (MCC 0.65), and the results from the programs correlated poorly. The overall best performing methods in this study were SNPs&GO and MutPred, with accuracies reaching 0.82 and 0.81, respectively. Hum Mutat 32:1–11, 2011. © 2011 Wiley-Liss, Inc.


Most human genetic variation is represented by single nucleotide polymorphisms (SNPs), and many of them are believed to cause phenotypic differences between individuals. Owing to the application of high-throughput sequencing methods, the number of identified variants in the human genome is growing rapidly, but identifying those variations responsible for specific phenotypes is a laborious task. The ability to discriminate between pathogenic and benign variants computationally could significantly aid targeting disease-causing mutations by helping in the selection and prioritization of likely candidates from a pool of data. A subset of SNPs occur at protein coding regions in the genome, and from a medical point of view particularly interesting ones are the nonsynonymous SNPs (nsSNPs) that lead to an amino acid substitution at the protein level (referred here to as missense variants). nsSNPs may affect gene function through their effect on the structure and/or function of the encoded protein.

Prediction of the possible disease-association of missense variants is a difficult problem because an amino acid substitution can affect the biological function of a gene product in a number of ways [Thusberg and Vihinen, 2009]. An amino acid substitution may disrupt sites that are critical in protein function, such as catalytic residues or ligand-binding pockets. A missense mutation may as well lead to alterations in the structure, folding, or stability of the protein product, thereby altering or preventing the function of the protein. On the other hand, amino acid substitutions do not necessarily affect protein function. Effects of missense mutations are often the most difficult to predict while the consequences of most deletions, insertions, and nonsense mutations are rather self-evident.

Many methods have been developed for the computational prediction of the phenotypic effect of nsSNPs. Some of them are for the study of very specific mechanisms, whereas others are developed to predict whether a variation is harmful or benign. All of the variation tolerance methods evaluated in this study follow a similar procedure in which a missense variant is first labeled with properties related to the damage it may cause to the protein structure or function. The resulting feature vector is then utilised to decide whether the variant is pathogenic or not. The methods differ in the properties of the variant they take into account in the prediction, as well as in the nature and possible training of the classification method used for decision making. The nine widely used methods evaluated in this study are based on evolutionary information (Panther [Thomas et al., 2003], PhD-SNP SVM-Profile [Capriotti et al., 2006], and SIFT [Ng and Henikoff, 2001]), or a combination of protein structural and/or functional parameters and multiple sequence alignment derived information (MutPred [Li et al., 2009], nsSNPAnalyzer [Bao et al., 2005], PolyPhen [Ramensky et al., 2002], PolyPhen2 [Adzhubei et al., 2010], SNAP [Bromberg and Rost, 2007], and SNPs&GO [Calabrese et al., 2009]). The machine-learning methods utilize neural networks (NN) (SNAP), random forests (RF) (MutPred, nsSNPAnalyzer), or support vector machines (SVMs) (PhD-SNP, SNPs&GO) for classification, whereas the other methods classify variants according to empirically derived rules (PolyPhen), Bayesian methods (PolyPhen2), or mathematical operations (SIFT, Panther) (Table 1).

Table 1. Summary of the Evaluated Methods
MethodBased onTraining setConservation analysisStructural attributesAnnotationsWebsite
  1. GO, Gene Ontology; HGMD, Human Gene Mutation Database; HMM, Hidden Markov model; NN, neural network; MSA, multiple sequence alignment; PMD, Protein Mutant Database; PSIC, position-specific independent counts; RF, random forest; SVM, support vector machine.

MutPredRFHGMD, Swiss-ProtSIFT, Pfam, PSI-BLASTPredicted attributes
nsSNPAnalyzerRFSwiss-ProtSIFTHomologue mapping
PantherAlignment scoresPanther library, HMMs
PhD-SNPSVMSwiss-ProtSequence environment, sequence profiles
PolyPhenEmpirical rulesPSIC profilesHomologue mapping/predictionsSwiss-Prot
PolyPhen2Bayesian classificationSwiss-Prot, neutral pseudo-mutationsPSIC profilesHomologue mapping/predictionsPfam domain
SIFTAlignment scoresMSAs
SNAPNNPMD, neutral pseudo-mutationsPSIC profiles, Pfam, PSI-BLASTPredictions
SNPs&GOSVMSwiss-ProtSequence environment, sequence profiles, PantherGO

As mutation data and information about the genotypes of individuals accumulate, understanding the molecular level effects of variations and elucidating their possible disease association is an important research challenge [Karchin, 2009; Mooney, 2005; Ng and Henikoff, 2006; Steward et al., 2003; Thusberg and Vihinen, 2009]. Numerous locus-specific databases (LSDBs) have been established for the collection, analysis, and distribution of disease-related variation information in certain genes. Data for several genes is available, for example, in the protein knowledgebase SwissProt [Yip et al., 2004] and PhenCode [Giardine et al., 2007], which is a database that connects human variant data with phenotypic information from LSDBs with genomic data from the ENCODE project and other resources in the UCSC Genome Browser [Raney et al., 2011]. SNP information is available in dbSNP [Sherry et al., 2001], a genetic variation database. Several tools for the prediction of the phenotypic consequences of missense variants are available, but without knowledge about the quality of predictions, choosing the best method and evaluating the reliability of its outcome is impossible. We therefore performed the first comprehensive systematic evaluation of nine bioinformatics tools predicting the phenotypic effects of missense variants.

Materials and Methods


We built a positive dataset (referred to as pathogenic dataset) of 19,335 missense mutations from the PhenCode database [Giardine et al., 2007] (downloaded in June 2009), registries in IDbases [Piirilä et al., 2006] and from 18 individual LSDBs, and a negative (neutral) dataset of 21,170 human nonsynonymous coding SNPs with an allele frequency >0.01 and chromosome sample count >49 from the dbSNP database [Sherry et al., 2001] build 131. The SNP data was filtered so that none of the dbSNP entries included in our dataset contained OMIM links to minimize the number of disease-associated SNPs in the neutral dataset. Entries annotated as “putative” or “predicted” were also left out. In addition, the neutral dataset was searched against the pathogenic dataset in order to remove possible duplicates and further minimise the probability of having false negative cases in the set. The PhenCode data was filtered so that only SNPs annotated as disease causing in the SwissProt database were taken into our pathogenic dataset. Swiss-Prot provides high-quality hand-curated information about the possible disease-relation of nsSNPs, derived from literature [Yip et al., 2008]. The complementing LSDB data was retrieved manually from each database. The pathogenic and neutral datasets contained 1,190 and 9,011 proteins, respectively, of which 445 and 1,205 were found to have three-dimensional structure coordinates in the Protein Data Bank (PDB) [Berman et al., 2000]. The datasets are available for download at our Website (

Both datasets were run by all of the nine methods studied here. The number of results from nsSNPAnalyzer is much smaller than the original number of cases in the input data, because the program only accepts mutations in those sequences for which a homologous protein is found in the ASTRAL database [Chandonia et al., 2004]. A large number of proteins in our dataset did not match with any entry in the database, thus limiting the number of cases that could be analysed by nsSNPAnalyzer.

Two kinds of subdatasets were constructed from the original pathogenic and neutral datasets. First, a structural subdataset was compiled from the part of both datasets for which structural data was available in the PDB, to study the effect of available structure data on prediction performance. Second, for probing the effect of using Swiss-Prot-derived data as part of the pathogenic testing set, we constructed a subdataset containing only pathogenic variants not present in Swiss-Prot. The corresponding neutral dataset was compiled by randomly selecting an equal number of variants from the original neutral test set.

To test whether the differences in method performance with these subdatasets was caused by smaller testing set size, we constructed 100 sample datasets each containing 1,000 pathogenic and 1,000 neutral variants randomly picked from the original datasets, and compared the average MCCs obtained with the MCCs from the subdatasets.

The Pathogenic-or-not Pipeline (PON-P) [Thusberg and Vihinen, 2009] was used for the submission of sequences and variants into the analysis programs nsSNPAnalyzer, Panther, PhD-SNP, PolyPhen, PolyPhen2, SIFT, and SNAP. PON-P is a service that simultaneously submits the input data provided by the user to selected prediction methods. MutPred and SNPs&GO were run locally at the corresponding laboratories by the developers of the methods.

Prediction Methods

The effects of mutations and SNPs were predicted by the programs MutPred [Li et al., 2009], nsSNPAnalyzer [Bao et al., 2005], Panther [Thomas et al., 2003], PhD-SNP [Capriotti et al., 2006], PolyPhen [Ramensky et al., 2002], PolyPhen2 [Adzhubei et al., 2010], SIFT [Ng and Henikoff, 2001], SNAP [Bromberg and Rost, 2007], and SNPs&GO [Calabrese et al., 2009]. Key properties of the methods are listed in Table 1. The default parameters of all programs were applied, and only the protein sequence and missense variant were given as input information for each program, as in a normal user situation of unknown variant analysis.


MutPred is a Random Forest-based classification method that utilizes several attributes related to protein structure, function, and evolution. MutPred utilizes the SIFT method [Ng and Henikoff, 2003] for defining the evolutionary attributes, along with PSI-BLAST, transition frequencies [Bromberg and Rost, 2007], and Pfam profiles [Finn et al., 2010]. In MutPred, structural descriptors include prediction of secondary structure and solvent accessibility by the method PHD [Rost, 1996], transmembrane helix prediction by TMHMM [Krogh et al., 2001], coiled-coil structure prediction by MARCOIL [Delorenzi and Speed, 2002], stability prediction by I-Mutant 2.0 [Capriotti et al., 2005], B-factor prediction [Radivojac et al., 2004], and disorder prediction by DisProt [Peng et al., 2006]. Function-related attributes include predictions of DNA-binding residues [Ahmad et al., 2004], catalytic residues, calmodulin-binding targets [Radivojac et al., 2006], and posttranslational modification sites [Daily et al., 2005; Iakoucheva et al., 2004; Radivojac et al., 2010]. The MutPred method estimates effects of an amino acid substitution on the set of defined properties of a protein and based on those estimates, predicts whether an amino acid substitution is likely to have phenotypic effects.


nsSNPAnalyzer is a machine-learning method that integrates multiple sequence alignment (MSA) and protein structure analysis to classify missense variants. The input protein sequence is searched against the ASTRAL database [Chandonia et al., 2004] for homologous protein structures, and extracts features of the environment of the substitution from the obtained structure, namely, the solvent accessibility, environmental polarity, and secondary structure. The SIFT method [Ng and Henikoff, 2003] is used for calculating the normalised probability of the substitution in the MSA, and the similarity and dissimilarity between the mutated, that is, original, and mutant residue is also taken into account. The program then uses a Random Forest classifier trained by a dataset prepared from the Swiss-Prot database [Yip et al., 2004] to classify the variant to be disease-associated or functionally neutral.


The Panther Evolutionary Analysis of Coding SNPs (referred simply to as Panther in this article) calculates substitution position-specific evolutionary conservation (subPSEC) scores based on alignments of evolutionarily related proteins to predict the pathogenicity. The alignments are obtained from the PANTHER library of protein families based on Hidden Markov Models (HMMs). The subPSEC score describes the amino acid probabilities, in particular, positions among evolutionarily related sequences, and the values range from 0 (neutral) to about − 10 (most likely to be deleterious). The cutoff for classifying a missense variant to be pathogenic can be defined by the user, but the authors of the method advice to use a cutoff of − 3 for classification [Thomas et al., 2003].


PhD-SNP is a prediction method based on single-sequence and sequence profile based support vector machines trained on Swiss-Prot variants [Yip et al., 2004]. The single-sequence SVM (SVM-Sequence) classifies the missense variant to be pathogenic or neutral based on the nature of the substitution and properties of the neighboring sequence environment. The profile-based SVM (SVM-Profile) utilizes sequence profile information taken from MSAs, and classifies the variant according to the ratio between the frequencies of the wild-type and substituted residue. A decision tree algorithm chooses which one of the two SVMs described above is to be used at each case based on the occurrence of wild-type and mutant amino acids at the given position.


PolyPhen (Polymorphism Phenotyping) uses a rule-based cutoff system to classify variants. It initially characterises the input missense variant by various sequence, structure, and phylogeny based descriptors. The sequence-based characterisation includes SWALL database [Johnson and Todd, 2000] annotations for sequence features, a transmembrane predictor TMHMM [Krogh et al., 2001] and PHAT [Ng et al., 2000] transmembrane-specific matrix score for substitutions at predicted transmembrane regions, the Coils2 program [Lupas et al., 1991] for prediction of coiled coil regions, and the SignalP [Nielsen et al., 1997] program to predict signal peptide regions. Phylogenetic information is derived by constructing a profile matrix from aligned sequences by the PSIC (Position-Specific Independent Counts) software [Sunyaev et al., 1999]. The structural descriptors are obtained by mapping the missense variant onto the corresponding or similar protein and then using the DSSP program [Kabsch and Sander, 1983] for secondary structure information, solvent-accessible surface, and φ–ψ dihedral angles. In addition, PolyPhen calculates the normalized accessible surface area and changes in accessible surface propensity resulting from the amino acid substitution, change in residue side chain volume, region of the Ramachandran map, normalized B factor, and loss of a hydrogen bond according to the Hbplus program [McDonald and Thornton, 1994]. The SWALL database annotations are utilized in the structure analysis such that the program checks whether the substitution site is in spatial contact with critical residues annotated to be involved in forming binding sites or active sites. Additionally, the contacts of the substituted residue with ligands or subunits of the protein molecule are checked. After characterising the variant, PolyPhen applies empirically derived rules based on the characteristics to predict whether a missense variant is damaging or benign.


PolyPhen2 utilizes a combination of sequence-and structure-based attributes for the description of an amino acid substitution, and the effect of mutation is predicted by a naive Bayesian classifier. The sequence-based features include PSIC scores and MSA properties, and position of mutation in relation to domain boundaries as defined by Pfam [Finn et al., 2010]. The structure-derived features are solvent accessibility, changes in solvent accessibility for buried residues, and crystallographic B-factor.


SIFT (Sorting Intolerant From Tolerant) makes inferences from sequence similarity using mathematical operations. SIFT constructs an MSA and considers the position of the missense variant and the type of the amino acid change. Based on the amino acids appearing at each position in the MSA, SIFT calculates the probability that a missense variant is tolerated conditional on the most frequent amino acid being tolerated.


SNAP (Screening for Nonacceptable Polymorphisms) is a neural network-based tool for the prediction of the effect of a missense variant. The method utilises evolutionary information from PSI-BLAST [Altschul et al., 1997] frequency profiles and PSIC [Sunyaev et al., 1999], transition frequencies for mutations, biophysical characteristics of the substitution, secondary structural information, and relative solvent accessibility values predicted by PROFsec/PROFacc [Rost, 1996; Rost and Sander, 1994], chain flexibility predicted by PROFbval [Schlessinger et al., 2006], protein family evolutionary information, and information about domain boundaries from Pfam [Finn et al., 2010], and Swiss-Prot annotations [Bairoch and Apweiler, 2000] to classify a missense variant. The training sets for the NN were constructed from Protein Mutant Database (PMD) [Kawabata et al., 1999] data complemented by a set of neutral pseudomutations generated by the authors of the method as described in Bromberg and Rost [2007].


SNPs&GO is an SVM classifier based on mutation type and sequence environment information, sequence profiles taken from MSAs, predictions from the program Panther [Thomas et al., 2003], and a function-based log-odds score describing information about protein function defined by Gene Ontology (GO) terms [Ashburner et al., 2000].

From the output of the programs, we only took the binary prediction (pathogenic/neutral) into consideration without taking into account any confidence values provided by some of the programs. Panther provides a numerical output rather than a binary classification (subPSEC score), which we converted to a binary prediction using a cutoff point of − 3 as recommended in [Thomas et al., 2003]. PolyPhen and PolyPhen2 classify the effects of a missense variant into three categories: “Probably pathogenic,” “Possibly pathogenic,” and “Benign.” We converted these into binary classifications in two ways, first by considering only the “Probably pathogenic” class as pathogenic and the “Possibly pathogenic” and “Benign” classes as neutral, and second, by considering both the “Probably pathogenic” and “Possibly pathogenic” classes as pathogenic, and the “Benign” class as neutral. These two ways of classifying the variants are referred to as PolyPhen(2)a and PolyPhen(2)b in this study, respectively.

Determination of Secondary Structural Elements and Accessible Surface Areas

The 3D structure coordinates of proteins were obtained from the PDB. Secondary structural information and accessible surface area (ASA) values for each mutation site were assigned by the program STRIDE [Frishman and Argos, 1995]. We classified residues with ASAs ≤10% as buried and with ASAs ≥25% as exposed, similarly as in a previous study [Khan and Vihinen, 2010].

Determination of Structural Classes of Proteins

The CATH database version 3.3 [Orengo et al., 1997] was used to group studied proteins according to their secondary and tertiary structure types.

Statistical Analyses

The quality of the predictions is described by six parameters: accuracy, precision, sensitivity, specificity, negative predictive value (NPV) and Matthews correlation coefficient (MCC). In the following equations, tp, tn, fp, and fn refer to the number of true positives, true negatives, false positives and false negatives, respectively.

equation image

The MCC [Matthews, 1975] is a very important evaluation statistic as it is unaffected by the differing proportion of neutral and pathogenic datasets predicted by the different programs. Because of its insensitivity to differing test set sizes, it gives a more balanced assessment of performance than the other performance measures [Baldi et al., 2000].

To be able to correlate the quality parameters for different programs with different sizes of test sets containing different amounts of pathogenic and neutral cases, the numbers of neutral cases were normalized to be equal to the number of pathogenic cases for each program.

Substitution statistics for both the pathogenic and neutral datasets were analyzed by comparing the frequencies of the substitutions with the expected values that were calculated using the distribution of all amino acids in the datasets. For the original residues, the expected values were calculated with regard to their codon diversity thereby taking into account all possible amino acid substitutions. The chi-square test was used to determine the significance of the results and chi-square was calculated as:

equation image

where fo is the observed frequency and fe is the expected frequency for an amino acid. The p-values were estimated in a one-tailed fashion.

Correlations between the program outputs were calculated by counting all of the common cases and those predicted correctly, and using Spearman's rank correlation coefficient.


Test Set Features

The distributions of mutated and mutant amino acids in both pathogenic and neutral datasets are biased (Table 2), and only a few residues occur as expected on the grounds of codon diversity. In the pathogenic dataset (mutation data), A, C, G, M, R, W, and Y are overrepresented among the original (mutated) amino acid residues, whereas E, F, I, K, L, N, Q, S, T, and V are significantly underrepresented. These results are in line with previous observations for distributions of disease-causing mutations in protein secondary structural elements [Khan and Vihinen, 2007], except for the overrepresentation of A and Y, and underrepresentation of L, N, S, and V in our data. In the neutral dataset, the distributions of many amino acids differ from the distributions in the pathogenic set. Most importantly, cysteines are highly underrepresented among the substituted positions, as opposed to their frequent mutation in the pathogenic dataset. This might be due to the important role of cysteines in folding of many proteins as they are capable of forming disulphide bonds, and therefore the substitution of cysteines in proteins transported through endoplasmic reticulum by any other residue can rarely be neutral in terms of protein structure and function. Other differences between the datasets are the underrepresentation of mutated glycine, tryptophan, and tyrosine residues in the neutral set as opposed to their frequent mutation in the pathogenic set, and the overrepresentation of isoleucine, asparagine, threonine, and valine residues in the neutral variation data, contrasting their underrepresentation in the mutation data.

Table 2. Amino Acid Distributions in the Pathogenic (Mutations) and Neutral (SNPs) Datasets
 Wild-type residues/pathogenic variants  Wild-type residues/neutral variants 
 ObservedExpectedχ2P-value ObservedExpectedχ2P-value
  1. The chi-square values in italics identify residues that are underrepresented and the values in bold identify overrepresented residues in comparison to random distributions derived theoretical codon usage frequencies. Significance levels are *P<0.05; **P<0.01; ***P<0.001.

All1933519335  All2117021170  
 Mutant residues/pathogenic variants  Mutant residues/neutral variants 
All1933519335  All2117021170  

The distributions of mutant or substituting amino acids are also very biased in both pathogenic and neutral datasets, and the amino acid residues I, P, R, T, V, and Y have opposite distributions in the mutation and neutral sets. Interestingly, proline residues are highly overrepresented among the substituting residues in the mutation dataset, and underrepresented in the negative set. Proline is a known secondary structure breaker [Chou and Fasman, 1974] and therefore mutations to P are often pathogenic.

Performance of Prediction Methods

To evaluate the performance of the programs predicting the pathogenicity of missense variants, we used six measures: accuracy, precision (or positive predictive value, PPV), specificity, sensitivity, NPV, and MCC. The values for these measures are presented in Table 3 for all the missense variants. SNPs&GO performed best in terms of accuracy (0.82), precision (0.90), specificity (0.92), and MCC (0.65), but sensitivity was higher in six other methods, and MutPred, Panther, PolyPhen2b, and SNAP performed better in terms of NPV. nsSNPAnalyzer performed worst in terms of MCC (0.19), accuracy (0.60), NPV (0.60), and precision (0.59). The two versions of PolyPhen have very similar overall performance; however, PolyPhen2 is recommended because the quality measures are more balanced.. The version classifying “Probably pathonegenic,” PolyPhen2a, as harmful is somewhat better than the other option.

Table 3. Performance of Prediction Methods
 MutPrednsSNPAnalyzerPantherPhD-SNPPolyPhen1aPolyPhen 1bPolyPhen 2aPolyPhen 2bSIFTSNAPSNPs&GO
  • a

    aTotal number of cases used by the given program (not normalized).

  • b

    bAccuracy, precision, specificity, sensitivity, NPV, and MCC are calculated from normalised numbers.

Performance of prediction methods (full data)
cases + a163367138125481879619278192781890918909153201814619223
cases − a204482262114732116520868208681987319873196211457718410
Performance of prediction methods (3D structure)
cases + a61424460494374527637763776567656563274657633
cases −a17981096117618441823182318351835180514771696
Performance of prediction methods (pathogenic dataset only from LSDBs, not in SwissProt)
cases + a31392037262035943594359435513551327635323499
cases − a3459377200935943538353833623362334124513157

In Table 3, the results are presented for the subset of cases for which structural information could be assigned. The performance of all methods was generally worse except for sensitivity, which is better for all methods. SNPs&GO performed best also in the structural subcategory considering accuracy, precision, specificity, and MCC, and MutPred was the best method in terms of sensitivity and NPV.

To test whether the poor performance was due to the smaller dataset size we sampled the full dataset results for those cases for which structural data was not available. We then compared the average MCC values of the samples to those obtained for the full dataset. The 100 sample datasets each contained randomly picked 1,000 neutral and 1,000 pathogenic variations. The average MCCs of the sample datasets were comparable to the MCCs of the full dataset in the case of Panther (average sample MCC 0.53), PhD-SNP (0.43), PolyPhen2b (0.39), and SNAP (0.47). For the other methods the MCC values were rather close when comparing the full dataset to the subdataset. We conclude that the large differences in the MCCs of the programs between the full dataset and the set for which structures were available (Table 3) were not due to the differences in the sizes of these datasets but were caused by some other factors, that is, differences in the performance of the methods when predicting on different types of data.

We also performed the analyses for a dataset that consisted only of LSDB-derived mutations not found in SwissProt (Table 3). This was done as some methods have been trained with Swiss-Prot disease-causing mutations. Because all methods (except SNPs&GO), and not only the ones trained on Swiss-Prot data, performed worse in this subcategory, we claim our results are not biased, even though we acknowledge that a perfectly fair comparison between methods trained on different datasets cannot be made.

To study the effect of residue types, the mutated and mutant amino acids were assigned into six groups according to their physicochemical properties: hydrophobic (C, F, I, L, M, V, W, and Y), positively charged (H, K, and R), negatively charged (D and E), conformational (G and P), polar (N, Q, and S), and A and T [Shen and Vihinen, 2004]. There were small differences in accuracy and precision of the methods for different types of wild-type or mutant amino acids, but their sensitivity and MCC were dependent on the physicochemical properties of the wild-type and mutant amino acids (Fig. 1). The methods were more sensitive to mutations at conformational, hydrophobic, and positively charged amino acids than mutations at polar residues or A and T (Fig. 1). MCC differed as well depending on the nature of the original residue position, and substitutions at hydrophobic positions were predicted best by most methods. Panther predicted mutations at hydrophobic and positively charged residues with equal performance, and MutPred and SNPs&GO performed better predicting conformational residues. Mutations affecting negatively charged residues had the lowest MCCs by most methods, except for PolyPhen1b, which predicted other classes better than the conformational class, and MutPred, nsSNPAnalyzer, and SNPs&GO, which had the lowest MCC when predicting the effects of mutations altering A and T residues (Fig. 1). The sensitivity and MCC of the methods also varied in predicting the effects of different types of mutant residues. All the methods performed best when the substituting residue was charged, and in the case of nsSNPAnalyzer, polar residues were predicted better than negatively charged residues, and SNAP predicted polar residues better than positively charged residues.

Figure 1.

The values of the quality parameters, accuracy, precision, sensitivity, and Matthews correlation coefficient (MCC) for different classes of substituted amino acids. A: accuracy, B: precision, C: sensitivity, and D: MCC. Abbreviations: Charge +, positively charged. Charge −, negatively charged.

Differences in prediction sensitivity could also be seen at the level of individual amino acids. Predictions for substitutions at C, W, and Y were clearly more sensitive than at other residues by all methods (Fig. 2A). A similar trend was also seen when looking at mutant amino acids: mutations to the aforementioned residues were predicted with better sensitivity (Fig. 2A). The sensitivity of PolyPhen2b and SNAP varied less at individual residues than that of the other programs.

Figure 2.

The values of sensitivity and Matthews correlation coefficient (MCC) for different types of amino acid substitutions. A: Sensitivity in different amino acid residues. Left: mutated (original) amino acids, right: substituting (mutant) amino acids. B: Sensitivity (left) and MCC (right) for amino acid substitutions at different secondary structural elements. C: Sensitivity (left) and MCC (right) for amino acid substitutions according to the accessible surface area (ASA) of the position (buried ASA ≤10%, exposed ASA ≥25%). D: Sensitivity (left) and MCC (right) for amino acid substitutions at different protein structural classes.

The results for the substitutions in the secondary structural elements are shown in Figure 2B. All of the programs predicted the effects of substitutions at different secondary structures with almost equal accuracy and precision. Sensitivity and MCC values showed more variation with secondary structure. In terms of MCC, MutPred, nsSNPAnalyzer, PolyPhen1b, and PolyPhen2b predicted amino acid substitutions at strands best, whereas Panther, PolyPhen1a, SNAP, and SNPs&GO performed best at turns. PhD-SNP and SIFT predicted substitutions positioned at α-helices best, and PolyPhen2a at coils. The differences in MCC were not striking. Except for Panther, PhD-SNP, and SNPs&GO, all methods were most sensitive when predicting the effects of amino acid substitutions at strands. Solvent-accessible surface areas of the positions did not markedly affect prediction accuracy or precision, but all the methods were more sensitive when predicting the effects of substitutions at buried positions (Fig. 2C). MCC for most methods was better at exposed than buried positions, except for PolyPhen1a and PolyPhen2a, which performed better at buried positions. MCCs for PolyPhen1b and SNAP did not differ with solvent accessibility of the position. These results are not in line with a previous study [Mort et al., 2010], where a sequence conservation based method yielded results of lower accuracy when predicting the effects of solvent-exposed residues.

CATH classifies proteins as mainly α-helical or β-stranded, mixed α- and β-structures (α–β), or as having few secondary structures. Interestingly, none of the proteins included in this analysis was assigned into the few secondary structures class. The predictions differed with respect to sensitivity and MCC depending on which protein class a mutation appeared (Fig. 2D). Most programs were more sensitive to amino acid substitutions in the α–β class of proteins, but SNPs&GO predicted substitutions best in the mainly β-class. nsSNPAnalyzer predicted those mutations occurring in α–β and α-helical proteins or domains with equal sensitivity. MCCs varied significantly with the structural class of proteins, especially in the predictions by nsSNPAnalyzer, PolyPhen1b, PolyPhen2a, and 2b, and SNPs&GO. The results were generally better for the α–β class of proteins, but nsSNPAnalyzer predicted substitutions at α-helical proteins best and SNPs&GO performed best with proteins in the mainly β-class.

To further evaluate the performance of the programs we compared them in a pairwise fashion (Table 4). The numbers of cases that were shared by the programs varied because the number of cases that could be predicted by each program varied as described in the Materials and Methods section. The largest percentage of correctly predicted cases by two programs was 68.2% (for the combination of MutPred and SNPs&GO). On average, the fraction of correctly predicted cases between any two programs was 57.7%. The correlations between two programs were highest for MutPred and PhD-SNP (0.57), and for PolyPhen 1 and 2 (0.57 for the less stringent b versions, and 0.56 for the a versions) (without taking into account the higher correlation between PolyPhen1a or 2a and PolyPhen1b or 2b that are different forms of the same program). Correlation was lowest for nsSNPAnalyzer and SNPs&GO (0.25).

Table 4. Pairwise Prediction Correlations
 MutPrednsSNPAnalyzerPantherPhD-SNPPolyPhen 1aPolyPhen 1bPolyPhen 2aPolyPhen 2bSIFTSNAPSNPs&GO
  1. Upper table: the number of cases shared by two programs (upper right triangle). The number of cases predicted correctly (lower left triangle). Lower table: The number of cases predicted correctly, reported as a percentage (upper right triangle). Pairwise correlation (lower left triangle).

MutPred 8721226453630036522365223519835198327052967434066
nsSNPAnalyzer4620 723792259380938093539353827086099145
Panther152963589 2367123869238692340623406215402071322555
PhD-SNP23955438914838 39659396593825438254345323220337095
PolyPhen1a2212543861396122756 401463848538485346833253337324
PolyPhen1b222084965147012217023764 3848538485346833253337324
PolyPhen2a22234477714728218712238323156 38782336863179036317
PolyPhen2b2091150121428820042196562241224006 336863179036317
SIFT188074302126231887918207189851864517833 2872632434
SNAP18877475013307180041702419811193211994516393 30987
MutPred 53.067.566.060.660.863.259.457.563.668.2
nsSNPAnalyzer0.36 49.647.646.852.951.153.652.055.251.1
Panther0.540.37 62.758.561.662.961.058.664.263.3
PhD-SNP0.570.350.51 57.455.957.252.454.755.962.9
PolyPhen2a0.490.440.510.450.560.58 61.955.355.360.7
PolyPhen2b0.440.420.490.400.460.570.72 52.962.756.6
SIFT0.410.530.480.450.450.520.500.51 57.055.9
SNAP0.460.410.510.440.440.540.520.530.53 60.8


In this study we evaluated how reliably the pathogenicity of missense mutants can be predicted, and whether selected features of the variant or the structural context affect prediction performance. The processing of the vast and increasing amount of genetic variation data requires the development of automatic annotation tools to determine the potential pathological character of a given variant. Prioritizing the most interesting and likely pathogenic cases for experimental analysis is another important application of the tested prediction methods.

To our knowledge, no comprehensive evaluation of the performance of missense variant pathogenicity predictors has been made outside the performance studies of individual methods in the context of their development. We selected test sets that have not been used in the training of the methods as such, but a subset of the pathogenic dataset is comprised of mutations from Swiss-Prot, and some methods (MutPred, nsSNPAnalyzer, PhD-SNP, PolyPhen2, and SNPs&GO) have used Swiss-Prot mutations in the training of the method. Testing of the performance of a method with the same cases it was trained on would lead into biased results, so that those methods trained on SwissProt mutations would have an advantage over the other methods. However, because the pathogenic dataset includes a large number of LSDB variations not found in SwissProt, we claim the test set was not similar to the training sets to the extent that it would advantage those methods trained on SwissProt data. Further, we tested the methods with cases coming only from LSDBs. With this dataset the performance decreased with all methods, whether trained on Swiss-Prot data or not, except for SNPs&GO. This indicates that the good performance of SNP&GO was not a result of that it has previously been exposed to the test dataset during its training phase. Furthermore, the poor performance of PhD-SNP indicates the method did not benefit from the possible identical cases in the data used for training and testing. However, it is impossible to construct a large testing dataset that would not share any cases with the original training sets of any of the methods, especially when the specific contents of the training sets are rarely published.

The neutral dataset was generated from dbSNP entries that had a frequency higher than 1% when there was data at least for 25 individuals (50 chromosomes). This way the number of false negatives could be minimized in the test set.

There are still other pathogenicity predictors that we did not evaluate. SNPs3D [Yue et al., 2006] was not included in this study because it does not allow submission of user-defined amino acid substitutions. Similarly, LS-SNP [Karchin et al., 2005] is an annotated database of SNPs, not a prediction method for any user-provided variant, although often referred to as a prediction method for nsSNP pathogenicity. The Auto-Mute predictor of disease potential of human nsSNPs [Barenboim et al., 2008] was left out from the analysis because the program did not allow batch submission. PMut [Ferrer-Costa et al., 2005] could not be tested because the server did not return predictions.

Overall, we found SNPs&GO and MutPred to be clearly the most reliable predictors for our dataset of genetic variants. The accuracies of all the methods were in the range of 0.60–0.82, and precision ranged from 0.59 to 0.90. More variation among the methods was seen when considering the sensitivities and MCC values that ranged from 0.52 to 0.88 and 0.19 to 0.65, respectively. The local structural context of a mutated residue did not dramatically affect predictor performance in most cases but most methods showed variance in their prediction power at the level of protein tertiary structure classification and at different mutated positions.

Studies have shown that combining information obtained from the multiple sequence alignment and three-dimensional protein structure can increase prediction performance [Bromberg and Rost, 2007; Saunders and Baker, 2002]. According to our results, this is not always the case. Panther operates solely on sequence-based evolutionary information, and it is one of the best performing methods, outperforming all the methods incorporating structural information in the prediction, except for MutPred, which uses sequence-derived structural predictions as features in combination with evolutionary information. Furthermore, although nsSNPAnalyzer uses the SIFT method for the evolutionary analysis and also includes structure-derived features, its overall performance is below that for SIFT, except for an increase in specificity in the structure subset of data. However, the two best performing predictors include both protein structural or functional and MSA-derived information in the prediction.

It is very difficult to determine whether the notable differences in the performance of these methods are caused by differences in the features utilized by the methods or the training datasets. For example, SNPs&GO uses GO annotations as a feature, and GO is biased toward genes involved in diseases. The PDB is biased as well, containing structures of mostly well-studied proteins, which include products of disease-related genes. Therefore, one would expect SNPs&GO would perform better in predicting the effects of missense variants in proteins that have structures in the PDB as they are likely to have GO annotation as well—and in fact, it performs worse. One factor that very probably affects prediction reliability is the quality of multiple sequence alignment. Because all of the methods studied here use MSA as input to the prediction, the quality of the provided MSA should be very carefully assessed. For many of the methods, we did not find documentation how the MSA is constructed when the user provides just the query sequence as input. For example, an automatic BLAST search often performed by the programs may lead into construction of an MSA that contains multiple versions of the same sequence or paralog sequences, affecting the resulting conservation analysis. The MSA should contain a selection of closely and distantly related sequences in order to effectively yield a conservation signal.

In conclusion, those methods that performed best had high accuracy (reaching 0.82, SNPs&GO), precision (0.90, SNPs&GO), specificity (0.92, SNPs&GO), sensitivity (0.88, SNAP), and NPV (0.84, MutPred). Matthews correlation coefficient reached the value of 0.65 at best (SNPs&GO). There is no single method that could be rated as best by all parameters, so the user should consider what aspects would be most valuable considering the nature of the data analysed. Furthermore, some methods require 3D structure coordinates, limiting the number of cases that can be analyzed (nsSNPAnalyzer), and some methods are at least currently too slow for high-throughput analyses (SNAP). Although some of the existing methods perform reasonably well, development of new, more reliable methods is certainly needed. Complementary methods could be combined in a metaserver to yield more reliable predictions.


The authors thank Pier Luigi Martelli and Rita Casadio from the University of Bologna, Biao Li from Indiana University, and Sean Mooney and Vidhya Krishnan from the Buck Institute for Age Research, for cooperation in running of data.