Hansa: An automated method for discriminating disease and neutral human nsSNPs†
Communicated by Marc S. Greenblatt
Variations are mostly due to nonsynonymous single nucleotide polymorphisms (nsSNPs), some of which are associated with certain diseases. Phenotypic effects of a large number of nsSNPs have not been characterized. Although several methods have been developed to predict the effects of nsSNPs as “disease” or “neutral,” there is still a need for development of methods with improved prediction accuracies. We, therefore, developed a support vector machine (SVM) based method named Hansa which uses a novel set of discriminatory features to classify nsSNPs into disease (pathogenic) and benign (neutral) types. Validation studies on a benchmark dataset and further on an independent dataset of well-characterized known disease and neutral mutations show that Hansa outperforms the other known methods. For example, fivefold cross-validation studies using the benchmark HumVar dataset reveal that at the false positive rate (FPR) of 20% Hansa yields a true positive rate (TPR) of 82% that is about 10% higher than the best-known method. Hansa is available in the form of a web server at http://hansa.cdfd.org.in:8080. Hum Mutat 33:332–337, 2012. © 2011 Wiley Periodicals, Inc.
Single nucleotide polymorphisms (SNPs) are the nucleotide changes that occur in DNA which account for approximately 90% of the genetic variation among individuals in a population [Collins et al., 1998]. The SNPs that lead to amino acid changes in the protein products are referred to as nonsynonymous SNPs (nsSNPs). Some amino acid changes are tolerated by proteins with no concomitant phenotypic effects and the corresponding nsSNPs are referred to as benign or neutral nsSNPs. The nsSNPs that lead to amino acid changes not tolerated by protein structure and function, further leading to disease phenotypes, are referred to as pathogenic or disease nsSNPs [Bao and Cui, 2005; Cargill et al., 1999; Ng and Henikoff, 2006; Saunders and Baker, 2002; Stenson et al., 2003; Thorisson and Stein, 2003; Yue and Moult, 2006]. Importance of nsSNPs in humans has been illustrated by various databases, including: dbSNP [Sherry et al., 1999], the Human Genome Variation Database (HGVbase) [Fredman et al., 2002], and Online Mendelian Inheritance in Man (OMIM) [Hamosh et al., 2005]. Although comparative genetic analyses of healthy and disease individuals have led to the discovery of a number of rare missense mutations, as well as nsSNPs associated with diseases but the list may be far from complete. The list of uncharacterized nsSNPs discovered from the human genome project outweighs the list of characterized nsSNPs. In this postgenomic era, classification of nsSNPs into disease or neutral is therefore perceived as the first step before any study is attempted such as pharmacogenomics.
Several methods have been developed to classify human nsSNPs/rare missense mutations into benign/neutral or pathogenic/disease categories [Mooney, 2005; Ng and Henikoff, 2006; Thusberg and Vihinen, 2009]. These methods use various attributes varying from sequence-based attributes [Ferrer-Costa et al., 2004; Ng and Henikoff, 2001, 2003; Thomas et al., 2004], evolutionary-based attributes [Fay et al., 2001], and a combination of structural and evolutionary information [Chasman and Adams, 2001; Ferrer-Costa et al., 2002; Sunyaev et al., 2001] to a variety of machine-learning techniques including linear logistic regression [Li et al., 2006; Saunders and Baker, 2002], decision trees [Dobson et al., 2006; Krishnan and Westhead, 2003], random forest [Bao and Cui, 2005; Barenboim et al., 2008], neural networks [Ferrer-Costa et al., 2004, 2005], neuro-fuzzy classifier [Barenboim et al., 2008], Bayesian classifier [Adzhubei et al., 2010; Cai et al., 2004; Needham et al., 2006], ridge partial least square [Li et al., 2006], and support vector machines (SVMs) [Bao and Cui, 2005; Krishnan and Westhead, 2003; Tian et al., 2007; Torkamani and Schork, 2007; Yue et al., 2005; Yue and Moult, 2006]. Despite the availability of various methods, it is still conceived as highly important to develop new methods with better prediction accuracies.
Recently, we developed a SVM-based method named “Hansa” which uses a novel set of features (referred to as nsSNP neutral-disease (nsSNPND) discriminatory features) associated with nsSNPs. This nsSNPND discriminatory feature set includes position-specific probabilities, local protein structural status, and the intrinsic properties of the wild-type and mutated residues. Validation studies on HumVar dataset and an independent dataset of cancer-associated missense mutations reveal that Hansa outperforms the other known methods in the correct prediction of disease and neutral nsSNPs.
Materials and Methods
A SVM-based method such as Hansa reported in this study requires datasets of known disease and neutral nsSNPs used for training and testing the method and a set of features characterizing distinctly the disease and neutral nsSNPs (discriminatory features). The following sections give the details of the training and testing datasets and the discriminatory features used in the present study.
The Benchmark Dataset for Training and Testing
We used HumVar dataset (ftp://genetics.bwh.harvard.edu/datasets/pph2/humvar-2.0.17.tar.gz) [Azdubehi et al., 2010], which is a curated dataset of known disease and neutral of nsSNPs. This dataset comprises of 13,032 “Disease” from 1,111 proteins and 8,946 “polymorphisms (neutral)” from 3,484 proteins. This dataset has been used as a benchmark dataset by many reported studies [Adzhubei et al., 2010; Capriotti et al., 2006; Tian et al., 2007], and hence the same dataset was preferred in the present study so as to enable comparison of the results obtained in this study with those already reported in the literature.
Hansa uses 10 nsSNPND discriminatory features that include six position-specific features, two structure-based, and two amino acid residue-based features. The details of these features are given below (Table 1).
The List of 10 nsSNPND Features Analyzed in the Present Study
aThese scores were calculated using the perl script psap.pl available at http://www.mobioinfor.cn/parepro/[Tian et al., 2007].
bThese scores were calculated using the PROPHECY program available in the EMBOSS suite [Rice et al., 2000].
cSolvent accessibility calculated from ACCpro4.0 [Cheng et al., 2005].
dSecondary structure prediction calculated from SSpro v4.5 [Cheng et al., 2005].
eTransfer in free energy values from inside to outside of a globular protein [Janin, 1979].
The Position-Specific Features
These features correspond to the position-specific preferences of amino acid residues at the mutation sites. One of the features is the position-specific probability score calculated using the 20-component Dirichlet mixture (DM) of priors [Sjölander et al., 1996; Tian et al., 2007], combined with the observed amino acid frequencies in the columns corresponding to the mutation sites in the multiple sequence alignment of the target human protein sequence with its homologs. The 20-component DMs were downloaded from http://compbio.soe.ucsc.edu/dirichlets/dist.20comp. The script psap.pl available at http://www.mobioinfor.cn/parepro/ was used for calculating scores.
The other position-specific feature is the Gribskov's score G (a,b) [Gribskov et al., 1987] calculated for the amino acid residues at the column corresponding to the mutation site in the multiple sequence alignment of the target human protein sequence with its homologs. These scores were calculated using the PROPHECY program (available in the EMBOSS suite) [Rice et al., 2000].
We calculated and G (a,b) scores for both wild-types as well as mutations. In addition, we also considered the differences in the and G (a,b) scores of wild-types and mutations.
The MSAs used for calculating the amino acid frequencies for both and G (a,b)scores were obtained using ClustalW-2 [Chenna et al., 2003] and no further manual curation was done to the alignments. The homologs of the target human sequence were identified by PSI-BLAST [Altschul et al., 1997] searches on the nonredundant (NR) database with an E-value 1 × 10–15 with 3–4 rounds of iteration until convergence is reached. From PSI-BLAST hits, homologs shorter than 70% of the query sequence length were removed before considering them for multiple sequence alignment by ClustalW-2.
While searching for homologs we made sure that only the relevant domain containing the given mutation was used as the query. Domain boundaries were identified using ProDom [Servant et al., 2002] (http://prodom.prabi.fr/profom/current/html/home.php).
Protein Structure Based Features
We considered secondary structural status and solvent accessibility status of the amino acid residues at the mutation sites. These structural features at the mutation sites were extracted for both wild-type and mutant amino acid residues from the predictions made on the target human protein using ACCpro (solvent accessibility) and SSpro v4.5 (secondary structural status) of the SCRATCH suite [Cheng et al., 2005]. All the residues with predicted solvent accessibility values <10% were considered as buried residues and those with higher solvent accessibility values were considered as exposed. Based on SSpro predictions, the amino acid residues were assigned to one of the three secondary structural features viz., helix, strand, and coils.
Amino Acid Based Features
In addition to the above mentioned features, we also considered two intrinsic amino acid features. One of them is the difference in transfer free energy values of amino acid residues from inside to surface of the protein [Janin, 1979] and the other is the pairwise substitution scores of BLOSUM62 substitution matrix.
SVM Learning and Testing
We used SVM, a supervised machine-learning method first developed by [Vapnik 1995] that is extensively used for classification and regression problems. All the SVM computations were carried out using LIBSVM [Chang and Lin, 2001] with RBF kernel. The training was done using different values of the cost parameter C = [211, 210… 2−3] and kernel parameter γ = [2−3, 2−2… 2−11], and only those that gave rise to the best results were retained.
An n-fold cross-validation is generally used to test the generalization and the stability of a method [Bhasin and Raghava, 2004]. In this study, we performed fivefold cross-validation on the HumVar dataset where the dataset was partitioned randomly into five equal-sized and nonoverlapping sets. The training and testing of our method was carried out five times using one distinct set for testing and other four sets for training.
In order to compare the performance of Hansa with the other methods, True positive rates (TPRs) obtained for different False positive rates (FPRs) were calculated using ROCR of R-Package [Sing et al., 2005] where, TPR = TP/ (TP + FN), FPR = 1 − TN/(TN + FP), TP = the number of true positives (TPs) (i.e., the number of correctly predicted pathogenic/disease mutations), TN = the number of true negatives (TNs) (i.e., the number of correctly predicted neutral mutations), FP = the number of false positives (i.e., the number of neutral mutations wrongly predicted as pathogenic), and FN = the number of false negatives (i.e., the number of pathogenic mutations predicted wrongly as neutral).
In addition to fivefold cross-validation tests, we also performed “Leave-one-out cross-validation” (LOOCV) to further assess the generalization and stability of our method. This involves using each of the examples in the dataset as the test set and remaining examples as the training set. We used svm-train of LISBSVM to calculate LOOCV.
Performance Evaluation of Hansa Using the HumVar Dataset
As mentioned earlier, Hansa was evaluated by performing the fivefold cross-validation on the HumVar dataset comprising of 13,032 known disease mutations and 8,946 neutral mutations. In order to get the complete picture on the performance of Hansa, we computed TPRs for different FPRs. Table 2 gives the TPR and FPR values obtained for Hansa along with the reported TPR and FPR values by Adzhubei et al.  for SIFT (Sorting Intolerant From Tolerant) [Ng and Henikoff, 2001] (http://sift-dna.org) and (http://sift.jcvi.org/), PolyPhen (Polymorphism Phenotyping) [Ramensky et al., 2002] (http://genetics.bwh.harvard.edu/pph), SNPs3D [Yue et al., 2006] (http://www.snps3d.org/) , SNAP (Screening for nonacceptable polymorphisms) [Bromberg and Rost, 2007] (http://rostlab.org/services/snap/), and PolyPhen-2 [Adzhubei et al., 2010] (http://genetics.bwh.harvard.edu/pph2). We would like to add here that Adzhubei et al.  used HumVar dataset (which is same as the one used in the present study) to calculate the TPR–FPR values and hence these values can be compared with the values obtained for Hansa in this study. It can be seen from Table 2 that for any given FPR, a better TPR is achieved by Hansa than the other methods. For example, at FPR of 0.2 (20%) Hansa gives a TPR of 0.82 (82%) that is about 0.1 (10%) higher than that given by the best method, that is, PolyPhen-2. The improvement achieved by Hansa as compared to the other methods is statistically significant (paired t-test with the null hypothesis that any two methods compared perform equally; P-value <10−4).
Table 2. The True Positive Rates (TPRs) Obtained for Different False Positive Rates by Hansa, PolyPhen-2, SIFT, SNAP, SNPs3D, and PolyPhena
Performance Evaluation Using Independent Datasets
We also used an independent dataset for evaluating the classification performance of Hansa and compared its performance with the performances achieved by the other known methods. The independent dataset comprises of known and well-characterized missense mutations observed in the four cancer-associated genes: BRCA1, MLH1, MSH2, and TP53 [Hicks et al., 2011] (Supp. Table S1). The BRCA1 dataset comprises of 17 pathogenic and 16 neutral mutations, MLH1 dataset comprises of 39 pathogenic and 21 neutral mutations. MSH2 dataset comprises of 19 pathogenic and 11 neutral mutations and TP53 database comprises of 140 pathogenic and 4 neutral variations. As these mutations have been well characterized, it has been suggested to consider them as “gold standard” for an independent evaluation of different algorithms employed for the prediction of pathogenic variations [Greenblatt et al., 2008; Tavtigian et al., 2008].
Predictions were made for the above-mentioned datasets using Hansa after training it with the entire HumVar dataset. Predictions by SIFT, Polyphen-2, and SNAP were made using their web servers with their default options. Table 3 gives number of TPs and the number of TNs obtained by Hansa, SIFT, PolyPhen-2, and SNAP for the independent dataset. Hansa, in general, gives rise to higher number of TPs as compared to the other methods indicating better predictions of disease mutations by Hansa than the other methods. On the other hand, prediction of neutral mutations by Hansa as indicated by the number of TNs is nearly as good as the other methods.
Table 3. Comparison of Performance of Hansa with SIFT, PolyPhen-2, and SNAP
|(n = 16) neutral||(37.5)||(88.2)||(43.8)||(58.8)||(31.3)||(94.1)||(37.5)||(76.5)|
|(n = 17) pathogenic||(75)||(60)||(50)||(52.6)||(83.3)||(59.2)||(60)||(56.5)|
|(n = 11) neutral||(36.4)||(94.7)||(36.4)||(89.5)||(45.5)||(89.5)||(36.4)||(94.7)|
|(n = 19) pathogenic||(80)||(72)||(66.7)||(70.8)||(71.4)||(73.9)||(80)||(72)|
|(n = 21) neutral||(52.4)||(97.4)||(47.6)||(100.0)||(52.4)||(71.8)||(61.9)||(94.8)|
|(n = 39) pathogenic||(91.6)||(79.1)||(100)||(78)||(50)||(73.6)||(86.6)||(82.2)|
|(n = 4) neutral||(50.0)||(94.3)||(75.0)||(85.0)||(75.0)||(84.3)||(75.0)||(88.6)|
|(n = 140) pathogenic||(20)||(98.5)||(12.5)||(99.1)||(12)||(99.1)||(15.7)||(99.2)|
It has been argued that the number of homologs and the quality of MSA is highly important for prediction of disease/neutral mutations [Chun and Fay, 2009; Karchin et al., 2009]. We, therefore, tested performance of Hansa when used with the MSAs produced by SIFT and PolyPhen-2 and similarly tested performance of SIFT and Polyphen-2 when used with Hansa alignments (we could not use SNAP as it does not accept externally supplied MSAs). The results obtained are given in Tables 4–6. The results indicate that Hansa predictions are not very much affected by the type of the MSA used.
Table 4. Comparison of Predictions Made by Hansa, SIFT, and PolyPhen-2 when Used with Hansa MSAs
|(n = 16) neutral||DOM2-30||(37.5)||(88.2)||(50.0)||(58.8)||(31.3)||(100.0)|
|(n = 17) pathogenic||DOM3-62||(75)||(60)||(53.3)||(55.5)||(100)||(60.7)|
|(n = 11) neutral||DOM2-234||(36.4)||(94.7)||(27.3)||(89.5)||(27.3)||(94.7)|
|(n = 19) pathogenic|| ||(80)||(72)||(60)||(68)||(75)||(69.2)|
|(n = 21) neutral||DOM2-19||(52.4)||(97.4)||(66.7)||(89.7)||(4.8)||(100.0)|
|(n = 39) pathogenic||DOM3-158||(91.6)||(79.1)||(77.7)||(83.3)||(100)||(66.1)|
|(n = 4) neutral||DOM2-67||(50.0)||(94.3)||(75.0)||(79.3)||(25)||(92.1)|
|(n = 140) pathogenic|| ||(20)||(98.5)||(9.3)||(99.1)||(8.3)||(97.7)|
Table 5. Comparison of Predictions Made by Hansa and SIFT when Used with SIFT MSAs
|(n = 16) neutral|| ||(31.3)||(94.1)||(37.5)||(94.1)|
|(n = 17) pathogenic|| ||(83.3)||(59.2)||(85.7)||(61.5)|
|(n = 11) neutral|| ||(45.5)||(89.5)||(18.2)||(94.7)|
|(n = 19) pathogenic|| ||(71.4)||(73.9)||(66.6)||(66.6)|
|(n = 21) neutral|| ||(52.4)||(71.8)||(52.4)||(69.2)|
|(n = 39) pathogenic|| ||(50)||(73.6)||(47.8)||(72.9)|
|(n = 4) neutral|| ||(75.0)||(84.3)||(25.0)||(93.6)|
|(n = 140) pathogenic|| ||(12)||(99.1)||(10)||(97.7)|
Table 6. Comparison of Predictions Made by Hansa and PolyPhen-2 when used with PolyPhen-2 MSAs
|(n = 16) neutral|| ||(43.8)||(58.8)||(37.5)||(47.0)|
|(n = 17) pathogenic|| ||(50)||(52.6)||(40)||(44.4)|
|(n = 11) neutral|| ||(36.4)||(89.5)||(9.1)||(94.7)|
|(n = 19) pathogenic|| ||(66.7)||(70.8)||(50)||(64.2)|
|(n = 21) neutral|| ||(47.6)||(100.0)||(42.9)||(100.0)|
|(n = 39) pathogenic|| ||(100)||(78)||(100)||(76.4)|
|(n = 4) neutral|| ||(75.0)||(85.0)||(50.0)||(92.9)|
|(n = 140) pathogenic|| ||(12.5)||(99.1)||(16.6)||(98.5)|
The significantly better performance of Hansa as compared to the other well-known methods can be attributed to the novel combination of nsSNPND features set that judiciously combines position-specific features, local protein structural features, and intrinsic properties of the amino acid residues. This novel features set has given rise to a very good discrimination between pathogenic and neutral mutations in SVM.
Furthermore, we have also tested Hansa for its stability and generalization by performing the most rigorous LOOCV, which involves using each of the examples in the HumVar as the test set and remaining examples as the training set. The LOOCV value of 83.2% clearly indicates that Hansa is stable and generalizing well.
Some of the nsSNPND discriminatory features have been used by other methods. For example, SIFT uses as the sole discriminatory feature for predicting pathogenic and neutral mutations. However, for calculating SIFT uses 13-component DMs [Ng and Henikoff, 2001] that have been calculated using unweighted alignment columns from the BLOCKS database, whereas Hansa uses 20-component DMs [Tian et al., 2007] that have been calculated from MSAs of distantly related proteins with known structures. As the MSAs are structure based, they are more accurate than sequence-based MSAs and hence their columns represent the true distributions of amino acid probabilities. Hence, 20-component DMs are expected to give rise to better estimations of than 13-component DMs.
Tavtigian et al.  suggest that alignment depth is very critical in addition to the quality of the MSAs for getting an accurate estimation of evolutionary preferences at the mutation sites. However, it is not possible to ensure good alignment depth of MSA for every protein that is used for prediction. The number of homologs identified by PSI-BLAST can vary from protein to protein (See Supp. Fig. S1 for HumVar dataset). Some proteins have large number of homologs while others have very few homologs. However, for cases lacking sufficient number of homologs, good estimation of position-specific probabilities can be achieved by using DMs [Sjolander et al., 1996] as priors along with the observed distribution of amino acid residues at alignment columns. It has been shown in earlier studies [Sjolander et al., 1996] that prior information encapsulated in the form of n-component DMs gives rise to accurate estimation of position-specific probabilities of amino acids. Hence, we believe that insufficient alignment depth is not an impeding factor for accurate predictions by Hansa.
In addition to , we have also used Gribskov's profile score to represent position-specific information at the missense mutation sites. Gribskov's profile score was initially developed for predicting the family of uncharacterized sequences [Gribskov et al., 1987] and has also been used for guiding motif-site detection in PROSITE [Sigrist et al., 2002]. To the best of our knowledge, the Gribskov's score G (a,b) has never been investigated for its value in predicting pathogenic mutations.
Solvent accessibilities and secondary structures of amino acid residues give useful insight into the structure and function of proteins and are widely used in the prediction of pathogenic mutations [Bromberg and Rost, 2007; Chasman and Adams, 2001; Ferrer-Costa et al., 2004; Krishnan and Westhead, 2003; Saunders and Baker, 2002; Sunyaev et al., 2001; Vitkup et al., 2003; Yue et al., 2005]. A majority of pathogenic mutations are known to be frequently solvent inaccessible (or buried), hence substitutions by incompatible amino acids at such positions would destabilize protein structure/function [Gong and Blundell, 2010; Sunyaev et al., 2000; Wang and Moult, 2001].
In addition to the above-mentioned position-specific and structure-based features, the nsSNPND features set also include two intrinsic amino acid features viz., the free energy transfer values and BLOSUM62 substitution scores. Correlation between pairwise amino acid substitution scores and pathogenic mutations is well known [Balasubramanian et al., 2005; Cargill et al., 1999]. Free-energy transfer values represent one of the important properties of amino acids [Janin, 1979]. While testing various combinations of features, we found that prediction accuracies increase when the two amino acid based features are combined with the position-specific and the protein structural features (Supp. Table S2).
Currently, Hansa has been trained on all the mutations available in the HumVar and is available as a web server at hansa.cdfd.org.in:8080.
The authors gratefully acknowledge Stephanie Hicks and Marek Kimmel for providing the independent datasets of missense mutations in cancer genes and Ivan Adzhubei for providing a version of PolyPhen-2 that accepts user specified MSAs. The authors also thank Anil Kumar for helping in the development of html pages and PHP scripts for the Hansa web server. HAN gratefully acknowledges the core funding from CDFD.