Statistical Analysis of Missense Mutation Classifiers


Correspondence to: Marek Kimmel, Rice University, Department of Statistics, MS 138, 6100 Main St. Houston, TX 77005. E-Mail:

A recent article [Acharya and Nagarajaram, 2012] describes a new method called Hansa, which classifies missense mutations into neutral and deleterious categories. However, the authors do not provide sufficient details about their algorithm, which results in concerns about the appropriateness and application of statistical methods that compare Hansa with existing algorithms.

Acharya and Nagarajaram (2012) state their method outperforms other known methods such as PolyPhen-2 [Adzhubei et al., 2010] and SIFT [Ng and Henikoff, 2006] by comparing the receiver operating characteristics (ROCs) of Hansa to the ROCs of the other algorithms. In their Table 2, a direct comparison of the ROCs is made by employing a benchmark dataset called HumVar originally described in Capriotti et al. (2006) and employed in Adzhubei et al. (2010), which compares the true positive rates (TPR) between algorithms for a fixed false positive rate (FPR). ROCs require a probability or continuous score associated with each prediction to compute TPRs and FPRs as the discrimination threshold is varied [Pepe, 2003]. For example, the ROC of PolyPhen-2 was based on the naïve Bayes probability provided by the algorithm itself and the ROC of SIFT was based on the SIFT score [Adzhubei et al., 2010]. As described in the publication, Hansa is based on support vector machine (SVM) method, which uses a set of 10 discriminatory features to classify missense mutations as neutral or deleterious. SVMs are nonprobabilistic classifiers [Hastie et al., 2009], and consistently, there is no probability or continuous score associated with each prediction, and thus, an ROC analysis does not seem obviously feasible for this algorithm. In the publication, there is no mention of what continuous score or probability was used to calculate the TPRs of Hansa for a fixed FPR. Therefore, it is unclear how they might attain various TPRs for a given FPR because there is no varying threshold defined.

We compared Hansa with other algorithms using the independent data set of n = 267 mutations from cancer-associated genes [Hicks et al., 2011], which Acharya and Nagarajaram (2012) use as a validation data set to the HumVar data that Hansa was trained on. We originally used this well-characterized data to compare the TPRs and FPRs of several algorithms using their native protein sequence alignments and to evaluate the impact of the predictions when the algorithms were supplied other alignments. Because Hansa does not provide a probability or continuous score associated with each prediction, we could not provide the ROC curves and could only calculate the TPRs and FPRs for each algorithm. Hansa seems to perform comparably to the other algorithms (Fig. 1).

Figure 1.

Predictions from the following four algorithms were taken from dbNSFP [Liu et al., 2011]: LRT—Likelihood Ratio Test, Mutation Taster, PolyPhen-2, and SIFT; predictions from the last two algorithms were produced on their respective web pages: Hansa [Acharya and Nagarajaram, 2012], Xvar [Reva et al., 2011].

In addition, as a way to compare the improvement of TPR in Hansa over the other algorithms, the authors inappropriately performed a paired t-test. Because they are comparing proportions, it would be preferable to use for example a test for a difference in proportions with a correction for multiple testing. Furthermore, to measure the performance of the SVM, the authors state they use a n-fold cross-validation and leave-one-out cross-validation (LOOCV) to “assess the generalization and stability” of their method. Unfortunately, they do not report the parameter estimates of the SVM and do not report the n-fold cross-validation error. They only report a LOOCV error, which makes it difficult to assess the validity of this analysis.