Best Practices for Evaluating Mutation Prediction Methods


  • Peter K. Rogan,

    Corresponding author
    1. Departments of Biochemistry, Schulich School of Medicine and Dentistry, Western University, Ontario, Canada
    2. Department of Computer Science, Faculty of Science, Western University, Ontario, Canada
    • Correspondence to: Peter K. Rogan, Department of Biochemistry, Schulich School of Medicine and Dentistry, Western University, London, ON N6A 2C1, Canada. E-mail:

    Search for more papers by this author
  • Guang Yong Zou

    1. Department of Epidemiology and Biostatistics, Schulich School of Medicine and Dentistry, Western University, Ontario, Canada
    Search for more papers by this author

Discovery of novel and rare variations in the genome has driven the development of in silico variant prediction and analyses. The diverse functions encoded by these variants have led to a proliferation of algorithms and methods designed to predict distinct types of mutations. Inevitable comparisons are made between these approaches for assessing the same type of sequence changes, based on the scores or values produced by bioinformatic methods. The multitude of available methods leads to considerable confusion for biomedical scientists who simply want reliable tools to analyze their data. This is because the approaches used to assess pathogenicity, conservation, activity, or expression levels are heterogeneous, whereas at the same time, the underlying validation data themselves vary in precision, accuracy, and novelty.

Vihinen (2013) has suggested a comprehensive set of publication guidelines for both developers and consumers of prediction methods. These guidelines cover method description, choice of data, performance, and implementation. Users are advised to consider the appropriateness of the method, performing analysis with multiple approaches, and provide statistical measures of performance along with relevant citations. The guidelines are sufficiently general for any method, but contain criteria tailored for machine-learning approaches. We concur with these recommendations and propose additional best practices for consideration in evaluating bioinformatic mutation predictors.

The Critical Assessment of Genome Interpretation (CAGI; is a forum that attempts to “objectively assess computational methods for predicting the phenotypic impacts of genomic variation” through development of objective measures of performance for specific variants. Developers predict molecular, cellular, or organismal phenotypes of prescribed sets of unpublished, functionally verified gene variants using different algorithms and software. Results have been compared with well-curated, small experimental datasets and discussed at a conference [Calloway, 2010]. This can reveal the limitations and strengths of different experimental and computational approaches, one of Vihinen's key recommendations.

The acute need for bioinformatic tools to interpret exome- or genome-scale sequence data has led to their early adoption [Robinson et al., 2011], despite the fact that each has different capabilities for mutation prediction, and may have been validated with a different set of proven mutations. These validation sets may not be representative of the range of mutations that are analyzed with these tools. For example, insight into cellular functions and biological pathways impacted by predicted mutations often depends on overrepresentation of mutated genes present in multiple samples to infer likely dysregulated pathways [Subramanian et al., 2005]. When based solely on exome data, this type of analysis could be susceptible to sampling errors, since only a subset of the variants present in the genome are actually considered in deciding which genes contain aberrations. Another common approach is to prioritize potential disease variants is by allele frequency [Yandell et al., 2011], which biases against detecting common variants that may be subject to adaptive selection and may be modifiers of other, more penetrant mutations. These methods make assumptions about disease mechanisms that can render conclusions based on predictive tools, which can be either incomplete or inaccurate.

Published comparisons between different methods that assess the same type of mutation rarely address the underlying differences that explain why they do not produce the same results. Some analytic methods produce agnostic continuous distributions of unitless values [Shapiro and Senapathy, 1987], whereas others have empirically derived thresholds from specific validation sets [Desmet et al., 2009], and yet others have a theoretical basis based on thermodynamics or other measurable criteria [Rogan et al., 1998]. Surprisingly, the underlying distributions on which these scores are based are often not available. These values are used to compute sensitivity, specificity, positive predictive value, accuracy, and the like. If the underlying metrics are not determined on a common scale, are not normalized, and may have an unknown or poorly characterized underlying distribution, the performance of each may not be comparable for different mutation validation sets. If possible, scores should be transformed so that either comparisons are on the same scale or the data fulfill assumptions of the same statistical inference procedure. Only then can the confidence intervals on these values, as Dr. Vihinen suggests, determine if wild-type sequences significantly differ from suspected pathogenic gene variants.

We agree that datasets need to be sufficiently comprehensive so that they represent the range of results generated by a bioinformatic tool. Aside from illustrating a particular analytical approach, application of multiple methods to a single or few datasets may be of limited value in studying the performance of different methods. From a statistical viewpoint, a single dataset can only be regarded as one realization of unknown underlying phenomena. Monte Carlo simulations should be used to evaluate the performance of different methods under a variety of conditions and assumptions, as have been done by biostatisticians [Burton et al., 2006]. However, if a large number of studies have used a particular software tool appropriately, curation of a particular validation set may not be more robust than a meta-analysis of all of the peer-reviewed studies that have cited the tool.

Known limitations in the precision of validation data itself may impact the sensitivity and specificity of bioinformatic comparisons. Examples include indeterminate structural and biophysical effects of protein coding mutations, paucity of validation data on deep intronic splicing mutations, unrecognized exonic variants that alter mRNA splicing, and narrowly defined studies of specific promoter mutations altering transcriptional regulation that may not be generalizable to other genes under the same regulatory control. This includes sequencing errors, including misalignment and incorrect mutation detection [Neuman et al., 2013], because of their potential impact on both sensitivity and specificity of bioinformatics tools. The use of large validation datasets, if available, can potentially mitigate the impact of individual data errors.

In the initial report of a bioinformatic method, it may not be feasible to fully address all of Dr. Vihinen's recommendations in the body of the article. Several journals that dedicate pages to publish descriptions of new bioinformatics resources impose severe limits on the lengths of these articles. New or significant modifications to existing bioinformatic resources and testing with multiple sets of validating mutations and reproducibility studies should themselves merit separate publication (or as peer-reviewed, supplementary documents), so that potential consumers of these tools can make informed decisions about the most appropriate situations for their use. These reports can provide insight into the performance of these tools, with discussion of the limitations and strengths for the particular sets of mutations and genes being evaluated.

The proposed guidelines will be essential for developing a set of minimum community standards that describe these tools along with an unbiased set of performance metrics. Reporting guidelines could emulate models such as the CONSORT statement for clinical trials [Schulz et al., 2010], STROBE for epidemiology [Gallo et al., 2011], STREGA for genetic association studies [Little et al., 2009], MIAME for microarray analysis [Brazma et al., 2001], or some combination of these. A common recommendation of all of these guidelines is that study design, including methods of data analyses, be stipulated prior to undertaking the study. This approach, if implemented for the evaluation of bioinformatic mutation predictors, would require that thresholds for scoring mutations be established prior to validation. Machine-readable XML versions could be produced to ensure portability of study designs and results for use with other predictive resources [Seringhaus and Gerstein, 2007]. Our increasing reliance on bioinformatic tools to predict the significance of and prioritize variants demands that we characterize these resources rigorously and use evaluation criteria that transcend any individual method or approach.


Peter Rogan is a Canada Research Chair in Genome Bioinformatics and a founder of Cytognomix Inc. The issues raised in this paper are of general interest to the community of Human Mutation readers. There is no intent to endorse the company or its products.