Network-based Auto-probit Modeling for Protein Function Prediction
Article first published online: 6 DEC 2010
© 2010, The International Biometric Society
Volume 67, Issue 3, pages 958–966, September 2011
How to Cite
Jiang, X., Gold, D. and Kolaczyk, E. D. (2011), Network-based Auto-probit Modeling for Protein Function Prediction. Biometrics, 67: 958–966. doi: 10.1111/j.1541-0420.2010.01519.x
- Issue published online: 14 SEP 2011
- Article first published online: 6 DEC 2010
- Received January 2009. Revised July 2010. Accepted September 2010.
- Bayesian hierarchical model;
- Gene ontology annotation uncertainty;
- MCMC algorithm;
- Protein function prediction;
Summary Predicting the functional roles of proteins based on various genome-wide data, such as protein–protein association networks, has become a canonical problem in computational biology. Approaching this task as a binary classification problem, we develop a network-based extension of the spatial auto-probit model. In particular, we develop a hierarchical Bayesian probit-based framework for modeling binary network-indexed processes, with a latent multivariate conditional autoregressive Gaussian process. The latter allows for the easy incorporation of protein–protein association network topologies—either binary or weighted—in modeling protein functional similarity. We use this framework to predict protein functions, for functions defined as terms in the Gene Ontology (GO) database, a popular rigorous vocabulary for biological functionality. Furthermore, we show how a natural extension of this framework can be used to model and correct for the high percentage of false negative labels in training data derived from GO, a serious shortcoming endemic to biological databases of this type. Our method performance is evaluated and compared with standard algorithms on weighted yeast protein–protein association networks, extracted from a recently developed integrative database called Search Tool for the Retrieval of INteracting Genes/proteins (STRING). Results show that our basic method is competitive with these other methods, and that the extended method—incorporating the uncertainty in negative labels among the training data—can yield nontrivial improvements in predictive accuracy.