SEARCH

SEARCH BY CITATION

Keywords:

  • Auto-probit;
  • Bayesian hierarchical model;
  • Gene ontology annotation uncertainty;
  • MCMC algorithm;
  • Protein function prediction;
  • STRING

Summary Predicting the functional roles of proteins based on various genome-wide data, such as protein–protein association networks, has become a canonical problem in computational biology. Approaching this task as a binary classification problem, we develop a network-based extension of the spatial auto-probit model. In particular, we develop a hierarchical Bayesian probit-based framework for modeling binary network-indexed processes, with a latent multivariate conditional autoregressive Gaussian process. The latter allows for the easy incorporation of protein–protein association network topologies—either binary or weighted—in modeling protein functional similarity. We use this framework to predict protein functions, for functions defined as terms in the Gene Ontology (GO) database, a popular rigorous vocabulary for biological functionality. Furthermore, we show how a natural extension of this framework can be used to model and correct for the high percentage of false negative labels in training data derived from GO, a serious shortcoming endemic to biological databases of this type. Our method performance is evaluated and compared with standard algorithms on weighted yeast protein–protein association networks, extracted from a recently developed integrative database called Search Tool for the Retrieval of INteracting Genes/proteins (STRING). Results show that our basic method is competitive with these other methods, and that the extended method—incorporating the uncertainty in negative labels among the training data—can yield nontrivial improvements in predictive accuracy.