Fine-grained protein fold assignment by support vector machines using generalized npeptide coding schemes and jury voting from multiple-parameter sets

Authors

  • Chin-Sheng Yu,

    1. Department of Biological Science and Technology, National Chiao Tung University, Hsin Chu, Taiwan
    Search for more papers by this author
  • Jung-Ying Wang,

    1. Department of Computer Science, National Taiwan University, Taipei, Taiwan
    Search for more papers by this author
  • Jinn-Moon Yang,

    1. Department of Biological Science and Technology, National Chiao Tung University, Hsin Chu, Taiwan
    Search for more papers by this author
  • Ping-Chiang Lyu,

    1. Department of Life Sciences, National Tsing Hua University, Hsin Chu, Taiwan
    Search for more papers by this author
  • Chih-Jen Lin,

    Corresponding author
    1. Department of Computer Science, National Taiwan University, Taipei, Taiwan
    • Jenn-Kang Hwang, Department of Biological Science and Technology, National Chiao Tung University, Hsin Chu 300, Taiwan or Chih-Jen Lin, Department of Computer Science, National Taiwan University, Taipei, Taiwan
    Search for more papers by this author
  • Jenn-Kang Hwang

    Corresponding author
    1. Department of Biological Science and Technology, National Chiao Tung University, Hsin Chu, Taiwan
    • Jenn-Kang Hwang, Department of Biological Science and Technology, National Chiao Tung University, Hsin Chu 300, Taiwan or Chih-Jen Lin, Department of Computer Science, National Taiwan University, Taipei, Taiwan
    Search for more papers by this author

Abstract

In the coarse-grained fold assignment of major protein classes, such as all-α, all-β, α + β, α/β proteins, one can easily achieve high prediction accuracy from primary amino acid sequences. However, the fine-grained assignment of folds, such as those defined in the Structural Classification of Proteins (SCOP) database, presents a challenge due to the larger amount of folds available. Recent study yielded reasonable prediction accuracy of 56.0% on an independent set of 27 most populated folds. In this communication, we apply the support vector machine (SVM) method, using a combination of protein descriptors based on the properties derived from the composition of n-peptide and jury voting, to the fine-grained fold prediction, and are able to achieve an overall prediction accuracy of 69.6% on the same independent set—significantly higher than the previous results. On 10-fold cross-validation, we obtained a prediction accuracy of 65.3%. Our results show that SVM coupled with suitable global sequence-coding schemes can significantly improve the fine-grained fold prediction. Our approach should be useful in structure prediction and modeling. Proteins 2003;50:531–536. © 2003 Wiley-Liss, Inc.

Ancillary