FiberID—A technique to identify fibrous protein subclasses



Fibrous proteins such as collagen, silk, and elastin play critical biological roles, yet they have been the subject of few projects that use computational techniques to predict either their class or their structure. In this article, we present FiberID, a simple yet effective method for identifying and distinguishing three fibrous protein subclasses from their primary sequences. Using a combination of amino acid composition and fast Fourier measurements, FiberID can classify fibrous proteins belonging to these subclasses with high accuracy by using two standard machine learning techniques (decision trees and Naïve Bayesian classifiers). After presenting our results, we present several fibrous sequences that are regularly misclassified by FiberID as sequences of potential interest for further study. Finally, we analyze the decision trees developed by FiberID for potential insights regarding the structure of these proteins. Proteins 2007. © 2006 Wiley-Liss, Inc.