• protein structure;
  • machine learning;
  • vibrational spectroscopy;
  • α-helix;
  • β-sheet


Knowledge of the fold class of a protein is valuable because fold class gives an indication of protein function and evolution. Fold class can be accurately determined from a crystal structure or NMR structure, though these methods are expensive, time-consuming, and inapplicable to all proteins. In contrast, vibrational spectra [infra-red, Raman, or Raman optical activity (ROA)] are rapidly obtained for proteins under wide range of biological molecules under diverse experimental and physiological conditions. Here, we show that the fold class of a protein can be determined from Raman or ROA spectra by converting a spectrum into data of 10 cm−1 bin widths and applying the random forest machine learning algorithm. Spectral data from 605 and 1785 cm−1 were analyzed, as well as the amide I, II, and III regions in isolation and in combination. ROA amide II and III data gave the best performance, with 33 of 44 proteins assigned to one of the correct four top-level structural classification of proteins (SCOP) fold class (all α, all β, α and β, and disordered). The method also shows which spectral regions are most valuable in assigning fold class.