• Bioinformatics;
  • Decision tree;
  • Enzymes;
  • Genetic algorithm;
  • Neural network


Cytochrome P450 (CYP) is an important drug-metabolizing enzyme family. Different CYPs often have different substrate preferences. In addition, one drug molecule may be preferentially metabolized by one or more CYP enzymes. Therefore, the classification and prediction of substrate specificity of CYP enzymes are of importance to the understanding of drug metabolisms and may help guide the development of new drugs. In this study, we used three different machine learning methods to classify CYP substrates for predicting CYP-substrate specificity based solely on structural and physicochemical properties of the substrates. We first built a simple decision tree model to classify substrates of four CYP enzymes, 1A2, 2C9, 2D6 and 3A4 with more than 78 % classification accuracy. We then built a single-label eight-class model and a multilabel five-class model to classify substrates of eight CYP enzymes and to classify substrates that can be metabolized by more than one CYP enzymes, respectively. Above 90 % and >80 % prediction accuracy was achieved for the single-label and multilabel models, respectively. The main improvement of our models over existing ones is the automated and unbiased selection of descriptors by genetic algorithms, which makes our methods applicable for larger data sets and increased number of CYP enzymes.