The Development of Novel Chemical Fragment-Based Descriptors Using Frequent Common Subgraph Mining Approach and Their Application in QSAR Modeling

Authors

  • Raed Khashan,

    Corresponding author
    1. Department Pharmaceutical Sciences, College of Clinical Pharmacy, King Faisal University, Al-Ahsa, KSA 31982 phone: +966-13-581-7175; fax: +966-13-581-7174
    • Department Pharmaceutical Sciences, College of Clinical Pharmacy, King Faisal University, Al-Ahsa, KSA 31982 phone: +966-13-581-7175; fax: +966-13-581-7174

    Search for more papers by this author
  • Weifan Zheng,

    1. Biomanufacturing Research Institute & Technology Enterprise, NCCU, Durham, NC 27707, USA
    Search for more papers by this author
  • Alexander Tropsha

    1. Laboratory for Molecular Modeling and Carolina Center for Exploratory Cheminformatics Research, School of Pharmacy, UNC, Chapel Hill, NC 27599, USA
    Search for more papers by this author

Abstract

We present a novel approach to generating fragment-based molecular descriptors. The molecules are represented by labeled undirected chemical graph. Fast Frequent Subgraph Mining (FFSM) is used to find chemical-fragments (subgraphs) that occur in at least a subset of all molecules in a dataset. The collection of frequent subgraphs (FSG) forms a dataset-specific descriptors whose values for each molecule are defined by the number of times each frequent fragment occurs in this molecule. We have employed the FSG descriptors to develop variable selection k Nearest Neighbor (kNN) QSAR models of several datasets with binary target property including Maximum Recommended Therapeutic Dose (MRTD), Salmonella Mutagenicity (Ames Genotoxicity), and P-Glycoprotein (PGP) data. Each dataset was divided into training, test, and validation sets to establish the statistical figures of merit reflecting the model validated predictive power. The classification accuracies of models for both training and test sets for all datasets exceeded 75 %, and the accuracy for the external validation sets exceeded 72 %. The model accuracies were comparable or better than those reported earlier in the literature for the same datasets. Furthermore, the use of fragment-based descriptors affords mechanistic interpretation of validated QSAR models in terms of essential chemical fragments responsible for the compounds’ target property.

Ancillary