• Chemical fragment-based descriptors;
  • Fast frequent subgraph mining;
  • Two-dimensional descriptors;
  • k-Nearest neighbor QSAR techniques


We present a novel approach to generating fragment-based molecular descriptors. The molecules are represented by labeled undirected chemical graph. Fast Frequent Subgraph Mining (FFSM) is used to find chemical-fragments (subgraphs) that occur in at least a subset of all molecules in a dataset. The collection of frequent subgraphs (FSG) forms a dataset-specific descriptors whose values for each molecule are defined by the number of times each frequent fragment occurs in this molecule. We have employed the FSG descriptors to develop variable selection k Nearest Neighbor (kNN) QSAR models of several datasets with binary target property including Maximum Recommended Therapeutic Dose (MRTD), Salmonella Mutagenicity (Ames Genotoxicity), and P-Glycoprotein (PGP) data. Each dataset was divided into training, test, and validation sets to establish the statistical figures of merit reflecting the model validated predictive power. The classification accuracies of models for both training and test sets for all datasets exceeded 75 %, and the accuracy for the external validation sets exceeded 72 %. The model accuracies were comparable or better than those reported earlier in the literature for the same datasets. Furthermore, the use of fragment-based descriptors affords mechanistic interpretation of validated QSAR models in terms of essential chemical fragments responsible for the compounds’ target property.