• Open data;
  • TRPV1;
  • GRIND;
  • Classification


Publicly open databases of small compounds have become an indispensable tool for chemoinformaticians for collection and preparation of datasets suitable for drug discovery questions. Since these databases comprise compounds coming from structure-activity relationship (SAR) studies performed by different research groups, they are very diverse with respect to the biological assays used. In the present study we analyzed the applicability of a thoroughly curated dataset gathered from open sources for ligand-based studies, using the transient receptor potential vanilloid type 1 (TRPV1) as use case. Thorough curation of compounds according to the biological assay type and conditions led to a dataset of comparable bioactive chemicals. Subsequent exhaustive analysis of the obtained dataset using classification algorithms demonstrated that the models obtained in most of the cases possess reliable quality. Analysis of constantly misclassified compounds showed that they belong to local SAR series, where small changes in structure lead to different class labels. These small structural differences could not be captured by the classification algorithms. However application of the 3D alignment-independent QSAR technique GRIND for local, structurally related series overcomes this problem.