Using Graph Indices for the Analysis and Comparison of Chemical Datasets

Authors

  • Denis Fourches,

    1. Laboratory for Molecular Modeling, Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill NC 27599, USA
    Search for more papers by this author
  • Alexander Tropsha

    Corresponding author
    1. Laboratory for Molecular Modeling, Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill NC 27599, USA
    • Laboratory for Molecular Modeling, Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill NC 27599, USA

    Search for more papers by this author

Abstract

In cheminformatics, compounds are represented as points in multidimensional space of chemical descriptors. When all pairs of points found within certain distance threshold in the original high dimensional chemistry space are connected by distance-labeled edges, the resulting data structure can be defined as Dataset Graph (DG). We show that, similarly to the conventional description of organic molecules, many graph indices can be computed for DGs as well. We demonstrate that chemical datasets can be effectively characterized and compared by computing simple graph indices such as the average vertex degree or Randic connectivity index. This approach is used to characterize and quantify the similarity between different datasets or subsets of the same dataset (e.g., training, test, and external validation sets used in QSAR modeling). The freely available ADDAGRA program has been implemented to build and visualize DGs. The approach proposed and discussed in this report could be further explored and utilized for different cheminformatics applications such as dataset diversification by acquiring external compounds, dataset processing prior to QSAR modeling, or (dis)similarity modeling of multiple datasets studied in chemical genomics applications.

Ancillary