Quality Issues with Public Domain Chemogenomics Data

Authors

  • Tuomo Kalliokoski,

    Corresponding author
    1. Computer-Aided Drug Design, Novartis Institutes for Biomedical Research, Postfach, 4002 Basel, Switzerland
    2. Current Address: Lead Discovery Center GmbH, Otto-Hahn-Straße 15, 44227 Dortmund, Germany
    • Computer-Aided Drug Design, Novartis Institutes for Biomedical Research, Postfach, 4002 Basel, Switzerland

    Search for more papers by this author
  • Christian Kramer,

    1. University of Innsbruck, Center for Chemistry and Biomedicine, Innrain 82, 6020 Innsbruck, Austria
    Search for more papers by this author
  • Anna Vulpetti

    1. Computer-Aided Drug Design, Novartis Institutes for Biomedical Research, Postfach, 4002 Basel, Switzerland
    Search for more papers by this author

Abstract

The key concept in chemogenomics is the similarity principle that states that similar ligands should bind similar targets. Chemogenomic analysis requires large amounts of data and both powerful computational algorithms and computers. Data used for chemogenomics analysis can either be compiled from open sources, or they can be produced in-house as is often done in the pharmaceutical industry. The chemogenomic modeller often has to resort to mixing activity values from different laboratories and even assay types to facilitate chemogenomic analysis. The amount of chemogenomics data available in the public domain has dramatically increased in recent years, allowing fully traceable analysis on a continuously increasing scale. However, some warning flags about the data quality have been raised and because the primary data determine the accuracy of chemogenomic analysis, the quality of the data is one of the key questions in chemogenomics. This mini-review discusses some of the most common issues with public domain biological data related to chemogenomic analysis. The errors in data can originate from problems with the experiments themselves and their interpretation, or from more mundane issues such as data extraction and annotation. These issues are not unique for a certain database but are shared by all the public domain databases and can plague commercial and in-house bioactivity databases as well.

Ancillary