Chemogenomic databases: Construction, search and analysis


Over the past 10–15 years, the advent of high-throughput technologies developed by the pharmaceutical industry and related disciplines has produced large databases of drug efficacy, gene/protein expression levels, mutational status, as well as molecular structure information of potential drugs and drug targets which further our understanding of normal versus disease states and are useful for drug discovery. With the recent availability of large public databases, notably those from US government agencies such as the National Center for Biotechnology Information and the National Cancer Institute, the emerging field of chemogenomics is providing abundant opportunities for innovative new techniques for statistical analysis, data integration and applications of large-scale datamining. This is the first of two issues of Statistical Analysis and Data Mining to be published in 2009 focused on statistical methods for chemogenomics. The review by Liu and Verducci provides a good introduction to the field of chemogenomics as well as an overview of the many statistical, bio/chemo-informatic, and experimental subdisciplines used in modern drug discovery.

Several papers in this issue describe new statistical methods for relating molecular structure to drug efficacy. Because compounds that have similar molecular structure often have similar biological properties, chemical similarity searching, on the basis of features extracted from the molecular structure, has long been used to prioritize compounds for testing. The article by Liu and Verducci introduces methods of chemical similarity searching that typically start with an active compound as a reference structure and aim to retrieve additional active compounds from a database. If instead of a single compound the search is based on multiple active compounds, the proportion of active compounds retrieved from the database can be significantly enriched. Turbo similarity searching (TSS) assumes that the nearest neighbors of the single reference compound provided by the user are active compounds and automatically uses them to conduct a multiple reference compound similarity search. Previous studies have shown that TSS can improve retrieval enrichment compared to a traditional similarity search. In this issue, Gardiner et al. compare the effectiveness of TSS over a variety of databases and use alternative structural descriptors for computing structural similarity.

Scheiber et al. present a method for dealing with multiple activity values (i.e. from a series of assays) with the aim of explaining the assay differences from a chemical perspective. The key idea is to create meta-categories from the differences in assay values and use the new (ordinal) categories as dependent variables to identify chemical properties or molecular substructures associated with the difference in profiles. In the chemogenomics setting, this technique could be used to identify molecular features common to compounds for which activity levels are associated with gene/protein expression signatures.

The effectiveness of a molecular descriptor set for identifying active compounds from a similarity search depends on the compound class. This dependency makes it difficult to select the most suitable descriptor set for a given search. Vogt et al. describe a method that combines a Bayesian scoring scheme for selection of structural and molecular property descriptors with an information-theoretic method for predicting recall rates. The procedure allows for the selection of search methods most likely to be successful for a given compound class and target database.

Gardner-Lubbe et al. discuss the advantages of biplots for visualization and analysis of microarray data. Biplots are typically used in conjunction with principal component analysis. However, the authors show that PCA biplots do not provide optimal data separation when used for exploring the differences between treatments and differentially expressed genes. As an alternative, they suggest biplots based on analysis of distance and illustrate its effectiveness in separating gene expression samples from three treatment groups.

As a whole, the papers in this issue take the reader from the fundamentals of how chemogenomic databases are constructed and searched, to a clever application of statistical analysis used to draw insight from these. The works reflect the best spirit of data mining for ideas, not just facts. The authors are thanked and congratulated.