The early phases of designing a new drug are characterized by an exhaustive exploration of related information available in public as well as in proprietary databases. Gathering such information for a given set of compounds is complicated due to the large number of data sources, the different data formats and query mechanisms used.1 Manual searches are tedious and time consuming, thus usually limited to individual compounds only. The process is error prone and the information collected this way is incomplete, of variable quality, and missing the link to the original data sources. Therefore data values can not necessarily be compared and it is difficult to identify compound duplicates.2 A recent approach to overcome these issues is the integration of data from different sources by means of semantic web technologies.3 Organizing the data in triples allows the representation of relationships between data points. Here, each triple has a subject-predicate-object structure, i.e. “sodium chloride (=subject) is a (=predicate) salt (=object)”. This way, related pieces of information can be connected semantically. Next to assembling the data itself, another challenging aspect is the representation of the data in a user-friendly and comprehensible manner. The constantly growing amount of data to be considered in a single drug discovery process poses new challenges to find a human accessible representation. The visualization of compound structures along with the related textual information should be designed for interactive use and at the same time provide access to important details. Commonly used cheminformatics tools for the handling of large sets of compounds include Spotfire (TIBCO),4 MS Excel,5 Accord (Accelrys),6 the Dotmatics Browser,7 Seurat (Schrödinger),8 CDD Vault,9 and Tableau.10 These tools allow exploring entire datasets, e.g. by plotting numerical properties against each other or by presenting them in a tabular form. However, the compound information cannot be augmented dynamically, but is instead restricted to the information contained in the original dataset.
Employing the Internet as a knowledge base entails different challenges. The available data is spread over numerous sources, which are updated frequently but not necessarily regularly. Due to this dynamic nature of data, downloading a copy and working with it locally quickly leads to outdated, false or simply missing information. Other, more technical challenges are the large amounts of data, the diverse and often non-standard formats along with the use of different molecule identifiers in most data sources. In the context of semantic web technology, this can be solved by so-called identity resolution services (IRS),11 which are entrusted with the mapping of different identifiers to one unique object representation.
In 2011 a new project, called OpenPHACTS (Open Pharmacological Concept Triple Store),12 has been initiated aiming to create an Open Pharmacological Space (OPS) by assembling data useful for drug discovery from public data sources using semantic web technologies. Currently, it includes datasets from the following databases: DrugBank,13 ChEMBL,14 SwissProt/UniProt,15 ChEBI,16 Gene Ontology,17 GOA,18 Wikipathways.19 Following an application-oriented approach, the project started with the definition of potential use cases in the form of research questions, formulated and prioritized by a consortium of scientists from both academia as well as various pharmaceutical companies.20 The development of the different software components is now led by these research questions.
Here we present a first prototype of the ChemBioNavigator (CBN), an OpenPHACTS exemplar service for navigating the chem-bio space with a focus on small molecules relevant in pharmaceutical research. This service allows to access large amounts of data originating from numerous public data sources available on the Internet and to merge this with proprietary compound information dynamically during runtime. The added information is taken directly from datasets included in the OPS or from external data sources which are referenced in the OPS data cache.
Taking a step beyond the use-case driven development, the CBN is based on the analysis of several interviews with potential users from pharmaceutical industry. The interviews gave invaluable insights not only about the day to day use-cases, but also deficiencies as well as highly valued features of the existing tools. While the requirements of users may be competing, thus can never all be satisfied, trends usually become apparent as the number of interviewed parties grows. This leads to an agile and target-oriented development, which is able to quickly adapt to changing requirements from the scientist’s daily work.
The CBN is realized using modern web technologies and state of the art cheminformatics software libraries. The user interface consists of two different regions: A large molecule visualization canvas and an information panel (see Figure 1).
A CBN-session starts by uploading molecules either as a SD file or a mol2 file or in the form of SMILES strings. These are processed and validated using the NAOMI22 software library. NAOMI also calculates certain base-properties, such as the molecular weight, the number of hydrogen bond donors and acceptors, or a calculated logP. Subsequently the compounds are annotated with information obtained from the OPS system. These additional information range from identifiers and one-dimensional compound representations, such as InChI23 and SMILES24 strings, over topological and physico-chemical properties, to target-relations and assay data. Compound data uploaded from file is displayed and available for analysis as well.
For data analysis purpose, the entire compound set can be drawn in a scatter-plot on the visualization canvas. The user can choose any numerical property from any of the incorporated data sources to sort-order the data points via the x- and y-axis of the scatter plot. The data is color-coded based on whether the system was able to identify the compound within OPS and was able to obtain additional information. As known from browser based map services, it is possible to move through the scatter-plot using the mouse as well as to zoom in and out. When zooming in, the level of detail increases and mere dots are replaced by the actual molecule depictions. Additionally, a tabular representation of structure diagrams and the detailed view of single compounds are available allowing a closer examination of selected molecules. From the set of compounds, subsets can be selected and saved as SD file for further use. The SD file covers all compound data including the information retrieved from the OPS platform. The information panel is subdivided into the following tabs: The details-tab shows the structure diagram of the currently selected compound and lists all available properties. The properties are grouped as base properties assigned during the initialization process, properties provided in the uploaded file and additional properties originating from the OPS platform. The selection of properties for the axes of the scatterplot is visually supported by histograms, generated on the fly showing the value range and distribution of the currently chosen property.
Using the OPS system as data source, the CBN provides homogenous access to heterogeneous data originating from various databases. The functionality described here constitutes a first prototype of the CBN, which will be extended with various additional features over the course of the OpenPHACTS project. The main focus of further development will be the inclusion of target- and biological data. Given the project-focus on pharmaceutical research, the relationship between compound and potential targets is of particular interest. However, rather than duplicating functionality, links into other exemplars shall be provided to address very detailed or more complex research questions from other domains. Another important feature for future releases is the possibility to extend a compound set based on similarity searches.
The incorporation of these and other new features will enable the average bench scientist to easily explore large amounts of data, augmenting their research results with new aspects. The CBN is made publicly available at http://cbn.zbh.uni-hamburg.de.