3D‐e‐Chem: Structural Cheminformatics Workflows for Computer‐Aided Drug Discovery

Abstract eScience technologies are needed to process the information available in many heterogeneous types of protein–ligand interaction data and to capture these data into models that enable the design of efficacious and safe medicines. Here we present scientific KNIME tools and workflows that enable the integration of chemical, pharmacological, and structural information for: i) structure‐based bioactivity data mapping, ii) structure‐based identification of scaffold replacement strategies for ligand design, iii) ligand‐based target prediction, iv) protein sequence‐based binding site identification and ligand repurposing, and v) structure‐based pharmacophore comparison for ligand repurposing across protein families. The modular setup of the workflows and the use of well‐established standards allows the re‐use of these protocols and facilitates the design of customized computer‐aided drug discovery workflows.


Introduction
There is an eed for eScience technologiest op rocess the large volumes of rapidlyg enerated, heterogeneous [1] protein-ligand interaction data into computational models that enable the design of efficacious and safe medicines. [2] The ChEMBL database (version 23), for example, contains over 14 million data entries on 11 500 protein targets,o fw hich 4600 human, covering 1.7 million uniquec ompounds. [3] The Protein Data Bank (PDB, accessed October 21, 2017) contains more than 130 000 structures with nearly 24 000 small molecules covering 67 000 unique protein-ligand complexes. [4] Currently 20 000 human proteins have been deposited in Swiss-Prot [5] (version 2017_ 10), of which 3300 proteins are also present in ChEMBL. Comparisono ft he protein, ligand, and bioactivity datai nC hEMBL, PDB, and UniProti ndicates that structural information is lacking for more than 95 %o ft he protein-ligand pairs for which bioactivity data has been reported, and for more than 75 %o f the human proteins for which sequence information is available. In silicoc hemogenomics [6] and computer-aided drug discovery methods can be used to predict protein-ligand interactions in order to fill these bioactivity-structure and sequencestructure gaps, identify new protein-ligandp airs, and design new ligands. [6b, 7] The success rate of such methods strongly dependsonthe efficient integrationofchemical, pharmacological and structurald ata to train, optimize, and evaluate ligand-and protein-based models. [6b, 7a,b] An effectivea pproach to accomplish this is through the development of scientific workflows [8] that facilitatet he standardization of protocols, [7c] the integration of data anda nalyses, and re-use of parts of protocols to customize, extend, or designn ew workflowsf or different targets or applications. [9] KNIME [10] and Pipeline Pilot [11] are established workflowm anagers in the field of cheminformatics and computer-aided drug discovery, with ag rowingn umber of users. [8] Severall igand-based workflows have been reported that combine chemical and biological data sourcesf or ligandbased target prediction. [12] Few structure-based workflows have been reported, including protocols for pharmacophore screening, [13] structure-based ligand optimization, [14] as well as combined ligand-and protein-based ligand repurposing. [15] Several of the tools in the reported workflows, however,u se commercial computer-aided drug discovery softwaret hat is not accessible without apaid license. [15b, 16] Most freely available cheminformatics tools [17] (nodes) that can be run within these workflowsf ocus on small molecules [18] and the number of nodes that use freely available structure-based approaches is relativelys carce. eScience technologiesa re neededt op rocess the information availablei nm any heterogeneoust ypes of protein-ligand interaction data and to capture these data into modelst hat enable the design of efficacious and safe medicines. Here we present scientificK NIME tools and workflowst hat enablet he integration of chemical, pharmacological, and structural information for:i )structure-based bioactivity data mapping, ii)structurebased identification of scaffold replacements trategies for ligand design, iii)ligand-based target prediction, iv) protein sequence-based binding site identification and ligand repurposing, and v) structure-based pharmacophore comparison for ligand repurposing across protein families. The modular setup of the workflows and the use of well-established standards allows the re-use of thesep rotocols and facilitates the design of customized computer-aided drug discoveryw orkflows.
The current work describes the integration and analysiso f severalc hemical, biological, and structural data types in workflows that can be used for:i )structure-based bioactivity data mapping, ii)structure-based identification of scaffold replacement strategiesf or ligand design,i ii)ligand-based target prediction, iv) protein sequence-based binding site identification and ligand repurposing within ap rotein family,a nd v) structure-based pharmacophore comparison for ligand repurposing across protein families.
The flexible workflows and protocols presentedh ere can be used as templates for the standardization of protocols, the integrationo fd ata anda nalyses, andc an readily be reusedo r extendedf or the creationo fn ew computer-aided drug discovery workflows for other protein targets and applications. The cases will focus on two of the pharmaceutically most relevant protein targets, namely Gp rotein-coupled receptors (GPCRs) and kinases.

Structure-based bioactivity data mappingofk inase inhibitors
Protein-ligand crystal structures provide information regarding protein-ligand interactions and protein conformations, whereas bioactivity data provides insighti nto the binding affinity or functional effect.T he integration of structural and bioactivity data allows one to interpret differences and similarities in bioactivity (e.g.,a ffinity cliffs) to ligand binding modes,s pecific protein-ligand interactions,a nd to extrapolate these insights to other protein targets. In the next workflow ( Figure 2) we have combined bioactivity data from ChEMBL and (structural) kinase data from KLIFS to create am atrix of available bioactivity data on human kinases for all co-crystallizedk inase ligands.

Protocol:
1) Collect protein information and the molecular structures of co-crystallized ligands(here from KLIFS) 2) Retrieve the available bioactivity data for the ligands (here from ChEMBL) 3) Clean,curate, and process the bioactivity data The heatmap (B) shows the bioactivity profile for the top 100 co-crystallized kinase ligands with the largest amountofd ata available for the top 400 kinases. The kinomes, created with KinMap, [32] show the number of unique kinase-inhibitor complexes based on KLIFS (C) and the numbero fu nique kinaseinhibitors based on ChEMBL(D). The data accumulatedi nthis workflowa re summarized (E) for two well-known kinase inhibitors, namely Seliciclib and Dasatinib (indicatedw ith ablue and green arrow,respectively on the Y-axis of the heatmap). *Only human kinasesa re listed. The molecular structures of all 2552 unique co-crystallized small molecule kinase inhibitors were collected via KLIFS nodes (KLIFS accessed August18 th ,2 017) in SMILESformat. The InChI-Keys of the inhibitors were subsequently used to retrieve the ChEMBL IDs for the compounds (1583 matches) including all correspondingb ioactivity data (166 976 data points). Using the human kinase list from KLIFS all bioactivity data was reduced to solelyt he human kinome (86 601 data points for 432 kinases). The top 100 compounds with the largest number of available bioactivity data (excluding singlec oncentration measurements)f or kinases [30] was then selected togetherw ith the top 400 kinases and the median log value of the bioactivity data for each unique compound-kinase pair.T he data was then transformed into am atrix and visualized as ah eatmapu sing the JFreeChart HeatMap node. The heat map shows clear differences in the bioactivity profiles between kinase inhibitors and highlights promiscuous and selective compounds as well as the gaps in the bioactivity matrix.T his workflow illustrates a simple, yet powerful, method of complementing as tructurebased view of kinase inhibitors with the availablep harmacological data for more advanced structuralc hemogenomics applications ( Figure 2).

Scaffold replacements for kinase ligand design
Scaffold hopping is ac ommon approach in which ap arto fa known active compound is changed while trying to maintain the binding affinity and binding mode of the originalc ompound in order to obtain better ADMET/PKPDo rp hysicochemical properties or to escape patent infringement. [33] In the next workflow (Figure3)p rotein-ligand interaction similarity [6a, 34] as well as chemical similarity is used to identify molecular pairs with al ow chemical similarity but ah igh interaction similarity, thereby providing interesting starting points for the designo f hybrid molecules that have ah igh probability of maintaining their bindingm ode.
* Visually evaluate the obtained binding modes, compare their interaction fingerprints, or perform another binding mode comparison technique.
Starting from the KLIFS nodes all structurali nformation on human kinases (7552 unique monomers) wasd ownloadedi ncluding the kinase-inhibitor interaction fingerprints (IFP) and the SMILES of the co-crystallized kinase inhibitors. Subsequently,agroup loop is started that processes all structures per individual kinase. Within the loop, ap airwise interaction-based IFP [6a, 34] and ligand-based ECFP-4 [37] comparison is performed for all complexes of each kinase. The combinations are subsequently filtered for ligand pairs with al ow chemical similarity (ECFP-4T animoto score < 0.26) and ahigh interaction similarity (IFP Tanimoto score > 0.75), that is, all chemically distinct ligand pairs that do have similar interactions with the kinase target are selected. From the resulting list of pairs, an imidazopyridine inhibitor( PDB ID:4DIT) [35] and acarboxamide inhibitor (PDB ID:4 PTG) [36] in complexw ith GSK3B with av ery low ligand similarity (Tanimoto ECFP-4 = 0.188) and an identical protein-ligand interaction pattern (Tanimoto IFP = 1.0) were selected as an example for further inspection. From both structures, the KLIFS aligned full monomer andl igand were download and subsequently visualized using the Ligands and Proteins Viewer showingt he overlay of the ligandsi nthe GSK3B binding site. These two kinase inhibitors were subsequently used to design ah ybrid compound drawn in the MarvinSketch node. Finally,t his design was docked into the GSK3B binding site (PDB ID:4 PTG) using the newly developed PLANTS [27] docking nodes. Upon visual inspection of the obtained binding modes within the Ligands and Proteins viewer, ah ighly conserved binding mode of both parts of the hybrid design is observed. Within this workflow the chemical dissimilarity is complemented with protein-ligand interaction patterns to identify distinct molecules with similar mechanismso fb inding. This combination of techniques provides new opportunities for molecular design based on known ligandsa nd the workflow could, for example, be rewired ande xtended for more advancedfragment-based replacement approaches.

Ligand-based cross-reactivity prediction
The derivation of similarity measures between different protein receptors may be used to explore cross-reactivities andt oe xplore the potentialfor compounds to display (useful) polypharmacology. The PP_GPCR (protein-protein associationG PCR) workflow ( Figure 4A)f ollows methodologies used in previous efforts [39] to explore the relationshipsb etween protein targets using ligand topology. This chemocentric approach involves describing the sets of ligandsfor each protein target by chemical fingerprint descriptors, [40] and comparing the sets with each other to derive similarities between protein targets. With this approach, one can derive protein-liganda nd protein-protein associationsr anging from biologically expected to less obvious.

Protocol:
* Collect availableb ioactivity data for ap rotein family or (full) set of proteins of interest * Clean, curate,p rocess, and filter the bioactivity data * Calculate ligand-based fingerprint descriptors for each compound * Goal 1: Protein-protein association prediction * Performa na ll-against-all comparison of the fingerprints and select relevant hits based on au ser-definablecutoff * Group the number of hits per protein target pair and calculate an E-value * Output of the results for visualizationin, for example, Cytoscape [41] or flareplots. [38] * Goal 2: Identificationo fp otential proteint argets for small molecules * User input of the small molecules of interest and calculate their ligand-basedf ingerprint descriptors * Performafingerprint comparison against the protein dataset and select hits based on au ser-definable cutoff * Group the number of hits per protein target and calculate an E-value The protocol is applicable to any combinationo fd ata sets with unknown distributions of structures and biological activity values,u ser intervention to vary thresholds, similarity measures, fingerprints and statistical approaches is made possible. The PP_GPCR workflow reads in data from ap ublic data source, ChEMBL, for all non-olfactory GPCR receptors as derived from the GPCRdb. [23] Various filters for allowed activity type (EC 50 ,I C 50 ,A C 50 , K b , K D , K i )a nd thresholda ctivity (pAct ! 5) are applied, am inimum compound set size of 5i sr equired, and ar estriction on the number of calculated rotatableb onds (maximum of 18) is used to limit the number of very large, flexible compounds. The latter is performed as in our experience the presence of large numberso fp eptide/peptoid compounds can lead to some targetsb eing routinely overrepresented in later comparisons. Fingerprint descriptors (in this case RDKit:D aylight-like topological fingerprint) werec alculated for each compound and the similarities between the receptor sets were determined using au ser-definable threshold for similarity,here set to am inimum of 0.7. Use of the raw similarities and set size following Keiser [39a] allowedt he calculation of E-values,u sed to rank the similarity between protein targets. The similarities between receptors are viewable as aK NIME Ta ble and Excel File. To highlight some of the identified similarities the top 500 protein associations were visualized in af lareplot [38] (Figure 4D)a nd ah eatmap ( Figure 4B). The melanocortin receptors, for example, show links with opioid, endothelin, chemokine and somatostatin receptors. These associations have previously been explored by Quillan et al. [42] The PP_GPCR workflow may also be used to calculate potential targets/cross-activities for individual compounds. Ac ompoundm ay be entered into the workflow or,i fa lready present in the data, simply extracted and comparedw itht he fingerprints already presenta llowing the calculation of the statistical significance and ranking by E-values. To analyze the predictive ability of the PP_GPCR workflow,t he workfloww as applied to five reference structures taken from Keiser et al. [39b] with an experimentally validated GPCR affinity (K i < 1000 nm). Using the defaults imilarity cut-off of 0.7, for four of the five compounds (Sedalande, Dimetholazine, Xenazinea nd Fabhistine) previously predicted activitiesw ere reflected in the top-five nearest neighbors in the PP_GPCR workflow ( Figure 4C). Lowering the similarityc ut-off increases the likelihood of detectingf urther nearest neighbors at the expense of alarger number of hits.

Sequence-based ligand repurposing within ap rotein family
Sequence-based identification of key residues for as pecific protein can help with the identification of bindings ite residues or residues that are linked to as pecific receptor function. More importantly,t his information can be exploitedf or ligand repurposing as proteins that share similarity for these key residues can potentially bind similar ligands. [43] In this workflow ( Figure 5) we use ad ouble entropys equence analysis method (ss-TEA) to identify these key residues, and perform as equence-based comparison for these residues to identify similar proteins (withint he same protein family) as potential candidates for ligand repurposing. [44] Protocol: * Create or obtain al arge sequence alignment for ap rotein family * Selection of the protein subfamily of interest * Performt he double entropy ss-TEA analysisf or identification of key residues for the selected subfamily * Extract the alignedk ey residues and perform as equence comparison to identify nearestn eighbors * Collect additional ligand and bioactivity data for the nearest neighbors  [38] showsthe top 500 associations between protein targets based on their shared ligands imilarities (line thicknessindicates the significance), the associations of the melanocortin receptorsare highlighted in red. The workflow beginsb yg athering ac omplete list of all class AG PCR families (300), all class AG PCRs( 11 731), andt he aligneda nd numbered protein residues for each GPCR (4 536 590 in total) using the GPCRdb [23] nodes. The structurebased residue numbering was then used to obtain am atrix with the position-based alignment of all GPCR residues. At this point, the user can inspect the table of GPCR families and highlight the GPCR receptor/subfamily of interestu sing an interactive table viewer.T he user selection, in this case the Somatostatin receptor type 5( SST 5 R), is then used to create a subfamily( i.e.,r eference group) as input for the double entropy analysis by the ss-TEA node. All residue positions are scored accordingt ot he entropyw ithin the subfamily (internal entropy) compared to the entropyo utside the subfamily (external entropy). The 20 residue positions within the seven transmembrane helices with the lowest score (the residues with al ow internal entropy, but ah igh externale ntropy) were selected for furtherp rocessing. These residues have ah igh conservation of ar esidue within as ubfamilyb ut al ow conservation outside a subfamily, which is an indication of the subfamily-specific relevance of the residue for,f or example, ligand recognition or receptor function. For visualization of the results, as catterplot is createdd isplaying the internal versust he externale ntropy with all residue positions (each dot) colored according to their ss-TEA score ( Figure 4C). Subsequently,a na lignment of solely the selected 20 residues is generated and used to calculate the sequence identity of the human GPCR of the subfamilyt oa ll human GPCRs. The nearest5 0GPCRs based on this ss-TEA sequencea lignment are selected and shown in an interactive table viewer as potentialc andidates for ligand repurposing and complemented by al ist of available crystal structuresi n the PDB. Moreover,a ll ChEMBL bioactivities for each receptor are obtained and the number of active inhibitors annotated in ChEMBL is listed, including the number of known ligands that have both an affinity for the identified receptor as well as for the reference receptor. For the SST 5 Rt his selection of GPCRs logicallyc ontains the other somatostatin receptors and the closely related opioid receptors, but also the more distantdopamine as wella ss erotonin receptors (Figure 4). This matches with the known cross-reactivity of someS ST 5 Ri nhibitors for the m opioid receptor,a sw ell as the dopamine D 2 receptor (D 2 R) and the serotonin 2B receptor (5-HT 2B R), which are also identified by the cross-reactivity assessment using the ChEMBL bioactivities of the knownS ST 5 Ri nhibitors (see Figure 4C,D). This is, for example, demonstratedb yt he cross-reactivity of the marketed drugs Fluspirilene( aD 2 Ra ntagonist) and Loperamide (a m opioid agonist) on SST 5 R. Vice versa, as eries of benzoxazole SST 5 Ri nhibitors showed nanomolar affinities for 5-HT 2B R( Figure 4F). All these receptors share the key ionic anchorD 3.32 ( Figure 4E)w ithin the selected residues,w hich was deemed essential forthe ligand recognition. [46] Figure 5. Workflow (A) for the identification of ligand repurposing possibilities using as equence-based double entropyanalysis (ss-TEA). This example shows the identification of the opioid, serotonin, and dopamine receptors as potential repurposing targets for somatostatin type 5i nhibitors, which was retrospectively verifiedusing ChEMBL data (C) and aliterature search (F). The scatterplot (B) showsthe internal entropy(X-axis) versus the external entropy(Y-axis) for each residue and is colored by the ss-TEA score (the lowert he moresignificant). Part of the summarized analysis results are shown in (C) the interactive table viewera nd (D) the identified nearest proteins for SST 5 Ra re shown in the phylogenetic tree of humanG PCRs. (E) Asequence alignment of solelythe residues (using the Ballesteros-Weinstein residuen umbering scheme) [45] identified with ss-TEA for the somatostatin receptors and highlightedc ross-reactivity targets. Structure-based pharmacophore comparison for ligand repurposingacross protein families Ligand repurposing across protein families can be enabled through the comparison of known protein binding sites based on the available crystal structures. [47] The rationalei sthatp roteins with similarb inding sites can potentially bind similarl igands. [47,48] In this workflow ( Figure 6) we compare the KRIPO binding site pharmacophores from all structures of ap rotein (family) of interest against the KRIPO pharmacophores of the full PDB to identify ligand-repurposing possibilities.

Protocol:
* Collect available PDB entries for the protein familieso fi nterest * Obtaint he KRIPO fragments information based on the PDB entries of the reference protein family and search for similar KRIPO fragments in the PDB * Extract similar fragments that match with PDB entries from the query protein family Figure 6. As tructure-based ligand repurposing workflow (A) that searches for KripoDB [26] pharmacophores imilarities between GPCRs and kinases. Twoexamples( B) of binding site similarities between the 5-HT 2B receptor andMAPK14 kinase, and the adenosine A2A receptor andthe TTK kinase are presented and describedi nthe main text.The aligned kinase and GPCR structures based on the alignment of the KRIPO pharmacophores are showni n3Du sing the Proteins and Ligandsv iewer (for clarityp urposes the lipophilic pharmacophore features are hidden). Only residues within 3.5 of the ligands are depicted and labeled according to the Ballesteros-Weinstein [45] and KLIFS [24] numbering scheme for GPCRs and kinases, respectively.Complementary shape-based and pharmacophore-based assessment of the ligands using the KNIME-enabledS ilicos-it [25] tools Shape-it and Align-it are performed and compared in the Ligands viewer and Pharmacophore viewer,r espectively. With the GPCRdb KNIMEn odes, an overview of all GPCR crystal structures [49] is obtained and used to query the Kri-poDB [26b] for the availablep harmacophore fragment information for these structures. For all full ligand KripoDB entries a similaritys earch is performed with the KripoDB similar fragments node. The results are then filtered using the KLIFS nodes with an overview of all kinase crystal structures yielding an overview of GPCR pharmacophore fragments that share similarityw ith ak inase pharmacophore fragment based on their KripoDB fingerprints. From this list, two examples were selected that identified ap ossible overlap between the KRIPO pharmacophores based on ak inase and aG PCR structure. The first example is the match between the Sorafenib-bound MAPK14 protein kinase [50] (PDB ID:3 HEG) and the Ergotaminebound 5-HT 2B receptor [51] (PDB ID:4 IB4), consistent with studies showing that the FDA-approved kinase inhibitor Sorafenib has nanomolara ffinity for 5-HT 2B R. [52] The second example is the match between Reversine-bound TTK protein kinase [53] (PDB ID:5 LJJ) and the triazolecarboximidamide-bound A 2A receptor [54] (PDB ID:5 UIG). Reversine shows weak bindinga ffinity for the adenosine A 2A receptor,a nd has sub-micromolar affinity for the homologous adenosine A 3 receptor. [55] The KRIPO pharmacophores of each structure were downloaded anda lignedu sing the KripoDB pharmacophore and Align Pharmacophores nodes, respectively.T he rotational matrix obtained from the alignment was then used to align both pharmacophores as well as the complete PDB entries in the pharmacophore viewer.T oc ompare the structure-based pharmacophore alignmento ft he molecules with al igandbased approachb oth molecules were alignedu sing al igandbased pharmacophore approach( Align-it) and as hape-based approach( Shape-it). The SMILESo fb oth co-crystallized ligands were obtained from the PDB using the PDB Connector Custom Reportn ode. Then the RDkitA dd Conformers node was used to generate 30 conformations for each ligand as input for the Align-it and Shape-it nodes. The ligand-based alignments were again visualized with the Pharmacophores Viewer and the Ligands and Proteins viewer. Interestingly,t he urea moiety of Sorafenib binding in the back pocket of MAPK14 is aligned with the basic amine in the fused tetracyclic head of Ergotamine. This ligand alignment originates from the KRIPO pharmacophore alignment as the negatively charged centers of the conserved glutamate (E71 aC.24 )i nt he aC-helix of MAPK14 and the key aspartate D135 3.32 of 5-HT 2B Ra re matched.
The volume-based Shape-it overlay shows ag ood overlap (Tanimoto score = 0.67) between the two compounds, however,m ostp harmacophore features are not alignedd ue to a 180-degree flip of the core scaffold to maximize the shape overlay.T he ligand-basedp harmacophore overlay using Alignit resultsi nap oor score( Ta nimoto score = 0.22) and an alignment in which the whole molecules are flipped 180 degrees, illustrating that the structure-based KRIPO pharmacophores were key for the elucidation of this off-target effect.

Conclusions
The presented structural cheminformaticst ools and integrated workflowsc ombineh eterogeneous data analyses that enable the prediction of protein-ligand interactions and the identification of protein-protein relations. The reusable workflowsp rovide general guidelines that can be used fort he construction of automated computer-aided drug discoveryp rotocols, or for the customization ande xtensiont oo ther targets and applications: 1) The use of well documented and amenable workflow management platforms like KNIME facilitate the construction of consistent,r eproducible, [1] and transferable protocols. [7c] The workflows can be transferred between, for example, workstations, users, and sites, andc an be re-run:i )asi s, for example, when large data transfer is not feasible, or when new database versions are released;ii) with different configurations of the nodes,f or example, changing ligand activity cut-offs ( Figure 2), input ligands ( Figures 3, 4, 6), protein targets( Figure 5);i ii)with additional/modified nodes to obtain complementary information,f or example, including annotations from other databases, furthera nalyzing results, or performing machine learning [56] on the obtained data. Pre-configured meta nodes or workflowb locks can be easily reused because the same data collection, preparation, processing and analysis steps might be required in various workflowsfor different purposes. 2) KNIMEc ontains ar ich and continuously growing set of cheminformatics nodes to handlea nd process chemical and biological data in multiple formats. Custom nodes can be developed, such as the nodes presented in the current study, and scripts and external tools can be embedded to extendt he functionalities of this toolkiti no rder to address ap lethora of biochemical research questions, for example, structural protein-ligand interaction analysisa nd prediction functionalities. 3) Carefully annotated ands tandardized data resources are requiredt op erform integrated cheminformaticsa nalyses. [2a, 30, 57] However,i ts hould be noted that the use of external databases can also presentapotential pitfall as they can changec ontent andf ormat therebyd isrupting the workflow or changingt he outcome. 4) The infrastructure of aw orkflow managementp latform such as KNIME allows fori nteractive checks during execution of the workflow.C hecking the input and output for each step duringt he development of aw orkflow makes for easy debugging resulting am ore robusta nd less errorpronew orkflow.T oe nhancet his process customized data visualization nodes, such as the proteins and ligandsviewer and the pharmacophore viewer nodes presented in the current study,a re also required to inspect the validity of, for example, docking studies, pharmacophore-based structure alignments, and bindingm ode similarity assessments. 5) Combining complementary techniquesw ithin the same workflow allows for the creation of more advanced or more accurate (consensus) [58] cheminformatics workflows, for ex- ample,b yc ombining ligand-based on protein-ligandi nteraction based similarity assessments [59] or by combining 2D and 3D ligand-based similarity [60] methods.

Experimental Section
Newly developed KNIME nodes:T he KNIME workflows described in this article use as eries of 3D-e-Chem KNIME nodes that have been newly developed in addition to as et of previously published 3D-e-Chem nodes. An overview of the new nodes is shown in the list below and the nodes themselves are discussed in more detail in the next few paragraphs.
* Pharmacophore:R etrieval of the KRIPO pharmacophore based on the KripoDB fragment identifier.
* Ligands Viewer:v isualization of (aligned) small molecules. * Ligands and Proteins Viewer:t he combined visualization of (aligned) small molecules and proteins * Proteins Viewer:visualization of (aligned) proteins * Pharmacophores Viewer:v isualization of (aligned) pharmacophores, small molecules and proteins * Align pharmacophores:a lign the query pharmacophores to the reference pharmacophore.
* Extract pharmacophore points:e xtract the points of ap harmacophore as rows.
* Merge pharmacophore points:c reate pharmacophores from a table with x, y, z coordinates, pharmacophore type, alpha and optional directionality.
* Pharmacophore from molecule:c reate ap harmacophore from a molecule by mapping atoms to pharmacophore points. * Pharmacophore to molecule:g enerate am olecule from ap harmacophore by mapping pharmacophore points to atoms. * Strip-it:s trips ag iven set of molecules to its scaffold based on a user-selected scaffold definition.
* Ss-TEA score:c alculates the ss-TEA score for each residue position of asequence alignment for aset of family members.
Most of the nodes are available under the permissive Apache License 2.0 (https://www.apache.org/licenses/LICENSE-2.0). The PLANTS binaries for docking (embedded within the PLANTS nodes) are freely available for academics, and the Silicos-it source is available under the GNU Lesser General Public License v3 (https:// www.gnu.org/licenses/lgpl-3.0.en.html). Am ore detailed overview per node set and tool, including license information, dependencies, and their application, is given in Supporting Information Ta ble S1.
GPCRdb nodes:T he GPCRdb [23] is as pecialized database focused on Gp rotein-coupled receptors:t he largest protein family that lies encoded within the human genome. Besides ac omprehensive ontology,t his database contains information on GPCR sequences, alignments, residue numbering schemes, crystal structures, interactions, and mutation data. The eight GPCRdb KNIME nodes, as previously described, [31] provide access to this information from within KNIME and enable the integration of this data in comprehensive chemogenomics workflows.
KLIFS nodes:K LIFS contains kinase-ligand interaction information derived from over 3900 structures covering more than 270 different kinases in complex with % 2500 unique ligands (accessed August 2017). All kinase structures within KLIFS are curated, annotated, aligned, and processed in as ystematic manner with automated weekly updates. All KLIFS content can be accessed from within KNIME using one or more of the nine KLIFS nodes from four different categories, as published in McGuire et al. [31] KripoDB nodes:T he pairwise pharmacophore similarity of more than half am illion (sub)pockets extracted from structures in the Protein Data Bank is available in the KripoDB. KRIPO encodes pocket pharmacophores into af uzzy 3-point pharmacophore fingerprints that are subsequently used to assess this similarity. [26a] Besides the "Fragment information" and the "Similar fragments" KRIPO nodes that were previously published, [26a] an ew KripoDB KNIME node has been added for the retrieval of the pharmacophores themselves that where used for the creation of the KRIPO fingerprints. This allows au ser to obtain the pharmacophore of interest, and to align and visualize it in combination with the new set of "Pharmacophore" nodes as well as the "Pharmacophores Viewer".
Molviewer nodes:T he freely available molecule viewers in KNIME are primarily oriented at visualization of small molecules. To enable displaying proteins, protein-ligand complexes, and pharmacophores in KNIME we created as et of visualization nodes. When opening aK NIME view of one of the new viewer nodes aw eb browser will be opened with an interactive 3D canvas portraying the input molecule(s). There are four molecule viewer KNIME nodes:o ne to view as et of (aligned) small molecules (e.g.,s hapeit results), one to view as et of (aligned) small molecules and proteins (e.g.,f or visualizing PLANTS docking results), one to view a set of (aligned) proteins (e.g.,o btained from KLIFS), and one to view as et of pharmacophores and their aligned protein and/or ligands (e.g.,f rom aligning KripoDB pharmacophores). The molecule viewer KNIME nodes supports HiLiting, which means that as election of molecules inside the viewer can be sent to other KNIME nodes and vice versa. The web-based molecule viewers use the NGL protein viewer [61] (https://github.com/arose/ngl) as its 3D canvas and use React, Redux, and Bootstrap for controls. The Pharmacophore nodes:T he pharmacophores nodes are as et of KNIME nodes that enable the conversion and alignment of pharmacophores. The nodes support (directed) pharmacophore features with the following supported types:a romatic, H-bond donor, H-bond acceptor,l ipophilic, positively charged, negatively charged and exclusion. The pharmacophores nodes comprise nodes that read and write pharmacophores in the Silicos-it phar file format, nodes to convert ap harmacophore from or to am olecule by mapping the pharmacophore types from or to elements, nodes that convert 3D points with at ype information into ap harmacophore and vice versa, and finally there is an ode to align pharmacophore(s) to ar eference pharmacophore. The pharmacophore alignment is performed by comparing all the point pair combinations the pharmacophores can have in common and then identifies the maximum point pair combinations using Bron-Kerbosch [62] clique detection algorithm. It subsequently uses the Kabsch [63] algorithm to compute the optimal translation and rotation matrices using singular value decomposition, which are then applied to the probe pharmacophores to get the aligned probe pharmacophores for each point pair combination. The pharmacophore KNIME nodes are written in Java and depend on the ejml Java library (http:// ejml.org/) for matrix operations. The alignment algorithm is based on the KRIPO [26] codebase.
PLANTS:P LANTS [27] is af ree-for-academics docking tool that employs an ant-colony optimization algorithm for sampling potential ligand binding modes and uses as emi-empirical scoring function. The PLANTS KNIME nodes are:i )binding site node to calculate the binding site definition based on the ligand molecule or pocket atoms of the protein, ii)configuration reader to read PLANTS definition files which are used for configuration and to determine the docking output file names, iii)configuration generator to generate aP LANTS config file using the nodes dialog with almost all PLANTS configuration fields divided into tabs, iv) runner,t he node that executes the PLANTS executable, v) session builder,w hich takes the protein, binding site, and ligands from KNIME as input and writes them in as ession directory as files as input for the PLANT executable, vi)virtual screening runs the PLANTS executable in screen mode and will read the files written by the session builder,a nd finally vii)the virtual screening results reader which reads the output files generated by the virtual screening node into KNIME. The PLANTS runner and PLANTS configuration generator KNIME nodes are written in Java and use the Mustache template library [64] to write the PLANTS config file. All the other PLANTS nodes are implemented as KNIME meta nodes. AP LANTS executable for Windows, Linux and Mac OS Xi sb undled with the PLANTS KNIME nodes and is provided under af ree academic license. The location of the PLANTS executable defaults to the bundled version, but can be overwritten in the KNIME preferences. The initialization and combination of PLANTS KNIME nodes for docking runs requires great care. Therefore, an example docking workflow has been made available at https://github.com/3D-e-Chem/knimeplants/blob/master/examples/plants-virtual-screening-example.knwf.
Silicos-it nodes:S ilicos-it [25] released several of their cheminformatics tools to the open source domain. These KNIME nodes bring their functionality to the KNIME environment. The nodes are: i) align-it, [65] which aligns molecules to ar eference molecule based on their pharmacophore, ii)shape-it, [65b,c, 66] which aligns molecules to ar eference molecule based on their shape, iii)filter-it, [67] which can filter molecules with undesired properties from ac ompound set, iv) strip-it, which generates the Murcko, [68] Oprea, [69] or Schuffenhauer [70] scaffoldo famolecule v) Qed, [71] which calculates the Quantitative Estimation of Drug-likeness (QED) for a( set of) molecule(s). The Silicos-it executables are written in C ++ and have OpenBabel as ad ependency to read and write different molecule formats. The KNIME Silicos-it nodes come bundled with the alignit, filter-it, shape-it, strip-it executables for Linux and Mac OS X. The location of the executable defaults to the bundled versions, but can be overwritten in the KNIME preferences. All the Silicos-it KNIME nodes are implemented as KNIME meta nodes, except for the node that executes the actual Silicos-it executables. The silicosit execute node is implemented in Java and is used by all meta nodes. The align-it executable is wrapped into two KNIME nodes. A node to align SDF formatted molecules to ar eference molecule and another node to generate pharmacophores from molecules. The align-it KNIME nodes are part of the Silicos-it KNIME nodes plugin. The shape-it executable aligns molecules to ar eference molecule based on their shape. The shape-it executable is wrapped in aK NIME node, which aligns SDF formatted molecule to ar eference molecule. The output of the node has the aligned molecules and alignment scores.
ss-TEA:T he ss-TEA score [28] is an abbreviation for subfamily-specific TwoEntropy Analysis score. The score is calculated for each residue position of al arge sequence alignment based on ac omparison of the level of conservation within as ubset (i.e.,asubfamily) of proteins (internal entropy) compared to all other proteins (external entropy). By identifying positions that are highly conserved within, but not outside of the subfamily,t he ss-TEA score can identify residue positions specifically related to ligand binding or protein function for that specific subset. This methodology is, however,d ependent on ah igh quality and large quantity sequence alignment as input. The ss-TEA algorithm has been implemented as aK NIME node, is written completely in Java and has no dependencies. The node requires asequence alignment and alist of sequence identifiers, which will be used as the subfamily.
Workflows:A ll KNIME workflows described in this article, including the source code for all 3D-e-Chem nodes, are available from the 3D-e-Chem GitHub repository (https://github.com/3D-e-Chem/ workflows). The individual steps of each workflow are described in more detail in the main text. All 3D-e-Chem nodes used to perform the analyses described in the current work are available under community contributions in KNIME under "3D-e-Chem" (https:// www.knime.com/3d-e-chem-nodes-for-knime).