IChem: A Versatile Toolkit for Detecting, Comparing, and Predicting Protein–Ligand Interactions

Abstract Structure‐based ligand design requires an exact description of the topology of molecular entities under scrutiny. IChem is a software package that reflects the many contributions of our research group in this area over the last decade. It facilitates and automates many tasks (e.g., ligand/cofactor atom typing, identification of key water molecules) usually left to the modeler's choice. It therefore permits the detection of molecular interactions between two molecules in a very precise and flexible manner. Moreover, IChem enables the conversion of intricate three‐dimensional (3D) molecular objects into simple representations (fingerprints, graphs) that facilitate knowledge acquisition at very high throughput. The toolkit is an ideal companion for setting up and performing many structure‐based design computations.

IChem is as uite of tools consisting of about 50 000 lines of computer code written in C ++,d ecomposed in nine modules, for detecting and comparing molecular objects (proteins, ligand-binding cavities, ligands, protein-liganda nd proteinprotein complexes) frequently manipulated in structure-based computational chemistry (Table 1). Herein we describe four of the most frequent andi mportant uses. For deeper investigation of all modules, the reader shouldr efer to the publicly availableuser guide. [1] Setting the scene:pdbconv.Areasonable start to any structure-based design project is the retrieval of experimentally determined protein structures from the Protein Data Bank (PDB), [2] aw eb resource that currently stores over 134 000 entries. Unfortunately,P DB structures cannot be used directly,a s many important features (e.g.,p rotonation and ionization states, atom types andb ond ordersf or organic molecules) are missing. The pdbconv module of the IChem toolkit automates the preparation of ready-to-use protein-ligand structures. The processf irst assignsaspecific class to each residue name ( Table 2).
It then appliesac orrect atom type to every heavy atom, generates the corresponding covalent bonds, and selects strongly bound water molecules while removing bulk water. The process relies on ap redefined list of all possible residues with the corresponding templates for every HET record of the PDB file. Correct atom types and3 Dc oordinatesa re provided for every template by converting, with Corina, [3] PDBS MILES strings into the corresponding MOL2 file.
The residue list (Table 2) assignst he encountered HETrecord to one of the 12 possible residue classes (cofactor,i on,l igand, metal, modified aminoa cid, nucleic acid, organometallic, prosthetic, standard amino acid, sugar,u nwanted, water). Please note that molecules originating from crystallization buffer ("un- [a] Protein-proteininterfaces. wanted" class) are automatically identified and discarded.O nce atom types have been properly defined for every molecule type (protein and accessory molecules, solvent, ligand), any third-party tool (e.g.,P rotoss) [4] can be used to finally add the missing hydrogen atoms while optimizing both the ionization and tautomeric state of each molecule of the PDB entry.
Working with predefined residue lists and molecular templates provides both advantages and drawbacks. Them ain advantage is au niform treatment of all chemical components of aP DB entry with ap resumably correct atom typing. As am ain drawback, the procedure requires an updated residue list and thus fails in case of an ewly released PDB entry.W et herefore propose regular updates along every new release of the inhouse developed sc-PDB database of druggable protein-ligand complexes. [5] Detecting ligandable cavities: Volsite. Volsite is at ool to automatically detect cavitiesa tt he surfaceo famacromolecule of interest, and predict its structurald ruggability. [6] It can be run in two modes depending on whether coordinates of a bound ligand are given (ligand-restricted mode) or not (full unrestricted mode). In any case, the target is first placed in a2 resolution grid lattice and each voxel is assigned as tate as whether its accessibility exceeds au ser-defined threshold. Accessiblev oxels are then assigned ap harmacophoric property (hydrophobic, aromatic,h ydrogen bond donor,h ydrogen bond acceptor,p ositive ionizable, negative ionizable) complementaryt ot hat of the nearest protein atom according to as et of topological rules. [7] The pharmacophoric properties of all atoms are detectedo nt he fly by the general IChem atom parser thereby enabling to consider accessory molecules ( Table 2) or not during the cavity detection. Because every voxel has af ixed volume, the total number of pharmacophoreannotated voxels approximates the overall cavity volume. The methodi sfast (a few seconds) and precisely delineates the cavity bordersa tavery high precision ( Figure 1).
In addition, as et of 73 cavity descriptors are computed for each cavity and used as input to as upport vectorm achine (SVM) classifier to predict the structural druggability (or ligandability) of the inspected cavity.I nastandard benchmarking exercise consisting of 113c avitieso fk nownd ruggability,V olsite presented the highest accuracy when compared with state-ofthe-artt ools. [6] In case of multiple cavities, all druggable cavities are saved as readable MOL2 files, along with their predicted druggability score.
Interestingly,t he similarity of two Volsite cavities can be estimated by analogy to classical ligand similarity measurements, using acompanion tool (Shaper) [6] that uses asmooth Gaussian functiont om aximize the overlap of their volumea nd pharmacophoric properties.H igh-throughput cavity comparisons are increasingly used in computational chemistry notably to identify ligandsf or novel cavities, design inhibitors with precise selectivity patterns and predict their possible side effects. [8] Converting protein-ligand complexes into fingerprints and graphs:I FP,G RIM.Amajor feature of IChem is the possibility to generate diverse simplified representations (fingerprints, graphs) of protein-ligandi nteractions. For example, the IFP module enables to list all protein-ligand interactions occur-ring in ac omplex and to outputa ni nteraction fingerprint as a bit string (Figure 2).
Severaly ears ago, we [7] and other groups [9] proposed the use of IFPs to post-process docking data and pick poses producing IFPs similart ot hat of known actives. Computing interaction fingerprints (IFPs) from dockingp oses is ar obusta nd very efficient manner to predict ligand binding modes, [10] propose reliable scaffold hops, [11] and enrich virtual hits in true actives. [12] The success of this post-processing approachi sb ased  on the idea that true ligands of as ame target often sharek ey interactions with key anchoring residues and thereby produce relativelys imilar IFPs. However,ac lear limitation is the strict dependence to the number of active site residues, preventing to compare interaction fingerprints across bindings ites of different sizes.
We therefore recently designed size-invariants descriptors conceptualized by ag raph describing the exact protein-ligand interaction pattern. [13] The methodc alled GRIM (GraphI nteraction Matching) defines three interaction pseudoatoms (IPAs) for every detected protein-ligandi nteraction:o ne on the ligand-interacting atom, one on the protein-interacting atom and one at the barycenter of the latter two atoms (Figure 3).
The full set of IPAs defines an interaction pattern that is unique to every protein-ligand complex and that can be converted into agraph where IPAs will define nodes. [13] Ap articulari nteraction pattern can be easily compared and alignedt oa nother one by as imple graph matching technique aiming at identifying the maximal common subgraph (clique). [13] The similarity of two interaction pattern graphs is measured by an empirically derived score (GRIMscore)t hat can be used for example to post-processd ocking poses and reward those corresponding to interaction patterns already visited in reference X-ray structures (Figure 4). In three consecutive international dockingc ontests aimed at predictingl igand binding modes prior to the release of the corresponding X-ray structures, GRIM rescoring was alwaysq uoted as one of the very bestmethods for generating near-native docking poses. [14] The same advantage over fast scoring functions was reported in virtuals creening against diverset arget families (e.g.,Gprotein-coupled receptors, nuclear hormone receptors, protein kinases). [13] GRIM presentss everala dvantages over alternative knowledge-based rescoring strategies:1 )itc an be coupled to any docking algorithm, 2) it does not constrain ligand docking but rewardsi nteractionp atterns already presenta mong PDB tem-plates, 3) it takes advantage of ligandsw ith similar binding modes and not necessarily similar chemical structures, 4) it can be applied in at arget family-biasedp ose selection process in which PDB templatesf rom the same protein but also from similar targets can be used to store reference interaction patterns, and 5) it permits to directly quantify bindingm ode similarity between ap redicted protein-ligand complex and any PDB template at avery high throughput.
Detecting biologically relevant protein-protein interfaces: detectPPI.P rotein-protein interfaces (PPIs) represent challenging but very promising targets for drugd iscovery. [15] Hence, PPIs describe av ast unexplored biological space for which small molecular weight modulators [16] are expected to offer very high potency ands electivity profiles. Although mainly discoveredb yb iophysics-driven fragment-based approaches, computational chemistry is expected to playamajor role in designing the future PPIm odulators, [17] notably upon relying on the huge structurali nformation already available in the Protein Data Bank. To discriminate biologically relevant from crystallographic artifacts, computational methods are needed to rapidly detect PPIs and predict their biological relevance from astructuralpoint of view.IChemPIC [18] was designed to address this need. The detectPPI module uses the general IChem functions (molecule reader,i nteraction detection) to detect the interface, identifies the corresponding IPAs and generate af ixedlength property vector ( Figure 5) as input for aR andom Forest classifier previously trained on aset of 400 PPIs (200 biologically relevant,2 00 irrelevant interfaces). IChemPIC is equally robust to detect both classes with the same accuracy,i ndependently on the size of the PPI. [18] Interestingly,t his new IChem module can be used at ah igh throughout to detectb iologically relevant PPIs at the PDB   [14b] to the human MAP4K4byg raph similarity (GRIMscore) to the X-ray structure of the same kinase with the inhibitorGNE-495 (PDB ID:4ZK5). The toprankedp ose according to Surflex-Dock( cyan circle) is irrelevant. The two best poses according to GRIM( red circles)a re < 1.0 RMSD away from the true X-ray pose. All poses are numberedf rom 1t o2 0a ccording to the Surflex-Docks core. scale. Alternatively,t he method can be used on-line (http:// bioinfo-pharma.u-strasbg.fr/IChemPIC) by just specifying the PDB three-letter code.
In conclusion, IChem is as uite of softwared edicated to the analysisa nd comparison of three-dimensional molecular objects. It converts an intricate three-dimensional information into much simpler fingerprints or graphs, thereby enabling high-throughput comparisons and fueling machine learning modelsf or predicting important features like protein-protein interfaces, druggable cavities, interaction patterns and binding poses. IChem is availablef or nonprofit academic research at http://bioinfo-pharma.u-strasbg.fr/labwebsite/download.html.