A pharmacophore-based virtual screening method was developed and validated for use in predicting the function of a novel protein in terms of small metabolite binding. Five test cases were used for the validation study which spanned two different folds, four superfamilies, and three enzyme classes. Binding sites were predicted using a combination of two methods (CASTp and THEMATICS). The binding site was mapped with chemical probes representing hydrogen-bond donor, acceptor, negative ionizable, positive ionizable, and hydrophobe. The interaction maps were converted to three or four feature pharmacophore models and used to search a database containing 80 018 tautomers/protomers/conformers of 10 535 metabolites. The pharmacophore-based virtual screening eliminated >92% of the database as potential substrates and retrieved specific hits, which were ranked using a physics-based scoring function. The known substrate or product was ranked within the top 0.7% and substrate-like compounds within the top 1% of the metabolite database for all of the five test cases. The results suggest that using this pharmacophore-based virtual screening is a time-efficient strategy that can be applied to screen large databases to help predict the function of small metabolite binding proteins.
The protein structure initiative (PSI) proposes to discover the structures of all possible protein folds in nature by experimental means (1). By having the structure of at least one representative protein of each fold, the structure of the remaining proteins can be predicted by comparative modeling. After completion of the first 5 years each of PSI-1 and PSI-2, approximately 4910 protein structures were solved by the PSI centers, of which 56.7% were novel (according to PSI definition, ‘novel proteins’ are those that have <30% sequence similarity to proteins with known structures) (2). As of February 2012, the protein databank had 3360 structures with unknown function of which 2382 structures have only 30% sequence identity to known crystal structures (3). The number of novel proteins is expected to increase. The function of some of these proteins can be unraveled by the metabolites with which they are found to be co-crystallized (4). Function assignment of proteins with >30% sequence similarity is often possible through sequence or structure comparison to proteins with known function. While this may further help to assign the function for a number of proteins, this method would fail when there are no proteins with similar sequence or structure. Furthermore, sequence/structural comparison for protein function assignment is not completely reliable because there exists the category of proteins which despite having a similar structure with an existing protein perform a very different function (5).
To overcome this problem, numerous non-homology-based function assignment methods are being studied, which predict the coarse grained function of a protein by predicting function based on genome context, predicting sub-cellular localization, and predicting function through networks of interaction (6). For a finer grained function prediction such as the chemistry being performed or substrates and products of the protein, other methods need to be used. One approach to the problem is to perform experimental high-throughput screening of the novel protein with various small metabolites (7). However, screening experimentally with tens of thousands of compounds is not only time-consuming, but also involves a considerable monetary investment and is feasible only if clues about the function can help in reducing the number of metabolites to be tested.
Three published virtual screening methods employ docking to help assign enzymatic function. Kalyanaraman et al. (8) performed a validation study on the enolase (ENL) superfamily of α-β barrel enzymes to determine whether virtual screening using docking and rescoring is feasible in finding the protein substrates. They found that for most test cases, their method predicted the known substrates within the top 1% of the results. The authors also performed a collaborative study wherein independently and simultaneously experimental as well as computational approaches were used to assign function to BC0371, a member of the ENL family, a protein with a miss-annotated function as l-Ala-d/l-Glu epimerase (9). It was observed that homology modeling followed by docking with 420 possible substrates and physics-based rescoring gave results that matched closely with experiments. The true substrate N-succinyl-l-Arg was found to be at the top of the rank list, and the protein annotation was corrected to be an N-succinyl-l-Arg racemase.
In another virtual screening method, Hermann et al. (10) demonstrated that docking high-energy transition-state analogs of the substrates to the protein as compared to ground-state substrates improved the ranks of the true substrates. In a further collaborative study, the authors docked the protein Tm0936 (assigned to be in the aminohydrolase superfamily) with high-energy forms of 4207 metabolites containing functional groups recognized by aminohydrolases (11). Of the four top-ranked compounds chosen, three were found to be active substrates in experimental studies.
The above-mentioned studies provide considerable confidence that virtual screening can be used as an effective tool to help in assigning function to a protein of unknown function, not only by validation, but also by finding the function of a novel protein. However, there are certain limitations of the studies performed. Previous studies validated and demonstrated the method on a single superfamily of proteins, the ENL, and aminohydrolase superfamilies. The method by Hermann et al. is applicable only when clues of function already exist because it would be a Herculean task to convert the whole metabolite database into high-energy intermediates for every possible reaction. The other limitation of the studies is that the studies assume knowledge of the binding site. This luxury is not amenable for all proteins that are solved by PSI efforts. For example, crystal structure of protein Aq1575 has been solved and has been classified under domain of unknown function with a novel fold (12). Sequence alignment with homologous sequences and mapping of conserved residues to the surface enlisted two hypothetical active sites. There are many other such examples in recent literature. A recent study has revealed that approximately 10% of the superfamilies have unknown functions (13). Thus, it is important to extend the virtual screening method for the prediction of function when no clues apart from the structure exist. Previous studies demonstrate the use of docking as a virtual screening tool. It was of interest to test the use of pharmacophore modeling for retrieving substrate from a database of metabolites. A recent study reveals a detailed comparison of the two methods, docking and pharmacophore screening, and summarizes that each method has its own strengths (14). The strength of pharmacophore modeling is that it is a more intuitive and simplified procedure to search for compounds with chemical features responsible for binding.
Here, we report the use of a computational method that is used to predict the protein binding site based on the protein structure alone, and thus, no prior knowledge of the binding site is necessary. This is followed by a structure-based pharmacophore modeling and database screening to aid in the identification of unknown protein function. The aim was to develop a method that will start with any protein of unknown function, not necessarily belonging to a known family, and screen a database of metabolites followed by scoring, to yield a narrow list of compounds that can be experimentally tested. Toward this, five test cases were chosen that span twofolds namely the TIM α/β barrel and arginase (ARG) /deacetylase; four superfamilies namely metallo-dependent hydrolase, -aldolase (ALD), ENL C terminal domain like, and ARG/deacetylase; and three enzyme classes namely hydrolase, lyase, and isomerase. Results indicate that pharmacophore models can be successfully used to retrieve substrates from a large pool of metabolites. Ranking based on calculated binding energies successfully ranked all of the substrate-like compounds for all the five test cases very near the top of the database with known substrates ranked very highly (within the top 0.7%). Our method makes a timely contribution to the recently formed ‘Enzyme Function Initiative (EFI)’ whose goal is to assign function to enzymes of unknown function. The EFI is developing a multidisciplinary strategy to achieve this goal with the central aim of computational prediction of enzyme substrates, further enhancing the significance of our pharmacophore-based method (15).
Materials and Methods
Figure 1 illustrates the overall approach of the method reported here. The first step is predicting the binding site. As no one method has 100% predictivity, this was accomplished through the use of two different methods: a geometric method, CASTp (16), and a biophysical method, THEMATICS (17). Results from both methods were used to predict binding pockets. Once the binding pocket was chosen, the GRID (18) program was used to map energetically favorable positions for various functional groups complementary to the binding pocket. Based on results from GRID (18) mapping, binding positions for hydrogen-bond donors, acceptors, negative ionizable, positive ionizable, and hydrophobes complementary to the receptor were converted into pharmacophore features. Pharmacophore models were built using a combination of favorable pharmacophore features for each binding site, which were used to screen a database of metabolites (80 018 conformers of 10 535 metabolites). Structure-based filters were used to eliminate metabolites out that do not fit the binding site and enrich the database with true substrate-like compounds. The final results were ranked by a physics-based scoring function.
Proteins and their co-ordinates
Five proteins were used as test cases. The structures 1ADD (19) (resolution 2.4 Å), 4ALD (20) (resolution 2.8 Å), 4CEV (21) (resolution 2.7 Å), 7ENL (22) (resolution 2.2 Å), and 1MDR (23) (resolution 2.1 Å) were used for adenosine deaminase (ADA), fructose-1,6-bisphosphate ALD, ARG, ENL, and mandelate racemase (MDR), respectively. All the five are crystal structures of the holo enzymes, crystallized with substrate, product or inhibitor and the corresponding metal cofactor if any. 1ADD has been solved with Zn2+ as well as the ground-state analog 1-deazaadenosine in the active site. 4ALD has been solved with the open-chain form of the substrate fructose-1,6-bisphosphate in the active site. 4CEV has been solved with Mn2+ and product ornithine in the active site. 7ENL has been solved with Mg2+ and substrate 2-phosphoglycerate in the active site. 1MDR has been solved with the irreversible inhibitor (2-methyl analog of substrate) atrolactic acid and Mg2+ in the active site.
Determine binding site of the enzyme
For all test cases, the protein was stripped of the substrate/inhibitor; however, any metal ion was retained. These co-ordinates were submitted to the CASTp (16) server for prediction. Results obtained were analyzed for pockets on the surface of the protein. The protein co-ordinates (with metal ions) were also used to calculate the pKa’s for all of the side chains of the ionizable residues in the molecular systems using the pKa procedure implemented in the UHBD program (24). The calculations were performed with a dielectric constant of 80 for water and 20 for the protein. The ionic strength was 150 mm and the ionic radius (for the Stern layer) was 2.0 Å. Titration curves (charge as a function of pH) were calculated for all the side chains of the ionizable residues using the HYBRID program (25). According to the THEMATICS method (17), certain residues do not follow the standard HendersenHasselbalch curve and display a flat region where they are partially protonated over a pH range; these residues correlate with being present in the binding site.
Database of metabolites
Metabolites (10 535) were obtained from the KEGG LIGAND database (26) (ftp://ftp.genome.ad.jp/pub/kegg/ligand/). Compounds with ‘R’ groups were replaced with a methyl group and polymers were removed retaining only dimers and trimers. All possible stereomers and tautomers of the compounds were generated using an evaluation version of STERGEN and TAUTOMER from Molecular Networks GmbH.a 3D structures for each compound were generated using CATALYSTb (27). Finally, a 3D searchable database of 80 018 compounds (∼8 conformers per compound) was obtained.
Mapping the binding site with chemical groups as probes
The grid software (18) was used to evaluate energetically favorable positions in the binding site. The binding sites identified from CASTp and THEMATICS, in each of the five test cases, were subjected to GRID mapping. A grid was generated around each binding site, such that it encompassed all atoms of the binding pocket. The following probes were used in the GRID mapping: amino, carbonyl, carboxyl, dry, hydroxyl, and methyl. GRID spacing of 0.33 Å was used. Lone pairs and tautomeric hydrogens were allowed to move in response to the probe; also, groups like sp3 hydroxyl and sp2 amines were allowed to move to make better hydrogen bonds with the probe.
Generation of pharmacophore models
Results from the hydroxyl probe were studied in context of the binding site, and clusters complementary to any hydrogen-bond acceptors in the protein were converted into pharmacophore positions for hydrogen-bond donors. Similarly, results from the carbonyl and hydroxyl probes were studied to find clusters complementary to any hydrogen-bond donors in the protein binding site, and these positions were converted to pharmacophore positions for hydrogen-bond acceptors. Results from the methyl and dry probes were used to generate hydrophobic pharmacophore features. Results from the carboxyl probe were used to generate negative ionizable features, and the amino probe results were used to generate positive-ionizable pharmacophore features. Radii of the features were set to the maximum possible such that it covers the cluster of probes from the GRID results and does not clash with any of the binding site atoms. These features were all used in varying combinations to yield a pharmacophore model for each test case. For all five test cases, the pharmacophore models each had three to four features. Fit value is a term that quantifies how well a ligand retrieved reflects the features of a given pharmacophore model. For a three feature models, a minimum fit value of 2.5 was defined, and for a four feature models, the minimum fit value was 3.0. This is to ensure the compounds obtained from virtual screening reasonably represent the pharmacophore models.
Screening the database and filtering the results
The compounds obtained from virtual screening are based on a match of chemical features with the pharmacophore model, in the absence of actual protein atoms. To mimic the presence of protein atoms, three structure-based filters were used to shift through the pharmacophore model search results. For this purpose, the binding site atoms identified from CASTp and THEMATICS were used. Filter 1 is the length of the binding site. The length was measured as the distance between the two furthest atoms in the ligand binding site, using a Tcl script in VMD (28). Similarly, the length of each metabolite was calculated as the longest distance between any two atoms. Compounds longer than the minimum distance set by the binding site were eliminated. The second filter was the solvent accessible surface area of the metabolites. The solvent accessible surface area of the protein binding site and metabolites was calculated with VMD (28). Metabolites with a solvent accessible surface area larger than that of the binding site were removed from the search results. The third filter eliminated metabolites that had steric overlap between atoms of the ligand and the binding site. This calculation was performed using a Tcl script in VMD. The distance between every atom of the ligand and the binding site was calculated; if even one of the distances was <1.4 Å, the compound was eliminated as having a vdW clash with the binding site.
Ranking by binding energy
There were multiple conformers of each compound; thus, there was the possibility of a particular metabolite to be retrieved multiple times in various orientations. Binding energies were calculated for each of the various orientations remaining after the filtering using a physics-based scoring method developed by Kalyanaraman et al (8).
Hydrogens and MMFF-type partial charges for the metabolites were added using tools from OpenEye. File formats compatible with CHARMM were generated using an in-house program. Protein was converted to Merck format using CHARMM. For each test case, all the proteinmetabolite complexes, protein, and metabolites were energy minimized for 100 SD steps in a Born solvent using the Merck molecular force field in CHARMM. For energy minimization of the proteinmetabolite complexes, all atoms of the binding site and the metabolites were flexible while the rest of the protein was held rigid; the energy was reported as Elig-prot. For energy minimization of the protein, all atoms of the binding site were mobile while the rest of the protein was held rigid; the energy was reported as Eprot. Energy minimization of the metabolites was with no constraints and the energy was reported as Elig. To approximately account for the loss of ligand entropy because of binding, a penalty term was included that was proportional to the number of rotatable bonds in the metabolite. The best-ranked pose for each ligand was considered and the percent rank was assigned with respect to the total number of metabolites conformers (80 018).
Determining success of the method
For each test case, metabolites similar to the known substrates with a Tanimoto coefficient ≥0.85 were identified. Success was measured as the ability of the method to detect as many of the substrate-like compounds at a high rank. Additionally, two more criteria were considered in evaluating the success of the method. The first is ranking of compounds containing the functional group that is involved in catalysis and the number of top-ranked compounds sharing the major scaffold of the substrate.
Adenosine deaminase is a zinc metalloenzyme that catalyzes the irreversible hydrolysis of adenosine and 2′-deoxy adenosine to inosine and 2′-deoxy inosine, respectively. Analysis by CASTp revealed that the largest pocket predicted corresponds to the binding site as found in the co-crystal structure of ADA with the ground-state analog 1-deazaadenosine (19). The titration results for ADA revealed nine residues with perturbed shape (Figure 2). THEMATICS identified residues such as Asp19, Asp296, and Asp307, Cys262, Glu217, His238, Tyr240, and Tyr249 as having unusual titration behavior. Residues such as Glu217, His238, Asp19, and Asp296 have side chains that are part of the binding site surface. Residues Tyr240, Cys262, and Asp307 are part of the second shell residues of the binding site. Tyr249 is located much further away and is false positive. Most predicted residues cluster in and around the true binding site. Thus, there is consensus between the two methods for the prediction of the binding site.
From the results of the GRID mapping of the ADA binding site, there are two positions for hydrogen-bond donation, one position for a hydrogen-bond acceptor and one position for hydrophobic interaction. The pharmacophore model was built using all four features (Figure 3). Substrate-like compounds occupied 0.9% of the metabolite database colored red in the pie chart (Figure 4). Thus, the pharmacophore model used for virtual screening should specifically retrieve substrate-like compounds even though these compounds represent a small percentage of the database. The pharmacophore model was used to search the database of metabolites (80 018) using a minimum fit value of 3.0 (maximum 4.0) and yielded a total of 20 626 metabolite conformations (Table 1). Approximately 74.3% of the database (80 018) was eliminated, while 31 of 38 (81.5%) substrate-like metabolites were retained. This demonstrates the success of the pharmacophore model defined in retrieving substrate-like compounds specific to ADA. Further, the three structure-based filters were applied (Table 2). There were 3836 compounds from search results of the pharmacophore model whose SASA were larger than the binding site and were, therefore, eliminated. The calculated length of the ADA binding site revealed that compounds of not more than 20.4 Å can fit in the binding site. There were 4093 compounds from the search results of the pharmacophore model whose length was larger than 20.4 Å and 6695 compounds had steric overlap with the binding site atoms, which were eliminated. After application of the three filters, 826 unique compounds remained, representing an elimination of 92.2% of the metabolite database. Thirty (1 removed by the clash filter) of 31 substrate-like compounds were retained after the application of the structure-based filters, resulting in a success rate of 96.8%.
Table 1. Compounds retrieved from pharmacophore-based database screening. Total number of hits represents unique compounds retrieved in multiple conformations. The number of substrate-like compounds retrieved to the total number in the database is shown
Number of hits
Table 2. Compounds retrieved by a pharmacophore model were filtered through the application of structure-based filters. Structure-based filters eliminate compounds that do not fit the binding pocket. Total conformers retained as well as the unique number is given. The last column shows the number of substrate-like compounds retained out of those retrieved by the pharmacophore model
Results from pharmacophore search
Results after three filters
Unique compounds retained
Substrate-like compounds retained
The 826 unique compounds were found in various conformations by the pharmacophore model; thus, 6002 compounds were ranked by a binding energy method. According to the binding energy results (Table 3), all of the substrate-like compounds (31) ranked within the top 1% of the database. The true substrate, adenosine, is found in the top 0.2% and the product in the top 0.3% of the database.
Table 3. Ranks of substrate-like compounds including known substrates, calculated using a physics-based scoring function. Known substrate or product is highlighted in bold
Name of compound
Percentage of database
Inosine pranobex (Product)
Sedoheptulose 1,7-bisphosphate (Analog of open-chain form of substrate)
The RMSD over common atoms between the predicted substrate binding pose and the 2-methyl analog of the co-crystallized substrate was 1.9 Å. The predicted pose has the carboxylate tilted and not appropriately placed for His297 or Lys166 to abstract a proton for the deamination reaction. The pharmacophore method was partially successful in predicting the correct orientation of the substrate in the active site (Figure S1). The structures of the top-ranked compounds have either the substrate-like amino functional group, which can undergo a de-amination reaction, or the product-like carbonyl or hydroxyl group. The common scaffold in most of the top-ranked compounds turned out to be a phosphorylated ribonucleoside (Table S1). The chemical similarity plot (Figure 4) reveals that the physics-based scoring function succeeded in enriching substrate-like compounds at the top of the database illustrating a very strong trend. The pharmacophore-based virtual screening successfully predicted the actual substrate moiety for the ADA test case.
Fructose-1,6-bisphosphate ALD catalyzes the reversible cleavage of fructose-1,6-bisphosphate to dihydroxyacetone phosphate and glyceraldehydes-3-phosphate. The largest pocket predicted by CASTp corresponds to the binding site as indicated by the crystal structure (20). The titration results from THEMATICS for ALD reveal four residues with a perturbed titration curve shape (Figure 2). The residues exhibiting unusual titration are Cys177, Glu187, Glu189, and Lys146. Three of these residues form a cluster that is part of the binding site while Cys177 is a false-positive result. Thus, there was consensus between the two methods in predicting the ALD binding site.
From the results of the GRID mapping of the ALD binding site, there were two positions for hydrogen-bond acceptors and a position for a negative ionizable probe. The pharmacophore model was built using three pharmacophore elements: two hydrogen-bond acceptors and a negative ionizable feature (Figure 3). Substrate (fructose-1,6-bisphosphate)-like compounds occupied 2.4% of the metabolite database (colored red in the pie chart of Figure 4). The pharmacophore model was used to search the database of metabolites (80 018) using a minimum fit value of 2.5 (maximum 3.0). The search retrieved a total of 18 446 compounds (Table 1). This search eliminated 77.0% of the database (80 018). The reduction, however, retained 95.4% (21 of the 22) of the substrate-like metabolites demonstrating success of the pharmacophore model in retrieving specific hits. Structure-based filters were applied to eliminate compounds that do not fit the binding site (Table 2). The computed length of the ALD binding site was found to be 22.6 Å. There were 2597 compounds from search results of the pharmacophore model whose length was larger than the binding site. One compound had a larger SASA than the binding site and was eliminated. In this test case, very few compounds were eliminated by this filter because the binding site of ALD is very large. There were 8987 compounds with vdW clashes with binding site atoms and were eliminated. At the end of the application of three filters, 714 unique compounds were retained representing a reduction of 93.2% of the metabolite database. Eighteen of 22 substrate-like compounds were retained.
The 714 unique compounds were identified in various conformations; thus, 4943 conformers were ranked using a binding energy method. There are 18 substrate-like compounds all of which ranked within the top 0.7% of the database (Table 3). The true substrate, fructose-1,6-bisphosphate, is at a rank of 207 or 0.3% of the database. The closest analog of the open-chain (Sedoheptulose-1,7-bisphosphate) form of the substrate was ranked at 0.05% of the database. These results are very promising.
The closed form of the predicted substrate pose has the phosphate groups occupying the binding sub-site similar to that of the open-chain form as seen in the co-crystallized substrate (20). The C4 hydroxyl group of the substrate is within hydrogen-bonding distance from the catalytically important residue Lys146 (2.6 Å) (Figure S1). The predicted substrate pose was consistent with the proposed reaction mechanism (20).
Only one of the top ten compounds have a carbonyl or epoxy oxygen adjacent to a hydroxyl group that could participate in the aldol reaction (Table S1). A search of the database for compounds with the carbonyl or epoxy oxygen adjacent to a hydroxyl group retrieved 13 compounds, but only one was similar to the substrate. This is one of the limitations of the method. A significant number of the actual substrates and substrate-like compounds should be present in the database to identify the major scaffold of the true substrate from the top-ranked compounds. Nevertheless, we can infer the actual substrate to be either a pentose or hexose bisphosphate based on pharmacophore-based virtual screening. Further, compounds ranked at one, six, and seven are analogs of phosphoinositol, and studies show that ALD binds inositol (1,4,5)triphosphate (29). The second-ranked compound was dihydroneopterin that is known to be cleaved by a related family member dihydroneopterin ALD. The compound that was ranked fifth is a mannosamine analog known to bind N-acetyl-d-neuraminic acid ALD (30). The top-ranked compound with a substrate-like scaffold was found at position 38 (Table S1). Thus, the top-ranked compounds have a scaffold close either to the substrate or to the compounds known to bind to ALD. The chemical similarity plot (Figure 4) illustrates a trend toward enrichment of substrate-like compounds at the top of the database.
Arginase catalyzes the Mn2+-dependent hydrolysis of arginine to ornithine and urea. Thirty-five pockets were identified on the surface of the ARG by CASTp. The known pocket is 7th largest; however, this pocket is not predicted completely and only part of the substrate is sticking into the pocket. By visual inspection, however, one can find more residues that surround the substrate and form a pocket. There were seven residues with perturbed titration curves for the ARG test case from the THEMATICS analysis (Figure 2). The residues exhibiting unusual titration are Asp98, Glu181, Glu268, Glu271, His99, His222, and Tyr160. Residues Glu181, Glu271, and His99 are part of the binding site surface. Residues Glu268 and Asp98 are part of second shell; His222 is a little further away. Tyr160 is a false positive. Thus, most predicted residues do indeed cluster in and around the binding site. However, in the case of ARG, CASTp points to a different pocket, while THEMATICS points to the correct pocket.
From the results of the GRID mapping of the ARG binding site, there are two positions for positive ionizable groups and one for a hydrogen-bond acceptor. The pharmacophore model was built using two positive ionizable and one hydrogen-bond acceptor feature (Figure 3). A minimum fit value of 2.5 (maximum 3.0) was defined for the pharmacophore model. Searching the database of metabolites conformers (80 018) yielded a total of 3361 compounds (Table 1). This resulted in an elimination of 95.8% of the metabolite database (80 018). The reduction nevertheless retains 20 of the 23 (86.9%) substrate-like metabolites. The pharmacophore model successfully retrieved substrate (l-arginine)-like compounds occupying 0.1% of the database and demonstrates the specificity of the model. The ARG binding site length was calculated to be 15.4 Å; 1101 compounds exceeded the maximum length of the ARG binding site and were eliminated. There were 1841 compounds from search results of the pharmacophore model whose SASA was larger than the binding site (Table 2). Sixty-eight compounds exhibited steric overlap with the binding site atoms and were eliminated. After application of the structure-based filters, 87 unique compounds were retained, representing a 99.1% reduction of the metabolite database. All 20 substrate-like compounds retrieved by the pharmacophore model were retained.
The 87 unique compounds were found in various conformations; thus, 351 conformers were ranked by the binding energy method. The actual substrate (l-arginine) is found at a rank of 51, which is ranked in the top 0.06% of the database and the product (l-Ornithine) at rank 16, which is in the top 0.02% of the database. Indeed, all 22 substrate-like compounds were ranked within the top 0.08% of the database, which is remarkable. Most of the top-ranked compounds have the arginine scaffold (Table S1). Comparison of the binding pose derived from the alignment with the pharmacophore model of the substrate (l-arginine) with the product ornithine in the crystal structure reveals the pharmacophore-based method was able to predict the substrate pose at the same position as the product (Figure S1). However, closer analysis of the orientation reveals that the substrate is rotated by 180° with the guanidine groups positioned away from the manganese ions. This orientation does not enable the guanidine group to have hydrogen-bonding interactions with Glu271. Instead, the guanidine group is in a favorable orientation to interact with Glu181 (1.8 Å). The hydrogen bonding is a crucial step for ARG to catalyze the removal of water and formation of the products ornithine and urea. The carboxylate groups in the substrate might have favored the orientation of the substrate closer to the magnesium ions compared to the hydrogen bonding of the guanidine with Glu271. The ammonium group of the arginine is stabilized by hydrogen-bonding interactions with the His139 (3.3 Å from ND1). The presence of multiple charged groups in the pharmacophore model and many complimentary residues in the ARG binding site that favorably interact with the chemical moieties on the substrate is the reason for incorrect prediction of the binding pose.
The first two steps of our method, that is, pharmacophore-based screening followed by filtering using structure-based filters, retrieved compounds most of which have the arginine scaffold. We were left with a very small number of compounds, ranking of which placed all of them within top 0.08% of the database, and most of them contain the substrate moiety (Table 3). Hence, the chemical similarity plot was not provided for this test case. Presence of two charged features was the reason that the pharmacophore model retrieved very specific hits. This demonstrates that this method will be very successful in retrieving specific hits for proteins with multiple charged complimentary features in the binding site.
Enolase is involved in the hydrolysis of 2-phosphoglycerate to phosphoenol pyruvate. The largest pocket predicted by CASTp is the binding site as found in the co-crystal structure of the ENL with the substrate 2-phosphoglycerate (22). The titration results for ENL reveal seven residues with perturbed titration curve shape (Figure 2). The residues exhibiting unusual titration curves based on a THEMATICS analysis are Arg374, Asp396, Asp321, Glu168, His373, Lys345, and Tyr289. Tyr289 is far from the binding site and is a false positive. All of the other residues form a cluster and are part of the binding site surface. Thus, there is consensus between CASTp and THEMATICS in the prediction of the binding site.
From the results of the GRID mapping of the ENL binding site, there were two positions for hydrogen-bond acceptors and one position for a negative ionizable feature. The pharmacophore model was built using a combination of the three pharmacophore elements: two hydrogen-bond acceptors and one negative ionizable feature (Figure 3). A total of 22 632 compounds were retrieved (Table 1), eliminating 71.7% of the database (80 018) and retaining all five substrate-like compounds, which demonstrates specificity of the pharmacophore model. Inspection of the ENL binding site revealed that compounds of not more than 17.6 Å could fit in the binding site. There were 5612 compounds from the pharmacophore model search results whose length was larger than 17.6 Å and 1173 compounds whose SASA was larger than that of the binding site, which were eliminated. The clash filter revealed that 12 305 compounds had vdW clashes with binding site atoms and were removed. At the end of using three filters, there were 973 unique compounds in the search results, representing a 90.7% elimination of the metabolites in the database. All five of the substrate-like compounds were retained.
The 973 unique compounds were found in various conformations; thus, 3542 conformers were ranked by the binding energy method. The actual substrate 2-Phospho-d-glycerate is found at the top 0.7% of the database, while the product phosphoenol pyruvate is found at 0.3% of the database. All five substrate-like compounds were ranked within 0.8% of the database (Table 3).
The predicted substrate pose partially replicated the co-crystal structure with an RMSD of 2.9 Å (Figure S1). The phosphate groups of the predicted pose overlay well with those in the crystal structure. The carboxyl group of the predicted pose is rotated by 90° compared to the crystal structure and is placed favorably to co-ordinate with the magnesium ion. The location of the carboxylate and hydroxymethyl groups in the crystal structure has been found to be questionable by Larsen et al (31). The argument is that the crystal structure lacked the other Mg2+ ion and exhibited poor electron densities of the two groups. The proposed mechanism involves co-ordination of the carboxylate groups with the two magnesium ions, while the hydroxyl groups interact with a glutamine. The predicted pharmacophore pose is consistent with the proposed mechanism. The carboxylate of the pharmacophore pose was close to the magnesium, and the hydroxyl group was within hydrogen-bonding distance of Glu 211.
For ENL to catalyze removal of water, the substrate is required to have a β-hydroxy carboxylate. We have analyzed the ranking of all of the compounds having the β-hydroxyl carboxylate group. All of the 61 compounds containing a β-hydroxyl carboxylate were ranked within top 0.9% of the database. The pharmacophore-based virtual screening approach was able to predict the substrates with required functional groups for the ENL test case. Analysis of the top ten-ranked compounds revealed that they contain one or more phosphates, are polyhydroxylated, and some contain a carboxylate group, all of which are consistent with the substrate for ENL, 2-phosphoglycerate. Compounds ranked at three, nine, and ten are sugar acids, and ENL is known to bind erythronic acid 3-phosphate (32). There are other compounds ranked high but not similar to the substrate and can be ENL inhibitors or false positives. For instance, a recent study shows antimalarial drug mefloquine inhibits ENL and does not look like the substrate (33). Thus, the top-ranked compounds that are not similar to the substrate might potentially bind ENL. The top-ranked substrate-like compound was found at position 84 (Table S1). Enrichment of substrate-like compounds was evident from the chemical similarity plot, which displays a correlation between decreasing Tanimoto coefficient and ranking of the compound (Figure 4).
Mandelate racemase catalyzes the interconversion of R and S enantiomers of mandelic acid in the presence of Mg2+. The largest pocket predicted by CASTp is the true binding pocket as seen in the co-crystal structure with the inhibitor atrolactic acid (23). Lys164, Tyr54, and Tyr137 are the three residues with unusual titration curves according to THEMATICS, for the mandelate racemase test case (Figure 2). Residues Lys164 and Tyr54 are part of the binding site surface. Residue Tyr137 is a second shell residue. All predicted residues cluster in and around the binding site. Thus, there is consensus between the two methods for the prediction of the mandelate racemase binding site.
From the results of the GRID mapping of the mandelate racemase binding site, there was a position for a hydrogen-bond donor, a position for a negative ionizable feature, and one position for a hydrophobic feature. The pharmacophore model was built using a combination of three pharmacophore elements, namely one hydrogen-bond donor, one negative ionizable, and one hydrophobic feature. The pharmacophore model was used to search the database of the metabolites (80 018) with a minimum fit value cut off of 2.5, which yielded a total of 13 092 compounds (Table 1). This eliminated 83.6% of the database (80 018), and the pharmacophore model retrieved all of the substrate (S-Mandelate) compounds, which occupied the top 0.2% of the database (Figure 4). This demonstrates high specificity of the mandelate racemase pharmacophore model. Application of the length filter removed 5366 compounds as their lengths exceeded the maximum binding site length of 16.1 Å. There were 2615 compounds from search results of the pharmacophore model whose SASA was larger than that of the binding site, and 3245 compounds had steric overlap with binding site atoms and were eliminated. After application of the three filters, 580 unique compounds remained, which represents elimination of 94.5% of the metabolite database. For the mandelate racemase test case, all fourteen substrate-like compounds were retained by the three filters.
The 580 unique compounds were found in various conformations; thus, 1866 conformers were ranked by the binding energy method. The true substrate and product ranked in the top 0.5% of the database.
The pharmacophore-based method was able to predict the substrate pose at the same position as the inhibitor, atrolactic acid, in the co-crystal structure. The carboxylate group is closer to the magnesium ion and is consistent with the crystal structure, but is tilted away and favors hydrogen-bonding interaction with the side chain of the Asn197 (2.3 Å from ND2). Thus, it is not appropriately placed for enabling His297 or Lys166 to abstract a proton (34). The hydroxyl group is appropriately placed to have a hydrogen-bonding interaction with Asn197 (2.9 Å from OD1). The pharmacophore pose was partially successful in predicting the correct orientation of the substrate in the active site (Figure S1). The results for the mandelate racemase test case can be compared to a docking study by Kalyanaraman et al., where their predicted docked pose replicated the co-crystal pose of the inhibitor atrolactic acid. However, in their study, His297 (δ and ε nitrogens) and Glu317 were protonated, based on prior knowledge about the mandelate racemase catalytic mechanism. pKa prediction did not show any shift in pKa values for these two residues and hence were not protonated in our study.
The relatively poor performance of the pharmacophore method for the mandelate racemase test case can be attributed to the nature of the substrate. The substrate (2-hydroxy-2-phenylacetic acid) for mandelate racemase is small and any slight alteration in the co-ordinates of the pharmacophore features has an impact on the disposition of its important functional groups. Lack of protein flexibility in the pharmacophore-based method is a drawback in such cases. This limitation in the method can be overcome by running molecular dynamics simulations, which may enable the substrate and enzyme to adjust. A slight reorientation of the substrate may appropriately place the substrate in the correct orientation.
Nevertheless, it is encouraging that all of the substrate-like compounds are within the top 0.6% of the database (Table 3). In addition to this, there are eleven compounds with an α-hydroxy carboxylate group required for catalysis by mandelate racemase. All eleven were ranked within 0.6% of the top of the database, three of which are ranked within top 10 compounds (Table S1). However, no major scaffold similar to known substrates or inhibitors was identified from the top ten-ranked compounds for the mandelate racemase test case. The top-ranked compound with a substrate-like scaffold was found at position 54 (Table S1). The chemical similarity plot resulted in clustering of the substrate-like compounds around 0.2–0.6% of the database and did not exhibit a linear trend (Figure 4).
A combination of two methods namely CASTp and THEMATICS were used for predicting the binding site. It is not important which specific programs are used for the prediction, but the fact that consensus prediction has a much higher chance of being the true pocket, because no individual method offers 100% predictivity. There are various other ways to detect a ligand binding site apart from the programs used in this study; geometry-based pocket prediction methods such as LIGSITE (35) and SURFNET-ConSurf (36); blind docking of small molecules into the protein structure (37); predicting electrostatically destabilized residues (38); analysis of spatial hydrophobicity (39); and binding site similarity based on threading called FINDSITE (40). In 80% (four of five) test cases, the largest pocket identified by CASTp was the true binding site. In the case of ARG, the true binding site was the seventh largest pocket and only part of the true pocket was predicted. The reason this pocket was identified incompletely could be because of the shape of the binding site. CASTp defines a pocket as a concavity which is accessible to a solvent and that among the numerous possible cross-sections of a pocket, there needs to be at least one that is larger than the mouth opening. The THEMATICS method exhibited 100% predictivity, where, in all of the five test cases, the cluster of residues with shifted titration curves belonged to the binding site. Thus, there is an optimistic chance that for a novel protein, multiple methods will point to the same and true pocket; in case of disagreement, more than one pocket should be considered for further study. This situation may arise either because of the flaws of the particular method or due to the fact that protein may indeed have multiple binding sites performing various functions.
This study further validates the usage of structure-based pharmacophore models to retrieve substrate and substrate-like compounds from a database of metabolites. The biggest merit of the approach is its applicability to any protein structure. Typically, positions with most favorable energy or the positions with large cluster of probes reported by GRID (18) are considered for defining pharmacophore elements. This eliminated 71–95% of the decoys leading to enrichment of 84–100% of substrate-like compounds including all of the true substrates in multiple conformations.
At the end of pharmacophore searches, there were still thousands of compounds remaining. Kalyanaraman et al. used intuitive judgment and used only the top 25% of the database in its best predicted pose from the search results to rescore. We developed three structure-based filters to further reduce the compounds that are not substrate-like in the search results. The search results from pharmacophore screening contained numerous compounds that have a big part sticking out of the opening of the binding site. It is possible that the true substrate may be bound in this fashion and that the filter may get rid of it; however, the chemical moiety that binds into the binding site will still be found. The three filters that were employed in an effort to get rid of these types of compounds were the SASA, a length filter and a clash filter. The idea was to impose binding site-based shape constraints because the virtual screening was carried out in the absence of actual protein binding site. The SASA filter is only applicable for binding sites that are true pockets with an opening into which a substrate binds. It is not applicable to binding sites that are shallow depressions, because that would mean undoubtedly that a large portion of the substrate would be solvent accessible when bound to the protein. Thus, for an unknown protein, this judgment would have to be made by visual inspection. To remove compounds that had clashes with binding site atoms not described by the SASA and length filters, a clash filter was applied. The structure-based filters were successful at retaining almost all substrate-like compounds from the pharmacophore searches for all five test cases while significantly reducing the number of compounds that are not substrate-like. Remarkable success was achieved for the ARG (87 unique hits) and mandelate racemase (580 unique hits) test cases where a large reduction (≥94.5%) in the database was achieved by the pharmacophore-based virtual screening method.
With the application of a binding energy to rank compounds, the ranks of the true substrates or products were found within 0.7% of the top of database, and all of the substrate-like compounds were retrieved at ranks ranging between 0.01% and 1% of the database. According to a recent study (41), compounds with a Tanimoto of ≥0.85 only have a 30% chance of similar activity; nevertheless, it is still promising to have substrate-like compounds at such a high rank, because they may reveal insights about the true substrate and, therefore, the true function of the enzyme. In four of five cases, the physics-based scoring method produced a remarkably strong trend of enriching substrate-like compounds at the top of the ranking. The mandelate racemase test case exhibited a weaker trend, and clustering based on compound similarityc can be applied in such cases to identify classes of compounds that potentially bind. Clusteringc can be very useful for a novel protein case, where no information of substrate is available and when no inference can be drawn by merely looking at the top-ranked compound.
In the case of ADA, ENL, and ALD, the predicted binding poses were either close to the crystal structures or consistent with the catalytic mechanism (19,20,31). But, for ARG and mandelate racemase, the method could only partially predict the correct binding pose (21,23). Indeed, the ARG test case demonstrates that the presence of multiple charged groups helps in retrieving specific hits but is challenging to predict the correct binding pose, as mentioned earlier.
Intrigued by the poor performance of the method in predicting the correct binding pose, the fitness of the crystal pose versus the pharmacophore model was evaluated for each test case. The fit value of 2-methyl substrate analog for ADA was 2.9/4.0, cyclic form of ALD substrate 2.9/3.0, product of ARG 2.4/3.0, substrate of ENL 1.9/3.0, and inhibitor of MDR 2.8/3.0. The fit values were close to the cut offs (3.0 for ADA and 2.5 for ALD, ARG, ENL, and MDR) used in the database screening for four of five test cases. For the ENL test case, the poor fitness of the pharmacophore model is potentially attributable to poor electron densities of carboxyl and hydroxymethyl groups of the substrate, as discussed earlier (22,23). Overall, lack of flexibility in the pharmacophore method is a limiting factor in reproducing the crystal structure pose.
Kalyanaraman et al. also studied mandelate racemase and is a protein of our choice, for the sake of comparison. The ranking of substrate is similar (0.5% versus 0.4%) compared to their method for the same substrate. However, the ranking by Kalyanaraman et al. was against the entire database (19 007 compounds) and only the top 25% of the database was analyzed. Our ranking was against a much higher number of compounds (80 018), and only the compounds retained after the application of structure-based filters were ranked.
Ranking based on compounds retrieved in each case was also calculated for the sake of comparison (Table S2). All of the substrate-like compounds including known substrates ranked within 4% of the top of the database for all five test cases. The overall trend remains the same irrespective of the ranking method. Further, each case has a different number of matching compounds, and to have common criteria for all test cases, we adopted the method of Kalyanaraman et al. in ranking compounds.
False positives are a major concern in virtual screening of huge libraries and might be the reason for compounds that were not known to bind to MDR to be ranked high. Visual inspection of the compounds scoring highly is one way to eliminate false positives. Filters, such as FlexPharm and Magnet, screen ligands based on hydrogen bonding to specific residues, vdW contact, metal binding interactions, occupancy of specific pockets and can be applied to remove false positivesd (42). A scoring function tailor made to the target of interest by incorporating knowledge based on the binding site of the target and pharmacophore pattern might also dramatically improve the results (43).
Kalyanaraman et al. applied docking rather than pharmacophore modeling and could successfully predict the correct binding pose with an RMSD of <1.5 Å compared to the crystal structure pose for the five of six proteins, all belonging to the ENL family (8). However, their method failed to predict the correct binding pose for two of the apo enzyme test cases (alanine glutamate epimerase and muconate lactonizing enzyme-1). The pharmacophore-based method was moderately successful in predicting the correct pose. Nevertheless, it is remarkable that the pharmacophore method can successfully predict a binding pose close to the crystal structure for three proteins, each belonging to different protein families.
We identified the following limitations for the pharmacophore-based method. The first is lack of an energy function to evaluate compounds in the presence of the protein binding site residues, unlike docking. The second is that compounds were only retrieved based on the defined chemical features in the pharmacophore model, which can be problematic if the model either has more or fewer features than the true substrate.
Nevertheless, the method is relatively time efficient (maximum 37 min) to search a database of 80 000 compounds on a 1.8-GHz GNU/Linux with 510 MB RAM compared to docking (takes ∼30 min per compound or 40 000 CPU h for 80 000 compounds). This enabled screening of a much larger database (80 018 compounds) for each of the five test cases versus 19 007 compound screened by docking earlier in a similar study (8). Kalyanaraman et al. validated the efficiency of the physics-based scoring function, compared to ranking based on docked scores, and docking was only used to predict binding pose. This study validates pharmacophore-based prediction of binding poses followed by ranking using the physics-based scoring function as a more time-efficient strategy in the identification of substrate-like compounds. Further, we have validated our method against proteins belonging to four different families unlike earlier virtual screening methods (8,10).
Each method has its own advantages and limitations. Docking has the advantage of application of a scoring function in the presence of protein residues, while a pharmacophore method is more time efficient and larger databases can be more easily screened. The other advantage of the pharmacophore method is that it starts with the identification of the protein binding site and can be applied to any novel protein unlike the method of Kalyanaraman et al. (8), where prior knowledge about the protein binding site was used in virtual screening. The pharmacophore method described herein was successful in retrieving and retaining substrate-like compounds as well as ranking most of the substrate-like compounds highly, therefore aiding in identification of putative enzyme function.
The goal of this work was to develop a protocol to help in identifying the putative function of a small metabolite binding protein of unknown function by identifying the binding site and short listing potential substrates by the use of pharmacophore-based virtual screening and scoring. Our method was developed to consider a general case of a novel protein, with no clues as to its function or functionally important regions. The binding site is identified using a combination of methods. Once the binding pocket is predicted, the next task is to predict the compounds that potentially bind to the protein. Docking has been one of the popular methods to identify the probable compounds for binding to a protein binding site as well as to identify the most favorable ligand binding poses (8–11). In this study, we tested pharmacophore-based virtual screening as an alternative way to predict the potential ligand. The models were able to retrieve each substrate for all test cases and most of the substrate-like compounds (81.8–100%). To reduce the list of search results, we demonstrated that it is possible to use structure-based filters derived from the protein structure. Tanimoto coefficient analysis, which quantifies the number of compounds that are similar to the substrate, clearly indicated the enrichment of substrate-like compounds after applying the structure-based filters for four of five test cases. In all cases, either the known substrate or product was ranked at >0.7% of the database. Several substrate-like compounds were ranked higher than the true substrate, ranging between 0.03% and 1% of the database. Nevertheless, by using our method, more than 92% of the database of metabolites can be eliminated as being a potential substrate. Study of the five test cases from four different superfamilies enabled the identification of advantages as well as disadvantages of the pharmacophore method. The ADA and ARG cases were the most successful test cases with substrate-like compounds distinctly ranked high. The method failed to predict the correct binding pose in general (ADA) and especially for substrates with multiple charged groups (ARG) and those which are small with a fewer number of distinct pharmacophore features (mandelate racemase). Molecular dynamics simulations might correct the improper binding pose predicted for the ADA and mandelate racemase test cases. However, precise prediction of binding pose was not critically important as the goal of the method is to identify potential substrates from a large database of metabolites. Another limitation for the pharmacophore method is that the true substrate, or at least substrate-like compounds, should be present in the database of metabolites. The third limitation is, for proteins undergoing major structural changes in the binding site upon ligand binding, the method cannot be applied. This issue could be addressed by performing long time scale MD and normal mode analysis and checking whether the binding site is involved in motion; if so, it is possible to generate dynamic pharmacophore models as they have been demonstrated to work better than static ones in the case of inhibitor design (44).
The use of virtual screening in enzyme function identification is a more challenging application than for inhibitor searches. In our opinion, the goal of virtual screening in this context is to eliminate the possibility of thousands of metabolites as potential substrates and identify fewer compounds, or classes of compounds, to be experimentally tested. Toward this end, we demonstrated efficient usage of pharmacophore modeling and application of structure-based filters in retrieving, retaining, and ranking substrate-like compounds near the very top of a large database; however, more robust ranking methods remain to be developed and tested to enhance the yield of this and related methods.
Computing resources and support were made available from the University of Houston and the Texas Learning and Computation Center at the University of Houston. This research is based upon work supported by NASA under award No NNX08BA47A. Gratitude is also expressed to MOLNET GMBH, MESAAC and OpenEye Scientific software for their generous software support. The authors would like to thank Dr. Jerry Ebalunode for helpful discussions.
Conflicts of Interest
The authors are aware of no conflicts of interest in relation to the work described herein.