SEARCH

SEARCH BY CITATION

Keywords:

  • Crystallography;
  • Database;
  • Protein structure;
  • Protein-ligand interactions;
  • Structure-based drug design;
  • X-ray

Abstract

  1. Top of page
  2. Abstract
  3. Methods
  4. Results and Discussion
  5. Conclusion
  6. Acknowledgments
  7. References

The Protein Data Bank is the most comprehensive source of experimental macromolecular structures. It can, however, be difficult at times to locate relevant structures with the Protein Data Bank search interface. This is particularly true when searching for complexes containing specific interactions between protein and ligand atoms. Moreover, searching within a family of proteins can be tedious. For example, one cannot search for some conserved residue as residue numbers vary across structures. We describe herein three databases, Protein Relational Database, Kinase Knowledge Base, and Matrix Metalloproteinase Knowledge Base, containing protein structures from the Protein Data Bank. In Protein Relational Database, atom–atom distances between protein and ligand have been precalculated allowing for millisecond retrieval based on atom identity and distance constraints. Ring centroids, centroid–centroid and centroid–atom distances and angles have also been included permitting queries for π-stacking interactions and other structural motifs involving rings. Other geometric features can be searched through the inclusion of residue pair and triplet distances. In Kinase Knowledge Base and Matrix Metalloproteinase Knowledge Base, the catalytic domains have been aligned into common residue numbering schemes. Thus, by searching across Protein Relational Database and Kinase Knowledge Base, one can easily retrieve structures wherein, for example, a ligand of interest is making contact with the gatekeeper residue.

The Protein Data Bank (PDB) (1) is the definitive source of experimentally determined biologic macromolecular structures. It contains (July 2009) almost 60 000 structures and is expanding by about 7000 structures per year, which is the total number of structures on deposit in 1997.

The usefulness of any database is measured only in part by its data content. The ability to query on criteria important to the user and rapidly retrieve results in a usable format is just as critical to its usefulness. While the PDB Web sitea provides a rich search interface, it does not support queries based on the geometric relationship between bound ligands and their surrounding amino acid residues in the protein binding site. It can deliver structures containing a ligand or ligand substructure of interest but cannot narrow that list down to those in which the ligand makes contact with, for example, a serine hydroxyl group.

Many other methods (2) have been developed for filtering protein structures stored in databases, but they focus on properties of the protein and not protein–ligand geometries. Szabadka and Grolmusz (3,4) describe methodology for processing mmCIF (5,6) (macromolecular Crystallographic Information File) data from the PDB and storing the results in the relational database called the Rich Structure PDB, but it does not generate detailed geometric data. A system that does permit geometric querying is the commercial application Relibase+ (7,8). Very recently, the publicly available CREDO (9) database has appeared. It is a database of protein–ligand interactions and represents contacts as structural interaction fingerprints.

In this paper, we describe the design, population, and querying of the Protein Relational Database (PRDB). The database is easily extensible, and this will be illustrated with two Knowledge Bases developed for the protein kinase and matrix metalloproteinase families. Minimal information regarding our search interface is included herein to illustrate the types of queries that can be run against the databases.

Methods

  1. Top of page
  2. Abstract
  3. Methods
  4. Results and Discussion
  5. Conclusion
  6. Acknowledgments
  7. References

Protein relational database (PRDB)

To mine the PDB structures for ligand geometries and other features not available for searching at the RCSB Web sitea, we designed PRDB to contain new, enabling dimensions. The important tables are illustrated in Figure 1. GEOMETRIES, RESIDUE TRIPLET DISTANCE, and RESIDUE PAIR DISTANCE are the most significant, permit geometric searching and contain atoms and residues derived from the entire asymmetric unit. PRDB does include nucleic acids whether or not complexed with proteins.

image

Figure 1.  The eight tables of PRDB most relevant to end user searching are shown. The lines between the tables are standard entity relationship connectors.

Download figure to PowerPoint

Geometries

Protein–ligand geometric searching is enabled through the GEOMETRIES table. This table contains distances and, in some cases angles, between a pair of objects. Four types of geometries are included:

  • Protein atom–ligand atom distance.

  • Protein atom–ligand ring centroid distance and angle.

  • Protein ring centroid–ligand atom distance and angle.

  • Protein ring centroid–ligand ring centroid distance and angles.

Protein atoms and ring centroids are included in the geometries table if they are within 8 Å of a ligand atom.

The current version of PRDB does not include water as a ligand in the geometries table or classify the interaction between pairs of objects as, for example, a hydrogen bond. Also, atom types (e.g., H-bond donor or acceptor) have not been assigned for ligands. However, while not explicitly identified in the database, this information is known for each atom of the 20 common amino acids.

Ring centroid generation

For proteins, ring centroids were generated for Tyr, His, Phe, and the 5- and 6-membered rings of Trp. For ligands, average ring planes and ring centroids were generated for all monocyclic rings and envelope rings of polycyclic systems up to an envelope ring size of ten (Figure 2A). This includes fused 6-6 aromatic ring systems and encompasses most rings of interest. Moreover, while a fused 6-7 ring would have a missing centroid in the center, the centroids located in the middle of the 6- and 7-membered rings could be used to find any interactions with the fused ring simply by relaxing the angle and distance criteria slightly. Otherwise, we were liberal in our choice of included rings so as not to miss opportunities to find novel geometric arrangements. To counter this, there is an aromatic flag associated with each centroid that can be used to narrow a search. In addition, a measure of ring flatness was also generated and stored in LIG RING FLATNESS. This value is calculated as 1 minus the RMSD (root mean square deviation) of the distances between the ring atoms and the average ring plane. The GEOMETRIES table contains both values. Ring size is also available in another table.

image

Figure 2.  (A) Seven rings with marked centroids are generated from the hypothetical compound. Note that a 7,6 fused ring is not generated because its envelope ring would contain 11 atoms, which is one more than the maximum. (B) Edge on view of protein and ligand average ring planes with the DISTANCE line connecting the ring centroids. In the GEOMETRIES and KIN GEOMETRIES tables, <1 and <2 correspond to ANGLE 1 and ANGLE 2, respectively. (C) Edge on view of an average ring plane with the DISTANCE line connecting the ring’s centroid to an atom. In the GEOMETRIES and KIN GEOMETRIES tables, <x corresponds to ANGLE 1 when the ring is from the protein and the atom is from a ligand and ANGLE 2 for the reverse case.

Download figure to PowerPoint

Ring centroid geometries

The geometric arrangement between two rings requires the specification of three angles in addition to the distance between the ring centroids (Figure 2B). The situation is simpler when defining the arrangement between a ring and an atom and only requires, in addition to the distance, the specification of a single angle (Figure 2C). The angle’s value is stored in ANGLE 1 when the ring belongs to a protein and the atom is supplied by a ligand. In the reverse case, it is stored in ANGLE 2. In all cases, the angles stored are acute.

Residue triplets and pairs

Triplets of residues are stored in the RESIDUE TRIPLET DISTANCE table, and the distances between residues are measured between their C-alpha carbons. For a residue to be included, it must contain at least one atom within 5 Å of a ligand atom. For each triplet, LIG 3D ID identifies that ligand. Note that all residues of the triplet need not come from the same chain. If this is the case, the field HOMOGENEOUS is set to N; otherwise, it contains Y. RESIDUE PAIRS DISTANCE differs from RESIDUE TRIPLET DISTANCE only in that it contains two residues on a row instead of three.

PRDB data generation and maintenance

There are two basic steps for keeping PRDB up-to-date (and for its initial population). The first acquires update data from the PDB Web site, and the second applies this data to PRDB. The first step is accomplished by a cron script that periodically probes the PDB status directory for a new directory of updated files. When it finds one, it downloads the ‘added.pdb’, ‘modified.pdb’, ‘obsolete.pdb’, and if necessary, the ‘reload.pdb’ files. These files list the new, changed, and obsolete models for that update period. The script then downloads the cif and pdb format files for each model mentioned in the added, modified, and reload files. It also downloads the current components.cif file. Because the network connection frequently breaks, it may take several sessions to gather the complete update set. Once complete, this set of files is marked with a ‘good’ tag file.

A second cron script periodically scans for successfully downloaded sets. The first step for the update is to move the new and changed model cif and pdb files to master store directories. All updated and obsolete files are tagged with a version number and moved to an obsolete directory. Next, the new, changed, and obsolete models are deleted from the PRDB database. Now, the components.cif file is processed for new and changed ligands. New ligands are those that do not currently exist in PRDB, and changed ligands are determined by comparing the CRC (cyclic redundancy checksum) of a string formed by concatenating all of cif data for the ligand with a stored CRC value. The length of this string is also compared with its stored value. If either of these values has changed, the ligand is presumed to be updated and a determination is made of whether it is a cosmetic or substantive change. A substantive change is one that affects the use of the ligand data in the PRDB data model, such as a change in the connectivity of the ligand or atom name to atom mapping in the molecule. Because the PRDB data model enforces a foreign key relationship between models and ligands, all models that use a ligand must be removed before that ligand is updated. These models are added to a rework list.

After the ligands have been updated, the new, modified, and reworked models are inserted into PRDB. This stage takes place in three steps for each model. A Perl script extracts the data of interest from the cif-format file for the model into tab-separated text files. A second Perl script computes the geometric data of interest for the model. A third Perl Script loads all of the data for a model from the text files into the database in a single transaction window. To expedite processing, the models are split into ten groups and this stage is run in ten independent parallel threads. The Perl instance used for this processing has been augmented with the STAR cif parser from the PDB Web site and many modules from the CPAN site.

One hundred and fifty-two of the 1657 cif data element types found to be present in actual PDB cif model files (as of the Mar 13, 2009 release) are abstracted and used in PRDB. The REMARKS records are extracted from the PDB-formatted file for each model and are loaded into PRDB also.

Protein kinase knowledge base (KKB)

Drug discovery efforts against kinase targets have mainly focused on the catalytic (or ATP-binding) domain (10,11). Although a number of different types of inhibitors have been identified, including inactive form inhibitors (12) and allosteric inhibitors (13–16), they all interact with the catalytic domain of the kinase. To enhance internal access to kinase crystal structures and enable structural comparisons of catalytic domain features, we extended the PRDB to include a structural KKB that contains all catalytic domain structures deposited in the PDB.

KKB data generation

The KKB was constructed from the human catalytic domain sequences collated by Manning and co-workers, which can be accessed through http://www.kinase.com (17). Blast searches were conducted to identify kinase domain crystal structures in the PDB. A multiple sequence alignment was generated for the full human kinome, based on a HMMer profile of a structural alignment of 17 non-redundant kinase crystal structures (18). The full-kinome sequence alignment was used to develop a common numbering scheme. Their 3D structures were then aligned, and two versions of PDB files, identified by their PDB ID and chain letter, were generated using the original and common residue numbers for each kinase catalytic domain present in the parent PDB file. From these 1491 pairs of PDB files and a file mapping the original to common residue numbers, three tables in Figure 3 named KIN PDBNUM, KIN PDB, and KIN TRANSLATE were generated. These three tables contain all of the information necessary to generate the remaining tables and join the geometric and ligand data in the main PRDB database to the KKB. Thus, a substructure search of 2D ligands can be performed to retrieve kinase chains containing those ligands.

image

Figure 3.  The eight tables of KKB most relevant to end user searching. The lines between the tables are standard entity relationship connectors. The PROT RESNUM COMMON field of KIN GEOMETRIES contains the common residue number making it easy to search for geometries involving important kinase residues. For example, distances to the gatekeeper residue can be searched simply by specifying 461 in this field.

Download figure to PowerPoint

To facilitate searching by name, the name used by Manning was used as the standard for each kinase. It was stored in the KIN_STANDARD_NAMES table and linked to a list of synonymsb.

Feature identification

Important structural features were annotated using the common numbering scheme to allow for the rapid identification and comparison of these elements. The hinge region residues, the gatekeeper residue, glycine-rich loop residues, DFG- and HRD-motif residues were all enumerated in addition to active site residues.c Structural classification schemes for DFG in/out and Helix C in/out conformations were also generated on a structure by structure basis. Finally, reactive cysteines in the active site were annotated according to the scheme recently published by Gray (19). Complete details of the 2D and 3D alignments and structural classification schemes have been published by Brooijmans (20).

Features and their descriptions are stored in KIN FEATURES, and the residues comprising them are stored in KIN FEATURE RESIDUES. Some features apply to all structures (e.g., HRD motif), while others pertain to a subset (e.g., DFG in). In the latter case, those structures are identified in the KIN FEATURES PDBNUM table. The identified features and their scope are listed in Table 1.

Table 1.   Kinase features
FeatureDescription# ResScopea
  1. aFeatures belonging to all structures have their scope set to all while those found in specific structures have scope value of specific.

ACTIVE SITEResidues in the binding site that are capable of interacting with ATP27All
CATALYTIC LYSINELysine in the catalytic loop 1All
GATEKEEPERThe binding site residue that controls accessibility to the hydrophobic pocket 1All
GLY RICH LOOPHighly prevalent GxGxxG motif that connects beta strands 1 and 2 6All
HELIX C ACIDHighly conserved (99%) Glu in Helix C 1All
HINGE REGIONThe hinge region residues 5All
HINGE REGION H-BOND DONORThe hinge region H-bind donor residue 1All
HRD MOTIFThe HRD motif in the catalytic loop 3All
DFG LOOPThe active site DFG loop D1108, F1111, G1114 3All
DFG LOOP INActive site DFG loop where Asp is oriented toward ATP pocket and Phe pointing toward C-terminal and Helix C 3Specific
DFG LOOP OUTInactive kinase DFG loop where Asp is pointing out of ATP pocket 3Specific
DFG LOOP UNDEFDFG loop in a partially active kinase or unable to define because of structural inconsistencies 3Specific
HELIX CThe single alpha helix at the N terminus54All
HELIX C INActive kinase structure where Glu in Helix C interacts with the N terminal Lys and orients toward ATP pocket54Specific
HELIX C INTERMEDIATE OUTPartially inactive kinase structure where Glu is out of plane for interaction with N terminal Lys but still points toward ATP pocket54Specific
HELIX C OUTInactive kinase structure where Glu in Helix C is rotated outwards from ATP pocket and does not form any interaction with Lys at the N terminus54Specific
HELIX C UNDEFINEDUnable to identify Glu orientation in Helix C54Specific
REACT CYS – GROUP 1Subgroups A–C. Cys residue in the glycine-rich-loop (also known as p-loop) 1Specific
REACT CYS – GROUP 2Subgroups A–C. Cys residue immediately following the glycine-rich-loop 1Specific
REACT CYS – GROUP 3Subgroups A–G. Cys within the Hinge region 1Specific
REACT CYS – GROUP 4Cys at the start of the activation loop and precede the DFG motif 1Specific

The matrix metalloproteinase knowledge base (MMPKB)

Drug discovery efforts against MMP (matrix metalloproteinase) targets usually focus on selective inhibition of the zinc-containing active site (21–24). The active site and the catalytic domain as a whole exhibit relatively strong structural conservation and sequence homology through all of the MMP variants whether they are in inhibited or active states (24). Several features are common to all catalytic domain structures. The basic shape of the catalytic domain is provided by the arrangement of a five-stranded beta sheet and three alpha helices. The active site is formed by the chelation of the three histidines from the motif HEXGHXXGXXH to a zinc ion, and the cavity is divided into six pockets, of which the S1′ pocket is the most important for selectivity (22,23). The catalytic domain also regularly chelates a structural zinc ion and a calcium ion. To enhance the ability to target MMP’s, we extended the PRDB to include all human catalytic domain structures deposited in the PDB.

MMPKB data generation

The starting point for the MMP dataset was the set of 99 human MMP structures of the catalytic domain identified by Aureli and co-workers (21). A multiple sequence alignment was performed using ClustalW2 (BLOSUM, 10, 0.2). This alignment included the 24 canonical sequences for the catalytic domains of the MMP enzymes as obtained from UniProtKB.d The consensus sequence with a length of 452 residues was used to supply a common numbering scheme to the MMP family. Its length is more than doubled by the presence of the gelatinases (MMP2 and MMP9) because of their fibronectin type II modules. We considered an alternate method using the Align123 algorithm, which takes secondary structure into consideration for the alignment. The algorithm augments the standard ClustalW score with a secondary structure matching score. But we found that for this family of proteins, the standard ClustalW alignment already produces an excellent match for the five beta strands and three alpha helices. No additional benefit was found when testing the secondary structure-weighted alignment on a few samples.

All PDB structures were standardized programmatically using Discovery Script (Discovery Studio 2.5, Accelrys, CA, USA). To identify the important elements of each structure, they were subjected to a cleaning procedure in which water molecules were removed, and only the first ‘molecule’ data object was retained (relevant, for example, when an NMR (nuclear magnetic resonance) ensemble of conformations is present). If multiple protein chains were present, they were split into separate structures uniquely identified by the PDB ID and chain letter. Although this sometimes causes redundant information to be retained, there are instances of differing receptor–ligand complexes within the same structure. For example, the 1b3d structure contains two receptor chains where chain B has the S27 compound in the active site and chain A does not (in all, half of our multi-chain structures exhibit ligand differences). Thus from the 99 original structures, 162 receptor structures were obtained. For each small molecule found in the PDB structure, its parent receptor chain was determined as the one with the largest number of contacts (atom–atom distance ≤4 Å).

Feature identification

Table 2 provides a list of the features annotated in MMPKB. Because certain metal ions are regularly chelated by specific residues in the MMP sequence, we defined metal binding features in the database. For each of these metal ions, a list of contacting protein residues was generated for the annotations.

Table 2.   MMP features
FeatureDescription# ResScopea
  1. aFeatures belonging to all structures have their scope set to all while those found in specific structures have scope value of specific.

BETA STRAND 1Second strand in beta sheet 6All
BETA STRAND 2First strand in beta sheet located near helix 1 4All
BETA STRAND 3Middle of beta sheet. Calcium 2 binds at N-sided end 6All
BETA STRAND 4Last strand in beta sheet. Antiparallel. Close to active site ligand 3All
BETA STRAND 5Fourth strand in beta sheet. Close to structural zinc 4All
CATALYTIC MOTIFConserved pattern HEXGHXXGXXH11All
HELIX 1Longest alpha helix located behind the active site16All
HELIX 2Active site helix located near center of domain12All
HELIX 3Short alpha helix located at periphery of domain11All
SPECIFICITY LOOPForms one side of S1′ pocket 9All
S1′ POCKETFormed by space between helix 2 and specificity loop 7–12Specific
CALCIUM 1 BINDINGChelated by D,G,G,N residues in loop at end of strand 4 6–10Specific
CALCIUM 2 BINDINGChelated by D,N,G,D residues near beta strands 3 and 5 4–6Specific
CALCIUM 3 BINDINGChelated by D,D,E residues 3–5Specific
CATALYTIC ZINC BINDINGBinds to three histidine residues from catalytic motif 3–4Specific
STRUCTURAL ZINC BINDINGBehind beta sheet strands 4 and 5 and in front of a loop 4–6Specific
DELETIONSResidues removed from standard wild-type enzyme sequence Specific
INSERTIONSExtra residues not found in standard enzyme Specific
MUTATIONSE.g., sometimes the motif E is mutated to Q to disable the active site Specific

The most important metal ion is the catalytic zinc, which is co-ordinated by the three histidine residues in the catalytic motif HEXGHXXGXXH. The significance of the catalytic zinc ion is that it helps to bind the inhibitor compound. MMP structures also contain a structural zinc ion, which is chelated at well-conserved positions in the sequence. In 96% of our structures, the structural zinc ion is present and identified by its association with three histidines and an aspartic acid. It is situated between a loop (containing the aspartic acid and one histidine) on one side and beta strands 4 and 5 on the other. Features were created in MMPKB to track the residues in contact with the catalytic zinc and structural zinc ions.

Most structures also contain several calcium ions. In 96% of the structures, a primary calcium ion is found chelated in a loop region by Asp, Gly, Gly, and Asn residues. The ion is also in contact with Asp and Glu residues at one end of beta strand 5. A second calcium ion is present in 86% of the structures. It is associated with an Asp, Asn, Glu, and Asp residues in loop regions near the end of beta strands 3 and 5. A third calcium ion is present in 63% of the structures chelated between two loops. Three features were created in MMPKB to track the residues in contact with these three calcium ions.

We also wanted to create a sequence feature that identifies residues associated with the S1′ pocket, because this pocket shows the greatest potential for selectivity within the active site. The S1′ pocket is not straightforward to define systematically. It is formed by the gap between the specificity loop and the active site helix (22,24), both of which are also annotated as features. In the 2ayk structure, the pocket is capped by R114 (24), which is located just anterior to the catalytic motif along the alpha helix. Our definition is based on a sample of five MMP1 structures (1cge, 2tcl, 3ayk, 966c, 2ayk) with and without ligands. We used Discovery Studio tools to identity and if necessary partition cavities, and then selected the cavity located between the selectivity loop and the active site helix. The residues surrounding the cavity were assigned as putative S1′ residues and in comparing the five structures a consensus was reached for the Arg, Val, and His residues in the active site helix and the Tyr and three Ser residues in the selectivity loop.

Another attribute of interest to drug design that was recorded was the difference between the sequence found in the protein structure and the corresponding enzyme (wild type) sequence in UniProtKB. For each such pair, the sequences were aligned and the mutations, deletions, and insertions were determined. While these can be considered ‘features’ of the protein, the data need to be modeled differently than the rest of the features in Table 2 and are stored in their own database tables (Figure 4). The same is true for the metal binding residue features whose data are stored in MMP METAL BINDING RESIDUES and MMP METAL BINDING FEATURE (Figure 4).

image

Figure 4.  The 11 tables of MMPKB most relevant to end user searching. The lines between the tables are standard entity relationship connectors. The PROT RESNUM COMMON field of MMP GEOMETRIES contains the common residue number making it easy to search for geometries involving important MMP residues. There is no analog of KIN ALIAS as the standard names used in MMPKB are all of the format MMP-## which makes searching straightforward. MMP DELETION, MMP INSERTION, and MMP MUTATION do not have analogous tables in the KKB.

Download figure to PowerPoint

3D alignment

To facilitate three-dimensional comparisons of different parts of the consensus sequence, the receptor–ligand structures were all superimposed onto a reference structure using the consensus sequence alignment as a guide. Compared to kinase, MMP structures are fairly rigid and hence an excellent overlay to a common reference structure is possible. The superimposition algorithm contained in the Pipeline Pilot Protein Modeling Collection was employed in which the RMSD between main-chain atoms was optimized. As the reference, 1i76_A was chosen because it is the highest resolution structure (1.20 Å) from the most populous subfamily (collagenase) in our MMPKB.

Database updates

To demonstrate the extensibility of this database, we sought to add new MMP structures to the dataset without altering the original numbering scheme. Several approaches are suggested for automating the retrieval of new MMP structures. One approach is to search the description field in the Blast database of PDB sequences for MMP synonyms, but as expected, it misses some of the MMP structures. Another approach is to use the EC (Enzyme Commission) number that is filed with each PDB structure. Although endopeptidases have an EC number, there is no common EC number to extract structures of all MMP family protein structures. Moreover, two of the MMP’s (MMP13 and MMP16) do not have full EC numbers. We find the most reliable method to curate and maintain an up to date version of MMPKB is to perform a Blast search on the sequence database of PDB structures maintained by NCBI.e A query with all 24 of the MMP sequences returned 19 new PDB structures with an e Value less than 1e-36 including the first known structure for MMP20. The Blast search proved to be very effective at finding MMP structures. In addition to the new hits, all of the original structures were returned. A search of the ‘description’ field of the Blast database for MMP synonyms returned no additional structures.

Extension of the alignment to new PDB structures was done by a multiple ClustalW alignment of the original 99 sequences with each new sequence (one at a time). In practice, because of the weight and length of the reference sequences, none of the new sequences altered the original consensus. Moreover, each sequence aligned perfectly to its corresponding wild-type sequence. Alternatively, one can use a statistical profile generated from the original alignment as a template for new sequences. However, an attempt to do this using an HMM (Hidden Markov Model) profile failed to produce gapped sequences of a consistent length.

The data generated for MMPKB are similar to the KKB data and are stored in the tables shown in Figure 4. There is no analog of KIN ALIAS as the standard names used in MMPKB are all of the format MMP-## (e.g., MMP-12), which makes searching straightforward. MMP DELETION, MMP INSERTION, and MMP MUTATION do not have analogous tables in the KKB.

In its present form, our MMP dataset comprises 191 superimposed and renumbered receptor–ligand complexes derived from 118 individual PDB files. The superposition RMSD had a mean of 1.4 and a standard deviation of 0.9.

Results and Discussion

  1. Top of page
  2. Abstract
  3. Methods
  4. Results and Discussion
  5. Conclusion
  6. Acknowledgments
  7. References

Querying PRDB, KKB, and MMPKB

The PRDB database consists of 62 tables with an additional 21 holding the KKB and MMPKB. To date, only the tables described herein have been made available for searching from our user interface ADOpt. (Analyze, Design, Optimize), and they probably cover 90% of common users’ queries. More complex queries can be run directly against the database; however, they require SQL expertise. The interface allows any retrieved PDB file to be saved to the hard drive or directly opened in the molecular visualization software, Benchware 3D Explorer.f Detailed use cases of the informatics system described herein have been published separately (25).

The data are presented to the user in a format that can, in some cases, be quite different than the database tables (Figures 1, 3 and 4). Figure 5 shows the sets of fields, or datasources, available to the user for query building. Datasources beginning with KIN or MMP are only available when searching the KKB or MMPKB, respectively. The remaining datasources are available for querying all three databases.

image

Figure 5.  The fields exposed to the user for constructing queries. Datasources beginning with KIN or MMP are available only when searching KKB or MMPKB, respectively. The rest are available for all queries. Sample data are shown for MMP FEATURES and single field. Unless prefixed with O (original) or C (common), residue numbers are original with the exception of MMP FEATURES: Deletions where the common number (and wild type residue name) is used. In that case, there is no original number that can be assigned to the residue.

Download figure to PowerPoint

The KIN FEATURES and MMP FEATURES datasources look markedly different than anything in the database tables (Figures 3 and 4). There are a number of database transformations (views and functions) that rearrange the data in the raw tables into the format seen in Figure 5. The data contained in the fields listed in these datasources are best described by the sample data in Figure 5. There is one row in KIN FEATURES for each of the 1491 pairs of kinase PDB files. Likewise, MMP FEATURES contains one row for each of its 191 pairs of PDB files. For the feature definitions, all of the residue names and numbers are concatenated into a single field. One can search any of these fields for a feature that contains a specific residue and position, but that is better done using the KIN RESIDUES or MMP RESIDUES datasources. The concatenated fields were primarily generated to present the data to the user in a single row and not for searching.

While KINASE FEATURES and MMP FEATURES contain the details of features that apply to all structures in their respective databases, the KIN PROT FEATURES and MMP PROT c FEATURES datasources just contain a list of PDB file identifiers and their features from Tables 1 and 2, respectively, that have a specific scope. Querying these datasources is the easiest way to search for structures containing any one of a list of features.

In addition to querying on kinase and MMP metadata, the datasources in Figure 5 allow for the generation of powerful queries based on protein–ligand geometries that cannot be duplicated at the PDB Website. A closer examination of the GEOMETRIES table and datasource makes this more apparent.

The pair of objects described in GEOMETRIES is comprised of one protein and one ligand component. In each case, there are several identifiers that can be used to specify the component of interest.

Atoms

For a protein atom, a unique atom identifier (PROT ATOM ID), generated by the data loading routines, serves as its primary identifier. Because it would be rare for a user to know and search on that identifier, the table also contains the name for that atom (PROT ATOM NAME; e.g., CA, SG) and its containing residue’s name and number (PROT RESIDUE NAME and PROT RESIDUE NUM, respectively). This allows the user to query for structures containing a ligand atom, for example, <2 Å from:

Query 1: any cysteine (PROT RESIDUE NAME = CYS).

Query 2: the sulfur of any cysteine (PROT ATOM NAME = SG [the sulfur of cysteine]).

Query 3: cysteine 41 (PROT RESIDUE NUM = 41 and PROT RESIDUE NAME = CYS).

Query 4: the sulfur of cysteine 41 (PROT RESIDUE NUM = 41 and PROT ATOM NAME = SG).

In the last two examples aforementioned, one would have to know that the protein of interest contained a cysteine at position 41. If the criteria shown were the only ones in the query, it is likely to find extraneous proteins. However, when searching for kinase or MMP geometries, the common residue numbering schemes allow for the generation of quite specific queries that would otherwise be nearly impossible. For example, it is trivial to search for structures containing ligands having an atom <2.5 Å from:

Query 5: the kinase catalytic lysine (PROT RESNUM COMMON = 189).

Query 6: the terminal nitrogen of the kinase catalytic lysine (PROT RESNUM COMMON = 189 and PROT ATOM NAME = NZ [the terminal nitrogen of lysine])a.

For ligand atoms, the unique atom identifier is LIG ATOM ID 3D. This identifies a specific atom of a specific instance of a ligand in a specific PDB file. That ligand’s identifier is LIG 3D ID. It is also unlikely that a user would search on either of these values. However, atoms also have a unique 2D identifier (LIG ATOM ID 2D) that is more useful as a search term. Thus, while it is unlikely that a user would search for a distance to the carbonyl oxygen of particular instance of staurosporine in a single PDB file (LIG ATOM ID 3D search), they would benefit tremendously from being able to search for any structures containing some distance to the carbonyl oxygen of staurosporine (LIG ATOM ID 2D search).

LIG 2D ID is the identifier for the 2D ligand structure. One can search the geometries table by specifying this value and thus return structures where some distance criteria is being met by one or more atoms of the ligand. In practice, this is most commonly done indirectly by performing an exact or substructure search. Finally, one can search for a particular ligand atom by using the LIG ATOM NAME field.

One can create very specific queries by combining search terms for the protein and ligand. They can be even more efficacious when searching the KKB or MMPKB. In addition to the residue numbers found in the original PDB files, their GEOMETRIES tables and datasources contain residue numbers in their common numbering schemes (PROT RESNUM COMMON). Thus, referring to the kinase inhibitor staurosporine (Figure 6) as a sample ligand, one can construct queries to locate protein–ligand complexes containing staurosporine under the following conditions:

image

Figure 6.  The natural product staurosporine.

Download figure to PowerPoint

Query 7: any staurosporine atom <3.5 Å from any leucine atom.

Query 8: any staurosporine atom <3.5 Å from leucine nitrogen.

Query 9: any oxygen in staurosporine<3.5 Å from leucine nitrogen.

Query 10: staurosporine’s carbonyl oxygen<3.5 Å from leucine nitrogen.

The performance of queries 1–10 in this section is shown in Table 3.

Table 3.   Performance of protein–ligand geometries queries
 Query numbera
Parameters12b345678910
  1. aQueries 1–4 and 7–10 were performed against PRDB’s GEOMETRIES table while queries 5 and 6 were run against KKB’s KIN GEOMETRIES table.

  2. bQuery 2 illustrates the value of partitioning the GEOMETRIES table by PROT RESIDUE NAME. The CYS search term is unnecessary because the PROT ATOM NAME of SG can only belong to cysteine. However, if the CYS term is omitted, the query time increases to 7-seconds.

Distance<2<2<2<2<2.5<2.5<3.5<3.5<3.5<3.5
PROT RESIDUE NAMECYSCYSCYSCYS  LEULEULEULEU
PROT RESIDUE NUM  4141      
PROT RESNUM COMMON    189189    
PROT ATOM NAME SG SG NZ NNN
LIG ATOM NAME        O 
LIG 2D ID      115431154311543 
LIG ATOM ID 2D         34
Time (seconds)0.520.390.290.250.080.030.030.10.060.1
# PDB ID’s returned14901366155241932776

Performance

The performance of the queries in the previous section is provided in Table 3. Data are stored in Oracle, version 10.2.0.3 running on a Hewlett-Packard BL680G5 with 4× Quad-Core Xeon E7340 2.4Ghz CPUs with 2×4M L2 Cache and 64GB of RAM under a Red Hat Enterprise Linux AS release 4 Update 6 operating system. The two largest tables are GEOMETRIES and RESIDUE TRIPLET DISTANCE with 265 million and 161 million rows of data, respectively. GEOMETRIES is partitionedg on PROT RESIDUE NAME with one partition each for the 20 naturally occurring amino acids and a 21st partition for all 552 remaining amino acids. It contains an average of 12.5 million rows for each natural amino acid and 14.3 million rows for all others combined. KIN GEOMETRIES and MMP GEOMETRIES are much smaller and contain 3.1 million and 2.5 million rows, respectively.

Geometries

Performance is generally excellent as all queries complete in <1-second (Table 3). All but queries 5 and 6 were run against the 265 million row GEOMETRIES table while those queries were run against the much smaller KIN GEOMETRIES table.

Triplets

As one cannot know which of the three residue columns in the RESIDUE TRIPLETS datasource contains the values of ones query, all six combinations have to be tried. This was accomplished by making the RESIDUE TRIPLETS datasource a view that unions its parent table with itself six times (Figure 7). Thus, while the user arbitrarily assigns values of his query to residues 1, 2, and 3, all combinations are searched. RESIDUE PAIRS was handled in an analogous way.

image

Figure 7.  A self-union of the RESIDUE TRIPLETS DISTANCE table obviates knowledge of the actual storage column.

Download figure to PowerPoint

Query performance can vary widely when searching RESIDUE TRIPLETS as shown in Table 4. Queries 1–7 all perform well with only two completing in over 1 second. However, that is generally true if the distances searched are about 6 Å or less. For larger distances, which contain many more triplets, query time is significantly longer as shown by queries 8 and 9. Three to four minutes is the typical search time when distance criteria get above 6 Å. While these are the slowest queries in PRDB, performance does not at all seem unreasonable, especially considering the alternatives for finding the same information.

Table 4.   Residue triplet query performance
 Query number
Parameters123456789
RESIDUE NAME 1SeranySeranySer or AspAspAspGlyTyr
RESIDUE NAME 2HisHisHisHisHisAspAspSerAla
RESIDUE NAME 3AspAspAspAspAspHisHisMetGly
Distance 1–2<6<63.5–43.5–43.5–44–64–53–53–4
Distance 1–3<6<64.5–5.54.5–5.54.5–5.54–64.5–5.56–86–7
Distance 2–3<6<64.5–5.54.5–5.54.5–5.54–64.5–5.56–106–8
Time (seconds)1240.98810.320.86398258
# PDB ID’s returned2152713132022772639658

Conclusion

  1. Top of page
  2. Abstract
  3. Methods
  4. Results and Discussion
  5. Conclusion
  6. Acknowledgments
  7. References

We have shown that PRDB is a database that permits the rapid retrieval of experimental protein structures. Queries can include detailed geometric relationships between ligand atoms and their surrounding protein layer and still be quite fast; structures can be retrieved in <1-second by specifying a distance constraint between a specific pair of ligand and protein atoms. The precalculation and inclusion of protein and ligand ring centroids as atoms in the database makes searching for π-stacking interactions just as quick. Residue pairs and triplets enable the search for catalytic triads and other geometric arrangements within the protein and between the protein and ligand.

The data in KKB and MMPKB were generated independently of PRDB and with no knowledge of PRDB’s data model. However, as with all properly designed relational databases, they could be easily joined. Powerful ligand structure searches and ligand–protein geometric searches are possible in KKB and MMPKB even though none of those dimensions were included in the data processing that led to their creation. With the database combination and KKB’s common numbering scheme, one can query for ligand interactions with the kinase catalytic lysine without concern for how to identify that residue in the hundreds of kinase structures. Once hits are retrieved, prealignment of the 3D structures greatly simplifies their comparison whether they are viewed in Benchware 3D Explorer, which is integrated with our ADOpt front end, or any other protein visualization system.

Footnotes
  • a
  • b
  • c

    Active site residues were defined as those contained in a 4Å sphere around the average position of staurosporine after a 3D overlap of all the structures.

  • d

    The full length sequences were obtained from http://www.uniprot.org/ and the catalytic domains were sliced out of them. For all but one of the 24 known MMP’s, a human sequence is available, whereas for MMP18 the sequence from the African clawed frog was used.

  • e

    NCBI makes the latest Blast databases available on its website ftp://ftp.ncbi.nih.gov/blast/db/. Ours was updated on 8/13/09.

  • f

    Tripos, L.P. (2009), Benchware 3D Explorer, version 2.5. http://www.tripos.com.

  • g

    Partitioning a table can be thought of as if the data are in separate tables (partitions) but made to appear as if they are in a single, larger table. It provides the performance of smaller tables with the ease of querying and maintaining just a single, large table.

Acknowledgments

  1. Top of page
  2. Abstract
  3. Methods
  4. Results and Discussion
  5. Conclusion
  6. Acknowledgments
  7. References

We are indebted to Ms Lynne Greenblatt for testing the application, preparing training materials, and conducting numerous training sessions at the four Wyeth research sites. We also thank Dr Jack Bikker for many useful discussions and input into the database front end.

References

  1. Top of page
  2. Abstract
  3. Methods
  4. Results and Discussion
  5. Conclusion
  6. Acknowledgments
  7. References