Querying PRDB, KKB, and MMPKB
The PRDB database consists of 62 tables with an additional 21 holding the KKB and MMPKB. To date, only the tables described herein have been made available for searching from our user interface ADOpt. (Analyze, Design, Optimize), and they probably cover 90% of common users’ queries. More complex queries can be run directly against the database; however, they require SQL expertise. The interface allows any retrieved PDB file to be saved to the hard drive or directly opened in the molecular visualization software, Benchware 3D Explorer.f Detailed use cases of the informatics system described herein have been published separately (25).
The data are presented to the user in a format that can, in some cases, be quite different than the database tables (Figures 1, 3 and 4). Figure 5 shows the sets of fields, or datasources, available to the user for query building. Datasources beginning with KIN or MMP are only available when searching the KKB or MMPKB, respectively. The remaining datasources are available for querying all three databases.
Figure 5. The fields exposed to the user for constructing queries. Datasources beginning with KIN or MMP are available only when searching KKB or MMPKB, respectively. The rest are available for all queries. Sample data are shown for MMP FEATURES and single field. Unless prefixed with O (original) or C (common), residue numbers are original with the exception of MMP FEATURES: Deletions where the common number (and wild type residue name) is used. In that case, there is no original number that can be assigned to the residue.
Download figure to PowerPoint
The KIN FEATURES and MMP FEATURES datasources look markedly different than anything in the database tables (Figures 3 and 4). There are a number of database transformations (views and functions) that rearrange the data in the raw tables into the format seen in Figure 5. The data contained in the fields listed in these datasources are best described by the sample data in Figure 5. There is one row in KIN FEATURES for each of the 1491 pairs of kinase PDB files. Likewise, MMP FEATURES contains one row for each of its 191 pairs of PDB files. For the feature definitions, all of the residue names and numbers are concatenated into a single field. One can search any of these fields for a feature that contains a specific residue and position, but that is better done using the KIN RESIDUES or MMP RESIDUES datasources. The concatenated fields were primarily generated to present the data to the user in a single row and not for searching.
While KINASE FEATURES and MMP FEATURES contain the details of features that apply to all structures in their respective databases, the KIN PROT FEATURES and MMP PROT c FEATURES datasources just contain a list of PDB file identifiers and their features from Tables 1 and 2, respectively, that have a specific scope. Querying these datasources is the easiest way to search for structures containing any one of a list of features.
In addition to querying on kinase and MMP metadata, the datasources in Figure 5 allow for the generation of powerful queries based on protein–ligand geometries that cannot be duplicated at the PDB Website. A closer examination of the GEOMETRIES table and datasource makes this more apparent.
The pair of objects described in GEOMETRIES is comprised of one protein and one ligand component. In each case, there are several identifiers that can be used to specify the component of interest.
For a protein atom, a unique atom identifier (PROT ATOM ID), generated by the data loading routines, serves as its primary identifier. Because it would be rare for a user to know and search on that identifier, the table also contains the name for that atom (PROT ATOM NAME; e.g., CA, SG) and its containing residue’s name and number (PROT RESIDUE NAME and PROT RESIDUE NUM, respectively). This allows the user to query for structures containing a ligand atom, for example, <2 Å from:
Query 1: any cysteine (PROT RESIDUE NAME = CYS).
Query 2: the sulfur of any cysteine (PROT ATOM NAME = SG [the sulfur of cysteine]).
Query 3: cysteine 41 (PROT RESIDUE NUM = 41 and PROT RESIDUE NAME = CYS).
Query 4: the sulfur of cysteine 41 (PROT RESIDUE NUM = 41 and PROT ATOM NAME = SG).
In the last two examples aforementioned, one would have to know that the protein of interest contained a cysteine at position 41. If the criteria shown were the only ones in the query, it is likely to find extraneous proteins. However, when searching for kinase or MMP geometries, the common residue numbering schemes allow for the generation of quite specific queries that would otherwise be nearly impossible. For example, it is trivial to search for structures containing ligands having an atom <2.5 Å from:
Query 5: the kinase catalytic lysine (PROT RESNUM COMMON = 189).
Query 6: the terminal nitrogen of the kinase catalytic lysine (PROT RESNUM COMMON = 189 and PROT ATOM NAME = NZ [the terminal nitrogen of lysine])a.
For ligand atoms, the unique atom identifier is LIG ATOM ID 3D. This identifies a specific atom of a specific instance of a ligand in a specific PDB file. That ligand’s identifier is LIG 3D ID. It is also unlikely that a user would search on either of these values. However, atoms also have a unique 2D identifier (LIG ATOM ID 2D) that is more useful as a search term. Thus, while it is unlikely that a user would search for a distance to the carbonyl oxygen of particular instance of staurosporine in a single PDB file (LIG ATOM ID 3D search), they would benefit tremendously from being able to search for any structures containing some distance to the carbonyl oxygen of staurosporine (LIG ATOM ID 2D search).
LIG 2D ID is the identifier for the 2D ligand structure. One can search the geometries table by specifying this value and thus return structures where some distance criteria is being met by one or more atoms of the ligand. In practice, this is most commonly done indirectly by performing an exact or substructure search. Finally, one can search for a particular ligand atom by using the LIG ATOM NAME field.
One can create very specific queries by combining search terms for the protein and ligand. They can be even more efficacious when searching the KKB or MMPKB. In addition to the residue numbers found in the original PDB files, their GEOMETRIES tables and datasources contain residue numbers in their common numbering schemes (PROT RESNUM COMMON). Thus, referring to the kinase inhibitor staurosporine (Figure 6) as a sample ligand, one can construct queries to locate protein–ligand complexes containing staurosporine under the following conditions:
Query 7: any staurosporine atom <3.5 Å from any leucine atom.
Query 8: any staurosporine atom <3.5 Å from leucine nitrogen.
Query 9: any oxygen in staurosporine<3.5 Å from leucine nitrogen.
Query 10: staurosporine’s carbonyl oxygen<3.5 Å from leucine nitrogen.
The performance of queries 1–10 in this section is shown in Table 3.
Table 3. Performance of protein–ligand geometries queries
| ||Query numbera|
|PROT RESIDUE NAME||CYS||CYS||CYS||CYS|| || ||LEU||LEU||LEU||LEU|
|PROT RESIDUE NUM|| || ||41||41|| || || || || || |
|PROT RESNUM COMMON|| || || || ||189||189|| || || || |
|PROT ATOM NAME|| ||SG|| ||SG|| ||NZ|| ||N||N||N|
|LIG ATOM NAME|| || || || || || || || ||O|| |
|LIG 2D ID|| || || || || || ||11543||11543||11543|| |
|LIG ATOM ID 2D|| || || || || || || || || ||34|
|# PDB ID’s returned||1490||1366||15||5||24||19||32||7||7||6|
As one cannot know which of the three residue columns in the RESIDUE TRIPLETS datasource contains the values of ones query, all six combinations have to be tried. This was accomplished by making the RESIDUE TRIPLETS datasource a view that unions its parent table with itself six times (Figure 7). Thus, while the user arbitrarily assigns values of his query to residues 1, 2, and 3, all combinations are searched. RESIDUE PAIRS was handled in an analogous way.
Query performance can vary widely when searching RESIDUE TRIPLETS as shown in Table 4. Queries 1–7 all perform well with only two completing in over 1 second. However, that is generally true if the distances searched are about 6 Å or less. For larger distances, which contain many more triplets, query time is significantly longer as shown by queries 8 and 9. Three to four minutes is the typical search time when distance criteria get above 6 Å. While these are the slowest queries in PRDB, performance does not at all seem unreasonable, especially considering the alternatives for finding the same information.
Table 4. Residue triplet query performance
| ||Query number|
|RESIDUE NAME 1||Ser||any||Ser||any||Ser or Asp||Asp||Asp||Gly||Tyr|
|RESIDUE NAME 2||His||His||His||His||His||Asp||Asp||Ser||Ala|
|RESIDUE NAME 3||Asp||Asp||Asp||Asp||Asp||His||His||Met||Gly|
|# PDB ID’s returned||215||2713||13||202||27||72||6||396||58|