SEARCH

SEARCH BY CITATION

Keywords:

  • binding sites;
  • clustering;
  • distance;
  • OPTICS;
  • PDB;
  • sequence

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Results and Discussion
  5. Methods
  6. Conclusions
  7. Acknowledgements
  8. References

The Protein Data Bank contains the description of approximately 27 000 protein–ligand binding sites. Most of the ligands at these sites are biologically active small molecules, affecting the biological function of the protein. The classification of their binding sites may lead to relevant results in drug discovery and design. Clusters of similar binding sites were created here by a hybrid, sequence and spatial structure-based approach, using the OPTICS clustering algorithm. A dissimilarity measure was defined: a distance function on the amino acid sequences of the binding sites. All the binding sites were clustered in the Protein Data Bank according to this distance function, and it was found that the clusters characterized well the Enzyme Commission numbers of the entries. The results, carefully color coded by the Enzyme Commission numbers of the proteins, containing the 20 967 binding sites clustered, are available as html files in three parts at http://pitgroup.org/seqclust/.


Abbreviations
EC

Enzyme Commission

gp

gap penalty

OPTICS

Ordering Points to Identify the Clustering Structure

PDB

Protein Data Bank

Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Results and Discussion
  5. Methods
  6. Conclusions
  7. Acknowledgements
  8. References

In recent years, the exploration of the human genome has received wide publicity. Although somewhat less emphasized, another important bioinformatics resource is the exponentially growing, publicly available Protein Data Bank (PDB) [1], containing more than 55 000 biological structures at the present time.

The three-dimensional structures of small molecules, e.g. drug molecules, can usually be calculated from their chemical composition. Several databases exist that contain millions of ligands. An example of this is the freely available ZINC database [2] created from catalogues of compound manufacturers. Contrary to ligands, the three-dimensional structures of proteins cannot be calculated easily; therefore, the rapid growth of the PDB cannot be overestimated.

Most antimicrobial drug molecules act as enzyme inhibitors. Inhibitors need to bind more strongly to the enzyme than to the substrate of the enzyme; consequently, the chemical and geometrical properties of the binding sites are of utmost importance in drug discovery and design.

The PDB contains the three-dimensional structures of more than 55 000 entries. In a separate study [3], we collected, verified and cleaned the list of approximately 27 000 binding sites found in the PDB. During the process of the identification of these binding sites, we filtered out crystallization artifacts and covalently bound small molecules, and also considered broken peptide chains, modified amino acids and incorrectly labeled HET groups. The resulting cleaned, strictly structured RS-PDB database [3] can serve as an input for different data mining algorithms. One such technique of classification is clustering. By the clustering of binding sites it is possible to create binding site similarity classes. These classes can be useful for the classification of protein–ligand interaction.

In this article, we present a fast, sequence-based method for binding site clustering that takes into account amino acid sequences in the close neighborhood of binding sites. Our method is a hybrid, in the sense that it uses the sequence information together with steric data from the PDB in a clearly structured manner.

Previous work

There is a very rich literature describing the identification techniques for biological functions from structural protein information by the application of highly nontrivial mathematical tools [4,5]. Some of these tools have been applied to determine or analyze protein–protein interaction network topology [6–10] or binding sites [6,11]. A considerable amount of work has also been performed to devise polypeptide sequence-order independent structural properties [12–14]. Unlike other binding site clustering solutions in the literature ([15–18]), we used a hybrid of order-independent methods that analyzes the three-dimensional structure of the binding site together with an order-analysis method; one of its main features is that our order-analysis method is capable of handling multiple polypeptide chains in the same binding site (Fig. 1).

image

Figure 1.  A binding site with four protein chains (PDBID: 1CT8). Each chain is colored differently.

Download figure to PowerPoint

Results and Discussion

  1. Top of page
  2. Abstract
  3. Introduction
  4. Results and Discussion
  5. Methods
  6. Conclusions
  7. Acknowledgements
  8. References

Our main result was the OPTICS (Ordering Points to Identify the Clustering Structure)-based clustering of the 20 967 binding sites found. In order to verify the capabilities of the clustering method, we need to compare the clusters found with verified biological functions.

Verification of results: biological relevance

Ideally, proteins of the same or closely related functions ought to be assigned in the same cluster. We considered the Enzyme Commission (EC) number classification of enzymes [19], and color coded the EC numbers such that closely related functions were given similar colors, as provided in http://pitgroup.org/seqclust/bsites_AAcodes/EC_colour.html.

The color-coded clusters, together with the ordinal number of the binding site, the PDB ID, the cluster ID and the EC number can be found in three large html files (Page1, Page2, Page3) under http://pitgroup.org/seqclust/. The clusters correspond to concave regions in the figure.

The deviations of the EC numbers in all the clusters were also computed, and are given in the online table http://pitgroup.org/seqclust/bsites_AAcodes/EC_deviation.txt. In most of the clusters, the deviation is zero; the average deviation is 1.71%.

We believe that the validation of the enzymatic functions through EC numbers shows that our clustering method is an adequate solution for binding site clustering and classification.

Parameter settings and examples

We present here, as examples, four binding sites from the largest cluster (element count: 448) (see Fig. 2). All four proteins are blood clotting factors. The whole cluster is given in the online figure http://pitgroup.org/seqclust/bsites_AAcodes/bsites_optics_M02_No001.html. It should be noted that the whole cluster is colored blue, and all the members of the cluster (between line numbers 702 and 1149; cluster ID: 28) have EC numbers of the form 3.4.21.X (serine proteases).

image

Figure 2.  Four binding sites (PDB IDs: 1ZPB, 1RXP, 1C5Z, 2BZ6) from the same cluster. The whole cluster is given in the online figure http://pitgroup.org/seqclust/bsites_AAcodes/bsites_optics_M02_No001.html. Note that the whole cluster is colored blue, and all the members of the cluster (between line numbers 702 and 1149; cluster ID: 28) have EC numbers of the form 3.4.21.X (serine proteases). More analysis on the homogeneity of the clusters is given in http://pitgroup.org/seqclust/EC_deviation.txt.

Download figure to PowerPoint

From the second largest cluster (element count: 188), three binding sites were visualized (Fig. 3). The whole cluster is given in the online figure http://pitgroup.org/seqclust/bsites_AAcodes/bsites_optics_M02_No001.html. It should be noted that the whole cluster is colored deep violet, and almost all members of the cluster (between line numbers 1224 and 1411) have EC numbers 3.4.23.16 (HIV-1 retropepsins). More detailed analysis of the homogeneity of the clusters is given in http://pitgroup.org/seqclust/bsites_AAcodes/EC_deviation.txt.

image

Figure 3.  Three binding sites from the same cluster (one site from PDB ID 1BDL and two sites from PDB ID 1W5V); these are HIV-1 proteases. The whole cluster is given in the online figure http://www.pitgroup.org/seqclust/bsites_AAcodes/bsites_optics_M02_No001.html. Note that the whole cluster is colored deep violet, and almost all the members of the cluster (between line numbers 1210 and 1435) have EC numbers of the form 3.4.23.16 (HIV-1 retropepsins). More analysis on the homogeneity of the clusters is given in http://www.pitgroup.org/seqclust/bsites_AAcodes/EC_deviation.txt.

Download figure to PowerPoint

Clustering quality measurement

The quality of clustering depends on several parameters. These include the distance function used to determine the similarity or distance of the objects and parameters of the clustering algorithm. In order to obtain appropriate feedback about the quality of clustering with a given parameter setting, quality metrics need to be defined. For this purpose, we used the ‘silhouette coefficient’ [20]. The advantage of the silhouette coefficient is that it is completely independent of the type of data being clustered; it uses only object distances and cluster membership assignments for its determination. Basically, the silhouette coefficient measures how distinct are the clusters: the ‘silhouette value’ of a cluster is the smallest possible distance between an element of this cluster and an element of the neighboring clusters. The silhouette coefficient of the overall clustering is the average of the silhouette values for the individual clusters. More exactly, the silhouette coefficient is defined as the average of the silhouettes taken for all the objects; for example, the silhouette of object i is defined as (bi – ai)/max(ai, bi), where ai is the average distance of object i to the points of its cluster, and bi is the minimum of the average distances of object i to other clusters. It should be noted that, typically, ai < bi, and so the silhouette is equal to 1 – (ai/bi). Clearly, for good clustering, the typical ai value is much less than bi; therefore, the silhouettes of the objects and the silhouette coefficient are close to unity.

The data contained in Table 1 are based on empirical measurements. The values of the silhouette coefficient are strongly dependent on the applied distance function. Therefore, it is questionable whether clusters can be classified into rigid quality categories on the basis of the silhouette coefficient value. However, it is undoubtedly useful for comparing the quality of the clusters.

Table 1.   Cluster quality descriptions based on silhouette coefficient values in [20].
Silhouette coefficientClustering quality
0.00–0.25Clusters cannot be adequately identified; cluster borders are not obvious
0.25–0.50Clusters can be identified, but there are numerous unclassifiable points (‘noise’)
0.50–0.70Most of the data/points can be classified
0.70–1.00Excellent distinguishable clusters

The silhouette coefficient requires the clustering algorithm to assign each binding site to a cluster by definition. Thus, the silhouette coefficient value also shows the amount of noise contained in the database. The clustering algorithm used in this study is the OPTICS algorithm (see later). This algorithm allows some binding sites to be marked as ‘noise’ (thus not assigning them to any cluster). It does not seem reasonable for binding sites that are ‘noise’ to be taken into account twice (once, as the OPTICS algorithm marks them, and once during the calculation of the silhouette coefficient). Therefore, binding sites marked as ‘noise’ were not taken into account when calculating the silhouette coefficient. Nevertheless, for completeness, we show (Fig. 4) how the value of the silhouette coefficient would change if binding sites marked as ‘noise’ were taken into consideration with a silhouette = 0 value.

image

Figure 4.  Silhouette coefficient dependence on parameter MINPTS when unclustered binding sites are also taken into account at silhouette coefficient determination (gp = 1/10). The color coding is given in Table 2.

Download figure to PowerPoint

Table 2.   Colors assigned to different OPTICS cut-off levels.
ColorCut-off level (%)
Red20
Green30
Blue40
Cyan50
Magenta60
Yellow70

Effects of parameters on the quality of clustering and cluster size distribution

Within our binding site model, the distance function and clustering algorithm, three main parameters affected the properties of clustering: OPTICS MINPTS, OPTICS cut-off level and gap penalty (gp) of the distance function. We examined how these parameters affected the quality of clustering measured by the silhouette coefficient. The results are given in Figs 4 and 5.

image

Figure 5.  Number of binding sites contained in clusters as a function of the number of clusters allowed to be used (gp = 1/10). The color coding is given in Table 2.

Download figure to PowerPoint

  • Effect of gp. Increasing gp improved slightly the quality of clustering. This is understandable if we consider that the introduction of a less strict gp function automatically decreases the average distance between the clusters.
  • Effect of MINPTS. On increasing MINPTS, two main effects were observed. An increase in MINPTS yields better quality clustering. However, it also yields a lot more binding sites classified as ‘noise’. The main cause of the latter effect is that the clusters that exist in the database, but contain less points than MINPTS, are not recognized; they are marked as ‘noise’. On the basis of this observation, it can be stated that our binding site database contains numerous small clusters.
  • Effect of OPTICS cut-off level. Increasing the cut-off level decreases the quality of clustering, and also the number of binding sites marked as ‘noise’. The application of an extremely high cut-off level places almost all binding sites into the same cluster; the quality of such clustering can by no means be considered as high.

In conclusion, low MINPTS and low cut-off levels yield the best clustering quality (whilst covering 70–80% of the binding sites found in the PDB). In Figs 4 and 5, we represent the dependence of clustering quality on these parameters.

Methods

  1. Top of page
  2. Abstract
  3. Introduction
  4. Results and Discussion
  5. Methods
  6. Conclusions
  7. Acknowledgements
  8. References

Binding site representation

As a first step, an exact definition of a binding site must be provided. For easy algorithmic handling, we stored the binding sites found in the PDB in a compact data structure.

The definition of binding sites

A binding site is defined as a set of atom pairs; the first atom of the pair belongs to the protein, and the second atom to the bound ligand, such that their distance is equal to the sum of the van der Waals’ radii, calculated differently for different atom types. That is, only pairs within noncovalent binding distances are included in the list. Binding sites containing covalently bound ligands are not considered in this work, as our main motivation was to review pharmacologically significant binding sites.

A ‘binding amino acid (or residue)’ is an amino acid with at least one of its atoms in the binding atom pair. A ‘binding amino acid sequence’ is an amino acid sequence that contains at least one binding amino acid. Basically, binding sites are represented by storing all the binding amino acid sequences of all the protein chains that are present at the particular binding site.

Binding sites were extracted from the RS-PDB database described in [21] and [3]. By using this definition for binding sites, all amino acids from a given amino acid sequence that have at least one atom contained in an atom pair set (describing a binding site) can be identified.

Residue sequence representation

An amino acid sequence refers to sequences consisting of amino acids connected by peptide bonds that are of maximal length (i.e. they cannot be continued with further amino acids on either end).

It should be noted that multiple amino acid sequences might occur in the immediate vicinity of a single binding site, making binding site distance/similarity determination fairly complicated. An example of a binding site with four neighboring polypeptide chains can be seen in Fig. 1.

Binding amino acid sequences were first extracted from the binding sites of the RS-PDB database [3,21] and then simplified as follows.

A string was assigned to each amino acid sequence in a binding site. In this string, residues participating in the bond were indicated by their one-character code; nonbinding amino acids were indicated by ‘-’. As our purpose was to deal with only the binding sections, the pre- and postfixes consisting of purely nonbinding amino acids (or, in our notation, ‘-’) were deleted. Hence, all the strings constructed in this way start and end with a binding amino acid.

A binding amino acid sequence constructed and transformed in this way (from PDB entry 2BZ6) is as follows: H.......................................................................................................................................................................TT--D.................................................................................................................................................................................P....................................DSCK....S................................................VSWGQGC......................G.

Distance function

In order to use a clustering algorithm, we need to define a distance function. The binding sites are represented by all amino acid sequences that participate in the bond with the ligand. Consequently, we need to define the distance of the sequence sets situated in the binding sites. This is accomplished first by defining the distance of two sequences (described in the next section), and then by defining the distance of the sequence sets. The reason for this complexity is the fact that more than one binding sequence can be present in a binding site (see Fig. 1).

Sequence comparison algorithm

To measure the distances of the binding sections of amino acid sequences constructed in this way, we used a modified version of the algorithm employed to calculate the Levenshtein distance (denoted as L). The modifications involved the assignment of different costs to gaps depending on where they were inserted, whereas amino acid mismatches were simply penalized by the value unity.

The costs of aligned binding and nonbinding amino acids were as follows:

  •  The cost of two aligned, different amino acids is unity.
  •  The cost of aligned, matching amino acids is zero.
  • Gaps
    were penalized as follows:
  •  The insertion of a gap with a length of one unit (one amino acid) costs gp if the gap is aligned with a nonbinding amino acid in the other sequence. If a gap is aligned with a binding amino acid, its cost is unity.
  •  The insertion of gaps at the end of sequences is only penalized if they are aligned with binding amino acids. Gaps inserted at either end of a sequence have a zero cost if they are aligned with nonbinding amino acids.

It can be shown that the Levenshtein distance (and also our modified version) fulfills the required properties for being a metric. Non-negativity and symmetry can be seen directly from the definition (assuming non-negative costs). It is also obvious that a zero distance can only be achieved by comparing the same objects: L(x,y) = 0 if, and only if, x = y (assuming that every compared sequence starts and ends with a binding amino acid). What is left to prove is the triangle inequality: for every s, t, r strings (binding amino acid sequences), L(s,t) ≤ L(s,r) + L(r,t).

In other words, the triangle inequality asserts that changing s to t via r cannot cost less than changing s to t directly. As the Levenshtein distance (by definition) is the minimum possible total cost of operations transforming s into t, and the sequence of operations that transform s into r and then r into t is also an allowed sequence of operations, it cannot have a lower total cost than L(s,t), as this would contradict the optimality of L(s,t). (What we may need to prove at this point is that the algorithm used indeed calculates the defined distance –L.) This reasoning is also applicable to our modified version of the Levenshtein distance; the only difference is that we have a somewhat more sophisticated set of costs for the insertion, deletion and changing of the characters. We assume that the costs are non-negative, and any binding amino acid sequence compared with our distance function starts and ends with a binding amino acid. We can now reformulate the above defined costs to be used with ‘insert’, ‘delete’, ‘change’ operations.

Costs for insertion
  •  Insertion of ‘-’ to the end of the sequence: 0.
  •  Insertion of ‘-’ between the first and last binding amino acids of the sequence: gp.
  •  Insertion of a one-letter code of a binding amino acid: 1.
Costs for deletion
  •  Deletion of ‘-’ from the end of the sequence: 0.
  •  Deletion of ‘-’ between the first and last unchanged binding amino acids of the sequence: gp.
  •  Deletion of a one-letter code of a binding amino acid: 1.
Costs for character change
  •  For matching characters: 0.
  •  For nonmatching characters: 1.

If we want to transform a binding amino acid sequence s into t using the above operations, we cannot expect to obtain a lower total cost by first transforming s to an arbitrary r and then r to t (compared with the direct transformation of s to t). This means that the triangle inequality holds.

Binding site comparison method

The input of the distance function described above is two strings that represent amino acid sequences extracted from binding sites. However, our aim is to measure the distance of the binding sites, not just single amino acid sequences. We have seen in section 'Previous work' in Fig. 1 that multiple amino acid sequences might occur in the immediate vicinity of a binding site. Therefore, we also need to define the distance of the sequence sets representing binding sites.

For this purpose, a complete bipartite graph is defined. This is a graph in which the set of vertices can be divided into two disjoint sets, A and B, such that no edge has both of its endpoints in the same set, |A| = |B| and the number of edges is always |A|·|B|.

  •  Points of the vertex sets A and B correspond to the amino acid sequences of the first and second binding sites, respectively. If the numbers of amino acid sequences are not equal in the two binding sites, amino acid sequences with zero length are added to the smaller set.
  •  Weights are assigned to all edges of this graph that correspond to the distance of the two amino acid sequences connected by the edge. By ‘distance’, we mean the distance defined in the previous section.

The distance of the sequence sets A and B is then defined as the minimum weight perfect matching [22] in the graph defined above.

It should be noted that, by the definition of the previous section, the distance of an arbitrary residue sequence A to a zero-length sequence B is the binding amino acid count of sequence A.

Binding site distance normalization

The expected distance of two randomly generated binding sites will be proportional to the sum of the binding amino acids occurring at the binding sites. The maximum achievable distance is always less than the sum of the binding amino acids.

The distance of two binding sites calculated using the function described in the previous section does not describe the binding site dissimilarity alone. If the distance of two binding sites is three, it may be that they have three binding amino acids each, and hence they may be completely different. However, a distance of three between two binding sites with 30 binding residues each is approximately a 10% difference, and so these binding sites might be almost the same.

Therefore, it is necessary to ‘normalize’ the distances. We did this by dividing all distances by the sum of the binding amino acids of the two binding sites being compared. The result of this operation yields a value between zero and unity that can also be interpreted as a percentage of the absolute maximum possible distance of the two binding sites.

Clustering algorithm

For data clustering, we wanted to use an algorithm that was not biased towards even-sized and regular-shaped clusters.

One algorithm with this properties is DBSCAN [23], which is a density-based algorithm. The density of objects is defined with a radius-like ε parameter and an object-count lower limit (MINPTS): a neighborhood of a certain object ‘o’ is considered to be dense if there exist at least MINPTS objects within a distance of less than ε. Therefore, MINPTS and ε are input parameters of the algorithm.

Unfortunately, the clustering structure of many real datasets cannot be characterized by global density parameters, as quite different local densities may exist in different areas of the data space. The OPTICS algorithm [24] overcomes these difficulties by ordering the objects contained in the database, creating a so-called ‘reachability plot’. The reachability plot is a very clever visualization of high-dimensional clusters. It is basically generated by assigning a value, called the ‘reachability distance’, to all the objects of the database, whilst going through the database points in a specific order. The reachability distance is given on the y axes, and the objects (i.e. binding site representations) are numbered on the x axes. Clusters correspond to concave regions in the plot. After the creation of the reachability plot, cluster membership assignments can be created by cutting the reachability plot with a horizontal line referred to as the ‘cut-off level’.

The reachability plot of a small database consisting of binding sites that contain NAD as the ligand is shown in Fig. 6.

image

Figure 6.  OPTICS reachability plot of a database consisting of 800 binding sites.

Download figure to PowerPoint

Database parameters and further settings used in the OPTICS algorithm

The parameters used for clustering were as follows: OPTICS MINPTS, 2; OPTICS cut-off level, 20%; gp, 1/10.

The OPTICS algorithm was run on a database consisting of 20 967 binding sites. Indistinguishable binding sites, which were assigned exactly to the same binding amino acid sequence sets and ligand identifiers, were contained only once. (The original database without this kind of redundancy filtering consisted of 27 208 binding sites.) The distance of the binding sites was measured with the distance function described above.

inline image

Fig. 7. A representative of cluster 85 in the online table http://www.pitgroup.org/seqclust/bsites_pseudocenters/bsites_optics_M04_No001.html. Cluster 85 contains PDB entries 3B9J, 1FFU, 1JRP, 1T3Q, 2E3T, 1JRO, 1RM6, 1WY6, 1N5X; all of these contain an Fe2/S2 cluster (FeS) bond.

Using labeling encoding binding types

Following the suggestion of an anonymous referee, we modified the labeling of the bond residues as follows: using the approach first described in [25], we replaced each amino acid’s one-letter abbreviation with one of the following five characters (‘A’, ‘D’, ‘H’, ‘C’, ‘P’) depending on the assumed type of interaction between the given amino acid and the ligand. As several atoms of an amino acid can be located within the ‘binding distance’ (defined to be more than 1.25 times the sum of covalent radii belonging to the protein and ligand atoms, respectively, but < 1.05 times the sum of the van der Waals’ radii belonging to these atoms) for a given amino acid, we only considered its closest atom to the ligand. Five types of interaction were used: ‘hydrogen-bond acceptor’ (denoted by ‘A’); ‘hydrogen-bond donor’ (denoted by ‘D’); ‘mixed hydrogen-bond donor/acceptor’ (denoted by ‘H’, e.g. hydroxyl groups or side-chain nitrogen atoms in histidine); hydrophobic aliphatic interaction (denoted by ‘C’); and aromatic (denoted by ‘P’); all are described in [25].

Using this labeling, we applied the OPTICS algorithm, exactly as described above. The resulting clusters are given in the second set of online supporting figures at http://pitgroup.org/seqclust, in four html files, together with a statistical analysis.

It is easy to see that, for the large clusters, the amino acid labeling gives better results.

Conclusions

  1. Top of page
  2. Abstract
  3. Introduction
  4. Results and Discussion
  5. Methods
  6. Conclusions
  7. Acknowledgements
  8. References

In this article, we have presented a fast, sequence-based method capable of classifying the binding sites contained in the publicly available PDB. We determined the parameter settings yielding a classification with the best quality (measured by the silhouette coefficient). Our main result was a sequence-based approach, derived from three-dimensional structures, used for binding site clustering (rather than three-dimensional binding site structure), that allows multiple sequences to occur at each binding site. We also evaluated our clustering results with a large, colored diagram (given at the URL http://pitgroup.org/seqclust), where the colors correspond to the EC numbers of the proteins containing the binding sites. As witnessed by the colored diagram, and also by the numerical deviations given in http://pitgroup.org/seqclust/bsites_AAcodes/EC_deviation.txt, our method has a clear-cut biological significance. The method presented in this work may help to reveal evolutionary related binding sites, and may also be used to filter redundancies (i.e. multiple occurring binding sites) from the PDB. A possible step for further research could be the creation of aggregate sequence set profiles for each binding site cluster, generating binding site families similar to the Protein Families Database [26,27].

Acknowledgements

  1. Top of page
  2. Abstract
  3. Introduction
  4. Results and Discussion
  5. Methods
  6. Conclusions
  7. Acknowledgements
  8. References

This work was supported by Hungarian Scientific Research Fund (NK-67867, CNK-77780), and by the Hungarian National Office for Research and Technology (OMFB-01295/2006 and OM-00219/2007).

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Results and Discussion
  5. Methods
  6. Conclusions
  7. Acknowledgements
  8. References