Activity cliffs are formed by structurally similar compounds with significant differences in potency and represent an extreme form of structure–activity relationships discontinuity. By contrast, regions of structure–activity relationships continuity in compound data sets result from the presence of structurally increasingly diverse compounds retaining similar activity. Previous studies have revealed that structure–activity relationships information extracted from large compound data sets is often heterogeneous in nature containing both continuous and discontinuous structure–activity relationships components. Structure–activity relationships discontinuity and continuity are often represented by different compound series, independent of each other. Here, we have searched different compound data sets for the presence of structure–activity relationships continuity within the vicinity of prominent activity cliffs. For this purpose, we have designed and implemented a computational approach utilizing particle swarm optimization to examine the structural neighborhood of activity cliffs for continuous structure–activity relationships components. Structure–activity relationships continuity in the structural neighborhood of activity cliffs was relatively rarely observed. However, in a number of cases, notable structure–activity relationships continuity was detected in the vicinity of prominent activity cliffs. Exemplary local structure–activity relationships environments displaying these characteristics were analyzed in detail. Thus, the structure–activity relationships environment of activity cliffs must not necessarily be discontinuous in nature, and local structure–activity relationships continuity and discontinuity can occur in a concerted manner in series of structurally related compounds.
Large-scale analysis of structure–activity relationships (SAR) information contained in compound data sets from various sources has in recent years become increasingly popular to complement SAR investigations on individual compound series (1,2). Therefore, concepts such as activity landscapes and activity cliffs have been put forward to aid in the extraction of SAR information from compound data (2–5). For the characterization and extraction of SAR information, both numerical functions and graphical methods have been developed (6–9). From SAR profiling of many different compound data sets, the picture has emerged that SAR information contained in large and structurally diverse data sets is globally most often heterogeneous in nature, containing multiple continuous and discontinuous local SAR environments (5,6). In this context, SAR discontinuity refers to the presence of structurally similar compounds with marked differences in potency against a given target, whereas continuity refers to compounds with gradually increasing structural diversity that display similar activity (1,2,6). For the characterization of such SAR features, the interplay between numerical and graphical analyses is often of critical importance. For example, the numerical SAR Index (SARI) makes it possible to assign a global SAR phenotype to a data set (6) (i.e., as mostly continuous, discontinuous, or heterogeneous in nature) and network-like similarity graphs (8) can be utilized to identify the subsets of compounds that form continuous or discontinuous local SARs and to study the relationships between local and global SAR features. In addition, variant (local) SARI scores can be calculated for individual compounds to quantify their individual contributions to SAR features (8). Different relationships between SAR continuity and discontinuity have been observed in data sets (5,6). In structurally diverse compound sets, continuous and discontinuous SAR components are typically represented by different compound series and mutually coexist, independent of each other. By contrast, continuous and discontinuous SAR elements might not always be independent of each other at the level of individual compound series, for example, when stringent binding constraints must be met in one region of a molecule and less critical ones in another, giving rise to some continuity (5,6). However, such possible relationships have thus far not yet been systematically explored.
Despite the intuitive nature and attractiveness of combining numerical and graphical analyses, efforts have also been made to automatically extract local SAR components from compound data sets on a large scale, thereby alleviating the need for multilevel (numerical and graphical) SAR analysis of one compound set at a time. For example, an SAR pathway model has been adapted to extract a form of locally continuous SAR from compound data sets (9,10). On the basis of this model, series of pairwise similar compounds are selected that follow a pseudolinear potency gradient. Once extracted, such pathways can also be graphically analyzed. Furthermore, in a recent study, we have for the first time applied particle swarm optimization (PSO) (11), a numerical optimization approach, to large-scale SAR analysis to systematically extract locally discontinuous SARs from different compound sets (12). Particle swarm optimization is an algorithm originally developed by Kennedy and Eberhart (11) that models coordinated ‘social behavior’, with flocking of birds or schooling of fishes providing paradigmatic examples. Because of its simple arithmetic structure and favorable convergence behavior compared with other population-based heuristic methods (13), PSO is increasingly applied to solve various optimization problems in different areas of computational science including computer-aided drug design (14–16).
An aspect that has thus far only marginally been considered in SAR information extraction is whether, and to what extent, SAR continuity might also be observed in the immediate environment of prominent activity cliffs. There is some preliminary evidence. For example, SAR pathways can be found that lead to activity cliff-forming compounds (9), but these pathways usually do not originate in the structural neighborhood of activity cliffs. Here, we generalize the search for SAR continuity in the vicinity of activity cliffs through a PSO-based approach. The method screens the immediate structural environments of prominent activity cliffs for compound subsets that represent locally continuous SARs.
Materials and Methods
We first introduce the PSO approach and then present its adaptation to search for SAR continuity in the vicinity of activity cliffs.
The population (‘swarm’) of the PSO algorithm is initialized with ‘N’ number of random solutions called ‘particles’ in a D-dimensional search space. Each particle is characterized by a position vector and a velocity vector , where d = 1,2...D is the dimension of each particle and i = 1,2...N is the number of particles in the swarm. In each iteration (‘generation’), particle positions in the swarm are evaluated by a fitness function and moved toward the ultimately best solution. During each generation (t), for each particle, its new velocity is calculated according to eqn (1) and its position is updated according to eqn (2) based on its own ‘best experience’ (pbest) and the best experience of the neighboring particles (nbest). Thus, the particles in the swarm ‘communicate’ and share the information between them in the search space defined by an optimization problem.
Here, w is the so-called inertia weight, a crucial parameter for particles to exploit information inherited from the previous generation. Furthermore, c1 and c2 are so-called cognitive and social confidence coefficients and represent acceleration constants that modify the velocity of a particle toward (pbest) and (nbest), respectively. In addition, rand() is a random function uniformly distributed between values of 0 and 1. Standard PSO calculations are outlined in Figure 1.
The standard PSO approach cannot be directly applied to solve binary-encoded feature selection problems. Accordingly, for these types of discrete search problems, Kennedy and Eberhart (17) have extended the standard PSO method by introducing a discrete binary version of the PSO algorithm. In binary PSO, particles update their individually best positions and reach the best global solution as in the standard case. By contrast, the major difference between binary PSO and real-valued standard PSO is that particles in the swarm are updated by the application of sigmoidal transformations to calculate the velocity of each particle according to eqn (3) and its position according to eqn (4).
This transformation of the velocity vector yields the positional vector of particles in the swarm with values of 0 or 1. Thus, compared with the standard PSO approach in Figure 1, the sigmoidal transformation step is carried out after the velocity update and prior to the positional update.
For SAR analysis, as reported herein, binary fingerprint representations are calculated for active compounds to define the dimensionality of particles in the swarm. We have utilized the MACCS structural keys fingerprinta that consists of 166 bit positions, each of which accounts for the presence or absence of a cataloged structural fragment or feature in a molecule. In large-scale SAR analysis, MACCS keys have often proven to be robust and reliable indicators of structural similarity relationships (6,8). Thus, the use of MACCS as a molecular representation for our optimization task required the application of binary PSO.
Another critically important component of the PSO approach is the fitness function used to guide the optimization. To detect SAR continuity in the vicinity of activity cliffs, the continuity score component of SARI (6) was utilized. SARI comprises two score components accounting for SAR continuity and discontinuity, respectively, according to eqns (5) and (6).
Here, wij is the potency-weighted factor for each pair of compounds i and j, pot(i) and pot(j) are their potency values, and potdiff(i,j) is the absolute potency difference between these values. In addition, sim(i,j) is the conventional Tanimoto (Tc) similarity of MACCS fingerprints calculated for i and j. The raw continuity score (rawcont) and discontinuity score (rawdisc) are converted into Z-scores using an external reference panel of compound data sets (5) from which score means and standard deviations are derived. Assuming a normal distribution, the cumulative probability for each Z-score is then calculated and mapped onto the value range [0,1]. Utilizing the normalized scorecont and scoredisc, the final SARI score is calculated according to eqn (7).
For the identification of continuous local SAR components in search space, the continuity score component was utilized as a fitness function for binary PSO, as stated earlier. This score component emphasizes the presence of structurally divergent compounds having similar activity. The application of this fitness function directs the particles of binary PSO toward compounds representing continuous local SARs within the structural neighborhood of activity cliffs, as further discussed later.
Choosing the number of independent iterations and other parameter value settings for PSO is of critical importance for its convergence behavior and the quality of the obtained results with respect to the chosen fitness function (18,19). Preferred parameter values are typically dependent on specific applications. Accordingly, we have carried out preliminary calculations with different parameter values and ultimately selected parameter settings following Clerc (19): inertia weight w = 0.721348; confidence coefficients c1 and c2 = 1.193147; a swarm size of 35 particles; and a maximum number of 20 000 iterations. In each case, 20 independent optimization runs were carried out, and the compound subset from the best individual run yielding the highest continuity score was selected.
In our application, the objective of particles in the swarm is to identify the subsets of compounds representing high SAR continuity in the structural neighborhood of prominent activity cliffs. Such activity cliffs were defined here as pairs of structurally similar compounds (MACCS Tanimoto similarity ≥0.8) with at least two orders of magnitude difference in potency (i.e., 100-fold) and a minimum potency of 100 nm for the highly potent cliff-forming compound. An activity cliff-forming compound was permitted to have more than one cliff partner. In addition, the chemical neighborhood of an activity cliff was defined as all compounds with a MACCS Tanimoto similarity of ≥0.6 compared with at least one of the cliff-forming compounds. The choice of these similarity cutoff values ensured that only compounds displaying a high degree of structural similarity were considered for activity cliff formation and that a well-defined structural neighborhood around each cliff was generated.
Each iteration of the optimization process included the following steps:
• Step 1: Calculate the Euclidean distance between each particle position and data set compounds.
• Step 2: Sort the compounds in the order of increasing Euclidean distance from particle positions.
• Step 3: From the ranking, select activity cliff-forming compounds (meeting the cliff criteria).
• Step 4: Select the n top-ranked compounds falling into the structural neighborhood of the activity cliff (selected in step 3).
• Step 5: Calculate the SARI continuity score for the top-ranked n compounds
If no qualifying compounds are identified in step 3 or 4 (following the ranking top-down), the continuity score value of each particle is set to zero in step 5 and another search direction is explored. Thus, improving the continuity score guides particles to move in the search space toward compounds forming continuous local SARs in the vicinity of the active cliff.
Compound data sets for our analysis were extracted from ChEMBL.b Subsets of 5 and 10 compounds were selected from each data set by PSO. Each calculation was repeated 20 times such that 20 independent trails were available for compound selection.
Results and Discussion
Structure–activity relationships continuity and discontinuity are helpful concepts to describe the distribution of SAR features in compound data sets of any source and characterize both global (at the level of entire data sets) and local (at the level of compound subsets/series) SAR phenotypes (1,2,5). Continuous and discontinuous SAR components often coexist in compound data sets, giving rise to global SAR heterogeneity. The analysis of SAR discontinuity is generally thought to yield most information about activity determinants in compound series because in this case, small chemical changes have large biological effects and one can thus often identify sites in compounds that are of critical importance for their bioactivity. Activity cliffs represent an extreme form of SAR discontinuity, and hence, the study of large-magnitude cliffs usually is a primary focal point of SAR analysis. On the other hand, the analysis of SAR continuity is highly relevant, for example, to explore the permissiveness of targets toward small molecules. In addition, the notion of SAR continuity provides the conceptual basis for scaffold hopping in medicinal chemistry and virtual screening (2,5). The SARI scoring scheme was designed to quantitatively account for the presence of global and local SAR continuity and discontinuity and quantitatively describe SAR phenotypes.
Despite the qualitative and quantitative characterization of SAR features in many different compound sets, it has thus far largely remained unclear how SAR continuity and discontinuity might potentially be linked in structurally related compounds. Structure–activity relationships continuity and discontinuity are typically considered distinct characteristics, but there is at least some evidence that SAR continuity might evolve in compound series when critical binding constraints are met (5,6). For example, one might consider the case of strong carbonic anhydrase inhibitors that usually must contain a sulfonamide group to complex a zinc cation in the active site of the enzyme. If this critical interaction is present, structural variations in molecular regions of inhibitors distant from the sulfonamide group are often tolerated, thus giving rise to a form of limited and constraint SAR continuity within series of carbonic anhydrase inhibitors. Here, we have asked the general question whether SAR continuity might be detectable in the structural neighborhood of centers of discontinuity, i.e., activity cliffs. Because we intended to address the question systematically through compound data mining, it was required to develop a generally applicable approach to search for this type of SAR information.
For the purpose of our analysis, PSO was considered to provide a meaningful methodological basis. As an optimization approach guided by a specific fitness function, PSO is readily applicable for compound selection. Previously, we have successfully applied PSO to extract compound series from various compound data sets that represented discontinuous local SARs (12). To adapt PSO for the potential identification of local SAR environments that combine SAR discontinuity and continuity in a defined manner, we have now implemented a PSO-based search protocol to first identify activity cliffs among compounds proximal to initialized particle positions and then constrain the search space to structural neighbors of such activity cliff-forming compounds. This has made it possible to utilize the SARI continuity score as the only required fitness function to solve this, at first glance, relatively complex optimization problem. Through resetting of continuity score values to 0 for a particle if no suitable compound was matched, the particle swarm search of the structural environment of an activity cliff was iteratively redirected until convergence was reached.
For our analysis, 32 sets of inhibitors of target enzymes belonging to different families were selected, as summarized in Table 1. In addition to target diversity, three other selection criteria were considered. First, large compound sets with at least more than 100 (and up to more than 1500) inhibitors were chosen. Second, compound data sets were selected to cover wide potency ranges. Third, and most importantly, compound data sets were selected to cover the entire spectrum of SAR phenotypes, which were assigned on the basis of global SARI scores (6), as also reported in Table 1. Structure–activity relationships phenotypes covered by these compound sets ranged from overall mostly discontinuous SARs with low SARI scores smaller than 0.3 (e.g., caspase-1 inhibitors) over globally heterogeneous SARs with scores around 0.5 (e.g., matrix metalloproteinase-1 inhibitors) to predominantly continuous SARs with scores >0.7 (e.g., cyclooxygenase-1 inhibitors). Because the majority of large compound data sets are globally heterogeneous in their SAR character, as revealed by SARI profiling (6), this global SAR phenotype also dominated our selections.
Table 1. Compound data sets and PSO results
Subset continuity score
For each of the 32 sets of inhibitors investigated herein, the target name, number of compounds, and their potency range are given. In addition, the global structure–activity relationships index (SARI) score is reported for each compound data set. Also reported are the continuity scores for the top-ranked 5- and 10-compound subsets obtained by particle swarm optimization (PSO). Continuity scores above 0.70 are highlighted in boldface.
Ser/Thr kinase Chk1
Tyr phosphatase 1B
Carbonic anhydrase I
Dipeptidyl peptidase IV
Tyr kinase SRC
Ser/Thr kinase AKT
Coagulation factor X
Tyr kinase TIE-2
Cytochrome P450 2D6
Protein kinase C α
Carbonic anhydrase XII
Tyr kinase LCK
c-Jun N-terminal kinase 1
Protein kinase C β
MAP kinase p38 α
The 32 data sets were subjected to PSO analysis. In different calculations, we searched for subsets of 5 or 10 compounds representing continuous SARs in the structural neighborhood of automatically identified activity cliffs. For each data set, the highest continuity scores from 20 independent trials are reported in Table 1 for the differently sized compound subsets. For subsets of 10 compounds, only low to intermediate continuity scores were observed, ranging from scores very close to 0 (reflecting the absence of SAR continuity around cliffs) to a maximum score of 0.52 (for inhibitors of matrix metalloproteinase-9, reflecting the presence of limited continuity). Thus, for compound subsets of this size, significant SAR continuity in cliff environments was not observed. When reducing the subset size to five compounds, a different picture emerged. In this case, examples of activity cliffs with notable SAR continuity in their structural neighborhoods were detected in seven compound data sets (sets 1–7 in Table 1), with best continuity scores ranging from 0.70 to 0.80, which are considered significant on the basis of previous experience (5). These compound sets included kinase and phosphatase inhibitors as well as inhibitors of different types of proteases and also of carbonic anhydrase (as discussed earlier). The global SAR phenotype of these compound sets where combinations of local SAR continuity and discontinuity were detected was mostly heterogeneous, with the exception of the tyrosine phosphatase 1B set that was overall continuous in nature (yielding a global SARI score of 0.73). Thus, although larger series of compounds representing SAR continuity around activity cliffs were not found, in seven of 32 test cases, subsets of five compounds were identified that displayed notable continuity. These findings confirmed that combinations of strongly discontinuous and continuous SAR components in small sets of structurally related compounds were in principle possible and detectable in different data sets, although the occurrence of these types of SAR microenvironments was fairly limited, as one might expect.
SAR continuity in activity environments
We then studied exemplary microenvironments in more detail. To display selected activity cliffs and their structural neighborhood where SAR continuity was detected, chemical neighborhood graphs (CNGs) (20) were generated, as shown in Figure 2. Chemical neighborhood graphs provide a radial view of the structural neighborhood of a given active compound, here a highly potent activity cliff-forming molecule. This reference compound is placed as a node in the center of the graph and structurally similar compounds are displayed as nodes in an organized manner, revealing similarity and potency relationships to the reference molecule. Compounds belonging to the neighborhood including the weakly potent activity cliff partner are arranged in layers of decreasing Tanimoto similarity (given in percent relative to the reference) around the central compound. For example, if a compound displays at least 70% similarity to the reference (i.e., MACCS Tanimoto similarity of at least 0.7) but <80%, it is placed on the 70% layer. The compounds are represented as colored nodes. The color code reflects the compound potency distribution of the entire data set from which the compounds are selected using a continuous color gradient from green to red, corresponding to the lowest and highest potency values in the data set, respectively. Nodes representing compounds with potency lower than the central compound are positioned on the left and more potent compounds on the right half of the graph. Hence, in this case, all neighborhood compounds are placed on the left because the highly potent activity cliff marker represents the most potent compound.
In Figure 2A, the structural neighborhood of a large-magnitude activity cliff from a serine/threonine kinase inhibitor data set is shown involving one of the most potent compounds in this set (i.e., with node color red). Here, the presence of SAR continuity surrounding the activity cliff becomes readily apparent. The compounds falling into the structural neighborhood of the cliff share substructures with the cliff-forming compounds and are related to them by replacements of similar ring systems. However, these compounds gradually structurally depart from the cliff compounds and two of them display comparably high and the three others comparably low potency. Thus, in this case, discontinuous and continuous SAR features are closely combined. In Figure 2B, the environment of an activity cliff formed by matrix metalloproteinase-9 inhibitors is shown that involves multiple compounds. Here, the highly potent activity cliff-forming compound has four weakly potent cliff partners. The compounds falling into the structural neighborhood of the cliff are only weakly potent. They are structurally related to activity cliff compounds in different ways and also gradually depart from them. Furthermore, the three-compound activity cliff from the dipeptidyl peptidase IV inhibitor set shown in Figure 2C reflects similar relationships. These inhibitors are comparably small, and the activity cliff-forming compound in the center of the graph is paired with two weakly potent cliff partners. The neighboring compounds representing continuity around the cliff include molecules that are very similar or only remotely similar to the activity cliff compounds and weakly to moderately potent. Thus, taken together, these examples show how SAR continuity around large-magnitude activity cliffs is formed by structurally diverse classes of active compounds and illustrate structural and potency relationships between cliff compounds and compounds populating their structural neighborhood.
The identification of regions of SAR continuity in the close vicinity of activity cliffs is first and foremost of scientific interest because it provides a conceptual link between these a priori distinct SAR features. Clearly, this phenomenon is not anticipated to be generally observed in compound data sets. It should be an exception, rather than the rule. However, the approach introduced herein makes it possible to systematically search for this SAR data structure, and our findings demonstrate that SAR discontinuity and continuity can in principle occur in small subsets of similar compounds or, in other words, be closely coupled in narrow regions of activity landscapes. While these insights are of considerable interest from an SAR-theoretic view point, it would be difficult to translate them into practical medicinal chemistry recipes to generate novel active compounds. For this purpose, the identification and categorization of local SAR features is generally insufficient, and also these efforts might well provide some insights into activity determinants within compound series.
In this study, we have searched for special SAR microenvironments in large and structurally diverse compound sets with activity against a variety of target enzymes. These local environments were defined to consist of centers of SAR discontinuity characterized by the presence of significant SAR continuity in their structural neighborhood, a rather unusual SAR data structure. However, we reasoned that detecting SAR continuity around prominent activity cliffs would provide evidence for combined discontinuous and continuous SAR components beyond mutual coexistence. Our analysis has been carried out with a specifically designed PSO protocol guided by an SAR fitness function. This approach has made it possible to select subsets of compounds from large data sets that met predefined SAR criteria. We have detected notable SAR continuity around large-magnitude activity cliffs in seven of 32 compound data sets that we screened. Thus, although the occurrence of SAR continuity around activity cliffs was relatively rare and limited to small compound subsets, our findings revealed that highly discontinuous and continuous SAR components are not mutually exclusive or always independent of each other. Rather, they might also occur in combination in series of structurally related compounds, a previously largely unobserved phenomenon, which further illustrates the complexity of SARs across different compound classes.
MACCS Structural Keys, Symyx Software, San Ramon, CA, USA, 2005.