Electrostatic recognition in substrate binding to serine proteases

Abstract Serine proteases of the Chymotrypsin family are structurally very similar but have very different substrate preferences. This study investigates a set of 9 different proteases of this family comprising proteases that prefer substrates containing positively charged amino acids, negatively charged amino acids, and uncharged amino acids with varying degree of specificity. Here, we show that differences in electrostatic substrate preferences can be predicted reliably by electrostatic molecular interaction fields employing customized GRID probes. Thus, we are able to directly link protease structures to their electrostatic substrate preferences. Additionally, we present a new metric that measures similarities in substrate preferences focusing only on electrostatics. It efficiently compares these electrostatic substrate preferences between different proteases. This new metric can be interpreted as the electrostatic part of our previously developed substrate similarity metric. Consequently, we suggest, that substrate recognition in terms of electrostatics and shape complementarity are rather orthogonal aspects of substrate recognition. This is in line with a 2‐step mechanism of protein‐protein recognition suggested in the literature.

variations within the binding cleft of an individual protease. [12][13][14] Obviously, evolution had to tackle quite different tasks-on the one hand, designing proteases that are able to digest more or less every peptide that they encounter, and on the other hand, designing proteases within a signaling cascade, that should specifically recognize the subsequent member of the signaling chain to ensure the proper transmission of the signal. 15 This evolutionary pressure led to proteases ranging from highly promiscuous to extremely specific.
Surprisingly, these extremes can occur within the same family of evolutionarily related proteases, eg, the Chymotrypsin family of serine proteases. 16 The promiscuity and specificity of substrate recognition are very often not spread evenly along the binding cleft. Proteases often prefer certain substrate amino acids in given distances to their catalytic center. Half a century ago, Schechter and Berger 17 suggested a convention to denote the peptide substrate amino acids (P4 to P4′) and the subpockets (S4 to S4′) within the binding cleft around the scissile bond (cf. Figure 1).
Several methods have been published to describe and localize promiscuity and specificity of the protease binding interface, thus facilitating a comparison of individual proteases. [18][19][20][21] Our cleavage entropy metric 14,22 is based on substrate data deposited in the MEROPS database 23 and quantifies the specificity of peptide recognition in each subpocket. To compare proteases based on their substrate recognition, we developed a metric that considers the positional abundance of individual amino acids. 24  Research on snake venom metalloproteases revealed strong hints that their promiscuity is linked to their flexibility. 25 Likewise, we found that Caspases 26 and Thrombin 27 display strong correlations between flexibility and promiscuity. For Thrombin, this correlation translates into ordering processes of water molecules in the binding interface.
Regions of specificity show ordered water molecules in the interface, whereas regions of promiscuity tend to have more disordered water molecules in the first solvation sphere (Figure 2). On the other hand, enthalpic contributions to hydration of the S1 and S4 to S6 are almost identical. Thus, dynamic, and therefore entropy of hydration, contributes strongly to the recognition of substrates in Thrombin.
Electrostatic interactions are quite different from other contributions of substrate recognition as they are long-range interactions that change little with small differences in distance. 28,29 This has several important consequences. Obviously, a more continuous distance dependence varies less with conformational changes, much in contrast to shape-dependent recognition like van der Waals interactions and recognition that relies on precise exit vectors like hydrogen bonds.
On the contrary, due to the long-range character of electrostatic interactions, assigning them to specific subpockets is more challenging.
Calculating differences in electrostatic molecular interaction fields (eMIFs) of proteins is a rather challenging task. Many different approaches exist, and all of them have a significant impact on the result. Differences in the handling of the solvent and the solute, either implicit as a continuum or explicit, can yield highly different results. 30 When using an implicit model, it is also not trivial to assign each point on the grid a certain value for the dielectric constant. This problem is irrelevant for high distances to the solute but can yield errors for points close to it. 31 Furthermore, differences in handling multipoles will also introduce differences in the results. 32 The biggest error, however, is included when using different protonation states for the FIGURE 1 Peptide substrate amino acid (Pi and Pi′) and protease subpocket enumeration (Si and Si′) with respect to the cutting position (vertical line). The N-terminal side of the substrate is located on the left FIGURE 2 Correlation of substrate specificity with backbone flexibility and orientational ordering of water molecules in the non-prime site (S6-S1) of Thrombin's binding cleft (ranging from red-specific, rigid and ordered, via yellow to green-promiscuous, flexible and disordered) 27 model, as introducing an extra charge, or removing one, changes the entire electrostatic field significantly.
In a previous study, we used GRID-probes that test van der Waals interactions and electrostatics simultaneously. Even taking into account conformations extracted from molecular dynamics trajectories, we could only achieve limited correlation with substrate recognition. 33 To predict the specificity of proteases, Pethe et al 34 used a structure-based approach that ranks possible substrates according to interaction energies and reorganization penalties. Their scheme outperforms conventional methods that focus solely on knowledge-based prediction of substrate preferences.
Okun and Chen compared proteases with a statistical model. They calculated electrostatic similarities using a volumetric overlay of isopotentials. 35 In PIPSA, 36  Various approaches are already available that compare binding sites, 38 often for the purpose of off-target prediction and drug repurposing. 39 Such methods rely on molecular interaction fields (MIFs), eg, BioGPS 40,41 and IsoMIF, 42 on shape and physicochemical properties of the surface, eg, protein functional surfaces, 43 on graphs representing the 3D atomic similarities, eg, IsoCleft 44 or on fingerprints describing the binding sites, eg, PocketMatch. 45 In several data bases, properties of binding sites are stored for comparison, such as pseudocenters with projected physicochemical properties in CavBase, [46][47][48] in CavSimBase 49 and in SiteEngine, 50 sequence and structural similarity in CPASS, 51 position of functional groups in SuMo, 52 or surface geometrics and electrostatics in eF-site. 53 However, most of these methods are not meant to compare structurally very similar cavities as found in our set of chymotrypsinlike proteases (Figure 3), or define the binding site without ligand information. Therefore, we chose to implement our own method optimized to compare similar binding interfaces and able to compare these interfaces based on a distance criterion to a ligand as described in the methods section below in detail.

| Electrostatic substrate preferences
We extracted and isolated the electrostatic contributions and the shape recognition contributions in the substrate preference similarity metric that we defined previously. 24 We achieved this goal by binning the amino acid residues according to their electrostatic properties into positively charged (K, R, H), negatively charged (D, E), and neutral amino acids (G, P, A, V, L, I, M, F, Y, W, S, T, C, N, Q). In this way, we split off shape-dependent and size-dependent aspects of substrate recognition and focus solely on electrostatic recognition and specificity. The shape-dependent aspects of substrate recognition can be studied individually for each of the 3 bins, especially within the bin of neutral amino acids. However, this aspect is beyond the scope of the current study. Considering histidine as a positively charged amino acid is somewhat arbitrary, but in line with the usual classification in sequence logos. 54 Nevertheless, histidine is rather underrepresented in substrate data (1.7%); thus, this choice does not influence the analysis significantly (correlation data is shown in the Supporting Information).
For each of the 9 proteases under investigation, we extracted substrate data from the MEROPS database 23 The electrostatic substrate similarity of 2 proteases is calculated by
Using the program GRID, 69    Electrostatic substrate preferences are not only highlighted for the S1 subpocket, but also for all other subpockets. For example, the electrostatic substrate preferences of Granzyme M reveal a propensity for negatively charged residues in the subpockets S3, S3′, and S4′ that is hardly identifiable in the cleavage site sequence logos (Figure 3).
Among the proteases that prefer positively charged amino acids in S1, Factor VIIa and Kallikrein-1 are quite different in the substrate preferences in that subpocket, yet they share a preference for positively charged substrate amino acids in S3′. Granzyme B, favoring negatively charged substrate residues over large parts of its binding site, shows generally only minimal similarity with the proteases that prefer positive residues in the S1 subpocket. Still within these, the largest electrostatic substrate similarity with Granzyme B is determined for Trypsin. While Trypsin is very specific for positive amino acids in S1, in remote subpockets, it accepts negatively charged residues. This peculiarity is highlighted when compared with Granzyme B.

| Electrostatic molecular interaction fields (eMIFs)
With the negatively and the positively charged GRID probes, the eMIFs of the proteases can be determined ( Figure 5).
Trypsin shows favorable interactions with the positive probe in its S1 subpocket, while in the more peripheral S4 and S4′ subpockets, it prefers the negative probe. Factor VIIa, Factor Xa, Thrombin, and   For Elastase-1, the eMIFs reflect the electrostatic substrate preference very well. As the S1 subpocket is specific for neutral amino acids, practically no electrostatic interactions are visible. The prime site substrate self-similarities show preferences for the negative probe, whereas the non-prime site varies more in electrostatic preferences. The eMIFs correspond very well with this. In the S3 subpocket, which is to be rather unspecific in terms of electrostatics, both interactions can be observed, although the positive eMIF at that position is barely visible because it is hidden behind the negative one.

| Electrostatic substrate preferences and electrostatic molecular interaction fields
In the S1 subpocket of Trypsin, positively charged amino acids are strongly favored, which is also visible in the eMIF. On the periphery of the binding site, the protease starts favoring negatively charged amino acids, which is also mirrored very well by the calculated eMIFs, where the negative eMIF starts to dominate around the S3′ and S4′ subpockets.   Furthermore, the similarity of the eMIFs at the binding site of the different proteases was also calculated using the webPIPSA server 36,74 and the APBS method 75 for calculation of the eMIF ( Figure   S8 in the Supporting Information). In webPIPSA, we defined the selection of the binding site via the coordinates centered between the Cαs of the W215, the catalytic serine (S195) and the E192, as well as a radius of 17 Å. In general, both approaches are in good agreement, with a few differences between them. The most striking discrepancy is the apparent dissimilarity of Factor VIIa in the webPIPSA results to most of the other proteases, primarily reading positively charged amino acids in the S1. While the presented approach finds that the eMIF of Factor VIIa is rather similar to all these proteases. This difference could be due to the different definition of the binding interface, or to the different method to calculate the similarity.

| DISCUSSION
In a previous study, a conclusive correlation between the knowledgebased protease substrate specificity and physics-based enthalpic aspects of the binding cleft of proteases 33 was hampered by simultaneously considering van der Waals interactions and electrostatics.
With the improved approach presented in this work, where we separate electrostatics from van der Waals interactions, in contrast a high correlation is found for electrostatic substrate preferences and eMIFs.
This shows that electrostatic recognition is a major factor in protease substrate recognition in all proteases of the Chymotrypsin family investigated here. This is in line with previously published work focusing on Thrombin by Huntington 76 and on Chymotrypsin C by Batra et al. 77 The most important aspect of our breakthrough here seems to root back to focusing only at electrostatics, both in characterizing protease readout and interaction profiles. Mechanisms of electrostatic substrate recognition seem to be inherently different from other mechanisms of substrate recognition. Due to their strong nature and their long-range behavior, electrostatic interactions behave quite differently from other aspects of substrate recognition, which are dominated by shape complementarity. 78,79 Electrostatic interactions are not only strong and long ranging, but compared with van der Waals interactions vary relatively little with small changes in distance. Hence, it is not surprising that considering only static X-ray structures already yields a very high correlation between electrostatic substrate similarities and electrostatic interaction field similarities. Flexibility of the binding interface is influencing electrostatics only to a minor extent.
Electrostatic contributions would vary substantially solely with major conformational changes and, obviously, with differences in protonation or ion coordination. The long-range behavior and continuous nature of electrostatic interactions also impede allocating electrostatic contributions to subpockets of the binding clefts. By correlating the electrostatic substrate recognition with the electrostatic interaction field of the entire binding cleft, we avoid the non-trivial task of FIGURE 7 eMIFs, eMIF overlap and electrostatic substrate similarity for chymotrypsin and Kallikrein-1 (top) and thrombin and factor Xa (bottom): The eMIFs and their overlaps are depicted in blue (positive probe) and red (negative probe). The eMIF overlap was calculated at a cutoff of 0 kcal/mol and visualized at a cutoff of 50 (kcal/mol) 2 . The eMIF overlap on the right is depicted without protein surfaces, revealing overlapping eMIFs deep in the S1 subpocket hidden by the protein surfaces on the left. Above the overlap eMIF, the substrate similarity for the proteases is depicted for the positively charged amino acids (blue), for the neutral ones (yellow) and for the negatively charged ones (red) apportioning the binding cleft into subpockets. Still, for an efficient substrate prediction, which however is beyond the scope of this study, such a partitioning of electrostatic contributions to subpockets of the binding cleft would be highly desirable. This is in line with the notion that protein-protein recognition follows a 2-step mechanism. Firstly, an initial encounter complex forms when enzyme and substrate meet. The association rates for this initial encounter complex are largely governed by electrostatics. 80 An energy funnel pulls substrate and enzyme together and directs the substrate towards the binding site [81][82][83][84] (Figure 9). In a second step, conformational changes lead to the formation of a compatible binding interface. 85,86 Here, shape complementarity and flexibility are crucial to enable weak van der Waals interactions and to avoid clashes.
Electrostatics and shape complementarity in context of substrate recognition can be considered rather orthogonal properties resulting in different aspects of substrate recognition. 87 Thus, we believe that these 2 aspects of substrate recognition can be separated very efficiently by our knowledge-based approach to analyze substrate readout data as presented in this study. Electrostatic substrate preferences can be characterized very well by binning substrate residues according to their charge. On the other hand, we expect that shape complementarity can be characterized by analyzing substrate recognition within the 3 bins, especially within the neutral bin comprising 15 different neutral amino acid residues. If we can describe the contributions of electrostatics and shape complementarity in a solely physics-based way, it will be possible to predict the localized

| CONCLUSIONS
A knowledge-based approach to characterize differences in electrostatic substrate preferences is introduced and applied on 9 homologous serine proteases of the chymotrypsin family. The approach bins known substrate residues into positively charged, negatively charged, and neutral amino acids. Thus, electrostatic preferences in substrate recognition are quantified within subpockets of the binding cleft of the 9 serine proteases and can be compared between different proteases. Similarities and differences in electrostatic preferences can easily be identified on a localized subpocket level but also globally for the complete binding cleft.
On the other hand, eMIFs are calculated in a physics-based way studying X-ray structures using the program GRID in combination with user-defined probes that focus on electrostatics. The binding cleft within the X-ray structures is delimited by a proximity criterion to known ligands. Calculating the overlap between eMIFs results again in similarities and differences in electrostatic preferences.
Comparing the knowledge-based and physics-based similarities and differences in electrostatic preferences, a high correlation between the 2 totally different approaches is found. This implies that the electrostatic part of substrate recognition and substrate specificity can be explained very well by eMIFs.
Due to the long-range nature of electrostatics, we assume that these electrostatic molecular interactions fields determine the formation of an initial encounter complex between substrates and proteases.

FIGURE 9
The binding interface of trypsin depicted with the eMIFs of the positive (blue) and negative (red) probe. An energy cutoff of −0.5 kcal/mol was used for the visualization of far-reaching electrostatic interactions. The eMIF forms a funnel-like long-range interaction profile that presumably guides substrates towards an initial encounter complex