Ligand-protein interactions in lysozyme investigated through a dual-resolution model

A fully atomistic modelling of biological macromolecules at relevant length- and time-scales is often cumbersome or not even desirable, both in terms of computational effort required and it a posteriori analysis. This difficulty can be overcome with the use of multi-resolution models, in which different regions of the same system are concurrently described at different levels of detail. In enzymes, computationally expensive atomistic detail is crucial in the modelling of the active site in order to capture e.g. the chemically subtle process of ligand binding. In contrast, important yet more collective properties of the remainder of the protein can be reproduced with a coarser description. In the present work, we demonstrate the effectiveness of this approach through the calculation of the binding free energy of hen egg white lysozyme (HEWL) with the inhibitor di-N-acetylchitotriose. Particular attention is posed to the impact of the mapping, i.e. the selection of atomistic and coarse-grained residues, on the binding free energy. It is shown that, in spite of small variations of the binding free energy with respect to the active site resolution, the separate contributions coming from different energetic terms (such as electrostatic and van der Waals interactions) manifest a stronger dependence on the mapping, thus pointing to the existence of an optimal level of intermediate resolution.

A fully atomistic modelling of biological macromolecules at relevant length-and time-scales is often cumbersome or not even desirable, both in terms of computational effort required and a posteriori analysis. This difficulty can be overcome with the use of multi-resolution models, in which different regions of the same system are concurrently described at different levels of detail. In enzymes, computationally expensive atomistic detail is crucial in the modelling of the active site in order to capture e.g. the chemically subtle process of ligand binding. In contrast, important yet more collective properties of the remainder of the protein can be reproduced with a coarser description. In the present work, we demonstrate the effectiveness of this approach through the calculation of the binding free energy of hen egg white lysozyme (HEWL) with the inhibitor di-N-acetylchitotriose. Particular attention is posed to the impact of the mapping, i.e. the selection of atomistic and coarsegrained residues, on the binding free energy. It is shown that, in spite of small variations of the binding free energy with respect to the active site resolution, the separate contributions coming from different energetic terms (such as electrostatic and van der Waals interactions) manifest a stronger dependence on the mapping, thus pointing to the existence of an optimal level of intermediate resolution.

I. INTRODUCTION
One of the most relevant challenges of computational biochemistry and biophysics is the accurate calculation of binding free energies [1][2][3], which represents one of the key steps in the identification of pharmacological targets as well as in the development of new drugs [4][5][6]. However, the large sizes of the molecules under examination (often above the hundred of residues), as well as the necessity to screen through large datasets of potential candidate molecules, make this effort onerous in terms of time and computational resources.
A promising way to mitigate these limitations is the use of multiple-resolution models of the protein, that is, representations in which different parts of the molecule are concurrently described at different levels of resolution [7][8][9][10][11][12][13][14][15]. The chemically relevant part of the protein, e.g. the active site, is modelled at level of detail, typically atomistic. For the remainder, on the contrary, a simplified representation is used, where several atoms are lumped together in effective interaction sites. The working hypothesis underlying these methods is that only a relatively small part of the molecule requires an explicitly atomistic treatment; the remainder, in fact, is mainly responsible for large-scale, collective fluctuations whose function-oriented role is well recognised and prominent [15][16][17][18][19], however also prone to be accurately reproduced by lower-resolution representations [20][21][22][23][24][25]. Hence, the resulting model favourably joins the accuracy of an atomistic (AT) description where needed and the computational efficiency of a coarse-grained (CG) one where possible. * raffaello.potestio@unitn.it In order to take full advantage of the dual-resolution approach to protein modelling, though, one has to solve a few key open issues: first, the definition of the appropriate coarse-grained model to employ in the lowresolution part [25][26][27][28][29][30][31][32][33]; second, the coupling between high-and low-resolution models, which has to be performed so as to guarantee that the appropriate observables are reproduced with respect to the reference provided for example by a fully atomistic simulation. This issue entails a further one, namely the identification of the correct observables apt to quantify the fidelity with which the behaviour of the system is reproduced by the dual-resolution model; third, the selection of the subpart of the molecule that requires a high-resolution modelling. In the present work we will focus specifically on this third aspect.
Various methods and approaches have been developed in the past few years to describe proteins in dual resolution [7,[10][11][12][13][14]. In general, the high-resolution part is modelled at the all-atom level, making use of one of the several atomistic force fields available. The coarsegrained representations range from simple bead-spring elastic networks [15,20,23] to more sophisticated Gōtype models [11]. Recently, we have proposed a dualresolution model [15] where, in the CG part, only the C α carbons of the protein chain are retained and connected one with the other by harmonic bonds. This model has been employed in the present work with the aim of assessing the accuracy of a hybrid atomistic/coarse-grained description of a protein for binding free energy calculations. The system under examination is hen egg-white lysozyme in explicit water, bound to a sugar substrate, di-N-acetylchitotriose. We carried out calculations of the binding free energy of the ligand in the active site, with a twofold objective. In fact, not only we aimed at verifying that the computed quantity in the dual-resolution model matches a reference, all-atom calculation; but rather we also investigated the impact of different choices in the definition of the high-resolution subdomain. This aspect bears the highest prominence, as it is becoming increasingly more evident that a crucial component in the construction of accurate and effective low-resolution models for biological and soft matter systems is represented by the mapping [15,32,33], that is, the particular selection of collective variables employed to describe the system. Here, we provide novel evidence of this general property in the context of a dual-resolution model of a biomolecule, and describe a transferable strategy to tackle this issue.

II. METHODS
The system under examination in the present work is hen egg-white lysozyme (HEWL) in aqueous solution. In this model, the binding site of the enzyme and the substrate molecule, the inhibitor di-N-acetylchitotriose, are represented with atomistic detail. The protein model employed is not adaptive, that is, the resolution of a given residue is fixed -either atomistic or coarse-grained-and does not change throughout a simulation. However, at difference with other works [8,9,11], several values of the number of protein residues treated at high resolution have been explored and employed in independent calculations. The impact of choosing different numbers of active site residues to model at the atomistic level is a central aspect of this study. The coarse-grained model employed to describe the low-resolution part of the protein is a simple bead-spring representation where the selected sites (namely the C α atoms) are connected by elastic bonds penalising the deviations from the distances that interacting atoms have in the reference conformation. Two values of elastic constants employed, one for C α 's along the chain, and one for all other bonds. Water molecules are described in atomistic detail throughout the whole simulation box: the interaction with the high-resolution part of the protein takes place through the standard allatom force field, while the interaction with the coarsegrained beads is mediated by a purely repulsive potential acting on the sole oxygen atom.
Hereafter we provide a detailed description of the model. We first discuss the calculation of the binding free energy ∆G bind , then we outline the dual-resolution model and its coupling to the atomistic part, and finally report information about the simulation setup. Further details are made available in the Supporting Information.

A. Binding Free Energy calculation
One of the key points of this work is the calculation of the protein-ligand binding free energy ∆G bind , which quantifies the affinity of a molecule towards a protein [1][2][3]. As such, it plays a prominent role in the investigation of the biochemical function and activity of enzymes and similar biomolecules, and in the development of effective drugs.
∆G bind is defined as the difference between the free energy of the system in the configuration in which the ligand is bound to the active site (G b ) and the corresponding value when the ligand is absent (G ub ): This value, in the specific case under examination, varies according to the number of active site residues modelled with atomistic resolution, as we will see in Sect. III.
The free energy difference between two states is here computed by means of thermodynamic integration (TI) [34]. Specifically, a scalar λ ∈ [0, 1] is defined which parametrises the potential energy of the system as U λ (r) = λU A (r) + (1 − λ)U B (r) connecting the states A and B. The sought quantity is given by: Since the free energy is a state function, the nature of the path is unimportant, and one can choose a thermodynamic cycle that connects the bound and unbound states through several intermediate ones, as illustrated in Fig.1. In particular, we can identify two main terms: the insertion of the ligand from vacuum to water ∆G lig , and the decoupling from the protein ∆G compl . A further step is the removal of the restraints that keep the ligand in proximity of the protein during the damping of the ligand-protein interactions, ∆G r of f ; this latter calculation can be carried out analytically without the need to run simulations. Hence, ∆G bind is the algebraic sum of the previous three terms: According to the previous definitions of each term, neither ∆G lig nor ∆G r of f changes with the protein resolution: indeed, the former corresponds to the solvation free energy of the ligand, which is always treated at the atomistic level; likewise, the calculation of the restraint removal free energy is analytic [3]. The unique term that varies depending on the number of active site residues modelled in high resolution is the free energy change of the protein-ligand complex between the bound state and the state where the ligand is removed, that is, the variation of ∆G bind is equal to the variation of ∆G compl .
The alchemical change in the calculation of ∆G compl is performed in three steps (in the following, the subscripts c and stand for complex and ligand, respectively). First, one adds a set of restraints between protein we decouple the ligand from the protein (∆G compl , which also includes a set of restraints between ligand and protein) and subsequently introduce it in water (∆G lig ). A further step is the restraints removal (∆G r of f ) whose calculation is analytical.
and ligand (∆G r on ) in order to avoid the problem of the ligand leaving the binding pocket when interactions are being removed. The presence of restraints is indicated in the cycle scheme of Fig.1 with a red circle: it represents the fact that the ligand is confined in a certain volume. For this work we use the set of restraints described by Boresch [3]. Second, Coulomb interactions are switched off (∆G coul,c ); third, the Lennard-Jones potentials modelling van der Waals interactions are removed (∆G LJ,c ). Likewise, the alchemical change in the ligand free energy ∆G lig is performed in two steps: first switching on Coulomb interaction (∆G coul, ), and then Lennard-Jones (∆G LJ, ). The last contribution to the binding free energy, ∆G r of f , derives from restraint removal: its calculation is analytical and therefore it does not require alchemical changes. These transformations are summarised in Fig. 1 and Tab. I. Further details can be found in the Supporting Information in the section relative to the thermodynamic cycle. The calculation of ∆G compl can be carried out in two different ways, namely decoupling and annihilation. Decoupling refers to turning off the interaction between the molecule and its environment, while maintaining the potentials among atoms constituting the molecule; annihilation, on the other hand, implies turning off the interac- tion between the molecule and the environment as well as the intramolecular interaction. Here we consider the values of ∆G obtained through ligand decoupling, since this process is more intuitive with respect to annihilation; furthermore, the ligand is always treated at fully atomistic detail, therefore it is not involved in the change of free energy while varying the protein resolution. In Tab. III and Fig. 6 (and with greater detail in the Supporting Information, annihilation section) we provide data showing that the values of binding free energy obtained using decoupling and annihilation are consistent within the error bars.

B. Dual-Resolution protein model
In this work the solvent is treated with all-atom detail, while the protein has a fixed (i.e. position-and time-independent) dual-resolution. The binding site is modelled with atomistic resolution, whereas the rest of the protein is coarse-grained. To describe the lowerresolution part we employ an elastic network model (ENM) [15,20], in which each residue is mapped onto a bead whose position corresponds to the C α atom in the atomistic description. These beads are connected by harmonic springs as shown in Fig. 2.
The potential energy is given by: with spring constants k ij , equilibrium distance r 0 ij , a cutoff distance r c , i and j are the node index, and θ(r) is a Heaviside theta function taking value 1 if r > 0 and 0 otherwise. In this model we made use of two different elastic constants: a very stiff spring (k b ) for consecutive beads, represented in blue in Fig. 2; and a weaker spring k nb for not consecutive beads whose distance in the reference (native) conformation lies below a fixed cutoff (in green).
The ENM used here is parametrised to reproduce the conformational fluctuations of the reference all-atom model, these being quantified by the root mean square fluctuations (RMSF) of the all C α atoms of the system [15]. The residues in direct contact (H-bonding or hydrophobic contact) with the substrate are modelled with all-atom detail; in order to select the other binding site residues to be described at the atomistic level, we sorted them by increasing distance of their the center of mass from the closest ligand atom. The water-CG protein interaction consists in a simple excluded volume, modelled via a Weeks-Chandler-Anderson (WCA) potential [35]. The details about the procedure followed to determine the ENM elastic constants and the excluded volume interaction are provided in the Supporting Information, while the numerical values of the resulting parameters are reported hereafter.

C. Simulation details
The reference model is given by the 2 ns equilibrated PDB structure 1HEW in the NPT ensemble (the Parrinello-Rahman barostat [36] with a time constant of 2.0 ps and 1 bar was used). Both fully atomistic and dual-resolution models of HEWL are solvated in water and placed in a cubic simulation box of 7.06 nm side. The force field employed is Amber99SB [37], whereas the water model is TIP3P [38]. The inhibitor, which was always atomistic, had GLYCAM forcefield parameters consistent with Amber99SB [39]. The TI binding free energy calculation consists of 3 different steps: ∆G compl , ∆G r of f , ∆G lig : 1. The protein-ligand complex free energy (ΔG compl ) calculation uses 11 λ values per ∆G restr on,c , 5 evenly spaced λ values per ∆G LJ,c (with separation 0.20) and 15 λ values per ∆G coul,c , with 600 ps of simulation per λ in the fully atomistic case, and 4000 ps in the dual-resolution case to improve the statistics.
2. The restraint removal free energy (ΔG r off ) calculation is analytical (details on Supporting Information).
In the thermodynamic integration we employ the softcore potential of Ref. [40] with parameters α = 0.5 and p = 1.0 to avoid possible singularities in the Lennard-Jones terms from atoms overlapping during the alchemical change. The temperature is kept constant at 298 K by means of a Langevin thermostat with a friction constant γ = 15 ps −1 . The integration step is 1 fs. The calculation of electrostatic interaction is performed using the reaction field method with a dielectric constant = 80 and a cutoff of 1.2 nm. These parameters are a good compromise between speed and accuracy, as verified in Ref. [41]. The SETTLE [42] and RATTLE [43] algorithms for rigid water and rigid bonds to hydrogen have been used. Each system is prepared using fully atomistic minimisation with steepest descent and 6 ns of equilibration in NVT (for both ligand-free and ligand-bound systems). All simulations (both fully atomistic and dual-resolution) are carried out with the ESPResSo++ simulation package [44,45], in which we have implemented TI (except in case of annihilation, for which all steps are performed in both ESPResSo++ and GROMACS [46]). Some preliminary fully atomistic equilibration simulations use GRO-MACS. The error bars shown are calculated using the Student t at 95% confidence limit [47], via standard deviations obtained using block averaging in which all trajectories are divided into four blocks of equal length.
The parametrization of the dual-resolution model is consistent with the work in Ref. [15]: the spring constant between consecutive C α nodes along the backbone (k b ) has a stiff value of 5 · 10 4 kJ · mol −1 · nm −2 , whilst all the other ones (k nb ) have a value of 160 kJ · mol −1 · nm −2 , until 1.2 nm as cutoff, parametrised by minimising the average root mean square error in the C α RMSF. Moreover, a WCA interaction is applied between C α nodes and all solvent molecules center of mass. In the WCA potential, has a value of 0.34 kJ · mol −1 arbitrarily chosen as the value for carbon in the atomistic forcefield, and σ i = R g,i · c where R g,i is the radius of gyration of a given residue i where c is the same for all amino acids. The value of c is tuned to give the correct bulk water density of reference for a protein-water system. The c value found is 0.658. Further explanations about c can be found in the Supporting Information.

III. RESULTS AND DISCUSSION
We performed the calculation of ∆G b of lysozyme modelled in dual-resolution, varying the number of atomistic residues constituting the binding site and comparing the results with a fully atomistic reference simulation. Recall that the binding free energy calculation consists of three steps: restraint removal, ligand ∆G, and ligand-complex ∆G; of these, only the latter depends on protein resolution, that is, only ∆G compl assumes different values for different numbers of active site residues described at the all-atom level.
As explained in the previous section, the contribution coming from the restraints can be analytically computed and amounts to ∆G r of f = −31.3 kJ · mol −1 . Likewise, the Coulomb and Lennard-Jones contributions to the ligand free energy ∆G lig are the following: Hence: The final step is the calculation of ∆G compl , whose results, including the comparison between dual-resolution model and fully atomistic reference, are shown in Tab. II and illustrated in Fig. 3.
The first three columns of the table describe the Coulomb, Lennard-Jones, Restraints contributions to free energy, respectively, while the last one corresponds to the value of the total ligand-protein complex free energy. All the values are expressed in kJ ·mol −1 . In Fig. 3, the atomistic reference is represented with a dash black line with its error bar. In particular, panels (a), (b) and (c) show the three components that contribute to the total complex free energy, reported in panel (d). Looking at these values as a function of the number of all-atom active site residues, we notice that there are important deviations of the free energy from the reference, especially in the case of 3 and 4 atomistic residues. On the contrary, the total value of the binding free energy agrees with the reference within the error bar in all cases.
Furthermore, we observe that the trend of free energy values, in comparison to the reference, is essentially the same: starting from 3 amino acids it approaches the reference until reaching 6, both in its components and in total. In contrast, going from 6 to 8 atomistic residues the value deviates from the reference, even though the total remains close to it. Finally, from 8 to 10, ∆G converges again. Hence, increasing the number of atomistic residues does not introduce necessarily an improvement of the computed free energy, at least as long as the various free energy components are considered separately.
In order to gain further, quantitative insight into these results, we computed the the quadratic deviation from the reference, δ 2 , defined as: where the index i = 3...10 runs over atomistic residues. Fig. 4 reports δ 2 as a function of the number of active site amino acids modelled with atomistic detail.
The plot shows that the binding free energy computed in the dual-res model approaches the reference as the number of atomistic active site residues increases, and most importantly this approach takes place for each component up 6 residues. Beyond this value, though, the trend stops and the deviation becomes larger, peaking at 8 residues and decreasing when further atomistic amino acids are added. These results highlight a non-monotonic dependence of the free energy on the mapping, that is, the number of retained atomistic residues. If, on the one hand, the overall value of the binding free energy (Fig. 3 panel d) levels to the reference with as few allatom residues as 4, the separate components oscillate and reach the plateau only for larger numbers. The existence of a minimum in the standard deviation of all three contributions pinpoints a particular number of atomistic active site residues for which the accuracy of the computed free energy is the highest and the economy of the highresolution subpart the largest. Including more than 6 atomistic residues counterintuitively worsens the result -when the various contributions are looked at-and the previous accuracy is only recovered when more residues are included. This behaviour suggests that the total free energy undergoes an error cancellation which hides the deviations of the separate terms.
A possible explanation for this nontrivial behaviour is that when 6 active site residues are modelled with allatom accuracy (Fig. 5b) the ligand is stable in the catalytic site, namely it is surrounded by a complete shell of atomistic residues. The addition or deletion of other residues (Figs. 5c and 5a respectively) leads to a worsening of ∆G: in the first case, the two added residues (in pink and grey) are located behind the first shell of amino acids (far away from the ligand) and start to form a second, incomplete shell; in the second case, only three atomistic amino acids take part in the direct interaction with the ligand: therefore, the first layer is still incomplete and important interactions are missing; in order to improve the free energy value one has to add further amino acids in order to complete the second shell. We emphasise that the impact on the deviation from the reference is inversely proportional to the distance of the added/removed amino acid. Thus, the farther the atomistic amino acid is from the ligand, the more negligible its effect is. In the Supporting Information we provide detail about the other numbers of all-atom residues not reported here. Finally, the values of binding free energy (also for the case of annihilation whose calculations are reported in the Supporting Information) are summarised in Tab. III and illustrated in Fig. 6.

IV. CONCLUSIONS
In this work we have shown how the dual resolution model employed, constituted by an all-atom subregion coupled to an elastic network model remainder, can be used to calculate the binding free energy of an enzymesubstrate complex with atomistic accuracy. Furthermore, and most importantly, we have highlighted the impact that different choices of the model resolution can have. Specifically, we have computed the total value of the binding free energy as well as that of its various energetic components, and quantitatively inspected how these change when different selections are performed for the subgroup of amino acids, ranging from 3 to 10 in total, to be modelled at the fully atomistic level.
At first sight, one can appreciate that the binding free energy value rapidly converges to the atomistic reference when as few as 4 amino acids constituting the active site are described all-atom. This comforting result, however, unveils a greater complexity when the different terms constituting the free energy are looked at separately. These show an oscillating behaviour as the number of all-atom residues in the active site is increased, with a decreasing difference from the reference followed by a sudden jump to larger values, which dampens upon further addition of atomistic amino acids. The rationale in this behaviour is identified in the structure of the active site, which is constituted by a first shell of the six residues exposed to the solvent and closest to the ligand; when further amino acids beyond these are modelled with atomistic resolution, they interact with the substrate affecting the binding free energy components and shifting them away from the reference, with a steadily lowering impact as the model's resolution is increased -as one can expect. Surprisingly, very little if no signal of this behaviour is observed in the value of the binding free energy as a whole, rather it becomes visible only upon inspection of its separate contributions.
The results of this work thus highlight the importance of mapping in the construction of multi-scale and multiresolution models, as a higher degree of detail does not necessarily correlate with a higher accuracy of the quantities of interest. The implications of these observations should serve as a warning in the realisation of coarsegrained models concurrently employing various levels of detail for different regions of the same system, whose range of application spans from fundamental understating of a molecule's properties to real-life pharmaceutical applications.