Comprehensive 3D‐RISM analysis of the hydration of small molecule binding sites in ligand‐free protein structures

Abstract Hydration is a critical factor in the ligand binding process. Herein, to examine the hydration states of ligand binding sites, the three‐dimensional distribution function for the water oxygen site, gO(r), is computed for 3,706 ligand‐free protein structures based on the corresponding small molecule–protein complexes using the 3D‐RISM theory. For crystallographic waters (CWs) close to the ligand, gO(r) reveals that several CWs are stabilized by interaction networks formed between the ligand, CW, and protein. Based on the gO(r) for the crystallographic binding pose of the ligand, hydrogen bond interactions are dominant in the highly hydrated regions while weak interactions such as CH‐O are dominant in the moderately hydrated regions. The polar heteroatoms of the ligand occupy the highly hydrated and moderately hydrated regions in the crystallographic (correct) and wrongly docked (incorrect) poses, respectively. Thus, the gO(r) of polar heteroatoms may be used to distinguish the correct binding poses.

in incorporating the effect of hydrogen bonding between the protein and water molecules and the effect of the excluded volume (EV), even though both are important in hydration. One reported approach for studying the hydration states of ligand-free structures is the WaterMap. 3,4 In the WaterMap method, a molecular dynamics (MD) simulation is first performed for the ligand-free structure in explicit solvent. The hydration sites are defined by 1 Å spheres that are very frequently occupied by water molecules, and then the free energy change upon releasing a water molecule at the hydration site is computed. However, the high computational cost of WaterMap hinders its application to a large number (e.g., thousands) of proteins.
Another method is the WaterDock proposed by Ross et al., 5 which is a docking-based method and uses AutoDock Vina to identify the water binding sites. The short computation time of WaterDock enabled the analysis of 14 proteins in their holo structures, and the water molecules were classified as "conserved" and "displaced" with 75% accuracy. Nevertheless, the scoring function implemented is empirical and lacks both the EV effect and the interactions between water molecules.
Here, we employed the three-dimensional reference-interaction site model (3D-RISM) theory, 6 which is a statistical mechanical theory of solvation, to analyze the hydration states of ligand-free structures.
This theory enabled us to compute the three-dimensional water site distribution functions around a ligand-free protein structure with the force fields used in MD simulations. The EV effect and the interactions between water molecules are explicitly included. The usefulness of the 3D-RISM theory for analyzing hydration sites has been demonstrated before. For example, the positions of CWs inside the cavity of hen eggwhite lysozyme were successfully reproduced. 7 In that case, the water binding sites are not exposed to the surface, and MD-based methods such as WaterMap cannot be easily applied because the duration of the MD simulation is shorter than the time scale for the water molecules to penetrate the protein surface and reach the binding site inside.
The other advantage of 3D-RISM is that its computation time is much shorter than that of MD. Thus, this approach allows one to analyze the hydration states for thousands of ligand-free structures.
In this study, the 3D-RISM theory was used to compute the threedimensional distribution functions of water around the ligand-free structures of 3706 proteins, which were obtained from the protein-ligand complexes in the PDBbind refined set (v. 2017). [8][9][10][11][12][13] The 3D-RISM theory has been previously employed to calculate the hydration free energies for a large number of SMs [14][15][16] and a large number of conformations of a single protein obtained through MD simulations. 17,18 However, to the best of our knowledge, its application to the hydration states of thousands of different proteins has not been reported.

| 3D-RISM theory
The three-dimensional water site distribution function around the ligand-free protein structures was computed using the 3D-RISM theory implemented in the AmberTools18 suite. 19 Here, we briefly outline the computation process, while the details can be found in reference books. 6,20 The process is split into two steps for treating the bulk water (step 1) and the aqueous solution of a ligand-free protein structure at infinite dilution (step 2). In step 1, the water site-site correlation functions were computed with the dielectrically consistent RISM (DRISM) theory 21 combined with the Kovalenko-Hirata (KH) closure. 22 Then, in step 2, the three-dimensional water site distribution function around the ligand-free protein was obtained from the 3D-RISM theory, using the water site-site correlation functions from step 1 as input and the KH closure. For the water site α = H (hydrogen) or O (oxygen), the distribution function at position r is denoted by g α (r). Here, the analysis was carried out using g O (r).
As described in the introduction, the interactions between water molecules as well as the interactions between water molecules and protein atoms are explicitly treated in the 3D-RISM theory. Owing to the interactions between water molecules, g O (r) has multiple peaks. 6,20 While the peak nearest to the protein surface primarily arises from the interactions between water molecules and protein atoms, the other peaks primarily arise from the interactions between water molecules.
The following force fields and parameters were used for the calculations with the 3D-RISM theory. Amber ff99SB 23 was used for the proteins and ions, while the coincident SPC/E model 24 was employed for water. The histidine residue was set with a hydrogen on the delta nitrogen, HID. In step 1, the values of the dielectric constant and the bulk density were set at 78.497 and 0.03332 Å −3 , respectively.
The temperature was set at 310 K. The other parameters required for the computation were set at the default values implemented in the AmberTools18 suite. In step 2, a water box was prepared, and its size was set so that the minimum distance between the protein and the edge of the box was 14 Å. The linear grid spacings for the x, y, and z coordinates were set to 0.5 Å, and the maximum number of steps for convergence was 20,000. The default values implemented in the AmberTools18 suite were employed for the other parameters.

| Data set
Protein-ligand complexes in the PDBbind refined set (v. 2017) [8][9][10][11][12][13] consisting of 4,154 complexes were utilized as the data set for analysis. In the PDBbind data set, hydrogen atoms were already added to the heavy atoms other than those of the CWs. For each complex, the ligand and the CWs were removed, and the remainder (proteins, ions, and other ligands such as crystallization agents) was used in the following 3D-RISM calculation as a ligand-free structure. The ligand was separately saved in the Tripos Mol2 format. When the crystallographic protein structure contained multiple protein chains, the chain closest to the ligand was selected. Hereinafter, the resulting molecules are simply referred to as "proteins." Before the 3D-RISM calculations, the following preprocessing was applied to the proteins. First, CAPs (the acetyl group and N-methylamide, denoted respectively by ACE and NME in Amber) were added to the N-and C-terminals. ACE and NME were added to the residues before and after any missing fragments. For example, in chain H of thrombin (PDB code: 1a4w), residues 147-149 were missing. In this example, ACE and NME were added to the C-terminal side of Glu146 and to the N-terminal side of Gly150, respectively. Namely, residues 147 and 149 were ACE and NME, respectively. The same treatment was performed for the missing fragments in the other proteins. Second, the tleap command in the AmberTools18 suite was executed to assign the force field parameters for each atom. Proteins that produced errors under this command were excluded from the subsequent calculations. Then, a minimization of the protein structure was performed using the AmberTools18 suite with the constraint of 10.0 kcal mol −1 Å −2 for the heavy atoms to optimize the positions of the hydrogen atoms. During the minimization, the generalized Born model was employed for the solvent. Again, proteins that produced any error during the minimization were excluded. Finally, a total of 3,706 proteins were used for the subsequent computations with the 3D-RISM theory.

| Analysis of water oxygen distribution function with crystallographic waters
To analyze the distribution of solvent water, the distribution function Next, the number of contacts with the protein heavy atoms was counted for each CW in order to discuss the characteristics of the g O (r CW ) values. Such contacts were defined by a distance of 3.9 Å or less between the CW and the protein heavy atom, using the threshold value from the program HBPLUS. 25 To analyze the elements for the protein heavy atoms in contact with the CWs, the minimum distance from the protein heavy atom to the CW, denoted by r Min CW P , was obtained for each CW. The element of this closest protein heavy atom (nitrogen, oxygen, carbon, sulfur, etc.) was used to investigate the relationships between g O (r CW ) and the contacted elements on the protein, which were visualized using pie charts.

| Analysis of water oxygen distribution function with ligand heavy atoms
For each ligand heavy atom (LHA) located at r LHA , the distribution function of water oxygen at its position, g O (r LHA ), was computed. For background information, ρg O (r LHA )ΔV (where ρ is the density of the bulk solvent) represents the number of water molecules within a small volume ΔV around r LHA . 20 Thus, a higher g O (r LHA ) value indicates that more water molecules at r LHA are replaced upon ligand binding.
g O (r LHA ) was computed using the same procedure as for g O (r CW ). During this calculation, the LHAs were categorized according to the SYBYL atom types, which were obtained using the information listed in the ligand file in the Tripos Mol2 format. The SYBYL atom types are given in Table 1, where those shown in boldface were the focus of this study because there were enough data points for the analysis.
To investigate the elements of the protein heavy atoms in contact with the ligands, the minimum distance from the protein heavy atom to each LHA, denoted by r Min LHA P , was defined. For each LHA with a g O (r LHA ) value, the element of the closest protein heavy atom at r Min LHA P was assigned and visualized using a pie chart. Fluorine F

Chlorine Cl
Other halogens and metals -Note: The probability of g O (r) was analyzed for the atom types shown in boldface.
Two ligand poses were considered for each protein: a correct one and an incorrect one. The former was based on the crystallographic ligand structure (converted to the Tripos Mol2 format in the PDBbind database). The latter was selected from the ligand poses generated using AutoDock Vina for each of the 3,706 proteins. 26 The following process was used to select a pose that was dissimilar to the correct pose.

| Definitions of terms
In the subsequent sections, the terms "highly hydrated" and "moderately hydrated" are used. A CW, region, state, or binding site that is "highly hydrated" is defined by g O (r) > 4, whereas one that is "moderately hydrated" is defined by 1 < g O (r) < 4.

| Analysis of crystallographic waters with g O (r)
To check the results of our 3D-RISM calculations, the overlap between g O (r) and r CW for dihydrofolate reductase (PDB code: 1dhi) was examined visually in Figure 1a. Regions where g O (r) ≥ 4.0 (colored in red) strongly overlap with the CWs. In the red regions, the probability for the existence of solvent water molecules is at least four times higher than that in bulk water. Thus, the observation of CWs in the red regions is reasonable, and it indicates that the 3D-RISM theory successfully reproduced the positions of these CWs. Figure 1a also shows that the regions with high g O (r) are tube-shaped, suggesting that the movement of water molecules is easy within the tube but relatively hard outside of it.
The distribution function of water molecules at each CW, g O (r CW ), was computed for the comprehensive analysis of the hydration state at the CWs. In the histogram in Figure 1b, g O (r CW ) > 1 at most of the CW positions. Therefore, these CWs are located in regions with a higher probability for the existence of water than that in the bulk, which is also a reasonable result. It was observed that the histogram has a maximum at g O (r CW ) $ 5, and the range of We also counted the number of contacts with the protein heavy atoms, N C , so as to understand the value of g O (r CW ) for each CW at the atomic level. Then, the probability of g O (r CW ) for the CWs with , was obtained. Figure 1c shows Þfor N C = 1, 5, 10, 15, and 20. Clearly, the highest peak of is strongly correlated with N C : the position of this peak is shifted toward the right (larger g O (r CW )) when N C is increased. Thus, a region with very high g O will appear as almost completely surrounded by the protein. Figure 1d shows an example CW with N C = 20 (g O (r CW )= 14.4), which is surrounded by Ser65, Phe66, His94, Phe95, and Trp97, and half of its contacts are formed with the heavy atoms in Phe66.
As described earlier, the range of g O (r CW ) in the histogram of close to ions. One example is depicted in Figure 2. In this case, the CW was coordinated to Zn 2+ , and the electrostatic interactions between the ion and water molecules led to very high g O (r CW ) values.
Another case of g O (r CW ) ≥ 30 is the situation shown in Figure 1d, where the CW is deeply buried inside the protein.

| Analysis of hydration states close to ligands
To investigate the effects of ligands on g O (r CW ), we analyzed g O (r CW ) at CW positions that are at most 4.0 Å away from any of the LHAs.
The number of these CWs is approximately 45,000. In Figure 3a, the histogram of g O (r CW ) for these CWs close to the ligand (red line) is compared to that for all CWs (black line). Both histograms are normalized to one because they contain different numbers of CWs. The normalized probability (Y-axis) is denoted by P g ( g O (r CW )) hereinafter.
While the shapes of both P g ( g O (r CW )) profiles were almost the same for g O (r CW ) ≥ 10, when g O (r CW ) < 10, the probabilities appeared very different for CWs close to the ligands and all CWs. Only one peak at g O (r CW )$4.8 was observed for all CWs (black line), while there were two peaks for the CWs closest to the ligands (red line) at g O (r CW )$2.8 and $4.8, and the height of the latter peak was slightly decreased compared to its counterpart for all CWs.
To investigate the atomic origin of the two peaks in P g (g O (r CW )) for CWs close to the ligands, the minimum distance from the protein heavy atom to the CW, r Min CW P , was computed for each CW involved in either peak. The probabilities of r Min CW P , hereinafter denoted by P r r Min shown in Figure 3b for the CWs with g O (r CW ) = 2.8 (peak 1) and 4.8 (peak 2). The P r r Min CW P À Á profiles were significantly different. For peak 1 (black solid line), P r r Min CW P À Á had two peaks at r Min CW P $ 2.8 and 3.6 Å, with the latter being the major peak. For peak 2 (black dashed line), there was no peak in P r r Min CW P À Á at r Min CW P $ 3.6 Å, and the height of the major peak at r Min CW P $ 2.8 Å was substantially increased. The CWs belonging to peaks 1 and 2 differed in the type of nearest protein heavy atom elements (Figure 3c). For the CWs with g O (r CW ) = 4.8 (peak 2), the nearest-neighbor protein heavy atoms were mainly nitrogen and oxygen, while for half of the CWs with g O (r CW ) = 2.8 (peak 1), the nearest-neighbor protein heavy atoms were carbon.
From the results described above, the atomic-level origin of the two peaks in P g ( g O (r CW )) was given as follows. First, it was suggested the CWs belonging to peak 2 formed hydrogen bonds with the nitrogen and oxygen atoms of the proteins because these CWs had a maximum P r r Min CW P À Á at r Min CW P $ 2.8 Å (a typical distance for hydrogen bonds such as NHÁÁÁO C and OHÁÁÁO C), and the nearestneighbor protein heavy atom was mainly nitrogen and oxygen. An example is shown in Figure 4a, where the CW was stabilized by the hydrogen bond between it and the hydroxy group of Ser195. Another hydrogen bond was also formed between the CW and the amino group of the ligand. The resultant hydrogen-bonding network of Ser195, the CW, and the ligand is thought to increase the stability of the CW.
For the CWs with g O (r CW ) = 2.8 (peak 1), the major peak of

| Analysis of hydration states at ligand heavy atoms
To examine the characteristics of water molecules to be replaced by the ligand at the protein binding site, g O (r LHA ) was examined for the correct and incorrect poses of the ligand. For some other atom types, these characteristics were different between the poses. Figure 6 shows that the value of P(g O (r LHA )) at high g O (r LHA ) (highly hydrated states) was higher for the correct poses  Figure 6). At point 3 (g O (r LHA )= 7.95, that is, highly hydrated states), P ( g O (r LHA )) was higher for the correct pose than for the incorrect one.
In contrast, the peak of P( g O (r LHA )) near point 1 (g O (r LHA )= 1.35, that is, moderately hydrated states) was higher for the incorrect pose. This was also true for the following atom types (Here "moderately hydrated" and "highly hydrated" are denoted by "MH" and "HH", respectively. Although most of the selected points correspond to the positions of peaks in Figure 6, some points were selected at non-peak positions to allow comparisons between the P(g O (r LHA )) profiles for

| Analysis of binding site hydration and ligandprotein interactions
To further investigate the selected points on g O (r LHA ) shown in Figure 6, which as previously mentioned correspond to the peaks in the correct poses as well as some other points in the highly hydrated regions, we analyzed the detailed origin of the ligand-protein and water-protein interactions. The average of the minimum distance from the protein heavy atom to each LHA, r Min LHA P , for all the LHAs of atom type X in the ligand at point i in Figure 6 was defined as r Min LHA P X i . Its value represents the average interaction distance between the LHA of atom type X and the closest protein heavy atom. The g O (r LHA ) and Regarding the neutral carbon atoms of the ligand, which include C.2, C.3, and C.ar (sp2, sp3, and aromatic carbon, respectively), no peak with an interaction distance r Min LHA P X i smaller than the hydrogen bonding distance (3.2 Å) was observed. This was also true for the points in the highly hydrated region in Figure 6  Using the results in Figures 6 and 7, we propose the following picture of ligand binding. In highly hydrated regions, P( g O (r LHA )) was higher for the correct poses and lower for the incorrect ones. It was also observed that g O (r) was generally larger at positions closer to the protein surface. Thus, the correct binding pose might be the one that Regarding the hydration state of the LHAs, when the LHA was a heteroatom, the peak height in the distribution of g O (r LHA ) at high g O (r LHA ) values, e.g.,$8, was higher for the correct poses, whereas that at low g O (r LHA ) values, e.g.,$1, was higher for the incorrect poses.
These observations suggested that the correct pose may be distinguished from incorrect ones by examining the overlap between the polar heteroatoms of the ligand and the highly hydrated regions (i.e., with high g O (r LHA ) values) of the binding site. When the distance between the LHAs in proteins and their interaction partners (r Min LHA P ) was within the distance of a hydrogen bond, g O (r LHA ) mostly ranged from 7 to 8 and their partners were oxygen or nitrogen in the proteins. Hence, the protein environment (in this case, the oxygen and nitrogen atoms at the protein surface) seems to determine the hydration states around the protein and facilitates the formation of hydrogen bonds. In contrast, for 3.2 Å ≤ r Min LHA P ≤ 4 Å, the g O (r LHA ) values ranged mostly from 3 to 6 and the major interaction partner was carbon. Therefore, compared with oxygen and nitrogen atoms, carbon atoms at the protein surface make the surface less hydrated. Consequently, the protein surface is more suitable for weak interactions such as CH O, CH F, CH π, and hydrophobic interactions.
It is suggested that binding-pose prediction would be possible based on the following results. Water molecules in highly hydrated regions are likely to be replaced by the polar heteroatoms of the LHAs. However, the water molecules in moderately hydrated regions are more likely to be replaced by the hydrophobic atoms of the LHAs than by the polar heteroatoms of the LHAs. In the WaterMap method, unfavorable hydration sites are amenable to binding. 4 This idea is consistent with our result that water molecules in moderately hydrated regions are more likely to be replaced by the hydrophobic atoms of the LHAs.
One remaining issue is to elucidate the effects of the structural dynamics of proteins on their hydration state. In the present study, we analyzed the hydration states of ligand binding sites with the assumption that no structural change occurred upon ligand binding. This is because apo structures were not always available, and thus it is unclear how the protein conformation changes upon ligand binding. In future, an analysis of the hydration states must be performed with the structural changes of proteins upon ligand binding taken into consideration using MD simulations. A bottleneck for this analysis would be that comprehensive calculations of the hydration states for multiple conformations (e.g., generated by MD simulations) of a large number of proteins still requires huge computation time. Thus, the speed of hydration state computation using the 3D-RISM theory must be improved. Another issue is the force field. While the Amber ff99SB force field and the coincident SPC/E water model were found to be reasonable in this analysis, the quality of force field calculations is not comparable to that of quantum mechanics calculations. The adoption of other currently available force fields could be helpful, and more accurate new ones may have to be devised in the future.
The analysis in this study can be extended to other systems such as protein-protein complexes or mixtures of water and a fragment of a drug molecule. 33 The hydration state at the proteinprotein interface would be different from that at the protein-ligand interface because the shape and properties of the interfaces are different. For instance, while the ligand binding site is a groove or a deep pocket in the protein, the protein-protein interface is often flat. Thus, water may play different roles in the formation of these two types of protein complexes. Extending the present method to protein-protein complexes would shed light on the roles of water in their formation.
For the mixtures of water and a fragment of a drug molecule, the solvation state of the fragment at the protein binding site, which was obtained using MD simulations of a protein in a mixture of water and the fragment, has been used for virtual screening 34 and the identification of hot spots on the protein surface. 35 However, only a few proteins have been considered for such applications. Using the 3D-RISM theory to analyze the solvation state of the fragment at the binding sites would allow a comprehensive characterization of the hot spots for thousands of proteins. These extensions will be reported in further publications.

ACKNOWLEDGMENTS
This work was supported by a grant from the Medical Science Innovation Hub of RIKEN. Most of the computations in this study were performed using the HOKUSAI supercomputer at RIKEN.