Conformational response to ligand binding of TMPRSS2, a protease involved in SARS‐CoV‐2 infection: Insights through computational modeling

Thanks to the considerable research which has been undertaken in the last few years to improve our understanding of the biology and mechanism of action of SARS‐CoV‐2, we know how the virus uses its surface spike protein to infect host cells. The transmembrane prosthesis, serine 2 (TMPRSS2) protein, located on the surface of human cells, recognizes the cleavage site in the spike protein, leading to the release of the fusion peptide and entry of the virus into the host cells. Because of its role, TMPRSS2 has been proposed as a drug target to prevent infection by the virus. In this study, we aim to increase our understanding of TMPRSS2 using long scale microsecond atomistic molecular dynamics simulations, focusing on the conformational changes over time. The comparison between simulations conducted on the protein in the native (apo) and inhibited form (holo), has shown that in the holo form the inhibitor stabilizes the catalytic site and induces rearrangements in the extracellular domain of the protein. In turn, it leads to the formation of a new cavity in the vicinity of the ligand binding pocket that is stable in the microsecond time scale. Given the low specificity of known protease inhibitors, these findings suggest a new potential drug target site that can be used to improve TMPRSS2 specific recognition by newly designed inhibitors.

β-coronavirus genus of Coronaviridae family of the order Nidovirales. 6 The virion is defined by four structural proteins: spike, envelope, membrane, and nucleocapsid. The latter covers the genome, which in turn is enclosed by a lipid bilayer formed by the other three proteins.
All of these are involved in the mechanism of virus entry into the host cell. The recognition and the binding to the host cell are performed by the spike protein, which binds to the target cell through specific interactions with cellular receptors. 7,8 From a structural point of view, the spike protein is an elongated homotrimer and added to the spherical nucleocapsid gives the virus the typical crown-like appearance. Each subunit of the spike protein is composed of two domains, S1 and S2 ( Figure 1A). 7,8 The S1 domain contains the Receptor-Binding Domain (RBD), which interacts closely with the receptor on the host cell, while the second subunit (S2) is responsible for membrane fusion, an event that allows effective access. The spike protein, however, must be processed before fulfilling its task. This occurs by involving the S1 and S2 subunits, which become the site of cleavages. 8 The first proteolytic cleavage is made at the border between S1 and S2 (S1/S2 site, Figure 1A) and involves a conformational change in the S2 domain separating the two subunits. This operation is made by furin during virus maturation. Another cleavage site, termed S2 0 , is located in the S2 domain ( Figure 1A) and allows the fusion of the viral and cellular membranes to facilitate the release of the N-coated RNA genome into the cytoplasm. 8 Once the spike protein is activated, it interacts with the cellular receptors of the target cell, specifically the angiotensinconverting enzyme 2 (ACE2), which is a carboxypeptidase of 805 amino acids that hydrolyses angiotensin II. 8 There are two ways for the virus to enter the cell, and depending on the route taken, the cleavage on S2 0 is made by a different protease. In one of these cases, the transmembrane prosthesis, serine 2 (TMPRSS2) protein is involved, and the cleavage occurs at the cell surface ( Figure 1B). 8 The cleavage causes the division of S1 and S2 and the exposure of the fusion peptide, which will be pushed toward the target membrane.
The fusion of the two membranes, viral and cellular, leads to the formation of a fusion pore through which the RNA is released into the cytoplasm. 8 TMPRSS2 belongs to the type 2 transmembrane serine protease (TTSP) family that is constituted by 19 surface-expressed trypsin-like serine proteases. 9 These proteins participate in the remodeling of the extracellular matrix, 10 in the proteolytic activation of F I G U R E 1 (A) SARS-CoV-2 schematic representation and detail of the spike protein. The proteolytic S1/S2 and S2 0 cleavage sites are highlighted. (B) Schematic representation of SARS-CoV-2 binding to the human ACE2 receptor and position in the membrane of TMPRSS2. (C) Domain organization in the TMPRSS2 sequence. (D) Starting TMPRSS2-membrane assembly. The protein skeleton is shown as a ribbon. The membrane is in light gray, while the protein ribbons are in red, blue, cyan and light green for THD, LDLRA, SRCR and SPD, respectively. The acylated Ser441* residue and the phenylguanidino acyl moiety are reported as spheres colored according the the atom type, while the nine structure stabilizing disulphide bonds are in sticks highlighted with a "X" character. (E) Detail of the phenylguanidino acyl moiety covalently bound to Ser441* (ball-and-stick) along with residues involved in the catalytic triad (Asp345 and His296) and interacting residues within 3.5 Å. Hydrogen atoms are ignored for clarity. membrane proteins 11 and in other epithelial homeostasis roles. 12 All the members of the TTSP family are disulphide-rich, experience complex post-translational regulation and proteolytic cleavage activation. 9 Cancers and tumor cell proliferation, invasiveness and metastasis causes a dysregulation in TTSP activity, [13][14][15] while basal activity levels of TMPRSS2 in normal tissues can be utilized by several viruses (such as influenza A and B viruses 16,17 or coronaviruses 18 ) for infection at the host cell membrane. Consequently, TMPRSS2 represents an interesting pharmacological target to prevent SARS-CoV-2 from infecting the human cells. 19,20 Human TMPRSS2 is composed of three domains ( Figure 1C based on a model of TMPRSS2 SPD (residues 256-491) 21 in complex with the hydrolytic product of two known protease inhibitors (namely Nafamostat and Camostat) was published. 22 The simulations, conducted in the tens of-hundreds of microsecond time scale, suggested that in the design of new covalent TMPRSS2 inhibitors one should consider the position of the catalytic serine (Ser441) with respect to Asp435. It was also noted that the phenylguanidino acyl moiety resulting from the inhibitors hydrolysis is slightly shorter than the size of the TMPRSS2 modeled cavity. It was also suggested that a putative new inhibitor should be size-compatible to a hydrophobic patch found in the catalytic pocket. 22 The x-ray crystal structure of the human TMPRSS2 ED has been recently released (PDB ID 7MEQ, residues 148-491, solved at 1.95 Å resolution) ( Figure 1D). 23 The structure reports the TMPRSS2 structure after its reaction with the potent nonselective trypsin-like serine protease inhibitor Nafamostat. 23 This inhibitor reacts with Ser441 and is cleaved into a phenylguanidino acyl moiety thar remains covalently bound to TMPRSS2 Ser441 (Ser441* hereafter) ( Figure 1E) and a molecule of 6-amidino-2-napthol. The phenylguanidino acyl moiety also forms a salt-bridge with Asp435 and H-bonds with Ser436 and Gly439 by using its guanidium group. The tertiary structure of the ED is also stabilized by nine disulphide bridges located in the ED (cysteine  21,22 there is no indication that the disulfide bonds should be considered in order to achieve a reliable model. Here we took advantage of the TMPRSS2 experimental x-ray structure, while the unsolved parts of TMPRSS2 were modeled through AlphaFold2 and made available through the AlphaFold Protein Structure Database. 24 AlphaFold2 is a recent machine learning algorithm that incorporates physical and biological knowledge about protein structure for the calculation of reliable structural models. 25 Despite the extremely good results obtained in the recent 14th Critical Assessment of protein Structure Prediction (CASP14), 26 AlphaFold2 still shows some problems in the modeling of membrane proteins such as the one at the center of this study. 27 In fact, while TMPRSS2 ED and THD are modeled with good accuracy, the CD is unfolded and localized in regions where the membrane should be or even outside the cell. For this reason, our study focuses only on THD and ED (residues 71-492) ( Figure 1). Here we report the results of atomistic molecular dynamics simulations in explicit solvent and membrane performed on the phenylguanidino acyl-bound TMPRSS2 (holo form hereafter) and the inhibitor-deprived protein (apo form). For each system, two 2 μs-long sets of simulations were run. Comparisons between the dynamics of the protein and the active site in the presence or absence of the inhibitor are then discussed with a view to developing new TMPRSS2-specific drugs.  To do so, the last 1.8 μs of each replica of each system were joined together and superimposed before being analyzed in order to distinguish between different conformational states. Figure 3 shows the F I G U R E 2 Calculated residue averaged Cα root mean square fluctuations (RMSF, A) and time evolution of the weighted Cα root mean square deviations (wRMSD, B) of TMPRSS2 in the apo (black) and holo (orange) form. The wRMSD distributions over the last 1.8 μs are also reported. Dashed lines represent data for replica simulations. In panel A the first and the last three residues were excluded from the RMSF calculation due to high fluctuations, while the inset shows the region (±10 residues) flanking Ser441* (arrow in magenta); while horizontal bars report the position of each domain in the sequence and are colored according to the domain coloration in Figure 1C. C3 is observed mainly in the holo system but also in the initial stages of the first replica of the apo protein. This is probably because the apo simulations started from the holo system depleted of the phenylguanidino acyl moiety. Then, the apo system took some time to relax to conformations typical of the real apo state and, at the beginning of one of the two simulations, it retained the initial holo conformation.
In order to highlight the differences between each cluster of the ED obtained through PCA, the centroid of each cluster was determined. Subsequently, the 1000 snapshots closest to the centroid structure were chosen for each cluster and the RMSF per residue was calculated on these four structural subsets (Figure 4). In general, in the clusters observed during simulations on the holo system, greater rigidity is observed compared to clusters referring to the apo system, together with a greater similarity between the cluster centroid and its immediate neighbors. Also, based on previous analyses, it is not surprising to note that in apo system there is greater conformational variability in the active site region. However, in C1 there is considerable flexibility in the LDLRA region. In particular, the LDLRA and the SRCR

| Volumetric and geometric analyses
As a next step, attention was paid to the consequences of the motions determined through the PCA of the ED. The cavities on the surface of the ED with a lifetime of at least 60% of the simulation time were determined and the results are shown in Figure 5A. In the apo form there is a pocket in the correspondence of the active site, as expected.
However, this pocket appears to be more flexible than other clefts on F I G U R E 4 Structural mapping of residue level fluctuations. The representative snapshot of each of the four ED most populated clusters is depicted in worm representation whose thickness is proportional to RMSF values estimated over 1000 closest snapshots to the respective centroids. The phenylguanidino acyl moiety is highlighted in yellow for cluster 3 and 4 that are mainly constituted by frames of holo system. the ED surface, that are stable for 75% of the simulation time, suggesting a large flexibility in the active site, which is in agreement with the previous analysis of the trajectories and a phenomenon also observed for other proteases. [29][30][31] From the cavity analysis, it is also possible to notice the formation of some small cavities in the holo form of the protein that are not present in the apo form plus a large cavity overhanging the active site and close to the phenylguanidino acyl moiety (cavity C1 in Figure 5A). The latter is formed by residues in the 251-261 loop (namely, Val257, Gly258, Glu260, and Ser261) as well as in 386-401 loop (Glu388, Glu398, Gly391, Lys392, Thr393, Ser394, Glu395, Leu397, and Asn398) as reported in Figure S8. In other words, the presence of the ligand appears to stabilize the binding pocket and consequently to release the movement of the 386-401 and 251-261 loops. Interestingly, cavity C1 is present in both replicas (see Figure S9), suggesting that this structure is very stable in the microsecond time scale. A second large cavity is present in both the apo and holo form but appears to be larger in the holo form (cavity C2 in Figure 5). This second cavity is located between the LDLRA/ SRCR and the SPD domains. In order to get an idea of the size of the cavity opening, the distance between selected residues in the center of the SRCR and the SPD domains was measured as a function of the simulation time ( Figure 5B). In the apo system, the two domains remain much closer (average distance ca. 33 Å) than in the holo system, where the average distance is ca. 35 Å. Interestingly, in the apo form the domains composing the ED appears to assume a single con- Finally, the correct characterization of the residues involved in the active site cavity is made complex due to conformational variability of the cavity itself. To compare the active site in the apo and holo form of TMPRSS2, the normalized solvent accessibility of the residues (N SASA ) identified within varying radius from the active site was calculated. Such an analysis considers the solvent exposed surface of all residues within an increasing cut-off value and the resulting area is then normalized over the number of residues within the cut-off. In the absence of a cavity, the N SASA value is constant, while it tends to grow F I G U R E 5 Volumetric and geometric analyses. Formation of clefts in protein was identified by volumetric analysis (A). Iso surfaces of clefts sustaining for 60 (yellow) to 75 (magenta) percent of the simulation time are depicted for apo (left) and holo (right) systems. In the holo system, the ligand is shown in purple and dashed blue arrows highlight additional clefts visible in the SPD (the largest cavity, C1, is discussed in the text). Red arrows indicate the elongation of the cleft between LDLRA/SRCR (cyan) and SPD (light green) domains (cavity C2), which is further characterized by distance between respective domain centroids (B) and cleft volume (C). (D) shows the normalized solvent accessibility of residues identified within varying radius from Ser441* for the apo (black) and holo (orange) systems.
with a trend that increases to an asymptote in the presence of a pocket. Such an analysis (Figures 5D and S9) shows that, in agreement with the previous analysis, the solvent exposed surface of the protein in the apo form is larger and more variable than in the holo form, where the presence of the phenylguanidino acyl moiety tightens and stabilizes the active site region.

| CONCLUSIONS
In conclusion, in this work we performed atomistic MD simulations on the most complete TMPRSS2 structure used so far. The inclusion of the THD also allows to study the motion of the extracellular parts of the proteins with respect to the membrane. The latter domain was absent in the already cited study by Hempel and co-workers performed prior to the release of the experimental structure of the pro-  Figure S12). This result could be used in designing new specific TMPRSS2 inhibitors. Indeed, one of the biggest challenges in designing new and specific protease inhibitors is the high similarity among the active sites of this family of proteins. Because of this, inhibitors are often able to inhibit the activity of more than a single protease. Indeed, it has been reported that the 115 annotated human protease inhibitors are responsible for regulating the activity of the 612 known human proteases. 32 It is likely that these numbers will change as protease and inhibitor families become more refined.
However, the ratio of approximately one protease inhibitor to five proteases is likely to remain constant. One potential solution for optimizing the specificity of protease inhibitors, which has been explored extensively in recent years, is to identify new sites in the protein that can be used as potential targets for the designed inhibitors, such as

| Modeling and simulation
The x-ray structure of human TMPRSS2 (PDB ID: 7MEQ) 23 is resolved only for the extracellular region (residues 148-491) in complex with a covalently bound inhibitor fragment (phenylguanidino acyl deriving from Nafamostat hydrolysis) and contains several unsolved segments.
The starting structure for the present simulations was obtained from the AlphaFold (AF) Structure Database (entry O15393) 24 and terminal regions with low prediction accuracy were removed. Thus, the starting model included residues Ser71-Gly492 ( Figure 1D). For the extracellular region, the root mean square deviation (RMSD) between the x-ray and AlphaFold2 models was 1.4 Å over 325 Cα atom pairs, indicating that the latter model closely represents the holo state. In particular, the model of the SPD is in excellent agreement with the experimental structure (Cα RMSD ca. 0.4 Å, Figure S13). The phenylguanidino acyl moiety was covalently incorporated into the starting model after being superimposed on the x-ray structure to maintain its bound conformation and chemical environment ( Figure 1E), while Ser441 was left unmodified in the apo system. The starting models were embedded in a membrane bilayer of 600 lipid molecules consisting of POPE and POPG in a 7:3 ratio using the CHARMM-GUI server. [33][34][35][36][37] The assembly was immersed in a TIP3P 38 water box extending up to 15 Å from the protein surface. To achieve electroneutrality, the system was neutralized, and physiological ionic strength was mimicked through the addition of 0.1 M KCl. Ligand partial charges were obtained by fitting the electrostatic potential derived at the HF/6-31G* theory of level with Gaussian16 39 via the restrained electrostatic potential (RESP) procedure 40 using R.E.D. tools. 41 Ligand parameters were adapted from the Amber ff10 forcefield following best practices explained elsewhere. 42,43 The minimization, equilibration and production runs were performed using GROMACS software (v2021.5), [44][45][46] with the Amber ff14SB force field for the protein 47 and Lipid14 for the lipids. 48 Systems were energy-minimized using the steepest-descent algorithm for 5000 steps and then equilibrated with an NVT simulation using the v-rescale thermostat 49 at 303.15 K for 100 ps. This was followed by an NPT equilibration at 1 atm pressure using the Parrinello-Rahman barostat 50 and at 303.15 K using the Nose-Hoover thermostat 51 for a further 1 ns. Finally, 2 μs NPT production runs were performed for both systems, including two replicate runs which were initiated from the same equilibration configurations but with different random velocity distributions. In all simulations, a time step of 2 fs was employed and the LINCS algorithm 52 was used to impose constraints on the hydrogen-containing bonds. Periodic boundary conditions (PBC) were applied, and the Particle Mesh Ewald (PME) method 53 was used to calculate long-range electrostatic interactions. The equilibration and production runs were executed on the CINECA MAR-CONI100 and the ENI HPC5 clusters. These supercomputers have similar architectures consisting of nodes equipped with 4Â NVIDIA V100 GPUs per node so the optimal setup for both machines, with this version of GROMACS, was to use 1 node per run, resulting in performances of about 170 ns/day for the holo systems and 190 ns/ day for the apo runs.

| Trajectory analysis
The simulated trajectories were analyzed with the MDAnalysis, 54 MDTraj 1.9.8 55 and mdciao 0.0.5 56 Python packages, using in-house Jupyter notebooks. 57 Plots were generated with matplotlib and seaborn libraries. UCSF Chimera 1.16 58 was used for molecular figures.
Analyses requiring structural alignments were performed based on Cα coordinates using equilibrated snapshots as the reference structure.
All analyses except RMSD were performed on a single trajectory obtained by combining the final 1.8 μs sampling of replicate runs, allowing statistics of analyzed observables to be obtained over 3.6 μs of data. Weighted RMSD (wRMSD) was estimated using a Gaussian weighting scheme. 28 The method involves an iterative fit, where the weights for the atoms used in the alignment (we used Cα atoms) are assigned with a Gaussian function, defined as: where r i is the distance of the ith atom between the two aligned conformations, and C is an arbitrary scaling factor. At each iteration, the weights are updated and the fitting proceeds until convergence in