SARS‐CoV‐2, an evolutionary perspective of interaction with human ACE2 reveals undiscovered amino acids necessary for complex stability

Abstract The emergence of SARS‐CoV‐2 has resulted in nearly 1,280,000 infections and 73,000 deaths globally so far. This novel virus acquired the ability to infect human cells using the SARS‐CoV cell receptor hACE2. Because of this, it is essential to improve our understanding of the evolutionary dynamics surrounding the SARS‐CoV‐2 hACE2 interaction. One way theory predicts selection pressures should shape viral evolution is to enhance binding with host cells. We first assessed evolutionary dynamics in select betacoronavirus spike protein genes to predict whether these genomic regions are under directional or purifying selection between divergent viral lineages, at various scales of relatedness. With this analysis, we determine a region inside the receptor‐binding domain with putative sites under positive selection interspersed among highly conserved sites, which are implicated in structural stability of the viral spike protein and its union with human receptor ACE2. Next, to gain further insights into factors associated with recognition of the human host receptor, we performed modeling studies of five different betacoronaviruses and their potential binding to hACE2. Modeling results indicate that interfering with the salt bridges at hot spot 353 could be an effective strategy for inhibiting binding, and hence for the prevention of SARS‐CoV‐2 infections. We also propose that a glycine residue at the receptor‐binding domain of the spike glycoprotein can have a critical role in permitting bat SARS‐related coronaviruses to infect human cells.


| INTRODUC TI ON
The recent emergence of the novel SARS coronavirus 2 (SARS-CoV-2) marked the third introduction of a highly pathogenic coronavirus into the human population in the twenty-first century, following the severe acute respiratory syndrome coronavirus (SARS-CoV) (Drosten et al., 2003;WHO, 2004). MERS-CoV was the second emergence and was first detected in Saudi Arabia in 2012 and resulted in nearly 2,500 human infections and 858 deaths in 27 countries (Fehr, Channappanavar, & Perlman, 2017;Zaki, Boheemen, Bestebroer, Osterhaus, & Fouchier, 2012). In December 2019, SARS-CoV-2, a previously unknown coronavirus capable of infecting humans was discovered in the Chinese city of Wuhan, in the Hubei Province (Huang et al., 2020;Zhu et al., 2020). SARS-CoV-2 is associated with an ongoing pandemic of atypical pneumonia, now termed coronavirus disease 19 (Covid-2019) that has affected over 1,280,000 people with 72,614 fatalities as of April 7, 2020(WHO, 2020. Both SARS-CoV and MERS-CoV are thought to have originated in colonies of bats, eventually transmitted to humans, putatively facilitated by intermediate hosts such as palm civets and dromedary camels, respectively (Cui, Li, & Shi, 2019). The genome of SARS-CoV-2 shares about 80% nucleotide identity with that of SARS-CoV and is 96% identical to the bat coronavirus BatCoV RaTG13 genome, reinforcing the probable bat origin of the virus . However, better assessing the evolutionary dynamics of SARS-CoV-2 is an active research priority worldwide.

SARS-CoV, MERS-CoV, and SARS-CoV-2 belong to the genus
Betacoronavirus within the subfamily Coronavirinae of the family Coronaviridae. Members of this family are enveloped viruses containing a single positive-strand RNA genome of 27-32 kb in length, the largest known RNA virus genome. The coronavirus spherical virion consists of four structural proteins: the spike glycoprotein (S-protein), the envelope protein, membrane protein, and nucleocapsid. The transmembrane trimeric S-protein plays a critical role in virus entry into host cells (Gallagher & Buchmeier, 2001;Tortorici & Veesler, 2019). It comprises two functional subunits: S 1 subunit, where the receptor-binding domain (RBD) is found, is responsible for binding host cell surface receptors and S 2 subunit mediates subsequent fusion between the viral and cellular membranes (Kirchdoerfer et al., 2016;Yuan et al., 2017). Both SARS-CoV and SARS-CoV-2 interact directly with angiotensin-converting enzyme 2 (ACE2) to enter host target cells (Hoffmann et al., 2020;Li et al., 2003;Walls et al., 2020;Yan et al., 2020). In the case of SARS-CoV, ACE2 binding was found to be a critical determinant for the range of hosts the virus can infect, and key amino acid residues in the RBD were identified to be essential for ACE2-mediated SARS-CoV infection and adaptation to humans (Li et al., 2005(Li et al., , 2006. Understanding the dynamics that permits a virus to shift hosts is of considerable interest and can further be an essential preliminary step toward facilitating the development of vaccines and the discovery of specific drug therapies. We employ a multidisciplinary approach to look for evidence of diversifying selection on the S-protein gene and model the interactions between human ACE2 (hACE2) and the RBD of selected coronavirus strains, which ultimately afforded us novel insights detailing virus and host cell interactions. Given the rapid pace of discovery, we aim to add clarity to evolutionary dynamics of diseases strains by more precisely understand the dynamics at the S-protein and its interaction with hACE2.

| Phylogenetic reconstruction and analysis testing for evidence of positive/purifying selection in the coronavirus S-protein region
The most similar genomes to SARS-CoV-2 MN908947 were retrieved using BLASTp (Altschul et al., 1997) versus the NR database of GenBank (Table 1). Genomes were then aligned using MAUVE (Darling, Mau, Blattner, & Perna, 2004), and the S-protein gene was trimmed. The extracted genomic sections were aligned using the translation align option of Geneious (Kearse et al., 2012) with a MAFFT plugin (Katoh & Standley, 2013). The phylogenetic reconstruction of S-protein genes was performed with PhyML (Guindon et al., 2010), using a GTR + I + G model, using with 100 nonparametric bootstrap replicates. Both, the alignment and the tree were used as inputs for PAML Codeml (Yang, 2007). The presence of sites under positive selection was tested by the comparison of M2 model (which it allows for a proportion of positive, neutral, and negative selection sites in the alignment) versus the M1 model (it which only allows a proportion of neutral and negative selection sites in the alignment) and M8 (ω follows a beta distribution plus a proportion of sites with ω > 1) versus M7 (ω follows a beta distribution) models using the ETE toolkit 3.0 (Huerta-Cepas,  (Weaver et al., 2018) was used to perform the HyPhy analyses.

| Assessing molecular structure and hostspecific interactions
The crystal structure of the SARS-CoV S-protein RBD (GeneBank ID NC_004718) in complex with hACE2 was retrieved from the Protein Data Bank (code 2AJF) (Berman et al., 2000). Homology models were constructed using this structure as template for the RBDs of SARS-CoV-2 (SARS2, GeneBank ID MN908947), the Bat SARS-like coronavirus isolate Rm1 (Rm1, GeneBank ID DQ412043), and the Bat SARS-like coronavirus isolate Rs4231 (Rs4231, GeneBank ID KY417146). One additional homology model for the G496D mutant of the SARS-CoV-2 RBD (SARS2-MUT) was constructed. Homology models were built with Modeller v. 9 (Webb & Sali, 2016) using its UCSF Chimera interface (Pettersen et al., 2004). Five models were constructed for each target sequence and the one with the lowest Discrete Optimized Protein Energy (DOPE) score was selected for the final model.
All nonamino acidic residues were removed from the SARS-CoV RBD-hACE2 complex to obtain a clean complex. The homology models of the SARS2, Rm1, Rs4231 RBDs, and SARS2-MUT were superimposed into the SARS-CoV RBD to obtain their initial complexes with hACE2. These complexes were then subject to molecular dynamics (MD) simulations and estimation of their free energies of binding using Amber 18 (Case et al., 2018). For the later, ACE2 was considered as the receptor and the RBDs as ligands. The protocol described below was employed for all complexes and otherwise noted default software parameters were employed.
Systems preparation was performed with the tleap program of the Amber 18 suite. Each complex was enclosed in a truncated octahedron box extending 10 Å from any atom. Next, the boxes were solvated with TIP3P water molecules and Na+ ions were added to neutralize the excess charge. Systems were minimized in two steps, the first of which consisted in 500 steps of the steepest descent algorithm followed by 500 cycles of conjugate gradient with protein atoms restrained using a force constant of 500 kcal/mol.Å 2 .
The PME method with a cutoff of 12 Å was used to treat long-range electrostatic interactions. During the second minimization step, the PME cutoff was set to 10 Å and it proceeded for 1,500 steps of the steepest descent algorithm followed by 1,000 cycles of conjugate gradient with no restrains. The same PME cutoff of 10 Å was used in all simulation steps from here on. Both minimization stages were performed at constant volume.
The minimized systems were heated from 0 to 300 K at constant volume constraining all protein atoms with a force constant of 10 kcal/mol.Å 2 . The SHAKE algorithm was used to constrain all bonds involving hydrogens and their interactions were omitted from this step on. Heating took place for 10,000 steps, with a time step of 2 fs and a Langevin thermostat with a collision frequency of 1.0 ps −1 was employed. All subsequent MD steps utilized the same TA B L E 1 List of coronavirus isolates used for positive selection analysis (closer dataset) thermostat settings. Afterward, the systems were equilibrated for 100 ps at a constant temperature of 300 K and a constant pressure of 1 bar. Pressure was controlled with isotropic position scaling with a relaxation time of 2 ps. The equilibrated systems were used as input for 10 ns length production MD simulations.
The free energies of binding were computed under the MM-PBSA approach implemented in AmberTools 18 (Case et al., 2018).
A total of 100 MD snapshots were evenly selected, one every 50 ps, from the last 5 ns of the production run for MM-PBSA calculations.
The ionic strength was set to 100 mM and the solute dielectric factor was set to 4 for all systems.

| Selective constraints over spike protein in betacoronaviruses
In order to detect branches and sites under positive/negative selection, two datasets were explored. The first ("closer" dataset) harbors the most similar genomes to Wuhan-Hu-1 coronavirus (SARS-CoV-2) (MN908947). For this dataset, several genomes were excluded from the analysis because they showed minimal variation to other sequences. We used a preliminary phylogeny to select a representative isolate of each clade (Table 1)  In both datasets, we observed evidence of purifying selection in the majority of nodes of the tree. Specifically, in the "closer" dataset, we identified 38 nodes with evidence of negative selection, and 4 under positive selection when free ratios model of CODEML model was applied. To confirm the four nodes under positive selection, we use LTR test for contrasting hypothesis using branch free, branch neutral, and M0 models of CODEML. Using these approximations, any node predicted by free ratios model with ω > 1 was significantly different to the purifying (ω < 1) or neutral (ω = 1) models. An equivalent analysis was performed using aBSREL of HyPhy, observing episodic diversifying selection in at least 8 of 41 nodes of the phylogenetic tree reconstructed with the "closer" dataset ( Figure 1).
Interestingly, one of the divisions detected with diversifying selection was the branch that contains SARS-CoV-2, Pangolin coronavirus isolate MP789 and bat coronavirus RaTG13 (called SARS-CoV-2 group) but not the specific branch that contains SARS-CoV-2. and 500 (using SARS-COV-2 S-protein as a reference) have drastic amino acid changes for alpha-helical tendencies. In addition, the section between residues 448 and 485 shows radical changes in amino acids implicated in the equilibrium constant (ionization of COOH). In the structural analysis we performed, the section between 472 and 486 forms a loop that is not present in certain S-proteins of coronavirus isolated in bats. This loop extends the interaction area between RBD of S-protein and human ACE2; in fact, the lack of this loop decreases the negative energy of interaction (increasing the binding) among these two molecules (see Table 2).
These results, obtained from independent analyses, strongly highlight the importance of 439-508 region. Additionally, important hACE2-binding residues in the RBD from SARS-COV-2 obtained from the crystallography and structure determination performed by Shang et al. (2020) are also present in the section we highlight here. We propose that this region is the most probable to contain the sites under positive selection due to predictions by our CODEML and FUBAR models. In that sense, we refer to this section as region under positive selection (RPS). It is important to additionally clarify that even inside the RPS we found at least 20 aa highly conserved between coronaviruses, several of them are predicted as sites under purifying selection. This shows that it is necessary to maintain conserved sites which are located around polymorphic sites, probably to maintain the protein structure and at the same time to have the ability to colonize more than one host.

| Structural analysis of spike protein and human ACE2
With a list of broader observations related to the role of selection across viral genomes, we aimed to specifically understand how these regions could affect virus/host interactions. To understand more in more detail the importance of RPS in the evolution of SARS-COV-2, we quantified the relative importance of this region in the interaction between the RBDs and hACE2. In that sense, MD simulations were run for five complexes (listed in methods). Simulations were initially performed for four RBDs corresponding to the SARS2, SARS, Rs4231 and Rm1 coronaviruses. As discussed below in this section, the G496D mutation is predicted to have a large negative influence in the stability of the RBD-hACE2 complexes. To further clarify this influence, we added as fifth system a G496D SARS2 mutant RBD-hACE2 complex in our studies.
In all cases, the systems were stable with root mean square deviations (RMSD) of their backbones between 1.11 and 3.30 Å relative  The contribution of each residue in the studied coronaviruses that interact with the hACE2 receptor are shown in Table 3. Rows are presented in such a way that each of them contains the residues occupying the same position in the viral RBDs structures as in the SAR2 RBD structure. From here on, residues numeration will take that of SARS2 as reference. To better interpret the influence of the key interactions between the studied coronaviruses RBDs and the hACE2 receptor, their complexes were visually analyzed. The predicted RBD-hACE2 complexes for SARS2, SARS, and SARS2-MUT are depicted in Figure 3. For each complex, the structure used to create Figure 3 is the representative one of the largest cluster formed by the MD snapshots previously used in MM-PBSA calculations.
Many studies have focused on coronaviruses mutations that favor adaptations for infecting human host infections. For F I G U R E 3 Predicted interaction of SARS2 (top), SARS2-MUT (middle) and SARS (bottom) with the human ACE2 receptor. hACE2 in depicted in gray and the RBD of coronaviruses in cyan. Oxygen atoms are colored red and nitrogen blue. The numbering of the residues corresponds to that of the sequence of each spike glycoprotein example, it has been shown that specific substitutions at positions 455, 486, 4983, 494, and 501 (442, 472, 479, 480, and 487 in SARS) of the RBD of SARS favors the interaction between the RBD of SARS and hACE2 (Cui et al., 2019). Likewise, homology modeling studies found favorable interactions between the residues occupying these positions in the SARS2 RBD and the human receptor (Wan, Shang, Graham, Baric, & Li, 2020). The cornerstone of these favorable interactions is the complementarity of the RBDs with hot spots 31 and 353. These are salt bridges between K31 and E35 and between D38 and K353 of ACE2 which are buried in a hydrophobic environment (see Figure 2). In the cases of SARS2  Table 3) respectively, do not interact with any hot spot residue. Instead, they interact with D30 of hACE2 in the SARS2 complex and with E329 of the human receptor in the SARS complex. This could indicate that interactions additional to those previously identified with the hACE2 hotspots could be critical for the stabilization of the RDB-human receptor complexes. Finally, we analyzed the possible reasons for the predicted negative impact that the G496D mutation has on the predicted free energies of binding of the RBD to hACE2. As depicted in Figure 3, G496 directly interacts with K353 in hot spot 353 and its mutation interferes with the D38-K353 salt bridge. Specifically, D496 of the RDB point to D38 of hACE yields a high electric repulsion between these amino acids.
Consequently, this portion of the RBD is pushed to a position further from hACE2 than that observed in the wild type receptor, resulting in the reduction of its network of contacts with K353. As a result, the binding of the RBD to hACE2 is considerably inhibited and unlikely to occur. propose that blocking its interaction with the receptor D30 could be a promising strategy for future drug discovery efforts.

ACK N OWLED G M ENTS
The authors would like to thank Daniela Santander for her valuable comments on the manuscript. This work was supported by the

CO N FLI C T O F I NTE R E S T
None declared.

DATA AVA I L A B I L I T Y S TAT E M E N T
The phylogenetic tree presented in this manuscript was uploaded to DRYAD (https://doi.org/10.5061/dryad.w3r22 80mp). The full networks of interactions between the coronaviruses and the hACE2 receptor are provided as Supporting Information.