Evolutionary analysis of SARS‐CoV‐2 spike protein for its different clades

Abstract The spike protein of severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2) has become the main target for antiviral and vaccine development. Despite its relevance, e information is scarse about its evolutionary traces. The aim of this study was to investigate the diversification patterns of the spike for each clade of SARS‐CoV‐2 through different approaches. Two thousand and one hundred sequences representing the seven clades of the SARS‐CoV‐2 were included. Patterns of genetic diversifications and nucleotide evolutionary rate were estimated for the spike genomic region. The haplotype networks showed a star shape, where multiple haplotypes with few nucleotide differences diverge from a common ancestor. Four hundred seventy‐nine different haplotypes were defined in the seven analyzed clades. The main haplotype, named Hap‐1, was the most frequent for clades G (54%), GH (54%), and GR (56%) and a different haplotype (named Hap‐252) was the most important for clades L (63.3%), O (39.7%), S (51.7%), and V (70%). The evolutionary rate for the spike protein was estimated as 1.08 × 10−3 nucleotide substitutions/site/year. Moreover, the nucleotide evolutionary rate after nine months of the pandemic was similar for each clade. In conclusion, the present evolutionary analysis is relevant as the spike protein of SARS‐CoV‐2 is the target for most therapeutic candidates; besides, changes in this protein could have consequences on viral transmission, response to antivirals and efficacy of vaccines. Moreover, the evolutionary characterization of clades improves knowledge of SARS‐CoV‐2 and deserves to be assessed in more detail as re‐infection by different phylogenetic clades has been reported.


| Phylogenetic and genetic characterization
Patterns of genetic diversifications for both genomic regions S and RBD for each clade were analyzed using the median-joining reconstruction method with the PopART v1.7.2 software. 16 Haplotypes shared among all clades were analyzed in Arlequin 3.5.2.2 software. 17 Polymorphism indices were calculated separately for each clade with DnaSPv. 6.12.01. 18

| Nucleotide evolutionary rate
The estimation of the nucleotide evolutionary rate for the entire S-coding region datasets was carried out with the Beast v1.8.4 program package 19 at the CIPRES Science Gateway server. 20 The temporal calibration was established by the samples' date of sampling. The best nucleotide substitution model was selected according to the Bayesian information criterion method in IQ-TREE v1.6.12 software. 21 The analysis was performed under a relaxed (uncorrelated lognormal) molecular clock model recommended previously by Duchene & col. 22 with an exponential demographic model. 23 Analyses were run for 8 × 10 6 generations and sampled every 8 × 10 5 steps. The convergence of the "meanRate" and "allMus" parameters (effective sample size [ESS] ≥ 200, burn-in 10%) was verified with Tracer v1.7.1. 24 The obtained substitution rate was probed against 10 independent replicates of the analysis with the time calibration information (date of sampling) randomized as described by Rieux and Khatchikian. 25 3 | RESULTS

| Datasets
Three-hundred sequences were randomly selected for each clade.
Two thousand and one hundred sequences were curated and selected for the analysis. Table 1 shows the SARS-CoV-2 sequences included for every month and clade.

| Phylogenetic and genetic characterization
The haplotype networks ( Figure 1) reflect the diversity indices results as a star shape with multiple haplotypes with a few T A B L E 1 Number of SARS-CoV-2 sequences from GISAID database on September 2020, by month and clade as per the selection criteria (temporal structure)  Table 3 shows the frequency of each haplotype with amino acid changes.
The haplotype diversity was moderate to high in every clade, ranging from Hd = 0.507 to 0.793 (Table 2). In contrast, nucleotide diversity was relatively low for each clade, ranging between π = 0.0018 for V and π = 0.0040 for O (   (Table 4). A date-randomization analysis showed no overlapping between the 95% HPD substitution-rate intervals obtained from real data and from date-randomized datasets for all clades ( Figure 2).

| Nucleotide evolutionary rate
The data set for the clade L did not reach convergence (ESS < 200).
To verify the reliability of the result, 10 independent runs were performed. All of them converged in a similar posterior distribution.
Likewise, for many of the random sample datasets, convergence was not achieved (ESS between 100 and 200). For those datasets that did not reach convergence, two independent runs were carried out and concatenated. 26 When the evolutionary rate was analyzed according to the emergence of each clade, founding clades (L, O, S, and V) tended to present evolutionary rates slightly slower than the more recent clades (G, GH, and GR), (p = .157).

| DISCUSSION
The evolutionary characterization of the spike genomic region of SARS-CoV-2 is crucial to estimate the course that re-infections, vaccines, and therapeutics would have in the pandemic's future. In  this study, the evolutionary rate of the most important SARS-CoV-2 protein for vaccine development was estimated in general and separately for each genetic clade described in GISAID. In this context, the spike haplotype network showed a founding central paternal haplogroup from which multiple sequences with modest changes derived. Overall, the nucleotide evolutionary rate after 9 months of the pandemic was similar for each clade.
At the beginning of the pandemic, the most prevalent clades were L, O, V, and S. Later, with the appearance of the D614G mutation in the S protein, clade G emerged and remained with a high and stable prevalence. After this initial step, the GR clade has emerged and grown until it became the most prevalent. Finally, the GH clade peaked at 30% in May 2020 and then began to decrease. 3 In this sense, it is important to highlight that clades with the mutation D614G in the S protein (clades G, GH, and GR) have been suggested to present a higher transmission efficiency although they would not be associated with more severe pathogenesis. 27 Therefore, to describe the evolution of the S protein variants, the study of haplotypes network in all seven clades and for both regions (S and RBD alone) was performed. This analysis showed several identical sequences grouped together resulting in a starshaped network, which is characteristic of viral outbreaks. 28  were associated with the binding affinity of RBD. 30,31 Additionally, the mutation L5F in the signal peptide was present in 3.3% of members belonging clade V. 27 Other changes associated to relevant The evolutionary characterization of the wide spectrum of haplotypes contributes to determining the haplotype significance and its association with disease severity, response to antivirals, development of vaccines, and host genetic factors.
The evolutionary rate of S protein estimated for all together clades was significantly higher than that previously reported by analyzing the entire genome. 14,28 This is expected as the complete F I G U R E 2 Test of temporal structure. Comparison of the evolutionary rates estimated for the original data set versus the date-randomized ones. This analysis was performed for the Spike-coding region (3822nt) of each clade. s.s.y, substitutions/site/year genome includes several genomic regions with a high degree of