Phylogenetic analysis of SARS‐CoV‐2 in the first few months since its emergence

Abstract During the first few months of severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2) evolution in a new host, contrasting hypotheses have been proposed about the way the virus has evolved and diversified worldwide. The aim of this study was to perform a comprehensive evolutionary analysis to describe the human outbreak and the evolutionary rate of different genomic regions of SARS‐CoV‐2. The molecular evolution in nine genomic regions of SARS‐CoV‐2 was analyzed using three different approaches: phylogenetic signal assessment, emergence of amino acid substitutions, and Bayesian evolutionary rate estimation in eight successive fortnights since the virus emergence. All observed phylogenetic signals were very low and tree topologies were in agreement with those signals. However, after 4 months of evolution, it was possible to identify regions revealing an incipient viral lineage formation, despite the low phylogenetic signal since fortnight 3. Finally, the SARS‐CoV‐2 evolutionary rate for regions nsp3 and S, the ones presenting greater variability, was estimated as 1.37 × 10−3 and 2.19 × 10−3 substitution/site/year, respectively. In conclusion, results from this study about the variable diversity of crucial viral regions and determination of the evolutionary rate are consequently decisive to understand essential features of viral emergence. In turn, findings may allow the first‐time characterization of the evolutionary rate of S protein, crucial for vaccine development.

its interspecies jump. 10 Most studies published to date have characterized the viral genome and evolution by analyzing complete genome sequences. [11][12][13][14] Despite this, until now, the viral genomic region providing the most accurate information to characterize SARS-CoV-2 could not be established. This lack of information prevents from investigating its molecular evolution and monitoring of biological features, affecting the development of antiviral drugs and vaccines. Therefore, the aim of this study was to perform a comprehensive viral evolutionary analysis to describe the human outbreak and the molecular evolution rate of different genomic regions of SARS-CoV-2.

| Phylogenetic signal
To determine the phylogenetic signal of each of the nine generated alignments, Likelihood Mapping analyses were carried out, 15

| Bayesian coalescence and phylogenetic analysis
To study the relationship among SARS-CoV-2 sequences, nine regions of the viral genome were investigated by Bayesian analyses.

| Evolutionary rate
The estimation of the nucleotide evolutionary rate was made with the Beast v1.10.4 program package. 20 Analyses were run at the CIPRES Science Gateway server. 21 19 Additionally, to verify the obtained results, 15 independent replicates of the analysis were performed with the time calibration information (date of sampling) randomized, as described by Rieux and Khatchikian. 24 Finally, the obtained parameters for real data and the randomized replicates were compared.

| Phylogenetic signal
Using bioinformatics tools, a phylogenetic signal study was carried out to identify the most informative SARS-CoV-2 genomic regions.
The likelihood mapping analysis showed that most genes have a very poor phylogenetic signal with high values in the central region that represents the area of unresolved quartets ( Figure 1). Accordingly, genes could be separated into three groups: the first group with little or no phylogenetic signal (E, Orf6, Orf8, nsp1, and nsp14), the second group with a low phylogenetic signal (Orf3a and N), and the last group with a relatively more phylogenetic signal (S and nsp3), but still low to be considered a robust one (unresolved quartets >40%).

| Analysis of amino acid substitutions
The analysis of amino acid substitutions by fortnights was useful to study the viral evolutionary dynamics in the context of the beginning T A B L E 2 Amino acids selected by region and fortnight. The number indicates the amino acid location in its protein   (Table 2). Particularly, in the Orf8 region, early selection of two amino acid substitutions (V62L and L84S) was observed in FN2.
However, in the S region, the D614G substitution started with <2% in FN3 and FN4 and reached 88% in the last fortnight. In a similar way, the Q57H (Orf3a) substitution increased from 6% to 34%,

| Bayesian coalescence analysis
In this study, trees were analyzed by Bayesian analysis instead of distance, likelihood, or parsimony methods. Consistent with the phylogenetic signal analysis, trees for nsp1, E, and Orf6 showed a star-like topology. Nevertheless, different proportions of clade formation could be observed in trees of Orf8, nsp14, Orf3a, N, S, and nsp3 regions ( Figure 2). Finally, from the mentioned regions, nsp3 and S showed a better clade constitution. This analysis allowed to differentiate regions displaying a diversification process (nsp3, nsp14, Orf3a, S, Orf8, and N) from those that even after 4 months showed an incipient one (nsp1, E, and Orf6). Furthermore, this nucleotide analysis is complemented by the previous study of amino acid variations in each region. However, it is important to note that due to the low phylogenetic signal observed for each region, results can only be considered as preliminary.

| Evolutionary rate
Nsp3 and S sequences were selected to perform the evolutionary rate analysis, as both regions provided the best phylogenetic in-  (Figure 3).

| DISCUSSION
The phylogenetic characterization of an emerging virus is crucial to understand the way the virus and the pandemic will evolve. Thus, a detailed study of the SARS CoV-2 genome allows, on the one hand, to contribute to the knowledge of viral diversity to detect the most sui- Thus, despite being a virus with an RNA genome, the short time elapsed since its emergence, and possibly genetic restrictions have led to a constrained evolution of SARS-CoV-2 in these months. For this reason, it is expected that trees generated from SARS-CoV-2 partial sequences in the first months of the pandemic are unreliable for defining clades. Therefore, they should be analyzed with caution.
As Bayesian analysis allows to infer phylogenetic patterns from tree distributions, it represents a more reliable tool to compare different evolutionary behaviors. Bayesian analysis helps to obtain a tree topology that is closer to reality in the current conditions of SARS-CoV-2 pandemic. 25 The phylogenetic analysis for nsp1, E, and it can be clearly seen that sequences are separated into two large groups. Although the clusters observed for nsp3 and S showed high support values, these results should be taken with precaution and longer periods should be considered to obtain more accurate phylogenetic data. However, even when data are not the most accurate to study the spread or clade formation, 28,29 they provide a good representation of the way the virus is evolving.
The analysis of amino acid frequencies allowed identifying different degrees of region conservation throughout the viral genome due to positive and negative pressures. In particular, nsp3, S, Orf8, and N showed some substitutions in high frequencies. This would indicate, as other authors have previously reported, the frequent circulation of polymorphisms due to a significant positive pressure. 12,26,30 Additionally, as S and N are among the candidates to be used in the formulation of vaccines and antibody treatment, it will be important to monitor these substitutions in different geographic regions to improve treatment and vaccination efficacy. [31][32][33] In particular, the appearance of the D614G variant in the third week and its rapid increase until reaching an 88% prevalence in the eighth week could reflect an improvement in viral fitness, as it has been previously reported. 34 This is supported by studies on SARS CoV showing that predicted S protein domains underwent the most extensive amino acid substitutions and the strongest positive selection. 35 Contrarily, in regions nsp1, nsp14, E, and Orf6, no substitutions were selected during the first 4 months of the pandemic. This would suggest that these regions present constraints to change due to a great negative selection pressure, as it has been recently reported. 12 In the present study, the evolutionary rate for SARS-CoV-2 genes was estimated by analyzing a large number of sequences, which were carefully curated and had a good temporal and spatial structure. Additionally, the most phylogenetically informative regions of the genome (nsp3 and S) were used for analysis, reinforcing the results confidence. Previous studies on SARS-CoV-2 have reported similar data, ranging from 1.79 × 10 −3 to 6.58 × 10 −3 s/s/y, for the complete genome. 6,36 However, in both articles, small data sets of complete genomes were used (N = 32 and 54, respectively). As studies were performed early in the outbreak and due to data sets' temporal structure, analysis could have led to less precise estimates of the evolutionary rate. 22

CONFLICT OF INTERESTS
The authors declare that there are no conflict of interests.

AUTHOR CONTRIBUTIONS
Data curation, acquisition of data, analysis and interpretation of data,

DATA AVAILABILITY STATEMENT
Data derived from public domain resources.