Toward unified molecular surveillance of RSV: A proposal for genotype definition.

Abstract Background Human respiratory syncytial virus (RSV) is classified into antigenic subgroups A and B. Thirteen genotypes have been defined for RSV‐A and 20 for RSV‐B, without any consensus on genotype definition. Methods We evaluated clustering of RSV sequences published in GenBank until February 2018 to define genotypes by using maximum likelihood and Bayesian phylogenetic analyses and average p‐distances. Results We compared the patterns of sequence clustering of complete genomes; the three surface glycoproteins genes (SH, G, and F, single and concatenated); the ectodomain and the 2nd hypervariable region of G gene. Although complete genome analysis achieved the best resolution, the F, G, and G‐ectodomain phylogenies showed similar topologies with statistical support comparable to complete genome. Based on the widespread geographic representation and large number of available G‐ectodomain sequences, this region was chosen as the minimum region suitable for RSV genotyping. A genotype was defined as a monophyletic cluster of sequences with high statistical support (≥80% bootstrap and ≥0.8 posterior probability), with an intragenotype p‐distance ≤0.03 for both subgroups and an intergenotype p‐distance ≥0.09 for RSV‐A and ≥0.05 for RSV‐B. In this work, the number of genotypes was reduced from 13 to three for RSV‐A (GA1‐GA3) and from 20 to seven for RSV‐B (GB1‐GB7). Within these, two additional levels of classification were defined: subgenotypes and lineages. Signature amino acid substitutions to complement this classification were also identified. Conclusions We propose an objective protocol for RSV genotyping suitable for adoption as an international standard to support the global expansion of RSV molecular surveillance.


| BACKG ROU N D
Human respiratory syncytial virus (RSV) is the commonest viral cause of acute lower respiratory tract infections in children worldwide, being the main infectious reason for pediatric hospitalizations. 1 There is yet neither an effective vaccine nor antiviral therapy, and treatment remains supportive, although passive antibody prophylaxis (palivizumab) is available in developed countries for prevention of high-risk babies. Progress is being made with vaccine development with encouraging results reported for live-attenuated and subunit approaches. 2 RSV is member of the Orthopneumovirus genus within the family Pneumoviridae. 3 It is an enveloped virus with a negative-sense ssRNA genome of ~15 200 nucleotides (nt) in length. Its genome encodes for 11 proteins ( Figure 1A). Two antigenic subgroups (A and B) are distinguished by polyclonal and monoclonal antibodies.
Both subgroups are evolutionary lineages which diverged approximately 350 years ago 4 with considerable genotypic variability within them. The major differences are found in the attachment glycoprotein G, which has only 53% amino acid sequence conservation across strains and has been used historically for molecular characterization. 5 Currently, 13 RSV genotypes have been defined among the subgroup A strains (GA1-7, 6,7 SAA1, 8 NA1-4, 9,10 and ON1-2 11,12 ) and 20 genotypes for the subgroup B strains (GB1-4, 6 SAB1-4, 8 URU1-2, 13 and BA1-10 14,15 ) but the criteria used for definition of a genotype varies based on phylogenetic analyses inferred using different methods (maximum likelihood, maximum parsimony, neighbor-joining or Bayesian inferences). Most definitions focus on clustering of phylogenetic clades with significant bootstrap values (>70%) in trees built from alignments encompassing the 2nd hypervariable region (HR) of G gene (~270 nt in length). To support these definitions, the average genetic distance (p-distance) has been used, sometimes as an informative tool, sometimes as an arbitrarily selected cut-off value (<0.07) 8 or similarity value (>96%). 6 Overall, there is a lack of consensus regarding the criteria to be used to allocate genotypes. The presence of a duplicated segment (ON −72 nt duplication-and BA −60 nt duplication-genotypes in RSV-A and RSV-B, respectively) in the 2nd HR of G gene has been used as an added criterion to define new genotypes. 11,16 A more recent proposal for genotyping RSV-A strains used phylogenetic analysis of the G-ectodomain and reevaluated historical genotypes using average p-distances within and among genotypes with a cut-off value of 0.049, based on the average p-distance of the oldest RSV-A genotype, GA1. 17 Unification of the nomenclature and phylogenetic classification of viruses with high impact in human and animal health, such as highly pathogenic H5N1 avian influenza virus (https ://www. who.int/influ enza/gisrs_labor atory/ h5n1_nomen clatu re/en/), Newcastle disease virus, 18 measles virus, 19 and HCV 20 is an important underpinning principle for unambiguous tracking of virus evolution, which may have significant public health consequences.
Reaching consensus on a unified criterion for RSV genotype definition is essential to explore the association of genotype with disease severity, or geographic or temporal restriction of virus circulation. The aim of this work is to reach a new genotype definition based on both phylogenetic analyses and average p-distances that would bring uniformity to strains designations and thereby facilitate conclusions about viral evolution based on data from surveillance studies.

| Sequence datasets
RSV complete genome sequences from human clinical samples were retrieved from GenBank up to February 2018 (718 RSV-A and 348 RSV-B sequences). The criteria used to curate the sequences include removing sequences with "NNN" regions, ambiguous nucleotides, and/or sequences with 1-2 nucleotide deletions causing frameshifts.
Sequences with incorrect RSV subgroup allocation were identified and added to the correct subgroup. After curation, a total of 689 complete genomes of RSV-A and 344 of RSV-B sequences were obtained ( Figure 1B).
Each genomic dataset was aligned with MUSCLE (multiple sequence comparison by log-expectation). 21 These were further trimmed to produce alternative datasets encompassing different regions of the RSV genome to be analyzed: the concatenated SH-G-F genes (including their intergenic regions) as a single stretch; the three surface glycoprotein genes as separate ORFs, the ectodomain and the 2nd HR of the G gene ( Figure 1A).
A second analysis was performed with all the G-ectodomain sequences published in GenBank up to February 2018 (3362 RSV-A and 1742 RSV-B sequences). Alignment curation was performed as described for genome sequences. In addition, identical nucleotide Conclusions: We propose an objective protocol for RSV genotyping suitable for adoption as an international standard to support the global expansion of RSV molecular surveillance.

K E Y W O R D S
average genetic distance, genotypes, global molecular surveillance, human orthopneumovirus, human respiratory syncytial virus, lineages, phylogenetic analysis, subgenotypes sequences were detected and removed with iq-tree software and only one representative non-identical sequence was kept. 22 After curation, a total of 2481 G-ectodomain sequences for RSV-A and 1259 for RSV-B were obtained.
The final alignments were visually checked for artifacts produced during the alignment procedure.

| Phylogenetic inferences
The selection of the most suitable nucleotide substitution models was performed with iq-tree software according to the Akaike information criterion (AIC). 23 The GTR + G was the most suitable model for most of the alignments, with exception of complete genome and the three surface glycoprotein alignments in which the GTR + I + G was selected for both subgroups. In addition, TIM + G was the model selected for both SH alignments.
Maximum likelihood trees were inferred using iq-tree v1.6.7 software with 1000 ultrafast bootstrap replicates plus SH-like approximate likelihood ratio test as statistical support (values ≥80% were defined as well-supported). 24 Bayesian trees were inferred with beast v1.10.2 package. 25 Demographic and molecular clock model selections were performed by estimating the model marginal log-likelihood through the path sampling method, and uncorrelated relaxed molecular clock and Skyride tree prior were selected. The number of generations was between 50 and 100 million, and an appropriate sample frequency was used to obtain 10 000 trees. Convergence was assessed by estimating the effective sampling size after a 10% of burn-in by using tracer v1.7.1 (http://tree.bio.ed.ac.uk/softw are/trace r/). TreeAnnotator was used to summarize the posterior trees from BEAST into a maximum clade credibility tree (MCCT). The statistical support of the nodes was considered as well-supported when posterior probabilities were ≥0.8.
A web tool (www.phylo.io) was used for comparison of tree topologies. 26

| Evaluation of phylogenetic signal
The loss of phylogenetic signal due to substitution saturation was evaluated with dambe software. The level of saturation was studied by plotting the pairwise number of observed transitions and transversions versus genetic distance. 28

| Genetic distances and amino acids analyses
The average genetic distance within and among clades was estimated for alignments with unique sequences with mega7 software with the most simplified method, p-distance, as a proportion of nucleotide sites at which two sequences being compared were different. 29 Pairwise deletion was the treatment used for the alignment's gaps. The standard errors of the estimates were determined by the bootstrap method with 1000 replicates.
The treesub program (https ://github.com/tamur i/treesub) was used to analyze signature amino acids that support the defined levels of classification. It estimates ML trees using raxmL, followed by branch annotation of amino acid substitutions. aLiview v1.25 was used to visualize and check signature amino acids. 30  Phylogenies constructed from F gene, G gene, and G-ectodomain showed very similar topologies, albeit with lower resolution than complete genome trees. Given that vaccine and prophylaxis strategies are mostly targeted to the F protein, analysis of F sequences would provide useful data; however, there is more availability both in number and geographical representation of G-ectodomain sequences. Therefore, we focused on G-ectodomain. A comparison of the tree topology of the complete genome and the G-ectodomain phylogenies is shown in Figure 2A. Even though several clusters are not identical between complete genomes and G-ectodomain phylogenies, the detailed analysis by taxa shows that the inconsistencies occur largely in terminal nodes with low statistical support. The ancestor nodes of the main clades of sequences remain unchanged in their topology and statistical support, suggesting that potential genotype assignation of the sequences would not be influenced.

| RE SULTS AND D ISCUSS I ON
Based on these results, we conclude that the G-ectodomain should be considered the minimum region of the genome to be analyzed to obtain reliable phylogenies and genotype designation, and it can be used to obtain robust phylogeographic analyses allowing inferences about the origin of viral introductions in a particular geographical region.
Next, to classify genotypes, we based on the strategy of using average p-distances for RSV-A subgroup, described in 2015. 17 Briefly, in that work the average p-distances among individual genotypes, as well as within each genotype, were calculated. The highest intragenotype average p-distance was found for the GA1 genotype. This value was taken as the threshold for sorting viruses into different genotypes.
Using these criteria, we reevaluated and redefined all previous genotypes as explained below and also summarized in Table S1.  (Tables 1A and 2). As a result, three genotypes were identified for RSV-A (GA1-GA3) and seven genotypes for RSV-B (GB1-GB7). This collapses the previously larger number of groupings into a smaller set of genotypes for both RSV-A and RSV-B.
For a standardized genotype denomination, we propose to adopt the nomenclature defined by Peret et al (1998) where genotypes are designated as GAX and GBX; G stands for G-based genotype, A or B designate RSV subgroups, and X corresponds to a first-order ascending numbering system. This is a straightforward definition which does not allude to geographic references, avoiding potential stigmatizing labeling of RSV clades. 6 We renamed all genotypes in ascending order according to their first detection date, except GA2.
We maintained its designation despite having emerged more recently than GA3, because it is a current widespread-circulating genotype and renaming this genotype could create confusion in the RSV community.
Small numbers of independent sequences not fitting the genotype definition were not classified and will remain undefined until future sequences cluster with them and meet the definition of genotype. It is also possible that these sequences represent extinct viral genotypes.
Genotypes GA2, GA3, and GB5 are currently the most frequently detected and the largest ones ( Figure 2B,C). Within these genotypes, further levels of classification were defined.
The next level of classification, defined as subgenotypes, encompassed well-supported dichotomies (≥80% bootstrap and ≥0.8 posterior probability values) with an average intersubgenotype p-distance smaller than the minimum intergenotype p-distance described (divergence: 9% for RSV-A and 5% for RSV-B).
The subgenotype nomenclature was defined as GAX.Y or GBX.Y, where X is the genotype and Y the subgenotype and corresponds to a second-order ascending numbering system. Within GA2, three subgenotypes were identified (GA2.1-GA2.3; Figure 2B) with average p-distances among them as shown in Table 1B. GA3 and GB5 did not show well-supported dichotomies within them. However, we cannot rule out the possibility that an increasing number of available sequences in the future will allow improved resolution of subgenotypes. For this reason, we propose to designate the level related to subgenotypes with a 0 (zero) until further resolution is possible.   GB5.0.4c lineages. Once two or more particular lineages were defined with a letter, subsequent lineages were also defined with the same letter to denote the same ancestral origin, such as GB5.0.4a and GB5.0.5a ( Figure 3C).

TA B L E 1 Estimates of average genetic distances among and within RSV-A genotypes (A) and subgenotypes (B)
To complement the defined classification levels by phylogenetic clustering, we explored the search of signature amino acids, as used for influenza. 31 Signature amino acids are ideally defined by substitutions that are uniquely present in all sequences of a given clade. Table 3 and Figures S5 and S6 show amino acid substitutions across genotypes, subgenotypes, and lineages. Although most RSV-A and B genotypes, subgenotypes and lineages showed substitutions fitting this definition (defined as "main" in Table 3), there were other signature amino acids present in most but not all sequences within a clade, or present in more than one clade across the trees (defined as "secondary" in Table 3). Our analysis shows that all genotypes and subgenotypes in RSV-A and B were clearly defined by a collection of characteristic amino acid substitutions. For some lineages, there were no unique signature substitutions or at least two concomitant substitutions need to be present for a sequence to be properly classified into that lineage. Thus, we recommend that signature amino acids should be analyzed as a collection of substitutions (or haplotypes) defining each genotype/subgenotype/lineage and used as a combined tool together with phylogenetic methods to further confirm the classification of sequences into given genetic clades.
With this new classification, the number of genotypes was reduced from 13 to three genotypes for RSV-A (GA1-GA3) and from 20 to seven genotypes for RSV-B (GB1-GB7). Two further levels of classification were added: subgenotypes and lineages (summarized definitions in Table S2). The timeline of detection of genotypes and subgenotypes is shown in Figure 2C. In addition, the information about period and regions of circulation, and the first detected strain for all the proposed taxonomic groups is listed in  15 The schematic diagram of genotypes and subgenotypes circulation ( Figure 2C) and the information listed in Table 4

| Strengths and weaknesses of this RSV strain classification proposal
All of the analyses in the present study were carried out with sequences downloaded from GenBank, which is not a dedicated and curated database. A variety of issues with the sequences were found that might have affected tree topologies and calculation of average pdistances. Therefore, a reliable well-maintained database with curated sequences will be essential to standardize RSV molecular surveillance.
Our study shows that full genome sequences are the most informative and desirable dataset for genotyping purposes. As technology progresses and NGS methodologies become widespread in the near future, costs will eventually reduce enabling more laboratories to implement protocols to generate full viral genomes for their molecular analyses. The strength of this proposal is that genotype, subgenotype, and lineage definitions proposed here are scalable to the complete genome. The cut-off value for the average p-distance should be recalculated as shown in Table S5, and patristic distances should be evaluated for the identification of lineages. The strategy proposed in this work relies on the current availability of partial RSV sequences. The results of this study and other publications support the idea that the G-ectodomain sequences can be effectively used for genetic characterization of RSV strains. [32][33][34] It is important to highlight the economic benefit of a low-cost methodology especially for low-and middle-income countries that may be expanding their sequencing activity, given that the length of G-ectodomain region can be sequenced by Sanger methodology with only two reactions, allowing surveillance laboratories to monitor molecular diversity easily and in real time. In this effort to standardize the molecular classification of RSV strains, we consider that the use of a cut-off value for genotype definition is essential.

TA B L E 3
Signature amino acids characterizing genotypes/ subgenotypes/lineages for RSV-A (A) and B (B) Defining the average intragenotype p-distance of GA1 and GB1 as cut-off is very useful because these clades include the oldest strains detected for each subgroup and they also are no longer detected in the population. Although unlikely, we cannot rule out the possibility that in the near future, strains from these two genotypes might still be detected. Genetic distance measure is sensitive to diversity of sampled sequences; thus, if more sequences from old strains were to be released in the future, this could alter average p-distances of intra/interclades and even slightly modify the cut-off value. We strongly reinforce the need for periodical reevaluation of genotype definitions by a global consortium of RSV experts.

| CON CLUS ION
With many new vaccine candidates in prospect, widespread adoption of a unified criterion for RSV genotyping will enable standardized, comparable analyses across the scientific research and public health communities. Thus, our proposal of RSV strain classification will facilitate the analysis of strains in clinical and epidemiological studies and would be a fitting start to a joint global RSV surveillance.

ACK N OWLED G EM ENTS
We would like to thank Burcu Ermetal for assistance with the use of the treesub program and interpretation of data.

CO N FLI C T O F I NTE R E S T
The authors declare that they have no competing interests.

D ED I C ATI O N
We would like to dedicate this manuscript to the memory of Dr José Antonio Melero, who was an outstanding researcher and warm and generous individual who encouraged us to work on this topic, with whom we began to work on the unification of the RSV genotype definition shortly before his untimely death.