Novel and extendable genotyping system for human respiratory syncytial virus based on whole‐genome sequence analysis

Abstract Background Human respiratory syncytial virus (RSV) is one of the leading causes of respiratory infections, especially in infants and young children. Previous RSV sequencing studies have primarily focused on partial sequencing of G gene (200–300 nucleotides) for genotype characterization or diagnostics. However, the genotype assignment with G gene has not recapitulated the phylogenetic signal of other genes, and there is no consensus on RSV genotype definition. Methods We conducted maximum likelihood phylogenetic analysis with 10 RSV individual genes and whole‐genome sequence (WGS) that are published in GenBank. RSV genotypes were determined by using phylogenetic analysis and pair‐wise node distances. Results In this study, we first statistically examined the phylogenetic incongruence, rate variation for each RSV gene sequence and WGS. We then proposed a new RSV genotyping system based on a comparative analysis of WGS and the temporal distribution of strains. We also provide an RSV classification tool to perform RSV genotype assignment and a publicly accessible up‐to‐date instance of Nextstrain where the phylogenetic relationship of all genotypes can be explored. Conclusions This revised RSV genotyping system will provide important information for disease surveillance, epidemiology, and vaccine development.


| INTRODUCTION
Human respiratory syncytial virus (RSV) is a major cause of acute lower respiratory tract infection worldwide in infants and young children (<5 years of age), as well as in the elderly and patients who are immunocompromised. 1 Despite the clinical significance and the burden of RSV infection, we lack an understanding of the patterns of virus emergence, evolution, and spread. Phylogenetic studies of RSV evolution are in need, especially on a global scale due to the limited availability of whole-genome sequence (WGS) data and strongly asynchronous sampling in time and space. 2 RSV is an enveloped virus with a negative-sense, single-stranded, non-segmented RNA genome of $15,200 nucleotides (nt) in length and belongs to the family Pneumoviridae. This genome encodes for 11 proteins, including the polymerase (L), nucleocapsid (N), phosphoprotein (P), transcriptional regulators (M2-1 and M2-2), matrix (M), small hydrophobic surface protein (SH), non-structural proteins (NS1, NS2), and two major surface glycoproteins (F and G). 3 This virus has been classified as subtype A (RSV-A) or subtype B (RSV-B) according to reactivity with monoclonal antibodies. 4 Both subtypes typically cocirculate during epidemic seasons. Within the RSV-A and RSV-B subtypes, different genotypes have been further classified mainly based on genetic differences in the second hypervariable region (HR) located at the G glycoprotein. 5 Like other respiratory viruses, RSV has diverse circulation patterns. Several genotypes can cocirculate within the same community, whereas novel RSV genotypes with high genomic diversity may arise and potentially replace the previous dominant genotypes. 6 Fourteen genotypes among RSV-A (GA1-7, SAA1, CB-A, NA1-4, and ON1) and 24 genotypes in RSV-B (GB1-GB5, SAB1-SAB4, URU1-2, BA1-12, and CB1) have been identified. 1 The most notable genotype change observed in recent years is the emergence of RSV-A (ON) and RSV-B (BA) strains with a partial duplication of the distal third of the G gene and have since become the dominant strains in many regions. 7,8 With the emergence of novel genotypes, a potential association of RSV genotype with disease severity or geographic and temporal restriction of virus circulation has been reported 9 . 10 Moreover, RSV genetic diversity has been considered as an important factor that allows for reinfections to occur and needs to be considered in vaccine development. 9 Therefore, a genotyping system that could reflect RSV genetic diversity is needed. Previous RSV sequencing has largely focused on complete 11 or partial G gene (200-300 nt) 3,12 for genotype characterization or diagnostics. However, the evolutionary signals from other gene regions should also be taken into account as the phylogeny inferred from other genes might conflict with the phylogeny inferred from complete or partial G gene alone. Furthermore, the novel identified G gene duplication signature should be considered as a single evolutionary event, whereas current widely used phylogenetic models only account for residue substitution events. In addition, most of the current RSV genotyping methods are based on the pairwise distance (p-distance) matrix by specifying a cutoff value below which individuals are assigned to the same cluster. 13,14 It is important to note that several factors affect p-distance calculation, including the length of the sequences and the number of sequences used in the analysis. The p-distance defined genotype system also needs to be updated frequently due to the accumulated viral diversity within genotypes over an increased circulation period, and new genotypes are likely to be defined within the previously defined genotypes.
Despite the need to easily recognize RSV genotypes for molecular epidemiology, vaccine design, and control efforts, the delineating criteria of RSV genotypes are not agreed upon. 11,[14][15][16] There is a need for a robust system to define RSV genotypes and to resolve inconsistencies present in the literature arising from previous genotyping methods.
Our study proposed a novel and extendable RSV genotyping system based on a more complete RSV phylogeny. After evaluating the phylogeny inferred from different RSV gene datasets, we concluded that the WGS is the most informative and desirable dataset for RSV genotyping purpose. We categorized RSV into two classification levels, the genotype and subclade, mainly with phylogenetic analysis and detection year of sequences, which provides a convenient and sustainable way to refer to the emergence of RSV strains.

| Phylogenetic inference
Phylogenetic analysis for different gene datasets was conducted with the maximum likelihood (ML) approach using RAXML v8.0, 21 which has the advantage of partitioned analysis. We applied the autoMRE option embedded in RAXML for an efficient convergence of bootstrapping process, where the bootstrapping value is one of the criteria for the genotype assignment. We implemented an indel coding method to code gene duplication and deletion region of the G gene as separate binary partitions. The rest of the nucleotides were set as a separate partition with the GTR + Gamma substitution model. The temporal signal of the WGS datasets was diagnosed using TEMPEST, and temporal outliers were removed. 22 Tree topology tests for the phylogenies inferred from different RSV gene datasets were performed using IQ-TREE with the Shimodaira-Hasegawa (SH) test and approximately unbiased (AU) test. 23 The evolutionary rates for different genes were estimated using the program TREETIME. 24

| Genotypes and subclades assignment for RSV
We aim to classify RSV strains into two levels in our analysis. The groups of strains that have potential to circulate are further defined as subclades under the classification level of genotypes. The genotype assignment is based on pair-wise node distance. Pair-wise node distances, which are the distances between the most common ancestors of groups of sequences in a phylogeny, were calculated between all nodes in a phylogenetic tree based on the alignment of the RSV WGSs, using the RRPHYLO v2.5.7 package. 25 We further employed time pipeline, 28 which rapidly determines cladistic information for sequences using support vector machines (SVMs) without the need for time-consuming sequence alignment, phylogenetic tree construction, or manual annotation. Sequences with an annotated genotype or subclade were used to create a training data library. Training data for each genotype were subsampled in an ad hoc manner using PDA v1.0.3. 29 (Table S2). The classification module (available at https:// github.com/JianiC/rsv-genotype/tree/master/LABEL/RSV) was then built with training sequences and the custom scripts that are implemented in the LABEL program. Both WGS and partial sequences of RSV can be automated to a genotype or subclade, and no further information is required.

| Phylogenetic analysis with different RSV gene datasets
We first characterized the indels of RSV sequences in our dataset. In addition to the previously defined RSV-A genotype ON with a 72-nt duplication in the second HR of G gene and RSV-B genotype BA with a 60-nt duplication in a similar region, we also observed a 6-nt deletion in the recent circulating RSV-B strains at the G gene region ( Figure 1A).
The RSV phylogenies were constructed for each gene as well as WGS using the ML approach and the G gene duplication and deletion region have been further coded as a separate binary partition in our analysis. We also scored the likelihood of phylogenies inferred from different gene datasets given by the WGS dataset to compare the topologies of different phylogenetic trees. The SH and AU tests suggested the phylogenetic trees inferred from individual RSV genes have significantly different likelihood scores compared with WGS, and we did not observe the significant difference with the phylogeny where G gene indels were simply considered as multiple substitution events ( Figure 1B). The tree topology differences inferred from the WGS and individual gene sequences (L, G, and F that have close likelihood scores with that of WGS) have been further demonstrated using a tanglegram approach as demonstrated in Figure S1. The mean nucleotide substitution rates for RSV-A and RSV-B estimated from WGS are 3.72 Â 10 À4 and 5.73 Â 10 À4 substitutions/site/year, respectively. The rate estimation of the G gene was approximately 2.5-fold faster than other genes ( Figure 1C). Overall, our results indicate the phylogenetic analysis based on WGS can provide more valuable insight on RSV genetic diversity and evolution.

| Novel RSV genotype system with WGS
The following criteria were used to build a standardized RSV

| Spatial and temporal distribution of RSV genotypes
We provide a description of the spatial and temporal distribution of

| DISCUSSION
In this study, we highlight the importance of WGS for RSV genotype assignment. We statistically compare the phylogeny derived from different RSV gene datasets with the likelihood score test and evolutionary rate estimation from different gene datasets. Our results indicate WGS should be used for genotype assignment. Our analysis is based on a recombination-free dataset to avoid inferential biases. Twelve RSV sequences in our initial dataset showed some evidence of potential recombination. Because genomic recombination in RSV is believed to be extremely rare, it is most likely that these recombinants arose as a result of PCR or sequencing artifacts. 31 Even though we did not observe a significant statistical difference in our analysis, the G gene duplication should be considered as a single biological event, and we implement an indel coding method to improve phylogenetic resolution.
There are several expectations for a widely acceptable RSV genotyping system. First, since more than 30 genotypes have been identified, we expect to have a reasonable number of RSV genotypes to simplify the study with different RSV strains. Secondly, genotype assignment should be able to capture the genetic diversity, thereby providing valuable insights into the ongoing evolution of the virus and playing an important role in its mitigation and control. 32 Finally, we expect the novel emergent strains could be classified and easily added into the revised nomenclature system. Competing RSV genotype systems have been proposed (Table 2), including an influenza-like system for RSV genotype classification based on the highest intra-genotypic p-distance as the minimum threshold to define a genotype, 14,15,33 F I G U R E 2 Maximum likelihood phylogeny of respiratory syncytial virus (RSV) RSV-A (A) and RSV-B (B) inferred from whole-genome sequence (WGS) with the genotyping assignment. The genotype assignments are indicated with vertical black bars and are labeled on the right. The subclade assignments are indicated with vertical gray bars and are labeled on the left. Tip point colors represent the previously defined genotype names based on complete or partial G gene sequences. The nodes that define the genotype and subclade are indicated with black and gray node points, respectively. Bootstrap of each ancestral genotype/subclade node is marked. Colored columns on the right side represent G gene duplication and indels. Scale bars indicate 0.01 nucleotide substitution per site which are highly sensitive to sampling bias. Another recent systematic RSV genotype study attempted to use patristic distance (the shortest distance between two tips) instead of p-distance to propose a new classification system. 16 Both p-distance and patristic distance are sensitive to the sequencing error and the length of sequences. In addition, a cutoff value of either genetic distance or patristic distance is always needed to standardize the molecular classification of RSV strains, which is likely to be problematic due to sampling bias in RSV, and delineation criteria may need to be reevaluated with continued surveillance of RSV strains. 15,16 We calculate pair-wise node distance to assign genotype in our analysis. Our approach relies on the genetic divergence calculated from the tree tips to the most recent common ancestor for each genotype, which is less sensitive to the individual sequence quality and has advantages to the undersampled RSV sequences. In addition to the genetic differences between RSV genotypes, previous studies have suggested some RSV genotypes may have an advantage in transmission and circulation. 34 Instead of identifying every lineage or strain, one of the major goals in this manuscript T A B L E 1 List of previously defined genotype name and detection time of new RSV genotype assignment. It is crucial to share our updated genotype assignment so that new sequences can be easily added. We deploy our genotype assignment as well as the genotype assignment from previously published studies with Nextstrain, which allows a comparison and a continually updated visualization. 30 We also provide a tool that enables the automated classification of newly generated RSV sequences. By using the platform provided, RSV sequences can be assigned with genotypes and subclades based on the similarity of the sequences that are included in our system. 28 This fast and accurate RSV genotyping assignment tool will be valuable for the classification of novel sequences in future phylogenetic or diagnostic settings.
There are several limitations that need be addressed with any molecular systematic revision of RSV genotypes. In particular, RSV genotype assignment using WGS is subject to sampling bias. Only a limited number of sequences are currently available in GenBank, especially among older samples that were sequenced prior to the widespread and routine use of WGS, which may affect our genotype assignment. In addition, most samples sequenced were isolated from the regions in the Americas. With more RSV sequencing effort, we would expect the geographic distribution of sequence could be captured and effectively used for future updates to this genotype system.
In summary, we propose a revised RSV genotyping assignment that reflects the genetic diversity and circulation pattern of RSV. WGS should be used for future RSV genotyping revisions. In addition, the G gene duplication and other indels should be taken into account for the phylogenetic analysis as a single evolutionary event rather than multiple substitution patterns. Overall, a robust RSV genotype assignment based on WGS will greatly assist those working in clinical identification, epidemiological studies, and vaccine development.

PEER REVIEW
The peer review history for this article is available at https://publons. com/publon/10.1111/irv.12936.

DATA AVAILABILITY STATEMENT
Data are publicly available on GenBank.