Influenza Classification Suite: An automated Galaxy workflow for rapid influenza sequence analysis.

Influenza viruses continually evolve to evade population immunity, and the different lineages are assigned into clades based on shared mutations. We have developed a publicly available computational workflow, the Influenza Classification Suite, for rapid clade mapping of sequenced influenza viruses. This suite provides a user-friendly workflow implemented in Galaxy to automate clade calling and antigenic site extraction. Workflow input includes clade definition and amino acid index array files, which can be customized to identify any clades of interest. The Influenza Classification Suite provides rapid, high-resolution understanding of circulating influenza strain evolution to inform influenza vaccine effectiveness and the need for potential vaccine reformulation.

originally developed for swine influenza strains in North America, classify lineages and clades based on phylogenetic analysis alone using an extensively curated reference gene data set. 5 In the context of rapidly evolving seasonal human influenza, this reference data set would require constant updating.
To modernize the labor-intensive approaches previously involved in assigning clade designation and interpreting antigenic site relatedness between circulating viruses and the vaccine strain and address limitations of using currently available pipelines, we have developed an "Influenza Classification Suite," an automated Python-based workflow, publicly available through the popular Web-based platform Galaxy that is used for the development and distribution of bioinformatic pipelines. 6 Automated and documented workflows are critical for generating reproducible results 7 and for increasing near real-time availability of virological surveillance information. The Influenza Classification Suite can be easily implemented and modified by influenza researchers to automate the process of clade calling in their analysis pipelines.
We describe our workflow and demonstrate its application to HA gene sequences using a test data set from influenza A(H3N2)positive specimens collected by SPSN sites across Canada during the 2016/17 epidemic. 8

| IMPLEMENTATION
Sanger sequencing of hemagglutinin (HA) genes is routinely performed by SPSN researchers for influenza-positive specimens collected from patients presenting with an influenza-like illness. We created Python scripts to automate the labor-intensive manual clade calling and antigenic site identification process, and validated output from each script against manually obtained results.
Each script is defined as a stand-alone tool in Galaxy (Table 1) and is also combined into a standardized analysis workflow. All tools can also be run in command line for integration into existing pipelines. The source code of the tools is publicly available on GitHub (https ://github.com/Public-Health-Bioin forma tics/flu_class ifica tion_suite ), with the option to install the tools easily using the Conda package manager. A workflow was created by combining tools in a pipeline to automate a series of tasks in a standardized, user-friendly manner (Figure 1). Our comprehensive user's manual, template files, and test data can be found in our GitHub README.md.

| Tool: Change FASTA Deflines
"Change FASTA Deflines" was included as a tool to automate changing or de-identifying sequence names in FASTA files. It uses a twocolumn text file (comma-or tab-separated) containing existing sequence names in column 1 and desired sequence names in column 2. The program searches the target FASTA file for definition lines matching those in column 1 and, if found, changes them as specified in column 2, allowing fast and accurate renaming.

| Tool: Assign Clades
A clade definition file is used to assign and append clade designations to influenza sequence files. Viral clades are inherently nested, with child clades evolving from parent clades over time ( Figure 2). To determine whether sequences matched a clade, we required clade definitions to (a) contain the clade name(s), (b) contain respective clade-defining AA and position numbers, and (c) be easily modified with common software. An Excel template is provided in our GitHub repository that allows researchers to easily edit and define clades in comma-separated value (CSV) format (Table 2). We further incorporated a "depth" parameter into each clade definition to resolve situations in which a sequence is an exact match to more than one clade (eg, parent and child clade). The depth parameter is an integer greater than 0, defining the relative ancestry of the clades (eg, parent clade depth = 2 and child clade depth = 3). The European

Centre for Disease Prevention and Control (ECDC) Influenza
Virus Characterization Reports (https ://www.ecdc.europa.eu) and Nextstrain can be used in identifying and naming clade-defining substitutions. 2 To assign clades to sequence results, the Galaxy tool "Assign Clades" reads in the clade definition file (CSV) and the aligned AA sequence files (FASTA). In the clade definition file, each clade is represented by a tuple (an ordered list) consisting of a clade name,

Galaxy tool name Workflow description
Assign Clades Uses a clade definition file to assign and append clade designations to sequence names in influenza FASTA files.

Antigenic Site Extraction
Uses an influenza subtype-specific amino acid index array to extract antigenic amino acids from influenza sequences and output to FASTA.

Line List
Transforms FASTA files of influenza antigenic maps into line lists.

Aggregate Line List
Transforms FASTA files of influenza antigenic maps into line lists, summarizing occurrences of each sequevar.

Change FASTA Deflines
Changes sequence names in a FASTA file, according to old and new names specified in a text (.csv ortxt) file.

| Tool: Antigenic Site Extraction
Since antigenic sites of influenza A/H1 and A/H3 subtypes and B/Victoria and B/Yamagata lineages comprise different AA positions, [9][10][11] these are specified to programmatically extract from the full-length HA sequence. This was done using CSV files containing an array of respective antigenic site positions; files are available in our GitHub repository.
To extract antigenic AAs from sequences, the Galaxy tool "Antigenic Site Extraction" reads in the AA index array (CSV) and the "Assign Clades" output file (FASTA). The AA index array is the list of subtype/lineage-specific indices to extract from the sequence FASTA files. The full-length FASTA sequences, containing clade names, are read into SeqRecords, and the specified antigenic AAs from original sequences are extracted and the antigenic sites are written to FASTA or CSV format with clade-defined sequence names.

| Tool: Line List
The Galaxy tool "Line List" reads in FASTA files of a reference strain (eg, substitutions. The Galaxy tool "Aggregate Line List" is used to aggregate identical antigenic maps when the number of sequences is too great to view in the allowed space. It performs all of the functions as "Line List" and collapses identical sequences within clades, enumerating them and displaying the count in a separate column.

| Downstream application
The output FASTA files can be used for downstream phylogenetic analysis and constructing dendrograms using a variety of programs.

| Influenza Classification Suite validation
Output from each programmatic analysis step was compared using a test set of 574 clinical influenza A(H3N2)-positive patient sequences from the SPSN 2016/17 HA data set and corresponding manually obtained results. 8,12 With a stringent clade assignment cutoff of 100% sequence match to a given clade, 557/574 sequences (97%) matched. Of the 17 "No_Match" sequences (3%), 10 had incomplete sequences at a specific AA position used in the clade definition and seven had a mutation in a clade-defining AA. "No_Match" sequences were analyzed individually, and clades were manually assigned.
Insertions/deletions (indels) are reflected in the aligned FASTA input files, and frequent indels are accounted for when designing the clade definition files. Rare one-off indels are not captured by the software and will also be flagged as "No_Match" for manual analysis. Using the Influenza Classification Suite Galaxy pipeline, manual analysis time was reduced by hours, with increased reliability and reproducibility.
The pipeline is now in routine use by the SPSN sequencing team. Rapid and more automated processing of influenza HA sequences is important for real-time tracking of influenza strain evolution. This information is important to understand emerging circulating influenza escape variants, their relation to vaccine-virus relatedness, and helps guide vaccine strain reformulation.