Rapid alignment updating with Extensiphy

High‐throughput sequencing has become commonplace in evolutionary studies. Large, rapidly collected genomic datasets are used to capture biodiversity and for monitoring global and national scale disease transmission patterns, among many other applications. Updating homologous sequence datasets with new samples is cumbersome, requiring excessive program runtimes and data processing. We describe Extensiphy, a bioinformatics tool to efficiently update multiple sequence alignments with whole‐genome short‐read data. Extensiphy performs reference based sequence assembly and alignment in one process while maintaining the alignment length of the original alignment. Input data‐types for Extensiphy are any multiple sequence alignment in fasta format and whole‐genome, short‐read fastq sequences. To validate Extensiphy, we compared its results to those produced by two other methods that construct whole‐genome scale multiple sequence alignments. We measured our comparisons by analysing program runtimes, base‐call accuracy, dataset retention in the presence of missing data and phylogenetic accuracy. We found that Extensiphy rapidly produces high‐quality updated sequence alignments while preventing alignment shrinkage due to missing data. Phylogenies estimated from alignments produced by Extensiphy show similar accuracy to other commonly used alignment construction methods. Extensiphy is suitable for updating large sequence alignments and is ideal for studies of biodiversity, ecology and epidemiological monitoring efforts.

From a human health perspective, rapidly updated phylogenies are pivotal to tracing and understanding pathogen outbreaks (Hadfield et al., 2018). With sequencing rates producing more genomic data than ever before, the barrier for studies of ecology, evolution and biodiversity is now the process of organizing and manipulating data prior to estimating phylogenies (Hodcroft et al., 2021).
Adding new data to a phylogeny first requires that the new data to be incorporated into a key underlying data structure, the homologous sequence alignment. Homologous sequence alignments, also known as multiple sequence alignments, capture the shared evolutionary origin of any number of sequences arranged with pairwise awareness of sequence homology (Chenna et al., 2003;Swofford et al., 1996). Alignment as a procedure is the process of finding homology between two or more DNA sequences (Kim et al., 2015;Vasimuddin et al., 2019). The procedure of multiple sequence alignment is computationally challenging, which must be repeated when new data are added to existing alignments (Chenna et al., 2003;Field et al., 2018;Liu et al., 2012;Treangen et al., 2014;Wang & Jiang, 1994). While recent methods have improved the efficiency of aligning datasets of many taxa and long sequences, the continuing expansion of empirical genomic datasets make the necessary data processing cumbersome (Eddy, 2009;Grad et al., 2016;Hadfield et al., 2018;Leebens-Mack et al., 2019;Liu et al., 2012;NCBI, 2020;Nguyen et al., 2015). The National Center for Biotechnology Information (NCBI) pathogen database contains 14,915 Neisseria gonorrhoeae samples along with other pathogens with more than 340,000 samples (NCBI, 2020). The task of assembling these genomes, extracting loci-of-interest and aligning the updated datasets, while not intractable, will be formidable and highlights why novel methods for updating genomic datasets are necessary.
An additional problem when updating an existing MSA with large, rapidly growing genomic databases is the probability of introducing missing data or incomplete data. 'Missing data' may be due to biological reality, such as the evolutionary process of insertions and deletions, or can be a bioinformatic artefact such as low sequencing coverage or read quality in some genomic regions. It has been demonstrated that biological reality and bioinformatic artefacts can interact in driving patterns of missing data across the genome, as rapidly evolving regions are more likely to have reads fail to map, resulting in the appearance of missing data (Huang & Knowles, 2016). Researchers have studied the effect of missing data in evolutionary analyses for decades (Driskell et al., 2004;Huang & Knowles, 2016;Lemmon et al., 2009;Molloy & Warnow, 2018;Wilkinson, 1995;Xi et al., 2016). As such, the effect of missing data on evolutionary analyses has been hotly debated (Capella-Gutiérrez et al., 2009;Castresana, 2000 Talavera & Castresana, 2007;Treangen et al., 2014;Xi et al., 2016).
Methods of alignment trimming are based on cutoffs of the number of taxa which are missing a particular locus, removing the locus for all taxa (Capella-Gutiérrez et al., 2009;Castresana, 2000;Criscuolo & Gribaldo, 2010;Treangen et al., 2014). Alignment trimming programs often include strict default settings but allow for user specified inputs in order to tailor datasets for the question at hand (Castresana, 2000;Treangen et al., 2014). In general, missing data tend to be less problematic for phylogenetic estimation when it is randomly distributed across the phylogeny, and more problematic when there is a correlation between phylogeny and missingness (Huang & Knowles, 2016;Lemmon et al., 2009;Streicher et al., 2016). Wholesale removal of these regions from analyses can therefore bias estimates of evolutionary rate, affecting branch lengths, topology and bootstrap support (Huang & Knowles, 2016;Streicher et al., 2016). This bias can shorten branch lengths if predominantly variable regions are removed (Huang & Knowles, 2016), or lengthen branch lengths if invariant characters are dropped from the analysis (Felsenstein, 1992;Leaché et al., 2015;Lewis, 2001). Moreover, trimming alignment regions with high proportions of missing data can preclude potentially informative downstream analyses. Analyses of sequence selection and adaptation, often assessed using ratios of synonymous and non-synonymous mutations between taxa, also rely on multiple sequence alignments as statements of orthology (Briggs et al., 2009;Huerta-Cepas et al., 2016;Rocha et al., 2006). Studies in various biological fields describe removing missing data from selection analyses, either by the removal of any missing data or by cutoff values for the number of taxa with missing data at a site (Hodgins et al., 2016;Murolo & Romanazzi, 2015;Williamson et al., 2014). While these methods may be appropriate for within-locus missing data, the automated removal of sequences flanking missing data sites could bias investigations of adaptation. Simply put, if a locus has been removed from an alignment, no further analyses may be performed using it once new data are added to the alignment.
To address the problem of rapidly updating sequence alignments with unprocessed whole-genome sequence data while maintaining input alignment length, we introduce Extensiphy. Extensiphy uses efficient reference based sequence assembly to add homologous loci to existing multiple sequence alignments. Extensiphy performs sequence assembly, locus extraction and alignment of new data to the original dataset in a single process. The intended utility of Extensiphy is to incorporate new un-assembled sequence (e.g. raw reads) data into existing alignments for phylogenetic analyses. Here we describe the Extensiphy method and compare its speed and accuracy to a standard de novo assembly workflow and a commonly used reference alignment method for calling single nucleotide polymorphisms (SNPs); Snippy (Bankevich et al., 2012;Seemann, 2021;Treangen et al., 2014).
We investigate Extensiphy's performance compared to these other methods by running each workflow on an empirical N. gonorrhoeae dataset as well as a simulated sequence dataset. Each method was assessed using metrics of program runtime, dataset retention, basecall comparison and phylogenetic distances.

| Overview of Extensiphy
A standard run of Extensiphy accepts a multiple sequence alignment (MSA) and any number of high-throughput read files for newly sequenced samples. The MSA may contain any number of concatenated loci, here referring to genes or lengths of DNA sequences appended together. Extensiphy can accept both paired-end and single-end high-throughput short-read files. An arbitrary reference sequence is chosen from the taxa in the alignment for read alignment.
After a reference is selected, all reads are aligned to the concatenated reference sequence. Following read alignment, nucleotides are called to create a consensus sequence that is homologous to all the sequences in the original MSA. All new consensus sequences are added to the multiple sequence alignment, completing assembly and sequence alignment as part of the same process. Finally, if the user opts to automate phylogeny estimation, a phylogeny based on the newly created and extended sequence alignment is estimated using a maximum-likelihood framework. A default run of Extensiphy is visually described in Figure 1. Alternative options for Extensiphy parameters and functionality are described in the following sections.

| File inputs, reference selection and read alignment
Extensiphy takes as input a single, concatenated MSA file or any number of unconcatenated single-locus MSA files with identical taxon labels. If multiple single-locus files are chosen, sequences corresponding to each taxon are concatenated into a single sequence and all sequences are combined into a single multiple sequence alignment containing all sequences for all taxa. Reference selection by default selects the first taxon in the alignment to use as the reference. The user may also specify the selection of a specific reference.

| Variant calling and consensus sequence construction
Following read alignment, SAM files are passed to programs for variant calling. Reference sequence indexing is performed by Samtools Faidx (Li et al., 2009). SAM files are converted to binary alignment mapping (BAM) files by Samtools View (Li et al., 2009). Once SAM to BAM conversion is complete, BAM file organizing is performed by Samtools Index (Li et al., 2009). Variant nucleotide calling is performed by Mpileup from the Bcftools suite (Li et al., 2009). Mpileup produces a Variant Call File (VCF; Danecek et al., 2011). Following VCF production, insertions and deletions are removed as these events usually prevent shared synteny between aligned sequences.
The cleaned VCF is then converted to a fastq format file by vcfutils.pl and then to a fasta format file by seqtk (Danecek et al., 2011;Gordon & Hannon, 2021;Heng, 2021). Finally, gaps in the original reference sequence are added to the new consensus sequence to preserve synteny. The fully constructed consensus sequence is then appended to the updated alignment file. RAxML with the GTRGAMMA model of nucleotide substitution (Stamatakis, 2014). Extensiphy can perform a de novo phylogenetic estimation or, when updating a extant phylogeny, Extensiphy may use a tree produced by the original MSA as a starting tree to improve the search of tree space. The purpose of the starting tree is to build on the evolutionary estimations of the original phylogeny.
If the input was multiple single-locus alignment files, the user may also choose to split the final, updated alignment back into singlelocus multiple sequence alignment files, for example, for the estimation of gene trees or a species tree by way of summary methods (Yin et al., 2019). RAxML using the GTRGAMMA model is the only option for phylogenetic estimation currently implemented within Extensiphy. However, as a default execution of Extensiphy outputs an updated alignment, users are free to apply any available method of phylogenetic estimation, by using the output alignment as the input for an alternative method. For example, when updating multiple single-locus alignment files a more appropriate method of estimation may be available for inferring a species tree from single-locus alignments. While Extensiphy does not automate running a placement algorithm, the updated alignment and original phylogeny can be easily used as inputs software to place the new sequences without updating the input relationships (Matsen et al., 2010). Due to Extensiphy's focus on adding large amounts of new sequence data to existing alignments, users may specify removing intermediate output files used during consensus sequence production to reduce unnecessary on-disk storage. Phylogenetic inference may be skipped altogether if only an updated sequence alignment is desired.

| Program comparison overview
Extensiphy produces an alignment of homologous sequence data. In order to assess Extensiphy's ability to produce useful data, we compared Extensiphy's alignment to similar alignments produced by contemporary programs and methodologies. In addition to comparing the alignments, we also compared phylogenies produced from alignments, and overall program runtimes. Based on previous literature, we identified two dominant approaches for constructing alignments with a focus on outputs used for evolutionary analyses: de novo sequence assembly followed by core genome alignment and read alignment to reference genome followed by SNP calling (Bush et al., 2020;Castresana, 2000;Seemann, 2021;Treangen et al., 2014). We chose the pipeline Snippy to represent read alignment and variant calling methodologies due to its results in program runtime and SNP calling accuracy (Bush et al., 2020).
Following light quality trimming with BBDUK (Bushnell, 2021), we chose to perform de novo sequence assembly with SPAdes and homologous locus selection with ParSNP (Bankevich et al., 2021;Treangen et al., 2014). SPAdes has been used to assemble genomic sequences in numerous studies for a variety of subject organisms. ParSNP is routinely cited in studies involving evolutionary analyses with topics on the microbial tree of life, the evolution of antibiotic resistance in Staphylococcus aureus and genomic analysis of antibiotic susceptibility in N. gonorrhoeae (Chen et al., 2020;Gernert et al., 2020;Shakya et al., 2020).
We ran each of these approaches on a simulated dataset and an empirical dataset and assessed the outputs. The simulated dataset was used to test all aspects of interest; program runtime, base-call accuracy, dataset retention and phylogenetic accuracy. The empirical dataset was used to test program runtime and the resulting alignments and phylogenies produced by each method were compared to each other to note discrepancies. The comparison software was primarily written in Bash shell scripts and Python, and these scripts as well as the configuration files for Tree to Reads are shared on GitHub at https://github. com/jtfie ld/phylo_compa rison. There are two versions of the code, one for analysing each simulated and empirical sequence data. The empirical data comparison software requires whole-genome short-read sequences. The software for analysing simulated data required the same input parameters with the addition of the phylogeny and genomes that were used to simulate the raw read sequences. Details on configuring the comparison software are available in the manual packaged with the software.

| Datasets
To construct our simulated high-throughput dataset with a known phylogenetic topology, we used TreeToReads (McTavish, Pettengill, et al., 2017). TreeToReads takes as input a phylogeny, evolutionary model parameters and a reference sequence that serves as the template for simulating all additional sequences. In order to generate an input phylogeny for simulation, we obtained 209 N. gonorrhoeae raw read files in fastq format from the CDC (Centers for Disease Control and Prevention, USA) used in a 2016 study of the evolutionary relationships of antibiotic resistant N. gonorrhoeae (Grad et al., 2016).
We replaced all isolate names with random identifiers before phylogenetic estimation. The resulting phylogeny was used as the input phylogeny for TreeToReads. We used a 51,924 bp segment of a complete N. gonorrhoeae genome (GenBank: NC_002946.2) as the reference sequence. The NC_002946.2 sample was also used as the reference in all instances of reference-based read alignment when processing the empirical dataset. To introduce sequence variation, 3,000 variant nucleotides were uniformly distributed throughout the reference genome and reads of 100 nucleotides were generated at an average of 20 reads per site. To simulate sequences and reads, we used the evolutionary rate model estimated by RAxML from the 2016 study isolates (Rambaut & Grassly, 1997

| De novo sequence assembly and selection of loci
During the de novo assembly and automated locus selection pipeline, for both the empirical and simulated datasets, bases were trimmed from the raw reads with a quality score of 10 or below. We also removed any sequencing adapters included in the BBDUK default adapters file (Bushnell, 2021). De novo sequence assembly was performed on the trimmed read files to construct contigs for all taxa in the dataset. De novo sequence assembly was performed by SPAdes using default parameters with the exception of additional computing cores (Bankevich et al., 2021). Following assembly, the core genome for all assembled sequences was selected using ParSNP (Treangen et al., 2014). Core genomes are defined as sets of orthologous sequences that are conserved in all included taxa (Hodgins et al., 2016).
ParSNP identifies core genomes using a used maximal unique matches between sequences to capture conserved blocks of sequences in highly similar sets of genomes. Regions with missing data are not included in the final core genome, resulting in separate locus alignments.
The selected loci were concatenated into a single alignment while the separate locus alignments were retained for downstream base-call analyses. While ParSNP includes options to alter the sequence distance between acceptable matches used for identifying core genome sequences, all options were left as defaults for our analyses.

| Read alignment and SNP calling with Snippy
For both the empirical and simulated datasets, Snippy was run using the chosen reference sequence and the raw reads as inputs. Snippy aligned reads to the reference and replaced reference nucleotides with taxon-specific variants where appropriate. The output of the Snippy runs was alignments with sequence lengths matching the reference sequence. The empirical dataset used a contiguous N. gonorrhoeae genome sequence as a reference while the simulated dataset used the sequence input into TreeToReads for sequence simulation.

| Read alignment and SNP calling with Extensiphy
In order to create an input alignment for use with Extensiphy, we took the assembled genomes for four random taxa and assembled them in the same manner as the de novo assembly stage described above. We created a core genome alignment for these four taxa and the selected reference sequence using ParSNP (Treangen et al., 2014). This small set of taxa produced a set of loci that were influenced by the missing data found in the five included taxa. The homologous loci of this smaller dataset were concatenated and used as the input alignment for Extensiphy, along with raw read sequences corresponding to the rest of the taxa. Extensiphy processed the concatenated alignment, raw read input files and produced an updated multiple sequence alignment and phylogeny based on the alignment. Once phylogenetic estimation was complete, the concatenated sequence alignment was split into individual locus alignments in preparation for base-call comparisons.

| Phylogenetic analysis
For all datasets, phylogenetic estimation was performed on the concatenated alignment using RAxML to produce a maximum-likelihood topology and a consensus topology based on 100 bootstrap replicates (Stamatakis, 2014). We used the GTRGAMMA model for all estimations as this model is the most flexible maximum-likelihood model, and the only one available in RAxML.

| Program output comparison overview
We assessed each methodology using three metrics: program runtime, base-call accuracy and phylogenetic accuracy. The methods of measuring program runtime were identical regardless of the dataset. We assessed individual time to assemble each single sequence and the total time for a program to assemble a complete alignment. The time required for phylogenetic estimation was not included for any program.
Base-call comparisons, when using the simulated dataset, benefit from comparing each program outputs to the original TreeToReads sequences used to simulate the input data for each program. By using the original TreeToReads sequences, we collected an accurate description of which nucleotides were correctly and incorrectly called. The true base-calls of any empirical sequence are unknown. With this limitation in mind, we compared the sequence outputs of each program to their counterparts from each other program when assessing sequences produced from the empirical dataset. We assessed base-calls pairwise from any locus present in the output of any two programs. This conservative comparison was necessary due to the variation in the length of the sequences output by each program. Consequently, each sequence comparison was limited to the length of the shortest sequence. Phylogenies produced from the simulated dataset were compared to the original topology used by TreeToReads for sequence simulation. For the empirical dataset, the phylogeny produced by each program was compared to each other program's phylogeny. We compared majority-rule consensus phylogenies on bootstrapped data for all comparisons to account for stochastic variation in inferences of very short branches.

| Program runtime comparisons
We defined program runtime as two values: the time taken to assemble and output the sequence associated with a single taxon and the total program runtime for assembling all taxon sequences and outputting a complete sequence alignment. All three programs reported the time required for individual sequence alignment and assembly. The total program runtimes to produce a complete alignment were recorded.

| Program base-call comparisons
For simulated dataset base-call comparisons, each taxon's sequences were aligned to the original genomes produced by TreeToReads.
Extensiphy and de novo assembled sequences which were separate loci for each taxon. Snippy sequences, being duplicates of the reference sequence with variant nucleotides inserted, were the same length as the reference sequence. A base-call comparison was made once two sequences were aligned by noting which nucleotides in one sequence were identical to the paired sequence produced from the other program. Identical nucleotides, non-identical nucleotides, nonidentical degenerate nucleotides and gaps within the sequences were counted and summed for each locus. The lengths of all loci were also recorded for Extensiphy and the de novo pipeline. Additional metrics collected from the simulated data analyses were the total number of bases analysed, the per-base miscall and missing data rate for each program and, when comparing Extensiphy and de novo assembled sequences, the discrepancy in the length between the sequences output each program and the sequences produced by TreeToReads. For empirical dataset base-call comparisons, each taxon's sequences were aligned to the sequences produced by both other programs. Additional metrics collected from the empirical data analyses were the total number of bases analysed, the per-base disagreement between each sequence and, when comparing Extensiphy and de novo assembled sequences, the discrepancy in the length of the compared loci.

| Phylogenetic comparisons
Phylogenies estimated from each program's alignment were compared using the Robinson-Foulds (RF) distance calculations, the symmetric distance of partitions between two phylogenies, using the Dendropy Python library (Robinson & Foulds, 1981;Sukumaran & Holder, 2010).
All RF distances were calculated as unweighted, expressing only the symmetric differences in branches between topologies.

| Runtime
Using Extensiphy, individual sequences were assembled at a mean rate of 4 s per sequence and the overall program runtime was completed in 6 min and 45 s (Table 1)

| Alignment length
Extensiphy returned 209 sequences at 51,157 nucleotides each for a total of 10,691,913 nucleotides in the final alignment, including the reference sequence (Table 1)

| Alignment accuracy
Extensiphy's sequences produced the lowest miscall rate at 15 nucleotides while the de novo pipeline's alignment contained 21 miscalled nu-

| Phylogenetic accuracy
Extensiphy produced a phylogeny with an RF distance to the true topology of 56 while the de novo pipeline's phylogeny received an RF distance of 55 and Snippy produced a phylogeny with an RF distance of 98 (Table 1).

| Runtime
When processing and analysing data from the empirical dataset, Extensiphy produced consensus sequences in a mean time of slightly over 6 min and produced a complete alignment in 38 hr ( Figure 2; Table 2). The de novo pipeline assembled sequences in a mean time

| Alignment length
Individual sequences produced by Extensiphy were all of 1,859,910 nucleotides in length for a total of 2.293 × 10 9 nucleotides in the final alignment (Table 3) Individual sequences produced by Snippy were 2,180,847 nucleotides in length for a total of 2.732 × 10 9 nucleotides in the final alignment. Locus values were not reported for Snippy as Snippy operates using whole-genome inputs and outputs.

| Alignment accuracy
We assessed empirical base-calls for the outputs of all three programs against each other as true base-calls cannot be described with certainty for empirical sequence data (Table S2)

| Missing data
We assessed empirical missing data in the same manner as empirical base-calls, that is, by comparing the outputs of each program against each other. The Extensiphy-de novo pipeline comparison contained 81,035 differing gaps or degenerate nucleotides from 31,909,017 analysed sites between both alignments (

| Phylogenetic accuracy
When analysing the RF distances between the phylogenies produced by each program, the Extensiphy-de novo pipeline comparison produced an RF distance of 687 and the Extensiphy-Snippy

F I G U R E 2
The time required by each method to assemble all sequences associated with each taxon in the empirical dataset comparison produced an RF distance of 749 (Table 4). The de novo pipeline-Snippy comparison produced an RF distance of 676.

| DISCUSS ION
Sequencing efforts are expanding for the collection of genomic data (Goodwin et al., 2016;Hodcroft et al., 2021;Mardis, 2017). Current methods for incorporating new data into sequence alignments exist but are inadequate for whole-genome datasets with thousands of taxa (Eddy, 2009;Nguyen et al., 2015). While combining new and previously analysed data during de novo alignment construction is a routinely performed workflow, this process can result in alignment trimming that can remove potentially useful data from a dataset (Huang & Knowles, 2016). To address issues of expanding existing sequence alignments, we introduced the Extensiphy program and assessed its outputs to two workflows with comparable outputs.
Our results show that Extensiphy balances between data retention, runtime efficiency and applicability to genomic datasets. Extensiphy returned alignments with sequence lengths matching those of the input alignment and containing a lower proportion of degenerate or gap sites than other methods. Extensiphy accommodated and returned an alignment with sequences of lengths comprising over 90% of the N. gonorrhoeae genome. All sequences were assembled in competitive times compared to other analysed methodologies. If the starting point of a study is an existing concatenated alignment or set of alignments for the same taxa and a set of whole-genome shortread data and the goal is to rapidly add the new data to the alignment, Extensiphy will produce the desired results. Additionally, we argue that the analyses of both the simulated and empirical datasets demonstrate that Extensiphy performs equally well when updating alignments with any number of loci and inputs of either separate alignments or a single, pre-concatenated alignment. While these two features are simple in terms of modern bioinformatics tools, their presence expands the scope of studies for which Extensiphy may be appropriate. By accommodating any number of loci, Extensiphy is applicable to any scale of project, from inquiries with a single or a few loci to full-scale epidemiological monitoring efforts (Grad et al., 2016;Hadfield et al., 2018;Hodcroft et al., 2021). By accepting either individual locus alignments or a concatenated alignment, Extensiphy does not constrain the user to a specific method of phylogenetic estimation.
Extensiphy is designed to integrate new genomic data with existing datasets. The approach targets computational effort to Extensiphy assembles new loci directly aligned to existing loci, as opposed to a reference genome. Extensiphy does not require a full reference genome, and can be applied to integrating sequences from whole-genome data into even single-locus datasets. These few or single-locus datasets form the phylogenetic backbone of our understanding of many taxa.
As part of this framework, Extensiphy also allows for the selection of a reference sequence already found in an existing alignment.
This provides an opportunity to assess the role of choice of reference sequence in consensus sequence inference. While reference-based read alignment is an excellent flexible method for many studies, the choice of reference sequence can inherently bias downstream analyses (Brandt et al., 2015;Günther & Nettelblad, 2019). Reference bias is a well-known potential influence on sequence structure during read alignment based on the structure of the reference (Günther & Nettelblad, 2019;Ros-Freixedes et al., 2018). The extent to which reference bias affects phylogenetic estimation is still ambiguous.
Extensiphy paired with the methodologies of sequence and phylogenetic comparison we describe in this study offer an excellent opportunity to repeatedly measure the effects of constructing alignments based on diverse reference sequences. By running the same analyses using different references with known phylogenetic relationships to each other, it is straightforward to use Extensiphy to assess if this bias is playing a role in one's own dataset.
Acknowledging and addressing missing data are key issues in modern phylogenomics. Current research argues for a caseby-case strategy on including or excluding missing data (Huang & Knowles, 2016;Streicher et al., 2016). The distribution of missing data throughout an alignment influences such decisions (Lemmon et al., 2009). Assuming a relatively even distribution of missing data, alignment trimming may not be necessary and such trimming could remove valuable variant nucleotides from future analyses. In the presence of an uneven distribution of missing data, perhaps due to sequencing bias, a study could benefit from judicious locus removal (Streicher et al., 2016). Extensiphy finds an 'middle ground' in respect to retaining full loci-of-interest while introducing a minimum of missing data. Using Extensiphy, all input loci are maintained while updating an alignment, preventing loci from fragmenting into smaller sequence segments as seen when using ParSNP in the de novo pipeline. Moreover, a smaller percentage of missing data was found in the Extensiphy alignment compared to the alignment produced by Snippy. While the Snippy alignment did contain more sites, expressed as the full length of the reference sequence for each taxon, the difference in size between the Snippy alignment and the Extensiphy alignment is modest compared to the amount of missing data found in the Snippy alignment. Such a percentage of missing data could affect inferred phylogenies by biasing branch lengths, potentially misleading conclusions based on those phylogenies. Extensiphy rapidly returns an updated alignment while minimizing missing data and enabling researchers to make decisions on the inclusion or excision of loci. Ultimately, all three methods tested here produced accurate estimates and useful alignments and the choice of application of any of the approaches described here depends on the researchers' goal.

| CON CLUS IONS
Updating a multiple sequence alignment previously required tradeoffs of program runtime, reference sequence availability and dataset trimming and fragmentation. We have introduced Extensiphy, a program that updates alignments of loci with new data, and compared it to two popular alternative methods. Extensiphy is applicable to any project with a starting alignment and new whole-genome shortread data. Alignments may be concatenated or separate single-locus alignments. Extensiphy offers an efficient and flexible solution to any study producing high volumes of whole-genome data, particularly for disease monitoring purposes. Projects where maintaining locus length and preventing alignment trimming due to missing data are important will find Extensiphy particularly useful. Extensiphy produces updated alignments suitable for multiple methods of phylogenetic estimation and base-call accuracy comparable to standard methods in the field of bioinformatics. Updating sequence alignments with Extensiphy removes the burden of data processing from the researcher and enables them to focus on purpose and applications of their research.

ACK N OWLED G EM ENTS
Research was supported by the grant 'Cultivating a sustainable

PE E R R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/2041-210X.13790.

DATA AVA I L A B I L I T Y S TAT E M E N T
Extensiphy is open source software utilizing software written by other developers. The Extensiphy pipeline itself is available on Github https://github.com/McTav ishLa b/exten siphy and on Zenodo https://doi.org/10.5281/zenodo.5770686 (Field, 2021b). The comparison pipelines are also open source software pipelines and are available on Github https://github.com/jtfie ld/phylo_compa rison and on Zenodo https://doi.org/10.5281/zenodo.5770698 (Field, 2021c). All accession numbers for samples and alignments, as well as the simulated data files used in this study are publicly available on Dryad Digital Repository https://doi.org/10.6071/M38T0T (Field, 2021a).