homologizer: Phylogenetic phasing of gene copies into polyploid subgenomes

Organisms such as allopolyploids and F1 hybrids contain multiple distinct subgenomes, each potentially with its own evolutionary history. These organisms present a challenge for multilocus phylogenetic inference and other analyses since it is not apparent which gene copies from different loci are from the same subgenome and thus share an evolutionary history. Here we introduce homologizer, a flexible Bayesian approach that uses a phylogenetic framework to infer the phasing of gene copies across loci into their respective subgenomes. Through the use of simulation tests, we demonstrate that homologizer is robust to a wide range of factors, such as incomplete lineage sorting and the phylogenetic informativeness of loci. Furthermore, we establish the utility of homologizer on real data, by analysing a multilocus dataset consisting of nine diploids and 19 tetraploids from the fern family Cystopteridaceae. Finally, we describe how homologizer may potentially be used beyond its core phasing functionality to identify non‐homologous sequences, such as hidden paralogs or contaminants.


| INTRODUC TI ON
Some individual organisms-such as allopolyploids and F1 hybridscontain multiple distinct subgenomes, each with its own evolutionary history (in the case of polyploids, the homology of these subgenomes is related to a genome duplication event and they are referred to as "homeologous"). Such organisms present a particular challenge for multilocus phylogenetic inference because a researcher must take care to avoid assuming that gene copies from different loci share an evolutionary history, that is, that they are from the same subgenome. For example, a diploid F1 hybrid individual will have one copy of a given nuclear locus from species A, and another from species B. If the copy of one locus from species A is treated as sharing an evolutionary history with the copy of another locus from species B (e.g. by mistakenly assigning them to the same haplotype in concatenated or multispecies coalescent analysis), the underlying bifurcating model will be violated and the inferences will be unreliable (McDade, 1990(McDade, , 1992Oxelman et al., 2017;Rothfels, 2021).
This issue is not limited to multilocus phylogenies of single-copy nuclear loci-the same problem applies when attempting to include organellar and nuclear loci in a common analysis, or when analysing ITS sequences with other loci.
Phasing copies across loci are related to the problem of haplotype assembly-the phasing of sequencing reads within a locus: in both cases, the goal is to avoid chimeric data that are a mix of multiple evolutionary histories. In the assembly problem, however, a researcher can rely on physical linkage to determine which reads belong to which haplotype (Kates et al., 2018;Majidian et al., 2020;Nauheimer et al., 2021;Schrinner et al., 2020;Tiley et al., 2021). This approach is not available in the locus-phasing case, where the loci are separated from each other by unsequenced regions; the only information available to determine whether two gene copies come from the same subgenome is in the phylogenetic history itself. Note that we use the term "gene copies" to refer both to copies that evolved in separate subgenomes (and thus do not share the same evolutionary history) and those that arose from allelic variation within the same subgenome (and thus do share the same species-level evolutionary history).
The need to phase loci is particularly prevalent in plants, since over one-third of extant species are inferred to be recent polyploids (Såstad, 2005;Wood et al., 2009). However, the issue is not restricted to plants. For example, both insects and diatoms have been reported to have an extensive history of polyploidy Parks et al., 2018); all salmonid fish are ancestrally polyploid (Alexandrou et al., 2013); and squamates and amphibians have clades with frequent allopolyploid lineage formation (Bogart, 1980;Bogart & Licht, 1986;Hedges et al., 1992;Lowe & Wright, 1966).
In theory, researchers could get around this locus-phasing problem by inferring individual gene trees, and then using the phylogenetic position of the copies to determine to which subgenome they belong (the "phase" of the copies). For example, if a polyploid always has two gene copies, one of which is closely related to diploid species A and the other to diploid species B, and this is true across all loci sampled, then one could confidently conclude that the "A" copies share one evolutionary history and the "B" copies another (e.g. Dauphin et al., 2018;Sessa et al., 2012b). However, given variable amounts of missing data (such as failure to recover one of the copies for a subset of the loci, or failure to sequence the related diploids for all loci) and the phylogenetic uncertainty inherent in inferring single-gene phylogenies, this method can be exceedingly difficult to apply, especially in datasets with many polyploids, or with many loci (Bertrand et al., 2015).
Beyond this by-eye approach, and approaches that rely upon existing reference sequences, such as that of Hénocq et al. (2020) and Nauheimer et al. (2021), there are, to our knowledge, three currently available methods for phasing copies across loci. First, Bertrand et al. (2015) developed an approach to phase two loci by finding the largest set of sequence pairs such that any incongruence between the two resulting gene trees could be due to stochastic or coalescent error. A somewhat similar approach was developed and refined by Oberprieler and colleagues (Lautenschlager et al., 2020;Oberprieler et al., 2017). Instead of explicitly testing for sources of incongruence, this method looks for the subgenome assignments of copies within a locus, and then across loci, that minimizes the number of inferred deep coalescent events in the context of the phylogeny of the diploid species. Both methods are relatively fast, although the complexity of the analysis increases quickly with the number of polyploid accessions and with the number of loci. Perhaps more fundamentally, these methods add in accessions sequentially, so the phylogenetic information in the full dataset is not available to inform the phasing of each polyploid copy.
In contrast to the two methods above, which require an iterative series of analyses and are based on particular test statistics, the third available method, alloPPnet (Jones, 2017;Jones et al., 2013) is a onestep parametric Bayesian model for the inference of polyploid species trees (more accurately, species networks) under the multispecies coalescent framework. As such, it explicitly models single polyploid formation events, takes into account all the data from the full sample, and, for example, requires that the post-polyploidization subgenome lineages have the same effective population size. The chief limitation of alloP-Pnet is that it cannot accommodate ploidy levels higher than tetraploid.
Here we introduce homologizer, a simple, flexible, tree-based method to infer the posterior probabilities of the phasing of gene copies across loci that can be performed on a fixed topology or while simultaneously estimating the phylogeny under a wide range of phylogenetic models, including the multispecies coalescent framework. Users can thus apply homologizer to infer the phylogeny while treating the phasing as a nuisance parameter, to infer the phasing while integrating out the phylogeny or for any combination of these goals. This method can utilize the full dataset of all loci jointly as opposed to phasing each locus individually (information about topology and subgenome identity from each locus is thus available to inform the phasing of the other loci), does not require any external data (such as a reference genome) and does not require that the progenitor diploids of the polyploid accessions be extant or sampled.
homologizer is implemented in the open-source Bayesian inference software RevBayes (Höhna et al., 2016). It consists of dedicated MCMC proposals that switch tip assignments among gene copies and MCMC monitors that log the sampled phasing assignments. These homologizer features can be included in conjunction with any of the other modular phylogenetic inference components that RevBayes offers, enabling the phasing to be estimated under a large variety of models (e.g. models that assume a shared gene tree for some or all loci, multispecies coalescent models that do or do not assume a constant population size, etc.).
We anticipate that homologizer will be of greatest interest to phylogeneticists, but its applications extend beyond phylogenetics to other cases where it is important to determine from which subgenome a particular gene copy originates, for example, studies of expression-level dominance and subfunctionalization (Edger et al., 2018).

| MATERIAL S AND ME THODS
homologizer makes use of multilocus sequence data, where a locus is a homologous genomic region present across the accessions sampled for the analysis (though the method allows for missing data). The output of the method is the posterior distribution of phased homeologs, that is, the posterior distribution of the assignments of each gene copy, for each locus, into each of the available subgenomes. If joint inference of the phasing and phylogeny is performed, the posterior distribution of the multilocus phylogeny is also inferred, along with all other parameters of the model.

| The homologizer model
The homologizer model represents phylogenies containing polyploids as multilabeled trees ("mul-trees"), where each polyploid accession is present multiple times, once for each subgenome (each distinct evolutionary history; Huber et al., 2006). Gene copies are phased into subgenomes for each locus by swapping which gene copy is assigned to each of that accession's tips in the mul-tree ( Figure 1). For each polyploid accession, there are (n!) g ways to phase the gene copies, where n is the number of gene copies per locus and g is the number of loci. The homologizer model thus adds a set of extra latent variablesgene-copy phase parameters-to the underlying phylogenetic model: where are the gene-copy phase parameters, is the tree, are all the other parameters of the model and X is the set of sequence alignments.
The user defines the set of gene-copy phase parameters , which for each accession and each locus consist of n − 1 parameters for the n gene copies to be phased. Thus, the number of such parameters increases with both the number of polyploid subgenomes and with the number of loci. A given gene-copy phase parameter can take n values, corresponding to the n possible assignments of sequences to a given tip in the mul-tree: these values are each assigned an equal prior probability. The prior probability of the tree P( ) is specified using a standard phylogenetic model such as a birth-death process or a uniform prior on the topology with exponential priors on branch lengths. The other parameters of the model, , represent a nucleotide substitution model. For example, one may specify the GTR substitution model (Tavaré, 1986) with exchangeability rates and stationary frequencies sampled from flat Dirichlet priors. The posterior probability P( , , | X) is then estimated by MCMC (Metropolis et al., 1953). P( , , | X) ∼ P(X| , , )P( , , ), F I G U R E 1 Phasing gene copies into polyploid subgenomes on a mul-tree. (a) A phylogenetic network representing a single reticulation, giving rise to an allopolyploid. (b) The mul-tree representation of this phylogenetic network has two leaves (green and blue) representing the two subgenomes of the allopolyploid. (c) Four loci were sequenced from the allopolyploid. Two copies (green and blue) of each locus were recovered. Loci 2 and 3 are incorrectly phased. (d) After phasing, each locus is assigned to the correct subgenome.

| Summarizing homologizer inferences
The phasing estimates from a homologizer MCMC analysis can be summarized and plotted using the R (R Core Team, 2013) package RevGadgets (Tribble et al., 2020), which displays the joint maximum

| Comparing homologizer models
Users interested in distinguishing gene copies that evolved in separate polyploid subgenomes from those that arose from allelic variation within the same subgenome (or that are otherwise nonhomeologous) can set up a series of homologizer models-which differ in the number of mul-tree tips available for phasing-and select the best fitting model using statistical model comparisons. To perform homologizer phasing model comparisons, the user can run a stepping-stone analysis (Xie et al., 2011) to compute the marginal likelihood of each model. The best fitting model is then selected using Bayes factors (Kass & Raftery, 1995).
Consider the example in Figure 2. Here, the polyploid accession has two subgenomes, was sequenced for four loci, and for each of those loci, two copies were recovered. It could be that each locus is represented by one copy from each of the subgenomes and thus the accession should be given two tips in the mul-tree (panel a in Figure 2). However, it is possible, instead, that the copies at one or more of the loci are haploid allelic variants from one of the subgenomes. In this case, three mul-tips are needed-two tips for one subgenome, and one tip for the other subgenome (panel b in Figure 2). To compare these two models (one with two mul-tree tips for this accession, and one with three) using Bayes factors, the data must be kept constant, and only the number of phaseassignment parameters changed. This condition is satisfied by running both models with a "third set" of blank sequences added for this accession, and assigned to a mul-tree tip, but with that tip phased only in the "three-tip" analysis. In the two-tip analysis, conversely, the phase of this tip is not estimated and is fixed to blank sequences; after the analysis, it can be pruned from the tree sample.

F I G U R E 2
Model testing for allelic variants and/or homeologs. In this example, two copies of each locus were sequenced from a single allotetraploid. (a) A phasing model that swaps the gene copies between two mul-tree tips (the green and the blue tips) assumes that the two gene copies of each locus are homeologous. (b) An alternative phasing model swaps gene copies between three mul-tree tips. This potentially allows for the gene copies of some loci to be inferred as allelic variation and the gene copies of other loci to be inferred as homeologs. In this example, the gene copies of locus 1 and locus 4 are homeologs and the gene copies of locus 2 and 3 are allelic variants. The fit of the data to these two models-mul-tree A and mul-tree B-can be compared through Bayes factors.

| Simulation tests
To evaluate the performance of homologizer, we performed two separate simulation tests. The first tested the impact of different evolutionary factors on phasing accuracy and the second focused on whether applying Bayes factors to compare homologizer phasing models can successfully distinguish between allelic versus homeologous variation.

| Experiment 1: Testing phasing accuracy
To explore the behaviour of homologizer under different evolutionary scenarios, we simulated datasets to test how the performance of homologizer was impacted by the following factors: (1) the phylogenetic informativeness of each locus, (2) the phylogenetic distance between subgenomes for each polyploid (the minimum distance in cases of more than two subgenomes), (3) the minimum phylogenetic distance from a polyploid subgenome to the nearest diploid and (4) the amount of incomplete lineage sorting (ILS) present. For each replicate, we first simulated a species tree under a constant rate birthdeath process with root age = 50.0, speciation rate = 0.2, extinction rate = 0.01, and the fraction of taxa sampled at the present = 0.01.
Each species tree was conditioned on having 20 extant tips, five of which were randomly selected to represent allopolyploid subgenomes. Two of the allopolyploid tips were then randomly selected to represent the subgenomes of a tetraploid, and the other three allopolyploid tips represented subgenomes of a hexaploid (making the species tree a mul-tree). The remaining 15 tips were considered diploid lineages that did not need phasing. Each dataset thus included two allopolyploids that we could use to assess the performance of homologizer.
Once the species tree was simulated, we used the multispecies coalescent framework to simulate four gene trees, representing four loci. We varied the effective population size of these simulations as described below. Finally, over the simulated gene trees, we simulated nucleotide sequences under an independent GTR substitution model with exchangeability rates and stationary frequencies sampled from flat Dirichlet priors. The clock rate of the GTR process was constant over all branches of the tree and fixed to 0.01. We varied the length of the simulated sequences as described below.
All the simulated nucleotide sequences were then analysed using homologizer. The phasing and the phylogeny were inferred simultaneously, linking the tree across loci (so these were "concatenated" analyses). Each locus was modelled by an independent GTR substi- in each simulated dataset with the summary statistic R g ∕ R s , where R g is the mean Robinson-Foulds distance (Robinson & Foulds, 1981) between each gene tree and the species tree, and R s is the mean Robinson-Foulds distance between each independently simulated species tree. This statistic provides an intuitive summary of the amount of gene discordance: when, for example, R g ∕ R s = 0.5 the amount of discordance between the gene trees and the species tree is 50% of what one would observe between completely unlinked trees. Therefore, simulation replicates with 0.0 < R g ∕ R s ≤ 0.1 represent low amounts of ILS, replicates with 0.1 < R g ∕ R s ≤ 0.3 represent moderate amounts of ILS and those with 0.3 < R g ∕ R s ≤ 0.5 represent high levels of ILS.

| Experiment 2: Testing Bayes factors to distinguish allelic variation from homeologs
Our second simulation test focused on testing whether Bayes factors could be used to compare homologizer phasing models and distinguish allelic variation from homeologs. For each replicate, we first simulated a species tree under a constant-rate birth-death process with root age = 50.0, speciation rate = 0.2, extinction rate = 0.01 and the fraction of taxa sampled at the present = 0.01. Each species tree was conditioned on having 20 extant tips, two of which were randomly selected to represent allopolyploid subgenomes of a tetraploid (making the species tree a mul-tree). The remaining 18 tips were considered diploid lineages that did not need phasing.
Once the species tree was simulated, we used the multispecies coalescent framework to simulate four gene trees, representing four loci. In half of the replicates, all gene trees were simulated with one gene-tree tip for each species-tree tip. These replicates represent the case in which a two mul-tree tip phasing model is appropriate because there is no allelic variation (all gene copies for the tetraploid are true homeologs). For the other half of the replicates, gene trees were simulated with one gene-tree tip for each species-tree tip for two loci, and with two gene lineages from subgenome A and none from subgenome B for the remaining two loci. These replicates represent the case in which a three mul-tree tip phasing model is needed because the gene copies for two of the loci are allelic variants from the same subgenome but for the other two loci the gene copies are homeologs from different subgenomes.
The gene trees were simulated using an effective population size For each simulated dataset, we ran stepping-stone analyses to compute marginal likelihoods of the two-mul-tree-tip and threemul-tree-tip homologizer phasing models. Our stepping-stone simulations used 50 stepping stones, sampling 1000 states from each step. We then compared the homologizer phasing models using Bayes factors. Rothfels et al. (2017) analysed a dataset of four single-copy nuclear loci (ApPEFP_C, gapCpSh, IBR3 and pgiC) for a sample of nine diploids and 19 tetraploids from the fern family Cystopteridaceae, using alloPPnet (Jones, 2017;Jones et al., 2013). We reanalyzed these data (

| Experiment 2: Testing Bayes factors to distinguish allelic variation from homeologs
Using Bayes factors to compare homologizer phasing models successfully distinguished allelic variation from homeologs ( Figure 4).
For the simulation replicates in which the three-mul-tree-tip phasing model was necessary to accommodate the presence of both homeologous gene copies and allelic variation, the Bayes factors for the three-mul-tree-tip phasing model compared to the two-mul-tree-tip phasing model were nearly always over 100, indicating "decisive" support ( Figure 4, grey dots; Kass & Raftery, 1995). When all gene copies from the tetraploid were homeologs such that three mul-tree tips were not necessary, then the Bayes factors were always less than zero (Figure 4, red dots), indicating there was no support for the three-mul-tree-tip phasing model over the two-mul-tree-tip phasing model.

| Insights from homologizer model selection and data exploration
The ability to compare among different homologizer models with Bayes factors, and thus to use homologizer as a data-exploration tool, resulted in key insights about the data for three of the accessions in our dataset; these insights subsequently resulted in significantly altered results of the downstream inferences. The first TA B L E 2 Cystopteridaceae phasing model tests. Bayes factors were used to compare phasing models for three polyploid accessions. Each model varied the number of mul-tree tips used to phase gene copies. The marginal log-likelihood was calculated four times independently for each phasing model; the mean and standard deviation is shown. The Bayes factor of the best supported model versus each other model is shown. The best supported model is indicated with -. Model comparisons with "strong" support according to Kass and Raftery (1995) are bold and marked with **.

Accession Number of mul-tree tips Marginal log-likelihood (mean) Marginal log-likelihood (SD) Bayes factor
Cystopteris tasmanica  Proportion correct case involves Cystopteris tasmanica, a tetraploid (Brownlie, 1958;Brownsey & Perrie, 2018;Tindale & Roy, 2002). From our C. tasmanica accession, we recovered two sequences for three of the four loci (we recovered a single copy from IBR3). This accession thus appears to be perfectly "well-behaved", with one copy of each of the two expected homeologs at most loci, which is how it was treated in the Rothfels et al. (2017) alloPPnet analyses. However, preliminary homologizer analyses phasing two mul-tree tips for C. tasmanica had an odd result: three of the loci phased with very high posterior support, but the phasing for the two gapCpSh copies was equivocal (each copy had ~50% posterior probability of phasing to each of the mul-tree tips). To test the possibility that this strange result was due to the two gapCpSh copies being allelic rather than homeologous (in which case we would expect each of the copies to fit equally to each mul-tree tip), we ran another homologizer model, this time with C. tasmanica allocated three mul-tree tips. This three-tip model strongly out-performed the two-tip one (Table 2), and as we had suspected, the two gapCpSh copies were resolved sister to each other, with a blank copy phased with high posterior probability to the other subgenome ( Figure 5).
The other two insights were related to each other, and involve the allotetraploid Gymnocarpium dryopteris and one of its diploid progenitors, G. disjunctum (Pryer & Haufler, 1993;Rothfels et al., 2014Rothfels et al., , 2017. In our dataset, G. dryopteris was represented by two sequences for two loci (as would generally be expected for an allotetraploid), but by three sequences for the other two loci. In the alloPPnet multispecies coalescent analyses of Rothfels et al. (2017), this accession was split into two "individuals" in order to accommodate its apparent heterozygosity. For our preliminary concatenationbased (instead of coalescent-based, as in alloPPnet) homologizer analyses, we ran a model that gave this accession three tips, in the thought that only one of the subgenomes had any allelic variation (i.e. the pattern shown in Figure 2b), and a four-tip model, in case both subgenomes were heterozygous at least one locus. The phasing for G. dryopteris in these preliminary analyses, however, had difficulty converging, and in addition, inspection of the pgiC gene tree showed that each of the three G. dryopteris copies were resolved in distinct places. Specifically, one copy grouped with each of G. appalachianum and G. disjunctum, as expected given these two diploids are the G. dryopteris progenitors (Pryer & Haufler, 1993;Rothfels et al., 2014Rothfels et al., , 2017; however, the third copy was in an isolated position. In response, we ran a five-tip model, and this model was strongly preferred over either of the three-or four-tip ones (Table 2).
Similarly, G. disjunctum appeared to be heterozygous at two loci, and in the work of Rothfels et al. (2017), it was split into two individuals; however, this (two-tip) model converged poorly, and a three-tip model was weakly preferred by Bayes factor comparison ( Table 2).
The final, integrated model (three tips for C. tasmanica, five tips for G. dryopteris, and three for G. disjunctum) resulted in a considerably revised understanding of our dataset. The C. tasmanica case was fairly straightforward but also had a dramatic effect: by allowing the gapCpSh copies to be allelic rather than forcing them onto different subgenomes, the three-tip model inferred that C. tasmanica is an allopolyploid, with one subgenome from each of the major F I G U R E 4 Effectiveness of model comparison tests to distinguish allelic variation from homeologs. Each dot represents a single simulated dataset as described in the main text. The y-axis shows the Bayes factor for the three mul-tree tip phasing model compared to the two multree tip phasing model. All replicates had two gene copies simulated for each locus. Replicates 1-50 (left panel; red dots) were simulated with each of the two tips representing a different polyploid subgenome. The Bayes factor for all of these replicates was less than 0 (dashed line), indicating the three-tip model was not supported over the two-tip phasing model. Replicates 51-100 (right panel; grey dots) were simulated with three total tips representing two different subgenomes of a polyploid in some loci and a single subgenome with two allelic variants in other loci. For nearly all these replicates the Bayes factor was over 100 (dotted line), indicating "decisive" support (Kass & Raftery, 1995) for the three-tip phasing model over the two-tip model. The columns of the heatmap each represent a locus, and the joint MAP phase assignment is shown as text within each box. Each box is coloured by the marginal posterior probability of the phase assignment. Adjacent to the heatmap is a column that shows the mean marginal probability across loci of the phasing assignment per tip, which summarizes the model's overall confidence in the phasing of that tip. In the sample labels, "A." = Acystopteris; "C." = Cystopteris; "G." = Gymnocarpium; the four-digit numbers are Fern* Labs Database accession numbers (https://fernl ab.biolo gy.duke.edu/); capital A, B, and so forth indicate subgenomes; sp1 and sp2 indicate undescribed cryptic species; and lowercase letters following the accession numbers indicate haploid "individuals" within the sampled diploids (i.e. those diploids are heterozygous). Copy names with a "BLANK" suffix indicate missing sequences (e.g. a subgenome that was present in some loci but not retrieved for others).
subclades of the core fragilis complex (i.e. the fragilis complex minus C. protrusa; Rothfels et al., 2013). This conclusion is in strong contrast to that of Rothfels et al. (2017), which limited this accession to two tips and inferred that it was possibly an autopolyploid (the two tips were inferred to be sister to each other, albeit relatively deeply divergent). For G. disjunctum, both sequences of the two loci that had two copies (gapCpSh, IBR3) were phased to tips consistent with their being "normal" allele variants in a diploid genome-these two loci are heterozygous. However, the lone pgiC sequence did not phase to either of those tips; instead, it was phased to its own tip, in an isolated portion of the phylogeny ( Figure 5). Multiple lines of evidence suggest that this tip does not represent a subgenome history: (1) it is inconsistent with what is known of the ploidy of G. disjunctum (it is a diploid; Pryer & Haufler, 1993;Rothfels et al., 2014); (2)  For G. dryopteris, while each locus had only one sequence from the subgenome closely related to G. appalachianum, the gapCpSh representative of this subgenome was sufficiently phylogenetically distinct (presumably due to coalescent variance) to be phased with high posterior probability to its own, closely related, tip. In the subgenome closely related to G. disjunctum, the results were more to be expected-all loci had a single copy, which phased to one of the tips, except for gapCpSh, which had two closely related alleles ( Figure 5).
However, the most important result was that one of the three pgiC copies was phased, with high posterior probability, to its own tip, phylogenetically distant from all other G. dryopteris tips. This pattern is very similar to that seen for pgiC in G. disjunctum (above), and we suspect these issues are related: G. dryopteris has inherited this putatively paralogous copy from G. disjunctum, and the G. dryopteris version in our dataset may additionally be a PCR recombinant. This conclusion is supported by the fact that G. dryopteris has two other pgiC copies that are cleanly phased to subgenomes corresponding to each of the putative progenitor diploids. By allowing this aberrant pgiC tip to be phased to its own branch, this three-tip homologizer model clarifies that the remaining G. disjunctum allele sequences are very closely related to each other, and provides strong support to the relationships within this area of the Gymnocarpium phylogeny ( Figure 5).

| MCMC behaviour for homologizer models on empirical data
The Cystopteridaceae analyses also demonstrated that inference under homologizer models, when applied to real datasets with a large number of polyploid accessions, involves complex interactions among the phasing estimates and between those estimates and those of the topology (and branch lengths). Particularly problematic is the case of two polyploids that have closely related subgenomes and can get their phasing for a subset of the loci "stuck" in the incorrect orientation. In the Cystopteridaceae dataset, for example, ×Cystocarpium and Cystopteris fragilis "sp 2" share two pairs of closely related subgenomes, each pair from a different major subclade of the fragilis complex ( Figure 5). If a locus got out-of-phase for both accessions, it could get trapped in that orientation, presumably because a proposal to the phasing for that locus for any one of the accessions would result in each pair of closely related tips containing very dissimilar sequences, which would have a low probability given the short phylogenetic distance between the two accessions.
While we experimented with more complex joint phase moves, we found that simply increasing the frequency of the phase moves relative to the tree topology moves was more effective. However, while any given homologizer run was reasonably rapid (on the order of a "regular" analysis in RevBayes), they frequently failed to converge.
Specifically, individual runs would get stuck in a suboptimal region of parameter space and appear to be sampling from the posterior distribution (effective samples sizes would be high, etc.), but multiple such runs would sample from different distributions. In our full Cystopteridaceae analysis, for example, we ran seven independent runs, only three of which converged to the posterior distribution.
As a result, we recommend that users run preliminary analyses and inspect the results for areas where two closely related accessions are potentially interfering with each other. In addition, multiple independent runs are necessary to ascertain the efficiency of MCMC mixing for a given dataset. Users can additionally experiment with

| A phased multilocus Cystopteridaceae phylogeny
Our inferred Cystopteridaceae phylogeny under the best-fitting model (see Table 2) is well supported (Figure 5), and generally consistent with the inferences of Rothfels et al. (2017), although with the addition of all the phasing information.
The primary differences relate to the insights gained from our data exploration/model comparison (see above). Specifically, the strongly supported resolution of Cystopteris tasmanica as an allopolyploid with two fairly phylogenetically distant progenitors ( Figure 5) is in strong contrast to the earlier inferences of this accession as a potential autopolyploid, with subgenomes that are sister to each other (Rothfels et al., 2017 where that locus was not recovered for the given sample, such as IBR for C. utahensis ( Figure 5).

| homologizer for inference
The core application of homologizer is to infer multilocus multilabeled phylogenies and/or the phasing of gene copies to subgenomes. For this application, our simulation tests show that homologizer is robust to a wide range of factors that can potentially affect accuracy (Figure 3). When the homologizer model does not have enough information to confidently phase gene copies due to (1) phylogenetically uninformative sequences, (2) the polyploid subgenomes being too closely related, or (3) (Sigel, 2016); recombination within a focal locus would be very rare (Chen et al., 2018).
Other datasets might require different analysis approaches. For example, genome-scale target enrichment and transcriptome datasets often contain hundreds of loci. With these large datasets joint inference of the phylogeny and gene-copy phasing as presented here may not be computationally feasible. In those cases, it may be reasonable to adopt a sequential Bayesian approach: first infer a species-level mul-tree while phasing a subsample of loci, and then condition the phasing of the remaining loci on that mul-tree. During the secondary phasing analyses one could optionally incorporate phylogenetic uncertainty in the mul-tree by integrating over the posterior distribution of mul-trees, as inferred in the first step. The RevBayes implementation of homologizer allows for a wide range of approaches, such as these, for scaling up phasing analyses that should be suitable for any sized dataset.

| homologizer for data exploration
The other core functionality of homologizer is as a data exploration tool. With increasing numbers of taxa and loci, it can be difficult to detect non-orthologous sequences, such as hidden paralogs, contaminants, or allelic variation that is erroneously modelled as homeologous. homologizer, in conjunction with the RevGadgets visualization tools, allows for the convenient detection of cases where most loci phase easily but some do not, indicating a likely problem in the homology of the underlying sequences. This application can be extended as a hypothesis-testing tool via model comparison (see "Experiment 2: Testing Bayes factors to distinguish allelic variation from homeologs"), which provides an objective, powerful, and statistically rigorous method of understanding these complex patterns of homology. In the case of our empirical Cystopteridaceae dataset, this data exploration approach revealed cases of alleles masquerading as homeologs (e.g. in Cystopteris tasmanica) and of hidden paralogy (in Gymnocarpium disjunctum and dryopteris). The identification and correction of these errors resulted in important changes to our downstream phylogenetic inference, demonstrating the impact that this form of data exploration can have.

| SUMMARY
homologizer provides a powerful and flexible method of phasing gene copies into subgenomes, allowing users to infer a multilocus phylogeny for lineages with polyploids (or F1 hybrids) while treating the phasing as a nuisance parameter, to infer the phasing while integrating out the phylogeny, or for any combination of these goals. Our simulations demonstrate that homologizer is able to confidently infer phasing even in cases of relatively weak phylogenetic signal (e.g. short loci); in such cases where the signal is insufficient, homologizer returns an equivocal result (all potential phasings have lower posterior probability). In contrast, our simulations also demonstrate that homologizer can be sensitive to model violations. Specifically, if the true gene trees differ strongly from each other (e.g. due to high levels of ILS) and yet the model assumes all loci share a single phylogeny, the method is forced to choose among a small set of more-or-less equally wrong phasing options, and may return the incorrect phasing with a high posterior probability (i.e. the best of a bad lot). Our analyses of the empirical Cystopteridaceae dataset further demonstrate the power of the method; by simultaneously leveraging the phylogenetic information available in the full taxon sample (diploids and polyploids alike), across all loci, homologizer can confidently infer the phylogeny and phasing in areas of the tree where there is no diploid representation, and for markers where there is extensive missing data. As such, homologizer opens up the application of multilocus phylogenetics to groups containing polyploids or other individual organisms that may contain subgenomes with divergent evolutionary histories-such groups have been historically understudied phylogenetically, which has limited our ability to explore evolutionary questions in general, and questions related to the macroevolution of polyploids specifically. Finally, homologizer may potentially be used beyond its core phasing functionality to identify non-homologous sequences more generally, with broad applications for phylogenetic and phylogenomic inference.

AUTH O R CO NTR I B UTI O N S
William A. Freyman and Carl J. Rothfels developed homologizer conceptually. William A. Freyman implemented homologizer in software.
William A. Freyman, Matthew G. Johnson and Carl J. Rothfels performed analyses and contributed to the manuscript.