Lpnet: Reconstructing phylogenetic networks from distances using integer linear programming

Neighbor‐net is a widely used network reconstructing method that approximates pairwise distances between taxa by a circular phylogenetic network. We present Lpnet, a variant of Neighbor‐net. We first apply standard methods to construct a binary phylogenetic tree and then use integer linear programming to compute an optimal circular ordering that agrees with all tree splits. This approach achieves an improved approximation of the input distance for the clear majority of experiments that we have run for simulated and real data. We release an implementation in R that can handle up to 94 taxa and usually needs about 1 min on a standard computer for 80 taxa. For larger taxa sets, we include a top‐down heuristic which also tends to perform better than Neighbor‐net. Our Lpnet provides an alternative to Neighbor‐net and performs better in most cases. We anticipate Lpent will be useful to generate phylogenetic hypotheses.


| INTRODUC TI ON
Neighbor-net (Bryant & Moulton, 2004) is a widely used distancebased method that approximates the input distance by (the distance induced by) a weighted circular split system on the same taxa set, which is then visualized as a splits graph. It has been used to analyse numerous real datasets, and it is a crucial step of other phylogenetic tools, for example, to construct coalescent-based phylogenetic networks (Allman et al., 2019), to detect and filter out homoplastic sites of a sequence alignment (Dress et al., 2008) and to align multiple sequences without guide tree (Kruspe & Stadler, 2007). Neighbor-net is the network analogue of the neighbor-joining (NJ) method (Saitou & Nei, 1987). First, a circular ordering of the taxa is computed heuristically, and then, a nonnegative least squares (NNLS) procedure is applied to find optimal weights for all splits that agree with that circular ordering. The heuristic consists of joining two clusters, using the same criterion as neighbor-joining, and then choosing one taxon from each cluster such that those two taxa will be adjacent in the final circular ordering. The first part of this agglomeration defines a phylogenetic tree on all taxa (Levy & Pachter, 2011) and the second part chooses a circular ordering that agrees with all splits of that tree. It follows from Semple and Steel (2004) that there are 2 n−3 circular orderings with that property for n taxa.
Here, we present Lpnet, a variant of Neighbor-net, that does not apply the second heuristic step of the agglomeration. Instead, we only construct a phylogenetic tree heuristically. Then, we use integer linear programming to find an optimal circular ordering, and finally, we use the same NNLS procedure as Neighbor-net to assign split weights. We have run experiments to compare Lpnet and 2 | MATERIAL S AND ME THODS

| Compatible splits and circular split systems
A split S divides a set X of taxa to two non-empty parts A and B and is denoted S = A | B. For a phylogenetic tree where the leaf nodes are labelled by the set X of taxa, every branch of the tree represents a split of X. Sets of splits that can be obtained from a single tree in this way are called compatible (Semple & Steel, 2003).
A circular ordering of a set X = x 1 , … , x n of n ≥ 3 taxa can be obtained by labelling all vertices of an ngon by the taxa (see Figure 1). A split A | B agrees with a circular ordering, if both A and B label consecutive paths on the circle. A circular ordering is defined by the permutation of X that we get by starting at an arbitrary taxon and then following all vertices of the cycle in a clock-like or anticlock-like fashion. Therefore, there are 2n different permutations associated with the same circular ordering. A circular split system of X is a set of splits of X such that all splits agree with a single circular ordering. It has long been known that circular split systems have up to ⎛ ⎜ ⎜ ⎝ n 2 ⎞ ⎟ ⎟ ⎠ splits (Bandelt & Dress, 1992).
Furthermore, compatible split systems are circular; thus, an unrooted phylogenetic tree can be considered a circular split system.
When non-negative weights are assigned to the splits, this weighted circular split system can be visualized by a planar splits graph, a network where the taxa are embedded as vertices, and every split A | B of weight w corresponds to a set of parallel edges which all have length w. Furthermore, removing all those edges decomposes the network into two connected components containing A and B respectively (Dress & Huson, 2004). Split graphs are commonly referred to as implicit phylogenetic networks, and SplitsTree4 (Huson & Bryant, 2006) can be used to draw such a network from an input weighted circular split system (Huson & Bryant, 2006).

| Quartets and weights
The support for a tree (without edge lengths) or a circular ordering can be quantified by summing up the weights of all supported quartets, and this number is maximized by a correct tree or ordering, if the input distance is induced by a tree or a circular split system respectively.

| Distance reduction and the order dependence of Neighbor-net
The first step of neighbor-joining and all other agglomerative algorithms discussed here is to identify two taxa u, v, such that ∑ x,y≠u,v w(uv� xy) is maximized. Note that the selection criterion of neighbor-joining and its variants can be formulated in terms of these quartet weights (Mihaescu et al., 2009). Then, the cluster {u, v} is considered a single taxon, where the distances between this new taxon and another taxon x are based on the distances between u , v, and x. This process is reiterated until there are only three taxa left, and the two steps are called the selection and the reduction step. It was pointed out by Bryant (2005) that, in order to reconstruct trees correctly from their induced metric, the selection step is unique, while the reduction step can give different weights to the two joined clusters. Neighbor-joining always gives the same weight to both clusters, while UNJ (Gascuel, 1997b) uses weights that are proportional to the cluster size, and BioNJ (Gascuel, 1997a) tries to minimize the variance of the reduced distances.
Neighbor-net does not reduce the distance for clusters of size two. Instead, it reduces three taxa to two, whenever a cluster of size two is merged with another cluster. If both clusters have size two, F I G U R E 1 A circular split system and three splits. this reduction step has to be performed twice. This distinguishes the taxon that is not included in the first reduction from the other three. For default parameters, the taxa that are first reduced each receive 2/9 and the remaining taxon 1/3 of the total weight of the new cluster. Since the choice which three taxa are reduced first is not determined by the input distances, the output of Neighbor-net can sometimes depend on the input order of the taxa, even if no ties occur. It therefore seems reasonable to give equal weights to all four clusters. We have implemented this variant of the Neighbor-net tree construction in Lpnet and refer to it as symmetric NNet tree. In addition, we use NNet tree to mimic the original weighting by randomly assigning 1/3 to one of the two candidate taxa that might receive that weight from the Neighbor-net algorithm.

| The Lpnet algorithm
Similar to NJ and Neighbor-net, Lpnet uses a matrix with pairwise distances between the taxa as its input. However, all of these methods can be interpreted as quartet-based. Then, all objective functions and selection criteria of the agglomerative part of the considered distance-based methods can be written as functions of the weights of some quartets. This has been noted by Mihaescu et al. (2009) and Levy and Pachter (2011). The Lpnet algorithm can be divided into three steps: 1. Construct a phylogenetic tree. We use several methods, including NJ.
2. Use Integer Linear Programming to find a circular ordering which is consistent with the tree to maximize the sum of all quartet weights contained in the circular ordering.
3. Estimate split weights from the circular ordering by using nonnegative least squares such that the least squares fit (LSFit) is maximized.
Neighbor-net combines the agglomeration of bottom-up clustering with a second selection step that essentially defines an ordering of the taxa in the newly added cluster. Every cluster can be considered a path with its taxa as vertices, and the path will be an interval in the final circular ordering. Crucially, the decision about the ordering of the new cluster is made based on the information available at the time of merging.
For Lpnet, we choose to delay the ordering of the clusters until the whole tree is known. We do so by solving an integer linear programming problem that finds a circular ordering of all taxa that maximizes the sum of the weights of its supported quartets among all orderings that agree with all splits of the tree.

| Constructing a tree
Neighbor-net constructs a circular ordering agglomeratively.
The process can be described as adding edges to a graph whose vertices are the taxa such that every component is a path (Grünewald et al., 2007).
As indicated in the previous subsection, the second selection step has some influence on the tree construction. It was already suggested by Levy and Pachter (2011) to change the distance reduction of Neighbor-net such that all splits of the neighbor-joining tree are always supported by the output ordering. The main idea of Lpnet is to skip the second selection altogether and choose a circular ordering after the tree construction. In order to observe how much of the performance difference between Lpnet and Neighbor-net is caused by this new strategy, we include the NNet tree which stays as close as possible to Neighbor-net. Noting the undesired order dependence of Neighbor-net, we also include the symmetric NNet tree.
The other tree reconstruction methods implemented by Lpnet are neighbor-joining and its variants, UNJ and BioNJ. It is also possible to input a user-defined tree, and Lpnet will compute a circular split system where all splits of the input tree agree with the circular ordering.

| Using linear programming to maximize quartet weights
When a phylogenetic tree is drawn in the plane, this embedding defines a circular ordering of the taxa which can be observed by traversing the tree such that every edge is visited once in each direction (see Figure 2). Given a binary unrooted tree T with n taxa and pairwise distances, we want to find a circular ordering that agrees with all splits of T, such that the sum of the weights of all supported quartets is maximized. Given an initial ordering that agrees with T, we can obtain another such ordering by choosing a non-trivial split A | B and reversing the order of A. Note that reversing the order of B yields the same circular ordering as reversing A, and reversing both yields the initial circular ordering. This process can be interpreted as flipping the edge that separates A from B, and it follows from Semple and Steel (2004)  F I G U R E 2 A phylogenetic tree is drawn in the plane. Traversing its edges gives the circular ordering (a,b,c,d,e,f).
For four taxa i 1 , i 2 , j 1 , j 2 , assume that the quartet i 1 i 2 | j 1 j 2 is displayed by T. Then, all edges corresponding to splits that separate i 1 and i 2 from j 1 and j 2 form a path in T. One of the end vertices of that path, say i , is on the path from i 1 to i 2 and the other one, say j, is on the path from j 1 to j 2 . If the initial circular ordering supports the quartet i 2 j 1 | j 2 i 1 , then another circular ordering also supports that quartet, if the number of flipped edges on the path from i to j is even, else it supports i 1 j 1 | i 2 j 2 . Let I 1 , I 2 , J 1 , J 2 be the sets of taxa that are in the same component of the graph obtained from T by removing i and j as i 1 , i 2 , j 1 , j 2 , respectively. Then, we define so c ij quantifies how much we would like to flip an odd number of edges on the path from i to j. Introducing binary variables, X ij for every pair {i, j} of interior vertices of T, we need to find a circular ordering such that X ij = 1, if and only if the number of flipped edges on the path from i to j is odd, and ∑ i,j c ij X ij is maximal. It turns out that a (0,1)-assignment to all variables corresponds to a circular ordering, if and only if the following four linear inequalities hold for every three interior vertices i, j, k of T: These conditions are necessary, because every edge of the smallest subtree of T containing i, j, k is contained in exactly two of the three paths between two of those vertices. This means that the sum X ij + X ik + X jk is even, so either all three variables are zero or there are two ones and one zero. To see that the conditions are sufficient, we note that the (0,1)-assignment to all those variables X ij where ij is an edge of T already defines a circular ordering. Now there is a single extension of this assignment to all variables such that all conditions hold: Let i and k be two interior vertices of T such that X ik is unknown while X ij and X jk have already been assigned for a vertex j. As before, X ij + X ik + X jk must be even, so there is only one allowed assignment for X ik . We give an example of a 5taxa tree with all four possible circular orderings in Figure 3, and we list the values of the variables X i,j .
In summary, we compute an optimal circular ordering by solving a binary linear programming problem with constraints and we can globally maximize this objective function using binary linear programming.

| Maximizing quartet weights heuristically
Even if an exact solution to maximize the quartet weights is not feasible, it has some advantages to first complete the tree construction and then compute a circular ordering. We include a top-down heuristic which iteratively fixes the variables X i,j for edges ij. In contrast to Neighbor-net, it uses all sets of four taxa. It also can choose between several candidate variables to be fixed first, and it revises its decisions in a post-processing step.
F I G U R E 3 A 5-taxa tree and the correspondence between circular orderings and variables X i,j .
The agglomeration process of NJ and its variants ends when there are only three clusters left, and for each of those clusters, a rooted binary tree has been constructed. This corresponds to a rooted tree where the root has outdegree 3 which we will denote T . We use that tree and an initial circular ordering to decide in a top-down fashion which of the edges should be flipped. We assume that there is a subtree U containing the root such that for all edges of U, a decision has been made. Initially, the subtree con- After this top-down procedure, we do some post-processing by reversing the decision whenever doing so increases the sum of the quartet weights. In order to guarantee polynomial running time, we stop this process, either when we find no edge that improves the score or when every edge has been checked n ∕ 2 times where n is the number of taxa.

| Estimate split weights by non-negative least squares
The last part of Lpnet is to compute weights for all splits that agree with the circular ordering that was constructed in the previous step. Here, we follow the same approach as Neighbor-net and use the NNLS algorithm by Lawson and Hanson (1995) to obtain split weights such that the difference between the input distance and the distance induced by the weighted circular split system is minimized. More precisely, we minimize the Euclidian norm of A⃗ s − � ⃗ d where A is the (0,1)-matrix with rows indexed by the pairs of taxa and columns indexed by the allowed splits, and an entry is one, if and only if the pair of taxa corresponding to its row is separated by the split corresponding to its column.
Furthermore, ⃗ s and � ⃗ d are the column vectors with the nonnegative split weights and the input distances respectively. In order to measure the goodness of distance approximation, we use the least squares fit (LSFit): where the pairwise distances induced by the weighted split system are p ij and the input distances are d ij .

| Consistency of Lpnet
Consistency is an important feature of phylogenetic reconstruction methods. It means that a method does not make mistakes for perfect input. Specifically, a method that reconstructs a circular split system from distances is consistent, if it returns, the correct weighted split system whenever the input distance is induced by a weighted circular split system. Neighbor-net was shown to be consistent by Bryant et al. (2007). A more general proof was given by Levy and Pachter (2011), where all tree construction methods used by Lpnet are included in their definition of neighbor-joining which allows a wide class of weighting schemes for the reduction step. Their result implies that all splits of the tree constructed by Lpnet agree with some circular ordering that agrees with all splits of the underlying split system. It is easy to see that such an ordering will also maximize the sum of all supported quartets. Finally, NNLS will be able to match the input distance exactly; thus, Lpnet is consistent for all used tree reconstruction methods.

| Implementation
We provide an R implementation of Lpnet that outputs a nexus file with a weighted circular split system. We recommend SplitsTree4 (Huson & Bryant, 2006)  In practice, the integer linear programming is the computationally most demanding part of Lpnet. The constraint matrix has O n 5 entries, and the required memory grows equally fast. There is a hard limit of 94 taxa, because the matrix is stored as a vector, and R only allows vectors of length at most 2 31 − 1.
We list the size of the constraint matrix and the average CPU time for running our Lpnet function in R using Gurobi with binary linear programming for different numbers of taxa in Table 1. Rglpk is much slower, and the problem was often not feasible for more than 50 taxa on our machine. For 40 taxa, we observed an average CPU time of 13.76 s. taxa, and with the Gurobi solver, a solution will usually take at most a few minutes. The Rglpk solver works well for smaller instances but will struggle for more than 50 taxa.
We report running times for our heuristic in Table 2. Using a computer with more memory (128 GB), an example with 500 taxa took 28 h.
All other CPU times reported in this article were obtained running Lpnet on a laptop with Windows 10 operating system, Intel Core i7-9750H 2.60 GHz CPU with six cores and 16 GB of RAM.

| RE SULTS
In this section, we mainly compare the performance of Neighbor-net and Lpnet for different input distances. In order to evaluate different networks from the same input, we use the LSFit, which is also available in SplitsTree4 (Huson & Bryant, 2006). Before the split weights are computed, Neighbor-net and Lpnet both try to maximize the sum of the weights of the quartets that agree with the chosen circular ordering. Therefore, we use this sum to evaluate the algorithms at that state and refer to it as sum of quartets. Within Lpnet, the result may depend on the choice of the tree reconstruction method. We use Neighbor-joining and its common variants BioNJ and UNJ, as well as two methods that try to mimic the internal tree building of Neighbor-net that we call NNet tree and symmetric NNet tree.
We run examples on four different kinds of input distances: First, we present a simple artificial example with only seven taxa and two reticulations that shows how the early ordering of clusters by Neighbor-net can cause problems. The second example uses random distances between taxa and represents the general approximation problem of an input metric by a circular split system, without any phylogenetic signal. The third example uses simulated sequences from random trees. The last example is a published dataset of Viburnum plants which contains a hypothesized hybrid and has been suggested as a benchmark for phylogenetic network methods.

| An artificial example
We start with an artificial example that demonstrates the disadvantage of ordering clusters locally. As visualized in Figure 4, we assume that seven taxa mainly evolved under a clock-like tree, with the exception of two gene transfer events I 1 and I 2 . We assume that 10% (for I 1 ) and 20% (for I 2 ), respectively, of the genome of the reticulation vertices are independently replaced along the reticulation arrows.
This means that the genome consists of four parts representing sequences affected by one or both or none of the reticulations. Each part follows its own tree, and the observed distance is a convex combination of those four tree distances where the coefficients are the fractions of the genome that follow the trees. This distance corresponds to 10 non-trivial splits which do not fit on any circular ordering and seven trivial splits. The non-trivial splits with the weights and the contributing trees are listed in Table 3, and the distance matrix representing the network is given in

| Random distances
We use random numbers between 0 and 1 from the uniform distribution as pairwise distances, and then add a large enough constant to all distances to guarantee the triangle inequality.
We generate 10,000 random distance matrices with 30 taxa.
Then, we compare Lpnet using different methods to construct phylogenetic trees, with Neighbor-net. We use SplitsTree4 (Huson & Bryant, 2006) with default setting to get the Neighbor-net circular ordering. Then, we compare the sum of quartets and the LSFit value for Lpnet and Neighbor-net (Table 5). We observe that for all five tree building methods, for the sum of quartets and LSFit, the Lpnet algorithm clearly tends to get better scores than Neighbor-net. While Lpnet achieves a higher LSFit for roughly 80% of the input metrics, this fraction is more than 98% for the sum of quartets. Comparing the Lpnets using different tree building methods, we find that we often get the same circular ordering. Nevertheless, Table 6 shows that UNJ performs significantly better than the other methods. Our heuristic, based on the UNJ tree, is a clear improvement compared to Neighbor-net, but it produces worse results than Integer Linear Programming.

| Simulated sequences
We randomly generate a tree for 30 taxa by using the function 'sim.
taxa' from the r package TreeSimGM (Hagen & Stadler, 2018). We let the parameter 'waiting time until speciation' for 'sim.taxa' be exponentially distributed with rate parameter = 1.2, and then normalize such that the longest pairwise distance is one. Then, we use the software Dawg (Cartwright, 2005) to simulate DNA sequences of length 10,000 bp from the random tree under the Jukes-Cantor model.
Finally, we use SplitsTree4 (Huson & Bryant, 2006) to compute Jukes-Cantor distances with default settings. We repeat this process 10,000 times and compare Lpnet and Neighbor-net in the same way as for the random metrics. Table 7 shows the result of comparing the LSFit value and the sum of all quartets between Lpnet and Neighbor-net. We see that for all five tree construction methods, the advantage of Lpnet compared to Neighbor-net increases. The sum of quartets is now always higher and the LSFit better for almost 95% of the datasets when we use Lpnet. The input datasets for this experiment can be interpreted as a tree metric plus some random noise, and the results show that in this situation, the strategy of Lpnet to complete the tree reconstruction before embedding the tree pays off. The various tree building methods yield the same circular ordering more often than for random metrics, but again UNJ achieves significantly better scores than the other variants of NJ (see Table 8).

TA B L E 3
Non-trivial splits and positive weights for the original phylogenetic tree and the three trees representing transfer events, the mixture of the input trees and for Neighbor-net (NNet) and Lpnet. Note: The contributing trees and their percentage of the genome are listed in the header. Every row corresponds to a non-trivial split. For every contributing tree, we represent its splits by the smaller split halves and list the split weights. The correct weight for every non-trivial split throughout the genome is given in the Tree mixture weight column.

TA B L E 4
The distance matrix for our artificial example.

| Analysis of a published dataset
As an example of an analysis of a real dataset, we choose a study of the genus Viburnum of flowering plants (Donoghue et al., 2004). The raw data are chloroplast trnK intron and nuclear ribosomal ITS DNA sequences from 43 species of Viburnum and 2 species of Sambucus.
We use the uncorrected P distance from the combined sequence alignment to compute the Neighbor-net ( Figure 5) and the Lpnet ( Figure 6). Following the previous results, we chose UNJ as the tree reconstruction method for Lpnet.
This dataset has been proposed as an example for testing phylogenetic networks method by an influential but now inactive blog (phylo netwo rks.blogs pot.com/p/datas ets.html). We focus on the position of Viburnum prunifolium (V. prunifolium) which was already hypothesized to be a hybrid between Viburnum lentago (V. lentago) and Viburnum rufidulum (V. rufidulum) in 1956 (Brumbaugh & Guard, 1956). From the differences between the trees obtained by analysing the two loci separately, Donoghue et al. (2004) conclude that their dataset supports that hypothesis. In Figure 5 (NNet) and From Donoghue's study (Donoghue et al., 2004)

| DISCUSS ION
Eighteen years after the release of Neighbor-net, our new variant Lpnet provides an alternative that approximates the input distance better for the clear majority of the datasets we have tried. The main disadvantage of Lpnet is that it is slower and needs more memory, but most datasets that have been analysed by Neighbor-net have less than 80 taxa and can therefore be handled by Lpnet as well.
The main application of split graphs and in particular Neighbornets is to find the main signals in an early stage of a data analysis (Huson & Bryant, 2006). In practice, a Neighbor-net often contains a few strong splits and many tiny ones which are usually interpreted as irrelevant noise. We expect that the clear signals will often be detected by both methods, while differences between the minor splits will cause a slightly higher LSFit value for Lpnet. In such cases, it does not matter much which method is used.
However, our realistic artificial example demonstrates that there can be significant differences. If reticulations are anticipated in a dataset, then the goal has to be to reconstruct an explicit phylogenetic network as shown in Figure 4. While it is generally hard to guess that network from a splits graph, this task would be easier for F I G U R E 5 Neighbor-net. Viburnum prunifolium (red) is the hybrid between Viburnum rufidulum and Viburnum lentago (blue).

F I G U R E 6
Lpnet. Viburnum prunifolium (red) is the hybrid between Viburnum rufidulum and Viburnum lentago (blue).
the Lpnet than for the Neighbor-net. We therefore anticipate that Lpnet will be useful for interpreting real datasets in the future.
We provide five different algorithms to construct phylogenetic trees. All of them turn out to yield the best LSFit occasionally, and we have no strong preference. In the average, UNJ performed best for our datasets, and it seems most consistent with the general approach taken by Lpnet that treats all pairs of taxa and all quartets equal. Therefore, we select UNJ as the default method, but we recommend to try other methods as well. Other tree building methods like minimum evolution or even not distance-based methods like maximum likelihood might be worth trying, and any binary tree can be input to Lpnet. It is generally interesting to see whether all splits of a reasonable tree will have positive weights in the Lpnet, and to compare the weights of the strongest other splits with the weights of the conflicting tree splits. However, users should be aware that we only have a consistency proof for the neighbor-joining variants that we provide.
Even though we replaced a heuristic part of Neighbor-net by an exact algorithm, Lpnet is still a heuristic method. It relies on a heuristic tree construction, and it optimizes the sum of quartet weights, while the final score function for a weighted split system is the LSfit value. The weights of the supported quartets indicate but do not guarantee that the distance can be approximated well by the allowed splits, and the discrepancy causes almost all cases where the LSFit of Lpnet is worse than Neighbor-net. It would be desirable to have a method that directly optimizes the least squares fit, but this would not allow any agglomerative construction and we are not aware of any such algorithm other than trying every possible ordering.
QNet (Grünewald et al., 2007) is the quartet analogue of Neighbor-net. It uses quartet weights directly obtained from the raw data instead of distances to reconstruct a weighted circular split system. The strategy of Lpnet to first construct a tree and then use linear programming to get a circular ordering can also be applied to modify QNet, and the approximation is expected to improve.

AUTH O R CO NTR I B UTI O N S
Mengzhen Guo and Stefan Grünewald conceived the ideas and designed methodology; Mengzhen Guo wrote the R package; Stefan Grünewald led the writing of the manuscript. All authors contributed critically to the drafts and gave final approval for publication.

CO N FLI C T O F I NTE R E S T S TATE M E NT
The authors declare no competing interests.

PE E R R E V I E W
The peer review history for this article is available at https:// w w w.w e b o f s c i e n c e . c o m /a p i /g a t e w a y/ w o s /p e e r-r e v i e w/10.1111/2041-210X.14086.

DATA AVA I L A B I L I T Y S TAT E M E N T
The tree files, sequence alignments and distance matrices for the simulated datasets, as well as the distance matrices for the random distances are available at https://github.com/yukim ayuli -gmz/data