DBTree: Very large phylogenies in portable databases

Growing numbers of large phylogenetic syntheses are being published. Sometimes as part of a hypothesis testing framework, sometimes to present novel methods of phylogenetic inference, and sometimes as a snapshot of the diversity within a database. Commonly used methods to reuse these trees in scripting environments have their limitations. I present a toolkit that transforms data presented in the most commonly used format for such trees into a database schema that facilitates quick topological queries. Specifically, the need for recursive traversal commonly presented by schemata based on adjacency lists is largely obviated. This is accomplished by computing pre‐ and post‐order indexes and node heights on the topology as it is being ingested. The resulting toolkit provides several command line tools to do the transformation and to extract subtrees from the resulting database files. In addition, reusable library code with object–relational mappings for programmatic access is provided. To demonstrate the utility of the general approach I also provide database files for trees published by Open Tree of Life, Greengenes, D‐PLACE, PhyloTree, the NCBI taxonomy and a recent estimate of plant phylogeny. The database files that the toolkit produces are highly portable (either as SQLite or tabular text) and can readily be queried, for example, in the R environment. Programming languages with mature frameworks for object‐relational mapping and phylogenetic tree analysis, such as Python, can use these facilities to make much larger phylogenies conveniently accessible to researcher programmers.

. The topology shown is represented in the table, with one record for each node, by way of the following columns: • Name: the node label as it appears in the tree.
• Length: the branch length.
• Id: a primary key, generated as an autoincrementing integer.
• Parent: a foreign key referencing the primary key of the parent.
• Left: an index generated as an autoincrementing integer in preorder traversal: moving from root to tips, parent nodes are assigned the index before their children.
• Right: an index generated as an autoincrementing integer in a post-order traversal: moving from root to tips, child nodes are assigned the index before their parents. That is, "on the way back" in the recursion.
• Height: cumulative distance from the root.
In relational database implementations of trees that use adjacency list of this form, the children of n1 can be selected like so (returning C and D): select CHILD.* from node as PARENT, node as CHILD where PARENT.name='n1' and PARENT.id==CHILD.parent; The opposite, getting the parent of the input, should be readily apparent. Beyond direct adjacency, traversals that are normally recursive can be executed as single queries using the additional indexes. For example, to identify the most recent common ancestor MRCA of nodes C and F, we can formulate: select MRCA.* from node as MRCA, node as C, node as F where C.name= 'C' and F.name='F' and MRCA.left < min(C.left,F.left) and MRCA.right > max(C.right,F.right) order by MRCA.left desc limit 1; The query selects all nodes whose left index is lower, and whose right index is higher than that of either of the input nodes. This limits the results to nodes that are ancestral to both. By ordering these on the left index in descending order they are ranked from most recent to In this query, the final result is 3.3, that is, the sum of the heights of C and F, as the root has no height. Other calculations that take advantage of the extra indexes are also possible as single queries.
For example, several metrics capturing the tendency of nodes towards the tips (such that the tree is 'stemmy') or towards the root ('branchy') are used to summarize the mode of diversification in a clade (e.g., apparently accelerating or slowing down, respectively).
One of these metrics (Fiala & Sokal, 1985) iterates over all internal nodes and for each calculates the ratio of the focal node's branch length over the sum of descendent branch lengths plus the focal length, and then averages over these ratios. This can be expressed in a single query: These examples illustrate that access to large tree topologies indexed in this way is quite powerful, especially when integrated in scripting environments that provide additional functionality. The toolkit presented here provides such access.

| Database schema and object-relational mapping
A database schema that provides the functionality described in the Introduction is shown in to allow it to find matches more quickly, sort result sets, and enforce certain constraints (such as uniqueness). In other words, this is something else than the topological indexing described at greater length in this paper. Nevertheless, these B-Tree indexes also influence performance greatly so I note them here in the interest of any re-implementations by readers.
As the database consists of a single

| Populating databases
My approach for processing input parenthetical statements and emitting these as database records of the form discussed in the Introduction is described in the following prose algorithm.
1. Apply an auto-incrementing label to each node, that is, reading the tree statement from left to right, append a unique identifier to each closing parenthesis. Careful tokenization, considering the Newick rules (loosely observed as they are) for single and double quoting, spaces, underscores, and square bracketed comments, must be applied diligently here and throughout the algorithm.
2. Remove the closing semicolon of the parenthetical statement.
From here on, every nested taxon-including the entire tree-is syntactically self-same: it has a name, either tagged using the labelling scheme from step 1, or a previously provided one, and it may have a branch length (the last colon symbol followed by a number).
3. Emit the focal taxon to the database handle. In the root case, no parent of the focal taxon is in the tree, and so the default value for parent is used, that is, 1. The length and name are parsed out of the string. An id is generated as an auto-incrementing integer

| Performance benchmarking
To assess the performance of the approach, I compared subtree extraction as enabled by DBTree with a naive implementation based on Newick descriptions. The extraction of a subtree from a large, published phylogeny is a very common operation. This is done, for example, when trait data are only available for a subset of the taxa in the tree and these data need to be analysed in a phylogenetic comparative framework.
Such subtree extraction operation is much of the raison d'être for the Phylomatic toolkit (Webb & Donoghue, 2005) and the PhyloTastic project (Stoltzfus et al., 2013); likewise, NCBI provides a web service to extract the 'common tree' from the NCBI taxonomy (Federhen, 2012

| RE SULTS
The substantial results of this study comprise library code and scripts. The library code introduces two namespaces compatible with the standardized class hierarchy for Perl5: • Bio::Phylo::Forest::DBTree: a class containing factory methods for instantiating databases and utility methods for persisting and extracting trees. This subclasses the core tree class in Bio::Phylo and inherits its decorations.
• Bio::Phylo::Forest::DBTree::Result::Node: the generated object-relational mapping class, modified to inherit from the core tree node class of Bio::Phylo. In addition, this class contains several query methods of the sort described in the Introduction.
The scripts are: • Megatree-loader: Newick tree parser/loader • Megatree-ncbi-loader: parser/loader of NCBI taxonomy dump Applying the loader scripts to the trees listed in the Methods resulted in databases that can be queried in SQL (e.g. in the SQLite shell, a 3rd party database browser, or from a scripting environment via a database handle) or using the object library code presented here. I describe in Data Availability how to obtain these generated databases and the tools to make more. As an example of the time it takes to do the latter: indexing the largest tree in the set (and the largest published phylogeny I am aware of), the Open Tree of Life release, took approximately one hour on a current MacBook Pro. This is thus a somewhat costly operation that, mercifully, needs to be run only once.
The subtree extraction benchmarking (see Figure 2) demonstrates that such indexing is an operation that may be worth it. Tiny subtrees of a few dozen tips took DBTree about a second. For small subtrees (≤640 tips), the DBTree implementation returned results in less than 10 s where it took DendroPy over 13 min; for the largest subtree (40,960 tips), DendroPy took over an hour longer to complete than DBTree (~69 min vs. ~138 min). This is not to suggest that there are performance issues with DendroPy per se, which is a very well written, popular, and highly regarded toolkit, but simply to demonstrate the general problem with processing very large Newick strings and loading entire trees in memory.

| D ISCUSS I ON
The concepts, tools and data presented here are intended to make life easier for researchers in computational biology. I would therefore reassure the reader that there is no need to dust off any lingering knowledge of SQL or Perl to be able to take advantage of the outcomes of this study. The databases presented here can be navigated in R by accessing them as data frames and processing them with dbplyr and related tools.
I provide an R Markdown document on the git repository (see Data Availability) that provides a run through of how to operate on the databases, showing how to extract clades, MRCAs, and pairwise distances.
For programming languages where object-relational mapping is a common technique, the schema and databases presented here may form the basis for extending the functionality of some popular toolkits. For example, generating a mapping for Python results in a tiny SQLAlchemy class that, thanks to Python's multiple inheritance model, can subclass DendroPy's tree node model, making persistently databased trees accessible through the same programming interface as memory resident trees. I invite authors of libraries that could take advantage of this to consider this possibility.
The performance of the subtree extraction is such that this common operation is much easier supported on DBTree-indexed trees than on Newick tree files. The implication of this is twofold: (a) projects that release very large phylogenies periodically-such as F I G U R E 2 Tree pruning performance comparison. In this example, sets of taxa of varying sizes (as shown on the x-axis) are randomly sampled and extracted as subtrees from the Open Tree of Life topology. The comparison is between an implementation based on DendroPy that reads the published, Newick version of the tree as made available by the OpenTOL consortium, and an implementation that uses the DBTree-indexed version of the same tree. The latter implementation is made available as the megatree-pruner program in the software release. The running times for both implementations are recorded as the logarithm to base 10 of the real system time in seconds for the respective processes to complete. Values range from less than one second to about two and a half hours. See text for details