treeducken: An R package for simulating cophylogenetic systems

Cophylogenetic methods describe discordance between non‐independent phylogenies. Simulation is necessary for testing cophylogenetic methods, but few simulators exist that are capable of generating data under explicit and biologically meaningful models. We present an R package, treeducken, for simulating host–symbiont evolution and gene‐tree–species‐tree evolution. treeducken provides a straightforward and reproducible interface for simulating cophylogenetic data. This allows easier performance testing of methods and has potential applications in machine learning (ML) and approximate Bayesian computation (ABC) approaches.

received much attention as genomic data have become widely available. The processes of gene duplication, gene loss, lateral gene transfer (LGT) and incomplete lineage sorting (ILS) contribute significantly to the discordance between gene and species trees (Maddison, 1997).
Simulation tools are essential for validating the performance of methods for comparing phylogenetic trees. There are currently few existing cophylogenetic simulators available for generating hostsymbiont datasets. The simulation program, CoRE-PA (Keller-Schmidt et al., 2011), allows two types of events in host trees (cospeciation and host speciation with sorting of symbionts), and two types of events in symbiont trees (host-switching and symbiont speciation). Baudet et al. (2015) and Alcala et al. (2017) each introduced simulation methods that generate symbiont phylogenies along user-defined host trees under forward-time birth-death processes, though, neither approach simultaneously simulates the evolution of both the host and symbiont. There are a number of tools for generating discordant species trees and gene trees, including DLCoalSim (Rasmussen & Kellis, 2012) and SimPhy (Mallo et al., 2016). As noted above, these distinct biological processes of gene-tree/species-tree evolution and host-symbiont evolution produce similar patterns that warrant comparison. However, no existing tool can simulate both gene trees, species trees and cophylogenetic data while accounting for extinction.
Here we present an R package, treeducken, for simulating both cophylogenetic and gene-tree/species-tree data. treeducken currently generates data under two scenarios: the cophylogenetic birth-death model for simulating host-symbiont coevolution and a hierarchical model (Mallo et al., 2016;Rasmussen & Kellis, 2012) for simulating gene-tree/species-tree coevolution. The cophylogenetic birth-death model presented here extends the work of Keller-Schmidt et al. (2011) by allowing for extinction and host-shift or host-switch speciation and the tandem simulation of host and symbiont phylogenies. The hierarchical gene-family model used by treeducken simulates trees under processes implicated in gene-tree/ species-tree discordance. The modular simulation framework provides a user-friendly and reproducible workflow allowing interoperability with existing R packages.

| MODEL
treeducken simulates under two distinct types of processes: hostsymbiont evolution and gene-species evolution. Here, we present the cophylogenetic birth-death model for simulating the tandem evolution of interacting hosts and symbionts ( Figure 1). To simulate gene-species coevolution, we use a hierarchical three-tree model consisting of three levels: species, locus and gene ( Figure 2). These three levels model different causes of gene-tree and speciestree discordance including ILS, LGT, and gene duplication and loss (Rasmussen & Kellis, 2012).

| Cophylogenetic birth-death model
The cophylogenetic birth-death model simulates a pair of phylogenies, the host phylogeny and the symbiont phylogeny, and their ecological interactions or associations. The resulting two trees and extant associations are intended to mimic the data used in cophylogenetic analyses. Ecological interactions are represented by a presence-absence matrix, hereafter the association matrix, with rows representing hosts and columns representing symbionts.
The association matrix determines which hosts and symbionts can evolve together; for example, the model does not allow cospeciation in unassociated hosts and symbionts. The symbiont and host lineages can undergo speciation and extinction independently, and they can speciate together (cospeciation) or go extinct together (coextinction; Figure 1). The user can also set symbiont lineages' maximum number of host associations at any given time. For example, with a host limit of three, a symbiont is only able to be associated with, at the most, three hosts.
The symbiont speciation and extinction rates determine the independent evolutionary history of the symbiont taxa (Figure 1a,b).
Following a symbiont speciation event, both descendant lineages inherit their ancestral associations. The symbiont speciation event described here corresponds to the cophylogenetic event of duplication. The model also includes host-expansion speciation, a special case of symbiont speciation where one of the descendant lineages, chosen at random, gains a novel association in addition to the symbiont's ancestral host repertoire ( Figure 1c). The host-expansion event described here is similar to the spreading event described in Brooks et al. (1991). Users can use the hs_mode = TRUE argument to change the host-expansion rate to the host-switching rate. Here we define host-switching as a symbiont speciation where one descendant gains a randomly chosen novel association and the other descendant inherits the ancestral host repertoire. The symbiont tree has three parameters: symbiont speciation rate S , symbiont extinction rate S and host-expansion rate .
Cospeciation occurs when one host lineage bifurcates and a simultaneous speciation event occurs on one of the host's (randomly selected) symbiont lineages ( Figure 1d). Following cospeciation, each descendant host lineage is then associated with one of the new descendant symbiont lineages. The remainder of the associations of the ancestral lineages of host or symbiont are sorted at random among the host's or symbiont's descendants. The cospeciation rate, C , is a shared parameter between the host and the symbiont tree.
Host speciation-controlled by rate H -describes events where the host speciates independently from its symbiont, that is, no cospeciation occurs (Figure 1e). This is equivalent to the failure-to-diverge event used in many event-based methods (Conow et al., 2010). Following a host speciation event, ancestral associations are sorted randomly on either or both descendant lineages. Host extinction refers to events where the host goes extinct (Figure 1f). If there are symbiont lineages left without a host following host extinction, then coextinction occurs. Host The generating process is a birth-death model with parameters S , S , , C , H and H conditioned on time. This forward-time simulation terminates after a pre-specified amount of time, resulting in two phylogenies and an association matrix. The cophylogenetic birth-death process simulation is performed using a single R function sim_cophyloBD that takes as input all six parameters of the model, the pre-specified amount of time and the number of cophylogenetic datasets to simulate. The function outputs a list containing a host tree, a symbiont tree, the extant association matrix with hosts in rows and symbionts in columns, and a data frame containing all events that have occurred during the simulation.

| Three-tree model
We implemented a hierarchical model, the three-tree model, to allow simulation from the species level down to coalescent sites (Mallo et al., 2016;Rasmussen & Kellis, 2012). The three-tree model has three levels: the species tree, the locus tree and the gene tree (Rasmussen & Kellis, 2012). At the highest level is the The three-tree model in treeducken first simulates the species tree ( Figure 3a) under the birth-death process (Gernhard, 2008;Kendall, 1948) using the general sampling algorithm (GSA) given in Hartmann et al. (2010) for forward simulating a tree to a set number of tips with set birth and death rates. treeducken is also able to simulate species trees to a pre-specified time using the simple sampling algorithm (SSA; Stadler, 2011). For the species tree, users can use host and symbiont trees simulated from the cophylogenetic simulator, or use one of the species-tree simulation functions sim_stBD for the GSA simulation or sim_stBD_t for the SSA simulation.
Next, a locus tree is simulated within the full species tree-that is, the species tree containing records of extinct lineages (Figure 3b). This locus is simulated over the time spanned by the complete species tree under a birth-death process coupled with LGT. Within treeducken, the species tree is used as input along with a gene birth rate, gene death rate, LGT rate and a number of loci to the sim_ltBD function to simulate a set of locus trees under a birth-death process.
Finally, gene trees are simulated backwards in time along the locus tree using a multi-locus coalescent process (Figure 2c; Rasmussen & Kellis, 2012). Gene trees can also be simulated along species trees using the multispecies coalescent process (Rannala & Yang, 2003). In treeducken, locus trees or species trees are used as input for the multi-locus sim_mlc for the multispecies coalescent sim_msc functions. Users are able to set the generation time, mutation rate and effective population size. Each of these gene trees corresponds to a single coalescent independent site within its containing locus.

F I G U R E 2
Three-tree simulation model of treeducken (a) The species tree (black) produced via a forward-time birth-death process. (b) A locus tree (goldenrod) simulated within the species tree using a second forward-time birth-death process coupled with LGTs. Capital letters following the tip name denote the locus identity. (c) A portion of the gene tree is backward simulated using a coalescent process within part of the locus tree (indicated by the blue box). Note that multiple individuals, denoted by subscripts, have been sampled F I G U R E 3 An example of the final output of the cophylogenetic simulation 3 | USAG E

| Description of the R package
treeducken is implemented as an R package (R Core Team, 2020) and is available on CRAN (https://cran.r-proje ct.org/web/packa ges/ treed ucken/ index.html). Validation testing was conducted in R and is available on Github (https://github.com/waded ismuk es/ treed ucken Valid ation). This package makes extensive use of the Rcpp and RcppArmadillo packages to wrap C++ code and improve performance (Eddelbuettel & Sanderson, 2014). In addition to the simulation functions described above, the package builds on previous R phylogenetics packages to provide functions for assistance in determining various simulation parameters, calculating summary statistics and plotting host-symbiont tree sets (Harmon et al., 2007;Revell, 2012). All phylogenies are output in the format of the APE package as full phylogenies containing extinct tips (Paradis & Schliep, 2018). An example cophylogenetic plot is shown in Figure 3. Integration with R allows for straightforward simulation of parameters from statistical distributions, and intuitive integration with other macroevolutionary and cophylogenetic tools (e.g. phytools, PACo; Hutchinson et al., 2017;Revell, 2012).

| Simulating cophylogenetic datasets
To generate cophylogenetic data, we first set the cophylogenetic birth-death process parameters: host speciation and extinction rate, symbiont speciation and extinction rate, host-expansion rate, and cospeciation rate and the pre-specified simulation time. These rates are set relative to time; for example, a speciation rate of 0.1 corresponds to on average 1 speciation every 10 time units.
Cophylogenetic objects are output as a cophy object with many generic functions implemented including summary, plot and print. These objects can be used with functions within treeducken to perform the ParaFit global fit test, or with existing packages for cophylogenetics (Legendre et al., 2002). We provide an example of using treeducken with the paco package (Hutchinson et al., 2017).

| Simulating under the three-tree model
In addition to the cophylogenetic simulations, treeducken can simulate using the three-tree model to simulate gene-tree and speciestree discordance. For instance, we can use the host tree from the cophylogenetic example above to simulate a locus tree with gene birth and death rates. Then we use the locus tree and set mutation rate, generation time and population size to simulate under the multilocus coalescent. We can perform a similar simulation with the sym-

| CON CLUS ION
treeducken adds to the phylogenetic analysis and simulation toolbox available in the R programming language (a list of those existing on the CRAN R project can be found here: https://cran.rproj ect.org/ web/views/ Phylo genet ics.html). As a tree simulator, treeducken outputs phylogenetic trees in the format of the APE package, allowing interoperability with many other packages in the R ecosystem.
This package fills a needed gap in the available phylogenetic tools in R by providing a straightforward means of simulating cophylogenetic data under a variety of models. treeducken allows for a high degree of flexibility and complexity in simulations that make it straightforward to use for testing the performance of phylogenetic methods. We intend to improve this tool and increase its usability and functionality for empiricists and theoreticians using cophylogenetic methods.

PE E R R E V I E W
The peer review history for this article is available at https://publo ns. com/publo n/10.1111/2041-210X.13625.