Stochastic population processes may cause differences between species histories and gene histories. These processes are assumed to only influence the most recent divergences in the tree of life; however, there may be underappreciated potential for microevolutionary processes to impact deep divergences. I used multispecies coalescent models to determine the impact of stochastic processes on deep phylogenomic histories. Here I show phylogenomic discordance between gene histories and species histories is expected at deep divergences for many eukaryotic taxa, and the probability of discordance increases with population size, generation time, and the number of species in the tree. Five eukaryotic clades (angiosperms, birds, harpaline beetles, mammals, and nymphalid butterflies) demonstrate significant discordance potential at divergences over 50 million years old, and this discordance potential is independent of the age of divergence. These findings demonstrate population processes acting over very short timescales will leave a lasting impact on genomic histories, even for divergence events occurring tens to hundreds of millions of years ago.
The tree of life is largely a product of microevolutionary processes governing gene frequencies, which ultimately produce the patterns of species relationships. Most estimates of the tree of life implicitly assume the pathway of information transfer via genetic material is identical to the true species phylogeny (Dunn et al. 2008; Hackett et al. 2008; Regier et al. 2010). This assumption is based on the idea that successive sampled speciation events occur far enough apart in time to eliminate the potential for different histories among different genes (Maddison 1997; Hudson and Coyne 2002; Edwards 2009). With sufficient time between divergences, population-level stochastic processes will be unlikely to produce gene histories which differ from the actual species histories. However, considering the time between many speciation events is relatively short (Venditti et al. 2010), there is considerable potential for phylogenomic discordance across all time scales. This potential for discordance at ancient divergences has received little attention until recently (Chirai et al. 2012; Crawford et al. 2012; Faircloth et al. 2012; McCormack et al. 2012; Song et al. 2012).
Multispecies coalescent models accommodate population-level processes, and allow direct assessment of potential discord between species histories and gene histories (Degnan and Rosenberg 2009; Liu et al. 2009). Here I use multispecies coalescent models to determine the extent to which short-term, population-level processes affect gene histories in deep time. Based on neutral coalescent models, I first measure the impact population size, generation time, and sampling intensity have on the probability of discordance at divergences occurring over 50 million years ago. I compare these models with empirical estimates of population parameters to determine the biological relevance of model results. Finally, I investigate potential for ancient discordance in angiosperms, birds, beetles, mammals, and butterflies.
MODELING ANCIENT DISCORDANCE PROBABILITIES
To assess the potential for gene histories and species histories to conflict at deep timescales, I modeled species histories under a variety of conditions, simulated gene trees under each model, and compared gene histories to the known species histories. For each model, I first simulated species trees with 5000 contemporary species, then simulated 100 coalescent gene trees on each species tree in Mesquite (Maddison and Maddison 2011). Each model is characterized by three parameters: the effective population size (Ne), the generation time (in years), and the number of sampled species. To cover a broad range of empirical estimates of parameter values (Table S1), Ne took values of 5 × 101, 5 × 102, 5 × 103, 5 × 104, 5 × 105, or 5 × 106; generation times were 1 or 10 years per generation; and five, 50, or 500 species were sampled. I used a full factorial combination of these parameter values, for a total of 36 different models; each model was replicated 100 times. Each species history was simulated with a model of uniform speciation with a total tree depth of 100 million years; gene trees were simulated under the neutral coalescent model without migration (Hudson 1990).
Gene trees and species trees can conflict in a variety of ways (Maddison 1997), but I focus on differences that arise due to incomplete lineage sorting (ILS). Discordance due to ILS occurs when, from a time-backwards perspective, gene lineages from sister branches of the species tree fail to coalesce before the next (older) divergence event in the species history (“hemiplasy” of Avise and Robinson 2008). I have categorized ILS discordances into two types: temporal discordance and topological discordance (Fig. 1). Temporal discordance includes all discordances due to ILS, where the timing of divergences in the gene tree at a particular node actually predates earlier divergences in the species tree (Fig. 1A). Temporally discordant nodes in the gene tree may have a topology identical to the true species history. The second category, topological discordance (Fig. 1B), occurs when the topology of the gene tree differs from the topology of the species history. All topological discordances are by definition temporally discordant as well, but not all temporal discordances are topologically discordant.
I measured this potential for ancient discordances by estimating the probability that the most common gene history for a branch in the species tree differed from the true species history (“modal discordance” hereafter). A branch-specific gene history is a unique pattern of (1) the relationships among the gene lineages within the branch of the species tree, where a gene lineage is defined by the terminal taxa descendent of that lineage and (2) the order and position in the species tree (i.e., branch) where these gene lineages coalesce. For example, there is one branch-specific gene history defined by node X in the species tree with two descendent lineages, X1 and X2, that is concordant with the species tree: all the terminal descendants of X1 coalesce more recently than X (regardless of the order in which they coalesce), all the terminal descendants in X2 coalesce more recently than X (regardless of the order in which they coalesce), and X1 and X2 coalesce in the branch immediately subtending X in the species tree. If any branch-specific gene history that does not meet both criteria is more common than the history that matches the species tree, then that node is characterized by modal discordance. Note because this discordance is measured at the scale of a single branch in the species tree, there can be multiple fully specified gene trees that count toward the same branch-specific gene history for a particular node in the species tree (Fig. S1). If a node in the species tree is characterized by modal discordance, it means a majority of the genome will have a history different from the history of the species descendent from that node. Measuring the probability of modal discordance, where the most common branch-specific gene history for a node is discordant from the true species history, requires enumeration of all observed coalescent histories for each node in the species tree. Modules for counting the number of modal discordant nodes, and identifying nodes in the species tree where modal discordances occur are available in the AUGIST package for Mesquite (Oliver 2012).
To evaluate the relevance of these models, it is necessary to determine which models reflect natural populations. I first surveyed 112 published studies that provided 287 estimates of contemporary effective population size and generation time. These studies include examples from major lineages in crown Eukaryota, including vertebrates, invertebrates, plants, and fungi (Table S1). All studies used coalescent models of genetic diversity to estimate effective population size (Kuhner et al. 1998; Beerli and Felsenstein 2001; Nielsen and Wakeley 2001; Rannala and Yang 2003; Hey and Nielsen 2004, 2007; Kuhner 2006; Becquet and Przeworski. 2007), although specific approaches varied among studies.
ANCIENT DISCORDANCE IN FIVE EUKARYOTE CLADES
As a further assessment of the potential for differences between gene histories and species histories in deep evolutionary time, I investigated discordance probabilities in five major clades of eukaryotes. I used published studies for divergence estimates in angiosperms (Bell et al. 2010), modern birds (Hackett et al. 2008), harpaline beetles (Ober and Heider 2010), mammals (Meredith et al. 2011), and nymphalid butterflies (Wahlberg et al. 2009). These five groups are ideal representatives because they have relatively well-sampled (over 100 species) phylogenomic histories with divergence time estimates, and they each originated approximately 100–200 million years ago. Taking these phylogenetic estimates as given, I used them as fixed species histories to simulate gene histories and measure the potential for discordance in these five groups. For each replicate, I simulated 100 gene trees using the neutral coalescent model without migration. For each of the five groups, I performed 100 simulation replicates; to reflect the variation in effective population size and generation time observed in empirical studies (Table S1), in each replicate, each branch of the species history was randomly assigned a pair of values for Ne and generation time, based on empirical values in Table S1. Parameter values for each of the five clades were only sampled for an appropriate subsample of taxa included in the table (i.e., modern bird parameter values were only sampled from bird species listed in Table S1).
To measure the probability of perfect concordance of individual gene trees at ancient nodes, I used the results of simulations above to count the number of gene trees in each replicate that had identical topologies to all ancient divergences in the species tree. The probability of perfect topological concordance at ancient divergences was estimated as the frequency of gene trees with topologies identical to all divergences in the species tree over 50 million years old (i.e., the mean number of gene trees with ancient topologies identical to the species tree divided by the number of replicates [in this case, 100]). A module for estimating the probability of complete gene tree concordance from a frequency distribution of gene trees for a given species tree is available in the AUGIST package for Mesquite (Oliver 2012).
Results and Discussion
MODELING ANCIENT DISCORDANCE
Over a range of population size, generation time, and sampling density values, I modeled gene tree evolution within a given species tree, and compared genomic history to the true species history. The probability of modal discordance at divergences over 50 million years ago increased with population size, generation time, and sampling density (Fig. 2A) and these relationships generally hold when topology alone is considered (Fig. 2B). Additional work is necessary to determine if these topological discordances are due to real “anomaly zones” (Degnan and Rosenberg 2006) in ancient divergences. For an explanation of the nonmonotonic response of topological modal discordance to increasing population size in some models (Fig. 2B), see Figure S2.
The models explored a broad distribution of parameter space, but empirical estimates of population size and generation time demonstrate eukaryotic diversity is not uniformly distributed across this parameter space. The joint frequency distribution of effective population size and generation time is characterized by a peak corresponding to an effective population size of 0.5–1 million individuals and a generation time of 1 year (gray circles in Fig. 2C). For this combination of population size and generation time, the mean number of temporally modal discordant nodes at divergences occurring over 50 million years ago in a clade of 5000 species originating 100 million years ago is 5.19, 1.76, and 0.09 for 500, 50, and five sampled species, respectively, and the mean number of topologically modal discordant nodes is 0.83, 0.3, and 0.02 for 500, 50, and five sampled species, respectively. Thus, for a majority of eukaryotes, population processes underlying gene histories will likely lead to some deep divergences characterized by a majority of the genome having a different history than that of the species, due to successive speciation events occurring in short periods of time.
An additional result, unexpected yet intuitive, is evident in model comparisons: increased sampling density increases the probabilities of ancient discordances. This result runs counter to the general philosophy of increasing sampling density to increase phylogenetic accuracy (Hillis 1996). Although increased taxon sampling can alleviate some problems in phylogenetic inference, such as long-branch attraction (Hendy and Penny 1989), increased sampling also has the potential to increase the chance of misleading inferences. As more species are sampled, times between sampled speciation events will become shorter (Fig. 2D), increasing the potential for discordance between species and gene histories. This phenomenon will have the greatest impact on accurately inferring phylogenies from sparse supermatrices, which are characterized by very dense taxon sampling along with many missing data (Sanderson et al. 2011).
The results also demonstrate how diversification rates will influence the probability of ancient discordances. Because all models were simulated under uniform speciation with a constant tree depth, comparing the different sampling efforts illustrates the impact diversification rate has on the probability of discordance; the models with five, 50, or 500 sampled species are equivalent to diversification rates of 0.0092, 0.0322, and 0.0552, respectively (eq. 3 of Magallón and Sanderson 2001). As expected, models with low diversification rates (e.g., the five-species models) show considerably less discordance than models with high diversification rates (e.g., the 500-species models) (Fig. 2A and B). This result is expected because low speciation or high extinction rates produce longer times between splitting events, decreasing the probabilities of discordance, whereas high speciation or low extinction rates will result in shorter times between divergences, increasing the probability of discordance. The distributions of branch lengths in Figure 2D illustrate how increasing diversification rate (i.e., increasing from five to 50 to 500 sampled species) will result in shorter times between divergence events. Conventional wisdom holds that background extinction rates will be high enough to eliminate short branches at deep divergences. However, in the simulations presented here, even the high diversification rate (corresponding to the 500 species model) is relatively low compared with empirical estimates (e.g., Magallón and Sanderson 2001). Because many real clades experience diversification rates as high or higher than those in the 500 species models (and thus they have a lower relative extinction rate than the 500 species models), there is little support for the assumption that extinction will be the salvation for deep phylogenetic inference in many empirical datasets.
ANCIENT DISCORDANCE IN FIVE EUKARYOTE CLADES
Based on the potential for significant discordance between species and genome histories in deep time, I measured the potential for deep discordance in five eukaryote clades: angiosperms (Bell et al. 2010), modern birds (Hacket et al. 2008), harpaline beetles (Ober and Heider 2010), mammals (Meredith et al. 2011), and nymphalid butterflies (Wahlberg et al. 2009). All five show potential for discordances in divergences occurring prior to 50 million years ago (Fig. 3). Most of the differences between genome and species histories were due to coalescent time differences (temporal discordance, Fig. 1A) (Edwards 2009), although there are cases where the source of discordance was due to topological differences between species trees and the contained gene trees (topological discordance, Fig. 1B). The probability of perfect topological concordance at ancient divergences of any one gene tree with the species tree is less than 0.01 for most of the five clades (Table 1).
Table 1. Perfect concordance at deep divergences
over 50 million
Values indicate average estimated probability that an individual gene tree will be topologically identical to the species tree at all divergences occurring over 50 million years ago.
The causes of discordance are common among all five groups: short times separating divergence events lead to differences between gene histories and species histories (Fig. 4A) (Degnan and Rosenberg 2006). These results are expected, given the positive relationship between the probability of coalescence in any branch of the species tree and the length of that branch (Rosenberg 2002). Also expected, but less acknowledged, is that these discordances can occur at any point in these five clades’ histories (Fig. 4B). With the exception of the nymphalid dataset, all have at least some ancient successive divergence events close enough in time to expect over 25% of the genome to be topologically discordant from the true species history. For example, the lineage uniting bats with ungulates and whales is estimated to have only existed for 440,900 years (Meredith et al. 2011) before the split between Chiroptera and a clade of Perissodactyla and Cetartiodactyla. On average, the probability of gene trees with topologies alternative to this sister-group relationship is 0.36 (yellow point indicated by arrow in Fig. 4), and multispecies coalescent analyses do not support the Chiroptera + (Perissodactyla + Cetartiodactyla) clade (McCormack et al. 2012; Song et al. 2012). Additionally, even some nodes supported in phylogenetic analyses are characterized by nontrivial probabilities of topological discordance. For example, the avian clade uniting trogons with kingfishers and woodpeckers, to the exclusion of the cuckoo-roller, is supported in supermatrix analyses (Hackett et al. 2008), but over 27% of the gene trees are expected to be topologically discordant (red point indicated by arrow in Fig. 4). Given the potential for discordance, these inferred ancient divergences may be spurious byproducts of stochastic population processes, rather than reflections of true evolutionary history.
Although the potential for significant discordance between gene histories and species histories for recently diverged taxa is currently appreciated (Pollard et al. 2006), the results presented here demonstrate a significant portion of eukaryotic diversity will be characterized by phylogenomic discordance at some deep divergences in the tree of life. Thus, the implicit assumption that species histories and gene histories are identical at deep nodes in the tree of life is not supported. This discordance has the potential to mislead evolutionary inferences when microevolutionary processes are not accounted for, such as the common approach of concatenating data in large supermatrix analyses (Degnan and Rosenberg 2006; Kubatko and Degnan 2007). Such population-level processes may thus confound our understanding of evolutionary history if they are ignored, although other stochastic processes (e.g., mutation) may reduce the impact of gene tree discordance on phylogenetic inference (Huang and Knowles 2009; Huang et al. 2010). Even in cases where the most common gene-tree topology matches the true species history, temporal discordances, where ancestral polymorphisms persist through multiple speciation events, may influence divergence time estimation (Arbogast et al. 2002; Jennings and Edwards 2005; McCormack et al. 2010). Furthermore, the potential for discordance between gene trees and species histories due to incomplete ILS will increase as taxon sampling increases. Finally, the information transfer of genetic variation underlying biological diversity will have a considerably more complicated history than the canonical tree of life, and these complications promise to increase with the rapid accumulation of genomic data. Promising recent work on datasets with low taxon sampling illustrates the tractability of incorporating microevolutionary processes into analyses of deep divergences (Chirai et al. 2012; Crawford et al. 2012; Faircloth et al. 2012; McCormack et al. 2012; Song et al. 2012). Novel phylogenomic approaches for large datasets should provide reasonable estimates of deep evolutionary history while accommodating the underlying population processes giving rise to the tree of life.
The author wants to thank J. M. Beaulieu, M. J. Donoghue, A. Dornburg, and D. R. Maddison for helpful comments on previous versions of this work. L.S. Kubatko and four anonymous reviewers also provided helpful feedback on an earlier manuscript draft. Additionally, C. D. Bell, K. A. Ober, M.S. Springer, and N. Wahlberg kindly provided eukaryotic phylogenetic estimates for use in simulations.