SEARCH

SEARCH BY CITATION

Keywords:

  • Fossil;
  • genome;
  • macroevolution;
  • Neanderthal;
  • phylogeography;
  • polytomy

Abstract

  1. Top of page
  2. Abstract
  3. What Is Phylogeny, and How Do We Infer It from Sequence Data?
  4. A Brief History of Species Trees
  5. The Logic of the Species Tree Approach
  6. The Future: Simulations, Sampling, Species, and SNPs
  7. Conclusion—The Relevance of Species Trees
  8. ACKNOWLEDGMENTS
  9. LITERATURE CITED

The advent and maturation of algorithms for estimating species trees—phylogenetic trees that allow gene tree heterogeneity and whose tips represent lineages, populations and species, as opposed to genes—represent an exciting confluence of phylogenetics, phylogeography, and population genetics, and ushers in a new generation of concepts and challenges for the molecular systematist. In this essay I argue that to better deal with the large multilocus datasets brought on by phylogenomics, and to better align the fields of phylogeography and phylogenetics, we should embrace the primacy of species trees, not only as a new and useful practical tool for systematics, but also as a long-standing conceptual goal of systematics that, largely due to the lack of appropriate computational tools, has been eclipsed in the past few decades. I suggest that phylogenies as gene trees are a “local optimum” for systematics, and review recent advances that will bring us to the broader optimum inherent in species trees. In addition to adopting new methods of phylogenetic analysis (and ideally reserving the term “phylogeny” for species trees rather than gene trees), the new paradigm suggests shifts in a number of practices, such as sampling data to maximize not only the number of accumulated sites but also the number of independently segregating genes; routinely using coalescent or other models in computer simulations to allow gene tree heterogeneity; and understanding better the role of concatenation in influencing topologies and confidence in phylogenies. By building on the foundation laid by concepts of gene trees and coalescent theory, and by taking cues from recent trends in multilocus phylogeography, molecular systematics stands to be enriched. Many of the challenges and lessons learned for estimating gene trees will carry over to the challenge of estimating species trees, although adopting the species tree paradigm will clarify many issues (such as the nature of polytomies and the star tree paradox), raise conceptually new challenges, or provide new answers to old questions.

The title of this essay is borrowed from one of the famous essays written by Stephen Jay Gould, “Is a new and general theory of evolution emerging?”, published in Paleobiology in 1980 (Gould 1980). Gould was speculating as to whether the constellation of observations and trends from the fossil record and developmental biology, collectively known as “macroevolution,” might constitute a genuinely new set of phenomena, a set that had not been covered adequately by the reigning paradigm of Darwinian microevolution. Of course whether one answers Gould's question in the positive or negative depends on one's perspective; although Gould and others would not have raised the question unless one could answer “yes,” many evolutionary biologists have argued that the quantitative framework provided by microevolution can adequately account for the observations of punctuation, stasis, and apparent saltation that had suggested a new paradigm to some (Charlesworth et al. 1982; Smith 1983; Estes and Arnold 2007). Yet there is a pervasive feeling that the paradigms laid down by the Modern Synthesis still may not adequately capture the plethora of phenomena ushered in by modern evolutionary biology (Erwin 2000; Pigliucci 2007). Although the paradigm that I question is more limited in scope than Gould's, in a similar spirit I raise the question of whether molecular phylogenetics is experiencing an important conceptual shift, one that may affect the daily practice of phylogeny building as well as the relationship between systematics and other evolutionary disciplines. The developments I will review are indeed new in a practical sense, yet they mark a return to the goals and concepts that have been in the back of systematists minds for many decades (Felsenstein 1981; Neigel and Avise 1986; Takahata 1989; Avise 1994; Maddison 1997; Yang 1997). Put simply, the response to Joe Felsenstein's oft-quoted complaint that “Systematists and evolutionary geneticists do not often talk to each other” (Felsenstein 1988: 445) is, I think, finally maturing and reaching fruition, and thus it is an opportune time to reflect on this new interdisciplinary dialogue and to forecast what might lay ahead.

What Is Phylogeny, and How Do We Infer It from Sequence Data?

  1. Top of page
  2. Abstract
  3. What Is Phylogeny, and How Do We Infer It from Sequence Data?
  4. A Brief History of Species Trees
  5. The Logic of the Species Tree Approach
  6. The Future: Simulations, Sampling, Species, and SNPs
  7. Conclusion—The Relevance of Species Trees
  8. ACKNOWLEDGMENTS
  9. LITERATURE CITED

One of my favorite essays in systematics, with one of my favorite essay titles, is the paper by Rod Page and the late Joe Slowinski, innocently entitled “How do we infer species phylogenies from sequence data?” (Slowinski and Page 1999). In it they argued cogently for a distinction between gene and species trees and outlined ways to estimate the latter. As obvious as the answer to these questions may seem to some, they are worth raising again, if only to reiterate an answer so simple that we sometimes overlook it. Phylogeny is the history of species and populations. It records the branching pattern of evolving lineages through time. One of the grand missions of systematics is to reconstruct and provide details on the great Tree of Life. As difficult as it may be for modern methodologies to reconstruct this history, and as fraught with reticulations, hybridization events, horizontal gene transfer, and other mechanisms that cloud the picture of organismal history, it is important to reiterate that, at the level of populations and species, there is only one such history, even when reticulate. With species and populations as the focus, there is no heterogeneity in this demographic history, because the history has happened only once.

In pursuit of the goal of reconstructing the history of life, the core approaches of phylogenetic systematics have evolved into a suite of methodologies that focus on amassing character data to build these trees (Nei and Kumar 2000; Felsenstein 2003; Delsuc et al. 2006), and the leading role of DNA sequences in providing these characters over the past several decades has helped to invigorate systematics and to provide many fruitful intersections with other biological disciplines, such as genomics and molecular ecology. Yet the use of DNA sequences has also led to challenges in the translation of histories of DNA sequence diversity—phylogenetic trees of genes and alleles—into the currency that is surely still the major focus of systematics—phylogenetic trees of species and populations, or species trees. Ultimately these challenges arise because genes and species are different entities, assuming different levels in the biological hierarchy (Avise and Wollenberg 1997; Doyle 1997; Maddison 1997; Avise 2000). The diversity recovered in our surveys of DNA sequence evolution within and between species is ultimately an indirect and incomplete window into the history of species, precisely because species are, by most definitions, evolving lineages that comprise many genes, each found in many individuals. The fact that species comprise a higher level of biological organization than do genes ensures that the program of systematics will be incomplete until phylogenetic methods make a clear distinction between gene trees and species trees and explicit reference to the phylogenetic relationships of species within which genes are embedded. The overwhelming dominance of molecular data in systematics and phylogenomics makes the development of methods for estimating species trees a key, if not the key, task for the years ahead.

CAUSES OF GENE TREE HETEROGENEITY AND THE UBIQUITY OF COALESCENT EFFECTS

The causes of gene tree heterogeneity and of gene tree/species tree conflicts are by now well known to molecular systematists and nicely summarized, for example, in Maddison's 1997 review (Maddison 1997). The three primary causes—gene duplication, horizontal gene transfer, and deep coalescence—have varying levels of importance depending on the taxa and genes under study. Horizontal gene transfer is a well-known and common cause of discordance in the microbial world—so much so that some microbial phylogeneticists have questioned whether a coherent Tree of Life exists for microbes (e.g., Bapteste et al. 2005; Doolittle and Bapteste 2007). Gene duplication is in some taxa also common and widespread, and can subvert phylogenetic analysis if it is not recognized; alternatively gene duplication can provide a rich source of information for phylogenetic analysis (e.g., Page and Charleston 1997; Rasmussen and Kellis 2007; Sanderson and McMahon 2007).

Deep coalescence, the third major cause of gene tree heterogeneity and gene tree/species tree conflicts, is distinct in so far as its occurrence is, in principle, much more widespread, depending not on specific, molecular events that occur only in some lineages (and whose consequences can be avoided by appropriate gene sampling), but on the intrinsic properties of every population. The root cause of deep coalescence is the rate of genetic drift—deep coalescence will be more prevalent when the rate is low (due to large populations) compared to the length of internodes in the species tree. Thus deep coalescence is in principle detectable in any taxonomic group, and for any gene, whether in an organelle or the nucleus, provided that the branch lengths in the underlying species tree are sufficiently short as measured in coalescent units (Maddison 1997; Hudson and Turelli 2003). (Coalescent units are calculated as the ratio of the length of internodes in the species tree as measured in generations over the effective population size, as measured in individuals, of ancestral species during those internodes.) Thus deep coalescence knows no taxonomic or gene bias, as do the other two phenomena, and thus holds a special place in the triumvirate of causes of gene tree heterogeneity. Indeed, empirical examples of deep coalescence or incomplete lineage sorting are now routine and taxonomically ubiquitous (for recent examples, see Satta et al. 2000; Jennings and Edwards 2005; Patterson et al. 2006; Pollard et al. 2006; Hobolth et al. 2007; Wong et al. 2007).

Yet there is a fourth and even more widespread cause of gene tree heterogeneity, if not gene tree/species tree “conflict,” one that influences branch lengths only but causes heterogeneity nonetheless. I suggest branch length heterogeneity due to the coalescent process as a useful additional source of gene tree heterogeneity (Fig. 1). Branch length heterogeneity specifically highlights the heterogeneity of branch lengths of gene trees in situations when all gene trees are topologically identical, such as will occur when the underlying species tree has branch lengths that are long in coalescent units; by contrast, deep coalescence emphasizes the heterogeneity of gene tree topologies. Branch length heterogeneity is a useful concept for systematists because it highlights the fact that, even when gene trees are topologically identical—a situation in which most systematists would feel comfortable in combining data through concatenation and other traditional means—there can be significant and detectable heterogeneity in branch lengths, such that the gene trees are for practical purposes still heterogeneous. Such collections of trees that vary only in branch length have become drawn the attention of systematists because of the related issue of heterotachy the change in rate of characters or sites over time (Kolaczkowski and Thornton 2004, 2008) and also because they can generate unexpected phylogenetic signals in DNA sequence datasets (Kolaczkowski and Thornton 2004; Matsen and Steel 2007). In fact, branch length heterogeneity and deep coalescence are ends of a continuum, and the latter is really an expression of the former in the limit as gene tree topologies begin to depart from the species tree. But branch length heterogeneity is indeed ubiquitous, due to the finiteness of all populations. It will occur to varying extents in all taxa, genes, and contexts, even in situations in which deep coalescence is not occurring. Thus, branch length heterogeneity from gene to gene is probably the most common of all causes of gene tree heterogeneity. In addition, branch length heterogeneity could be a potent source of phylogenetic inconsistency in real datasets; like deep coalescence, it essentially introduces mixtures of gene trees into datasets, a situation that is known to mislead phylogenetic analysis (Mossel and Vigoda 2005). Masten and Steel (2007) have recently shown that DNA sequences simulated from mixtures of topologically identical but branch-length variable trees can in some cases mimic signals from a topologically different tree very well; thus the problem of branch length heterogeneity is in principle a serious one for empiricists, although whether this is the case empirically is not known.

image

Figure 1. Distinction between deep coalescence and branch length heterogeneity as sources of gene tree heterogeneity and gene tree/species tree conflict. Example species trees are shown at the top, with constituent gene trees in the bottom row; taxa from which gene trees are sampled are given as A, B, C, and D. Whereas deep coalescence emphasizes topological differences between gene and species trees, branch length heterogeneity emphasizes branch length differences between gene and species trees and variation among genes in branch length, without topological variation. Branch length heterogeneity is ubiquitous and will be important for impacting site distributions in DNA sequences when effective population sizes and species tree branch lengths are large enough to permit substantial variation in coalescence times without deep coalescence. Heterogeneity in branch lengths among constituent gene trees is indicated by the dashed lines in the lower right panel.

Download figure to PowerPoint

The extent of variation in branch lengths due to branch length heterogeneity will be a function of the effective population size of ancestral populations, scaling with its square. θ is a measure of effective population size as measured in DNA substitutions; θ= 4Nμ, where N is the effective population size and μ the mutation rate, and can be easily calculated, at least for extant populations, from DNA sequence data. We expect that branch lengths will vary from gene to gene sometimes by hundreds of thousands of years, if not millions of years (Lynch and Jarrell 1993; Edwards and Beerli 2000), if estimates of θ from extant natural populations are any guide. Most researchers use a single topology when using sequence simulation packages such as SeqGen (Rambaut and Grassly 1997) and other approaches, thereby implicitly assuming that the underlying gene trees are identical in topology and branch lengths and, when multiple loci are simulated, the sequence datasets record only the mutational variance accumulated within the specified phylogenetic trees. (Rarely are multiple different trees used with DNA sequence simulators such as SeqGen to incorporate both mutational and coalescent variance). These simulated datasets will differ from the more realistic situation, embodied in packages such as MCMCcoal (Yang 2002), the coalescent module in Mesquite (Maddison 1997; Maddison and Maddison 2008), and serialsimcoal (Laval and Excoffier 2004; Anderson et al. 2005), in which, at minimum, branch lengths differ subtly among loci due to branch length heterogeneity. Differences between DNA datasets with and without coalescent variance will vary depending on the effective population sizes in the species tree used in the simulations and the extent of gene tree heterogeneity, although the precise pattern of differences is not well-studied (Fig. 2). We might expect that simulated DNA sequence datasets produced under models with only branch length heterogeneity will deviate from simulations on single trees less so than will datasets produced under deep coalescence; Figure 2 suggests that deep coalescence and single gene tree simulations can indeed produce very different distributions of site patterns in DNA sequences. Models traditionally used to simulate DNA sequence data for phylogenetic purposes essentially assume that the population sizes in the species tree are zero, and thus ignore the contribution of coalescent variance to molding the variation and signal present in DNA sequence datasets (Carstens et al. 2005).

image

Figure 2. Example of the contribution of coalescent variance to the distribution of site patterns in DNA sequences. A species tree (top left and white bars in graphs) was used to simulate gene trees and DNA sequences of 500 bp using the Jukes Cantor model of DNA evolution using MCMCcoal (Yang 2002). A gene tree (top right and gray bars in graphs) with branch lengths and topology identical to the species tree, but without the effective population size parameter, was used to simulate gene trees and DNA sequences. One thousand gene trees were simulated from the species tree and one sequence per gene tree was then simulated; for the gene tree analysis, 1000 DNA sequences were simulated from the single gene tree. The frequency of two of the 15 possible site patterns for four taxa under the Jukes Cantor model was counted for each replicate. The two graphs show the frequency of two types of sites, one consistent with the species tree and gene tree (XXYY) and one apparently in conflict with the species tree and gene tree (XYYX). The species tree used was ((((H:0.05, C:0.05): 0.0025 #0.1, G:0.0525):0.0025 #0.1, O:0.055) #0.1); numbers after the pound sign indicate value of θ= 4Nμ. Approximately 68% of the gene trees simulated with this species tree are concordant with the species tree. The value of θ used in the species tree simulations is admittedly high and primarily for illustrating the situation with a high mutation rate; but substantial tails to the distribution of site frequencies is achieved with species trees an order of magnitude shorter and thinner. The gene tree used was (((H:0.05, C:0.05): 0.0025, G:0.0525):0.0025, O:0.055). DNA sequences simulated from species trees are likely to contain a higher number of sites that “conflict” with the species tree, even though from a species tree perspective they are not really in conflict with it. For example, over 20% of the gene trees simulated from the species tree (n= 215) gave rise to DNA sequences in which greater than 10 percent of sites (>5 sites) had pattern XYYX, which is naively in conflict with the true tree. But in fact, such sites are consistent with the species tree when a clear distinction between gene and species trees is made. By contrast, no sequences simulated from the single gene tree had this many sites with pattern XYYX, and less than 20% (n= 192) of sequences had more than one site of this type.

Download figure to PowerPoint

ORIGIN AND CONSEQUENCES OF THE CONCATENATION PARADIGM

The current paradigm under which molecular phylogenetics operates—one characterized by the accumulation of many genes that are then concatenated into large supermatrices before analysis—arose in part from a need to amass larger datasets, and in part from debates in the early 1990s spurred by Arnold Kluge's call for “total evidence”—a philosophical mandate to include all available information into phylogenetic analyses (Kluge 1989, 2004). Concatenation—the practice of combining different genes or data partitions into a single supermatrix and analyzing this matrix such that all genes conform to the same topology—provided a convenient means of implementing Kluge's call. It soon became clear, however, that although total evidence might have substantial philosophical justification, the practice could clash with some of the practical nuances of molecular systematics and with the growing appreciation of heterogeneity in gene trees, which grew separately from observations of the behavior of gene trees in natural populations (Wilson et al. 1985; Avise et al. 1987; Doyle 1992; Avise 1994). There were generally two practical arguments against total evidence. The first was the demonstration that in computer simulations, DNA sequences evolving under substantially different substitution rates and patterns could give erroneous results when analyzed with currently available software and models of phylogenetic reconstruction (Bull et al. 1993). This first concern has largely been addressed in the past decade with the development of efficient likelihood and Bayesian algorithms permitting different data partitions to evolve under different models (e.g., Nylander et al. 2004). Although it is widely appreciated that the most commonly employed models of DNA substitution do not adequately describe the complexities of DNA sequence evolution (not only because of their simplicity but also because of their frequent reliance on the assumption of stationarity through time), applying different models to different genes or data partitions is well known to dramatically improve phylogenetic inference. Variation in substitution patterns among genes was sometimes considered a benefit to phylogenetic analysis, provided that it was not too great. For example, combining many genes encompassing both fast and slow rates of evolution was suggested as a better means of improving phylogenetic analysis as compared with using genes having similar rates (Cummings et al. 1995; Otto et al. 1996).

The second argument against total evidence was the suggestion that different genes should not be combined if they can be shown to have different topological histories. By the early 1990s, gene tree heterogeneity had been observed frequently in real datasets. Yet in roughly the same time it has taken systematics to embrace sophisticated mixture models of the substitution variation across partitions, the challenge of heterogeneity in gene trees has not received commensurate attention. For example, Felsenstein's recent survey of phylogenetic methods (Felsenstein 2003) contains only a single chapter on species trees and the potential variability of their constituent gene trees (chapter 28), and only recently have a few phylogenetic methods incorporating gene tree heterogeneity, with the ability to analyze large datasets, been available. This relative inattention to dealing with gene tree heterogeneity—even in the knowledge that such heterogeneity does not necessarily conflict with the unique species history in question—was, I think, partly due to the perceived success of the concatenation approach in delivering high confidence in phylogenetic trees, and the suggestion that more genes could improve this resolution. However, as I describe next, it was not so much the multiplicity of genes that was deemed responsible for the success of combining information via concatenation, but rather the multiplicity of characters or sites.

By now it is routine for phylogenetics and phylogenomics projects to amass multiple genes, sometimes hundreds of them, in pursuit of phylogenetic rigor. Yet the current justification for collecting multiple genes is, I suggest, somewhat out of sync with their real service in phylogenetics. When asked why collecting multiple genes is useful in phylogenetic analysis, many systematists might answer “In order to capture a diversity of mutation rates, so as to resolve deep and shallow branches in the tree.” (This answer is partly a legacy of the influential paper by Cummings et al. (1995), which specifically prescribed sampling many, short (mitochondrial) genes with varying mutation rates, rather than a few longer genes.) In addition, our imaginary systematist would probably prefer to sample widely throughout the genome, rather than from one chromosomal segment, even if one could assure her that a single segment contained as much mutation rate variation as genes spread across the genome. Yet, as attractive as this protocol sounds, the demand for genomically widespread markers—tantamount to demanding a measure of genealogical independence among markers due to recombination between them—does not reconcile easily with the concatenation or supermatrix approaches that have become the norm in phylogenomics, because these approaches do not allow for genealogical independence of different genes. I therefore argue that the motivation for sampling many markers in modern phylogenomics is not due to an explicit desire to sample many (phylo)genetically independent markers, but rather to sample many sites, perhaps with varying rates; and that the goal of sampling many genes is favored only in so far as it might bring some measure of rate heterogeneity among loci that might resolve both deep and shallow nodes in phylogenetic trees. Missing from this justification for sampling many genes is any reference to the possibility that the sample of gene trees will increase with the sample of genes, and will thereby better portray the statistical tendencies of genomes and populations that comprise the biological levels above the sampled entities. Few systematists today would say they prefer sampling many genes “So as to obtain a diversity of gene trees.” It is this answer, however, that underlies the sampling properties of the species tree approach.

GENE TREE PHYLOGENETICS: A LOCAL OPTIMUM

The molecular biology revolution drastically changed phylogenetics in key ways. In addition to the obvious advances allowing collection of vastly more characters for phylogenetic analysis, the revolution in restriction enzymes, and eventually rapid DNA sequencing via PCR, allowed researchers to collect molecular data that could be directly analyzed phylogenetically (Avise 1994). For example, allozymes proved extremely useful in advancing phylogenetics and biogeography, yet the molecular data themselves—alleles and allele frequencies—could not easily be analyzed phylogenetically without first transforming the data in some way, such as by estimating a genetic distance. By contrast, data from restriction enzymes, or DNA or protein sequences, are easily and almost effortlessly amenable to phylogenetic analysis, because they come to the researcher already in the form of a character matrix (Hillis et al. 1993; Swofford et al. 1996).

Producing phylogenies directly from gene sequences essentially in one step, without additional transformations, is now the dominant mode of phylogenetic analysis and indeed it has advanced the field enormously. Nonetheless, I suggest that the very success of this paradigm and the ease with which phylogenies could be produced directly from DNA matrices led to a comfort zone in phylogenetics. If we can imagine systematic methods themselves as a likelihood surface, I suggest that the current paradigm is a local optimum in that surface, an optimum that is useful but ultimately incomplete in so far as it has failed to model the potential for gene tree/species tree discordance even cursorily (Fig. 3).

image

Figure 3. A fictitious likelihood plot illustrating the idea that gene trees represent a local optimum historically in the field of systematics. The plot also alludes to the greater explanatory power of species tree models over models without gene tree heterogeneity. I showed a plot similar to this at the symposium on species trees at the 2008 meetings of the Society for the Study of Evolution in Minneapolis, and received a number of chuckles from the audience, and it is presented here in a similarly irreverent spirit. In fact, models that allow for gene tree heterogeneity do have significantly more explanatory power for those DNA datasets that have been tested than do concatenation or supermatrix models. However, the extra parameters of some species tree approaches are a disadvantage.

Download figure to PowerPoint

Recent phylogenomic analyses have begun to enshrine the concatenation paradigm by amassing hundreds of genes to unravel the Tree of Life (e.g., Rokas et al. 2003; Delsuc et al. 2006; Dunn et al. 2008; reviewed in Delsuc et al. 2005). At the same time, other recent results are beginning to question the suitability of concatenation for all data types and time scales, in particular those from DNA sequences sampled from rapidly diverging clades (Edwards et al. 2007; Kubatko and Degnan 2007). For example, Degnan and Rosenberg (2006) showed that for any species tree of five or more taxa, there exist branch lengths in species trees for which gene trees that do not match the species tree are more common than gene trees matching the species tree—so called anomalous gene trees. In such situations, or even slightly outside of this zone, phylogenetic analysis of concatenated sequences can positively mislead inference of species relationships (Kubatko and Degnan 2007); by contrast, some of the new species tree approaches appear promising in or near the anomaly zone (e.g., Edwards et al. 2007; Liu and Edwards 2008; L. Liu, L. Yu, D. K. Pearl, and S. V. Edwards, unpubl. ms.). Although the parameter space of species trees that produce anomalous gene tree topologies is probably not large (we do know yet of any empirical examples of this phenomenon), it stands to reason that concatenation will under many circumstances be a worse approximation of the underlying diversity of gene trees than will approaches that allow for gene tree heterogeneity, because we know, as stated above, that gene trees will always differ from one another subtly, even when topologically congruent. What few statistical comparisons that have been done suggest that species tree approaches that allow for gene tree heterogeneity are significantly better explanations of multilocus sequence data than is concatenation, even in situations in which gene tree heterogeneity is moderate (Liu and Pearl 2007; Edwards et al. 2007; Belfiore et al. 2008). The cost, however, of the species tree approach can sometimes be substantially increasing the number of parameters to be estimated. For example, in addition to the usual nucleotide substitution parameters for each gene (partition), species tree analysis can involve parameters for the relative mutation rates of different genes, branch length, and tree length parameters for each gene, as well as branch lengths, effective population sizes, and topologies of the species tree.

CONCATENATION, PHYLOGENETIC CONFIDENCE, AND POLYTOMIES

Concatenation has many implications beyond whether recovered tree topologies are correct or incorrect. As stated before, in all likelihood the topologies generated by concatenation are reasonable approximations of reality, and in many cases it is not concatenation per se that might derail a phylogenetic analysis but some other detail, such as specification of the substitution model, inhomogeneous base compositions, vagaries of the molecular clock, etc. Yet a serious and still unanswered question is whether concatenation itself can strongly influence confidence values, if not the topologies, of phylogenetic trees. Many papers have been devoted recently to understanding the type I and II error rates of phylogenetic inference methods, most recently with Bayesian inference, and several researchers have suggested that confidence values on branches can be strongly overestimated under a variety of circumstances frequently encountered in routine data analysis. Some researchers, particularly those working with simulations, have often attributed this overconfidence to the inference method as encoded in various software programs (Suzuki et al. 2002; Misawa and Nei 2003; Simmons et al. 2004), or to misspecifications of the model of evolution (Huelsenbeck et al. 2002; Yang and Rannala 2005; reviewed in Alfaro and Holder 2006). In such cases, DNA sequences are indeed simulated on trees that lack coalescent variance, and so such a conclusion may be reasonable. Yet the source of the often high posterior probability values seen in empirical trees has a less obvious explanation. Misspecifications of the substitution model may often be to blame, but concatenation itself—a type of model misspecification, given the coalescent process—represents a major unexplored source of such overconfidence. An example of this is illustrated by the extreme case in which a polytomy in a species tree is used as a model to generate gene trees, DNA sequences, and to reconstruct the phylogeny from these simulation. Despite the polytomy in the species tree, we expect the gene trees generated by this species tree to be dichotomous except in extreme circumstances (Slowinski 2001). (As discussed below, I believe that species trees clarify many aspects of polytomies, and associated concepts such as the “star tree paradox” (Lewis et al. 2005; Kolaczkowski and Thornton 2006), that have been confused in the literature due to a gene tree perspective.) Figure 4 shows how we can expect three distinct dichotomous gene trees from a single polytomous species tree. In this situation, whereas species tree analysis gives a reasonable estimate of confidence in the species tree, providing fairly even support for all three constituent trees underlying the species tree polytomy, concatenation unrealistically places high confidence on one or another gene tree (depending on the details of the replicate), to the exclusion of the remaining two trees. Thus, because something approximating a coalescent process generates DNA sequences in nature, yet we analyze them as if coalescence did not exist, it is worth exploring this source of misestimation further, and the brief example in Figure 4 is by no means the last word. A separate but important issue is the fact that, until recently, most explorations of phylogenetic accuracy and overcredibility of phylogenetic methods have been performed on gene trees, not species trees, and it is unclear to what extent these conclusions will translate to the higher level embodied in species trees (e.g., Douady et al. 2003; Taylor and Piel 2004).

image

Figure 4. Illustration of the utility of the species tree approach as a framework for studying polytomies (top); the mixture of dichotomous gene trees that are expected to result from a polytomy in the species tree (middle); and the tendency for concatenation to excessively favor one particular topology when presented with a mixture of gene trees that, together, should cause lower confidence in any one topology (bottom). In two simulations, the polytmous species tree at the top was used to generate 30 gene trees, which in turn were used to generate DNA sequences under the Jukes Cantor model using MCMCcoal (Yang 2002). The three possible gene trees produced from a polytomous species tree are indicated in black, gray, and dotted lines. These sequences were analyzed either with the method Bayesian Estimation of Species Trees (BEST, Liu and Pearl 2007; Liu et al. 2008b) lower left corner; or were concatenated and analyzed using MrBayes (Huelsenbeck and Ronquist 2001). This procedure was repeated 10 times (replicate). The optimal distribution of posterior probabilities would be even at ∼0.33 across all replicates and trees; given the finite nature of the simulation, the observed probabilities are expected to vary from this optimum somewhat. Whereas BEST achieves moderately even posterior probabilities across trees and replicates, concatenation produces strongly uneven probabilities that favor one tree or another depending on detail of each replicate. This unevenness is likely a consequence of concatenation, rather than any idiosyncracies in MrBayes, and illustrates that concatenation itself can be a major source of overconfidence in phylogenetic trees (see text).

Download figure to PowerPoint

I suggest, as have others (Slowinksi 2001), that species trees are the more relevant entity when discussing polytomies (e.g., Braun and Kimball 2001), or related concepts such as the “star tree paradox” (Lewis et al. 2005; Kolaczkowski and Thornton 2006). (The star tree paradox is the finding that posterior probabilities of trees can be grossly overestimated when the true tree is a polytomy but when polytomies are not visited frequently or at all during the MCMC run). Nonetheless, polytomies in gene trees have remained the focus of discussion and theoretical attention (Walsh et al. 1999; see Slowinski 2001 for an excellent review; Lewis et al. 2005; Steel and Matsen 2007). Polytomies in species trees are of real relevance to systematics and biogeography, and likely exist in nature, whereas polytomies in gene trees are expected to be rare on biological grounds, and in any case are not a necessary consequence of polytomies in species trees (Slowinksi 2001; Fig. 4). For these reasons I suggest that studying the behavior of DNA sequences generated by polytomous gene trees will be less productive than studying the types of gene trees generated from polytomous species trees, and the sequences that arise from them.

A Brief History of Species Trees

  1. Top of page
  2. Abstract
  3. What Is Phylogeny, and How Do We Infer It from Sequence Data?
  4. A Brief History of Species Trees
  5. The Logic of the Species Tree Approach
  6. The Future: Simulations, Sampling, Species, and SNPs
  7. Conclusion—The Relevance of Species Trees
  8. ACKNOWLEDGMENTS
  9. LITERATURE CITED

Species trees are, of course, synonymous with phylogeny and the Tree of Life. Methodologically, species trees can be defined as any phylogenetic approach that distinguishes gene trees or genetic variation from species trees, and explicitly estimates the latter. Species trees by this definition need not be derived from DNA sequence data, but they often involve a model of gene tree evolution—a model distinct from that for nucleotide substitution—that serves as a basis for evaluating the likelihood of the collected data under various candidate species trees. This model can explicitly capture biological processes, such as the coalescent process, or it can capture trends in gene tree heterogeneity without specifically modeling coalescence (e.g., Ané et al. 2007; Steel and Rodrigo 2008).

Species trees as distinct from gene trees are not a new idea. As early as the 1960s, Cavalli-Sforza (Cavalli-Sforza 1964), and in the late 1970s Joe Felsenstein (Felsenstein 1981), were applying simple drift models to tables of allele frequencies and using these models to evaluate competing hypotheses of population and species relatedness. In the 1980s John Avise brought species trees as distinct from gene trees to the forefront of the burgeoning field of phylogeography (Neigel and Avise 1986; Avise et al. 1987). Species trees are a realization of Doyle's characterization of gene trees as characters (Doyle 1992, 1997), or Maddison's (1997)“cloudogram.” Nonetheless, as I suggest in this section, the concept of species trees appears newer than it is in part because the use of DNA sequence data mass-produced a closely related entity, the gene tree, that systematists must now distinguish from it. The concept also appears new from a practical standpoint, since, until now, there have been few means to directly incorporate gene stochasticity into the phylogenetic analysis of moderately sized datasets with workable software (Table 1). Statistical methods for dealing with gene tree heterogeneity and coalescent stochasticity have already been in the mainstream of related fields, such as phylogeography and historical demography, for a number of years, as evidenced by a battery of software focused near the species level that deals with multilocus data. Examples of such software include MIGRATE, LAMARC, BEAST, IM, and other methods that treat the gene tree as a statistical quantity with associated errors in estimation and as a means for estimating parameters at the population level (Wakeley and Hey 1997; Yang 1997; Drummond and Rambaut 2003; Beerli 2006; Kuhner 2006). By estimating population parameters above the level of the gene, these models make reference to the species history in which gene histories are embedded, and indeed go so far as to integrate out gene trees as nuisance parameters. Hey and Machado (2003) captured the distinctive properties of this new view of phylogeography, as well as the spirit of the debates that accompanied the transition in perspective.

Table 1.  Examples of methods for estimating species trees.*
Method (References)Methodological basisData requiredAccounts for stochastic variation or gene tree error?Yields species tree branch lengths?Yields effective population sizes?Applicable to many loci?Applicable to many taxa?
  1. *Modified and expanded from Table 1 of Brito and Edwards (2008). The table is not mean to be exhaustive (see text).

Gene tree distributions
Probability of incongruence (Pamilo and Nei 1988; Wu 1991; Hudson 1992; Chen and Li 2001; Waddell et al. 2002)Likelihood/CoalescentGene treesNoYesYesYesNo
Democratic Vote (Pamilo and Nei 1988; Satta et al. 2000)Gene tree countsGene treesNoNoNoYesNo
SINE method (discordance) (Waddell et al. 2001)LikelihoodBinary charactersNoNoYesYesNo
Gene tree shapes or conflict minimization
Genetree parsimony (Page and Charleston 1997)ParsimonyMultigene family treesNoNoNoYesModerate
Deep coalescence (Maddison 1997; Maddison and Knowles 2006)ParsimonyGene treesNoNoNoYesYes
Species Trees Using Average Rank of Coalescence Time (STAR) (L. Liu, L. Yu, D. K. Pearl, and S. V. Edwards, unpubl. ms.)Ranks of pairwise coalescence timesCoalescence times/Gene treesvia bootstrappingNoNoYesYes
Specie trees Using Estimated Average Coalescence Time (STEAC) (L. Liu, L. Yu, D. K. Pearl, and S. V. Edwards, unpubl. ms.)Pairwise coalescence timesCoalescence Ranks/Gene treesvia bootstrappingNoNoYesYes
Minimum divergence (Takahata 1989); Maximum tree (Liu and Pearl 2006); GLASS (Mossel and Roch 2007)Divergence in gene trees/coalescentGene treesNoYes (Assuming ultrametricity)NoYes (Maximum and Glass)Yes
Joint Inference of Species and Tree (JIST) (O’Meara 2008)Likelihood/CoalescentGene treesNoNoNoYesModerate
Allele frequencies, SNPs or Haplotype Configurations
Drift model (Felsenstein 1981)Likelihood/Brownian motionAllele frequenciesYesYesNoYesYes
Infinite sites model (Nielsen 1998)LikelihoodHaplotypesYesYesYesYesNo
FST method (Nielsen et al. 1998)Likelihood/CoalescentAllele frequenciesYesYesYesYesNo
Pruning Algorithm (RoyChoudhury et al. 2008)Likelihood/CoalescentSNPsYesYesNoYesModerate
Gene tree probabilities/likelihoods
Gene tree probabilities (Carstens and Knowles 2007)Likelihood/CoalescentGene treesPartiallyNoNoYesNo
Bayesian Estimation of Species Trees (BEST) (Liu and Pearl 2007; Liu et al. 2008)BayesianDNA sequencesYesYesYesModerateModerate
Bayesian Concordance Factors (BCA) (Ané et al. 2007)BayesianDNA sequencesYesNoNoYesModerate
Sum and average criteria (Seo et al. 2005)LikelihoodDNA sequencesYesNoNoYesYes
Consensus and supertree approaches
Likelihood supertrees (Steel and Rodrigo 2008)LikelihoodGene treesYesNoNoYesYes
Rooted triple consensus (Degnan et al. 2008; Ewing et al. 2008)ConsensusGene treesNoNoNoYesYes
Majority rule consensus/greedy consensus (Degnan et al. 2008)ConsensusGene treesNoNoNoYesYes

In stark contrast to the situation in phylogeography, phylogenetic inference itself still largely retains its focus on gene trees—if not philosophically then operationally, at the UNIX prompt or GUI menu. The thought of integrating out the gene trees from a phylogenetic analysis would likely seem paradoxical to practitioners of the current paradigm, and for this reason again, species trees may appear to be a new concept. Table 1 summarizes a number of approaches to estimating species trees that have been developed over the years, many in the last five years. All of these approaches make explicit the distinction between the underlying genetic variation—whether manifested as allele frequencies or as gene trees—and the species tree that is the object of estimation. Table 1 does not necessarily include all methods for combining data from multilocus datasets—for example, consensus trees, majority rule trees, supertrees and supermatrices have been suggested as ways of combining data from multiple genes (de Queiroz 1993; de Queiroz et al. 1995; Wiens 1998; Steel et al. 2000; Gadagkar et al. 2005; Holland et al. 2005, 2006). Although I do include some recent evaluations of these approaches for estimating the species tree under a coalescent model (Degnan et al. 2008), I do not consider these methods true species tree methods because they do not specifically acknowledge an overarching species tree in which gene trees are embedded, any sort of correlation among gene trees, or a model connecting the two, other than simply calling the consensus tree or supertree the species tree (for an exception see Steel and Rodrigo 2008). A complete review of species tree methods is beyond the scope of this Commentary (see Degnan and Rosenberg 2008 and Brito and Edwards 2008 for an introduction), but the following overview may be helpful.

Methods for inferring species trees have adopted likelihood or parametric statistical or model-free approaches and have proved useful with varying degrees of success. For example, some of the most statistically robust methods are challenging to implement and are not generally available to empiricists (Nielsen 1998; Nielsen et al. 1998; chapter 28 of Felsenstein 2003). Other approaches, such as likelihood methods (Pamilo and Nei 1988; Wu 1991; Hudson 1992; Chen and Li 2001; Waddell et al. 2001, 2002) are generally not applicable to more than three species. Recent parsimony methods for inferring species trees, such as methods minimizing deep coalescence, appear promising, particularly given their implementation in powerful software packages such as Mesquite (Maddison and Knowles 2006). Likelihood approaches, such as direct evaluation and comparison of species trees via the likelihood of gene trees in the data (Seo et al. 2005; Carstens and Knowles 2007; Seo 2008), or constructing supertrees from gene trees via a summary likelihood function (Steel and Rodrigo 2008), also appear promising. Recently Liu and colleagues have proposed a promising Bayesian method (Liu and Pearl 2007; Liu et al. 2008), as well as several parametric methods (L. Liu, L. S. Kubatko, D. K. Pearl, and S. V. Edwards, unpubl. ms.), for estimating species trees, the latter of which is quick to compute on very large datasets. All of these methods assume a model that allows gene tree heterogeneity, and yet these methods each estimate a single species tree, and in some cases can handle multiple alleles per species (Maddision and Knowles 2006; Liu et al. 2008). They are distinct from traditional methods of phylogenetic analysis in so far as there is no assumption that the estimated gene tree is isomorphic with the species tree; instead, they perform additional computation, whether calculation of likelihoods or summary statistics, on the collected gene trees to derive a species tree.

WHAT'S IN A NAME?

It is a legitimate question to ask, as a colleague of mine did recently, whether species trees have any validity if in fact the definition of species is still in limbo (as they are likely to be for a long time). This colleague suggested that the term “population tree” is better suited to the new paradigm, because it avoids the issue of species validity (notwithstanding the problem of defining populations in nature). I would be happy with this terminology, but defining it this way might seem to exonerate those working at higher taxonomic levels, for whom population processes are minor concerns. Phylogeneticists working on the higher level questions tend not to concern themselves with populations, or their genetics. For this reason, “population trees” might become appropriated solely by phylogeographers and those working near the species level. This would be unfortunate, because gene tree heterogeneity and the species tree problem in principle affects all levels of phylogeny, even if the extent of deep coalescence or branch length heterogeneity is less among higher taxa or sparsely sampled clades. For this reason I suggest we simply exercise a verbal substitution and reserve the term “phylogeny” to refer to species trees. Phylogenies as they have been built in the last few decades would then be called gene trees, which is generally what they are, sensu stricto.

The Logic of the Species Tree Approach

  1. Top of page
  2. Abstract
  3. What Is Phylogeny, and How Do We Infer It from Sequence Data?
  4. A Brief History of Species Trees
  5. The Logic of the Species Tree Approach
  6. The Future: Simulations, Sampling, Species, and SNPs
  7. Conclusion—The Relevance of Species Trees
  8. ACKNOWLEDGMENTS
  9. LITERATURE CITED

From what little we know at this time, the species tree approach appears to derive its power from the accumulated signal of many gene trees, or independently segregating single nucleotide polymorphisms (SNPs), each with their own “tree” or bipartition. As such the approach leaves open the possibility that the collected DNA sequences may contain site patterns that are not directly mappable on to the resulting phylogeny. Complex signals and hidden support have been observed in combined and concatenated molecular datasets and have been suggested to arise from “discrepant patterns of homoplasy” (Gatesy and Baker 2005). Yet, notwithstanding these complex interactions among characters, ultimately there can be no site patterns in a concatenated datasets that are not present in the original partitions. By contrast, species tree approaches explicitly conduct additional computation on trees from individual partitions; the end result can sometimes derive from signals that are not specifically encoded in the site patterns of the original partitions (Fig. 2). A good illustration of this is the fact that species trees correctly estimated from gene trees in or near the anomaly zone differ from the most common gene tree, and by inference, from the signal in the most common site pattern in constituent partitions of the data (Edwards et al. 2007; Liu and Edwards 2008; L. Liu, L. Yu, D. K. Pearl, and S. V. Edwards, unpubl. ms.). The additional signal not found in the original sequence data comes from the likelihood function of gene trees given a species tree (Maddison 1997; Rannala and Yang 2003; chapter 28 of Felsenstein 2003; Liu and Pearl 2007). This likelihood is distinct from the likelihood function modeling nucleotide substitution and its function is to provide probabilities of gene trees given a species tree. Such likelihoods have appeared in several forms recently and provide a solid foundation for developing new species tree methods (Rannala and Yang 2003; Degnan and Salter 2005; Steel and Rodrigo 2008).

SPECIES TREES: CONFIDENCE AND MISSING DATA

Although it is too early to tell clearly, I predict that statistical confidence in species trees when estimated with new multilocus approaches will in general be less than when estimated via concatenation, particularly when analyzing datasets of long-diverged clades, such as orders of mammals or birds. I suggest this prediction even though we know that in some instances the species tree approach is more efficient at extracting information from DNA sequences than concatenation approach, such as the example from yeast (Edwards et al. 2007). This prediction stems from consideration of how signal is propagated in supermatrix and species tree approaches, and from a recent multilocus study on turtles that suggested that the effect of missing data was much stronger for species tree approaches than for concatenation approaches (Thomson et al. 2008).

It stands to reason that species tree approaches will be more sensitive to missing data than supermatrix approaches because, in species tree approaches, a missing gene for a given taxon means that that taxon's genealogy is unknown for that particular gene (although it could probably be estimated for that gene based on the information from other genes). By contrast, in supermatrix approaches, a missing gene for a given taxon can easily be compensated for by other genes for that taxon, although the ease of compensation will no doubt vary. Hence there is may be less of a penalty for missing data in supermatrix approaches (although I confess my argument at this stage is not airtight). In the turtle study, the phylogeny of concatenated genes based on a dataset in which nearly a third of the taxon-by-gene matrix had empty cells nonetheless had high confidence, with most branches achieving high posterior probability (Thomson et al. 2008). Similar claims of high confidence from vastly undersampled supermatrices have been made for other taxa as well (Driskell et al. 2004). Both the statistical inference issues—species trees are, after all, a different and more complex entity to estimate than gene trees—as well as the effects of missing data may conspire to prove species trees in general harder to estimate than trees obtained by concatenation. This no doubt could be frustrating—after all, the community has become comfortable with the levels of confidence delivered under the current paradigm. But on the other hand, this extra effort may be telling us something about species trees and their ease of inference from genetic data.

Concatenation also suffers from the problem of data “swamping,” in which one or a few partitions provides essentially all of the signal in a particular study, even in molecules-only analyses (Kluge 1983; Hillis 1987; Baker et al. 1998). I predict that the contribution to phylogenetic signal will be more evenly distributed among genes in species trees approaches, because in the end, each partition is only one gene, and extra signal comes from each gene independently as well as from additional sites within any one gene. Of course, low confidence in species trees could also be the result of violations of the model assumed, such as when gene tree discordance is generated not just by coalescent phenomena but by horizontal gene transfer, intragenic gene conversion, paralogous genes, or other processes (Eckert and Carstens 2008). In general, as we begin to compare the relative merits of species trees and the concatenation approach, we should bear in mind that the two are different entities; although not exactly apples and oranges, they are nonetheless distinct statistical quantities that are correlated with one another and yet will behave differently with regard to signal maximization.

The Future: Simulations, Sampling, Species, and SNPs

  1. Top of page
  2. Abstract
  3. What Is Phylogeny, and How Do We Infer It from Sequence Data?
  4. A Brief History of Species Trees
  5. The Logic of the Species Tree Approach
  6. The Future: Simulations, Sampling, Species, and SNPs
  7. Conclusion—The Relevance of Species Trees
  8. ACKNOWLEDGMENTS
  9. LITERATURE CITED

The species tree paradigm suggests a number of new directions that will impact future research. I choose three areas in particular – simulation practices, data sampling, and species delimitation – to complement the list of specific research questions outlined in Degnan and Rosenberg's recent review on related subjects (Degnan and Rosenberg 2008). First, I suggest that simulations of DNA sequences should from now on be conducted in a coalescent context, even if the simulated sequences are to be analyzed by traditional phylogenetic approaches. By this I mean DNA sequences should be simulated with a specific species tree in mind on which gene trees evolve, rather than through the traditional approach, which simply simulates DNA sequences on a static phylogenetic tree. For example, several simulation packages, such as MCMCcoal (Yang 2002), serialsimcoal (Laval and Excoffier 2004; Anderson et al. 2005) or Mesquite (Maddison and Maddison 2008) can simulate DNA sequences generated from gene trees that are in turn generated from explicitly specified species trees. By contrast, with other approaches it is often easy to forget the need to simulate from multiple different gene trees, and to inadvertently assume no coalescent stochasticity. The suggestion on simulation practices is independent of whether to concatenate or not. But simulating from coalescent gene trees would be an easy way to better approximate reality in ways that we do not now. One, of course, will be left with the choice of whether to simulate from long, thin species trees, which will generate a series of nearly identical gene trees (both in topology and branch lengths) or to simulate from short, fat species trees, which will generate substantial gene tree heterogeneity, and by extension, heterogeneity in phylogenetic signal of the underlying DNA sequences. This choice could essentially offer a “way out” for those researchers who are reluctant to adopt the species tree approach; simulating from long, thin species trees and then concatenating these sequences prior to analysis is tantamount to the current approach to simulation, because there could be few, if any, signals emanating from the DNA sequences that are not easily ascribed to the topology of the species tree generating them. For some clades there will be population genetic information on the values of θ for extant populations; these could be used as a guide to assign lineage widths to species trees used in simulations (Edwards and Beerli 2000).

Second, I suggest that the new species tree paradigm will influence how we sample genomic data for phylogenetic analysis, and how confident we are of the results. As discussed above, for most purposes, sampling multiple genes for phylogenetic analysis has had as its most important consequence the accumulation of many sites for phylogenetic analysis. By contrast, the species tree approach places high value not only on the total number of sites, but also on the total number of independently segregating genes. I suggest, as have others (Maddison 1997; Avise 2000), that phylogenies are population phenomena and that the parameters of species trees and the means for estimating them from genetic data qualitatively are in the same class as recent models for estimating phylogeographic and demographic parameters within species, such as genetic diversity, rates of gene flow, or population divergence times. These phylogeographic methods derive their statistical power from combining the information from many genes while still treating gene trees as independent of each other conditional on the demographic history being estimated. Recent theoretical and empirical analyses have demonstrated the dependence of statistical confidence in phylogeographic parameter estimation on the number of sampled loci (Jennings and Edwards 2005; Felsenstein 2006; Lee and Edwards 2008); in many cases, the number of sampled loci appears to be more important in reducing variance of parameter estimates than the total number of base pairs (Carling and Brumfield 2007; Janes et al. 2008). In the same way, simulations have shown that confidence in species trees is also critically dependent on the number of sampled loci, although the contribution of the number of sites per locus to statistical confidence is still not known (Edwards et al. 2007; Liu et al. 2008). The number of alleles sampled per species has also been shown to be an important variable determining phylogenetic accuracy and confidence (Maddison and Knowles 2006). Fortunately, many recent phylogenetics and phylogenomics datasets have already focused heavily on sampling multiple loci, making extension to a species tree approach easier. Still, we do not yet know the optimal allocation of effort toward characterizing loci, individuals, and sequence length for phylogenetic analysis, if resources for a given project are limited.

Species and population delimitation will become fundamental to constructing species trees (O’Meara 2008). This suggestion comes from the fact that another key assumption, at least in this first generation of species tree approaches, is lack of gene flow between species in the tree. Lack of gene flow or other mechanisms of lateral genetic transfer go a long way toward satisfying the assumptions of many species tree approaches. (The impact of gene flow on species tree inference is likely to be substantial (Eckert and Carstens 2008), yet in many ways no more severe than for gene tree inference; in both cases, care is required in interpreting the resulting tree). For this reason, a critical step in species tree analysis will be defining taxa in such a way that this assumption is met.

In fact, species trees are often compatible with a number of prominent species concepts, particularly those that emphasize reproductive isolation, genetic cohesion, and lineage isolation. For example, the “metapopulation lineage species concept” proposed by de Queiroz (2005) views species as sets of wholly or partially interbreeding units and subsumes many of the positive aspects of multiple species concepts. The growing appreciation of the multidimensionality of species and the variation in their embedded gene lineages (even by Willi Hennig, in his famous and frequently republished diagram of gene lineages in two diverging populations in his Phylogenetic Systematics) makes such species concepts attractive and increasingly compatible with multilocus DNA sequence datasets that are becoming the norm. In addition, a battery of new definitions and methods for quantifying gene tree heterogeneity will greatly facilitate the species tree approach (Avise and Robinson 2008; Cummings et al. 2008; Ané et al. 2007; Baum 2007). By contrast, species tree approaches are less compatible with species concepts that focus on diagnosibility via monophyly of gene trees. Although such monophyly is often criticized as a useful criterion for recognizing species, particularly with mitochondrial DNA, such a criterion is nonetheless used quite regularly (Zink 2006; Zink and Barrowclough 2008). Other species concepts based on multilocus genealogical distinctiveness, such as the genealogical species concept or Avise and Ball's (1990) genealogical concordance concept, in which ∼95% of gene lineages should be monophyletic under good species, are less useful in a species tree context, because the very nature of species trees acknowledges the possibility of distinct species despite rampant and ongoing incomplete lineage sorting (Edwards et al. 2005). In my view gene tree monophyly should be abandoned as a criterion for species, because, in addition to its conflation of patterns and criteria for diagnosibility at the level of genes and species, it can easily split biodiversity far too narrowly, or lump taxa far too liberally, depending on a variety of accidents of population genetics, including allelic sampling, natural selection, founder effects, and other vagaries of population history (Rosenberg 2003, 2007).

A final issue that will be important to watch as species tree approaches diversify is the issue of recombination. Most species tree approaches (Table 1) have the tacit assumption that recombination is absent within genetic segments, but complete between such segments. This assumption allows each gene tree to be conditionally independent of the other trees, yet the signal of each gene to be internally consistent. There has been surprisingly little interest in studying the effects of recombination on phylogenetic analysis, in part because recombination can only occur among alleles in the same population; for this reason it is thought that recombination within diverging lineages that are not exchanging genes with other such lineages is unlikely to strongly affect higher level phylogenetics; no information is exchanged between species. Yet under the species tree paradigm, recombination within loci, or lack of recombination between loci (linkage) is likely to have important effects, and these should be quantified; theory suggests that even small amounts of recombination between loci can quickly render their histories independent of one another in a species tree context (Slatkin and Pollack 2006). For these reasons individual unlinked SNPs may emerge as an important type of character to estimating phylogenies (species trees), and we are beginning to see efforts in this area (RoyChoudhury et al. 2008). Individual SNPs are a relief for those who worry about recombination within loci (because there is no recombination within a single SNP) and they can be collected rapidly on very large scales, as recent genome projects have shown. Again, phylogeographic methods might help show the way, as there are several methods tailored for within-species variation that extract useful information on population parameters from linked or unlinked SNPs (Falush et al. 2003, 2007; Pritchard Beerli 2006; Kuhner 2006). Some recent phylogeographic approaches that incorporate recombination within loci into the model for multlocus data appear promising (Kuhner 2006; Becquet and Przeworski 2007).

Conclusion—The Relevance of Species Trees

  1. Top of page
  2. Abstract
  3. What Is Phylogeny, and How Do We Infer It from Sequence Data?
  4. A Brief History of Species Trees
  5. The Logic of the Species Tree Approach
  6. The Future: Simulations, Sampling, Species, and SNPs
  7. Conclusion—The Relevance of Species Trees
  8. ACKNOWLEDGMENTS
  9. LITERATURE CITED

John Avise encapsulated the relationship between gene and species trees well in 1994: “Gene trees and species trees are equally “real” phenomena, merely reflecting different aspects of the same phylogenetic process. Thus, occasional discrepancies between the two need not be viewed with consternation as sources of “error” in phylogeny estimation. When a species tree is of primary interest, gene trees can assist in understanding the population demographies underlying the speciation process” (pp. 133 and 138 in Avise 1994). This essay is in part meant to reemphasize Avise’ perspective and to remind readers that species trees are in fact the “primary interest” of systematics.

My essay is not meant to champion any particular new software or statistical approach; but my polemic against concatenation and supermatrix approaches has no doubt been emboldened by the recent success of a new generation of species tree approaches in a wide variety of phylogenetic situations (Table 1). Despite the advent of these new and often promising approaches, there is still a great need for additional models and methods that can efficiently analyze the very large phylogenomics datasets that are becoming the norm. Thus my essay is instead meant to champion a perspective on phylogenetics that has had many conceptual ancestors, yet is still in need of new models by theoreticians and experimentation by empiricists. The call for embracing species trees does not derive from the success of particular methods in a slightly wider region of tree space (such as the anomaly zone) than traditional methods. Nor does it derive from a failure of concatenation approaches to deliver reasonable trees, although I have suggested several ways in which concatenation can mislead. Rather, a heightened focus on species trees arises from an awareness of the near ubiquity of gene tree heterogeneity (whether in topology or branch lengths); from a consideration of the basic goals of systematics, whose focus is on trees of species and lineages; and from the fact that we can now act on these goals given the availability of at least a few computationally feasible methods. In one sense the transition could be construed as trivial; after all, species tree approaches really represent just a different way of combining data in phylogenetic analysis. On the other hand, the array of new approaches that have already appeared and the renewed focus on lineages and populations that they provide allow us to state in hindsight that systematics has been overly “gene centric,” at least since entering the PCR era. This gene centrism has been an extremely valuable way station as many other issues with the analysis of DNA sequence data have been sorted out. I suggest, however, that the field has now matured enough that we can move on to the next phase in which species and populations regain their rightful place as the primary focus in phylogenetic analysis.

Species tree approaches will of course open up a plethora of new debates and challenges for the field, both for higher level systematics and for phylogenetic analysis near the species level. For example, virtually any debate that has already taken place in the modern era of molecular systematics, can and will take place with species trees as the new focus. Such debates include issues on the molecular clock, taxon sampling, phylogenetic bias, rooting, incorporating fossil data, merging morphological and molecular data, and ways of achieving high levels of confidence. And yet in some cases, the consensus of the community may settle on an answer different from that proffered during the gene tree era of systematics. After all, the statistical quantities of species trees—topologies, branch lengths, times of divergence—are different from those for gene trees. For example, is more taxa or more sequence better for estimation of species trees? This question has for the most part received the answer of “more taxa” or perhaps in some cases “both” (Graybeal 1998; Pollock et al. 2002; Zwickl and Hillis 2002; Hedtke et al. 2006), but we have already seen that it might have a wholly new answer of “more genes” in the case of species trees. Another example where species trees will usher in a new dialogue is the nature and sources polytomies (discussed above), a debate that I feel has been fraught with confusion precisely because the community has failed to adequately distinguish polytomies in gene trees versus polytomies in species trees. Ways of treating polymorphic characters in phylogenetic analysis, as well as optimal sampling of species for phylogenetic analysis may also benefit from clearly distinguishing between gene and species trees (Wiens 1999; Geuten et al. 2007). We can look forward to a more seamless integration of phylogeography and phylogenetics, two fields that have been divided in the recent past due to methodological and conceptual differences (Hey and Machado 2003; Brito and Edwards 2008). I suspect that species tree approaches, along with the new and awesome power of modern sequencing and computational methods, will play an important role in creating a uniform methodological platform on which the diversity of genetic patterns emanating from diverse genomes can be interpreted and compared. They should be celebrated as a return to the genuine focus of systematics and will play an important role in helping build the Tree of Life, perhaps even facilitating the completion of this goal and a move beyond a focus on pattern to considerations of evolutionary mechanism and process.

Associate Editor: M. Rausher

ACKNOWLEDGMENTS

  1. Top of page
  2. Abstract
  3. What Is Phylogeny, and How Do We Infer It from Sequence Data?
  4. A Brief History of Species Trees
  5. The Logic of the Species Tree Approach
  6. The Future: Simulations, Sampling, Species, and SNPs
  7. Conclusion—The Relevance of Species Trees
  8. ACKNOWLEDGMENTS
  9. LITERATURE CITED

I thank Evolution Editor M. Rausher for inviting me to write a commentary and for his patience during writing. I thank my collaborators, L. Liu and D. Pearl for inspiring many of the ideas in this essay, as well as B. Rannala, J. Wakeley, J. Fensenstein, L. Knowles, D. Baum, N. Rosenberg, B. Carstens and R. Nielsen for helpful discussion over the years that has helped clarify my thoughts. L. Liu performed the simulations in Figures 2 and 4. I received very helpful comments on the manuscript from B. O’Meara, G. Spellman, B. Arbogast, C. Marshall and T. Near. B. O’Meara, A. Rambaut, N. Rosenberg, A. RoyChoudhury and J. Degnan provided helpful discussion and extensive comments on Table 1. Thanks to N. Rosenberg, J. Degnan and A. RoyChoudhury for providing preprints of as yet unpublished material. This work was supported by NSF grant 0743616 with D. Pearl.

LITERATURE CITED

  1. Top of page
  2. Abstract
  3. What Is Phylogeny, and How Do We Infer It from Sequence Data?
  4. A Brief History of Species Trees
  5. The Logic of the Species Tree Approach
  6. The Future: Simulations, Sampling, Species, and SNPs
  7. Conclusion—The Relevance of Species Trees
  8. ACKNOWLEDGMENTS
  9. LITERATURE CITED