Author for correspondence: Jeff J. Doyle Tel: +1 607 255 7972 Email: email@example.com
•Polyploidy is a widespread speciation mechanism, particularly in plants. Estimating the time of origin of polyploid species is important for understanding issues such as gene loss and changes in regulation and expression among homoeologous copies that coexist in a single genome owing to polyploidy.
•Polyploid species can originate in various ways; the effects of mode of origin, genetic system, and sampling on estimates of the age of polyploid origin using distances between alleles of polyploids and their diploid progenitors, or between homoeologous loci in a polyploid genome, are explored.
•Even in the simplest cases, simulations confirm that different loci are expected to give very different estimates of the date of origin. The time of polyploid origin is at least as old as the time estimated from comparison of an allele sampled from the polyploid with the most closely related allele in the diploid progenitor. The polyploidy literature often does not make clear the longstanding observation that the divergence of homoeologous copies in an allopolyploid tracks the divergence of diploid species, not the origin of the polyploid.
•Estimating the date of origin of a polyploid is difficult, and in some circumstances impossible. Skepticism about dates of polyploid origins is clearly warranted.
Polyploidy has been a ubiquitous genetic mechanism throughout the history of flowering plants (Masterson, 1994). It is a prevalent phenomenon in the chromosomal evolution of extant species and genera (Otto & Whitton, 2000), and its footprint is apparent in the genomes of chromosomally diploid plant species, including nearly all of those whose genomes have been fully sequenced (Arabidopsis, Populus, Vitis, Oryza; Fawcett et al., 2009). It has been hypothesized that polyploidy may have contributed to the origin of flowering plants (De Bodt et al., 2005), and that it is a potent source of morphological innovation (Freeling & Thomas, 2006). Polyploidy also occurs in many other groups of organisms, including animals (Mable, 2004). Polyploidy is known to have profound effects on genome structure and gene expression (Doyle et al., 2008). It is also a major speciation mechanism in plants. In a recent brief review of speciation, Hendry (2009) noted that one of the most hotly debated topics in this controversial area is sympatric speciation, and that there are many who reject the notion entirely – except for the formation of a polyploid, in which case genome doubling can impose reproductive isolation instantaneously.
Estimating the time of origin of polyploid species is important for understanding issues such as gene loss and changes in regulation and expression among homoeologous copies that coexist in a single genome owing to polyploidy (Doyle et al., 2008). With the exception of inferences from the fossil record (Masterson, 1994) dating of polyploid origins requires the use of molecular data, sometimes involving comparisons between alleles in a polyploid and its presumed diploid progenitor(s) (Senchina et al., 2003), and sometimes involving comparisons of duplicated genes within a polyploid genome (Lynch & Conery, 2000). Conventional molecular evolutionary methods are used for this purpose, such as the use of a molecular clock to estimate dates from the number of synonymous substitutions per synonymous site in protein-coding genes (Ks). When genome doubling and speciation are simultaneous, the polyploid species can be traced back to the production of a single polyploid individual. In this simple conceptualization of polyploid evolution there is also a firm date associated with speciation, because there is a singular origin of the polyploid species. There is, in theory, no uncertainty associated with this date, which can be obtained by measuring the distance between the polyploid and its diploid progenitor(s) at any locus in the genome.
Unfortunately, this model of polyploid origin is a drastic oversimplification, and estimating the date of origin of a polyploid species is not this straightforward. Issues such as multiple origins, origins involving unreduced gametes, differences between autopolyploid vs allopolyploid genetic systems, extinction of key taxa or lineages, making inferences directly from a polyploid’s genome without sampling diploids, and sampling standing variation of progenitor species all add complications that make it difficult to make assumptions even about what is being measured when an origin and its date are hypothesized. Here we discuss these complications in an effort to clarify what is being measured when molecular methods are used to identify and date the origin(s) of a polyploid species.
Background and definitions
Although all polyploids by definition have ‘doubled’ genomes, the ways in which this increase in genomic complexity is accomplished varies, and this, in turn, affects the interpretation of molecular data used to estimate time of origin (Gaut & Doebley, 1997). The terms ‘autopolyploid’ and ‘allopolyploid’ can be used in either a taxonomic or genetic sense (Ramsey & Schemske, 1998; Table 1). In the taxonomic definition, autopolyploids are polyploids formed from within a single species, whereas allopolyploids are formed by hybridization between two or more species. In the genetic definition, autopolyploids are plants with random association among four (in a tetraploid) homologous chromosomes, leading to tetrasomic segregation, whereas allopolyploids have two sets of homoeologous chromosomes that do not typically pair, leading to disomic segregation. The latter definition is most relevant for the issue of dating polyploid origins. This is because patterns of segregation determine what genetic variation is retained over time, and it is this variation that is used to infer the time of origin (Gaut & Doebley, 1997). Because a genetic allopolyploid shows diploid chromosomal behavior, it is a simpler case than genetic autopolyploidy. Therefore, throughout this discussion, genetic allopolyploidy will generally be discussed first, with genetic autopolyploidy introduced subsequently.
Table 1. Genetic (inheritance type) and taxonomic (number of progenitor species) definitions of autopolyploidy and allopolyploidy
One progenitor species
Two progenitor species
Genetic autopolyploid; taxonomic autopolyploid
Genetic autopolyploid; taxonomic allopolyploid
Genetic allopolyploid; taxonomic autopolyploid
Genetic allopolyploid; taxonomic allopolyploid
The term ‘homoeologous’ was originally created to describe chromosomes that were cytogenetically ‘partially homologous’ in the sense of having some rearrangements, relative to ancestral synteny, that could reduce pairing in a hybrid (Huskins, 1932). Thus, when an allopolyploid is formed by hybridization and doubling involving diploid species with somewhat rearranged and differentiated chromosomes, the two subgenomes comprise sets of homoeologous chromosomes. By extension, genes on homoeologous chromosomes are considered homoeologues. Homoeologous loci were orthologues in the diploid progenitor species – homologous sequences produced by a speciation event. In the polyploid’s genome they are paralogues – multiple homologous sequences ‘duplicated’ by genome merger (see Wendel & Doyle, 2004 for discussion).
A simple case: allopolyploidy, progenitors extant
We begin with a simple case, in which a polyploid species has originated a single time by hybridization between two genetically distinct, sexually reproducing, panmictic, non-sister diploid species, followed by genome doubling in the F1; the resulting polyploid exhibits disomic segregation at all loci. It is thus an allopolyploid in both the taxonomic and genetic senses. This allopolyploid and multiple individuals of its two diploid progenitor species are sampled immediately after the origin of the polyploid. Given these conditions, the polyploid will be a fixed hybrid, retaining the allelic contributions from both parents at most loci (exceptions would be caused by gene loss or homoeologous recombination from ‘genomic shock’ during early stages of polyploid formation; e.g. Gaeta et al., 2007). Thus, an allele at a given homoeologous locus in the polyploid will be more closely related to the progenitor allele in the diploid than to the other homoeologue in the polyploid (Fig. 1).
Clearly, if the polyploid has just formed, the age of the polyploid is zero years and, in theory, should be reflected in a zero distance between alleles of the polyploid and its diploid progenitor at each homoeologous locus. However, in a panmictic diploid species, each individual contains a random sample of the allelic variation at each locus. Therefore, unless the actual diploid progenitor genotype was included in the sample, alleles at many loci in the polyploid (hereafter called ‘polyploid alleles’) would not have a zero distance to any sampled diploid allele, although a network approach could permit a zero distance to an unsampled diploid allele to be inferred (Fig. 2a). Because of this, distance measurements at different loci would provide different estimates for the age of the polyploid. The technical issue of measuring small divergences is discussed separately, below.
The magnitude of the difference among loci depends on the degree of standing allelic variation in the diploid progenitor at the time of polyploidy (Gaut & Doebley, 1997) on the shapes of the allele trees at different loci, and on the depth of sampling of the diploid progenitor species. Estimates will range from zero (e.g. when the actual progenitor allele is sampled from the diploid) to a maximum possible value given by the deepest coalescence in the allele tree (Fig. 3). Across a group of loci, the maximum possible estimate of the polyploid’s origin is given by the most deeply coalescing locus. For neutrally evolving loci this can be modeled using coalescent simulation (e.g. in Mesquite: Maddison & Maddison, 2009), considering each randomly generated tree to be the tree for a different locus. In one such simulation, with population size set to 100 000 (the effective population size estimated for the model plant, Arabidopsis thaliana, based on average polymorphism levels from Nordborg et al., 2005), a sample of 100 alleles had an average deepest coalescence of c. 200 000 generations, with the deepest coalescence among the 1000 trees (loci) being over 700 000 generations. In this example, the estimate of a polyploid’s origin could range from 0 to 700 000 generations. For many perennial plants, the effective population size could be much lower than 100 000; a similar simulation with a population size of 10 000 produced a maximum difference of 80 000 generations. However, a shallower coalescence depth in generations could be even greater in years for a long-lived perennial, again leading to an extensive range of observed divergences between polyploid and diploid progenitor alleles.
Any process that increases the coalescence depth at a locus relative to the neutral expectation (e.g. balancing selection) would increase the potential variation in diploid–polyploid Ks among loci. Clearly, when the actual age of the polyploid is zero, the best estimate is given by the minimum coalescence date among all estimates. This is true not only for comparisons of loci in one diploid progenitor with alleles in the polyploid, but also when evaluating estimates of homoeologues with the two different diploid progenitors. For example, if one diploid progenitor underwent a severe bottleneck just before polyploid formation whereas the other diploid progenitor maintained a consistently large effective population size, most loci in the first diploid species would provide better estimates of the polyploid age than would orthologous loci in the second diploid.
Adding complexity: older polyploids, multiple origins, and autopolyploids
As the time since polyploidy increases, the expected variance in Ks between diploid and polyploid alleles among loci that results from standing variation in the diploid progenitor genome should decrease. This is because the range of variation among the sampled neutrally evolving loci is still what it was at the time of polyploid formation, from zero to the maximum coalescence depth, but as time since polyploidy increases, this value makes a decreasing fractional contribution to the total Ks estimate (Fig. 4). For example, in the coalescent simulation described above, at 500 000 generations after polyploid formation, estimates from different loci could vary by > 50% because of differences in coalescent depths among loci, whereas 5 million generations after origin the estimates would vary by only c. 10% because of initial coalescent depth differences. For this reason, Zhang et al. (2002) concluded that standing variation was not a major contributor to the high variation they observed in Ks among Arabidopsis homoeologue pairs at different loci duplicated simultaneously by polyploidy.
As time passes, the minimum polyploid age estimated by comparing a polyploid allele with one or more diploid alleles will increasingly over-estimate the actual age of polyploid origin because of the extinction of relevant allele lineages in the diploid. This is because lineage extinction can never decrease the estimate, but it can eliminate the progenitor allele or its close relatives.
Many polyploids have originated multiple times involving the same diploid genome or genomes but different genotypes (Soltis et al., 2004) and these origins could occur over a period of many years. When a polyploid formed more than once, it is possible to estimate the number of origins if alleles from the polyploid are nested within the diploid allele network at a given locus (Fig. 2b). In that case, the number of origins is estimated from the number of observed or inferred diploid alleles that have given rise directly to one or more polyploid alleles (Doyle et al., 2004a). The number of distinct alleles in the polyploid is not in itself a measure of the number of origins because this number should increase by divergence over time even in the case of a single origin.
Clearly, when there have been multiple origins there is not a single date for the origin of the polyploid, but unrecognized multiple origins will increase the variance in multilocus estimates of ‘the’ age of a polyploid (Fig. 5). Even if multiple alleles are sampled, gene flow in the polyploid can result in many loci possessing alleles from only one origin; other loci would retain alleles from both origins. A bimodal distribution of age estimates across multiple loci would indicate that two origins had occurred, but both the low and high age estimates would show variation caused by sampling effects: the low estimate (recent origin) because of the coalescent effect noted above and the high estimate (older origin) because of different patterns of diploid allele extinction at different loci. The youngest estimate should be the most accurate for both the recent and older origin.
The mechanism of polyploid origin is also important in inferring and dating origins. Although polyploids can be formed directly by chromosome doubling of a single plant that is either a diploid hybrid (producing a taxonomic allopolyploid) or a nonhybrid (producing a taxonomic autopolyploid), polyploidy may more commonly occur in a stepwise manner involving unreduced gametes (Harlan & De Wet, 1975). In this scenario, tetraploids may be formed through an intermediate triploid step. This complicates the definition of ‘polyploid origin’ for the eventual tetraploid species, because the triploid is also a polyploid. However, unless considerable time passes between the formation of the triploid and the eventual increase to the tetraploid level, the two dates will be similar. As noted by Watanabe et al. (1991), formation by unreduced gametes also complicates the estimation of the number of origins using nuclear markers, because an unreduced gamete can simultaneously incorporate two different alleles into the polyploid genome at a locus, eliminating the 1 : 1 correspondence between alleles and origins.
Counting and dating polyploid origins is further complicated by the potential for gene flow between a polyploid and its diploid progenitors mediated by unreduced gametes in the diploid (Arriola & Ellstrand, 1996). When multiple diploid–polyploid allele pairs are observed (e.g. Fig. 2b), and these have different divergence times, the usual interpretation is that these represent multiple independent origins, but it is also possible that only the pair with the greatest distance represents an ‘origin’ and the less diverged pairs represent evidence of subsequent gene flow. From a genetic perspective, multiple origins and diploid–tetraploid gene flow are identical in being sources of genetic diversity in the polyploid; the presence of such diversity is more significant than how it arrived in the tetraploid. Continued gene flow between a polyploid and a diploid progenitor also may blur the concept of ‘origin’ by calling into question whether speciation has occurred at all. This problem is ignored throughout the remainder of this paper.
Thus far, only allopolyploids have been discussed. Age estimates for genetic autopolyploids are complicated by multisomic inheritance, which leads to segregational loss of allelic variation. Whereas a genetic allopolyploid is a fixed hybrid, showing diploid segregational behavior at each pair of homoeologous loci across its genome, there are no homoeologous loci in a genetic autopolyploid. Instead, each locus in a genetic autotetraploid can have as many as four alleles if it is the product of hybridization between two genotypes that were both heterozygous at that locus. These alleles can be lost by random segregation, so that some loci will have alleles from only one parent whereas others will retain alleles from both. Sampling numerous individuals of the diploid relatives of a recently formed genetic autopolyploid that is also a taxonomic autopolyploid should lead to an estimate of polyploid origin with no greater variance than in the case of the allopolyploid discussed above. The same concerns apply for neutrally evolving loci in both cases, with the difference being that instead of two completely separate estimates at each locus in a genetic allopolyploid (one from each homoeologue to its diploid progenitor), each comparison in the genetic autopolyploid will be made to the set of all alleles available from a single diploid progenitor if it is also a taxonomic autopolyploid. The case of a genetic autopolyploid in which the parents are from different species is more complex because at loci for which the polyploid is homozygous only one parental allele or the other will be observed (Fig. 6). However, the same considerations again apply, though many loci will only provide one estimate rather than two as in the case of a taxonomic allopolyploid.
Polyploids whose diploid progenitors are extinct
As time passes since the origin of a taxonomic allopolyploid, it becomes increasingly likely that one or both diploid progenitor species have become extinct. If only one progenitor is extinct, then it is still possible to obtain an accurate estimate for the age of origin of the polyploid, within the limits described above, by comparison of one homoeologue to the extant diploid progenitor at each locus. The estimate from the other homoeologue at each locus will simply be older, coalescing with alleles from the most closely related extant (and sampled) diploid relative of the missing progenitor (Fig. 7). A large difference between estimates from two members of a homoeologue pair to their nearest diploid orthologue is an indication that one progenitor may not have been sampled. The best estimate of polyploid age will be given by the smaller of the two distances, which could still be larger than the true age of origin if both diploid progenitors are extinct. The situation is analogous to sampling within an extant progenitor, where it is never certain that the most closely related diploid allele has been sampled unless a diploid allele is sampled that is identical to the polyploid allele. Here the time-scale is greater, and extinction will affect all loci. The only defense against this overestimation is sampling multiple diploid taxa and populations: nesting of an allele from the polyploid within the allele tree of a diploid is the best evidence that the diploid progenitor has been identified, unless there are several closely related diploids whose alleles coalesce together. As time passes, it will become increasingly likely that alleles from a polyploid and its diploid progenitor will form sister monophyletic groups. This process is identical to the process by which diploid sister species progress from paraphyly to reciprocal monophyly (Rieseberg & Brouillet, 1994); in this case, one ‘species’ is one homoeologous subgenome of the allopolyploid. However, a reciprocally monophyletic allele relationship can also result from a more recent polyploid origin from a now-extinct diploid progenitor.
If both diploid progenitors are extinct, any estimate of the date of allopolyploid origin will be inaccurately high, measuring diploid speciation events rather than polyploidy. That both diploid progenitors are extinct is not apparent, however, until all diploid species in the clades to which the progenitors belong, and all species in any clades separating these progenitor clades, are extinct (Fig. 7c), at which point homoeologues will be sister to one another. This homoeologue topology mimics the expectation for a taxonomic autopolyploid.
Clearly, then, the distance between a pair of homoeologues provides a maximum estimate of the date of origin of an allopolyploid, as pointed out by Gaut & Doebley (1997, p. 6810):
‘With exclusive disomic inheritance, homologous loci from the two ancestral species remain distinct. Sequences from ‘duplicated’ loci have a coalescent time that corresponds to the date of the divergence of the two ancestral diploid species ... . The species’ divergence time predates tetraploid formation.’
Therefore, only if the polyploid originated simultaneously with the divergence of the diploid progenitors would the divergence time of a pair of homoeologues be identical to the time of polyploid origin. As Gaut & Doebley (1997) note, the distance between two homoeologues only ‘corresponds to’ the age of the polyploid. The Ks between two homoeologues has two components: (1) the distance between orthologues in the diploid progenitors before their being brought together as homoeologues in the polyploid genome, plus (2) the distance accumulated since the polyploid event.
In the absence of additional information, there is no minimum estimate for the time of origin from homoeologue distances – the polyploid could have originated at any time after the speciation of the progenitors, up to the present. However, if speciation subsequently occurs in the polyploid lineage, that speciation time marks the minimum date of polyploidy. For example, the divergence of homoeologue pairs in the genome of Glycine species is estimated to be c. 10 million years (MY), and the divergence of the two subgenera of Glycine is about half that (Innes et al., 2008; A. N. Egan & J. J. Doyle, unpublished), giving a range of 5–10 MY for the origin of the polyploid ancestor of modern Glycine (Fig. 8).
Estimates of polyploid origin made solely from a polyploid’s genome
When extinction has removed diploid progenitors and their relatives from consideration, inferences must be made from pairs of homoeologues that are sisters in phylogenetic reconstructions (e.g. Fig. 7). But homoeologues are effectively sisters when polyploid genome(s) are studied without reference to diploid taxa (e.g. Blanc & Wolfe, 2004a; Schlueter et al., 2004). The occurrence of a polyploid duplication can be discerned in a species’ genome by analysing the distribution of distances between paralogous copies at many loci. The simultaneous duplication of all loci in the genome should be visible as a ‘peak’ in a plot of the number of gene pairs vs Ks, which otherwise would show only a large number of recent duplicate pairs that are rapidly purged from the genome over time (Lynch & Conery, 2000; Fig. 9). This insight has been exploited in several plant comparative genomics studies, for example in crop species (Blanc & Wolfe, 2004; Schlueter et al., 2004), in ‘basal’ angiosperms (Cui et al., 2006), among Asteraceae (Barker et al., 2008), and across flowering plants (Fawcett et al., 2009).
Many paralogous pairs comprise the peak inferred from the Ks distribution. Some are homoeologues, whereas others are paralogues resulting from simple duplications that occurred around the time of the polyploid event. In completely sequenced genomes such as Arabidopsis, rice, or poplar, it is possible to distinguish homoeologues from paralogues of similar age because the former occur in linked blocks (Zhang et al., 2002). Identifying polyploid peaks in a Ks distribution when linkage information is lacking can be very difficult, particularly for older events (Cui et al., 2006).
A Ks peak provides evidence for the existence of a polyploid event, but it does not give the age of this event, because, as noted above, the distance between homoeologues combines the divergence since polyploidy with the pre-existing divergence between diploid progenitor alleles. The peak therefore marks the mode of Ks values of all of the orthologues from the two diploid progenitors that become homoeologues at the time of hybridization leading to allopolyploidy, and which subsequently continue to diverge from one another (Fig. 9). As noted above for single homoeologue pairs, the date estimated for the Ks peak is thus the maximum age of allopolyploidy (Fig. 8).
The fact that the mode or median of a Ks peak is not the age of the polyploidy event is unlikely to be recognized by naive readers of the Ks /polyploidy literature. For example, although Table 2 of Blanc & Wolfe (2004) is titled ‘Estimated Ages of the Observed Secondary Peaks’– which is entirely neutral as to the phenomenon that generated the peaks – the text cites this table as providing ‘absolute dates for the large-scale gene duplications’ (p. 1673). It is not that the authors are unaware of the issue, as is clear from their statement (p. 1672) that ‘ ...the ancestor of Medicago may have been an allotetraploid resulting from the merger of two diverged genomes (i.e. A and B). In this case, the date of divergence between these two subgenomes... is older than the actual polyploidy event’. However, there is reference throughout the paper to duplication events, and their Fig. 4 (p. 1674) is a phylogeny showing ‘dates of duplication and speciation events’, which a reader might mistakenly assume showed the dates of polyploid events rather than the date of divergence of the polyploid’s progenitors. This is also true elsewhere in the polyploidy literature. Schlueter et al. (2004, p. 869) state that ‘Determination of evolutionary distances between large numbers of pairs of duplicated genes allowed us to estimate when these polyploidization events may have occurred’, although they also state (p. 871) that ‘Coalescence estimates correspond to the point when groups of gene pairs began evolving independently’, which is correct but could be read incorrectly to mean independence of homoeologues from their progenitors rather than from one another. Maere et al. (2005) refer several times to ‘waves of duplication’ represented by peaks in the Ks distribution; in Cui et al. (2006), Ks peaks ‘correspond’ to polyploidy events. Table 1 of Fawcett et al. (2009) gives ‘Timing of WGD (whole genome duplication) events in plants’. By contrast, Barker et al. (2008) avoid all reference to dates of polyploidy when discussing the Ks distributions from which they infer whole-genome duplications in Asteraceae.
Some ambiguity is almost certainly traceable to the terminology of polyploidy. The process by which ‘duplicate’ loci of an allopolyploid arise is very different from the process by which a single gene produces two initially identical paralogues in a diploid genome. An allopolyploid ‘duplication’ is a ‘genome merger’ (Doyle et al., 2008) in which the duplicates (homoeologues) are already diverged from one another. As noted above, the term ‘homoeologue’ was initially coined to describe chromosomes in different diploid species (Huskins, 1932). The adoption of this term for genes as opposed to chromosomes, coupled with the differences between polyploid and nonpolyploid duplications, has led to the terminological ambiguity among homoeologue, paralogue and orthologue discussed earlier.
How significant is this issue? In the example of Glycine, the problem of dating from homoeologues produces a twofold range of possible dates of the polyploid event, and this clearly limits what can be said about the pace at which polyploid evolution has occurred in the genus. Could this range be narrowed if it could be assumed that the likelihood of hybridization (and hence allopolyploid formation) decreases with increasing time since divergence of diploid taxa? In Glycine, allopolyploids that formed within the last 50 000 yr combine the genomes of species estimated to have diverged c. 4 MYA and which belong to different ‘genome groups’ based on their inability to form fertile hybrids (Doyle et al., 2004b; unpublished data). Therefore it would presumably have been possible for the earlier polyploidy event to have occurred at nearly any time during the 5 MY period between homoeologue divergence and the radiation of extant Glycine species. If it is generally true that the propensity to form polyploid, rather than homoploid, hybrids increases with increasing genetic distance (Chapman & Burke, 2007), then homoeologue (progenitor species) divergence time will often seriously overestimate the date of allopolyploid origins.
Over time the contribution of this source of ambiguity will decrease as a proportion of the total range of divergence times. The ambiguity between homoeologue divergence and the actual date of polyploidy in Glycine leads to placing the polyploidy event between 5 Mya and 10 Mya; 10 My from now the date will lie between 15 Mya and 20 Mya – not nearly as significant as the current twofold range of possible dates.
Homoeologue divergence times and diploidization in genetic autopolyploids
Estimating ages from homoeologues is still more complex for polyploids that begin their existence as genetic autopolyploids. It is generally assumed that genetic autopolyploids eventually shift their pattern of inheritance from tetrasomic to disomic as part of the process of ‘diploidization,’ though whether this is always true remains an open question and one that may be impossible to answer with certainty (Doyle et al., 2008). It is not known whether ancient polyploid events in species whose genomes have been sequenced, such as Arabidopsis or soybean (Glycine max), were originally autopolyploid or allopolyploid (Gill et al., 2009; Shoemaker et al., 2006), but at present these species are diploidized, with well-defined homoeologous blocks of genes (Innes et al., 2008; Thomas et al., 2006). If these species originated as genetic autopolyploids, different homoeologous loci may yield different synonymous distances because of random segregational fixation at different loci (Gaut & Doebley, 1997). This is best seen in a polyploid that originates as a genetic autopolyploid but was formed by hybridization between two diverged diploid species (i.e. a taxonomic allopolyploid). This situation, where diploid progenitors are extant, was discussed above (Fig. 6).
The potential for variance in estimates increases after genetic diploidization occurs, and the four tetrasomically segregating alleles are partitioned into two pairs of disomically segregating homoeologues. Figure 10 illustrates the consequences of this process at three different loci, each of which retains alleles with different coalescent times because of random segregation during the period of tetrasomy. The key point here is that none of these homoeologues will necessarily coalesce at the time of polyploid formation – there is nothing about that point that starts a new ‘divergence clock’. Some homoeologues will coalesce before polyploidy, at the time of divergence between alleles in the diploid progenitors (Fig. 10, locus 3), whereas other homoeologue pairs will coalesce at various points between the time of polyploidy and the onset of disomy (loci 1 and 2). Only a locus like locus 1, which was homozygous at the time of diploidization, ‘reflects roughly the time of onset of disomic inheritance’ (Gaut & Doebley, 1997, p. 6809).
Yet another complication is possible for genetic autopolyploids. As Gaut & Doebley (1997) point out (p. 6809):
‘The switch from tetrasomic to disomic inheritance is the key variable in this model. If the switch is coordinated among chromosomes, then all pairs of duplicated sequences are expected to coalesce at roughly the same time… If the switch from tetrasomic to disomic inheritance is not well coordinated among chromosomes, then coalescent times could be scattered over a broad range.’
This would not affect the deepest possible coalescences (e.g. locus 3 in Fig. 10) directly, though a change in the time between polyploidy and onset of disomy would increase or decrease the probability of losing allelic variation at all loci, including these deeply coalescing ones. The primary effect of variation in the onset of disomy at different loci would be on the minimum estimate of polyploid age. Polyploids with tetrasomic inheritance at some loci and disomic inheritance at others are termed ‘segmental allopolyploids.’
Methodological issues: estimating dates
Throughout this discussion it has been assumed that distances can be estimated accurately, and that dates can be inferred from them reliably. Anyone familiar with the molecular systematics literature knows that these assumptions are, at best, dangerous. In the field of polyploid evolution, it has been observed that loci with homoeologous gene pairs tracing back to the same polyploidy event differ in Ks by as much as 13-fold in Arabidopsis (Zhang et al., 2002). What, then, is the date of the event that led to homoeologue divergence in this species? Similar levels of rate variation have been observed in other studies (e.g. in soybean; Schlueter et al., 2007; Innes et al., 2008). Using a molecular clock approach on individual gene pairs under these circumstances would lead to wildly different estimates. The solution adopted by Ks /polyploidy studies is to use the distribution of paralogue pairs (as in Fig. 9), assuming that the mode of this distribution captures a genome-wide average, effectively smoothing out variation caused by differences among genes. However, translating distances to ages using this method still requires knowledge of the global synonymous substitution rate; Blanc & Wolfe (2004) and Schlueter et al. (2004) used very different rates, and thus reported very different polyploidy dates for peaks with nearly identical Ks modes in Arabidopsis and soybean. Similarly, when rate variation occurs among lineages involving many genes, problems occur when the same synonymous substitution rate is used to estimate homoeologue divergence times from a Ks peak mode. For example, Ks peaks with different modal values were observed in two legumes, Glycine and Medicago, leading to estimates of 42 MY in the former and 55 MY in the latter by Schlueter et al. (2004). Phylogenetic testing of numerous gene families supported a single shared polyploid event in the common ancestor of these species (Pfeil et al., 2005; Cannon et al., 2006). Such problems led Fawcett et al. (2009) to use one of several available nonclock dating methods, penalized likelihood (PL: Sanderson, 2002), which required the construction of numerous gene phylogenies. However, using such methods is no guarantee that dates will be estimated with precision, let alone accuracy. Dates estimated using PL and Bayesian methods from gene pairs sequenced from two c. 1 Mb homoeologous segments in soybean and allies (Innes et al., 2008) still varied nearly as much as when simpler molecular clock methods were used (A. N. Egan & J. J. Doyle, unpublished).
When dating recent polyploidy events, a significant problem is accurate estimation of very low divergences from single gene sequences. Given rates c. 6 × 10−9 synonymous substitutions per synonymous site per year (see Fawcett et al., 2009 for rates used in recent studies of polyploidy), it would take around half a million years for a pair of alleles with 300 synonymous sites to average one synonymous difference. Thus, during the early history of a polyploid, most homoeologues will have an observed divergence of zero substitutions between a polyploid allele and the diploid progenitor allele. To discriminate between small distances with statistical confidence requires a larger number of synonymous sites, because the sampling variance is inversely proportional to that number. It is becoming increasingly feasible to sequence large numbers of sites as high-throughput sequencing methods become more widely available, but for recombining sequences (nuclear genes) this strategy means bringing together genes with potentially different coalescent histories. The degree to which the use of linked genes can circumvent this problem is limited by the extent of linkage disequilibrium in the taxa being studied. In practice, then, phylogenomic approaches to recent polyploid origin date estimates will be affected by the issue of standing variation.
Limitations of the conventional molecular dating approaches to polyploid origins were noted by Jakobsson et al. (2006), who described a Bayesian coalescent approach for dating the origin of the allopolyploid, Arabidopsis suecica. This alternative to traditional procedures is in keeping with the general trend toward coalescent methods for addressing species-level questions in systematic biology (Edwards, 2009). However, their methods produce estimates for a single origin of this polyploid that range from 12 000 to 300 000 yr, and they note that these estimates are strongly dependent on assumptions about mutation rates and rates of population expansion following polyploid origins – parameters that are even more difficult to estimate in most species than in Arabidopsis.
It is wise to be skeptical about any divergence date based on molecular data (Graur & Martin, 2004), and in the case of polyploids all of the usual technical concerns apply. Furthermore, the diversity of mechanisms that can lead to polyploid formation (e.g. instantaneous doubling vs stepwise, through a triploid intermediate), coupled with the different segregational systems of polyploids (disomic vs tetrasomic), make it difficult to know what biological event is being measured when the divergence time of a pair of sequences is estimated.
Disagreement among polyploid age estimates derived from multiple loci is another manifestation of the gene tree/species tree problem best known in the arena of phylogeny reconstruction (Pamilo & Nei, 1988). Divergence dates inferred from a gene tree are indirect estimates of the divergence times in the species tree (or network, in the case of taxonomic allopolyploids). For estimating the age of a polyploid relative to one of its extant diploid progenitors using neutrally evolving loci, standing genetic variation in the diploid at the time of origin, coupled with sampling of the diploid, can lead to significant disagreements among loci. For older polyploids, particularly when estimates are made from comparisons of homoeologue pairs in the polyploid, variation among loci may result from true rate differences among loci (Zhang et al., 2002), but the meaning of the dates themselves varies with the genetic system of the polyploid and with the histories of individual loci.
As a particular case of sympatric speciation, estimating speciation times for polyploids is no different than estimating diploid speciation times: in the simplest case, at the time of formation, a founder individual has alleles at each locus with a zero distance from alleles at the orthologous locus in the single progenitor individual (diploid or taxonomic autopolyploid speciation) or individuals from two progenitor species (taxonomic allopolyploid speciation). With time and concomitant divergence, identifying the actual progenitor allele will be difficult for both the diploid and the polyploid cases. The allele tree/network for a polyploid and its progenitor(s) would also look very similar to the early stages of diploid allopatric speciation (Fig. 2). Indeed, diploid speciation by a single founder individual involving a common allele from the source population would produce a pattern identical to that of a polyploid speciation event involving a single individual. Similarities also exist between allopatric speciation involving larger numbers of individuals and polyploid speciation with multiple origins, and in both cases allele trees/networks would be produced in which alleles sampled from the two taxa are interspersed. In the case of diploid allopatric speciation, however, not all alleles in each new species need have a perfect match in the other, and in cases where they do not, the age estimated from the closest allele pair will be older than the speciation event. In the case of a polyploid, all alleles initially had a zero distance from an allele in the diploid progenitor. This difference may not be of practical significance, however, because of the uncertainty involved in sampling the standing variation in the diploid progenitor of a polyploid species (Fig. 3) and because of technical problems with estimating distances.
For older polyploid species, the date of origin may simply be unknowable owing to extinction of diploid progenitors (Fig. 7), despite its importance for testing evolutionary hypotheses (Fawcett et al., 2009). This is no different than the situation in a group of diploid species, where times of divergence and diversification of extant taxa can be estimated, but dates of actual speciation events cannot.
As molecular phylogenetics has moved from a reliance on single genes to a more phylogenomic approach, and as analyses and theory have become more rigorous, incongruence has become increasingly routine, whether among gene trees themselves or from the species hypotheses and key dates derived from them. Like any other inference about events that cannot be observed directly, there is uncertainty associated with the date of origin of a polyploid. Caveat emptor.
The authors acknowledge funding from the US National Science Foundation (grants DEB-0516673 and DBI-0822258). Two anonymous reviewers provided helpful comments. We thank members of the L. H. Bailey Hortorium molecular systematics discussion group for productive discussions.