Dating the origins of polyploidy events


Author for correspondence:
Jeff J. Doyle
Tel: +1 607 255 7972


  • Polyploidy is a widespread speciation mechanism, particularly in plants. Estimating the time of origin of polyploid species is important for understanding issues such as gene loss and changes in regulation and expression among homoeologous copies that coexist in a single genome owing to polyploidy.
  • Polyploid species can originate in various ways; the effects of mode of origin, genetic system, and sampling on estimates of the age of polyploid origin using distances between alleles of polyploids and their diploid progenitors, or between homoeologous loci in a polyploid genome, are explored.
  • Even in the simplest cases, simulations confirm that different loci are expected to give very different estimates of the date of origin. The time of polyploid origin is at least as old as the time estimated from comparison of an allele sampled from the polyploid with the most closely related allele in the diploid progenitor. The polyploidy literature often does not make clear the longstanding observation that the divergence of homoeologous copies in an allopolyploid tracks the divergence of diploid species, not the origin of the polyploid.
  • Estimating the date of origin of a polyploid is difficult, and in some circumstances impossible. Skepticism about dates of polyploid origins is clearly warranted.


Polyploidy has been a ubiquitous genetic mechanism throughout the history of flowering plants (Masterson, 1994). It is a prevalent phenomenon in the chromosomal evolution of extant species and genera (Otto & Whitton, 2000), and its footprint is apparent in the genomes of chromosomally diploid plant species, including nearly all of those whose genomes have been fully sequenced (Arabidopsis, Populus, Vitis, Oryza; Fawcett et al., 2009). It has been hypothesized that polyploidy may have contributed to the origin of flowering plants (De Bodt et al., 2005), and that it is a potent source of morphological innovation (Freeling & Thomas, 2006). Polyploidy also occurs in many other groups of organisms, including animals (Mable, 2004). Polyploidy is known to have profound effects on genome structure and gene expression (Doyle et al., 2008). It is also a major speciation mechanism in plants. In a recent brief review of speciation, Hendry (2009) noted that one of the most hotly debated topics in this controversial area is sympatric speciation, and that there are many who reject the notion entirely – except for the formation of a polyploid, in which case genome doubling can impose reproductive isolation instantaneously.

Estimating the time of origin of polyploid species is important for understanding issues such as gene loss and changes in regulation and expression among homoeologous copies that coexist in a single genome owing to polyploidy (Doyle et al., 2008). With the exception of inferences from the fossil record (Masterson, 1994) dating of polyploid origins requires the use of molecular data, sometimes involving comparisons between alleles in a polyploid and its presumed diploid progenitor(s) (Senchina et al., 2003), and sometimes involving comparisons of duplicated genes within a polyploid genome (Lynch & Conery, 2000). Conventional molecular evolutionary methods are used for this purpose, such as the use of a molecular clock to estimate dates from the number of synonymous substitutions per synonymous site in protein-coding genes (Ks). When genome doubling and speciation are simultaneous, the polyploid species can be traced back to the production of a single polyploid individual. In this simple conceptualization of polyploid evolution there is also a firm date associated with speciation, because there is a singular origin of the polyploid species. There is, in theory, no uncertainty associated with this date, which can be obtained by measuring the distance between the polyploid and its diploid progenitor(s) at any locus in the genome.

Unfortunately, this model of polyploid origin is a drastic oversimplification, and estimating the date of origin of a polyploid species is not this straightforward. Issues such as multiple origins, origins involving unreduced gametes, differences between autopolyploid vs allopolyploid genetic systems, extinction of key taxa or lineages, making inferences directly from a polyploid’s genome without sampling diploids, and sampling standing variation of progenitor species all add complications that make it difficult to make assumptions even about what is being measured when an origin and its date are hypothesized. Here we discuss these complications in an effort to clarify what is being measured when molecular methods are used to identify and date the origin(s) of a polyploid species.

Background and definitions

Although all polyploids by definition have ‘doubled’ genomes, the ways in which this increase in genomic complexity is accomplished varies, and this, in turn, affects the interpretation of molecular data used to estimate time of origin (Gaut & Doebley, 1997). The terms ‘autopolyploid’ and ‘allopolyploid’ can be used in either a taxonomic or genetic sense (Ramsey & Schemske, 1998; Table 1). In the taxonomic definition, autopolyploids are polyploids formed from within a single species, whereas allopolyploids are formed by hybridization between two or more species. In the genetic definition, autopolyploids are plants with random association among four (in a tetraploid) homologous chromosomes, leading to tetrasomic segregation, whereas allopolyploids have two sets of homoeologous chromosomes that do not typically pair, leading to disomic segregation. The latter definition is most relevant for the issue of dating polyploid origins. This is because patterns of segregation determine what genetic variation is retained over time, and it is this variation that is used to infer the time of origin (Gaut & Doebley, 1997). Because a genetic allopolyploid shows diploid chromosomal behavior, it is a simpler case than genetic autopolyploidy. Therefore, throughout this discussion, genetic allopolyploidy will generally be discussed first, with genetic autopolyploidy introduced subsequently.

Table 1.   Genetic (inheritance type) and taxonomic (number of progenitor species) definitions of autopolyploidy and allopolyploidy
 One progenitor speciesTwo progenitor species
Polysomic inheritanceGenetic autopolyploid; taxonomic autopolyploidGenetic autopolyploid; taxonomic allopolyploid
Disomic inheritanceGenetic allopolyploid; taxonomic autopolyploidGenetic allopolyploid; taxonomic allopolyploid

The term ‘homoeologous’ was originally created to describe chromosomes that were cytogenetically ‘partially homologous’ in the sense of having some rearrangements, relative to ancestral synteny, that could reduce pairing in a hybrid (Huskins, 1932). Thus, when an allopolyploid is formed by hybridization and doubling involving diploid species with somewhat rearranged and differentiated chromosomes, the two subgenomes comprise sets of homoeologous chromosomes. By extension, genes on homoeologous chromosomes are considered homoeologues. Homoeologous loci were orthologues in the diploid progenitor species – homologous sequences produced by a speciation event. In the polyploid’s genome they are paralogues – multiple homologous sequences ‘duplicated’ by genome merger (see Wendel & Doyle, 2004 for discussion).

A simple case: allopolyploidy, progenitors extant

We begin with a simple case, in which a polyploid species has originated a single time by hybridization between two genetically distinct, sexually reproducing, panmictic, non-sister diploid species, followed by genome doubling in the F1; the resulting polyploid exhibits disomic segregation at all loci. It is thus an allopolyploid in both the taxonomic and genetic senses. This allopolyploid and multiple individuals of its two diploid progenitor species are sampled immediately after the origin of the polyploid. Given these conditions, the polyploid will be a fixed hybrid, retaining the allelic contributions from both parents at most loci (exceptions would be caused by gene loss or homoeologous recombination from ‘genomic shock’ during early stages of polyploid formation; e.g. Gaeta et al., 2007). Thus, an allele at a given homoeologous locus in the polyploid will be more closely related to the progenitor allele in the diploid than to the other homoeologue in the polyploid (Fig. 1).

Figure 1.

 Single locus gene tree for four diploid species (Sp1–4) and an allopolyploid formed by hybridization between Sp1 and Sp4. The allopolyploid has two homoeologous loci, one from Sp1 (H1) and one from Sp4 (H2).

Clearly, if the polyploid has just formed, the age of the polyploid is zero years and, in theory, should be reflected in a zero distance between alleles of the polyploid and its diploid progenitor at each homoeologous locus. However, in a panmictic diploid species, each individual contains a random sample of the allelic variation at each locus. Therefore, unless the actual diploid progenitor genotype was included in the sample, alleles at many loci in the polyploid (hereafter called ‘polyploid alleles’) would not have a zero distance to any sampled diploid allele, although a network approach could permit a zero distance to an unsampled diploid allele to be inferred (Fig. 2a). Because of this, distance measurements at different loci would provide different estimates for the age of the polyploid. The technical issue of measuring small divergences is discussed separately, below.

Figure 2.

 Allele networks for a single homoeologous locus in a polyploid, showing alleles sampled from the diploid progenitor (large circles labeled ‘2x’), unsampled alleles from the diploid (small closed circles), unsampled alleles that could be from either a diploid or from a tetraploid (small open circles), and alleles from the polyploid (squares labeled ‘4x’). (a) Evidence exists for only a single origin of the polyploid. Although the polyploid’s allele is not identical to any allele sampled from the diploid, its position in the interior of the network means that it is identical to an (unsampled) allele from the diploid. (b) Evidence exists for three origins of the polyploid (dashed lines grouping polyploid alleles with nearest diploid allele). Each origin is inferred from the relationship of one or more polyploid alleles with a sampled or unsampled diploid allele. Multiple alleles from the polyploid at two of the origins are parsimoniously inferred to have diverged after polyploidy, rather than being independently incorporated in the polyploid’s genome by additional origins. Both scenarios assume that only one allele was contributed per polyploid event (i.e. there was not a heterozygous unreduced gamete involved in the origin).

The magnitude of the difference among loci depends on the degree of standing allelic variation in the diploid progenitor at the time of polyploidy (Gaut & Doebley, 1997) on the shapes of the allele trees at different loci, and on the depth of sampling of the diploid progenitor species. Estimates will range from zero (e.g. when the actual progenitor allele is sampled from the diploid) to a maximum possible value given by the deepest coalescence in the allele tree (Fig. 3). Across a group of loci, the maximum possible estimate of the polyploid’s origin is given by the most deeply coalescing locus. For neutrally evolving loci this can be modeled using coalescent simulation (e.g. in Mesquite: Maddison & Maddison, 2009), considering each randomly generated tree to be the tree for a different locus. In one such simulation, with population size set to 100 000 (the effective population size estimated for the model plant, Arabidopsis thaliana, based on average polymorphism levels from Nordborg et al., 2005), a sample of 100 alleles had an average deepest coalescence of c. 200 000 generations, with the deepest coalescence among the 1000 trees (loci) being over 700 000 generations. In this example, the estimate of a polyploid’s origin could range from 0 to 700 000 generations. For many perennial plants, the effective population size could be much lower than 100 000; a similar simulation with a population size of 10 000 produced a maximum difference of 80 000 generations. However, a shallower coalescence depth in generations could be even greater in years for a long-lived perennial, again leading to an extensive range of observed divergences between polyploid and diploid progenitor alleles.

Figure 3.

 Effect of sampling and coalescence depth on estimating the date of polyploid origin. An allopolyploid has just evolved a single time by genome merger following hybridization between two diploid species. Ten alleles (indicated byclosed arrows) are sampled from a neutrally evolving nuclear locus in one of the diploid progenitors; in this case the tree has a maximum coalescence depth of c. 580 000 generations, but the sampled diploid alleles coalesce at c. 200 000 generations. The same locus is then sampled from the polyploid (open arrows). Two examples are given. In neither case has the allele donated to the polyploid been sampled from the diploid, which would have had a divergence time of zero generations. In example 1, the allele from the polyploid is relatively closely related to several alleles sampled from the diploid, and would give an estimate of the polyploid origin of c. 10 000 generations. In example 2, by chance the allele contributed to the polyploid belongs to the deepest coalescing clade, a clade that was not sampled from the diploid; the estimate of polyploid origin would be c. 580 000 generations in this case. At loci with shallower maximum coalescence and similarly unbalanced topologies, the range of age estimates would be smaller; at loci with equally deep maximum coalescence but with more balanced topologies, the probability of sampling an allele from the diploid that also belonged to the deepest clade would be high, yielding a younger estimate for the origin of the polyploid.

Any process that increases the coalescence depth at a locus relative to the neutral expectation (e.g. balancing selection) would increase the potential variation in diploid–polyploid Ks among loci. Clearly, when the actual age of the polyploid is zero, the best estimate is given by the minimum coalescence date among all estimates. This is true not only for comparisons of loci in one diploid progenitor with alleles in the polyploid, but also when evaluating estimates of homoeologues with the two different diploid progenitors. For example, if one diploid progenitor underwent a severe bottleneck just before polyploid formation whereas the other diploid progenitor maintained a consistently large effective population size, most loci in the first diploid species would provide better estimates of the polyploid age than would orthologous loci in the second diploid.

Adding complexity: older polyploids, multiple origins, and autopolyploids

As the time since polyploidy increases, the expected variance in Ks between diploid and polyploid alleles among loci that results from standing variation in the diploid progenitor genome should decrease. This is because the range of variation among the sampled neutrally evolving loci is still what it was at the time of polyploid formation, from zero to the maximum coalescence depth, but as time since polyploidy increases, this value makes a decreasing fractional contribution to the total Ks estimate (Fig. 4). For example, in the coalescent simulation described above, at 500 000 generations after polyploid formation, estimates from different loci could vary by > 50% because of differences in coalescent depths among loci, whereas 5 million generations after origin the estimates would vary by only c. 10% because of initial coalescent depth differences. For this reason, Zhang et al. (2002) concluded that standing variation was not a major contributor to the high variation they observed in Ks among Arabidopsis homoeologue pairs at different loci duplicated simultaneously by polyploidy.

Figure 4.

 Effect of increasing age on the contribution of sampling to estimates of the date of polyploid origin. Two loci are shown at two time-points: (a) 1000 generations after origin, and (b) 1 million generations after origin. As in Fig. 3, at one locus (locus 1) the polyploid allele coalesces deeply (500 000 generations) with the nearest diploid allele, whereas at the second locus the progenitor diploid allele has been sampled. At 1000 generations after origin the difference in the two age estimates is 500-fold; after 1 million generations the estimate from locus 1 is < 1.5 times larger than that from locus 2.

As time passes, the minimum polyploid age estimated by comparing a polyploid allele with one or more diploid alleles will increasingly over-estimate the actual age of polyploid origin because of the extinction of relevant allele lineages in the diploid. This is because lineage extinction can never decrease the estimate, but it can eliminate the progenitor allele or its close relatives.

Many polyploids have originated multiple times involving the same diploid genome or genomes but different genotypes (Soltis et al., 2004) and these origins could occur over a period of many years. When a polyploid formed more than once, it is possible to estimate the number of origins if alleles from the polyploid are nested within the diploid allele network at a given locus (Fig. 2b). In that case, the number of origins is estimated from the number of observed or inferred diploid alleles that have given rise directly to one or more polyploid alleles (Doyle et al., 2004a). The number of distinct alleles in the polyploid is not in itself a measure of the number of origins because this number should increase by divergence over time even in the case of a single origin.

Clearly, when there have been multiple origins there is not a single date for the origin of the polyploid, but unrecognized multiple origins will increase the variance in multilocus estimates of ‘the’ age of a polyploid (Fig. 5). Even if multiple alleles are sampled, gene flow in the polyploid can result in many loci possessing alleles from only one origin; other loci would retain alleles from both origins. A bimodal distribution of age estimates across multiple loci would indicate that two origins had occurred, but both the low and high age estimates would show variation caused by sampling effects: the low estimate (recent origin) because of the coalescent effect noted above and the high estimate (older origin) because of different patterns of diploid allele extinction at different loci. The youngest estimate should be the most accurate for both the recent and older origin.

Figure 5.

 Effect of sampling individuals of a polyploid that has two origins. Gene trees for two loci are shown, each with several diploid alleles (D) and one polyploid allele (P) sampled; unsampled polyploid alleles are denoted by brackets and dashed branches. The polyploid has originated twice, with one ancient origin represented in each tree by a clade of polyploid alleles with no diploid members, and with a second, recent origin, whose single allele is nested within the diploid clade. The coalescence of the polyploid allele with the nearest diploid allele (circled node) is much older for locus 1 than for locus 2. Although this reflects the reality of the two origins, it would result in a large variance if interpreted as two estimates of a single polyploid origin.

The mechanism of polyploid origin is also important in inferring and dating origins. Although polyploids can be formed directly by chromosome doubling of a single plant that is either a diploid hybrid (producing a taxonomic allopolyploid) or a nonhybrid (producing a taxonomic autopolyploid), polyploidy may more commonly occur in a stepwise manner involving unreduced gametes (Harlan & De Wet, 1975). In this scenario, tetraploids may be formed through an intermediate triploid step. This complicates the definition of ‘polyploid origin’ for the eventual tetraploid species, because the triploid is also a polyploid. However, unless considerable time passes between the formation of the triploid and the eventual increase to the tetraploid level, the two dates will be similar. As noted by Watanabe et al. (1991), formation by unreduced gametes also complicates the estimation of the number of origins using nuclear markers, because an unreduced gamete can simultaneously incorporate two different alleles into the polyploid genome at a locus, eliminating the 1 : 1 correspondence between alleles and origins.

Counting and dating polyploid origins is further complicated by the potential for gene flow between a polyploid and its diploid progenitors mediated by unreduced gametes in the diploid (Arriola & Ellstrand, 1996). When multiple diploid–polyploid allele pairs are observed (e.g. Fig. 2b), and these have different divergence times, the usual interpretation is that these represent multiple independent origins, but it is also possible that only the pair with the greatest distance represents an ‘origin’ and the less diverged pairs represent evidence of subsequent gene flow. From a genetic perspective, multiple origins and diploid–tetraploid gene flow are identical in being sources of genetic diversity in the polyploid; the presence of such diversity is more significant than how it arrived in the tetraploid. Continued gene flow between a polyploid and a diploid progenitor also may blur the concept of ‘origin’ by calling into question whether speciation has occurred at all. This problem is ignored throughout the remainder of this paper.

Thus far, only allopolyploids have been discussed. Age estimates for genetic autopolyploids are complicated by multisomic inheritance, which leads to segregational loss of allelic variation. Whereas a genetic allopolyploid is a fixed hybrid, showing diploid segregational behavior at each pair of homoeologous loci across its genome, there are no homoeologous loci in a genetic autopolyploid. Instead, each locus in a genetic autotetraploid can have as many as four alleles if it is the product of hybridization between two genotypes that were both heterozygous at that locus. These alleles can be lost by random segregation, so that some loci will have alleles from only one parent whereas others will retain alleles from both. Sampling numerous individuals of the diploid relatives of a recently formed genetic autopolyploid that is also a taxonomic autopolyploid should lead to an estimate of polyploid origin with no greater variance than in the case of the allopolyploid discussed above. The same concerns apply for neutrally evolving loci in both cases, with the difference being that instead of two completely separate estimates at each locus in a genetic allopolyploid (one from each homoeologue to its diploid progenitor), each comparison in the genetic autopolyploid will be made to the set of all alleles available from a single diploid progenitor if it is also a taxonomic autopolyploid. The case of a genetic autopolyploid in which the parents are from different species is more complex because at loci for which the polyploid is homozygous only one parental allele or the other will be observed (Fig. 6). However, the same considerations again apply, though many loci will only provide one estimate rather than two as in the case of a taxonomic allopolyploid.

Figure 6.

 Effect of segregational loss of allelic variation at three loci in a genetic autopolyploid that is a taxonomic allopolyploid (Table 1), formed by hybridization between two diploid species (Sp1 and Sp3). At locus 1, tetrasomic segregation has led to the loss of the Sp3 allele, and only the Sp1 allele has been retained in the polyploid (P), whereas at locus 2 the Sp3 allele has been retained; only at locus 3 have alleles from both parents been retained. At locus 3 the autopolyploid looks like a genetic allopolyploid – a fixed hybrid – but this variation could be lost in subsequent generations unless the genetic system evolves from tetrasomic to disomic inheritance. Despite the loss of variation from one parent, estimates of date of origin could be obtained from either the Sp1 parent (locus 1) or the Sp3 parent (locus 2); as in a genetic allopolyploid, two estimates, one from each parent, could be obtained from variation at locus 3.

Polyploids whose diploid progenitors are extinct

As time passes since the origin of a taxonomic allopolyploid, it becomes increasingly likely that one or both diploid progenitor species have become extinct. If only one progenitor is extinct, then it is still possible to obtain an accurate estimate for the age of origin of the polyploid, within the limits described above, by comparison of one homoeologue to the extant diploid progenitor at each locus. The estimate from the other homoeologue at each locus will simply be older, coalescing with alleles from the most closely related extant (and sampled) diploid relative of the missing progenitor (Fig. 7). A large difference between estimates from two members of a homoeologue pair to their nearest diploid orthologue is an indication that one progenitor may not have been sampled. The best estimate of polyploid age will be given by the smaller of the two distances, which could still be larger than the true age of origin if both diploid progenitors are extinct. The situation is analogous to sampling within an extant progenitor, where it is never certain that the most closely related diploid allele has been sampled unless a diploid allele is sampled that is identical to the polyploid allele. Here the time-scale is greater, and extinction will affect all loci. The only defense against this overestimation is sampling multiple diploid taxa and populations: nesting of an allele from the polyploid within the allele tree of a diploid is the best evidence that the diploid progenitor has been identified, unless there are several closely related diploids whose alleles coalesce together. As time passes, it will become increasingly likely that alleles from a polyploid and its diploid progenitor will form sister monophyletic groups. This process is identical to the process by which diploid sister species progress from paraphyly to reciprocal monophyly (Rieseberg & Brouillet, 1994); in this case, one ‘species’ is one homoeologous subgenome of the allopolyploid. However, a reciprocally monophyletic allele relationship can also result from a more recent polyploid origin from a now-extinct diploid progenitor.

Figure 7.

 The effect of progenitor extinction on estimating dates of polyploid origin. A gene tree is shown for a single locus sampled in five species (Sp1–5); an allopolyploid with two homoeologous loci (H1, H2) was formed by hybridization between Sp1 and Sp5. (a) Immediately after allopolyploid formation, age of origin can be inferred from either homoeologous locus, with the same considerations as in Fig. 3, from coalescence of polyploid alleles (P) with the nearest diploid allele (circled nodes). (b) At a later date, diploid progenitor Sp1 is extinct (indicated by dotted lines); the date of origin based on H1 is inferred to be the age of divergence of Sp1 from Sp2 (circled node), whereas the age estimated from H2 is now younger, because Sp5 is extant. (c) Still later, all of the diploid species are extinct; the only estimate available for the age of the polyploid is the divergence of alleles at H1 and H2, which is the date of divergence of Sp1 and Sp5, a much more ancient date than that of the polyploid event.

If both diploid progenitors are extinct, any estimate of the date of allopolyploid origin will be inaccurately high, measuring diploid speciation events rather than polyploidy. That both diploid progenitors are extinct is not apparent, however, until all diploid species in the clades to which the progenitors belong, and all species in any clades separating these progenitor clades, are extinct (Fig. 7c), at which point homoeologues will be sister to one another. This homoeologue topology mimics the expectation for a taxonomic autopolyploid.

Clearly, then, the distance between a pair of homoeologues provides a maximum estimate of the date of origin of an allopolyploid, as pointed out by Gaut & Doebley (1997, p. 6810):

‘With exclusive disomic inheritance, homologous loci from the two ancestral species remain distinct. Sequences from ‘duplicated’ loci have a coalescent time that corresponds to the date of the divergence of the two ancestral diploid species ... . The species’ divergence time predates tetraploid formation.’

Therefore, only if the polyploid originated simultaneously with the divergence of the diploid progenitors would the divergence time of a pair of homoeologues be identical to the time of polyploid origin. As Gaut & Doebley (1997) note, the distance between two homoeologues only ‘corresponds to’ the age of the polyploid. The Ks between two homoeologues has two components: (1) the distance between orthologues in the diploid progenitors before their being brought together as homoeologues in the polyploid genome, plus (2) the distance accumulated since the polyploid event.

In the absence of additional information, there is no minimum estimate for the time of origin from homoeologue distances – the polyploid could have originated at any time after the speciation of the progenitors, up to the present. However, if speciation subsequently occurs in the polyploid lineage, that speciation time marks the minimum date of polyploidy. For example, the divergence of homoeologue pairs in the genome of Glycine species is estimated to be c. 10 million years (MY), and the divergence of the two subgenera of Glycine is about half that (Innes et al., 2008; A. N. Egan & J. J. Doyle, unpublished), giving a range of 5–10 MY for the origin of the polyploid ancestor of modern Glycine (Fig. 8).

Figure 8.

 Polyploid evolution in Glycine, illustrating key time-points in species (a,b) and gene (c,d) trees and the uncertainty in estimating these dates (MY, million years). The diploid progenitor genomes of Glycine diverged at point (1), which was followed by polyploidy at point (2), which led to the modern chromosome number of 2n = 40 in diploid Glycine species, including soybean; the divergence of Glycine species began at point (3) from an already-polyploid ancestor. The divergence of the now-extinct (‘x’) progenitor genomes occurred about twice as long ago as the divergence between the soybean lineage and the lineage of perennial Glycine species. The polyploidy event could have occurred anywhere between points (1) and (3). In species tree (a) (reflected in gene tree, c), polyploidy occurred very close to the time of divergence of progenitor genomes, possibly within a single species (taxonomic autopolyploidy); close relationship of progenitor genomes could permit pairing among their chromosomes (genetic autopolyploidy) and could lead to segregational loss of parental sequences or homogenization of repeats across parental genomes. In species tree (b) (reflected in gene tree d), polyploidy occurred considerably later than the divergence of parental genomes; parents likely would be differentiated genetically and taxonomically when hybridization occurred (taxonomic allopolyploidy), also leading to disomic pairing (genetic allopolyploidy). H1, H2 are the two homoeologues in Glycine. (a and b modified from Gill et al., 2009)

Estimates of polyploid origin made solely from a polyploid’s genome

When extinction has removed diploid progenitors and their relatives from consideration, inferences must be made from pairs of homoeologues that are sisters in phylogenetic reconstructions (e.g. Fig. 7). But homoeologues are effectively sisters when polyploid genome(s) are studied without reference to diploid taxa (e.g. Blanc & Wolfe, 2004a; Schlueter et al., 2004). The occurrence of a polyploid duplication can be discerned in a species’ genome by analysing the distribution of distances between paralogous copies at many loci. The simultaneous duplication of all loci in the genome should be visible as a ‘peak’ in a plot of the number of gene pairs vs Ks, which otherwise would show only a large number of recent duplicate pairs that are rapidly purged from the genome over time (Lynch & Conery, 2000; Fig. 9). This insight has been exploited in several plant comparative genomics studies, for example in crop species (Blanc & Wolfe, 2004; Schlueter et al., 2004), in ‘basal’ angiosperms (Cui et al., 2006), among Asteraceae (Barker et al., 2008), and across flowering plants (Fawcett et al., 2009).

Figure 9.

 Inferring allopolyploidy from the Ks distribution of paralogue pairs. Each figure shows the distribution of the number or percentage of paralogue pairs of a given synonymous distance (Ks) from one another. (a,b) Plots for the two diploid progenitors (AA and BB) of the polyploid. In each genome there is a large number of very recent duplicates; as time passes, marked by increasing Ks of paralogue pairs, most duplicates have been purged from the genome (Lynch & Conery, 2000). (c) Ks plot of orthologous gene pairs of AA and BB diploids; the orthologue peak gives the relative date of divergence of the two species. (d) Ks plot for the AABB allopolyploid at its time of origin by hybridization of AA and BB followed by chromosome doubling; the polyploid’s Ks plot combines figures (a)–(c), having both recent duplicates from each diploid and the set of orthologous (now homoeologous) gene pairs from the two diploids that constitutes the polyploid ‘peak.’ (e,f) During later stages of polyploid evolution, homoeologues in the polyploid continue to diverge from one another, increasing the synonymous distance between homoeologue pairs; as some genes return to single copy status through elimination of one homoeologue, the peak decreases in size (Cui et al., 2006).

Many paralogous pairs comprise the peak inferred from the Ks distribution. Some are homoeologues, whereas others are paralogues resulting from simple duplications that occurred around the time of the polyploid event. In completely sequenced genomes such as Arabidopsis, rice, or poplar, it is possible to distinguish homoeologues from paralogues of similar age because the former occur in linked blocks (Zhang et al., 2002). Identifying polyploid peaks in a Ks distribution when linkage information is lacking can be very difficult, particularly for older events (Cui et al., 2006).

A Ks peak provides evidence for the existence of a polyploid event, but it does not give the age of this event, because, as noted above, the distance between homoeologues combines the divergence since polyploidy with the pre-existing divergence between diploid progenitor alleles. The peak therefore marks the mode of Ks values of all of the orthologues from the two diploid progenitors that become homoeologues at the time of hybridization leading to allopolyploidy, and which subsequently continue to diverge from one another (Fig. 9). As noted above for single homoeologue pairs, the date estimated for the Ks peak is thus the maximum age of allopolyploidy (Fig. 8).

The fact that the mode or median of a Ks peak is not the age of the polyploidy event is unlikely to be recognized by naive readers of the Ks /polyploidy literature. For example, although Table 2 of Blanc & Wolfe (2004) is titled ‘Estimated Ages of the Observed Secondary Peaks’– which is entirely neutral as to the phenomenon that generated the peaks – the text cites this table as providing ‘absolute dates for the large-scale gene duplications’ (p. 1673). It is not that the authors are unaware of the issue, as is clear from their statement (p. 1672) that ‘ ...the ancestor of Medicago may have been an allotetraploid resulting from the merger of two diverged genomes (i.e. A and B). In this case, the date of divergence between these two subgenomes... is older than the actual polyploidy event’. However, there is reference throughout the paper to duplication events, and their Fig. 4 (p. 1674) is a phylogeny showing ‘dates of duplication and speciation events’, which a reader might mistakenly assume showed the dates of polyploid events rather than the date of divergence of the polyploid’s progenitors. This is also true elsewhere in the polyploidy literature. Schlueter et al. (2004, p. 869) state that ‘Determination of evolutionary distances between large numbers of pairs of duplicated genes allowed us to estimate when these polyploidization events may have occurred’, although they also state (p. 871) that ‘Coalescence estimates correspond to the point when groups of gene pairs began evolving independently’, which is correct but could be read incorrectly to mean independence of homoeologues from their progenitors rather than from one another. Maere et al. (2005) refer several times to ‘waves of duplication’ represented by peaks in the Ks distribution; in Cui et al. (2006), Ks peaks ‘correspond’ to polyploidy events. Table 1 of Fawcett et al. (2009) gives ‘Timing of WGD (whole genome duplication) events in plants’. By contrast, Barker et al. (2008) avoid all reference to dates of polyploidy when discussing the Ks distributions from which they infer whole-genome duplications in Asteraceae.

Some ambiguity is almost certainly traceable to the terminology of polyploidy. The process by which ‘duplicate’ loci of an allopolyploid arise is very different from the process by which a single gene produces two initially identical paralogues in a diploid genome. An allopolyploid ‘duplication’ is a ‘genome merger’ (Doyle et al., 2008) in which the duplicates (homoeologues) are already diverged from one another. As noted above, the term ‘homoeologue’ was initially coined to describe chromosomes in different diploid species (Huskins, 1932). The adoption of this term for genes as opposed to chromosomes, coupled with the differences between polyploid and nonpolyploid duplications, has led to the terminological ambiguity among homoeologue, paralogue and orthologue discussed earlier.

How significant is this issue? In the example of Glycine, the problem of dating from homoeologues produces a twofold range of possible dates of the polyploid event, and this clearly limits what can be said about the pace at which polyploid evolution has occurred in the genus. Could this range be narrowed if it could be assumed that the likelihood of hybridization (and hence allopolyploid formation) decreases with increasing time since divergence of diploid taxa? In Glycine, allopolyploids that formed within the last 50 000 yr combine the genomes of species estimated to have diverged c. 4 MYA and which belong to different ‘genome groups’ based on their inability to form fertile hybrids (Doyle et al., 2004b; unpublished data). Therefore it would presumably have been possible for the earlier polyploidy event to have occurred at nearly any time during the 5 MY period between homoeologue divergence and the radiation of extant Glycine species. If it is generally true that the propensity to form polyploid, rather than homoploid, hybrids increases with increasing genetic distance (Chapman & Burke, 2007), then homoeologue (progenitor species) divergence time will often seriously overestimate the date of allopolyploid origins.

Over time the contribution of this source of ambiguity will decrease as a proportion of the total range of divergence times. The ambiguity between homoeologue divergence and the actual date of polyploidy in Glycine leads to placing the polyploidy event between 5 Mya and 10 Mya; 10 My from now the date will lie between 15 Mya and 20 Mya – not nearly as significant as the current twofold range of possible dates.

Homoeologue divergence times and diploidization in genetic autopolyploids

Estimating ages from homoeologues is still more complex for polyploids that begin their existence as genetic autopolyploids. It is generally assumed that genetic autopolyploids eventually shift their pattern of inheritance from tetrasomic to disomic as part of the process of ‘diploidization,’ though whether this is always true remains an open question and one that may be impossible to answer with certainty (Doyle et al., 2008). It is not known whether ancient polyploid events in species whose genomes have been sequenced, such as Arabidopsis or soybean (Glycine max), were originally autopolyploid or allopolyploid (Gill et al., 2009; Shoemaker et al., 2006), but at present these species are diploidized, with well-defined homoeologous blocks of genes (Innes et al., 2008; Thomas et al., 2006). If these species originated as genetic autopolyploids, different homoeologous loci may yield different synonymous distances because of random segregational fixation at different loci (Gaut & Doebley, 1997). This is best seen in a polyploid that originates as a genetic autopolyploid but was formed by hybridization between two diverged diploid species (i.e. a taxonomic allopolyploid). This situation, where diploid progenitors are extant, was discussed above (Fig. 6).

The potential for variance in estimates increases after genetic diploidization occurs, and the four tetrasomically segregating alleles are partitioned into two pairs of disomically segregating homoeologues. Figure 10 illustrates the consequences of this process at three different loci, each of which retains alleles with different coalescent times because of random segregation during the period of tetrasomy. The key point here is that none of these homoeologues will necessarily coalesce at the time of polyploid formation – there is nothing about that point that starts a new ‘divergence clock’. Some homoeologues will coalesce before polyploidy, at the time of divergence between alleles in the diploid progenitors (Fig. 10, locus 3), whereas other homoeologue pairs will coalesce at various points between the time of polyploidy and the onset of disomy (loci 1 and 2). Only a locus like locus 1, which was homozygous at the time of diploidization, ‘reflects roughly the time of onset of disomic inheritance’ (Gaut & Doebley, 1997, p. 6809).

Figure 10.

 Variation in coalescent time of alleles sampled from homoeologues at three loci in a plant that is initially a genetic autopolyploid. Evolution of three loci is tracked over times (a–d). In this simple case, the genetic autopolyploid is a taxonomic allopolyploid, whose alleles at each locus originate from two diploid progenitors, each of which is homozygous at all three loci. (a) Diploid orthologues coalesce in allele trees at time to; the alleles at each locus are brought together by polyploidy at time tp. For convenience the coalescent time is shown as being the same for each locus, but for polyploidy involving closely related taxa in which alleles were not reciprocally monophyletic, or where the progenitors were two genotypes of a single taxon, various coalescent times would be expected for alleles sampled randomly from different loci. (b) At a later time, divergence creates allelic variation at each locus in the polyploid. (c) Still later, tetrasomic inheritance has led to random segregational loss of alleles, with deeper coalescing alleles from one of the two parents having been lost at loci 1 and 2, whereas the deepest coalescence at locus 3 is the orthologue divergence in the diploid progenitors (to). (d) At time td, the genetic system at all three loci simultaneously shifts from tetrasomic to disomic, creating two homoeologues (H1 and H2) at each locus, and fixing variation at these two loci; the plant is now a genetic allopolyploid, although it originated as a genetic autopolyploid. Disomic segregation can reduce variation within H1 and H2 at any of the three loci, but not between homoeologues at a given locus.

Yet another complication is possible for genetic autopolyploids. As Gaut & Doebley (1997) point out (p. 6809):

‘The switch from tetrasomic to disomic inheritance is the key variable in this model. If the switch is coordinated among chromosomes, then all pairs of duplicated sequences are expected to coalesce at roughly the same time… If the switch from tetrasomic to disomic inheritance is not well coordinated among chromosomes, then coalescent times could be scattered over a broad range.’

This would not affect the deepest possible coalescences (e.g. locus 3 in Fig. 10) directly, though a change in the time between polyploidy and onset of disomy would increase or decrease the probability of losing allelic variation at all loci, including these deeply coalescing ones. The primary effect of variation in the onset of disomy at different loci would be on the minimum estimate of polyploid age. Polyploids with tetrasomic inheritance at some loci and disomic inheritance at others are termed ‘segmental allopolyploids.’

Methodological issues: estimating dates

Throughout this discussion it has been assumed that distances can be estimated accurately, and that dates can be inferred from them reliably. Anyone familiar with the molecular systematics literature knows that these assumptions are, at best, dangerous. In the field of polyploid evolution, it has been observed that loci with homoeologous gene pairs tracing back to the same polyploidy event differ in Ks by as much as 13-fold in Arabidopsis (Zhang et al., 2002). What, then, is the date of the event that led to homoeologue divergence in this species? Similar levels of rate variation have been observed in other studies (e.g. in soybean; Schlueter et al., 2007; Innes et al., 2008). Using a molecular clock approach on individual gene pairs under these circumstances would lead to wildly different estimates. The solution adopted by Ks /polyploidy studies is to use the distribution of paralogue pairs (as in Fig. 9), assuming that the mode of this distribution captures a genome-wide average, effectively smoothing out variation caused by differences among genes. However, translating distances to ages using this method still requires knowledge of the global synonymous substitution rate; Blanc & Wolfe (2004) and Schlueter et al. (2004) used very different rates, and thus reported very different polyploidy dates for peaks with nearly identical Ks modes in Arabidopsis and soybean. Similarly, when rate variation occurs among lineages involving many genes, problems occur when the same synonymous substitution rate is used to estimate homoeologue divergence times from a Ks peak mode. For example, Ks peaks with different modal values were observed in two legumes, Glycine and Medicago, leading to estimates of 42 MY in the former and 55 MY in the latter by Schlueter et al. (2004). Phylogenetic testing of numerous gene families supported a single shared polyploid event in the common ancestor of these species (Pfeil et al., 2005; Cannon et al., 2006). Such problems led Fawcett et al. (2009) to use one of several available nonclock dating methods, penalized likelihood (PL: Sanderson, 2002), which required the construction of numerous gene phylogenies. However, using such methods is no guarantee that dates will be estimated with precision, let alone accuracy. Dates estimated using PL and Bayesian methods from gene pairs sequenced from two c. 1 Mb homoeologous segments in soybean and allies (Innes et al., 2008) still varied nearly as much as when simpler molecular clock methods were used (A. N. Egan & J. J. Doyle, unpublished).

When dating recent polyploidy events, a significant problem is accurate estimation of very low divergences from single gene sequences. Given rates c. 6 × 10−9 synonymous substitutions per synonymous site per year (see Fawcett et al., 2009 for rates used in recent studies of polyploidy), it would take around half a million years for a pair of alleles with 300 synonymous sites to average one synonymous difference. Thus, during the early history of a polyploid, most homoeologues will have an observed divergence of zero substitutions between a polyploid allele and the diploid progenitor allele. To discriminate between small distances with statistical confidence requires a larger number of synonymous sites, because the sampling variance is inversely proportional to that number. It is becoming increasingly feasible to sequence large numbers of sites as high-throughput sequencing methods become more widely available, but for recombining sequences (nuclear genes) this strategy means bringing together genes with potentially different coalescent histories. The degree to which the use of linked genes can circumvent this problem is limited by the extent of linkage disequilibrium in the taxa being studied. In practice, then, phylogenomic approaches to recent polyploid origin date estimates will be affected by the issue of standing variation.

Limitations of the conventional molecular dating approaches to polyploid origins were noted by Jakobsson et al. (2006), who described a Bayesian coalescent approach for dating the origin of the allopolyploid, Arabidopsis suecica. This alternative to traditional procedures is in keeping with the general trend toward coalescent methods for addressing species-level questions in systematic biology (Edwards, 2009). However, their methods produce estimates for a single origin of this polyploid that range from 12 000 to 300 000 yr, and they note that these estimates are strongly dependent on assumptions about mutation rates and rates of population expansion following polyploid origins – parameters that are even more difficult to estimate in most species than in Arabidopsis.


It is wise to be skeptical about any divergence date based on molecular data (Graur & Martin, 2004), and in the case of polyploids all of the usual technical concerns apply. Furthermore, the diversity of mechanisms that can lead to polyploid formation (e.g. instantaneous doubling vs stepwise, through a triploid intermediate), coupled with the different segregational systems of polyploids (disomic vs tetrasomic), make it difficult to know what biological event is being measured when the divergence time of a pair of sequences is estimated.

Disagreement among polyploid age estimates derived from multiple loci is another manifestation of the gene tree/species tree problem best known in the arena of phylogeny reconstruction (Pamilo & Nei, 1988). Divergence dates inferred from a gene tree are indirect estimates of the divergence times in the species tree (or network, in the case of taxonomic allopolyploids). For estimating the age of a polyploid relative to one of its extant diploid progenitors using neutrally evolving loci, standing genetic variation in the diploid at the time of origin, coupled with sampling of the diploid, can lead to significant disagreements among loci. For older polyploids, particularly when estimates are made from comparisons of homoeologue pairs in the polyploid, variation among loci may result from true rate differences among loci (Zhang et al., 2002), but the meaning of the dates themselves varies with the genetic system of the polyploid and with the histories of individual loci.

As a particular case of sympatric speciation, estimating speciation times for polyploids is no different than estimating diploid speciation times: in the simplest case, at the time of formation, a founder individual has alleles at each locus with a zero distance from alleles at the orthologous locus in the single progenitor individual (diploid or taxonomic autopolyploid speciation) or individuals from two progenitor species (taxonomic allopolyploid speciation). With time and concomitant divergence, identifying the actual progenitor allele will be difficult for both the diploid and the polyploid cases. The allele tree/network for a polyploid and its progenitor(s) would also look very similar to the early stages of diploid allopatric speciation (Fig. 2). Indeed, diploid speciation by a single founder individual involving a common allele from the source population would produce a pattern identical to that of a polyploid speciation event involving a single individual. Similarities also exist between allopatric speciation involving larger numbers of individuals and polyploid speciation with multiple origins, and in both cases allele trees/networks would be produced in which alleles sampled from the two taxa are interspersed. In the case of diploid allopatric speciation, however, not all alleles in each new species need have a perfect match in the other, and in cases where they do not, the age estimated from the closest allele pair will be older than the speciation event. In the case of a polyploid, all alleles initially had a zero distance from an allele in the diploid progenitor. This difference may not be of practical significance, however, because of the uncertainty involved in sampling the standing variation in the diploid progenitor of a polyploid species (Fig. 3) and because of technical problems with estimating distances.

For older polyploid species, the date of origin may simply be unknowable owing to extinction of diploid progenitors (Fig. 7), despite its importance for testing evolutionary hypotheses (Fawcett et al., 2009). This is no different than the situation in a group of diploid species, where times of divergence and diversification of extant taxa can be estimated, but dates of actual speciation events cannot.

As molecular phylogenetics has moved from a reliance on single genes to a more phylogenomic approach, and as analyses and theory have become more rigorous, incongruence has become increasingly routine, whether among gene trees themselves or from the species hypotheses and key dates derived from them. Like any other inference about events that cannot be observed directly, there is uncertainty associated with the date of origin of a polyploid. Caveat emptor.


The authors acknowledge funding from the US National Science Foundation (grants DEB-0516673 and DBI-0822258). Two anonymous reviewers provided helpful comments. We thank members of the L. H. Bailey Hortorium molecular systematics discussion group for productive discussions.