A population genetic model to infer allotetraploid speciation and long-term evolution applied to two yarrow species


  • Yan-Ping Guo,

    1. Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, and College of Life Sciences, Beijing Normal University, Beijing, China
    Search for more papers by this author
  • Xiao-Yuan Tong,

    1. Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, and College of Life Sciences, Beijing Normal University, Beijing, China
    Search for more papers by this author
  • Lan-Wei Wang,

    1. College of Life Sciences, Henan University, Kaifeng, China
    Search for more papers by this author
  • Claus Vogl

    Corresponding author
    1. Institute of Animal Breeding and Genetics, University of Veterinary Medicine, Vienna, Austria
    • Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, and College of Life Sciences, Beijing Normal University, Beijing, China
    Search for more papers by this author

Author for correspondence:

Claus Vogl

Tel: +43 1 250775631

Email: claus.vogl@vetmeduni.ac.at


  • Allotetraploid speciation, that is, the generation of a hybrid tetraploid species from two diploid species, and the long-term evolution of tetraploid populations and species are important in plants. We developed a population genetic model to infer population genetic parameters of tetraploid populations from data of the progenitor and descendant species.
  • Two yarrow species, Achillea alpina-4x and A. wilsoniana-4x, arose by allotetraploidization from the diploid progenitors, A. acuminata-2x and A. asiatica-2x. Yet, the population genetic process has not been studied in detail. We applied the model to sequences of three nuclear genes in populations of the four yarrow species and compared their pattern of variability with that in four plastid regions.
  • The plastid data indicated that the two tetraploid species probably originated from multiple independent allopolyploidization events and have accumulated many mutations since. With the nuclear data, we found a low rate of homeologous recombination or gene conversion and a reduction in diversity relative to the level of both diploid species combined.
  • The present analysis with a novel probabilistic model suggests a genetic bottleneck during tetraploid speciation, that the two tetraploid species have a long evolutionary history, and that they have a small amount of genetic exchange between the homeologous genomes.


Allopolyploid speciation is a process by which two divergent species hybridize and produce a polyploid species carrying full copies of the genomes of both progenitors (Grant, 1981; Stebbins, 1985; Otto, 2007). Subsequently, there can be long-term evolution of the two diploid homeologous genomes within the same nucleus, entailing genomic rearrangements, mutations and epigenetic and other regulatory changes in gene expression (Soltis et al., 2003, 2010; Adams & Wendel, 2005; Paun et al., 2010; Wendel et al., 2010). Selection during this phase of evolution must be intense, as hybrids of early generations are usually relatively unfit, but later often evolve to out-compete their parental species and spread to occupy new ecological niches (Grant, 1981; Chang et al., 2010; Schmickl & Koch, 2011).

The initial stages of the evolution of homeologous genomes may be tractable to experimental analysis with synthetic allopolyploids (Song et al., 1995; Kashkush et al., 2002; Madlung et al., 2002; Adams et al., 2004) or may be relatively accessible in recent natural allopolyploid species, for example of Tragopon (reviewed in Soltis et al., 2004; Buggs et al., 2009; Chester et al., 2012; etc.), Spartina (Ainouche et al., 2004; Salmon et al., 2005; etc.) and Senecio (Abbott & Lowe, 2004; Hegarty et al., 2005). However, evolutionary processes necessarily take a long time and it is thus hard to study the accumulation of mutations, which may render one or the other duplicate copy of a gene dysfunctional (Chang et al., 2010), and recombination or gene conversion, which may lead to mixing of two homeologous gene copies (Gaeta & Pires, 2010; Salmon et al., 2010). As a result, little is known about the long-term evolution of natural populations of the many allopolyploid plant species.

In East Asia, there are two morphologically and genetically distinct and phylogenetically distant diploid yarrow species, Achillea acuminata-2x and A. asiatica-2x. There are also two tetraploid species, A. alpina-4x and A. wilsoniana-4x, intermediate between the diploids. In earlier work, we used amplified fragment length polymorphism (AFLP) data to identify A. acuminata-2x and A. asiatica-2x as the diploid progenitors of both A. alpina-4x and A. wilsoniana-4x (Guo et al., 2006). The allotetraploids form a species complex that has spread far beyond the range of the parental species (Shih & Fu, 1983; Meusel et al., 1992; Guo et al., 2005, 2006).

In our previous work, we assumed a coarse model of strictly disomic inheritance, that is, independent inheritance of two intact sets of progenitor chromosomes. The model ignores a number of known population genetic processes and forces. First, genomic rearrangements are expected in allotetraploids. Second, recombination or gene conversion may have led to the mixing of the genetic material (Lim et al., 2008; Gaeta & Pires, 2010; Kelly et al., 2010; Salmon et al., 2010). Finally, the genetic variability detected in the tetraploid populations seems to be below that of the two diploid populations combined (although this inference was made difficult by the dominant nature of the AFLP data), probably as a result of founder effects, that is, genetic bottlenecks during allopolyploid speciation. We now present a more sophisticated probabilistic genetic model of allopolyploid evolution that can account for these three phenomena. We apply it to nuclear sequences from populations of the two diploid progenitor and the descendant tetraploid populations. The new method for the analysis of allopolyploid evolution could be applied to other available datasets, thus helping to fill the gap in our understanding of polyploid hybridization and the long-term effects of allopolyploid evolution.

Materials and Methods

Probabilistic model for allopolyploid evolution

The population genetics of allotetraploid evolution in the nuclear genome are inferred on the basis of a probabilistic model. A brief verbal description is given here, with the formulae and technical details reserved for Supporting Information Notes S1.

Data include aligned haplotype sequences from homologous loci of individuals of both diploid species and the allotetraploid descendant. Homologous base pairs may show single nucleotide polymorphisms (SNPs) within or among species, or both. The model thus explicitly incorporates the possibility of incomplete lineage sorting among species. The times since the tetraploid speciation event are relatively short, so that new mutations can be ignored (although the model could easily be extended to overcome this restriction.)

The simplest model of inheritance for allotetraploids is disomic inheritance: the two diploid genomes are independently inherited within the allotetraploid nucleus. With perfectly disomic inheritance, each allotetraploid chromosome, and thus each allotetraploid haplotype in the dataset, would derive entirely from one or the other diploid species. The demographies of the two diploid populations may differ before the allopolyploidization event, but, thereafter, the two diploid homeologous genomes coexist in the same nucleus and thus necessarily share the same demography. There is likely to be a bottleneck event during or shortly after the polyploidization event; this may also be termed a founder effect. Hence, the genetic diversity of sequences within the allotetraploids is expected to be below that of the corresponding diploid populations. Furthermore, from the time of polyploidization, the now diploid and tetraploid populations will evolve independently and thus differentiate further. The population genetic force responsible for the differentiation is genetic drift, the random walk of allele frequencies in finite populations, which eventually leads to the stochastic loss of variation. Modeling of such population differentiation is straightforward in population genetics and is usually parametrized with Wright's FST or similar.

Complications may arise as a result of genetic exchange between homeologous chromosomes within the nuclei of the allotetraploids. This will lead to the mixing of diploid material on the same chromosome; and to deviations from a 1 : 1 proportion of genetic material from the two diploid parental species either by chance or because of selective differences between the two diploid genomes. Backcrossing to one or the other parent can also lead to a shift of the parental contributions. In our model, we treat mixing in a similar manner to the way in which recombination is handled with mapping, for example in the context of quantitative trait loci (QTL) mapping (Lander & Green, 1987) and population admixture mapping (Falush et al., 2003).

At each variable site, we assume biallelic loci, arbitrarily coded as zero and one, and ignore insertion–deletion polymorphism. Generalization to multi-allelic loci is straightforward and could easily be incorporated into the model. In the diploid progenitor populations, we assume linkage equilibrium, that is, the random mixing of gametes. In the tetraploids, we also assume linkage equilibrium within material from the same diploid ancestor (as in the diploids), but not between genetic material from different diploid ancestors, as this would require frequent recombination or gene conversion among homeologous chromosomes, which we assume to be rare. The present model does not take into account artifacts such as ‘PCR recombination’ (which could also be incorporated into an extended model). Furthermore, we assume that each haplotype in the tetraploid is independent of all others, as would be the case if the tetraploid populations grew rapidly and genetic drift within the populations is absent. If this assumption is violated, uncertainty of the estimates is underestimated because drift leads to repeated sampling of the same gene and therefore decreases the effective sample size.

Specifically, the method of inference employs a Markov chain Monte Carlo (MCMC; Gelman et al., 1995) algorithm by iterating through rounds of updating of the parameters given in Table 1. Figure 1 shows the conditional dependences of the parameters on one another and the data. The arrowhead indicates a variable, whose conditional distribution depends on the variable from which the arrow originates. The various steps in the cycle are as follows:

Table 1. A list of variables and their interpretation
x Tensor of allele frequencies in the diploid populations
π Matrix of allelic proportions in the diploid populations
l Index of the nucleotide or basepair, 1 ≤ ≤ L
n Index of the diploid population, 1 ≤ ≤ N
i Index of the individual in the diploid population, 1 ≤ ≤ I
y Matrix of allele frequencies in the tetraploid population
φ Vector of allelic proportions in the tetraploid population
ω Drift coefficient in the tetraploid population
z Vector of indicator variables for each nucleotide of the diploid population of origin
ρ Scaled recombination rate
γ Vector of proportions of the diploid population of origin in the tetraploid population
Figure 1.

Directed graph showing the dependences among the variables and data. An arrow indicates that the distribution of the variable pointed to is conditional on the variable from which the arrow originates. A double-pointed arrow indicates that the conditional distributions of both variables depend on each other.

  1. The conditional distribution of the population allelic proportions of each of the diploid populations π is calculated conditionally on the observed allele frequencies x. This is a beta distribution (Formula 1 in Notes S1).
  2. For the tetraploid allelic proportions φ, we need to differentiate among the diploid population of origin. Hence, we introduce the auxiliary variable z, which indicates the diploid population of origin for each individual and polymorphic site. Furthermore, we assume that the tetraploid allelic material originates from all diploid populations, but that the allelic proportions in the tetraploid population φ differ from the corresponding π through genetic drift. Although it is possible to model splitting populations, it is much more convenient to model genetic drift using an island model (Wright, 1931). The symbol FST is usually used for this parameter, but we prefer ω as it is shorter. We note that, in the infinite island model, FST = 1/(1 + 4Nem), where Ne is the effective population size and m is the migration rate per generation, from which it follows that 4Ne= (1 − ω)/ω. The allelic distributions of a splitting and a migration model are quite similar, such that we choose the latter for convenience. We note that the conditional distribution of φ, again a beta distribution, depends on the diploid allelic proportions π and the genetic drift coefficient ω, even before observing any of the polyploid allele frequency data y (Formula 2 in Notes S1). The likelihood of y, a binomial, depends on φ (Formula 3 in Notes S1). Hence, the conditional distribution of the allelic proportions φ depends on π, ω and y (Formula 4 in Notes S1). This distribution is again beta, as this is the conjugate prior distribution for a binomial likelihood.
  3. The drift coefficient ω depends on both the allelic proportions π in the diploid populations and φ in the polyploid populations. Its distribution is nonstandard (Formula 5 in Notes S1), such that sampling ω involves a Metropolis step instead of a Gibbs step, as detailed in Notes S1.
  4. The prior distribution of the proportions of the assignment of the polyploid material to the diploid populations γ is assumed to be Dirichlet distributed (Formula 6 in Notes S1). Its posterior depends on the linkage structure, which is captured in the auxiliary variable z. We use a hidden Markov model (HMM) to model linkage (see Notes S1). We allow for ‘recombination’ between the material of the two diploid progenitor populations at a rate ρ. Recombination may be caused by either meiotic recombination or gene conversion. Neighboring bases may have a different population genetic origin. The probability of such switches is given by the transition matrix (Formula 7 in Notes S1). To obtain the posterior distribution of z, we model the linkage structure with a forward–backward algorithm (Durbin et al., 1998; Vogl & Futschik, 2010), conditional on γ, the recombination rate ρ, the polyploid data y and the polyploid allelic proportions φ.
  5. The posterior proportions of the assignment of the polyploid material to the diploid populations γ is a Dirichlet distribution and depends on the number of switches between the diploid genetic material obtained with the indicator variable z (Formula 8 in Notes S1). Similarly, the conditional posterior distribution of the recombination rate ρ is a beta distribution that also depends on the number of switches.

Plant sampling

Nineteen populations of two diploid (A. acuminata and A. asiatica) and two tetraploid (A. alpina and A. wilsoniana) species were studied (Table 2). The plastid data were obtained from 104 individuals of the sampled populations and the nuclear genes were cloned and sequenced from a subset of c. 50 individuals of 16 of the 19 populations.

Table 2. Population sampling
SpeciesPop. CodePloidyLocalityVouchers
  1. Names of collectors: DYT, Dun-Yan Tan; GYR, G-Y. Rao; JXM, Jin-Xiu Ma; MS, M. Staudinger; XYT, Xiao-Yuan Tong; YPG, Y-P. Guo; YR, Yi Ren; ZYL, Zhen-Yu Liu. Populations marked with ‘*’ were sequenced only at the plastid loci. All vouchers are deposited in the herbarium of Beijing Normal University (BNU).

Achillea acuminata ARX22xArxan (Xing-an Mts.), Inner Mongolia, China: 47°17′39″N, 120°27′09″E; 865 mGYR 2007.08.18
CB12xChangbai Mt., China: 42°28′09″N, 128°08′50″E; 620–690 mYPG & GYR 2002.07.24
TB52xTaibai Mt., China: 34°01′17″N, 107°18′21″E; 1700 mYPG & XYT 2006.09.09
YC1*2xYichun (Xing-an Mts.), Heilongjiang, China: 47°43′32.4″N, 128°51′06.2″E; 234 mGYR 2007.08.12
A. asiatica AL2xAltai Mt., Russia: 51°02′52″N, 85°36′47″E; 1100 mMS 2002.07.30
ARX32xArxan (Xing-an Mts.), China: 47°17′39″N, 120°27′11″E; 1130 mGYR 2007.08.18
NM2xDaqing Mt., Inner Mongolia, China: 41°04′52″N, 112°35′56″E; 2010 mGYR 2006.08.25
SHB*2xHebei, China: 42°26′N, 117°15′E; 1500 mYPG, 2007.07.27
XJ2xXinjiang, China: 43°45′52″N, 87°48′63″E; 1550 mDYT 2002.06.30
A. alpina ARX14xArxan (Xing-an Mts.), Inner Mongolia, China: 47°17′N, 120°27′E; 860–1100 mYPG 2007.10.08
CB24xChangbai Mt., China: 42°28′09″N, 128°08′50″E; 700–780 mYPG & GYR 2002.07.24
NM44xDaqing Mt., Inner Mongolia, China: 41°04′52″N, 112°35′56″E; 2000 mGYR 2006.08.25
NM54xInner Mongolia, China: 40°43′29.4″N, 109°24′11.8″E; 1870 mGYR 2006.08.27
WT4xWutai Mt., China: 39°06′N, 113°34′E; 1820–2305 mJXM 2006.07.31
YC2*4xYichun (Xing-an Mts.), Heilongjiang, China: 47°44′26.2″N, 128°50′55.4″E; 265 m; 47°43′32.4″N, 128°51′06.2″E; 234 mGYR, 2007.08.12
A. wilsoniana HX4xHuixian, Gansu, China: 33°59′56″N, 107°09′10″E; 2000 mYPG & GYR 2003. 08.27
GZ4xGuizhou, China: 28°51′31″N, 107°25′10″E; 1750 mZYL 2001.08.06
ZD4xZhongdian, Yunnan, China: 27°51′36″N, 99°42′21″E; 3300 m; 27°48′04″N, 99°43′24″E; 3370 mYPG & GYR 2006.08.01/03
TB4xTaibai Mt., China: 34°01′17″N, 107°18′21″E; 1600–1720 mYPG & YR 2006.09.01/09

Chromosome numbers or DNA ploidy levels were checked for all the populations and individuals analyzed (Table 2). Usually one individual per population was checked by chromosome counting, with the remainder verified by flow cytometry. Chromosome counting was performed using young flower buds collected in the field and fixed in Carnoy's fluid (ethanol : acetic acid, 3 : 1). The fixed flower buds were stained and squashed in 4% acetocarmine and observed under a microscope. DNA ploidy levels were investigated with propidium iodide flow cytometry (Temsch & Greilhuber, 2000; Suda et al., 2006) from fresh or silica gel-dried leaves.

Voucher specimens are deposited in the herbarium of the College of Life Sciences, Beijing Normal University (BNU).

Data sampling

Total genomic DNA was extracted from c. 0.02 g of silica gel-desiccated leaf materials following the 2% hexadecyltrimethylammonium bromide (2 × CTAB) protocol (Doyle & Doyle, 1987).

Three nuclear genes were partially sequenced. They are the chloroplast-expressed glutamine synthase gene (ncpGS), the cytosolic phosphoglucose isomerase gene (PgiC) and the sedoheptulose bisphosphatase gene (SBP). These genes have proven to be informative for the phylogenetic study of Achillea species (Ma et al., 2010; Guo et al., 2012). Four plastid DNA loci, rpL16 intron, trnH-psbA, trnC-ycf6-psbM and trnY-rpoB, were also sequenced to compare their pattern of variability with that in the nuclear genes. The regions sequenced, the primers used for amplification and the PCR conditions are provided in Table S1.

Purified PCR products were either used for direct sequencing (for cpDNA) or ligated into a pGEM-T Vector (for nuclear genes) with a Promega Kit (Promega Corporation, Madison, WI, USA). For sequencing the nuclear genes, five to eight positive clones from each diploid and 10–15 from each tetraploid individual were randomly picked. The plasmid was extracted with an Axyprep Kit (Axygene Biotechnology, Hangzhou, China). Cycle sequencing was conducted using ABI PRISM®BigDye™ Terminator and the products were run on an ABI PRISM™3700 DNA Sequencer (Applied Biosystems, Foster City, CA, USA). All sequences were submitted to the National Center for Biotechnology Information (NCBI) GenBank under accession numbers HQ204041HQ204185 (the nuclear genes) and JN224491JN224810 (the plastid loci).

Sequences were assembled with the Contig Express program (Informax Inc. 2000, North Bethesda, MD, USA), aligned with ClustalX 1.81 (Thompson et al., 1997) and manually improved with BioEdit version 7.0.1. (Hall, 1999). To avoid sequencing errors in the nuclear gene dataset, we assumed that a single variant base found in a single clone (sequence) was probably generated during the cloning or sequencing processes, and thus excluded this variant from further analysis.

We also tried to counteract the influence of PCR-mediated recombinants (Cronn et al., 2002; Wu et al., 2007; Kelly et al., 2010; Scheen et al., 2012) on our estimate of the recombination rate of the nuclear gene sequences. To estimate the approximate frequency of PCR recombination and to account for it in data analysis, we re-amplified, cloned and sequenced two divergent homeologous alleles from the mixture of Escherichia coli clones that had already been sequenced. Recombinant sequences obtained during this procedure are definitely PCR mediated. We found that, even under optimized PCR conditions, in vitro chimeras represent 10–15% of the total (Wu et al., 2007; Ma et al., 2010; Guo et al., 2012; Scheen et al., 2012). In real datasets, an in vitro artifactual haplotype usually arises randomly and appears uniquely, whereas an in vivo recombinant haplotype may be identical by descent with other haplotypes from the same or different populations (Kelly et al., 2010; Ma et al., 2010; Guo et al., 2012). Hence, shared recombinant haplotypes are more likely to be natural in vivo recombinants than unique ones. Finally, we randomly checked 16 recombinant sequences through direct sequencing using recombinant-specific primer pairs. If the recombinant could be recovered, we reasoned that it was real; otherwise, it was an artifact. Our results suggested a low rate of in vivo recombination. The procedures guard against the inclusion of artifactual recombination, but lead to bias against in vivo recombinants, that is, our method is conservative with respect to the null hypothesis of no recombination.

Median-joining network analysis implemented in Network ver., available at http://www.fluxus-engineering.com/sharenet.htm (Bandelt et al., 1999), was applied to the cpDNA data based on shared haplotypes. Nuclear gene trees were constructed in MEGA 5.05 (Tamura et al., 2011) with neighbor joining (NJ) and in PAUP* 4.0b10a (Swofford, 2003) with maximum parsimony (MP) methods (results shown in Figs S1–S3). Gaps were treated as missing data in all analyses.

Population genetic analysis

With nuclear gene data, we have to consider problems with PCR cloning and Sanger sequencing of nuclear genes from an allopolyploid individual, even after the elimination of all PCR recombinants. To obtain all four alleles in the correct proportions requires considerable effort. As an example, in the case in which two haplotypes segregate within an individual and we need to decide between the ratios 1 : 3 and 2 : 2, at least 77 sequences are required for a power of 0.9 (calculated with the function power.prop.test in the statistical programming language ‘R’; www.r-project.org). As the possibility of a ratio of 3 : 1 also needs to be considered, 100 or more sequences are necessary to decide between all three possibilities. This is practically very difficult and costly. In the following population genetic analysis, we therefore consider a single sample per individual, which ignores some information, but avoids bias caused by incorrect assumptions.

With DnaSP ver. 5.10.01 (Librado & Rozas, 2009), we quantified the levels of genetic diversity with measures of Nei's π and Watterson's θ within each species for the combined plastid loci and each nuclear locus. Genetic differentiations between populations, FST (Hudson et al., 1992b) and KST* (when the sample size is small; Hudson et al., 1992a; Morales-Hojas et al., 2008), were also estimated using a permutation test with 1000 replicates.


We used the ms program (Hudson, 2002) to simulate a dataset in which two diploid populations separated in the past and a tetraploid hybrid population was generated by combining material from the two diploid populations. The length of the locus was set to 105 bp; the scaled mutation parameter 4Neμ was set to one and the scaled recombination rate to 10 within the populations; the split between the two diploid populations occurred at 4 × 4Ne generations. The ‘allotetraploidization event’ was placed at 0.5 × 4Ne generations. After this event, the evolution of the single tetraploid population is modeled by two diploid populations that evolve independently, but with the same parameters, because the two diploid genomes inside the allopolyploid nucleus are unlinked as two descendant populations splitting at the same time from the two progenitor populations. From each of the descendant populations, 30 haplotypes were sampled. In the first dataset, representing a strictly disomic situation, 30 haplotypes were chosen at random from these two sets of 30. In the second dataset, recombination at a rate of 10−5 (i.e. an average of one recombination event per haplotype) was also simulated. We note that the estimated drift coefficient ω between populations separated by 0.5 × 4Ne is c. 1 − exp(−1) = 0.63. There is a trade-off with the accuracy of inference of different parameters: if the diploid ancestral populations have a low within-population diversity, that is, are almost monomorphic, the scaled recombination rate ρ is estimated accurately, but not the drift coefficient ω. Furthermore, the accuracy of estimation of the population proportions of alleles γ is higher with higher recombination rates ρ, because more independent events of switching among diploid populations are available for inference.

Simulation results

We present the analysis of two simulations, one without recombination and the other with a recombination rate of 10−5 per base. Results are based on runs with 105 iterations after a ‘burn in’ phase of 104 iterations. The estimates are based on a single locus and cannot be fully reliable. Nevertheless, assignments to the parental progenitors are exactly correct for the case without recombination, where the true assignments correspond to the inferred posterior. For the case with recombination, about five recombinations remain undetected by the algorithm (data not shown).

For the model without recombination, the estimated recombination rates ρ, drift coefficient ω and allelic proportions γ have their maxima as close to the true parameter values (vertical lines in Fig. 2a,c,e) as expected, given that the estimates are based on a single locus. Given the true assignments z (determined from the simulated data), the best-fitting posterior distribution curves for ρ and γ (solid lines in Fig. 2a,e) can be calculated; the posterior distributions estimated with the MCMC sampler almost coincide with these distribution curves. For the coefficient ω, which indicates drift within the tetraploid compared with its two diploid progenitors, we could not easily calculate the theoretical distribution, even when the true assignments z are given, and so we do not show a solid line in Fig. 2(c,d). The broad posterior distribution of ω is expected, as estimates of a drift coefficient (usually abbreviated as FST) from a single locus have a relatively large variance. In the dataset with recombination, the posterior assignment to the diploid parents z was only partially correct. Therefore, the posterior distribution of ρ is biased from that inferred using the true values by the five undetected recombinations (compare the solid line in Fig. 2b with the inferred posterior distribution). Nevertheless, the contribution of the first parent broadly overlaps the true parameter value (Fig. 2f).

Figure 2.

Simulation results. Plot of the marginal posterior frequencies in Markov chain Monte Carlo (MCMC) samplers of simulated data without recombi-nation (a, c, e) and with recombination (b, d, f). (a, b) Plot of the posterior frequency of the recombination coefficient ρ. Its expectations, which are 0 in (a) and 1.0 × 10−4 in (b), are indicated by the vertical lines; the solid curves are the theoretically calculated posterior distributions given the true amount of ‘recombination’ between the diploid populations. (c, d) Plot of the posterior frequencies of the drift coefficients ω. The expectations of ω are c. 0.63 for both, again indicated by a vertical line. (e, f) Plot of the contributions of the first progenitor to the tetraploid γ. The expected proportions are 0.5 in both cases, again indicated by a vertical line. The solid curves are the posterior distributions given the true assignments z.


Nuclear and plastid haplotype variations and their relationship

In total, 252 sequences of the ncpGS gene from 54 individuals, 258 of PgiC from 49 individuals and 189 of SBP from 42 individuals are included (Figs S1–S3). The lengths of the ncpGS sequences vary from 721 to 749 bp. The aligned ncpGS data matrix contains 762 bp; a total of 65 SNPs generate 31 haplotypes. The lengths of the PgiC sequences vary from 595 to 628 bp. The aligned PgiC data matrix contains 630 bp; 56 SNPs generate 37 haplotypes. The lengths of the SBP sequences vary from 392 to 396 bp. The aligned SBP data matrix contains 397 bp; 46 SNPs generate 27 haplotypes. Figures S1–S3 show the gene trees constructed with the NJ distance method. The two diploid species share no polymorphism, mirroring their distant relationship (A. asiatica-2x may have retained ancestrally polymorphic SBP alleles (Fig. S3: clade II and II*) as discussed in Guo et al., 2012). However, the two tetraploid species each harbor divergent homeologous gene copies, reflecting their hybrid origins.

Four plastid loci (trnH-psbA, trnC-ycf6-psbM, trnY-rpoB and rpl16) were sequenced from 104 individuals of 19 populations. As chloroplasts do not recombine, these sequences were concatenated into sequences of 2497–2737 bp. The alignment contains 2789 nucleotide positions with 30 variable sites arranged in 21 haplotypes (Fig. 3).

Figure 3.

Median-joining network of 21 cpDNA haplotypes across 104 individuals of 19 populations of the two diploid species Achillea acuminata-2x (acu) and A. asiatica-2x (asi), and the two allotetraploid descendant species A. alpina-4x (alp) and A. wilsoniana-4x (wil). These haplotypes are generated from four noncoding plastid DNA regions: rpL16 intron and the intergenic spaces trnH-psbA, trnC-ycf6-psbM and trnY-rpoB. Short bars on branches of the network indicate the number of variable sites. Species, populations and individuals covered by each haplotype (H1–H21) are labeled as ‘taxon abbreviation (population code (number of individuals))’.

The two diploid species do not share a haplotype. Most of the haplotypes fall into two well-differentiated groups. The tetraploid species A. wilsoniana-4x and A. alpina-4x share the frequent haplotype H7 with A. asiatica-2x. The other private haplotypes of A. wilsoniana-4x and A. alpina-4x are apparently derived from H7 by a few mutations. In addition, A. alpina-4x has a rather divergent haplotype group containing the frequent H2, shared with A. acuminata-2x, and the rare, private H10, apparently derived from H2.

Corresponding to the diverse cpDNA haplotypes, values for Nei's π and Watterson's θ are high for both of the diploid species. The tetraploid species, especially A. wilsoniana-4x, however, show reduced plastid diversity. All four species show population differentiation (Table 3).

Table 3. Summary of nucleotide diversity and population differentiation of the studied Achillea species
LocusSpeciesbp N S h π θ w KST* F ST
  1. N, number of sequences (= number of individuals); S, number of segregating (polymorphic) sites; h, number of haplotypes; π, nucleotide diversity from distribution of segregating sites; θw, Watterson's estimator from number of segregating sites; ncpGS, chloroplast-expressed glutamine synthase gene; PgiC, cytosolic phosphoglucose isomerase gene; SBP, sedoheptulose bisphosphatase gene. Probability (P) obtained by the permutation test with 1000 replicates: *, 0.01 < < 0.05; ***, < 0.001.

cpDNAA. acuminata-2x2486241760.002960.001830.67023***0.72269
A. asiatica-2x269621930.001410.000930.77905***0.95548
A. alpina-4x26963614110.001510.001250.27905***0.49503
A. wilsoniana-4x269723340.000150.000300.17692*0.23077
ncpGSA. acuminata-2x7539230.001180.000980.70392*0.80000
A. asiatica-2x726133140.019390.013760.151190.16905
A. alpina-4x737175040.020820.02007−0.04855−0.17499
A. wilsoniana-4x708155740.026460.024760.32676*0.47600
PgiC A. acuminata-2x62511840.004480.004370.28455*0.45455
A. asiatica-2x61992830.020370.01664−0.249910.54878
A. alpina-4x623161430.005380.006770.068640.23077
A. wilsoniana-4x620133030.019400.015590.49311*0.47059
SBP A. acuminata-2x3978120.001350.00097−0.244440.14286
A. asiatica-2x395123840.040770.031860.190140.15531
A. alpina-4x395132840.029410.02284−0.06002−0.02391
A. wilsoniana-4x39592130.025880.019560.159610.13986

Population genetic analysis with the nuclear gene data

For the population genetic analysis, only one sequence per individual was considered to avoid possible bias, as detailed above. Assuming neutral evolutionary equilibrium, nuclear genes are expected to maintain about four times as much diversity as chloroplast loci because of their biparental inheritance and diploidy. (This is also true for allotetraploids, if inheritance is strictly disomic.) Nevertheless, compared with the chloroplast data, the nuclear genes of A. acuminata-2x show little diversity (Table 3), and all three loci are monomorphic in the population ARX2 (Figs S1–S3). Achillea asiatica-2x is more diverse than A. acuminata-2x (Table 3). The populations are well differentiated (Table 3).

With the two tetraploid species, we expect a high nuclear diversity because of the allopolyploid nature, that is, the fact that two distantly related species have combined their genomes. We find that the variability is indeed greater than that of the diploid species in some cases, for example in the ncpGS locus. Diversity in each of the two tetraploid species is, however, often lower than the theoretical diversity obtained by combining the two diploid species. Population differentiation within each tetraploid species is comparable with that in a diploid species (Table 3).

With the probabilistic model, we present the analytical results based on runs with 10 000 iterations after a ‘burn-in’ phase of 1000 iterations. We note that sample sizes are rather small for all genes and populations, such that the margins of inference are rather wide (Figs 4, S4, S5). Nevertheless, because of the large divergence between the diploid progenitor species, the assignment of haplotypic regions to diploid progenitors is stable and sensible given the data (Figs 4e,f, S4, S5E,F).

Figure 4.

Analysis of the nuclear locus of the cytosolic phosphoglucose isomerase gene (PgiC). Plot of the marginal posterior frequencies in Markov chain Monte Carlo (MCMC) samplers of three key parameters in Achillea alpina-4x (a, c, e) and A. wilsoniana-4x (b, d, f). (a, b) Scaled recombination rates per basepair ρ; (c, d) drift coefficients ω; and (e, f) the inferred proportion of the A. acuminata-2x progenitor in the polyploids γ.

For the locus PgiC, the most probable or maximum posterior assignment of genetic material in the tetraploids to the diploid progenitors (deduced from the indicator variable z) is given in Fig. 4. We see that the A. acuminata-2x haplotype predominates in both tetraploid species (Fig. 4e,f; Tables 4, 5, S2). In both species, the posterior mass hardly overlaps the theoretical midpoint of 0.5: with A. alpina-4x, > 99% of the posterior probability mass is > 0.5 and, with A. wilsoniana-4x, > 97%. With A. alpina-4x, three recombinations are inferred, two at identical positions and belonging to a single haplotype found in different individuals and populations (alp_CB2_2_6 and alp_WT1_1), and thus they should be identical by descent (Tables 4, 5). We note that this violates the assumption of independence of the observed haplotypes in the tetraploid population, which leads to an underestimation of error rates of some parameters. With A. wilsoniana-4x, no recombination is apparent (Table S2: Results of A. wilsoniana). Correspondingly, the mean of the inferred recombination rates between the diploid progenitors is bounded away from zero in A. alpina-4x and close to zero in A. wilsoniana-4x (Fig. 4a,b, respectively). In both tetraploid species, the drift coefficient ω is closer to one than to zero (Fig. 4c,d, respectively), that is, diversity is reduced compared with the diploid parents.

Table 4. Polymorphic alleles of the cytosolic phosphoglucose isomerase gene (PgiC) locus in (a) Achillea acuminata-2x, (b) A. asiatica-2x and (c) A. alpina-4x
  1. The first 16 positions are the code for the individual and clone. Then each column corresponds to a polymorphic position; the major (most frequent) allele in A. acuminata-4x is coded as ‘0’.

Table 5. Assignment of the cytosolic phosphoglucose isomerase gene (PgiC) clones of individuals of Achillea alpina-4x to the diploid parental species
  1. ‘0’ corresponds to A. acuminata-2x, and ‘1’ to A. asiatica-2x. The clones are identical to those in Table 4(c).


For the locus ncpGS, the most probable posterior assignment to diploid populations is given in Fig. S4. Similar to PgiC, the A. acuminata-2x haplotype also predominates (Fig. S4; Table S3). With A. alpina-4x, nearly 99% of the posterior probability mass is > 0.5 and, with A. wilsoniana-4x, > 95%. With A. alpina-4x, three recombinations are inferred, all belonging to one haplotype found in different individuals and populations (Table S3: Results of A. alpina); whereas, with A. wilsoniana-4x, a single sequence shows two recombinations (Table S3: Results of A. wilsoniana). Corresponding to the inferred assignments, the mean of the inferred recombination rate is very low for A. wilsoniana-4x (Fig. S4B), whereas the inferred values are higher and the distribution is bounded away from zero for A. alpina-4x (Fig. S4A)

For the locus SBP, the most probable posterior assignment to diploid populations is given in Fig. S5. We see that, again, the A. acuminata-2x haplotype predominates (Fig. S5; Table S4), although this time less strongly. With A. alpina-4x, c. 80% of the posterior probability mass is > 0.5 and, with A. wilsoniana-4x, c. 85%. With A. alpina-4x, four recombinations are inferred, three belonging to one haplotype found in different individuals and populations (Table S4: Results of A. alpina). With A. wilsoniana-4x, no recombination is inferred (Table S4: Results of A. wilsoniana). The posterior distributions of the recombination rates are shown in Fig. S5(A,B).

For all loci and species, the inferred genetic drift parameter is c. 0.7 on average (Figs 4c,d, S4, S5C,D). For the inferred contribution of the A. acuminata-2x genome, some variability around the mean is observed; nevertheless, the posterior distribution is clearly shifted from the equal prior towards a higher contribution of A. acuminata-2x (Figs 4e,f, S4, S5E,F). Taken together, these data show a low rate of homeologous recombination or gene conversion, a modest loss of diversity through genetic drift relative to the two diploids combined, and an increase in the A. acuminata-2x genome in the tetraploids in all three loci.


In this article, we present a probabilistic population genetic model to analyze sequence data from the nuclear genes of populations of allotetraploids and both their diploid progenitors. With this model, the amount of reduction in genetic variation in the allotetraploids relative to their diploid parents can be inferred. This reduction is probably caused by founder effects during the polyploid speciation event. Furthermore, the amount of recombination or gene conversion that leads to the mixing of genetic material between the homeologous copies of loci can also be inferred. During this process, the amount of genetic material contributing to the allotetraploid genome may shift between the two diploid parental genomes. We present simulations which show that we can recover the simulation parameters and apply the model to a dataset of two allotetraploid yarrow species and their diploid progenitors.

Plastid haplotype data indicate multiple and relatively old allopolyploid formation

The chloroplast data were analyzed to decide between a single or multiple formation of the allopolyploid species. These alternatives may lead to different evolutionary trajectories of populations of allopolyploids, even up to the differentiation of cryptic species (Soltis et al., 2010). According to the previous (Guo et al., 2006) and present (Fig. 3) chloroplast data, A. alpina-4x and A. wilsoniana-4x have a common maternal contribution from an A. asiatica-like progenitor. This could either have been caused by one common or multiple independent allopolyploidization events. Given the clear population genetic differentiation between the two tetraploid species shown by the AFLP data (Guo et al., 2006) and the nuclear sequences presented in this article, we consider that multiple independent events are more likely. With the extended plastid dataset in this study, we find that A. alpina-4x also has maternal contributions from A. acuminata-2x (Fig. 3), probably through another allopolyploid speciation event with reversed parental contribution. More complicated scenarios, such as an already established tetraploid A. alpina capturing plastids from A. acuminata-2x, or plastid introgression of A. alpina-4x into A. acuminata-2x, are unlikely as they should also have left traces in the nuclear genomes.

The chloroplast data were also analyzed to investigate the population structure. Among the four species, the tetraploids show relatively low variability. This was probably caused by a founder effect, that is, a genetic bottleneck during allopolyploid speciation. The presence of many private haplotypes within both tetraploid species indicates that they have evolved for a comparatively long time, which allowed for the accumulation of new mutations, and have maintained large enough population sizes not to lose the new mutations through genetic drift.

Genetic variability and long-term evolution of homeologous nuclear genes in a plant of allopolyploid origin

The two tetraploid species often show more nuclear genetic diversity than one of the diploids (Table 3). This is expected as two diverse diploid progenitors combine their genetic material in the tetraploids (Figs S1–S3). Nevertheless, the genetic diversity of the tetraploid species is below that of the two diploid species combined.

With a probabilistic model, we differentiated the genetic material of the polyploids with respect to contributions from one or the other progenitor species. As the two diploid species are well differentiated and do not share polymorphisms (Figs S1–S3), we could assign stretches of DNA to one or the other diploid parent with near certainty. Unlike the earlier inference of disomic inheritance of the two allotetraploid species based on the whole-genome AFLP analysis (Guo et al., 2006), we found here a modest amount of recombination or gene conversion between the homeologous genomes in the tetraploids. As we tried to avoid PCR artifacts, we might have biased against the detection of recombinants and underestimated the real rate of recombination. Recombination was also found by Salmon et al. (2010) in cotton, Kelly et al. (2010) in Nicotiana and Pelser et al. (2012) in Senecio. With expressed sequence tags, Salmon et al. (2010) detected c. 1.8–1.9% nonreciprocal homeologous exchanges (gene conversion) in Gossypium hirsutum (AD genome). They showed that such events have accumulated gradually throughout polyploid divergence and speciation, as opposed to saltationally at the time of allopolyploidization. However, these authors do not address the population-level distribution of such gene conversion tracks or the population genetic consequences of allopolyploidization as we do here. Pelser et al. (2012) reported recombination of parental internal transcribed spacer (ITS) copies in the allotetraploid species Senecio massaicus: more than half of the sequenced S. massaicus samples contain recombinant ITS sequences, some of which must be natural, that is, cannot be caused by PCR-mediated recombination. Chang et al. (2010) do not report such recombination in the allotetraploid Arabidopsis suecica. This may be a result of their use of short sequence reads that do not allow for the easy identification of gene conversion or homeologous recombination between the progenitors' genetic material. We therefore concur with Kelly et al. (2010) that intragenic recombination is sufficiently common that it needs to be taken into account when reconstructing hybrid relationships with nuclear gene sequence data.

After assigning the progenitors' contribution with the probabilistic model, we find that both allotetraploid species show a reduction in diversity of genetic material with respect to their diploid progenitors. This is also obvious for the plastid DNA. For the nuclear genes, the nucleotide diversity of the tetraploid species is often lower than that in both diploid progenitors combined, but nevertheless higher than that of each diploid species separately. The reduced diversity may be a sign of a genetic bottleneck at the time of polyploidization, or the result of selection after the polyploid hybridization event, or both. It is unlikely to be caused by genetic drift after the establishment of the allopolyploid species, as their census population sizes are much larger than those of the diploid species.

We find a slightly, but significantly, higher proportion of the A. acuminata-2x alleles in both tetraploids. The previous AFLP analysis, however, shows an equal contribution of both progenitor species (Guo et al., 2006). We note that sampling is genome-wide with the AFLP technique, which increases the power of inference by increasing the effective sample size, whereas the present sequence data are sampled locally at three loci. However, AFLP fragments are useless for the detection of recombination or gene conversion events between materials from the two diploid progenitor species in the genome. Such recombination may lead to local shifts in the proportion of parental contribution, either randomly or through selective differences among alleles. This may explain the finding of an excess of A. acuminata-2x alleles. Furthermore, we assume no genetic sampling after establishment of the tetraploids, such that all nuclear gene haplotypes found are counted as independent. However, we do observe some apparent recombination events that seem to be identical by descent and probably have risen to higher frequencies in the descendant populations through random genetic drift. This violation of the assumption of independent (recombinant) haplotypes makes the estimates of the biased proportions of parental contribution less reliable.

In summary, the sequences of the nuclear genes analyzed show that most of the tetraploid individuals and populations harbor homeologous gene copies. The plastid sequence data indicate that the maternal progenitor of A. wilsoniana-4x seems to be exclusively A. asiatica-2x (Guo et al., 2006), whereas A. alpina-4x seems to have maternal contributions from both diploid progenitor species via multiple independent and reciprocal allopolyploidization events. The large number of low-frequency plastid haplotypes that probably have arisen by mutation from the shared frequent haplotypes after polyploidization indicate that both allotetraploid species are relatively old. We find a small amount of genetic exchange between the parental genomes through recombination or gene conversion, and a reduction in diversity relative to the diploid parental species combined. Considering the huge extant population sizes of the tetraploid species, it is most likely that the two tetraploid species underwent a genetic bottleneck at the time of speciation.


YPG thanks the National Natural Science Foundation of China (Grant nos. 31170207 and 31121003) for financial support, and the College of Life Sciences, Beijing Normal University for the facilities provided. CV acknowledges the funding of the ‘Initiativkolleg Populationsgenetik’ from the University of Veterinary Medicine, Vienna, Austria, and thanks Andreas Futschik for discussions on the technical details of the probabilistic sampler. We are grateful to Friedrich Ehrendorfer for initiating our interest in evolution and speciation of the Achillea plants and for his continuing support. We thank Graham Tebb for critical reading and editing of the manuscript. We are particularly grateful to several anonymous reviewers whose comments have significantly improved the clarity of the model presented here and the presentation of the manuscript.