Recent progress and challenges in population genetics of polyploid organisms: an overview of current state-of-the-art molecular and statistical tools

Authors


Abstract

Despite the importance of polyploidy and the increasing availability of new genomic data, there remain important gaps in our knowledge of polyploid population genetics. These gaps arise from the complex nature of polyploid data (e.g. multiple alleles and loci, mixed inheritance patterns, association between ploidy and mating system variation). Furthermore, many of the standard tools for population genetics that have been developed for diploids are often not feasible for polyploids. This review aims to provide an overview of the state-of-the-art in polyploid population genetics and to identify the main areas where further development of molecular techniques and statistical theory is required. We review commonly used molecular tools (amplified fragment length polymorphism, microsatellites, Sanger sequencing, next-generation sequencing and derived technologies) and their challenges associated with their use in polyploid populations: that is, allele dosage determination, null alleles, difficulty of distinguishing orthologues from paralogues and copy number variation. In addition, we review the approaches that have been used for population genetic analysis in polyploids and their specific problems. These problems are in most cases directly associated with dosage uncertainty and the problem of inferring allele frequencies and assumptions regarding inheritance. This leads us to conclude that for advancing the field of polyploid population genetics, most priority should be given to development of new molecular approaches that allow efficient dosage determination, and to further development of analytical approaches to circumvent dosage uncertainty and to accommodate ‘flexible’ modes of inheritance. In addition, there is a need for more simulation-based studies that test what kinds of biases could result from both existing and novel approaches.

Introduction

Polyploidy is a prominent feature of plant genomes (Tate et al. 2005). Although polyploidy is much rarer in the animal kingdom than in plants, there are numerous examples of polyploid invertebrates, fish and amphibians (Gregory & Mable 2005; Mable et al. 2011). Even organisms that are now genetically diploid often have a paleopolyploid history. In plants and yeast, early genome-sequencing projects revealed that numerous diploid species show signs of ancient genome duplications (Arabidopsis, Blanc et al. 2000; rice, Bowers et al. 2003; yeast, Kellis et al. 2004; poplar, Tuskan et al. 2006; grapevine, Jaillon et al. 2007). In animals, whole-genome duplication events have coincided with the origin of vertebrates, gnathostomes and teleosts (Holland et al. 1994; Postlethwait et al. 2000; Crow et al. 2006). A whole-genome duplication event is thought to have facilitated the survival of flowering plant lineages during the mass extinction events during the Cretaceous-Tertiary transition (Fawcett et al. 2009). This has led to the generally accepted view that polyploidization plays an important role in evolution, in both plants and animals.

Despite the important role of polyploidization in evolution, our basic understanding of polyploids is still poor compared with diploids. This is largely due to the more complex nature of their genome evolution. Polyploids are typically classified as either autopolyploids or allopolyploids (Stebbins 1947). Autopolyploids originate after genome doubling within a single species, so that each chromosome is represented by more than two homologous copies. These homologous copies theoretically can at least initially pair in all possible combinations, leading to polysomic inheritance. However, even in autopolyploids divergence, neo-functionalization, or loss of duplicate copies over time (Lynch & Conery 2000) inevitably leads to disomic inheritance for at least some loci (Ohno 1970). Allopolyploids originate after hybridization of different species and subsequent genome doubling so that each chromosome is represented by two (or more) sets of divergent chromosomes, in which chromosomes within a set are termed homologues, and chromosomes from different sets (i.e. derived from different ancestral species) homoeologues (see Box 1). With sufficient divergence between homoeologues, meiotic pairing only takes place between chromosomes from the same parental origin, leading to disomic inheritance. In cases for which the homoeologous chromosomes can pair in meiosis and produce viable gametes, allopolyploids also may show a mixture of disomic and polysomic inheritance patterns. Moreover, inheritance patterns can vary across the genome within individuals, leading to disomic inheritance at some loci and polysomic at others. Due to the time and expense of assessing segregation within progeny arrays for every locus and every individual or species compared, it has not been quantified how frequently deviations from strictly disomic or strictly polysomic inheritance occur. However, where segregation has been tested, it is rare to find either extreme across all loci. For example, the family Salmonidae originated through polyploidization, but allozyme data originally suggested that inheritance patterns can vary between species, within species or even among tissue types within individuals (Danzmann & Bogart 1982; Allendorf & Danzmann 1997). Similar conclusions about deviations from strictly disomic or strictly polysomic inheritance have been described for plants (Jannoo et al. 2004; Stift et al. 2008; Kamiri et al. 2011; Koning-Boucoiran et al. 2012).

The existence of complex inheritance patterns complicates the genetic analysis of polyploids, because analytical frameworks normally assume a specific mode of inheritance. Assumptions about inheritance patterns are important because expected dosage of alleles (i.e. copy number of each allele) at individual loci will differ depending on the mode of segregation and models predicting the rate of loss or change of duplicate genes depend on the degree of redundancy of duplicate copies (Ohno 1970; Ferris & Whitt 1977; Allendorf 1978). This introduces both conceptual (e.g. how many alleles and gene copies are to be expected) and methodological (e.g., resolving allele and gene copy numbers) issues with obtaining markers for population genetic analyses. A major challenge for most existing markers used for population genetic analyses is reliably resolving dosage of alleles in polyploids and so enabling calculation of observed and expected allele frequencies, which is fundamental to many population genetic based inferences (Cockerham 1973; Kreitman 1987). Continuing advances in sequencing technology mean that it should soon be possible to consider genomewide variation in segregation patterns, but most population genetics models must currently be applied in the absence of knowledge about segregation, expected dosage, and allele or gene copy number.

In addition, variation in mode of segregation patterns can make it difficult to disentangle the effects of genome duplication from hybridization. For allopolyploids, analyses would be most robust if copies from each parent could be identified and treated separately during analysis of genetic variation. However, past genome duplication events make it difficult to distinguish true single nucleotide polymorphisms (SNPs) or orthologous allelic copies from fixed differences between homoeologous duplicate chromosomal regions and from tandemly-duplicated paralogous regions (Everett et al. 2011; Seeb et al. 2011b). This is confounded by the difficulty of resolving whether polyploid lineages have arisen through allo- or autopolyploidization.

Although many polyploid fish, amphibians (Bogart 1980; Otto & Whitton 2000) and plants (Suomalainen et al. 1987) reproduce sexually, an additional complexity arises due to the frequent association of specific reproductive systems with polyploidy. In the animal kingdom, the majority of polyploid invertebrates and reptiles reproduce asexually, and it has been estimated that 99% of apomictic plant species are polyploids (Suomalainen et al. 1987). In some cases, such as found in the planarian flatworm, Schmidtea polychroa, polyploid individuals can produce viable sperm that may lead to rare sexual processes (Sánchez-Navarro et al. 2013). As asexually reproducing plants and animals often have uneven ploidy levels (e.g. triploid) but coexist with even ploidy (e.g. diploid or tetraploid) individuals that reproduce sexually (Neiman et al. 2011), a substantial challenge is to include multiple ploidy levels with different expected heterozygosities (due to differences in both allelic dosage and mating system) into the same population genetic analyses, particularly for inferences that rely on accurate estimation of allele frequencies.

The main aim of this review is to provide an overview of the molecular and statistical tools that are currently available for polyploid population genetics, to provide examples of their application, and to identify the main areas where further development of molecular techniques and statistical theory is required to advance the field. Our review is organized into two sections. The first section deals with the issue of obtaining informative markers for polyploids. We first discuss the application of traditional markers [amplified fragment length polymorphism (AFLP), microsatellites, Sanger sequencing] in polyploids, and their pros and cons. We then show that new sequencing technologies still suffer from similar problems as traditional markers and introduce some of their own, but do hold promise for ultimately reducing these problems. The second section focuses on the analytical side and deals with the problem of extending standard methodologies for diploids to polyploid data. We discuss how classical approaches (allele frequency estimation, assignment and clustering methods, fixation indices, similarity/distance indices and multivariate analyses, custom models) can be used with polyploid data and identify priorities for further development of methodology and software. In particular, we conclude that there is a strong need for simulations to evaluate the appropriateness of the various creative solutions that have been proposed for analysing polyploid data.

Box 1. Glossary

Allelic dosage – Number of copies of each allele at a particular locus in a polyploid genotype.

Allopolyploid – Polyploid that has originated by genome doubling after hybridization, so that two homoeologous sets of the same chromosome exist. The dogma is that this generally leads to disomic inheritance, because there is preferential pairing between chromosomes from the same ancestral genome. However, polysomic inheritance is often still possible, at least at some loci or chromosomal regions.

Autopolyploid – Polyploid that has been originated by genome doubling within a species, so that all variants of the same chromosome are homologous. The dogma is that this generally leads to polysomic inheritance, because there is no preferential pairing between certain chromosomes. However, as genome doubling inevitably leads to divergence among copies, specialization of function, or loss of copies, a return to disomic inheritance is predicted over time. Hybridization between closely related species or differentiated populations of the same species (sometimes referred to as segmental allopolyploidy) can be difficult to distinguish from autopolyploidy, but it is expected that there will be at least some disomic inheritance.

Disomic inheritance – Type of inheritance typical for allopolyploids due to preferential pairing between the chromosomes derived from the same ancestral species. This means that alleles derived from the same ancestral species segregate as for diploids, so offspring receive only one copy from a given parent. There is thus not expected to be recombination between the copies derived from the different parents (i.e. homoeologues).

Double reduction – Meiotic process in polyploids with polysomic inheritance in which recombination takes place between the locus and centromere and sister chromatids migrate to the same pole (i.e. segregate in the same gamete).

Homoeologues – Divergent loci or chromosomes in allopolyploid genomes that usually do not pair together during meiosis because they are derived from different parental lineages.

Homologues – Loci or chromosomes that usually pair together during meiosis because they are derived from the same parental lineage.

Orthologues – Gene copies that diverged after a speciation event.

Paralogues – Gene copies that diverged after a gene or genome duplication event.

Partial heterozygote – In diploids, genotypes for a given locus can be homozygotes (e.g. AA, BB, CC) or heterozygotes (e.g. AB, AC, BC, CD). In polyploids, genotypes can be homozygotes (e.g. AAAA, BBBB, CCCC), full heterozygotes (e.g. ABCD, ABFG, CDEF) or partial heterozygotes where one or more alleles are present multiple times (e.g. ABBC, ABFF, ABBB). Resolving partial heterozygotes is one of the biggest challenges for applying population genetics approaches to polyploids, for the majority of existing methods.

Null allele – An allele that fails to amplify using locus-specific primers or that is not observed due to incomplete sampling (e.g. not enough clones sequenced or not enough coverage during deep sequencing).

Polysomic (or multisomic) inheritance – type of inheritance typical for autopolyploids, where all variants of the same chromosome can pair in meiosis. This means that parental alleles will be combined in the same gamete in all possible combinations. Depending on the position of the locus relative to the centromere, a maximum of one-sixth of the gametes can be the result of double reduction.

Stutter bands – artefacts due to replication slippage during the PCR amplification of highly repetitive sequences (e.g. microsatellites), visible as one or more shadow bands, or one or multiple repeat lengths shorter or longer than the actual allele length.

Molecular genetic and genomic markers for polyploid population genetics

General caveats for genetic marker analysis in polyploids

Molecular markers that are standardly used for population genetics in diploids can in principle also be used in polyploids. However, one of the most important challenges when working with polyploid genomes is the difficulty of resolving the allelic constitution of individual loci (i.e. allelic dosage), which would be necessary to implement methods that rely on allele frequency-based inferences or those that require complete genotyping of individuals. Uncertainties in dosage can also compound problems associated with homoplasy due to null alleles or artefacts associated with either replication slippage (e.g. stutter bands) or unequal amplification of alleles of different lengths (e.g. allelic dominance) in markers requiring PCR amplification; as the number of alleles at a locus could vary from 1 to k in a k-ploid, detecting alleles that either do not amplify consistently or ‘extra’ alleles is not straightforward for ploidy levels higher than diploid (k = 2). Most tests for detecting such artefacts are based on Hardy–Weinberg (HW) equilibrium (MICROCHECKER, Van Oosterhout et al. 2004, 2006), but complete dosage information would be required to calculate expected allele and genotype frequencies. In addition, as many polyploids also show a shift to self-fertilization (Mable 2004a) or reproduce asexually (Stenberg & Saura 2013), tests that assume HW equilibrium also would not be useful for detecting homoplasy in these cases.

The presence of an uncertain number of allelic copies could also be problematic for sequence-based analyses; for example, in tests for selection where the relative frequency of particular alleles is informative or in calculation of inbreeding coefficients based on observed and expected heterozygosity (which would of course also apply to codominant markers). In addition, if there is sufficient divergence among duplicated copies that the different sets (homoeologues) segregate independently, then analyses that cannot distinguish between homoeologues could result in inaccurate inferences about population genetic structure and levels of genetic diversity.

In this section, we will discuss the implications of these general issues as well as specific problems or benefits associated with applying the most commonly used markers for population genetics to polyploid genomes. We have divided this into ‘traditional markers’ (AFLPs, microsatellites, Sanger sequencing) and ‘new markers’ (rapidly advancing deep sequencing approaches).

Traditional markers

AFLP

Amplified fragment length polymorphism fingerprinting has been popular in population genetics, but especially in plants (Bensch & Åkesson 2005), where the frequency of polyploidy is high (Masterson 1994). It is attractive because a single fingerprint includes information for a large number of anonymous nuclear markers that are assumed to be scattered over the entire genome (Meudt & Clarke 2007). A disadvantage compared with codominant markers such as microsatellites (see below 1.2) is that AFLP markers are dominant (i.e. they contain no direct information on heterozygosity), which could actually be an advantage when working with polyploids, to avoid problems with dosage uncertainty.

A further attractive feature of AFLPs is that fingerprints can in principle be simultaneously generated for diploids and polyploids, thus allowing interploidal comparisons. For this reason, AFLPs have frequently been used to reconstruct origins of allopolyploids (e.g. in Dactylorhiza, Hedrén et al. 2001; Achillea, Guo et al. 2005; and Ranunculus, Paun et al. 2006) and for the analysis of population structure and Analysis of Molecular Variance (e.g. in polyploid Knautia, Kolář et al. 2012; and alpine Ranunculus, Burnier et al. 2009). However, these applications have revealed a potential drawback that AFLPs in polyploids tend to produce higher numbers of AFLP fragments than diploids (reviewed by Fay et al. 2005; Meudt & Clarke 2007). AFLP markers are prone to homoplasy (comigration of nonhomologous fragments), which increases in proportion to the total number of AFLP bands (Caballero & Quesada 2010).

AFLPs in species with larger genomes (higher ploidy levels) also frequently result in a small number of high-intensity fragments and many low-intensity fragments that are difficult to score, which effectively results in a relatively high frequency of null alleles. These phenomena have been attributed to repetitive elements related to retrotransposon activity (Fay et al. 2005), but it remains to be tested if they could cause any bias. Nevertheless, the sheer abundance of informative markers that AFLPs can generate appears to outweigh potential scoring issues. Hence, we conclude that AFLPs provide a powerful source of information for addressing questions related to origins of allopolyploids and population genetic structure.

Microsatellites (simple sequence repeats)

In population genetics, microsatellites are an attractive alternative to dominant AFLPs, because they are by nature codominant. This means that they allow (at least in diploids) directly distinguishing between heterozygotes and homozygotes, which is important for inferring levels of inbreeding and using allele frequency-based inferences. Typical applications of microsatellites involve the analysis of population structure, genetic diversity and population differentiation (Sunnucks 2000). Moreover, if one is willing to assume certain models of repeat evolution, microsatellite data can be used to calculate migration rates or to reconstruct geneaeologies, which can be used to test models of demographic history based on coalescent models (e.g. Beaumont 1999). Next-generation sequencing (NGS) technologies now allow the efficient identification of large numbers of microsatellites at a fraction of the cost and effort of traditional approaches, so these markers will probably remain popular for population genetics studies, despite continuing advances in technology.

In polyploids, inability to reliably utilize codominant scoring reduces the usefulness of microsatellites relative to diploids and to AFLPs. The nature of the problem is best illustrated with an example. A tetraploid genotyped with three different alleles scored at a microsatellite locus could have three possible genotypes: AABC, ABBC or ABCC. If there is a null allele that does not amplify, the true genotype could be ABCX. Homoplasy could also result if there are stutter bands caused by replication slippage during the PCR process, which could make it look like the genotype was ABCD, when in fact D is not a true allele. Which genotype is correct would affect the allele frequency distribution of the alleles and in turn inferences about population genetic structure. Theoretically, allelic configurations for microsatellites could be resolved based on the ratios between peak intensities to determine the relative number of copies of each allele (MAC-PR method: Esselink et al. 2004), but in practice, this has only proved feasible in cases where segregation analyses within families were used to confirm dosage patterns; for example, in Rosa × hybrida (Esselink et al. 2004), Thymus praecox (Landergott et al. 2006) and Rorippa amphibia (Luttikhuizen et al. 2007). Such segregation data are essential to reliably resolve the exact allelic configuration based on peak intensities but are rarely performed in practice due to the extra samples, time and effort required to perform the tests for families from each individual or even each population sampled. In addition, segregation data cannot be obtained in asexual polyploids. This effectively means that codominant microsatellite data have to be treated as dominant, which reduces the information content and precludes analyses that take into account observed heterozygosity of individuals or allele frequency distributions.

Null alleles are a further problem for use of microsatellites in polyploids. Null alleles of course form a general problem in population genetics for codominantly scored molecular markers (irrespective of ploidy), because they lead to an overestimation of homozygosity (e.g., see Dakin & Avise 2004). The risks could be magnified in polyploids (particularly allopolyploids) for several reasons. First, loci developed for one species may not amplify equally well in other species. This is a general problem regardless of ploidy level when distantly related taxa are compared with markers developed in only one of the taxa. However, allopolyploid taxa combine multiple diverged genomes in a single individual, so that even population genetic comparisons within a single species may be affected by null alleles. The severity of the problem depends on the degree of similarity between the homoeologues (Röder et al. 1995; McQuown et al. 2002). Second, polyploidization and hybridization often lead to increased transposon activity and sequence loss due to genomic rearrangements (Parisod et al. 2009), which could affect primer binding sites. Third, the presence of multiple alleles at each locus increases the chances of differential amplification of alleles (i.e. allelic dominance; Vergilino et al. 2009). This makes the problem of not being able to test for the presence of null alleles problematic for polyploids, particularly when combined with dosage uncertainty.

Despite the complications associated with genotyping, microsatellites have been used to analyse population structure and address phylogeographic questions in polyploids. For example, dominantly scored microsatellites have been used to identify a cryptic invasive European lineage of hexaploid reed Phragmites australis in North America (Saltonstall 2003), to infer that multiple genotypes of the red alga Asparagoformis taxiformis have invaded the Mediterranean Sea (Andreakis et al. 2009) and that clonal diversity has increased in refugial island populations of octoploid prune tree Prunus lusitanica (García-Verdugo et al. 2013). In the relatively few cases where dosage has been determined reliably, microsatellites have provided powerful markers for polyploid population genetics and have the ability to include diploids and polyploids in the same analysis. For example, in a phylogeographic study of hawthorn (Crataegus), complete genotypes were resolved using peak ratios (Esselink et al. 2004) and used to show that diploid sexuals were more diverse than triploid apomicts (Lo et al. 2009). Codominantly scored microsatellites have also been used to show that Rorippa amphibia autotetraploid plants have higher genetic diversity than diploids, exactly matching predictions based on the larger effective population size of tetraploids (Luttikhuizen et al. 2007). In cases where resolving dosage is unrealistic (which is probably the case for ploidy levels higher than tetraploid), it is questionable if the increased information content per locus (i.e. multiple allelic states that can be identified) outweighs the loss of marker number compared with AFLPs and the increased risks of artefacts caused by null alleles and homoplasy. Although microsatellites are widely used, they cannot be used to their full potential in polyploids unless segregation is tested at each locus or until analytical solutions that can implement dosage uncertainty are adequately tested. With future developments in NGS technologies, the sequencing of microsatellite alleles may someday replace current genotyping methods and allow the characterization of hundreds of individuals at thousands of loci (Guichoux et al. 2011). This would reduce the influence of homoplasy, provided that sequencing errors are minimized by bioinformatics treatment.

Sanger sequencing

A major advantage of using DNA sequences for population genetics compared with fragment-based analyses is that complex substitution models can be fit to the data (e.g. Swofford et al. 1996), which allows application of more rigorous tests of demographic history, genealogical relationships, migration rates, recombination and selection (e.g. Rozas & Rozas 1999). Different regions of DNA evolve at different rates and so can be used to address questions from relatedness among individuals to deep species relationships. For example, introns and noncoding sequences tend to evolve at a faster rate than coding regions and so can be useful for examining close relationships; analysis of SNPs across a wide range of genes has the potential to increase fine-scale resolution compared with focusing on single genes. In theory, models of evolution based on sequences can be extended to polyploids, as long as complete information can be obtained about nucleotide substitution patterns, heterozygosity and allele frequencies.

A disadvantage of using nuclear DNA sequences for analyses that rely on resolving patterns of allele sharing and observed heterozygosity is that even in diploids it is often difficult to resolve the phase of substitutions, meaning that labour-intensive cloning is required to determine the exact allelic composition in heterozygotes (Zhang & Hewitt 2003). Cloning is also required if heterozygotes include sequences of different lengths. Even for diploids it can also be difficult to distinguish paralogues (i.e. alleles arising from gene duplications) from orthologues (i.e. alleles that have arisen through common descent at a single locus) in gene families. These problems are exacerbated in polyploids due to the increase in the number of possible alleles at a locus, unknown copy number of genes, and reticulate evolution in allopolyploids.

As most polyploids undergo some degree of diploidization following the initial genome duplication event, there can be random losses of gene copies in different taxa or even in different individuals from the same taxa, leading to widespread presence–absence variation and copy number variation (CNV; Griffin et al. 2011). This makes resolution of phylogenetic trees and population genetic inferences difficult if orthologues cannot be reliably distinguished from paralogues. In allopolyploids, if there is high sequence conservation among parental copies, there is the added difficulty of identifying homoeologues, and origins through hybridization mean that assumptions of strictly bifurcating models of evolution are violated. One approach would be to focus on genes that do not remain duplicated in polyploids, but this in itself might be evidence that such genes are under selection, and so not strictly appropriate for population genetic tests that assume neutrality. Alternatively, network-based approaches that allow reticulation, such as SplitsTree (Huson & Bryant 2006), are frequently used to resolve origins and phylogenies of polyploids based on nuclear gene sequences (e.g. Schmickl et al. 2008; Brysting et al. 2011; Talavera et al. 2013). This approach can reduce problems associated with duplicate gene copies as well as hybridization if paralogues can be resolved based on phylogenetic clustering and then analysed separately by designing paralogue-specific primers (e.g. Evans et al. 2011).

Except for plastid DNA (mitochondria and chloroplasts) and ribosomal RNA repeats (which are both present in high copy number in each cell), traditional Sanger sequencing has required either a PCR or cloning step, with PCR the most popular since the early 1990s (Swofford et al. 1996). However, this means that DNA sequencing suffers from some of the same problems as PCR-based fragment analyses (e.g. microsatellites): lack of ability to determine allelic dosage; uneven amplification of alleles; and possibility of null alleles. In addition, increasing the number of alleles at a locus and/or the number of gene copies increases the risk of artefacts due to recombination during the PCR process, and cloning is nearly always required if there are more than two alleles at a locus. Although the proportion of clones of a particular allele could be used as an indication of its relative dosage, this would require even amplification of each allele; there is also a risk of missing alleles (i.e. null alleles) if some alleles amplify less strongly than others and if insufficient numbers of clones are sequenced. Particularly for polyploids arising through hybridization, a substantial challenge when using cloning is to distinguish real recombinants among parental copies from PCR-based artefacts (e.g. Jørgensen et al. 2012). However, with sufficient effort, even complex gene families can be resolved and interpreted in polyploids using segregation analyses and cloning (Mable et al. 2004). Thus, the problem is not as fundamentally insurmountable as for microsatellites.

Despite these caveats, DNA sequencing has revealed important insights into polyploid evolution and still holds the greatest potential for population genetic inferences. It was in allopolyploid cotton that it was first discovered that ribosomal gene arrays, which had been assumed to evolve under complete concerted evolution so that every copy in an individual is identical in sequence (Hillis & Davis 1988), could include multiple sequence types (Wendel et al. 1995). Furthermore, it was demonstrated that copies could be present from either parent or both and that this could vary by individual. There have been extensive studies investigating phylogeography in closely related diploids and polyploids using plastid sequences for both animals (e.g. Ptacek et al. 1994; Ludwig et al. 2001; Tsigenopoulos et al. 2002; Stenberg et al. 2003; Evans et al. 2004; Stöck et al. 2005; Culling et al. 2006; Lampert & Schartl 2008) and plants (e.g. Soltis et al. 1989; Brochmann et al. 1992; Van Dijk & Bakx-Schotman 1997; Segraves et al. 1999; Wu et al. 2010); because of their uniparental inheritance and lack of variation among copies within individuals, they can be treated as effectively equivalent in diploids and polyploids. Many studies have also combined nuclear and plastid sequence data to investigate complex evolutionary histories of polyploids in both plants (Soltis & Soltis 2000; Baumel et al. 2002; Huang et al. 2002; Schmickl et al. 2008; Ainouche et al. 2009; Krak et al. 2013) and animals (Evans et al. 2005; Holloway et al. 2006; Saitoh et al. 2010), and the combination of organelle and nuclear data can help to disentangle incomplete lineage sorting from past hybridization events (e.g. Vergilino et al. 2011). Some studies have combined plastid or nuclear genes with other types of markers such as AFLPs (e.g. Burnier et al. 2009; Ma et al. 2010) to resolve complex polyploid complexes. Given the rapid developments in sequencing technology, resolution of complete genotypes in polyploids should be achievable in the near future, but the fundamental issues related to interpreting sequence variation in duplicated genes (i.e. assigning alleles to loci, distinguishing phase, resolving copy number and allelic dosage, inferring recombination) remain a substantial challenge.

New markers

Rapid advances in technology enabling whole-genome perspectives on genetic variation hold great promise for increasing the range of inferences possible using polyploid genomes (reviewed by Aversano et al. 2012; Buggs et al. 2012; Egan et al. 2012; Madlung 2013) but cannot yet solve all of the issues with previous markers and introduce some of their own challenges. Researchers working on polyploid genomes have been at the forefront of advanced genomic approaches for understanding changes in gene expression, epigenetics and genome shock associated with hybridization and gene duplication (Ainouche & Jenczewski 2010; Stöck & Lamatsch 2013). Although this is at least partly due to the fact that many economically important crop plants (reviewed by Edwards et al. 2013) and fish (reviewed by Mable et al. 2011) are polyploid, important genome-scale insights have also been obtained from nonmodel organisms with intriguing evolutionary histories of recent polyploidy, such as Spartina (Ainouche et al. 2004; Salmon et al. 2005; Chelaifa et al. 2010; de Carvalho et al. 2013), Senecio (Hegarty et al. 2006, 2008, 2009) and Tragopogon (Soltis et al. 2004; Buggs et al. 2009, 2010, 2012).

While there has as yet been little focus on implications of polyploidy for population genomics, it will still be critical to resolve issues associated with gene duplication, allelic dosage, copy number variation, resolution of homoeologues, and recombination. In addition, reliable assembly of duplicated genes, repetitive sequences and highly divergent regions of polymorphism remains one of the largest challenges for whole-genome reconstruction and annotation; even genomes that are considered well-resolved (e.g. Arabidopsis thaliana) retain uncertainty in these types of regions. In addition, most NGS methods currently suffer from higher error rates than traditional Sanger sequencing, which can introduce additional biases; while this problem applies equally to diploids, dosage uncertainties again make the problem potentially more difficult to solve in polyploids. However, the major advantage is the overwhelming number of sequence-based characters available for population genetics analyses of nonmodel species and being able to take a genomewide perspective on consequences of introgression through hybridization, fate of duplicate genes, and patterns of selection and recombination.

Below we outline some of the main types of characters that have been used in population genomic approaches and discuss current strategies for dealing with polyploid genomes. A major difference with NGS approaches is that technology and analyses are advancing so quickly that there is not a ‘stable state’ of issues and solutions that can be applied as easily as for the older methods. We expect that it will soon be possible to apply the same types of population genetic analyses developed for traditional Sanger sequencing at a whole-genome scale, but it is the sheer volume of data that will be the biggest challenge for implementation. We thus concentrate the review on where we think the major challenges currently lie in generating the data, rather than making specific recommendations for application of population genetic models to NGS data obtained from polyploids.

Genome-wide SNP markers

Development of microarray technology was the first major advance in making genome-scale approaches accessible to ecological and evolutionary questions (e.g. Gibson 2002; Shiu & Borevitz 2006). Although microarrays have been applied to interesting questions related to gene expression in polyploids (Chen et al. 2004; Slotte et al. 2007; Buggs 2008; Hegarty et al. 2008, 2009; Mavarez et al. 2009; Chelaifa et al. 2010; Flagel & Wendel 2010; Pignatta et al. 2010; de Carvalho et al. 2013), a major issue is with unknown copy number changes between the individuals compared on the array, which could lead to spurious conclusions about expression differences. Although this could theoretically be corrected using DNA arrays to estimate copy number (Auer et al. 2007), inability to distinguish sequence divergence (i.e. preventing hybridization on the arrays) from loss of duplicated copies, could affect such interpretations (e.g. Parkin et al. 2010). Expression changes in allopolyploids can also be highly complex. For example, detailed studies using cDNA-AFLP approaches have clearly demonstrated that not only changes in gene expression but stochastic loss or over-representation of parental copies occur frequently in newly synthesized polyploids (e.g. Wang et al. 2006; Gaeta et al. 2007; Buggs et al. 2009, 2010; Jackson & Chen 2010). Thus, differences in hybridization of paralogues have represented an important challenge for microarray-based studies of changes in gene expression following polyploidization.

Transcriptome analyses using RNA-sequence hold more promise for distinguishing the evolutionary dynamics of duplicate genes, because they should not be as sensitive to bias in the representation of paralogous copies. As for all analyses of polyploids, emerging results are complex but intriguing (de Carvalho et al. 2013). Large genome size, large gene families and high repetitive sequence content remains problematic for genome and transcriptome assembly, particularly in nonmodel organisms (e.g. Vijay et al. 2013), but new approaches are constantly being developed that could improve resolution of polyploid genomes. For example, following up on microarray-based experiments (Flagel et al. 2008; Flagel & Wendel 2010; Salmon et al. 2010), Yoo et al. (2013) used Illumina technology to sequence the transcriptomes of wild and cultivated cotton to distinguish between expression changes due to biases in which parental genome is expressed in an allopolyploid and ‘dominance’ in the expression patterns from one parent (i.e. where hybrids show similar expression patterns to those in one parent, rather than preferentially expressing the allelic copy from one parent; reviewed by Buggs 2013). Such complications emphasize that even with advanced technology, phylogenetic and population genetic analyses of polyploids could remain problematic due to their biology, rather than just methodological issues.

Despite these issues, SNP arrays based on transcriptome analyses have led to useful insights into the population genetics of polyploid organisms (e.g. Atlantic salmon: Bourret et al. 2012). For example, based on 454 transcriptome sequencing of polyploid wheat genomes, Lai et al. (2012) modified a tool developed for SNP detection in diploid crop plants (AutoSNPdb) to enable integration of SNP and gene annotation information with a graphical viewer even for such highly complex genomes. In polyploid sturgeons, Hale et al. (2009) applied a rarefaction approach taken from theoretical ecology to assess the relationship between sequence coverage and gene discovery and discussed whether normalization is a useful approach to reduce coverage of repetitive sequences such as rRNA subunits. Normalization could be particularly problematic for polyploids because relative levels of gene expression among homoeologues are often of particular interest for understanding evolutionary and functional processes in polyploids and so important information might be lost through the standardization. In addition, if diploids and polyploids are included in the same analyses, it might not be possible to apply a single normalization strategy to all individuals, due to differences in relative coverage. Although some success has been achieved using distant diploid relatives as references (e.g. Everett et al. 2011), the current lack of sequenced polyploids also hinders assembly and resolution of SNPs for most polyploid genomes.

Continuing technological developments mean that genomic-based SNP generation is now also feasible, even in large polyploid genomes. However, problems with distinguishing between paralogous copies and the presence of high copy numbers of repetitive elements in many polyploids (Leitch & Leitch 2008; Koukalova et al. 2010; Buggs et al. 2012; Piednoël et al. 2012) mean that full-genome annotations remain challenging (e.g. Seeb et al. 2011a; Brenchley et al. 2012; Wang et al. 2012), reducing the potential to interpret population genomics patterns in the context of potential for selection. In some instances, duplicated genes are intentionally excluded to simplify genomic assembly, with linkage maps based on only the nonduplicated portion of the genome (Everett et al. 2011, 2012). As distinguishing what types of genes are retained in duplicate is often a critical goal to understand selection pressures following gene duplication (e.g. Birchler & Veitia 2007), this could be an important omission. Nevertheless, whole-genome-based population genetic inferences on polyploid genomes are starting to emerge. Hollister et al. (2012) resequenced 12 individual plants from four populations of tetraploid Arabidopsis arenosa and aligned them to reference sequences from two diploid relatives (Arabidopsis thaliana and Arabidopsis lyrata) and used the three-way comparisons to interpret patterns of selection in the tetraploid genome. The novelty was that they also tested the mode of inheritance using a simulation approach compared to the observed SNP frequency distribution. Although only a portion of the sequence space that was found at a threshold read depth in A. arenosa and aligned to both other genomes could be used, the study demonstrated the utility of implicitly considering the different types of allele-frequency spectra expected in polyploids into analyses of selection at a genomewide scale.

There have already been some developments in strategies for incorporating gene duplication into models of genome assembly, and we anticipate that continuing improvements in both sequencing technology and bioinformatics pipelines will result in generation of well-annotated and complete polyploid genomes in the near future. Increasing the stringency (e.g. allowing differentiation of two divergent sequences as two different loci and not two alleles from the same locus) when assembling genomes may help to eliminate combining paralogues during SNP discovery analyses and could help to differentiate homoeologous sequences from each other in allopolyploids (Hohenlohe et al. 2011). For example, the Stacks software (Table 1; Catchen et al. 2013, 2011), which operates by ordering matching reads into different short-read ‘stacks’, could allow differentiation of paralogous (or homoeologous) from homologous sequences. By increasing the number of ‘stacks’ per locus in the module USTACKS (Catchen et al. 2013) and modulating the mismatch parameter used to produce these ‘stacks’, the user should be able to differentiate alleles from duplicated genes as well as alleles from homoeologous loci in allopolyploids (depending on the divergence between homoeologous loci). However, increasing the stringency of the assembly risks separating polymorphic loci that include highly divergent alleles at single loci (e.g. immune genes at the Major Histocompatibility Complex, MHC) into multiple loci (Seeb et al. 2011a). Comparison with a completely resolved and annotated reference genome is needed to distinguish divergent alleles from duplicated loci (Wang et al. 2013). Thus, there remains the circular problem of initially resolving duplicated or highly divergent genomes.

Table 1. Software and statistical packages used in population genetics and population genomics studies on polyploid or mixed-ploidy level populations, including the type of polyploids for which they are applicable, whether or not they support large datasets (i.e. for analysis of next-generation sequence data, what types of markers they have been developed for, and the operating systems on which they can be run
SoftwareType of polyploidsSupporting large datasetsMarker typeOperating system
  1. AFLP, amplified fragment length polymorphism; SNP, single nucleotide polymorphisms; SSR, simple sequence repeats.

  2. a

    Although this software is primarily adapted for diploids, some studies have used this software more or less successfully to analyse SNPs in polyploids (Ogden et al. 2013; Wang et al. 2013).

Assembly, SNP discovery and genotyping

CLC Bio genomic workbench

http://www.clcbio.com/products/clc-genomics-workbench/

AllYes

Sequences

SNP

Mac OS X Windows

Unix

Genome Analysis Tool Kit (gatk)

http://www.broadinstitute.org/gatk/

AllYes

Sequences

SNP

Mac OS X Windows Unix

Stacks

http://creskolab.uoregon.edu/stacks/

DiploidsaYesSequences SNP

Mac OS X

Linux

fitTetraR package

http://www.wageningenur.nl/en/show/Software-fitTetra.html

TetraploidsYesBi-allelicMac OS X Windows Unix

superMassa

http://statgen.esalq.usp.br/SuperMASSA/

AllYesSNPOnline
Distance-based methods

popdist

http://genetics.agrsci.dk/~bernt/popgen/

Asexuals (mixedploidies)NoSSRMac OS X Windows Unix
Estimation of allele frequencies and F-statistics based methods

polySegratio/polySegratioMM

http://cran.r-project.org/web/packages/polySegratio/index.html

http://cran.r-project.org/web/packages/polySegratioMM/index.html

AutopolyploidsYes

SNP

AFLP

SSR

Mac OS X Windows Unix

atetra

http://www.vub.ac.be/APNA/ATetra.html

TetraploidsNoSSRWindows

StAMPP r package

http://cran.rproject.org/web/packages/StAMPP/index.html

Mixed ploidiesYesSNPMac OS X Windows Unix
Bayesian clustering methods

Structure

http://pritch.bsd.uchicago.edu/structure.html

AutopolyploidsYes

SNP

SSR

Mac OS X Windows Unix

InStruct

http://cbsuapps.tc.cornell.edu/InStruct.aspx

AutopolyploidsAllopolyploidsYes

SNP

SSR

MacOS X Windows Linux Online
Packages implementing multiple methods

adegenet r package

http://cran.r-project.org/web/packages/adegenet/

Various but no mixed ploidiesYesAllMac OS X Windows Linux

PolySat r package

http://openwetware.org/wiki/Polysat

Polysomic inheritance (mixed ploidies)NoSSRMac OS X Windows Linux

SPAGeDi

http://ebe.ulb.ac.be/ebe/Software.html

AutopolyploidsYesDominant CodominantMac OS X Windows Linux/Unix

GenoType/GenoDive

www.patrickmeirmans.com/software/GenoDive.html

Asexuals (mixed ploidies)

Polysomic inheritance

YesDominant CodominantMac OS X Windows

Another important issue related to all current NGS-sequencing approaches has to do with error rates. While the scale of the problem varies by method, for all current methods heterozygote genotypes can be falsely produced by the incorporation of spurious mutations during the sequencing (or amplifying) steps, and heterozygotes can be missed with insufficient sequence coverage. Taking into account the sequencing error rate and the depth of coverage is critical for properly characterizing homozygote and heterozygote genotypes and estimating allele frequencies, even in diploid populations (Lynch 2009; Hohenlohe et al. 2010). However, as the depth of coverage used to sequence and detect variants has to be sufficient to sample all variants present at a given locus, it should be increased proportionately to the ploidy level to account for the possibility of increased number of alleles. Again, dosage uncertainty in polyploids means that a simple calculation of read number in relation to expected heterozygosity at a given locus cannot be used to predict whether there has been sufficient coverage, as has been used for diploids (Catchen et al. 2013). There would also be difficulties with combining different ploidy levels in the same analysis, as it would be difficult to completely normalize read depths.

Various genomic assemblers (see Table 1) such as the CLCbio genomic workbench and the Genome Analysis Tool Kit (GATK; McKenna et al. 2010; DePristo et al. 2011) can incorporate the ploidy level as a parameter to discover or estimate the presence of variants in polyploids. The CLCbio genomic workbench uses a modified version of Neighbourhood Quality Standard (Altshuler et al. 2000; Brockman et al. 2008) to detect variants, taking into account the quality of the sequences. GATK, an open-source community platform, uses a Bayesian framework, taking into account phred quality score (Ewing et al. 1998) to disentangle spurious mutations from real variants (McKenna et al. 2010; DePristo et al. 2011). However, these approaches still often consider true variants to have a frequency of 0.5 in heterozygous genotypes and so might not be directly applicable to assessing reliability of SNP calls in polyploids. Simulation studies are required to assess how sensitive such approaches might be to assuming diploid inheritance in polyploid genomes or to individual loci showing polysomic inheritance, and to predict what types of biases might result.

For high-throughput SNP-genotyping platforms, there are some analytical approaches that can incorporate partial heterozygosity (i.e. heterozygotes with different dosage patterns), and we suggest that this is an area where further analytical solutions should continue to be developed, not only for these rapid genotyping methods but also for assessing reliability of SNPs obtained from whole-genome sequences. Using mixture models, the fitTetra r package allows genotyping and estimation of partial heterozygote tetraploid individuals using data obtained from high-throughput SNP genotyper platforms (Voorrips et al. 2011). Serang et al. (2012) have provided a Bayesian algorithm to genotype individuals and estimate SNP frequencies in populations with complex mixed-ploidy levels, which is currently compatible with Illumina GoldenGate assays and the Sequenom iPlex MassARRAY®. This algorithm is implemented in the software SuperMASSA (see Table 1). Once again, the problem of uncertainty in allele dosage remains a challenge: both software packages assume that the intensity of hybridization is directly proportional to the copy number (i.e. allelic dosage) at a given SNP site, which has not been systematically tested. Simulation studies to assess the sensitivity of these types of analyses to deviations from the expected dosage should be conducted to evaluate the utility of such approaches and identify where improvements should be made.

Multiplex amplicon sequencing

High-throughput targeted sequencing approaches hold great promise for understanding the evolutionary history of polyploid organisms and for identifying patterns of genetic diversity at adaptively important genes. This method has been used, for example, as a ‘digital cloning’ approach to resolving complex gene families in autotetraploid plants (Jørgensen et al. 2012). However, although the approach is more efficient than cloning in terms of coverage of amplicon products and confidence in resulting genotyping, potential biases associated with PCR-based techniques are not completely solved by a deep-sequencing approach. Uneven representation of allelic products can still be apparent within and between individuals or between PCR runs, and PCR recombinants can remain difficult to distinguish from genuine recombinant alleles. Differences in annealing of the tagged primers in allopolyploids due to divergence between the parental sequences could also complicate the interpretation of parental genome contributions (e.g. Bundock et al. 2009). Nevertheless, tagged amplicon sequencing has been applied to allopolyploids to simultaneously investigate linkage of multiple homologues of candidate genes coding for important traits (e.g. Gholami et al. 2012) and to investigate phylogeography of polyploids using a combination of nuclear and organellar genes (Griffin et al. 2011). Lessons learned from the analysis of complex gene families in diploids (e.g. MHC: Sommer et al. 2013) will be a useful source of solutions to increasing genotype reliability using tagged amplicons, which can be applied to both diploids and polyploids. There has been a recent switch to using Illumina-based sequencing technology, which produces shorter sequences but with lower rates of error than for 454; the rapid advances in both the technology (e.g. read length) and analyses (e.g. methods for detecting chimeric sequences, Quince et al. 2011) of these types of data should further increase the utility of this approach to applying population genetics models to sequences obtained from duplicated sequences.

Targeted sequence capture

Another type of approach that is increasingly being applied and that holds great promise for isolating multiple whole genes for use in population genetic studies of polyploids is the enrichment or targeting of particular parts of the genome (targeted sequence capture). Salmon et al. (2012) analysed heterozygosity of hundreds of homoeologues genes in wild and domesticated cotton Gossypium hirsutum with the aid of custom hybridization probes (targeting 500 pairs of homoeologues from the transcriptome). A similar approach was used to sequence 56.5 Mb of genomic DNA from allohexaploid bread wheat (Winfield et al. 2012) to assess variation at 500 000 SNPs, not only among gene copies but also among varieties. Bundock et al. (2012) used information from Sorghum (Sorghum bicolor) to capture the sequences of two closely related sugarcane genotypes (Saccharum officinarum and a hybrid cultivar) and were able to develop SNPs using Agilent Sure Select arrays and Illumina sequencing. The approach has also been applied to highly complex gene families (plant resistance genes) to identify not only already known genes but to identify hundreds more copies than had been identified from scans of complete genome sequences (Jupe et al. 2013) and to pull out orthologous sequences from distantly related plant species (potato and tomato). O'Neill et al. (2013) applied parallel tagged amplicon sequencing to better resolve species boundaries in Ambystoma tigrinum, a species with a large and complex genome. EST information from two related species was used, and 95 PCR-targeted unlinked nuclear loci in 93 individuals were used to assign individuals to different geographical regions using the structure software (Pritchard et al. 2000). This combined sequencing and bioinformatics approaches resulted in a genomewide data set with relatively low levels of missing data and a wide range of nucleotide variation. The advantage of these types of methods for polyploids is that problems with unequal coverage across the genome due to large size and duplications would be reduced by focusing on a smaller number of target genes, for which read depth could be optimized to allow inference of number of alleles. Although it is not yet feasible to reliably infer copy number, given that this is also an area of concern for duplicated genes in diploids, we predict that creative solutions will appear in the near future.

Genotyping by sequencing

A currently expanding area of research is the use of complexity-reducing techniques to enable population-scale analyses of nonmodel organisms. ‘Genotyping by sequencing’ approaches are one such class of methods. Although there are a variety of approaches, restriction-associated DNA (RAD) sequencing (Baird et al. 2008) has been used the most frequently for population genetic applications (Hohenlohe et al. 2010, 2011; Rowe et al. 2011). RAD-Seq provides the ability to examine tens of thousands of genetic loci simultaneously in groups of individuals. The principle of this approach is similar to AFLPs in that genomic DNA is cut with restriction enzymes, but the digested fragments are then ligated to adapters and bar-coded to enable multiplex sequencing using NGS platforms. It yields two kinds of data: presence–absence of markers resulting from polymorphism in the restriction enzyme cut site, and substitutional (SNP, indel) markers in tagged sequences. For polyploids, the advantage is that, with sufficient coverage, it should be possible to obtain all four copies (in a tetraploid) at a given polymorphic site and so theoretically determine allelic dosage. However, this assumes no bias in representation of allelic copies and equal read coverage across all loci, so that sequences can be normalized to a standard. Currently, this is not feasible even in diploids but if possible, would lead to a major breakthrough in sequence-based analyses of polyploid genomes. Although phase of substitutions is limited to a relatively short fragment of DNA flanking each cut site, the use of paired-end sequencing with a reference genome or using more than one restriction enzyme (double digest RAD: Peterson et al. 2012) has the potential to distinguish between paralogues by considering patterns of nucleotide substitutions over a larger sequence fragment and so to enable multilocus haplotype-based analyses (e.g. structure analyses: Pritchard et al. 2000; Falush et al. 2007). One important drawback of RAD sequencing is the fact that mutations at restriction sites will make it impossible to observe the associated SNP allele, resulting in allele dropout. In addition, if restriction digest sites are present in transposons, large numbers of reads will not be informative; thus, stringent data filters are required (Twyford & Ennos 2012). Simulation studies have shown that including loci with missing data can lead to an over-estimation of FST values (Arnold et al. 2013; Gautier et al. 2013). The ascertainment of sites with missing data will be even more important in polyploids, given their duplicated loci. Simulation studies are required to better assess the effects of allele dropout in both auto- and allopolyploid organisms. The major advantage compared with AFLPs and microsatellites is being able to apply a testable model of evolution to the data and so increase the scale of inference possible about evolutionary and demographic processes.

So far, most studies that have used RAD sequencing for mapping have excluded potential paralogues in downstream analyses (e.g. sockey salmon: Everett et al. 2012), but testing segregation of variants within families could help to distinguish how many copies are present at a particular RAD ‘locus’ (i.e. the contiguous sequence next to a cut site). For allopolyploids, if it is possible to separate reads into the diploid contributions from each parent, then data can be analysed as if it were effectively diploid. For example, Hohenlohe et al. (2011) distinguished candidate SNPs for differentiation between Oncorhynchus mykiss and native westslope cutthroat trout (Oncorhynchus clarkii lewisi) by detecting excessively high observed heterozygosity and deviations from HW equilibrium. However, they appear to have assumed strict disomic inheritance; again, uncertainties in segregation patterns at each locus would affect the model for expected genotype distributions and so could bias these types of analyses.

Reduced representation NGS techniques suffer from the fact that mutations in the restriction enzyme restriction sites, along with the random sequencing of genomic fragments, may result in a large number of missing orthologues. This is of particular concern in large complex genomes because the larger sequence length means that there is a higher probability of stochastic differences in which SNPs are sequenced in different individuals (O'Neill et al. 2013). Uncertainties in allelic and gene copy number also means that errors remain more difficult to detect in polyploids than in diploids (as for the other NGS-based methods), but this is complicated by strategies for filtering data. The rediploidization process that occurs following genome duplication means that individuals could differ in which gene copies they retain. For genome-sampling approaches such as RAD sequencing, this means that filtering data to include only loci that are found in all individuals could omit important information on the fate of duplicate genes and could confound interpretation of paralogues. This would also be problematic when including multiple ploidy levels in the same analysis, as a uniform filtering strategy might lead to biases across ploidies.

Regardless of these cautions, complexity reduction approaches should in theory be easier to apply to polyploids than whole-genome approaches because of the reduced difficulties with ensuring sufficient coverage provided by sequencing only a targeted portion of the genome. There also should be no theoretical barrier to using assemblers and SNP genotypers developed for diploids. However, for very large and complex genomes, current methods might still be limited by uneven coverage across the genome. For example, in the complex case of sturgeon, where ploidy level can be as high as 2n = 8x, but there has been varying degrees of rediploidization, Ogden et al. (2013) were able to discover SNPs using a RAD tag sequencing technique on a Illumina Hiseq2000 platform. However, they were unable to recover all of the polymorphisms expected from genotyping within a family (two parents and six offspring). A current but potentially transient benefit of complexity reduction approaches for polyploid genomes is that such approaches can be applied without assembly to a reference sequence, but inferences remain more powerful where this is possible. For example, in polyploid birch, paralogues were differentiated from homologues using the features of the Stacks assembler by comparing RAD sequences to a reference genome library, but not when comparing de novo RAD sequences to each other (Wang et al. 2013). While these approaches can reduce the cost of SNP discovery and genotyping by sequencing, the continued increase in data volumes at an ever-reducing cost may make whole-genome sequencing more efficient for SNP discovery in the future.

Combining methodologies

Even for diploids, there has been recognition that combining approaches has the greatest potential for resolving large and complex genomes. For example, long-read technologies that are prone to high error rates but can be used to generate scaffolds where a reference genome is not available, with higher accuracy short-read approaches used for detailed SNP identification. For example, You et al. (2011) used such a combined approach for SNP discovery in the diploid ancestor of the D genome of polyploid wheat (Aegilops tauschii), which itself has a genome size of over 4 Gb, with 90% repetitive sequences, making de novo assembly difficult. They combined Roche 454 shotgun reads with low-genome coverage of one genotype to distinguish single copy sequences and repeat junctions from repetitive sequences and sequences shared by paralogous genes and then mapped shotgun reads from other genotypes generated with SOLiD or Solexa to the annotated Roche 454 reads to identify putative SNPs. Mayer et al. (2011) combined chromosome sorting, NGS, array hybridization, and synteny comparisons with model grasses to construct an ordered scaffold of barley (Hordeum vulgare). Seeb et al. (2011a) included a high-resolution melt curve analysis (HRMA; Wu et al. 2008) and Sanger sequencing, as additional stringency steps, to validate transcriptome-based SNPs in tetraploid chum salmon. Such combined approaches hold the most promise for identifying individual markers that could be used for population genetic inference in polyploid genomes, to allow resolution of the full complexity of the evolutionary process when changes in copy number are critical for understanding relationships among populations.

Extending population genetic tools used for diploids to polyploids

General caveats for genetic marker analysis in polyploids

Analysis of allele and genotype frequencies and the quantification of deviations from the HW equilibrium are a central aspect of population genetics. Although the concepts of population genetics theory have predominantly been developed for diploids (Wright 1943, 1951), the same core principles apply to polyploids. The HW equilibrium principle can be applied to the diploid subgenomes of allopolyploids with strict disomic inheritance, if one can reliably identify the homoeologous copies. The principle has also been extended to autopolyploids, where polysomic inheritance and double reduction complicate matters (Haldane 1930; Geiringer 1949; Parsons 1959; see Bever & Felber 1992 for a review). For a polyploid with polysomic inheritance (without double reduction), expected genotype frequencies for a bi-allelic locus in HW equilibrium are predicted by the formula (p + q)2m, in which p and q represent the frequencies of both allelic states and m is the ‘haploid’ ploidy level (Haldane 1930). The main effect of double reduction is that it causes the expected frequencies of homozygous genotypes to increase (Bever & Felber 1992 and references therein), resembling the effect of inbreeding (see Geiringer 1949; Parsons 1959; Bennett 1968 for some formulae for predicting genotype frequencies of polyploids with double reduction). This relates to a more general theoretical issue with the use of HW equilibrium in autopolyploids. Compared with diploids, the random mating equilibrium is not reached as fast in autopolyploids (Haldane 1930; Geiringer 1949; Bever & Felber 1992) and depends on the frequency of double reduction (Parsons 1959; Bennett 1968). This questions whether any method that is based on deviation from HW equilibrium is actually appropriate for autopolyploids. To the best of our knowledge, there are no theoretical studies that have addressed this issue.

In any case, the theoretical basis for population genetic analysis in polyploids is frequently not always possible to apply in practice. The reasons for this are mainly related to issues that have already been identified in the previous sections: (i) inheritance can deviate from strict disomic or polysomic and can vary from locus to locus and over time; (ii) dosage/copy number uncertainty and null alleles prevent reliable assessment of observed allele and genotype frequencies; and (iii) differences in ploidy level within a taxon or between closely related taxa included in the same analyses add an additional level of complexity to the population genetic analysis of polyploid species. It is of course possible to avoid difficulties with mixed ploidy (often referred to as mixed cytotypes) by analysing different ploidy levels separately, and to refrain from any interploidy comparison. This only seems reasonable in situations where different ploidy levels are indeed reproductively or spatially isolated. In Aster amellus, for example, diploids and hexaploids are completely reproductively isolated from each other, despite being morphologically indistinguishable and occurring in close vicinity (Münzbergová et al. 2013). However, although experimental crosses between ploidy levels tend to result in a much lower seed-set than crosses between plants with equal ploidy, reproductive isolation between ploidy levels can be incomplete (e.g. Hardy et al. 2001; Husband & Sabara 2003; Stift et al. 2010; Mraz et al. 2012). This means that gene flow between ploidy levels is possible and so population structure should be considered across cytotypes. Using molecular markers, gene flow across ploidy levels has, for example, been detected between diploids and tetraploids in Arabidopsis arenosa and A. lyrata (Jørgensen et al. 2011), and between diploids and apomictic triploids in Taraxacum (Menken et al. 1995). Given the frequent genetic exchange between ploidy levels, and the fact that polyploids are often recently derived from ancestors with a lower ploidy level, it is clearly undesirable to analyse different ploidy levels separately.

In this section, we will discuss the most commonly used approaches for population genetic analysis in polyploids, and how assumptions related to the inheritance mode and dosage uncertainty may affect these approaches. This will provide a thorough evaluation of the approach-specific pros and cons and allows us to make recommendation of work that is most critically needed to advance the field. We discuss some of the main statistical packages that implement these methods in the main text, Table 1, and boxes 2 and 3 and discuss creative solutions that are being suggested for extending analyses to polyploids. Some of the most exciting developments are being implemented in flexible programming environments that allow direct user additions, such as r (http://www.r-project.org/, R Development Core Team 2004). We anticipate that future advances will continue using these platforms.

Estimating allele frequencies

Estimation of allele frequencies is of great importance in the study of demographic factors influencing population structure such as migration, population growth or bottlenecks. Accurate allele frequencies are a prerequisite for the calculation of expected heterozygosities and estimates of population differentiation and fixation indices. Unlike in diploids, direct calculation of allele frequencies in polyploids can rarely be determined unless there is no uncertainty in allele copy number. A way around this problem is to incorporate dosage uncertainty into the inference of population genetic parameters. Unfortunately, there is not a single straightforward method for doing this. A first way is to estimate allele frequencies by considering that each allele in partial heterozygotes has an equal likelihood of being present in more than one copy (implemented in SPAGeDi assuming polysomic inheritance; Hardy & Vekemans 2002). This leads to an underestimation of common allele frequencies and an overestimation of rare allele frequencies (Clark & Jasieniuk 2011). A second method works by assigning the state of the unknown double dose allele based on the total sample or population allele frequencies (implemented in GenoDive assuming polysomic inheritance; Meirmans & Van Tienderen 2004). A problem arises here due to circularity caused by the very fact that the uncertainty in allelic dosage means that accurate population allele frequencies cannot be calculated and that assigning a particular allelic state changes the allele frequencies that the assignment was based on. A third method is to calculate allele frequencies and levels of heterozygosity in polyploid populations only based on unambiguous genotypes and ignoring genotypes with missing data (StAMPP, Pembleton et al. 2013; atetra, Van Puyvelde et al. 2010; and TETRASAT, Markwith et al. 2006; the latter two assuming disomic inheritance). This may cause biased allele frequencies because partial heterozygotes are ignored. In an extreme example of a tetraploid population of two individuals with genotypes ABCD and ABBB (which would be scored as ABX due to the dosage uncertainty), the true frequency of B is 1/2 but would be estimated as 1/4 if ABBB were excluded. Similarly, in a hexaploid population with ABBBBB and ABCDEF, the true frequency of B (1/2) would be estimated as 1/6. For ploidy levels above tetraploid, this method is probably obsolete anyway, as there will probably not be any unambiguous genotypes. Even in triploids and tetraploids, allele frequencies will be inaccurate for loci with limited variability and hence many ambiguous genotypes. This is because the frequency estimates for such loci can only be based on the limited number of individuals with unambiguous genotypes. A fourth way is to recalculate actual allele frequencies in the population from the ‘allelic phenotype’ frequencies (estimation of allele frequencies using phenotypes instead of genotypes) based on an iterative process. De Silva et al. (2005) developed a maximum-likelihood-based approach to do this, using the expectation maximization algorithm of Dempster et al. (1977), under the assumption of random mating and either disomic or polysomic inheritance without double reduction. This approach is implemented in PolySat (Clark & Jasieniuk 2011) and in GenoDive in a modified form (Meirmans & Van Tienderen 2004). A level of selfing can also be introduced in this estimate to improve the estimation of allele frequencies in inbred populations (not implemented in GenoDive). Detailed simulations of the consequences of implementing any of the approaches to circumvent dosage uncertainty have not yet been conducted, which would be required to assess what types of biases might result from the various strategies. Moreover, it should be realized that any method to assign an allelic state will obviously lead to a bias in cases for which the unknown allelic state is a null allele or an artefact arising during the PCR process.

Box 2. Population differentiation indices

Analysis of population differentiation (F-statistics) is a key component of population genetic studies. Here, we list some of the FST analogues and interpopulation differentiation measures that have been developed for diploids but have been applied to polyploids and discuss what is expected under disomic and polysomic inheritance.

FST-related measures

The first and the most widely used summary statistics in population genetics is Sewall Wright's FST (Wright 1943, 1965):

display math

where Var(p) is the variance of local allele frequencies among subpopulation and inline image is the mean allele frequency. The properties of this index have been well studied under island and isolation by distance models (Wright 1943; Wright 1946; Wright 1965; Slatkin & Barton 1989; Whitlock 2011). Under the finite island model and with a low mutation rate μ, FST is only dependent on the effective population size N and the migration rate m, such that:

display math

if m << 1 and μ << m (Wright 1951).

Weir & Cockerham (1984) proposed θ as an estimate of FST using a simple Analysis of Variance (anova) to calculate the variances within and among subpopulations. Similarly, different FST analogues have been proposed to analyse population structure by taking into account haplotype sequences, ϕST (Excoffier et al. 1992), or a stepwise mutation model in microsatellites, RST (Slatkin 1995). Nei's GST (Nei 1973, 1987) is equivalent to Wright's FST but defined in terms of heterozygosity within subpopulations (HS) and heterozygosity of the entire set of subpopulations under the assumption of HW equilibrium (HT). It has been designed to account for the analysis of loci with multiple alleles:

display math

As many authors have shown than GST has the undesirable property of being constrained to a maximum value of <1 when the mutation rate is high, G′ST was proposed by Hedrick (2005) to adjust for the number of alleles in a subpopulation as:

display math

with d being the number of subpopulation studied. Such standardization can be applied to other FST analogues (F′ST, ϕ′ST or θ′) by weighting by their maximum values (Meirmans & Hedrick 2011). Moreover, as Hedrick's GST (2005) may be biased when few subpopulations have been sampled, Meirmans & Hedrick (2011) have proposed a standardization to account for small population sample size, G″ST.

Each of these measures can be adapted to autopolyploids but difficulties with inferring dosage again restrict their usage. In polyploid organisms with complete disomic inheritance in which heterozygosity can be fixed, even if complete genotypes can be resolved the expected heterozygosity (HS) may be overestimated; hence, fixation indices such as FST, Nei's GST and G″ST would be underestimated (Meirmans & Van Tienderen 2013). An alternative measure Rho, was proposed by Ronfort et al. (1998) after Tachida & Yoshimaru (1996) and Waller and Knight (1989). It has a theoretical background linked to Wright's FST, with the following equation:

display math

where FIS is the inbreeding coefficient of an individual within a subpopulation. Additionally, Ronfort et al. (1998) provided a method to estimate Rho using the anova framework of Weir & Cockerham (1984). In their simulation studies (see 'Population differentiation'), Meirmans & Van Tienderen (2013) found that this measure was least sensitive to the ploidy level, selfing rate and double reduction rate (and therefore mode of inheritance) and recommended it as the population differentiation measure of choice for polyploids.

Jost's D

Jost (2008) proposed a summary statistic that accounts for the number of alleles in the population:

display math

This summary statistic measures the departure from total differentiation, which should not be confused with fixation indices (e.g. FST) that measure the departure from panmixia, at least in the finite island model (Whitlock 2011). Jost's D is not informative about migration between populations or other demographic processes and is dependent on neutral genetic diversity and the mutation rate. The behaviour of D according to the mode of inheritance is hard to interpret and D should therefore be avoided for the analysis of polyploid organisms (Meirmans & Van Tienderen 2013). However, it is implemented for extension to polyploid data in GenoDive (Table 1).

Population structure

Bayesian clustering methods such as implemented in structure for the analysis of population structure (Falush et al. 2003) and in InStruct for simultaneous analysis of population structure and inbreeding rates (Gao et al. 2007) are popular methods in population genetics. The principle of Bayesian clustering is to assign individuals to one or more clusters such that deviation from HW equilibrium is minimized (Pritchard et al. 2000; Falush et al. 2003). Although initially developed for diploids, the programmes Structure and InStruct accommodate (auto)tetraploid data and allow joint analysis of different ploidy levels. Bayesian clustering has been used to infer the assignment of polyploid individuals to structured subpopulations (Lo et al. 2009; Shimizu-Inatsugi et al. 2009; Vanderpoorten et al. 2011; Tsuchimatsu et al. 2012). For example, Lo et al. (2009) used structure on a data set of 13 microsatellite loci to investigate possible gene flow and evolutionary relationships between sexually reproducing diploid and polyploid (triploid and tetraploid) populations reproducing by pseudogamous gametophytic apomixis of two species of hawthorns (Crataegus suksdorfii and Crataegus douglasii; Rosaceae). Due to a lack of genetic structuring (supported by the absence of isolation by distance) in tetraploid apomictic populations of C. douglasii, they concluded that there was either substantial gene flow among populations or that the populations originated from the same set of founders. In contrast, populations of the mixed-ploidy species C. suksdorfii clustered according to the ploidy level, suggesting a reduction of gene flow between cytotypes in this species.

However, the application of Bayesian clustering based on HW equilibrium in polyploids comes with a number of potential problems. Potential issues could mainly arise due to violations of the basic assumption of random mating within clusters. This problem is not specific to polyploids, but many polyploids frequently show a shift to inbreeding (reviewed in Mable 2004b) and asexual reproduction (Tomiuk & Loeschcke 1992; Dufresne & Hebert 1995; Stenberg et al. 2003; Aguilera et al. 2007; Lo et al. 2009; Vergilino et al. 2009; Neiman et al. 2011) and are often associated with novel habitats at range edges (e.g. Hijmans et al. 2007; Parisod et al. 2010). Because selfing, asexual reproduction and fast population growth cause departures from HW equilibrium, each of these cases represents a violation of the core assumptions of structure, which may produce either spurious population clustering or a lack of population structuring, depending on the genetic variability (Pritchard et al. 2000). It is currently unknown how seriously inference of population structure can be affected by violation of the underlying assumptions. Again there is a strong need for simulations that explicitly test each of the potential causes of bias in polyploids, simultaneously addressing the potential effect of null alleles and departures from polysomic inheritance.

Population differentiation

Quantifying population differentiation is among the main goals of population genetic analysis. Measures of population differentiation and partitioning of variance such as FST (or analogs) are therefore routinely reported in diploids. The main principles of F-statistics are extendible to autopolyploids with polysomic inheritance (e.g. Hardy et al. 2001; Andreakis et al. 2009). However, the previously identified problem of dosage uncertainty often prevents calculation of accurate allele and genotype frequencies. As such frequencies are needed to assess fixation indices (Box 2), it is frequently impossible to calculate F-statistics for autopolyploids.

In rare cases, where allele and genotype frequencies can be inferred for polyploids, remaining issues with using FST or related indices in polyploids include potential violations of the assumptions of HW equilibrium and polysomic inheritance. GenoDive (Meirmans & Van Tienderen 2004) and adegenet (Jombart 2008; Jombart & Ahmed 2011) have options to test deviation from HW equilibrium in polyploids. Simulating tetraploid genotype data, Meirmans & Van Tienderen (2013) demonstrated that assuming tetrasomic inheritance for a marker that in reality is inherited disomically may overestimate the expected within population heterozygosity and underestimate the estimation of divergence between populations as measured by Nei's GST (Nei 1987), G″ST (Hedrick 2005) and Jost's D (Jost 2008). Rho-st (Tachida & Yoshimaru 1996; Ronfort et al. 1998) proved to be the only measure of population differentiation that was independent of the ploidy level, selfing rate and double reduction rate, and appeared unbiased by the type of inheritance (Meirmans & Van Tienderen 2013). This led the authors to recommend Rho-st as the preferred statistic for assessing population differentiation in polyploids with unknown segregation (see Box 2). However, they also warned that Rho is analogous to the ‘correlation between truly outcrossed mates’ defined for diploids in Tachida & Yoshimaru (1996) and cannot be interpreted as directly equivalent to an FST estimate because the values that Rho takes are comparable to expected FST values for haploid populations. Encouragingly, though, Meirmans & Van Tienderen (2013) also found that violating the assumption of tetrasomic inheritance does not bias other more standard FST measures too much, as long as there are sufficient intergenomic recombination events (around one event per generation). It is this type of simulation work that holds most promise for assessing potential biases that might result from lack of knowledge of inheritance patterns and allelic dosage when making population genetic inferences based on polyploid data sets.

Box 3. Inter-individual and inter-population distance/similarity indices

In this box, we provide an overview of the formulae underlying the distance and similarity indices discussed in Genetic Distance–Based Analyses. They represent indices that are frequently applied to polyploids and that users are likely to encounter in software packages that accommodate polyploid data. For each of the indices below, it is important to realize that most have been formulated as similarities (Simple Match, Jaccard, Lynch, Kosman), but some as distances (Bruvo, and all the interpopulation measures). Similarities and distances are related in a relatively simple manner: similarity = 1 − distance. The general use of distance and similarity indices in population genetics has been reviewed by Kosman & Leonard (2005) and will not be dealt with in detail. Here, we focus predominantly on their extensions to polyploid data, and we indicate the software programs that implement these extensions.

Simple-match index (squared Euclidian distance)
Software allowing calculation for polyploid data: none

The simple match index (M) calculates the similarity between two (in principle haploid) individuals based on the multilocus presence/absence data. It is calculated as M(i1,i2) = (n−b−c)/n or M(i1,i2) = (a + d)/n (Sneath & Sokal 1973), in which n = a + b + c + d and is the length of the presence/absence (1,0) vector for all individuals under consideration, a is the number of shared band presences among i1 and i2, b is the number of bands present in i1 and absent in i2, c is the number of bands absent in i1 and present in i2, and d is the number of shared band absences. Because it is based on presence and absence, it can be applied independently of ploidy level. For both diploids and polyploids, the index should be calculated per locus and subsequently averaged over all loci. As shared absences contribute to similarity, the simple match index increases with marker diversity, making it mainly applicable to closely related individuals (Kosman & Leonard 2005).

Jaccard and Dice (Lynch) similiarity indices
Software allowing calculation for polyploid data: PolySat

The Jaccard similarity index is calculated as J(i1,i2) = a/(a + b + c) and the Dice similarity index as D(i1,i22a/(2a + b + c) (Legendre & Legendre 1998), in which a corresponds to bands shared between individuals i1 and i2, b corresponds to the presence in i1 and absence in i2, and c corresponds to the absence in i1 and presence in i2. The main difference between the Jaccard and Dice indices lies in the weight given to shared bands. This is twice as large for the Dice index, which works out to be the equivalent of the similarity index independently developed by Lynch (1990). Both the Jaccard and Dice/Lynch index can be readily calculated for dominant and co-dominant diploid and polyploid data by calculating the index per locus and subsequently averaging over all loci. They mainly differ from the simple match index in that the shared absence of bands does not contribute to similarity. This makes their use unrestricted with regard to the expected relatedness of the analysed individuals, although with highly variable markers the risk of homoplasy may lead to overestimation of similarity (Kosman & Leonard 2005).

Kosman & Leonard´s similarity index
Software allowing calculation for polyploid data: none

Kosman & Leonard (2005) questioned the consistency of the Jaccard/Dice/Simple Match indices when analysing diploid or higher ploidy data. They based this on an apparent inconsistency when more than two alleles are present at a single locus. For example, in a diploid case with three alleles at a single locus (A, B and C), the similarity between genotypes AC and CC gives a Jaccard-similarity of 1/2 (Dice: 2/3), whereas the similarity between AB and AC gives a similarity of 1/3 and 1/2, respectively. Kosman & Leonard (2005) argued that since in both comparisons one allele is shared, the genotype pairs should have the same similarity and proposed an index (which we dub the Kosman–Leonard index). This is calculated as a/q, in which a corresponds to the number of shared alleles and x to the ploidy. The Kosman–Leonard similarity between AC and CC thus equals 1/2, just like the similarity between AB and AC. For a tetraploid, the similarity between AAAA and AAAB equals 3/4 (three of four alleles shared), between AAAA and AABB 1/2 (two of four alleles shared), and between AAAA and ABBB 1/4 (one out of four alleles shared). One disadvantage is that the index can only be calculated for complete co-dominant genotypes, whereas determining the dosage is one of the main challenges in polyploids.

Smouse and Peakall interpopulation distance
Software allowing calculation for polyploid data: GenoDive

Smouse & Peakall (1999) proposed a distance specifically designed for co-dominant markers in diploids that was adapted for polyploid individuals in the GenoDive software (Meirmans & Van Tienderen 2004). The distance is based on a geometric space with r vertices, where each vertex is represented by each homozygous genotype, the distance between them for diploid organisms equals 2, and heterozygotes are positioned midway between the respective homozygotes. So, using this framework for a locus with three alleles (A, B and C) in diploids, the distances between AA and BB and between AA and AB are equal to 2 and 1, respectively, and the distance between AA and BC is inline image. For polyploids, the distances are difficult to summarize verbally, but the following matrix shows the Smouse and Peakall distances calculated by GenoDive for tetraploid genotypes, with ABCD as reference:

 ABCDAABCABBCABCCABBBAABBAAABAAAAAAAEAAEFAFGHEFGHEEEE
ABCD01113236433410

The Smouse and Peakall distance has been criticized for having a poor biological rationale even for relationships between diploid organisms (Kosman & Leonard 2005), and as this criticism also applies to relationships between polyploids (why would ABCD be more distant from AAAA than from EFGH?), its use should be avoided. Besides that, it is a disadvantage that the index can only be calculated for complete co-dominant genotypes.

Nei's interpopulation distance
Software allowing calculation for polyploid data: popdist, GenoDive, SPaGeDi, adegenet, atetra, StAMPP

Nei (Nei 1972) proposed an inter-population distance:

display math

with J1 and J2 corresponding to the arithmetic means of the probabilities of identity of two randomly chosen alleles in populations 1 and 2, respectively, and J12 corresponding to the arithmetic means of the probabilities of identity of a randomly chosen allele in population 1 and a randomly chosen allele in population 2. This measure has not been tested in simulation studies using polyploid populations and so it is not known whether the mode of inheritance assumed will bias the estimates. As it is based on estimation of allele frequencies, it will suffer from difficulties with resolving complete genotypes in polyploids.

Tomiuk and Loeschke interpopulation distance
Software allowing calculation for polyploid data: popdist

Tomiuk & Loeschcke (1991) proposed a distance DTLG based on the frequency of shared phenotype/genotype classes between populations with mixed ploidy level:

display math

with I1 and I2 corresponding to the genetic identities of two populations, 1 and 2, and their common ancestral population, following the equation:

display math

where n is the ploidy level of population X and zX1, zX2, zX3 and zX4 represent the observed frequencies of: (i) homozygotes found in population X whose alleles are found in both populations; (ii) heterozygotes found in population X whose alleles are found in both populations; (iii) heterozygotes that have at least one allele present in both populations and at least one allele that is not observed in the other population; and (iv) phenotypes/genotypes carrying exclusively allele(s) found only in population X. This distance is only useful for studies of populations where private alleles occur (i.e. an allele that is present in only one of the two populations being compared).

Tomiuk's band sharing measurement
Software allowing calculation for polyploid data: popdist

Tomiuk et al. (2009) proposed another interpopulation distance measure for polyploids, the Band Sharing Measurement (DBSM), largely inspired by the inter-population distance of Nei (1972) but that does not take into account the redundancy of alleles in partial heterozygotes (in other words, it is based on allelic phenotypes rather than genotypes). It allows estimation of distances between subpopulations with different ploidy levels even if these subpopulations have no private alleles. However, this measure does not behave linearly with ploidy level increase or when populations are closely related (Tomiuk et al. 2009).

Bruvo distance
Software allowing calculation for polyploid data: GenoDive and PolySat

Bruvo et al. (2004) proposed a distance for microsatellites that takes the mutational process into consideration. It assumes a stepwise mutation model and calculates a matrix of distances between pairs of individuals, in which the distance (d) is calculated as d = 12|x|, in which x is number of repeat differences, so that the distance approaches 1 as the number of repeat differences increases. Its main advantage is that it can be used for calculating distances regardless of ploidy level. Although its application is much simpler when complete genotypes are available, it can deal with dosage uncertainty. It does so by averaging over all possible allelic constitutions. In cases of mixed ploidy, the method assumes autopolyploidization and assigns one or more ‘virtual alleles’ to the individuals with the lower ploidy level. For each paired comparison, it will do this in two steps. The first step represents a scenario of ‘genome loss’: that one or more alleles of the higher ploidy level were lost in the lower ploidy. Hence, it assigns the value of the virtual allele of the lower ploidy level to represent each of the different alleles of the higher ploidy level, calculates d for each situation and calculates the average d over each of the genome loss scenarios. The second step represents a scenario of ‘genome addition’: that one or more alleles of the lower ploidy level were duplicated. Hence, it assigns the value of the virtual allele of the lower ploidy level to represent each of the different alleles of the lower ploidy level, in all possible combinations and calculates the average d over each of the genome addition scenarios. Finally, the distance between the two individuals that differ in ploidy level is calculated as the sum of the average distance for the two scenarios, divided by the higher of the two ploidy levels (kmax): ddifferent ploidy = (d’genome loss’ + d’genome addition’)/kmax.

The Bruvo distance is not implemented in the same way in Genodive and Polysat. In Genodive, the value of the ‘virtual allele’ can be set manually from 0 to infinite. In the Polysat package, the value of the ‘virtual allele’ is infinite by default, so that the geometric distance between any allele and a virtual allele is always 1.

Example: As a hypothetical example, we compared 10 diploid and triploid single-locus genotypes (AA, BB, CC, AB, AC, BC, AB−, AC−, BC− and ABC, in which A, B and C are alleles that differ by one mutational step, and ‘−’ corresponds to the unknown allele). Depending on the index (Jaccard or Bruvo, with an infinite value for the virtual allele) and software (GenoDive or Polysat), the relative distances between diploid and triploid genotypes changes (Figure 1). This is due to the fact that the Jaccard distance does not consider allelic dosage in ambiguous polyploid genotypes, resulting in a Jaccard distance of 0 between diploid heterozygotes (e.g. AB) and corresponding triploid partial heterozygotes (AB−) (Fig. 1A,B). For the Bruvo distances in this specific example, PolySat differs from GenoDive, because Polysat accounts for allelic dosages in ambiguous polyploid genotypes (resulting in a distance >0 between diploids and triploids with the same alleles; Fig. 1C), whereas Genodive does not (resulting in a distance of 0 between diploids and triploids with the same alleles; Fig. 1B).This simple theoretical experiment shows that the choice of distance index and software used to estimate the relationships between diploid and polyploid organisms can strongly affect the conclusions reached.

Figure 1.

Jaccard and Bruvo genetic distance between ten simulated diploid and triploid genotypes (AA, BB, CC, AB, AC, BC, AB−, AC−, BC− and ABC) and the corresponding Principal Coordinate Analyses. (a) Jaccard distance calculated by hand; (b) Bruvo distance with ploidy level variation as an infinitely large mutation using (calculated by GenoDive); (c) Bruvo distance with ploidy level variation as an infinitely large mutation (calculated by Polysat). In the left panels, shaded cases show distances between diploid heterozygotes and the corresponding ambiguous triploid partial heterozygotes. In the right panel, large open symbols represent only diploid genotypes and small solid symbols represent only triploid genotypes; large solid symbols represent both diploid and triploid genotypes.

Genetic distance–based analyses

Distance or similarity indices are a common tool in population genetics; for example, to assess population differentiation, diversity within populations, isolation by distance and for clustering approaches (reviewed by Kosman & Leonard 2005). There are several distance/similarity measures (Box 3), most of which were not specifically developed for polyploids, but which can be applied to polyploid data.

For example, the simple-match similarity coefficient has been used to differentiate populations of the tetraploid marram grass based on AFLPs (Amnophila arenaria; Hol et al. 2008) and to assess isolation by distance patterns in hexaploid and enneaploid (9x) individuals of a tallgrass species (Andropogon gerardii; Rouse et al. 2011). Its calculation for polyploid data does not require modification of the formula for haploids and the index can include mixed ploidies (Kosman & Leonard 2005). Other distance measures, such as the Jaccard and Dice similarity indices (Lynch 1990; Legendre & Legendre 1998), Kosman and Leonard's similarity index (Kosman & Leonard 2005), and Smouse and Peakall's distance (Smouse & Peakall 1999) can also be applied to polyploid data and mixed-ploidy data. However, each of these distances, like any other summary statistic using phenotypes instead of genotypes (Obbard et al. 2006), suffer from a loss of information due to the fact they do not take into account the allele dosage in polyploid heterozygotes. In addition, they have a poor genetic rationale (Clark & Jasieniuk 2011) and could lead to biases in interpretation, especially in cases of mixed ploidy. The potential for such bias is best illustrated through a thought experiment. For a diploid and a triploid, both with dominant genotype AB and AB, the distance can vary from 0 to 0.33, depending on the method used to calculate the distances. Methods collapsing the allelic data into dominant phenotypes result in a distance of 0, as both the diploid and triploid will be treated as AB. On first sight, this may not be so bad, as all this reflects is that the diploid AB and triploid AB share all their alleles, which is true (regardless of whether the triploid is AAB or ABB). In addition, if the unknown allele in the triploid is not actually A or B, but a null allele, the zero distance would be an underestimate of the real distance. As the probability of null alleles could increase with increasing ploidy, the estimated distances between diploids and higher ploidy levels would on average be underestimated. Some methods developed for diploids can take into account complete genotypes, such as Kosman and Leonard's Similarity Index (Kosman & Leonard 2005) and Nei's interpopulation distance (Nei 1972), but they retain the difficulty of resolving dosage in polyploids.

Some interpopulation indices have specifically been designed to study relationships between polyploid populations and between populations of different ploidy levels (see Box 3). The Tomiuk and Loeschke distance (Tomiuk & Loeschcke 1991) estimates interpopulation distance based on proportions of different classes of genotypes (i.e. homozygotes vs. different types of heterozygotes, which is often reduced to phenotypes if the dosage is unknown). The Band Sharing Measurement (Tomiuk et al. 2009) does the same based on the sharing of common alleles. An alternative to the indices based on the shared presence and/or absence is the Bruvo distance (Bruvo et al. 2004). It was specifically developed for polyploids and calculates distances between co-dominant microsatellite genotypes based on the assumption that slipped-strand mispairing is the main driver of length variation among alleles. It has been implemented in GenoDive and PolySat, which differ in the way allelic dosage in partial heterozygotes is assessed this has a strong effect on the distance calculation (Box 3). The Bruvo distance has been used, for example, to differentiate between clonal tetraploid genotypes of hawthorns (Crataegus; Rosaceae; Lo et al. 2009) and between closely related octoploid subspecies of Atriplex sp. (Sampson & Byrne 2012). The Bruvo distance may also lead to an overestimation of the genetic distance between individuals with different ploidy levels and may falsely group individuals with the same ploidy level together, especially in the case of autopolyploids or allopolyploids from closely related parents (Clark & Jasieniuk 2011). Hence, the parameters used to calculate the Bruvo distance, in particular those related to allele dosage, have to be set with caution (Meirmans & Van Tienderen 2004; Clark & Jasieniuk 2011), and the Bruvo distance should only be used for microsatellite loci for which it is reasonable to assume a stepwise mutation model. Given the general difficulty of determining dosage in polyploids, the distance/similarity indices that do not require full genotypes are most useful for the analysis of polyploids, despite their suboptimal use of genetic information.

Multivariate analyses

Multivariate and cluster analyses such as principal component analysis (PCA; Pearson 1901; Hotelling 1933) or principal coordinate analysis (PCoA; Gower 1966) can be used to visualize genetic distances among individuals. Their lack of any underlying population genetics-based assumptions, such as HW equilibrium, make multivariate approaches independent of ploidy level, and therefore suitable to analyse polyploid data as well as mixed-ploidy data (reviewed in Jombart et al. 2009). In polyploids, multivariate analyses have been used to infer evolutionary and genetic relationships either between populations or between individual genotypes (Vergilino et al. 2011).

Other multivariate analyses such as K-means clustering (Hartigan & Wong 1979) and Discriminant Analysis of Principal Components (DAPC; Jombart et al. 2010) allow clustering of polyploid and mixed-ploidy level populations using SNPs (including datasets from high-throughput genotyping) and SSR data. The DAPC method, which may use K-means as a priori clustering algorithms, implemented in adegenet (Jombart 2008; Jombart & Ahmed 2011), provides an interesting alternative to structure and Instruct software as it does not require that populations are in HW equilibrium and can high handle a large amount of data (Jombart et al. 2010). However, as for the other multivariate analyses, the reduction of genetic information to interindividual or interpopulation distances represents a substantial loss of information. Therefore, methods that make use of genotype information are in principle more powerful, and to be preferred over multivariate approaches. Nevertheless, multivariate approaches can provide an attractive visual complementation to other methods and are the method of choice in cases where other methods are inappropriate due to violation of assumptions (such as random mating). The main issue with using multivariate analyses in polyploids is the calculation of the underlying distance matrices, which should be chosen with caution according to the marker used and the type of the multivariate analysis used (Jombart et al. 2009).

Custom model-based analyses to analyse complex scenarios

Population geneticists have provided custom models to test different hypotheses about demographic and evolutionary history, such as bottleneck events or change in mode of reproduction in diploid populations (Beaumont et al. 2010; Csilléry et al. 2010). Few simulation software or coalescent models have included features specific to polyploid organisms, such as larger effective population size at the gene level or the possibility of polysomic inheritance (but see Arnold et al. 2012). The few studies that have analysed polyploid organisms with a custom model-based approach have taken advantage of particular features of the organisms studied, such as disomic inheritance or high selfing rate, to analyse their data and avoid the difficulties inherent to polyploidy. For example, using coalescent-based models on numerous microsatellite loci and nuclear sequences, Jakobsson et al. (2006) tested the hypothesis of a unique origin of the highly selfing allotetraploid species Arabidopsis suecica from hybridization between the highly selfing species A. thaliana and the obligate outbreeding species A. arenosa. They assessed different scenarios changing the number of founders and the time of origin, taking into account the mode of reproduction of the different species and then accepted the model suggesting a unique origin of the polyploid species A. suecica, as previously proposed by Säll et al. (2003). St-Onge et al. (2012) used a coalescent-based model on the basis of nucleotide variation of 14 nuclear genes to test whether Shepherd's purse (Capsella bursa-pastoris), a polyploid species with disomic inheritance, had an allo- or autopolyploid origin. Sequences of each gene copy from the tetraploid species C. bursa-pastoris were amplified using homeolog-specific primers and compared with corresponding sequences from the diploid, and potentially parental, species C. grandiflora and C. rubella. St-Onge et al. (2012) first compared the number of fixed differences in the homeologous genomes A and B in C. bursa-pastoris and in the genomes of C. grandiflora and C. rubella, and the number of shared polymorphism between them. The high number of fixed differences between the homeologous genomes of C. bursa-pastoris that are shared with the genome of C. grandiflora is consistent with a scenario of speciation by autopolyploidization of C. bursa-pastoris. They then used coalescent-based simulations and an Approximate Bayesian Computation (ABC) model to estimate that the gene copies in C. bursa-pastoris diverged before the speciation process leading to the formation of the diploids C. grandiflora and C. rubella and so were not able to reject the hypothesis of autopolyploidization of C. bursa-pastoris, followed by the divergence of gene copies following polyploidization. Although Jakobsson et al. (2006) and St-Onge et al. (2012) used diploid-based models and simulation programmes to test their hypotheses, specific models including polysomic inheritance are under development. For example, a coalescent model for autotetraploid populations with tetrasomic inheritance, in which partial selfing (as well as double reduction) can be simulated, has been provided recently that may be useful to test different evolutionary scenarios (Arnold et al. 2012). Such approaches are promising, as polyploid or ploidy-mixed populations may have different modes of reproduction, rates of mutation or demographic history and these different parameters may be modelled and tested independently using custom-based models.

Conclusions

The evolution of polyploidy is a fascinating topic and many insights have been obtained from investigation of population genetic processes through the analysis of various types of molecular markers. However, dosage uncertainty and unknown segregation patterns result in difficulties in calculating observed and expected allele frequencies. This affects the ability to apply standard population genetic models to investigate population structure in polyploid organisms and additional custom-based models should be developed to take such factors into account. One of the major problems in the analysis of single copy markers remains our inability to reliably determine allelic configurations in polyploids. This prevents estimation of heterozygosity, which is at the heart of population genetic theory in diploids. New genomic tools offer great promise to unravel population genetics questions in polyploids given the astounding number of markers that are becoming available. While NGS approaches still bear some old (determining allelic dosage, gene copy number and paralogous sequences) and new (potential biases associated with PCR-based techniques, difficult assembly and annotation due to the presence of multiple gene copies, lack of models to filter errors that can incorporate dosage uncertainty and incomplete genotypes) problems when working with polyploid genomes, we anticipate that rapid developments in both sequencing technology and computational approaches to statistical inference should dramatically reshape the field of polyploid population genetics in the near future. For both new and old markers, we recommend that more simulation-based studies should be conducted to assess the sensitivity of population genetic analyses to potential biases caused by uncertainty in genotypes and modes of inheritance. We hope that this review will stimulate the development of new theory and practical approaches for analysing complex data sets involving extensive gene duplication and ‘flexible’ modes of inheritance.

Acknowledgements

FD acknowledges a grant from the Natural Sciences and Engineering Research Canada (NSERC). RV is supported by a fellowship from the NSERC-CREATE program: Aquatic ecosystem health. We thank Karim Gharbi for many fruitful discussions on the current challenges of applying deep sequencing approaches to polyploid genomes. We also thank Maurine Neiman and two anonymous reviewers for valuable suggestions that have substantially improved the manuscript.

The initial plan for the review came from discussions between F.D., R.V. and B.K.M. at Evolution 2011 in Ottawa, with M.S. subsequently contributing valuable suggestions for improvement of content and structure. All authors contributed extensively to conceptual content, manuscript structure, and writing.

Ancillary