Recent advances have highlighted the ubiquity of whole-genome duplication (polyploidy) in angiosperms, although subsequent genome size change and diploidization (returning to a diploid-like condition) are poorly understood. An excellent system to assess these processes is provided by Nicotiana section Repandae, which arose via allopolyploidy (approximately 5 million years ago) involving relatives of Nicotiana sylvestris and Nicotiana obtusifolia. Subsequent speciation in Repandae has resulted in allotetraploids with divergent genome sizes, including Nicotiana repanda and Nicotiana nudicaulis studied here, which have an estimated 23.6% genome expansion and 19.2% genome contraction from the early polyploid, respectively. Graph-based clustering of next-generation sequence data enabled assessment of the global genome composition of these allotetraploids and their diploid progenitors. Unexpectedly, in both allotetraploids, over 85% of sequence clusters (repetitive DNA families) had a lower abundance than predicted from their diploid relatives; a trend seen particularly in low-copy repeats. The loss of high-copy sequences predominantly accounts for the genome downsizing in N. nudicaulis. In contrast, N. repanda shows expansion of clusters already inherited in high copy number (mostly chromovirus-like Ty3/Gypsy retroelements and some low-complexity sequences), leading to much of the genome upsizing predicted. We suggest that the differential dynamics of low- and high-copy sequences reveal two genomic processes that occur subsequent to allopolyploidy. The loss of low-copy sequences, common to both allopolyploids, may reflect genome diploidization, a process that also involves loss of duplicate copies of genes and upstream regulators. In contrast, genome size divergence between allopolyploids is manifested through differential accumulation and/or deletion of high-copy-number sequences.
All angiosperms have experienced at least one, if not more, rounds of whole-genome duplication (WGD or polyploidy) in their ancestry (Vision et al., 2000; Bowers et al., 2003; Jaillon et al., 2007; Barker et al., 2009; Jiao et al., 2011). Subsequent to polyploidy, genomes undergo a process of diploidization, whereby duplicate copies of genes may be lost and chromosome number may decrease, so that over time the signature of ancestral polyploidy becomes more and more obscure. Although global analyses of genome size in angiosperms reveal a trend towards DNA loss subsequent to polyploidy (genome downsizing) (Leitch and Bennett, 2004; Leitch et al., 2008), increases in genome size are also known to occur. Such changes in genome size are thought to arise via accumulation of repetitive DNA (Hawkins et al., 2009; Renny-Byfield et al., 2011).
Evolution and change in the copy number of repetitive DNA may be analysed via ‘genome skimming’, in which short-read next-generation sequencing data are used to reconstruct and quantify the repetitive fraction of the genome (Hribova et al. 2010, Macas et al., 2011, 2007; Renny-Byfield et al., 2011; Swaminathan et al. 2007, Wicker et al. 2009). The approach efficiently characterizes sequences present in multiple copies, and read-depth analysis provides estimates of repeat abundance within the genome, allowing quantification and comparisons of repeats between species.
Recent studies have used these methods to better understand genome evolution following allopolyploidy. Comparisons of the young allotetraploid Nicotiana tabacum (<0.2 million years ago; Clarkson et al., 2005) with its diploid progenitors revealed a bias towards removal of paternally derived repeats in conjunction with a reduction in the repetitive fraction of the genome (Renny-Byfield et al., 2011). Patterns of sequence loss observed in N. tabacum are repeated in synthetic lines after only four generations (Renny-Byfield et al., 2012). The loss of DNA in synthetic (Petit et al., 2010) and natural (Petit et al., 2007) tobacco is also revealed using sequence-specific amplified polymorphisms (SSAPs) targeting Tnt1 and Tnt2 retroelement families. Likewise, repetitive DNA loss targeted to chromosomes and genomes has been reported in wheat (Triticum aestivum) (Ozkan et al., 2001; Salina et al., 2004).
Nicotiana section Repandae provides an ideal model group for dissecting the two phenomena of genome size divergence and diploidization following ancient polyploidy (approximately 5 million years ago). It is thought that Repandae formed from a single hybridization event between relatives of extant Nicotiana sylvestris (2637 Mbp/1C) and Nicotiana obtusifolia (1511 Mbp/1C), which, following speciation, produced four allopolyploids (Chase et al., 2003; Clarkson et al., 2004, 2005, 2010; Kelly et al., 2012).
In Nicotiana section Repandae (Parisod et al., 2012), there is considerable departure from additivity in seven transposable element families, typically through deletion, especially of transposable elements derived from the N. obtusifolia parent. In addition, there are also large numbers of new SSAP bands, probably derived from new element insertion sites. However, SSAP data cover only a small fraction of the genome, targeting a small number of repeats. In this paper, we use next-generation sequencing approaches to generate a global overview of repetitive DNA evolution in the context of allopolyploidy. We examine two species in section Repandae, Nicotiana repanda (5320 Mbp/1C) and Nicotiana nudicaulis (3477 Mbp/1C) (Leitch et al., 2008), to better understand the processes of diploidization and genome size change. We used genomic in situ hybridization (GISH), next-generation sequencing and a graph-based clustering pipeline to simultaneously identify, quantify and assess thousands of repeat families in the genomes of N. repanda (with genome upsizing) and N. nudicaulis (with genome downsizing), and compare these with repeats in close relatives of their diploid progenitors. Thus, this paper dissects the signatures of two phenomena: diploidization and genome size divergence.
Reconstructing ancestral genome size
We estimated genome size changes in lineages leading to N. nudicaulis and N. repanda by reconstructing ancestral genome sizes in Nicotiana section Repandae using Markov chain Monte Carlo reconstruction methods (Figure 1). Ancestral genome size estimates are in good agreement using sequence data from both allopolyploid sub-genomes (Figure 1), and indicate an increase in genome size in the lineages leading to N. repanda (+23.6%), Nicotiana nesophila (+14.7%) and Nicotiana stocktonii (+14.3%), but downsizing in the lineage leading to N. nudicaulis (−19.2%; Figure 1). These values are similar to a model that assumes simple additivity of genome sizes recorded in extant relatives of the diploid progenitors (Leitch et al., 2008).
Graph-based clustering was used to characterize, quantify and compare highly repetitive DNA sequences in the genomes of the diploid species N. sylvestris and N. obtusifolia and their derived allotetraploids N. repanda and N. nudicaulis. Clustering of 6 812 631 Illumina reads, each 95 bp long (equivalent to 5% coverage of each genome), produced 492 696 clusters, with the smallest comprising two reads and the largest comprising more than 80 000 reads. The majority (79%) of the largest clusters (defined as having more than ten reads from either or both of the progenitors) included reads from at least one allopolyploid. Furthermore, 66% of clusters contained reads from both allopolyploids. However, a small number of clusters were derived from only one of the four species included in the analysis.
Clusters correspond to families of repetitive DNA that may be assessed for similarity to known repeats as well as abundance in each genome. The largest cluster (CL1) comprised 81 004 reads, and sequence similarity searches indicate it is derived from a Ty3/Gypsy retroelement (examples of the resulting 3D networks produced from the clustering algorithm are shown in Data S1). Reads within the largest clusters were assembled into contiguous sequences using CAP3 (Huang and Madan, 1999) on a cluster-by-cluster basis. Resulting contigs ranged from 95 bp (the minimum possible, indicating complete overlap of more than one read) to several thousand nucleotides in length.
To investigate the nature of repetitive DNA in each species, we annotated repeat clusters using similarity to known repetitive DNA [using a RepBase library (Jurka et al., 2005)and RepeatMasker (Smit et al., 2010)]. Of the clusters we could identify as a known repeat, the majority were retroelements, contributing between 29.95–38.30% of the genome depending on the species (Table 1). Most of the retroelements were Ty3/Gypsy-like (between 24.31 and 33.32% of the genome), with Ty1/Copia-like elements being less abundant (between 3.31 and 4.71% of the genome; Table 1). The smallest genome analysed (N. obtusifolia) contained the smallest proportion of retroelements, whereas N. repanda (the largest genome) contained the most.
Table 1. Genome characterization in Nicotiana section Repandae
These clusters returned no matches to RepBase.
These clusters consist of fewer than ten reads from both or either of the progenitor diploids, and were not screened against RepBase.
Comparison of the major repetitive DNA component of the genome in two allopolyploids of section Repandae (N. repanda and N. nudicaulis) and close relatives of their diploid progenitors (N. sylvestris and N. obtusifolia). GR, genome representation; GP, genome proportion. The expected GR is the sum of GR in the progenitors, reflecting complete additivity in a nascent allotetraploid. Expectation is also given as a percentage of the genome (GP).
Although the repetitive fraction of all four Nicotiana genomes is dominated by retroelements, there are low levels of several other repeat types. For example, DNA transposons were estimated to contribute between 1.19 and 1.99% of the genome, with N. obtusifolia having the smallest genomic fraction and N. sylvestris the largest. We also identified a number of long interspersed elements (LINEs), short interspersed elements (SINEs), low-complexity and satellite repeat families in the dataset, but these showed low abundance in all four genomes (Table 1).
Comparing observed with expected values in allotetraploids
We compared the expected abundance of repeat clusters, assuming additivity with the parents, with the abundance observed in the two allotetraploid species (Figures 2-4). Of 3480 clusters where the expected number of reads was ≥ 10, 3170 and 3119 were found to have fewer reads than expected for N. repanda and N. nudicaulis, respectively. Many clusters (2997) were under-represented in both allotetraploids. For the majority of these (1605 clusters) under-representation was greatest in N. repanda, whereas only 515 clusters were most under-represented in N. nudicaulis. The remaining 877 clusters were equally under-represented in both allopolyploids. In contrast, 296 and 330 clusters were over-represented in N. repanda and N. nudicaulis, respectively.
We analysed all clusters for which we predicted fewer than ten reads in the allopolyploids (based on progenitor additivity). This analysis revealed an overall excess of 4869 reads derived from N. repanda. This small number represents a limited contribution to genome size change in this group of clusters. In comparison, N. nudicaulis exhibited a deficit of 112 528 reads among clusters predicted to contain fewer than ten reads.
A small number of clusters, with few or no reads derived from the diploids, are abundant in the allopolyploids. These are CL67, CL77, CL85, CL118, CL174 and CL237. Of these clusters, CL77 contributes the most to genome size change. For this cluster, we expected to see fewer than ten reads, but observed 13 520 and 5872 reads in N. repanda and N. nudicaulis, respectively. This cluster may not be classified into a repeat type as it returned no hits to repeat databases.
Overall, N. repanda had a higher than expected repeat abundance: we classified 2 412 368 reads (229 174 960 bp) into repeat clusters, rather than the predicted 1 847 714 reads (175 532 830 bp). As we have analysed 5% of the genome, these additional reads equate to a 25.87% increase in genome size over that expected (4147 Mbp/1C). However, in N. nudicaulis, the total number of clustered reads was 1 526 997 (145 064 715 bp), which is less than expected (1 847 714 reads/175 532 830 bp), equivalent to a 14.69% decrease in genome size relative to expectation.
In order to determine which clusters are associated with genome upsizing in N. repanda and genome downsizing in N. nudicaulis, we ranked the clusters by size and plotted the cumulative deviation in abundance from that expected in each of the allopolyploids (Figure 2a). This revealed that clusters with low abundance are under-represented in both N. repanda and N. nudicaulis [Figure 2a, from the origin to position (i)], with both allopolyploids following a similar pattern. At position (i), there is a step change where two repeat clusters are over-represented, and thereafter the trend continues for under-representation of reads derived from low-copy-number repeats [until position (ii); Figure 2a]. For clusters with higher expected values [from position (ii); Figure 2a], the repeats are predominantly over-represented in N. repanda, but generally under-represented in N. nudicaulis. For the largest 30 clusters [from position (iii) to maximum], the number of reads in each cluster is lower than expected for both N. repanda and N. nudicaulis.
In order to remove any effect of cluster size, the data were re-analysed to obtain cumulative proportional deviation scores. This score considers the observed number of reads divided by the expected number of reads for each cluster (see Figure 2b legend). A negative score indicates that the cluster size is smaller than expected, whereas a positive score indicates a larger cluster size. This analysis reveals that there are similar proportional losses in both allopolyploids for the majority of the range in cluster size [from the smallest to position (i) in Figure 2b]. However, there are exceptions; a few clusters are over-represented in both species, resulting in a considerable change to the cumulative deviation score [e.g. positions (i) and (ii) in Figure 2b].
To assess the impact of cluster abundance on the genome size discrepancy between N. repanda and N. nudicaulis, we compared the abundance of each cluster in the allopolyploids (Figure 2c). Most clusters have minimal or no effect on genome size differentiation (as indicated by falling on or near zero on the y axis). However, a number of clusters, the majority of which are inherited in high copy number, have a marked effect. Most of these clusters have a higher abundance in N. repanda and probably account for the majority of the genome size discrepancy between the two allopolyploids. The relationship between observed and expected cluster size in the allopolyploids is visualized in Figure S1.
Sequences causing genome size divergence in Repandae
For both allopolyploids, we summed the abundance (number of base pairs) for each repeat type (as identified by RepeatMasker, see above and Table 1) and compared this to expectation (progenitor additivity). We show deviation as a percentage of the expected genome size (Figure 3). For the majority of repeat types, deviation from additivity is minimal, e.g. DNA transposons and rDNA sequences. For N. nudicaulis, repeats of unknown origin (no similarity to known repetitive sequences) are under-represented and account for approximately 6% reduction in genome size, whereas these clusters are slightly over-represented in N. repanda.
In N. repanda, there is an over-abundance of Ty3/Gypsy-like retroelements relative to expectation, accounting for an approximately 13% increase in genome size. All clusters containing GAG and reverse transcriptase domains were further analysed in order to ascertain which families of TY3/Gypsy elements are over-represented (Figure 3b). We found that most of the over-represented sequences, accounting for a 12.5% genome size increase in N. repanda, are chromovirus-like retroelements, while Tat-Ogre elements account for a 1% decrease in genome size. Low-complexity sequences and satellite repeats have also made a positive contribution to genome size change. For N. nudicaulis, there is a decrease in Ty3/Gypsy-like retroelements, contributing to an approximately 4% decrease in genome size.
Genome divergence in Repandae
To characterize the overall effect of diverging repeat abundance in the allopolyploids in comparison with the diploids, we performed a heatmap analysis, implemented in R. The heatmap reveals that most clusters are under-represented in both allopolyploids, mirroring the data displayed in Figure 3. In addition, the dendrograms in Figure 4(a,b) group species based on similarities in cluster abundance, and here the two allopolyploids differ: N. nudicaulis has repeat abundances that are closer to their expected values (i.e. additivity of abundance in diploids) and most similar to N. sylvestris, whereas N. repanda is a more divergent genome.
The greater similarity of N. nudicaulis to the diploids was further confirmed using GISH (Figure 5a), which revealed many sites of probe hybridization, including N. obtusifolia probe signal at sub-telomeric regions (red in Figure 5a) that may correspond to retention of a high-copy number satellite repeat called NPAL, inherited from N. obtusifolia (Koukalova et al., 2010). More uniform binding of the N. sylvestris probe was observed (green in Figure 5a). However, discrimination of probes is difficult, and it is not possible to resolve progenitor chromosome complements with any degree of certainty, although a few chromosomes are distinguishable as predominantly red or green.
In contrast, GISH to N. repanda using the same probes produced weaker binding of the N. sylvestris probe along most chromosomes, although there was stronger signal at sub-telomeric regions (green in Figure 5b). However, probe binding was weak compared with N. nudicaulis, particularly for the N. obtusifolia probe, for which little or no signal was detected for most metaphases. As in N. nudicaulis, it was not possible to resolve progenitor chromosome sets.
Recent advances have revealed that WGD is ubiquitous among angiosperms (Jiao et al., 2011), yet most flowering plants have a relatively small genome size (Bennett and Leitch, 2010) and appear to be functionally diploid (Soltis et al., 2005). This is thought to arise through loss of sequences after polyploidy, perhaps reflecting selection against large genome sizes (Leitch and Leitch, 2012). However, in Nicotiana section Repanda, polyploids show either genome upsizing or downsizing (Figure 1). This variation in genome size is likely to have arisen post-allopolyploidy, as the section is thought to have formed from a single origin approximately 5 million years ago (Clarkson et al., 2010).
Currently, we know relatively little about processes and patterns that lead to diploidization and genome downsizing in polyploid plants. To address this deficiency, we analysed differences in repetitive DNA content between the allopolyploids N. repanda and N. nudicaulis, and compared them with the extant diploids (N. sylvestris and N. obtusifolia) that are most closely related to their actual progenitors. Combining next-generation sequencing reads from these four Nicotiana species and subjecting the whole dataset to graph-based clustering allowed characterization, quantification and comparisons of repetitive DNA families between species (Figures 2-4, Table 1 and Data S1). Similar approaches have been used to characterize repeats in diploid (Macas et al., 2007, 2011; Novak et al., 2010) and allotetraploid species (Renny-Byfield et al., 2011).
Using extant diploids that are most closely related to the actual progenitors of section Repandae is the only option available for assessing changing patterns in repeat composition in association with polyploidy. Alterations in repeat abundance reflect changes that have occurred along branches leading to both the allopolyploids and diploids. However, our reconstruction of ancestral genome size using Markov chain Monte Carlo methods, coupled with our approaches assessing patterns of repeat divergence, both point towards genome downsizing in N. nudicaulis and upsizing in N. repanda. Furthermore, if the allopolyploids arose from a single event, as is likely (Clarkson et al., 2010), differences in repeat abundance between species of section Repandae must have occurred subsequent to allopolyploidy.
Genome diploidization through loss of low-copy number repeats
Most angiosperms fall within a narrow range of genome size (50% of species have a genome size less than 2500 Mbp/1C; Leitch and Leitch, 2012), despite multiple WGDs in their ancestry (Jiao et al., 2011). Furthermore, global analyses of genome sizes in angiosperms indicate that polyploid genomes tend to decrease in size subsequent to formation (Leitch and Bennett, 2004). For example, genome downsizing has been proposed in allotetraploid N. tabacum (Leitch and Bennett, 2004), and mechanisms leading to genome downsizing may have acted rapidly, as some repeats are lost even in synthetic lines that are just a few generations old (Skalicka et al., 2005; Renny-Byfield et al., 2012).
Despite genome upsizing in N. repanda and downsizing in N. nudicaulis, our analysis shows that repeats at low abundance are predominantly under-represented in both species (Figure 2a,b). Repeat reduction in the allopolyploids is also evident in the heatmap analysis, where the majority of repeat clusters (>85%) are under-represented (Figure 4, brown).
Mechanisms resulting in genome size change are poorly understood, although various recombination-based processes have been proposed (Kejnovsky et al., 2009; Grover and Wendel, 2010). For example, there is a approximately 80 Mbp difference in genome size between A. thaliana and A. lyrata, which is thought to be the result of differential rates of deletion. These deletions were often small, but numerous and common in non-coding and repetitive regions, including within transposable element (Hu et al., 2011). In addition, unequal intra-strand homologous recombination and illegitimate recombination have been identified as mechanisms that remove transposable element insertions and thus contribute to genome size reduction (Devos et al., 2002; Kellogg and Bennetzen, 2004). In rice (Oryza sativa), removal of transposable elements has resulted in the loss of approximately 190 Mbp DNA, equivalent to a 38% change in genome size over five million years (Ma et al., 2004). It is possible that similar mechanisms affect the low-copy-number fraction of the genome in section Repandae.
A loss of low-copy sequences is potentially an integral part of the diploidization process, and is associated with loss of genes and upstream regulator regions. The loss of DNA may arise because of reduced selective constraints arising from genome duplication (Freeling et al., 2012). Such diploidization processes may be ubiquitous in early polyploid divergence.
Expansion of high-copy-number repeats in N. repanda
Despite the general trend that most clusters are under-represented in both allotetraploids (Figure 3), there is evidence for substantial expansion of a small number of repeats (Figure 2a,b). Indeed, over-representation of these repeats is equivalent to a 26.0% increase in genome size in N. repanda (Figure 2a–c), close to the 24% genome size change predicted here (Figure 1). Furthermore, these repeat families are predominantly Ty3/Gypsy retroelements (particularly chromovirus-like retroelements) and low-complexity sequences (Figure 3), which have been inherited from the diploid progenitors in high copy number (Figure 2a–c). In contrast, there is evidence for loss of high-copy-number repeats in the genome of N. nudicaulis (Figure 2a,b), some of which are Ty3/Gypsy retreoelements (Figure 3), contributing to a reduction of approximately 14% in genome size, similar to the 19% estimated using Markov chain Monte Carlo approaches (Figure 1). These observations suggest that differential deletion and/or accumulation of high-copy-number repeats in these two allotetraploids is largely responsible for their varying genome size.
Transposable elements are often major contributors to angiosperm genomes (Kumar and Bennetzen, 1999), and Ty3/Gypsy-like retroelements are particularly prevalent (Macas and Neumann, 2007; Macas et al., 2011), being present in all four species examined here (Table 1). The excess of Ty3/Gypsy-like retroelements in N. repanda contributes most to the increased genome size (Figure 3). Similarly, differential accumulation of Ty3/Gyspy-like Gorge3 transposable elements produced a threefold variation in genome size in diploid Gossypium (Hawkins et al., 2006). We know there is potential for genome size change to occur rapidly in allopolyploids as activation and integration of transposable elements may occur after only a few generations (Petit et al., 2010). Together with the data presented here, these observations suggest that transposable element dynamics play an important role in governing genome size after allopolyploidy.
The dual process of DNA loss in the low-copy-number fraction (diploidization) and large-scale changes in the high-copy-number fraction leads to ‘genome turnover’ (Lim et al., 2007). Analyses of retroelement insertions (Ramakrishna et al., 2002; Ma et al., 2004; Bennetzen, 2005) and nuclear integrants from the plastid genome (Matsuo et al., 2005) have suggested that genome turnover occurs at a rapid rate, with retroelement half-lives of only one to a few million years. Turnover of DNA results in the loss of GISH signal previously reported in N. nesophila section Repandae (Clarkson et al., 2005; Lim et al., 2007) and shown here, particularly in N. repanda (Figure 5). The loss of sub-genome discrimination by GISH indicates that genome turnover has acted to homogenize the genomes. The genomes of both allopolyploids are no longer compartmentalized, as in a nascent allopolyploid, and in this respect have returned to a more diploid-like state. Perhaps genome homogenization results from the loss of repeats with low abundance and the transfer of repeats between sub-genomes.
Differences in genome size between N. repanda and N. nudicaulis appear to be a consequence of differential deletion and/or accumulation of the high-copy-number fraction of the genome. It follows that evolution and amplification of de novo repetitive DNA sequences have had only minimal effects on genome size variation and genome divergence in these two species. On the other hand, diploidization of the genomes in both allopolyploids is associated with loss of low-copy-number nuclear sequences and blending of the two progenitor sub-genomes. As all angiosperms are paleopolyploids (Jiao et al., 2011), it is likely that the processes we describe here, i.e. genome size change and diploidization, have played key roles in their evolution.
Phylogenetic analysis and ancestral genome size reconstruction
As all previous analyses have yielded congruent results regarding the evolutionary relationships between members of section Repandae (Chase et al., 2003; Clarkson et al., 2004, 2005, 2010; Kelly et al., 2012), datasets were constructed from combined sequence data for each of the two parental sub-genomes. For the N. sylvestris-like sub-genome dataset, sequences from four plastid regions (Clarkson et al., 2004), the nuclear ribosomal internal transcribed spacer (Chase et al., 2003), and regions of the low-copy nuclear genes NUCLEAR ENCODED PLASTID EXPRESSED GLUTAMINE SYNTHASE (npGS; Clarkson et al., 2010), ALCOHOL DEHYDROGENASE (ADH) and LEAFY/FLORICAULA (LFY/FLO; Kelly et al., 2012) were used. For the N. obtusifolia-like sub-genome dataset, sequences from GS, ADH, LFY/FLO and the non-transcribed spacer of 5S nuclear ribosomal DNA (Clarkson et al., 2005) were used. All regions were aligned separately using PRANK+F (Loytynoja and Goldman, 2008), and then combined before further optimization by eye using Mesquite version 2.74 (Maddison and Maddison, 2008). Phylogenetic reconstruction by Bayesian inference was performed using MrBayes version 3.1.2 (Huelsenbeck and Ronquist, 2001; Ronquist and Huelsenbeck, 2003). Separate partitions were used for different codon positions, introns, non-coding spacers and RNA coding regions, applying the best-fit model of evolution for each partition as selected using the Akaike information criterion in MrModelTest version 2.3 (Nylander, 2004). Methods S1 provides further detailed information.
The ancestral genome size at each node of the phylogenetic tree was reconstructed using BayesTraits version 1.1beta (http://www.evolution.reading.ac.uk/BayesTraits.html) by analysing genome sizes for the four extant species of section Repandae (genome sizes taken from Leitch et al., 2008) as continuously varying (Pagel, 1997, 1999), together with trees from the MrBayes analysis. Values for ancestral genome size for section Repandae as a whole, the most recent common ancestor of N. repanda, N. nesophila and N. stocktonii, and the most recent common ancestor of N. nesophila and N. stocktonii were calculated by averaging all estimates for these nodes from the 90 000 post-burn-in iterations. Methods S1 provides further detailed information.
We used Nicotiana obtusifolia (accession number 8947501/176) and N. nudicaulis (accession number 964750051) (both from the Botanical and Experimental Garden, Radboud, University of Nijmegen, The Netherlands), N. sylvestris (accession number ITB626) (from the Tobacco Institute, Imperial Tobacco Group, Bergerac, France), and N. repanda (accession number TW18) (from the United States Department of Agriculture, North Carolina State University, NC).
DNA extractions were performed as described by Fojtova et al. (2003). We sequenced a random sample of DNA from the genomes of Nicotiana sylvestris, N. obtusifolia, N. repanda and N. nudicaulis using an Illumina Genome Analyzer xII (http://www.illumina.com/systems/genome_analyzer_iix.ilmn), at the Genome Centre, Queen Mary University of London, generating 108 bp reads. Raw sequence reads were deposited at the Sequence Read Archive at the National Center for Biotechnology Information under the study accession numbers SRA045794 and SRA051392. Resulting sequence reads were then screened for quality and removed if they contained more than five unidentified nucleotides or were shorter than 95 bp in length. All sequences that passed quality checks were trimmed to 95 bp and screened against plastid genomes, and reads with significant similarity were removed from further analysis.
Clustering and repeat identification
A random sample of 5% of each genome was combined into a single dataset and subjected to a graph-based clustering procedure as described by Novak et al. (2010). Details of the data used in this analysis are provided in Table S1. This approach identifies repetitive DNA families using a ‘community’ approach by grouping high-throughput sequencing reads into clusters based on shared sequence similarity. Each sequence read was compared with all other reads in a pairwise analysis using MGBLAST (Altschul et al., 1990), whereby a hit required at least 90% sequence identity along 55% of the sequence read. Graph-based clustering was performed using the R programming language to create an algorithm that detects sets of reads that are more densely connected among each other than to other reads. These groups are termed ‘clusters’, and correspond to families of repetitive DNA that were characterized further.
Sequences within the largest clusters were analysed to produce a 3D network for each cluster, enabling visualization of similarity between individual reads. Sequence reads (nodes) were connected by edges, where edge weight is proportional to sequence similarity. Nodes were then positioned using a Fruchterman–Reingold algorithm by which reads with extensive similarity are placed close together and those that share little or none are placed further away. Subsequently, the 3D networks were viewed and inspected using the SeqGrapheR program (Novak et al., 2010).
Sequence similarity between reads may be interpreted in two ways: (i) the Illumina sequencer has read the same genomic region more than once, or (ii) the sequences cover regions within repetitive DNA. As each genome was skimmed to a depth of 5%, it is most probable that reads with sequence overlap arise from similar repetitive DNA rather than coverage of the same genomic region. As all sequences are the same length, it follows that the number of reads in each cluster is a measure of abundance within the original dataset. Therefore, a count of the number of sequence reads from each species within a cluster gives a quantitative measure of abundance in the genome of each species. For each cluster we counted the number of reads and calculated the genome proportion (a percentage of the genome) for all four species. Thus genome representation (total contribution of a cluster to the dataset, in bp) and genome proportion are reflective of the total contribution of a given cluster/repeat family to genome size, and are not measures of copy number per se. For example, 10 000 copies of a LTR retroelement 5 kb long have a smaller genome proportion than the same number of elements that are 12 000 kb in size.
After graph-based clustering, reads were assembled using the TGICL (Pertea et al., 2003) version of CAP3 (Huang and Madan, 1999) on a cluster by cluster basis, requiring 80% sequence similarity along a 40 bp length. Clusters consisting of at least ten reads were assessed for sequence similarity to a database of known repetitive sequences (RepBase 16.03, Jurka et al., 2005) using RepeatMasker (Smit et al., 2010) (with the –s option that invokes slower and more sensitive searches). To avoid spurious labelling of clusters, only those descriptions encompassing at least 10% of the reads or totalling 100 hits were considered. The number of reads in clusters with the same description was summed in order to calculate genome proportion for all repeat types.
Comparing deviation of repeat abundance in the allotetraploids
At the outset of allopolyploidy, the quantitative abundance of a given repeat is the sum of the diploid progenitors. Using this logic, we assessed each cluster for deviation from expectation in the allotetraploids. We also calculated the cumulative deviation from additvity across the range of expected repeat abundance, including only those clusters where we expected ten or more reads based on was observed in the diploids.
Cells at metaphase were accumulated in freshly harvested root-tip meristems by pre-treatment in saturated Gammexane® (hexachlorocyclohexane, Sigma, http://www.sigmaaldrich.com/united-kingdom.html) in water for 4 h. Subsequently root tips were fixed for 24 h in 3:1 absolute ethanol/glacial acetic acid, and stored in 100% ethanol at −20°C. Root-tip material was spread onto acid-cleaned glass slides following enzyme digestion as described by Lim et al. (1998), and checked for quality using phase-contrast microscopy.
We thank the Natural Environment Research Council for PhD studentship funding, and Dr Richard Buggs and Dr Ilia Leitch (Royal Botanic Gardens, Kew, Richmond Surrey, UK) for constructive comments on the manuscript. The work was partially supported by the Czech Science Foundation (P501/13/10057S and P501/12/G090). We thank Robert Horton and Christopher Walker for their help with the high-performance computing cluster at Queen Mary University of London.