Characterization of the LTR retrotransposon repertoire of a plant clade of six diploid and one tetraploid species


For correspondence (e-mail


Comparisons of closely related species are needed to understand the fine-scale dynamics of retrotransposon evolution in flowering plants. Towards this goal, we classified the long terminal repeat (LTR) retrotransposons from six diploid and one tetraploid species of Orobanchaceae. The study species are the autotrophic, non-parasitic Lindenbergia philippensis (as an out-group) and six closely related holoparasitic species of Orobanche [O. crenata, O. cumana, O. gracilis (tetraploid) and O. pancicii] and Phelipanche (P. lavandulacea and P. ramosa). All major plant LTR retrotransposon clades could be identified, and appear to be inherited from a common ancestor. Species of Orobanche, but not Phelipanche, are enriched in Ty3/Gypsy retrotransposons due to a diversification of elements, especially chromoviruses. This is particularly striking in O. gracilis, where tetraploidization seems to have contributed to the Ty3/Gypsy enrichment and led to the emergence of seven large species-specific families of chromoviruses. The preferential insertion of chromoviruses in heterochromatin via their chromodomains might have favored their diversification and enrichment. Our phylogenetic analyses of LTR retrotransposons from Orobanchaceae also revealed that the Bianca clade of Ty1/Copia and the SMART-related elements are much more widely distributed among angiosperms than previously known.


In angiosperms, nuclear genome size varies 2400-fold (Pellicer et al., 2010), largely because of different proportions of non-coding DNA, especially repetitive DNA (Leitch, 2007). Apart from whole-genome duplications (polyploidization), the main cause of genome size increase is the accumulation of tandem-repeat DNA families and transposable elements (TEs). Variation in nuclear genome size is of major evolutionary importance because it determines key traits, such as the duration of the cell cycle, that directly impact fitness (Gregory and Hebert, 1999; Meagher and Vassiliadis, 2005; Gruner et al., 2010). The insertion and accumulation of TEs is therefore expected to be counter-selected and transposition activity suppressed, for example, by TE autoregulation (Simmons and Bucholz, 1985; Lohe and Hartl, 1996; Lohe et al., 1996), protein regulators (Adams et al., 1997), RNA silencing (Sarot et al., 2004; Aravin et al., 2007; Brennecke et al., 2007; Olivieri et al., 2010) and methylation (Verbsky and Richards, 2001; Bird, 2002; Slotkin and Martienssen, 2007). Under stable genomic conditions, transposition activity is therefore probably low.

Genomic stresses, however, can facilitate transposition (Zeh et al., 2009), and polyploidization (which often accompanies hybridization) is one such stress thought to promote the proliferation of TEs (Kashkush et al., 2002; Liu and Wendel, 2003; Shan et al., 2005; Chen and Ni, 2006; Renny-Byfield et al., 2011). The resulting temporary increase of genome size is sometimes counterbalanced by rapid genome ‘downsizing’ (Bennetzen, 2002; Leitch and Bennett, 2004; Skalická et al., 2005; Hawkins et al., 2008; Mun et al., 2009; Eilam et al., 2010; Renny-Byfield et al., 2011). Thus far, the evidence for such downsizing via the loss of repeat types, particularly Ty3/Gypsy retrotransposons, comes mainly from the allotetraploid Nicotiana tabacum (Renny-Byfield et al., 2011). A recent analysis of the repetitive DNA in nine species of Orobanchaceae of different life histories (seven holoparasitic species, one hemiparasitic species and one autotrophic species; Piednoël et al., 2012) also pointed to genome downsizing in the tetraploid species included in the sample. The genomic proportions of repetitive DNA varied greatly among the nine species, ranging from 25 to 60%, with long terminal repeat (LTR) retrotransposons making up most of the repetitive DNA; the tetraploid species differed substantially in Ty3/Gypsy families.

Retrotransposons, a TE class specific to eukaryotes, transpose via an RNA intermediate. Based on structural features and phylogenetic relationships, five orders of retrotransposons have been defined (Wicker et al., 2007): LTR retrotransposons; tyrosine recombinase-encoding retrotransposons (e.g. DIRS1-like elements); Penelope elements; long interspersed elements (LINEs); and short interspersed elements (SINEs). The LTR retrotransposons, related to retroviruses (Xiong and Eickbush, 1990), usually encode two open reading frames (ORFs): one called gag, which encodes a structural protein for virus-like particles, and another called pol, which encodes enzymatic domains involved in the transposition cycle, such as an aspartic protease (AP), a reverse transcriptase (RT), an RNase H (RH) and an integrase (INT). The two major superfamilies of plant LTR retrotransposons are Ty1/Copia and Ty3/Gypsy (see Velasco et al., 2010; table S6), which differ in their pol gene order (Capy et al., 1997; Wicker et al., 2007; Eickbush and Jamburuthugoda, 2008): the RT and RH genes are located upstream of the INT gene in Ty3/Gypsy, but downstream in Ty1/Copia.

In the present study, we characterize the TE dynamics in Orobanchaceae at a finer scale than previously achieved in this or any other flowering plant clade. For this purpose, we analyzed the Ty1/Copia and Ty3/Gypsy elements phylogenetically, using seven of the nine species to represent closest relatives and one out-group (Lindenbergia philippensis, Orobanche crenata, Orobanche cumana, Orobanche gracilis, Orobanche pancicii, Phelipanche ramosa and Phelipanche lavandulacea; Figure 1). We wondered whether specific elements are responsible for the Ty3/Gypsy diversification, and we also wanted to know how the Ty1/Copia and Ty3/Gypsy families reacted to the tetraploidization in O. gracilis. Earlier studies have taken a similar approach, but were usually based either on a single species (Domingues et al., 2012) or on a single clade of elements (Gao et al., 2012). Comparative classifications of TEs from closely related species have also been performed on species in which TEs had previously been well characterized (Wicker and Keller, 2007; Estep et al., 2013). Our study is the first to exhaustively sample and classify all highly and moderately repeated Ty1/Copia and Ty3/Gypsy families based on next-generation sequencing data from species in which elements have not been previously well characterized. Among the unexpected results is the wide distribution of the Bianca clade and of SMART-related elements across angiosperms.

Figure 1.

Phylogenetic relationships and key genomic parameters of the seven Orobanchaceae species studied. Genomic proportions and species phylogeny from Piednoël et al. (2012), except for recalculated values for Phelipanche lavandulacea.


Clusters are hypothesized to be TE families, each consisting of related elements (based on all-to-all blast and graph-based clustering, see Experimental procedures). Within each cluster, contigs were obtained by read assembly, using an identity threshold of 80% over at least 40 bp. The effect of using a single contig per family was tested with a phylogenetic analysis of the largest Ty1/Copia family from O. pancicii that we called Opan_CL2 (Figure 2). All individual contigs from the Opan_CL2 family formed a highly supported clade (99 bootstrap support). Given this result, we subsequently included only one representative contig per cluster (family).

Figure 2.

Phylogenetic relationships in the Opan_CL2 family, inferred from neighbour-joining analysis of the reverse transcriptase encoding domain. Contigs from the Opan_CL2 family are indicated in dark red. Statistical support (>70%) comes from parametric bootstrapping using 100 replicates.

Phylogenetic structure of Ty1/Copia elements in Orobanchaceae

Phylogenetic analyses were performed for each of the seven Orobanchaceae species using reference TEs, from the Ty1/Copia clades found in plants: Hopscotch, Tos17, SIRE1/Maximus, Tnt1, Angela, Tont1 and Bianca (Wicker and Keller, 2007; Llorens et al., 2009; Hribová et al., 2010). The tree composed of Ty1/Copia families from O. gracilis is shown in Figure 3 (the other species trees are shown in Figures S1–S6).

Figure 3.

Phylogenetic relationships in Ty1/Copia elements from Orobanche gracilis, inferred from neighbour-joining analysis of the reverse transcriptase encoding domain. Families from O. gracilis are indicated in dark red. Families that are widely distributed in Orobanche but are lost in O. gracilis are indicated in green. Statistical support (>70%) comes from parametric bootstrapping using 100 replicates.

For each species, most of the seven Ty1/Copia clades have a bootstrap support ≥80% (Figure 3). Only a few clades, such as Tos17 and Tont1, are weakly supported. Five of the Ty1/Copia clades (SIRE1/Maximus, Tnt1, Angela, Tont1 and Bianca) occur in all seven species. Hopscotch is restricted to Phelipanche species and Tos17 is restricted to P. ramosa, where it makes up 0.01% of the genome. The elements FRretro64 and OSCOPIA2, related to the small LTR retrotransposons (SMARTs), cluster together with the element Victim, either as the sister-group of the Hopscotch clade (e.g. Figure 3) or nested within it (e.g. Figure 2). We accordingly considered FRretro64, OSCOPIA2 and Victim to belong to the Hopscotch clade. The phylogenetic analyses also revealed that one element from P. ramosa and two elements from P. lavandulacea are closely related to the SMART retrotransposons.

Most Ty1/Copia families per species could be included in the phylogenies because their contigs harbored reliable matches with the RT domain; fewer than 13 families (per species) could not be included. Most of these families were instead assigned to clades using a blast-based classification (see Experimental procedures); <0.2% of the genomes remained unassigned to TE clades (Figure 4). The SIRE1/Maximus families make up the largest proportions of Ty1/Copia in Orobanchaceae (Figure 4), representing between 10.5% of the genome, in O. cumana, and 15.3%, in P. ramosa. TEs from the Angela clade also are abundant: contributing 2.2% of the genome in L. philippensis, and 4.9% in O. crenata. Families from the Hopscotch clade make up 2.4 and 3.9% of the P. ramosa and P. lavandulacea genomes, whereas they were undetectable in three of the four Orobanche, the exception being O. cumana, in which they made up 0.04% of the genome. The Tnt1, Tont1, and Bianca clades also occurred in low genomic proportions (0.1–0.8%), except for Tont1, which makes up ~3.8% of the O. crenata genome.

Figure 4.

Ty1/Copia clade distribution (%) among the seven Orobanchaceae species. NC: not classified.

To date, Bianca elements have been reported from only a few species. Here we found, however, that L. philippensis contains two Bianca families, O. gracilis and O. cumana each contain four families, and the remaining species each have five Bianca families. To test whether the presence of Bianca elements in Orobanchaceae results from an underestimation of their distribution in angiosperms or from horizontal transfer(s), we performed similarity searches against the nr/nt database from NCBI ( Low E-values were obtained for the Elote1 element from maize and for sequences from Arachis hypogaea, Beta vulgaris, Brassica rapa, Capsella rubella, Citrullus lanatus and Vitis vinifera, as well as from Ipomoea trifida and Solanum lycopersicum. When these high-similarity elements were included in a phylogenetic analysis, they clustered together in a single Bianca clade (Figure 5).

Figure 5.

Phylogenetic relationships in the Bianca clade, inferred from neighbour-joining analysis of the reverse transcriptase encoding domain. Families from Orobanche are indicated in dark red, families from Phelipanche are indicated in red, families from Lindenbergia philippensis are indicated in blue and the Bianca element is indicated in pink. Additional sequences are indicated in green. Statistical support (>70%) comes from parametric bootstrapping using 100 replicates.

Phylogenetic structure of Ty3/Gypsy elements in Orobanchaceae

Similar to our phylogenetic analyses of the six known Ty1/Copia clades, we compared Orobanchaceae Ty3/Gypsy elements with the seven known plant clades from this TE superfamily, Tekay, Galadriel, CRM, Reina, Athila, Ogre and Tat (Llorens et al., 2009; Hribová et al., 2010). Figure 6 shows the resulting tree for O. gracilis (the other species trees are shown in Figures S7–S12). The element clades have bootstrap supports of ≥80%, except for Reina and Tat. Families that could not be included in the phylogenetic analyses (~20–30 in each species) were classified using the blast-based approach (see Experimental procedures). Only a few families, comprising less than ~1% of the genomes, could not be assigned. Two families (Ocre_CL223 and Lind_CL46), which turned out to be caulimoviruses instead of Ty3/Gypsy, make up 0.01 and 0.08% of the genomes in which they were found (O. crenata and L. philippensis). The genomic proportion of Ty3/Gypsy in L. philippensis is thus lower (1.85%) than previously calculated (1.93%; Piednoël et al., 2012).

Figure 6.

Phylogenetic relationships in Ty3/Gypsy elements from Orobanche gracilis, inferred from neighbour-joining analysis of the reverse transcriptase encoding domain. Families from O. gracilis are indicated in dark red, and species-specific families from this species are labeled with an ‘SPE’ tag in their name. Families that are widely distributed in Orobanche, but are lost in O. gracilis, are indicated in green. Statistical support (>70%) comes from parametric bootstrapping using 100 replicates.

Three of the seven plant Ty3/Gypsy clades (Tekay, CRM and Athila) are found in all seven species (Figure 7). Tekay is more abundant in Orobanche (>9.7%) than in Phelipanche (6.0–6.5%), and makes up 0.78% of the genome of the out-group L. philippensis. CRM elements make up a lower proportion in Phelipanche (<0.1%) than in Orobanche (from 0.3% in O. crenata up to 0.7% in O. gracilis), and the single CRM family in L. philippensis (Lind_CL124; 0.02%) was only detected using blast. Athila families, by contrast, are more abundant in Phelipanche (5.6%) than in Orobanche (<3.5%) and comprise a substantial proportion of the L. philippensis genome (0.43%). Tat appears absent from L. philippensis, but is ubiquitously distributed in the other species, where it makes up 2.8–7.1% of the genomes. The remaining elements are more rare, with Reina and Ogre restricted to O. gracilis (0.05 and 0.08% of the genome, respectively), and Galadriel restricted to O. gracilis (~0.7% of the genome), O. pancicii, O. crenata and L. philippensis (<0.1% of their genomes).

Figure 7.

Ty3/Gypsy clade distribution (%) among the seven Orobanchaceae species. NC: not classified.

LTR retrotransposon dynamics in the tetraploid species O. gracilis

To better understand the genome modifications that occurred in the tetraploid O. gracilis, we investigated in detail both its species-specific LTR retrotransposons and the TE families it has lost (compared with the remaining Orobanche). None of its species-specific families belong to Ty1/Copia, whereas 11 belong to Ty3/Gypsy. These 11 make up 5.85% of the O. gracilis genome (Figure 6), and comprise seven Tekay families, two Athila families and two Galadriel families. The seven Tekay families by themselves make up 5.12% of the O. gracilis genome. TE families that are lost in O. gracilis are two Ty3/Gypsy families (Tekay) and four Ty1/Copia families (one Angela, one SIRE1/Maximus and two Bianca; Figures 3 and 6, and blast-based classification).


This study is a fine-scale analysis of the Ty1/Copia and Ty3/Gypsy LTR retrotransposons in closely related species of Orobanchaceae, including a young tetraploid and a phylogenetically more distant species as an out-group. Most of the TE families belong to clades commonly found in plants (Hribová et al., 2010; Staton et al., 2012). Two families (Ocre_CL223 and Lind_CL46), which we previously classified as Ty3/Gypsy (Piednoël et al., 2012), are in fact caulimoviruses and make up 0.01 and 0.08% of the genomes in which they were found (O. crenata and L. philippensis, respectively). Before this study, Bianca elements had been reported from only a few species, including Arabidopsis thaliana, Lotus japonicus, Medicago trunculata, Oryza sativa (rice) and Triticeae (Schulman and Kalendar, 2005; Holligan et al., 2006; Wicker and Keller, 2007; Wang and Liu, 2008), which are distantly related to Orobanchaceae. It is now clear, however, that the Bianca clade of Ty1/Copia, which also comprises the Elote1 element from Zea mays (maize), is more widely distributed across angiosperms, occurring in Brassicales (Arabidopsis thaliana, Brassica rapa and Capsella rubella), Caryophyllales (Beta vulgaris), Cucurbitales (Citrullus lanatus), Fabales (Arachis hypogaea, Lotus japonicus and Medicago truncatula), Lamiales (Orobanchaceae spp.), Poales (O. sativa and Triticeae spp.), Solanales (Ipomoea trifida and Solanum lycopersicum) and Vitales (Vitis vinifera). This suggests that Bianca originated early during angiosperm evolution, which fits the hypothesis that Bianca may be the most ancient Ty1/Copia clade in angiosperms (Wicker and Keller, 2007).

Considering the wide distribution of Bianca and the poorly resolved phylogenetic relationships between elements from Orobanchaceae and the other species (Figure 5), we hypothesize that the Bianca clade is vertically inherited in the Orobanchaceae family. No uncorrupted full-length Bianca elements from Orobanchaceae are known, and these elements are therefore probably no longer active, even though the Elote1 element transposed ‘recently’ in inbred maize (Wang and Dooner, 2006). In Musa acuminata (banana; Hribová et al., 2010), Saccharum officinarum (sugar cane; Domingues et al., 2012) and Glycine max (soybean; Du et al., 2010), Bianca has not been detected, possibly because of low family number, judging from its low diversity in other angiosperms where it has been detected (Orobanchaceae, this study; Triticae, rice and Arabidopsis, Wicker and Keller, 2007). Likewise, our study reveals the first elements (Plav_CL114 and Plav_180, Figure S4; Pram_CL107, Figure S5) related to SMARTs outside monocotyledons (Gao et al., 2012). As P. lavandulaceae and P. ramosa only parasitize dicotyledon species (, accessed February 2013), the presence of SMART-related elements in these two species probably results from an underestimation of the distribution of SMART elements among angiosperms.

Our LTR retrotransposon classification shows that several Ty1/Copia and Ty3/Gypsy clades (SIRE1/Maximus, Tnt1, Angela, Tont1, Bianca, Tekay, CRM, Athila and Tat) are widely distributed in Orobanchaceae (Figures 4 and 7), and may have been present in the family's most recent common ancestor. We previously found that Orobanche and Phelipanche are characterized by different TE dynamics (Piednoël et al., 2012). The present analysis further illustrates this. For example, Hopscotch is restricted to Phelipanche, whereas Tekay is overabundant in Orobanche. In addition, there are species-specific features. For example, the L. philippensis genome has a high proportion of Ty1/Copia (17.2%), but a very low proportion of Ty3/Gypsy (1.9%), and O. crenata is enriched in Tont1 elements (3.8%), compared with all other species (<0.8%). The two closely related Phelipanche resemble each other in their Ty1/Copia element proportions (21.1% in P. lavandulacea; 22.8% in P. ramosa), but diverge in their Ty1/Copia composition, with P. ramosa enriched for the SIRE1/Maximus elements and P. lavandulacea enriched for Hopscotch. This highlights the need to study TE dynamics at both large and fine scales, and in a comparative context. TE transposition can be activated by stress (Melayah et al., 2001; Fablet and Vieira, 2011) or the colonization of new environments (Vieira et al., 2002), and it has been suggested that the TE repertoire of a gene pool could promote, or be associated with, the emergence of evolutionarily separate lines (Oliver and Greene, 2009, 2011; Jurka et al., 2011).

The Ty3/Gypsy genome proportions are higher in Orobanche than in Phelipanche (Figure 1), which we have attributed to diversification rather than a burst of transposition (Piednoël et al., 2012). The present results fit that hypothesis. Firstly, Orobanche genomes comprise more diverse element clades than Phelipanche (Galadriel, Reina and Ogre were only detected in Orobanche). Three Galadriel families are present in low genomic proportions in the out-group L. philippensis (Lind_CL123, Lind_CL135 and Lind_CL137), and thus perhaps this family as well as Reina and Ogre were lost in the common ancestor of Phelipanche. Secondly, Tat, Tekay and CRM are more abundant in Orobanche than in Phelipanche, with the Tekay clade (and to a lesser extent the Tat families) making up most of the Ty3/Gypsy enrichment in Orobanche (Figure 7). This enrichment is accompanied with an increase of the CRM and Tekay family number in Orobanche (4–9 CRM and 25–33 Tekay families), compared with Phelipanche (1–2 CRM and 19–22 Tekay families).

The Tekay elements, as well as the CRM, Galadriel and Reina elements, are chromoviruses (Gorinsek et al., 2004; Llorens et al., 2009). Chromoviruses are the earliest-diverging branch of Ty3/Gypsy, and are found in plants, fungi and animals (Kordis, 2005). They have a high genomic turnover (Gorinsek et al., 2004), which may result from their ‘strategy’ to escape repression and elimination mechanisms (Kordis, 2005; Baucom et al., 2009; Novikov et al., 2012). Chromoviruses differ from other Ty3/Gypsy elements in harboring a chromodomain in their 3′ end, which is a structural domain commonly found in proteins associated with the remodeling and manipulation of chromatin (Gorinsek et al., 2004). Chromodomains are highly constrained (Novikov et al., 2012), and may promote the integration of TEs in heterochromatin regions (Gao et al., 2008; but see Novikov et al., 2012 for chromoviruses in euchromatin). In accordance with this, CRM families appear to be centromere-specific (Luo et al., 2012). The high genomic turnover and site-specific integration of chromoviruses probably both contribute to their survival and abundance, especially in plants where they often attain large numbers of young copies (Kordis, 2005). It is therefore not surprising that chromoviruses make up a high proportion of the Orobanchaceae Ty3/Gypsy. There is, however, an exception to this global pattern, with the Athila element enrichment in Phelipanche. Once again, this underlines the need to study TE dynamics at both large and fine scales.

We previously showed that O. gracilis (the sequenced plant was tetraploid, with 2n = 76) has a particular TE composition, probably related to its tetraploidization and subsequent genome downsizing (Piednoël et al., 2012). Although O. gracilis has one of the smallest genomes of the Orobanchaceae studied, it has the highest proportion of TEs, especially of Ty3/Gypsy retrotransposons (Figure 1). The present LTR retrotransposon classification allows a better understanding of the underlying dynamics. As is characteristic of Orobanche, Tekay elements are enriched in O. gracilis: Tekay families make up 17.82% of its genome, with seven Tekay families unique to O. gracilis and making up 5.12% of its genome. These seven families represent almost one-third of the O. gracilis species-specific families, and 75% of the Ty3/Gypsy genomic enrichment compared with O. crenata and O. pancicii. Previous studies have shown that polyploidy can be associated with selective amplification of repetitive DNA (Parisod et al., 2010 for a review). The O. gracilis polyploidization could thus have promoted the proliferation of specific Tekay families.

Polyploidy may have been only one of the factors activating the Tekay elements in O. gracilis. In maize, only one of the LTR retrotranposon amplification bursts was initiated by polyploidy, whereas the other element activations were not (Estep et al., 2013). Additionally, the chromodomain of Tekay chromoviruses could have helped their persistence and thus accumulation (as described above). Like the CRM families, Tekay elements may be preferentially located in centromeric and pericentromeric regions of plant chromosomes (Theuri et al., 2005; Domingues et al., 2012). Interestingly, the tetraploid O. gracilis is the only species in our sample that harbors all four chromovirus clades (Tekay, Galadriel, CRM and Reina), with Galadriel elements being especially diverse and abundant (partially via the presence of two specific families; Figures 6 and 7).

The Ty3/Gypsy enrichment in O. gracilis might be related to the elimination of genes, especially redundant ones, as suggested for the triploid Brassica rapa genome, compared with the diploid A. thaliana (Mun et al., 2009). This could be the case because Ty3/Gypsy elements (and more especially chromoviruses) are preferentially located in heterochromatin, in contrast to genes, which are preferentially located in euchromatin. However, this differential location cannot be the only explanation of the TE enrichment, because genome downsizing in O. gracilis also led to the loss of TE families from various LTR retrotransposon clades (Figures 3 and 6), including the presumably heterochromatin-located Tekay elements. This loss of entire TE families matches results from the allopolyploid Nicotiana tabacum, in which common repeats from the parental species are lost in the polyploid descendant (Renny-Byfield et al., 2011).

In conclusion, we classified the Ty1/Copia and Ty3/Gypsy LTR retrotransposons from a rapidly speciating clade of Orobanchaceae, as indicated by the numerous morphologically poorly separated species that differ mainly in flower color and host preferences (Carlón et al., 2008; Pusch and Günther, 2009). Most of the identified LTR retrotransposon clades appear to have been inherited from the most recent common ancestor of Orobanchaceae. Orobanche has the greatest Ty3/Gypsy element diversification, perhaps because chromoviruses (Tekay, Galadriel, CRM and Reina) targeted heterochromatin regions via their chromodomains, and thus were able to persist longer. The tetraploidization event in O. gracilis appears to have promoted the proliferation of seven species-specific Tekay families in this species. In contrast to the Ty3/Gypsy elements, the Ty1/Copia repertoire appears more homogeneous among Orobanchaceae species, although there are striking species-specific TE dynamics. Finally, this study revealed that the Bianca clade of Ty1/Copia elements and SMART-related elements are more widely distributed in angiosperms than previously known.

Experimental Procedures

Plant material

Lindenbergia philippensis (Cham. and Schltd.) Benth. (2n = 32) is a fully autotrophic species from Bangladesh, India, Burma, Thailand, Cambodia, Laos, Vietnam, tropical China and the Philippines. Orobanche cumana Wallr. (2n = 38) is distributed from the Mediterranean region to Central Asia, and its main hosts belong to Asteraceae. Orobanche gracilis Beck (2n = 76) is found in the Mediterranean, and northwards to southern Central Europe, its hosts are exclusively shrubby Fabaceae. Orobanche pancicii Beck (2n = 38) is distributed from the Balkan Peninsula, northwards to the Eastern Alps, and its hosts are species of Knautia and Scabiosa. Orobanche crenata Forssk. (2= 38) is found in the Mediterranean region and the Near East, and its hosts are legumes, mainly annual crop species. Phelipanche lavandulacea Pomel (2n = 24) is a Mediterranean species, and its sole host is Bituminaria bituminosa, a perennial Fabaceae. Finally, Phelipanche ramosa (L.) Pomel (2n = 24) has been introduced worldwide, but its native distribution is the Mediterranean region and the Near East. Its hosts are a broad range of annual species. Chromosome numbers of the individuals studied were reported in Schneeweiss et al. (2004) and Piednoël et al. (2012).

Sequencing data and assembly

The 454 pyrosequencing reads were obtained from the sequence read archives (accession no. SRA047928). Filtering for plastid contaminants resulted in 76–555 Mb of DNA sequence for each species. This amounts to ~23% of the O. cumana genome (1.42 Gb), ~20% of the O. gracilis genome (2.05 Gb), ~20% of the O. crenata genome (2.78 Gb), ~16% of the O. pancicii genome (3.17 Gb), ~12% of the P. ramosa genome (4.25 Gb) and ~11% of the P. lavandulacea genome (4.29 Gb). The corresponding genome sizes were reported in Weiss-Schneeweiss et al. (2006) and Piednoël et al. (2012).

The reads were processed as described in Piednoël et al. (2012). Briefly, they were assembled using a graph-based clustering approach (Novák et al., 2010), in which vertices correspond to sequence reads, and overlapping reads are connected, with edges associated with edge weights corresponding to their similarity scores. Clusters of frequently connected nodes represent groups of similar sequences (hereby considered as families of genomic repeats). The number of reads in each family is proportional to its genomic abundance. Within each family, the reads were assembled into contigs, representing chimeric consensus sequences, using TIGR Gene Indices clustering tools (Pertea et al., 2003), with the −O′ −p80 −o40′ parameters, specifying overlap percentage identity and minimal length cut-off for the cap3 assembler.

Family classification

For both Ty1/Copia and Ty3/Gypsy, we reconstructed phylogenetic trees including several previously classified TEs, representative of all LTR retrotransposon clades described from plants. Reference sequences were selected from the matrices used in Hribová et al. (2010), plus the ATCopia28 and OSCOPIA2 elements from Arabidopsis thaliana and rice, all deposited in Repbase (Jurka et al., 2005). Some other elements not represented in Hribová et al. (2010) were also added: (i) Bianca, Eninu, Opie and Victim from the maize TE database (, (ii) Giepum and Ji from the Retrotransposon database (; and (iii) Ale (HE774675), Araco (AC079131:14472-19329), FRetro64 (JN806224), Ivana (EF067844:429582-434664) and Kielia (EU195798) from GenBank. For each Orobanchaceae LTR retrotransposon family, one contig covering the RT domain was then included in the phylogenetic analyses as a representative of its entire family. The representative contigs were selected as the most conserved considering their similarity scores with known elements obtained using rps-blast (reversed position-specific blast; Altschul et al., 1997) and three alignment profiles: pfam07727, pfam00078 and cd01650.

The RT domains of the reference sequences and representative contigs were extracted using a custom Python script and the RT boundaries provided by the results of rps-blast. All RT domains were then translated using Traduit (, and the corresponding sequences of amino acids were aligned using mafft (Katoh et al., 2009). Alignments were manually curated, and ambiguously aligned sites were filtered out using bmge (Criscuolo and Gribaldo, 2010). Phylogenetic analyses were carried out using the neighbour-joining method, 100 bootstrap replicates and the pairwise deletion of gaps option included in mega 5.0 (Tamura et al., 2011). The best-fitting model, JTT + G (Jones et al., 1992), was selected using topali 2 (Milne et al., 2009).

Some families without any reliable hit on the RT domain could not be included in the phylogenies. We thus classified them considering their Blastx results on the Gypsy database (Llorens et al., 2011). A family was assigned to a particular clade using two criteria: (i) best hits obtained on a unique clade, and (ii) an E-value difference between these hits and the best hits obtained on other clades of at least 1E−5.

Bianca clade distribution

To determine the Bianca distribution among angiosperms, similarity searches were performed using the AT28Copia element from Arabidopsis thaliana and elements from Orobanchaceae species against the nr/nt database. For this purpose, both blastn and blastx were used. Several candidates from Arachis hypogaea (HQ637177), Beta vulgaris (GU057342), Brassica rapa (AC232487), Capsella rubella (DQ103594), Citrullus lanatus (JX027061), Ipomoea trifida (AH013750), Lotus japonicus (AP009656), Medicago truncaluta (AC161750), Oryza sativa japonica group (AC018929), Solanum lycopersicum (AF275345), Vitis vinifera (AM477556) and Zea mays (Elote1: DQ493648) were selected and included in a specific phylogenetic analysis.


This work was supported by the German Science Foundation (RE 603/9-1 and -2). We thank Jiri Macas from the Institute of Plant Molecular Biology in Budweis for the RepeatExplorer pipeline.