Characterization of LTR-retrotransposons in the soybean genome
A combination of structure-based and homology-based approaches (Ma and Bennetzen, 2006; Ma and Jackson, 2006) was employed to identify LTR-RTs in the soybean genome sequence (assembly version Glyma1.01), 950 Mb of which was mapped to the 20 soybean chromosomes (Schmutz et al., 2010). A total of 32 370 elements with clearly defined boundaries were identified and deposited in SoyTEdb, a comprehensive database of transposable elements in the soybean genome (Du et al., 2010a). Of these elements, 14 106 are intact elements and 18 264 are solo LTRs (Figure 1, Table S1 in Supporting Information). All of these elements were manually inspected and defined based on their structures as previously described (Ma et al., 2004). Because the present soybean pseudomolecules contain numerous sequence gaps within and around repetitive sequences, some truncated elements or fragments can be potential products of incomplete assembly or mis-assembly of the corresponding regions. Therefore, truncated elements without structurally defined termini were not further investigated. Of the 32 370 elements described above, 31 858 (98.4%) were anchored to the currently assembled 20 chromosome pseudomolecules (Schmutz et al., 2010).
Figure 1. Distribution of long terminal repeat-retrotransposons (LTR-RTs) along the 20 chromosome pseudomolecules (Gm01–Gm20) of soybean. Intact elements and solo LTRs are shown by the red and green bars, respectively. The recombination-suppressed region of each chromosome is represented by the gray-shadowed area within each box. Intact elements and solo LTRs are plotted along the soybean physical map using 1 Mb per unit.
Download figure to PowerPoint
Based on a unified classification system for eukaryotic transposable elements (Wicker et al., 2007), the 32 370 elements were classified into 510 distinct families, including 353 Gypsy-like families (19 052 elements) and 157 Copia-like families (13 318 elements), approximately 95% of which were the first reported (Table S1) (Jurka et al., 2005; Du et al., 2010a). The ratio of Gypsy-like to Copia-like elements in soybean is 1.4:1 (Table S1), slightly lower than in maize (1.6:1) (Baucom et al., 2009; Schnable et al., 2009), much lower than in rice (4.9:1) (International Rice Genome Sequencing Project 2005; Tian et al., 2009) and sorghum (3.7:1) (Paterson et al., 2009), but considerably higher than reported in Medicago (0.3:1) (Wang and Liu, 2008). Nevertheless, the ratio of Gypsy-like to Copia-like elements in Medicago may be a biased estimation, as only euchromatic portions of the Medicago genome have been sequenced and analyzed (Wang and Liu, 2008).
The length of intact elements in soybean varies from 1 to 20 kb, with LTRs ranging from 0.1 to 4 kb in size (Figure S1). The copy numbers of individual LTR-RT families in soybean vary greatly, ranging from 1 to 4724, with an average number of 63 (Table S1). The three largest families are Gmr9 [i.e. SNARE/GmOgre (Laten et al., 2009; Du et al., 2010b)], Gmr4 [i.e. GmGypsy10 (Laten et al., 2009)], and Gmr5, which have 4724, 3370, and 2925 copies of intact elements and solo LTRs, respectively (Table S1). Overall, the 32 370 intact elements and solo LTRs, together with numerous truncated fragments or remnants measured by the Repeatmasker program (http://www.repeatmasker.org), make up 401 Mb of repetitive DNA, accounting for approximately 42% of the soybean genome (Schmutz et al., 2010). This proportion is lower than estimated in the larger maize genome (79%) (Schnable et al., 2009) and sorghum genome (55%) (Paterson et al., 2009), but higher than the smaller rice genome (26%) (International Rice Genome Sequencing Project 2005). It appears that the two rounds of the whole-genome duplication events that shaped the current soybean genome are mostly responsible for the larger-size genome but with a lower proportion of LTR-RT DNA in soybean in contrast to sorghum.
Structural variation of LTR-RTs according to their ages and distribution in recombination-suppressed pericentromeric regions and chromosome arms
Of the 31 858 elements anchored to the assembled 20 chromosome pseudomolecules, 27 836 (approximately 87%) were found in the recombination-suppressed pericentromeric regions (Schmutz et al., 2010) (Figure 1, Table 1). This is probably an underestimate, given that a large number of assembled scaffolds predominately composed of retrotransposon fragments and centromere satellite repeats (approximately 17.7 Mb) have not yet been integrated into the 20 chromosomes. By contrast, <18% (2292 out of 12 918) of LTR-RTs (intact elements and solo LTRs) in the rice genome are located in the recombination-suppressed pericentromeric regions (Tian et al., 2009). The densities of LTR-RTs in the recombination-suppressed pericentromeric regions and chromosome arms are 52 Mb−1 and 9 Mb−1 in soybean (Figure 1, Table 1), and 51 Mb−1 and 33 Mb−1 in rice (Tian et al., 2009), respectively. When all fragments were included, the proportions of retrotransposon DNA in the recombination-suppressed pericentromeric regions and chromosome arms are 63 and 11% in soybean, and 39 and 17% in rice (Tian et al., 2009), respectively.
Table 1. Distribution of long terminal repeat-retrotransposons (LTR-RTs) within and outside of recombination-suppressed pericentromeric regions
|Chr.||No. of intact elements||No. of solo LTRs||S/I ratioa|
|Total||12 323||1551||15 513||2471||1.26||1.62|
The formation of solo LTRs by unequal intra-element homologous recombination is thought to be a major process for removal of LTR-RT DNA in plants (Devos et al., 2002; Ma et al., 2004). Our data show that the ratio of solo LTRs to intact elements (S/I) in soybean is approximately 1.29:1 (Table S1), much higher than the ratio (approximately 0.12:1) suggested previously using limited bacterial artificial chromosome (BAC) sequences (Wawrzynski et al., 2008). This estimate is significantly lower than reported in rice (1.62:1) (Tian et al., 2009) (Fisher’s exact test, P < 10−25), but significantly higher than that in Arabidopsis (0.50:1; Table S2) (Fisher’s exact test, P < 10−24). In an attempt to shed light on the potential forces that facilitate the formation of solo LTRs, we investigated the structures of LTR-RTs along chromosomes (Figure 1). Our data reveal that the S/I ratio in pericentromeric regions (1.26:1) is significantly lower than in chromosome arms (1.62:1) (t-test, P < 0.001; Table 1). This observation, paralleling a recent study in rice, which reported significantly lower S/I ratios in pericentromeric regions (1.36:1) than in chromosome arms (1.68:1) (Tian et al., 2009), suggests that the mechanisms for suppression of genetic recombination in pericentromeric regions may reduce the frequency of formation of solo LTRs by unequal recombination.
Using an approach employed earlier (Ma and Bennetzen, 2004), we estimated the ages of intact elements in soybean. As shown in Figure 2(a), most of the elements (91%) were amplified in the last 3 Myr, and approximately 3248 elements were generated within the last 0.5 Myr. Despite recent amplification, it appears that many families were active and amplified within distinct evolutionary timeframes. For example, Gmr2/SIRE (Laten et al., 1998) has the greatest number of copies that arose within the past 0.5 Myr, and this family may be recently or even currently active, given that it contains 75 intact elements each having two identical LTRs (Figure 2b). In contrast, the majority of Gmr3, Gmr19/Diaspora (Yano et al., 2005), and Gmr25 elements were amplified within the last 0.5–1.0, 1.5–2.0, and 2.0–2.5 Myr, respectively.
Figure 2. Timing and activities of recent amplification of long terminal repeat-retrotransposons in soybean. (a) Insertion time of intact elements (mya, million years ago). (b) Comparisons of activities of different families.
Download figure to PowerPoint
Similar to that previously described in a few other species (Wicker and Keller, 2007), the overall age distribution of intact elements in soybean fits an exponential decay curve (r = 0.99, P < 0.001). This pattern is expected, because unequal recombination and illegitimate recombination have been thought to be common mechanisms responsible for rapid elimination of retrotransposon DNA during the evolution of plant genomes. In particular, illegitimate recombination that generates small deletions has been documented to be a major mechanism for elimination of LTR-RT DNA in Arabidopsis (Devos et al., 2002). In this study, we identified numerous truncated LTR-RT fragments in soybean. However, because the assembled soybean genome sequence was generated by the whole genome shotgun (WGS) approach, and unavoidably contains many sequence gaps, it thus does not allow a precise assessment of the effectiveness of illegitimate recombination for the shrinkage of the soybean genome.
Our analysis reveals that the intact elements identified in pericentromeric regions and chromosome arms were amplified at different times in both soybean and rice (average, 1.29 and 1.64 Myr in soybean, and 2.12 and 1.43 Myr in rice). Since the frequencies for removal of LTR-RT DNA by solo LTR formation, as reflected by the S/I ratios in either euchromatic regions or heterochromatic regions, are strikingly similar between rice and soybean, the formation of such large proportions of recombination-suppressed pericentromeric regions in soybean may be mainly caused by preferential insertions of LTR-RTs in the regions, instead of biased removal in gene-rich euchromatic regions. It appears that this deduction reinforces our observation that soybean LTR-RTs in pericentromeric regions are, on average, younger than those in chromosome arms. However, this deduction needs to be made with the caveat that solo LTR formation may be the predominant process for the elimination of LTR-RT DNA in both species. This is true in rice (Ma et al., 2004; Tian et al., 2009), but is less clear in soybean.
Association of the S/I ratios with insertion times and LTR lengths
The S/I ratios vary among families. Among the 53 families each with >50 copies (Table S1), the lowest and highest S/I ratios are 0.11 (Gmr15) and 16.39 (Gmr24), whereas the average ages of the intact elements of families Gmr15 and Gmr24 are 1.60 and 0.87 Myr, respectively (Table S1). No significant correlation between the S/I ratios and the average ages of intact elements was detected among these 53 families (r = 0.12, P = 0.38; Figure 3a), although families with younger intact elements have a tendency toward lower S/I ratios.
Figure 3. Genetic factors associated with solo long terminal repeat (LTR) formation. (a) Ratios of solo LTRs to intact elements (S/I) versus average insertion times (mya, million years ago). (b) S/I ratio versus LTR sizes.
Download figure to PowerPoint
Theoretically, more solo LTRs would be formed from intact elements over evolutionary time. For example, the S/I ratio of elements amplified before the divergence of indica and japonica, two subspecies of rice, approximately 0.5 Ma, is twice that of the S/I ratio of elements amplified after the divergence of these two subspecies (Ma et al., 2004; Tian et al., 2009). The lack of significant correlation between the S/I ratios and the average ages of intact elements may be due to the distinct waves and scales of LTR-RT amplifications among individual families, as well as the variable degrees and magnitudes of unequal recombination and illegitimate recombination over evolutionary time (Tian et al., 2009). In contrast, our data reveal a significant positive correlation between the S/I ratios and LTR sizes among the 53 soybean LTR-RT families each with >50 copies (r = 0.63; P < 0.0001; Figure 3b). Such a correlation was also detected in rice (r = 0.39; P = 0.001) by analyzing the 66 largest rice LTR-RT families (Tian et al., 2009). One possible explanation is that larger LTRs can facilitate unequal homologous recombination for the formation of solo LTRs. Together, these data suggest that the formation of solo LTRs in soybean may be driven by multiple factors, such as chromosomal distribution, local genetic recombination, the lengths of LTRs, and probably also selection against insertions within or adjacent to genes (Tian et al., 2009).
Most evolutionary lineages of LTR-RTs are shared between monocots and eudicots, but one is extinct in soybean
To understand the evolutionary dynamics, history, and fates of LTR-RTs in flowering plants over evolutionary time, we performed comprehensive phylogenetic and comparative analyses of complete sets of LTR-RTs identified in soybean, rice (Tian et al., 2009), and Arabidopsis (Pereira, 2004; Table S2), and large sets of LTR-RTs identified in Medicago (Wang and Liu, 2008) and Lotus (Holligan et al., 2006). Rice is a monocot, while the other four are eudicots. The divergence time of these species and two other monocots (maize and sorghum) is shown in Figure S2 (Chaw et al., 2004; Choi et al., 2004; Swigonova et al., 2004; Lavin et al., 2005; Tuskan et al., 2006; The International Brachypodium Initiative 2010).
Because many families of LTR-RTs among species or within an individual genome are highly diverged, we only used the relatively conserved RT domains from individual elements for phylogenetic analysis. Of the 510 families identified in the soybean genome, 145 Copia and 284 Gypsy families each have at least one element that contains a conserved RT domain. The other 81 families are either all non-autonomous elements or contain deletions of RT domains. Five (Gmr324, Gmr40, Gmr28, Gmr190, and Gmr27) of these 81 families have more than 100 copies (Table S1) in soybean. Most of the non-autonomous families were each found to contain at least two elements with similar structures based on sequence alignments, suggesting that these families were capable of amplification.
A previous survey of 20 Copia families from barley and wheat and their homologous elements from Arabidopsis (22 families) and rice (46 families), defined six major common evolutionary Copia lineages (Wicker and Keller, 2007). Using a similar approach, we grouped all Copia families with conserved RT domains identified in Arabidopsis (33 families), rice (113 families), and soybean (145 families) into seven distinct lineages, Ivana, Maximus, Ale, Angela, TAR, GMR, and Bianca (Figure 4a). Six of the seven lineages are shared by these three species, except that the Bianca lineage was not found in soybean (Figure 4a). The Bianca lineage contains one and seven families that are composed of five and 32 elements in Arabidopsis and rice, respectively (Table 2). These data suggest that the Bianca lineage elements are now extinct in the soybean genome. Because this lineage was also identified in both Lotus and Medicago (Holligan et al., 2006; Wang and Liu, 2008; Figure S3), two model legume species that diverged from soybean about 51 Mya (Lavin et al., 2005), the Bianca lineage must have been lost in soybean within the last 51 Myr.
Figure 4. Phylogenetic relationships of long terminal repeat-retrotransposon (LTR-RT) families identified in soybean, rice, and Arabidopsis. (a) Copia families and (b) Gypsy families. The RT nucleotide sequences of individual families were used to construct the phylogenetic trees, which were rooted using the RT sequences of the Copia element DM (a) and Gypsy element INVIDER2 (b) identified in Drosophila melanogaster (DM). Individual LTR-RT families in soybean, rice, and Arabidopsis are indicated by red boxes, green boxes, and blue triangles, respectively.
Download figure to PowerPoint
Table 2. Numbers of families and elements of different evolutionary lineages across species
The Gypsy families from soybean, Arabidopsis, and rice fell into five previously defined evolutionary lineages: Reina, CR, Tekay, Athila, and Tat (Figure 4b, Table 2) (Wang and Liu, 2008). These five lineages all existed before the divergence of eudicots and monocots, and are still shared by other two legume species, Medicago and Lotus (Figure S4).
The spectrum of activity for proliferation of LTR-RTs is highly variable among lineages and species over evolutionary time
Although the 11 evolutionary lineages are shared by soybean, Arabidopsis, and rice, the scales and timeframes of activity for proliferation of LTR-RTs vary tremendously among lineages and species. The numbers of families within each lineage and the copy numbers of elements within each family identified in these three species are listed in Table 2. Among the six Copia lineages in soybean, Ivana has the largest number of LTR-RT families (63), accounting for 43.4% of the Copia families (145) analyzed. However, this lineage only contains 6.3% (788) of the 12 564 Copia elements. In contrast, Maximus is the Copia lineage that contains the highest number of Copia elements (8575, 68.3%), but these elements belong to only 15 families. More dramatic differences were observed among the five Gypsy lineages in soybean. For example, the lineage Reina contains 253 (89.1%) of the 284 Gypsy families analyzed, but these families comprise only 892 elements (4.8% of the 18 587 Gypsy elements). The largest copy-number family in soybean is Gmr9/SNARE/GmOgre, which belongs to the Tat lineage and accounts for 15% of all elements identified in the soybean genome. The copy numbers of LTR-RTs within individual families reflect the recent activities for LTR-RT amplification, while the numbers of families within individual lineages record the ancient activities. Hence, the above observations suggest that different lineages and families of LTR-RTs had distinct activities for amplification over evolutionary time.
A similar scenario was seen in rice (Figure 4, Table 2). Notably, the three lineages (Ivana, Ale, and Reina) that contain the highest number of families of LTR-RTs in rice are also the three lineages that contain the highest number of families of LTR-RTs in soybean, although the relative proportions of elements within these three lineages are higher in rice than in soybean. Nevertheless, either the proportions of elements within individual families or the proportions of families within individual lineages show considerable differences between soybean and rice. Compared with soybean and rice, Arabidopsis shows the overall lowest activities for LTR-RT amplification from ancient to recent times (Table 2). In particular, the lineage TAR may be facing extinction in Arabidopsis, as only a single intact element of this lineage was found in the entire genome. Our analysis suggests that the consistently low activities of LTR-RTs over evolutionary time are largely responsible for maintaining such a small genome, although illegitimate recombination for rapid accumulation of small deletions was found to be a primary mechanism for counteracting genome expansion during the recent evolution of the Arabidopsis genome (Devos et al., 2002).
‘Functional specification’ of centromere retrotransposons pre-dates the separation of eudicots and monocots
Among the five Gypsy lineages, CR (centromeric retrotransposon) shows the highest level of sequence conservation between the monocot and eudicot species (Figure 4b). This lineage contains three Arabidopsis (Atr39, Atr47, and Atr48), four (Oryza sativa subspecies, japonica) rice (CRR1, CRR2, CRR4, and rn 417-136), and 10 soybean (Gmr3, Gmr4/GmGypsy10, Gmr12/GmGypsy11, Gmr17, Gmr59, Gmr102, Gmr175, Gmr215, Gmr235, Gmr362) families (Figures 4b and 5).
Figure 5. Phylogenetic tree of the CR lineage constructed using the conserved retrotransposon (RT) nucleotide sequences. The putative soybean centromere-enriched retrotransposon families, Gmr12/GmGypsy11 and Gmr17 are marked by the bold branches. The tree was rooted using the RT sequence of the INVIDER2 element in Drosophila.
Download figure to PowerPoint
The plant CR elements were first discovered in cereals cytogenetically anchored to centromeres using fluorescence in situ hybridization (Aragon-Alcaide et al., 1996; Jiang et al., 1996; Miller et al., 1998; Presting et al., 1998). Then, two CR families (CRR1 and CRR2) in japonica rice and three CR families (CRM1, CRM2, and CRM3) in maize were isolated from the respective genomes and found to be enriched in the functional centromere by chromatin immunoprecipitation (ChIP)-based analysis with a CENH3-specific antibody (Cheng et al., 2002; Zhong et al., 2002; Nagaki et al., 2005; Sharma and Presting, 2008). CRR1 versus CRM3, CRR2 versus CRM2, and CRR3 versus CRM1 are three CR pairs that pre-date the divergence of maize and rice (Sharma and Presting, 2008). CRM1 is a high-copy-number family enriched in maize centromeres, but not a single orthologous copy of CRR3 was identified in japonica rice, and only an incomplete copy was detected in indica rice (Sharma and Presting, 2008). Other CR families in rice and maize analyzed previously (e.g. CRR4 and CRM4) are not located in centromeric regions. These observations suggest: (i) the association of CR lineages with centromeres was established before the divergence of maize and rice, (ii) some CR families are still associated with functional centromeres in both rice and maize, and (iii) some CR families have lost their roles as centromere components in these two organisms.
CR families were found to be present in the centromeric regions of most grasses that have been investigated (Aragon-Alcaide et al., 1996; Jiang et al., 1996), but absent in functional centromeres of Oryza brachyantha (Lee et al., 2005), a wild species that diverged from rice about 7–9 Mya (Dawe, 2005). Instead, FRetro3, a LTR-RT family that belongs to the Tekay lineage was found to colonize the O. brachyantha centromeres (Gao et al., 2009), representing an exception to the general CR conservation of grasses.
Although three families belonging to the CR lineages existed in Arabidopsis (Figure 5), they comprise only six elements (Table 2). Thus, it is likely that none of these three families of elements are enriched in Arabidopsis centromeres. This observation seems to echo a previous observation that no significant enrichment of CR homologs was detected in functional centromeres of Arabidopsis by ChIP (Nagaki et al., 2003). However, no Arabidopsis centromeres have been fully sequenced; thus we are uncertain whether additional elements belonging to the three CR families exist in Arabidopsis. It is also possible that the copy number of CR elements in Arabidopsis is below the detection limits of the ChIP assay (Nagaki et al., 2003).
Since both soybean and Arabidopsis (eudicots) share the CR lineage with rice and many other grass species (monocots), a standing question that intrigued us was whether the CR lineage were functionally specified as centromere components before the divergence of eudicots and monocots? To address this question, we developed a computational method to detect putative ‘CR’ (referred to as LTR-RT families enriched in centromeres, e.g. CRR1 and CRR2 in rice) in plants. Because functional centromeres in most plants that have been investigated are mainly composed of large arrays of centromere satellite repeats and ‘CR’ elements (Ma et al., 2007), theoretically, ‘CR’ elements should show stronger physical association with centromere satellite repeats than non-’CR’ elements in a particular genome.
To test this computational method for ‘CR’ identification, we first assessed the association of centromere satellite repeats with the CR families in indica rice and maize. Because the majority of centromeres have not been completely sequenced or accurately assembled by either the BAC-by-BAC approach (for japonica rice and maize) or the WGS approach (for soybean and indica rice), the assembled genome sequences were not used. Instead, we chose a set of random WGS clones from each genome that make up approximately 1× genome coverage for the association analysis. Association was measured by the percentage of clones that contain both centromere satellite repeats and the terminal sequences of a LTR-RT family versus the clones that contain the terminal sequences of the same LTR-RT family (see Experimental procedures). We found that, as expected, the ‘CR’ elements, CRR1/CRM3, CRR2/CRM2, and CRM1, show strong association with CentO/CentC satellite repeats (5.2–9.7%) (Table 3). By contrast, CRR4 and CRM4 are not associated with the satellite repeats, although they are the largest and the second largest CR families in the rice and maize genomes, respectively (Table 3). These results validate the feasibility of the computational approach to prediction of ‘CR’ families.
Table 3. Physical association of long terminal repeat-retrotransposons (LTR-RTs) with centromere satellite repeats
|Species||Class||Family||Association of LTR-RTs with satellite repeatsa|
|No. of associated LTR-RTs||Total no. of LTR-RTs||Association rate (%)|
|Soybean (Williams 82)||Gypsy||Gmr1||14||3756||0.4|
The physical association of centromere satellite repeats, CentGm-1 and CentGm-2 (Gill et al., 2009), with all LTR-RT families in soybean was subsequently analyzed. We found that 16 LTR-RT families showed association with the satellite repeats, out of which Gmr12/GmGypsy11 and Gmr17 showed the strongest association (5.5 and 3.8%, respectively), and thus were considered as putative ‘CR’ families in soybean. Both Gmr12/GmGypsy11 and Gmr17 belong to the CR lineage. Additionally, the co-localization of centromere satellite repeats and Gmr12/GmGypsy11 was visualized by fluorescence in situ hybridization (FISH) (Figure 6). Small proportions of WGS sequences matching both LTR-RTs and satellites were found for nearly all large-copy-number families (Table 3), but these probably reflect random insertions of LTR-RTs instead of preferential enrichment of ‘CR’ in soybean centromeres. This is consistent with the observations in rice that non-CRR families were found in the functional domain of rice centromeres (Nagaki et al., 2004; Wu et al., 2004; Ma and Bennetzen, 2006). If indeed these two CR families are associated with the functional centromeres of soybean, then it would be reasonable to deduce that the functional specification of ‘CR’ as centromere components pre-dates the divergence of monocots and eudicots about 140–150 Mya (Chaw et al., 2004).
Figure 6. Co-localization of centromere satellite repeats and a putative centromere retrotransposon on mitotic chromosomes. (a) 4′,6-diamidino-2-phenylindole (DAPI)-stained chromosomes, blue channel. Bar = 5 μm. (b) Location of the centromeres using CentGm-1 and CentGm-2 as probes, red channel. (c) Gmr12/GmGypsy11 localized to chromosomes, green channel. (d) Merged imaged: CentGm-1 and CentGm-2 (red), with Gmr12/GmGypsy11 (green) on DAPI stained chromosomes (blue).
Download figure to PowerPoint
As we mentioned earlier, the CR lineage is the most conserved between the eudicot and monocot sublineages (Figures 4b and 5). This suggests that the CR lineage evolves more slowly than other lineages, probably because the CR families had/have been selected for an important centromere function, and/or were located in a more slowly evolving portion of the genome, or both (Ma et al., 2007). Similar to CRR4 and CRM4, the other eight CR families in soybean may have lost their functional roles as centromere components during evolution of the soybean genome. The CR lineage was also found in Medicago and Lotus (Figure 5), but whether they are enriched in the centromeres of these two eudicots remains to be determined.
Presence and absence of the env-like genes in the putative plant endogenous retrovirus lineages/sublineages/families
LTR-RTs are thought to proliferate by reverse transcription, a molecular mechanism for replication of retroviruses. Unlike retroviruses from vertebrates, typical LTR-RTs do not contain an env-like gene that encodes a transmembrane protein, coiled-coil glycoprotein, which sponsors retroviral infection. Many, non-infectious, mammalian endogenous retroviruses also encode a transmembrane coiled-coil protein, but functional expression of these genes has not been demonstrated. Infectious endogenous retroviruses were found in Drosophila melanogaster (Kim et al., 1994; Song et al., 1994; Malik et al., 2000), but whether they exist in plants remains unclear. Nevertheless, the discovery of putative env-like genes encoding hypothetical proteins with similar secondary structural elements immediately downstream of pol in some plant LTR-RTs, was interpreted as ‘indirect’ evidence for the discovery of endogenous plant retroviruses (Kumar, 1998; Peterson-Burch et al., 2000; Miguel et al., 2008). To understand the evolution of putative endogenous retroviruses in plants, we first reassessed the lineages/families/elements that have env-like genes in soybean, rice, and Arabidopsis in the context of their phylogenies, constructed based on the RT domains from intact elements (Figures 4 and 7).
Figure 7. Phylogenetic trees of the putative plant retrovirus lineages. (a) Copia families and (b) Gypsy families. The trees were constructed using the conserved retrotransposon (RT) nucleotide sequences, and rooted using the RT sequences of DM and INVIDER2 elements in Drosophila. The putative retrovirus families are marked by bold branches.
Download figure to PowerPoint
Maximus is the only Copia lineage that has putative endogenous retroviral elements. This lineage contains a sublineage, composed of soybean family Gmr2 [i.e. SIRE (Laten et al., 1998, 2003)] and Arabidopsis family Atr1 [i.e. Endovir1 (Laten, 1999; Peterson-Burch et al., 2000)], two putative endogenous retrovirus families previously reported. TBlastN searches using these two putative ENV-proteins as queries against genome sequences investigated in this study identified five additional families with significant matches (e-value < 10−6), including two Arabidopsis families (Atr37 and Atr49) and three Lotus families (Lj1, Lj2, and Lj3) (Figure 7a). No rice or Medicago family was found in this sublineage based on this analysis.
Athila and Tat are the two Gypsy lineages that contain putative endogenous retroviral elements based on the presence of a gene encoding a predicted transmembrane protein (Figure 7b). Athila contains 10 soybean families, seven Arabidopsis families, and one rice family (Vicient et al., 2001). Of these 18 families, seven from soybean and five from Arabidopsis, and the rice family were found to harbor an env-like gene, including the previously identified putative endogenous retrovirus families Calypso (i.e. Gmr11) in soybean (Peterson-Burch et al., 2000) and Athila4 (i.e. Atr9) in Arabidopsis (Wright and Voytas, 2002). In addition, putative endogenous retroviral elements belonging to this lineage were also found in Lotus (Lj18), Medicago (Mtr60 and Mtr64) (Figure 7b), and barley (Vicient et al., 2001). If one presumes that the env-like genes in this lineage have a common origin, the absence of the env-like genes in Gmr43, Gmr19/Diaspora and Gmr1 may be interpreted as the outcome of deletion of the gene, as has previously been proposed (Yano et al., 2005).
The Tat lineage is evolutionarily close to Athila and contains two putative soybean endogenous retrovirus families Gmr9/SNARE/GmOgre and Gmr338. Gmr9/SNARE/GmOgre is the largest family in soybean, and its ENV-like protein shares approximately 31% identity with that of Gmr11/Calypso. The two largest OGRE families in Medicago (Mtr57 and Mtr59) were found to be evolutionarily close to Gmr9/SNARE/GmOgre, but these OGRE families lack the env-like gene, indicating that either the Gmr9/SNARE/GmOgre family captured an env-like gene or the OGRE families lost the env-like gene after the divergence of soybean and Medicago. Given that the two soybean families (e.g. Gmr9/SNARE/GmOgre and Gmr338) that contain the env-like genes are, on average, younger than many other families that do not contain this gene (Figure 7b), it is most likely that the former captured the env-like genes from the Athila lineage.
It should be mentioned that a large number of families within two putative endogenous retroviral lineages (Maximus and Tat) possess a third ORF between the pol and PPT regions. Although the hypothetical proteins encoded by these extra ORFs do not have significant matches with previously described plant ENV-like proteins, some do have strong signatures of transmembrane domains, similar to retrovirus env genes (Figure S5; Hofmann and Stoffel, 1993). The nature and origin of the third ORF remain unclear.
Origin of env-like genes in retrovirus-like families in plants
Given that the env-like genes were found in SIRE1 and Athila elements that belong to Copia and Gypsy superfamilies, respectively, it was proposed that the putative plant endogenous retroviruses have had at least two independent origins (Peterson-Burch et al., 2000). However, this hypothesis needs to be made with the caveat that the putative Copia and Gypsy retrovirus lineages evolved independently. To shed light on the origin and evolution of the env-like genes, we performed phylogenetic analysis of the putative endogenous retroviral elements using the conserved RT proteins and ENV domains, respectively, and compared the PBS and PPT motifs from these elements. As expected, the Copia families and Gypsy families were separated into distinct lineages, and the soybean elements and Arabidopsis elements were clearly distinguished based on their RT domains (Figure 8a). Furthermore, the nucleotide similarities of the PBS and PPT motifs (Figure 8b) perfectly reflect the relationships among these families as revealed by the RT domains. By contrast, the ENV-like domains between the Copia and Gypsy families were not distinguished by two distinct groups (Figure 8c, Figure S6), indicating that the env-like genes may not be the components of the common ancestor of the Copia and Gypsy superfamilies. In other words, either Copia or Gypsy, or both superfamilies, may have captured the env-like genes after their bifurcation. Nevertheless, the RT and ENV-like domains from the putative Copia endogenous retrovirus families of soybean, Lotus, and Arabidopsis exhibit consistent phylogenetic relationships, suggesting that the env-like genes in these Copia elements have a common origin. The average similarity among the ENV-like domains included in Figure 8(c) is 21%. If the phylogeny of the ENV domains shown in Figure 8(c) reflects the origin and evolution of the env-like genes, it would be reasonable to deduce that the putative Copia endogenous retroviral elements captured the env-like genes from the putative Gypsy endogenous retroviral elements. Together, these data are in favor of the hypothesis that the env-like genes in the Copia and Gypsy endogenous retroviral elements in plants may have a common origin.
Figure 8. Evolutionary relationships of the putative plant retroviruses. (a) Phylogenetic tree constructed using the retrotransposon (RT) protein domains. (b) The relationship of the putative retroviruses reflected by the primer-binding site (PBS) and polypurine tract (PPT) sites. (c) Phylogenetic tree constructed using the ENV-like protein domains, which was rooted using a putative retrovirus element in Drosophila (accession number gi140836).
Download figure to PowerPoint
Rigy2 is the only putative endogenous retrovirus family identified in rice from this analysis. However, neither the RT nor the ENV-like domains separate this family from the putative endogenous retrovirus families in soybean and Arabidopsis (Figures 7b and 8b). On the other hand, the relationships between Rigy2 and the putative Gypsy endogenous retrovirus families in soybean and Arabidopsis reflected by both domains appear to be consistent. Whether Rigy2 represents an ancient horizontal transfer from eudicots to rice remains unclear.