Evolution of the large genome in Capsicum annuum occurred through accumulation of single-type long terminal repeat retrotransposons and their derivatives

Authors

  • Minkyu Park,

    1. Interdisciplinary Program in Agriculture Biotechnology, Seoul National University, Seoul 151-921, Korea
    2. Department of Plant Science and Plant Genomics and Breeding Institute, Seoul National University, Seoul 151-921, Korea
    Search for more papers by this author
  • Jongsun Park,

    1. Fungal Bioinformatics Laboratory, Seoul National University, Seoul 151-921, Korea
    2. Center for Fungal Pathogenesis and Department of Agricultural Biotechnology, Seoul National University, Seoul 151-921, Korea
    Search for more papers by this author
  • Seungill Kim,

    1. Interdisciplinary Program in Agriculture Biotechnology, Seoul National University, Seoul 151-921, Korea
    2. Department of Plant Science and Plant Genomics and Breeding Institute, Seoul National University, Seoul 151-921, Korea
    3. Fungal Bioinformatics Laboratory, Seoul National University, Seoul 151-921, Korea
    Search for more papers by this author
  • Jin-Kyung Kwon,

    1. Department of Plant Science and Plant Genomics and Breeding Institute, Seoul National University, Seoul 151-921, Korea
    Search for more papers by this author
  • Hye Mi Park,

    1. Department of Plant Science and Plant Genomics and Breeding Institute, Seoul National University, Seoul 151-921, Korea
    Search for more papers by this author
  • Ik Hyun Bae,

    1. Department of Plant Science and Plant Genomics and Breeding Institute, Seoul National University, Seoul 151-921, Korea
    Search for more papers by this author
  • Tae-Jin Yang,

    1. Department of Plant Science and Plant Genomics and Breeding Institute, Seoul National University, Seoul 151-921, Korea
    Search for more papers by this author
  • Yong-Hwan Lee,

    1. Fungal Bioinformatics Laboratory, Seoul National University, Seoul 151-921, Korea
    2. Center for Fungal Pathogenesis and Department of Agricultural Biotechnology, Seoul National University, Seoul 151-921, Korea
    Search for more papers by this author
  • Byoung-Cheorl Kang,

    1. Department of Plant Science and Plant Genomics and Breeding Institute, Seoul National University, Seoul 151-921, Korea
    Search for more papers by this author
  • Doil Choi

    Corresponding author
    1. Interdisciplinary Program in Agriculture Biotechnology, Seoul National University, Seoul 151-921, Korea
    2. Department of Plant Science and Plant Genomics and Breeding Institute, Seoul National University, Seoul 151-921, Korea
      (fax +82 2 873 2056; e-mail doil@snu.ac.kr).
    Search for more papers by this author

(fax +82 2 873 2056; e-mail doil@snu.ac.kr).

Summary

Although plant genome sizes are extremely diverse, the mechanism underlying the expansion of huge genomes that did not experience whole-genome duplication has not been elucidated. The pepper, Capsicum annuum, is an excellent model for studies of genome expansion due to its large genome size (2700 Mb) and the absence of whole genome duplication. As most of the pepper genome structure has been identified as constitutive heterochromatin, we investigated the evolution of this region in detail. Our findings show that the constitutive heterochromatin in pepper was actively expanded 20.0–7.5 million years ago through a massive accumulation of single-type Ty3/Gypsy-like elements that belong to the Del subgroup. Interestingly, derivatives of the Del elements, such as non-autonomous long terminal repeat retrotransposons and long-unit tandem repeats, played important roles in the expansion of constitutive heterochromatic regions. This expansion occurred not only in the existing heterochromatic regions but also into the euchromatic regions. Furthermore, our results revealed a repeat of unit length 18–24 kb. This repeat was found not only in the pepper genome but also in the other solanaceous species, such as potato and tomato. These results represent a characteristic mechanism for large genome evolution in plants.

Introduction

Heterochromatin is the tightly condensed chromosome region that is visualized by staining of pachytene chromosomes during interphase of cell division. Heterochromatin can be divided into two types, constitutive and facultative heterochromatin, according to the sequence components in the heterochromatic region (Grewal and Jia, 2007). Constitutive heterochromatin mainly consists of dense repeat elements and is observed at flanking regions of centromeres and also near telomere or knob regions. In contrast, facultative heterochromatin can be found in gene-rich regions and the chromatin state can be changed in response to cellular signals and gene activity. Constitutive heterochromatin maintains a condensed structure throughout the cell cycle and comprises large parts of eukaryotic genomes.

Formation of constitutive heterochromatin structure occurs by epigenetic control via RNA interference of repetitive sequences (Volpe et al., 2002; Lippman et al., 2004). The repeat elements in the constitutive heterochromatin are largely divided into two types: class I retrotransposons that use an RNA transcript as an intermediate and class II transposons that use DNA as an intermediate (Wicker et al., 2007). In plant genomes, the class I retrotransposons, particularly the long terminal repeat (LTR) retrotransposons, are major factors that constitute the heterochromatin sequence (CSHL/WUGSC/PEB Arabidopsis Sequencing Consortium., 2000; Mao et al., 2000; Meyers et al., 2001; Wang et al., 2006).

In general, heterochromatic regions are considered less important than euchromatic regions due to the extremely low gene density in the region, and thus most biological studies have focused on functional genes in euchromatic regions. Nevertheless, several important roles have been identified for heterochromatic regions. Heterochromatic regions are important for meiotic chromosome segregation (Dernburg et al., 1996) and also cause position-effect variegation on adjacent genes (Csink and Henikoff, 1996). In addition, a recent study suggested that rapid evolution of heterochromatic regions can cause speciation (Bayes and Malik, 2009; Ferree and Barbash, 2009; Hughes and Hawley, 2009).

Through intensive analysis of constitutive heterochromatin sequences in plants, the 0.52-Mb knob region sequence of Arabidopsis thaliana (among a total of 1.17 Mb) revealed that constitutive heterochromatic regions of plants consist of a number of retrotransposons, tandem repeats, and DNA transposons (CSHL/WUGSC/PEB Arabidopsis Sequencing Consortium., 2000). Additional analysis of six bacterial artificial chromosome (BAC) sequences harboring tomato heterochromatic regions revealed that tomato heterochromatic regions mainly contain Ty3/Gypsy-like retrotransposons (Wang et al., 2006). Comparative analysis of four Brassicaceae species demonstrated that the pericentromeric heterochromatin has undergone dynamic changes through the insertion of transposable elements, resulting in different heterochromatin structures between these closely related species (Hall et al., 2006).

Although constitutive heterochromatic regions comprise a major portion of plant genomes, how these regions have been expanded remains to be elucidated in many species. To examine the expansion of constitutive heterochromatic regions, the solanaceous species are a good model as almost all members share the same chromosome number (= 12) (Wikström et al., 2001) but have very different genome sizes. For example, the genome sizes of Solanum tuberosum (potato), Solanum lycopersicum (tomato), Petunia hybrida (petunia), and Capsicum annuum (pepper) are 840, 950, 1200, and 2700 Mb, respectively. The gene contents and order among the solanaceous species are syntenic, and these species have not experienced whole genome duplication after speciation (Wang et al., 2008; Wu et al., 2009). Hence, the genome size diversity of the solanaceous species is probably mainly due to differential expansion in the constitutive heterochromatic regions.

Among the solanaceous species, the pepper genome may be the best model for the analysis of genome expansion through expansion of constitutive heterochromatic regions due to its large genome size. It is known that the constitutive heterochromatic regions in pepper are highly expanded compared with those in tomato, and most of the pepper genome regions consist of constitutive heterochromatin structure (Park et al., 2011). Furthermore, the pepper genome has not undergone a whole-genome duplication event since its speciation in the Solanaceae family (Wu et al., 2009). Although the maize genome also has a large genome size of 2300 Mb in the diploid phase (Schnable et al., 2009), this genome has undergone a whole-genome duplication event 12–5 million years ago (Ma) (Blanc and Wolfe, 2004; Swigoňováet al., 2004). Thus, pepper has a distinct history of genome expansion compared with that of other plant species, including maize.

In this study, we report the evolutionary history of pepper genome expansion. For this purpose, the expansion of the pepper constitutive heterochromatic regions was analyzed in detail. To analyze at the sequence level, we used seven full-contig BAC sequences harboring pepper heterochromatin segments and 125 randomly selected pepper BAC clone sequences that were generated as partial contigs (Yoo et al., 2003). In addition, 65× whole-genome shotgun sequences in draft state were also used to estimate the fraction of analyzed repeats in the whole genome. To determine the distribution of analyzed repeats in the genome, we used fluorescence in situ hybridization (FISH). Our findings provide deeper insight into plant genome diversity for a species that has not experienced whole genome duplication.

Results

The structure of the pepper heterochromatin sequence

To examine the repeat composition in the constitutive heterochromatic regions of pepper at the sequence level, we sampled seven pepper BAC clones that harbored repeat-rich regions and sequenced them by full-length contigs (Table S1). The BAC clones were selected based on FISH analyses that showed high-density FISH signals in pepper constitutive heterochromatic regions. The total size of the seven selected BAC sequences was 807 290 bp.

The sequence structure of the seven BAC sequences was analyzed in detail and was found to consist of a total of eight types of repetitive sequences (Figure 1a). The repetitive sequences were validated by nucleotide BLAST search against the pepper sequence database that contained 1235 pepper BAC sequences (Park et al., 2011). The sequence regions that exhibited high copy numbers (at least five) were considered repeats and were confirmed again by nucleotide BLAST search against the Repbase (Jurka et al., 1996). Two BAC sequences (CaCM403E16 and CaCM557C04) clearly showed dense FISH signals at the constitutive heterochromatic regions (Figure 1b,c).

Figure 1.

 Repeat analysis of the seven selected bacterial artificial chromosome (BAC) sequences.
(a) The horizontal bars depict each BAC sequence. Green bars, Ty3/Gypsy elements; gray bars, large retrotransposon derivative (LARD) elements; red bars, Ty1/Copia elements; orange bars, non-long terminal repeat (LTR) retrotransposon elements; blue circles, DNA transposon elements; purple arrows, tandem repeats; black diamonds, caulimovirus elements; sky blue bars, unknown repeat elements; white arrows, genes. Numbers over the repeats indicate the LARD elements. Capital letters and lower-case letters above the repeats indicate degenerated Ty3/Gypsy-like elements by nested insertion and intact elements, respectively. Greek letters indicate Ty1/Copia-like elements. The insertion time of each repeat is shown under the repeats. The nested insertions of the repeats are depicted stepwise over the sequence.
(b), (c) Fluorescence in situ hybridization (FISH) analyses of the selected BAC clones. The probes used in each analysis are shown at the bottom of each panel. The FISH result and pachytene chromosome structure are shown in left and right images of each panel, respectively. Heterochromatin and euchromatin structures are indicated by red and white arrows, respectively.

The Ty3/Gypsy-like elements were most abundant in the seven BAC sequences (Figure 1a, green bar). A total of 36 Ty3/Gypsy-like elements were identified from the seven BAC sequences. Interestingly, the second most abundant elements in the seven BAC sequences were large retrotransposon derivatives (LARDs; Figure 1a) (Kalendar et al., 2004). Nine LARD elements were found in these sequences. These elements exhibited the canonical structure of non-autonomous LTR retrotransposons that contain both LTR sequences without any functional genes for retrotransposition.

Dynamic accumulation of Ty3/Gypsy-like elements in constitutive heterochromatic regions

To investigate how the activity of the Ty3/Gypsy-like elements has affected pepper heterochromatin expansion, we generated a phylogenetic tree of the pepper Ty3/Gypsy-like elements. To generate such a tree without bias, we used the sequences of an additional 125 randomly selected pepper BAC clones, the total assembled sequence length of which was 17 846 790 bp. In these sequences, the reverse transcriptases (RTs) of the Ty3/Gypsy-like elements were identified (see Experimental Procedures for details). A total of 254 RTs were used for generation of the phylogenetic tree. Using the protein BLAST search of the RTs against the GyDB (http://gydb.org/) (Lloréns et al., 2008), the phylogenetic tree was divided into three major subgroups: Tat, Athila, and Del (Figure 2a).

Figure 2.

 Phylogenetic analysis and insertion time estimation of pepper Ty3/Gypsy-like elements.
(a) A total of 252 reverse transcriptases (RTs) of the pepper Ty3/Gypsy-like elements were used in generating the tree. Subgroups were identified with a BLAST search against the GyDB (http://gydb.org/), and the representative elements of each subgroup are also included in the three subgroups (green marks and letters). The Ty3/Gypsy-like elements found in the seven selected bacterial artificial chromosome (BAC) sequences are marked with small letters under the branches. The Del elements that are conserved for both LTRs are marked with red lines. Bootstrap values were generated by 1000 repetitions.
(b) Fluorescence in situ hybridization (FISH) analysis of a Ty3/Gypsy-like element in the Del subgroup. The Del element marked in (a) with ‘k’ was used as a probe. The FISH result and pachytene chromosome structure are shown in top and bottom images of the panel, respectively. Heterochromatin and euchromatin structures are indicated by red and white arrows, respectively.
(c) The insertion time of the Ty3/Gypsy-like elements (Del subgroup), large retrotransposon derivatives (LARDs), and Ty1/Copia-like elements are shown on the horizontal lines. The class marks are calculated in 2.5 Ma units. The vertical dotted line indicates the speciation time (19.2 Ma) between pepper and tomato.

Approximately 85% of the Ty3/Gypsy-like elements were Del elements (216 of 254 elements). Among the 36 Ty3/Gypsy-like elements in the seven BAC sequences, 13 elements could be included in the phylogenetic tree. Of the 13 elements, 12 belonged to the Del subgroup (Figure 2a, marked with lower-case letters). The FISH analysis using one of the Del elements (Figure 1a, marked ‘k’) revealed that the paralogs of this element are mainly distributed in the constitutive heterochromatic regions (Figure 2b). Hence, this analysis revealed that the Ty3/Gypsy-like elements in the Del subgroup are the major components of the repeats in the constitutive heterochromatic regions of pepper.

The expansion history of the pepper heterochromatic regions was investigated by estimating the insertion time of the Del elements. A total of 84 intact Del elements maintaining both LTR sequences were used to estimate insertion time (Figure 2a). Accumulation of the Del elements increased before the speciation time (19.2 Ma) and decreased after approximately 10.0 Ma. Approximately 76% of the Del elements were accumulated 20.0–7.5 Ma (Figure 2c), indicating that expansion of the heterochromatic regions mainly occurred during this period.

Large retrotransposon derivative elements in the pepper genome

In the seven pepper BAC sequences, LARDs were the second most abundant elements (Figure 1a). Among the nine LARD elements found in these sequences, eight elements that have no nested insertion were used for further analyses. Dot-plot analysis revealed no or low sequence similarity between the elements, except for the CaLARD-6 and CaLARD-8 sequences, which exhibited some similarity with 70–79% identity (Figure 3a). This result indicates that diverse types of LARDs exist in the pepper genome. The frequencies of each of these elements in the pepper whole genome shotgun (WGS) draft sequence were 0.04–0.47%, and the total frequency of these LARD elements was 2.21% (Figure 3b).

Figure 3.

 Analysis of large retrotransposon derivative (LARD) elements.
(a) The eight intact LARD elements were compared by dot-plot analysis. The vertical and horizontal axes represent the sequences of the eight intact LARD elements divided by green lines.
(b) The frequency of each LARD element in the pepper draft whole genome shotgun sequences are shown as percentages.
(c) Fluorescence in situ hybridization (FISH) analysis of CaLARD-8, demonstrating that the FISH signals are mainly distributed in the heterochromatic regions.
(d) The origin of the three LARD elements was analyzed by comparison with the intact Del elements. The red bar and pink box indicate the coding region of the Del elements.

Distribution of the LARD elements in the pepper genome was investigated by FISH analysis using one of the eight LARD elements, CaLARD-6, as a probe. This analysis revealed that the LARD elements are distributed mainly in the constitutive heterochromatic regions (Figure 3c). Hence, we investigated whether the LARD elements originated from the Del elements via comparison of the 84 intact Del elements with the eight LARD elements. Three of the eight LARD elements contained regions that were similar to the intact Del elements, indicating that these elements originated from the Del elements (Figure 3d). The genes for retrotransposition of the Del elements were lost in the three LARD elements (Figure 3d, marked with red box); however, the regions in the Del elements that exhibited similarity to the LARD elements differed. CaLARD-2 contained the inner region between the LTR sequences of the Del elements, and CaLARD-5 comprised the inner region and both LTR sequences. In addition, CaLARD-8 comprised only the LTR sequences of the Del element. These results indicate that although the LARD elements originated from Ty3/Gypsy-like elements in the Del subgroup, the evolutionary process of the LARD element is different in each case.

Tandem repeats in pepper heterochromatin and their origin

Among the seven BAC sequences, CaCM403E16 showed a unique sequence structure (Figure 1a). This BAC contained two types of tandem repeats, which spanned approximately 63% of the sequence (Figure 4a). Interestingly, the two tandem repeats had long repeat units that reached to 3.3 and 5.0 kb, respectively. Therefore, the names LUTR-s (long-unit tandem repeat-short) and LUTR-l (long) were used to denote the tandem repeats of unit length 3.3 and 5.0 kb, respectively. The distribution of LUTR-s and LUTR-l in the pepper genome was investigated by FISH analysis. Both LUTR-s and LUTR-l were also distributed mainly in the constitutive heterochromatic regions with high copy numbers (Figure 4b,c). The LUTR-s signals were denser than those of LUTR-l.

Figure 4.

 Analysis of long-unit tandem repeats (LUTRs).
(a) The sequence structure of CaCM403E16, which contains LUTR-short (s) and LUTR-long (l), was analyzed by dot-plot analysis. The repeat contents are shown under the dot-plot (see Figure 1 legend for explanation). A dot-plot comparison of LUTR-s and LUTR-l with CaCM403E16 is shown below.
(b), (c) Fluorescence in situ hybridization analysis of LUTR-s (b) and LUTR-l (c) demonstrating that these elements are distributed mainly in the heterochromatic regions.
(d) The pepper bacterial artificial chromosome (BAC) sequence contigs that contain LUTR-s were compared with each other. The vertical black bar indicates the BAC sequence contigs. Black and red lines that link each BAC sequence contig indicate highly similar regions in the same and opposite direction, respectively. Purple bars, LUTR-s; green bars, Ty3/Gypsy-like elements; gray bars, large retrotransposon derivative (LARD). The braces under the contigs depict the type of LUTR-s paralog.
(e) Dot-plot comparison between a LUTR-s and Del element. The diagonal lines indicate similarity between both sequences. Dark and light green bars under the dot-plot indicate the long terminal repeat and inner region of the Del element, respectively.
(f) Comparison of the pepper BAC sequence contigs that contain LUTR-l. The yellow bar indicates rDNA. See legend to part (d) for explanation.
(g) Dot-plot comparison between a LUTR-l and Del element. See legend to part (e) for explanation.

To identify the origin of the LUTR-s, we examined paralogous copies of the repeats in the sequence database of 1235 pepper BACs. A number of contigs that contained paralogous copies of LUTR-s were found. The 10 sequence contigs that contained paralogous copies of LUTR-s were analyzed in detail for repeat contents and compared with each other (Figure 4d). The paralogous copies of LUTR-s were in tandem array or singlet form in the sequences (Figure 4d, left brace). The comparison revealed that the sequence of LUTR-s is similar to the LTR sequence of Ty3/Gypsy-like elements (Figure 4d, right brace), which were identified as belonging to the Del subgroup. In addition, sequence comparison of LUTR-s with the 84 intact Del elements revealed partial matches in the LTR sequences (Figure 4e). These results indicate that the origin of LUTR-s is a derivative of the Del element, a solo-LTR.

In the same way, the origin of LUTR-l was investigated. Five contigs that contained paralogous copies of LUTR-l were found and also analyzed by the same method. The paralogous copies of LUTR-l were found in singlet forms (Figure 4f). Likewise, the sequence comparison of LUTR-l with the 84 intact Del elements showed partial matches in the LTR sequences (Figure 4g), indicating that the origin of LUTR-l is also a derivative of the Del element.

Our manual inspection of the pepper BAC sequences revealed two additional LUTRs (Figure S1). Interestingly, the origin of these two LUTRs was also the Del element. One was a paralog of LUTR-s with a copy number of 12 (Figure S1a,b) and the other was a paralog of CaLARD-2 with a copy number of five (Figure S1c).

Genes in the pepper heterochromatic regions

A total of five genes were identified in the three BAC sequences (CaCM278G16, CaCM642F15, and CaCM778H11; Figure 1a). To determine how these genes became located in the middle of the heterochromatic region, we compared the pepper sequences with their collinear tomato sequences. Tomato orthologs of these genes were identified by nucleotide BLAST search against the tomato genome database (http://solgenomics.net) (Mueller et al., 2005a,b). Four of the five tomato orthologous genes were identified as a single copy (Data S1). In addition, the five pepper genes in the heterochromatin sequences were also identified as single copy genes by nucleotide BLAST search against the pepper WGS draft sequence. The presence of these genes as single copy genes in both genomes confirms that the genes are orthologous.

The tomato collinear sequences that contained the orthologs of the pepper genes were analyzed in detail. Interestingly, the tomato collinear sequences were gene-rich regions (Figure 5, Data S1) and did not contain any repeat sequence except for a segment of a Ty3/Gypsy-like element in SL2.30ch02. Two tandem genes in CaCM278G16 (f′ and g′) and CaCM778H11 (x′ and y′) showed similarity with their collinear tomato sequences including the intergenic regions (Figure 5). The similarity in the intergenic regions indicates that the pepper genes in CaCM278G16 and CaCM778H11 were not transposed into the heterochromatic regions from other euchromatic regions. In addition, the tomato intergenic region in SL2.30ch02 (marked with α) also had an orthologous sequence in CaCM278G16 (Figure 5, marked with α′). The intergenic region between α′ and the adjacent gene, f′, in CaCM278G16 was expanded by insertion of LTR retrotransposons. Considering the well-conserved gene synteny between pepper and tomato (Wang et al., 2008; Park et al., 2011), these results indicate that some pepper euchromatic regions became constitutive heterochromatic regions through the accumulation of repeat elements.

Figure 5.

 Comparative analysis of the pepper heterochromatin sequence and its counterpart tomato sequence.
The upper horizontal bar of each pair indicates the tomato bacterial artificial chromosome (BAC) sequence, while the other bar indicates the pepper heterochromatin that contains genes. The predicted genes are indicated by arrows, and each of the genes is indicated by a small letter for tomato sequence and a small letter with prime for pepper sequence. The α and α’ indicate the orthologous intergenic region of pepper and tomato. See legend to Figure 1 for explanation of the repeats.

Unknown repeats in the pepper genome

In-depth analysis of the seven pepper BAC sequences revealed an unknown type of repeat. The repeat was identified in CaCM642F15 (Figure 1a, sky blue bar) and two Ty3/Gypsy-like elements were inserted in the repeat. Excluding the two inserted Ty3/Gypsy-like elements, the length of the repeat reached 21 646 bp. Due to the long unit length, we called it CaLUR (Capsicum long-unit repeat). By nucleotide BLAST search against the sequence database of 1235 pepper BAC clones, we could find four paralogous copies of CaLUR that maintain long unit length (Figure 6a, sky blue bars). Distribution of CaLUR in the pepper genome was determined by FISH analysis. The FISH signals of CaLUR were randomly distributed (Figure 6b).

Figure 6.

 Analysis of Capsicum long-unit repeats (CaLURs).
(a) Comparative analysis of the five CaLUR paralogs (indicated by sky blue bars). See legend to Figure 1 for explanation of the other repeats.
(b) Fluorescence in situ hybridization (FISH) analysis of CaLURs. A segment of CaLUR-c was used as a probe. The FISH signals were randomly distributed in the pepper pachytene chromosomes.
(c) Dot-plot analysis of the four CaLUR paralogs. Red arrows indicate the tandem array structures.
(d) Proportion of CaLUR segments in pepper whole genome shotgun draft sequences. The CaLUR segments were generated by dividing the CaLUR sequence into fragments of 1 kb length. Red box, transcription site; green box, solo-long terminal repeat (LTR). Both end sequences of CaLUR are exhibited at the bottom.
(e) Reverse transcriptase-PCR analysis of the transcription site in CaLUR. RNA expression from the transcription sites was observed in all eight tested pepper tissues.
(f), (g) Analyses of the CaLUR homologs in the potato and tomato bacterial artificial chromosome (BAC) sequences. The CaLUR homologs were named TuLUR (Tuberosum long unit repeat) and LyLUR (Lycopersicum long unit repeat) with consecutive numbers. (f) Comparative analysis of the BAC clone sequences of potato and tomato containing TuLUR and LyLUR. Sky blue boxes, TuLUR and LyLUR; red arrows within sky blue boxes, sites similar to the transcription site in CaLUR. See legends to Figures 1 and 4 for explanation. (g) Dot-plot analysis of the four CaLUR homologs.

To investigate sequence structure, four CaLUR paralogs were compared by dot-plot analysis. The CaLUR paralogs commonly contained a satellite array of 160 bp unit length in the center (Figure 6c, red arrows). Using CaLUR-c, we investigated the frequency of the repeat in the pepper WGS draft sequence by dividing the repeat unit into 19 segments of 1 kb length. Each segment showed a minimum 0.03% frequency in the pepper genome and the total frequency of every segment was approximately 1% (Figure 6d). The sequence segment containing the satellite arrays showed a high frequency of 0.1%. A solo-LTR insertion was found and its components also showed high frequencies of 0.07 and 0.12% (Figure 6d, green box and light gray column). Interestingly, CaLUR did not contain any expected coding region and also did not exhibit any characteristic feature found in previously reported repeats. Nevertheless, nucleotide BLAST search against our pepper expressed sequence tag (EST) database (Figure S2a) revealed that this repeat has EST matches in its 5′ side (Figure 6d, red box). Transcription of the EST matching region in the CaLUR sequence was confirmed by RT-PCR in eight pepper tissues, and the results revealed that transcription occurs in all eight tissues (Figure 6e). The repeat segments containing the transcription site exhibited relatively high frequencies of 0.07 and 0.06% in the whole genome (Figure 6d, black column).

Similar sequences of the CaLUR transcription site were also found in the other solanaceous species such as potato and tomato. The similarity between the CaLUR transcription site and its homologous sequences in potato and tomato was 65–68% (Figure S2b and Data S2). Interestingly, the potato and tomato BAC sequences containing this site also revealed long unit repeats with structure similar to CaLUR (Figure 6f,g). The four CaLUR homologs in the potato and tomato BAC sequences had unit lengths of 6–14 kb, which may be decreased due to degeneration. In addition, like CaLUR, they did not contain any expected coding regions. This result indicates that the unknown repeats found in the pepper genome also exist in other solanaceous species.

Discussion

Because constitutive heterochromatic regions comprise the majority of a large genome, the mechanism of genome diversity is often attributed to the expansion of these regions. To date, studies of genome diversity in plants have primarily involved monocot plants that have experienced whole genome duplication (SanMiguel et al., 1998; Vicient et al., 1999; Kalendar et al., 2000; Bruggmann et al., 2006; Wicker et al., 2009). Hence, our study on the expansion of constitutive heterochromatin in the pepper C. annuum provides deeper insight into plant genome diversity for a species that has not experienced whole genome duplication.

Del elements are the major component of the expansion of constitutive heterochromatin in pepper

Our analysis of the expansion history of pepper heterochromatin revealed that expansion mainly occurred 20.0–7.5 Ma, a time period that includes the estimated speciation of pepper and tomato (19.2 Ma) (Wang et al., 2008). During this period, the Ty3/Gypsy-like elements that belong to the Del subgroup massively accumulated in the heterochromatic regions. The Del subgroup belongs to the Chromoviruses, which contain a chromodomain (Malik and Eickbush, 1999) that directs the insertion of the Ty3/Gypsy-like elements into heterochromatic regions (Gao et al., 2008). Fluorescence in situ hybridization analysis of one of these Del elements also supported the preferential accumulation of these elements into the heterochromatic regions of the pepper genome (Figure 1e). In addition, the repeat elements have also been documented to accumulate in the euchromatic regions of the pepper genome (Park et al., 2011); however, the major repeat type identified was different. Whereas the pepper euchromatic regions have primarily been expanded through accumulation of the Tat and Athila elements, the heterochromatic regions have been expanded through accumulation of the Del elements.

According to our estimates of the insertion time of the Del elements, the pepper heterochromatic regions massively expanded after speciation and the degree of expansion has been reduced since 10.0 Ma (Figure 2c). Before the speciation between pepper and tomato, Del elements were also accumulating in the heterochromatic region of the ancestral species. Considering that the older Del elements were included to a lesser extent in the insertion time estimation due to degeneration, the accumulation of the Del elements before speciation would actually have been more frequent than presented in Figure 2(c). Recently, changes in heterochromatin have been shown to cause sterility between closely related species by altering chromosome segregation (Bayes and Malik, 2009; Hughes and Hawley, 2009; Malik and Henikoff, 2009). Thus, the initial expansion of heterochromatic regions in the ancestral species may have caused the speciation between pepper and tomato, and the more active accumulation of Del elements in the pepper following speciation may have formed the larger heterochromatic regions of the pepper genome.

Derivatives of the Del elements have important roles in pepper heterochromatin constitution

The second most abundant repeat elements in the pepper heterochromatic regions were the non-autonomous LTR retrotransposons, LARDs. The LARD elements were initially found in barley and related genomes (Kalendar et al., 2004). The LARD elements proliferate by the same mechanism as the LTR retrotransposons, but the genes required for proliferation are supplied from related LTR retrotransposons (Kalendar et al., 2004; Tomita et al., 2009). In our study, a total of nine LARD elements were found in the seven BAC sequences (Figure 1a).

The origin of the LARD elements was identified as the Del elements (Figure 3d). Although the origin of the LARD elements may, in fact, be a type of Ty3/Gypsy-like element, the part of the Ty3/Gypsy-like element used to generate the LARD elements has not been well studied. Our study revealed that the LARD elements are not only generated by simple deletion of the genes for retrotransposition in an LTR retrotransposon but are also generated from duplication of the internal region of the origin (Figure 3d).

Given that the pepper LARD elements are derived from the Del elements, the proliferation of the LARD elements is possible via complementation of the necessary genes from the Del elements. This possibility indicates that the distribution and proliferation of the LARD elements is affected by activation of the Del elements. Our FISH results demonstrate that the Del element and the LARD element are both found in the heterochromatic regions (Figure 3c). Therefore, active proliferation of the LARD elements in the pepper heterochromatic regions can be accounted for by the active proliferation of the Del elements.

The long-unit tandem repeats, LUTR-s and LUTR-l, were also found in the pepper heterochromatic regions as a derivative of the Del element. We found LUTR-s and LUTR-l to be tandemly repeated solo-LTRs of the Del element. In addition to the high copy number repeat array, we found two or more copies of the tandem array of LUTR-s (Figure 4d). Although LUTR was frequently found in the pepper genome, few related studies have been reported (CSHL/WUGSC/PEB Arabidopsis Sequencing Consortium., 2000). Related to the LUTR, 22.5 copies of the tandem repeat with a unit length of 1950 bp have been reported in the heterochromatic knob region of Arabidopsis (CSHL/WUGSC/PEB Arabidopsis Sequencing Consortium., 2000). This report also suggested that this repeat was a degenerated transposable element. These facts indicate that such LUTRs are formed from degenerated repeat elements. In Drosophila, tandem arrays of transposons result in heterochromatin formation, and the effect is stronger with increasing copy numbers of the repeat (Dorer and Henikoff, 1994). Thus, the LUTR of the degenerated Del element may function to enhance formation of the heterochromatin structure in the highly expanded pepper heterochromatic regions.

Accumulation of Del elements converted pepper euchromatin into heterochromatin

It is known that pepper heterochromatin has a highly expanded structure and an obscure border compared with that of tomato (Park et al., 2011). The reason behind the obscure heterochromatin border was revealed by our comparative analysis of the pepper heterochromatic regions and the tomato counterparts. According to the result, the pepper heterochromatic regions have invaded euchromatic regions (Figure 5). In the three BAC sequences, CaCM278G16, CaCM642F15, and CaCM778H11, a total of five genes were identified. All of these genes were identified as transcriptionally active genes (Figure S3), indicating that the genes would still be functional by avoiding position effect variegation. Interestingly, all of the LTR retrotransposons in these sequences were inserted after the estimated speciation time between pepper and tomato (Figure 1a). In contrast, the other BAC sequences that do not contain functional genes contained LTR retrotransposons that were inserted before the speciation (Figure 1a, blue numbers). Comparative analysis of the Arabidopsis heterochromatic region with those of closely related species revealed that expansion of heterochromatic regions can occur by spreading into euchromatic regions (Hall et al., 2006). Likewise, accumulation of the pepper Del elements expanded the pepper heterochromatic regions not only in the existing heterochromatic regions but also into the euchromatic regions.

The unknown repeat additionally expands the pepper genome

Besides the Del elements, the pepper genome contained the previously unknown repeat, CaLUR. CaLUR had an unusually long unit length (18–24 kb) and was randomly distributed in the pepper genome (Figure 6). A similar long unit repeat with a 10–17 kb unit length, Helitron has been reported in Arabidopsis and other plant species (Kapitonov and Jurka, 2001). However, the structure and conserved motifs between Helitron and CaLUR were completely distinct. Whereas Helitron contains functional genes for replication and has conserved terminal sequences of 5′-TC and CTRR-3′, CaLUR did not contain any coding region for replication and the terminal sequences were 5′-GAGAAAA… and …AGCTCAA-3′ (Figure 6d). Therefore CaLUR does not belong to the Helitron superfamily. This indicates that CaLUR is a previously unknown type of repeat that has an unusually long unit length. The existence of a CaLUR homolog in the potato and tomato genome corroborates its existence in the plant kingdom. However, the functional role and replication mechanism of CaLUR still remain to be elucidated.

Taken together, our results reveal that the expansion of the pepper genome has mainly occurred through massive accumulation of hyperactive single-type Ty3/Gypsy-like elements and derivative repeats within a specific time period. Our findings about the long unit repeat also suggest its additional role in plant genome diversity. This process represents a characteristic mechanism for genome expansion through expansion of constitutive heterochromatic regions in plant species that does not involve a genome-wide duplication event.

Experimental procedures

Sequencing of pepper BAC clones and the whole-genome draft

To determine the sequences of full-length pepper BACs, each of the seven BAC clones selected on the basis of FISH analyses that showed high-density FISH signals in pepper constitutive heterochromatic regions was sequenced using the 454 GS FLX (200-bp read length, 454 Life Science, Roche, http://www.roche.com/). Each clone was sequenced in one lane of a 454 GS FLX reaction plate. The reaction plate was divided into 16 lanes. After assembly, the gaps were filled manually. The 125 random BAC clones were sequenced by pooling in one lane of the 454 GS FLX-Titanium (400-bp read length) with the plate divided into two lanes. Raw sequences were produced by 20× coverage of the BAC clones, and the total assembled sequence length was 17 846 790 bp. The sequences were assembled by newbler 2.0.1. The sequence information from the 1235 BAC sequences was previously reported by Park et al. (2011). All the sequence assembly and manual editing were carried out by the National Instrumentation Center for Environmental Management (NICEM; http://nicem.snu.ac.kr/) and the Comparative Fungal Genomics Platform (CFGP; http://cfgp.snu.ac.kr/) (Park et al., 2008).

The whole-genome shotgun sequence draft of the pepper was generated as an initial step of the pepper genome project. The sequence was produced using Illumina Solexa GA with a 75-bp read length. A total of 65× coverage of the raw sequences of the pepper genome was assembled by ngs cell software (CLC bio; http://www.clcbio.com/). The average assembled contig length was 1519 bp, and the N50 was 4628 bp. The total assembled contig length was 2.61 Gb, which is close to the expected pepper genome size of 2.7 Gb.

Analysis of the repeat elements and genes

Analysis of the repeat elements in the pepper BAC sequences was carried out by manual dot-plot (Sonnhammer and Durbin, 1996) analysis and BLAST search against the Repbase, GenBank, and pepper BAC sequence databases (Jurka et al., 1996; Park et al., 2008, 2011). From the results of the BLAST search against the pepper BAC sequence database, the sequence regions that exhibited high copy numbers were regarded as repeats. The estimated repeat regions were confirmed by BLAST searches against the Repbase and GenBank databases. Finally, the structure of the repeat was investigated by dot-plot analysis. The LTR retrotransposons and CaLURs identified in this study are provided in Data S3–S7, respectively. The repeat frequency in the draft pepper WGS sequences was estimated by BLAST search. The threshold of the e-value in the search was e–5, and the minimum similarity at this threshold was approximately 79%. Visualization of the compared sequence contigs was carried out using the gata 0.7 software (Nix and Eisen, 2005).

The RT sequences of the Ty3/Gypsy-like elements in the 125 random BAC sequences were found by hmmer 2.1 (Eddy, 1998) using the RT sequences reported in the Pfam (http://pfam.sanger.ac.uk/, accession no. PF00078) database as a training set. The phylogenetic tree was generated by mega 4.0 software (Tamura et al., 2007). The alignment was performed using clusterw of the mega 4.0 software with the default setting (Data S8), and the tree was drawn by the neighbor-joining method with the Poisson correction model. The degenerated retrotransposons that were not used in the alignment were manually deleted.

The insertion time of the LTR retrotransposons was calculated by the method previously reported by SanMiguel et al. (1998). The substitution rate calculated from the intergenic region is faster than the synonymous substitution rate of the coding region (Ma and Bennetzen, 2004). Hence, we calculated the substitution rate between pepper and tomato from the concatenated orthologous intergenic sequences that were acquired from the sequences released by Park et al. (2011) (Data S9). Using the method of calculating the Kimura two-parameter distance in mega 4.0 (Li et al., 1985), the rate (r) was calculated as 5.178 × 10−9 substitutions per site per year. By the same method, the substitutions between both LTR sequences in the single LTR retrotransposon (K) were calculated. The LTR sequences were obtained from Data S3–S6. The insertion time was estimated using the formula, insertion time = K/2r.

The genes contained in the pepper and tomato sequences were predicted by FGENESH using the trained data of tomato (Salamov and Solovyev, 2000). The predicted genes were confirmed by BLAST search against the GenBank database, and genes that had a score ≥100 and e-values of ≤e−20 were considered. The copy number of each gene was estimated by BLAST search against the pepper and tomato draft WGS sequences (tomato sequence data base, http://sgn.cornell.edu; CFGP, http://cfgp.snu.ac.kr/). Reverse transcriptase-PCR of the pepper genes identified in the heterochromatin sequences was performed with the primer sets provided in Table S2 with 28 cycles.

Fluorescence in situ hybridization analysis

Fluorescence in situ hybridization analysis was carried out according to the method reported by Park et al. (2011). The probes of the repeat elements were produced by PCR using the primer sets provided in Table S2. The FISH analyses of CaCM403E16, CaCM388I15, and CaCM557C04 were performed by labeling the BAC clone DNA. The pachytene chromosomes were made from the C. annuum‘CM334’. The 4′,6-diamidino-2-phenylindole (DAPI)-stained pachytene chromosomes were converted to black and white in order to clearly distinguish the heterochromatin and the euchromatin structures.

Acknowledgements

This research was supported by a grant from the Agricultural Genome Center of the biogreen21 Program (project no. PJ008199012011) and the National Research Foundation (project no. 2010-0015105) funded by the Ministry of Education, Science, and Technology of the Republic of Korea.

Accession numbers: Sequence data from this article can be found in the GenBank data libraries under the accession numbers listed in Table S1. CaCM403E16: GU048901, CaCM388I15: JF330776, CaCM557C04: GU048903, CaCM328A06: JF330775, CaCM278G16: JF330774, CaCM642F15: JF330777, CaCM778H11: JF330778.

Ancillary