Characterizing the composition and evolution of homoeologous genomes in hexaploid wheat through BAC-end sequencing on chromosome 3B

Authors


*(fax +33 473 624453; e-mail catherine.feuillet@clermont.inra.fr).

Summary

Bread wheat (Triticum aestivum) is one of the most important crops worldwide. However, because of its large, hexaploid, highly repetitive genome it is a challenge to develop efficient means for molecular analysis and genetic improvement in wheat. To better understand the composition and molecular evolution of the hexaploid wheat homoeologous genomes and to evaluate the potential of BAC-end sequences (BES) for marker development, we have followed a chromosome-specific strategy and generated 11 Mb of random BES from chromosome 3B, the largest chromosome of bread wheat. The sequence consisted of about 86% of repetitive elements, 1.2% of coding regions, and 13% remained unknown. With 1.2% of the sequence length corresponding to coding sequences, 6000 genes were estimated for chromosome 3B. New repetitive sequences were identified, including a Triticineae-specific tandem repeat (Fat) that represents 0.6% of the B-genome and has been differentially amplified in the homoeologous genomes before polyploidization. About 10% of the BES contained junctions between nested transposable elements that were used to develop chromosome-specific markers for physical and genetic mapping. Finally, sequence comparison with 2.9 Mb of random sequences from the D-genome of Aegilops tauschii suggested that the larger size of the B-genome is due to a higher content in repetitive elements. It also indicated which families of transposable elements are mostly responsible for differential expansion of the homoeologous wheat genomes during evolution. Our data demonstrate that BAC-end sequencing from flow-sorted chromosomes is a powerful tool for analysing the structure and evolution of polyploid and highly repetitive genomes.

Introduction

The bread wheat genome holds the key to genetic improvements that will meet the growing demand for high-quality food, feed and fuels produced in an environmentally sensitive, sustainable and profitable manner. Moreover, because of its recent history it represents an excellent model for studying polyploidization, a common phenomenon in plant genomes. However, with a size of 17 Gb (five times the human genome and 40 times the rice genome), a repetitive DNA content of >80%, and the presence of three homoeologous subgenomes, the hexaploid wheat genome represents a challenge for molecular studies and for the development of genome-specific markers for genetic and physical mapping (Gill et al., 2004; Langridge et al., 2001). In the past 5 years a number of genomic tools have been produced in wheat, including the second largest expressed sequence tag (EST) collection (880 427) after rice (http://www.ncbi.nlm.nih.gov/dbEST/dbEST_summary.html); >3000 microsatellite markers (http://wheat.pw.usda.gov/GG2/index.shtml); and a number of BAC libraries from diploid, tetraploid and hexaploid genomes (Akhunov et al., 2005; Allouis et al., 2003; Cenci et al., 2003; Lijavetzky et al., 1999). These advances have made wheat one of the most densely mapped plant genomes for which no genome sequence is available. This has also allowed the recent map-based cloning of several genes of agronomic interest (reviewed by Keller et al., 2005; Simons et al., 2005), as well as the first molecular studies of wheat genome composition and evolution through comparison of BAC clone sequences.

These analyses have confirmed previous renaturation kinetics studies suggesting that the amount of repetitive DNA in wheat is over 80% (Smith and Flavell, 1975), and have identified the major classes of transposable elements (TE) that compose the wheat genome (Wicker et al., 2002). To date, 25 complete (http://urgi.infobiogen.fr/projects/TriAnnot/GBrowse.php) and 58 partial (http://www.tigr.org/tdb/e2k1/tae1/info.shtml) BAC sequences, representing 3.2 and 7.2 Mb, respectively, and originating from the A-, B- and D-genomes of diploid, tetraploid and hexaploid wheat, have been analysed. However, these sequences are short (approximately 200 kb of contiguous sequences, on average) and non-random as they mainly originate from gene-rich and telomeric regions. The first random analyses of the wheat genome have been performed recently. Devos et al. (2005) have analysed the gene content of 416 kb of sequence corresponding to four BAC clones randomly chosen from the hexaploid wheat cv. Chinese Spring BAC library. Their results suggest that genes are likely to be more randomly distributed on the wheat genome than previously believed, and that about 50% of the sequence corresponds to repeats that have not yet been characterized. Akhunov et al. (2005) have sequenced 138.6 kb from the ends of BAC clones randomly picked in libraries of Triticum urartu (38 kb) and Aegilops tauschii (49.8 kb), the A- and D-genome donors of hexaploid, respectively. Sequences (50 kb) were also generated from Aegilops speltoides, a wild species the S- genome of which is closely related to the B-genome donor of hexaploid wheat (Feldman and Levy, 2005). About 5% of coding regions and 60% of known repeats were identified in the three genomes. However, due to the limited amount of sequences generated from each genome, no detailed comparison of the sequence composition between the homoeologous wheat genomes could be performed. An accurate assessment of the wheat genome sequence composition, and a better understanding of its evolution under polyploidization require comprehensive and unbiased data sets from each of the different homoeologous A-, B- and D-genomes. Recently, 2.9 Mb of sequence from a whole-genome shotgun library of the D-genome of A. tauschii, the wild diploid ancestor of hexaploid wheat, have been generated and analysed by Li et al. (2004). Their results indicated that >91% of the sequence consists of repetitive elements and 2.5% of genes, and have provided a first representative data set for one of the homoeologous wheat genome.

BAC-end sequences (BES) can be very informative in assessing the content and organization of genomes when they represent a significant portion of a genome. Recently, Messing et al. (2004) generated 475 000 maize BES for a cumulative length of 307 Mb, representing one-eighth of the maize genome. This allowed them to estimate the ratio of the different types of repetitive element, as well as the number of genes (approximately 59 000) present in maize. Because of its large size and the presence of three homoeologous genomes, representative large-scale, random analyses of the hexaploid wheat genome are difficult to achieve. While the genomes of wild diploid relatives of wheat can be used as models to perform large random sequence analysis, they will not provide an accurate picture of the hexaploid wheat homoeologous genomes, as there is evidence for dramatic sequence rearrangements following polyploidization (Feldman and Levy, 2005). Therefore the most efficient approach for fully understanding the composition and evolution of the homoeologous genomes is to work on the individual genomes or chromosomes of hexaploid wheat.

Recent advances in flow-sorting techniques (Kubalakova et al., 2002) have allowed the isolation of DNA in sufficient amounts and quality to construct BAC libraries from single wheat chromosomes or chromosome arms. To date, three BAC libraries have been constructed from the 3B and 1D-4D-6D chromosomes of hexaploid wheat cv. Chinese Spring (Janda et al., 2004; Safar et al., 2004) as well as from the short arm of chromosome 1B from cv. Pavon 76 (Janda et al., 2006). Recently, we have initiated the construction of a physical map of chromosome 3B of hexaploid wheat using the chromosome 3B-specific BAC library. In the course of this project, approximately 19 400 BAC-end sequences were generated, representing a cumulative length of nearly 11 Mb (1.1% of the chromosome length) and the most comprehensive, quantitative data set of wheat random genomic sequence produced to date. Here, we present a detailed analysis of these sequences, which provides insight into the sequence composition of a wheat chromosome and offers new data on the composition and evolution of the B-genome of hexaploid wheat. This work also demonstrates that BAC-end sequences represent a valuable source of randomly distributed chromosome-specific markers for genetic and physical mapping in polyploid and repetitive genomes.

Results and discussion

11 Mb of BES distributed along the physical map provide a representative picture of the chromosome 3B sequence composition

With an approximate size of 6 Gb, the B-genome is the largest of the three homoeologous genomes of hexaploid wheat; chromosome 3B alone accounts for about 1 Gb (approximately 2.5 times the rice genome; Lee et al., 2004). We have fingerprinted and assembled the 67 968 BAC clones of the bread wheat cv. Chinese Spring chromosome 3B-specific BAC library (unpublished data). In a first phase of assembly, 3306 contigs were obtained and used to select clones for BAC-end sequencing. A total of 10 752 BAC clones were sequenced from both ends: 9410 of them originated from the 3306 contigs (three BAC clones per contig, on average) and 1342 BACs corresponded to randomly selected singletons (Table 1). In total, 21 504 sequencing reactions were performed, providing 19 399 high-quality reads (success rate = 90.2%; GenBank accession numbers: DX363346DX382744). The average, edited read length was of 557 bp, providing a total 10 803 302 bp of sequence. This represents approximately 1.1% of the estimated length of chromosome 3B. Five sequences per contig were obtained on average, corresponding to a theoretical distribution of one BES every 51.5 kb across the chromosome. To date, about 14 Mb of genomic sequences from wheat and its wild relatives are present in the databases, including 5 Mb from Triticum aestivum (http://www.tigr.org/tdb/e2k1/tae1/info.shtml). However, they mostly (4.2 Mb) correspond to Genome Survey Sequences (GSS), of which the chromosomal location is largely unknown. These sequences provide a general picture of the hexaploid wheat genome, but do not distinguish the specific features of the composition and evolution of the homoeologous A-, B- and D-genomes. Here, by using a chromosome-specific approach, we have been able to generate a significant amount of random sequence for analysing the composition of chromosome 3B from hexaploid wheat.

Table 1.   Origin and distribution of the BAC clones used for BAC-end sequencing
BACsNo. BACsAverage number of BACs per contig
Contigs91403
 Size >200 kb72644
 Size >500 kb36677
Singletons13421
Total10 752

Composition and distribution of the repeat fraction of wheat chromosome 3B

The 11 Mb of BAC-end sequences from chromosome 3B were analysed sequentially for their repeat and gene content using a semi-automated annotation pipeline (see Experimental procedures). The results show that 76.3% of the sequences correspond to known repetitive elements (Table 2), which mostly belong to TE of class I (retroelements, 68.7% of the sequence) and class II (DNA transposons, 5.6% of the sequence). Long terminal repeat (LTR) retrotransposons were predominant, accounting for 97.5% of the class I elements, whereas the CACTA family alone accounted for 86.8% of the DNA transposons.

Table 2.   Distribution of repetitive DNA in BAC-end sequences of the bread wheat chromosome 3B
Class, subclass, familyNo. hitsNo. bases, bp% of chromosome 3B% of wheat D-genome% of maize genome% of rice genome
  1. The number of hits corresponds to the total number of BES showing homology to a given element. The number of bases represents the cumulative number of bases matching a given element. The percentage of the chromosome corresponding to a given repetitive element was estimated by dividing the cumulative length of this element by the total length of the data set. For example, CACTA elements represent 529 672 bp of the total 10 803 302 bp, 4.9% of the data set and therefore 4.9% of the chromosome. The percentage of the wheat D-genome (Aegilops tauschii) was calculated with the data set of Li et al. (2004). The data for the maize and rice genomes originate from Haberer et al. (2005).

Class I elements (retrotransposons)16 1677 427 14768.748.063.418.8
LTR retrotransposons15 7627 284 35867.447.051.212.6
 gypsy-like89714 183 92538.724.430.58.9
 copia-like34731 517 72114.013.720.73.7
 athila-like32971 577 45014.68.8ndnd
 TRIM2052620.00.1ndnd
Non-LTR retrotransposons405142 7891.31.10.10.5
 LINE390139 4751.31.10.10.1
 SINE1533140.00.00.00.4
Class II elements (DNA transposons)2025610 2365.611.61.310.4
 CACTA1567529 6724.910.80.32.8
 Mutator3174430.10.10.00.5
 MITE33850 5300.50.40.34.0
 LITE8922 5910.20.3ndnd
Unclassified elements386169 4991.62.8ndnd
Other known repeats25434 8160.33.41.31.1
 High-copy-number genes  (ribosomal)1215710.01.10.00.1
 Simple repeats15825 0360.22.21.21.0
 Tandem repeats8482090.10.2ndnd
Unknown repeats35951 036 5099.68.2ndnd
Total22 4279 278 20785.974.066.030.7

Recently, Sabot et al. (2005) have re-annotated 26 BAC clones (approximately 3.8 Mb) that originated from gene-containing regions of the A- (2.37 Mb), B- (0.58 Mb) and D- (0.83 Mb) genomes of diploid and polyploid wheat. Transposable elements represented 54.7% of the sequences, which is significantly different from the 76.3% and 83.4% of repeats identified from the random sequences of chromosome 3B and of the D-genome, respectively (Li et al., 2004). Similar discrepancies were observed in maize, where 53% of repeats were estimated from the analysis of 117 BAC clones selected for containing a gene, whereas 66% were found through the analysis of 100 random BAC clones (Haberer et al., 2005). This suggests that the distribution of TE varies along the chromosomes, particularly between gene-rich and gene-poor regions, and demonstrates the importance of generating large and randomly distributed sets of sequences to obtain an accurate picture of the composition of large and repetitive genomes.

Previous studies of sequences flanking transposable elements have indicated preferential associations between TE classes and genic or repeated sequences in the genome (Bennetzen, 2000a; Li et al., 2004). Here we have compared, for different TE families, the observed and expected frequencies of association of each element with other sequences on chromosome 3B (Table 3).

Table 3.   Frequency (%) of association between TE families and different fractions of the chromosome. Observed frequencies are expressed for each TE family as the percentage of association between a given TE and another sequence. The value in parentheses corresponds to the number of sequences for a given TE family found in association with another sequence (repeat family or coding). The expected frequency was determined as the percentage of chromosome covered by a given family of transposable elements. It corresponds to the likelihood of random association of a given element and any other element. Gypsy-, athila-, copia-like and LINE are class I elements; CACTA, MITE and LITE are class II elements
GroupExpected frequencyObserved frequencies
gypsy-like (3080)Copia-like (1276)athila-like (816)LINE (281)CACTA (816)MITE (287)LITE (76)
  1. *Significant difference at P < 0.001.

gypsy-like38.720.1*43.9*45.6*9.3*21.2*6.6*6.6*
copia-like14.012.518.8*12.76.8*6.0*1.7*0.0*
athila-like14.612.612.713.03.6*5.4*2.1*0.0*
LINE1.31.02.31.23.9*1.31.01.3
CACTA4.95.55.15.33.616.7*25.8*26.3*
MITE0.50.60.60.71.810.7*15.3*13.2*
LITE0.20.20.00.00.43.2*3.5*2.6*
Coding1.21.91.81.12.84.0*9.1*1.3

For the gypsy-, copia- and athila-like class I LTR retrotransposons, a significant bias was found for their insertion into other gypsy-like elements. In contrast, class I non-LTR retrotransposons (LINEs) and class II DNA transposons (CACTA, MITEs and LITEs) showed significantly less association with class I LTR retrotransposons than with each other and with low copy coding sequences. These data indicate that DNA transposons and MITEs are preferentially inserted into each other and that, as previously reported in wheat (Wicker et al., 2003), barley, maize, rice and sorghum (Bennetzen, 2000a; Ramakrishna et al., 2002), they are found mainly in the vicinity of genes. Moreover, the favoured association of class II TE with other class II, rather than with class I elements suggests that DNA transposons tend to accumulate in gene-rich regions, whereas LTR retrotransposons are distributed more evenly throughout the genome. This hypothesis is supported by the analysis of BAC sequences originating from gene-containing regions (Sabot et al., 2005) that identified 66.3% of class I and 31.7% of class II elements, whereas 72.1% of class I and 9% of class II elements were found in the random data generated on 3B (Table 2). These results suggest that the lower copy number of class II DNA transposons, compared with the class I TE, is a consequence of their preferential insertion into gene-rich regions where greater selection pressure against unrestricted transposition occurs (Kidwell and Lisch, 2001).

BAC-end sequencing allows the identification of new repeated elements among 10% of unknown repeats present on chromosome 3B

A significant portion of the repeated fraction of plant genomes remains unclassified (so-called ‘unknown repeats’) due to the limited amount of genomic sequences available. Given the size of our sequence sample, we considered that a sequence present in two or more copies corresponds to a putative unknown repeat. Using this criterion, 3595 BES that represent >700 putative novel repeats were identified. This corresponds to 9.6% of putative unknown repeats (Table 2) which, when added to 76.3% of known repeats previously identified, leads to a total repeat fraction of 85.9% for chromosome 3B of hexaploid wheat. Almost one-third of the putative repeats matched an EST sequence in the databases. This indicates that a number of as-yet unknown repetitive elements are expressed, and that automated gene discovery in wheat based on EST homologies needs to be conducted with caution as it may result in misidentification of novel genes.

BES are too short to reveal the full-length structure of most of the new elements, but they can be informative for short element discovery. Among the newly identified repeats, a highly conserved 500-bp element was found in more than 120 copies (0.6% of the data set), suggesting the presence of approximately 12 000 copies of this sequence on chromosome 3B. As the FatI restriction site is the most abundant in the 500-bp sequence, this element was named Fat. Analysis of the BES and three wheat BAC sequences (GenBank accession numbers AY188332, AY772735 and DQ157839) that contain a complete Fat sequence revealed that this element is clustered in direct tandem repeats separated by a consensus 9-bp spacer sequence (CYGRTTTDB where Y = C or T; R = A or G; D = A, T or G; B = C, T or G). Each unit of the Fat element displays two perfect direct repeats (GAGAAGCT) at both ends as well as two putative, autonomously replicating consensus sequences that determine the replication origin in yeast (Huang and Kowalski, 1996; Figure 1a), suggesting that Fat element amplification could occur through a rolling circle-like replication. Fluorescence in situ hybridization (FISH) with a Fat probe showed the presence of Fat sequences on all wheat chromosomes with different hybridization intensities (Figure 1b). To analyse the chromosome- or genome-specificity of this new repeat, the probe was co-hybridized with the B-genome specific probe pSc119.2 (Bedbrook et al., 1980) and with the D-genome-specific clone pAs1 (Rayburn and Gill, 1986). Double FISH experiments on diploid, tetraploid (data not shown) and hexaploid wheat species showed that A- and B-genomes are labelled poorly compared with the D-genome chromosomes, where the highest signals were observed on chromosome 1D, 4D, 6D and 7D (Figure 1b). Moreover, higher signal intensities were detected in the proximal regions compared with distal regions, where a Fat hybridization signal was lacking on many chromosomes (Figure 1b). These results suggest that Fat is present at high copy number in the D-genome chromosomes, with a preferential insertion in the proximal regions. This pattern differs from most of the previously characterized tandemly repeated DNA sequences of wheat, such as pSc119.2, pAs1, pSc200, pSc250 (Vershinin et al., 1995), Spelt1 and Spelt52 (Salina et al., 2004), which are located predominantly in subtelomeric or distal chromosomal regions.

Figure 1.

 Structure and chromosomal distribution of the new tandem repeat element Fat.
(a) Consensus sequence of the Fat element. Direct repeats are in bold; putative autonomously replicated consensus sequences are underlined.
(b) Double FISH experiment with the Cy3-labelled (pink) D-genome-specific probe pAS1 and the fluorescein-labelled Fat (green) probe on chromosome preparations of three Triticum aestivum varieties: Elite, Erythrospermum 59 (Eryth.), L2166. Chromosomes are arrayed by genomes.

To study the evolutionary origin of Fat, FISH experiments were also performed on rye and barley chromosomes. Weak hybridization signals similar to those observed on the A- and B-chromosomes were detected on the R-chromosomes of rye, whereas no hybridization signal was detected on the H-chromosomes of cultivated barley (data not shown). These results suggest that Fat elements have been amplified in wheat and rye after the divergence with barley 6–10 million years ago. Interestingly, similar hybridization patterns were observed for the D-genomes of A. tauschii and T. aestivum, suggesting that Fat element amplification occurred specifically in the ancestral D-genome after divergence between the diploid genomes 2.5–4.5 million years ago, and before hexaploid wheat formation 8000 years ago (Huang et al., 2002). These results show that a number of repetitive elements remain to be discovered in the wheat genome, and demonstrate the importance of generating large sets of random sequences across wheat chromosomes for better characterization and annotation of the wheat genome.

6000 genes are present on chromosome 3B of hexaploid wheat

After masking the known and putatively new repetitive sequences, the remaining fraction of BES was used to estimate the gene content of chromosome 3B. Using tblastx and blastn algorithms, the sequences were compared successively with wheat EST from GenoPlante-Info (Samson et al., 2003) and GenBank, and with a local database containing Arabidopsis thaliana, oat, soybean, barley, rice, rye, sorghum and maize ESTs. We found 599 unique sequences showing similarity (e-value 10−35) to an EST with a cumulative match length of 130 226 bp, representing about 1.2% of the data set. Thus, considering the size of chromosome 3B (1 Gb), the estimated transcribed fraction should represent 12 Mb. Preliminary analysis of wheat BAC sequences have indicated that the average wheat coding sequence size is about 2 kb (http://www.tigr.org/tdb/e2k1/tae1/info.shtml) and thus, with 1.2% of the sequence length, our results suggest the presence of 6000 genes on chromosome 3B. Interestingly, this corresponds roughly to the number of genes (6756) predicted in the first analysis of rice chromosome 1 (Sasaki et al., 2002), which has been revised recently to 4856 (International Rice Genome Sequencing Project, 2005). These data suggest that the overall gene content is conserved between wheat chromosome 3B and rice chromosome 1, supporting previous comparative analysis with ESTs indicating that wheat group 3 and rice 1 chromosomes show the best conservation in gene order and content despite >50 million years of independent evolution (Sorrells et al., 2003).

Comparisons between homoeologous B- and D- genome composition provide new insights into wheat genome evolution

Here, we have generated data that are representative of a single wheat chromosome – a key question is whether they also reflect the composition of the entire genome. In other words, is chromosome 3B representative of the B-genome of hexaploid wheat? In recent polyploids, such as bread wheat, homoeologous genomes retain specific features of their ancestral genomes that distinguish them from each other, in particular for the repetitive fraction that represents >85% of the genome sequence (Feldman and Levy, 2005; Lagudah and Appels, 1992). Thus, it is likely that sequence composition is generally more similar between all the chromosomes that belong to the same genomes (A- versus B- and D-chromosomes) than between homoeologous chromosomes (3A versus 3B and 3D chromosomes). Therefore, if a chromosome can provide a good representation of the sequence composition of its diploid genome, chromosome 3B should be representative of the B-genome of hexaploid wheat. In rice, the percentage of repeated sequences for each of the 12 pseudomolecules ranged from 22.94 for chromosome 3 to 32.06 for chromosome 4, while the average repeat content for the whole genome is 28.95% (R. Buell and S. Ouyang, personal communication), indicating that each chromosome reflects well the overall genome repeat content. Assuming that this is a general trend in plant genomes, we have used the information from chromosome 3B to deduce general features of the B-genome of hexaploid wheat.

We have estimated that 1.2% of the sequence corresponds to non-TE-related coding sequences. Thus, with a size of 6 Gb for the B-genome, the transcribed portion should represent 72 Mb and correspond to about 36 000 genes. This number is comparable with the 37 544 genes recently reported for rice (International Rice Genome Sequencing Project, 2005) and is slightly lower than in maize (42 000 to 56 000 genes), where a whole-genome duplication about 5 million years ago has contributed to increase the gene number and the genome size (Haberer et al., 2005). This suggests that the size difference between the wheat and rice genomes is mainly due to the expansion of repeat families in the wheat lineage, and not to massive gene amplification. Together, these data reinforce the idea that grass genomes contain a similar number of genes (Bennetzen, 2000b; Bennetzen et al., 2004). However, they contrast dramatically with a recent study that suggested the presence of 98 000 genes per sub-genome in hexaploid wheat as a result of massive amplification of pseudogenes after polyploidization (Rabinowicz et al., 2005). Such a high number of genes is not consistent with hybridization studies using ESTs, which have detected a mean of 2.8 loci per EST and have shown that duplications concerned <25% of the gene loci, indicating no massive gene amplification during the evolution of the wheat genome (Akhunov et al., 2003; Qi et al., 2003). To exclude the possibility that this discrepancy originates from the calculation methods, we have used our annotation procedure to estimate the gene number in hexaploid wheat from the 1597 whole shotgun sequences generated by Rabinowicz et al. (2005). A comparable number of approximately 320 000 genes for the whole genome (107 000 genes per subgenome) was found, indicating that differences probably originate from the data set used in the different studies, and that additional random sets of wheat sequences are needed to address this question.

Li et al. (2004) have recently analysed approximately 3 Mb of random sequence that represent 0.075% of the D-genome donor of hexaploid wheat, A. tauschii. We have re-analysed this data set using the same method as for the BES from 3B that is based on the cumulative length of sequences matching repetitive elements, and have compared the data from the D- and B-genomes. About 66% of the sequence length from the D-genome matched known repeat families (Table 2), which is significantly <76.3% observed for the B-genome. The smaller genome size of A. tauschii (4 Gb) correlates well with its lower content in TE compared with the B-genome of T. aestivum (6 Gb), and is consistent with the idea that differences in genome size are related to differences in TE content (Bennetzen et al., 2005; Fedoroff, 2000). Interestingly, the portion of the genome not related to repetitive sequence was similar for the B- (0.85 Gb; 14% of 6 Gb) and D- (1 Gb; 26% of 4 Gb) genomes. This suggests that both have evolved from an approximately 1 Gb-ancestral genome, and that differential expansion has occurred through differential transposable element amplification and deletion during the evolution of the B- and D-genomes. A detailed comparison of the relative abundance of TE families in the A. tauschii D-genome and hexaploid wheat B-genome sequences supports this hypothesis and suggests which TE families have been involved in the differentiation between the two genomes (Figure 2).

Figure 2.

 Relative abundance (expressed in percentage of the genome) of the main transposable element families in the B- and D-genomes of wheat. Gypsy-, athila-, copia-like and LINE are class I elements; CACTA, MITE and LITE are class II elements.

More than fourfold difference was found between class I (48.0%) and class II elements (11.6%) in the D-genome, whereas there are 12 times more class I than class II elements (68.7% versus 5.6%) in the B-genome. This is mainly due to a higher number of gypsy- and athila-like LTR retrotransposons in the B-genome compared with the D-genome (Figure 2), suggesting that these families are responsible primarily for the size difference between the B- and D-genomes. Among these families, the gypsy-like Fatima and the athila-like Sabrina elements seem particularly to have contributed to the expansion of the B-genome (Table S1).

Genome specificities of TE families were also reflected by the discrepancy in the distribution of TE insertion observed between the two genomes. While in the D-genome copia- and gypsy-like LTR retrotransposons were found preferentially associated with CACTA elements (Li et al., 2004), gypsy-like elements were the preferred target of these two elements in the B-genome (Table 2).

The relative abundance of class I and class II TE observed here in wheat was different from maize and rice, where ratios of 63.4% versus 1.3% and 18.8% versus 10.4%, respectively, were found (Haberer et al., 2005) (Table 2). The correlation between the low amount of class II elements and the activity of these elements in the maize genome is intriguing when compared with the high amount of transposons with no detectable activity, under normal conditions, in wheat and rice. It is possible that the differences in the amount of class II elements between maize and the other genomes correlate with strong selection pressure against the amplification of active elements that might have a deleterious effect if too numerous. Non-LTR retroelements (LINEs and SINEs) also displayed genome-specific differences. They were two to 12 times more abundant in wheat (1.3% of the B-genome; 1.1% of the D-genome) than in maize (0.11%) and rice (0.5%).

BES are a valuable source of chromosome-specific markers for repetitive genomes

BES were analysed as a source of new molecular markers for genetic, cytogenetic and physical mapping in wheat. The complete set of sequences was first searched for single-sequence repeats (SSRs) that potentially could be converted into new markers. We identified 1942 perfect repeats of one to four nucleotide motifs, corresponding to a mean density of one SSR per 5.6 kb. Of these, 166 (8.5%) harboured mononucleotide motifs, whereas 324 (16.7%), 943 (48.6%) and 515 (26.5%) had dinucleotide, trinucleotide and tetranucleotide motifs, respectively. The most abundant motif (14.8%) was the trinucleotide AAG. Interestingly, previous FISH experiments with SSR had shown that the AAG microsatellite is highly represented on the seven chromosomes of the B-genome of hexaploid wheat (Cuadrado et al., 2000), supporting the assumption that the sequence data from chromosome 3B are representative of the B-genome. From the 1942 SSRs detected, 176 (41 with mononucleotide; 79 with dinucleotide; 33 with trinucleotide; 23 with tetranucleotide motifs), which originated mainly from low copy sequences, were selected for primer design. More than 80% of the primer pairs gave an amplification product on at least one of the five hexaploid wheat varieties that have been used as parents in different mapping populations (Chinese Spring, Courtot, Renan, Synthetic W7984 and Opata 85), and 45% of the products exhibited polymorphism between at least two varieties (data not shown). About one-third (36%) of the SSRs were mapped to a single locus in one of the 3B deletion bins (Endo and Gill, 1996) without ambiguity. These data indicate that BAC-end sequences represent a potential source of new SSR markers for mapping in wheat. However, BES-derived SSRs are not as good, in terms of polymorphism, as microsatellites originating from enriched genomic libraries (reviewed by Gupta and Varshney, 2000) or EST databases (Varshney et al., 2005) that are usually longer (18 and seven repeats for dinucleotides and trinucleotides, respectively) than those derived from BES (10.7 and 4.5 repeats for dinucleotides and trinucleotides, respectively).

Transposable elements are known to be nested in large plant genomes, where they display unique insertion sites that are highly polymorphic between varieties. This feature has been exploited to develop PCR-based marker systems for genetic analysis in a range of cereal grass and grain legume species (for review see Schulman et al., 2004). The retrotransposon-based insertion polymorphism technique (RBIP; Flavell et al., 1998) uses primers flanking retrotransposon insertions to score for the presence or absence of insertions at individual sites. Recently, Devos et al. (2005) described a method derived from RBIP that exploits knowledge of the sequence flanking a transposable element, and uses one primer designed in the TE and the other in the flanking DNA sequence. In this work, the authors exploited the method to identify the chromosomal location of BAC clones in hexaploid wheat, and suggested its potential use for genetic mapping pending that the amplification would not be restricted to the cultivar from which BACs are originating. Here, we have analysed systematically the 20 000 BES for the presence of junctions between TE and other sequences. About 3000 junctions were identified, and 58 primer pairs (cfp) covering all kind of junction (26 ‘repeat/repeat’; 11 ‘repeat/coding sequence’; 21 ‘repeat/low-copy sequence’; Table S2) were designed to assess the potential of ‘insertion site-based polymorphism’ (ISBP) for mapping in wheat. PCR amplification was performed on genomic DNA from a 3B nullisomic–tetrasomic line (Sears, 1954); two 3B ditelosomic lines (Sears and Sears, 1978 ); six 3B deletion lines (3BS8, 3BS9, 3BS1, 3BL2, 3BL10, 3BL7; Endo and Gill, 1996); and five parents of mapping populations (Chinese Spring, Courtot, Renan, Synthetic W7984, Opata 85). PCR amplification was obtained for 50 junctions, and 39 ISBP markers (67.2%) were assigned to 3B deletion bins (Figure 3a), demonstrating that the amplification product was unique and specific. No bias in the type of junction that provided positive results was observed. Among the 39 assigned markers, 21 (53.8%) were polymorphic between the different wheat varieties tested (Figure 3a). In 19 cases, the polymorphism corresponded to null alleles with an amplification product in at least two of the varieties tested, whereas size polymorphism was observed for only two markers.

Figure 3.

 Cytogenetic and genetic mapping of ISBP markers in wheat.
(a) Deletion bin assignment and polymorphism testing for the cfp26 marker. M, DNA ladder; 1, Control (H2O); 2, nullisomic–tetrasomic N3B-T3A line; 3, 3BS ditelosomic line; 4, 3BL ditelosomic line; 5, 3BS8 deletion line; 6, 3BS9 deletion line; 7, 3BS1 deletion line; 8; 3BL2 deletion line; 9, 3BL10 deletion line; 10, 3BL7 deletion line; 11, Chinese Spring; 12, Courtot; 13, Renan; 14, Synthetic W7984; 15, Opata 85.
(b) Segregation analysis in 61 individuals of the ITMI population for the cfp25 marker. S, Synthetic W7984; O, Opata 85.

Four ISBP markers (cfp7, cfp16, cfp25, cfp26), which were polymorphic (either null allele or size polymorphism) between Opata 85 and Synthetic W7984, were selected and subsequently used for genetic mapping in 93 recombinant inbred lines derived from the cross between these two varieties (generally referred to as the International Triticeae Mapping Initiative, ITMI population; Figure 3b). Linkage analysis showed that all ISBP markers map to chromosome 3B at a position corresponding to the deletion bin identified in the previous cytogenetic mapping experiment (data not shown). This shows that the amplification products are unique, genome-specific and allelic. Moreover, it demonstrates that genetic mapping with ISBP markers is not limited to populations that originate from a cross with variety that was used for the BAC library construction. For this reason, ISBP markers also represent potential tools for phylogenetic and transposable elements evolution studies in wheat.

Considering the number of junctions (approximately 3000) identified in the 20 000 BES and the results obtained with the first 58 ISBP markers, we estimate that approximately 2000 new ISBP markers could be generated for cytogenetic mapping in deletion bins, and approximately 1000 (5% of the original BES set) for genetic mapping on chromosome 3B. Currently, we are adapting the ISBP technique on capillary electrophoresis to increase the level of polymorphism detection and the efficiency of the analysis. The genetic redundancy of hexaploid wheat makes it difficult to follow specific loci in genetic mapping and breeding programmes and, so far, genome-specific PCR-based markers have been designed essentially from genic regions that were believed to be more suitable for primer design than the repetitive fraction of the genome (Blake et al., 2004). One drawback of this approach is that it limits genetic mapping to genic region that are not evenly distributed along the chromosomes in wheat, but are organized in gene islands (Sandhu and Gill, 2002). Here we demonstrate that 10% of the BES which consist of junction between TE can be used very efficiently to develop new PCR-based chromosome specific markers that are evenly distributed along the wheat chromosomes. This strategy offers new possibilities for the development of molecular markers in wheat as well as in other polyploid species that face similar problems of genetic redundancy.

To assess the usefulness of the ISBP markers for the construction of physical maps, two markers (cfp7 and cfp8) originating from BAC clones that were located at the ends of different physical contigs on the physical map of chromosome 3B were used to screen the 3B BAC library. Cfp7 identified 10 clones, including its BAC of origin, which all belong to a single contig, thus confirming the accuracy of the FPC assembly. Cfp8 identified its BAC of origin (3B_005_I15) and four other BACs. One belonged to the same FPC contig as 3B_005_I15, whereas all three others were found in another contig. These data allowed us to confirm the results of a low-stringency FPC analysis and merge the two contigs with confidence, demonstrating the value of ISBP markers for contig assembly and physical mapping in wheat.

In this work, we have demonstrated the potential of developing BAC-end sequencing programs on single chromosomes for analysing the composition and evolution of homoeologous genomes, as well as for efficient production of chromosome-specific markers for genetic and physical mapping in polyploid and repetitive genomes. Future BAC-end sequencing from sorted chromosomes of the A- and D-genomes of hexaploid wheat (J. Dolezel, personal communication), and from the recently constructed BAC libraries of T. urartu and A. speltoides (Akhunov et al., 2005), will enable further large-scale comparative analysis in order to obtain an overall inventory of the repeat and gene content of the homoeologous A-, B- and D-genomes of hexaploid wheat. When combined with the results of the ongoing US National Science Foundation-funded project for large random BAC sequencing from hexaploid wheat (http://www.nsf.gov/awardsearch/showAward.do?AwardNumber = 0501814), which will provide information on the structural features of genes and intergenic regions and distribution of repeated elements, we should gain better insight into wheat genome organization and evolution, and accelerate the development of tools for establishing a physical map and sequencing the bread wheat genome.

Experimental procedures

BAC-end sequencing

Bacterial clones were cultivated in 96-well plates overnight in 1 ml chloramphenicol-supplemented 2YT medium. Preparation of BAC DNA for end sequencing was done using standard alkaline lysis miniprep techniques. Sequencing reactions were set up according to the manufacturer's instructions for Big Dye Terminator chemistry (Applied Biosystems, Foster City, CA, USA). Reactions were performed using M13 forward (5′CAGGAAACAGCTATGACC3′) and M13 reverse (5′TGTAAAACGACGGCCAGT3′) universal primers. Samples were loaded on a 3730xl DNA Analyser (Applied Biosystems). Sequence trimming was conducted by processing the traces using base-calling software phred (Ewing and Green, 1998; Ewing et al., 1998). Reads with a Q20 length of 75 bases in a 100-base consecutive window were retained for further analysis. Clone tracking and sequence data processing were performed as described by Artiguenave et al. (2000).

Semi-automated annotation of the sequences

Databases.  Three repeats databases were used to analyse the BAC-end sequences: TREPtotal (http://wheat.pw.usda.gov/ITMI/Repeats/index.shtml); RepBase (Jurka et al., 2005); and TIGR Plant Repeat Databases (http://www.tigr.org/tdb/e2k1/plant.repeats). EST sequences were retrieved from various sequence databases: 104 454 wheat EST contigs from GenoPlante-Info (Samson et al., 2003); 625 255 wheat ESTs and 3977 A. tauschii unfiltered genomic shotgun library sequences from GenBank; as well as 85 652 A. thaliana, 5388 Avena sativa, 86 729 Glycine max, 84 487 Hordeum vulgare, 116 515 Oryza sativa, 9196 Secale cereale, 41 845 Sorghum bicolor and 86 727 Zea mays ESTs from PlantGDB (Dong et al., 2005).

Repeat content analysis.  BES were screened for repeat sequences using repeatmasker (A. Smit, R. Hubley and P. Green, unpublished data; http://www.repeatmasker.org) with default parameters. First, four successive searches were performed without custom library, then against the three repeat databases described above (TREPtotal, RepBase, TIGR Plant Repeat Databases). A final tblastx (e-value 10−05) (Altschul et al., 1997) search against the same databases was performed. Sequences matching known repeats were masked as N. Putative unknown repeats were searched by aligning (blastn, 1e−05) repeat masked BES one to each other. Sequences displaying a 50-bp alignment with 80% identity were selected and the match lengths were determined. The percentage of the genome represented by the different elements was estimated by dividing the cumulative length of sequence showing homology to an element by the size of the BES set. For example, 1567 BES showing homology to CACTA elements with a cumulative length of 529 672 bp matches represent 4.9% of the 10.8-Mb sequence set. Thus, assuming that the BES data set is representative for the B-genome, the CACTA fraction was estimated to represent 4.9% of this genome.

Gene content analysis.  The repeat masked sequences were subjected to a batch homology search using tblastx versus GenoPlante-Info wheat EST contigs (e-value 10−35) and blastn versus GenBank wheat ESTs and PlantGDB ESTs (e-value 10−35). Cumulative non-TE-coding match lengths were used to calculate the coding fraction, as described for repetitive elements.

FISH analysis

Six varieties of hexaploid wheat (five Triticum aestivum L., Renan, Tumenskaya 80, Erythrospermum 59, Elite, #2166; and one Triticum spelta L., k-45368); one triticale variety (Central); three tetraploid wheat accessions [Triticum dicoccoides (Korn. ex Aschers. et Graebn.) Schweinf. (PI471750 and TTD-20) and Triticum aethiopicum Jakubz. (k-19217)]; six diploid wheat accessions (Triticum urartu Thum. Ex Gandil 27019, Triticum boeoticum Boiss. 23733, Triticum monococcum L. ‘from Pays de Sault’, Aegilopstauschii Coss. TQ27, Aegilops speltoides#37 from the National Institute for Agronomic Research of Rennes (France) and Aegilops sharonensis TH01); as well as one Secale cereale L. (Dan nove) and one Hordeum vulgare L. (#25160) accessions, were used for FISH analysis.

The Fat element probe was synthesized by PCR from the 3B_050_N05 BAC clone. Two primers (5′GGGGAGCTTCTCACAACAAGC3′ and 5′TATTTACCACGGCATGTCGGG3′) were designed from the 3B_050_N05_FM1 sequence. A 460-bp fragment was obtained after standard PCR amplification at 60°C for 30 cycles, and subsequently labelled by PCR with fluorescein-12-dUTP. Chromosome preparations and FISH were carried out as described by Badaeva et al. (1996). Detection was performed with rabbit-anti-FITC and Fluorescein-conjugated swine anti-rabbit Igs (DAKO, Glostrup, Denmark). The pSc119.2 (Bedbrook et al., 1980) and pAs1 probes (Rayburn and Gill, 1986) were labelled with biotin-16-dUTP according to the manufacturer's protocol (Roche Diagnostics, Mannheim, Germany) and detected with streptavidin-Cy3 (Amersham Biosciences, Piscataway, NJ, USA), respectively. The slides were examined on a Zeiss Axiom II Imaging microscope (Carl Zeiss, Oberkochen, Germany). Images were processed using Adobe photoshop software.

Development of molecular markers from the BES

SSR markers.  BES were screened for the presence of perfect SSRs using the SSR search program (ftp://ftp.gramene.org/pub/gramene/software/scripts/ssr.pl). SSR location data in sequence were extracted from the SSR search output files. Sequences containing SSR were converted to GCG format and primer design was performed using GCG program prime (Genetics Computer Group, Accelrys, Madison, WI, USA) with default settings. PCR reactions using the M13 protocol (Boutin-Ganache et al., 2001) were carried out in a final volume of 6.5 μl with 200 μm of each dNTP, 500 nm M13 primer, 50 nm of the forward M13-tailed primer, 500 nm of the reverse primer, 0.2 U Taq polymerase (Qiagen, Valencia, CA, USA) with 1× its appropriate buffer and 25 ng template DNA. PCR amplification was conducted using the following touchdown procedure: seven cycles (30 sec 95°C, 30 sec 62°C minus 1°C each cycle, 30 sec 72°C), 20 cycles (30 sec 95°C, 30 sec 55°C, 30 sec 72°C) and eight additional cycles (30 sec 95°C, 30 sec 56°C, 30 sec 72°C). Amplification products were visualized using an ABI PRISM 3100 Genetic Analyser (Applied Biosystems). Fragment sizes were calculated using genescan and genotyper software (Applied Biosystems), where different alleles are represented by different amplification sizes for tandem repeats. Two alleles are considered identical when they show the same fragment size. All cft primer sequences are available on the Graingenes database (http://wheat.pw.usda.gov/GG2/index.shtml).

ISBP markers.  BES containing a junction between two different sequences were identified from the repeat-masker analysis. Sequences were converted to the GCG format, and primer design was performed using the GCG program prime with default settings. PCR reactions were carried out in a 5 μl final volume with 400 μm of each dNTP, 500 nm of each primer, 0.25 U Taq polymerase (Qiagen) with 1× its appropriate buffer, 1× Q-solution (Qiagen) and 20 ng template DNA. PCR amplification was conducted using the following touchdown procedure: seven cycles (30 sec 95°C, 30 sec 62°C minus 1°C each cycle, 30 sec 72°C), 31 cycles (30 sec 95°C, 30 sec 55°C, 30 sec 72°C) and 11 additional cycles (30 sec 95°C, 30 sec 56°C, 30 sec 72°C). Amplification products were visualized on 4% agarose gel.

Acknowledgements

The authors would like to thank Delphine Boyer, Bouzid Charef, Cyril Saintenac and Sviatoslav Zoshchuk for their technical support. We also thank the Genoscope (Evry, France) and in particular Corinne Cruaud, Véronique de Berardinis, Carole Dossat and Patrick Wincker for their contribution to BAC-end sequencing, sequence trimming and data processing. We thank Philippe Leroy and Nicolas Guilhot for their support in developing the perl scripts that were used for sequence analysis. We are very grateful to Hélène Berges (CNRGV, Toulouse, France) who gave us access to the CNRGV facilities for rearraying the 3B library. SSR and ISBP development was done on the Genotyping platform of INRA, Clermont-Ferrand, France (http://www.dermont.inra.fr/I_inra_en_auvergne/grands_artils_de_biologie_1). The authors are grateful to Nabila Yahiaoui for critical reading of the manuscript. This work was supported by INRA grants.

GenBank accession numbers: DX363346DX382744.

Ancillary

Advertisement