High homologous gene conservation despite extreme autopolyploid redundancy in sugarcane


Author for correspondence:
Angélique D’Hont
Tel: +33 4 67 61 59 27
Email: dhont@cirad.fr


  • Modern sugarcane (Saccharum spp.) is the leading sugar crop and a primary energy crop. It has the highest level of ‘vertical’ redundancy (2n = 12x = 120) of all polyploid plants studied to date. It was produced about a century ago through hybridization between two autopolyploid species, namely S. officinarum and S. spontaneum.
  • In order to investigate the genome dynamics in this highly polyploid context, we sequenced and compared seven hom(oe)ologous haplotypes (bacterial artificial chromosome clones).
  • Our analysis revealed a high level of gene retention and colinearity, as well as high gene structure and sequence conservation, with an average sequence divergence of 4% for exons. Remarkably, all of the hom(oe)ologous genes were predicted as being functional (except for one gene fragment) and showed signs of evolving under purifying selection, with the exception of genes within segmental duplications. By contrast, transposable elements displayed a general absence of colinearity among hom(oe)ologous haplotypes and appeared to have undergone dynamic expansion in Saccharum, compared with sorghum, its close relative in the Andropogonea tribe.
  • These results reinforce the general trend emerging from recent studies indicating the diverse and nuanced effect of polyploidy on genome dynamics.


Polyploidy (whole genome duplication) is recognized as a major driving force in plant genome evolution. Its footprint is apparent in the genome of all diploid plants that have been fully sequenced (Jaillon et al., 2007; Fawcett et al., 2009; Salse et al., 2009; Schmutz et al., 2010). Analyses of recent polyploids have shown that polyploidy can have profound genetic and epigenetic consequences, including chromosome rearrangements, gene loss, transposable element (TE) activation, DNA methylation and gene expression modifications (Doyle et al., 2008; Jackson & Chen, 2009). Most of these studies have focused on allopolyploids, whereas the impact of autopolyploidy on genome evolution has received very little attention. Autopolyploids are traditionally considered to arise within a single species through the doubling of structurally similar homologous genomes (e.g. AA becoming AAAA), whereas allopolyploids arise via interspecific hybridization (e.g. yielding an AB hybrid) and subsequent doubling of nonhomologous (i.e. homoeologous) genomes (AABB). A major distinction between auto- and allopolyploids is their mode of chromosome inheritance as a result of meiotic pairing. Typically, autopolyploids have more than two sets of homologous chromosomes that randomly pair, resulting in polysomic inheri-tance. In allopolyploids, homologous chromosomes pair together (and not with their homoeolog), resulting in disomic inheritance (Mather, 1936; De Silva et al., 2005). The very limited available data on the consequences of autopolyploidization concern gene expression and suggest less modification than after allopolyploidization during the first generations following genome doubling (Albertin et al., 2005; Martelotto et al., 2005; Lu et al., 2006; Stupar et al., 2007; Church & Spaulding, 2009; Parisod et al., 2010). However, little is currently known about the structural consequences and longer term evolution of genomes in autopolyploids relative to allopolyploids.

Sugarcane is one of the world’s most efficient crops in converting solar energy into chemical energy, which makes it the leading sugar crop and a primary energy crop. The domesticated sugar-producing species Saccharum officinarum is auto-octoploid (2n = 8x = 80) and has been domesticated from the wild autopolyploid species S. robustum (mainly 2n = 60, 80) (Brandes, 1956; Sreenivasan et al., 1987; Grivet et al., 2006). A century ago, breeders incorporated chromosomes from a wild relative species, that is S. spontaneum, by hybridization and backcrossing. This wild species is also considered to be of autopolyploid origin, with chromosome numbers ranging from 2n = 5x = 40 to 16x = 128, with many aneuploid forms (Panje & Babu, 1960; Sreenivasan et al., 1987). The resulting modern cultivars have a genome size of around 10 Gb and typically have around 120 chromosomes, 70–80% of which are entirely derived from S. officinarum, 10–20% from S. spontaneum and a few from interspecific recombinations (D’Hont et al., 1996; Cuadrado et al., 2004; D’Hont, 2005; Piperidis et al., 2010). The meiosis of modern sugarcane cultivars mainly involves bivalent pairing (Bremer, 1922; Price, 1963; Burner & Legendre, 1994) and chromosome assortment results from general polysomy with some cases of preferential pairing (Grivet et al., 1996; Hoarau et al., 2001; Jannoo et al., 2004).

Taxonomically, sugarcane belongs to Saccharum, which encompasses only polyploid species. Synteny is particularly well conserved between sugarcane and sorghum, its close relative in the Andropogoneae tribe (Grivet et al., 1994; Dufour et al., 1997; Guimaraes et al., 1997; Ming et al., 1998; Le Cunff et al., 2008), from which it diverged 8–10 million yr ago (Mya) (Jannoo et al., 2007). The monoploid genome size of sugarcane, that is 760–930 Mb (D’Hont & Glaszmann, 2001), is close to that of sorghum, that is 730 Mb (Paterson et al., 2009).

The general objective of this study was to investigate genome dynamics, in particular the fate of duplicated genes and TE content, in a highly polyploid context such as sugarcane. Previously, we have analyzed two homoeologous sequences (97 kb and 126 kb) co-existing in the modern sugarcane cultivar R570 (Jannoo et al., 2007). These sequences correspond to the region encompassing the Adh1 gene, which has already been studied in several Poaceae species (Ilic et al., 2003). Our findings revealed perfect colinearity at the gene level between the two sugarcane homoeologous haplotypes, as well as high gene structure conservation. Apart from the insertion of a few retrotransposable elements, high homology was also observed along the nontranscribed regions. This first analysis of sugarcane haplotype organization at the sequence level suggested that the high ploidy in sugarcane did not induce generalized reshaping of its genome, thus challenging the idea that polyploidy quickly induces generalized rearrangement of genomes.

The specific objectives of this study were to verify the high level of functional gene retention and colinearity and the absence of TE colinearity in a second region of the genome of R570 and across a larger number of haplotypes, including homologous haplotypes (as opposed to homoeologous haplotypes). In addition, we refined the analysis by studying the selection pressures under which the redundant genes were evolving. We took advantage of a physical map developed in the context of a map-based cloning approach (Le Cunff et al., 2008). Bacterial artificial chromosome (BAC) clones belonging to seven hom(oe)ologous haplotypes were sequenced, enabling a comparison between homoeologous (from S. spontaneum and S. officinarum) and also between homologous (from S. spontaneum or S. officinarum) haplotypes.

Materials and Methods

BAC clone selection and sequencing

Eight sugarcane BAC clones belonging to seven hom(oe)ologous haplotypes of a region bearing the rust resistance gene Bru1 and three sorghum BAC clones corresponding to the orthologous sorghum region were identified, as described in Le Cunff et al. (2008). The sugarcane BAC clones originated from a R570 sugarcane cultivar BAC library developed by Tomkins et al. (1999) (BAC clones named Sh) and from a library built by Le Cunff et al. (2008) with DNA of selfed progenies of R570 (BAC clones named ShCIR). The sorghum BAC clones were selected from the SB_BBc sorghum BAC library constructed with DNA from the bicolor Btx623 genotype by D. Begum et al., unpublished (Clemson University Genomics Institute, Clemson, USA, http://www.genome.clemson.edu/cgi-bin/orders?&page=productGroup&service=bacrc&productGroup=26). BAC clones were subcloned and sequenced with an average coverage of 14X using the Sanger method and ABI3730 (Applied Biosystems by Life Techno-logies, Carsbad, California, USA) sequencers at Genoscope, France (http://www.cns.fr/spip/), and sequences were assembled using Phred/Phrap/consed (http://www.phrap.com/). Sequences were submitted to the EMBL sequence database under the following accession numbers (BAC clone names in parentheses): FN431669 (ShCIR9O20), FN431661 (ShCIR12E03), FN431668 (Sh253G12), FN431666 (Sh142J21), FN431664 (Sh53A11), FN431665 (Sh135P16), FN431663 (Sh15N23), FN431667 (Sh197G04) and FN431662 (SB_BBc24P17c).

Gene annotation

The gene finding software ‘EuGène’ (Schiex et al., 2001) was used to perform automatic prediction of the gene structure on the BAC sequences. The learning dataset of genes used to train ‘EuGène’ for its predictions was composed of 300 validated rice genes. Several sources of information were integrated in the prediction process, such as similarities with the sugarcane expressed sequence tags (ESTs) of the Gene Index database comprising 282 683 sequences (http://compbio.dfci.harvard.edu/tgi/cgi-bin/tgi/gimain.pl?gudb=s_officinarum). The polypeptide function was also predicted by integrating several pieces of evidence, such as protein similarities (NCBI-BLASTp on UniProt) (Consortium, 2009) and protein domains (InterproScan on InterPro) (Hunter et al., 2009). The sequence of the rice orthologous segment was recovered from the Rice Genome Annotation Project Database (http://rice.plantbiology.msu.edu/) and re-annotated. The manual expertise of predictions was performed using the Artemis annotation tool (Rutherford et al., 2000). All genes were checked and ambiguities were manually corrected.

TE annotation and classification

TEs were annotated using the REPET package containing two pipelines: TEdenovo and TEannot (http://urgi.versailles.inra.fr/index.php/urgi/Tools/REPET). Repeat sequences were first predicted with TEdenovo by an all-by-all comparison of BAC sequences, with the genes already annotated being masked. Repeat sequences, showing at least 80% identity on 80% of their length, were clustered and a consensus sequence was derived from each cluster. These consensus sequences were compared with the Repbase database (Jurka et al., 2005) and classified according to Wicker et al. (2007) in class, order, superfamily and family of TEs. In addition, the LTR_STRUC program (McCarthy & McDonald, 2003) was used to identify complete long terminal repeat (LTR) retrotransposons. LTR structures flanking retrotransposons were checked using dotplot analysis of each BAC sequence against itself. Redundancy between LTR_STRUC and TEdenovo was removed to obtain a nonredundant library of classified elements. Finally, this library was used with the TEannot pipeline to detect additional TE fragments. All large TEs (all TEs except miniature inverted-repeat TEs, MITEs) were then manually checked with the Artemis annotation tool (Rutherford et al., 2000).

Sequence analyses

All annotated genes were clustered using the NCBI-BLASTclust program to identify homologous and/or orthologous groups of genes (for rice and sorghum). Sequence alignments of hom(oe)ologous and orthologous genes were performed with ClustalW in the GENALYS (http://software.cng.fr) and BioEdit (Hall, 1999) interfaces, and manually corrected. The Nei–Gojobori method implemented in MEGA4 software (Kumar et al., 1994), with the Juke–Cantor correction, was used to estimate synonymous (Ks) and nonsynonymous (Ka) substitution rates and Ka : Ks ratios. Positive or negative selections were assessed by testing hypotheses H1 (Ka > Ks) or H2 (Ka < Ks) against the null hypothesis of no selection (H0: Ka = Ks), using the Z-test for selection (Nei & Kumar, 2000) implemented in MEGA; the standard error was calculated by the bootstrap method using 1000 replicates. P < 0.05 was considered to be statistically significant. Genetic distances between sequences and phylogenetic trees were established with Kimura’s two-parameter model and the neighbor-joining method, respectively. Divergence and duplication times (T) were estimated using T = Ks/2k, where Ks is the estimated number of synonymous substitutions per site between homologous sequences and k is the average synonymous substitution rate, by assuming a rate of 6.5 × 10−9 mutations per synonymous site per year, as reported by Gaut et al. (1996) for Adh loci of grasses. However, it is important to note that current molecular dating approaches are limited as many variables influence the molecular evolutionary rate (Doyle & Egan, 2009).

The insertion times of LTR retrotransposons were estimated for complete elements with both LTRs and target site duplications (Table 1). LTR sequences were aligned using ClustalW and edited manually if needed. MEGA4 software was used to calculate the rate of substitutions. Insertion dates were estimated using Kimura’s two-parameter model, with a mutation rate of 1.3 × 10−8 substitutions per synonymous site per year, as described by Ma & Bennetzen (2004).

Table 1.   Summary of bacterial artificial chromosome (BAC) content
Total number of sugarcane BACs analyzed8
Combined BAC length960 kb
Transposable elements337 kb (35%)
 Long terminal repeat (LTR) retrotransposons260 kb (27%)
 Non-LTR retrotransposons48 kb (5%)
 Transposons29 kb (3%)
Genes207 kb (22%)
Intergenic regions416 kb (43%)
Gene density1 gene per 9 kb
Number of colinear genes14
Number of genes that show colinearity with sorghum10
Number of genes that show colinearity with rice8

Haplotype networks were built with NETWORK v4.5 software (http://www.fluxus-technology.com/sharenet.htm) from coding sequences of hom(oe)ologous genes, using the median-joining method, with an equal weight for all sites. Both single nucleotide polymorphisms (SNPs) and insertions–deletions (INDELs) were considered. Phylogenetic trees, haplotype networks and divergence time calculations were used to schematize a haplotype differentiation pattern (Fig. 4). Comparisons were performed when at least four alleles could be compared and excluding gene fragment 10.


Sequencing and annotation of seven hom(oe)ologous regions within the R570 sugarcane cultivar

Eight BAC clones representing seven haplotypes were sequenced and annotated, representing nearly 1 Mb of sugarcane DNA sequence. Fifteen genes with their allelic versions distributed in six or seven haplotypes and 66 large TEs were identified (Fig. 1, Table 1). The region analyzed corresponded to a gene-rich region with an average of one gene every 9 kb. Each hom(oe)ologous haplotype was represented by one BAC clone, except for haplotype VII, which was represented by two BAC clones genetically mapped on the same chromosome but separated on the current physical map by a gap of unknown size. All seven hom(oe)ologous regions partially overlapped, except for haplotypes IV and VI. We could therefore not rule out that these two BAC clones belonged to the same chromosome. The overlap between two hom(oe)ologous regions (excluding haplotype IV and VI) ranged from 9.6 to 73 kb, representing between two and seven genes.

Figure 1.

 Comparison of the physical organization between seven hom(oe)ologous sugarcane bacterial artificial chromosome (BAC) clones, and with sorghum and rice orthologous segments in the Bru1 region. Genes are indicated by black boxes, long terminal repeat (LTR) retrotransposons by gray boxes (insertion times are indicated inside boxes), non-LTR retrotransposons by gray striped boxes, transposons by gray dotted boxes and proteinase inhibitor by diamonds. Colinear genes are connected by shaded areas and colinear transposable elements (TEs) by dashed lines. Genes are numbered according to Table 2. TE family names are abbreviated as: Rh (Rhum), Ti (TiPunch), Da (Daiquiri), Ca (Caipirinha), Ch (Cachaca), Co (Colada), Pl (Planteur) and Mo (Mojito).

Comparison of gene content and structure within sugarcane haplotypes and with sorghum and rice orthologous regions

Fifteen genes were present in at least two hom(oe)ologous regions in sugarcane or orthologous regions in rice and sorghum. For all of these genes (except gene 10), the coding sequence could be translated into a complete protein sequence, and thus they were all predicted to be functional. In all hom(oe)ologous versions of gene 10, three exons were missing, and so this gene is referred to as a gene fragment. All hom(oe)ologous alleles of the 14 complete genes showed a conserved exon/intron structure. From the sugarcane TIGR database, highly homologous ESTs covering the full length of six of the genes were identified, whereas only partial coverage was found for the other genes (data not shown). Proteins from other species presenting high homology with the amino acid sequences deduced from the 15 genes were found in the database. The function of 11 could be tentatively attributed (Table 2).

Table 2.   List of syntenic genes on sugarcane bacterial artificial chromosome (BAC) clones
No.Protein giving highest BLAST scoreBest BLASTX E-valueSyntenic orthologous sorghum locus1
  1. 1, Loci numbers are available at http://www.phytozome.net/, v1.0 release, comprising the Sbi1 assembly and Sbi1.4 gene set.

1Oryza sativa conserved hypothetical protein (CHP) (Q2QZU3)3.E-18Sb04g001890
2Arabidopsis thaliana histone lysine N-methyltransferase ATXR6 (Q9FNE9)1.E-123Sb04g001900
3O. sativa putative arginine/serine-rich splicing factor (Q6Z725)1.E-80Sb04g001910
4A. thaliana dimethyladenosine transferase (O65090)7.E-89Sb04g001913
5O. sativa Ulp1 protease family, C-terminal catalytic domain containing protein (Q2QME9)1.E-142Sb04g001916
6O. sativa cell division control protein 2 homolog 2 (P29619)1.E-144Sb04g001920
7A. thaliana shrunken seed protein (Q85851)5.E-88Sb04g001930
8O. sativa conserved hypothetical protein (CHP) PO575F10.13 (Q62719)0.E + 00Sb04g001940
9O. sativa tyrosine-specific protein phosphatase-like (Q6ZHH2)5.E-26
10Triticum aestivum serine/threonine protein kinase pelle fragment (Q5BQ31)8.E-15
11aMalus domestica NADP-dependent d-sorbitol-6-phosphate dehydrogenase (P28475)3.E-123
11bMalus domestica NADP-dependent d-sorbitol-6-phosphate dehydrogenase (P28475)3.E-125Sb04g001950
12O. sativa endoglucanase 4 precursor (Q6Z715)1.E-152Sb04g001960
13O. sativa hypothetical protein (HP) PO662B01.10 (Q5VQ47)1.E-38
14O. sativa hypothetical protein (HP) H0821G03.11 (Q25A05)2.E-04

As the overlaps varied between hom(oe)ologous regions, the sugarcane hom(oe)ologous regions could be compared on the basis of a variable number of genes, ranging from two to seven, with an average of four. On this basis, the colinearity and transcriptional orientation were highly conserved between genes 1 and 14 (Fig. 1). A notable exception to strict colinearity involved the presence of two additional predictions corresponding to gene fragment 10 and a duplication of gene 11 on haplotypes II, V, VI and VII. Gene fragment 10 corresponded to a fragment of a serine/threonine kinase. Gene 11 corresponded to a sorbitol-6-phosphate dehydrogenase (S6PDH).

In addition, two structural arrangements were each unique to one haplotype: haplotype VII displayed additional genes between genes 11a and 11b, as the result of an insertion within this haplotype (Le Cunff et al., 2008), and haplotype VI displayed a segmental triplication of 22 kb, including part of gene 11b and genes 12, 13 and 14 (Fig. 1). Finally, a few additional gene predictions were unique for one haplotype, but none was found to interrupt the gene 1–14 series (Fig. 1). Several genes coding for proteinase inhibitor subfamily I13 were identified on five haplotypes (I, II, III, V and VI), thus possibly extending the colinear region.

Colinearity was also very high in the overlapping regions between sugarcane and sorghum and, to a lesser degree, rice. Three genes (genes 9, 13 and 14) were absent in sorghum when compared with sugarcane, but a trace of the first exon still remained for gene 9. In the rice orthologous region, five genes were absent when compared with sugarcane (genes 1, 5, 9, 13 and 14), whereas one additional gene was present between genes 7 and 8 and two were present between genes 11 and 12. As for sugarcane, several proteinase inhibitors of subfamily I13 were identified at a similar position in the sorghum and rice orthologous regions.

Haplotype divergence

The percentage of nucleotide sequence identity was calculated for each pair of hom(oe)ologous and orthologous genes (Table 3). Among sugarcane hom(oe)ologous alleles, sequence identity was very high, i.e. ranging from 78.8% to 100%, with an average of 95.9% (97.7% when not taking gaps into account), for the coding sequence, and from 39.3% to 100%, with an average of 87.5% (96.9% when not taking gaps into account), for the aligned part of the introns. Between sugarcane and sorghum, the average identity was high, with 91.6% for the coding sequence and 72.8% for the aligned part of introns (96.4% and 88.0% when not taking gaps into account, respectively). Between sugarcane and rice, the average identity was lower as expected, with 71% for the coding sequence and 38% for the aligned part of the intron.

Table 3.   Percentage of identity between hom(oe)ologous sugarcane genes (a) and orthologous sorghum–sugarcane genes (b)
GeneNo. of sequencesAlignment length (bp)Exon length (bp)Intron length (bp)Percentage identity of exonsPercentage identity of introns
AverageMax valueMin valueAverageMax valueMin value
  1. 1, With three paralogous genes in the Sh135P16 bacterial artificial chromosome.

2. Histone-lysine N-methyltransferase2119511019497.697.697.6100.0100.0100.0
3. Arginine/serine-rich splicing factor23449975247496.796.796.789.889.889.8
4. Dimethyladenosine transferase239651002296396.896.896.874.474.474.4
5. Putative ulp1 protease379442586535898.498.598.390.697.786.6
6. Cell division control protein 2 homolog 244315882343399.5100.099.291.698.785.2
7. Putative shrunken seed protein6187296690699.299.998.697.699.995.8
8. Conserved hypothetical protein63021244557696.399.392.496.899.693.6
9. Putative tyrosine-specific protein phosphatase516621662090.399.384.6
10. Serine/threonine kinase (fragment)486341145288.893.184.469.2100.040.2
11a. NADP-dependent d-sorbitol-6-phosphate dehydrogenase51953681127292.899.787.466.598.239.3
11b. NADP-dependent d-sorbitol-6-phosphate dehydrogenase65113957415698.399.795.876.395.559.2
12. Endoglucanase 4 precursor712107175735090.099.878.885.499.471.1
13. Hypothetical protein412791181897399.099.898.099.499.999.3
14. Hypothetical protein416385528699.0100.098.0100.0100.0100.0
2. Histone-lysine N-methyltransferase3119511019494.
3. Arginine/serine-rich splicing factor33590975261594.495.693.584.987.782.1
4. Dimethyladenosine transferase340191032298790.592.389.362.864.561.2
5. Putative ulp1 protease480582607545189.789.889.780.583.574.5
6. Cell division control protein 2 homolog 254508891361795.295.495.174.375.770.7
7. Putative shrunken seed protein73696967272994.194.694.083.986.578.3
8. Conserved hypothetical protein73137244569292.793.491.187.688.785.4
9. Putative tyrosine-specific protein phosphatase6210210082.482.981.9
11b. NADP-dependent d-sorbitol-6-phosphate dehydrogenase75296966433094.094.992.252.258.547.5
12. Endoglucanase 4 precursor811852164420888.591.880.841.549.439.4

Haplotype differentiation was analyzed by comparing sequences of hom(oe)ologous alleles (Table 3) and building phylogenetic trees and haplotype networks (data not shown). A synthetic representation of the results is shown in Fig. 4. The maximum divergence between two alleles within a locus ranged from 1.2 to 9.6 Myr for the segmental duplicated loci 11b and 11a, respectively. Among nonsegmental duplicated loci, this range was from 1.7 to 5.1 Myr. The most divergent alleles fell into haplotypes I, III and part of haplotype VII, and the most similar into haplotypes II, IV, V and VI. The simplest interpretation is that these two sets of haplotypes (II, IV, V and VI vs I and III) could originate from distinct genomes, that is S. officinarum and S. spontaneum. The variation along haplotype VII suggests a more complex origin. The most divergent haplotypes or haplotype fragments could tentatively be attributed to S. spontaneum on the basis of the much broader molecular differentiation observed between S. officinarum and S. spontaneum than within S. officinarum, as well as the greater diversity and heterozygosity within S. spontaneum than within S. officinarum (Lu et al., 1994; Jannoo et al., 1999). It is noteworthy that, with the exception of haplotype VII, the presence vs absence of gene 11a and the distribution of the few colinear TEs corroborate the sequence differentiation pattern.

On the basis of the average gene synonymous substitutions between all sugarcane haplotypes and the sorghum and rice orthologous sequences, we estimated that sugarcane and sorghum diverged around 6–9 Mya and sorghum and rice around 43 Mya, which is in agreement with the previously published dates of 8–9 Mya (Jannoo et al., 2007) and 40–50 Mya (Paterson et al., 2004; Bowers et al., 2005), respectively.

Selective constraints on hom(oe)ologous genes

The synonymous (Ks) and nonsynonymous (Ka) substitution rates, as well as the Ka : Ks ratios, were calculated for exons of all pairs of hom(oe)ologous alleles (Fig. 2). For nine of the 14 genes compared (genes 2, 3, 4, 5, 6, 7, 8, 11a and 14), Ka : Ks values under unity were obtained for all pairs of hom(oe)ologous alleles. For seven of these genes, significant P values for the Z-test (below 0.05 and down to below 0.001) were observed for several pairs of hom(oe)ologous alleles. In the other cases, the results obtained were not statistically significant because of the small number of compared sequences and/or a lack of substitutions. For genes 9 and 13, some Ka : Ks values were close to unity, but they were not statistically significant. These results indicated that the majority of hom(oe)ologous alleles were under purifying selection (Fig. 2). The only genes for which some pairs of hom(oe)ologous alleles showed a Ka : Ks ratio above unity with, in a few cases, a Z-test below 0.05, concerned gene fragment 10 and pairs of hom(oe)ologous alleles within segmental duplications: genes 11 (duplication) and gene 12 (triplication).

Figure 2.

Ka : Ks values for each gene from bacterial artificial chromosome (BAC) pair comparisons. Pairwise comparisons with a Z-test gave P values of *, P ≤ 0.05; **, P ≤ 0.01; ***, P ≤ 0.001.

Gene 11 was duplicated on four of the hom(oe)ologous sugarcane haplotypes, whereas the other haplotypes contained only gene 11b. Sorghum and rice have only one copy of this gene and phylogenetic analyses showed that it had greater similarity to gene 11b (Fig. 3). Alleles of gene 11a displayed four- to five-fold more substitutions than alleles of the other genes in the sugarcane region. However, in spite of this excess of substitutions, a very high purifying selection pressure (with most P values below 0.001) was observed for gene 11a, whereas more relaxed selective pressures were observed for gene 11b.

Figure 3.

 Phylogenetic relationships of genes 11 encoding sorbitol-6-phosphate dehydrogenase (S6PDH) proteins. Gene 11 coding sequences from sugarcane bacterial artificial chromosomes (BACs), tentative consensus (TC) available in the TIGR sugarcane database, sorghum and rice were used to construct the tree by the neighbor-joining method implemented in MEGA4 software. The robustness of the tree topology was assessed with 1000 bootstrap replicates. The scale bar represents the relative genetic distance (number of substitutions per nucleotide). The coding sequences of S6PDH from Zea mays and Malus domestica (GenBank numbers EU965783 and AF057134, respectively) were used to root the tree. S6PDH genes were subdivided into at least two groups, namely 11a and 11b. ShCIR9O20g200 corresponds to a fragment of a second copy of gene 11a present on haplotype VII at the extremity of BAC ShCIR9O20, but interrupted by the cloning site of BAC.

Comparison of TE content within sugarcane haplotypes and with sorghum and rice

Sixty-six large TEs were annotated on BAC clones. They represented an average of 35% of all BAC sequences (between 15% and 54%). Most TEs had high similarities with previously reported families, but 21% were new. We attributed family names to the most abundant sugarcane elements, and their classification is described in Table 4. LTR retrotransposons were the most frequent elements, representing 65% of all TEs. They all belonged to two superfamilies, namely Ty3-Gypsy (58%) and Ty1-Copia (42%). Non-LTR retrotransposons represented 17% of all TEs; they all belonged to the LINE superfamily. DNA transposons (TE class II) represented 18% of all elements, with two-thirds belonging to the CACTA superfamily.

Table 4.   List and classification of large transposable elements (TEs) on sugarcane bacterial artificial chromosome (BAC) clones Thumbnail image of

Twenty percent of the TEs were complete, including 12 LTR retrotransposons and one transposon. Insertion times were calculated for all complete LTR retrotransposons and ranged from 0 to 1.58 Mya (Table 4). The vast majority of TEs were located in intergenic regions, except for eight that were inserted in introns of one allele of gene 11a and four alleles of gene 11b. No large TE was observed between genes 1 and 12 in the sorghum and rice orthologous regions. Most TEs were found in several hom(oe)ologous haplotypes, and no evident specific distribution of TEs between haplotypes was noted. They displayed hardly any colinearity, with their position not being conserved across haplotypes, except in three cases. The exceptions concerned a LTR retrotransposon of the Rhum family, which was present on haplotype V (two elements in tandem with one shared LTR), haplotype II (fragment) and haplotype VII (fragment) (Fig. 1). The second case concerned a LTR retrotransposon of the Colada family, which was complete on haplotype II and present as a fragment on haplotype IV. The third case concerned a fragment of a transposon found at the same position on haplotypes V and VI.


Our study was based on BAC clones from a modern sugarcane cultivar. Typically, modern cultivars have a genome made up of 75–85% of chromosomes from the autopolyploid species S. officinarum and 15–25% of chromosomes from the autopolyploid species S. spontaneum (D’Hont et al., 1996; Piperidis et al., 2010). These two types of chromosome were juxtaposed very recently (a century ago) through breeding, after their evolution within the autopolyploid parental species. Dating the origins of polyploidy events is difficult (Doyle & Egan, 2009), particularly for sugarcane: autopolyploidy implies a lack of initial differentiation (as opposed to allopolyploidy); it is likely to be gradual in such a high polyploid and its diploid progenitors have become extinct. On the basis of the molecular clock of Adh gene sequences, we estimated that the closest sugarcane diploid relative identified so far, that is Narenga porphyrocoma (Al-Janabi et al., 1994), diverged from sugarcane at 2.5 Mya (A. D’Hont, unpublished), and that S. spontaneum and S. officinarum diverged at 1.5–2 Mya (Jannoo et al., 2007). Polyploidy probably occurred after this divergence as these species have different basic chromosome numbers (D’Hont et al., 1998). Both S. spontaneum and the wild progenitor of S. officinarum (S. robustum) feature an impressive series of high polyploids. Saccharum species are therefore considered to be relatively ancient polyploids. Between the evolution as autopolyploid natural species and the short history as interspecific breeding materials, limited to the last century and a few meioses (≤ 7), it is likely that differentiation among hom(oe)ologous haplotypes extracted from a modern sugarcane (here R570) essentially reflects the evolutionary dynamics within the two autopolyploids.

In Jannoo et al. (2007), we compared two homoeologous haplotypes (BAC clones bearing Adh1 genes) from sugarcane cultivar R570, one originating from S. officinarum and one from S. spontaneum. In the present study, we compared seven haplotypes from a second region of the same cultivar. The species’ origin of the individual haplotypes was not determined precisely, but could tentatively be inferred (Fig. 4), probably featuring the two origins from S. officinarum and S. spontaneum. Hence, the present comparisons involved homologous and homoeologous haplotypes.

Figure 4.

 Schematic representation of differentiation among sugarcane haplotypes based on hom(oe)ologous allele sequence comparison (see text). Each allele is represented by a square. For each locus, the most divergent allele is marked in black and its theoretical divergence time (highest estimate observed in Myr) is indicated in italics. All alleles that fall into groups (of at least three) with all values lower than one-third of this maximum divergence time are marked by white squares. When the phylogenetic trees were not degenerate, the alleles of the same branch (relating to the same internal node) were placed in vertical dotted boxes, providing clues for classification (loci 7, 11b and 12). The white triangle in the black square for locus 11a indicates an insertion. The ‘x’ mark indicates absence of the gene.

High gene conservation in the autopolyploid genome context of sugarcane

We observed a remarkably high general colinearity of genes between sugarcane haplotypes and high gene structure and sequence conservation of homologous and homoeologous alleles. The general colinearity is disturbed by only a few segmental duplications. Strikingly, all the hom(oe)ologous genes (except gene fragment 10) were predicted, based on their structure, to be functional, although gene expression still needs to be verified experimentally. This was noted by Jannoo et al. (2007) between two homoeologous haplotypes, and is extended here for the first time to homologous haplotypes. This high retention of redundant functional genes contrasts with the situation in paleopolyploids (Blanc & Wolfe, 2004; Thomas et al., 2006; Freeling, 2009; Throude et al., 2009) and in some more recent allopolyploid species, such as wheat (Ozkan et al., 2001; Chantret et al., 2005) and Tragopogon (Tate et al., 2006; Buggs et al., 2010), where important gene eliminations and pseudo-genizations have been observed. However, it is in line with the situation in some other allopolyploids, such as Brassica and Gossypium, for which a high level of gene conservation and colinearity have also been reported (Grover et al., 2004; Cheung et al., 2009). When duplicated genes are conserved, they can be under relaxed selection, as exemplified by the case of the MONOCULM1 region in Oryza minuta, an allotetraploid rice (BBCC) estimated to be of recent origin (0.4 Mya) (Lu et al., 2009). Interestingly, our analysis in the sugarcane Bru1 region (this study) and Adh1 regions (Supporting information Table S1) suggests, conversely, that purifying selection acts on most genes. The only exceptions concerned genes comprising segmental duplications.

This high hom(oe)oallele conservation, despite the extreme gene redundancy observed in sugarcane, raises the question of the mechanisms involved. The mode of chromosome pairing may have a marked impact on genome evolution. In particular, the recurrent random assortment of chromosomes at meiosis in autopolyploids may counterselect departure from the functional stability of all alleles, because it may give rise to individuals (and gametes) lacking a functionally complete gene set. Le Comber et al. (2010), using computer simulation, suggested that subfunctionalization speeds up the transition from polysomic to disomic inheritance and also acts to maintain genes in syntenic blocks. Analysis of the expression of the hom(oe)oalles identified in this study should be undertaken to test for evidence of subfunctionalization and to further investigate this hypothesis. Genetic mapping studies have shown that chromosome assortment results from general polysomy in S. spontaneum (Al-Janabi et al., 1993) and from polysomy with some cases of preferential pairing in S. officinarum (Aitken et al., 2007) and in modern sugarcane cultivars (Grivet et al., 1996; Hoarau et al., 2001; Jannoo et al., 2004). These observations may be signs of initiation of transition from polysomic to disomic inheritance.

One of the genes duplicated through segmental duplication, gene 11, corresponds to S6PDH, which is a key enzyme in sugar metabolism for sorbitol biosynthesis. The duplication apparently occurred after the divergence between sugarcane and sorghum lineages, and possibly during the differentiation of Saccharum species, as it is absent from some of the haplotypes and confined to those tentatively attributed to S. officinarum (the sugar-producing species). The two copies have diverged; gene 11a has accumulated an important number of substitutions, but has evolved under very high purifying selection pressure, whereas more relaxed selective pressures were observed for gene 11b. These results suggest that segmental duplication has allowed the diversification of gene 11a, possibly towards an important specific function in sugarcane.

The contrasting result observed with regard to the evolution of duplicated genes suggests that, in our model, ‘horizontal’ duplications (segmental duplications), but not ‘vertical’ duplications (whole genome duplications through polyploidization), allow for diversification.

High TE variation between sugarcane haplotypes

The most represented TE families were Gypsy ‘TiPunch’ (12%) and Copia ‘Rhum’ (10%). These two retrotransposon families were also found to be the most abundant TEs in the sugarcane genome (S. officinarum) when analyzing about 1.45 Mb of random Sanger genomic sequence data (J. De Barry, J. L. Bennetzen, pers. comm., University of Georgia, USA). A survey of TEs in 260 781 sugarcane ESTs (Vettore et al., 2003) revealed a different distribution, with 53% transposons, 46.4% LTR retrotransposons and the absence of non-LTR retrotransposons. The most abundant retrotransposon superfamily in these ESTs was Copia (30%), with the Hopscotch family being the most represented element (Rossi et al., 2001; De Araujo et al., 2005). This suggested a nonrandom TE distribution in the expressed part of the genome.

In contrast with gene conservation and colinearity, we observed a general absence of colinearity for TEs among sugarcane haplotypes. This absence of colinearity, also noted by Jannoo et al. (2007) between two homoeologous haplotypes in the Adh1 region, was confirmed here in a second region extended to homologous haplotypes. In addition, the sugarcane sequence appears to be expanded compared with sorghum and rice orthologous regions because of the accumulation of TEs between genes. Indeed, rice and sorghum displayed no large TEs between genes 1 and 12 in the present study and between genes 1 and 14 in the Adh1 region.

The TE fraction of the genome in plants is known to be variable in size and very dynamic because of transposition and unequal recombination (homologous and illegitimate) (Ma et al., 2004; Vitte & Panaud, 2005). In some allopolyploids, major structural changes in the TE genome fraction have been observed after allopolyploidization, including TE proliferation and TE loss through recombination (review by Parisod et al., 2010). When they occur, these changes are generally attributed to interspecific hybridization (genomic shock) rather than polyploidy per se (McClintock, 1984). In our case, the interspecific configuration that was established by modern breeding is very recent (a century ago), and the insertion times of LTR retrotransposons that could be estimated were dated much earlier. Therefore, the diversity of TE patterns is unlikely to have resulted mainly from the recent interspecific hybridization.

We can speculate that polysomy and the high effective population size associated with autopolyploidy may impact on TE dynamics, and that the presence of TEs close to genes may facilitate a diversification of hom(oe)oallele expression patterns. These questions warrant further investigation.


This is the first analysis of the fine structural organization of a set of homologous and homoeologous haplotypes in sugarcane, and is the first of its kind in a highly polyploid genome. Despite an extreme level of gene redundancy, our analysis revealed strikingly high retention of redundant functional genes. In contrast, a general absence of TE colinearity among hom(oe)ologous haplotypes was observed, illustrating a dynamic expansion of TEs in the founder autopolyploid Saccharum species compared with sorghum. These results reinforce the general trend emerging from recent studies, indicating the diverse (depending on lineages) and nuanced effect of polyploidy on genome dynamics (Doyle et al., 2008; Parisod et al., 2010).

Our results also provide guidelines for future sequencing and assembling strategies that are presently being discussed within the Sugarcane Genome Sequencing Initiative (SUGESI, http://sugarcanegenome.org/). In particular, our results reveal high gene colinearity among hom(oe)ologous haplotypes, suggesting that one haplotype can serve as reference for the other hom(oe)ologous haplotypes with regard to gene content. They also confirm the high gene microlinearity between sorghum and sugarcane (Jannoo et al., 2007; Wang et al., 2010), thus making the sorghum sequence a good template to facilitate the assembly of the gene-rich part of the sugarcane genome. This study also has practical implications for the development of new molecular marker technologies. For example, to optimize the development of haplotype-specific markers, it could be more efficient to favor technologies based on TE variation, such as insertion site-based polymorphism markers (Paux et al., 2010), rather than those based on gene variation.

Sugarcane has been recognized as one of the world’s most efficient crops in solar energy conversion and as having the most favorable input : output ratios. Our data suggest the presence of broad sets of functional homologous alleles in its genome, which could explain its unique efficiency, particularly its high phenotypic plasticity and wide adaptation.


We thank Douglas D. Silva, Magdalena Rossi, Erika M. de Jesus and Nilo Saccaro-Junior from the GaTE Laboratory, University of Sao Paulo for their help in annotating TEs, and Nabila Yahiaoui and three anonymous referees whose contributions improved the manuscript. This work was funded by Genoscope (AAP2005) and Cirad.