A complex superfamily
The published genome of Arabidopsis thaliana ecotype Columbia (The Arabidopsis Genome Initiative, 2000) includes 36 genes belonging to the expansin superfamily. For rice, we limited our analysis to the map-based sequence of Oryza sativa L. cv. Nipponbare (Japonica cultivar group; Sasaki and Burr, 2000). This genome contains 58 expansin genes, including two pairs of identical genes (OsEXPA23a/OsEXPA23b and OsEXPB1a/OsEXPB1b).
In rice and Arabidopsis, the four expansin families are of comparable size, with the notable exception that EXPB genes are three times more numerous in rice (19 versus six genes). EXPA is the largest family, 34 genes in rice and 26 in Arabidopsis, while EXLB is the smallest, with a single gene in each species. The EXLA family has three members in Arabidopsis and four in rice. For individual gene annotations, see the expansin web site at http://www.bio.psu.edu/expansins.
Combining rice and Arabidopsis protein sequences, we obtained six phylogenetic trees using neighbor-joining, parsimony and Bayesian analyses (Figure S1). EXPA sequences were analyzed independently because of the presence of family-specific insertions and deletions. We found it impossible to define orthologous groups of genes with any certainty, due to poor support at key nodes (see example in Figure 1) and contradictory results using different methods or a different set of sequences. A recent analysis (Li et al., 2003) noted similar difficulties.
Figure 1. Neighbor-joining cladogram of Arabidopsis and rice expansins. Clades of orthologous genes, as determined in the integrated trees, are also indicated on the right with alternating black and gray bars and numbered as in Figure 2. Family names are shown next to the first gene of the family. Bootstrap values are shown above well-supported nodes. Thick dashed lines are poorly resolved areas that affect orthology. Branches with support below 70 have been collapsed. An asterisk indicates branches that were rejected in the integrated trees and that affect orthology.
Download figure to PowerPoint
In addition to sequence information, the assembled genomes of Arabidopsis and rice provide positional information. The 36 expansin genes of Arabidopsis are found in 29 non-adjacent genomic locations (three of these locations contain tandems). In rice, the 58 expansins appear in 36 locations, including 10 tandems. As described below, we used positional information, in combination with dates of segment divergence, to refine the phylogeny of rice and Arabidopsis expansins (summarized in Figure 2).
Figure 2. Integrated cladograms for the 17 clades of the expansin superfamily. Clades, identified by roman numerals, represent putative independent lineages in the last ancestor of rice (left) and Arabidopsis (right). Brackets connect clades whose independence is uncertain. Intron pattern for the ancestral gene is shown below each clade. Intron positions are indicated at the top. Vertical gray lines indicate proposed polyploidy events. Rice event ρ is discontinuous due to its incomplete characterization. Continuous red line connects genes found in the same or homologous genomic locations; black dashed lines are used for genes in non-homologous locations. Green circles are for tandem duplications, blue for short range translocation, yellow for segmental duplication in tandem. For each clade, genes in tandem are indicated with boxes of the same color next to their names. White boxes are observed gene deaths, white circles on black branches are assumed deaths (see text). Collapsed branches have bootstrap values below 75 in clade trees. From the presence of the EXLB-II descendent in the Fabaceae (see Supplementary Text), we infer that this clade disappeared from the Arabidopsis lineage between events α and β (Bowers et al., 2003).
Download figure to PowerPoint
For Arabidopsis, we consulted two analyses of segmental duplications that identify the individual genes involved in these events. The first study dated the duplications by calculating Ks (synonymous substitution rate) between gene pairs (Simillion et al., 2002). A second study dated them in relation to speciation events (Bowers et al., 2003). In both cases, the results were interpreted as evidence for three rounds of polyploidy. A recent analysis of duplication rates also supports this hypothesis (Maere et al., 2005). In this work, we follow Bowers et al. (2003) in referring to these events as α,β and γ, with α being the most recent and γ being the oldest. A polyploidy event in rice, predating the divergence of the grasses, has also been proposed (Paterson et al., 2004) and we will refer to it as event ρ. These three genome-wide studies allowed us to identify paralogous relations between different expansin-containing segments (that is, segments created by duplications within a genome, without speciation involved). Segmental duplications in Arabidopsis can be linked to 12 surviving expansin gene duplications and rice event ρ seem seems responsible for another 10 (Figure 2).
By microsynteny analysis, we were able to go further and assemble 49 of the expansin-containing genomic segments from rice and Arabidopsis into 12 groups (see Experimental procedures). We propose that all the segments within a group descended from a single expansin-containing segment in the genome of the last common ancestor of monocots and eudicots, and thus refer to them as orthologous groups. They all contain at least one rice and one Arabidopsis segment with an expansin gene. A simplified synteny diagram for one of them is shown as Figure 3(a) (for full diagrams with gene identifications and blastp results see Figure S2). A total of 68 expansins (23 from Arabidopsis, 45 from rice) are present in the 12 orthologous groups of segments (Figure 2). Rice–Arabidopsis microsynteny also allowed us to identify seven rice segmental duplications not previously described.
Figure 3. Microsynteny and sequence-based trees. (a) Simplified synteny diagram for the orthologous group of genomic segments for clade EXPA-IV. Pentagons represent genes and their orientation. Distances are not to scale. Blue lines connect orthologous groups of genes. Yellow genes denote the ‘best hit’ (closest homolog) in the entire Arabidopsis genome for the connected rice gene. Selected Arabidopsis paralogs (in white connected by gray lines) are included for segment alignment. Expansins (red) are identified by species initial and EXPA gene number. A cladogram to the right explains the duplication history of the segments, with polyploidy events as gray lines. The triple node for event β is due to a small-scale segmental duplication in tandem, close in time to the polyploidy event (the segments that include expansins A3 and A4 are contiguous). (b) Phylogenetic trees based on synteny (1), different phylogenetic methods (2, 3, 4) or a combination of both (5). Bootstrap or posterior probability values are shown above the nodes. Polyploidy events are indicated by dashed lines. Independent lineages in the last ancestor of monocots and eudicots are identified by diamonds. A gray triangle indicates the position of the closest Pinus protein when added to the tree.
Download figure to PowerPoint
Each orthologous group of segments includes between four and 43 orthologous groups of genes (including expansins) with representation in both species (average 20). These groups are shown connected by blue lines in Figure 3(a). Most of them (68–98% in different segment groups) include an Arabidopsis gene that is the ‘best hit’ in a blastp search of the entire Arabidopsis genome for one of the rice genes in the same orthologous group (yellow symbols in Figure 3a). Searches were done with protein sequences (see Experimental procedures). These results support the orthology of the segment groups used in this study. The smallest group of orthologous segments has just four groups of orthologous genes (three including best hits), but it contains an Arabidopsis segment and a rice segment whose one-to-one synteny had been previously shown to be statistically significant (Salse et al., 2002). The second smallest group has eight genes and seven best hits.
We used segmental duplication dates from the literature (Bowers et al., 2003; Paterson et al., 2004; Simillion et al., 2002) to construct cladograms for each orthologous group of segments. An example can be seen on the right side of Figure 3(a) (all cladograms can be found in Figure S2; for the duplication dates on which they are based see Supplementary Text). Both published analyses of the Arabidopsis genome are in general agreement as to the relative dating of individual segmental duplication events. The separation in time between the polyploidy events in the Arabidopsis lineage is also large enough that the assignment of individual segmental duplications to each of them is mostly straightforward. However, some duplicated segments in Arabidopsis appear in tandem and seem to be the result of small-scale events independent of whole genome duplications (yellow dots in Figure 2). We used expansin phylogenetic trees and parsimony considerations in addition to segmental duplication dates from the literature to establish their relative position with respect to the three genome duplications. In rice, all segmental duplications that are not linked to event ρ seem to be older, according to expansin phylogenetic trees.
Finally, in order to construct these cladograms it is necessary to determine the dating of the polyploidy events in relation to the divergence of the rice and Arabidopsis lineages. Phylogenetic trees of duplicated genes have been interpreted as indicating that event γ is shared between monocots and eudicots (Bowers et al., 2003; Chapman et al., 2004). We believe, on the other hand, that the evidence presented in these studies is not conclusive on this point and does not exclude the possibility that event γ happened in the Arabidopsis lineage after divergence from rice, where it has not yet been detected (see Discussion). Because this alternative hypothesis agrees better with the pattern of gene losses observed in the studied segments and also produces a more parsimonious tree for the expansins, we have adopted it for the segmental cladograms in Figure 3(a) and Figure S2. However, we have also considered the implications of a shared event γ. We explore below the consequences of this hypothesis for expansin phylogeny as well as for the estimated number of gene births and deaths.
Segmental cladograms are expected to parallel the cladograms of the individual genes included in them (see Discussion). With this assumption, we constructed position-based cladograms for the 12 groups of expansin genes in orthologous segments (Figure 2). A similar exercise was done assuming that eventγ happened in the common lineage of monocots and eudicots (Figure S3).
A practical case
Figure 3(a) shows a simplified synteny diagram for an orthologous group of genomic segments, one from rice and nine from Arabidopsis. This group includes one rice expansin (OsEXPA7, abbreviated as R7) and five from Arabidopsis (A3, A4, A6, A9 and A6). A total of 28 genes (or tandems of related genes) from the rice segment have putative orthologs in one or several of the Arabidopsis segments (23 best hits are shown in yellow). The cladogram for these segments is shown to the right and an expansin cladogram can be deduced directly from it (tree 1 in Figure 3b).
In contrast, sequence-based phylogenetic analyses of this same group of expansin genes yielded three different results (trees 2, 3 and 4 in Figure 3b), none of which agrees with the synteny tree (whatever the timing of event γ). To make these sequence-based phylogenies compatible with the synteny tree would require the existence of a very ancient gene tandem and an unlikely sequence of gene losses. Neighbor-joining and parsimony trees are close to each other, but they are incompatible with the Bayesian tree. All of the sequence-based phylogenies require more independent lineages in the ancestor of monocots and eudicots (diamonds in Figure 3b), lineages that are not supported by extant descendents in rice.
In this case, the main difference between position and sequence-based analyses concerns the correct rooting of the group. Trees 2 and 4 (and tree 3, but for a poorly supported branch) would be identical to the synteny tree if the root were moved to the rice branch. It is noteworthy that protein distances between R7 and the group (A3, A4, A16 and A6) are the smallest in the entire superfamily for interspecific pairs (R7/A9 is the eighth smallest). Surprisingly, when a gymnosperm gene (The Institute for Genomic Research, TIGR gene index no. TC46521 from Pinus) is included in the sequence trees; it groups with rice in trees 3 and 4 (triangle in Figure 3b). This topology would thus require some of the closest Arabidopsis/rice homologs to be paralogs that predate the angiosperm–gymnosperm split, which is twice as old as that of monocots and eudicots, an unlikely proposition. A simpler hypothesis is found in the synteny tree. The rooting problem in trees 2–4 could be due to unequal rates of evolution and long-branch attraction (Felsenstein, 1978). Once we decided to accept the synteny tree, we studied the three alternative topologies for the apparent triple node linked to event β. We adopted the solution shown in tree 5, with the tandem segmental duplication happening before the polyploidy event, as most compatible with sequence-based trees (see Supplementary Text).
The other position-based cladograms were merged with sequence-based phylogenetic trees in a similar way (see Supplementary Text for detailed explanations), with the end result shown in Figure 2. To resolve branches not linked to segmental duplications, DNA trees were created for each orthologous group of expansins (Figure S4). In cases of conflict, preference was given to positional information and the most parsimonious solution was always chosen. A few branches well supported by sequence-based trees were ignored in the integrated solution (asterisks in Figure 1; see also Figure S1). All these cases showed suspected rooting problems similar to the one just described (see Supplementary Text). Position-based cladograms were also useful in providing independent confirmation for topologies that were suggested by sequence-based trees but had low support, or where different trees contradicted each other. In a couple of cases, they also resolved orthologous groups missed by all the sequence-based trees (see below). An alternative set of cladograms under the assumption that event γ happened in the common lineage of Arabidopsis and rice is provided as Figure S3.
The integrated tree can be viewed as the most parsimonious hypothesis with respect to the number of orthologous groups, gene births and deaths. It incorporates new information about topologies and branch lengths that is independent of expansin sequences, makes testable predictions and alerts us to problematic nodes that require further study. In the end, deciding in particular cases between sequence-based and position-based topologies is a matter of judgment and should be seen as a provisional solution. An increased taxon sampling in the problematic areas could eventually help to reconcile the contradicting topologies.
A fresh view of the superfamily
Using our integrated cladograms, the four families can be divided into 17 orthologous clades, which we have designated with roman numerals (Figure 2). Each clade contains all Arabidopsis and rice expansins that descend from the same gene in their last common ancestor. A previous attempt at dividing the EXPA family (Link and Cosgrove, 1998), while limited to the few sequences known at that time, is nonetheless in general agreement with our results. Subgroups A, B, C and D in this classification correspond to clades IV, III, I and V, respectively.
According to our analysis, the last common ancestor of monocots and eudicots had 15 to 17 expansin genes (10–12 EXPA, 2 EXPB, 1 EXLA and 2 EXLB). The uncertainty is due to two cases where phylogenetic trees suggest (although with poor support) that a pair of clades without synteny might actually be a single orthologous group (indicated by brackets in Figure 2; see Supplementary Text for details). If this were true, gene movement or severe loss of flanking genes could account for the lack of synteny. Analyses of additional genomes might resolve this uncertainty. It is also possible that additional ancestral genes existed but were lost in both lineages. However, the small number of unilateral clade losses argues against this view. The rice lineage seems to have lost at most three clades unilaterally, while Arabidopsis only one or two. Assuming that clade losses are random, the likelihood of many double losses is very low. New genomic sequences from early branching eudicots and monocots would allow this assumption to be tested more thoroughly. Finally, two more EXPA genes would be required in the last common ancestor of rice and Arabidopsis if event γ had already happened by then.
Another conclusion we can draw from this analysis is that most expansin genes have not moved from their genomic neighborhood since the separation of the Arabidopsis and rice lineages. Only eight translocation events suffice to explain all the cases where synteny was not detected. Moreover, microsynteny allows us to determine which gene locations are ancestral. We can say, for example, that a tandem of two Arabidopsis genes (AtEXPB6/AtEXPB2) has moved recently from the neighborhood of AtEXPB4 to a new location in chromosome I (see Supplementary Text). It is notable that translocations only occurred in clades with tandem duplications.
The last common ancestor of monocots and eudicots could have lived as recently as 140–170 Ma (Leebens-Mack et al., 2005; Sanderson et al., 2004). Since then, the size of the expansin superfamily appears to have doubled in the Arabidopsis lineage and more than tripled in rice. In Arabidopsis, at least 12 out of 21 new and surviving genes appeared through segmental duplications, all but one probably in the course of a genome doubling. In rice, segmental duplications explain 17 of the 44 new and surviving genes and event ρ seems responsible for 10 of those. At least six segmental duplications appear to predate this event, pointing to the possibility of an older genome duplication in the rice lineage.
At least 20 surviving expansin genes arose through tandem duplications in rice, compared with eight in Arabidopsis. This is in line with the relative deficit of recent tandems in Arabidopsis when compared with rice or other plants (Blanc and Wolfe, 2004b). Furthermore, tandem duplications are concentrated in just five clades. In contrast, segmental duplications have increased gene numbers in at least 10 of the 17 clades. The massive and asymmetrical growth of clades EXPA-V and EXPB-I accounts for most of the extant expansins in rice and may be related to the evolution of a distinctive cell wall composition in grasses (Carpita, 1996). It seems clear from our analysis that the growth of these clades involved both tandem and segmental duplications and that it was already well under way before event ρ: that is, before the divergence of the cereal grasses (Paterson et al., 2004).
In our analysis, we distinguish two classes of inferred gene deaths: ‘observed’ deaths (inferred from segments in the orthologous groups lacking expansin genes) and ‘assumed’ deaths (inferred from the assumption of full genome duplications).
We count ‘observed’ deaths from genomic segments that descended from an expansin-containing segment, but that no longer contain an expansin gene. When phylogenetic trees exclude the possibility of translocation, we can safely conclude that an expansin gene once existed there and later disappeared. Observed deaths are indicated by empty boxes in Figure 2 (see also Figure S2). In some cases, two or more empty paralogous segments can be explained by a single gene death in an ancestral segment.
In the Arabidopsis branch of clade EXPB-I, a tandem duplication predates event α (Figure 2). The absence of the expected duplicate of AtEXPB2 next to AtEXPB5 is taken as evidence of another gene death. Two similar cases could be argued in rice for clade EXPA-V, but due to uncertain phylogenies we have excluded them from our estimates. We conclude that a minimum of 28 expansin genes were lost in this way in the Arabidopsis lineage and nine in the rice lineage. In most cases, the expansin genes have disappeared without leaving a trace. In a single case, a small gene fragment can still be identified in one of the empty segments. It dates to event α and is 78% identical, over 212 bp, to the end of AtEXPA17 (Figure S2 and Supplementary Text).
In addition to these ‘observed‘ deaths, the assumption of whole genome duplications lets us infer additional deaths, even if we cannot identify the empty paralogous segments, which may have disappeared in large deletions. Event α requires at least three such ‘assumed’ deaths (marked as circles in Figure 2). With event β, the number increases to 11. If event γ was also a genome duplication (the evidence is weaker), nine more deaths are required for a total of 20 assumed deaths. We did not make similar estimates for rice because its genomic evolution is less well understood.
In summary, assuming three polyploidy events since divergence from the rice lineage, we estimate that growth of the expansin superfamily in the Arabidopsis lineage involved a minimum of 48 gene deaths, which were more than offset by a minimum of 67 gene births (branch points in Figure 2), 56 of which are due to events α–γ, three due to independent segmental duplications, and eight due to tandem duplications. The relevant numbers if event γ happened in the common lineage of Arabidopsis and rice are 34 gene deaths and 53 gene births since the last common ancestor (Figure S3). It is noteworthy that gene deaths equaled gene births in three of the clades preserved in both species (EXPA-II, EXPA-VIII and EXPA-IX), so that one-to-one orthologous relationships have been preserved despite multiple rounds of polyploidy. Perhaps the genes in these clades are highly specialized, thus impeding functional divergence.
Because polyploidy events allow us to tentatively date most of the gene duplications in the Arabidopsis lineage, we can also estimate the minimal size of the superfamily at different times. We estimate at least 16 genes before event γ, 19 before event β, and 22 before the most recent polyploidy event (Figure 4). Gene loss is shown as a straight line in Figure 4, but it is not expected to be uniform in the long intervals between polyploidy events. Maize, which has been a tetraploid for just 12 Myr, already seems to have lost 50% of its duplicated genes (Lai et al., 2004). This kind of analysis could also allow predictions to be made as to the structure of the expansin superfamily in particular eudicot species, once their divergence from the Arabidopsis lineage had been dated with respect to the polyploidy events. We would know, for example, if we should be looking for orthologs of individual Arabidopsis genes or for orthologs of several duplicated genes. This sort of information could help in exporting knowledge from model species to economically important ones.
Figure 4. Changes in the minimal estimated number of expansin genes from the last common ancestor (LCA) of eudicots and monocots to present-day Arabidopsis. Polyploidy events (vertical lines) are identified as in the text. Other gene gains are shown in the intervals. TD = tandem duplications, SD = segmental duplications.
Download figure to PowerPoint
The Columbia ecotype of Arabidopsis contains two expansin pseudogenes: ΨAtEXPB6 is missing the end of the promoter and most of exon 1, whereas ΨAtEXPA19 has a 2-bp insertion that creates a premature stop codon. However, when we sequenced these genes in the Landsberg ecotype, we found them to be normal (undisrupted) genes (GenBank accession nos AY619565 and AY843212). In rice, the Japonica genome contains one pseudogene, ΨOsEXPB13, with a premature stop codon. In the Indica genome, this stop codon does not exist, but a 1-bp deletion causes a similar result (GenBank accession no. AAAA02007691). It appears that this gene was independently inactivated in both cultivars.
Nucleotide and amino acid composition
Due to its potential to distort phylogenetic trees, we investigated amino acid bias in the expansin superfamily. In Arabidopsis, the distribution of guanine + cytosine (GC) content for coding regions is unimodal and centered around 44%, while in rice it is much broader, with two main peaks, around 45–50 and 65–70% (Wang et al., 2004). Rice expansins, with an average 66% GC content, are mostly in the GC-rich category. The differences in GC content between Arabidopsis and rice are mainly due to third codon positions, but recently it was shown that rice GC-rich genes, when compared to their closest Arabidopsis homologs, are also enriched in certain amino acids (GARP; glycine, atanine, arginine and proline) encoded by GC-rich codons, while simultaneously depleted of amino acids (FYMNK; phenylalanine, tyrosine, methionine, isoleucine, asparagine and lysine) encoded by GC-poor codons (Wang et al., 2004).
We compared the average GC-rich and GC-poor amino acid content in the rice and Arabidopsis branches of each of the 12 clades represented in both species (Figure 5). All of them show a consistent shift, with rice clades showing, on average, an 18% increase in GC-rich amino acids when compared with their Arabidopsis orthologs (from 0.29 to 0.34) and a 25% decrease in GC-poor amino acids (from 0.26 to 0.20). This shift in amino acids probably contributes to the difficulties in phylogenetic analysis noted earlier and may be a common problem in sequence-based phylogenetic analyses that include rice and Arabidopsis genes.
Figure 5. Changes in amino acid composition related to %GC. For each expansin clade, the average proportions of GC-rich and GC-poor amino acids are plotted for rice (gray) and Arabidopsis (white), connected by a line. Circles are for EXPA clades, triangles for EXPB and diamonds for EXLA. Clade numbers as in Figure 2.
Download figure to PowerPoint