Gene family evolution in green plants with emphasis on the origination and evolution of Arabidopsis thaliana genes


  • Ya-Long Guo

    Corresponding author
    1. Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany
    • State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing, China
    Search for more papers by this author

For correspondence (e-mail or


Gene family size variation is an important mechanism that shapes the natural variation for adaptation in various species. Despite its importance, the pattern of gene family size variation in green plants is still not well understood. In particular, the evolutionary pattern of genes and gene families remains unknown in the model plant Arabidopsis thaliana in the context of green plants. In this study, eight representative genomes of green plants are sampled to study gene family evolution and characterize the origination of A. thaliana genes, respectively. Four important insights gained are that: (i) the rate of gene gains and losses is about 0.001359 per gene every million years, similar to the rate in yeast, Drosophila, and mammals; (ii) some gene families evolved rapidly with extreme expansions or contractions, and 2745 gene families present in all the eight species represent the ‘core’ proteome of green plants; (iii) 70% of A. thaliana genes could be traced back to 450 million years ago; and (iv) intriguingly, A. thaliana genes with early origination are under stronger purifying selection and more conserved. In summary, the present study provides genome-wide insights into evolutionary history and mechanisms of genes and gene families in green plants and especially in A. thaliana.


Most genes can be grouped into gene families based on sequence similarity. Gene family size is variable across different lineages and may have important functional outcomes related to adaptation or speciation (Lynch and Conery, 2000; Nei and Rooney, 2005).

Various factors can alter the size of gene families (De Bodt et al., 2005; Maere et al., 2005; Hahn et al., 2007; Freeling, 2009; Proost et al., 2011; Tautz and Domazet-Loso, 2011; De Smet and Van De Peer, 2012). First, gene duplication or whole-genome duplication (polyploidy) increases gene family size. Second, gene deletion decreases gene family size. Third, the de novo creation of genes creates new gene families. Gene duplication is an ongoing contributor to genome evolution and is of the same order of magnitude as the mutation rate per nucleotide site (Lynch and Conery, 2000, 2003; Gu et al., 2002; McLysaght et al., 2002; Simillion et al., 2002; Doyle et al., 2008; Flagel and Wendel, 2009; Van De Peer et al., 2009; Van De Peer, 2011). While duplication and de novo creation of genes increase the gene number, gene deletion decreases the gene number either by deletion of single gene or several genes within segmental fragment.

Although variation in gene family has been the subject of many studies, most analyses have focused on only one or a few gene families across a small number of plant genomes (Shiu et al., 2004; Chardon and Damerval, 2005; Kim et al., 2006; Leseberg et al., 2006; Li et al., 2006; Porter et al., 2009; Guo et al., 2011a) due to the limitation of available plant genomes or lack of robust pipelines. Recently, two new methods have been developed to address the evolutionary pattern of genes and gene families at the whole-genome level, including the likelihood approach implemented in CAFE (Computational Analysis of gene Family Evolution) (De Bie et al., 2006) and the gene emergence approach (Domazet-Lošo et al., 2007; Cai et al., 2009; Zhang et al., 2010a; Banks et al., 2011). The likelihood approach uses a stochastic birth and death process to model the evolution of gene family size over a phylogeny (Hahn et al., 2005, 2007; De Bie et al., 2006; Demuth et al., 2006). In contrast, the gene emergence approach maps genes to different phylogenetic ranks (internodes) according to their emergence by similarity searches in the representative genomes, called as ‘phylostratigraphy’, naming phylogenetic ranks or internodes ‘phylostrata’ (Domazet-Lošo et al., 2007; Prachumwat and Li, 2008; Cai et al., 2009; Zhang et al., 2010b, 2011; Banks et al., 2011; David and Alm, 2011). Recently, this approach has been used to study the protein domain evolution and transcription-associated proteins as well (Lang et al., 2010; Kersting et al., 2012; Zhang et al., 2012).

The comparison of gene families among closely related and representative species of green plants (Plantae or Viridiplantae), comprising land plants and green algae, allows a comprehensive study of the genes and gene families evolution in green plants. Eight representative genomes of green plants were compared, including two Arabidopsis genomes (Figure 1). Here I have investigated several fundamental questions regarding genes and gene families evolution in green plants: (i) the distribution of the gene family size; (ii) the rate of the gene gain and loss, the rate of the expansion or contraction per gene family on each lineage; and (iii) for A. thaliana genes in different phylostrata, the variation of genes or gene families fixation rate, the variation of functional annotation, of protein size, of expression level, and the variation of selection strength. In summary, the analyses of gene families flanking the eight representative genomes of green plants provide in-depth insights into the researches of evolutionary genomics about green plants.

Figure 1.

Phylogeny and divergence time (million years ago) of the eight green plants.


Gene family size distribution

All proteins from the eight genomes are used for the following analyses, since none of them is homologous to transposable elements. Markov Cluster Algorithm (MCL) clustering of all these proteins resulted in 55 989 gene families; where 13.6% of genes are singletons, 4.6% are grouped into two-gene families, and 81.7% are grouped into multigene families (Table 1, see Data S1 for the detail).

Table 1. Numbers of gene families (and genes) in green plants
SpeciesSingletonsaTwo-genesb≥3cTotal gene familiesdMax_gfe
  1. a

    The number of singletons.

  2. b

    The number of gene families (genes) with two genes.

  3. c

    The number of gene families (genes) contain at least three genes.

  4. d

    The number of total gene families (genes).

  5. e

    The maximum gene family size.

Chlamydomonas reinhardtii 9088 (9088)890 (1780)697 (4293)10 675 (15 161)132
Physcomitrella patens 11 513 (11 513)2081 (4162)2115 (20 241)15 709 (35 916)4322
Selaginella moellendorffii 1717 (1717)4348 (8696)2627 (24 262)8692 (34 675)1541
Oryza sativa 13 597 (13 597)2201 (4402)2589 (37 633)18 387 (55 632)2208
Sorghum bicolor 7579 (7579)1850 (3700)2419 (24 308)11 848 (35 587)1258
Populus trichocarpa 6525 (6525)2806 (5612)3465 (33 235)12 796 (45 372)335
Arabidopsis lyrata 6551 (6551)1964 (3928)2592 (21 990)11 107 (32 469)226
Arabidopsis thaliana 5980 (5980)1689 (3378)2054 (17 481)9723 (26 839)204
All eight genomes38 330 (38 330)6548 (13 096)11 111 (230 225)55 989 (281 651)4322

The clustering of genes in each genome indicates that the number of gene families ranges from 8692 (S. moellendorffii) to 18 387 (O. sativa); and the number of genes within gene families varies from 15 161 (C. reinhardtii) to 45 372 (P. trichocarpa) (Figure 2a and Table 1). In addition, the proportion of singletons in each genome ranges from 5% (S. moellendorffii) to 59.9% (C. reinhardtii); genes in two-gene families from 7.9% (O. sativa) to 25.1% (S. moellendorffii); and genes in multigene families from 28.3% (C. reinhardtii) to 73.3% (P. trichocarpa). The maximum gene family size varies from 132 (C. reinhardtii) to 4322 (P. patens). Interestingly, for the closely related species within the same genus or family, the ratios of genes within different classes are similar, such as O. sativa and S. bicolor, A. lyrata and A. thaliana. Most genes grouped into multigene families in each genome except C. reinhardtii, in which singletons takes up more than half of total genes (59.9%). However, only 2.4% (217/9088) of singletons in C. reinhardtii can be annotated to Gene Ontology (GO) terms, which are related with cellular process, metabolic process, response to stimulus, developmental process, cellular component organization, and multicellular organismal process.

Figure 2.

Proportions of the genes that are singletons, in two-gene families, or in multigene families. (a) Gene family sizes at whole-genome level. (b) Gene family sizes of orphan genes.

Orphan genes and loss of entire gene families in Arabidopsis

There are many orphan genes in each genome, which do not have homologous in other genomes. The proportions of orphan genes in different genomes are different, 61.5% in C. reinhardtii, while other species have a much lower proportion, especially two dicot species A. thaliana (5.3%) and A. lyrata (12.3%) (Figure 2b and Table 2). The maximum orphan gene family size ranges from 17 (A. thaliana) to 4322 (P. patens); the total number of orphan gene families varies from 1328 (A. thaliana) to 10 812 (O. sativa); and the number of genes within orphan gene families varies from 1430 (A. thaliana) to 19 351 (P. patens). Among the eight genomes, 16.8% (S. moellendorffii) to 89.1% (A. thaliana) of orphan gene families are singletons; 4.8% (A. thaliana) to 31.3% (S. moellendorffii) are two-gene families; and 6.2% (A. thaliana) to 51.9% (S. moellendorffii) are multigene families.

Table 2. Number of orphan gene families (and genes) in green plants
SpeciesSingletonsaTwo-genesb≥3cTotal gene familiesdMax_gfe
  1. a

    The number of singletons.

  2. b

    The number of gene families (genes) with two genes.

  3. c

    The number of gene families (genes) contain at least three genes.

  4. d

    The number of total gene families (genes).

  5. e

    The maximum gene family size.

Chlamydomonas reinhardtii 6619 (6619)443 (886)317 (1817)7379 (9322)68
Physcomitrella patens 9001 (9001)700 (1400)368 (8950)10 069 (19 351)4322
Selaginella moellendorffii 1356 (1356)1263 (2526)600 (4184)3219 (8066)129
Oryza sativa 9534 (9534)775 (1550)503 (4961)10 812 (16 045)562
Sorghum bicolor 3600 (3600)394 (788)265 (3663)4259 (8051)1258
Populus trichocarpa 4667 (4667)688 (1376)343 (2055)5698 (8098)82
Arabidopsis lyrata 2279 (2279)272 (544)180 (1165)2731 (3988)99
Arabidopsis thaliana 1274 (1274)34 (68)20 (88)1328 (1430)17

To understand the functional characteristics of the orphan genes, the well studied model species A. thaliana was selected as a representative to perform GO functional annotation. 1042 genes (72.9%) of A. thaliana orphan genes (1430) can be annotated to GO terms, though most (93.0%) are classified as unknown biological processes; other terms are other cellular processes, other metabolic processes, protein metabolism, transcription, and response to stress.

Another class of interesting genes represents those with homologues between one Arabidopsis species and at least one of other six species but not in another Arabidopsis species, which could have been deleted from the latter Arabidopsis species. There are 98 gene families (174 genes, the minimum is 1, the maximum is 8, the average is 1.7755) only have A. lyrata genes; and 117 gene families (163 genes, minimum 1, maximum 7, average 1.3932) only have A. thaliana genes. Based on the unique genes in each Arabidopsis species but shared with other species, the gene extinction rate for the entirely lost gene families can be calculated. Assuming the divergence time of A. thaliana and A. lyrata to be 13 million years ago (Beilstein et al., 2010; Guo et al., 2011b) (Figure 1) and normalized to the total gene number, extinction rate in A. thaliana is 0.00050 (174/13 × 27 025) extinctions/million year/gene; in A. lyrata, 0.00038 (163/13 × 32 670) extinctions/million year/gene.

Gene family expansion or contraction

The likelihood approach assumes that there is at least one ancestral gene in the most recent common ancestor (MRCA), thus only those gene families that are shared between Creinhardtii and at least one of other seven species are included for the analysis. Of the 3296 families present in the MRCA of green plants, 551 gene families have been lost completely in at least one species. It is 2745 gene families that are present across all the eight species most probably represent the ‘core’ proteome of green plants.

Assuming that each species has the same λ value, the average rate of gene turnover is estimated to be 0.001379 (gains and losses/gene/million years), the rate of gene family to either expand or contract over time. The estimated rate of gene gains and losses implies that within a single genome like A. thaliana, there are approximately 37.267 new duplicates and 37.267 new losses fixed every million years (0.001379 gains and losses/gene/million years × 27 025 genes).

The average expansions and contractions among different lineages are different (Figure 3). Along the phylogenetic ranks, most lineages expanded, including embryophyte, tracheophyte, angiosperm, and rosid. In contrast, lineages of C. reinhardtii, grass, and Arabidopsis contracted, and Arabidopsis genus is the most contracted lineage (average contraction size is −0.547).

Figure 3.

Expansions and contractions of gene families in green plants. The numbers nearby branches show the average expansion (mean number of genes gained or lost per family, where ‘–’ expansion is a net contraction), and the values inside the brackets are the number of families with expansions and contractions, respectively.

The likelihood approach can identify individual gene families evolving at the rate of gain or loss significantly higher than the genome-wide average (Hahn et al., 2005). Of the 3296 gene families, 41 are evolved non-randomly at < 0.0001. Of the 41 rapidly evolved gene families, only 37 contain A. thaliana genes, and 34 of which can be annotated to GO terms. Rapidly evolving gene families are associated with various biological processes, including other metabolic processes, other cellular processes, response to stress, protein metabolism, other biological processes, and response to abiotic or biotic stimulus (Tables S1 and S2). Nine of the 41 rapidly evolving gene families have significantly changed (< 0.0001) on the terminal branch to A. thaliana. All of the nine gene families significantly contracted (P < 0.0001), most of which are affiliated with nucleotide binding, protein binding, DNA or RNA binding, hydrolase activity, and structural molecular activity (Table S2).

Among nine significantly changing gene families on the A. thaliana terminal, eight of them have A. thaliana genes and in total 43 genes, of them 34 have the orthologues in A. lyrata. The average of dN/dS for these genes is 0.1505 (median 0.0466), lower than all proteins of A. thaliana within the gene families (average 0.2799, median 0.1786) (Wilcoxon rank sum test, = 4.525 × 10−6); the average of dS is 0.1480 (minimum 0.0427, maximum 0.3557, median 0.1398), also a little lower than all proteins of A. thaliana within the gene families (average 0.2167, median 0.1448) (= 0.3325). The results indicate there is not any correlation between the gene copy variation and the sequence divergence.

Origination of Arabidopsis thaliana genes

The 26 839 A. thaliana genes within gene families are assigned to different phylogenetic ranks (phylostrata) (Figure 4a and Table 3). More than 70% genes are assigned to Viridiplantae (40.6%) and Embryophyte (30.1%) internodes. In addition, fixation rates of gene or gene family for A. thaliana genes, which is the rate of genes (or gene families) appearance on different phylogenetic ranks, are calculated, respectively (Figure 4b and Table S3). A. thaliana internode has the highest gene fixation rate (110.000 genes/MY), while angiosperm internode has the lowest gene fixation rate (6.267 genes/MY).

Table 3. Number of A. thaliana gene families (and genes) assigned to each phylostratum
Phylostratum internodeGenes (%)aGene families (min, max, avg)b
  1. a

    The total number of genes assigned to each phylostratum in A. thaliana genome, and their percentage of whole genome given in parentheses.

  2. b

    The total number of gene families assigned to each phylostratum. Within parentheses, min indicates the minimum gene family size; max indicates the maximum gene family size; avg indicates the average of gene family size.

1 Viridiplantae10 897 (40.601)3007 (1, 193, 3.6239)
2 Embryophyte8074 (30.083)2216 (1, 204, 3.6435)
3 Tracheophyte741 (2.761)290 (1, 28, 2.5552)
4 Angiosperm1617 (6.025)732 (1, 25, 2.2090)
5 Rosid1076 (4.009)484 (1, 106, 2.2231)
6 Arabidopsis3004 (11.193)1666 (1, 134, 1.8031)
7 A. thaliana1430 (5.328)1328 (1, 17, 1.0768)
Figure 4.

Phylostratigraphic analysis of A. thaliana genes. (a) Gene appearance. (b) Gene fixation rate. (c) Gene ontology annotation. A dashed line indicates the fraction of A. thaliana genes within gene families that can be annotated to GO biological process terms. (d) Protein sizes. A dashed line indicates the median protein size of A. thaliana genes within gene families. Significances of the deviations from the median protein size of A. thaliana genes within gene families using Wilcoxon rank sum test are shown in the P-value panel.

For A. thaliana genes assigned to different phylostrata, the proportion of A. thaliana genes annotated to GO terms is 86.7% (23 282/26 839) (Figure 4c and Table S4). Among internodes, the proportion ranges from 92.1% (rosid internode) to 57.2% (A. thaliana internode). For the unknown function process, along the phylostrata, Viridiplantae has the lowest proportion, but A. thaliana has the highest proportion. For most other biological processes, Viridiplantae has the highest proportion; A. thaliana has nearly the lowest proportion (Figure S1 and Table S5). Within A. thaliana internal, the most abundant terms are unknown biological processes, followed by other cellular processes, other metabolic processes, and transcription.

Along phylostratigraphy from Viridiplantae to A. thaliana, the medians of protein sizes of the A. thaliana genes in different phylostrata reduced from 413 (Viridiplantae) to 77 (A. thaliana) (Figure 4d and Table S6). While the median of protein size for all A. thaliana proteins within the gene families is 351 (minimum 29, maximum 5337, average 407.8). In addition, along phylostratigraphy, the proportion of expressed genes reduced from 90.1% (Viridiplantae internode) to 37.1% (A. thaliana internode) (Figure 5a and Table S7); and the median of expression levels reduced from 6.985 (Viridiplantae) to 2.841 (A. thaliana) (Figure 5b and Table S7). In total, 79.2% A. thaliana genes are expressed (21 248/26 839); and the median of expression level is 5.988 (min 1.344, max 14.050, average 5.866).

Figure 5.

Expression variation of A. thaliana genes among phylostrata. (a) Expressed genes. A dashed line indicates the expression fraction level of A. thaliana genes within gene families. (b) Expression level. A dashed line indicates the median expression level of A. thaliana genes within gene families. Significances of the deviations from the median expression level of A. thaliana genes within gene families using Wilcoxon rank sum test are shown in the P-value panel.

In order to understand the divergence variation along phylostratigraphy from Viridiplantae to A. thaliana, the divergence between A. thaliana genes and their orthologues in A. lyrata were calculated. The medians of dN/dS increased from 0.1389 (Viridiplantae) to 0.4556 (Arabidopsis), and the medians of dS increased from 0.1407 (Viridiplantae) to 0.1553 (Arabidopsis) (Figure 6 and Table S8). The median of dN/dS for all proteins of A. thaliana within the gene families is 0.1786, and the median of dS is 0.1448.

Figure 6.

Divergence variation of A. thaliana genes among phylostrata, estimated between A. thaliana and A. lyrata. (a) dN/dS. (b) dS. A dashed line indicates the median dN/dS or dS of A. thaliana genes within gene families, respectively. Significances of the deviations from the median dN/dS and dS of A. thaliana genes within gene families using Wilcoxon rank sum test are shown in the P-value panel, respectively.


Family size distribution

Markov Cluster (MCL) algorithm, a method for clustering proteins into groups based on similarity, rigorously tested and validated on a number of databases (Van Dongen, 2000; Enright et al., 2002), has been used for gene family analysis in many taxa, including human (Lander et al., 2001), vertebrate and invertebrate (Prachumwat and Li, 2008), and Drosophila (Drosophila 12 Genomes Consortium, 2007). In this study, using the MCL algorithm, most genes in each genome are grouped into multigene families except C. reinhardtii, in which singletons make up more than half of total genes (59.9%). The predominant number of singletons in C. reinhardtii suggests that the C. reinhardtii genome has a lower gene duplicability than other genomes, possibly due to functional constraints or that some genes of C. reinhardtii are too young to have accumulated duplicates like the case of vertebrate-specific genes (Prachumwat and Li, 2008), or there are probably some annotation artifacts in C. reinhardtii. However, only 2.4% of singletons in C. reinhardtii can be annotated to a GO term, thus functional constraint is unlikely the case here. For S. moellendorffii genome, in order to sufficiently represent the S. moellendorffii genome, all gene models instead of the data set with only one of the two haplotypes are used in this study, thus in the present study for this genome, genes in two-gene families are inflated, and genes in singletons are reduced at the same time accordingly.

In total, 2745 gene families are shared among all the eight species, which must be the ‘core’ proteome of green plants. This number is consistent with another recently published study (Van Bel et al., 2012), in which 2928 core gene families are found via a parsimony method from 25 genomes, although these two studies used different strategy to call ‘core’ proteome. Among the 2928 gene families found in the previous study, 2207 (75%) are found in the present study, when match the representative A. thaliana genes of each core gene family reported in the previous study (Van Bel et al., 2012) to the present study. Interestingly, two recent studies have indicated across green plants, there is a relatively large and well conserved core set of protein domains as well (Kersting et al., 2012; Zhang et al., 2012). In addition, phylogenetic comparative analysis indicated the expansion of transcription-associated proteins is positively correlated with the increase in morphological complexity (Lang et al., 2010).

Orphan genes are protein-coding genes that do not have recognizable homologues in other species. Various factors can lead to rise of the orphan genes (Domazet-Lošo and Tautz, 2003; Demuth et al., 2006; Hahn et al., 2007). Theoretically, all of these factors can be classified into six classes: (i) true loss of all functional genes; (ii) the de novo evolution of new genes; (iii) rapid protein evolution in previously existing genes so that they are no longer identified as being part of a pre-existing family; (iv) horizontal gene transfer; (v) annotation artifacts; and (vi) artifacts of the clustering process. Such genes might be involved frequently in specific ecological adaptations or be the raw fuel for micro-evolutionary divergence (Carroll et al., 2001; Domazet-Lošo and Tautz, 2003; Yang et al., 2009). Not surprisingly, in C. reinhardtii and P. patens, more than half of the genes within these two genomes are orphan genes, given some genes in C. reinhardtii and P. patens are probably not required for vascular plants. Recently, in Drosophila, it has been demonstrated that orphan genes can quickly become essential and play an essential role in development (Chen et al., 2010).

Similar gene gain/loss rate between green plants and animals

The likelihood method implemented in cafe version 2.0 (De Bie et al., 2006) is a robust pipeline to study gene family evolution, and has been used widely in genome analyses, including mammalian genomes (Demuth et al., 2006), Drosophila genomes (Hahn et al., 2007), and rhesus macaque genome (Rhesus Macaque Genome Sequencing and Analysis Consortium, 2007). Of the 3296 families used for the likelihood analysis, along the phylogenetic ranks, gene families in each lineage expands to different extent except for C. reinhardtii, grass, and Arabidopsis. Consistent with the contraction of A. thaliana at the whole-genome level, all nine rapidly evolving gene families contract on the terminal branch to A. thaliana. In Drosophila, it has been demonstrated that some gene families (defense response, proteolysis, and trypsin activity) are evolving rapidly both in copy number and sequence level (Drosophila 12 Genomes Consortium, 2007; Hahn et al., 2007). However, in Arabidopsis, there is not any correlation between the copy number variation and sequence divergence for the genes of rapidly evolving gene families, which suggested that the relationship between copy number variation and sequence divergence is much more complex.

In the present study, the rate of gene gains and losses is about 0.001379 (gains and losses/gene/million years). This rate is a little lower than some previous estimates. For instance, the average duplication rate of a eukaryotic gene was estimated to be on the order of 0.01/gene/million years and 0.002 for A. thaliana (Lynch and Conery, 2000, 2003). However, this rate is similar to that in Drosophila – 0.0012 (Hahn et al., 2007) and Mammalian – 0.0016 (Demuth et al., 2006). As this result is consistent with the average rate of duplication of a eukaryotic gene, and also eight representative genomes of green plants are used, thus the estimated rate is reliable for Athaliana relatives and green plants as well.

One important factor that affects the estimation of gene gains or losses rate in certain lineages is polyploidy/whole-genome duplication. Especially for most green plants, including a few angiosperm species studied here, are paleopolyploidy, and whole-genome duplication is normally followed by gene loss and diploidization (Simillion et al., 2002; Blanc and Wolfe, 2004; Adams and Wendel, 2005; Doyle et al., 2008; Soltis et al., 2009; Koh et al., 2010; Jiao et al., 2011; Proost et al., 2011). Therefore, at the green plants level, whole-genome duplication likely raises the rate of gene gains and losses. However, with the availability of more genomes from various plant lineages, it would be interesting to address how these paleopolyploidy events affects gene gains or losses rate in some specific plant lineages that experienced whole-genome duplication.

Origin of Arabidopsis thaliana genes

The gene emergence approach has been used to study the evolutionary emergence of whole-genome genes in individual species or lineage at different phylogenetic levels (Domazet-Lošo et al., 2007; Domazet-Lošo and Tautz, 2008, 2010a,b; Prachumwat and Li, 2008; Zhang et al., 2010a,b, 2011; Banks et al., 2011). The significant difference of gene emergence on the phylostratigraphic map supports the assumption that some of these genes have retained a signal of their evolutionary history. In this way, the phylostratigraphy can imply a picture of evolutionary history using extant species. In this study, more than 70% of A. thaliana genes are traced back to 450 million years ago. Interestingly, the A. thaliana internode also has the highest gene fixation rate, which might be due to high gene origination rate, or more complete annotation since much more expression data can be used for annotation. Transposition in certain families is a possible cause of the high gene origination rate. A recent study indicates that specific gene families can transpose at specific points in evolutionary time, especially after whole-genome duplication events in the Brassicales (Woodhouse et al., 2011).

The fraction of expressed genes, expression level, and protein size reduced along the phylostratigraphy from Viridiplantae to A. thaliana. In yeast, there is an inverse relationship between protein size and protein expression level (Warringer and Blomberg, 2006). Conversely, in A. thaliana the oldest genes are much longer and also highly expressed. In addition, the comparison between A. thaliana and A. lyrata indicates transposable elements and small RNAs contribute to gene expression divergence (Hollister et al., 2011). Most probably, different lineage or species has different regulatory mechanism. For example, even comparing to its sister species A. lyrata, A. thaliana has much higher intron loss rate (Fawcett et al., 2012) and deletion rate, and also tend to delete longer fragments (Hu et al., 2011).

Interestingly, Viridiplantae internode has the lowest dN/dS, while Arabidopsis internode has the highest. This probably suggests that older genes have been under stronger purifying selection, since older genes are mostly fundamental genes for green plants to survive. Accordingly, the older genes are more conserved with a slow evolution rate (synonymous substitution rate). Evolution rate can be shaped by various factors (Albà and Castresana, 2005; Larracuente et al., 2008; Singh et al., 2009; Liao et al., 2010; Gaut et al., 2011; Yang and Gaut, 2011). Recent comparisons between two available Arabidopsis genomes provide more robust results. The rates of protein evolution are correlated mainly with expression level and specificity (Slotte et al., 2011); and rates at synonymous sites vary on genomic scale and might be a function of recombination rate and evolution rates at non-synonymous sites, strongly correlated with expression pattern and whether a gene is retained after a whole-genome duplication or not (Yang and Gaut, 2011). In addition, 1939 annotated protein-coding genes with little evidence of expression in A. thaliana are shorter and evolved with a two-fold higher rate of non-synonymous substitutions (Yang et al., 2011). This finding is consistent with what is found here, that younger genes are shorter, higher in the non-synonymous substitution rate, and lowly expressed.

Furthermore, it would be interesting to study dN/dS ratio in other species pairs, similar comparisons had been made between A. thaliana and popular and between two conifers (Picea sitchensis and Pinus taeda) (Buschiazzo et al., 2012), which indicates low rates of diversification in conifers but much higher dN/dS, possibly due to positive selection. It would also be interesting to do a lineage-specific study about dN/dS ratio, which could be informative for understanding the mechanism of lineage-specific evolutionary events.

Although many other plant species' genome sequences are available now, the eight genomes here are good representative of green plants with clear phylogeny and divergence times estimated, especially assembled with longer and high quality reads at high coverage, which are believed to be very important to address evolutionary questions at whole-genome level (Demuth and Hahn, 2009; Schneider et al., 2009; Milinkovitch et al., 2010). Therefore, the eight representative green plants' genomes are sufficient to address those interesting questions at whole-genome level. The comparison of the eight genomes, revealed the evolutionary pattern of green plants' genes and gene families at an unprecedented resolution. The rate of gene gains and losses is similar to that in animals, some gene families are rapidly evolved and the 2745 gene families are ‘core’ proteome of green plants. Most genes of A. thaliana are biased toward ancient genes, and genes with early origination are under stronger purifying selection and more conserved; and gene families of Arabidopsis lineage contract a lot, especially A. thaliana, which could be correlated with the shaping of the difference of morphology, physiology, and metabolism among species.

Experimental procedures

Genome data sets

Protein sequences from eight genomes of green plants were analysed, including A. thaliana (TAIR8) (27 025 proteins) (The Arabidopsis Genome Initiative, 2000), A. lyrata (JGI-v1.0, FM6) (32 670 proteins) (Hu et al., 2011), Populus (Populus trichocarpa, JGI-v1.1) (45 555 proteins) (Tuskan et al., 2006), rice (Oryza sativa, v5.0) (56 278 proteins) (International Rice Genome Sequencing Project, 2005; Ouyang et al., 2007), Sorghum (Sorghum bicolor, JGI-v1.0) (35 899 proteins) (Paterson et al., 2009), Selaginella moellendorffii (JGI-v1.0) (34 697 proteins) (Banks et al., 2011), Physcomitrella patens ssp patens (JGI-v1.1) (35 938 proteins) (Rensing et al., 2007), and Chlamydomonas reinhardtii (JGI-v3.1) (15 256 proteins) (Merchant et al., 2007). The phylogeny and divergence times of the eight genomes were clarified clearly (Kellogg, 2001; Wikström et al., 2001; Gaut, 2002; Wellman et al., 2003; Magallon and Sanderson, 2005; Zimmer et al., 2007; Beilstein et al., 2010; Lang et al., 2010; Guo et al., 2011b) (Figure 1 and Table S9 for the detail of the divergence time at each node). In addition, the sampled genomes are representative, flanking the most important events in the green plants leading to A. thaliana. In order to get rid of possible transposable element sequences within the proteins data sets, all proteins were searched against the repeat library database (repeat masker database version 20090604) using tblastn (E value <1 × 10−5), and length of hit must be at least half of the query sequence. The putative transposable elements are deleted from the dataset in the following analyses.

Homology searches and gene family construction

Markov Cluster Algorithm (mcl-06-058 package; was used to cluster all the eight genomes' protein sequences into gene families with default parameters (−I 2, −S 6). There are two steps for clustering analysis: (i) blastp was used to match each protein against all other proteins (E value <1 × 10−5) to produce a raw NCBI BLAST output; and (ii) MCL was used to build Markov Matrix using the parsed similarity data converted from raw NCBI BLAST output, and then generated the final gene families. The clustering results were imported into MySQL database for the following analyses. Genes (gene families) that do not have recognizable homologues in other species are called as orphan genes (gene families).

Gene family size analysis

Genes were classified into one of the following three groups by its family size in the clustering according to Prachumwat and Li (2008): (i) a singleton is a single-copy gene; (ii) a two-gene family has two copies; and (iii) a multigene family has ≥3 copies.

Likelihood analysis of gene gain and loss

cafe version 2.0 (De Bie et al., 2006), a tool for the statistical analysis of the evolution of the gene family size, was used to study the evolution of gene families. One can estimate gene gain and loss rate among these plant genomes; calculate the average expansion or contraction per gene family on each branch; and identify the gene family that has larger-than-expected contractions or expansions. The below input files and parameters are used. First, the input tree structure file, which is a Newick description of a rooted and bifurcating phylogenetic tree (including branch lengths in units of time) (Figure 1). Second, the data file consists of gene family sizes of each gene family across eight species. Third, λ, the probability of both gene gains and losses per gene per unit time in the phylogeny (CAFE assumes that gene birth and death are equally probable), was calculated by CAFE given the gene families in the data file (the estimated λ is the globally most probable value across all families in the input file or the different lineages' λ with related options of −t). Here I calculated the λ across all the eight species. Fourth, two other parameters were set as P-value threshold (0.01) and number of random samples used the default value (1000). Fifth, the default Viterbi method was employed to identify the branch with the most unlikely amount of change. For gene families that are evolving at rates of gains and losses significantly higher than the genome-wide average with a P-value below < 0.0001 (Hahn et al., 2005), Viterbi method was used to identify the branch that is the most likely cause of deviation from the random model significant at < 0.005 (Hahn et al., 2007).

Phylostratigraphic analyses of Arabidopsis thaliana genes

Phylostratigraphic analysis was performed according to previous study (Domazet-Lošo et al., 2007). All genes of A. thaliana were assigned into seven phylogenetic ranks (phylostrata) of different ages according to the emergence in the reference genomes on the phylogeny based on the similarity clustering via MCL. The seven gene groups were: (1) A. thaliana, genes that arose along the A. thaliana; (2) Arabidopsis, genes were present in the MRCA of Arabidopsis genus; (3) rosid, genes were present in the MRCA of Arabidopsis genus and P. trichocarpa; (4) angiosperm, genes were present in the MRCA of rosid and grass (S. bicolor and O. sativa); (5) Tracheophyte, genes were present in the MRCA of angiosperm and S. moellendorffii; (6) Embryophyte, genes were present in the MRCA of Tracheophyte and P. patens; and (7) Viridiplantae, genes were present in the MRCA of Embryophyte and C. reinhardtii. Genes (or gene families) fixation rates were calculated according to the formula, r = N/T: r is the number of genes (or gene families) per million years; N is the number of genes (or gene families) in the phylostratum; T is duration of the interval (million years).

Divergence and polymorphism

There are several steps to estimate the divergence of orthologs: (i) the orthologous gene pairs between A. thaliana and A. lyrata were extracted from a whole-genome alignment analysis (Hu et al., 2011); (ii) a codon-based alignment of orthologous coding sequences was performed with PAL2NAL (Suyama et al., 2006); (iii) dS, the number of synonymous substitutions per synonymous site, and dN/dS ratios, the ratio of the number of non-synonymous substitutions per non-synonymous site to the number of synonymous substitutions per synonymous site, were calculated for orthologue pairs using the codeml program implemented in phylogenetic analysis by maximum likelihood version 3.15 package (Yang, 1997).

GO annotation and gene expression

Gene ontology annotation was performed using GO implemented at TAIR (October 2009) ( for A. thaliana (Gene Ontology Consortium, 2004); for other species, blast2go was used to do the GO annotation (Conesa et al., 2005). GO Biological Processes were used to estimate the abundance of GO terms. Gene expression data from AtGenExpress Arabidopsis expression atlas (Schmid et al., 2005), which was generated on the widely available Affymetrix ATH1 array platform, were averaged across all tissues as an estimate of expression level of A. thaliana genes.


I thank Matthew W. Hahn for suggestions on CAFE analysis; Tomislav Domazet-Lošo for suggestions on phylostratigraphy analysis; Manyuan Long, Yong Zhang, Jie Guo, and anonymous reviewers for comments about this work; and Tina Hu for proofreading and improving the manuscript. Especially, I am very grateful to Detlef Weigel and Song Ge for their long-term support and also valuable comments on this work. This work was supported by 100 talents program of Chinese Academy of Sciences and the Max Planck Society.