SEARCH

SEARCH BY CITATION

Keywords:

  • comparative genomics;
  • evolution;
  • functional genomics;
  • non-family genes;
  • plant genome

Summary

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental Procedures
  7. Acknowledgements
  8. Conflict of Interest
  9. References
  10. Supporting Information

There are a large number of ‘non-family’ (NF) genes that do not cluster into families with three or more members per genome. While gene families have been extensively studied, a systematic analysis of NF genes has not been reported. We performed comparative studies on NF genes in 14 plant species. Based on the clustering of protein sequences, we identified ~94 000 NF genes across these species that were divided into five evolutionary groups: Viridiplantae wide, angiosperm specific, monocot specific, dicot specific, and those that were species specific. Our analysis revealed that the NF genes resulted largely from less frequent gene duplications and/or a higher rate of gene loss after segmental duplication relative to genes in both low-copy-number families (LF; 3–10 copies per genome) and high-copy-number families (HF; >10 copies). Furthermore, we identified functions enriched in the NF gene set as compared with the HF genes. We found that NF genes were involved in essential biological processes shared by all plant lineages (e.g. photosynthesis and translation), as well as gene regulation and stress responses associated with phylogenetic diversification. In particular, our analysis of an Arabidopsis protein–protein interaction network revealed that hub proteins with the top 10% most connections were over-represented in the NF set relative to the HF set. This research highlights the roles that NF genes may play in evolutionary and functional genomics research.


Introduction

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental Procedures
  7. Acknowledgements
  8. Conflict of Interest
  9. References
  10. Supporting Information

A gene family can be defined as a group of genes with similar sequences, resulting from various gene duplication events and often sharing similar or partially redundant functions (De Grassi et al., 2008; Demuth and Hahn, 2009). In the most inclusive manner, the majority of genes in plant genomes belongs to gene families that contain three or more gene members per genome. Many gene families in plants have been extensively studied, e.g. MADS-box (Parenicova et al., 2003; Nam et al., 2004), ABC protein superfamily (Verrier et al., 2008), NAC (NAM, ATAF, and CUC) transcription factor (Ooka et al., 2003), glycosyltransferase superfamily (Yin et al., 2010; Ye et al., 2011; Yonekura-Sakakibara and Hanada, 2011), F-box (Yang et al., 2008), APETELA2 (AP2)/ethylene-responsive element binding factor (ERF) superfamily (Nakano et al., 2006), and many protein kinase families (Hrabak et al., 2003). Still, a large number of genes, with one or two gene copies per genome, do not belong to such gene families; we term these genes as non-family (NF) genes. Only a few studies have addressed the role of NF genes in sequenced plant species (Guo et al., 2007; Duarte et al., 2010), and as such, the function and evolution of NF genes remain unclear.

There is considerable variation in the number of genes within gene families. The variation in numbers of gene family members is due partly to the extent and retention of gene duplication (Ohno, 1970; Chauve et al., 2008). Recent research indicates that the ancestral angiosperm genome contained 12 000–14 000 genes (Proost et al., 2011), far less than the number of genes in extant angiosperm genomes (e.g. ~27 000 in Arabidopsis, ~26 000 in Vitis, ~40 000 in Populus and ~40 000 in Oryza) (Goodstein et al., 2012). The gene number expansion in extant angiosperm genomes appear to be created through various gene duplication events (Van de Peer et al., 2009). For example, the F-box gene family is expanded in herbaceous plants relative to woody plants specifically due to a high rate of tandem duplications in the former (Yang et al., 2008). Several models (e.g. neofunctionalization, subfunctionalization, balanced gene drive) have been proposed to explain the evolutionary fate of duplicated genes (Ohno, 1970; Hughes, 1994; Yang et al., 2006; Freeling, 2009). Gene loss after duplication is one of the evolutionary modes that contribute to the contraction of gene families in Arabidopsis thaliana that experienced a loss of around 5700 genes after divergence between A. thaliana and A. lyrata (Proost et al., 2011).

To understand the evolution and function of NF genes in plants, we performed a large-scale comparative analysis of NF genes in diverse plant species ranging from algae to moss to angiosperm. In a comparative framework, we analyzed multiple aspects of the NF genes (one or two copies per genome) in comparison with low-copy-number gene families (LF; 3–10 copies) and high-copy-number gene families (HF; >10 copies). Our analysis revealed differences in the frequency of gene duplication, evolutionary rate, gene ontology, subcellular localization and gene expression between NF genes and genes in LF and HF categories, and among NF genes in alternate evolutionary groups (i.e. Viridiplantae wide, angiosperm specific, monocot specific, dicot specific, and species specific). We found that some plant functions involved more NF genes, relative to HF genes. We identified NF genes that were involved in stress responses as well as cell wall biosynthesis. Furthermore, our analysis of an Arabidopsis protein–protein interaction network revealed that a higher proportion of the NF gene set was classified as hubs (i.e. highly connected nodes) as compared with the HF gene set. We present this study to highlight the roles of NF genes in evolutionary and functional genomics research.

Results

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental Procedures
  7. Acknowledgements
  8. Conflict of Interest
  9. References
  10. Supporting Information

Non-family (NF) genes in plant genomes

Based on clustering analysis of protein sequences, we divided the protein-encoding genes from 14 diverse plant species, including algae (Chlamydomonas reinhardtii and Volvox carteri), moss (Physcomitrella patens), lycophyte (Selaginella moellendorffii), monocots (Brachypodium distachyon, Oryza sativa, Sorghum bicolor and Zea mays), and dicots (Arabidopsis thaliana, Carica papaya, Glycine max, Populus trichocarpa, Solanum tuberosum and Vitis vinifera), into three categories: (i) genes with only 1 or 2 copies per genome, i.e. NF genes, (ii) genes in LF, containing 3–10 copies per genome, and (iii) genes in high-copy-number families (HF), containing more than 10 copies per genome (Figure 1). Non-vascular plants contained a significantly (P < 0.01, chi-squared) higher proportion of NF genes and significantly (P < 0.01, chi-squared) fewer HF genes relative to vascular plants (Figure 1). Furthermore, NF contained significantly (P < 0.01, paired t-test) more single-copy genes than two-copy genes (Figure 2).

image

Figure 1. Distribution of genes classified as non-family (NF; 1 or 2 copies per genome) genes, low-copy-number (LF; 3–10 copies) genes and high-copy-number (HF; >10 copies) genes in 14 plant species. aPercent of all protein-encoding genes in each plant species. bLower plants have significantly (P < 0.01, Chi-square) higher proportion of NF genes than higher palnts (i.e. Tracheophytes). cLower plants have significantly (P < 0.01, Chi-square) lower proportion of HF genes than higher plants.

Download figure to PowerPoint

image

Figure 2. Number of single- and two-copy genes classified here as non-family genes in each of the 14 plant species. The number of single-copy genes is significantly (P < 0.01, paired t-test) higher than that of two-copy genes.

Download figure to PowerPoint

Among all NF genes, we identified five evolutionary groups: (i) NF1 – genes having homologues in all the 14 species studied (i.e. found Viridiplantae wide), (ii) NF2 – genes having homologues in all and only the 10 angiosperm species studied (i.e. angiosperm specific), (iii) NF3 – genes having homologues in all and only the four monocot species studied (i.e. monocot specific), (iv) NF4 – genes having homologues in all and only the six dicot species studied (i.e. dicot specific), and (v) NF5 – genes not conserved among the species studied (i.e. species specific) (Figure 3). The number of genes in NF5 was significantly [P < 0.01, anova (analysis of variance) + least significance difference (LSD) test] higher than that in the other groups (i.e. NF1, NF2, NF3, NF4). Interestingly, compared with the five NF groups (i.e. NF1–NF5), only four corresponding groups could be identified among the LF or HF genes and the dicot specific group (LF4 or HF4) that contains LF or HF genes having homologues in all and only the six dicot species is absent (Figures S1 and S2), suggesting that the shared evolutionary history of the LF and HF gene families in dicots can be dated before the divergence between monocots and dicots.

image

Figure 3. Distribution of non-family (NF; 1 or 2 copies) genes among different phylogenetic groups. NF1 gene group contains NF genes having homologues in all the 14 species; NF2 contains NF genes having homologues in all and only the 10 angiosperm species; NF3 contains NF genes having homologues in all and only the four monocot species; NF4 contains NF genes having homologues in all and only the six dicot species; and NF5 contains species-specific NF genes. The numbers of genes in each species were normalized based on Glycine, which has the highest number of genes among the 14 species studied.

Download figure to PowerPoint

Duplication and loss of NF genes

Using syntenic gene order, our analysis of duplication events in A. thaliana, B. distachyon, G. max, O. sativa, P. trichocarpa, S. bicolor, Vvinifera and Z. mays revealed that the frequency of tandem duplication in the NF genes is significantly (P < 0.001, paired t-test) lower than that in both LF and HF categories, and the frequency of tandem duplication in the LF genes is significantly (P < 0.001, paired t-test) lower than that in HF genes (Figure 4a). The frequency of segmental duplication in NF genes was (P < 0.01, paired t-test) significantly lower than that in both LF and HF categories and there was no significant difference in frequency of segmental duplication between LF and HF categories (Figure 4b). Furthermore, NF genes had a significantly (P < 0.01, paired t-test) higher frequency of gene loss after syntenic segmental duplication than that in both LF and HF categories, and genes in the LF category had a higher frequency (P < 0.001, paired t-test) of gene loss than those in HF category (Figure 4c). These data suggest that the NF genes resulted mainly from a higher rate of gene loss after segmental duplication as well as a lower rate of tandem gene duplication.

image

Figure 4. Duplication and loss of genes classified as non-family (NF; 1 or 2 copies), low-copy-number (LF; 3–10 copies) and high-copy-number (HF; >10 copies) genes in Arabidopsis, Brachypodium, Glycine, Oryza, Populus, Sorghum, Vitis and Zea. (a) The frequency of tandem duplication in the NF genes is significantly lower than that in both LF (P = 0.00003) and HF (P = 0.0001) categories, and the frequency of tandem duplication in the LF genes is significantly (P = 0.0006) lower than that in HF genes. (b) The frequency of segmental duplication in NF genes was significantly lower than that in both LF (P = 0.0015) and HF (P = 0.0077) categories. (c) NF genes had a significantly higher frequency of gene loss after syntenic segmental duplication than that in both LF (P = 0.0028) and HF (P = 0.0006) categories, and genes in the LF category had a higher frequency (P = 0.0001) of gene loss than those in HF category. The significance (P-values) of pair-wise comparisons between NF, LF and HF are estimated by paired t-tests.

Download figure to PowerPoint

Evolutionary fate of NF genes

To investigate the evolutionary fate of NF genes after duplication, we examined the non-synonymous/synonymous (Ka/Ks) ratio for gene pairs resulting from segmental duplication. Our analysis showed that NF genes had a significantly (P < 0.01, anova + LSD test) higher Ka/Ks ratio than genes in the LF and HF categories (Figure S3a), indicating that NF genes may have been under more relaxed selection than LF or HF genes. Furthermore, the species-specific NF genes (i.e. category NF5) had a significantly (P < 0.01, anova + LSD test) higher Ka/Ks ratio than the NF genes conserved among all plant species investigated (i.e. NF1) (Figure S3b), suggesting that the species-specific NF genes may have been under more relaxed selection than the conserved NF genes.

Functions of NF genes

Gene ontology (GO) analysis revealed that NF genes were disproportionately involved in various biological processes, with the most over-represented processes including nitrogen metabolism, biosynthetic process, biological regulation and response to stimulus (Figure S4). GO enrichment analysis revealed that many biological processes (e.g. translation, nucleoside metabolism, cofactor metabolism and photosynthesis) were significantly (P < 0.05, Fisher) over-represented in conserved NF genes (i.e. shared by at least two species) relative to conserved HF genes (Figure 5 and Table S1). Several biological processes (e.g. gene expression) were significantly (P < 0.05, Fisher) enriched in species-specific NF genes relative to species-specific HF genes (NF5 versus HF5; Table S2). Some biological processes (e.g. heterocycle biosynthetic process) were enriched (P < 0.05, Fisher) in conserved NF genes in comparison with conserved LF genes (Tables S3). We also found some enriched (P < 0.05, Fisher) biological processes in conserved and species specific LF genes in comparison with conserved and species specific HF genes, respectively (Tables S4 and S5). In addition, some biological processes were enriched (P < 0.05, Fisher) in species-specific NF genes relative to conserved NF genes (Table S6).

image

Figure 5. Biological processes over-represented (P < 1 × 10−77, Fisher) in conserved (i.e. shared by at least two species) non-family (NF; 1 or 2 copies) genes as compared with conserved high-copy-number family (HF; >10 copies) genes. The GO terms were summarized using REVIGO (http://revigo.irb.hr/) and the enriched biological processes were listed in Table S1.

Download figure to PowerPoint

To determine if there were differences in gene functions at the subcellular level among different gene-copy categories, we predicted the protein subcellular localization of genes in the NF, LF and HF categories. The results showed that genes found in: (i) plasma membranes were over-represented in conserved (i.e. shared by at least two species) NF, LF and HF categories, (ii) chloroplasts in conserved NF and LF categories, and (iii) extracellular space and nucleus in species specific NF and LF categories (P < 0.01, the cumulative Poisson distribution; Table 1). These over-represented subcellular localizations were in concordance with the over-represented biological processes, such that chloroplast localization and photosynthesis genes were in conserved NF and LF genes and nucleus localization and regulation processes were in species specific NF and LF genes (Tables 1, S1, S2, S4 and S5). Furthermore, we found that chloroplast and mitochondrion localization was over-represented in NF1 (Viridiplantae-wide NF genes), nucleus localization over-represented in NF2 (angiosperm-specific NF genes) and NF5 (species-specific NF genes), and extracellular localization over-represented in NF3 and NF4 (P < 0.01, the cumulative Poisson distribution; Table 2).

Table 1. Subcellular localization of non-family (NF; 1 or 2 copies per genome) genes, low-copy-number family (LF; 3–10 copies) genes and high-copy-number family (HF; >10 copies) genes. ‘Conserved’ indicates that the genes are shared by at least two species
Subcellular localizationNF conserved (%)NF species specific (%)LF conserved (%)LF species specific (%)HF conserved (%)HF species specific (%)
  1. a

    Over-representation (P < 0.01) of gene groups in each subcellular localization; P-value was calculated using the cumulative Poisson distribution.

Chloroplast13.6a1.613.4a1.64.90.8
Cytoplasm20.79.024.0a12.327.7a29.2a
Extracellular space2.512.6a5.020.4a5.19.5
Mitochondrion3.8a3.42.72.21.71.6
Nucleus49.471.0a41.160.3a44.857.1
Plasma membrane9.7a2.413.7a3.215.7a1.8
Table 2. The relative abundance (i.e. proportion of genes in each gene group) of subcellular localization in different non-family (NF; 1 or 2 copies per genome) genes. NF1 (Viridiplantae wide), NF2 (angiosperm specific), NF3 (monocot specific), NF4 (dicot specific), and NF5 (species specific) as defined in Figure 3
Subcellular localizationNF1 (%)NF2 (%)NF3 and NF4 (%)NF5 (%)
  1. a

    Over-representation (P < 0.01) of gene groups in each subcellular localization; P-value was calculated using the cumulative Poisson distribution.

Chloroplast24.5a1.94.31.6
Cytoplasm29.9a1.98.59.0
Extracellular space0.73.826.6a12.6
Mitochondrion3.9a0.61.63.4
Nucleus35.186.0a55.971.0a
Plasma membrane5.8a5.7a3.22.4

To obtain experimental support for our computational prediction of NF gene functions, we interrogated the AtGenExpress array data (Kilian et al., 2007) for Arabidopsis genes in the NF1 category (Viridiplantae-wide NF genes) versus NF5 (species-specific NF genes). Our analysis revealed eight and 15 expression clusters for NF1 and NF5, respectively (Figures S5 and S6). This finding suggests that species specific genes are involved in more diverse stress responses than the genes conserved among large evolutionary space.

Gene co-expression

Genes that are highly co-expressed may be involved in similar biological processes (Usadel et al., 2009). Our previous co-expression analysis identified 692 genes (plus two obsolete gene models) associated with cell wall biosynthesis in Arabidopsis (Yang et al., 2011a). Among these 692 genes, there were 70 NF genes, 177 LF genes and 445 HF genes (Tables S7 and S8). For example, two NF genes, AT2G31930 and AT2G41610, were co-expressed with known cell wall biosynthesis genes and preferentially expressed in stem tissue enriched with secondary cell wall materials (Figure 6). GO enrichment analysis showed that the secondary cell wall biosynthesis was significantly (P < 0.001, Fisher) over-represented in this co-expressed gene subnetwork as compared with the whole Arabidopsis genome (Tables S9).

image

Figure 6. An example of Arabidopsis non-family (NF) genes involved in cell wall biosynthesis. (a) Gene co-expression network, with the yellow dots representing the non-family genes with unknown function, the blue dots representing known genes associated with cell wall biosynthesis, the red dots representing other co-expressed genes and the green lines connecting two co-expressed genes. The network was drawn using the NetworkDrawer function of ATTD-II (http://atted.jp/) with the NF gene AT2G31930 and its directly co-expressed genes as query. (b) Expression pattern of the NF genes and associated known cell wall biosynthesis genes. Microarray expression data were obtained from AtGenExpress (http://www.weigelworld.org).

Download figure to PowerPoint

To explore the roles of NF genes in stress responses, we investigated the association of NF genes with known genes responsive to abiotic (i.e. cold, heat, salt, water deprivation) and biotic (i.e. bacteria and fungi) stresses in Arabidopsis. We identified 713 NF genes, 1268 LF genes and 2538 HF genes associated with stress responses (Tables S7 and S8). For example, the NF gene AT5G35320 was co-expressed with several heat shock protein genes (Figure S7a), and highly expressed under heat stress (Figure S7b). Genes related to heat response were significantly (P < 0.001, Fisher) enriched in this co-expression subnetwork, as compared with the whole Arabidopsis genome (Tables S10). NF gene AT3G01420 was co-expressed with several biotic stress responsive genes (Figure 7a) and highly expressed in Arabidopsis leaves infected with Pseudomonas (Figure 7b). Genes related to defense response were significantly (P < 0.05, Fisher) enriched in this co-expression gene subnetwork, as compared with the whole Arabidopsis genome (Tables S11).

image

Figure 7. An Arabidopsis non-family (NF) gene involved in biotic stress response. (a) Gene co-expression network, with the yellow dots representing the non-family gene, the blue dots representing known biotic responsive genes, the red dots representing other co-expressed genes and the green lines connecting two co-expressed genes. The network was drawn using the NetworkDrawer function of ATTD-II (http://atted.jp/) with the NF gene AT3G01420 and its directly co-expressed genes as query. (b) Expression pattern of the NF gene and associated biotic responsive genes in Arabidopsis under Pseudomonas treatment. Microarray expression data were obtained from AtGenExpress (http://www.weigelworld.org).

Download figure to PowerPoint

Previous analysis revealed that 77% of one-to-one orthologous gene pairs between A. thaliana and O. sativa showed conserved co-expression (Movahedi et al., 2011). We examined the co-expression conservation of one-to-one orthologous gene pairs in NF, LF and HF between A. thaliana and O. sativa. Our analysis revealed that NF genes had a significantly (P < 0.01, the cumulative Poisson distribution) higher rate of co-expression conservation than HF genes (Figure S8).

Protein–protein interaction

Protein–protein interactions (PPIs) are crucial for a large number of cellular processes in living organisms (He and Zhang, 2006; Vallabhajosyula et al., 2009). To determine the contribution of NF proteins in protein–protein interactions, we investigated an Arabidopsis PPI network, AtPIN (Brandao et al., 2009). The results showed that a significantly (P < 0.01, the cumulative Poisson distribution) higher proportion of the NF set could be classified as hubs, the top 10% most connected nodes (with a minimum degree of 36) in the PPI network, relative to the LF and HF sets (Figure 8 and Table S12). In addition, all of the hubs in the PPI network were found in the conserved (i.e. shared by at least two species) NF, LF and HF categories, consistent with the view that hub proteins are generally evolutionary conserved (Vallabhajosyula et al., 2009).

image

Figure 8. Hub genes in non-family (NF; 1 or 2 copies), low-copy-number family (LF; 3–10 copies) and high-copy-number family (HF; >10 copies) sets. Hubs were defined here as the top 10% most connected nodes in the Arabidopsis protein-protein interaction network (http://bioinfo.esalq.usp.br/atpin/atpin.pl).

Download figure to PowerPoint

Discussion

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental Procedures
  7. Acknowledgements
  8. Conflict of Interest
  9. References
  10. Supporting Information

With the rapid accumulation of genome sequencing data in public databases, our knowledge about genes found in expanded families (i.e. three or more gene copies per genome) has greatly increased. However, questions related to the function and evolution of NF genes (i.e. one or two copies per genome) remain unanswered. If the NF genes are involved in principal biological processes, why have they not expanded as much as the family genes have? To begin to address these questions, we performed a large-scale systematic analysis of NF genes in 14 plant genomes across a large evolutionary space. Our study provides initial evidence that NF genes play an important role in plant functions and their evolutionary dynamics are different from that of family genes. Key functions (e.g. cellular nitrogen compound metabolic process, translation, cofactor metabolic process) were enriched in NF genes (Figure 5 and Table S1). In particular, some essential biological processes, such as photosynthesis in chloroplasts and respiration in mitochondria that are shared by all plant lineages, were significantly (P < 1 × 10−77, Fisher) over-represented in the NF genes (Figure 5 and Table S1). Our comparative analysis indicates that NF genes resulted largely from a higher rate of gene loss after segmental duplications and/or lower frequency of tandem duplications (Figure 4). This is consistent with previous reports showing that tandem duplication is one of the main factors contributing to the expansion of gene families (Cannon et al., 2004; Yang et al., 2008), and gene loss is a common aspect of evolutionary dynamics after gene duplication (Yang et al., 2006, 2011b; Freeling, 2009).

There are still a large number of NF genes annotated as unknown, putative or hypothetical, especially for the NF genes in the species specific group (i.e. NF5). These NF genes should be considered as candidates for future functional studies. It was reported that the proportion of essential singleton genes was higher than that of essential duplicated genes in yeast and nematode, whereas the singletons and duplicates were determined to be equally essential in mouse (Liao and Zhang, 2007). In plants, there is currently very limited information on NF mutants relative to yeast, nematode and mouse. Future efforts are needed to create knockout mutants for majority if not all of the genes in several model plant species for estimating the proportion of essential genes among NF and family genes. Hubs tend to be more essential than non-hub proteins (He and Zhang, 2006; Vallabhajosyula et al., 2009). We found that hub proteins in the Arabidopsis protein–protein interaction network were over-represented in the NF set relative to the HF set (Figure 8), suggesting that NF genes may play more important roles than expected in plants.

Experimental Procedures

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental Procedures
  7. Acknowledgements
  8. Conflict of Interest
  9. References
  10. Supporting Information

Clustering of protein sequences

The annotated non-TE (transposable element) protein sequences of 14 plant species, including Chlamydomonas reinhardtii (www.Phytozome.net; Phytozome), Volvox carteri (Phytozome), Physcomitrella patens (Phytozome), Selaginella moellendorffii (Phytozome), Brachypodium distachyon (Phytozome), Sorghum bicolor (Phytozome), Vitis vinifera (Phytozome), Carica papaya (Phytozome), Glycine max (Phytozome), Populus trichocarpa (Phytozome), Oryza sativa (rice.plantbiology.msu.edu), Zea mays (www.maizesequence.org), Solanum tuberosum (potatogenomics.plantbiology.msu.edu), and Arabidopsis thaliana (www.Arabidopsis.org), were used to cluster protein sequences. The longest protein sequence was selected in case of multiple transcripts annotated for one gene locus. All-against-all BlastP search of these protein sequences were performed using Blast+ (Camacho et al., 2009) with an E-value cutoff of 1 × 10−3 followed by clustering analysis using TRIBE-MCL with an inflation value of 1.2 (Enright et al., 2002). Based on the protein clustering analysis, the genes were classified into three categories: NF that was defined as the gene clusters containing 1 or 2 genes in one genome, low-copy-number family (LF) defined as the gene clusters containing 3–10 genes in one genome, and HF defined as the gene clusters containing more than 10 genes in one genome.

Identification of gene duplication and gene loss

The information for segmental duplication was obtained from SynMap (synteny.cnr.berkeley.edu). The tandem duplicated genes were identified and defined as an array of two or more genes that were in the same protein cluster and were found within a 100-kb genomic window (Yang et al., 2008). Gene loss after segmental duplications was defined as the absence of homologous genes within one copy of the duplicated blocks, as compared with the other copy of the duplicated segments in the syntenic regions. For example, if a syntenic block contains duplicated fragments of ‘gene A—- gene B—- gene C—- gene D—- gene E—- gene F—- gene G—- gene H (fragment 1; gene F was not generated by tandem duplication)’ and ‘gene A*—- gene B*—- gene C*—- gene D*—- gene E*—- gene G*—- gene H* (fragment 2; genes A*, B*, C*, D*, E*, G*, and H* are duplicated copies of genes A, B, C, D, E, G, and H, respectively)’, we consider that there was a gene loss (i.e. gene F*) on fragment 2.

Evolutionary analysis of non-family genes

Non-synonymous (Ka) and synonymous (Ks) substitution rates of full-length coding sequences were calculated using the KaKs_Calculator with a gamma-modified version of Yang-Nielsen (GMYN) method (Wang et al., 2010). The same numbers (3679) of gene pairs generated from segmental duplication were randomly selected from NF, LF, and HF categories for Ka/Ks analysis. More than 70% of coding regions in the shorter sequences of duplicated gene pairs were aligned for Ka/Ks analysis, as determined by analysis of 200 randomly selected gene pairs (Figure S9).

Gene ontology (GO) analysis

We obtained whole genome GO term annotation for the 14 species investigated in this study using Blast2GO with a BlastP E-value hit filter of 1 × 10−6, an annotation cutoff value of 55, and GO weight of 5 (Conesa et al., 2005). GO enrichment analysis was performed using Blast2GO or agriGO (bioinfo.cau.edu.cn/agriGO/) with Fisher's exact test (Du et al., 2010), and GO terms were summarized using the web server REVIGO (Supek et al., 2011). For pair-wise enrichment comparison between the NF, LF and HF sets, all the genes in each set were used. For example, for calculating enrichment of GO terms in the conserved NF set relative to the conserved HF set, all of the conserved NF genes were studied with all of the conserved HF genes used as a reference.

Prediction of protein subcellular localization

The prediction of protein subcellular localization was performed using the YLoc (Briesemeister et al., 2010) with the model set as ‘YLoc+’ and version set as ‘Plants’, WoLF PSORT (Horton et al., 2007) with the organism set as ‘Plant’, and CELLO (Yu et al., 2006) with the organism set as ‘Eukaryotes’. The consensus results predicted by these three methods (i.e. the same results obtained by the three different methods) were adopted. Overall, 6000 randomly selected, conserved (i.e. shared by at least by two species) NF/LF/HF genes, 6000 randomly selected, species-specific NF genes, all of the species-specific LF genes (6225 genes), and all of the species-specific HF genes (7164 genes) were used for subcellular localization analysis. The random selection of gene sets was repeated three times and results from analysis of protein subcellular localization were consistent among the repeated random gene sets.

Analysis of gene expression

Expression data for Arabidopsis genes were obtained from AtGenExpress (Schmid et al., 2005; Kilian et al., 2007). K-means clustering of the gene expression pattern was performed using SC(2)ATmd (Olex and Fetrow, 2011) with optimal number of clusters determined by Figure of Merit. 263 species-specific NF genes (NF5), the same number as Viridiplantae-wide NF genes (NF1), were randomly selected for clustering analysis. The random selection of gene sets was repeated three times and results from clustering analysis were consistent among the repeated random gene sets. Arabidopsis co-expression data were obtained from ATTED-II (Obayashi et al., 2007). The co-expression conservation data for one-to-one orthologues in A. thaliana and O. sativa were obtained from Movahedi et al. (2011).

Analysis of protein–protein interaction

The Arabidopsis protein PPI network data was obtained from AtPIN (Brandao et al., 2009). The hubs were defined as the top 10% most connected nodes (with a minimum degree of 36).

Statistical analysis

Statistical analyses, including Chi-square tests, paired t-tests and anova with LSD tests, were performed using R (www.r-project.org/).

Acknowledgements

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental Procedures
  7. Acknowledgements
  8. Conflict of Interest
  9. References
  10. Supporting Information

We would like to thank S.D. Wullschleger for thoughtful and insightful comments on the manuscript. This work was supported by the U.S. Department of Energy, Office of Biological and Environmental Research, Genomic Science Program, the US DOE BioEnergy Science Center and the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory. The BioEnergy Science Center is a US Department of Energy Bioenergy Research Center supported by the Office of Biological and Environmental Research in the DOE Office of Science. Oak Ridge National Laboratory is managed by UT-Battelle, LLC for the US Department of Energy under Contract Number DE-AC05-00OR22725.

References

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental Procedures
  7. Acknowledgements
  8. Conflict of Interest
  9. References
  10. Supporting Information

Supporting Information

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental Procedures
  7. Acknowledgements
  8. Conflict of Interest
  9. References
  10. Supporting Information
FilenameFormatSizeDescription
tpj12073-sup-0001-FigS1.pdfapplication/PDF136KFigure S1. Numbers of genes in low-copy-number families (LF; 3–10 copies) in different phylogenic groups.
tpj12073-sup-0002-FigS2.pdfapplication/PDF129KFigure S2. Numbers of genes in high-copy-number families (HF; >10 copies) among different phylogenetic groups.
tpj12073-sup-0003-FigS3.pdfapplication/PDF139KFigure S3. The non-synonymous to synonymous substitution (Ka/Ks) ratio for the coding region of paralogous genes generated from segmental duplications.
tpj12073-sup-0004-FigS4.pdfapplication/PDF181KFigure S4. The top 15 biological processes of genes classified as non-family genes (NF; 1–2 copies per genome) in Arabidopsis.
tpj12073-sup-0005-FigS5.pdfapplication/PDF340KFigure S5. K-means clustering of expression pattern of Arabidopsis for genes classified as non-family genes in NF1 group (i.e. Viridiplantae wide; defined in Figure 3).
tpj12073-sup-0006-FigS6.pdfapplication/PDF595KFigure S6. K-means clustering of expression pattern of Arabidopsis for genes classified as non-family genes in NF5 group (i.e. species-specific; defined in Figure 3).
tpj12073-sup-0007-FigS7.pdfapplication/PDF197KFigure S7. A network analysis of an Arabidopsis non-family (NF; 1–2 copies) gene involved in heat response.
tpj12073-sup-0008-FigS8.pdfapplication/PDF126KFigure S8. Co-expression conservation of NF, LF and HF genes between A. thaliana and O. sativa.
tpj12073-sup-0009-FigS9.pdfapplication/PDF125KFigure S9. Proportion of coding regions in the shorter sequences of duplicated gene pairs were aligned for Ka/Ks analysis in 200 randomly selected gene pairs.
tpj12073-sup-0010-TableS1.xlsxapplication/msexcel69KTable S1. Biological processes over-represented (< 0.05) in conserved (i.e. shared by at least two species) non-family genes as compared with conserved high-copy-number family genes.
tpj12073-sup-0011-TableS2.xlsapplication/msexcel19KTable S2. Biological processes over-represented (< 0.05) in species-specific non-family genes (i.e. gene group NF5 as defined in Figure 3) as compared with species-specific high-copy-number family genes (i.e. gene group HF5 as defined in Figure S2).
tpj12073-sup-0012-TableS3.xlsapplication/msexcel110KTable S3. Biological processes over-represented (< 0.05) in conserved (i.e. shared by at least two species) non-family genes as compared with conserved low-copy-number family genes.
tpj12073-sup-0013-TableS4.xlsapplication/msexcel220KTable S4. Biological processes over-represented (< 0.05) in conserved (i.e. shared by at least two species) low-copy-number family genes as compared with conserved high-copy-number family genes.
tpj12073-sup-0014-TableS5.xlsapplication/msexcel26KTable S5. Biological processes over-represented (< 0.05) in species-specific low-copy-number family genes (i.e. gene group LF5 defined in Figure S1) as compared with species-specific high-copy-number family genes (i.e. gene group HF5 defined in Figure S2).
tpj12073-sup-0015-TableS6.xlsapplication/msexcel20KTable S6. Biological processes over-represented (< 0.05) in species-specific non-family genes as compared with conserved (i.e. shared by at least two species) non-family genes.
tpj12073-sup-0016-TableS7.xlsapplication/msexcel19KTable S7. Number of non-family (NF; 1–2 copies per genome), low-copy-number family (LF; 3–10 copies) and high-copy-number family (HF; >10 copies) genes related to stress responses and cell wall biosynthesis in Arabidopsis.
tpj12073-sup-0017-TableS8.xlsapplication/msexcel54KTable S8. List of Arabidopsis non-family (NF; 1–2 copies per genome) genes related to stress responses and cell wall biosynthesis based on GO annotation and NF genes that were co-expressed with genes involved in stress responses and cell wall biosynthesis.
tpj12073-sup-0018-TableS9.xlsapplication/msexcel20KTable S9. Biological processes over-represented (< 0.001) in the gene cluster in Figure 6 as compared with the Arabidopsis genome.
tpj12073-sup-0019-TableS10.xlsapplication/msexcel20KTable S10. Biological processes over-represented (< 0.001) in the gene cluster in Figure S7A as compared with Arabidopsis genome.
tpj12073-sup-0020-TableS11.xlsapplication/msexcel19KTable S11. Biological processes over-represented (< 0.05) in the gene cluster in Figure 7a as compared with Arabidopsis genome.
tpj12073-sup-0021-TableS12.xlsapplication/msexcel45KTable S12. Non-family hub genes in an Arabidopsis protein-protein interaction network (http://bioinfo.esalq.usp.br/atpin/atpin.pl).
tpj12073-sup-0022-Supplementary Legend.docxWord document28K 

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.