There are a large number of ‘non-family’ (NF) genes that do not cluster into families with three or more members per genome. While gene families have been extensively studied, a systematic analysis of NF genes has not been reported. We performed comparative studies on NF genes in 14 plant species. Based on the clustering of protein sequences, we identified ~94 000 NF genes across these species that were divided into five evolutionary groups: Viridiplantae wide, angiosperm specific, monocot specific, dicot specific, and those that were species specific. Our analysis revealed that the NF genes resulted largely from less frequent gene duplications and/or a higher rate of gene loss after segmental duplication relative to genes in both low-copy-number families (LF; 3–10 copies per genome) and high-copy-number families (HF; >10 copies). Furthermore, we identified functions enriched in the NF gene set as compared with the HF genes. We found that NF genes were involved in essential biological processes shared by all plant lineages (e.g. photosynthesis and translation), as well as gene regulation and stress responses associated with phylogenetic diversification. In particular, our analysis of an Arabidopsis protein–protein interaction network revealed that hub proteins with the top 10% most connections were over-represented in the NF set relative to the HF set. This research highlights the roles that NF genes may play in evolutionary and functional genomics research.
A gene family can be defined as a group of genes with similar sequences, resulting from various gene duplication events and often sharing similar or partially redundant functions (De Grassi et al., 2008; Demuth and Hahn, 2009). In the most inclusive manner, the majority of genes in plant genomes belongs to gene families that contain three or more gene members per genome. Many gene families in plants have been extensively studied, e.g. MADS-box (Parenicova et al., 2003; Nam et al., 2004), ABC protein superfamily (Verrier et al., 2008), NAC (NAM, ATAF, and CUC) transcription factor (Ooka et al., 2003), glycosyltransferase superfamily (Yin et al., 2010; Ye et al., 2011; Yonekura-Sakakibara and Hanada, 2011), F-box (Yang et al., 2008), APETELA2 (AP2)/ethylene-responsive element binding factor (ERF) superfamily (Nakano et al., 2006), and many protein kinase families (Hrabak et al., 2003). Still, a large number of genes, with one or two gene copies per genome, do not belong to such gene families; we term these genes as non-family (NF) genes. Only a few studies have addressed the role of NF genes in sequenced plant species (Guo et al., 2007; Duarte et al., 2010), and as such, the function and evolution of NF genes remain unclear.
There is considerable variation in the number of genes within gene families. The variation in numbers of gene family members is due partly to the extent and retention of gene duplication (Ohno, 1970; Chauve et al., 2008). Recent research indicates that the ancestral angiosperm genome contained 12 000–14 000 genes (Proost et al., 2011), far less than the number of genes in extant angiosperm genomes (e.g. ~27 000 in Arabidopsis, ~26 000 in Vitis, ~40 000 in Populus and ~40 000 in Oryza) (Goodstein et al., 2012). The gene number expansion in extant angiosperm genomes appear to be created through various gene duplication events (Van de Peer et al., 2009). For example, the F-box gene family is expanded in herbaceous plants relative to woody plants specifically due to a high rate of tandem duplications in the former (Yang et al., 2008). Several models (e.g. neofunctionalization, subfunctionalization, balanced gene drive) have been proposed to explain the evolutionary fate of duplicated genes (Ohno, 1970; Hughes, 1994; Yang et al., 2006; Freeling, 2009). Gene loss after duplication is one of the evolutionary modes that contribute to the contraction of gene families in Arabidopsis thaliana that experienced a loss of around 5700 genes after divergence between A. thaliana and A. lyrata (Proost et al., 2011).
To understand the evolution and function of NF genes in plants, we performed a large-scale comparative analysis of NF genes in diverse plant species ranging from algae to moss to angiosperm. In a comparative framework, we analyzed multiple aspects of the NF genes (one or two copies per genome) in comparison with low-copy-number gene families (LF; 3–10 copies) and high-copy-number gene families (HF; >10 copies). Our analysis revealed differences in the frequency of gene duplication, evolutionary rate, gene ontology, subcellular localization and gene expression between NF genes and genes in LF and HF categories, and among NF genes in alternate evolutionary groups (i.e. Viridiplantae wide, angiosperm specific, monocot specific, dicot specific, and species specific). We found that some plant functions involved more NF genes, relative to HF genes. We identified NF genes that were involved in stress responses as well as cell wall biosynthesis. Furthermore, our analysis of an Arabidopsis protein–protein interaction network revealed that a higher proportion of the NF gene set was classified as hubs (i.e. highly connected nodes) as compared with the HF gene set. We present this study to highlight the roles of NF genes in evolutionary and functional genomics research.
Non-family (NF) genes in plant genomes
Based on clustering analysis of protein sequences, we divided the protein-encoding genes from 14 diverse plant species, including algae (Chlamydomonas reinhardtii and Volvox carteri), moss (Physcomitrella patens), lycophyte (Selaginella moellendorffii), monocots (Brachypodium distachyon, Oryza sativa, Sorghum bicolor and Zea mays), and dicots (Arabidopsis thaliana, Carica papaya, Glycine max, Populus trichocarpa, Solanum tuberosum and Vitis vinifera), into three categories: (i) genes with only 1 or 2 copies per genome, i.e. NF genes, (ii) genes in LF, containing 3–10 copies per genome, and (iii) genes in high-copy-number families (HF), containing more than 10 copies per genome (Figure 1). Non-vascular plants contained a significantly (P < 0.01, chi-squared) higher proportion of NF genes and significantly (P < 0.01, chi-squared) fewer HF genes relative to vascular plants (Figure 1). Furthermore, NF contained significantly (P < 0.01, paired t-test) more single-copy genes than two-copy genes (Figure 2).
Among all NF genes, we identified five evolutionary groups: (i) NF1 – genes having homologues in all the 14 species studied (i.e. found Viridiplantae wide), (ii) NF2 – genes having homologues in all and only the 10 angiosperm species studied (i.e. angiosperm specific), (iii) NF3 – genes having homologues in all and only the four monocot species studied (i.e. monocot specific), (iv) NF4 – genes having homologues in all and only the six dicot species studied (i.e. dicot specific), and (v) NF5 – genes not conserved among the species studied (i.e. species specific) (Figure 3). The number of genes in NF5 was significantly [P < 0.01, anova (analysis of variance) + least significance difference (LSD) test] higher than that in the other groups (i.e. NF1, NF2, NF3, NF4). Interestingly, compared with the five NF groups (i.e. NF1–NF5), only four corresponding groups could be identified among the LF or HF genes and the dicot specific group (LF4 or HF4) that contains LF or HF genes having homologues in all and only the six dicot species is absent (Figures S1 and S2), suggesting that the shared evolutionary history of the LF and HF gene families in dicots can be dated before the divergence between monocots and dicots.
Duplication and loss of NF genes
Using syntenic gene order, our analysis of duplication events in A. thaliana, B. distachyon, G. max, O. sativa, P. trichocarpa, S. bicolor, V. vinifera and Z. mays revealed that the frequency of tandem duplication in the NF genes is significantly (P < 0.001, paired t-test) lower than that in both LF and HF categories, and the frequency of tandem duplication in the LF genes is significantly (P < 0.001, paired t-test) lower than that in HF genes (Figure 4a). The frequency of segmental duplication in NF genes was (P < 0.01, paired t-test) significantly lower than that in both LF and HF categories and there was no significant difference in frequency of segmental duplication between LF and HF categories (Figure 4b). Furthermore, NF genes had a significantly (P < 0.01, paired t-test) higher frequency of gene loss after syntenic segmental duplication than that in both LF and HF categories, and genes in the LF category had a higher frequency (P < 0.001, paired t-test) of gene loss than those in HF category (Figure 4c). These data suggest that the NF genes resulted mainly from a higher rate of gene loss after segmental duplication as well as a lower rate of tandem gene duplication.
Evolutionary fate of NF genes
To investigate the evolutionary fate of NF genes after duplication, we examined the non-synonymous/synonymous (Ka/Ks) ratio for gene pairs resulting from segmental duplication. Our analysis showed that NF genes had a significantly (P < 0.01, anova + LSD test) higher Ka/Ks ratio than genes in the LF and HF categories (Figure S3a), indicating that NF genes may have been under more relaxed selection than LF or HF genes. Furthermore, the species-specific NF genes (i.e. category NF5) had a significantly (P < 0.01, anova + LSD test) higher Ka/Ks ratio than the NF genes conserved among all plant species investigated (i.e. NF1) (Figure S3b), suggesting that the species-specific NF genes may have been under more relaxed selection than the conserved NF genes.
Functions of NF genes
Gene ontology (GO) analysis revealed that NF genes were disproportionately involved in various biological processes, with the most over-represented processes including nitrogen metabolism, biosynthetic process, biological regulation and response to stimulus (Figure S4). GO enrichment analysis revealed that many biological processes (e.g. translation, nucleoside metabolism, cofactor metabolism and photosynthesis) were significantly (P < 0.05, Fisher) over-represented in conserved NF genes (i.e. shared by at least two species) relative to conserved HF genes (Figure 5 and Table S1). Several biological processes (e.g. gene expression) were significantly (P < 0.05, Fisher) enriched in species-specific NF genes relative to species-specific HF genes (NF5 versus HF5; Table S2). Some biological processes (e.g. heterocycle biosynthetic process) were enriched (P < 0.05, Fisher) in conserved NF genes in comparison with conserved LF genes (Tables S3). We also found some enriched (P < 0.05, Fisher) biological processes in conserved and species specific LF genes in comparison with conserved and species specific HF genes, respectively (Tables S4 and S5). In addition, some biological processes were enriched (P < 0.05, Fisher) in species-specific NF genes relative to conserved NF genes (Table S6).
To determine if there were differences in gene functions at the subcellular level among different gene-copy categories, we predicted the protein subcellular localization of genes in the NF, LF and HF categories. The results showed that genes found in: (i) plasma membranes were over-represented in conserved (i.e. shared by at least two species) NF, LF and HF categories, (ii) chloroplasts in conserved NF and LF categories, and (iii) extracellular space and nucleus in species specific NF and LF categories (P < 0.01, the cumulative Poisson distribution; Table 1). These over-represented subcellular localizations were in concordance with the over-represented biological processes, such that chloroplast localization and photosynthesis genes were in conserved NF and LF genes and nucleus localization and regulation processes were in species specific NF and LF genes (Tables 1, S1, S2, S4 and S5). Furthermore, we found that chloroplast and mitochondrion localization was over-represented in NF1 (Viridiplantae-wide NF genes), nucleus localization over-represented in NF2 (angiosperm-specific NF genes) and NF5 (species-specific NF genes), and extracellular localization over-represented in NF3 and NF4 (P < 0.01, the cumulative Poisson distribution; Table 2).
Table 1. Subcellular localization of non-family (NF; 1 or 2 copies per genome) genes, low-copy-number family (LF; 3–10 copies) genes and high-copy-number family (HF; >10 copies) genes. ‘Conserved’ indicates that the genes are shared by at least two species
NF conserved (%)
NF species specific (%)
LF conserved (%)
LF species specific (%)
HF conserved (%)
HF species specific (%)
Over-representation (P < 0.01) of gene groups in each subcellular localization; P-value was calculated using the cumulative Poisson distribution.
Table 2. The relative abundance (i.e. proportion of genes in each gene group) of subcellular localization in different non-family (NF; 1 or 2 copies per genome) genes. NF1 (Viridiplantae wide), NF2 (angiosperm specific), NF3 (monocot specific), NF4 (dicot specific), and NF5 (species specific) as defined in Figure 3
NF3 and NF4 (%)
Over-representation (P < 0.01) of gene groups in each subcellular localization; P-value was calculated using the cumulative Poisson distribution.
To obtain experimental support for our computational prediction of NF gene functions, we interrogated the AtGenExpress array data (Kilian et al., 2007) for Arabidopsis genes in the NF1 category (Viridiplantae-wide NF genes) versus NF5 (species-specific NF genes). Our analysis revealed eight and 15 expression clusters for NF1 and NF5, respectively (Figures S5 and S6). This finding suggests that species specific genes are involved in more diverse stress responses than the genes conserved among large evolutionary space.
Genes that are highly co-expressed may be involved in similar biological processes (Usadel et al., 2009). Our previous co-expression analysis identified 692 genes (plus two obsolete gene models) associated with cell wall biosynthesis in Arabidopsis (Yang et al., 2011a). Among these 692 genes, there were 70 NF genes, 177 LF genes and 445 HF genes (Tables S7 and S8). For example, two NF genes, AT2G31930 and AT2G41610, were co-expressed with known cell wall biosynthesis genes and preferentially expressed in stem tissue enriched with secondary cell wall materials (Figure 6). GO enrichment analysis showed that the secondary cell wall biosynthesis was significantly (P < 0.001, Fisher) over-represented in this co-expressed gene subnetwork as compared with the whole Arabidopsis genome (Tables S9).
To explore the roles of NF genes in stress responses, we investigated the association of NF genes with known genes responsive to abiotic (i.e. cold, heat, salt, water deprivation) and biotic (i.e. bacteria and fungi) stresses in Arabidopsis. We identified 713 NF genes, 1268 LF genes and 2538 HF genes associated with stress responses (Tables S7 and S8). For example, the NF gene AT5G35320 was co-expressed with several heat shock protein genes (Figure S7a), and highly expressed under heat stress (Figure S7b). Genes related to heat response were significantly (P < 0.001, Fisher) enriched in this co-expression subnetwork, as compared with the whole Arabidopsis genome (Tables S10). NF gene AT3G01420 was co-expressed with several biotic stress responsive genes (Figure 7a) and highly expressed in Arabidopsis leaves infected with Pseudomonas (Figure 7b). Genes related to defense response were significantly (P < 0.05, Fisher) enriched in this co-expression gene subnetwork, as compared with the whole Arabidopsis genome (Tables S11).
Previous analysis revealed that 77% of one-to-one orthologous gene pairs between A. thaliana and O. sativa showed conserved co-expression (Movahedi et al., 2011). We examined the co-expression conservation of one-to-one orthologous gene pairs in NF, LF and HF between A. thaliana and O. sativa. Our analysis revealed that NF genes had a significantly (P < 0.01, the cumulative Poisson distribution) higher rate of co-expression conservation than HF genes (Figure S8).
Protein–protein interactions (PPIs) are crucial for a large number of cellular processes in living organisms (He and Zhang, 2006; Vallabhajosyula et al., 2009). To determine the contribution of NF proteins in protein–protein interactions, we investigated an Arabidopsis PPI network, AtPIN (Brandao et al., 2009). The results showed that a significantly (P < 0.01, the cumulative Poisson distribution) higher proportion of the NF set could be classified as hubs, the top 10% most connected nodes (with a minimum degree of 36) in the PPI network, relative to the LF and HF sets (Figure 8 and Table S12). In addition, all of the hubs in the PPI network were found in the conserved (i.e. shared by at least two species) NF, LF and HF categories, consistent with the view that hub proteins are generally evolutionary conserved (Vallabhajosyula et al., 2009).
With the rapid accumulation of genome sequencing data in public databases, our knowledge about genes found in expanded families (i.e. three or more gene copies per genome) has greatly increased. However, questions related to the function and evolution of NF genes (i.e. one or two copies per genome) remain unanswered. If the NF genes are involved in principal biological processes, why have they not expanded as much as the family genes have? To begin to address these questions, we performed a large-scale systematic analysis of NF genes in 14 plant genomes across a large evolutionary space. Our study provides initial evidence that NF genes play an important role in plant functions and their evolutionary dynamics are different from that of family genes. Key functions (e.g. cellular nitrogen compound metabolic process, translation, cofactor metabolic process) were enriched in NF genes (Figure 5 and Table S1). In particular, some essential biological processes, such as photosynthesis in chloroplasts and respiration in mitochondria that are shared by all plant lineages, were significantly (P < 1 × 10−77, Fisher) over-represented in the NF genes (Figure 5 and Table S1). Our comparative analysis indicates that NF genes resulted largely from a higher rate of gene loss after segmental duplications and/or lower frequency of tandem duplications (Figure 4). This is consistent with previous reports showing that tandem duplication is one of the main factors contributing to the expansion of gene families (Cannon et al., 2004; Yang et al., 2008), and gene loss is a common aspect of evolutionary dynamics after gene duplication (Yang et al., 2006, 2011b; Freeling, 2009).
There are still a large number of NF genes annotated as unknown, putative or hypothetical, especially for the NF genes in the species specific group (i.e. NF5). These NF genes should be considered as candidates for future functional studies. It was reported that the proportion of essential singleton genes was higher than that of essential duplicated genes in yeast and nematode, whereas the singletons and duplicates were determined to be equally essential in mouse (Liao and Zhang, 2007). In plants, there is currently very limited information on NF mutants relative to yeast, nematode and mouse. Future efforts are needed to create knockout mutants for majority if not all of the genes in several model plant species for estimating the proportion of essential genes among NF and family genes. Hubs tend to be more essential than non-hub proteins (He and Zhang, 2006; Vallabhajosyula et al., 2009). We found that hub proteins in the Arabidopsis protein–protein interaction network were over-represented in the NF set relative to the HF set (Figure 8), suggesting that NF genes may play more important roles than expected in plants.
Clustering of protein sequences
The annotated non-TE (transposable element) protein sequences of 14 plant species, including Chlamydomonas reinhardtii (www.Phytozome.net; Phytozome), Volvox carteri (Phytozome), Physcomitrella patens (Phytozome), Selaginella moellendorffii (Phytozome), Brachypodium distachyon (Phytozome), Sorghum bicolor (Phytozome), Vitis vinifera (Phytozome), Carica papaya (Phytozome), Glycine max (Phytozome), Populus trichocarpa (Phytozome), Oryza sativa (rice.plantbiology.msu.edu), Zea mays (www.maizesequence.org), Solanum tuberosum (potatogenomics.plantbiology.msu.edu), and Arabidopsis thaliana (www.Arabidopsis.org), were used to cluster protein sequences. The longest protein sequence was selected in case of multiple transcripts annotated for one gene locus. All-against-all BlastP search of these protein sequences were performed using Blast+ (Camacho et al., 2009) with an E-value cutoff of 1 × 10−3 followed by clustering analysis using TRIBE-MCL with an inflation value of 1.2 (Enright et al., 2002). Based on the protein clustering analysis, the genes were classified into three categories: NF that was defined as the gene clusters containing 1 or 2 genes in one genome, low-copy-number family (LF) defined as the gene clusters containing 3–10 genes in one genome, and HF defined as the gene clusters containing more than 10 genes in one genome.
Identification of gene duplication and gene loss
The information for segmental duplication was obtained from SynMap (synteny.cnr.berkeley.edu). The tandem duplicated genes were identified and defined as an array of two or more genes that were in the same protein cluster and were found within a 100-kb genomic window (Yang et al., 2008). Gene loss after segmental duplications was defined as the absence of homologous genes within one copy of the duplicated blocks, as compared with the other copy of the duplicated segments in the syntenic regions. For example, if a syntenic block contains duplicated fragments of ‘gene A—- gene B—- gene C—- gene D—- gene E—- gene F—- gene G—- gene H (fragment 1; gene F was not generated by tandem duplication)’ and ‘gene A*—- gene B*—- gene C*—- gene D*—- gene E*—- gene G*—- gene H* (fragment 2; genes A*, B*, C*, D*, E*, G*, and H* are duplicated copies of genes A, B, C, D, E, G, and H, respectively)’, we consider that there was a gene loss (i.e. gene F*) on fragment 2.
Evolutionary analysis of non-family genes
Non-synonymous (Ka) and synonymous (Ks) substitution rates of full-length coding sequences were calculated using the KaKs_Calculator with a gamma-modified version of Yang-Nielsen (GMYN) method (Wang et al., 2010). The same numbers (3679) of gene pairs generated from segmental duplication were randomly selected from NF, LF, and HF categories for Ka/Ks analysis. More than 70% of coding regions in the shorter sequences of duplicated gene pairs were aligned for Ka/Ks analysis, as determined by analysis of 200 randomly selected gene pairs (Figure S9).
Gene ontology (GO) analysis
We obtained whole genome GO term annotation for the 14 species investigated in this study using Blast2GO with a BlastP E-value hit filter of 1 × 10−6, an annotation cutoff value of 55, and GO weight of 5 (Conesa et al., 2005). GO enrichment analysis was performed using Blast2GO or agriGO (bioinfo.cau.edu.cn/agriGO/) with Fisher's exact test (Du et al., 2010), and GO terms were summarized using the web server REVIGO (Supek et al., 2011). For pair-wise enrichment comparison between the NF, LF and HF sets, all the genes in each set were used. For example, for calculating enrichment of GO terms in the conserved NF set relative to the conserved HF set, all of the conserved NF genes were studied with all of the conserved HF genes used as a reference.
Prediction of protein subcellular localization
The prediction of protein subcellular localization was performed using the YLoc (Briesemeister et al., 2010) with the model set as ‘YLoc+’ and version set as ‘Plants’, WoLF PSORT (Horton et al., 2007) with the organism set as ‘Plant’, and CELLO (Yu et al., 2006) with the organism set as ‘Eukaryotes’. The consensus results predicted by these three methods (i.e. the same results obtained by the three different methods) were adopted. Overall, 6000 randomly selected, conserved (i.e. shared by at least by two species) NF/LF/HF genes, 6000 randomly selected, species-specific NF genes, all of the species-specific LF genes (6225 genes), and all of the species-specific HF genes (7164 genes) were used for subcellular localization analysis. The random selection of gene sets was repeated three times and results from analysis of protein subcellular localization were consistent among the repeated random gene sets.
Analysis of gene expression
Expression data for Arabidopsis genes were obtained from AtGenExpress (Schmid et al., 2005; Kilian et al., 2007). K-means clustering of the gene expression pattern was performed using SC(2)ATmd (Olex and Fetrow, 2011) with optimal number of clusters determined by Figure of Merit. 263 species-specific NF genes (NF5), the same number as Viridiplantae-wide NF genes (NF1), were randomly selected for clustering analysis. The random selection of gene sets was repeated three times and results from clustering analysis were consistent among the repeated random gene sets. Arabidopsis co-expression data were obtained from ATTED-II (Obayashi et al., 2007). The co-expression conservation data for one-to-one orthologues in A. thaliana and O. sativa were obtained from Movahedi et al. (2011).
Analysis of protein–protein interaction
The Arabidopsis protein PPI network data was obtained from AtPIN (Brandao et al., 2009). The hubs were defined as the top 10% most connected nodes (with a minimum degree of 36).
Statistical analyses, including Chi-square tests, paired t-tests and anova with LSD tests, were performed using R (www.r-project.org/).
We would like to thank S.D. Wullschleger for thoughtful and insightful comments on the manuscript. This work was supported by the U.S. Department of Energy, Office of Biological and Environmental Research, Genomic Science Program, the US DOE BioEnergy Science Center and the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory. The BioEnergy Science Center is a US Department of Energy Bioenergy Research Center supported by the Office of Biological and Environmental Research in the DOE Office of Science. Oak Ridge National Laboratory is managed by UT-Battelle, LLC for the US Department of Energy under Contract Number DE-AC05-00OR22725.
Conflict of Interest
The authors have no conflict of interest to declare.