Asymmetric evolution of duplicate genes encoding the CCAAT-binding factor NF-Y in plant genomes


Author for correspondence: Ji Yang Tel: +86 (10) 62753035 Fax: +86 (10) 62751526 Email:


  • • NF-Y is a ubiquitous CCAAT-binding factor composed of NF-YA, NF-YB and NF-YC. Multiple genes encoding NF-Y subunits have been identified in plant genomes. It remains unclear whether the duplicate genes underwent different evolutionary patterns.
  • • Likelihood-ratio tests were used to examine whether the amino acid substitution rates are the same between duplicate genes. The influences of selection on evolution were evaluated by comparing the conservative and radical amino acid substitution rates, as well as maximum-likelihood analysis.
  • • Some NF-YB and NF-YC duplicates showed significant evidence of asymmetric evolution but not the NF-YA duplicates. Most amino acid replacements in the NF-YB and NF-YC duplicates result in changes in hydropathy, polar requirement and polarity. The physicochemical changes in the sequences of NF-YB seem to be coupled to asymmetric divergence in gene function.
  • • Plant NF-Y genes have evolved in different patterns. Relaxed selective constraints following gene duplication are most likely responsible for the unequal evolutionary rates and distinct divergence patterns of duplicate NF-Y genes. Positive selection may have promoted amino acid hydropathy changes in the NF-YC duplicates.


The CCAAT box is a widespread regulatory sequence found in promoters and enhancers of a large number of genes (Bucher, 1990; Li et al., 1992; Mantovani, 1998). The functional importance of the CCAAT box, as a positive promoter element, has been well established in different systems (reviewed in Mantovani, 1998). Many DNA-binding proteins have been found to bind to CCAAT boxes (Li et al., 1992; Maity & Crombrugghe, 1998; Mantovani, 1999). NF-Y (also termed CBF) is a ubiquitous CCAAT-specific binding factor, which has a high affinity and sequence specificity for the CCAAT sequence (Li et al., 1992; Bellorini et al., 1997; Maity & Crombrugghe, 1998; Frontini et al., 2002).

The NF-Y CCAAT-specific binding factor is a heterotrimer composed of three subunits, all of which are essential for CCAAT binding. The three subunits are referred to as NF-YA (also known as CBF-B and HAP2), NF-YB (also CBF-A and HAP3) and NF-YC (CBF-C, HAP5) throughout this manuscript. The NF-Y heterotrimer is constructed by association of NF-YA with a tight dimer formed from NF-YB and NF-YC. This dimer produces a protein structure similar to the Histone Fold Motif, and it is with this complex surface that NF-YA associates. The resulting trimer has been shown to have a high affinity for DNA (Mantovani, 1999; Gusmaroli et al., 2001, 2002; Frontini et al., 2002; Romier et al., 2003). Genes encoding NF-Y subunits have been isolated from various organisms. In contrast to the situation in yeast and most vertebrates, in which all the subunits are encoded by single copy genes, multiple and distinct genes for each subunits have been identified in plant genomes (Edwards et al., 1998; Gusmaroli et al., 2001, 2002). Gusmaroli et al. (2001, 2002) identified 29 NF-Y subunit genes in Arabidopsis thaliana, including 10 NF-YAs, 10 NF-YBs and nine NF-YCs. A search of the GenBank database led to the identification of five independent NF-YA homologous genes, 10 NF-YB genes and 10 distinct NF-YC genes in the rice genome (J. Yang, unpublished data).

Duplication is a prevalent feature of plant genomes and many genes are found in tandem arrays or in duplicated segmental clusters (Cronk, 2001). The duplicate genes can arise from tandem duplications or from polyploidization events (Lawton-Rauh, 2003). Three potential evolutionary fates of duplicated genes have been suggested (Ohno, 1970; Lynch & Conery, 2000): (i) one copy of the pair simply becomes silenced by degenerative mutations after gene duplication (non-functionalization); (ii) one copy acquires a novel, beneficial function and becomes preserved by natural selection (neofunctionalization), with the other copy retaining the original function; (iii) both duplicates experience loss or reduction of expression for different subfunctions by degenerative mutations and establish distinct complementary functions. The combined action of both gene copies is necessary to fulfil the requirements of ancestral genes (subfunctionalization) (Hughes, 1994; Force et al., 1999). Questions concerning whether duplicate genes undergo different evolutionary patterns following duplication, and what the factors are that determine the fate of duplicate genes, are currently under intense research (Zhang et al., 2003).

There is often an acceleration of the rate of evolution following gene duplication (Li, 1985; Ohta, 1993, 1994; Lynch & Conery, 2000). Studies of several gene families have indicated that natural selection accelerates the fixation rate of non-synonymous substitutions shortly after a duplication event, presumably to adapt those proteins to a new or modified function (Zhang et al., 1998; Bielawski & Yang, 2003). However, an accelerated non-synonymous substitution rate could also be driven by relaxation, but not complete loss, of selective constraints. Here, duplicated proteins evolve under relaxed functional constraints for some period of time, after which functional divergence occurs when formerly neutral substitutions convey a selective advantage in a novel environment or genetic background (Bielawski & Yang, 2003). Meanwhile, degenerative mutations in regulatory subfunctions may also be accelerated under relaxed selective constraints and lead to subfunctionalization of duplicate genes.

For protein coding genes, the traditional approach to inferring the magnitude of selective constraint and positive selection is comparing the non-synonymous (dN) and synonymous substitution (dS) rates (ω = dN/dS), with ω < 1.0, ω≈ 1.0, and ω > 1.0 indicating purifying selection, neutral evolution and positive selection, respectively. However, the influences of selection on molecular evolution can also be evaluated by comparing the conservative and radical substitution rates in amino acid sequences. An amino acid substitution can be classified as either conservative or radical, depending on whether it involves a change in a certain physicochemical property of the amino acid (Zuckerkandl & Pauling, 1965; Dayhoff et al., 1972; Zhang, 2000). It is proposed that those substitutions which tend to conserve amino acid physicochemical properties, termed conservative substitutions, are more common than those substitutions which cause large changes in physicochemical properties, termed radical substitutions (Clark, 1970; Dayhoff et al., 1972). This difference in quantity is usually explained by a higher intensity of purifying selection on radical mutations than on conservative mutations. A significantly higher rate of radical non-synonymous substitution than conservative substitution has been taken as evidence for positive Darwinian selection on radical substitutions (Hughes et al., 1990, 2000; Zhang, 2000; McClellan & McCracken, 2001).

In this study, the likelihood-ratio test was used to examine the amino acid substitution rates of duplicate NF-Y genes in the A. thaliana and rice genomes. The selective influences on the evolution of plant NF-Y genes were evaluated by comparing the conservative and radical substitution rates. We address three related sets of questions. (i) How are these genes related to each other? Do the duplicates of plant NF-Y genes evolve at the same rates at the amino acid level following duplication? (ii) If the duplicates have evolved asymmetrically, do they exhibit similar amino acid substitution patterns in different plant genomes? What are the major factors that are responsible for the asymmetric divergence? (iii) Is the asymmetric evolution of duplicate NF-Y genes in gene sequences coupled to asymmetric divergence in gene functions?

Materials and Methods

The amino acid and nucleotide sequences of the NF-Y subunits in the Arabidopsis thaliana and rice (Oryza sativa ssp. japonica) genomes were compiled by searching the GenBank database using BLASTP, PSI-BLAST and TBLASTN algorithms, respectively, with the filter setting as default and expectation cutoff of 1.0. The amino acid sequences of the HAP2 (accession number P06774), HAP3 (accession number P13434) and HAP5 (accession number Q02516) subunits from yeast (Saccharomyces cerevisiae) were used as queries. The accession numbers for all sequences are shown in Table 1. The NF-YB genes from other plants were also retrieved from GenBank in order to conduct a phylogeny-based comparison of the evolutionary patterns between different types of NF-YB subunits. The accession numbers for these sequences are shown in Fig. 3 later.

Table 1.  Accession numbers of the NF-Y sequences used in this study
Arabidopsis thalianaNM_121287 (At5g12840)NM_100774 (At1g09030)NM_100768 (At1g08970)
NM_112983 (At3g20910)NM_130348 (At2g47810)NM_104356 (At1g54830)
NM_104294 (At1g54160)NM_126937 (At2g13570)NM_104496 (At1g56170)
NM_112256 (At3g14020)NM_117534 (At4g14540)NM_125742 (At5g63470)
NM_101621 (At1g17590)NM_124138 (At5g47640)NM_114718 (At3g48590)
NM_105941 (At1g72830)NM_115194 (At3g53340)NM_124430 (At5g50480)
NM_179402 (At1g30500)NM_179974 (At2g38880)NM_124429 (At5g50470)
NM_179904 (At2g34720)NM_179946 (At2g37060)NM_122673 (At5g27910)
NM_120734 (At5g06510)NM_124141 (At5g47670)NM_124431 (At5g50490)
NM_111443 (At3g05690)NM_102046 (At1g21970)NM_123174 (At5g38140)
Oryza sativaAC092262AB095439AC134235
Figure 3.

Phylogeny of plant NF-YB subunits reconstructed by Bayesian inference. The numbers at internal nodes refer to the Bayesian posterior probabilities (shown as percentages, i.e. 95 represents a posterior probability of 0.95; Values under 50 are not shown). The accession number for each sequence is shown following the species name.

Amino acid sequences were aligned using Clustal X (Thompson et al., 1997). Nucleotide sequence alignments were adjusted to conform to the amino acid sequence alignments. Each subunit of the NF-Y complex contains a core region that is relatively conserved across duplicates and species, whereas the flanking regions are much less conserved with great differences in sequence identity and length. The core regions of each subunit possess all functional amino acid residues and are sufficient for subunit interactions and CCAAT binding (Gusmaroli et al., 2001; Romier et al., 2003). Therefore, the flanking sequences, which are ambiguous in the alignments, were not included in this study. All analyses in this study were based on the core region sequences of NF-YA, NF-YB and NF-YC, each with 59, 90 and 76 amino acid residues, respectively.

The phylogenetic relationships of duplicates of each subunit were inferred using the neighbour-joining method (NJ, Saitou & Nei, 1987), implemented in the program HYPHY ( The Jones Empricial model of amino acid substitution was employed with invariant sites and gamma distribution rate for variable sites estimated from the data. The trees were rooted using the yeast HAP2, HAP3 and HAP5 genes as outgroups, respectively. The maximum parsimony method (MP), implemented in paup* 4.0 (Swofford, 1998), and the Bayesian method, implemented in MrBayes V2.01 (, were also used to infer the phylogenetic relationships of the NF-YB genes from various plants. Heuristic tree search under parsimony was conducted using the TBR (tree-bisection-reconnection) swapping algorithm. The GTR + I + G model (general-time-reversible with invariant sites and gamma distributed rates for variable sites) of sequence evolution was employed in Bayesian inference, with model parameters estimated from the data. The Markov chains were run for 1 000 000 generations, and trees were sampled every 100 generations.

The likelihood-ratio test was used to examine whether the duplicates of the NF-Y genes evolved at the same rates at the amino acid level. For each duplication event (node in the tree), two different models were compared. One model assumes the same amino acid substitution rate on the two braches leading to the two duplicates but allows the rate on other branches to be different. The other model allows one of the duplicates to evolve at an independent rate. The codeml program in the PAML package (Yang, 1997) was applied to calculate the maximum likelihood values using the Jones Empricial model of amino acid substitution. Twice the log-likelihood difference was compared to a chi-square distribution. If significant, the results suggest that the two branches have evolved at unequal rates.

To investigate whether the subunits which show asymmetries in rates of amino acid divergence exhibit similar amino acid substitution patterns in different plant genomes, we calculated the goodness of fit between an observed distribution of physicochemical changes inferred from well corroborated phylogenetic trees and an expected distribution based on the assumption of completely random amino acid replacement expected under the condition of selective neutrality, using the program TreeSAAP (Woolley et al., 2003). Six amino acid properties shown to correlate with rates of amino acid replacement (Xia & Li, 1998; McClellan & McCracken, 2001) were considered: composition of the side chain, hydropathy, isoelectric point, molecular volume, polar requirement, and polarity. We also investigated the numbers of conservative, moderate, radical and very radical physicochemical changes in amino acid property, relative to the total number of theoretically possible evolutionary pathways. To deduce the relative selective influence on each physical property of each amino acid (z-score) we compared these numbers to a normal distribution. The z-scores provide information on the direction in which selection is acting, while the goodness-of-fit score (GF-score) provides information on the intensity of that selection. Taken together, these scores describe the selective influences acting on each subunit (McClellan & McCracken, 2001).


Evolution of duplicates of plant NF-Y genes at the amino acid level

The NF-Y genes underwent successive rounds of duplications in their evolutionary histories. The likelihood-ratio test was used to examine the evolutionary rates of the duplicates in each duplication event. We found that the NF-YA duplicates from both Arabidopsis thaliana and rice showed little difference from each other in the rate of amino acid substitution, with all the values of twice the log-likelihood difference (2Δl) being smaller than the critical value inline image = 3.84. Therefore no further analysis of the NF-YA genes was conducted. However, some duplicates of the NF-YB and NF-YC genes showed significant evidence that one copy had evolved faster than the other at the amino acid level (Fig. 1). The numbers of duplicates, which showed asymmetries in evolutionary rates, are variable in different genomes and different subunits, suggesting the significantly different patterns of evolution of different subunits, as well as different duplicates.

Figure 1.

Neighbour-joining trees of plant NF-Y genes showing the relationships of duplicates in each subunit. Branch lengths are proportional to the expected number of amino acid substitutions per site. Duplication events in each subunit were marked with white circles (where the duplicates do not show asymmetries in rates of amino acid divergence) or grey circles (where the duplicates have evolved at unequal rates following duplication).

Amino acid substitution patterns relative to individual amino acid properties

To investigate whether the subunits which show asymmetries in rates of amino acid divergence exhibit similar amino acid substitution patterns in different genomes and to evaluate the effects of selection on molecular evolution of plant NF-Y genes, we compared the relative shapes of expected and observed distributions of physicochemical changes by goodness-of-fit and by statistically comparing proportions of observed amino acid replacements to expected evolutionary pathways for each contiguous magnitude category. The results showed that the NF-YB and NF-YC duplicates from both A. thaliana and rice exhibited a poor fit (GF > 7.815, d.f. = 3, P < 0.05) to expected distributions for all amino acid properties considered, indicating that the observed evolution of these sequences is inconsistent with the assumption of selective neutrality.

Statistical comparisons between magnitude categories revealed that, with regard to composition of the side chin, isoelectric point and molecular volume, selection has constrained evolution in both the NF-YB and NF-YC subunits such that conservative and moderate changes are most common. However, with respect to polar requirement and polarity, the NF-YB duplicates from rice showed a significant difference between moderate and radical categories, with the radical class being greater. This pattern was also found in the NF-YC duplicates, with the z-score for the A. thaliana NF-YC genes being slightly lower than the significance level (z = 1.64) in relation to polarity (Fig. 2). Comparisons of observed magnitude classes indicated that very radical changes in hydropathy occurred most frequently in the NF-YC duplicates (Fig. 2). Previous authors have proposed that the influence of positive selection favouring very radical changes in amino acid properties can be inferred when a GF-score > 7.815 and a z-score > 1.645 between radical and very radical magnitude classes is observed, as long as the very radical class is greater than the radical class (McClellan & McCracken, 2001).

Figure 2.

Proportions of amino acid replacements per evolutionary pathway for NF-YB and NF-YC subunits. The solid bars show the numbers of amino acid replacements occurring in each magnitude class (C, conservative; M, moderate; R, radical; VR, very radical), with standard errors marked on the bars. The bars filled with lines show the z-scores in each case, with the direction of the lines indicating significance level: for bars filled with horizontal lines, P > 0.05; for bars filled with vertical lines, P < 0.05.

Asymmetric sequence divergence and functional divergence of duplicate NF-Y genes

To investigate the correlations between asymmetric sequence divergence and functional divergence of duplicate NF-Y genes, we conducted a phylogeny-based comparison of the divergence of plant NF-YB subunits. Plant NF-YB subunits have been proposed to divide into two classes, i.e. LEAFY COTYLEDON1 (LEC1)-type and non-LEC1-type (Kwong et al., 2003). Functional analyses have revealed that the LEC1-type subunits represent a functionally specialized subunit of the CCAAT-binding transcription factor (Lotan et al., 1998; Harada, 2001; Lee et al., 2003), which is expressed specifically within seeds. It was predicted that the NF-Y complex containing the LEC1-type subunit is unlikely to serve a general role in transcription, as the NF-Y trimer containing the non-LEC1-type subunit does, by optimizing promoter efficiency through its binding with CCAAT DNA sequences, but rather to regulate embryonic processes by activating the transcription of specific genes that are required for embryo development (Lotan et al., 1998).

The phylogenetic relationships of various plant NF-YB subunits were inferred by NJ, MP, and Bayesian analyses. The tree topologies produced by different methods were similar on the overall structure. Figure 3 shows the phylogenetic tree reconstructed by Bayesian inference. It shows that the LEC1-type subunits from various plants formed a well-supported monophyletic lineage in the phylogeny, suggesting the common origin of these genes. The branching pattern of the phylogeny suggests that successive duplication events have taken place in the ancestral plant lineages. Branch A in the tree (Fig. 3) represents a duplicate, which occurred prior to the divergence of gymnosperm and angiosperm, and gave rise to the ancestor of the LEC1-type subunits. The likelihood-ratio test was used to examine whether the evolutionary rate was accelerated at the amino acid level along branch A. The ‘one-rate’ (global clock) model assumes the same rate for all lineages in the phylogeny and gave a log-likelihood of l0 = −2251.78. The ‘two-rate’ (local clock) model assumes two independent rates, one for branch A and a second for all other branches. The log-likelihood value under this model was l1 = −2228.37. Comparison of twice the log-likelihood difference, 2Δl = 2(l1l0) = 46.82, with the χ2 distribution (d.f. = 1) suggested rejection of the ‘one-rate’ model with P < 0.001, indicating that the amino acid substitution rates are extremely different among lineages.

The physicochemical changes inferred from the LEC1-type subunits exhibited a good fit to the neutral expectation for hydropathy (GF = 7.26), but not for other amino acid properties. In the case of composition of side chain, isoelectric point, and molecular volume, lack of fits were due to selection that promoted conservative or moderate changes. However, in polar requirement and polarity, more radical or very radical changes seem to have taken place than expected (PR-polarity = 0.0202, PVR-polar requirement = 0.01359). In contrast to the LEC1-type subunits, the non-LEC1-type subunits exhibited poor goodness of fit to the expected distribution for hydropathy (GF = 18.547) and fewer very radical amino acid replacements in polar requirement (PVR-polar requirement = 0.00317).


Duplicated genes are common in genomes (Meyer, 2003), and attention has been focused on the divergence of sequences of duplicated genes and consequent divergence of functions of the proteins they encode (Conant & Wagner, 2003). A number of cases of asymmetric divergence between duplicate genes have been reported (Van de Peer et al., 2001; Kondrashov et al., 2002; Wagner, 2002). Our results show that the NF-Y duplicates in plant genomes have evolved in different patterns, with some of the NF-YB and NF-YC duplicates showing significant evidence of asymmetric evolution but not the NF-YA duplicates. This difference is probably a result of their distinct roles in trimer formation and DNA-binding. The core domain of NF-YA is less than 60 amino acids long and consists of two subdomains. The subunit-associating subdomain is responsible for the sequence specific interaction of the trimer, showing remarkable specificity for NF-YB/NF-YC among HFM dimmers (Mantovani, 1999), while the DNA-binding subdomain is implicated in specific recognition of the CCAAT element. Comparison of the duplicate NF-YA sequences revealed that all the key residues involved in DNA-binding or sequence–specific interaction are almost perfectly conserved, with only one of them showing substitution of Ala for Gly, suggesting the presence of very strong selective constraints on both subdomains. It is thus unsurprising that the NF-YA subunits do not show evidence of asymmetric evolution. We note that the NF-YA sequences used are relatively shorter than the NF-YB and NF-YC sequences, which may, to a certain extent, affect the power of statistical tests. The likelihood-ratio test becomes more conservative in a data set with short sequences and low divergence (Anisimova et al., 2001). Therefore this result should be treated with caution, particularly when the test fails to reject the null hypothesis.

Asymmetries in amino acid substitution rates were detected in both NF-YB and NF-YC subunits, indicating that some duplicates of NF-YB and NF-YC have evolved at significantly different rates following gene duplication. The amino acid changes, which have happened to different duplicates, are not evolutionarily equivalent. Most non-synonymous amino acid replacements seem to result in changes in hydropathy, polar requirement and polarity. Comparisons of magnitude classes also demonstrated that radical and very radical changes with regard to hydropathy, polar requirement and polarity happened more frequently in the NF-YC subunits than in the NF-YB subunits. This pattern was found in both the A. thaliana and rice subunits. Most interestingly, the LEC1-type and non-LEC1-type NF-YB subunits also showed significant heterogeneity in their amino acid substitution patterns with respect to hydropathy, polar requirement and polarity, suggesting that the physicochemical changes in sequences are coupled to asymmetric divergence in gene function. The amino acid properties are not changed independently. The changes in hydropathy and polarity seem to be correlated in the NF-Y subunit sequences. It is unclear whether the conservative changes in composition of the side chain, isoelectric point and molecular volume in the NF-Y subunits are accompanied by changes in other amino acid properties. The relationships of various amino acid property changes and their functional effects need to be addressed based on more rigorous biochemical characterization of the residues of the NF-Y sequences.

The unequal evolutionary rates and distinct divergence patterns indicate different selective influences on the evolution of plant NF-YB and NF-YC subunits. Some duplicates of NF-YB and NF-YC are clearly subject to stronger purifying selection, as they showed little differences in amino acid substitution rates, and a low level of divergence from each other. However, for the duplicates that showed an acceleration of evolution, the selective constraints may have been relaxed to some extent, but not completely lost. In both the NF-YB and NF-YC subunits, the preponderance of conservative and moderate amino acid replacement with regard to composition of side chain, isoelectric point, and molecular volume indicates the effects of negative selection, while the relatively high proportions of radical and very radical changes in polar requirement and polarity suggest a relaxed functional constraint on amino acid replacements in relation to these properties. One possible exception is the change in hydropathy in the NF-YC duplicate genes. A significantly higher proportion of very radical amino acid replacements with respect to hydropathy was detected in these duplicates, suggesting a directional change in amino acid sequences. This thus suggests the presence of positive selection, which has acted to promote amino acid hydropathy change to a greater extent than expected under random substitution.

It is an intriguing question whether any of the different members is able to interact with all other subunits. Gusmaroli et al. (2001, 2002) predicted that trimer formation should be possible among all members of the three subunits. At the same time, however, they showed that not all the NF-Y duplicates in A. thaliana are expressed ubiquitously. Some members are either organ-specific or developmentally regulated. We compared the amino acid substitution patterns between the ubiquitous and tissue-specific members, and failed to find any differences in sequence that are correlated with distinct expression patterns. This is possibly due to the use of only partial sequences of these genes in analyses. However, the asymmetric expression of subfunctionalized NF-Y duplicates is probably driven by the differential evolution of regulatory regions, rather than coding sequences (Conant & Wagner, 2003). The specific intermolecular interactions between different members may also play a role in determining the distinct expression patterns of duplicate NF-Y genes. Therefore, more biochemical evidence is needed to verify whether all NF-Y subunits can indeed associate.


We thank Dr M. D. Rausher and two anonymous reviewers for critical reading of the manuscript and valuable comments. This work was supported by the National Key Project for Basic Research (973) (G2000046804) to J.Y., and a National Natural Science Foundation of China grant (30370092) to J.Y and B.J.G. Part of this work was completed during J.Y.'s visit to the University of Cambridge, supported by the China Scholarship Council.