Highly conserved low-copy nuclear genes as effective markers for phylogenetic analyses in angiosperms


  • Ning Zhang,

    1. State Key Laboratory of Genetic Engineering, Institute of Plant Biology, Center for Evolutionary Biology, School of Life Sciences, Fudan University, 220 Handan Road, Shanghai 200433, China
    Search for more papers by this author
  • Liping Zeng,

    1. State Key Laboratory of Genetic Engineering, Institute of Plant Biology, Center for Evolutionary Biology, School of Life Sciences, Fudan University, 220 Handan Road, Shanghai 200433, China
    Search for more papers by this author
  • Hongyan Shan,

    1. State Key Laboratory of Systematic and Evolution Botany, Chinese Academy of Sciences, Beijing 100093, China
    Search for more papers by this author
  • Hong Ma

    1. State Key Laboratory of Genetic Engineering, Institute of Plant Biology, Center for Evolutionary Biology, School of Life Sciences, Fudan University, 220 Handan Road, Shanghai 200433, China
    2. Institutes of Biomedical Sciences, Fudan University, 138 Yixueyuan Road, Shanghai 200032, China
    Search for more papers by this author

Author for correspondence:
Hong Ma
Tel: +86 21 65642800
Email: hongma@fudan.edu.cn


  • Organismal phylogeny provides a crucial evolutionary framework for many studies and the angiosperm phylogeny has been greatly improved recently, largely using organellar and rDNA genes. However, low-copy protein-coding nuclear genes have not been widely used on a large scale in spite of the advantages of their biparental inheritance and vast number of choices.
  • Here, we identified 1083 highly conserved low-copy nuclear genes by genome comparison. Furthermore, we demonstrated the use of five nuclear genes in 91 angiosperms representing 46 orders (73% of orders) and three gymnosperms as outgroups for a highly resolved phylogeny.
  • These nuclear genes are easy to clone and align, and more phylogenetically informative than widely used organellar genes. The angiosperm phylogeny reconstructed using these genes was largely congruent with previous ones mainly inferred from organellar genes. Intriguingly, several new placements were uncovered for some groups, including those among the rosids, the asterids, and between the eudicots and several basal angiosperm groups.
  • These conserved universal nuclear genes have several inherent qualities enabling them to be good markers for reconstructing angiosperm phylogeny, even eukaryotic relationships, further providing new insights into the evolutionary history of angiosperms.


Phylogenetic relationships between organisms are the foundation of evolutionary biology and many other disciplines, such as biodiversity and biogeography, functional evolution of development and physiology, comparative genomics, and crop breeding. The characteristics for phylogenetic studies have evolved from morphology and development, to physiology and biochemistry, and more recently to protein and DNA sequences. The latter provides a nearly unlimited number of characters for comparison and has been made increasingly accessible with the rapid development of sequencing technology (Judd et al., 1999; Soltis et al., 2005). During recent decades, organellar and ribosomal RNA genes have been widely used for resolving organismal relationships. However, these markers have some limitations; for example, rDNA genes have high copy number and experience concerted evolution (Letsch & Kjer, 2011). In particular, the extent of rDNA sequence homogenization might differ both between gene regions and among different loci. These variations might increase the uncertainty of organismal phylogenies inferred from these sequences (Buckler et al., 1997).

The importance of using low-copy nuclear genes for phylogenetic analysis has long been recognized (Strand et al., 1997; Fulton et al., 2002; Sang, 2002; Mort & Crawford, 2004; Hughes et al., 2006; Whittall et al., 2006; Wu et al., 2006; Yuan et al., 2009; Duarte et al., 2010). Nuclear protein-coding genes represent the overwhelming majority of the cellular genome and are important for many diverse functions, providing markers to track organismal evolution through both male and female Mendelian inheritance. For example, protein-coding nuclear genes have been used to reconstruct phylogenetic relationships among nearly 200 fungal species, illustrating their effectiveness (James et al., 2006). Phylogenomics using many nuclear genes further demonstrated their power for molecular evolutionary analyses, especially in studies of fungal and animal relationships partly due to the availability of many large genomic and EST sequence datasets (Rokas et al., 2003; Moreau et al., 2006; Regier et al., 2010; Kocot et al., 2011; Smith et al., 2011; Struck et al., 2011). However, until now, only a few nuclear genes have been used for resolving the backbone of flowering plant phylogeny (Soltis et al., 1997; Mathews & Donoghue, 1999; Finet et al., 2010; Burleigh et al., 2011; Lee et al., 2011).

Flowering plants (angiosperms) are one of the most successful groups of organisms on earth, with c. 300 000 species, and provide humans and animals with foods, fibres, medicines and other materials (Judd et al., 1999). The angiosperm phylogeny establishes the evolutionary history and depicts the phylogenetic relationships among various species and groups, facilitating comparative analysis between model plants and crops (Soltis & Soltis, 2003). The recent angiosperm phylogeny, which revolutionized the view of angiosperm taxonomy mainly based on morphological evidence, provides strong support for the monophyly of major groups such as the eudicots, monocots and magnoliids (Bremer et al., 2002; Jansen et al., 2007; Moore et al., 2007, 2010; APG III, 2009; Soltis et al., 2011). Also, there is an increasing consensus on Amborella being the sister to all other extant flowering plants (Zanis et al., 2002; Moore et al., 2007, 2010; Soltis et al., 2011), although alternative hypotheses still remain (Goremykin et al., 2003, 2009). Nevertheless, several enigmatic relationships are still to be resolved, such as those among eudicots, monocots, magnoliids, Chloranthales and Ceratophyllales, those among Dilleniaceae, Saxifragales, Caryophyllales, rosids and asterids, and those among the major clades within rosids. It has been proposed that these relationships are difficult to resolve because of rapid radiation (Hilu et al., 2003; Moore et al., 2007, 2010; Wang et al., 2009).

Until now, reconstructions of angiosperm phylogeny have relied largely on plastid and mitochondrial genes (Chase et al., 1993; Savolainen et al., 2000; Hilu et al., 2003; Zhu et al., 2007; Qiu et al., 2010) and sometimes entire plastid genomes (Jansen et al., 2007; Moore et al., 2007, 2010), while the use of nuclear genes only has been rare (Soltis et al., 1997; Mathews & Donoghue, 1999; Finet et al., 2010; Lee et al., 2011). Widespread gene duplication events represent a major challenge in selecting effective nuclear genes as phylogenetic markers (Zhang, 2003; Kellis et al., 2004; Dehal & Boore, 2005; Soltis et al., 2009; Zhou et al., 2010); recent genomic studies have shown that all extant angiosperms have experienced polyploidization events during evolution (Jiao et al., 2011). Gene duplications, in turn, make it difficult to distinguish orthologs from paralogs (Philippe et al., 2005). Even worse, in some cases, single-copy paralogs resulting from gene duplication and subsequent lineage-specific losses could be mistaken as orthologs, contributing to incorrect inference of organismal relationships (Nei & Kumar, 2000; Koonin, 2005; Scannell et al., 2006). Another difficulty in using nuclear genes for phylogenetic analysis is their relatively complex gene structure, making them hard to clone and align with confidence. With the advance of sequencing technologies, more and more plant genomic and transcriptomic datasets are being generated, so selecting suitable low-copy nuclear genes as phylogenetic markers is becoming feasible.

In this work, to facilitate the use of nuclear genes in phylogenetic analyses in angiosperms, we compared the genomes of one moss and seven angiosperm species and identified over 1000 genes as potential phylogenetic markers. To test their usability, sequence of five genes (SMC1, SMC2, MSH1, MLH1 and MCM5) belonging to four gene families were also obtained from 91 angiosperm species of 46 orders (73% of all orders). These genes are easy to clone and align and, compared with widely used organellar genes, they are phylogenetically much more informative. The resultant phylogenetic trees, which are generally concordant with those of previous studies, suggest that these nuclear genes are excellent candidates for reconstruction of angiosperm phylogeny at both above- and below-order levels.

Materials and Methods

Ortholog identification

In order to identify low-copy nuclear genes common to angiosperms, annotated genomes of seven angiosperm species (Arabidopsis thaliana, Populus tricocarpa, Prunus persica, Vitis vinifera, Mimulus guttatus, Oryza sativa, Sorghum bicolor), and one moss genome (Physcomitrella patens) were retrieved from Phytozome v7.0 (http://www.phytozome.net/). These genomes were used to search for putative orthologous genes by OrthoMCL v1.4 (Li et al., 2003) with default parameters by identifying clusters with 7–9 gene sequences and at least one gene from each of the selected seven angiosperms. These low-copy nuclear genes were used to search against the moss genome by using HaMStR_v8 to determine the set of genes conserved among land plants (Ebersberger et al., 2009). To facilitate gene cloning and phylogenetic analyses, those included for further analysis encoded proteins with their length and sequence identities being no ≥ 300 amino acids (the number of amino acids is according to the A. thaliana protein) and 60%, respectively. Then, predicted gene functions of these genes were analysed in TAIR10 Gene Ontology (http://www.arabidopsis.org/) (See Supporting Information Fig. S1).

Public sequence retrieval and cDNA cloning

The identified low-copy nuclear genes were statistically overrepresented for predicted functions related to DNA/RNA metabolism compared with the percentage of genes with such functions in the whole genome. In addition, extensive gene family phylogenetic studies in this laboratory and others indicated that the SMC, MCM, MSH, MLH, RFC, RDRP and RAD51 families with highly conserved functions in DNA/RNA metabolism showed long-term maintenance of low copies during eukaryotic evolution despite genome duplications (Forsburg, 2004; Lin et al., 2006, 2007; Surcel et al., 2008; Zong et al., 2009). Therefore, gene copy numbers of 28 low-copy genes belonging to these seven gene families were inspected in 15 angiosperm genomes, with the Arabidopsis genes as queries. To test whether other low-copy nuclear genes with different functions, sizes or sequence identities can also be used as phylogenetic markers, 20 randomly selected genes (Figs S1–S3) were also inspected similarly for copy numbers.

In order to test the usability of low-copy nuclear genes for resolving angiosperm phylogeny, five genes (SMC1, SMC2, MSH1, MLH1 and MCM5) of 91 angiosperm species from 46 orders and three gymnosperm species were also obtained. The sequences of 18 angiosperm species (15 species with fully sequenced genomes and three species with large EST data) and three gymnosperm species were retrieved from public databases (Table S1). In total, 107 sequences were retrieved from public databases. The five genes for the additional 73 angiosperm species were obtained using PCR amplification from cDNAs (Fig. S1); detailed procedures for cDNA cloning and sequencing are described in Methods S1, and the primers used in this study are listed in Tables S2, S3.

Phylogenetic analyses and Approximately Unbiased (AU) test

For the reconstruction of angiosperm phylogeny using five genes (SMC1, SMC2, MCM5, MSH1 and MLH1), in order to maximize sequence data for a few taxa, sequences from different species of the same genus were combined, including Chloranthus elatior & C. spicatus, Schisandra propinqua & S. sphenanthera. Protein sequences were initially aligned by using Muscle v3.6 (Edgar, 2004), then adjusted manually with Genedoc (Nicholas et al., 1997), and transformed into the nucleotide matrices with the aid of PAL2NAL (Suyama et al., 2006). Concatenated nucleotide and amino acid matrices were generated with SeaView v4.2.12 (Gouy et al., 2010).

First, five single-gene trees were generated with PhyML v3.0 (Guindon et al., 2010) using the nucleotide sequence matrices to determine the gene copy number in each species with a fully sequenced genome and to test for possible contamination. If a species had two or more copies generated by lineage-specific duplication, the gene with slowest evolutionary rate was chosen for further analyses (see Tables S3–S8, Fig. S7a–e). The evolutionary model was specified as GTR + I + Γ, which was estimated by ModelTest v3.7 (Posada & Crandall, 1998). The supporting values were estimated by using the time-saving and accurate nonparametric statistical method; that is, by SH-like approximate likelihood ratios (Guindon et al., 2010).

The concatenated 5-gene trees were reconstructed based on both nucleotide and amino acid matrices by using the maximum likelihood and Bayesian methods. The maximum likelihood method was performed with PhyML v3.0 and RAxML v7.0.4 (Stamatakis, 2006), and the Bayesian method, MrBayes v3.1.2 (Ronquist & Huelsenbeck, 2003) and PhyloBayes v3.2e (Lartillot & Philippe, 2004). For the nucleotide matrix, the most suitable evolutionary model was determined by the Akaike Information Criterion (AIC) with the aid of ModelTest v3.7. In PhyML analysis, the evolutionary model was specified as GTR + I + Γ and the supporting values were estimated by using bootstrap analysis (100 replicates). In the MrBayes inference, one cold and three incrementally heated Markov chain Monte Carlo (MCMC) chains were run simultaneously. The Markov chain Monte Carlo (MCMC) convergence in Bayesian phylogenetic inference was monitored by AWTY (http://ceb.csit.fsu.edu/awty) (Nylander et al., 2008). Trees were sampled per 100 generations. The first 25% trees were discarded as burnin. The remaining trees were used for generating the consensus tree. During RAxML analysis, the GTRCAT model was specified, 500 rapid bootstrap analyses were performed to infer the supporting values. In PhyloBayes analysis, two independent chains were run simultaneously until the value of maxdiff was < 0.1. For the amino acid matrix, the evolutionary models were specified as JTT, mixed, PROTGAMMAJTT, and CAT in PhyML, MrBayes, RAxML and PhyloBayes, respectively.

It is thought that the 3rd codon positions may suffer from mutation saturation and bring noise to phylogenetic analysis (Nei & Kumar, 2000). Therefore, a nucleotide matrix excluding the 3rd codon positions was generated and the phylogenetic tree was inferred with RAxML v7.0.4. and MrBayes v3.1.2, respectively. In addition, sampling – especially that including genes showing long branches – has also been considered to affect the topology. So four nucleotide matrices including different samplings were generated, and the corresponding trees were inferred by using MrBayes v3.1.2 and RAxML v7.0.4.

In order to determine statistic support for alternative relationships of major groups of angiosperms, 69 alternative topologies previously proposed in other studies, were tested against the best ML tree. First, per site log likelihoods for each topology were computed in RAxML under the GTR + Γ model, and secondly AU and WSH tests were conducted using CONSEL v0.1j (Shimodaira & Hasegawa, 2001) (details are described in Table S9).

In order to test the effect of marker gene numbers used here on resolution in the angiosperm phylogeny, trees using four (SMC1, MLH1, MSH1 and MCM5), three (MLH1, MSH1 and MCM5), and two (MLH1 and MSH1) concatenated genes were also reconstructed, using nucleotide sequences with RAxML under the same settings as mentioned above.

In order to test the usability of 20 randomly selected low-copy genes, single-gene trees were reconstructed with PhyML v3.0 based on amino acid sequences using the JTT model (Fig. S3). Furthermore, to test whether other groups of five genes can generate congruent species trees, two concatenated 5-gene trees were generated under the same settings as above using so-called ‘good genes’ (whose single-gene trees are mostly congruent with the species tree) or ‘bad genes’ (with fewer consistent nodes) (Fig. S4).

Divergence time estimation

Whether the five genes we used fitted the molecular clock hypothesis was tested by using MEGA v5.0 with the evolutionary model being GTR + I + Γ (Tamura et al., 2011). To estimate the timing of divergences, species whose missing data exceeded 50% were excluded, including all three gymnosperm species and three angiosperm species: Aristolochia fimbriata, Cabomba caroliniana and Alisma plantago-aquatica. To facilitate comparison with previous similar studies (Moore et al., 2007, 2010; Bell et al., 2010), four time constraints were used, the eudicot crown was set to a mean age of 125 Ma (million years ago); the most recent common ancestor (MRCA) of Fagales and Fabales was set to a minimum of 85 Ma; the MRCAs of Caryophyllales and of Sapindales were set to minima of 83.5 and 65 Ma, respectively. For the eudicot crown, prior was treated as fitting normal distribution with the stdev of 1 and the mean set to 125, while the recent common ancestor of Fabales and Fagales, prior was treated as fitting uniform distribution with the lower and upper being 85 and 100, respectively. For the MRCAs of Caryophyllales and of Sapindales, priors were both treated as fitting lognormal distribution, with the stdev of 1.5 and the offset set to 83.5 and 65, respectively. Tree prior was specified as Yule process. The divergence times were estimated twice independently using the BEAST v1.6.1 (Drummond & Rambaut, 2007) under the uncorrelated lognormal relaxed clock model. Five genes used in this work were presumed to evolve independently under different models as inferred by ModelTest v3.7. The chain-length of MCMC and the sampling frequency were set to 50 000 000 and 5000, respectively. Then convergence was assessed by effective sample sizes (ESS) using the Tracer v1.5 (http://tree.bio.ed.ac.uk/software/tracer/). Finally, the divergence times were estimated by TreeAnnotator v1.6.1 with half of the trees treated as burnin and the divergence times visualized using Figtree v1.3.1 (http://tree.bio.ed.ac.uk/software/figtree/).


Highly conserved single-copy genes provide excellent candidate phylogenetic markers

In order to identify low-copy nuclear genes, orthogroups containing 7–9 gene sequences from seven angiosperms (at least one gene from each species) were obtained in this study; to increase the efficiency of further gene cloning from taxa without a sequenced genome, genes with coding regions of fewer than 300 aa were excluded in further analyses. Under this criterion, 1402, 1087 and 699 orthogroups were identified with 7, 8 and 9 gene sequences, respectively. After screening against the moss genome, 1030 (77.2%), 773 (71.1%), and 489 (70%) groups remained, respectively (Table S12). Further analyses indicated that the size of encoded proteins ranged from 300 to 5098 aa, with 66.5% of them no ≥ 600 aa (Fig. 1a). The average identities of each gene varied from 28.9% to 94.2% among the seven angiosperms, with the distribution fitting normal distribution (Fig. 1b). Compared with non-low-copy genes (> 9 genes in 7 taxa and at least one gene from each genome), the distributions of gene sizes and sequence identities largely overlapped (Fig. 1a,b). Among these low-copy nuclear genes, 1083 highly conserved genes with the identities > 60% were selected as the most promising candidate markers (see Table S12). GO analyses and chi-square test indicated that these low-copy genes are overrepresented in the following categories: cell organization and biogenesis, developmental processes, other cellular processes, other metabolic processes, transport, and especially DNA/RNA metabolism, but underrepresented in signal transduction and transcription factor categories (Fig. 1c).

Figure 1.

Characteristics of low-copy nuclear genes. Gene sizes (a), sequence identities (b), and functional categories (c) of low-copy genes are indicated in green, while sizes, sequence identities of non-low-copy nuclear genes and functional categories of Arabidopsis genome are denoted in yellow.

It has been reported that genes related to DNA/RNA metabolism tend to be lost after gene or genome duplication and remain single-copy (Blanc & Wolfe, 2004). Therefore, such genes might be suitable candidate markers for reconstructing angiosperm phylogeny. In addition, extensive phylogenetic analyses have shown that members of SMC, MCM, MSH, MLH, RAD51 and RDRP families, which function in DNA/RNA synthesis and repair, are maintained as one copy in most species in spite of their ancient origin from before the divergence of plants and animals (Forsburg, 2004; Lin et al., 2006, 2007; Surcel et al., 2008; Zong et al., 2009). The maintenance of long-term orthology even after whole duplications in both plant and animal lineages suggests that these genes are good candidates for phylogenetic analysis. To test this idea, we inspected their copy numbers in 15 angiosperms with fully sequenced genomes and found that most species had only one copy for most genes (Fig. 2); one notable exception is Glycine max, which likely experienced two recent genome duplications (Schmutz et al., 2010). In order to test whether other low-copy nuclear genes are also suitable for angiosperm phylogeny, 20 randomly selected genes were inspected for gene copy number and most of them have only one copy in each species (Fig. S2). Single-gene trees showed that only one (AT5G52910) was completely congruent with the reference species tree, suggesting that some low-copy genes might not be suitable for phylogenetic analysis (Fig. S3).

Figure 2.

Gene copy number of candidate marker genes in angiosperm species with sequenced genome. The organismal relationships modified from our results were shown on the left. The gene names are provided at the bottom, gene copy numbers are donated by different colours: green, orange and red indicating only one, two and three or more copies, respectively.

Then we used SMC1, SMC2, MCM5, MSH1 and MLH1 as representatives to study the angiosperm phylogeny and obtained their sequences from additional 73 species by degenerate PCR. The percentages of successful PCR for SMC1, SMC2, MCM5, MSH1 and MLH1 were 93.7%, 87%, 94.4%, 91.5%, and 90.1%, respectively (see Tables S4–S8). These, along with gene sequences retrieved from public databases, include in total, 92 SMC1, 94 SMC2, 89 MCM5, 89 MSH1 and 83 MLH1genes, from 91 angiosperm species belonging to 75 families in 46 representative orders (73% of orders defined by APG III), (APG III, 2009).

These five genes ranged from 654 to 3201 bp and had both highly conserved and relatively divergent regions (Figs 3a–e, S5). For instance, the amino acid sequence identities of SMC1 homologs ranged from 50% to 90% depending on the position of a sliding window, providing phylogenetically informative sites for both shallow and deep relationships (Fig. 3). Furthermore, highly conserved regions greatly facilitate amplification using universal primers, as mentioned above (Tables S4–S8), and sequence alignment, requiring the introduction of only a few gaps to obtain high quality alignments.

Figure 3.

Domains and amino acid sequence conservation of SMC1 (a), SMC2 (b), MCM5 (c), MSH1 (d) and MLH1 (e). Left, percentage of sequence identities among angiosperms calculated by SWAPP with the window size and step size of 100 and 10, respectively, with the x axis being the positions of sequence alignments. Right, domains predicted by SMART were shown as rectangles, primers used in this study were marked as arrows, representative conserved and divergent regions were shown by the Weblogo.

At the same time, regions of sequence divergence can also be used for resolving relationships among relatively closely related taxa. To test their usability, we compared Arabidopsis thaliana and A. lyrata, two sibling species diverged c. 10 Ma (Beilstein et al., 2010), focusing on introns because they evolve more rapidly. For these five genes, the nucleotide sequence identities for each intron were between 32% and 96.9% and some indels were also observed, providing additional resolving power even between closely related species (Fig. S6a–e). In comparison, the sequence identity of ITS (internal transcribed spacers), the most widely used molecular marker in resolving relationships among low-rank taxonomic hierarchies, between these two species is 95%.

As a test for possible phylogenetic markers, single-gene trees were constructed for each of the five genes and they were largely consistent with well-established organismal relationships, suggesting that these genes are likely orthologous (Fig. S7a–e). Although multiple sequences were occasionally obtained in some species, they always formed adjacent terminal branches in phylogenetic trees, indicating that they resulted from recent lineage-specific duplication and should not affect phylogenetic relationships of more distantly related groups in this study. Further ModelTest analyses of these gene sequences showed that GTR + I + Γ was the fittest model for four of the five genes, with the exception of MLH1 genes, for which the best model was TVM + I + G. This result indicates that the genes used in this study may have evolved under an essentially very similar evolutionary pattern (Table S9). Further analyses suggest that these genes are phylogenetically highly informative, with their average frequency of parsimony-informative (Pi) sites being 70%, much higher than those of the widely used organellar genes, such as rbcL and atpB (52%, 54%, respectively; Table 1) (Savolainen et al., 2000).

Table 1.  Sequence and tree statistics for five selected low-copy genes
GeneLengthConstantPi charactersPi (%)CIRIRCTree lengthVariable sitesRates of change
  1. Pi, parsimony informative; CI, consistency index; RI, retention index; Rates of change, steps/variable characters.

SMC1 3222598232172.00.1760.4270.07532 266262412.30
SMC2 3171698212867.10.1980.4080.08126 208247310.60
MCM5 67520841160.90.1690.4800.081561546712.02
MLH1 109818982375.00.1950.4260.08310 14290911.16
MSH1 1854345135573.10.1800.3950.07118 818150912.47
5 genes10 0202038703870.20.1840.4160.07693 538798211.97

Strong support for most angiosperm phylogenetic relationships using five nuclear genes

In order to use these nuclear genes for phylogenetic analysis, we aligned the amino acid sequences of the SMC1, SMC2, MCM5, MSH1 and MLH1 proteins and generated the corresponding nucleotide matrices. In total, the length of the concatenated 5-gene nucleotide matrix reached 10 020 bp, with < 15% being missing data. Topologies recovered by different methods were essentially the same when using the same matrix, and c. 83% of groupings were in agreement between the 5-gene nucleotide tree (NUtree hereafter) and the corresponding amino acid tree (AAtree hereafter) when the same phylogenetic method was used (Figs 4, S10). The nodes that were different usually had very low support values and have been considered as unresolved in all previous works (Jansen et al., 2007; Moore et al., 2007). In the NUtree generated by MrBayes, 94% of nodes had 0.95 or greater posterior possibilities (PP); for the NUtree conducted by RAxML, 87.5% of nodes had 70% or higher bootstrap values (BP). After comparing five single-gene trees with the NUtree, we found that SMC1 performed best among them. Compared with the NUtree, there are 72, 65, 48, 45 and 61 congruent nodes with supporting values > 50% in SMC1, SMC2, MCM5, MLH1 and MSH1 single-gene trees, respectively (Fig. S8). The latter three genes had relatively short cloned regions in this study. Furthermore, the concatenated 4 (SMC1, MCM5, MSH1, and MLH1), 3 (MLH1, MSH1, and MCM5), and 2(MLH1 and MSH1) gene trees were reconstructed. Using the NUtree as a reference, for most nodes more genes led to higher supporting values, but not always for controversial relationships (Fig. S9).

Figure 4.

A phylogram of the best ML tree conducted by RAxML (− loge L = −382 176.038) based on concatenated 5-gene nucleotide sequences. Asterisks indicate supporting values of posterior probabilities (PP) = 1 and bootstrap (BP) > 95. Numbers associated with internal branches are also supporting values (PP/BP). For backbone nodes, additional supporting values estimated using the amino acid matrix were also shown in the following order, PPs from nucleotide and amino acid matrices by MrBayes, respectively; BPs from nucleotide and amino acid matrices using RAxML, respectively. Zamia, Pinus and Picea were specified as outgroups. The scale bar indicates number of changes per site.

Our results provide full support for the monophyly of many previously defined groups, including all the orders analysed here and several larger groups: eudicots, monocots, magnoliids, Mesangiospermae and the major subgroups of eudicots (such as core eudicots, rosids and asterids), irrespective of taxon sampling (Fig. S11a–d) or 3rd codon position inclusions/exclusions (Fig. S12). Also well supported were Ranunculales, Gunneraceae and Acorales as the basalmost positions within eudicots, core eudicots and monocots, respectively. Strong support was also obtained for Amborellales, Nymphaeales and Austrobaileyales being successive sister groups of the extant angiosperms, especially in AAtrees (Figs 4, S10). Compared with the latest comprehensive phylogenetic work using 83 plastid genes from 86 species, c. 78.3% (54/69) of groupings were the same. Among the differences, 57.1% (8/14) of nodes had support of BP value of at least 70% and 21.4% (3/14) were with BP of 100% in our analyses.

The phylogeny here showed strong support for a few relationships that are different from those described by previous studies. Specifically, among rosids, all eight analyses (Figs 4, S10) and AU tests using five concatenated genes (Table S10) supported the affinity between malvids (including Brassicales, Malvales and Sapindales) and the COM clade (Celastrales, Oxalidales, Malpighiales), while the nitrogen-fixing clade of fabids (Cucurbitales, Fagales, Fabales, Rosales) were only distantly related to malvids and the COM clade; the same pattern was also recovered in all five single-gene trees (Fig. S7a–d); in addition, this placement was very robust to sampling and position inclusions/exclusions (Figs S11, S12). This placement, which was also observed in independent recent studies in spite of low supporting values or sparse sampling (Zhu et al., 2007; Finet et al., 2010; Qiu et al., 2010; Shulaev et al., 2010; Burleigh et al., 2011; Lee et al., 2011), is different from the previous grouping of fabids and the COM clade as eurosids I, and malvids as eurosids II. In addition, Ericales and Cornales together formed a sister group of the core asterids (Figs 4, S10), instead of being successive sisters to other asterids, as was reported in previous studies (APG III, 2009). All 5-gene trees, except for the one based on the matrix excluding the 3rd codon positions, also supported the placement of Caryophyllales as sister to rosids (Figs 4, S10, S12); however, it was always placed as sister to asterids previously (APG III, 2009). An additional major difference is the sister relationship of Chloranthaceae and Ceratophyllaceae as supported by almost all analyses (Fig. S10). Our results also supported a possible CCMM clade including Chloranthaceae, Ceratophyllaceae, magnoliids and monocots, in apparent divergence from results based on plastid genomes that placed Ceratophyllaceae and monocots as successive sisters of eudicots (Moore et al., 2007, 2010).

Divergence times for major groups were similar to those from previous studies

Molecular phylogeny and related sequence analysis can be used to estimate divergence times among lineages (Nei & Kumar, 2000). Similar to the patterns for rbcL, atpB, rps4 and 18S rDNA (Soltis et al., 2002), the five nuclear genes included in this study also have evolved with unequal evolutionary rates over different lineages of the tree. Nevertheless, by using a relaxed molecular clock model, divergence time estimations with these five genes yielded results (Fig. 5 and Tables 2, S11) similar to previous findings (Wikström et al., 2001; Moore et al., 2007, 2010; Bell et al., 2010; Zhang et al., 2011), except for the places where grouping patterns are different. Specifically, the divergence times of major groups of angiosperms were as follows: Angiospermae 198 Ma, Mesangiospermae 145 Ma, Monocotyledoneae 124 Ma, Magnoliidae 116 Ma, Pentapetalae 109 Ma, rosids 99 Ma and asterids 93 Ma (Fig. 5). Clearly, in addition to phylogenetic estimates, these low-copy nuclear genes can also be used for molecular clock analyses.

Figure 5.

Chronogram showing angiosperm divergence times as estimated by the BEAST using concatenated five genes. Four calibrations used in this study were marked with solid circles. Diversification times were described in detail in Supporting Information Table S11.

Table 2.  Divergence times and ranges (95% HPD) for major angiosperm groups as estimated by BEAST
  1. Ma, million years ago.

Angiospermae198 (163–256)
Nymphaeales + Austrobaileyales + Mesangiospermae179 (158–211)
Austrobaileyales + Mesangiospermae161 (146–185)
Mesangiospermae145 (133–163)
Chloranthus + Ceratophyllum + Magnoliidae + monocots137 (124–154)
Chloranthus + Ceratophyllum + Magnoliidae131 (113–153)
Chloranthus + Ceratophyllum119 (90–147)
Monocotyledoneae124 (108–142)
Eudicotyledoneae126 (123–127)
Magnoliidae116 (89–144)
Gunnera + Pentapetalae112 (107–116)
Pentapetalae109 (104–114)
Asterids93 (81–103)
Rosids99 (94–103)


Conserved low-copy nuclear genes are excellent markers for angiosperm phylogeny

Current phylogenetic relationships among angiosperm groups have been obtained by using organellar genes, largely because of their nearly certain orthology and the ease of obtaining gene sequences. However, associated limitations hinder their usability to some extent. First of all, compared with the low-copy nuclear genes used in this study, the plastid and mitochondrial genes are so conserved that they do not provide sufficient phylogenetically informative sites to resolve middle and low rank taxonomic relationships (Clegg et al., 1994; Knoop, 2004). In addition, the sizes of organellar genomes are much smaller than those of nuclear genomes; therefore, some hard-to-resolve relationships still remain in angiosperm phylogeny even when the sequences of entire plastid genomes were used (Jansen et al., 2007; Moore et al., 2007, 2010). Another factor that prevents organellar genes from being used for all plants lies in the loss of some chloroplast genes from parasitic plants (Palmer, 1990; Keeling & Palmer, 2008), making them difficult to be used universally. Moreover, contrary to nuclear genes, organellar genes are mostly inherited uniparentally, so that only a partial evolutionary history can be traced. However, hybrid speciation has occurred frequently in nature, and only nuclear genes provide a biparental record of the history of the evolutionary process (Birky, 2001; Hansen et al., 2007; Ness et al., 2011). For these reasons, in recent years, plant systematists working on specific groups have often indicated that more nuclear genes with higher variable regions should be used to uncover the complex evolutionary history of angiosperms (Cruz-Mazo et al., 2009; Lu et al., 2010; Li et al., 2011b).

Low-copy nuclear genes have long been considered to be important for reconstructing angiosperm phylogeny (Strand et al., 1997; Sang, 2002; Small et al., 2004; Wu et al., 2006). However, due to the difficulty in identifying orthologs and obtaining gene sequences, only a few genes have been used to resolve the relationships of low-rank taxonomic groups (Baldwin, 1992; Sang et al., 1997; Galloway et al., 1998; Zhu & Ge, 2005; Yuan et al., 2009; Ness et al., 2011). Among them, ITS has been the most widely used marker: a recent search indicated that Baldwin’s paper, which firstly proposed ITS as phylogenetic marker, has been cited ∼1300 times (Baldwin et al., 1995). However, as we discussed above, ITS is a part of the rRNA genes that is subject to incomplete homogenization for several species (Álvarez & Wendel, 2003; Ayalew et al., 2011), possibly confounding phylogenetic accuracy. In addition, the length of this marker is relatively short, with ITS-1 and ITS-2 both being shorter than 300 bp, and thus its resolving power for most groups is not very high.

In this work, > 1000 low-copy putative orthologous genes were identified as candidate phylogenetic markers by comparing genomes of moss and seven representative angiosperms. Their amino acid sequence identities ranged from 28.9% to 94.2%, suggestive of their variable evolutionary rates; therefore, they include both genes with relatively slow and rapid evolutionary rates for resolving relationships among both high- and low-rank taxonomic hierarchies. We then thoroughly inspected a set of conserved low-copy nuclear genes related to DNA/RNA metabolism and demonstrated their suitabilities for reconstruction of angiosperm phylogeny. Because these genes perform conserved functions necessary for all organisms, they are much less likely to be involved in adaptive fitness or under environmentally driven selective pressure than genes for signalling or regulatory processes. We also showed that they were easy to amplify from highly divergent angiosperm species, representing the first report of using low-copy nuclear genes for reconstructing angiosperm phylogeny on a large scale. As a proof of principle, our phylogenetic analyses of angiosperms demonstrated the power of these nuclear genes with sufficient phylogenetic information to resolve both deep and shallow relationships, including several relationships that have been difficult to determine. Also, many distinct intron sequences described here between two closely related species indeed provide easily amplified candidates from genomic PCR for resolving relationships among low-level taxonomic hierarchies and for DNA barcoding purposes (Hebert et al., 2003; Kress et al., 2005; Lahaye et al., 2008; Li et al., 2011a). With the improvement of sequencing technology, more and more genome and transcriptome data will be available, so, in addition to organellar genes, low-copy nuclear genes identified here can become new choices for plant systematists in future studies.

An angiosperm phylogeny with support for alternative hypotheses

At the beginning, several different genes were used to construct single angiosperm phylogenies, then more genes or even whole plastid genomes were combined to improve the phylogeny. After decades of efforts, the recently proposed APG III classification has many consistently supported relationships (Chase et al., 1993; Soltis et al., 1997; Jansen et al., 2007; APG III, 2009; Moore et al., 2010). However, the placements of some groups are still uncertain with low support values. In the phylogeny reported here, 78.3% of nodes for phylogenetic level of order or higher levels are congruent with the latest comprehensive analysis using 83 plastid genes (Moore et al., 2010), indicating that these relationships are the most strongly supported by both chloroplast and nuclear genes.

At the same time, our results support hypotheses distinct from previous ones (Jansen et al., 2007; Moore et al., 2007, 2010). Within rosids, our analyses strongly support the sister relationship of malvids and the COM clade, consistent with some recent independent results with variable degrees of confidence using either mitochondrial or nuclear genes (Zhu et al., 2007; Finet et al., 2010; Qiu et al., 2010; Shulaev et al., 2010; Lee et al., 2011). The sisterhood of the COM clade and malvids is also supported moderately by floral structural features (Endress & Matthews, 2006). Another obvious difference involves Cornales and Ericales, which together formed a sister group to euasterids here, unlike the previous hypothesis of Cornales and Ericales being successive sisters to the remaining asterids (Albach et al., 2001; Bremer et al., 2002; Moore et al., 2010). In addition, Vitaceae and Saxifragales were sisters with high support in all analyses except in the NUtree excluding the 3rd codon positions. This sister relationship was further supported by statistical tests (Table S4) and by recent analyses using plastid genomes and nuclear ribosomal protein genes (Finet et al., 2010; Moore et al., 2010). Different topologies recovered by genes from different genomes might have resulted from their different evolutionary histories, which may be caused by different manners of heritage. These apparent differences reinforce the necessity of obtaining evidence from both organellar and nuclear genes.

Some relationships recovered in this study with moderate support were also different from previous topologies. For example, in monocots, all analyses moderately supported Dioscoreales as the sister group of commelinids, a widely accepted monophyletic group composed of Poales, Commelinales, Arecales and Zingiberales (Chase et al., 2006), followed by Asparagales and Liliales as successive sisters, although five alternative hypotheses with Asparagales, Liliales and Dioscoreales as successive sisters of commelinids in different order (Table S10) could not be excluded using AU tests. By contrast, Asparagales or Liliales were placed as sister to commelinids in analyses using, respectively, plastid genes (Davis et al., 2004; Chase et al., 2006; Graham et al., 2006; Moore et al., 2010; Soltis et al., 2011) or mitochondrial genes (Qiu et al., 2010) with variable degrees of confidence, indicating that the relationships among these four major groups are still uncertain. Also, Caryophyllales has usually been placed previously as sister to asterids using plastid and/or mitochondrial genes (Hilu et al., 2003; APG III, 2009; Moore et al., 2010; Soltis et al., 2011); however, our results supported its sisterhood to rosids in all matrices/methods (Figs 4, S11b,c) except in the NUtree excluding the 3rd codon positions (Fig S12), similar to a recent topology recovered using 77 nuclear ribosomal genes (Finet et al., 2010). Among the major lineages of core eudicots, our NUtrees supported the placement of the clade of Vitaceae and Saxifragales in a group called super-rosids, as shown in a recent study with plastid genomes (Moore et al., 2010). However, our AAtrees supported a basal position in the core eudicots, in agreement with the study using nuclear ribosomal genes (Finet et al., 2010). Therefore, the placements of the clades of Vitaceae and Saxifragales are still uncertain. In Lamiidae, our findings that Lamiales and Solanales cluster together, with Gentianales being their sister, were similar to those from several previous analyses (Bremer et al., 2002; Hilu et al., 2003). However, other studies using different genes and methods have yielded two alternative topologies (Albach et al., 2001; Finet et al., 2010; Moore et al., 2010), which could not be rejected by AU tests, either (Table S10). The difficulty in resolving these relationships indicates that more genes from all three genomes – especially nuclear genes – and denser sampling are required to address these issues.

In addition, weak support was obtained for positions of several major groups, such as those among the Mesangiospermae – that is, the eudicots, monocots, magnoliids, Ceratophyllaceae and Chloranthaceae. Understanding this relationship is one of the most difficult problems in plant phylogenetics (Qiu et al., 2006) and there are various hypotheses for the inner relationships among Mesangiospermae (Qiu et al., 1999, 2006; Hilu et al., 2003; Moore et al., 2007, 2010). Recently, Ceratophyllaceae was placed as sister of eudicots and Chloranthaceae as sister of magnoliids, with moderate support even using entire plastid genomes (Moore et al., 2010). Here, almost all analyses supported weakly the sister relationship of Ceratophyllaceae and Chloranthaceae, as was also proposed in a recent study using four mitochondrial genes (Qiu et al., 2010); floral morphological evidence also supported this combination (Endress & Doyle, 2009). Furthermore, the monophyletic CCMM clade, although speculative, suggests a distinct hypothesis that warrants further investigations. Therefore, additional gene markers and/or more taxa, as well as careful morphological inspection of live and fossil material, are required for relationships that are difficult to resolve.


We describe here a set of highly conserved single- or low-copy nuclear genes as excellent candidate phylogenetic markers. Detailed sequence analyses of five representative genes revealed that they include both highly conserved and more divergent regions; the former allows easy alignment and design of primers for amplification whereas the latter provides informative sites for phylogenetic analysis, as well as DNA barcoding purposes. Indeed, we were able to obtain their homologs from dozens of flowering plants covering the entire spectrum of angiosperms and representing most orders, with five single-gene phylogenies being largely consistent with well-supported organismal relationships. These highly conserved nuclear genes are present in all eukaryotes, allowing integration of plant phylogenies into the eukaryotic tree of life. The small number of highly informative genes also facilitates the analysis of many more organisms in a single study, because of both the ease of their amplification and the economy of computational time.

The angiosperm phylogeny we obtained showed maximum support for most clades, largely consistent with previous hypotheses, indicating that both nuclear and organellar genes support well-established angiosperm relationships. The strongly supported differences in placements for some groups suggest different evolutionary histories for nuclear and organellar genes; therefore, both kinds of markers are necessary for reconstruction of angiosperm phylogeny. The highly informative and easily cloned nuclear genes will facilitate future investigation of the angiosperm phylogeny with expanded taxa and promote understanding of structural and functional evolution of flowering plants.


We thank Yonghong Hu, Yuqin Wang, Chunce Guo and Ming Ding for their help in plant sample collection. We also thank Ji Yang, Lu Lu and Liangsheng Zhang for helpful discussions and Fan Lu for specimen identification. We are very grateful to Yaqiong Wang and Haifeng Wang for developing Perl scripts for this work, to Hongxing Yang and Qiang Zhang for help with software use, to Ji Qi for improvement on the appearance of figures. This study was supported by grants from Natural Science Foundation of China (grant no. 31100156), Postdoctoral Science Foundation of China (grant no. 20100480549 and 201003241), the Ministry of Sciences and Technology of China (2011CB944600) and funds from Fudan University (including the 211 and 985 programs).