The pentatricopeptide repeat (PPR) gene family, a tremendous resource for plant phylogenetic studies


Author for correspondence:
Yao-Wu Yuan
Tel:+1 206 616 7156
Fax:+1 206 685 1728


  • • Despite the paramount importance of nuclear gene data in plant phylogenetics, the search for candidate loci is believed to be challenging and time-consuming. Here we report that the pentatricopeptide repeat (PPR) gene family, containing hundreds of members in plant genomes, holds tremendous potential as nuclear gene markers.
  • • We compiled a list of 127 PPR loci that are all intronless and have a single orthologue in both rice (Oryza sativa) and Arabidopsis thaliana. The uncorrected p-distances were calculated for these loci between two Arabidopsis species and among three Poaceae genera. We also selected 13 loci to evaluate their phylogenetic utility in resolving relationships among six Poaceae genera and nine diploid Oryza species.
  • • PPR genes have a rapid rate of evolution and can be best used at intergeneric and interspecific levels. Although with substantial amounts of missing data, almost all individual data sets from the 13 loci generate well-resolved gene trees.
  • • With the unique combination of three characteristics (having a large number of loci with established orthology assessment, being intronless, and being rapidly evolving), the PPR genes have many advantages as phylogenetic markers (e.g. straightforward alignment, minimal effort in generating sequence data, and versatile utilities). We perceive that these loci will play an important role in plant phylogenetics.


The paramount importance of single- or low-copy nuclear gene sequence data in plant phylogenetic studies has been elaborated extensively in reviews (Sang, 2002; Small et al., 2004; Hughes et al., 2006) and has resonated frequently in empirical studies (e.g. Alvarez et al., 2008; Yuan & Olmstead, 2008; Steele et al., 2008) and commentaries (Mort & Crawford, 2004; Crawford & Mort, 2004). However, nuclear loci that can be routinely employed for most angiosperm groups are scarce and the search for candidate nuclear loci is still a challenging and continuing endeavour (Mort & Crawford, 2004). Earlier studies (Strand et al., 1997; Bailey & Doyle, 1999; Olsen & Schaal, 1999; Small & Wendel, 2000; Tank & Sang, 2001; Howarth & Baum, 2002) mainly took the ‘low-copy nuclear gene approach’ (Hughes et al., 2006) by selecting a well-characterized gene or small gene family and testing the phylogenetic utility of these selected loci in a particular study group. This approach has often proved effective, but it is restricted to a single locus or a small number of loci, which are insufficient to resolve many plant phylogenetic problems, especially at lower taxonomic levels.

With the rapid development of complete genome sequence and expressed sequence tag (EST) databases emerges the conserved orthologue set (COS) approach (Fulton et al., 2002; Wu et al., 2006; Padolina, 2006; Chapman et al., 2007; Alvarez et al., 2008). By comparisons of EST and/or complete genome sequences between model organisms, COS markers can be identified and used to develop sets of primers that amplify putative orthologue sequences across the taxa of interest for phylogenetic investigations. As whole genome sequence and EST databases continue growing on a daily basis, the COS approach holds great promise in screening a large number of nuclear loci for phylogenetic studies. However, there is one major problem with the COS approach. Given the vast number of putative COS markers this approach often produces (e.g. Wu et al. (2006) found 2869 single-copy orthologues shared by euasterids) and little prior knowledge about these loci, it often requires labour-intensive preliminary work to screen loci for their appropriate phylogenetic utility in a specific study group. For instance, after examining 141 nuclear primer combinations designed from such COS markers (Padolina, 2006), Steele et al. (2008) found only three phylogenetically informative loci for resolving interspecies relationships in the genus Psiguria (Cucurbitaceae) and two loci for resolving intergeneric relationships in the family Geraniaceae. In the end these authors concluded, ‘In any case, identifying phylogenetically informative LCN [low copy nuclear] markers remains a time-consuming endeavor ...’.

With the hope of identifying numerous nuclear loci for general plant phylogenetic investigations that require little preliminary work to use, we took an integrative approach that combines the advantages of both approaches mentioned above and avoids the disadvantages of each. The general idea is to identify a large number of putative orthologous loci that are well characterized and information-rich. The availability of such online databases as POGs/PlantRBP (Walker et al., 2007; makes this strategy straightforward. The POGs/PlantRBP assigns proteins (with corresponding gene loci) in the rice (Oryza sativa) and Arabidopsis thaliana proteomes to putative orthologous groups (POGs) via a ‘mutual-best-hits’ strategy (Walker et al., 2007; see also for a schematic illustration of this strategy). Among the assigned proteins, predicted RNA-binding proteins (RBPs) are particularly well annotated. By mining this database, we found that the enormous pentatricopeptide repeat (PPR) gene family, coding for RNA-binding proteins, may have tremendous potential in plant phylogenetic applications.

The PPR gene family contains c. 450 members in A. thaliana, 477 in O. sativa, and 103 in Physcomitrella patens, whereas there are only a handful of loci in the genomes of green algae and nonplant eukaryotic organisms (Lurin et al., 2004; O'Toole et al., 2008), and virtually none in prokaryotic genomes (Lurin et al., 2004; Pusnik et al., 2007). The A. thaliana PPR genes are more or less evenly distributed throughout the 10 chromosome arms (Lurin et al., 2004). An interesting observation is that c. 80% of the PPR genes in both A. thaliana and rice are intronless (Lurin et al., 2004; O'Toole et al., 2008). PPR proteins are characterized by 2–26 tandem repeats of a highly degenerate 35 amino acid motif, and divided into two subfamilies and four subclasses based on their conserved C-terminal domain structure (Lurin et al., 2004). These proteins are targeted to organelles (i.e. mitochondria and plastids) and involved in many post-transcriptional processes undergone by organellar transcripts, including splicing, editing, processing, and translation (reviewed in Delannoy et al., 2007). The presence of one of the four subclasses (i.e. the DYW subclass) is strictly correlated with the existence of RNA editing in land plants (Salone et al., 2007; Rudinger et al., 2008). Together with other evidence, this led Salone et al. (2007) to propose that the DYW domain found exclusively in PPR proteins is the catalytic domain conducting the enigmatic organelle RNA editing process.

It might be counterintuitive at first glance that such a huge gene family with most members being intronless can have any phylogenetic utility. (1) Won't the massive number of gene copies make orthology assessment extremely difficult? (2) Can these protein-coding sequences provide sufficient variation to address phylogenetic problems at lower taxonomic levels? A phylogenetic analysis including all of the rice and A. thaliana PPR genes revealed that an extraordinarily large proportion of these genes form well-supported pairs that are probably A. thaliana and rice orthologues (O'Toole et al., 2008), which suggested that most of the PPR gene loci predate the divergence of eudicots and monocots with few duplications since then. This is consistent with the finding from the POGs/PlantRBP database (Walker et al., 2007) that the majority of PPR genes have a single orthologue in both A. thaliana and rice genomes. These results suggest that evaluating the orthology of most PPR genes should not be difficult. In addition, considering that PPR proteins probably function as RNA-binding molecules in a sequence-specific manner (Delannoy et al., 2007), they may have a rapid rate of evolution to adjust to changes in the targeted RNA species. This means that, within a putative orthologue group, sequences could be divergent enough to provide variation that could be used in resolving relationships at lower taxonomic levels, despite the lack of rapidly evolving introns. In fact, the absence of introns can be a great advantage for many phylogenetic applications such as resolving intergeneric relationships (see the Discussion). It is these three appealing characteristics – a huge number of loci, but an easily assessed orthology; an absence of introns; the likelihood of a rapid rate of evolution – that stimulate us to explore the potential of PPR genes in plant phylogenetic studies.

In this paper we have aimed: (1) to compile a comprehensive list of PPR genes that are intronless and have a single orthologue in both A. thaliana and rice; (2) to compute the pairwise distance at these compiled loci between two Arabidopsis species (A. thaliana and Arabidopsis lyrata) and among three Poaceae genera (Oryza, Zea and Sorghum) in order to obtain a cursory estimation of variation at each locus; and (3) to select a small proportion of these loci to evaluate their utility in resolving relationships among six Poaceae genera and nine diploid Oryza species. There is a large amount of genomic sequence data for 12 Oryza species (not including cultivated rice, O. sativa) in GenBank as trace archives from the Oryza Map Alignment Project (OMAP;; Wing et al., 2005). Eight of them are diploid species, representing all six diploid Oryza genome types (Nayar, 1973; Aggarwal et al., 1997; Ge et al., 1999). There are also quite abundant ESTs of three other Poaceae genera, Triticum (Triticum aestivum), Hordeum (Hordeum vulgare) and Saccharum (Saccharum officinarum), available from The Institute for Genomic Research (TIGR) Plant Transcript Assemblies database (; Childs et al., 2007). These publicly available genomic data provide a good opportunity to examine the phylogenetic utility of PPR gene loci.

Materials and Methods

Locus screening, sequence retrieval and annotation

We retrieved all the Arabidopsis thaliana (L.) Heynh. PPR gene family members with their putative orthologues in rice (Oryza sativa L.) from the POGs/PlantRBP database (Walker et al., 2007; by searching ‘At*’ by gene AND ‘PPR’ by domain. The A. thaliana PPR genes are assigned to 418 POGs, most of which contain a single locus in both rice and A. thaliana. The results were downloaded to an Excel file (see the Supporting Information, Table S1). We then screened loci for phylogenetic utility in a stepwise manner.

  • 1If the POG contains a single locus in both rice and A. thaliana, continue to (2); otherwise, abandon.
  • 2If the gene pair in the remaining POGs is marked as ‘well supported’ in POGs/PlantRBP, continue to (3); otherwise, abandon. When building the POGs/PlantRBP database, Walker et al. (2007) took a phylogenetic approach to evaluate POGs assigned by the ‘mutual-best-hit’ method. The top blast hits with > 50% coverage (either hit/query or query/hit) for each protein were retrieved to produce a multiple alignment and corresponding guide tree. Only those POG assignments supported by the tree topology were marked as ‘well supported’.
  • 3If the A. thaliana gene in the remaining POGs is intronless, continue to (4); otherwise, abandon. This was done by comparing the A. thaliana locus ID against the ‘Arabidopsis intronless gene list’ from Jain et al. (2008).
  • 4Follow the POGs/PlantRBP link for each rice locus to TIGR ( and for each A. thaliana locus to The Arabidopsis Information Resource (TAIR; These linked pages have comprehensive gene descriptions for the corresponding loci. If the rice gene in the remaining POGs is intronless, continue to (5) with sequence retrieval and annotation; otherwise, abandon.
  • 5For each remaining POG, download the rice coding sequence (CDS) from TIGR and the A. thaliana CDS from TAIR. In many cases the CDS is the same as the cDNA as well as the genomic DNA sequence, while in other cases the genomic DNA sequence is longer than the CDS by regulatory regions at the 5′ and/or 3′ ends. Blast the A. thaliana sequence against Arabidopsis lyrata (L.) O’Kane & Al-Shenbaz trace archives using the megablast program and default parameters ( Sequences with an E-value < e−100 were downloaded as SCF files, and were edited and assembled using sequencher 4.7 (Gene Codes Corporation, Ann Arbor, MI, USA). The positions of start and stop codons were determined by comparison with the A. thaliana sequence. Similarly, sequences of Sorghum bicolor (L.) Moench and Zea mays L. were retrieved by blasting rice sequences against the S. bicolor and Z. mays trace archives, but using the discontiguous megablast program ( as the divergence between rice and S. bicolor or Z. mays is much greater than that between the two Arabidopsis species. Downloaded sequences were subsequently annotated using the sequencher 4.7 software. In a small proportion of the retained POG loci, one or more of the three annotated sequences (A. lyrata, S. bicolor and Z. mays) had > 5% polymorphic sites (see Table S1). These loci were also abandoned as the last step to minimize the possibility that our data for each selected locus include paralogous sequences from recent gene duplications. A set of 127 loci was finally retained.

Estimation of variation for each selected locus

For each of the 127 loci, sequence alignments between the two Arabidopsis species (A. thaliana and A. lyrata) and among the three Poaceae genera (Oryza, Sorghum and Zea) were performed manually using Se-Al version 2.0a11 (Rambaut, 1996). The uncorrected p-distance was then calculated for the two sets of aligned sequences using the ‘Pairwise Base Differences’ function implemented in paup* version 4.0b10 (Swofford, 2002), as a cursory estimation of variation level. To compare the variation level of these selected PPR gene loci with that of loci extensively used in plant phylogenetic studies, we also retrieved, annotated, and aligned sequences of rbcL, ndhF, matK, trnL-F and ITS, for the two Arabidopsis species and the three Poaceae genera, following the procedures described above. The uncorrected p-distance was also subsequently calculated for the five extensively used loci.

Evaluation of phylogenetic utility

We blasted the rice sequence of each of the 127 loci against those of the eight diploid Oryza species (Oryza australiensis Domin, Oryza brachyantha A.Chev. & Roehrich, Oryza glaberrima Steud., Oryza granulata Nees, Oryza nivara S.D.Sharma & Shastry, Oryza officinalis Wall., Oryza punctata Kotschy ex Steud. and Oryza rufipogon Griff.) that have trace archive sequences in GenBank ( Thirteen of the 127 loci, which have partial sequences available for six or more of the eight Oryza species, were selected for further analyses (see Table S2). Then we blasted the rice sequence of each of the 13 loci against the ESTs of Triticum aestivum L., Hordeum vulgare L. and Saccharum officinarum L. in the TIGR Plant Transcript Assemblies database ( Partial genomic sequences of the eight Oryza species and the ESTs of the additional three Poaceae genera, whenever available, were downloaded, annotated, and aligned with rice, S. bicolor and Z. mays sequences for each locus, following the procedures described above.

Parsimony analysis was performed on each data set of the 13 loci separately, and both parsimony and Bayesian analyses were performed on the combined data set of all 13 loci. Parsimony analyses were conducted using paup* version 4.0b10 (Swofford, 2002). Heuristic searches were performed with 200 random stepwise addition replicates and tree-bisection-reconnection (TBR) branch swapping with MULTREES on. Clade support was determined by bootstrap analyses (Felsenstein, 1985) of 500 replicates, each with 10 random stepwise addition replicates and TBR branch swapping with the MULTREES option effective. Bayesian analysis was conducted using MrBayes version 3.1.2 (Ronquist & Huelsenbeck, 2003). Modeltest 3.7 (Posada & Crandall, 1998) was employed to determine the sequence evolution model that best fits the data. The GTR+G model was selected by the Akaike information criterion (AIC; Akaike, 1974) for the combined 13 loci data set. We carried out two independent runs of 1 000 000 generations using the default priors and four Markov chains (one cold and three heated chains), sampling one tree every 100 generations. The first 2500 trees were discarded as burn-in.


Selected loci and their variation level

A set of 127 PPR gene loci was finally obtained via the screening process for potential phylogenetic utility. They all are intronless and have a single orthologue in both rice and A. thaliana as well as the other three annotated taxa (A. lyrata, S. bicolor and Z. mays). Table 1 includes a comprehensive list of these loci with the uncorrected p-distance between A. thaliana and A. lyrata, and the distances among Oryza, Zea and Sorghum. Although these loci consist entirely of protein coding regions, they have a relatively rapid rate of evolution. For example, the pairwise distance between the two Arabidopsis species at the PPR loci ranged from 0.0244 (At1G10270) to 0.0985 (At3G25060), with an average of 0.0512, which is 6.7, 4.1, 2.7, 1.4 and 0.9 times the distance at rbcL, ndhF, matK, trnL-F and ITS, respectively (Table 1). Thirty-seven loci had a larger distance than ITS. Similarly, the pairwise distance between rice (O. sativa) and maize (Z. mays) ranged from 0.1262 (At5G18390) to 0.2789 (At1G10330), with an average of 0.1890, which is 4.7, 3.0, 2.3 and 2.4 times the distance at rbcL, ndhF, matK and trnL-F, respectively (Table 1). The ITS sequences could not be confidently aligned in the three Poaceae genera because of extensive length variation, and therefore no pairwise distances were available for the ITS region among these genera. Fig. 1 is a graphic view of the variation level of these 127 PPR gene loci as well as the five additional loci that have been used extensively in plant phylogenetics. The sizes of these 127 loci in A. thaliana are also listed in Table 1. They range from 909 to 3339 bp, with an average of 1977 bp.

Table 1.  List of the 127 pentatricopeptide repeat loci as well as rbcL, ndhF, matK, trnL-F and ITS, with their corresponding uncorrected p-distances and sequence lengths in Arabidopsis thaliana
 LocusArabidopsis thaliana vs Arabidopsis lyrataOryza sativa vs Zea maysOryza sativa vs Sorghum bicolorSorghum bicolor vs Zea maysLength in A. thaliana (bp)
5ITS0.055NANANA 639
104At4G381500.04970.19220.16890.0612 909
‘NA’ indicates that the uncorrected p-distance is not available because the ITS sequences cannot be unambiguously aligned between the corresponding pair of taxa.
Figure 1.

Graphic view of the variation levels of the 127 pentatricopeptide repeat (PPR) loci and rbcL, ndhF, matK, trnL-F and ITS, indicated by the uncorrected p-distance. The order of loci follows Table 1. Uncorrected p-distances are shown (a) between Arabidopsis thaliana and Arabidopsis lyrata, (b) between Sorghum bicolor and Zea mays, (c) between Oryza sativa and Z. mays, and (d) between O. sativa and S. bicolor. The arrow in (b), (c) and (d) indicates that the pairwise distance at the ITS region is not available because the ITS sequences of these taxa cannot be unambiguously aligned because of extensive length variation.

Phylogenetic utility

Because of incomplete sequences of the eight diploid Oryza species and S. officinarum, H. vulgare and T. aestivum, there are substantial amounts of missing data at the 13 loci that we used to reconstruct the intergeneric relationships within Poaceae and interspecific relationships within Oryza. Taxa that have no sequences at all at a certain locus were excluded from phylogenetic analysis of that locus and were not taken into account when calculating the percentage of missing data. The loci AT4G38150 and AT1G11290 have the lowest (11%) and highest (53%) percentages of data missing, respectively (see Fig. 2 and Table S2). All 14 taxa were included in the analyses of the concatenated data, in which 48% of data are missing.

Figure 2.

Gene trees resulting from parsimony analyses of individual data sets for the 13 loci. All trees are drawn to the same scale. Bootstrap values are shown along the branches when > 50. The ID of each locus is in bold and indicated above the gene tree. The number below the locus name (in parentheses) indicates the percentage of missing data at that locus. (a), (b), (e), (i), (l) and (m) show the single maximum parsimony (MP) tree inferred from the corresponding locus. (c) and (k) show one of the six MP trees; (d) shows one of the nine MP trees; (f) shows one of the three MP trees; (g) and (j) show one of the two MP trees; (h) shows one of the 10 MP trees.

Despite the large amount of missing data, almost every individual locus generated well-resolved gene trees (Fig. 2). The intergeneric relationships within Poaceae were congruent across all 13 loci, and are consistent with the Grass Phylogeny Working Group subfamily classification (Barker et al., 2001; see subfamily designations in Fig. 3). Within the genus Oryza, the monophyly of A-genome species and the phylogenetic position of E-genome species were very consistent among the 13 loci. However, the relationships among the A-, B- and C-genomes and the relationships among the F-genome, G-genome and all other genome types were incongruent among these loci (see genome type designation in Fig. 3). These results are consistent with a recent study of the relationships of the six Oryza diploid genome types using 142 nuclear genes (Zou et al., 2008). Fig. 3 represents the single most parsimonious tree inferred from the concatenated data for all 13 loci. Intergeneric relationships are the same as shown in gene trees resulting from individual data sets. The Oryza genome type relationships are very similar to that reported in the aforementioned study (Zou et al., 2008), except that the positions of the F- and G-genomes are switched.

Figure 3.

The single maximum parsimony (MP) tree inferred from the concatenated data for all 13 loci, 48% of which are missing data. Bootstrap values (BS) and Bayesian posterior probabilities (PP) supporting the corresponding nodes are shown along the branches (BS/PP). The asterisks indicate BS < 50 or PP < 0.95. Subfamily designations of the family Poaceae and genome type designations of the genus Oryza are represented on the right.


The PPR genes have three major characteristics that make them excellent candidates for plant phylogenetic studies. First of all, there are a large number of loci but orthology assessment is straightforward. Each of the 127 loci obtained from the screening process in the present study should have a single orthologue in the vast majority of diploid flowering plants, given the fact that a single orthologue was retained in both rice and A. thaliana after the deep split between monocots and eudicots. To test this intuitive assumption, we randomly drew 10 PPR loci from the list (Table 1), and blasted the A. thaliana nucleotide sequences against two other sequenced genomes, those of Populus trichocarpa (Tuskan et al., 2006; and Vitis vinifera (Jaillon et al., 2007;, using the cross-species megaBLAST program. The A. thaliana sequence hit a single locus in both genomes at nine of the 10 loci and did not produce any significant hit at the other locus (data not shown). We then blasted the amino acid sequence of this exceptional locus against the same databases using the tblastn program and this time it produced a single best hit (E-value < e−100) in both genomes. In addition, the successful retrieval of unique orthologous sequences from S. officinarum, H. vulgare and T. aestivum using the rice sequences of the 13 loci used in our phylogenetic analyses corroborates this assumption.

Secondly, the majority of PPR genes are intronless (Lurin et al., 2004; O'Toole et al., 2008). In fact, the 127 loci listed in Table 1 are all intronless, as this was one of the selection criteria. There are two important practical advantages in choosing intronless loci. (1) Alignment is straightforward. Alignments of noncoding DNA sequences such as introns or intergenic spacers can be problematic because of extensive length variation among all but the most closely related species. This often necessitates the introduction of numerous (sometimes impracticably numerous) large or small gaps (‘indels’) into the alignment. Intronless genes, in contrast, tend to contain few fixed length mutations and are easy to align if the taxa of interest are not too distantly related (e.g. belong to different major clades of angiosperms). (2) Sequencing requires minimal effort. When a nuclear locus is heterozygous, direct sequencing of intron or intergenic spacer regions becomes almost impossible. A simple deletion or insertion that occurred in one allele but not the other(s) will affect all the sequence reads after this point, and cloning is necessary to generate good quality sequences in this situation. For intronless loci, although polymorphism will be observed if there is allelic variation, sequence reads after the polymorphic sites will probably not be affected, because allelic polymorphisms usually do not involve length mutations in protein coding regions. In addition, nuclear gene introns often contain polynucleotide (e.g. poly-A) or/and microsatellite regions that are extremely difficult to sequence through, whereas intronless genes tend not to contain such regions.

The difference between intronless loci and noncoding regions might be trivial if recovering allelic polymorphisms is the main focus of a study, as in some phylogeography or population genetics studies. In such studies, separating multiple alleles within an individual via cloning is desirable, no matter whether the targeted loci are intronless or noncoding. However, being intronless is an obvious advantage when the main question is phylogenetic relationships of organisms and allelic polymorphism is not an issue (i.e. incomplete lineage sorting is trivial). This advantage will be substantially inflated when resolving intergeneric relationships is the primary interest of a study. Nuclear gene intron sequences often diverge rapidly and may not be aligned at all between distantly related genera. Exon sequences are the only source of useful data. Unfortunately, one may need to sequence across several intron regions to generate sufficient exon sequences from a locus that contains both exons and introns. What is worse is that cloning is likely to be necessary to overcome the length mutation problem in introns for many organisms. The laborious cloning work and wasted effort in generating intron sequences that may be useless in resolving intergeneric relationships can be completely avoided by employing these intronless loci. Of course, one may argue that protein coding regions usually diverge much more slowly than intron regions. An intronless locus without sufficient variation to resolve the targeted phylogenetic problem, particularly at lower taxonomic levels, is not very helpful. The third characteristic of PPR genes suggests that this is not a problem.

The third property of PPR genes is that they have a rapid rate of evolution. Figure 2 shows the general pattern of variation across the 127 loci we selected, in comparison with that of the four chloroplast DNA regions and ITS region. The average pairwise distance for the selected PPR loci between A. thaliana and A. lyrata was 1.4 times that for trnL-F and 0.9 times that for ITS. The average distances for PPR loci among the three Poaceae genera were 2.3–5.6 times those for trnL-F. These data suggest that PPR loci can certainly be used at interspecies and intergeneric levels, considering that both trnL-F and ITS have been extensively used for resolving interspecific and intergeneric relationships (Alvarez & Wendel, 2003; Shaw et al., 2005). Our phylogenetic analyses of partial sequences of 13 selected loci confirm this conclusion. Despite the substantial amount of missing data, individual data sets for the 13 loci generated well-resolved gene trees (Fig. 2). The intergeneric relationships were congruent across all 13 loci and consistent with the subfamily classification (Barker et al., 2001). Within the genus Oryza, there were both congruent (e.g. the position of the E-genome, O. australiensis) and incongruent relationships (e.g. among A-, B- and C-genomes) from one locus to another. These results are consistent with a recent phylogenomic study of the Oryza diploid genome types (Zou et al., 2008). Additionally, considering that these loci are intronless, we speculate that they might also be useful to resolve relationships between closely related families, but this possibility needs to be evaluated in future studies.

The unique combination of these three properties gives PPR gene loci many advantages over other nuclear gene loci as phylogenetic tools. They provide numerous loci with established orthology assessment to use. Generating sequence data of these loci requires only minimal effort and aligning these sequences is straightforward. They have a rapid rate of evolution despite being intronless, and versatile utility at various levels (interspecific, intergeneric, and potentially interfamiliar between closely related families). We believe that these loci will play a key role in resolving intergeneric relationships using nuclear gene data, given their extraordinary advantages in this respect, as discussed above. By the present report, we wish to bring the tremendous potential of these PPR gene loci as phylogenetic tools to the attention of plant systematists and to ameliorate the pessimistic view that ‘identifying phylogenetically informative LCN markers remains a time-consuming endeavor’ (Steele et al., 2008).

There are two final issues that we consider to be worth mentioning from a practical point of view. The first concerns the selection of loci from among these 127 loci for a specific project. Variation level (Fig. 1) and locus size (i.e. sequence length; Table 1) are two informative factors that one can use as guidance to select appropriate loci. However, we should caution that variation level might be lineage specific – the locus with the most rapid rate of evolution in Arabidopsis does not necessarily evolve most rapidly in another group. In this sense, locus size may be a more consistent parameter to guide locus selection. The second issue concerns primer design. While universal primers that can be used to amplify a locus across a broad spectrum of organisms (e.g. all angiosperms) are ideal choices, it is more and more widely recognized that such universal primers may not exist for most nuclear loci (Sang, 2002; Steele et al., 2008). For loci that have such a rapid rate of evolution as the PPR genes, primer design in a lineage-specific fashion is probably more fruitful than searching for universal primers. With the rapid development of whole genome sequence and EST databases (e.g. the National Center for Biotechnology Information (NCBI) plant genome project database:|12%3; the TIGR Plant Transcript Assemblies database: and bioinformatics tools (e.g. BLAST;, it has become much easier to design lineage-specific primers. The general idea is to use these public databases to search for orthologous sequences of a selected locus from several other plant species, especially those most closely related to the study group. The alignment of these sequences can then provide a basis for the identification of conserved motifs and the design of working primers. As a matter of fact, we have employed this approach and designed Lamiales-specific primers for five more or less arbitrarily selected loci from Table 1. Using these primers we have successfully amplified the targeted loci as single bands in the family Verbenaceae, a typical non-model-system group that has been poorly studied to date. While the details of these empirical data and phylogenetic results will be published elsewhere, we are assured that these loci are quite easy to use in practice.


The authors are grateful to Bruce Baldwin and an anonymous reviewer for comments on the manuscript. This research was supported by a Graduate Fellowship in Plant Molecular Systematics from the University of Washington Department of Biology, an NSF Doctoral Dissertation Improvement Grant (DDIG) (DEB-0710026) to RGO for the first author's dissertation research, and an NSF Grant (DEB-0542493) to RGO.