The pentatricopeptide repeat (PPR) gene family, a tremendous resource for plant phylogenetic studies

Authors


Author for correspondence:
Yao-Wu Yuan
Tel:+1 206 616 7156
Fax:+1 206 685 1728
Email: colreeze@u.washington.edu

Summary

  • • Despite the paramount importance of nuclear gene data in plant phylogenetics, the search for candidate loci is believed to be challenging and time-consuming. Here we report that the pentatricopeptide repeat (PPR) gene family, containing hundreds of members in plant genomes, holds tremendous potential as nuclear gene markers.
  • • We compiled a list of 127 PPR loci that are all intronless and have a single orthologue in both rice (Oryza sativa) and Arabidopsis thaliana. The uncorrected p-distances were calculated for these loci between two Arabidopsis species and among three Poaceae genera. We also selected 13 loci to evaluate their phylogenetic utility in resolving relationships among six Poaceae genera and nine diploid Oryza species.
  • • PPR genes have a rapid rate of evolution and can be best used at intergeneric and interspecific levels. Although with substantial amounts of missing data, almost all individual data sets from the 13 loci generate well-resolved gene trees.
  • • With the unique combination of three characteristics (having a large number of loci with established orthology assessment, being intronless, and being rapidly evolving), the PPR genes have many advantages as phylogenetic markers (e.g. straightforward alignment, minimal effort in generating sequence data, and versatile utilities). We perceive that these loci will play an important role in plant phylogenetics.

Introduction

The paramount importance of single- or low-copy nuclear gene sequence data in plant phylogenetic studies has been elaborated extensively in reviews (Sang, 2002; Small et al., 2004; Hughes et al., 2006) and has resonated frequently in empirical studies (e.g. Alvarez et al., 2008; Yuan & Olmstead, 2008; Steele et al., 2008) and commentaries (Mort & Crawford, 2004; Crawford & Mort, 2004). However, nuclear loci that can be routinely employed for most angiosperm groups are scarce and the search for candidate nuclear loci is still a challenging and continuing endeavour (Mort & Crawford, 2004). Earlier studies (Strand et al., 1997; Bailey & Doyle, 1999; Olsen & Schaal, 1999; Small & Wendel, 2000; Tank & Sang, 2001; Howarth & Baum, 2002) mainly took the ‘low-copy nuclear gene approach’ (Hughes et al., 2006) by selecting a well-characterized gene or small gene family and testing the phylogenetic utility of these selected loci in a particular study group. This approach has often proved effective, but it is restricted to a single locus or a small number of loci, which are insufficient to resolve many plant phylogenetic problems, especially at lower taxonomic levels.

With the rapid development of complete genome sequence and expressed sequence tag (EST) databases emerges the conserved orthologue set (COS) approach (Fulton et al., 2002; Wu et al., 2006; Padolina, 2006; Chapman et al., 2007; Alvarez et al., 2008). By comparisons of EST and/or complete genome sequences between model organisms, COS markers can be identified and used to develop sets of primers that amplify putative orthologue sequences across the taxa of interest for phylogenetic investigations. As whole genome sequence and EST databases continue growing on a daily basis, the COS approach holds great promise in screening a large number of nuclear loci for phylogenetic studies. However, there is one major problem with the COS approach. Given the vast number of putative COS markers this approach often produces (e.g. Wu et al. (2006) found 2869 single-copy orthologues shared by euasterids) and little prior knowledge about these loci, it often requires labour-intensive preliminary work to screen loci for their appropriate phylogenetic utility in a specific study group. For instance, after examining 141 nuclear primer combinations designed from such COS markers (Padolina, 2006), Steele et al. (2008) found only three phylogenetically informative loci for resolving interspecies relationships in the genus Psiguria (Cucurbitaceae) and two loci for resolving intergeneric relationships in the family Geraniaceae. In the end these authors concluded, ‘In any case, identifying phylogenetically informative LCN [low copy nuclear] markers remains a time-consuming endeavor ...’.

With the hope of identifying numerous nuclear loci for general plant phylogenetic investigations that require little preliminary work to use, we took an integrative approach that combines the advantages of both approaches mentioned above and avoids the disadvantages of each. The general idea is to identify a large number of putative orthologous loci that are well characterized and information-rich. The availability of such online databases as POGs/PlantRBP (Walker et al., 2007; http://plantrbp.uoregon.edu/) makes this strategy straightforward. The POGs/PlantRBP assigns proteins (with corresponding gene loci) in the rice (Oryza sativa) and Arabidopsis thaliana proteomes to putative orthologous groups (POGs) via a ‘mutual-best-hits’ strategy (Walker et al., 2007; see also http://plantrbp.uoregon.edu/about-pogs.php for a schematic illustration of this strategy). Among the assigned proteins, predicted RNA-binding proteins (RBPs) are particularly well annotated. By mining this database, we found that the enormous pentatricopeptide repeat (PPR) gene family, coding for RNA-binding proteins, may have tremendous potential in plant phylogenetic applications.

The PPR gene family contains c. 450 members in A. thaliana, 477 in O. sativa, and 103 in Physcomitrella patens, whereas there are only a handful of loci in the genomes of green algae and nonplant eukaryotic organisms (Lurin et al., 2004; O'Toole et al., 2008), and virtually none in prokaryotic genomes (Lurin et al., 2004; Pusnik et al., 2007). The A. thaliana PPR genes are more or less evenly distributed throughout the 10 chromosome arms (Lurin et al., 2004). An interesting observation is that c. 80% of the PPR genes in both A. thaliana and rice are intronless (Lurin et al., 2004; O'Toole et al., 2008). PPR proteins are characterized by 2–26 tandem repeats of a highly degenerate 35 amino acid motif, and divided into two subfamilies and four subclasses based on their conserved C-terminal domain structure (Lurin et al., 2004). These proteins are targeted to organelles (i.e. mitochondria and plastids) and involved in many post-transcriptional processes undergone by organellar transcripts, including splicing, editing, processing, and translation (reviewed in Delannoy et al., 2007). The presence of one of the four subclasses (i.e. the DYW subclass) is strictly correlated with the existence of RNA editing in land plants (Salone et al., 2007; Rudinger et al., 2008). Together with other evidence, this led Salone et al. (2007) to propose that the DYW domain found exclusively in PPR proteins is the catalytic domain conducting the enigmatic organelle RNA editing process.

It might be counterintuitive at first glance that such a huge gene family with most members being intronless can have any phylogenetic utility. (1) Won't the massive number of gene copies make orthology assessment extremely difficult? (2) Can these protein-coding sequences provide sufficient variation to address phylogenetic problems at lower taxonomic levels? A phylogenetic analysis including all of the rice and A. thaliana PPR genes revealed that an extraordinarily large proportion of these genes form well-supported pairs that are probably A. thaliana and rice orthologues (O'Toole et al., 2008), which suggested that most of the PPR gene loci predate the divergence of eudicots and monocots with few duplications since then. This is consistent with the finding from the POGs/PlantRBP database (Walker et al., 2007) that the majority of PPR genes have a single orthologue in both A. thaliana and rice genomes. These results suggest that evaluating the orthology of most PPR genes should not be difficult. In addition, considering that PPR proteins probably function as RNA-binding molecules in a sequence-specific manner (Delannoy et al., 2007), they may have a rapid rate of evolution to adjust to changes in the targeted RNA species. This means that, within a putative orthologue group, sequences could be divergent enough to provide variation that could be used in resolving relationships at lower taxonomic levels, despite the lack of rapidly evolving introns. In fact, the absence of introns can be a great advantage for many phylogenetic applications such as resolving intergeneric relationships (see the Discussion). It is these three appealing characteristics – a huge number of loci, but an easily assessed orthology; an absence of introns; the likelihood of a rapid rate of evolution – that stimulate us to explore the potential of PPR genes in plant phylogenetic studies.

In this paper we have aimed: (1) to compile a comprehensive list of PPR genes that are intronless and have a single orthologue in both A. thaliana and rice; (2) to compute the pairwise distance at these compiled loci between two Arabidopsis species (A. thaliana and Arabidopsis lyrata) and among three Poaceae genera (Oryza, Zea and Sorghum) in order to obtain a cursory estimation of variation at each locus; and (3) to select a small proportion of these loci to evaluate their utility in resolving relationships among six Poaceae genera and nine diploid Oryza species. There is a large amount of genomic sequence data for 12 Oryza species (not including cultivated rice, O. sativa) in GenBank as trace archives from the Oryza Map Alignment Project (OMAP; http://www.omap.org/; Wing et al., 2005). Eight of them are diploid species, representing all six diploid Oryza genome types (Nayar, 1973; Aggarwal et al., 1997; Ge et al., 1999). There are also quite abundant ESTs of three other Poaceae genera, Triticum (Triticum aestivum), Hordeum (Hordeum vulgare) and Saccharum (Saccharum officinarum), available from The Institute for Genomic Research (TIGR) Plant Transcript Assemblies database (http://plantta.tigr.org/; Childs et al., 2007). These publicly available genomic data provide a good opportunity to examine the phylogenetic utility of PPR gene loci.

Materials and Methods

Locus screening, sequence retrieval and annotation

We retrieved all the Arabidopsis thaliana (L.) Heynh. PPR gene family members with their putative orthologues in rice (Oryza sativa L.) from the POGs/PlantRBP database (Walker et al., 2007; http://plantrbp.uoregon.edu/) by searching ‘At*’ by gene AND ‘PPR’ by domain. The A. thaliana PPR genes are assigned to 418 POGs, most of which contain a single locus in both rice and A. thaliana. The results were downloaded to an Excel file (see the Supporting Information, Table S1). We then screened loci for phylogenetic utility in a stepwise manner.

  • 1If the POG contains a single locus in both rice and A. thaliana, continue to (2); otherwise, abandon.
  • 2If the gene pair in the remaining POGs is marked as ‘well supported’ in POGs/PlantRBP, continue to (3); otherwise, abandon. When building the POGs/PlantRBP database, Walker et al. (2007) took a phylogenetic approach to evaluate POGs assigned by the ‘mutual-best-hit’ method. The top blast hits with > 50% coverage (either hit/query or query/hit) for each protein were retrieved to produce a multiple alignment and corresponding guide tree. Only those POG assignments supported by the tree topology were marked as ‘well supported’.
  • 3If the A. thaliana gene in the remaining POGs is intronless, continue to (4); otherwise, abandon. This was done by comparing the A. thaliana locus ID against the ‘Arabidopsis intronless gene list’ from Jain et al. (2008).
  • 4Follow the POGs/PlantRBP link for each rice locus to TIGR (http://www.tigr.org/) and for each A. thaliana locus to The Arabidopsis Information Resource (TAIR; http://www.arabidopsis.org/). These linked pages have comprehensive gene descriptions for the corresponding loci. If the rice gene in the remaining POGs is intronless, continue to (5) with sequence retrieval and annotation; otherwise, abandon.
  • 5For each remaining POG, download the rice coding sequence (CDS) from TIGR and the A. thaliana CDS from TAIR. In many cases the CDS is the same as the cDNA as well as the genomic DNA sequence, while in other cases the genomic DNA sequence is longer than the CDS by regulatory regions at the 5′ and/or 3′ ends. Blast the A. thaliana sequence against Arabidopsis lyrata (L.) O’Kane & Al-Shenbaz trace archives using the megablast program and default parameters (http://blast.ncbi.nlm.mih.gov/). Sequences with an E-value < e−100 were downloaded as SCF files, and were edited and assembled using sequencher 4.7 (Gene Codes Corporation, Ann Arbor, MI, USA). The positions of start and stop codons were determined by comparison with the A. thaliana sequence. Similarly, sequences of Sorghum bicolor (L.) Moench and Zea mays L. were retrieved by blasting rice sequences against the S. bicolor and Z. mays trace archives, but using the discontiguous megablast program (http://blast.ncbi.nlm.mih.gov/) as the divergence between rice and S. bicolor or Z. mays is much greater than that between the two Arabidopsis species. Downloaded sequences were subsequently annotated using the sequencher 4.7 software. In a small proportion of the retained POG loci, one or more of the three annotated sequences (A. lyrata, S. bicolor and Z. mays) had > 5% polymorphic sites (see Table S1). These loci were also abandoned as the last step to minimize the possibility that our data for each selected locus include paralogous sequences from recent gene duplications. A set of 127 loci was finally retained.

Estimation of variation for each selected locus

For each of the 127 loci, sequence alignments between the two Arabidopsis species (A. thaliana and A. lyrata) and among the three Poaceae genera (Oryza, Sorghum and Zea) were performed manually using Se-Al version 2.0a11 (Rambaut, 1996). The uncorrected p-distance was then calculated for the two sets of aligned sequences using the ‘Pairwise Base Differences’ function implemented in paup* version 4.0b10 (Swofford, 2002), as a cursory estimation of variation level. To compare the variation level of these selected PPR gene loci with that of loci extensively used in plant phylogenetic studies, we also retrieved, annotated, and aligned sequences of rbcL, ndhF, matK, trnL-F and ITS, for the two Arabidopsis species and the three Poaceae genera, following the procedures described above. The uncorrected p-distance was also subsequently calculated for the five extensively used loci.

Evaluation of phylogenetic utility

We blasted the rice sequence of each of the 127 loci against those of the eight diploid Oryza species (Oryza australiensis Domin, Oryza brachyantha A.Chev. & Roehrich, Oryza glaberrima Steud., Oryza granulata Nees, Oryza nivara S.D.Sharma & Shastry, Oryza officinalis Wall., Oryza punctata Kotschy ex Steud. and Oryza rufipogon Griff.) that have trace archive sequences in GenBank (http://blast.ncbi.nlm.mih.gov/). Thirteen of the 127 loci, which have partial sequences available for six or more of the eight Oryza species, were selected for further analyses (see Table S2). Then we blasted the rice sequence of each of the 13 loci against the ESTs of Triticum aestivum L., Hordeum vulgare L. and Saccharum officinarum L. in the TIGR Plant Transcript Assemblies database (http://blast.jcvi.org/euk-blast/plantta_blast.cgi). Partial genomic sequences of the eight Oryza species and the ESTs of the additional three Poaceae genera, whenever available, were downloaded, annotated, and aligned with rice, S. bicolor and Z. mays sequences for each locus, following the procedures described above.

Parsimony analysis was performed on each data set of the 13 loci separately, and both parsimony and Bayesian analyses were performed on the combined data set of all 13 loci. Parsimony analyses were conducted using paup* version 4.0b10 (Swofford, 2002). Heuristic searches were performed with 200 random stepwise addition replicates and tree-bisection-reconnection (TBR) branch swapping with MULTREES on. Clade support was determined by bootstrap analyses (Felsenstein, 1985) of 500 replicates, each with 10 random stepwise addition replicates and TBR branch swapping with the MULTREES option effective. Bayesian analysis was conducted using MrBayes version 3.1.2 (Ronquist & Huelsenbeck, 2003). Modeltest 3.7 (Posada & Crandall, 1998) was employed to determine the sequence evolution model that best fits the data. The GTR+G model was selected by the Akaike information criterion (AIC; Akaike, 1974) for the combined 13 loci data set. We carried out two independent runs of 1 000 000 generations using the default priors and four Markov chains (one cold and three heated chains), sampling one tree every 100 generations. The first 2500 trees were discarded as burn-in.

Results

Selected loci and their variation level

A set of 127 PPR gene loci was finally obtained via the screening process for potential phylogenetic utility. They all are intronless and have a single orthologue in both rice and A. thaliana as well as the other three annotated taxa (A. lyrata, S. bicolor and Z. mays). Table 1 includes a comprehensive list of these loci with the uncorrected p-distance between A. thaliana and A. lyrata, and the distances among Oryza, Zea and Sorghum. Although these loci consist entirely of protein coding regions, they have a relatively rapid rate of evolution. For example, the pairwise distance between the two Arabidopsis species at the PPR loci ranged from 0.0244 (At1G10270) to 0.0985 (At3G25060), with an average of 0.0512, which is 6.7, 4.1, 2.7, 1.4 and 0.9 times the distance at rbcL, ndhF, matK, trnL-F and ITS, respectively (Table 1). Thirty-seven loci had a larger distance than ITS. Similarly, the pairwise distance between rice (O. sativa) and maize (Z. mays) ranged from 0.1262 (At5G18390) to 0.2789 (At1G10330), with an average of 0.1890, which is 4.7, 3.0, 2.3 and 2.4 times the distance at rbcL, ndhF, matK and trnL-F, respectively (Table 1). The ITS sequences could not be confidently aligned in the three Poaceae genera because of extensive length variation, and therefore no pairwise distances were available for the ITS region among these genera. Fig. 1 is a graphic view of the variation level of these 127 PPR gene loci as well as the five additional loci that have been used extensively in plant phylogenetics. The sizes of these 127 loci in A. thaliana are also listed in Table 1. They range from 909 to 3339 bp, with an average of 1977 bp.

Table 1.  List of the 127 pentatricopeptide repeat loci as well as rbcL, ndhF, matK, trnL-F and ITS, with their corresponding uncorrected p-distances and sequence lengths in Arabidopsis thaliana
 LocusArabidopsis thaliana vs Arabidopsis lyrataOryza sativa vs Zea maysOryza sativa vs Sorghum bicolorSorghum bicolor vs Zea maysLength in A. thaliana (bp)
1rbcL0.00760.03990.04060.00771440
2ndhF0.01250.0640.06490.00992241
3matK0.01910.08140.07940.01691515
4trnL-F0.03670.07790.07920.01141384
5ITS0.055NANANA 639
6At1G024200.03730.19060.16980.07111476
7At1G035100.05430.1810.17870.05381290
8At1G035600.04590.15650.15630.05771983
9At1G056000.03970.21410.20320.07491515
10At1G057500.04720.16460.15920.06041503
11At1G096800.05820.21360.20160.06791824
12At1G102700.02440.17180.15280.0432742
13At1G103300.06050.27890.24930.08111404
14At1G112900.05470.16480.15130.07472430
15At1G130400.04180.21090.20040.06181554
16At1G155100.03460.17930.17190.0462601
17At1G192900.05310.23220.22270.062715
18At1G202300.04420.18080.17270.06112283
19At1G203000.04710.20580.19570.07281614
20At1G229600.04560.20450.19320.05932157
21At1G253600.05060.15940.14950.04972373
22At1G265000.06450.20130.18690.05211518
23At1G286900.05180.20780.20510.08831563
24At1G314300.06710.18370.17110.05471713
25At1G333500.05440.18680.1740.05761617
26At1G533300.05930.24050.23460.08651416
27At1G597200.0570.18430.19150.0841860
28At1G643100.04640.20250.19830.07011659
29At1G663450.09810.2350.2330.09461617
30At1G689300.05350.18070.1740.05812232
31At1G710600.0620.21330.21790.05721533
32At1G712100.0470.23660.20040.1072595
33At1G734000.05160.16510.15910.05711401
34At1G746000.05980.20710.19240.04972688
35At1G770100.05080.23160.22140.07712028
36At1G773600.04360.1460.12810.05461446
37At1G794900.0380.14380.13320.04972512
38At1G795400.03540.23260.21480.06682343
39At1G801500.04190.18480.17720.0671194
40At1G805500.04720.15110.13970.05341347
41At2G018600.05920.21560.20570.06231461
42At2G029800.06290.16140.14710.07141812
43At2G033800.06180.21460.2030.06032070
44At2G038800.04170.18450.1680.08271149
45At2G136000.05160.15280.15230.04932094
46At2G156300.07060.17780.180.04761869
47At2G156900.06070.13650.13330.03731470
48At2G159800.06880.21460.19990.06691497
49At2G168800.04120.18550.17580.06342232
50At2G176700.05260.21360.20810.0871392
51At2G189400.05010.18370.14740.1032469
52At2G205400.0530.18910.18750.06761605
53At2G220700.04660.19170.19130.07152361
54At2G224100.05940.20540.19780.05812046
55At2G276100.0470.21990.20890.06912607
56At2G326300.04750.21010.19210.09361875
57At2G336800.0430.20550.20.08232184
58At2G337600.05760.18010.1650.06831752
59At2G350300.04670.18920.17350.05911894
60At2G362400.04690.20620.19430.06741494
61At2G367300.0520.20490.19620.08861506
62At2G372300.04630.15570.15530.06162274
63At2G373100.04760.19520.18870.07051974
64At2G410800.04240.16060.15640.06921698
65At2G429200.06020.15810.14090.05491680
66At3G041300.04810.23590.22050.07221527
67At3G047500.04810.19910.18980.06481974
68At3G053400.03790.22290.18990.10141977
69At3G090400.040.19890.19110.05873087
70At3G090600.0460.20570.20630.05532064
71At3G114600.04290.19760.19180.0681872
72At3G147300.05720.22330.21510.05911962
73At3G151300.04110.17340.16520.06042070
74At3G168900.05920.21160.20560.07911980
75At3G180200.05430.22810.22150.07322067
76At3G214700.05490.22470.21540.06581329
77At3G221500.04220.17350.17260.05012463
78At3G226700.05580.22290.22360.04571689
79At3G230200.04990.20180.19650.05382526
80At3G250600.09850.20190.19440.07781782
81At3G259700.07720.1890.17530.08351941
82At3G265400.04420.18960.17460.07272103
83At3G292300.05770.16670.15850.05131803
84At3G467900.0380.17440.16490.04521974
85At3G475300.04010.20070.17760.06911776
86At3G478400.05230.18990.18420.0612121
87At3G488100.03980.22050.21820.07131980
88At3G492400.04870.15080.13780.06071890
89At3G533600.04580.18960.18170.06262307
90At3G574300.06160.19310.17460.05122673
91At4G015700.05480.17170.16270.05382418
92At4G141700.07120.17930.16650.071377
93At4G207400.05170.18930.18240.06432184
94At4G207700.07780.16890.16150.05822223
95At4G213000.04330.18760.18480.05842574
96At4G307000.03190.18090.17920.06262379
97At4G308250.05880.190.19090.06192715
98At4G318500.03990.18880.17740.04843339
99At4G324300.04010.19160.18220.06392292
100At4G339900.0510.17020.16350.04952472
101At4G351300.05620.18070.17430.06492415
102At4G371700.06120.17040.16570.05612076
103At4G373800.04790.20730.19370.06791899
104At4G381500.04970.19220.16890.0612 909
105At4G395300.04230.20940.20360.05822505
106At5G011100.05580.1760.15490.0612190
107At5G028600.03780.16570.16250.05462460
108At5G038000.05050.20130.19170.07612691
109At5G047800.04780.18310.17620.05541884
110At5G064000.04920.26330.25590.05953093
111At5G084900.03490.20810.19970.06542550
112At5G099500.03680.17850.17570.04912988
113At5G121000.05220.2220.20650.05772451
114At5G132300.04940.19530.18530.06322469
115At5G137700.04970.20240.19210.06071830
116At5G153000.04310.19690.18080.05741647
117At5G164200.04560.18590.18630.06981608
118At5G183900.05070.12620.12070.04831380
119At5G184750.05460.19090.18730.05421521
120At5G375700.0450.17750.1670.06861653
121At5G393500.05490.22370.20690.07942034
122At5G396800.05530.17610.17410.05492133
123At5G399800.04030.150.14580.04632037
124At5G424500.06960.18530.18830.07411179
125At5G442300.04810.19340.1860.0691974
126At5G473600.06420.2340.23030.05871434
127At5G526300.0560.15950.13780.05241767
128At5G557400.05470.22410.19490.08092493
129At5G563100.06210.18950.17990.06421593
130At5G596000.04670.20150.19420.06481605
131At5G609600.04730.14810.14190.06041566
132At5G614000.05830.22340.21750.06931965
‘NA’ indicates that the uncorrected p-distance is not available because the ITS sequences cannot be unambiguously aligned between the corresponding pair of taxa.
Figure 1.

Graphic view of the variation levels of the 127 pentatricopeptide repeat (PPR) loci and rbcL, ndhF, matK, trnL-F and ITS, indicated by the uncorrected p-distance. The order of loci follows Table 1. Uncorrected p-distances are shown (a) between Arabidopsis thaliana and Arabidopsis lyrata, (b) between Sorghum bicolor and Zea mays, (c) between Oryza sativa and Z. mays, and (d) between O. sativa and S. bicolor. The arrow in (b), (c) and (d) indicates that the pairwise distance at the ITS region is not available because the ITS sequences of these taxa cannot be unambiguously aligned because of extensive length variation.

Phylogenetic utility

Because of incomplete sequences of the eight diploid Oryza species and S. officinarum, H. vulgare and T. aestivum, there are substantial amounts of missing data at the 13 loci that we used to reconstruct the intergeneric relationships within Poaceae and interspecific relationships within Oryza. Taxa that have no sequences at all at a certain locus were excluded from phylogenetic analysis of that locus and were not taken into account when calculating the percentage of missing data. The loci AT4G38150 and AT1G11290 have the lowest (11%) and highest (53%) percentages of data missing, respectively (see Fig. 2 and Table S2). All 14 taxa were included in the analyses of the concatenated data, in which 48% of data are missing.

Figure 2.

Gene trees resulting from parsimony analyses of individual data sets for the 13 loci. All trees are drawn to the same scale. Bootstrap values are shown along the branches when > 50. The ID of each locus is in bold and indicated above the gene tree. The number below the locus name (in parentheses) indicates the percentage of missing data at that locus. (a), (b), (e), (i), (l) and (m) show the single maximum parsimony (MP) tree inferred from the corresponding locus. (c) and (k) show one of the six MP trees; (d) shows one of the nine MP trees; (f) shows one of the three MP trees; (g) and (j) show one of the two MP trees; (h) shows one of the 10 MP trees.

Despite the large amount of missing data, almost every individual locus generated well-resolved gene trees (Fig. 2). The intergeneric relationships within Poaceae were congruent across all 13 loci, and are consistent with the Grass Phylogeny Working Group subfamily classification (Barker et al., 2001; see subfamily designations in Fig. 3). Within the genus Oryza, the monophyly of A-genome species and the phylogenetic position of E-genome species were very consistent among the 13 loci. However, the relationships among the A-, B- and C-genomes and the relationships among the F-genome, G-genome and all other genome types were incongruent among these loci (see genome type designation in Fig. 3). These results are consistent with a recent study of the relationships of the six Oryza diploid genome types using 142 nuclear genes (Zou et al., 2008). Fig. 3 represents the single most parsimonious tree inferred from the concatenated data for all 13 loci. Intergeneric relationships are the same as shown in gene trees resulting from individual data sets. The Oryza genome type relationships are very similar to that reported in the aforementioned study (Zou et al., 2008), except that the positions of the F- and G-genomes are switched.

Figure 3.

The single maximum parsimony (MP) tree inferred from the concatenated data for all 13 loci, 48% of which are missing data. Bootstrap values (BS) and Bayesian posterior probabilities (PP) supporting the corresponding nodes are shown along the branches (BS/PP). The asterisks indicate BS < 50 or PP < 0.95. Subfamily designations of the family Poaceae and genome type designations of the genus Oryza are represented on the right.

Discussion

The PPR genes have three major characteristics that make them excellent candidates for plant phylogenetic studies. First of all, there are a large number of loci but orthology assessment is straightforward. Each of the 127 loci obtained from the screening process in the present study should have a single orthologue in the vast majority of diploid flowering plants, given the fact that a single orthologue was retained in both rice and A. thaliana after the deep split between monocots and eudicots. To test this intuitive assumption, we randomly drew 10 PPR loci from the list (Table 1), and blasted the A. thaliana nucleotide sequences against two other sequenced genomes, those of Populus trichocarpa (Tuskan et al., 2006; http://www.ncbi.nlm.nih.gov/projects/genome/seq/BlastGen/BlastGen.cgi?pid=10770) and Vitis vinifera (Jaillon et al., 2007; http://www.ncbi.nlm.nih.gov/projects/genome/seq/BlastGen/BlastGen.cgi?pid=12992), using the cross-species megaBLAST program. The A. thaliana sequence hit a single locus in both genomes at nine of the 10 loci and did not produce any significant hit at the other locus (data not shown). We then blasted the amino acid sequence of this exceptional locus against the same databases using the tblastn program and this time it produced a single best hit (E-value < e−100) in both genomes. In addition, the successful retrieval of unique orthologous sequences from S. officinarum, H. vulgare and T. aestivum using the rice sequences of the 13 loci used in our phylogenetic analyses corroborates this assumption.

Secondly, the majority of PPR genes are intronless (Lurin et al., 2004; O'Toole et al., 2008). In fact, the 127 loci listed in Table 1 are all intronless, as this was one of the selection criteria. There are two important practical advantages in choosing intronless loci. (1) Alignment is straightforward. Alignments of noncoding DNA sequences such as introns or intergenic spacers can be problematic because of extensive length variation among all but the most closely related species. This often necessitates the introduction of numerous (sometimes impracticably numerous) large or small gaps (‘indels’) into the alignment. Intronless genes, in contrast, tend to contain few fixed length mutations and are easy to align if the taxa of interest are not too distantly related (e.g. belong to different major clades of angiosperms). (2) Sequencing requires minimal effort. When a nuclear locus is heterozygous, direct sequencing of intron or intergenic spacer regions becomes almost impossible. A simple deletion or insertion that occurred in one allele but not the other(s) will affect all the sequence reads after this point, and cloning is necessary to generate good quality sequences in this situation. For intronless loci, although polymorphism will be observed if there is allelic variation, sequence reads after the polymorphic sites will probably not be affected, because allelic polymorphisms usually do not involve length mutations in protein coding regions. In addition, nuclear gene introns often contain polynucleotide (e.g. poly-A) or/and microsatellite regions that are extremely difficult to sequence through, whereas intronless genes tend not to contain such regions.

The difference between intronless loci and noncoding regions might be trivial if recovering allelic polymorphisms is the main focus of a study, as in some phylogeography or population genetics studies. In such studies, separating multiple alleles within an individual via cloning is desirable, no matter whether the targeted loci are intronless or noncoding. However, being intronless is an obvious advantage when the main question is phylogenetic relationships of organisms and allelic polymorphism is not an issue (i.e. incomplete lineage sorting is trivial). This advantage will be substantially inflated when resolving intergeneric relationships is the primary interest of a study. Nuclear gene intron sequences often diverge rapidly and may not be aligned at all between distantly related genera. Exon sequences are the only source of useful data. Unfortunately, one may need to sequence across several intron regions to generate sufficient exon sequences from a locus that contains both exons and introns. What is worse is that cloning is likely to be necessary to overcome the length mutation problem in introns for many organisms. The laborious cloning work and wasted effort in generating intron sequences that may be useless in resolving intergeneric relationships can be completely avoided by employing these intronless loci. Of course, one may argue that protein coding regions usually diverge much more slowly than intron regions. An intronless locus without sufficient variation to resolve the targeted phylogenetic problem, particularly at lower taxonomic levels, is not very helpful. The third characteristic of PPR genes suggests that this is not a problem.

The third property of PPR genes is that they have a rapid rate of evolution. Figure 2 shows the general pattern of variation across the 127 loci we selected, in comparison with that of the four chloroplast DNA regions and ITS region. The average pairwise distance for the selected PPR loci between A. thaliana and A. lyrata was 1.4 times that for trnL-F and 0.9 times that for ITS. The average distances for PPR loci among the three Poaceae genera were 2.3–5.6 times those for trnL-F. These data suggest that PPR loci can certainly be used at interspecies and intergeneric levels, considering that both trnL-F and ITS have been extensively used for resolving interspecific and intergeneric relationships (Alvarez & Wendel, 2003; Shaw et al., 2005). Our phylogenetic analyses of partial sequences of 13 selected loci confirm this conclusion. Despite the substantial amount of missing data, individual data sets for the 13 loci generated well-resolved gene trees (Fig. 2). The intergeneric relationships were congruent across all 13 loci and consistent with the subfamily classification (Barker et al., 2001). Within the genus Oryza, there were both congruent (e.g. the position of the E-genome, O. australiensis) and incongruent relationships (e.g. among A-, B- and C-genomes) from one locus to another. These results are consistent with a recent phylogenomic study of the Oryza diploid genome types (Zou et al., 2008). Additionally, considering that these loci are intronless, we speculate that they might also be useful to resolve relationships between closely related families, but this possibility needs to be evaluated in future studies.

The unique combination of these three properties gives PPR gene loci many advantages over other nuclear gene loci as phylogenetic tools. They provide numerous loci with established orthology assessment to use. Generating sequence data of these loci requires only minimal effort and aligning these sequences is straightforward. They have a rapid rate of evolution despite being intronless, and versatile utility at various levels (interspecific, intergeneric, and potentially interfamiliar between closely related families). We believe that these loci will play a key role in resolving intergeneric relationships using nuclear gene data, given their extraordinary advantages in this respect, as discussed above. By the present report, we wish to bring the tremendous potential of these PPR gene loci as phylogenetic tools to the attention of plant systematists and to ameliorate the pessimistic view that ‘identifying phylogenetically informative LCN markers remains a time-consuming endeavor’ (Steele et al., 2008).

There are two final issues that we consider to be worth mentioning from a practical point of view. The first concerns the selection of loci from among these 127 loci for a specific project. Variation level (Fig. 1) and locus size (i.e. sequence length; Table 1) are two informative factors that one can use as guidance to select appropriate loci. However, we should caution that variation level might be lineage specific – the locus with the most rapid rate of evolution in Arabidopsis does not necessarily evolve most rapidly in another group. In this sense, locus size may be a more consistent parameter to guide locus selection. The second issue concerns primer design. While universal primers that can be used to amplify a locus across a broad spectrum of organisms (e.g. all angiosperms) are ideal choices, it is more and more widely recognized that such universal primers may not exist for most nuclear loci (Sang, 2002; Steele et al., 2008). For loci that have such a rapid rate of evolution as the PPR genes, primer design in a lineage-specific fashion is probably more fruitful than searching for universal primers. With the rapid development of whole genome sequence and EST databases (e.g. the National Center for Biotechnology Information (NCBI) plant genome project database: http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi?p3=11:Plants&taxgroup=11:Plants|12%3; the TIGR Plant Transcript Assemblies database: http://plantta.tigr.org/) and bioinformatics tools (e.g. BLAST; http://blast.ncbi.nlm.nih.gov/Blast.cgi), it has become much easier to design lineage-specific primers. The general idea is to use these public databases to search for orthologous sequences of a selected locus from several other plant species, especially those most closely related to the study group. The alignment of these sequences can then provide a basis for the identification of conserved motifs and the design of working primers. As a matter of fact, we have employed this approach and designed Lamiales-specific primers for five more or less arbitrarily selected loci from Table 1. Using these primers we have successfully amplified the targeted loci as single bands in the family Verbenaceae, a typical non-model-system group that has been poorly studied to date. While the details of these empirical data and phylogenetic results will be published elsewhere, we are assured that these loci are quite easy to use in practice.

Acknowledgements

The authors are grateful to Bruce Baldwin and an anonymous reviewer for comments on the manuscript. This research was supported by a Graduate Fellowship in Plant Molecular Systematics from the University of Washington Department of Biology, an NSF Doctoral Dissertation Improvement Grant (DDIG) (DEB-0710026) to RGO for the first author's dissertation research, and an NSF Grant (DEB-0542493) to RGO.

Ancillary