Plastid genome (plastome) sequence information is of central importance to tracing the evolutionary history of plastids and their hosts. The small genome size and high copy number per cell have made the plastid genomes much more amenable for sequencing than nuclear genomes. Currently, there are at least 190 completely sequenced plastid genomes available, of them 163 are from various green plants, with land plants having the best representation (137) (Fig. 1). Comparative studies indicate that plastid genomes have experienced substantial rearrangements and frequent gene losses throughout plant evolution. Some genes or groups of functionally related genes have been independently lost multiple times. These plastid genomic characters have been recognized as phylogentically informative markers. Based on the complete plastid genome sequence data, chloroplast phylogenomics has emerged as an effective approach to clarifying phylogenetic relationships in plants and algae.
Abstract More than 190 plastid genomes have been completely sequenced during the past two decades due to advances in DNA sequencing technologies. Based on this unprecedented abundance of data, extensive genomic changes have been revealed in the plastid genomes. Inversion is the most common mechanism that leads to gene order changes. Several inversion events have been recognized as informative phylogenetic markers, such as a 30-kb inversion found in all living vascular plants minus lycopsids and two short inversions putatively shared by all ferns. Gene loss is a common event throughout plastid genome evolution. Many genes were independently lost or transferred to the nuclear genome in multiple plant lineages. The trnR-CCG gene was lost in some clades of lycophytes, ferns, and seed plants, and all the ndh genes were absent in parasitic plants, gnetophytes, Pinaceae, and the Taiwan moth orchid. Certain parasitic plants have, in particular, lost plastid genes related to photosynthesis because of the relaxation of functional constraint. The dramatic growth of plastid genome sequences has also promoted the use of whole plastid sequences and genomic features to solve phylogenetic problems. Chloroplast phylogenomics has provided additional evidence for deep-level phylogenetic relationships as well as increased phylogenetic resolutions at low taxonomic levels. However, chloroplast phylogenomics is still in its infant stage and rigorous analysis methodology has yet to be developed.
1 Overview of plastid genome sequencing
A search of public databases and published work revealed 190 completely sequenced plastid genomes, with a steep increase over the last 4 years (Fig. 2). A total of 24 and 33 genome sequences were reported during 1986–2001 and 2002–2005, respectively, whereas at least 133 were released within 2006–2009. In these efforts of plastid genome sequencing, a total of 163 land plants and green algae have already been determined (Fig. 1). Under the phylogenetic framework, the distribution of plastid genomic efforts is clearly uneven (Fig. 1). Given their apparent importance, flowering plants, particularly crops, are the focus of sequencing (Pryer et al., 2002). The lack of plastid genomic data of representatives from other crucial evolutionary nodes, however, may hinder our comparative understanding of plastid genomic organization, function, and evolution. Fortunately, this situation has started to change. Since 2007, 34 plastid genomes have been sequenced for non-flowering plants including chlorophytes (12), charophytes (1), bryophytes (2), lycophytes (2), monilophytes (2), and gymnosperms (15). The plastid genomic data, which now have a more expanded taxonomic coverage, offer an excellent opportunity for the study of plastid genome evolution.
The rapid increase of complete plastid genome sequences is largely fuelled by advances in sequencing technologies. The first two plastid genomes were sequenced using the chemical method (Gilbert) and the dideoxynucleotide procedure (Sanger) (Ohyama et al., 1986; Shinozaki et al., 1986). At that time, sequencing a complete plastid genome was laborious work. For instance, it took approximately 10 years to sequence the tobacco plastid genome (Sugiura, 2003). After 2000, the sequencing of plastid genomes has been markedly accelerated due to improvements in Sanger sequencing technology. More recently, the next-generation sequencing platforms have been introduced to sequence plastid genomes of plants, such as Nandina domestica Thunb., Platanus occidentalis L., and Ceratophyllum demersum L., using the Roche/454 platform (Moore et al., 2006; Moore et al., 2007), and eight coniferous plastid genomes have been sequenced using the Illumina/Solexa Genome Analyzer (Cronn et al., 2008). In a recent report, nearly-complete plastid genomes of 32 species in Pinus and four relatives in Pinaceae have been generated using the Illumina/Solexa Genome Analyzer (Parks et al., 2009). These achievements have indicated the unprecedented power of next-generation sequencers for sequencing plastid genomes.
2 Comparative chloroplast genomics
2.1 Genome structure
2.1.1 Overall structure The chlorophycean plastid genome has experienced numerous architectural changes (Pombert et al., 2005, 2006; Belanger et al., 2006; Brouard et al., 2008; Turmel et al., 2009b). Nevertheless, the plastid genomes remain largely stable in the transition from charophycean green algae to land plants (Turmel et al., 2006, 2007). Land plant plastid genomes share the typical quadripartite structure that is characterized by the presence of two rRNA-containing inverted repeats (IRs) and two unequal single-copy regions (Raubeson & Jansen, 2005). Similar to land plants, five of the six monophyletic charophycean lineages, except Zygnematales (Turmel et al., 2005), have a plastid genome with the quadripartite structure as well (Turmel et al., 2006). The same structure is also found in the plastid genomes of most chlorophycean algae, the cyanelle genome of Cyanophora paradoxa Korshikov (glaucophytes) (Stirewalt et al., 1995) and about two-thirds of the sequenced secondary plastid genomes. Thus, this quadripartite architecture and gene-partitioning pattern of plastid genomes should represent ancient genomic features derived from their cyanobacterial progenitor (Turmel et al., 1999).
In streptophytes (land plants and their close relatives, charophycean algae), the sequenced plastid genomes vary from 70 to 166 kb, with only one known exception (Pelargonium x hortorum Bailey, 217.9 kb) (Chumley et al., 2006). However, chlorophycean algae have a wider range, from 37 to 224 kb. The increased size is largely due to the expansion of IRs, intergenic spacers, introns, or even repeats rather than gene capacity (Maul et al., 2002; Chumley et al., 2006; Turmel et al., 2007). By contrast, the smallest plastid genome is always found in parasitic organisms, such as Helicosporidium sp. (37,454 bp, a parasitic green alga) (de Koning & Keeling, 2006) and Epifagus virginiana (L.) W.P.C. Barton (70,028 bp, a parasitic flowering plant) (Wolfe et al., 1992). Because of the specialized lifestyle, parasitic species usually experience a loss of evolutionary pressure on their plastid genomes, especially on the photosynthesis-related genes (Krause, 2008). This reduction pattern points to the common force that shapes plastid genomes after the loss of photosynthesis (de Koning & Keeling, 2006; Wickett et al., 2008).
2.1.2 Genome rearrangements The structure of land plant plastid genomes is generally conserved. However, rearrangements do occur in certain lineages and have been proven to be effective phylogenetic markers (Fig. 3). For instance, a 30-kb inversion, which is shared by all vascular plants (tracheophytes) except lycopsids, suggested that lycopsids are the basal lineage of vascular plants (Raubeson & Jansen, 1992). More recently, a 3-kb inversion (including psbD, psbC, and psbZ) (Roper et al., 2007) and a trnD-GUC inversion (D inversion) (Gao et al., 2009) have been proposed to be potential phylogenetic markers uniting the monilophyte (fern) clade. In addition, the monophyly of a group of papilionoid legumes is also supported by the feature of loss of one copy of the IR (inverted-repeat-lacking clade) (Wojciechowski et al., 2000).
Among the 108 sequenced plastid genomes from flowering plants, the majority have a very similar organization and putatively ancestral pattern (Raubeson et al., 2007), that is represented by tobacco (Nicotiana tabacum L.). This tobacco-like pattern is conserved across angiosperms from the basal lineages including Amborella trichopoda Baill. (Goremykin et al., 2003) and Nymphaeales (Goremykin et al., 2004; Raubeson et al., 2007) to derived lineages such as Panax schinseng Nees (Kim & Lee, 2004), Eucalyptus globulus Labill. (Steane, 2005) and grape (Vitis vinifera L.) (Jansen et al., 2006). Nevertheless, a few angiosperms show several distinctive rearrangements. Pelargonium x hortorum has the largest plastid genome in land plants, with two dramatically expanded IRs, up to 75 kb each (Chumley et al., 2006). It contains several rearrangements including inversions, duplications, insertions, and expansions of the IRs (Chumley et al., 2006). Numerous rearrangements are also observed in the plastid genomes of Trifolium subterraneum L. (Fabaceae) (Cai et al., 2008), Trachelium caeruleum L. (Campanulaceae) (Haberle et al., 2008), and Jasminum nudiflorum Lindl. (Oleaceae) (Lee et al., 2007).
Gymnosperms have relatively less conservative plastid genomes than angiosperms. The Cycas taitungensis C.F. Shen plastid genome has the typical quadripartite structure, and its gene order and genome size are very similar to the ancestral type of angiosperms (Wu et al., 2007). However, in conifers the IR regions are highly reduced or lost (Wakasugi et al., 1994; Hirao et al., 2008). In gnetophytes even the whole plastid genomes become extremely reduced and compact (McCoy et al., 2008; Wu et al., 2009a). The chloroplast genome sequences of Welwitschia mirabilis, Ephedra equisetina, and Gnetum parvifolium have been completely sequenced, representing the three gnetophyte orders: Welwitschiales, Ephedrales, and Gnetales, respectively. These gnetophyte plastid genomes have lost at least 18 genes that are found in other seed plants. Moreover, they have significantly shorter introns and intergenic spaces (McCoy et al., 2008; Wu et al., 2009a). Although gnetophyte plastid genomes have two large IRs, gene losses and compactness make them (less than 120 kb) shorter relative to other IR-containing plastid genomes, excluding those of some parasitic plants.
Comparative studies based on plastid genome sequences show that genome shuffling occurs mostly as blocks, which might be necessary for genome function (Wakasugi et al., 1997; Turmel et al., 2002, 2005, 2009b; Chumley et al., 2006; Cai et al., 2008; Haberle et al., 2008). It has also been noticed that highly rearranged plastid genomes usually display an increased abundance of small dispersed repeats (SDRs); and SDRs and/or tRNA genes often present at or near the endpoints of rearranged gene blocks. These repetitive elements are speculated to promote chloroplast DNA inversions through inter- or intra-molecular recombination (Hiratsuka et al., 1989; Turmel et al., 2002; Pombert et al., 2005; Chumley et al., 2006; Haberle et al., 2008). If this is indeed the case, they may have a destabilizing effect on the genome structure and should be deleted after inversions in order to guarantee genome stability. However, SDRs are also detected in some plastid genomes with few if any rearrangements (Daniell et al., 2006; Jansen et al., 2006; Raubeson et al., 2007; Timme et al., 2007; Gao et al., 2009). It is observed that most plastid SDRs are small in size and commonly restricted to either intergenic spacers/introns or to three genes, psaA, psaB, and ycf2 (Daniell et al., 2006; Jansen et al., 2006; Raubeson et al., 2007; Timme et al., 2007; Gao et al., 2009). Certain SDRs are located in the same regions across distantly related species (Saski et al., 2005; Raubeson et al., 2007). Currently, the exact roles of SDRs in unrearranged plastid genomes remain unknown. But their widespread presence and conservation suggest that some SDRs should be functional (Saski et al., 2005; Raubeson et al., 2007).
2.2 Gene capacity
During the transition from endosymbiont to organelle, most of the cyanobacterial genes have been lost or transferred to the host nuclear genome (Martin et al., 1998; Timmis et al., 2004). As a result, the plastids of most modern-day photosynthetic algae and plants possess a small circular genome containing ∼100–250 genes (NCBI Organelle Genome Resources, http://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?opt=plastid&taxid=2759). Most gene losses occurred early in the process of endosymbiosis (Martin et al., 1998); however, some losses took place during the subsequent course of evolution. The current availability of plastid genomic data from several crucial evolutionary nodes will allow us to examine the mechanisms of plastid gene losses and/or transfers within a phylogenetic framework.
2.2.1 Plastid genome reductions in parasitic plants While most plants thrive as free-living organisms, some have evolved into parasites. As mentioned above, in parasitic plants many plastid genes, in particular photosynthesis-related genes such as genes encoding photosystem subunits, are lost or degraded to pseudogenes. Besides E. virginiana (Wolfe et al., 1992), recent reports on the plastid genome sequences of the Cuscuta species (Funk et al., 2007; McNeal et al., 2007) as well as Aneura mirabilis (Malmb.) Wickett (a parasitic liverwort) (Wickett et al., 2008) provide a good opportunity to dissect the reduction pattern of plastid genomes.
The E. virginiana plastid genome is characterized by having an extreme reduction encoding only 42 genes and having no intact photosynthetic or chlororespiratory genes (ndh genes) (Fig. 4) (dePamphilis & Palmer, 1990; Wolfe et al., 1992). In comparison to autotrophic angiosperms, it has also lost several transcription–translation system genes including all the plastid-encoded RNA polymerase (PEP) genes, six ribosomal protein genes, and 13 tRNA genes (Wolfe et al., 1992). All of the 11 ndh genes, five psb genes, and 2 psa genes are missing in the A. mirabilis plastid genome as well (Fig. 4) (Wickett et al., 2008). In contrast, the four sequenced Cuscuta species retain nearly all the photosynthesis genes despite losing all ndh genes, as they still possess residual photosynthetic activity (Fig. 4) (Funk et al., 2007; McNeal et al., 2007). Interestingly, PEP genes (rpo genes) are missing in Cuscuta gronovii and Cuscuta obtusiflora (Funk et al., 2007; McNeal et al., 2007), which is the same as E. virginiana. The PEP function seems to be fully taken over by a nuclear-encoded polymerase that is imported into the plastids (Krause, 2008). Nevertheless, the PEP genes still exist in the Cuscuta reflexa, Cuscuta exaltata, and A. mirabilis plastid genomes (Fig. 4). This fact may reflect that insufficient time has passed under relaxed selection or the genes remain needed. Another surprise is that the highly conserved maturase gene, matK, is absent from the C. gronovii and C. obtusiflora plastid genomes (Fig. 4) (Funk et al., 2007; McNeal et al., 2007). This gene loss parallels the disappearances of Group IIA introns, which are the putative targets of maturase K (McNeal et al., 2009).
2.2.2 Gene loss is a common pattern throughout plastid genome evolution Apart from parasitic plants, the ndh genes are absent in gnetophytes and Pinaceae, but present in other gymnosperms (Wakasugi et al., 1994; Wu et al., 2007, 2009a; Hirao et al., 2008; McCoy et al., 2008; Braukmann et al., 2009). This finding lends additional support to the close relationship between gnetophytes and Pinaceae (Braukmann et al., 2009). ndh genes are lost in the Phalaenopsis aphrodite Rchb.f. (Orchidaceae) plastid genome as well (Chang et al., 2006), possibly caused by their transfers to the nuclear genome (Wakasugi et al., 1994; Chang et al., 2006).
The losses of four genes chlB, chlL, chlN, and trnP-GGG have been proposed to be synapomorphies for flowering plants (Jansen et al., 2007). And the three chl genes are also independently lost from the whisk fern Psilotum nudum (L.) P. Beauv. (Wakasugi et al., 1998), and two gnetophytes Welwitschia mirabilis Hook.f. (McCoy et al., 2008) and Gnetum parvifolium (Warb.) Cheng (Wu et al., 2009a), but they are present in other non-flowering land plants. The chl genes are also found in most green algae, except for four chlorophycean algae, Ostreococcus tauri (Robbens et al., 2007), Pseudendoclonium akinetum (Pombert et al., 2005), Monomastix sp. OKE-1 (Turmel et al., 2009a), and Pedinomonas minor (Turmel et al., 2009b).
Beyond the ndh and chl gene families, multiple, independent gene losses have further been observed for genes including infA (Millen et al., 2001), rps15 (Tsuji et al., 2007), rps16 (Jansen et al., 2007; McCoy et al., 2008), rpl22 (Jansen et al., 2006), rpl23 (McCoy et al., 2008), accD (Lee et al., 2007), psaM (Wolf et al., 2003), ycf1 (Roper et al., 2007; Cai et al., 2008; Wu et al., 2009b), ycf15 (Mardanov et al., 2008), and ycf66 (Gao et al., 2009).
It is believed that a set of 30 tRNAs is sufficient for the translation of chloroplast mRNAs (Shinozaki et al., 1986). However, certain plastid genomes apparently possess no sufficient tRNA genes for their translations. For instance, only 12, 13, and 17 tRNA genes were identified in the plastid genomes of Selaginella uncinata (Desv. ex Poir.) Spring (Tsuji et al., 2007), Selaginella moellendorffii Hieron (Smith, 2009), and E. virginiana (Wolfe et al., 1992). The deficiencies of plastid tRNAs suggest that cytosolic tRNAs might be imported into chloroplasts by means of previously unsuspected mechanisms (Wolfe et al., 1992), despite a lack of experimental evidence (Lung et al., 2006).
2.2.3 Decay of the trnR-CCG gene in ferns and gymnosperms: An example for plastid gene loss Three arginine tRNA genes (trnR-CCG, trnR-ACG, and trnR-UCU) are found in the land plant plastid genomes (Wakasugi et al., 2001). Of them, intact trnR-CCG occurs in the studied bryophytes (Ohyama et al., 1986; Kugita et al., 2003; Sugiura et al., 2003; Wickett et al., 2008), Huperzia lucidula (Michx.) Trevis. (lycophytes) (Wolf et al., 2005), two basal ferns Psilotum nudum (Wakasugi et al., 1998) and Angiopteris evecta (G. Forst.) Hoffm. (Roper et al., 2007), and most gymnosperms (Wakasugi et al., 1994; Wu et al., 2007; Cronn et al., 2008; McCoy et al., 2008; Wu et al., 2009a), but is absent in the two Selaginella species (lycophytes) Selaginella uncinata (Tsuji et al., 2007) and Selaginella moellendorffii Hieron. (Smith, 2009), tree ferns (Gao et al., 2009), polypod ferns (Wolf et al., 2003; Gao et al., 2009), a coniferous species Cryptomeria japonica D. Don (Hirao et al., 2008) and all angiosperms (Wakasugi et al., 2001) (Fig. 5: A,B). The trnR-CCG gene is normally located between rbcL and accD (Gao et al., 2009). However, it occurs between rbcL and psaI in gnetophytes (Ephedra, Gnetum, and Welwitschia) due to the accD loss (McCoy et al., 2008; Wu et al., 2009a). Strikingly, Welwitschia has a duplicated trnR-CCG in the psbJ-psbL intergenic region (McCoy et al., 2008).
The trnR-CCG gene may have been lost independently in multiple clades like lycophytes, higher ferns, and seed plants. Sugiura and Sugita (2004) argued that trnR-CCG may not be functionally essential for plastid in spite of being conserved in non-flowering plants. The finding of its losses in multiple land plants tends to support this view. Recently, a unique trnR-UCG gene was found between rbcL and accD in tree ferns (Gao et al., 2009). Previously, at the same locus an apparent tRNA gene was annotated as trnSeC-UCA in the polypod fern Adiantum capillus-veneris L. (Wolf et al., 2003). Nevertheless, neither trnR-CCG/trnR-UCG nor trnSeC was identified at the position by examining other polypod fern rbcL-accD intergenic sequences deposited in GenBank (Gao et al., 2009). Due to the sequence similarity and conservative location, Gao et al. (2009) proposed that the trnR-CCG, tree fern trnR-UCG, and Adiantum trnSeC genes are orthologous. These findings imply that the decay of trnR-CCG in ferns is associated with a two-step mutation on the anticodon: first, the trnR-CCG is changed into trnR-UCG by one base change; second, the trnR-UCG is modified as Adiantum trnSeC by altering one base further, becoming a trnR-UCG pseudogene (Fig. 5: C). Then the ψtrnR-UCG undergoes extensive base substitutions and becomes ultimately undetectable in some polypods.
For gymnosperms, the Cryptomeria trnR-CCG gene was considered a pseudogene (Hirao et al., 2008) whose predicted anticodon sequence is CCA (Fig. 5: B). As the putative anticodon in Ephedra trnR-CCG is also CCA, Wu et al. (2009a) proposed an unusual RNA editing site at the anticodon (A-to-G), without providing any experimental evidence. Here we would speculate that Ephedra trnR-CCG might be a pseudogene too; and the putative ψtrnR-CCGs from both Cryptomeria and Ephedra were derived from the trnR-CCG by altering the third anticodon base (by an insert or a base substitution) (Fig. 5: C). After this mutation, the ψtrnR-CCGs, particularly Cryptomeria ψtrnR-CCG, have continuously experienced substantial sequence changes (Fig. 5: B).
In summary, as the trnR-CCG gene is not essential for plastid function, it has been lost independently in multiple land plant lineages. The trnR-CCG loss is not caused by a single deletion event, but by a few point mutations making the gene non-functional, then the pseudogene gradually decays until it is no longer recognizable. Moreover, the detailed degradation trajectories differ among distinct lineages (Fig. 5).
3 Phylogenetic use of whole plastid genome data
Besides providing materials for comparative chloroplast genomics, the whole plastid genome sequences of crucial species have also been used to unravel photosynthetic eukaryote phylogenies. The term “chloroplast phylogenomics” used in this review refers to the use of chloroplast genome-scale data to reconstruct the evolutionary history of organisms. At the early stage of molecular phylogenetics, a single or a few genes were used. Recently, technological breakthroughs have facilitated rapid sequencing of the entire plastid genome. Recruitment of sequences of more genes, even whole genomes, is expected to avoid or reduce two types of errors inherent in using just one or several genes: the stochastic error (sampling error); and paralogy caused by mechanisms such as gene duplication, horizontal gene transfer or lineage sorting, although it still cannot elude systematic error (Rokas et al., 2003; Martin et al., 2005; Jeffroy et al., 2006; Brinkmann & Philippe, 2008; Zou et al., 2008). In addition, the examination of plastid genomic features, such as gene content, gene order, intron gain/loss, genome size, nucleotide composition, and codon usage, may also offer independent tests of hypotheses proposed by traditional analysis of DNA sequence data (Rokas & Holland, 2000; Kelch et al., 2004; Delsuc et al., 2005; Philippe et al., 2005; Turmel et al., 2006, 2007; Goffinet et al., 2007; Mishler & Kelch, 2009). The genomic structural changes, in particular multigene inversions, show less homoplasy than sequence data, and can be informative in resolving certain intractable phylogenetic issues (Kelch et al., 2004; Philippe et al., 2005; Raubeson & Jansen, 2005; Goffinet et al., 2007; Jansen et al., 2007; Mishler & Kelch, 2009).
3.1 Phylogeny of charophycean green algae
Land plants and their immediate ancestors, charophycean green algae, together form the phylum Streptophyta, whereas the remaining green algae are classified into the phylum Chlorophyta. In charophycean algae, six main lineages are currently recognized: Mesostigmatales (one species), Chlorokybales (one species), Klebsormidiales (three genera, 45 species), Zygnematales (ca. 50 genera, ca. 6000 species), Coleochaetales (three genera, 20 species), and Charales (six genera, 81 species), given here in an order from low to high cellular complexity (Turmel et al., 2007). Two questions are at the center of studies of these algae, namely: whether the scaly green flagellate alga Mesostigma viride Lauterborn (Mesostigmatales) is a member of charophytes or not; and which group of charophytes represents the sister to the land plants (Qiu, 2008). Recent analysis of the entire plastid genomes has firmly established that Mesostigma and Chlorokybus atmophyticus Geitler (Chlorokybales) form a clade representing the earliest divergence of charophycean algae (Lemieux et al., 2007).
Nevertheless, the second question is still under furious debate (Becker & Marin, 2009). The Charales (stoneworts) have long been regarded as the closest relatives of land plants (Karol et al., 2001; Turmel et al., 2003; McCourt et al., 2004; Qiu et al., 2006). The well-known phylogenetic study based on four genes from three cellular compartments (plastid atpB and rbcL genes, mitochondrial nad5 gene, and the nuclear 18S rRNA gene) (Karol et al., 2001) supported that stoneworts are sister to land plants and the evolution of charophytes is accompanied by a gradual increase in cellular complexity (McCourt et al., 2004). However, this view was challenged by a recent phylogenomic analysis of 76 plastid genes/proteins using various methods (Turmel et al., 2006). It showed that Charales is sister to the clade consisting of Coleochaetales, Zygnematales and land plants (Turmel et al., 2006). This unexpected branching order was further supported by plastid structural genomic features including gene order, gene content, and intron content, as well as insertions/deletions in gene-coding regions (Turmel et al., 2006). As well as being incongruent with the four-gene phylogeny (Karol et al., 2001), the chloroplast genome-based phylogeny (Turmel et al., 2006) also conflicts with inferences established from mitochondrial genomic data (Turmel et al., 2002, 2003). If the relationship shown by chloroplast phylogenomics is true, the evolution of streptophytes must be reconsidered. In particular, the cellular features thought to be shared by the Charales and land plants might have emerged independently (Turmel et al., 2007).
3.2 Phylogeny of land plants
Land plants, which probably diverged from green algae approximately 432–476 million years ago (Kenrick & Crane, 1997), successfully went ashore, and subsequently adapted to a dramatically different environment, paving the way for the development of terrestrial ecosystems. The monophyly of land plants has been robustly established by extensive phylogenetic analyses based on morphological and biochemical data, multigene supermatrices, as well as chloroplast genome sequences and gene contents (reviewed in Qiu, 2008). However, relationships among basal land plants remain unsettled.
3.2.1 Bryophytes The three major groups of bryophytes – liverworts, mosses, and hornworts – constitute the earliest lineages of land plants (Goffinet & Shaw, 2009). Knowledge of their relationship is essential for understanding the early diversification of land plants. Although they share a fundamentally similar life cycle, the paraphyly of bryophytes is well established after extensive surveys (reviewed in Qiu, 2008; Mishler & Kelch, 2009). However, the branching order among the three bryophyte clades remains vigorously contested. Based on recent molecular phylogenies, an emerging consensus is that liverworts represent the earliest diverging lineage, while hornworts are the closest relatives of extant vascular plants, that is, a branching order of “liverworts(((mosses((hornworts(vascular plants)))” (Qiu et al., 2006, 2007). The missing plastid ycf3 intron 2 in liverworts, for example, Marchantia (Ohyama et al., 1986) and Aneura (Wickett et al., 2008), further indicates that liverworts are distinct from other land plants. The sister relationship of hornworts to vascular plants is also more convincingly evidenced by chloroplast and mitochondrial genomic characters (Kelch et al., 2004; Groth-Malonek et al., 2005; Mishler & Kelch, 2009) along with morphological and physiological features, particularly those related to sporophyte development (reviewed in Qiu et al., 2006).
To date, there are five species, which represent three major bryophyte lineages, Marchantia polymorpha (Ohyama et al., 1986) and A. mirabilis (Wickett et al., 2008) (liverworts), Physcomitrella patens (Sugiura et al., 2003) and Syntrichia ruralis (GenBank: NC_012052) (mosses), and Anthoceros formosae (Kugita et al., 2003) (hornworts), have their plastid genomes completely sequenced. The sequences of Marchantia, Physcomitrella, and Anthoceros have already been integrated in phylogenomic studies. Nevertheless, disparate topologies were revealed due to the differences inherent in tree reconstruction methods, taxon samplings, or character partitions (Wolf et al., 2005; Qiu et al., 2006; Turmel et al., 2006, 2007; Lemieux et al., 2007; Rodriguez-Ezpeleta et al., 2007). Our analysis of 79 plastid protein-coding genes from the five bryophyte complete genomes and 37 charophycean and vascular plant genomes provides robust support for the basal placement of liverworts within land plants (bootstrap (BS) = 100) and the sister relationship of hornworts to vascular plants (BS = 100) (Fig. 6: A). Signal conflicts were observed among codon positions (Fig. 6: B,C), indicating that in addition to using more extensive gene and genome sequences from more taxa, improved methodology designed to deal with signal conflicts is also urgently needed to disentangle the controversial phylogentic issues existing in bryophytes and the vascular plants.
3.2.2 Lycophytes and ferns Vascular plants are a well established monophyletic lineage in land plants as diverse as approximately 285,000 living species (Magallón & Hilub, 2009). Among them, lycophytes (clubmosses) form a clade sister to the clade of ferns and spermatophytes (seed plants) (Kenrick & Crane, 1997; Pryer et al., 2001), which is strongly supported by a unique 30-kb inversion in the chloroplast genome (Raubeson & Jansen, 1992). Currently three plastid genomes of lycophytes, H. lucidula (Wolf et al., 2005), S. uncinata (Tsuji et al., 2007), and S. moellendorffii (Smith, 2009), have been determined. The gene order in H. lucidula is very similar to that of bryophytes (Wolf et al., 2005). The two Selaginella species share a unique 17-kb transposition from LSC to SSC. Surprisingly, there exists a 20-kb inversion (from trnC to psbI) in S. uncinata (Tsuji et al., 2007), but not S. moellendorffii (Smith, 2009) or H. lucidula (Wolf et al., 2005). Here, our chloroplast phylogenomic analysis based on the sequenced H. lucidula, S. uncinata, and S. moellendorffii plastid genomes confirms that lycophytes form a monophyletic clade diverging earliest in vascular plants (Fig. 6).
The validation of the monophyly of monilophytes/ferns constituting horsetails, whisk ferns, and all eusporangiate and leptosporangiate ferns, is a major breakthrough in recent pteridophyte systematic studies (Pryer et al., 2001). The result has also been upheld by morphological analyses of living and fossil taxa, the studies on DNA sequence data (reviewed in Pryer et al., 2004; Qiu, 2008; Schuettpelz & Pryer, 2008), as well as examinations of the chloroplast genomic features such as the 3-kb inversion and the D inversion (Fig. 3) (Roper et al., 2007; Gao et al., 2009). Although a robust phylogenetic framework of extant ferns is available (Smith et al., 2006, 2008; Schuettpelz & Pryer, 2008), relationships among their major clades remain somewhat elusive (Fig. 7) (Rothwell & Stockey, 2008; Schuettpelz & Pryer, 2008).
Analyses of the four sequenced fern plastid genomes provide fresh insights for the plastid genome evolution (Fig. 7) (Wakasugi et al., 1998; Wolf et al., 2003; Roper et al., 2007; Gao et al., 2009). After the 30-kb inversion, another two small inversion events (the 3-kb inversion and D inversion) occurred in ferns (Fig. 3). Both of the plastid genomes of Psilotum nudum (Psilotaceae, a whisk fern) (Wakasugi et al., 1998) and Angiopteris evecta (Marattiaceae, a marattioid fern) (Roper et al., 2007) have the ancestral fern gene order. The loss of three chl genes in Psilotum (Wakasugi et al., 1998) and the pseudogenization of ycf1 in Angiopteris (Roper et al., 2007) distinguish them from other ferns. A set of complex rearrangements spanning the IR-LSC junction was previously reported in higher leptosporangiate ferns (Hasebe & Iwatsuki, 1992; Stein et al., 1992). This has been validated and detailed by the recently published plastid genomes of Adiantum capillus-veneris (Pteridaceae, a polypod fern) (Wolf et al., 2003) and Alsophila spinulosa (Hook.) R.M. Tryon (Cyatheaceae, a tree fern) (Gao et al., 2009). An additional rearrangement between rpoB and psbZ, involving at least two overlapping inversions, was also identified in polypod and tree ferns (Fig. 3) (Wolf et al., 2003; Gao et al., 2009). It is noteworthy that each of the three abovementioned inversions has one endpoint located within ropB-psbZ, that is, intergenic spacers of psaM∼trnD-GUC for the 30-kb inversion, trnE-UUC∼trnG-GCC for the 3-kb inversion, and trnD-GUC for the D inversion (please refer to Figure 3 in Gao et al., 2009). The convergence of up to five tRNA genes should be a potential cause for the high instability of this region.
However, for resolving the fern phylogeny, currently available plastid genome data is clearly insufficient (Fig. 7). Plastid genome sequences of the two unsampled basal fern families, Ophioglossaceae and Equisetaceae, may provide clues to ascertain their phylogenetic positions. As mentioned, two major rearranged regions and five tRNA gene losses have been found in the higher leptosporangiate plastid genomes. More plastid genomic data from the leptosporangiate clades, such as Osmundaceae, Hymenophyllaceae, Gleicheniales, and Schizaeales, are needed to trace the evolutionary trajectories of the rearrangements and develop phylogenetic markers.
3.2.3 Seed plants (Spermatophytes) Seed plants account for ∼96% of living vascular plant diversity. Extant seed plants include five major clades: cycads, ginkgo, conifers, gnetophytes, and angiosperms. Their phylogenetic relationships, particularly which is the sister of angiosperms, remain controversial. Several cladistic analyses based on morphological data identified gnetophytes as the closest living relatives of angiosperms (Anthophyte hypothesis) (e.g. Crane, 1985a, 1985b; Doyle & Donoghue, 1986; Donoghue, 1994; Nixon et al., 1994; Doyle, 1996). However, nearly all molecular studies disagree with this view but often support a gnetophytes and conifers affinity (reviewed in Qiu, 2008; Mathews, 2009). They also show that all living gymnosperms are more closely related to each other than to angiosperms (i.e., the monophyly of extant gymnosperms) (Fig. 6) (Renner, 2009). In other words, angiosperms may have no sister group among living gymnosperms (Mathews, 2009). Chloroplast phylogenomic analyses based on 56 (Wu et al., 2007) or 57 (McCoy et al., 2008) concatenated genes support the affinity between gnetophytes and conifers although limited taxa are used in the studies. A recent analysis using 83 plastid genes (Mathews, 2009) and our analysis using 79 plastid genes (Fig. 6) with more taxon samplings maintain the previously found link between gnetophytes and conifers but place gnetophytes as the sister to non-Pinaceae conifers rather than all conifers. Moreover, the ndh gene losses in both gnetophytes and conifers also highlight their link at the plastid genomic level (2.2.2) (Braukmann et al., 2009).
One of the most important contributions of complete plastid genome data is the aid for resolving deep-level relationships of angiosperms. Angiosperms represent a monophyletic group of approximately 270,000 known species (Magallón, 2009). Their origin and early evolution have long intrigued plant biologists. Thanks to intensive efforts, the ANA (formerly called ANITA) taxa have been identified as the earliest branches among extant angiosperms, despite dispute over the relative positions of Amborellaceae and Nymphaeales (Magallón, 2009; Soltis et al., 2009). A recent analysis of up to 81 plastid genes from 64 taxa provided the first strong support for Amborella alone, rather than together with Nymphaeales, as the sister group of the remaining angiosperms (Jansen et al., 2007). This branch order is also recovered in our 79-gene tree (Fig. 6).
Mesangiospermae (sensu Cantino et al., 2007), also called “core angiosperms”, encompasses the great majority of living angiosperms excluding the ANA and Hydatellaceae. Five major lineages are recognized in this huge clade, including Chloranthales, magnoliids, Ceratophyllaceae, monocots, and eudicots. However, relationships among these lineages have been unstable in different studies (e.g. Mathews & Donoghue, 1999; Qiu et al., 1999; Graham & Olmstead, 2000; Zanis et al., 2002; Hilu et al., 2003; Leebens-Mack et al., 2005). To address this lingering question, Moore et al. (2007) analyzed a dataset that includes 61 plastid genes (∼42 kb of the plastid genome) from 45 species. Their analyses recovered a clade of Chloranthales + magnoliids as sister to a well supported clade of monocots + (Ceratophyllum+ eudicots). This topology was corroborated by the Jansen et al. (2007) study that was based on a greatly expanded number of genes and taxa (81 genes from 64 plastid genomes).
4.1 Sequencing more plastid genomes
During the past 20 years, nearly 200 plastid genomes have been completely sequenced. However, neither absolute number nor the taxonomic coverage is sufficient for understanding the dynamics, rate, and pattern of plastid genome evolution. Recent studies of angiosperm plastid genomes have revealed a lineage-specific correlation between rates of nucleotide substitutions, indels, and genomic rearrangements (Jansen et al., 2007). However, whether this rearrangement-related rate acceleration is a general feature in plastid genome evolution remains unknown. Plastid genomes tend to have low GC contents, mostly between 20% and 40%. But two recently sequenced plastid genomes of the Selaginella genus have unusually high GC contents, both being higher than 50% (Tsuji et al., 2007; Smith, 2009). Are there any other plant lineages with similarly high GC contents? Furthermore, what are the exact evolutionary forces influencing the nucleotide compositional bias of plastid genomes? These are still open questions. More plastid genome data with a broader taxonomic coverage are needed to address these issues.
The resolution of phylogenetic relationships is often plagued by the lack of informative characters when studying taxa that are recently derived or the result of a rapid radiation. With the improvement of the next-generation sequencing technologies, massively parallel plastome sequencing will become routine in the near future and be affordable for low taxonomic level phylogenetic studies. Most recently, the infrageneric phylogeny of Pinus was inferred from 37 nearly-complete chloroplast genome sequences generated using the Illumina/Solexa sequencer (Parks et al., 2009). Compared to previous chloroplast-based phylogenetic analyses of Pinus, the plastome-scale data matrix contains 60 times more informative characters resulting in an extraordinarily high support level (Parks et al., 2009). It can be expected that the plastome-based strategy will soon become an effective option for molecular phylogenetic and phylogeographic analyses.
The next potential application of whole plastid genome data is plant DNA barcoding. Erickson et al. (2008) raised the possibility of plastome-scale barcodes. Several research groups have proposed different plastid regions/combinations as their preferred barcodes, but no consensus has emerged (e.g. Pennisi, 2007; Erickson et al., 2008; Fazekas et al., 2008; Hollingsworth et al., 2009; Kress et al., 2009). We should see discussions and debates over these issues.
4.2 Developing specific bioinformatics tools for plastid genomes
Entire plastid genome data offer an opportunity to exhaustively examine the whole-genome features. Manual analysis of huge datasets is not only laborious work but also plagued by artificial errors. It becomes requisite to develop automated algorithms for detecting and analyzing genomic changes concealed in the increasing volume of plastid genomic data. Several universal tools have already been adopted in certain plastid comparative genomic studies, for example, REPuter (Kurtz et al., 2001) for dealing with SDRs, GRAPPA (http://www.cs.unm.edu/~moret/GRAPPA/) and GRIMM (Tesler, 2002) for genome rearrangements, as well as PipMaker (Schwartz et al., 2000), VISTA (Mayor et al., 2000), and zPicture (Ovcharenko et al., 2004) for genome alignments and visualization. Recently, an updated method GRAPPA-IR based on GRAPPA has been developed to specifically handle chloroplast genomes, which can accurately recover both genome phylogeny as well as ancestral gene order (Yue et al., 2008). In the near future, there will be an ever-growing need for plastid genome-specific bioinformatics tools to fulfil the requirements of either qualitative detections or quantitative analyses of plastid genomic features.
4.3 Chloroplast phylogenomics: Advantages and pitfalls
Chloroplast phylogenomics can overcome the limitations of single-gene phylogenies by combining many genes and ultimately complete plastid genomes (Jeffroy et al., 2006). However, this approach may also be misled by systematic errors arising from non-phylogenetic signals in the data, which is not accounted for in the tree-reconstruction models (Phillips et al., 2004). The systematic errors are mainly derived from: (i) evolutionary rate variations across lineages, which may cause the long-branch attraction artifact; (ii) heterogeneous nucleotide/amino acid compositions; and (iii) heterotachy, namely shift of position-specific evolutionary rates (Philippe et al., 2005). If the non-phylogenetic signal is strong enough, it will lead to an incorrect, but statistically strongly supported tree (Rodriguez-Ezpeleta et al., 2007). A number of approaches have been proposed to alleviate the systematic errors, such as expanding taxon sampling, improving sequence evolution model, and removing phylogenetically misleading data (Philippe et al., 2005; Jeffroy et al., 2006; Brinkmann & Philippe, 2008). Sometimes, the number of living species in certain lineages is extremely sparse and nothing can be done to improve taxon sampling. In such cases, removing the non-phylogenetic signal and/or developing new methods insensitive to systematic error become the only choice to improve phylogenetic reconstruction. However, reducing noise does not guarantee the reduction of errors (Mathews, 2009). Thus, we have to be cautious about over-reliance on the plastid phylogenomic trees, whose correctness should be verified through corroboration from other independent sources, such as nuclear gene phylogenies.
The impressive progress of complete plastid genome sequencing has not only deepened our understanding of the evolutionary dynamics of plastid genomes but also provided fresh insights into the plant phylogeny. The adoption of next-generation sequencing technologies will markedly accelerate data producing and settling a sound basis for further exploring plastid genome evolution. The power of chloroplast phylogenomics has already been shown in resolving intractable relationships, for example, the phylogeny of land plants and the deep-level relationships of angiosperms. Although the incongruences derived from this genome-scale approach attract vigorous debates, it is worth bearing in mind that chloroplast phylogenomics is still in its infancy and will continue to improve.
Acknowledgements This work was supported by the 100 Talent Project of the Chinese Academy of Sciences (Grant No. 0729281F02), the National Natural Science Foundation of China (Grant No. 30970290), the Outstanding Young Scientist Project of the Natural Science Foundation of Hubei Province, China (Grant No. O631061H01), and the Open Project of the State Key Laboratory of Biocontrol, China (Grant No. 2007-01).