Noncoding plastid trnT-trnF sequences reveal a well resolved phylogeny of basal angiosperms

Authors


Thomas Borsch, Botanisches Institut und Botanischer Garten, Friedrich-Wilhelms-Universität Bonn, Meckenheimer Allee 170, 53115 Bonn, Germany. Tel.: +49-228-73-2681; fax: +49-228-73-3120; e-mail: borsch@uni-bonn.de

Abstract

Recent contributions from DNA sequences have revolutionized our concept of systematic relationships in angiosperms. However, parts of the angiosperm tree remain unclear. Previous studies have been based on coding or rDNA regions of relatively conserved genes. A phylogeny for basal angiosperms based on noncoding, fast-evolving sequences of the chloroplast genome region trnT-trnF is presented. The recognition of simple direct repeats allowed a robust alignment. Mutational hot spots appear to be confined to certain sectors, as in two stem-loop regions of the trnL intron secondary structure. Our highly resolved and well-supported phylogeny depicts the New Caledonian Amborella as the sister to all other angiosperms, followed by Nymphaeaceae and an AustrobaileyaIlliciumSchisandra clade. Ceratophyllum is substantiated as a close relative of monocots, as is a monophyletic eumagnoliid clade consisting of Piperales plus Winterales sister to Laurales plus Magnoliales. Possible reasons for the striking congruence between the trnT-trnF based phylogeny and phylogenies generated from combined multi-gene, multi-genome data are discussed.

Introduction

Flowering plants (angiosperms) are the largest and most diverse group in the plant kingdom. They have undergone an extensive radiation since the Cretaceous, and at present comprise approximately 270 000 species of remarkably diverse biological forms, spanning and dominating most habitats on earth and providing the vast majority of our food crops. Connected to this immense diversity and importance has been the need for understanding their origin and evolution. Recent contributions based on DNA sequences from genes belonging to the three plant genomes (nuclear, chloroplast and mitochondrial) analysed individually and in combinations have provided new insights into flowering plant phylogeny and radically changed our concept of their systematic relationships (Chase et al., 1993; Soltis et al., 1997, 1999a, 2000; Mathews & Donoghue, 1999, 2000; Qiu et al., 1999, 2000; Barkman et al., 2000; Graham & Olmstead, 2000; Savolainen et al., 2000). Although many new lineages have recently been identified, there are still disputable clades in the global angiosperm tree because of incongruence among phylogenies, poor branch resolution or lack of convincing statistical support. As a consequence, additional areas of evidence from new genomic regions or other sources, like the fossil record, remain crucial.

The recent surge in applying molecular techniques in systematic biology has also raised important issues relevant to understanding patterns of molecular evolution of genes and genomes and their implications for organismal phylogenies. The issue of incongruence among phylogenies inferred from different genes underscores a central problem in phylogenetic studies, namely that of segregating gene trees that reflect gene phylogenies from organismal trees that depict the evolutionary history of the organisms (e.g. Doyle, 1992; Moritz & Hillis, 1996). Differences between gene trees and organismal trees can be caused either by intrinsic biases of the genes, such as functional constraints resulting in heterogeneity in rates and modes of substitution, or by extrinsic factors such as deep coalescence, gene duplication and horizontal gene transfer (Doyle, 1992; Swofford et al., 1996). Combining data sets have helped to resolve most problems, which arose from single gene analyses of angiosperms (Qiu et al., 1999, 2000; Soltis et al., 1999a, 2000). Nevertheless, in some cases, like the analysis of combined rbcL and atpB data sets (Savolainen et al., 2000), potential dominance of information from one gene could generate evolutionary noise that obscures to varying degrees the true organismal phylogeny. For example, parsimony analysis of atpB sequences alone resolves Ceratophyllum as sister to Acorus, and the two as sister to all other monocots, whereas the combined analysis of atpB and rbcL shows Ceratophyllum as sister to all other angiosperms, reflecting its position in the rbcL analysis alone. Combining data from different genes may also cause a decrease in resolution in parts of the phylogeny and create weak support of some clades when there is incongruence between original data sets. These shortcomings may only be overcome by sampling high numbers of independently varying characters (e.g. Graham & Olmstead, 2000; Qiu et al., 2000).

Coding regions of rather conserved genes are typically used in reconstructing deep-level phylogenies, such as relationships among major angiosperm lineages. This practice is based on the premise that the low rates of substitution characteristic of those genes reduce incidents of multiple hits that could obscure historical signal, keeping levels of homoplasy at a minimum. In addition, relative ease of sequence alignment makes homology assessment within so-called conserved genes very straightforward. In contrast, noncoding regions have been deemed unsuitable for resolving such phylogenies because of high mutational rates. Noncoding regions, on the other hand, being functionally less constrained than coding regions (e.g. Morton & Clegg, 1993; Clegg et al., 1994) may render fixation of a greater number of substitutions during cladogenesis closer to a stochastic process (i.e. selectively closer to neutral mutations; Jukes & King, 1971; Kimura, 1983). Consequently, the mutations would not to a larger extent be biased by and reflect the functional evolution of the gene.

Our application of trnT-trnF sequences to a phylogenetic analysis of the waterlily genus Nymphaea (Borsch, 2000) demonstrated that alignment of outgroup sequences beyond the Nymphaeaceae sensu APG (1998; corresponds to Nymphaeales as comprising the genera Brasenia, Cabomba, Nuphar, Barclaya, Ondinea, Victoria, Euryale, Nymphaea) is possible and led us to employ the region in investigating relationships among basal angiosperms. The trnT-trnF region is located in the large single-copy region of the chloroplast genome, approximately 8 kb downstream of rbcL. Three highly conserved transfer RNA genes [tRNA genes for threonine (UGU), leucine (UAA) and phenylalanine (GAA)] are found in tandem, separated by spacers of several hundred base pairs (bp) (Fig. 1). The high variability of the two spacers and the intron in trnL have led to the wide use of trnT-trnF sequences in addressing relationships at the species and genus levels (e.g. Taberlet et al., 1991; Van Ham et al., 1994; Sang et al., 1997; Small et al., 1998; Bakker et al., 2000). Moreover, the region was quite informative in phylogenetic studies of families like Asteraceae (Bayer & Starr, 1998), Arecaceae (Asmussen & Chase, 2001) and Rhamnaceae (Richardson et al., 2000) and orders like Laurales (Renner, 1999) and Magnoliales (Sauquet et al., in press).

Figure 1.

Structure of the trnT-trnF region in basal angiosperms and gymnospermous outgroups based on the data set used in the present study. tRNA genes (trnT and trnF are each 73 bp long) and exons (trnL-5′ is 35 bp and 3′ is 50 bp) are represented by black boxes. The spacers and the intron are illustrated by an empty bar with mutational hot spots in grey. Proportions reflect average sequence length of the sequenced taxa. Mean sequence lengths (and standard deviations, SD) in bp are 298 in H1 (SD = 222), 12 in H2 (SD = 2), four in H3 (SD = 2), 40 in H4 (SD = 7), 12 in H5 (SD = 7), 30 in H6 (SD = 31), 19 in H7 (SD = 21), and 7 in H8 (SD = 3). Minimum and maximum sizes of the spacers and the intron among the taxa sequenced are indicated below the bar. Positions of primers are marked by arrows.

In the present study, the entire trnT-trnF region was sequenced from 32 families representing most lineages of basal angiosperms. The confinement of the extreme variability to certain mutational hot spots and the presence of a majority of length mutational events in simple sequence repeats (SSRs) of 3–5 bp facilitated the alignment. Mutationally flexible stretches of sequence in the trnL intron correspond to two stem-loop regions in P8 of the proposed RNA secondary structure. This study presents a phylogenetic tree for basal angiosperms based on trnT-trnF sequence data that is largely congruent with multi-gene, multi-genome studies and demonstrates that fast-evolving, noncoding sequences do not necessarily show total saturation when applied to deep-level phylogenetic questions in angiosperms, but on the contrary, yield a phylogeny with many of the nodes receiving statistical support. This empirical analysis therefore is in line with expectations drawn from recent simulation studies (Hillis, 1998) in that higher evolutionary rates may be beneficial for reconstructing correct phylogenies.

Materials and methods

Material

Sequences from the trnT-trnF region were obtained for 38 angiosperms representing 28 families and three gymnosperms. The species, their respective families and the sources of material are listed in Table 1. The Pinus trnT-trnF sequence was obtained from the complete sequence of the chloroplast genome (Tsudzuki et al., 1992; GenBank number NC001631). Classification is in accordance with the APG (1998) system. However, for the Chloranthales (comprising Chloranthaceae) and Winterales (comprising Canellaceae and Winteraceae), an ordinal rank is accepted because (1) published ordinal names exist, (2) these groups are now identified as clearly monophyletic lineages, and (3) they do not belong to the basal angiosperm grade comprising Amborellaceae, Nymphaeaceae, Austrobaileyaceae, Illiciaceae, Schisandraceae and Trimeniaceae.

Table 1.  Taxa used in the study, their respective families, source of material, location of voucher specimens, and GenBank numbers of deposited sequences.
Genus/speciesFamilyGarden/field originVoucherGenBank number
Acorus gramineus L.AcoraceaeBonn Bot. Gard.Borsch 3458 (BONN)AY145336
Aextoxicon punctatum Ruiz & Pav.AextoxicaceaeBonn Bot. Gard.Borsch 3459 (BONN)AY145362
Amborella trichopoda Baill.AmborellaceaeUniversity of California, Sta. Catarina Bot. Gard.Borsch 3480 (VPI)AY145324
Annona muricata L.AnnonaceaeBonn Bot. Gard.Borsch 3460 (BONN)AY145352
Asimina triloba Dun.AnnonaceaeBonn Bot. Gard.Borsch 3461 (BONN)AY145353
Orontium aquaticum L.AraceaeBonn Bot. Gard.Borsch 3457 (BONN)AY145338
Araucaria araucana C. KochAraucariaceaeBonn Bot. Gard.Borsch 3462 (BONN)AY145321
Nypa fruticans Wurmb.ArecaceaeBonn Bot. Gard.Borsch 3463 (BONN)AY145339
Aristolochia pistolochia L.AristolochiaceaeFrance, HeraultBorsch 3257 (FR)AY145341
Saruma henryi Oliv.AristolochiaceaeBonn Bot. Gard.Borsch 3456 (BONN)AY145340
Austrobaileya scandens C. WhiteAustrobaileyaceaeBonn Bot. Gard.Borsch 3464 (BONN)AY145326
Buxus sempervirens L.BuxaceaeBonn Bot. Gard.Borsch 3465 (BONN)AY145357
Brasenia schreberi GmelinCabombaceaeUSA, VirginiaBorsch & Wieboldt 3298 (VPI. FR)AY145329
Cabomba caroliniana GreyCabombaceaeUSA, VirginiaLudwig, J.C. s.n. (VPI)AY145328
Calycanthus floridus L. var. laevigatus (Willd.) T. & G.CalycanthaceaeBonn Bot. Gard.Borsch 3455 (BONN)AY145349
Canella winterana Gaertn.CanellaceaeBonn Bot. Gard.Borsch 3466 (BONN)AY145348
Ceratophyllum demersum L.CeratophyllaceaeUSA, VirginiaWieboldt 16073 (VPI)AY145335
Chloranthus brachystachys Bl.ChloranthaceaeBonn Bot. Gard.Borsch 3467 (BONN)AY145334
Dicentra eximia (Ker Gawl.) Torr.FumariaceaeBonn Bot. Gard.Borsch 3468 (BONN)AY145361
Ginkgo biloba L.GinkgoaceaeVirginia Tech Bot Gard.Borsch 3469 (VPI)AY145323
Gnetum gnemon L.GnetaceaeBonn Bot. Gard.Borsch 3470 (BONN)AY304546
Illicium floridanum EllisIlliciaceaeUSA, FloridaBorsch & Wilde 3104 (VPI, FR)AY145325
Lactoris fernandeziana Phil.LactoridaceaeDNA from Tod StuessyStuessy s.n.AY145324
Umbellularia californica Nutt.LauraceaeBonn Bot. Gard.Borsch 3471 (BONN)AY145350
Liriodendron tulipifera L.MagnoliaceaeUSA, VirginiaSlotta s.n. (VPI)AY145356
Magnolia virginiana L.MagnoliaceaeUSA, MarylandBorsch & Neinhuis 3280 (VPI, FR)AY145354
Michelia champaca L.MagnoliaceaeBonn Bot. Gard.Borsch 3472 (BONN)AY145355
Myristica fragrans Houtt.MyristicaceaeBonn Agr. Bot. Gard.Borsch 3473 (BONN)AY145351
Nelumbo nucifera subsp. lutea (Willd.) Borsch & BarthlottNelumbonaceaeUSA, MissouriBorsch & Summers 3220 (FR)AY145359
Nuphar advena (Aiton) W.T. AitonNymphaeaceaeUSA, FloridaBorsch & Wilde 3093 (FR)AY145351
Nuphar lutea (L.) Sibth. & Sm.NymphaeaceaeGermany, HesseBorsch 3337 (FR)AY145330
Nymphaea odorata Ait. subsp. odorataNymphaeaceaeUSA, GeorgiaBorsch & Wilde 3132 (VPI, BONN)AY145333
Victoria cruciana Orbign.NymphaeaceaeBonn Bot. Gard.Borsch 3474 (BONN)AY145332
Piper angustum RudgePiperaceaeMissouri Bot. Gard.Acc. 910150AY145345
Piper spec.PiperaceaeBonn Bot. Gard.Borsch 3475 (BONN)AY145346
Platanus occidentalis L.PlatanaceaeUSA, VirginiaSlotta s.n. (VPI)AY145358
Houttuynia cordata Thunb.SaururaceaeBonn Bot. Gard.Borsch 3481 (BONN)AY145344
Saururus cernuus L.SaururaceaeUSA, FloridaBorsch & Wilde 3108 (VPI, FR)AY145343
Schisandra rubrifloraSchisandraceaeBonn Bot. Gard.Borsch 3477 (BONN)AY145327
Tofieldia glutinosa (Michx.) Pers.TofieldiaceaeUSA,Borsch, Hellquist, Wiersema 3393 (VPI, BONN)AY145337
Trochodendron aralioides P.F. Siebold & J.G. ZuccariniTrochodendraceaeBonn Bot. Gard.Borsch 3478 (BONN)AY145360
Drimys winteri J.R. Forster & G. ForsterWinteraceaeBonn Bot Gard.Borsch 3479 (BONN)AY145347

DNA isolation, amplification and sequencing

Total genomic DNA was isolated from frozen (stored at −80 °C) or silica-gel-dried leaf tissue using a modified (2% cetyltrimethylammoniumbromide, 1% polyvinylpyrrolidone, 100 mm Tris (pH 8), 20 mm EDTA, 1.4 m NaCl) (CTAB) method. The isolation procedure was modified in the present study by introducing triple CTAB extractions to yield optimal quantities of high-quality DNA from tissues with considerable amounts of secondary compounds that occur in many basal angiosperms. This protocol is a modification of a miniprep procedure described in Liang & Hilu (1996). About 100 mg of dry tissue (equaling approximately 300 mg of fresh tissue) were ground in liquid N2 and incubated at 65 °C for 30 min with 700 μL of CTAB. After centrifuging and transferring the supernatant into a clean tube, the same tissue was reincubated twice with CTAB solution. All three preparations were kept separate. The CTAB solutions were then extracted with chloroform twice, and the DNA was subsequently precipitated with ethanol. After separately resuspending the pellets from all extraction steps in TE, two cleaning steps were carried out: the first by adding one-half volume 7.5 m ammonium acetate and precipitating with 100% ethanol, and the second by adding one-half volume 3 m sodium acetate and precipitating with ethanol. Genomic DNA from the second and third extractions was usually clean enough to be directly used for polymerase chain reaction (PCR) amplification.

The region was PCR-amplified in two overlapping fragments with universal primers (Taberlet et al., 1991) annealing to the tRNA genes. Primers a and d or rps4-5F (5′-AGGCCCTCGGTAACGSG-3′, designed in this study) and d were used to amplify the trnT-L spacer together with the trnL intron, and primers c and f were used to amplify the trnL intron and the trnL-F spacer. Amplification conditions were: 34 cycles of 94 °C (1 min) denaturation, 52 °C (1 min) annealing, 72 °C (2 min) extension and 72 °C (15 min) final extension. The PCR products were then purified using a QiaQuick gel extraction kit (QIAGEN, Inc., Valencia, CA, USA) and directly sequenced with an ABI PrismTM BigDye Terminator Cycle Sequencing Ready Reaction Kit (Perkin Elmer, Norwalk, CT, USA) on ABI 310 and 377 automated sequencers. In addition to the above mentioned primers, trnL110R (5′-GAT TTG GCT CAG GAT TGC CC-3′) was designed as an additional universal sequencing primer for angiosperms.

Sequence alignment

Sequence divergence in noncoding regions is caused by a variety of small structural changes in addition to substitution events. We concur with the opinion expressed by Thorne et al. (1992), Gu & Li (1995), Benson (1997), Kelchner (2000), and others that the nature of the underlying molecular processes leading to structural changes has to be used as the basis for alignment. Therefore, the processes creating length mutations need to be considered as of particular importance for homology assessment. In trnT-trnF sequences, most of the structural changes are known to be SSRs of 4 bp and more (e.g. Van Ham et al., 1994; Bayer & Starr, 1998). Small indels (1–3 bp) are rare and usually confined to poly-A/T strings. Several algorithms for multiple sequence alignment have been developed (e.g. McClure et al., 1994). However, currently available algorithms do not always allow a safe recognition of structural motifs of unpredictable kind, length and complexity (e.g. Graham et al., 2000; Kelchner, 2000), such as SSRs occurring in tandem, shorter indels occurring within larger, clearly delimited indels, or small inversions. These difficulties are caused by defining nucleotides as discrete and independent characters throughout all alignment positions (Kelchner, 2000), regardless of the possibility that a single length mutational event might have involved several nucleotides at once or not. This also explains why the application of fixed gap costs in current alignment algorithms can result in an alignment that deviates from the optimum (i.e. if length mutations are considered putative single events). Therefore, alignment was carried out by eye based on direct sequence comparison using quickalign 1.5.5 (Müller, 2000), a program designed for optimal manual sequence adjustment. For stringency, rules for manual alignment are required that consider known mechanisms of sequence evolution as well as other, similarity-based criteria for homology assessment, as proposed by Golenberg et al. (1993), Kelchner & Clark (1997), Hoot & Douglas (1998), Graham et al. (2000), Simmons & Ochoterena (2000) and others, which have been accepted in many studies including the present study. Similarity is a valid criterion to hypothesize homology not only in morphological but also in molecular characters (Doyle & Davis, 1998). Indels are called ‘entire’ (i.e. positional extension is identical in all taxa in which an indel occurs; Graham et al., 2000) or ‘overlapping’ (i.e. positional extensions differ in different taxa). Overlapping indels may be explained by two or more subsequent length mutational events in one taxon, or by different, overlapping length mutational events in different taxa. Inversions are not discussed as they were not found in the present data set.

The rules employed for the trnT-trnF alignment are presented below.

  • 1Gap insertion. For the insertion of gaps, attention was given to both the potentially inserted sequence and its neighbouring sequences. A gap was inserted only when it prevented the inclusion of more than two substitutions among closely adjacent nucleotides in the alignment. This decision is based on empirical data from analyses of trnT-trnF sequences in basal angiosperms (Borsch, 2000) where length mutations were found to occur approximately half as frequently as substitutions.
  • 2Placement of gaps. For the placement of gaps, the recognition of sequence motifs was given priority following Kelchner & Clark (1997), which in this data set are only direct SSRs. Golenberg et al. (1993), who first proposed alignment rules for length-variable DNA sequences, called multinucleotide repeats ‘Type 1b gaps’. Giving priority to a motif can result in insertions that are correctly aligned as nonhomologous (i.e. with different positional extensions) although sequence similarity would warrant their inaccurate placement under the same column (e.g. 6B, see Kelchner, 2000).
  • 3Homonucleotide strings. Individual positions in homonucleotide strings of different lengths (poly-As or -Ts) are considered to be of uncertain homology (Golenberg et al., 1993; Hoot & Douglas, 1998; Kelchner, 2000) and are therefore excluded. Slipped strand mispairing (Levinson & Gutman, 1987) is likely to have led to numerous length mutational events involving one to several nucleotides. As only nucleotides of the same kind are involved, accurate motif recognition is not possible.
  • 4Determination of entire indels. Entire indels of the same positional extension and of complete sequence similarity were very easily assessed as primary homologous sensu De Pinna (1991) and consequently placed in the same column(s) of the alignment. During primary homology assessment, no inference had to be made regardless of whether the length mutational event occurred in a common ancestor of all taxa sharing it or in parallel in different lineages. This is analogous to the fact that the synapomorphic status of a substitution in a particular position is not inferred in the alignment process. Recognition of a repeat motif was regarded as further evidence for correctly recognizing a length mutational event.
  • 5Substitutions in indels. If entire indels of the same positional extension differed by individual substitutions, two principally different cases were distinguished. (A) Direct repeats with exact duplication of a sequence template that has already acquired a substitution (compared with other taxa in the alignment). The presence of autapomorphic or synapomorphic substitutions in the template sequence in this case implies that the repeat event happened after the substitution event. Compared with taxa without substitutions, those motifs provide evidence for unravelling the parallel nature of an insertion event before its potential synapomorphic status could have been tested in a phylogenetic context. As cases without substitutions do not allow such inference, and levels of homoplasy in length mutational events should be assessed equally over all alignment parts, positional extension of indels is regarded as a decisive criterion. A side-effect is that such substitutions in indels can receive double weight in phylogenetic analysis, but the signal would still be in favour of the correct topology. (B) Repeats with substitutions not found in their template sequence: this implies that substitutions either occurred in the template or inserted sequence during or after the replication process. As there is no way of distinguishing which of the nucleotides were the template and which were inserted, correct assignment of these variable positions is not possible. Consequently, variable positions of case B were excluded from the analysis for objective homology assessment following Kelchner (2000) and Asmussen & Chase (2001). We followed this more conservative approach, although Graham et al. (2000) did not see the need for exclusion.
  • 6Overlapping indels. Those indels can be explained by two or more length mutational events and are also called ‘progressive step indels’ (Kelchner, 2000). For their alignment, a parsimony principle is employed where the least number of steps required is assumed as most probable. The least number of steps can only be inferred using a global perspective. For detecting alternative explanations, all sequences that are length variable in the respective region were placed in close proximity. Gaps were then placed so that only a minimum number of rectangles are required to describe the gaps globally. When this criterion is applied, two different cases need to be distinguished. (1) Overlapping indels with complete sequence similarity can be easily considered primary homologous following Kelchner (2000; example 5). This assumption is valid regardless of whether or not repeat motifs can be identified or the origin of an inserted sequence can be determined. (2) In case of overlapping indels differing by individual substitutions, homology assessment can follow sequence similarity criteria to place overlapping sequence parts (i.e. nucleotides present in taxa with shorter than the largest gaps). Where SSRs were involved, the above-mentioned rules had to be applied. If different placements of overlapping sequence parts (including different arrangement of variable sites) requiring the same minimal number of length mutational events were possible, homology was considered uncertain. Other authors (Gatesy et al., 1993; Davis et al., 1998; Simmons & Ochoterena, 2000) do not think that the latter have to be excluded because these alternative positions are considered to be neutral in parsimony searches. We followed the more conservative approach.
  • 7Regions of uncertain homology. Those regions (referred to here as hot spots) were excluded from phylogenetic analysis following Swofford & Olsen (1990) and Swofford et al. (1996). As these hot spots are confined to a few blocks, their removal does not constitute a subjective exclusion of information. The core structure of trnT-trnF sequences is represented across the data set. Furthermore, these excluded blocks are comparatively small. Depending on the species, they represent approximately 7.7% (Amborella) to an average of about 20% of the entire region for the ingroup taxa and below the average for the outgroups (Araucaria = 13.8%; Pinus = 14.7%; Ginkgo = 12.9%; see Table 2). Accurate information on the location of hot spots in the sequences of all species is provided in Table 2. A similar approach was adopted in broad scale analyses of 18S rDNA in angiosperms and land plants (e.g. Soltis et al., 1997, 1999b).
Table 2.  Actual length of the trnT-trnF region in basal angiosperms and gymnospermous outgroups and positions of mutational hot spots in the respective sequences (counts start with position 1 from the 5′ end of each spacer and the intron). Note that length variation of the spacers and the intron is mainly caused by the insertion of nucleotides within the hot spot regions, which differ depending on the species. Where there are no insertions in a hot spot region in individual taxa, the hot spot is considered as not present, ‘n. p.’ For mean sizes of hot spots see Fig. 1.
TaxonLength of trnT-L-spacer (bp)Length of trnL-intron (bp)Length of trnL-F-spacer (bp)Position of H1Position of H2Position of H3Position of H4Position of H5Position of H6Position of H7Position of H8
Araucaria41249546660–94215–218n.p.289–325241–248290–292192–294n.p.
Pinus42348937749–88214–217278294–336252–263296–298137–219267–270
Ginkgo37950036251–88201–204270271–295241–248286–304137–189258–269
Amborella47447537474–88206–218321–323337–388233–240285–295n.p.270–275
Illicium61751824362–242377–390469–472486–523257–264302–314n.p.127–133
Austrobaileya68447638967–309444–456535–537551–589234–241279–291150–183279–285
Schisandra55448439656–197327–339418–420434–464242–249287–293145–173279–285
Cabomba48450839676–140253–265351–353376–402242–245283–325176–220288–293
Brasenia47952235977–128241–253337–339362–387240–243281–338162–194257–263
Nuphar lut48358836561–124241–253338–340354–388243–246284–405158–190256–261
Nuphar ad47860736561–124241–253338–340354–388244–247285–424158–190256–261
Victoria46757737362–113225–237322–324338–370241–244282–392145–183246–255
Nymphaea47652137962–113225–237327–329343–384241–244282–336174–206269–275
Chloranthus79749535155–455584–596672694–717250–257295–311166–171242–248
Ceratophyllum83853044164–476595–607687–688706–749243–256307–351190–206303–312
Acorus72652237656–359489–501580–582598–636243–272318–331174–180260–268
Tofieldia138552123960–10001126–381217–191236–78244–262300–33324–34125–131
Orontium76861516469–364508–520599–601618–671269–293331–413n.p.n.p.
Nypa79452234561–424556–568655–659677–711244–268306–331170–180n.p.
Saruma75050535653–360493–505583–586608–641245–256298–328171–180247–253
Aristolochia71651237155–323453–465558–560582–618261–278321–345167–180257–263
Lactoris79549837379–412548–560638–640654–695255–265300–321171–184262–269
Saururus87749535056–461595–607700–709726–768258–265302–319176–191261–270
Houttuynia141149135356–9791117–291222–311248–91257–264301–315174–189262–271
Piper ang.84449037556–426560–572665–674691–734255–262299–314172–193271–280
Piper spec.802491381556–379513–525618–627644–687255–262299–315172–199277–286
Drimys71749735960–346480–492569–571588–629237–254292–316174–179249–255
Canella79347925856–412546–558637–639656–699239–252290–30973–78148–154
Calycanthus65332433277–274400–412494–496513–555n.p.n.p.151–156216–228
Umbellularia54548436233–160286–293374–376393–452242–254292–310176–181252–258
Myristica88150330055–505642–653732–734751–797235–258296–324175–181252–258
Annona349378241–258n.p.181–191257–263
Asimina496390242–259297–315188–198264–275
Magnolia76649235556–409539–551630–632649–687241–258296–313168–174245–251
Michelia77249235656–415545–557636–638655–693241–258296–313169–175246–252
Liriodendron78349136157–411542–554633–635652–698241–258296–312170–180251–257
Buxus68550737762–298419–431510–516533–580252–269312–328177–182256–262
Platanus101152536556–630752–764843–850867–910258–276319–346151–156235–241
Nelumbo103552740056–663784–796875–877894–937257–276319–342187–192285–291
Trochodendron107744136857–719825–837916–918935–978n.p.248–262191–196269–275
Dicentra71347635956–319433–445526–528545–585238–249289–307158–161232–238
Aextoxicon85751135487–459593–605691–693710–752261–278321–337166–171239–245

Determination of secondary structure

Based on the Michel–Westhof model of the catalytic core (Michel & Westhof, 1990), Cech et al. (1994) proposed a convention to draw secondary structures of group I introns, which is followed here. Cech et al. (1994) pointed out that introns might vary considerably in size and number of helical elements, especially at P8. Consequently, the different helical elements (stem-loop regions P1 and 2, P4 and 5, P6, P8 and P9) as well as the cloverleaf structure of the tRNA-leucine have been predicted using free energy minimization (Jaeger et al., 1989; Zucker, 1994; Zucker et al., 1999). In order to characterize the borders of highly variable parts with the P8 stem-loop region of the trnL intron, we chose an integrated approach of comparative sequence analysis and free energy minimization as proposed by Jaeger et al. (1990). Predictions of secondary structures based on free energy minimization were computed with RNA structure 3.6 (Mathews et al., 2001) and with the mfold server (http://www.bioinfo.math.edu/mfold) that allowed a more adequate selection of parameters.

Phylogenetic analysis

Analyses were based on nucleotide substitutions, and gaps were treated as missing characters. This approach also allowed us to compare the results with those based on coding genomic regions. Phylogenetic trees were constructed with PAUP*4.0b6 (Swofford, 2001) employing maximum parsimony (MP) with heuristic searches consisting of 100 and 1000 replicates of random stepwise addition with MULPARS in effect and tree bisection reconnection (TBR) branch swapping. Characters were optimized with ACCTRAN. Measures of support for individual clades are based on bootstrap analysis of 500 replicates and decay analysis as implemented in PAUP* and AutoDecay (Erikson & Wikstrom, 1996). Numbers of steps per site were calculated using the CHART option of MacClade 3.07 (Maddison & Maddison, 1997). The data were also analysed with maximum likelihood (ML) implemented in PAUP*. A general time reversible model was employed as an approach for direct estimation of substitution rate matrix parameters and nucleotide frequencies via ML. We are aware that under these settings calculation time might be higher compared with less complex models. Four heterogeneous rate categories across sites were specified after an approximation of the gamma distribution. Heuristic search was performed with starting trees obtained by ‘as-is’ stepwise addition, and TBR was used as branch swapping algorithm with MulTrees in effect.

Results

Variability of the trnT-trnF region in basal angiosperms

In the angiosperm taxa studied, the overall length of the trnT-trnF region (excluding the tRNA genes; Fig. 1, Table 2) ranges from 1309 to 2255 bp; the trnT-L spacer accounts for 467–1411 bp, the trnL intron for 324–615, and the trnL-F spacer for 164–441. The trnT-trnF region is similar in length within the four gymnosperm taxa sequenced as outgroups, except for Gnetum in which the two spacers are considerably shorter (280 and 138 bp) and the intron is only somewhat reduced in size (346 bp). The alignment (see ‘Supplementary material’) was performed through the gymnosperms with the exception of Gnetum. The latter had accumulated numerous autapomorphies of, sometimes, unclear homology in the spacers, and thus was excluded from the analysis. The overall sequence alignment is 4622 bp long (without tRNA genes; including hot spots). High variability commonly detected at lower taxonomic levels turned out to be confined to certain mutational hot spots (Figs 1 and 2; Table 2). This pattern allowed us to exclude from the analyses all positions of uncertain homology.

Figure 2.

Proposed secondary structure of the tRNA for leucine (UAA) encoded by the trnL gene in Nymphaea odorata. (a) Cloverleaf structure corresponding to the two exons, (b) proposed secondary structure for the intron and (c) P8 stem-loop region. Three main helical elements are labelled using roman numerals I–III. The single arrow in helical element I indicate the position of a repeat region, which is missing in Nymphaea. The two arrows in helical element II border an AT-rich string of repetitive elements that cannot be aligned across angiosperms and was therefore excluded from phylogenetic analysis.

The striking difference between the absolute length of the region and the alignment is caused by length mutations, occurring at about half the frequency of nucleotide substitutions. Indels inserted in the alignment range in length from 1 to 200 bp, and most of the insertions identified as simple repeat motifs were 4–6 bp long. Inversions were absent. Several of the length mutations are synapomorphic, defining specific clades, some of which are cited in the following. Although a detailed account of microstructural changes in trnT-trnF in basal angiosperms goes beyond the scope of the present paper, the following are examples of major synapomorphies: the Nymphaeaceae (represented by Brasenia, Cabomba, Nuphar, Nymphaea and Victoria and corresponding to Nymphaeales) share a ‘TTATG’– insertion in alignment positions 1341–1345 in the trnT-L-spacer and an ‘AAATG’– SSR in positions 4603–4607 of the trnL-F-spacer; the lineage of Piperaceae and Saururaceae within Piperales a ‘CTTT’– SSR in positions 3643–3646 in the trnL-F-spacer; a clade of Magnolia, Michelia and Liriodendron within Magnoliaceae which, based on substitutions only receives 68% bootstrap support (BS) with MP, shares a ‘GAATC’– SSR in positions 2622–2626 in the trnL-intron; and the two species of Nuphar share a ‘GATTT’– SSR in positions 1373–1377 in the trnT-L-spacer. It appears that synapomorphic indels occur at various taxonomic levels, and vary considerably in their distribution where some branches of the basal angiosperm tree (e.g. the one leading to the Nymphaeaceae) are marked by many indels, whereas others have few or none. Thorough analyses of their type and distribution with broader taxon sampling will help to assess their phylogenetic utility at various taxonomic levels including the genus level. Long indels were rather rare and were restricted to individual taxa (autapomorphic). Further, long insertions (>20 bp) generally do not occur as repeated motifs. Most prominent examples are the 176-bp insertion in the trnT-L spacer and the 200-bp deletion in the trnL-F spacer of Austrobaileya and Illicium, respectively. Interestingly, both genera are members of the same small clade (Figs 3 and 4).

Figure 3.

Strict consensus of the two most parsimonious trees (3198 steps, CI = 0.565, RI = 0.592) showing phylogenetic relationships among basal angiosperms based on noncoding sequences from the plastid region trnT-trnF. BS values >50% are given above and DE values below branches.

Figure 4.

One of the two shortest trees (3198 steps) found in parsimony analyses of the trnT-trnF data set. Branch lengths (ACCTRAN optimization) are indicated above branches. The second tree only differed by the position of Dicentra among basal eudicots.

Proposed secondary structure

The proposed secondary structures for the tRNA-Leucine and the trnL intron in Nymphaea odorata are given in Fig. 2. The P6 and P8 stem-loop regions account for most of the sequence length variation in the intron. Minimum free energy configurations reveal several helical elements, labelled using roman numerals (I–III; Fig. 2) within an extensive P8 region. Within helical element I, repetitive elements are inserted in a number of basal angiosperm taxa (hot spot H5; Table 2). This is not the case in Nymphaea. Therefore, the respective position is marked by a single arrow in Fig. 2C. A second mutational hot spot (H6; Table 2) that was also excluded from phylogenetic analysis falls into helical element II. Two arrows border the respective AT-rich string of repetitive elements. It is important to note that P8 is conserved for most of its primary sequence in angiosperms. Only two positions are prone to larger inserts, which can be of independent origin and may vary considerably in length among taxa.

Phylogeny of basal angiosperms

The two spacers and the intron provided a set of 3112 characters excluding hot spot regions and exons. The positions of excluded parts with respect to the alignment with a total length of 4707 bp are (Fig. 1): 256–1276 (H1), 1538–1550 (H2), 1729–1750 (H3), 1795–1927 (H4), 2194–2228 (trnL-5′ exon), 2720–2749 (H5), 2837–2990 (H6), 3330–3379 (trnL-3′ exon), 4025–4145 (H7), 4403–4418 (H8). Of these 3112 characters, 928 characters were variable in angiosperms (1070 in whole data set), of which 608 are parsimony-informative in angiosperms (738 in the whole data set). The relative contributions of the three trnT-trnF sections are summarized in Table 2. The MP analysis of trnT-trnF sequences resulted in two shortest trees of 3198 steps (consistency index = 0.565, retention index = 0.592), differing only in the position of Dicentra (Ranunculales) being either basal in eudicots (Fig. 3) or sister to a eudicot clade consisting of Buxus, Aextoxicon and Trochodendron (tree not shown). Increasing the number of replicates during random stepwise addition from 100 to 1000 found the same two trees, increasing the possibility that the most parsimonious trees were recovered. The ML analysis resulted in one tree with a score of −ln L = 18720.06573 (not shown). The ML phylogeny differed from the MP in placing the Chloranthales as sister to the eumagnoliids (defined here to comprise Laurales, Magnoliales, Piperales, Winterales) instead of being sister to the eudicots, as in the MP trees.

The trnT-trnF MP strict consensus (Fig. 2) clearly depicts [99% BS, decay value (DE) of 13] the New Caledonian woody shrub Amborella as sister to all remaining angiosperms. Diverging next is the herbaceous water lily lineage Nymphaeaceae (94% BS, six DE), followed by the Schisandra–Illicium–Austrobaileya clade (100% BS, 18 DE) that comprises small trees and woody lianas. The Chloranthaceae, a tropical family with very reduced flowers, emerges as sister to the eudicots, but with bootstrap support <50%. Piperales, comprising Piperaceae, Saururaceae, Aristolochiaceae and Lactoridaceae (members of what have been known as paleoherbs), are highly supported (97% BS, nine DE) in a clade sister to woody Canellaceae and Winteraceae (Winterales). Support for the latter sister relationship is weak (64% BS, one DE). Magnoliales and Laurales, the first of which gains moderate (73% BS, two DE) and the latter strong (100% BS, eight DE) support, appear in a clade sister to the Piperales–Winterales clade. The clade of these four lineages (eumagnoliids) is weakly supported (60% BS and two DE). The eudicots, encompassing the dicot families with tricolpate or tricolpate-derived pollen, constitute a monophyletic lineage (100% BS, 13 DE) in which Dicentra (Ranunculales) appears in a polytomy with a well-supported NelumboPlatanus clade (Proteales, 98% BS, 10 DE) and a clade containing the other eudicots (represented by Trochodendron, Aextoxicon and Buxus). A phylogram (MP) of one of the two shortest trees is shown in Fig. 3 to illustrate branch lengths and ML branch lengths leading to some important nodes are presented in the following. The branch leading from the first angiosperm node to the subtrees where Nymphaeaceae and Austrobaileyales are basal has 53 and 26 steps (Fig. 4). The branch leading from the root node to Amborella is 73 steps long, whereas the branches of water lilies and Austrobaileyales are 156 and 158 steps on average (mean of all terminal taxa belonging to respective clades). The branch leading from the basal grade to the subtree where the monocot–Ceratophyllum clade is basal is 33 steps. The branch length with ML are 0.029 and 0.010 from the first angiosperm node to the subtrees with Nymphaeaceae and Austrobaileyales at the base; 0.087 from the root node to Amborella, and 0.101 and 0.068, respectively, to taxa of Nymphaeaceae and Austrobaileyaceae. The branch leading from the basal grade to the subtree with the monocot–Ceratophyllum clade basal is 0.041.

Discussion

Molecular evolution of trnT-trnF and implications for phylogenetic utility

Sequences of trnT-trnF presented here from across seed plants allow us to examine early evolution in flowering plants with fast-evolving, noncoding DNA and to compare the evolution of the two spacers and the intron over a broad evolutionary scale. The region has been widely used in systematic studies below the family level, and often only the trnL intron and trnL-F spacer are employed (e.g. Gielly et al., 1996; Sang et al., 1997; Small et al., 1998). It is rather striking to see this noncoding region providing strong historical signal and high resolution deep in angiosperm phylogeny. The numbers of variable and informative sites correlate with the number of aligned characters for each of the three parts (Table 3). However, looking at actual sequence lengths it appears that the trnL intron sequences contain only 63% variable sites compared with 83 and 98% in the trnT-L and trnL-F spacers, respectively. This seems to be caused by fewer small length mutational changes in the intron compared with the spacers.

Table 3.  Variation and relative contribution (excluding mutational hot spots) of the three parts of the trnT-trnF region in angiosperms and gymnospermous outgroups. High numbers of insertions characteristic of noncoding regions expands the alignment, causing underestimation of variability; for a better approximation, the amount of variability is also calculated on the basis of average actual length of sequences (corrected). Note that character numbers are always based on the alignment.
 trnT-L spacertrnL introntrnL-F spacertrnT-trnF
Average sequence length (bp)7305003551590
Standard deviation2355149223
Average sequence length excluding hot spots3764593311167
Standard deviation24334553
Number of characters100591911883112
Variable characters313289326928
% variable characters (corrected)31 (83)31 (63)27 (98)30 (79)
Parsimony informative characters211179218608
% informative characters (corrected)21 (56)20 (39)18 (66)20 (52)

Estimates of variability for noncoding DNA cannot be carried out very easily because of multiple-nucleotide mutational events (i.e. length mutations). Aligning sequences with insertions results in an accelerated character number and thus underestimation of variability. We therefore consider a ‘corrected’ percentage value based on average actual sequence length (Table 3) to present a more accurate approximation of variability. The differences between the corrected and uncorrected values could be substantial, and in this data set they range from 2.0 to 3.6-fold. The correction measure used here should be considered as a rough approximation. It is obvious that differences in sequence lengths among the sampled taxa may bias the mean sequence length, but we consider the adoption of such a correction measure provides a more realistic picture than using uncorrected values.

Compared with the trnT-L spacer, it seems that the trnL intron and the downstream spacer, trnL-F, evolve in concert. Length variability is much higher in trnT-L than in the other two regions, as indicated by the SE of average length of 200 vs. 50 (Table 3). In addition, 4–6 bp simple direct repeat motifs are about 30% less frequent in the intron and trnL-F than trnT-L (Borsch, 2000). Moreover, the mutational hot spot sectors are much smaller in the trnL intron and trnL-F spacer compared with trnT-L (Table 2). It is also worth mentioning that the respective tRNA genes of the trnL intron and trnL-F spacer are transcribed in the same direction (Shinozaki et al., 1986; Kanno & Hirai, 1993). The relatively conserved length of the trnL intron may relate to the role this group I intron plays in splicing during mRNA processing (Kuhsel et al., 1990).

These evolutionary patterns are also reflected in the gymnosperm species examined here, including the divergent Gnetum. The absolute size and degree of sequence divergence are proportionally less pronounced in the intron (346 bp) than in the spacers (280 and 138 bp; in angiosperms the average length of the intron and spacers are 500, 739 and 355 bp, respectively; Table 2). The extreme divergence found in the sequence and absolute size of trnT-trnF of Gnetum is important in the light of anthophyte hypothesis (e.g. Crane, 1985; Doyle & Donoghue, 1986) that proposes the Gnetales and extinct Bennettitales as closest relatives for angiosperms. Information from this and other phylogenetic studies (e.g. Goremykin et al., 1996; Qiu et al., 1999, 2000; Bowe et al., 2000; Chaw et al., 2000; Donoghue & Doyle, 2000), as well as the analysis of genes controlling floral development (Winter et al., 1999) point to the rejection of the anthophyte hypothesis and acceptance of the monophyly of extant gymnosperms (e.g. Chaw et al., 2000). Divergence of Gnetum in the trnT-trnF region could be caused by either an accelerated mutational rate or by a very long separation from other lineages including angiosperms. In our main analysis, Gnetum was not included because large portions of trnT-trnF sequence cannot be aligned. However, a negative effect from not including this gymnosperm lineage in our basal angiosperm analysis is not to be expected, as Gnetum most likely is not the immediate sister of angiosperms, and its inclusion would probably have only resulted in additional long-branch attraction effects. The secondary structure of the trnL intron (Fig. 2) reveals that the highly length-variable sectors that cannot be aligned across angiosperms are confined to smaller stem-loops within P8. Insertion–deletion events are a characteristic mode of divergence in noncoding regions and tend to be site-dependent (Clegg et al., 1994). The situation in trnT-trnF is also comparable with angiosperm 18S rDNA, in which four highly variable regions have been identified by Soltis et al. (1997, 1999b). These regions are also located in loops of the proposed ribosomal RNA secondary structure and were excluded from the phylogenetic analysis because of difficulties in alignment. The two hot spots in the trnL-F spacer are small areas in which tandem duplications and repetitive elements accumulate, whereas the large mutational hot spot in trnT-L seems to be of a different nature. Length variability in trnT-L is caused by the addition of sequence in a certain area that seems to occur independently in different lineages and may involve insertions of larger fragments of so far unknown origin, particularly in monocots (Table 2).

The trnT-trnF region is known to be fast-evolving; depending on the taxonomic group, it evolves up to three times faster than rbcL (e.g. Bayer & Starr, 1998, for Asteraceae; Reeves et al., 2001, for Iridaceae). TrnT-trnF, and many other noncoding parts of the large single copy (LSC) region of the chloroplast genome differ considerably in their rates of evolution from noncoding DNA in the inverted repeat (IR; Olmstead & Palmer, 1994; Soltis & Soltis, 1998). This fast rate of evolution in LSC noncoding regions may have led to the notion that the majority of their sites would be saturated when they are used in phylogeny reconstruction at higher taxonomic levels. On the contrary, Fig. 5 indicates that the largest number of variable sites changed only once in the present data set.

Figure 5.

Amount of variability among characters in the trnT-trnF data set containing 3112 characters calculated over tree one of 3198 steps; the x-axis indicates the level of variability (i.e. number of steps for a character) and the y-axis shows the number of characters for each level of variability. Most of the variable characters have changed only one time.

Reconstructing deep-level phylogeny in the angiosperms with trnT-trnF sequences has produced results that are highly congruent with trees inferred from multi-gene, multi-genome data sets as discussed in detail below. Reasons for this unexpected strong performance of the trnT-trnF data set at deep levels may come from the ability of the majority of the sequence positions to evolve rather freely. Based on the secondary structure of the trnL intron of Nymphaea, 68% of the characters in the present analysis are contributed by the P8 and P6 stem-loops. In contrast, the functionally highly constrained and evolutionarily conserved P, Q, R and S regions, which act as a core in RNA catalysis (Cech, 1988; Michel & Westhof, 1990; Besendahl et al., 2000), account for only about 9% of the trnL intron sequence length (based on the secondary structure model for Nymphaea). These regions are not length variable in basal angiosperms and contain only one informative and three additional variable positions that are autapomorphic. Consequently, the effect of the P, Q, R and S regions on phylogeny reconstruction is minimal. Compensatory mutations could present a problem in a phylogenetic analysis because two different character states change dependently, resulting in double weighting of the respective changes. In the trnL secondary structure of Nymphaea, 95 positions appear in stems which corresponds to the maximum number of potentially co-evolving sites. In the present data set only 20% of these stem positions are variable, so that the overall proportion of possibly co-evolving sites across basal angiosperms is very low. This seems to be in contrast to the maximum possible 73% of compensatory mutations reported from 18S rDNA (Soltis et al., 1997).

Congruence of trnT-trnF with multi-gene multi-genome phylogenies

The overall angiosperm phylogeny resolved in this trnT-trnF study is highly congruent with that based on multi-gene, multi-genome data sets (Qiu et al., 1999, 2000; Soltis et al., 1999a, 2000). The emergence of a grade of Amborella, Nymphaeaceae and Schisandra–Illicium–Austrobaileya as the three most basal branches (the basal angiosperm grade) is in agreement with phylogenies based on combined data sets (except rbcL plus atpB). This grade has not been observed in analyses of individual genes with the exception of atpB (Savolainen et al., 2000). In the 18S rDNA sequence analysis, Austrobaileya, Illicium and Schisandra either appear in a clade second to Amborella or as the most basal lineage, depending on sampling and outgroup choice (Soltis et al., 1997). Analysis of the trnT-trnF region, like the five-gene analysis of Qiu et al. (1999, 2000) and the six-gene analysis by Zanis et al. (2002), stands out in its strong statistical support for the basal grade (nearly 100% BS for the relevant nodes in both studies). This additional evidence from a genomic region with a basically different evolutionary mode and tempo is of particular importance as rooting the angiosperms with Amborella has been discussed in terms of possible long-branch attraction. The five-gene data set shows Amborella to have the longest branch (357 steps; Qiu et al., 2000). Taxon deletion analyses in the same study found a likelihood measure in favour of an Amborella plus Nymphaeaceae clade rather than Amborella alone as the first branch. Similar results were obtained with likelihood analyses and noise reduction experiments of a data set consisting of sequences from the three plant genomes (Barkman et al., 2000). In the present analysis, the node uniting all other angiosperms above Amborella does not only receive higher bootstrap support (99% compared with 91% in six-gene, 88% in five-gene and 65% in three-gene analyses) but also a decay value of 13. Recent extensive analyses on the root of the angiosperms using MP, ML and Bayesian methods of phylogeny reconstruction with an 11-gene data set favoured Amborella as sister to all other angiosperms, with less evidence for an Amborella plus Nymphaeaceae clade and almost no evidence for Nymphaeaceae alone as respective sister lineages to all other angiosperms (Zanis et al., 2002). Additional support for this basal grade in flowering plants comes from this trnT-trnF data set. Congruence between MP and ML analyses of trnT-trnF is not only in topology but also in branch lengths. Unrooted MP subtrees of the angiosperms in this study (Fig. 4) reveal that the branches leading to Amborella (126 steps) and the water lilies (103 steps on average) are quite similar in length. Branch lengths determined with ML are 0.116 vs. 0.101 and correlate well with those found using MP. Consequently, long-branch attraction to the outgroup does not seem to be a very probable factor to have influenced the basal-most position of Amborella in this study, given that Nymphaeaceae and Amborella are more or less equally divergent. Therefore, the trnT-trnF data support Amborella instead of Nymphaeaceae plus Amborella as the most basal angiosperm lineage.

The recognition of the basal grade in flowering plants points to strong shifts in habit and habitat quite early in their evolutionary history as exemplified by the divergence of the herbaceous aquatic Nymphaeaceae as the second extant lineage.

The position of monocots varies among trees based on different data sets although none of these alternative placements is well supported. Combined analysis of rbcL, atpB and 18S rDNA (Soltis et al., 1999a, 2000) shows monocots unresolved with Winterales, Laurales, Magnoliales, Chloranthales and Piperales; rbcL data alone (Chase et al., 1993) depicted them unresolved with Laurales and Piperales; atpB (Savolainen et al., 2000) analysis revealed monocots sister to eudicots and paraphyletic to Ceratophyllum (Ceratophyllales); phytochrome genes PHYA and PHYC (Mathews & Donoghue, 1999) place monocots as sister to Chloranthus (Chloranthales) in a position basal to eudicots; and three of four 18S rDNA data sets show them sister to Ceratophyllum within basal eudicots, with Acorus resolved independently from the rest of the monocots (Soltis et al., 1997). The present study places monocot-Ceratophyllum as the next-branching clade after the basal grade but without bootstrap support, in line with the six-gene analysis of Zanis et al. (2002). The five-gene analyses of Qiu et al. (1999, 2000) showed the clade sister to Chloranthales in the same phylogenetic position; these nodes, however, collapse in their strict consensus (Qiu et al., 2000). Underlying these inconsistencies in the position of monocots is weak support in all these studies.

In contrast to the difficulties in defining the monocot position among angiosperms, the relationship between monocots and the dicot Ceratophyllum is gaining support. The relationship between monocots and the aquatic Ceratophyllum inferred in this study is congruent with those based on a large number of slowly evolving chloroplast genes (Graham & Olmstead, 2000) or genes from all three genomes (Qiu et al., 2000; Zanis et al., 2002). The congruence between ML and MP trees implies that the position of Ceratophyllum in this study is not influenced by heterogeneity in rates of substitution among taxa and concomitant long-branch attraction. Such a phenomenon could affect MP more strongly than ML analysis (e.g. Huelsenbeck, 1995). The relationship of Ceratophyllum to monocots is supported by loss of primary roots. In addition, Ceratophyllum shares with the Alismatales, one of the most basal lineages in monocots, the presence of achene fruits (Les & Schneider, 1995). The fossil record is in line with this molecular-based hypothesis because earliest records for both Ceratophyllum and monocots have been found almost contemporaneously in the Cretaceous. Ceratophyllum-like fruits have been recognized in the Aptian of Australia (Krassilov, 1997), and the earliest fossils that can be assigned to the monocots are triuridaceous flowers from the Turonian (early Upper Cretaceous) of the USA (Gandolfo et al., 2000) and aroid fruits from the Albian (Herendeen & Crane, 1995). Recent calculations by Bremer (2000) postulate that the major monocot lineages may have diverged from each other during the Early Cretaceous. As many basal monocots are aquatic or nearly so (Chase et al., 2000), a possible aquatic ancestor for the Ceratophyllum–monocot group ought to be considered.

Among other important angiosperm relationships supported by the trnT-trnF data is the association between the largely herbaceous Piperales and woody Winterales. This association is in contrast with the traditional classification that places the woody Canellaceae and Winteraceae into a more broadly circumscribed order Magnoliales (Cronquist, 1981). The trnT-trnF-based position of the Winterales with the Piperales is also depicted in the phytochrome gene (Mathews & Donoghue, 1999, 2000; Graham & Olmstead, 2000), five-gene (Qiu et al., 1999, 2000) and six-gene (Zanis et al., 2002) trees analyses. Thus, hypothesized relationships based on different molecular data sets now seem to converge, whereas morphology favours a sister-group relationship of Winterales with Laurales (Doyle & Endress, 2000). The close affinity of the two families Canellaceae and Winteraceae has been suggested earlier based on rbcL (Chase et al., 1993; Qiu et al., 1993) and phytochemical data (Gottlieb et al., 1989). The classification of Piperales to contain the four families Piperaceae, Saururaceae, Lactoridaceae and Aristolochiaceae (APG, 1998) can be clearly defended based on the strong statistical support for this clade. However, the position of Lactoridaceae sister to Aristolochia within Aristolochiaceae with no support (Fig. 2) or with weak support in the five-gene analysis (Qiu et al., 1999, 2000) might be spurious. Analyses of a larger trnT-trnF data set of Aristolochiaceae and Piperales (Neinhuis et al., 1999) displayed the Lactoridaceae sister to the Aristolochiaceae. Lactoris has the longest branch among Piperales in this and other data sets (e.g. Graham & Olmstead, 2000), and thus its position might be influenced by long-branch attraction. Morphological information (González & Rudall, 2001) supports relationships of Lactoris to Saururaceae and Aristolochiaceae.

In line with previous molecular studies, both Laurales and Magnoliales are resolved as monophyletic lineages with monophyly of Laurales gaining very strong statistical support. However, Magnoliales gain weak support here, which is most likely the result of a low rate of substitutions in the trnT-trnF region in this lineage and the subsequent low number of synapomorphic mutations uniting them. The eudicot lineage is very well supported (100% BS, 13 DE), but the branch leading to Dicentra (Ranunculales) is very long (Fig. 3). The long branch may have caused the conflict between the shortest trees at the base of the eudicots, which may not be maintained when the eudicots are more densely sampled.

Contributions of noncoding trnT-trnF sequence data to understanding basal angiosperm relationships

Recent molecular approaches based on single and combined gene data sets have provided immense insight into the evolution of flowering plants. These contributions are redefining angiosperm classification. Hypotheses from the precladistic era recognized the Magnoliales (Takhtajan, 1980; Cronquist, 1981) with their large showy flowers and a high number of spirally arranged carpels, to be the most ancestral flowering plants (the so called ‘Magnolialean Hypothesis’; see Qiu et al., 2000, for overview of basal angiosperm relationships). Analyses of an 18S rDNA data set by Hamby & Zimmer (1992) resolved Nymphaeaceae s.str. as the sister group to all other angiosperms. However, results of the first large-scale molecular phylogenetic analyses based on rbcL depicted the aquatic Ceratophyllum as the first-branching angiosperm (Chase et al., 1993; Qiu et al., 1993). Subsequent intense efforts of sequencing multiple genes from different genomes culminated into a first general hypothesis of what could be the root of the angiosperms (e.g. Mathews & Donoghue, 1999; Qiu et al., 1999; Soltis et al., 1999). The picture has changed not only by revealing Amborella as sister to all other angiosperms but also by providing strong corroborative evidence from various genomic regions, including trnT-trnF, in support of an Amborella, Nymphaeaceae and IlliciumSchisandraAustrobaileya grade. Moreover, Trimenia (not sampled here) has been shown to be a member of the Illicium–SchisandraAustrobaileya clade (e.g. Qiu et al., 1999; Renner, 1999; Zanis et al., 2002). In addition, receiving increased evidence from this noncoding DNA and from multiple gene studies (Qiu et al., 1999, 2000; Graham & Olmstead, 2000) is a core eumagnoliid clade encompassing Winterales plus Piperales and Laurales plus Magnoliales. This finding provides a phylogenetic framework for one of the most species-rich groups of basal angiosperms. Instead of the broad circumscription of the eumagnoliids (Soltis et al., 2000) as comprising all angiosperms with monosulcate or monosucate-derived pollen except the basal grade and Ceratophyllum, this term might better be confined to the above-mentioned clade of four orders.

Remaining in flux are the positions of the Chloranthales, the eudicots (which comprise approximately 75% of angiosperm diversity) and the Ceratophyllum–monocot clade, even when combining six genes with a quite dense taxon sampling (Zanis et al., 2002). Chloranthaceae and members of the basal grade are the only angiosperms that lack post-genital carpel fusion (Endress & Igersheim, 2000). This evidence, and the extensive occurrence of unambiguously identified chloranthaceous fossils in the lower Cretaceous (Crane et al., 1995; Friis et al., 1999), point to a position more basal than currently inferred from molecular data sets. Perhaps, Chloranthaceae have diverged right after the separation of the Illicium–Schisandra–Trimenia–Austrobaileya clade. In order to reveal possible parallelisms or reversals in structural characters and to improve robustness of the molecular-derived phylogenies, the addition of genomic regions that evolve under different functional constraints as well as the integration of information from morphology, palaeobotany, and developmental genetics are needed. Better understanding of sampling effects and of patterns of molecular evolution in conjunction with the development of algorithms that more effectively reflect the evolutionary modes of the different genomic regions used in molecular systematics will perhaps allow further progress in this area.

Conclusions

Most striking is the congruence between angiosperm phylogenies based on sequences from the noncoding trnT-trnF, the five (Qiu et al., 2000; mitochondrial atp1 and matR, chloroplast atpB and rbcL, and nuclear 18S) and six combined genes (Zanis et al., 2002), and generally the three-gene (Soltis et al., 1999a; Soltis et al., 2000; chloroplast atpB and rbcL, and nuclear 18S RNA) and 17-gene (Graham & Olmstead, 2000) analyses. Soltis et al. (1999a, 2000), and Qiu et al. (1999, 2000) suggested that phylogenies based on combined data representing different genomes are more reliable than phylogenies based on individual genes because gene- or genome-specific bias can be largely ruled out. The strong congruence in topology as well as statistical support of major nodes between trees based on trnT-trnF and multiple gene/genome sequence data underscore the effectiveness of this fast-evolving, noncoding region in reconstructing phylogenies at high taxonomic levels. Low constraints (i.e. freedom for a greater number of sequence positions to vary) could result in a more equitable distribution of phylogenetic information across the region in contrast with only a few, localized, potentially variable positions; such a pattern is expected to reduce the average level of homoplasy in a genomic region. Emphasis on the utility of neutral nucleotide substitutions as phylogenetic markers is underscored in several studies at the generic and subfamilial levels (e.g. Bakker et al., 2000). Further support for this concept comes from the strong agreement between trnT-trnF and matK phylogenies (Hilu et al., in press). The matK gene also appears to be under far less selectional constraint than other genes used in phylogeny reconstruction, as is evident from the considerably higher rate of nonsynonymous substitution that is up to 26 times that of other genes and about seven-fold that of rbcL (Olmstead & Palmer, 1994; Hilu & Liang, 1997).

It is also important to note that the widely accepted concept of distinguishing ‘slow’ and ‘fast’ evolving genes often rather considers average amounts of variability in the genomic region under study than picturing rates at individual sites. The latter are the actual source of variability, and a slow evolving gene may have its few variable positions evolving at rates similar to the rates at most individual positions in a fast evolving gene. Consequently, expected levels of multiple hits per variable site might not necessarily differ in a genomic region in which the sites evolve slowly on average. In other words, assuming the benefits of slow evolving genes as described in the introduction may also be problematic, simply because the currently used concepts of ‘slow’ vs. ‘fast’ are an oversimplification. In fact the benefits expected for a slow evolving region (reduced incidents of multiple hits, low levels of homoplasy) might not always apply. These conceptual issues have to be further investigated, but this requires a comparative characterization of different genomic regions, which we are not attempting in this paper.

The assumption of high efficiency of functionally constrained/evolutionarily conserved DNA, in contrast with less constrained and fast-evolving DNA, for resolving deep-level phylogenies was often also based on the notion that the less constrained third codon positions are highly saturated, high in homoplasy and uninformative because of excess multiple hits (e.g. Swofford et al., 1996). Nonetheless, third codon positions of rbcL were shown to provide most of the historical signal in analyses of bryophytes and land plants (Lewis et al., 1997; Kallersjöet al., 1998). A similar observation was also made by Savolainen et al. (2000) in the combined analysis of atpB and rbcL data for angiosperms. In a recent simulation study, Hillis (1998) increased evolutionary rates originally observed in a set of 228 angiosperm sequences by a factor of 10. He found that the amount of the tree that was inferred correctly was achieved with far fewer characters compared with the sequences evolving at lower rates. Although the average character changed 23.6 times under the latter conditions (Hillis, 1998), signal was not obscured by homoplasy. Based on the relative numbers of supported nodes in different data partitions, Kallersjöet al. (1999) asserted that homoplasy can result in better recognition of groups and thus can increase phylogenetic structure. Soltis et al. (1999b), recognizing this phenomenon, gave higher weight to the more freely evolving loop characters in the18S rDNA in their land plant study. However, the effect of differential weighting on the phylogeny was not significant, possibly because of profound differences in rate of evolution between18S loop and stem regions (Soltis & Soltis, 1998; Soltis et al., 1999b). Rapidly evolving sites may increase the chance of generating synapomorphies for particular groups without being obscured by multiple hits simply by expanding the information basis. Further, the number of sites that are co-evolving or that have a greater likelihood to change to particular nucleotides because certain amino acids are favoured by the protein secondary and tertiary structures may be lower. The latter have been identified as a reason to explain parallelisms and reversals in RuBisCo (Kellogg & Juliano, 1997).

Other noncoding regions have also been used in phylogeney reconstruction at higher taxonomic levels. However, those studies have so far only used noncoding sequences from the chloroplast inverted repeat (IR), an extremely slowly evolving region, with average rates of substitutions that are even lower than the more conserved protein-coding genes of the LSC region (Graham et al., 2000). The IR regions used in such studies are the internal transcribed spacer (ITS) (Goremykin et al., 1996), the rpl2 intron, 3′-rps12 intron, ndhB-intron, and spacers between 3′-rps12 and rps7, and the spacer between rps7 and ndhB (Graham & Olmstead, 2000; Graham et al., 2000). Graham et al. (2000) found slightly higher CIs but similar RIs comparing IR protein coding and noncoding data sets. In contrast, CIs and RIs from noncoding IR regions were substantially lower than from protein coding genes outside the IR. This points to different mutational dynamics between the IR and the single copy regions of the chloroplast genome. Therefore, except for the presence of length mutations, noncoding regions in the IR may not be directly comparable with other noncoding parts such as trnT-trnF. A more conclusive view on the effects of noncoding vs. coding and ‘fast’ vs. ‘slow’ evolving genomic regions will require comparative studies of a larger number of functionally different genomic regions based on identical sampling schemes.

In our data set, the relaxed selection pressure across the noncoding parts of trnT-trnF appears to have provided an ideal situation that allowed the recovery of a robust phylogenetic structure. Detailed study of the molecular evolution of the trnT-trnF region is currently underway. The effectiveness of the trnT-trnF sequences in phylogeny reconstruction is even more evident when we consider that the average actual length of trnT-trnF (Table 2) is less than about 20% of the length of the five- and six-gene data sets (Qiu et al., 2000; Zanis et al. 2002). The present data set recovered the same relationships with equal or greater support, from substantially fewer nucleotides. Therefore, the approach is considerably cheaper in terms of sequencing effort but requires much more prudence in alignment. The trnT-trnF data set constitutes new and strong evidence for understanding relationships among basal angiosperms, particularly within major clades. Furthermore, the results provide strong argument for the application of noncoding regions in molecular systematics at deeper levels. Utility of these genomic regions, however, should be individually tested.

The present study underscores the importance of recognizing patterns and mechanisms of molecular evolution of genomic regions used in molecular systematics to augment the probabilities of employing historic signals in phylogeny reconstruction and recovering correct organismal phylogenies. Moreover, analysis of noncoding regions is not subject to problems of differential weighting of codon positions or synonymous vs. nonsynonymous mutations that, when applied, might influence data-decisiveness in phylogenetic analyses (Davis et al., 1998; Savolainen et al., 2000). Thus, gene trees inferred from noncoding regions should theoretically depict a rather close approximation of the evolutionary history of the group. Our findings demonstrate that alignable noncoding regions like trnT-trnF can be particularly promising in phylogeny reconstruction deeper than the species and generic levels.

Acknowledgments

We thank D. L. Dilcher, L. A. Alice, S. E. Scheckler and K. Müller for comments on the manuscript, Tod Stuessy and Daniel Crawford for providing Lactoris DNA, and the Missouri Botanical Garden and the Arboretum of the University of California, Santa Cruz, for contributing plant material for Piper and Amborella, respectively. Supported in part by a scholarship from Studienstiftung des deutschen Volkes to T.B. Helpful comments by two anonymous reviewers are acknowledged.

Supplementary material

The following material is available from: http://www.blackwellpublishing.com/products/journals/suppmat/JEB/JEB577/JEB577sm.htm

Appendix S1

Overall sequence alignment 4707 bp in length, including the trnT-L-spacer, the trnL gene, and the trnL-F-spacer. Positions of the trnL 5′-exon are 2194–2228, and of the trnL 3′-exon 3330–3379. Positions of hot spots are: 256–1276 (H1), 1538–1550 (H2), 1729–1750 (H3), 1795–1927 (H4), 2720–2749 (H5), 2837–2990 (H6), 4025–4145 (H7) and 4401–4416 (H8).

Appendix S2

Character matrix of the trnT-trnF-region used in phylogenetic analysis of basal angiosperms (3112 characters). Hot spots H1–H8 and trnL exons are excluded.

Ancillary