Chloroplast DNA-based studies in molecular ecology may be compromised by nuclear-encoded plastid sequence



    1. Molecular Ecology Group, Institute of Ecology, University of Innsbruck, Technikerstrasse 25, 6020 Innsbruck, Austria
    Search for more papers by this author

    1. Institute of Genetics, Federal Research and Training Centre for Forests, Natural Hazards and Landscape, Vienna, Austria
    Search for more papers by this author

    1. Molecular Ecology Group, Institute of Ecology, University of Innsbruck, Technikerstrasse 25, 6020 Innsbruck, Austria
    Search for more papers by this author
    • These authors contributed equally to this paper.


    1. Molecular Ecology Group, Institute of Ecology, University of Innsbruck, Technikerstrasse 25, 6020 Innsbruck, Austria
    Search for more papers by this author
    • These authors contributed equally to this paper.

Wolfgang Arthofer, Fax: +43 512507 6190; E-mail:


Ongoing genetic transfer from mitochondria and plastids into the nucleus is a well-documented fact. While in metazoan molecular ecology the need for surveillance against pseudogene-mediated artefacts when analysing mtDNA sequences is commonly accepted, no comparable measurements have been established for plastid-based studies. We highlight the impact and management of nuclear mitochondrial insertions, argue that nuclear plastid sequences represent an underestimated but major factor in plant molecular ecology, and discuss potential avenues of remedy in chloroplast studies.

In a recent publication in Molecular Ecology Resources, Naciri & Manen (2010) report on the finding of divergent chloroplast DNA (cpDNA) sequences in the angiosperm Eryngium alpinum and hypothesize nuclear insertions of cpDNA as the source of their observation. Their inference is, in our mind, an important one. Ongoing genetic transfer from mitochondria and plastids into the nucleus is a well-documented fact (Ayliffe & Timmis 1992; Zhang & Hewitt 1996; Martin & Herrmann 1998; Bensasson et al. 2001; Richly & Leister 2004; Kleine et al. 2009). Several studies demonstrated that nuclear copies of mitochondrial DNA (mtDNA) may lead to erroneous phylogenetic inferences (e.g., Zhang & Hewitt 1996; Sorenson & Quinn 1998; Bensasson et al. 2001; Thalmann et al. 2004). Today, it is widely accepted that any use of mtDNA in molecular ecology requires tests to prove a true mitochondrial origin of the underlying sequences (Song et al. 2008 and references therein). In contrast, awareness of pseudogene-mediated artefacts is still largely lacking for chloroplasts, as argue Naciri & Manen (2010) for plant phylogeography and molecular diagnostics. Here, we go beyond their paper by putting the problem in the more general context of molecular ecology as a whole. We summarize the impact and management of nuclear mitochondrial (NUMT) insertions, make some example-based assumptions on the abundance of nuclear plastid (NUPT) sequences, and briefly discuss the problem of how to identify and eliminate NUPT-based artefacts.

MtDNA has been the major source of molecular phylogenetic data of metazoans for almost two decades (Avise 2009). In 1996, Zhang et al. reviewed NUMT-based errors in mtDNA studies for the first time and proposed double bands, unexpected frameshifts, sequence ambiguities, and contradictory tree topologies as major signs of NUMT contamination. In 2001, studies of 64 metazoan species had been confirmed to contain pseudogene artefacts (Bensasson et al. 2001), and, as an example of a more recent review, Yao et al. (2008) documented several NUMT-based errors in clinical disease studies. Of great practical concern, Benesh et al. (2006) demonstrated a case of preferential binding of universal primers to a NUMT sequence. The total frequency of NUMT detections is difficult to assess as they are infrequently reported following their disclosure (cf. Beckenbach 2009), but as of 29 April 2010, GenBank/Nucleotides contained 757 entries designated as ‘NUMT’. The current toolkit for NUMT identification (see Song et al. 2008 for a review) includes (i) testing for PCR ghost bands after mtDNA amplification, (ii) examination of sequence chromatograms for ambiguities, including examination of PHRED scores, (iii) sequence translation and test for indels, frameshifts, and premature stop codons, (iv) sequence examination for compositional biases, (v) comparison with mtDNA sequences of closely related taxa, and (vi) BLAST analysis. Strategies for contamination avoidance rely on (a) preferred use of mtDNA rich tissues, (b) mtDNA enrichment, (c) long PCR, (d) reverse transcription PCR, and (e) use of taxon-specific primers; the latter, however, did not prove very effective in avoiding NUMT amplification in a recent study (Moulton et al. 2010).

Comparable to mtDNA in metazoans, cpDNA has been the most important molecular tool in phylogenetic and phylogeographic studies of plants for the past quarter century (e.g., Palmer 1987; Clegg 1993; Wu et al. 2007; Qiu et al. 2009). The maternal inheritance in most plant genera allowed using cpDNA for the inference of relationships among nearly all taxonomic units ranging from classes (e.g. among the basal angiosperms, Graham & Olmstead 2000) to intraspecific races and populations (Petit et al. 1993, 1997). Phylogenetic studies mainly focused on sequence variation patterns in coding regions, whereas phylogeographic variation patterns were widely studied based on non-coding sequences (Taberlet et al. 1991; Demesure et al. 1995).

Although evidence of cpDNA transfer to the nucleus dates back to the time when NUMTs slowly became a topic in metazoan phylogenetics (Ayliffe & Timmis 1992) and although NUPTs were early identified in genome-sequencing projects of higher plants (Richly & Leister 2004), a limited number of papers have dealt in depth with the topic until today (e.g. Shamuradov et al. 2003; Leister 2005; Matsuo et al. 2005; Kejnovsky et al. 2006; Kleine et al. 2009). Apart from the recent paper by Naciri & Manen (2010), only Meimberg et al. (2006) emphasized potential consequences of NUPT contamination for the reconstruction of plant phylogenies. GenBank/Nucleotide currently contains no entries designated as ‘NUPT’.

We performed a MEGABLAST search of whole-chloroplast sequences against the corresponding nuclear genomes of four plant species, and an analogous MEGABLAST of mitochondrial vs. nuclear sequences in two animal species in which NUMTs have been documented. Hits with an alignment length of at least 200 bp and a similarity of at least 85% were assumed to represent pseudogenes and are summarized in Table 1. The query indicates that NUPTs and NUMTs show quite similar values for length and sequence similarity but that in three of the plant species tested here, the absolute (i.e. uncorrected for genome sizes) numbers of NUPTs are far higher than the amount of NUMTs in the inquired animals. The NUPT similarity distribution is given in Fig. 1. Except for Arabidopsis thaliana, in all species exact matches were found, and the similarity class of <100 to 97% similarity to the cpDNA sequence was the one with the highest number of NUPTs.

Table 1.   MEGABLAST hits of organelle vs. nuclear genomes in four plant (Arabidopsis thaliana, Oryza sativa japonica, Populus trichocarpa, Vitis vinifera) and two animal species (Canis lupus familiaris, Pan troglodytes). Only hits with an alignment size ≥200 bp and a similarity ≥85% were considered
 n hitsMax length (bp)Mean length (bp)Median length (bp)Mean similarity (%)Median similarity (%)
Arabidopsis thaliana343638638.849592.994
Oryza sativa japonica543562002167.345195.998
Populus trichocarpa10113180671.948595.297
Vitis vinifera10124818563.733492.893
Canis lupus familiaris473427907.053592.193
Pan troglodytes161055521.338687.787
Figure 1.

 Histograms of similarity distribution of BLAST hits in four plant species. Except for Arabidopsis thaliana, the similarity class of <100 to 97% similarity to the cpDNA sequence was the one with the highest number of potential NUPTs.

An in silico PCR using the primers for the trnH-psbA spacer region proposed for land plant barcoding (Kress & Erickson 2007) suggests the possibility of amplifying a NUPT for Vitis vinifera (Table 2). The assumptive NUPT is located on chromosome 6 and has 92.7% similarity to the corresponding cpDNA sequence. We suggest that the results from our exercise be viewed as a proof of concept of a significant relevance of NUPTs to cpDNA-based studies and argue that NUPT contamination is likely to have a similarly distractive impact on such studies as NUMTs are established to have on mtDNA-based ones. This is underlined by recent findings on NUPT distribution and replication number in rice, where NUPTS were found on all 12 chromosomes and where 53 of 60 known plastid genes were found to occur in multiple copies throughout the nuclear genome (Akbarova et al. 2010). Future studies may focus on patterns of length and similarity variation among NUPT sequences, the distribution of NUPTs in the nuclear genome of more species, and whether NUPTs were replicated in repetitive regions of the chromosomes.

Table 2.   Results from in silico PCR with trnH-psbA primers in Vitis vinifera
Amplicon length (bp)389398
Mismatches in fwd primer region00
Mismatches in rev primer region22
CG content (%)29.026.1
Matches 369

Accepting NUPTs as of general relevance in molecular ecology means that for reliable inferences of species and population histories, procedures for NUPT management need to be implemented. What strategies are available? On the short term, strategies successfully applied against NUMT contamination, like alertness for ghost bands and double peaks, and circumvention of preferential pseudogene amplification by the use of multiple primers and loci, should be a good starting point. In many studies, use of plastid-rich leaf tissue is the rule. Particular attention, however, is required if herbarium specimens (e.g. Savolainen et al. 1995) or ancient wood and seed tissues (Parducci & Petit 2004) are being analysed as is frequently the case with, for example, phylogenetic studies. In such instances, the sequences of the envisaged cpDNA regions should be compared with those derived from fresh leaf material. Identification and report of any artefact will help in fine-tuning of NUPT surveillance tools. Two important exceptions, however, make it more challenging to approach NUPTs when compared to NUMTs. First, the relatively small size of most metazoan NUMTs allows their elimination by long PCR (Arthofer et al. 2006). The reported existence of giant NUPTs (Matsuo et al. 2005) with sizes >10 kbp in some species discourages this approach in plants. In line with this, also in all species examined here, NUPTs exceeding 2000 bp were found, albeit the majority of NUPT sequences were shorter than 1000 bp (Fig. 2). It is noteworthy that also in plants, the length of NUMT inserts sometimes exceeds the range known from metazoan species by far (extreme example: a 620-kbp insertion in A. thaliana, Stupar et al. 2001). Second, a typical 150- to 162 -kbp-sized chloroplast genome consists of 40–50% non-coding regions, including intergenic spacers and introns (Palmer 1987). In such regions, translation-based tests are not applicable. On the medium term, a remedy to the two problems may arise from the increasing ease of next-generation sequencing of individual genomes. While still facing some cost and bioinformatics constraints today, promising studies using these technologies in molecular ecology are at hand already (Tautz et al. 2010; chloroplast example: Whittall et al. 2010). We should not shrink back from expecting the full realization of their effects (Gilad et al. 2009) for use in everyday and every laboratory routine to include new approaches to identifying pseudogenes in general.

Figure 2.

 Histograms of length distribution of BLAST hits in four plant species. The majority of potential NUPTs were shorter than 500 bp, but long insertions exceeding 2000 bp were found in all species.


We thank Astrid Haara for assistance in data mining, and News and Views Editor Nolan Kane and three anonymous referees for constructive criticism.

W Arthofer, FM Steiner and BC Schlick-Steiner focus on terrestrial animals (insects, harvestmen, millipedes, spiders) and their symbionts (ascomycete fungi, bacteria) of the Alpine environment. Their research topics include Alpine endemism, biogeography, climate-change and conservation biology, morphology and integrative taxonomy and the development of markers for population genetics.

S Schüler is interested in geno- and phenotypic variation of forest trees and the analysis of natural and anthropogenic influences shaping this variation. He utilizes molecular methods with the goal of developing guidelines for sustainable forest management.