In a recent publication in Molecular Ecology Resources, Naciri & Manen (2010) report on the finding of divergent chloroplast DNA (cpDNA) sequences in the angiosperm Eryngium alpinum and hypothesize nuclear insertions of cpDNA as the source of their observation. Their inference is, in our mind, an important one. Ongoing genetic transfer from mitochondria and plastids into the nucleus is a well-documented fact (Ayliffe & Timmis 1992; Zhang & Hewitt 1996; Martin & Herrmann 1998; Bensasson et al. 2001; Richly & Leister 2004; Kleine et al. 2009). Several studies demonstrated that nuclear copies of mitochondrial DNA (mtDNA) may lead to erroneous phylogenetic inferences (e.g., Zhang & Hewitt 1996; Sorenson & Quinn 1998; Bensasson et al. 2001; Thalmann et al. 2004). Today, it is widely accepted that any use of mtDNA in molecular ecology requires tests to prove a true mitochondrial origin of the underlying sequences (Song et al. 2008 and references therein). In contrast, awareness of pseudogene-mediated artefacts is still largely lacking for chloroplasts, as argue Naciri & Manen (2010) for plant phylogeography and molecular diagnostics. Here, we go beyond their paper by putting the problem in the more general context of molecular ecology as a whole. We summarize the impact and management of nuclear mitochondrial (NUMT) insertions, make some example-based assumptions on the abundance of nuclear plastid (NUPT) sequences, and briefly discuss the problem of how to identify and eliminate NUPT-based artefacts.
MtDNA has been the major source of molecular phylogenetic data of metazoans for almost two decades (Avise 2009). In 1996, Zhang et al. reviewed NUMT-based errors in mtDNA studies for the first time and proposed double bands, unexpected frameshifts, sequence ambiguities, and contradictory tree topologies as major signs of NUMT contamination. In 2001, studies of 64 metazoan species had been confirmed to contain pseudogene artefacts (Bensasson et al. 2001), and, as an example of a more recent review, Yao et al. (2008) documented several NUMT-based errors in clinical disease studies. Of great practical concern, Benesh et al. (2006) demonstrated a case of preferential binding of universal primers to a NUMT sequence. The total frequency of NUMT detections is difficult to assess as they are infrequently reported following their disclosure (cf. Beckenbach 2009), but as of 29 April 2010, GenBank/Nucleotides contained 757 entries designated as ‘NUMT’. The current toolkit for NUMT identification (see Song et al. 2008 for a review) includes (i) testing for PCR ghost bands after mtDNA amplification, (ii) examination of sequence chromatograms for ambiguities, including examination of PHRED scores, (iii) sequence translation and test for indels, frameshifts, and premature stop codons, (iv) sequence examination for compositional biases, (v) comparison with mtDNA sequences of closely related taxa, and (vi) BLAST analysis. Strategies for contamination avoidance rely on (a) preferred use of mtDNA rich tissues, (b) mtDNA enrichment, (c) long PCR, (d) reverse transcription PCR, and (e) use of taxon-specific primers; the latter, however, did not prove very effective in avoiding NUMT amplification in a recent study (Moulton et al. 2010).
Comparable to mtDNA in metazoans, cpDNA has been the most important molecular tool in phylogenetic and phylogeographic studies of plants for the past quarter century (e.g., Palmer 1987; Clegg 1993; Wu et al. 2007; Qiu et al. 2009). The maternal inheritance in most plant genera allowed using cpDNA for the inference of relationships among nearly all taxonomic units ranging from classes (e.g. among the basal angiosperms, Graham & Olmstead 2000) to intraspecific races and populations (Petit et al. 1993, 1997). Phylogenetic studies mainly focused on sequence variation patterns in coding regions, whereas phylogeographic variation patterns were widely studied based on non-coding sequences (Taberlet et al. 1991; Demesure et al. 1995).
Although evidence of cpDNA transfer to the nucleus dates back to the time when NUMTs slowly became a topic in metazoan phylogenetics (Ayliffe & Timmis 1992) and although NUPTs were early identified in genome-sequencing projects of higher plants (Richly & Leister 2004), a limited number of papers have dealt in depth with the topic until today (e.g. Shamuradov et al. 2003; Leister 2005; Matsuo et al. 2005; Kejnovsky et al. 2006; Kleine et al. 2009). Apart from the recent paper by Naciri & Manen (2010), only Meimberg et al. (2006) emphasized potential consequences of NUPT contamination for the reconstruction of plant phylogenies. GenBank/Nucleotide currently contains no entries designated as ‘NUPT’.
We performed a MEGABLAST search of whole-chloroplast sequences against the corresponding nuclear genomes of four plant species, and an analogous MEGABLAST of mitochondrial vs. nuclear sequences in two animal species in which NUMTs have been documented. Hits with an alignment length of at least 200 bp and a similarity of at least 85% were assumed to represent pseudogenes and are summarized in Table 1. The query indicates that NUPTs and NUMTs show quite similar values for length and sequence similarity but that in three of the plant species tested here, the absolute (i.e. uncorrected for genome sizes) numbers of NUPTs are far higher than the amount of NUMTs in the inquired animals. The NUPT similarity distribution is given in Fig. 1. Except for Arabidopsis thaliana, in all species exact matches were found, and the similarity class of <100 to 97% similarity to the cpDNA sequence was the one with the highest number of NUPTs.
|n hits||Max length (bp)||Mean length (bp)||Median length (bp)||Mean similarity (%)||Median similarity (%)|
|Oryza sativa japonica||543||56200||2167.3||451||95.9||98|
|Canis lupus familiaris||47||3427||907.0||535||92.1||93|
An in silico PCR using the primers for the trnH-psbA spacer region proposed for land plant barcoding (Kress & Erickson 2007) suggests the possibility of amplifying a NUPT for Vitis vinifera (Table 2). The assumptive NUPT is located on chromosome 6 and has 92.7% similarity to the corresponding cpDNA sequence. We suggest that the results from our exercise be viewed as a proof of concept of a significant relevance of NUPTs to cpDNA-based studies and argue that NUPT contamination is likely to have a similarly distractive impact on such studies as NUMTs are established to have on mtDNA-based ones. This is underlined by recent findings on NUPT distribution and replication number in rice, where NUPTS were found on all 12 chromosomes and where 53 of 60 known plastid genes were found to occur in multiple copies throughout the nuclear genome (Akbarova et al. 2010). Future studies may focus on patterns of length and similarity variation among NUPT sequences, the distribution of NUPTs in the nuclear genome of more species, and whether NUPTs were replicated in repetitive regions of the chromosomes.
|Amplicon length (bp)||389||398|
|Mismatches in fwd primer region||0||0|
|Mismatches in rev primer region||2||2|
|CG content (%)||29.0||26.1|
Accepting NUPTs as of general relevance in molecular ecology means that for reliable inferences of species and population histories, procedures for NUPT management need to be implemented. What strategies are available? On the short term, strategies successfully applied against NUMT contamination, like alertness for ghost bands and double peaks, and circumvention of preferential pseudogene amplification by the use of multiple primers and loci, should be a good starting point. In many studies, use of plastid-rich leaf tissue is the rule. Particular attention, however, is required if herbarium specimens (e.g. Savolainen et al. 1995) or ancient wood and seed tissues (Parducci & Petit 2004) are being analysed as is frequently the case with, for example, phylogenetic studies. In such instances, the sequences of the envisaged cpDNA regions should be compared with those derived from fresh leaf material. Identification and report of any artefact will help in fine-tuning of NUPT surveillance tools. Two important exceptions, however, make it more challenging to approach NUPTs when compared to NUMTs. First, the relatively small size of most metazoan NUMTs allows their elimination by long PCR (Arthofer et al. 2006). The reported existence of giant NUPTs (Matsuo et al. 2005) with sizes >10 kbp in some species discourages this approach in plants. In line with this, also in all species examined here, NUPTs exceeding 2000 bp were found, albeit the majority of NUPT sequences were shorter than 1000 bp (Fig. 2). It is noteworthy that also in plants, the length of NUMT inserts sometimes exceeds the range known from metazoan species by far (extreme example: a 620-kbp insertion in A. thaliana, Stupar et al. 2001). Second, a typical 150- to 162 -kbp-sized chloroplast genome consists of 40–50% non-coding regions, including intergenic spacers and introns (Palmer 1987). In such regions, translation-based tests are not applicable. On the medium term, a remedy to the two problems may arise from the increasing ease of next-generation sequencing of individual genomes. While still facing some cost and bioinformatics constraints today, promising studies using these technologies in molecular ecology are at hand already (Tautz et al. 2010; chloroplast example: Whittall et al. 2010). We should not shrink back from expecting the full realization of their effects (Gilad et al. 2009) for use in everyday and every laboratory routine to include new approaches to identifying pseudogenes in general.