Present address: D. C. S. G. Oliveira, Departament de Genètica i de Microbiologia, Universitat Autònoma de Barcelona, Bellaterra, BCN 08193, Spain.
Lumi Viljakainen, Department of Biology, University of Oulu, PO Box 3000, 90014 University of Oulu, Oulu, Finland. Tel.: +358-8553-1791; fax: +358 8553 1227; e-mail: email@example.com. Susanta K. Behura, Eck Institute of Global Health, Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, USA. Tel: 574-631-4151; fax: 574-631-7413; e-mail: firstname.lastname@example.org
Many organisms carry nuclear sequences of mitochondrial origin (NUMTs). We have identified 76 NUMTs in 25 genomic locations in the jewel wasp Nasonia vitripennis. The total amount of NUMTs in Nasonia is 42 972 bp exceeding over four-fold that found in Tribolium castaneum and almost fifty-fold that found in Drosophila melanogaster, whereas Apis mellifera has an even larger number of NUMTs in its genome (over 230 kb). The Nasonia NUMTs were inserted by multiple independent events and frequently involved large fragments spanning multiple mitochondrial genes. Most of the NUMTs are recent transfers that occurred less than one million years ago after the speciation of N. vitripennis. Duplications and rearrangements in the nucleus have also occurred. Data suggest that NUMTs may be more common in hymenoptera than in other insect genomes.
Several studies have tried to estimate the extent of NUMTs produced by a single transfer event followed by duplications within the nuclear genome compared with multiple separate integrations (e.g. Bensasson et al., 2000, 2003; Hazkani-Covo et al., 2003; Pamilo et al., 2007). These studies, covering grasshoppers, humans and honeybees, showed that both mechanisms have contributed to the observed NUMT content. The genome-wide estimates of NUMTs in human genomes depend on the method used. By comparing pairs of NUMTs and their flanking non-NUMT regions, Bensasson et al. (2003) have suggested that human NUMTs mostly resulted from separate mitochondrial insertion events. Hazkani-Covo et al. (2003), using a phylogenetic approach, proposed that the majority of the NUMTs in the human genome resulted from duplications of preexisting pseudogenes. In honeybees 2/3 of the NUMTs have been estimated to have originated from duplications of existing insertions within the nuclear genome (Pamilo et al., 2007).
In insects, thorough studies of NUMTs and their evolution are thus far restricted to honeybees and flour beetles, because of the shortage or lack of such sequences in the other sequenced genomes of the fruit fly and the mosquito (Behura, 2007; Pamilo et al., 2007). In this paper we study the NUMT content in the genome of the jewel wasp Nasonia vitripennis (Werren et al., 2010) by using BLAST searches and alignments of the target scaffolds with the N. vitripennis mtDNA (Oliveira et al., 2008). We evaluated the age of the N. vitripennis NUMTs using bioinformatic and phylogenetic approaches, taking advantage of partial nuclear genome sequences (Werren et al., 2010) and mtDNA sequences (Oliveira et al., 2008) for two closely related Nasonia species: N. giraulti and N. longicornis. Finally, the extent of independent transfers and duplications after integration in the nuclear genome were investigated. We report a large amount of recently (less than 1 MY ago) transferred mitochondrial DNA to the nuclear genome and suggest that most of the NUMTs have originated from independent transfers.
Identification and validation of NUMTs
Using BLASTN searches 195 NUMTs were initially found scattered over 29 different genomic scaffolds of N. vitripennis summing up to 27 133 bp. Alignments of the 29 genomic scaffolds with the original mitochondrial query sequences reveal that many of the small NUMTs could be joined to larger ones within the scaffolds. The query sequences were the two large mtDNA genome fragments available for N. vitripennis (for details see Oliveira et al., 2008). The current mitochondrial assembly of N. vitripennis is in two fragments, which together contain all the 13 protein-coding and the two rRNA genes; however, the sequences linking the two fragments have not been identified.
Special attention was paid to exclude spurious NUMTs. The N. vitripennis draft genome is sequenced to 6.2X genome coverage using whole-genome shotgun sequencing strategy and therefore also includes true mitochondrial sequences. This sequencing strategy was shown to lead to false identification of NUMTs in the fish Fugu rubripes due to misassembly, probably resulting from low sequence coverage (Venkatesh et al., 2006). In the N. vitripennis genome assembly, true mitochondrial sequences or assembly artifacts were identified in four scaffolds: scaffold2008, scaffold3010, scaffold1962, and scaffold2194 (hereafter scaffold and contig numbers refer to the N. vitripennis genome assembly 1.1, HGSC Baylor College of Medicine).
One case involves scaffold2008 of 13 949 bp which contains the entire larger mitochondrial query fragment of 9791 bp (GenBank number EU746609). Scaffold2008 and the mtDNA are 99.7% identical in sequence. However, the pairwise comparison reveals that the sequence differences were mostly concentrated in a small region of 580 bp (4% of the scaffold), the other regions being identical to the mtDNA. The rest of the sequence is derived from a reptig (reptig146), with higher CG content, 44% CG in this repetitive region, compared to 18% in the mtDNA region. Since most of the missing regions of N. vitripennis mitochondria involve the control region, which is expected to be AT-rich, scaffold2008 is likely an artifact and we have excluded it from the NUMT analysis.
The putative second misassembly case involves scaffold3010, of which the first 1262 bp of a total length of 1807 bp are 100% identical to the end part of second mtDNA query sequence (GenBank accession number EU746613) which is partial at both ends (Oliveira et al., 2008). This means that the last 545 bp of scaffold3010 can be the yet uncharacterized region of the mtDNA or alternatively nuclear DNA misassembled with mtDNA.
The third case involves scaffold1962: the first part of the scaffold contains 1263 bp of the 3′ end of mtDNA fragment EU746609 followed by 5619 bp sequence stretch, which could be the yet uncharacterized region of the mtDNA, although, again, it is a repetitive region that is relatively CG-rich. These stretches are then followed by 963 bp of the 5′ end of mtDNA fragment EU746613, and then 1247 bp of the 5′ end of mtDNA fragment EU746613 in a reverse orientation, representing a truncated and inverted 16 s rRNA. Notably, the first two NUMTs of scaffold1962 are 100% identical to the mtDNA, whereas the last NUMT in reverse orientation is 99.1% identical to the mtDNA. If the last two NUMTs were duplicates it would not be expected that their divergence from the mtDNA would differ, and it is very unlikely that they are independent transfers from the mitochondrion. Thus, this scaffold is likely an assembly artifact.
The fourth case involves scaffold2194 which contained four NUMTs. The first three NUMTs are probably authentic based on their divergence from the mtDNA. The last NUMT (790 bp) in the end of the scaffold seems to be a mixture of a NUMT and mtDNA with the first half of the sequence carrying several mutations and the other half being identical to mtDNA. In addition to these four cases a NUMT in scaffold6052 can be included in its totality in another scaffold (scaffold253). As these cannot be confidently identified as NUMTs and so that we can make conservative estimates of NUMT content in N. vitripennis, we are excluding them from our characterization of NUMT insertions. This data shows that the mixture of multiple large NUMTs and true mtDNA sequences are problematic to genome assemblies in a similar way to other repetitive sequences.
Excluding the uncertain NUMTs and after joining many of the small NUMT fragments to larger ones, the number of NUMT sequences was reduced to 76 and the number of NUMT locations (scaffolds) to 25 (Table S1). It is noteworthy that two of the 25 scaffolds (scaffold2333 and scaffold3234) are small, less than 1 kb, and contain NUMT DNA only, and 11 scaffolds (44%) either start or end with NUMT sequence, indicating difficulties in the assembly of NUMT containing sequence reads. However, based on sequence differences compared to mtDNA (see below) these NUMTs are regarded as authentic.
After joining NUMT fragments to larger regions, the total amount of NUMTs increased substantially to 42 972 bp with a size range of 43–1838 bp and a mean length of 565 bp (Fig. 1). The NUMTs have originated from almost all parts of the examined mtDNA, with copy number ranging up to seven when examined in 50-bp windows of the mtDNA (Fig. 2). The frequency distribution of the copy numbers indicates that different parts of the mtDNA are represented unequally in NUMT sequences (χ2-test, P < 0.001). The mitochondrial genes ND5, ND4, ND4L, ND6 and Cyt-b have the highest copy numbers ranging from 5 to 7 copies for some gene regions (Fig. 2), and reaching 9 copies for the gene as a whole.
Phylogeny and divergence of NUMTs
Unrooted phylogenies were constructed for four mitochondrial genes of three Nasonia species and their respective N. vitripennis NUMTs (Fig. 3 and Fig. S1). Phylogenetic reconstructions are limited to overlapping regions shared by several NUMTs. Both Maximum Parsimony (MP) and Maximum Likelihood (ML) criteria produce trees with similar topologies (Fig. 3 shows MP, Fig. S1 shows ML). The phylogenies suggest that there are relatively younger and older NUMTs, and older NUMTs could be shared with the sibling species N. longicornis and N. giraulti. However, it is not obvious where to place the root.
In order to find older NUMTs that could be ancestral to all the three Nasonia species, N. vitripennis NUMTs were used to search the genomes of N. giraulti and N. longicornis by using BLAST. The hits obtained were screened so that hits containing at least 50 bp on both sides of the NUMT-non-NUMT junctions were considered significant matches. Using this method NUMT17-2 (a ND4 copy, not included in the phylogenetic analysis) and NUMT17-3 (a copy of Cyt-b, Fig. 3d) were considered to be shared between N. vitripennis and N. giraulti. None were found to be shared with N. longicornis. Although NUMT17-3 was not found in the sequence reads from N. longicornis, this could easily be due to the relatively low coverage in these species (1X for both N. longicornis and N. giraulti).
More detailed investigation revealed that only the last 147 bp of NUMT17-3 (758 bp in length) could be aligned with the N. giraulti trace sequence (trace identifier number 1285780596), whereas the flanking non-NUMT regions could be aligned until the end of the N. giraulti sequence. The alignment involved 853 bp of the N. vitripennis scaffold423 containing the NUMT17-3 and 816 bp of the trace sequence of N. giraulti. Alternatively, independent large deletions could have removed this region in N. longicornis and part of the NUMT17-3 in N. giraulti. The uncorrected pairwise nucleotide differences of this ancestral NUMT is 4.3% between N. vitripennis and N. giraulti, compared with an average of approximately 15% for the mtDNA of the same species, and only 1% for nuclear genes (Oliveira et al., 2008). This difference likely reflects the different mutation rates in the nuclear and mitochondrial genomes (Oliveira et al., 2008). Because of the accelerated rate of nucleotide substitutions in the mtDNA of Nasonia, shared NUMTs are expected to have diverged less in their nucleotide composition than the respective mtDNA gene. Pairwise divergence between true mitochondrial DNA and the NUMTs varied from 0% to 14%, with NUMT17-3 being highly divergent (12.6% divergence) (Table S1). NUMT17-3 is one of the most divergent NUMTs, and orthologs of less divergent NUMTs were not found in the N. giraulti and N. longicornis genomes.
Although it appears that most detected N. vitripennis NUMTs were recently transferred to its genome after the speciation from the two other species, both the phylogenetic position and relative level of divergence indicate that some NUMTs are relatively older, and perhaps acquired near or before divergence of N. vitripennis and its two sibling species. A more precise determination cannot be obtained because the phylogenetic trees cannot be adequately rooted. The time of divergence of N. vitripennis from two other Nasonia species, N. longiconis and N. giraulti, is believed to be around 1 MY (Campbell et al., 1993; Raychoudhury et al., 2009), and it appears that the majority of N. vitripennis NUMTs are younger than that.
Interestingly there was no negative association with NUMT length and the p-distance (r=−0.098, P = 0.396, Fig. 4a), suggesting that longer NUMTs are not necessarily younger than the shorter ones. Also, the number of short indels in the NUMTs did not correlate with the p-distance (r = 0.069, P = 0.553, Fig. 4b). However, in the phylogeny older NUMTs appear to have more frequent mutations that disrupt the coding sequence, as expected (Fig. 3).
Duplications and rearrangements of NUMTs in the nuclear genome
The phylogenetic results also suggest that mtDNA insertions into the nuclear genome happened more frequently relative to NUMT duplications within the nucleus (Fig. 3 and Fig. S1). The four phylogenies indicate only a couple of well-supported duplications after insertion in the nucleus, one for ND5 (Fig. 3a and Fig. S1a) and one for Cyt-b (Fig. 3d and Fig. S1d). In addition, duplicates were screened by examining junctions between NUMT (150 bp) and non-NUMT DNA (150 bp) at either 5′ or 3′ ends of the NUMTs, or both if possible. In total 32 NUMT junctions could be screened but no duplicates were found with this approach.
The existence of NUMTs with close proximity to each other indicated that some of the copies had become fragmented after their insertion to the nuclear genome (Fig. 5). Of the 25 NUMT locations 16 contained fragmented NUMTs with a mean length of 556 bp, and nine contained intact NUMTs with a mean length of 794 bp. In addition, three of the locations containing fragmented insertions involved also inversions (e.g. scaffold579, Fig. 5). Scaffold2012 includes NUMTs originated from at least 10 mitochondrial protein-coding genes. These pseudogenes are 5–12% divergent from the mitochondrial genes, they have many insertions and deletions, and a CoI copy is split by an uncharacterized non-LTR retrotransposon (Fig. 5).
A NUMT in Nasonia giraulti
We also report that a NUMT has been identified during sequencing of the mtDNA of N. giraulti (Oliveira et al., 2008). A fragment was amplified in N. giraulti that had a gene order ND1-CoIII, containing a truncated ND1 at the 3′ end (70 amino acids shorter), while a fragment with this gene order was not amplified for N. vitripennis. Hybrid crosses were performed between N. giraulti and N. vitripennis to test that this ND1-CoIII fragment was not present in the mitochondrial genome of N. giraulti but it was actually a NUMT (details of the crosses are given in Experimental procedures). The results of a PCR amplification test showed that this truncated ND1-CoIII fragment was present in some and absent in other F2 hybrid males, independent of their mitochondria, either N. giraulti or N. vitripennis. This clearly demonstrates that it was located in the nuclear genome of N. giraulti. It is unclear how long this N. giraulti NUMT is, however, it could be as large as 4 kb based on long PCR results (Dennis V. Lavrov, pers. comm.).
Preliminary bioinfomatic investigation indicates that N. giraulti and N. longicornis also possess a large number of NUMTs (not shown), as we have demonstrated for N. vitripennis. However, a comprehensive analysis of the NUMT content in these species awaits the completion of their genome sequencing.
We have identified 76 NUMTs in 25 different locations of the N. vitripennis genome. The total NUMT content in the Nasonia genome is ∼43 kb which represents a density of 14 bp of these mtDNA-like sequences per 100 kb of the nuclear DNA. Though Nasonia has a relatively higher density of NUMT sequences in the chromosomal DNA compared with that of coleopteran insect T. castaneum (5.6 bp per 100 kb, Pamilo et al. (2007) or dipteran Ae. aegypti genome (8 bp per 100 kb) (Hlaing et al., 2009), this represents nearly seven-fold lower density of NUMT sequences compared with another hymenopteran, A. mellifera (100 bp per 100 kb) (Pamilo et al., 2007). It is interesting to note that the genome sequence of An. gambiae showed no detectable mitochondrial sequences and that that of D. melanogaster showed only a few (6–8) copies (Richly & Leister, 2004). It is noteworthy that the NUMT content of Nasonia is probably even larger than we have demonstrated here, as the mitochondrial genome has not yet been fully characterized (Oliveira et al., 2008), and detection of more ancestral NUMTs is complicated by the high divergence rate between insertions and the mitochondria due to the exceptionally high mitochondrial mutation rate in Nasonia[∼35 fold higher than the nucleus (Oliveira et al., 2008)].
The phylogenetic analyses in the present study indicate that most of the detected NUMTs have been transmitted relatively recently, after the divergence of N. vitripennis from the closely related N. giraulti and N. longicornis (Fig. 3). For only two of the 76 NUMTs (NUMT17-2 and NUMT17-3) an ortholog was detected in another Nasonia species. A second possibility is that most NUMTs are removed or derived beyond recognition by multiple rearrangements in this time scale (Fig. 5). The divergence of the Nasonia lineages has been estimated to have occurred during the Pleistocene (Campbell et al., 1993), which means that the age of most of the identified NUMTs is less than one million years. As it has been suggested that the insertion of mtDNA into the nucleus is a continuous process (Thorsness & Fox, 1990; Ricchetti et al., 1999; Woischnik & Moraes, 2002), the actual NUMT content in the Nasonia genome may be considerably larger than reported here.
Our analyses suggest that a large majority of the NUMTs have originated from independent transfers from the mitochondrion, whereas fewer are attributable to duplications within the nuclear genome (Fig. 3). In addition, duplicates were not found using within-genome screening of some of the junctions between NUMTs and flanking non-NUMT DNA. In contrast, in the honeybee 2/3 of the NUMTs have been estimated to originate from duplications (Pamilo et al., 2007). In humans two conflicting estimations have been made: that most of the NUMTs have independent origin (Bensasson et al., 2003) or that most are duplicates within the nuclear genome (Hazkani-Covo et al., 2003). The shortage of Nasonia NUMT duplications can be due to the relatively young age of the identified NUMTs, i.e. there has not been enough time for duplications to occur after the integration of NUMTs into the nuclear genome.
Nucleotide divergence between NUMTs and the mtDNA was over 2% in all but one NUMT, ranging up to 14%. Such a broad range indicates that the group of identifiable nuclear copies contains NUMTs of different ages. This group included a NUMT that is identical to the mtDNA and thus transferred very recently, whereas the time of transfer of the most divergent NUMTs is close to the split among N. vitripennis, N. giraulti and N. longicornis (Fig. 3, Table S1). The divergence results further support the idea that the transfer of mtDNA to the nucleus is a continuous process.
Even though it has been noted that the NUMT content across species varies substantially (Richly & Leister, 2004), it is interesting that both hymenopterans with sequenced genomes carry such a large amount of NUMTs in their genome. In the honeybee a question was raised whether there could be association between the high recombination rate and the high frequency of NUMTs (Pamilo et al., 2007). However, in Nasonia the recombination rate is much lower than in the honeybee (Wilfert et al., 2007), and thus cannot explain the high incidence of NUMTs. Interestingly, the nuclear genomes of the three Nasonia species have also been shown to contain species-specific, and thus recently transferred, sequences of their endosymbiotic bacteria Wolbachia (Dunning Hotopp et al., 2007). In contrast, D. melanogaster is infected with Wolbachia, but apparently lacks inserted bacterial sequences in its genome, and Drosophila ananassae has a large, ∼1MB, insert of Wolbachia DNA in its genome. It has been posited that the same processes that lead to NUMTs probably contribute to transfers of DNA from endosymbionts that are found in the germline of many invertebrates (Dunning Hotopp et al., 2007). As more insect genomes are sequenced, it may become possible to elucidate the factors that influence rates of acquisition and maintenance of such foreign DNA.
Bioinformatic detection of NUMTs
Mitochondrial sequences of N. vitripennis described by Oliveira et al. (2008) were used to search NUMTs in the nuclear genome of N. vitripennis. The mitochondrial sequences included a fragment of 9791 bp (GenBank accession number EU746609) containing 11 protein-coding genes and a fragment of 4434 bp (EU746613) containing sequences of snl (12S rRNA), rnl (16S rRNA), ND1 and ND2. The search was done by using BLASTN against the N. vitripennis genome database (N. vitripennis version 1.0, 2007-Apr-05) through the Human Genome Sequencing Center (http://blast.hgsc.bcm.tmc.edu/blast.hgsc?organism=9). Only hits with expected value E < 0.0001 were counted as NUMTs. Scaffolds in which NUMTs were found were aligned with the original mitochondrial query sequences using DNA Block Aligner (DBA) (Jareborg et al., 1999) in order to join NUMTs that were fragmented due to the BLAST search parameters. A preliminary screen of NUMTs in the draft genomes of N. giraulti and N. longicornis was done by searching 300 bp stretches including 150 bp of NUMT and 150 bp non-NUMT sequences on both ends of the NUMTs with E < 0.0001. The sequences were blasted against N. giraulti and N. longicornis whole genome sequences using the trace archive database MegaBLAST search on GenBank. Additional BLAST searches using sequences of interest were also performed.
Phylogenetic inference and genetic divergence
Manual alignments of NUMTs and the mtDNA were made of overlapping sequences over 200 bp so that at least four NUMT sequences could be included. Maximum parsimony (MP) – with equal character weighting and gaps as missing data – and the maximum likelihood (ML) – GTR+Gamma model of nucleotide substitution (Lanave et al., 1984; Yang, 1994) – optimality criteria were used to generate phylogenetic trees in paup* v4.0b10 (Swofford, 2002). Since the number of sequences is small, all possible trees could be examined (by exhaustive searches in paup*). Support for branches was obtained by performing 1000 bootstrap replicates and 100 addition sequences for MP, and 1000 bootstrap replicates and 10 addition sequences for ML. Pairwise sequence differences between the mitochondrial sequence and the NUMTs were calculated as p-distance in MEGA version 4 (Tamura et al., 2007).
Hybrid crosses and a PCR-based test for detection of a Nasonia giraulti NUMT
Three different strains were used for hybrid crosses: AsymC (N. vitripennis), RV2r(u) (N. giraulti), and R16A. The latter is a conplastic strain which has N. vitripennis mitochondria and a N. giraulti nuclear genome (Breeuwer & Werren, 1995). Three different crosses were performed: RV2r(u) males vs. AsymC females, AsymC males vs. RV2r(u) females, and AsymC males vs. R16A females. Hybrid F1 females were virgin hosted and 48 F2 hybrid males of each cross were genotyped using mitochondrial primers NMCoIIIb and NMNad1a (primers are described in Oliveira et al., 2008). This primer combination amplifies a 500 bp sequence in N. giraulti, however, they fail to produce amplification in N. vitripennis. This experiment was designed to demonstrate that these primers amplify a N. giraulti NUMT.
We thank Pekka Pamilo for comments on the manuscript. This work was supported by following grants: Academy of Finland (122210) to LV and National Institutes of Health grant 5R01 GM070026 and Indiana's 21st Century Research and Technology Fund UND250086 to JHW.