Correspondence: Christine Sacerdot, Unité de Génétique Moléculaire des Levures (URA 2171 CNRS, UFR 927 Université Pierre et Marie Curie), Département Génomes et Génétique, Institut Pasteur, 25 rue du Dr Roux, F-75724 Paris Cedex 15, France. Tel.: +33 1 40 61 30 59; fax: +33 1 40 61 34 56; e-mail: email@example.com
Transfer of fragments of mtDNA to the nuclear genome is a general phenomenon that gives rise to NUMTs (NUclear sequences of MiTochondrial origin). We present here the first comparative analysis of the NUMT content of entirely sequenced species belonging to a monophyletic group, the hemiascomycetous yeasts (Candida glabrata, Kluyveromyces lactis, Kluyveromyces thermotolerans, Debaryomyces hansenii and Yarrowia lipolytica, along with the updated NUMT content of Saccharomyces cerevisiae). This study revealed a huge diversity in NUMT number and organization across the six species. Debaryomyces hansenii harbors the highest number of NUMTs (145), half of which are distributed in numerous large mosaics of up to eight NUMTs arising from multiple noncontiguous mtDNA fragments inserted at the same chromosomal locus. Most NUMTs, in all species, are found within intergenic regions including seven NUMTs in pseudogenes. However, five NUMTs overlap a gene, suggesting a positive impact of NUMTs on protein evolution. Contrary to the other species, K. lactis and K. thermotolerans harbor only a few diverged NUMTs, suggesting that mitochondrial transfer to the nuclear genome has decreased or ceased in these phylogenetic branches. The dynamics of NUMT acquisition and loss are illustrated here by their species-specific distribution.
The term ‘promiscuous DNA’ was coined by Ellis (1982) to denote DNA mobility among the genetic compartments of eukaryotic cells, a phenomenon evidenced by many eukaryotic genome sequences (Richly & Leister, 2004). The transfer of mitochondrial (mt) sequences to the nuclear genome gives rise to the so-called NUMTs (NUclear sequences of MiTochondrial origin) [see Leister (2005) for a review].
The presence of NUMTs in the nuclear genome is a ubiquitous phenomenon observed in dozens of eukaryotes with great variations across species as diverse as human and plants. Few comparative studies have been performed to characterize NUMTs in related species or subspecies (Pons & Vogler, 2005; Krampis et al., 2006; Behura, 2007; Hazkani-Covo & Graur, 2007). The hemiascomycetous yeasts offer the largest number of complete and annotated genome sequences among a monophyletic group of eukaryotes. Ricchetti et al. (1999) have characterized the NUMT content of S. cerevisiae. In the present study, we performed a detailed comparative analysis of the NUMT content in five other hemiascomycetes: Candida glabrata, Kluyveromyces lactis, Kluyveromyces thermotolerans, Debaryomyces hansenii and Yarrowia lipolytica (Dujon et al., 2004; Dujon, 2006). Candida glabrata is the second causative agent of human candidiasis, phylogenetically closer to S. cerevisiae than to Candida albicans, the major human fungal pathogen. Kluyveromyces lactis is a lactose-utilizing yeast found in cheese and commonly used in genetic studies and biotechnologies. Kluyveromyces thermotolerans is mainly found on grapes. Debaryomyces hansenii is a halotolerant yeast closely related to C. albicans and Y. lipolytica is an alkane-utilizing yeast, distantly related to all other hemiascomycetes studied so far. All five nuclear genomes are completely sequenced and annotated, as well as the cognate mitochondrial genomes, and some major steps of their evolutionary history have been identified (Dujon, 2005). These yeasts cover a broad evolutionary range that is comparable to the entire phylum of chordates (Dujon, 2006).
This work revealed unexpected interspecies variations in the NUMT abundance and organization across the hemiascomycete phylum. The presence of numerous large mosaics of NUMTs in the genome of D. hansenii illustrates this diversity.
Materials and methods
Genomes analyzed in this work
Candida glabrata, K. lactis, D. hansenii and Y. lipolytica nuclear genome sequences and annotations were retrieved from the ‘Génolevures’ web site (http://cbi.labri.fr/Genolevures/). The ‘Génolevures’ consortium has also sequenced the genome of K. thermotolerans (accession numbers CNS12AB9, CNS12ABA to CNS12ABG).
Full-length mitochondrial sequences were retrieved from the ‘Génolevures’ web site. The mitochondrial genome of D. hansenii was also sequenced by the Génolevures Consortium and annotated by us (EMBL accession numberDQ508940). For all species, the sequenced mitochondrial and nuclear genomes come from the same strain: CBS138, CLIB210, CBS6340, CBS767 and CLIB99 for C. glabrata, K. lactis, K. thermotolerans, D. hansenii and Y. lipolytica, respectively.
blastn searches were conducted locally using the complete nuclear genome sequence of each species as the database and the cognate mitochondrial genome sequence as the query. The default parameters were used (blastn matrix [1-3], gap penalties [existence 5; extension 2], without filtering for low complexity).
In order to select HSPs [‘High-scoring Segment Pair’ (Altschul et al., 1997)], we proceeded as follows: (1) we identified a posteriori the segments of low compositional complexity contained in the query sequence using the formula of Wootton & Federhen (1996). (2) We deduced the length of high-complexity sequence of each HSP, defined as the sum of the lengths of all high-complexity subsequences. (3) We selected HSPs having a predefined minimal length of high complexity, except if their score was higher than a ‘higher cut-off,’ above which we accepted all HSPs, in order to validate low-complexity sequences with very high scores, as true NUMTs. (4) We compared systematically the output of this algorithm with the results of a filtered blastn search. Most HSPs from the filtered blastn were already obtained with our algorithm and some of them were extended. However, our algorithm excluded a few HSPs present in the filtered blastn search, which we recovered. The algorithm will be published elsewhere in more detail.
Taking the NUMT content of S. cerevisiae published by Ricchetti et al. (1999) as a reference, we chose the following parameters: minimal score=20, minimal length of high complexity=22 (which was also the minimal length of an HSP recovered from filtered blastn), higher cut-off of low-complexity tolerance=55.
Out of the 34 published S. cerevisiae NUMTs, seven were extended by 6–36 nt, using our algorithm. This algorithm also found three new NUMTs and we recovered one new NUMT from a filtered blastn search. This algorithm did not retain five published NUMTs of small size (22–25 nt) and we did not validate them, as they were absent from a filtered blastn search. Then we discarded a 30-nt-long HSP as it consisted in a repeated low-complexity sequence, although it came from a filtered blastn search. Altogether, we updated the NUMT content of S. cerevisiae to 32 NUMTs (see Supporting Information, Table S1).
From K. lactis, we obtained 53 HSPs, out of which 40 were made of repeated sequences from two regions of the mitochondrial genome: a TAATAG repeat present in the COX1 gene (20 HSPs) and a TATG repeat present in a mitochondrial intergene (20 HSPs). These 40 HSPs, made of periodic low-complexity sequences, were discarded. A small HSP (24 nt) mainly made of TA repeats on the nuclear sequence was also discarded. Finally, a 24-nt sequence of mt tRNA-Thr matched with four different copies of nuclear tRNA-Lys, with a single mismatch. We considered that this low score (20) alignment, which was found four times, resulted from sequence similarity between tRNAs and probably not from mtDNA insertion. Consequently, the four HSPs were discarded, leaving eight authentic NUMTs in K. lactis. Exactly the same four alignments of fragments of tRNA sequences were also found in the genome of K. thermotolerans and discarded, leaving only one NUMT in the latter genome.
Some NUMTs appear to be clustered on the chromosomes. In order to define clusters, we calculated the probability for two NUMTs to be closer than a given distance D by chance.
Theoretically, if NUMTs are considered as a set of N points distributed randomly along a straight line of length L (the chromosome), the probability P that the distance of a NUMT to its closest neighbor is lower than D can be shown to be equal to KD where K=2(N−1)/L. This result is obtained by stating that this probability is equal to one minus the probability to have all the (N−1) other NUMTS at a distance larger than D, i.e. outside a segment of length 2D centered on the NUMT. Therefore, P=1−(1−2D/L)N−1, which is approximately equal to 2(N−1)D/L, if we assume the distance D to be much shorter than the length of the chromosome L (i.e. D/L≪1).
For each species and for every chromosome containing at least two NUMTs, we calculated the value of D associated with a probability of 10−2 and compared this value with the actual distances separating every NUMT from its closest neighbor. The value of D was taken as a cut-off to define the clusters on each chromosome (Table 1).
Table 1. Calculation of the distances that define a cluster of NUMTs
Longest distance <D
Shortest distance >D
The distance D associated to a probability of 0.01 was calculated for each chromosome containing at least two NUMTs (see Materials and methods). The absence of distance <D (‘none’ in column 4) is characteristic of a chromosome devoid of clustered NUMTs. The absence of distance >D (‘none’ in column 5) is characteristic of a chromosome containing only clustered NUMTs.
Depending on the succession of the mitochondrial segments of the NUMTs found in a cluster, two situations were distinguished: (1) when the NUMTs in the cluster were in the same order and orientation as the corresponding mtDNA segments on the mitochondrial map and separated by intervening sequences of similar sizes, the cluster was classified as a ‘procession.’ Processions correspond to a single mtDNA segment transferred to the nucleus, followed by mutational decay. (2) Otherwise, the cluster was classified as a ‘mosaic.’ Mosaics correspond to multiple mtDNA segments transferred to the same chromosomal locus.
Note that in this work, the term ‘NUMT’ always refers to the segment of nuclear sequence homologous to a segment of mitochondrial genome, identified in practice as the hit sequence of a validated HSP; it does not refer to the whole cluster.
Detection of duplicated NUMTs within a species
Within each species, we searched for NUMTs that would be duplicated along with their chromosomal flanking regions. We selected pairs of NUMTs that originated from the same mitochondrial sequence or from largely overlapping mitochondrial segments (overlap longer than 70% of the longer NUMT). Then we looked for a possible similarity between the flanking nuclear DNA sequences of the two NUMTs by blastn search using a 2-kb nuclear region encompassing each NUMT as a query. We used the same blastn parameters as for NUMT detection.
Search for orthologous NUMTs between species
Note that the yeast species analyzed here cover a large evolutionary distance. Unlike primates, their genomes have undergone extensive reshuffling of genetic maps. When comparing their genomic sequences, mosaics of short conserved syntenic blocks separated by numerous breakpoints and often containing internal inversions of a few genes are found between all chromosomes (Dujon, 2005). In order to find orthologous NUMTs, we searched first for NUMTs that would appear in the same syntenic block between two species. In a first step, we defined the gene environment of each NUMT as the 10 surrounding genes (the five upstream and five downstream annotated protein coding genes) and searched for NUMTs that shared at least two homologous genes in their gene environment: we found seven pairs of NUMTs or NUMT loci satisfying this condition. However, these regions did not correspond to a syntenic block involving the corresponding chromosomes, according to the syntenic blocks calculated by D. Sherman (unpublished results), with the smallest blocks containing at least three conserved genes in the same order along the chromosomes.
Identification of orthologous proteins from Pichia stipitis and protein alignments
Total genomic DNA was extracted from D. hansenii sequenced strain CBS767 using the same procedure as for S. cerevisiae. Primer sequences are available upon request.
NUMT content in six hemiascomycetous yeast genomes
We identified all NUMTs present in six hemiascomycetous yeasts: S. cerevisiae, C. glabrata, K. lactis, K. thermotolerans, D. hansenii and Y. lipolytica (see Tables S1–S6), using a new algorithm that selects HSPs from a blastn search upon their score and the amount of high-complexity sequence they contain (see Materials and methods). We found 32 NUMTs in S. cerevisiae, in agreement with the publication of Ricchetti et al. (1999) (34 NUMTs), with six small (22–30 nt) low-complexity sequences that were not retained by our algorithm and four new NUMTs. Moreover, the size of seven NUMTs was significantly extended (by 6–36 nt). A striking variation in the NUMT number was observed across the species, with 145 NUMTs in D. hansenii, by far the largest number observed here, and only one low-score NUMT in K. thermotolerans (length 25 nt, score 21, e-value of 5.11 × 10−2). Candida glabrata, K. lactis and Y. lipolytica genomes contain, respectively, 14, 8 and 47 NUMTs. Altogether, the number of NUMTs varies by two orders of magnitude across the six species (Fig. 1a), without correlation with the size of the corresponding mitochondrial genomes (Table 2).
Table 2. NUMT occurrence in six hemiascomycetous yeast species
The NUMT content of the genome of Saccharomyces cerevisiae, previously published by Ricchetti et al. (1999), has been updated using our method of detection.
Nuclear genome size (Mb)
Mitochondrial genome size (kb)
Total transferred mitochondrial DNA (bp)
Transferred mitochondrial DNA (%)
The percentage of sequence identity between the NUMTs and their mitochondrial counterpart varies from 82 % to 100% under our conditions of detection. Two species lack NUMTs with 100% sequence identity: K. thermotolerans (one 96% identity NUMT) and K. lactis (87–97% identity NUMTs) (Fig. 1b).
All NUMTs are in the same size range, with the largest NUMT being found in the C. glabrata genome (388 nt). The average (and median) lengths of NUMTs are 71 (60), 102 (80.5), 50 (47.5), 65 (54) and 43 (38) nt, respectively, in S. cerevisiae, C. glabrata, K. lactis, D. hansenii and Y. lipolytica genomes. We verified that the NUMTs do not result from sequence assembly errors by PCR analysis on the nuclear DNA of D. hansenii and confirmed the presence of 28 NUMTs in this species. Figure 2 shows the amplification of 10 chromosomal loci.
No orthologous NUMTs were detected between the species analyzed here, suggesting that each NUMT arose from a novel insertion that occurred in each species lineage. This result is consistent with the broad evolutionary range covered by these yeasts (see Materials and methods) and with the fact that probably NUMTs disappear by mutational decay at a higher rate than the one that would leave recognizable traces at such distances.
In most cases, a given mitochondrial segment was present only once in the genome. However, we detected one duplication of NUMTs in D. hansenii, one NUMT was present in six identical copies in Y. lipolytica, and we confirmed the four duplicated NUMTs described previously in S. cerevisiae (Ricchetti et al., 1999). In all cases, the flanking sequences of the NUMTs are identical or highly similar, suggesting that the duplication of the NUMTs results from the duplication of a larger chromosomal region after insertion of the mtDNA segment. The D. hansenii paralogous NUMTs (31 nt long, 100% sequence identity with mtDNA) were found within an 8.9 kilobase (kb) segmental duplication of subtelomeres of chromosomes D and F. The six paralogous NUMTs of Y. lipolytica were all in subtelomeric regions. The sequence identity between the subtelomeric regions containing the NUMT is much higher (96–100%) than the sequence identity found between the NUMTs and their relative mitochondrial segment (89–91%), indicating that mtDNA inserted first into one chromosome and spread later by subtelomere shuffling. Louis & Haber (1991) had already suggested that NUMTs could propagate by subtelomeric shuffling in S. cerevisiae.
NUMT locations on the chromosomes
Using present annotations of the genomes, we examined the position of each NUMT in its genetic environment. All NUMTs detected in C. glabrata, K. lactis, K. thermotolerans and Y. lipolytica were intergenic. All updated NUMTs of S. cerevisiae were also within intergenic regions. The two NUMTs of S. cerevisiae described previously within an ORF (Ricchetti et al., 1999) are composed of low-complexity sequences and were not validated as NUMTs by our criteria (see Materials and methods). In D. hansenii, 92% of the NUMTs (133/145) were found within intergenic regions. The other 12 NUMTs overlap one (in one case, two) annotated gene(s), and can be classified into five categories.
(1) Four NUMTs (D6, D86, D97, D129 in Table S5) overlap by 8 to 31 nt the end of a short annotated ORF (117 nt to 219 nt). In all cases, the ORF harbors no similarity to known genes. Either a true gene has been interrupted by a NUMT, or the ORF is spurious and the NUMT is actually intergenic. Similarly, one NUMT (D47 in Table S5) overlaps by 10 nt, the beginning of a short ORF (216 nt).
(2) Five NUMTs (D8, D37, D47, D91, D138 in Table S5) overlap by 1 to 39 nt significantly longer genes (804 nt to 1520 nt), including either start (3 NUMTs) or stop (2 NUMTs) codons (Fig. 3a). The length of these ORFs and their sequence similarities to known genes suggest that they are true genes. It is noteworthy that all NUMTs of this category are low-identity NUMTs (87–93% sequence identity with the mitochondrial segment), suggesting that they evolved to fit with the function of the gene.
(3) One case concerns a NUMT within an intron (Fig. 4a and b). NUMT D105 (Table S5) is located within the S1 region situated in between the 5′-splice site and the branchpoint of the third intron of DEHA2F06314g, which contains five introns. This part of spliceosomal introns is variable in length, which makes possible the insertion of a 51-nt-long NUMT without a threat on splicing. Note that this is one of the only two genes of D. hansenii with five introns, the other genes having much less ntrons (mostly one) or no intron at all (the majority of the genes).
(4) One NUMT (D64 in Table S5) was found within a highly conserved gene (DEHA2D14124g). A multiple alignment between orthologues from other yeast species revealed highly similar protein sequences in this locus, without a gap (data not shown). However, no NUMT was detected within the orthologous genes from the other species, suggesting that this low-score NUMT (29 nt, 93% sequence identity) is a fortuitous match rather than a real NUMT.
(5) One 148 nt long NUMT (D74 in Table S5) contains a spurious ORF (antisense of the mitochondrial gene ATP9); however, the mitochondrial insertion did not occur within a preexisting gene.
In order to gain more insight into the five genes overlapping a NUMT (Fig. 3a), we compared their protein sequence with their possible orthologue from P. stipitis, a yeast that is phylogenetically close to D. hansenii (Jeffries et al., 2007). Significant alignments could be generated for three genes (Fig. 3b and Supporting Information). DEHA2C03388g protein differs from its orthologue from P. stipitis mainly by an extended N-terminal domain provided by the NUMT, insofar as its coding sequence actually starts on the first ATG (Fig. 3b). On the contrary, alignment shows that DEHA2F00198g protein sequence starts 114 amino acids (aa) downstream of the P. stipitis protein orthologue, suggesting that the insertion of the NUMT interrupted the gene of D. hansenii, followed by loss of the 5′ coding sequence (see alignment in Appendix S1). Finally, despite the presence of a NUMT overlapping the stop codon of the target gene, DEHA2G19470g protein does not appear to be significantly truncated, because the alignment with its orthologue from P. stipitis shows just one amino acid (aa) truncation (see alignment in Appendix S1).
As NUMTs are mainly located in intergenic regions, we asked whether we could find NUMTs overlapping pseudogenes. The pseudogenes were identified by protein sequence similarity of the translated intergenic regions in the six reading frames, using the proteome of the six yeast species, including mitochondrial proteins (I. Lafontaine, unpublished data). Two kinds of pseudogenes were found, which correspond to different situations: (1) the mitochondrial pseudogenes are traces of mitochondrial coding sequences and were all found overlapping a NUMT detected by our method (based on DNA sequence similarity). NUMTs associated with a mitochondrial pseudogene were found in all species, except K. lactis and K. thermotolerans (Table 3). (2) The nuclear pseudogenes are traces of nuclear genes whose sequences have been degraded over evolutionary time. Among the mutations, the insertion of a NUMT might be the event that triggered pseudogenization. Alternatively, the NUMT inserted within an intergene already containing a pseudogene. NUMTs overlapping (or within) nuclear pseudogenes were found only in D. hansenii (Table 4). In most cases (four of five), only a small portion of the original gene shows detectable similarity to the pseudogene (12–25% of the length of the homologous protein). NUMTs associated with these pseudogenes either overlap one extremity of the pseudogene or are located within it. Three NUMTs clustered on chromosome D were found within the same pseudogene (Fig. 5a). One full-length pseudogene containing a NUMT was also detected. The low degradation of the pseudogene allowed the identification of the main mutations that degraded the sequence of the original gene (Fig. 5b).
Table 3. Number of pseudogenes overlapped by a NUMT in the six species
Pseudogenes were identified by I. Lafontaine (unpublished data), using the nuclear and mitochondrial proteomes of all species.
Table 4. NUMTs overlapping (or included in) nuclear pseudogenes of Debaryomyces hansenii
Analysis of the distances separating the NUMTs on chromosomes revealed single NUMTs and NUMTs grouped into clusters. We defined clusters as sets of NUMTs separated by a distance associated with a probability lower than 0.01 to occur by chance (see Materials and methods). The proportion of single NUMTs varies considerably across the species analyzed, from 0% in K. lactis to 86% in C. glabrata– if we exclude K. thermotolerans with its single NUMT (Table 5). We distinguished two main types of clusters, depending on the order and orientation of the fragments in the mitochondrial genome compared with the nuclear cluster. The ‘procession’ type is characterized by a succession of NUMTs that are collinear to their associated mitochondrial segments. Processions probably result from a single primary insertion of mtDNA, whose sequence has degenerated over time, leaving small detectable regions homologous to mtDNA (the NUMTs) separated by more diverged sequences. We identified five processions altogether, including one in S. cerevisiae, one in C. glabrata, one in K. lactis and two in D. hansenii. Dot-matrix analyses showed the extent of homology between the chromosomal and mitochondrial sequences, revealing less conserved homologous fragments (Fig. 6). The smaller procession of D. hansenii (cluster 3 in Table S5, ‘DEHA 3’ in Fig. 6) differs from the others as it contains a small intact NUMT (23 nt, 100% sequence identity), suggesting that this cluster may not have arisen from a single insertion event.
Table 5. Genomic organization of the NUMTs in the different species
Number of single NUMTs
Number of NUMTs in a cluster
Number of clusters
Number of mosaics
Number of processions
The procession is within a mosaic
One cluster is a tandem repeat of NUMTs
The second type of cluster consists in ‘mosaics,’ where NUMTs originate from disparate mitochondrial fragments. Mosaics are by far the most common type of cluster, with a total of 32 mosaics (vs. five processions) across the six species (Table 5). However, no mosaic was found in C. glabrata, while all NUMTs of K. lactis are distributed in three mosaics, one of which contains an internal procession of two NUMTs. In S. cerevisiae, three mosaics had been reported previously (Ricchetti et al., 1999) and we found an additional NUMT in one of them; moreover, we identified another mosaic due to a new NUMT joined to a previously described NUMT. Debaryomyces hansenii contains 23 mosaics, by far the largest number, accounting for 54% of its NUMTs. Besides nine mosaics of two NUMTs and five mosaics of three NUMTs, this species harbors the largest mosaics: six mosaics of four NUMTs, one mosaic of six NUMTs, one of seven NUMTs and one of eight NUMTs, with an average number of 3.4 NUMTs per mosaic (Fig. 7). Most mosaics contain NUMTs from both orientations, suggesting that mtDNA pieces inserted in random orientation.
In the yeast S. cerevisiae, mosaics were associated with multiple insertions of noncontiguous mitochondrial fragments as observed during the repair of a chromosomal DSB (Ricchetti et al., 1999). In the in vivo double insertion events, the two fragments shared a terminal homology of 2–3 nt. Similarly, we found NUMTs overlapping their neighbor on a few nucleotides. Out of 55 mosaic junctions in the genome of D. hansenii, 24 correspond to overlapping NUMTs (on 1–7 nt), seven to exact junctions, four to nearly exact junctions (1 nt gaps) and 12 are gaps of 3–32 nt. These 47 junctions (85%) can be considered as ‘tight junctions.’ The remaining cases (8/55, 15%) concern NUMTs that are separated by 63–328 nt from their neighbor, and can be considered as ‘loose junctions.’ Both tight and loose junctions coexist in three mosaics (clusters 7, 18, 20 in Table S5, see Fig. 7b for mosaic cluster 18).
In the present study, we report the diversity of the NUMT content across the hemiascomycete phylum and provide a detailed genome-wide investigation of NUMT organization within six species.
The size of the NUMTs in all six yeast species does not exceed 400 nt, which is small in comparison with other eukaryotes where they can reach several thousands nucleotides (Richly & Leister, 2004). The longest NUMT was found in the genome of C. glabrata (388 nt) and Y. lipolytica harbors the smallest average and median length of NUMTs (43 and 38 nt, respectively). Traces of longer insertions were detected as processions of NUMTs, which reveals that mtDNA fragments of >520 nt (in C. glabrata) and 800 nt (in D. hansenii) have integrated into these nuclear genomes. The surprising absence of recent insertions as long as processions (the longest NUMT with 100% sequence identity is 230 nt, found in S. cerevisiae) suggests that long insertions were rare in recent times. The small size of all yeast NUMTs could be linked to the compactness of the yeast genomes. However, this is not a general rule because the compact genome of Neurospora crassa harbors numerous long NUMTs (up to 4136 nt, average size 647 nt) while the large genome of Rattus norvegicus contains only small NUMTs (Richly & Leister, 2004).
All NUMT sequences in this study were 82–100% identical to the mtDNA. Mismatches between NUMTs and mtDNA result either from nuclear mutations after transfer or from mutations in the mitochondrial genome since the transfer occurred.
The most striking results are the high number of NUMTs and the presence of numerous large mosaics in the genome of D. hansenii, which contains up to eight mitochondrial fragments at the same locus. The fragments originate from different regions of the mtDNA, inserted independently of their orientation and order along the mitochondrial genome. Tight mosaics (when the NUMTs are very close to each other or overlap) are likely to result from the capture of several fragments of mtDNA during the repair of a chromosomal DSB, as reported experimentally in S. cerevisiae (Ricchetti et al., 1999). It is not excluded that some mosaics result from more than one insertion event. For example, mosaic cluster 16 (Table S5) combines a diverged NUMT (91% sequence identity over 106 nt) and an intact NUMT (100% sequence identity over 40 nt), which are likely to be the result of two successive insertions. However, due to the small size of the NUMTs, their sequence conservation compared with the mtDNA does not allow accurate estimation of the time of insertion. Loose mosaics (when the NUMTs are more distant to each other) could be the result of successive insertions in a fragile region of the chromosome. The high number of mosaics in the genome of D. hansenii accounts for 54% of the NUMTs and 27% of the NUMT loci and suggests that an important reservoir of mtDNA fragments is available to be captured during the repair of DSBs. This yeast might undergo a continuously high rate of mtDNA fragmentation, or bursts of massive mtDNA fragmentation under some conditions, like stress.
Another striking result is the quasi absence of NUMTs in the genome of K. thermotolerans. The only validated HSP harbors an e-value of 5.1 × 10−2, below the threshold chosen by Richly & Leister (2004). Interestingly, K. lactis, which is phylogenetically close to K. thermotolerans, harbors only degraded NUMTs (average and median % sequence identity=91%, and no NUMT with 100% sequence identity), as if mitochondrial transfer had decreased or ceased in this evolutionary branch.
Insertion of mtDNA into nuclear genomes is a potentially mutagenic phenomenon. Recent insertions in the human genome have been associated with diseases (Willett-Brozick et al., 2001; Borensztajn et al., 2002; Turner et al., 2003; Goldin et al., 2004). As already observed in S. cerevisiae, we found most NUMTs in intergenic regions (92% in D. hansenii and 100% in the other species). However, we discovered new situations. We found a NUMT within an intron the preferential location of recent human-specific NUMTs (Ricchetti et al., 2004). This NUMT is located in a part of the intron, where it should not interfere with splicing. We also found five NUMTs of D. hansenii that overlap one extremity of a gene, potentially changing gene expression and/or resulting in protein truncation or extension (Fig. 3a). All five NUMTs are degraded, signing old insertions; nevertheless, the open reading frames have not been interrupted by nonsense mutations, suggesting that the genes are still functional and have adapted to the presence of the NUMT. Moreover, one of these NUMTs could have acted positively on the evolution of the overlapped gene by addition of a new protein sequence (DEHA2C03388g, Fig. 3a and b). This NUMT contains four non-synonymous mutations out of five, suggesting that it may undergo positive selection, as observed for the evolution of new genes (Long et al., 2003). This hypothesis is consistent with the recent study of Noutsos et al. (2007), who found a positive impact of NUMTs on evolution.
The diversity in the NUMT content of the six genomes analyzed here may be explained by differences in mtDNA fragmentation or escape rate, differences in transfer to the nucleus or efficiency of integration into chromosomes or different rates of mutational decay over evolutionary time. The causes of mtDNA fragmentation and the mechanisms of transfer to the nucleus are largely unknown. Experimental evidences showed that NUMT acquisition ultimately relies on the DNA repair machinery in S. cerevisiae. Most genes involved in DSB repair are conserved across the six hemiascomycetes analyzed here (Richard et al., 2005), including K. thermotolerans (unpublished). The NHEJ pathway has been investigated experimentally in K. lactis, where integration of nonhomologous DNA molecules into the genome has been found 1000-fold more frequently than in S. cerevisiae (Kegel et al., 2006). This observation shows that differences in DNA repair efficiency do exist among the species. It also indicates that the low number of NUMTs in the genome of K. lactis is not due to low efficiency of NHEJ. The genomes of D. hansenii and K. thermotolerans, which are the ‘exceptions’ for the NUMT content, are not an exception in terms of presence or absence of a gene involved in DSB repair. However, even orthologous genes may possess different regulation and functional properties. Moreover, the comparative analyses were performed with S. cerevisiae taken as a reference, and so the existence of D. hansenii- or K. thermotolerans-specific genes that would modulate the DSB repair machinery in these species cannot be excluded.
We thank our colleagues from the ‘Unité de Génétique Moléculaire des Levures’ and from the Génolevures consortium for fruitful discussions, and David Sherman for calculation of the syntenic blocks between pairs of yeast species. We are grateful to Bertrand Llorente, Gilles Fischer, Cécile Fairhead and Miria Ricchetti for their valuable comments on the manuscript. This work was supported by grant ANR-05-BLAN-0331 from the Agence Nationale de la Recherche (ANR) and by the CNRS (GDR2354 ‘Génolevures’). B.D. is a member of the Institut Universitaire de France.