Twin mitochondrial sequence analysis

When applying genome-wide sequencing technologies to disease investigation, it is increasingly important to resolve sequence variation in regions of the genome that may have homologous sequences. The human mitochondrial genome challenges interpretation given the potential for heteroplasmy, somatic variation, and homologous nuclear mitochondrial sequences (numts). Identical twins share the same mitochondrial DNA (mtDNA) from early life, but whether the mitochondrial sequence remains similar is unclear. We compared an adult monozygotic twin pair using high-throughput sequencing and evaluated variants with primer extension and mitochondrial preenrichment. Thirty-seven variants were shared between the twin individuals, and the variants were verified on the original genomic DNA. These studies support highly identical genetic sequence in this case. Certain low-level variant calls were of high quality and homology to the mtDNA, and they were further evaluated. When we assessed calls in preenriched mtDNA templates, we found that these may represent numts, which can be differentiated from mtDNA variation. We conclude that twin identity extends to mtDNA, and it is critical to differentiate between numts and mtDNA in genome sequencing, particularly as significant heteroplasmy could influence genome interpretation. Further studies on mtDNA and numts will aid in understanding how variation occurs and persists.


Introduction
Mitochondria play important roles in cellular function and human disease (Poulton et al. 2010;Koopman et al. 2012) and harbor DNA that encodes for tRNAs, rRNAs, and for proteins that function in energy production. Unlike the nuclear genome, the mitochondrial DNA (mtDNA) is present in many copies within the same cell. The mtDNA sequence reflects the maternally inherited mitochondrial genome and could harbor variation that arises somatically (Park and Larsson 2011). mtDNA may be subject to a higher mutation rate due to apparent decreased replication fidelity (Song et al. 2003;Lee and Johnson 2006); however, it is unclear whether mutations accumulate (Kujoth et al. 2007; Ameur et al. 2011) particularly in humans or whether variation is preexisting. Variation within the mtDNA sequence in an individual's cells can be homoplasmic (the same sequence) or heteroplasmic (coexisting different mtDNA sequences), and heteroplasmy levels can be related to disease (Lightowlers et al. 1997). Due to the segregation of the mtDNA during cell division, mtDNA variation could differ between cells as they divide, potentially due to selection. Such variation has been assessed in maternal transmission but the variation has not been examined in depth in adults, where multiple influences over time could affect sequence.
Deep sequencing is increasingly being utilized to detect genomic variation, and as more human genome sequencing is performed, it is clearly important to accurately detect mitochondrial variation and annotate variant function (Ruiz-Pesini et al. 2007;Calvo et al. 2012). However, mitochondrial variation can be misinterpreted given the presence of nuclear mitochondrial sequences (numts), which are highly similar nuclear fragments of the mitochondrial genome located on different chromosomes (Hazkani-Covo et al. 2010). Indeed, some studies report the difficulties in estimation of variation due to coamplification of numts with the mtDNA (Hirano et al. 1997;Parfait et al. 1998;Parr et al. 2006); however, a combination of high-throughput sequencing and validation could aid in thoroughly characterizing mitochondrial variation.
In our study, we evaluate and compare the presence of variants in mtDNA of an identical twin pair. Resulting from an early split of the developing embryo, twins harbor genomes that would be identical initially. Such twins represent a model for the analysis of mtDNA variation using a variety of methods, as any genetic difference observed between twins derived from the same zygote could represent somatic variation (Dumanski and Piotrowski 2012), whereas concordant variations found in twins would support genetic similarity (Bruder et al. 2008; Baranzini et al. 2010;Hallmayer et al. 2011;Jakobsen et al. 2011). Low levels of variation between twins could be detected in mitochondrial sequence. Here, we analyzed an adult twin pair with complementary methods to investigate variation in the mitochondrial genome sequence, and we determine the potential origin of ambiguous sequence variants from deep sequencing.

DNA isolation and molecular analysis
Informed consent was obtained under the guidelines of the institutional review board. Genomic DNA was isolated from the peripheral blood of a self-reported 21-year-old female adult identical twin pair, using standard methods (Qiagen, Valencia, CA). The twins grew up together and lived in a similar environment.
Zygosity of the twins was determined by genotyping highly polymorphic DNA loci using the PowerPlex â short tandem repeat kit (Promega, Sunnyvale, CA). Amplified fragments were detected using the ABI (Life Technologies, Grand Island, NY) 3730xl DNA Analyzer. Data were analyzed with GeneMapper software (Life Technologies).

Paired-end Illumina sequencing
DNA libraries were generated using Illumina Paired-End (Illumina Inc., San Diego, CA) library preparation kit according to the manufacturer's instructions. Libraries were quantified by qPCR using a KAPA (Kapa Biosystems Inc., Woburn, MA) library quantification kit and assessed on an Agilent Technologies (Santa Clara, CA) 2100 Bioanalyzer using a High Sensitivity DNA chip. Following an automatic cluster generation, paired-end sequencing was performed on Illumina HiSeq 2000 using the V1 flow cell Hiseq system version (Illumina Inc.). The libraries were subject to a second sequencing run using the V3 flow cell HiSeq system version.

Analysis of high-throughput sequencing data
To check read quality, raw sequencing data were analyzed using the next generation tool Fastqc (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/) executed in an internal Galaxy server (Giardine et al. 2005) allowing large data file upload and processing. Sequencing reads were mapped to the hg19 reference genome (NCBI GRCh37; AF347015.1 [16571 bp]/NC_012920.1 [16569 bp]) using the Bowtie alignment package (0.12.7). Alignment was performed using normal and high stringency parameters for each of the two runs. Mapped reads were made available in BAM file format using SAM tools. All BAM files were visualized and analyzed using the Integrative Genomic Viewer IGV 2.1 (Robinson et al. 2011). Mitochondrial sequence reads, produced from the second high stringency run, were filtered from the whole genome data and converted to small BAM files and analyzed with MITO-BAM annotator online tool (Zhidkov et al. 2011). Using "Generate pileup" from SAM tools, we created from the BAM files pileup files for the twins. The generated pileup files were used to call Single Nucleotide Polymorphisms (SNPs) via the "Filter pileup" tool. For each potential variant, we set a quality threshold Q ≥ 30 where Q is the Phred quality score, and asked the program to call any SNP present at least in one read. Using IGV tool, we manually verified the presence and the Q value of all the called SNPs. In addition, we used the MITOBAM annotator tool to detect and annotate any variant also present in at least one read and with a Q ≥ 30. We also verified all SNPs manually applying more stringent filters by not considering potential variants surrounded by more than 10 mismatches within the same read, and mismatches localized within the five first or last bases of the same read. Potential variants were identified as having RefSNP (rs) numbers, while some variants were only found in Mitomap mtDNA Sequence Polymorphism database (http://www.mitomap.org/MITOMAP), and others were previously not reported.

Low-level variant evaluation
Variants detected at low frequency (<0.01%) were examined using the fluorescent primer extension assay SnapShot (Applied Biosystems, Life Technologies). We designed custom multiplex reactions according to the manufacturer's instruction and pooled up to five templates per reaction for scalability. As a first step we designed template primers and one extension primer for each variant and performed primer extension on 33 candidates. In order to verify if the detected variant nucleotides were from the mitochondrial or the nuclear genome, we amplified the entire mtDNA in two large fragments of more than 8 kb with an overlap of 183 bp (Voets et al. 2011). These fragments were used as templates in primer extension reactions using the same extension primers used in the first step. Postextension treatment was performed using 1 unit of calf intestinal phosphatase. Electrophoresis was run on an ABI 3730xl DNA Analyzer. Data were analyzed using PeakScanner software (Life Technologies).

Molecular analysis of zygosity
Before assessing mitochondrial variation using highthroughput sequencing, we first verified zygosity of the self-reported identical adult twin pair studied. Short tandem repeat analysis was performed on genomic DNA from the blood of the twins. They shared identical alleles at 15 highly variable loci on their chromosomes, indicating monozygosity with high confidence (Table 1).

Mitochondrial sequence analysis by highthroughput sequencing
To assess twin mitochondrial genome sequencing performance in the context of genome sequencing without prior sequence capture, we prepared paired-end whole genome sequencing libraries and used the Illumina HiSeq platform, which yielded over 100 million reads for each twin individual (Run 1, Table 2). A second sequencing run was performed on a newer version of flow cell (Run 2, Table 2), which greatly increased the number of reads by producing 275 million reads for twin A and over 314 million reads for twin B. We aligned these reads to the hg19 reference genome using Bowtie. Initial alignment using default parameters resulted in alignment of~47% of the reads (Alignment 1, Table 2), and this was improved to over 90% of reads aligned (Alignment 2, Table 2) using the "-y/-tryhard" mode, although the alignment performed more slowly. On the mitochondrial genome the mean depth of coverage was 1151 for the twin A and 1279 for the twin B (Fig. 1). We verified variants using the Integrated Genomics Viewer manually and detected 37 high-confidence variants (>99% of the reads), and all these were common to both twin A and B (Fig. 1). These variants included 34 homoplasmic variants and three nearly homoplasmic variants ( Fig. 2A). Among these 37 variants, 27 were distributed on 12 genes throughout the mitochondrial genome (Fig. 2C), and 10 were localized at the hypervariable segments HV1 (16024-16383) and HV2 (57-372), which had variable coverage even after remapping to account for the circular mitochondrial genome. Although six variants were nonsynonymous, all were at positions of previously reported mtDNA polymorphism (Table 3).

Low-level variation analysis
We noted the presence of potential single-base sequence differences present at very low level, one to three reads per locus, excluding duplicate reads. As these differences could represent a number of possibilities including sequencing errors, read mapping ambiguity due to numts, or true mitochondrial variation, we investigated these further. In analyzing potential low-level variants in the mitochondrial genome, we used stringent filters: a strict nucleotide quality score (Phred score Q ≥ 30), the presence of the variant in at least one sequencing read, and variant surrounded by nonvariant sequence (potential variant is at least five bases from a read end, and lack of other sequence differences in same read).
We detected many single-base candidate low-level variants in the twins (Tables S1, S2, and S3); the presence of more low-level potential variants was correlated with higher coverage. Importantly, we performed additional validation on the original genomic DNA to rule out error introduced by library preparation and complementary studies to evaluate their origin. We focused on 76 lowlevel variants that were shared between twins (Fig. 2B) given their potentially inherited nature. The low-level variant calls were distributed over the whole mitochondrial genome including the three genes coding for the respiratory chain complex IV and the six genes coding for the respiratory chain complex I (Fig. 2C). Twenty of these variants were previously reported (Mitomap database). From the 56 possibly novel variants, 17 mapped to rRNA or tRNA genes (10 on the RNR2 gene, one on the RNR1 gene, and six distributed on five tRNA genes). The other 39 variants were in coding regions of mitochondrial genes, and 30 were nonsynonymous (Table 4). We designed primers (primer sequences available upon request) to amplify the surrounding regions and primers for fluorescent primer extension for detection of these 30 as they could have functional consequence if confirmed.
We also examined two low-level variants called in two different tRNA genes that have been implicated in potential disease, m.12258C>A in diabetes mellitus and Deafness syndrome, and m.3275C>A in Leber's hereditary optic neuropathy (Lynn et al. 1998;Garcia-Lozano et al. 2000). We also tested a variant that was detected in only one twin (twin B, m.4456C>T) for comparison. Using the original genomic DNA, we found the major reference allele in 33/33 positions, which confirms the results from the high-throughput sequencing, and the low-level variant was seen in 9/33 (27.3%) ( Table 5). An example is shown in Figure 3. In all cases, primer extension revealed concordant results between the twin individuals with or without variant sequence. Where low-level variants were not confirmed, these could represent  sequencing false positives or variant detection could be limited by sensitivity even though the seven primer extension assays detected variants seen in <1% of highthroughput sequencing reads.
Given these findings, for the seven variants confirmed by primer extension, we considered the possibility that nuclear mtDNA fragments (numts) could be generating apparent mitochondrial sequence variants. To test this   possibility, we generated hemigenomes of mtDNA using high-fidelity long PCR and used these as templates for the primer extension assays, as this would eliminate numts (Li et al. 2012). Accordingly, we did not find low-level variants from the preamplified mitochondrial hemigenomes (Fig. 4). The absence of low-level variant detection from the hemigenomes could suggest that they are not present in the mtDNA, but we cannot exclude the possibility that these variants are present in the mtDNA at such a low level that they were not amplified in the hemigenome reaction. In order to determine if variants reflected numt sequences (from other regions of the genome), we directly compared variant-containing reads to the genome (Data S1). We found homologies to regions of chromosomes 1 and 17, where numts are known to exist. Reads containing position 4457 or 6569, for example, have complete read homology to corresponding regions on chromosome 1 and the primer extension could coamplify mitochondrial and nuclear DNA, explaining the apparent variation in those positions, whereas primer extension on mitochondrial hemigenome only detected the mitochondrial genome. There were, however, several low-level variants, 3455 for example, where sequencing read and primer extension assay precise homology to the nuclear genome were both not present, suggesting these variants may be present in mtDNA albeit at low levels that may not be detectable by analysis of long mitochondrial fragments.
HiSeq, high-throughput sequencing; MS, aligned with moderate stringency; HS, aligned with high stringency. Dash indicates minor allele was not detected. To detect numts that could be mapping to the mitochondrial genome and producing low-level apparent variants, we performed read mapping using whole genome reads from the twins to the mtDNA only as reference. Depth of coverage increased up to 600 reads higher than the initial mapping using the whole hg19 genomic refer-  ence, and almost all of these increased the only allele or major allele call. We examined the low-level variants previously analyzed using the initial mapping data and the primer extension assay in table 5. Many of low-level variants did not show any increase in reads and these could be due to errors. However, the number of variant reads increased in 7 of 33 positions in twin A and B (Fig. 5). This increase is likely due to the presence of numts that now were solely mapped to the mtDNA. Some of these variants were present in reads that show high similarity with regions in the assembled nuclear genome. For example, reads with the variant at position 6569 are highly similar to chr1:567077-567211 (Data S1). The number of reads containing the variant allele at this specific position increased for both twins after read mapping to mtDNA alone, suggesting that this variant is due to a numt. These findings support mitochondrial similarity in twins and demonstrate the importance of distinguishing mitochondrial genome from homologous nuclear DNA using a combination of methods.

Discussion
In this study, we analyzed mitochondrial sequence variants in a pair of adult twins using high-throughput sequencing and validation. These data also allowed us to ask whether these two independent individuals harbored mitochondrial somatic variation in their blood. Our results revealed identical homoplasmic variants in mtDNA sequence from blood-derived genomic DNA, which lend support to the idea that monozygotic twins may be highly similar in their mitochondrial genomic sequence, as they seem to be in nuclear genomic sequence (Baranzini et al. 2010;Jakobsen et al. 2011). The presence of shared variants in twins suggests that the variants were likely present since conception (Avital et al. 2012). As mtDNA variation impacts many biological processes and can affect disease, it presents important challenges to diagnostic capabilities and could inform us about mutation and selection. Any intertwin differences in variant accumulation may depend on individuals studied, tissues examined or on age of subjects.
Heteroplasmy may change with age due to the accumulation of mtDNA variation Wallace 2001Krishnan et al. 2007Larsson 2010), Lee and Wei 2012;(Bratic and Larsson 2013, but this may depend on several factors. Mitochondrial heteroplasmy in blood samples was increased in older aged individuals, with no obvious mutation type pattern (Sondheimer et al. 2011). Targeted studies of the mtDNA have demonstrated variation in grandmothers (do Ros ario Marinho et al. 2011). Interestingly, studies that examined murine liver did not detect differences in mutation load with age (Ameur et al. 2011). In our analysis of two adult twin blood samples, we did not find accumulated mtDNA mutation above the 1% level. In addition, as our twin pair were young adults, it will be interesting to examine variants at different ages and environments. If numt and sequencing errors can be excluded, these low-level variants could possibly result from replication errors.
Our finding also revealed that low-level variation detected in the mitochondrial sequence from whole genome high-throughput sequencing can reflect homologous mitochondrial sequences. Low-level variants could represent sequencing errors, low frequency variation, numts, or a combination of these possibilities. The low-read calls considered in our studies only represented the top~0.5% of nucleotide quality score. Only with secondary sequencing assays do we see that the variants may not present in mitochondrial genome-enriched templates. Although we verified some of these nucleotides using analysis of original genomic DNA samples, some of these low-level called variants in mitochondrial sequence could not be detected by a secondary method. These variants could represent sequencing error or could reflect differences in sensitivity of primer extension. High-throughput sequencing errors that can be due to the sequencing signal dephasing that occur during the sequencing cycling (Metzker 2010) and might be due to DNA back folding (Allhoff et al. 2013). Additional studies are needed to develop algorithms to filter these out prior to variant call.
The detection of possible numts in our sequencing studies provides insights into the dynamic nature of the Figure 5. Mapping reads to test for nuclear mitochondrial sequences. Resulting low-level variant reads for twins when reads are mapped to the hg 19 reference genome (red dots) or to the mtDNA genome only (blue circles). The y-axis represents the number of reads with the variant allele for each position on the mitochondrial sequence, x-axis. mitochondrial sequence over evolutionary time; however, it also poses challenges in mitochondrial genome interpretation. Our findings support that initial mtDNA enrichment can identify mtDNA signals separately from potential numts, and techniques such as primer extension may be considered as an adjunct to high-throughput sequencing.
In our analyses, we found that many of the low-level variants detected by sequencing, even those mapped with high homology to the mitochondrial sequence, were not detected by primer extension using mtDNA hemigenomes. In addition, the mapping parameters could affect the low-level variant calls, and these factors could potentially influence the rate of false positives. If low-level mtDNA variants exist, they may be present at very low levels of heteroplasmy (Payne et al. 2012). Although lowlevel variation may not substantially impact the interpretation of clinical laboratory results, they may be helpful for understanding variation origin and could exist at different frequencies in different tissues (Payne et al. 2012). Given selective pressure on a specific gene, these low-level variants could potentially accumulate over time. While low-level variants resulting from sequencing may represent mapping ambiguity or result from presequence processing of DNA, identification of the exact same variants in two adult monozygotic twins by primer extension from the original DNA samples suggests these variants could exist. If true variants arise that differ between monozygotic twins, these could be somatic mutations that accumulate due to the low replication fidelity of the mtDNA polymerase (Lee and Johnson 2006). While the probability of the human mtDNA polymerase to misincorporate bases is estimated to be~5.6 9 10 À7 to 2.8 9 10 À8 errors per incorporated base (Lee and Johnson 2006), overall somatic mutation rates in mitochondria are not well defined. Mitochondrial mutation rates using population and phylogenetic data have found 1.66 9 10 À8 AE 1.48 9 10 À9 substitutions per nucleotide per year (Soares et al. 2009). In our studies of adult twins, we found that the mitochondrial sequence was largely identical. For the variants that appeared different between twins, we calculated 1.2-1.6 9 10 À6 events observed per site per year. If 1-1.4% of these calls were somatic events, the observed rate per individual would approximate the predicted error rate. A recent study reported an intratwin pair difference in monozygotic twins suggesting a posttwinning de novo copy number variant event (Ehli et al. 2012). Several other studies have reported CNVs in twin studies (Bruder et al. 2008;Maiti et al. 2011;Sasaki et al. 2011;Breckpot et al. 2012), and so there may be differences in somatic Copy Number Variant and SNP occurrence. Further studies should address the possibility of somatic events and selection.
Current read mapping to the genome insufficiently accounts for the homology between numts and the mtDNA. Variants from whole genome sequencing may be found in the mitochondrial genome or the corresponding numt, which could alter estimates of heteroplasmy. This is an important consideration in assays that measure mtDNA heteroplasmy especially for mitochondrial disorder diagnosis, where a threshold of variation may need to be surpassed to express disease. Our findings are salient given the increased application of genome sequencing and efforts to identify mosaic variation and to distinguish regions of genome homology. We suggest that combined methods of high-throughput sequencing, techniques like primer extension, and mtDNA enrichment may be useful in assessing mtDNA variants and numts.
Finally, we conclude the following: (1) these adult twins had highly similar mtDNA sequence from blood.
(2) We did not find differential somatic heteroplasmy >1% to suggest accumulating mutation over time, although more twin pairs and tissues should be tested in the future. (3) Low-level variants, only some of which are numts, are detected by high-throughput sequencing and can be confirmed by primer extension.

Supporting Information
Additional Supporting Information may be found in the online version of this article: Data S1. Summary of Blast, BLAT, and nuclear mitochondrial sequences (numts) analysis applied to the reads and extension PCR template fragment of eight variants.