Elevated Coding Mutation Rate During the Reprogramming of Human Somatic Cells into Induced Pluripotent Stem Cells§


  • Author contributions: J.J.: conception and design, provision of study material, and collection and/or assembly of data; S.H.N.: conception and design, provision of study material, collection and/or assembly of data, and data analysis and interpretation; D.N., V.S., M.S., S.H., G.C., J.D.M., and A.N.: collection and/or assembly of data; N.N.B.: conception and design, data analysis and interpretation, financial support, and manuscript writing. J.J. and S.H.N. contributed equally to this article.

  • Disclosure of potential conflicts of interest is found at the end of this article.

  • §

    First published online in STEM CELLSEXPRESS December 12, 2011.


Mutations in human induced pluripotent stem cells (iPSCs) pose a risk for their clinical use due to preferential reprogramming of mutated founder cell and selection of mutations during maintenance of iPSCs in cell culture. It is unknown, however, if mutations in iPSCs are due to stress associated with oncogene expression during reprogramming. We performed whole exome sequencing of human foreskin fibroblasts and their derived iPSCs at two different passages. We found that in vitro passaging contributed 7% to the iPSC coding point mutation load, and ultradeep amplicon sequencing revealed that 19% of the mutations preexist as rare mutations in the parental fibroblasts suggesting that the remaining 74% of the mutations were acquired during cellular reprogramming. Simulation suggests that the mutation intensity during reprogramming is ninefold higher than the background mutation rate in culture. Thus the factor induced reprogramming stress contributes to a significant proportion of the mutation load of iPSCs. STEM CELLS 2012;30:435–440


Somatic cells can be reprogrammed to embryonic stem cell (ESC)-like cells known as induced pluripotent stem cells (iPSCs) via forced expression of defined transcription factors [1–3]. However, reprogramming might be mutagenic as most of the reprogramming factors are known to be oncogenic [4–6] and generate genotoxic stress [7–9] causing cell cycle arrest [10], cellular senescence [7, 11], and apoptosis [8] in factor recipient fibroblasts.

Karyotypic [12] and meta-analysis of gene expression data [13] revealed aneuploidy, and copy number analysis detected large-scale subchromosomal aberrations [14, 15] in iPSCs that arise upon prolonged passaging in vitro. Genome-wide copy number analysis of multiple iPSC lines found that regardless of the reprogramming factor combinations and gene delivery methods (retroviral vector and piggyBac transposon), iPSCs had many copy number variations (CNVs) not present in the bulk parental cells [16]. Sequencing of iPSC lines made using both integrative and nonintegrative methods (episomal and mRNA delivery) revealed an average of six nonsynonymous (i.e., protein sequence changing) point mutations per iPSC line [17]. Approximately 60% of these mutations were present in the parental fibroblasts in very low frequency suggesting selection for mutated cells during reprogramming. Thus, so far, it is known that iPSCs harbor mutations despite absence of the MYC oncogene in the reprogramming factor cocktail and use of nonintegrative reprogramming factor delivery methods, and that some of the mutations are preexisting in the parental cells and some are acquired during passaging. However, it remains unknown what proportion of the mutations in iPSCs are acquired due to the genotoxic stress associated with reprogramming.

In this study, we determined the mutation rate during iPSC passaging by whole exome sequencing of several iPSC lines at two different passages. We further estimated the proportion of iPSC mutations that preexist as rare mutations in the parental population using ultradeep amplicon sequencing. Despite being derived from a common parental source, these iPSCs had many unique nonsilent coding mutations absent in the parental cells. We thus provide evidence that many of the coding mutations in iPSCs are incurred during the reprogramming phase.


Cell Culture

Human neonatal foreskin fibroblasts (HFFs) (ATCC, Manassa, VA, http://www.atcc.org) were maintained in fibroblast medium consisting of Dulbecco's modified Eagle's medium (DMEM) (Invitrogen, Carlsbad, CA, http://www.invitrogen.com) supplemented with 10% fetal calf serum (Fisher Scientific, Ottawa, Canada, http://www.fishersci.ca) and 1 mM L-glutamine (Invitrogen). Human ESCs (hESCs), HES2 (WiCell Research Institute, Madison, WI, http://www.wicell.org), and iPSCs were maintained on feeder-free Matrigel (BD Biosciences, Mississauga, Canada, http://www.bdbiosciences.com)-coated plate in complete mTeSR medium (Stemcell Technologies, Vancouver, Canada, http://www.stemcell.com) as previously described.18

Retrovirus Production

Four moloney-based retroviral vectors (pMXs) containing the human complimentary DNAs (cDNAs) of OCT4, SOX2, KLF4, and c-MYC were obtained from Addgene (Addgene, Cambridge, MA). These plasmids were transfected into a previously established 293GPG packaging cell line that incorporated pMD.gagpol and tetracycline-inducible VSV-G plasmids to generate high titer retroviruses.19 Viral supernatant was collected 48, 72, and 96 hours post-transfection and filtered by 0.45-μm syringe filters.

Generation of Human iPSCs

Approximately 3 × 105 HFFs were seeded in gelatin-coated 100-mm dishes in fibroblast medium and were infected twice by OCT4, SOX2, KLF4, and c-MYC trangene containing retroviruses during a 48-hour period after seeding HFFs. Approximately 24 hours after second viral infection, cells were switched to hESC media consisting of knockout DMEM supplemented with 20% knockout serum replacement, 1 mM L-glutamine, 1% nonessential amino acid, 0.1 mM β-mercaptoethanol, and 10 ng/ml human basic fibroblast growth factor (Invitrogen). Human iPSC lines were established 3–4 weeks postinfection by selecting newly formed colonies with hESC-like colony morphology.


iPSCs were fixed with phosphate buffered saline (PBS) containing 4% paraformaldehyde for 20 minutes at room temperature, washed with PBS. For NANOG and SOX17 intracellular staining, cells were permeabilized with 0.2% Triton X-100 for 10 minutes at room temperature. The cells were then blocked with 10% normal goat or mouse serum (Vector Labs, Burlington, Canada, http://www.vectorlab.com) in PBS for 1 hour and incubated overnight at 4°C with one of the following primary antibodies: SSEA4 (1:50, Developmental Studies Hybridoma Bank, Iowa City, IA, http://www.dshb.biology.uiowa.edu), TRA-1-60 (1:50, Millipore, Billerica, MA, http://www.millipore.com), NANOG (1:20; R&D Systems, Minneapolis, MN, http://www.rndsystems.com), SOX17 (1:50, R&D Systems), A2B5 (1:50, R&D Systems). Cells were then washed and incubated for 1 hour at room temperature with Alexa488- or Alexa594-conjugated secondary antibodies (1:250, Invitrogen).

Flow Cytometry

Single-cell suspensions were prepared from day 15 embryoid bodies by treatment with Collagenase B (Roche, Mississauga, Canada, http://www.rochecanada.com) and cell dissociation buffer (Invitrogen) followed by staining with human CD31-R-phycoerythrin (PE) and CD34-fluorescein (FITC) antibodies, (Miltenyi Biotec, Auburn, CA, http://www.miltenyibiotec.com). The cells were then subjected to FACSCalibur (BD Biosciences) for data acquisition, and data were analyzed by FlowJo software (www.flowjo.com, Fluorescein (FITC) Tree Star, Ashland, OR). Flow gates (Tree Star, Ashland, OR, http://www.flowjo.com) were based on isotype controls.

Polymerase Chain Reaction and Quantitative PCR

Total RNA was isolated using RNeasy (QIAgen, Valencia, CA, http://www.qiagen.com) and treated with Turbo DNase (Ambion, Carlsbad, CA, http://www.invitrogen.com) to remove genomic DNA contamination. DNase-treated RNA was repurified using ammonium acetate precipitation method after inactivation of DNase by EDTA (Sigma-Aldrich, Oakville, Canada, http://www.sigmaaldrich.com/canada-english.html). cDNA was generated from 1 μg of total purified RNA using Reverse Transcription System (Promega, Madison, WI, http://www.promega.com) according to the manufacturer's instructions. All polymerase chain reactions (PCRs) were performed with High-Fidelity Taq DNA polymerase (Invitrogen). Quantitative PCR (qPCR) was performed with SYBR Green qPCR kit (New England Biolabs, Ipswich, MA, http://www.neb.com) using 20 μl of total reaction and analyzed on the 7900HT real-time PCR system (Applied Biosciences, Carlsbad, CA, http://www.invitrogen.com). Primer sequences are given in supporting information Table.

Bisulfite Conversion

Genomic DNA from human iPSCs (1 μg), hESC line HES2, and parental HFFs were processed for bisulfite modification using EpiTech Bisulfite Kit (QIAgen) according to the manufacturer's instructions. The promoter regions of human OCT4 were amplified by PCR using previously reported primer sets, cloned into pCR2.1-TOPO vector using TOPO T/A cloning kit (Invitrogen), and sequenced using both forward and reverse primers.

Next Generation Sequencing

DNA was extracted using Blood & Cell Culture DNA Midi Kit (QIAgen). Illumina libraries were prepared according to manufacturer's protocols and exome was captured using Agilent's SureSelect Exon Capture (50 Mb target) according to manufacturer provided protocol. All sequencing was carried out on a Illumina Genome Analyzer IIx sequencer with 76 × 2 paired-end reads. One lane of sequencing was done for each sample.

Read Mapping

The reference human genome used in these analysis was UCSC assembly hg18 (NCBI build 36.1) containing unordered sequences (i.e., sequences that are known to be in a particular chromosome but could not be reliably ordered within the current sequence). The 76 × 2 paired-end reads were mapped on to the human reference genome using BWA (version 0.5.7) [20]. Quality scores were recalibrated using GATK [21], and PCR artifacts were discarded using Picard. Only uniquely mapping reads (BWA mapping quality ≥ 1) were retained for further analysis.

Identification and Annotation of Mutations

Only targeted genomic regions with at least 10× coverage and Phred-scaled base quality of 30 or higher were considered. Candidate iPSC mutations are defined as variants that are present in a given iPSC exome but not in the fibroblasts or in the other iPSC exomes. These candidate mutations were subjected through a series of filters: (a) candidate mutations were discarded if they were present exclusively in latter third of the reads; (b) we disregarded candidate mutations if the mutant allele was the flanking homopolymeric (defined as repeats of two or longer) base; (c) candidate mutations were discarded if BLAST alignment of reads containing them did not concur with BWA mapping. Functional annotation of single-nucleotide polymorphism (SNP) into nonsynonymous and synonymous and prediction of damaging mutations were done using SeattleSeq Annotator online tool (http://gvs.gs.washington.edu/SeattleSeqAnnotation/).

Simulation of Mutations in iPSCs

To compute the expected number of mutations in iPSCs at passage 6 due to the number of cell doublings from the onset of the reprogramming till passage 6 if the mutation rate during reprogramming is not elevated, we simulated a random mutagenesis process and accounted for differences in cell cycle length and passaging interval of iPSC cells and fibroblasts using the following parameters. Cell doubling time of human foreskin fibroblasts in culture is taken to be 24 hours [22], and doubling time of iPSCs is taken to be 44 hours [1]. Reprogramming duration is taken to be 4 weeks (28 days) with a doubling time of 34 hours which represents the average doubling time of fibroblasts and iPSCs. Thus during the reprogramming phase, a reprogramming cell has undergone approximately 20 cell doublings. After picking the initial iPSC colonies, the duration of passaging was 1 week; therefore, during the six passages of the initial iPSC colony, approximately 23 cell doublings have taken place. Thus, a total of approximately 43 doublings have taken place since the reprogramming factors were delivered into the parental fibroblasts until the six passages of iPSCs. For the 32.4-Mb targeted region that is present above 10× in all the iPSCs and parental fibroblast exome, a background mutation rate of 0.02 coding mutations per cell division (which equates 6.7 × 10−10 per bp per cell division [17]) leads to an expectation of 0.94 coding mutations per iPSC. Setting the background mutation rate to the mutation rate during iPSC passaging (0.035 coding mutations per cell division) leads to an expectation of 1.5 coding mutations in iPSCs at passage 6. We found an average of 12 mutations per iPSCs, out of which 7% were attributed to passaging and 19% were preexisting resulting in approximately 9 mutations that are not explained by iPSC passaging or from inheritance of mutations from parental cells. Thus the mutation rate during the reprogramming phase is 6–9.4 times higher than that expected for the background mutation rate associated with cell divisions.


Human primary neonatal foreskin fibroblasts (ATTC, catalog # CRL-2429) (passage 14) were reprogrammed with retroviruses encoding KLF4, MYC, OCT4, and SOX2 transgenes [1]. We sought to minimize technical variations by using the same gene delivery method, reprogramming factors, viral titer, culture conditions, and passaging intervals. The randomly selected five iPSCs displayed all the hallmarks of pluripotent cells such as expression of pluripotency markers, demethylation of OCT4 promoter, transgene silencing, and potential to differentiate into derivatives from the three germ layers (supporting information Figs. 1 and 2). Each iPSC line was sequenced after 6 (p6-iPSC) and 12 (p12-iPSC) passages subsequent to picking the initial iPSC colony 28 days after reprogramming. We enriched for DNA-encoding protein coding genes using the Agilent SureSelect Human All Exon kit and sequenced the captured DNA from the 11 samples (i.e., parental fibroblasts, five p6-iPSC lines, and five p12-iPSC lines) using the Illumina Genome Analyzer IIx [23] with one sample per lane. After aligning the reads to the human reference genome, we obtained more than 60 million uniquely aligning reads per sample (Table 1) and from here on we refer to the sequence data as the exome.

Figure 1.

The contribution of passaging, reprogramming stress, and inheritance of rare preexisting mutations from parental cells to the mutation load of iPSCs. (A): The expected number of coding mutations per iPSC line simply due to background mutation rate during the numerous cell divisions that take place during reprogramming and passaging to passage 6. The histogram of 500,000 simulations to match the number of parental fibroblasts that were plated is shown. Parameters are given in Materials and Methods. (B): Illustration of the contribution of class I, class II, and class III to the overall coding mutation load in iPSCs. P0 represents the initial colony. Abbreviation: iPSCs, induced pluripotent stem cells.

Table 1. Summary statistics of the exome sequence data and the identified variants
inline image

We developed a custom single nucleotide variant caller (Materials & Methods) to identify all the alleles in fibroblasts and iPSCs that are absent in the human reference NCBI build 36.1 (from here on referred to as variants) at targeted exons representing approximately 32.4 Mb of the genome with a minimum of 10-fold (10×) coverage in fibroblasts and in all the iPSC lines (Table 1). An iPSC variant is defined to be a mutation if it is present only in a single iPSC exome and absent in the parental fibroblasts exome (Table 2). As the iPSCs were derived from a common batch of fibroblasts from a single individual, presence of mutations unique to each iPSCs (i.e., absent in parental cells and in other iPSCs derived from the same fibroblasts during the single experiment) provides a stringent test for excluding variants that did not arise during reprogramming. Thus, each iPSC candidate mutation was tested in five independent exomes (i.e., the parental fibroblasts and the four other p6-iPSC exomes) to discard preexisting parental mutations. We found 59 mutations (∼12 mutations per iPSC line on average) in the p6-iPSC exomes (Table 2; supporting information Table 1). We randomly selected 30 candidate iPSC mutations and interrogated their presence in the parental fibroblasts and other iPSC lines via Sequenom MassArray SNP genotyping system which can detect alleles present in 10% frequency. We confirmed that all the candidate iPSC mutations found were present in their corresponding iPSC line and absent in the parental fibroblasts and other iPSC lines (supporting information Table 2).

To estimate the proportion of coding mutations in p6-iPSCs that are likely acquired during passaging since the picking of the initial iPSC colony, we sequenced p12-iPSCs. We found a total of 60 mutations in p12-iPSCs. Coding point mutations largely persisted during passaging (56 of the 59 mutations in the p6-iPSC were present in the p12-iPS) (Table 2). We designated all mutations identified in the p12-iPSC exome but absent in the p6-iPSC exome of the same iPSC line as passaging-induced mutations. We found a total of four passaging-induced heterozygous mutations in the p12-iPSC lines (a mutation rate of 0.1333 coding point mutations per passage per iPSC line of 0.035 coding mutations per cell division) (Table 2; supporting information Table 1). Thus, assuming that the rate of mutations due to passaging is constant (as cells are treated identically during each passage), passaging induced mutations account for approximately 7% (4 out of 59) of the mutations in the p6-iPSC exome. As it is possible that some of these four mutations are also present in p6-iPSCs below detection limit, the estimated passaging-induced mutation rate is likely an overestimate and consequently the proportion of mutations in p6-iPSCs that is nonpassaging induced is likely an underestimate.

Table 2. Counting of point mutation in iPSCs at P6 and P12
  1. The number of iPSC mutations reported are only those that are in the 32.4-Mb aggregate that is targeted and has a minimum of 10× coverage in all the fibroblast and iPSC exomes.

  2. Abbreviation: iPSC, induced pluripotent stem cell.

inline image

To estimate the proportion of mutations that might be preexisting mutations in rare fibroblast subpopulation, we performed deep amplicon sequencing of 46 randomly selected mutations out of the 59 mutations identified in the p6-iPSC exomes. The genomic regions spanning the candidate mutations were covered at approximately 3 million times on average allowing detection of rare mutant alleles in the fibroblast population; only 8 out of the 46 (or 17%) mutations were present in rare frequency (supporting information Table 3).

As P53 is demonstrated to be required for maintaining genome integrity of iPSCs [8] and expected to prevent accumulation of reprogramming-induced mutations in iPSCs, we asked if any of these iPSC lines incurred mutations in TP53 or were derived from founder cells with mutated TP53 which may explain survival of iPSCs despite DNA damage. As the mutations inherited from the parental fibroblast or acquired during reprogramming should be high in frequency, we used capillary sequencing to check if the P53 gene incurs inactivating mutation during iPSC generation. None of the 11 exons of the TP53 have nonsynonymous mutations in any of the iPSC lines (supporting information Table 4). There were also no deleterious variants in MDM2, CDKN2A, P21, and BCL2 which are genes upstream or downstream of P53. Gene ontology analysis of mutated genes revealed no significant enrichment for membership in any particular biological process that suggests defects in checkpoint arrest or apoptosis. No homozygous nonsynonymous or compound nonsynonymous heterozygous variants were found in known DNA repair genes [22]. Thus the iPSC line founder fibroblast does not have obvious defects in genome maintenance which may make them prone to incur mutations during reprogramming.


We partitioned iPSC mutations into three mutually exclusive classes: class I represents mutations that preexisted in the parental cells, class II represents mutations incurred during reprogramming, and class III represents mutations acquired during iPSC maintenance. Two models may account for the number and the relative proportion of mutations in p6-iPSC that are class I, class II, and class III. According to the first model, which we refer to as the constant mutation rate model, mutations in p6-iPSCs reflect accrual of mutations that occur at a background mutation rate during the numerous cell divisions that take place during reprogramming. According to the second model, which we refer to as the reprogramming stress model, the mutation rate during reprogramming is highly elevated due to the stress associated with cell fate alteration caused by the overexpression of oncogenic reprogramming factors. This model is based on the fact that the reprogramming factors have oncogenic potential [4–6] and activate the DNA damage response [7–9] reflecting an increased rate of genome instability during reprogramming. Due to the low efficiency of reprogramming, empirical measurement of mutation rate in the subset of fibroblasts that undergo reprogramming is technically challenging. To determine which of these two models better explains the mutations seen in p6-iPSCs, we simulated the random mutational process associated with genome duplication according to the constant mutation rate model. Substitution point mutations can accumulate during the reprogramming phase and during passaging of the initial iPSC colony for six passages (parameters are listed in the Materials and Methods section). To match the number of parental cells seeded during reprogramming, we simulated the mutational process for 500,000 single cells. The distribution of the number of mutations simply due to background mutation rate (0.02 coding mutations per cell division [17]) in an iPSC line at passage 6 gives a median of one coding point mutation (Fig. 1A). Using the mutation rate during iPSC passaging as the background mutation rate increases the median number of coding point mutations in p6-iPSC to two mutations per line. To explain the observed number of coding mutations seen in the p6-iPSCs without the need for elevated mutation rate during reprogramming at a background mutation rate of 0.3 coding mutations per cell division is required. This mutation intensity is ninefold higher than iPSC passaging mutation rate. Thus the reprogramming stress model better explains the observed mutation rate in p6-iPSCs. That our iPSCs were derived from the same batch of parental cells but harbored unique mutations not found in other iPSCs derived from the same parental cells further supports the reprogramming-associated mutagenesis model.


While the Gore et al. study [17] demonstrated that iPSCs had mutations and that many have originated from the parental founder cell, we have shown that mutations are also acquired during reprogramming and passaging. Furthermore, we find that less than approximately 20% of the mutations in iPSCs were preexisting mutations from the parental cells and that reprogramming contributes approximately 75% of the mutations found in our fibroblast-derived iPSCs. This discrepancy may be either due to higher false negative rate of Gore et al. mutation caller or because we characterized iPSCs made from neonatal fibroblasts while Gore characterized iPSCs generated from adult fibroblasts (the latter is expected to have accumulated more mutations than the neonatal fibroblasts). [17]. Our study provides strong evidence that coding point mutations are incurred during the reprogramming phase of iPSC generation and that passaging of iPSCs after the initial colony picking contributes to only a small proportion of the overall iPSC mutation load (Fig. 1B). Furthermore, unlike in the case of CNVs [16], many of the coding point mutations in iPSCs persist during passaging. Our work highlights the need for identification of optimal conditions of reprogramming that reduce the mutations associated with iPSC generation.


This work was supported by funding to N.N.B. who is the recipient of a New Investigator Award from the Ontario Institute for Cancer Research, through generous support from the Ontario Ministry of Research and Innovation. The raw Exome sequencing data is available at www.ebi.ac.uk/ena/data/view/ERA073428.


The authors indicate no potential conflicts of interest.