Detection algorithm for the validation of human cell lines


  • Névine Eltonsy,

    1. Department of Systems Biology, University of Texas MD Anderson Cancer Center, Houston, TX
    2. Kleberg Center for Molecular Markers, Houston, TX
    Search for more papers by this author
  • Vivian Gabisi,

    1. Department of Systems Biology, University of Texas MD Anderson Cancer Center, Houston, TX
    2. Kleberg Center for Molecular Markers, Houston, TX
    Search for more papers by this author
  • Xuesong Li,

    1. Department of Systems Biology, University of Texas MD Anderson Cancer Center, Houston, TX
    2. Kleberg Center for Molecular Markers, Houston, TX
    Search for more papers by this author
  • K. Blair Russe,

    1. Department of Systems Biology, University of Texas MD Anderson Cancer Center, Houston, TX
    2. Kleberg Center for Molecular Markers, Houston, TX
    Search for more papers by this author
  • Gordon B. Mills,

    1. Department of Systems Biology, University of Texas MD Anderson Cancer Center, Houston, TX
    2. Kleberg Center for Molecular Markers, Houston, TX
    Search for more papers by this author
  • Katherine Stemke-Hale

    Corresponding author
    1. Department of Systems Biology, University of Texas MD Anderson Cancer Center, Houston, TX
    2. Kleberg Center for Molecular Markers, Houston, TX
    • Anderson Cancer Center, Department of Systems Biology, 7435 Fannin St., 2SCRB 2.2018, Unit 950, Houston, TX 77054, USA
    Search for more papers by this author
    • Tel.: +713-745-0509, Fax: +713-563-4235


Cell lines are an important tool in understanding all aspects of cancer growth, development, metastasis and tumor cell death. There has been a dramatic increase in the number of cell lines and diversity of the cancers they represent; however, misidentification and cross-contamination of cell lines can lead to erroneous conclusions. One method that has gained favor for authenticating cell lines is the use of short tandem repeats (STR) to generate a unique DNA profile. The challenge in validating cell lines is the requirement to compare the large number of existing STR profiles against cell lines of interest, particularly when considering that the profiles of many cell lines have drifted over time and original samples are not available. We report here methods that analyze the variations and the proportional changes extracted from tetra-nucleotide repeat regions in the STR analysis. This technique allows a paired match between a target cell line and a reference database of cell lines to find cell lines that match within a user designated percentage cut-off quality matrix. Our method accounts for DNA instability and can suggest whether the target cell lines are misidentified or unstable.

Cell line cross-contamination and misidentification has been a problem ever since the establishment of the first cell culture lines. Even though many researchers were aware of the problem and sought solutions,1–4 it has not been until the last five years that concerted efforts have been made to require the use of validated cell lines in grants and publications.5, 6 Using misidentified cell lines not only affects researchers who have had to retract papers but also has implications on data generated in the past.7 The use of misidentified cell lines has set back research in Mesenchymal stem cell transplantation, thyroid cancer, leukemia and esophageal cancers (see websites from American Type Culture Collection (ATCC), German Collection of Microorganisms and Cell Cultures (DMSZ), Japanese Collection of Research Bio-resources (JCRB) and RIKEN8–11). Misidentified cell lines have also had an effect on clinical practice; data from cell lines of the wrong tumor type have been used to justify clinical trials, which then failed to demonstrate benefit in patients. Because of these high-profile failures, many journals now require that a cell line be validated prior to publication.12

Although there are several methods that can be used to authenticate a cell line, the one that is most commonly used for human cell lines is based on short tandem repeat (STR) profiling.13–15 STR repeats are regions of microsatellite instability with defined tri- or tetra-nucleotide repeats that are located across multiple chromosomes. polymerase chain reaction (PCR) reactions using primers on non-repetitive flanking regions will generate PCR products of different sizes based on the number of repeats in the region; the size of these PCR products are determined by capillary electrophoresis. By combining between 8 and 16 STR loci, such as D5S818, D13S317, D7S820, D16S539, vWA, TH01, TPOX, CSF1PO, it is possible to uniquely identify a sample. This is the same method that is used in forensics to match biological samples. The biggest advantage of using STR to validate cell lines is that it is quick and relatively inexpensive. Much work has been done to determine the characteristics of the STR loci that are currently in use. The loci must be variable enough so that a unique pattern can be discerned but stable enough so that the PCR products generated fall within a size range that can be detected by standard capillary electrophoresis approaches. There are several initiatives that are gathering STR profiles so that researchers can directly compare their STR profiles against reference sequences.16, 17

However, STR profiling has several limitations. Unless original patient tissue is available there is no absolute way to guarantee that the STR profile generated is from the expected source. This is less of a problem for new cell lines that are being established since patient tissue is often available, but the historical cell line STR profiles must be inferred by comparing cell lines with the same name across researchers and across institutions, using the lowest passages available. The STR profiling method is also not useful for determining inter-species contamination especially if human DNA is present since any small amount of human DNA would be amplified and no non-human DNA would be detected. Another problem is that STR profiles can change depending on the stresses to which a cell line has been subjected. Stresses that can alter an STR profile include passage over time, viral contamination and exposure to drugs or passage through mice to generate a better mouse model.14 Most of the time these stresses result in loss of heterozygosity or genomic rearrangements and these DNA aberrations can affect the STR profile. Because of these issues, some leeway must be allowed to say that a cell line is of the expected linage. Another issue unique to cancer cell lines is that many cell lines have defects in DNA repair that can cause microsatellite instability. Since STR regions are microsatellite regions, the STR profile in such lines can be unstable. Knowledge of whether the cell line has certain mutations in the DNA repair pathway can help infer instability in the STR profile.

We present here an automated system that can be used to compare a target or list of target STR profiles against an STR database consisting of a variable numbers of reference loci. Our matching algorithm takes into account the variability that can occur within cell lines that are used in cancer research and can suggest whether there are mixtures of cell lines. Since the algorithm is based on the number of repeats in a locus and not the PCR length, we can use our method across different STR platforms as long as the entire STR variable regions are included. The method can accommodate multiple input target STR profiles matched against different reference STR profile collections, thus taking advantage of the work of many companies and institutions that are generating the reference datasets.

Material and Methods

Cell lines

HCT-116 (NCI-60), IGROV-1, TK-10 and CAKI1 were obtained as part of the NCI-60 cell line collection from the National Cancer Institute. NCI-60 cell lines were grown in RPMI (Cellgrow 10-040-CV) media containing 10% fetal bovine serum (Gibco HiFBS 10438-024 lot 804875).

Cell line authentication

Cell lines were grown to 70–80% confluence in a T-75 flask, trypsinized and the cell pellets were washed once with 1× phosphate buffered saline. DNA was extracted from cell pellets using QiaAMP mini preps (Qiagen cat 51306) and DNA was quantitated by Nanodrop spectrometry. Cell lines were validated by STR DNA fingerprinting using the AmpFℓSTR Identifiler kit according to manufacturer instructions (Applied Biosystems cat 4322288, Foster City, CA). A higher DNA amount of 0.15 ng DNA was used so that lower level mixing of cell lines could be visible. The samples were run on an Applied Biosystems 96-well 3730 Genetic Analyzer. Data were analyzed using GeneMapper (Applied Biosystems) and exported for STR comparisons. For the purposes of this article, we did not change any of the automated calls. The STR profiles were compared to known ATCC fingerprints (, to the Cell Line Integrated Molecular Authentication database (CLIMA) version 0.1.200808 (,18 to the complete 16 loci reference listed by Lorenzi et al.,19 and to the MD Anderson fingerprint database. The STR profiles matched known DNA fingerprints (see Table 1 and Supporting Information Table 1). All cell lines were also tested by Giemsa-banding and silver staining of metaphase chromosomes to verify that the cell lines were not cross-contaminated.

Table 1. Results of STR profiles generated vs. database of known STR reference profiles
inline image

Matching Algorithm

Our matching algorithm calculates two different scores to determine the percentage identity and instability (or potential cross-contamination) for each target line. These scores are calculated from the number of repeats in each allele at each STR locus. The percentage identity is determined by comparing between the reference sequences and the target sequence, while the degree of instability and potential cross-contamination are intrinsic to the cell line itself and do not depend on any information from an external reference.


A global weighted hit score (η) is calculated for each target-reference pair by comparing all loci in common between the target set A and the reference set B. This is done by first defining a hit score for each locus, μi The variable μi is calculated for each locus by counting the number of alleles that appear in common between the target aij in set A and reference bij in set B, where we define aij as allele j at locus i, Ai as the set of target alleles at locus i and Bi as the set of reference alleles at locus i. Each loci is corrected to a local weight by dividing by the total distinct number of alleles in both target and reference (m), where aij is given the score 1 if the allele is present in both reference and target and 0 otherwise [Eq. (1.1)].

equation image(1.1)
equation image(1.2)

An intermediate hit score, μ, is calculated by summing μi across all loci [see Eq. (1.2)]; if the target and reference are a perfect match and have the same number of loci, μ would equal the total number of loci.

To correct for the discordance between the number of loci in the target and the reference, we define two limits using the intermediate hit score μ. The upper hit score limit ℓul is defined as μ divided by the number of loci in the target (Ia), and the lower hit score limit ℓll is defined as μ divided by the number of loci in the reference (Ib) [see Eq. (2.1)].

equation image(2.1)

To further correct for the missing loci information, we calculate a global hit correction factor μc [Eq. (2.2)]. The global weight, ω, is calculated by counting the number of locus with matched information (i.e., not blank,) and dividing by the total number of locus found with information.

equation image(2.2)

Finally, the global hit score η is calculated in Eq. (2.3). This final step is used to correct the average for the loci with zero information no = abs(IaIb). This decision was made to correct for samples where the bounds are at wider limits due to missing information; i.e., if only one or two loci have information, the percentage match may be 100%, but the confidence that the target and reference are the same is very low. The comparison between Test HCT-116 and the ATCC HCT-116 samples is illustrated in Supporting Information Table 2.

equation image(2.3)

In Table 2, the upper hit score in percentage for CAKI1 matching ATCC CAKI1 is 100% because all of the STR regions that are present match at 100% but the global hit score is only 0.5625 since ATCC reports fewer loci than are present in the Test CAKI1 sample. The percentage match between our test sample CAKI1 and the reference from Lorenzi et al. is only 88.54% due to the overall drifts in loci vWA and D2S1338. Scores for all cell lines are shown in Supporting Information Table 1.

Table 2. A target cell line (CAKI1) with selected matched references
inline image


Instability is an intrinsic property of a cell line sample and does not depend on values from any other cell lines. For normal chromosomes, there can exist one (homozygous) or two (heterozygous) alleles per locus. If relative peak height for all alleles is available, then heterozygous calls should be at half the height when compared to single allele locus. Instability is often seen as a halo effect where the major allele has the highest peak and minor alleles including non-integer repeats have lower peak heights (Fig. 1, loci D8S1179, D351358, D135317, D165639, vWA, D18s51 and D55818). However, most public references do not have the height information as they only report the number of tri- or tetra-repeats. For this reason, our algorithm does not check for instability until there are three or more alleles per locus. We also cannot rely on relative peak height (halo effect) to determine whether an external reference has locus instability.

Figure 1.

Shown are the STR profiles generated using the GeneMapper (Applied Biosystems) software using the AmpFℓSTR Identifiler kit. The Y-axis shows relative peak heights for the PCR products that are generated. On the X-axis the PCR length are shown. The gray columns indicate the length of the known PCR products within a locus. (a) STR profile for CAKI-1. (b) STR profile for a mixture of two cell lines, CAKI1 at 95% and TK-10 at 5%. (c) STR profile for HCT-116.

An instability score is calculated for each locus within the STR profile and the overall score is summed over all loci. The formula used in this calculation was obtained from our training set of approximately 1,500 STR profiles and tested on 2,000 additional samples. The formula is as follows:

equation image(3)

where, j = the allele number in any one locus.

m = the total number of alleles at that locus.

xj = the number of tetra-repeats for that allele, starting with the second allele length.

For the test HCT-116, τ is calculated as: 1.0 for loci CSF1PO, D13S317 and D16S539, 0.2 for loci D18S51 and D19S433, 0.5 for loci D3S1358, 0.2 for loci D5S818, 0.9 for loci D8S1179, 0.7 for loci FGA and vWA and zero otherwise. Our empirical cutoff for the instability was experimentally determined by looking at a number of known unstable lines; any locus greater than τ > 0.2 are flagged as potentially unstable.

In our set of four test samples, three of the four samples are flagged as potential unstable since three or more loci had an increased τ. Two of these, IGROV1 and HCT-116, are known to have microsatellite instability due to mutations in mismatch repair genes.20 The TK-10 when run at a higher DNA concentration also shows three loci that have more than two alleles.


Any cell line that is a mixture of two cell lines will also appear as unstable. Again, if peak height is available, and if the cell lines are mixed in unequal parts, there will be one set of alleles that have higher peak heights than the other, indicating a mixture (see Fig. 1 with CAKI1 at 95% and TK-10 at 5%). However, we cannot rely on two cell lines being present in disequilibrium and we do not always have relative peak heights to compare. Therefore, we flag cell lines with more than three regions of instability as possible mixtures of cell lines. In Supporting Information Table 1, we have mixtures of the four cells lines, CAKI1, TK-10, HCT-116 and IGROV1. For our test, we set the minimum reported match at 70% so we could display the matches. Even CAKI1 when present at 95% of the sample was flagged when mixed with any of the cell lines.


The STR profile for CAKI1 is shown in Figure 1. The 16 loci STR allele fragment patterns for TK-10, CAKI1, HCT-116 and IGROV1 are listed in Table 1. For CAKI1, reference STR from three sources were available: ATCC and JCRB which report the regions D5S818, D13S317, D7S820, D16S539, vWA, TH01, TPOX, CSF1PO plus gender amelogenin, and the work of Lorenzi et al.19 that use the same Identifiler assay set as in our study. Even from this relatively stable cell line, the STR profile differs slightly between the four samples. Lorenzi et al. shows that one locus, vWA, is missing one allele (15, 17 for ATCC, JCRB and our test sample vs. 17 in Lorenzi et al.) when compared to the published data. Another locus, D2S1338, is missing an allele when compared to our data; however, our STR profile agrees with the STR profile on both the ATCC and JCRB websites. This discrepancy can either be due to a technical failure of the PCR reaction in Lorenzi et al. where one allele was preferentially amplified and the other alleles were below the level of detection, or due to a biological change where the NCI-60 cells from Lorenzi et al. have lost an allele. Since we do not have access to the original sequencing traces or the original DNA from Lorenzi et al., we cannot determine whether there was a technical failure or a real biological change. Overall, all four cell lines matched and should be considered CAKI1.

As shown in Table 1, both HCT-116 and IGROV1 have multiple alleles per loci and there is a great deal of differences between the different sources. Both cell lines are known to have microsatellite instability and this alone can generate different profiles. Another factor that can cause fluctuations in the STR profile is chromosome rearrangements or loss of chromosomes. This is one of the reasons HCT-116 is described in different databases as having the AMEL locus as either X or XY (see Table 1 and see websites from ATCC and DMSZ). As can be seen in Figure 1, the Y peak is very close to background levels. If lower amounts of DNA are used so as to make the spectra appear to have fewer alleles, the Y band will fall below the level of detection. The G-band karyotype shows that not all cells contain the Y chromosome; different sources have reported from 16 to 50% of cells without a Y chromosome.21


In order to ensure that research conducted with human cell lines is of the highest caliber, cell lines must be validated. While the tools for validation are inexpensive, the task of comparing a test set of STR profiles for several cell lines against the large number of existing cell line STR profiles is very time consuming since current methods can only compare one cell line at a time and in most cases manual review is required. In addition, as is seen with the unstable cell lines IGROV1 and HCT-116 in this paper, the reference profiles do not always match, especially when the DNA concentration is increased in the analysis to allow detection of mixtures of cell lines.

Implementation of the algorithm in this paper will allow researchers to set up automated comparisons of target STR profiles with current reference STR profiles that can be downloaded from several sources. The current requirement that a cell line must match by at least 80% may not be achievable for cell lines with mismatch repair defects as can be seen with IGROV1 in our test samples. In these cases, our algorithm allows the user to lower the stringency threshold of reported matches so that the best hit will be displayed allowing manual comparison. Since no technique can answer all questions, if researchers are unsure whether a cell line is unstable or a mixture of clones, there are several other methods that can be used in addition to STR profiling to ascertain whether the cell line is a mixture of more than one cell line. A simple method to verify whether a cell line is mixed is to select for single clones and re-test the STR profile. Alternate approaches used in this study were Giemsa-banding and silver staining of metaphase chromosomes in addition to STR profiling to validate the unstable cell lines IGROV1 and HCT-116.

Although the purpose of this paper is to provide the scientific community with a tool to validate cell lines, it does highlight some of the problems when using unstable cell lines. In highly unstable cell lines like HCT-116, the major allele can drift even over the course of a week which might account for different biological data generated in different laboratories.22, 23 Even if misidentification of a cell line is ruled out, there remains the possibility that mutations or rearrangements may alter the phenotype of the cell line between cell passages. As more cell lines become available and in order to enhance consistency in comparing data between laboratories, it may be that researchers also “drift” away from unstable cell lines unless the observation under investigation is genomic instability in cancer.


N.E., X.L., K.B.R., G.B.M. and K.S.H. were partially supported by the Kleberg Center for Molecular Markers and N.E., X.L., V.G. and K.S.H. were partly supported by the Characterized Cell Line core, NCI. G.B.M. has potential conflict of interests based on his consulting work for Asuragen, Aushon, Catena, Diichi Pharmaceuticals, Foundation Medicine and The Komen Foundation, his stock ownership in Catena and PTV Ventures and his sponsored research with AstraZeneca, Celgene, CeMines, Exelixis, GlaxoSmithKline, LPath, Roche and Wyeth/Pfizer. The algorithm will be available for download and testing for a simple comparison on our website.