Quantification of cancer driver mutations in human breast and lung DNA using targeted, error‐corrected CarcSeq

Abstract There is a need for scientifically‐sound, practical approaches to improve carcinogenicity testing. Advances in DNA sequencing technology and knowledge of events underlying cancer development have created an opportunity for progress in this area. The long‐term goal of this work is to develop variation in cancer driver mutation (CDM) levels as a metric of clonal expansion of cells carrying CDMs because these important early events could inform carcinogenicity testing. The first step toward this goal was to develop and validate an error‐corrected next‐generation sequencing method to analyze panels of hotspot cancer driver mutations (hCDMs). The “CarcSeq” method that was developed uses unique molecular identifier sequences to construct single‐strand consensus sequences for error correction. CarcSeq was used for mutational analysis of 13 amplicons encompassing >20 hotspot CDMs in normal breast, normal lung, ductal carcinomas, and lung adenocarcinomas. The approach was validated by detecting expected differences related to tissue type (normal vs. tumor and breast vs. lung) and mutation spectra. CarcSeq mutant fractions (MFs) correlated strongly with previously obtained ACB‐PCR mutant fraction (MF) measurements from the same samples. A reconstruction experiment, in conjunction with other analyses, showed CarcSeq accurately quantifies MFs ≥10−4. CarcSeq MF measurements were correlated with tissue donor age and breast cancer risk. CarcSeq MF measurements were correlated with deviation from median MFs analyzed to assess clonal expansion. Thus, CarcSeq is a promising approach to advance cancer risk assessment and carcinogenicity testing practices. Paradigms that should be investigated to advance this strategy for carcinogenicity testing are proposed.


| INTRODUCTION
Evaluating the carcinogenic potential of test articles to which humans are exposed is a critical endeavor because cancer is the second leading cause of morbidity and mortality among non-communicable diseases worldwide (Madia et al., 2019). Cancer is driven by both exogenously-and endogenously-induced mutational events . Although the most relevant data for assessing carcinogenicity in humans are studies of human populations, ethical considerations and the long latency period for most human cancers prevents the assessment of the carcinogenic potential of therapeutics in humans as part of drug development (Bourcier et al., 2015). Retrospective human studies lack sensitivity due to human genetic variation and the impact of many low-dose co-exposures (Bourcier et al., 2015).
Obstacles to obtaining human data have resulted in a dependency on the two-year rodent tumor bioassay (RTB) for assessing the potential carcinogenicity of drugs, chemicals, and physical test articles (Bourcier et al., 2015). Even though the RTB is called the gold standard for carcinogenicity testing, it is flawed in several ways (Goodman 2018). RTBs require the use of large numbers of animals.
The highest dose tested in a RTB is usually the maximum tolerated dose, which can alter biological processes, generate results that may not be relevant to humans (Cohen 2017), and necessitate low dose extrapolation (Bucher 2000). In some instances, rodents are biologically different from humans in terms of xenobiotic metabolism, tumor cell origin, or pathology of premalignant lesions (Silva Lima and Van der Laan 2000;Thayer and Foster 2007;Oesch and Hengstler 2020).
Other obstacles are RTBs cost millions of dollars and may take five or more years to complete (Boorman et al., 1994).
Given the drawbacks of the RTB, there is a clear need for in vivo human and rodent biomarkers that can be integrated with other information to predict carcinogenicity (Harris et al., 2019). An approach that enables prediction of tumorigenic responses due to lifetime rodent exposures from shorter-term rodent studies (28 days to 6 months) would be invaluable (Parsons 2018). Hotspot cancer driver mutations (hCDMs) have potential as biomarkers for use in carcinogenicity testing and assessing potential cancer risks associated with exogenous exposures, whether therapeutic, occupational, environmental, genotoxic, or non-genotoxic (Harris et al., 2019). Advantages of hCDMs as biomarkers of cancer risk include their relevance to carcinogenesis in both rodents and humans, their known roles in oncogenesis, and their ability to confer a growth advantage to a neoplastic cell in the microenvironment of the tissue in which the cancer arises, leading to clonal expansion of cells carrying cancer driver mutations (CDMs) (Figure 1) (Stratton et al., 2009). hCDMs have been assessed primarily as DNA based measurements, meaning analyses can be performed on any tissue from any species from which DNA can be isolated (Harris et al., 2019). When using justifiable assumptions of mutant zygosity, cancer driver (CD) mutant fraction (MF) can be translated into mutant cell numbers or proportions, providing information useful in mathematical modeling of carcinogenesis (Soh et al., 2009). Genotoxic carcinogens can induce CDMs, induce other mutations, or epigenetic changes that cooperate with prevalent spontaneous CDMs leading to clonal amplification. Carcinogenesis induced by non-genotoxic carcinogens is dependent on inducing the clonal expansion of spontaneous CDMs, again detectable through the analysis of prevalent reporter CDMs.
Our lab previously developed an allele-specific competitive blocker-polymerase chain reaction (ACB-PCR, a multi-step procedure for quantifying levels of specific base pair [bp] substitution mutations in a DNA sample), allowing for quantification of rare mutational events down to a frequency of 10 −5 (Myers et al., 2014b). We used ACB-PCR to analyze hCDMs across four normal human tissues and established that the degree of interindividual variability in CD MF was positively correlated with the impact of the mutation in terms of organ-specific carcinogenesis (Parsons et al., 2017). This result suggests that hCDMs can serve as substrates for carcinogenesis and reporters of tissue-specific clonal expansion (Harris et al., 2019). ACB-PCR analyses of carcinogen-treated rodents demonstrated that variation in CD MF following relatively short-term exposures (4 weeks to 8 months) correlate with RTB responses (Parsons 2018). While ACB-PCR was used to generate valuable knowledge regarding the nature F I G U R E 1 Clonal expansion of cells carrying CDMs is a disease proximate biomarker of effect. A continuum of cancer-related biomarkers is depicted. Because pathogenic mutations lead to clonal expansion of cells carrying CDMs during carcinogenesis, the variability in CD MF across individuals may be a sensitive metric for assessing cancer risk of CDMs as biomarkers of cancer risk, it is a low throughput method that assesses one mutation at a time. Given the breadth of possible mutations with importance in tissue-specific carcinogenesis, a method is needed that can interrogate many hCDMs at once.
Advances in next-generation sequencing (NGS) have made it possible to analyze many mutations at once, even at low frequency.
Indeed, the development of a variety of error-corrected NGS (EC-NGS) methods is revolutionizing the field of genetic toxicology (Salk et al., 2018), and has created an opportunity to analyze panels of amplicons encompassing many hCDMs. Methods based on the construction of single-strand consensus sequences (SSCSs) or two-strand consensus sequences have reported sensitivities between 10 −3 and 10 −6 (Kinde et al., 2011;Young et al., 2015;Gregory et al., 2016; McKinzie and Bishop 2020). Duplex sequencing (DS), which is based on constructing double-strand consensus sequences, is capable of detecting MFs as low as 10 −8 (Salk et al., 2018). Thus, the analysis of panels of hCDMs is now an achievable goal and the development of such panels as biomarkers could be applied to improving carcinogenicity testing.
Although there are advantages to using CDMs as biomarkers for carcinogenicity testing, their development will require progress in multiple areas. First, we must identify which CDMs will be the most useful reporters of carcinogenic effect in different human tissues. Second, we must establish and validate high-throughput methods for their quantitative analysis. Third, because the planned application is the detection of multiple CDMs in short-term, repeat-dose rodent treatment studies, we need to determine which human CDMs are useful reporters of carcinogenic effect in corresponding rodent tissues. This study addressed the first and second areas of necessary research. Specifically, we developed an EC-NGS method (CarcSeq) for the analysis of a panel of amplicons encompassing human hCDMs in normal and malignant breast and lung samples. We validated the CarcSeq approach in terms of its ability to replicate known tissue specificity, mutation spectra, and concordance with previously obtained ACB-PCR MF measurements.

| MATERIALS AND METHODS
2.1 | DNA isolation and multiplex first-round PCR DNA was isolated from fresh-frozen normal breast, normal lung, ductal carcinomas, and lung adenocarcinoma samples, as previously described (Myers et al., 2015;Myers et al., 2016;Myers et al., 2019).
Because these samples were purchased from anonymous tissue donors, this work was classified as "not human subjects research," when evaluated for the purpose of human subject protection. Normal breast and lung were collected as autopsy samples from individuals who died from causes unrelated to sample type. For normal breast, normal lung, ductal carcinomas, and lung adenocarcinomas the mean ± SD of tissues processed was 4.12 ± 0.96 g, 2.36 ± 0.64 g, 0.43 ± 0.37 g, and 0.64 ± 0.42 g. Based on the amount of DNA recovered and assuming a diploid genome weight of 6.6 pg (Elli et al., 2019), it was calculated that the genomic DNAs analyzed were derived from an average of 2.31 × 10 7 , 6.85 × 10 8 , 1.01 × 10 8 , and 6.56 × 10 7 diploid cell equivalents for normal breast, normal lung, ductal carcinomas, and lung adenocarcinomas, respectively.
Using 1 μg EcoRI-digested genomic DNA as template and highfidelity PfuUltra Hotstart DNA Polymerase (Agilent Technologies, Santa Clara, CA), segments of CD genes encompassing hotspot mutations (see Table 1) were amplified from the normal breast (n = 9), ductal carcinoma (n = 10), normal lung (n = 9), and lung adenocarcinoma (n = 9) samples. Using the National Center for Biotechnology Information (NCBI) Primer-BLAST primer selection tool (Ye et al., 2012), amplicons 132 bp or less in length encompassing hotspot targets were identified, thereby enabling the entire amplicon to be sequenced using Illumina 150 bp paired-end sequencing after amplification using primers with 9 bp unique molecular identifier sequences (UMIs) at each end, theoretically generating 68 billion different 18 bp UMIs.
Specifically, four multiplex reactions amplifying two to four gene segments each were performed (Table S1)  This design produced 13 amplicons covering over 20 hCDMs, flanked by 9 bp UMIs on each end. Primer sequences are provided in Table S1. First-round PCR reactions were carried out using DNA Engine or DNA Engine Tetrad thermocyclers (Bio-Rad, Hercules, CA) with the cycling conditions provided in Table S2. All first-round PCR products were identified by size using gel electrophoresis and purified using the MinElute PCR Purification Kit (Qiagen, Germantown, MD). PCR products were frozen and stored at −80 C, as multiple single-use aliquots.

| Library preparation
Libraries composed of combined gel-purified samples were prepared using the Illumina® TruSeq® ChIP Sample Preparation Kit (Illumina, San Diego, CA), modified for the EC-NGS application developed in this study. The DNA concentration of each sample was determined by measurements of replicate single-use aliquots using a dsDNA High Sensitivity Kit (ThermoFisher, Waltham, MA), as previously described . Multiplex DNA products were combined in equimolar amounts. Following the pooling, 10 ng DNA were subjected to DNA end-repair, 3 0 end adenylation, and ligation of index/adapter sequences, as described for the Illumina® TruSeq® ChIP Sample Preparation Kit (Figure 2a).
Following a second gel purification step to separate target amplicons from unligated adapters, 2 μl of each DNA were seriallydiluted and quantified by digital droplet PCR (ddPCR) using the PIK3CA H1047R ddPCR Mutation Detection Assay and the QX200 Droplet Digital PCR System (Bio-Rad). This information was used to determine the dilution and volume required to incorporate 1 million copies of the PIK3CA amplicon for breast or 1.5 million copies of the PIK3CA amplicon for lung DNA (and presumably other amplicons) into the final PCR amplification step (18 cycles) of library preparation. This step employed a primer cocktail and PCR master mix to amplify the adapter-ligated UMI-labeled amplicons, as per Illumina® TruSeq® PCR kit protocol instructions.

| DNA sequencing
Pooled DNA samples were denatured and diluted according to the Illumina® TruSeq® Library Prep Pooling Guide. All DNA samples were diluted to 2 nM and combined with 2 nM PhiX (Illumina). Samples (10 μl) were denatured by the addition of 0.2 N NaOH (10 μl) and diluted with HT1 hybridization buffer to a concentration of 1.8 pM in a total volume of 1.3 ml. Denatured and diluted samples were loaded onto reagent cartridges and cluster generation and sequencing were performed on an Illumina NextSeq 500. In this study, four samples were applied to a mid-output flow cell and a paired-end 151-cycle run was performed on the NextSeq 500, with a 6 bp index read. The NextSeq 500 was controlled using BaseSpace® onsite v2.1 HT. BaseSpace® Real-Time Analysis (RTA) software extracts intensities from images, performs base calling, and assigns a quality score to each base. Post-run data processing included sorting reads by sample, based on the 6 bp index sequence incorporated during library preparation ( Figure 2b). The post-sequencing bioinformatics pipeline used is shown in Figure S1 and a detailed description is provided in Supplemental Material, "Error Correction Sequencing Data Processing." Raw sequencing data have been deposited in the precisionFDA cloudbased next-generation DNA sequencing data platform, where access can be provided upon request.

| Mutation detection and filtering
Starting with the raw numbers of mutations and the SSCS depths reported for each target base in the mutation position (mutpos) output file, the data were transformed using a three-step process. First, mutations represented by only one or two mutant SSCSs were removed from the dataset to reduce inaccuracy in MF measurements due to  calculated. Following visualization of known positives and artifacts, invariant MFs were defined as those with a COV <60% and those values were filtered from the normal and tumor data sets. The third data transformation investigated CarcSeq sensitivity by applying cutoffs of either 10 −5 or 10 −4 , eliminating measurements below the cutoff.

| Analysis of CarcSeq sensitivity in a reconstruction experiment
The CarcSeq assay was used alongside ACB-PCR in a reconstruction experiment analyzing the PIK3CA E545K mutation. Mutant and wild type DNAs were prepared from plasmid DNA using a high-fidelity PCR as previously described (Parsons et al., 2017). Wild-type PIK3CA E545 (GAG) and PIK3CA E545K mutant (AAG) were mixed to generate samples with MFs of 10 −1 , 10 −2 , 10 −3 , 10 −4 , 10 −5 , and 0 (wild type only). These DNAs were used for library preparation and CarcSeq analysis, as well as ACB-PCR (Myers et al., 2014b). For this analysis, the PIK3CA E545K ddPCR Mutant Assay (Bio-Rad) was used for copy number quantification during library preparation.

| Statistical analysis
Comparison of CarcSeq and ACB-PCR results used log 10 -transformed Statistical analyses were performed using GraphPad Prism 8 Software (GraphPad Software, Inc., La Jolla, CA) with significance defined as p ≤ .05 (two-tailed).
The Adams and Skopek Monte Carlo method was used for comparisons of mutation spectra between sample types (Adams and Skopek 1987).
Variation was analyzed as a metric of clonal expansion using median absolute deviation (MAD) on raw MF measurements as described by Mroz and Rocco (Mroz and Rocco 2013 3 | RESULTS

| CarcSeq cancer panel design and optimization
The overarching goals of this study were to: (a) develop an EC-NGS method that can quantify many different hCDMs at once, (b) validate measurements of MF based on previous ACB-PCR measurements obtained from the same samples and expected tissue-specific mutational profiles and mutation spectra, and (c) garner additional information regarding which mutational targets may serve as tissue-specific biomarkers of carcinogenic potential.
The Catalogue of Somatic Mutations (COSMIC, a collection of CD gene mutations detected using low-sensitivity DNA sequencing, https://cancer.sanger.ac.uk/cosmic) was searched to identify the most prevalent mutations in human cancers (Table 1), expecting these to be among the most penetrant mutations and, therefore, appropriate constituents of a multi-component biomarker. hCDMs that are prevalent in tumors and in normal human tissues may be substrates for chemical carcinogenesis and serve as early reporters of carcinogen-induced clonal expansion on the path to tumorigenesis ( Figure 1) (Brash 2016).
Our search resulted in the identification of 13 gene segments containing hCDMs, which could be subsampled to assess carcinogenicity in different tissues. Target amplification was developed as four multiplex groups, containing two to four amplicons. The multiplex groups, their amplicon targets, GRCh38 locations, lengths, and prevalence in ductal carcinomas and lung adenocarcinomas are provided in Table 1. Table 1 shows that some targets (e.g., PIK3CA and KRAS) differ in mutation prevalence between tumor types. This study focused on normal and malignant DNA samples from breast and lung because previously obtained ACB-PCR data (sensitivity 10 −5 ) on specific hCDMs within these samples could be used to validate the Car-cSeq results. Following amplification of the 13 amplicons, a total of 973 bps of target sequence were generated from each sample, encompassing more than 20 prevalent CDMs (Table 1). This panel was developed to be useful for analysis of multiple tumor types, enabling gene mutations that are not prevalent in breast or lung tumors (Table 1) to serve as negative controls for tissue specificity.
When possible, one of the primers was derived from an intronic sequence (sequences shown in blue), to reduce potential confounding due to pseudogene amplification.
First-round PCR was optimized to ensure the number of PCR duplications was similar to that previously used to generate firstround PCR products for ACB-PCR studies that achieved a sensitivity of 10 −5 (McKinzie and Parsons 2002). The CD gene segments listed in Table 1 were amplified from the normal breast, normal lung, breast ductal carcinoma, and lung adenocarcinoma DNAs. In each case the multiplex products were gel-purified, quantified, and combined in equimolar amounts.
Libraries were prepared using the Illumina® TruSeq® ChIP Sam- Average and median SSCS numbers achieved for normal, tumor, and all amplicons combined are provided in Table 2, along with the percentages of amplicons analyzed with enough SSCS numbers to achieve theoretical sensitivities of 10 −4 or 10 −5 (assuming three molecules are needed to accurately assess MF) and the sensitivity theoretically achievable with 90% power. These calculations are based upon the median SSCS number reported for each amplicon, a number that is often slightly lower at the ends of an amplicon than in the middle. SSCS numbers were larger in the lung data set than in the breast data set. Importantly, the data in Table 2 shows the SSCS numbers obtained robustly support a sensitivity of 10 −4 , with fewer amplicons represented by sufficient SSCS numbers to achieve a sensitivity of 10 −5 . Table 2 shows that sufficient SSCSs were recovered to achieve sensitivities between 2.00 × 10 −5 and 5.26 × 10 −5 , with 90% power.
Bar graphs showing the distribution of SSCS reads for each amplicon are provided in Figure S3. For breast 93% of amplicons had sufficient SSCS representation for a sensitivity of ≥10 −4 . For lung, 96% of amplicons had sufficient SSCS representation for a sensitivity of ≥10 −4 .

| Analysis and validation of CarcSeq MF measurements
In detection of rare events, sampling error can negatively impact the Our CarcSeq method generated some visually-identifiable invariant MFs (see Figure S4, panels a-c), one of these being a SNP. Others were the result of reproducible bioinformatic mis-priming of amplicons with homologous pseudogene sequences (i.e., the MF described the frequency of misalignment, rather than the fraction of mutant molecules in the population, see Figure S4, ddPCR-confirmed artifacts (see Figure S5 and Table 3   The mutation spectra observed within the different tissue types were compared, using the COV ≥60% MFs ≥10 −4 data sets (Table 5).
Significant differences in mutation spectra of normal and tumor were observed for both breast and lung. Significant differences in mutation spectra were observed between normal breast and normal lung, as well as between ductal carcinomas and lung adenocarcinomas. Most importantly, the most predominant mutational specificity in each tumor matched that previously reported, G:C ! A:T for ductal carcinomas and G: C ! T:A for lung adenocarcinomas (Kandoth et al., 2013).

| Sensitivity
To further probe the sensitivity achieved in the CarcSeq analysis of MF and how that relates to the accuracy of MF measurement, a subset of  As a direct approach for analyzing CarcSeq sensitivity, a reconstruction experiment was performed. PCR products amplified from plasmid DNAs encompassing the PIK3CA E545 wild-type sequence (GAG) or the E545K mutant (AAG) were mixed in known proportions (10 −1 to 10 −5 ). These DNAs were used for CarcSeq library preparation and standard ACB-PCR analysis. The reconstruction experiment demonstrated that CarcSeq MF measurements correlated perfectly with the expected MFs in the standards down to a MF of 10 −4 (Figure 6a), but slightly overestimated the 10 −5 MF standard (Figure 6b). The ACB-PCR results analyzing the same set of MF standards is provided for comparison in Figure 6c.

| Correlation with age
Variation in MF for drivers with tissue-specific carcinogenic effect is expected to increase with increasing tissue donor age because age is the most important risk factor for cancer. Figure 7 shows that the sum of MFs measured in each individual for all targets was not correlated with breast tissue donor age (a), but the sum of PIK3CA and TP53 MFs for each individual was significantly correlated with tissue donor age (b). PIK3CA and TP53 were analyzed because they had previously been identified as containing hCDMs prevalent in breast cancers (Harris et al., 2019). Figure 7 also illustrates how tissue donor age might be related to breast cancer risk; cumulative age-related risk was calculated from data in the SEER database and plotted relative to age (c) and the sum of MFs measured at each tissue donor age is plotted relative to the calculated breast cancer risk at that age (d).  Figure S8b). Mean absolute deviation was also examined and the breast specific targets showed a better correlation with age than all targets, but neither was significantly correlated with age ( Figure S8c,d).

| DISCUSSION
There is growing impetus in the scientific community to move away from the paradigm of carcinogenicity testing based on the RTB. This has prompted development of approaches to prioritize chemicals for carcinogenicity testing and the adoption of alternative carcinogenicity assessment strategies (Morton et al., 2012;Luijten et al., 2016;Yauk et al., 2019). Alternative in vivo mouse models have been developed that can speed carcinogenicity testing in this species (Donehower 1996;Flammang et al., 1997;Schwetz and Gaylor 1998;Spalding et al., 1999;Bourcier et al., 2015). Nevertheless, current strategies for carcinogenicity testing require additional improvement, specifically better prediction of rodent tumor responses from shorter-term endpoints and strengthening of the scientific knowledge underpinning rodent to human extrapolation (Zeiger and Stokes 1998).
The use of hCDMs as quantitative biomarkers of carcinogenic effect is a promising approach for improving carcinogen testing because quantifying CDMs has the potential to capture information on clonal expansion of cells with neoplastic potential, at a very early stage in the carcinogenic process. Our previous work documented how specific CDMs accumulate in treated rodent tissues following short-term exposures to carcinogens and examined the impact of dose and exposure duration (Verkler et al., 2008;Meng et al., 2010;McKinzie and Parsons 2011;Wang et al., 2012). Importantly, we reported that the tissue-specific carcinogenic impact of hCDMs is related to a metric based on interindividual variability in normal tissue levels (Parsons et al., 2017). We interpret this as meaning that the F I G U R E 7 Relationships between the sum of MF measurements in normal breast of different individuals and tissue donor age. The sum of MFs for different individual samples was correlated with age using MFs ≥10 −4 for all targets (a) or only targets known to be drivers of breast cancer, PIK3CA and TP53 (b). SEER breast cancer incidence data was used to calculate a cumulative risk for each age (c), as the cumulative sum of incidence observed at the current and all previous years. Finally, the sum of PIK3CA and TP53 MFs ≥10 −4 in normal breast were plotted relative to the cumulative risk expected based on the tissue donor's age (d) magnitude of interindividual variability in CDMs is reflective of clonal expansion driven by CDMs in individuals, which occurs in a stochastic manner that mirrors the carcinogenic process itself (Harris et al., 2019). Another way to state this is that the more heterogeneity there is in normal tissue, the more potential there is for clonal selective advantage leading to carcinogenesis. Consistent with this, it was shown in treated rodents that a metric based on treatment group variation in CD MF correlated with tumor response better than a metric based on treatment group CD geometric mean MF (Parsons 2018).
It is now understood that CDMs are prevalent in normal human tissues (Sudo et al., 2006;Gao et al., 2009;Parsons et al., 2010;Myers et al., 2014a;Martincorena et al., 2015;Myers et al., 2015;Young et al., 2016;Parsons et al., 2017;Martincorena et al., 2018;Suda et al., 2018;Salk et al., 2019;Yokoyama et al., 2019), leading one to ask how such mutations can be used as biomarkers if they are so universally present. Variability in CDM frequency across a homogeneous group of individuals (i.e., individuals of the same age and sex) or a rodent treatment group can be used to assess clonal expansion of cells carrying CDMs, which is viewed as a functional indicator of carcinogenic potential and/or effect, respectively (see Figure 1).
Building on past efforts to develop hCDMs as quantitative biomarkers of cancer risk, here we describe development and characterization of an EC-NGS method, CarcSeq, for the analysis of a human amplicon panel encompassing many hCDMs. This was conducted in human based on the idea it is first necessary to identify human hCDMs that are relevant biomarkers before identifying which conserved rodent hCDMs have similar tissue-specificity. Additionally, human samples that had been analyzed previously for CDMs by ACB-PCR were available for cross-platform validation (Parsons et al., 2017).
CarcSeq is similar to the Safe-Sequencing System (Safe-SeqS) (Kinde et al., 2011) or AmpliSeq HD technology (Thermo Fisher Scientific). Safe-SeqS relies on UMIs incorporated into primers or endogenous UMIs, whereas AmpliSeq HD relies on UMIs incorporated into primers. Specifically, Safe-SeqS uses two cycles in the first-round of PCR to assign UMIs to an amplicon. This limits its application in situations wherein multiple amplicons need to be analyzed from a limited sample. Furthermore, two PCR cycles could be limited in efficiency to assign UMIs due to presence of chemical fixatives in clinical samples (Kinde et al., 2011). Our CarcSeq approach used 38 cycles to assign UMIs to amplicons during first-round of multiplex PCR. While the assignment of multiple UMIs to the same molecule may occur during the first round of PCR in CarcSeq, we ameliorated this concern by diluting the input that goes into the final PCR amplification of the sample enrichment step to 1-1.5 million molecules and requiring three mutant molecules to construct a SSCS (so molecules amplified in the earliest cycles are enriched in the sequenced population).
EC-NGS methodologies depend on the use of UMIs to identify replicate sequence reads from the same template molecule, allowing genetic variants shared across reads to be identified as mutations and those not shared to be identified as sequencing or PCR errors. Incorporation of UMIs as part of a primer sequence is an efficient strategy for correcting sequencing or late PCR errors, but may not correct errors that occur within the first few cycles of PCR (Salk et al., 2018).
However, in the initial PCR amplification of our CarcSeq approach, we employed the same high-fidelity PCR conditions used for ACB-PCR, which has a sensitivity of 10 −5 , suggesting that the background error rate from early PCR errors would be in that range.
We employed a SSCS approach requiring three sequences with the same UMI to comprise a SSCS. In contrast, DS requires six sequences with the same UMI to comprise a duplex consensus sequence (DCS) to maximize DS efficiency (Kennedy, 2014). Because DS has a lower consensus-making efficiency than a SSCS approach, DS requires greater read depths to construct the same number of consensus sequences (Salk et al., 2018). Thus, SSCS approaches have the potential to be more cost effective. That said, DS undoubtedly produces fewer background errors and, therefore, is expected to be more sensitive, provided enough DCS are assembled to take advan- in one-third of replicate measurements on a population of size X and a MF of ≥2/X will be reported in one-third of replicate measurements, which correspond to sampling errors of the infinite amount or ≥200% of the true MF (McKinzie et al., 2001). In rare event detection, one way to achieve greater precision is to require measurements be based on the detection of more than one molecule/SSCS. In our post-sequencing analyses, we required detection of at least three SSCSs for further consideration as a "measured MF." If 3/X is the true MF in a sample of size X, then in one-third of replicate samples the mutation would be measured as 2/X and in one-third of replicate samples the mutation would be measured as ≥4/X, which means most sampling errors will be of a magnitude of ±30%. Imprecision due to sampling errors will deteriorate the power to detect significant differences in clonal expansion between populations/treatment groups.
To further increase the power of the CarcSeq approach to yield a metric of clonal expansion, we employed a filter to eliminate invariant MF measurements. In the current study, specific mutations/positions where the COV across 9 normal samples was <60% were removed from each data set. This also helped to remove pseudogene artifacts and SNPs. The pseudogene artifacts were caused by either primers annealing to highly-homologous, pseudogenes sequences that include SNVs compared with the true CD genes or sequence misalignment, rather than due to the detection of a mutant subpopulations ( Figure S4d). Two types of data were collected to justify COV filtering: (a) visualizing the COV of ACB-PCR concordant mutations, SNPs and pseudogene artifact identified bioinformatically (through homologies at nontarget sites where large numbers of reads aligned) and (b) ddPCR results confirming that some mutations were not present in the original genomic DNA samples ( Figures S4 and S5).
Although the robustness and broad applicability of the use of 60% as a cutoff with the COV approach will require further investiga-   tumor-specificity (mutations present in 11% of lung adenocarcinomas but only 1% of ductal carcinomas), relatively high levels of STK11 mutations were observed in lung, but not breast samples. However, these mutations appeared at similar levels in lung adenocarcinomas and normal lung samples (Figure 4). How this related to CarcSeq methodology or the quality of the data sets analyzed is currently unclear and additional studies will be needed to clarify which hCDMs will be informative predictors of future lung cancer development.
Quantitatively, some of the differences in hotspot MFs between normal and tumor were smaller than perhaps expected. This may be due to one potential variable that was not controlled in this study; specifically, a potential limitation of the current study is that different numbers of diploid cell equivalents were evaluated. Overall, large samples were analyzed relative to the sensitivity of the assay; most DNA samples were derived from 10 7 to 10 8 diploid cell equivalents, A final approach for validating CarcSeq MF measurements, and the overall approach of using CD MF as a metric of cancer risk, was to correlate normal MF measurements with tissue donor age, because age is the single most important risk factor for cancer. The significant correlation observed between the sum of MFs in breast-specific targets for individuals and age is encouraging, as was the initial attempt to examine the correlation between measures of clonal expansion with age (Figures 7 and S8). Obviously much more data are needed to establish measurements of specific CDMs as quantitative biomarkers of cancer risk for specific tumor types. Paradigms that should be utilized to develop such data, in both human and rodent, are illustrated in Figure 8.
In summary, we developed a novel method for EC-NGS, called Tissue Network, which is funded by the National Cancer Institute.
Other investigators may have received specimens from the same F I G U R E 8 Depiction of experimental paradigms that could be used to relate tumor incidence in human or rodent to a metric based on analyzing batteries of hCDMs subjects. The information in these materials is not a formal dissemination of information by FDA and does not represent agency position or policy.