Restriction site associated DNA sequencing for tumour mutation burden estimation and mutation signature analysis

Abstract Background Genome‐wide measures of genetic disruption such as tumour mutation burden (TMB) and mutation signatures are emerging as useful biomarkers to stratify patients for treatment. Clinicians commonly use cancer gene panels for tumour mutation burden estimation, and whole genome sequencing is the gold standard for mutation signature analysis. However, the accuracy and cost associated with these assays limits their utility at scale. Methods WGS data from 560 breast cancer patients was used for in silico library simulations to evaluate the accuracy of an FDA approved cancer gene panel as well as restriction enzyme associated DNA sequencing (RADseq) libraries for TMB estimation and mutation signature analysis. We also transfected a mouse mammary cell line with APOBEC enzymes and sequenced resulting clones to evaluate the efficacy of RADseq in an experimental setting. Results RADseq had improved accuracy of TMB estimation and derivation of mutation profiles when compared to the FDA approved cancer panel. Using simulated immune checkpoint blockade (ICB) trials, we show that inaccurate TMB estimation leads to a reduction in power for deriving an optimal TMB cutoff to stratify patients for immune checkpoint blockade treatment. Additionally, prioritisation of APOBEC hypermutated tumours in these trials optimises TMB cutoff determination for breast cancer. The utility of RADseq in an experimental setting was also demonstrated, based on characterisation of an APOBEC mutation signature in an APOBEC3A transfected mouse cell line. Conclusion In conclusion, our work demonstrates that RADseq has the potential to be used as a cost‐effective, accurate solution for TMB estimation and mutation signature analysis by both clinicians and basic researchers.


| INTRODUCTION
Over the past decade, genomic characterisation using high throughput sequencing has played an increasingly important role in cancer precision medicine. 1One example of this is the use of tumour mutation burden (TMB), a measure of the number of alterations in a cancer genome relative to matched germline from the same patient, which has been approved for use as a biomarker to indicate whether patients with any solid tumour type may respond to α-PD1 blockade. 2However, while early studies indicated that patients with high TMB were more likely to respond to therapy, many trials have failed to replicate this association. 3The FoundationOne Cancer diagnostics panel is the FDA approved companion test for acquisition of TMB as a biomarker for ICB in solid tumour types. 2 Comparisons of TMB estimates from whole genome sequencing (WGS) to those from targeted panels suggest that their accuracy is poor for samples that do not have high TMB 4 and using other parameters to estimate measurement error via the FoundationOne panel has demonstrated its low accuracy. 5As predictive thresholds have been approved under the assumption that these panels can accurately represent TMB, their ability to correctly identify responders with TMB around the cutoff levels is likely to be limited.Consequently, there is a need for an accurate, simple and cheap method to estimate TMB that would enable re-evaluation of the prognostic significance of this biomarker.
Extensive analysis of tumour genomic data has revealed that evidence of mutagenic activity in tumour evolution can be derived by categorising mutations into their genomic context, with potential clinical utility in breast cancer. 61][12][13] There are other signatures with potential clinical utility such as the BRCA deficiency signatures (COSMIC SBS 3 and 8) identifying patients likely to respond to PARP inhibition even in the absence of pathogenic mutations in BRCA1/2 (e.g epigenetic silencing, variants of unknown significance). 14,15The gold standard approach for elucidating mutation signatures is to use WGS, however WGS is expensive, requires large amounts of compute and storage power, and many medical facilities do not have access to WGS services.This limits the uptake of mutational signature analysis, and corresponding benefit to patients, justifying research into cost-effective solutions for mutation signature analysis.
Restriction site associated DNA sequencing (RADseq) is a reduced representation method used frequently in population genetics for association studies and linkage mapping. 16RADseq requires a simple library preparation protocol and can be performed at a low cost per sample.RADseq based approaches have been used to capture genome wide mutation data in bladder cancer with RADseq effectively recapitulating mutation signatures in patient derived FFPE blocks. 17Restriction enzyme based approaches therefore potentially represent powerful tools to supplement targeted sequencing approaches in both clinical and research contexts.
In this paper, we aimed to critically evaluate the use of the FoundationOne diagnostic panel for TMB estimation and mutation signature analysis, and whether RADseq could be a suitable, cost-effective methodology for this purpose.We investigated the number of detected mutations required for accurate recapitulation of mutation signatures in breast cancer, and explored the effects of low mutation complexity caused by APOBEC mutagenesis on detection capacity.We then simulated RADseq library preparation methods for breast cancer patients and calculated the accuracy of these methods in acquisition of genome wide biomarkers such as TMB and mutation signatures, benchmarking against our estimate of the FoundationOne Cancer diagnostics panel TMB score.The effect of TMB measurement errors on the capacity to detect associations between response to ICB and TMB has not been formally evaluated.We simulated use of the RADseq libraries to evaluate the effect of TMB on response to ICB.and converted to a gRanges object for compatibility with the SomaticSignatures 19 R package.Random subsamples of between 10 and 800 mutations were extracted for each sample.Each random subsample catalogue of mutations was then characterised by the 96 trinucleotide context as using SomaticSignatures mutationContext and motifMatrix functions.This process was repeated times for each mutation subset in each sample to generate 10 representative libraries for each sample at each number of randomly subset mutations.

| Analysis of mutation signature similarities
Mutation profiles were compared by calculating the cosine similarity between the subsetted mutation profile and the WGS profile for each sample using the mutation-alPatterns 20 R package for each iteration and taking the diagonal values of the resulting matrix.The probability of obtaining a cosine similarity to WGS profile of at least 0.9, Pr(CS > 0.9), was calculated by dividing the number of cosine similarities for each mutation subset in each sample s by 10 (the number of randomly generated libraries for each sample).
In order to formally define the number of mutations required to accurately recapitulate the whole genome mutation signature, a binomial test was used.This was calculated by counting the number of samples that had a >90% cosine similarity to the original WGS profile at each number of mutations subset.

| Restriction enzyme mapping
Restriction enzyme information was obtained from the NEB website (https:// inter natio nal.neb.com/ tools -andresou rces/ usage -guide lines/ nebuf fer-perfo rmanc e-chart -with-restr ictio n-enzymes).Restriction site breakpoints were mapped by scanning the human genome (hg19) for the recognition sites, and defining regions between the breakpoints in a bed-like format.Fragments were restricted to those between 70 and 378 bp commonly used for RADseq experiments.Enzymes that are methylation sensitive were removed due to the fact that different methylation patterns in different samples could result in less reproducible library preparations.Duplicate enzymes (i.e.those with a high fidelity version) were removed.Libraries which returned an average of less than two mutations per library/sample were removed.Libraries with greater than 10% coverage of the whole genome were removed as these would not represent practically useful libraries when compared to WGS.The FoundationOne regions were defined by downloading available gene information from the FoundationOne panel 22 and matching gene names to annotations in hg19.Fragments were restricted to exonic regions as described on the FoundationOne website: for simplicity, we did not include the small amount of intronic regions that are included in the FoundationOne assay.Library coverage was defined using the 'reduce' function from the GenomicRanges R package, which gives the expected number of base pairs captured by each library.

| In silico library analysis
Once fragments were defined for enzyme libraries and the FoundationOne panel, mutations from the Nik-Zainal dataset were overlapped and cosine similarity was computed for each library as above, using the mutations that overlapped for each library instead of random subsetting.Mutation signature analysis was carried out for each library as described above.Mutation rates for each library were defined by dividing the number of mutations overlapping the library by the library coverage, that is, the number of base pairs covered by the library.Mutation rates were then compared with the WGS mutation rate using three methods; Pearson correlation of the WGS mutation library to the mutation rate estimated for the library; and two error-based methods: the difference between the WGS mutation rate and the library mutation rate; and the estimated bias, θ, of the library mutation for the 560 tumour samples.For the error-based methods, libraries with a Pearson correlation value less than 0.9 were removed and a linear model was defined using log 2 transformed values of the median value for each library as the dependent variable and log 2 transformed values of the library coverage methods as the independent variable in a linear regression analysis.

| Simulated ICB trials
For each sample in the Nik-Zainal dataset, the probability of response to ICB was defined as a function of the WGS mutation rate.The probability was scaled to defined minimum and maximum of 0 and 0.5.This resulted in a value between 0 and 1 for each sample that were that samples probability of response in that trial.The response to therapy in each separate trial was then defined using this probability value.The odds ratio of response versus non-response at defined WGS mutation rate cutoffs was then calculated.For comparison of different library methods used to estimate TMB, the odds ratio of response was calculated as above, except the mutation rate used was for the estimated mutation rate of that sample using the defined libraries.To simulate independent clinical trials where statistics are computed as part of each trial, odds ratio p-values for each cutoff were Bonferroni corrected using p.adjust separately for each trial.This process was repeated 100 times to represent 100 independent trials on the same cohorts of patients.As the rbinom function generates random deviates of the 'response' outcome based on TMB, each sample was represented 100 times, however the number of trials in which the sample 'responded' was dependent on the TMB as encoded in the model.The proportion of trials reporting a significant association at each cutoff was taken as the number of adjusted p-values <0.05 at each cutoff.

| SSM3 cell culture
SSM3 cells 23 were cultured in 90% DMEM + 10% FBS, 50 μM β-mercaptoethanol, 5 μg/mL bovine insulin, 10 ng/mL transferrin, 0.3 μM hydrocortisone at 8% CO 2 .Monoclonal isolates were derived by limiting dilution.Following this, a second round of monoclonal isolation was achieved by FACs sorting into 96 well plates.Cells were expanded in 50% conditioned media.Conditioned media was generated by harvesting media from a healthy growing parental SSM3 cell line after 1 day, followed by centrifugation, filtering through a 0.22 μm cell strainer and mixing with full SSM3 media in a 1:1 ratio.After initial plating by both methods, single colonies were left to grow for 7 days before replacing conditioned media.Due to the cluster like growth of SSM3 cells, after sufficient expansion of the colony in the 96 well plate, colonies were lifted and replated in the same well in full conditioned media to promote growth and over confluency.After colonies grew to 50% confluency, the entire well was harvested and transferred first to 24 well plates and then to T12.5 and T25 flasks, maintaining cultures in full SSM3 media.Colony growth was heterogenous, however most colonies were able to be transferred after 3 weeks following single cell isolation, and then grown as normal.
2.8 | Transfection, APOBEC mutagenesis and cell sorting of 0D5 subclones 0D5 cells were seeded at 800,000 cells/well of a 6 cm dish.24 h after plating, media was replaced with SSM3 media, and cells were transfected with a mixture containing 2 μg pcDNA3.1 vectors expressing either APOBEC3A or APOBEC3B (kindly donated by Vincent Caval 24 ), co-transfected with 2000 μg pcDNA3.1 construct expressing deaminase null APOBEC3A linked to a uracil deglycosylase construct and 1000 μg of a plasmid encoding mutant GFP and WT mCherry that is a reporter for APOBEC mutagenesis, 25 with 17.325 μL of FuGene in a total volume of 325 μL of transfection mixture.Control cells were co-transfected with pcDNA3.1-IRES-GFP 26Addgene plasmid #51406; http:// n2t.net/ addge ne: 51406 ; RRID:Addgene_51406)and the mCherry reporter construct above.To facilitate setting gates for flow cytometry, separate wells were transfected with 5 μg pcDNA3.1-IRES-GFP(GFP+ cells), and 5 μg of the reporter constructs above (mCherry+ cells), as well as extra wells for unstained and heat killed (Texas Red+) cells.Transfection mixture was left on for 24 h and then replaced with fresh media.Seventy-two hours after initial transfection, cells were scraped in cold PBS and stained with Texas Red for 20 min, washed with PBS and sorted for live, GFP+ cells into 96 well plates containing SSM3 conditioned media on a BD FACS Aria (1:1 ratio of filtered SSM3 media harvested from plates containing 40% confluent SSM3 cells combined with fresh SSM3 media).Single cells were confirmed by phase contrast microscopy, and cells were left to grow for 1 week post sorting.Monoclonal cell lines were then isolated as above (see Section 2.7).

| DNA isolation
DNA was isolated from cells by reserving the cell pellet from routine passaging of a T75 cells.DNA purity was assayed using a Nanodrop spectrophotometer, integrity by gel electrophoresis and quantified by Qubit fluorescence.

| RADseq library preparation and sequencing
RADseq library preparation and sequencing was outsourced to a local company, AgResearch, Dunedin.The RADseq libraries were constructed according to the methods outlined in Elshire et al., 27 with modifications as outlined in Dodds et al. 28 RADseq libraries were prepared using a PstI-MspI double-digest, and included negative control samples (no DNA).Libraries underwent a Pippin Prep (SAGE Science, Beverly, Massachusetts, United States) to select fragments in the size range of 220-520 bp (genomic sequence plus 148 bp of adapters).Single-end sequencing (1 × 101 bp) was performed on an Illumina NovaSeq6000 utilising v1.5 chemistry.Raw fastq files were quality checked using a custom qc pipeline (available at https:// github.com/ AgRes earch/ DECONVQC).As one of the qc steps raw fastq files were quality checked using FastQC v0.10.1 (https:// www.bioin forma tics.babra ham.ac.uk/ proje cts/ fastqc/ ).

RADseq data and SSM3 WGS data
WGS SSM3 data 29 were downloaded in FASTQ format from the Sequence Read Archive (SRA) using fasterqdump from sra-toolkit 30 (accession SRR2142076).FASTQ files were demultiplexed and adapters trimmed using cutadapt. 31Poor quality reads from FASTQ files were trimmed using Trimmomatic. 32Files were then aligned individually using bwa mem with standard parameters. 33Mutations were called using the GATK4 MuTect2 pipeline for tumour only samples. 34The MarkDuplicates step was omitted due to the nature of RADseq data likely resulting in a high probability of false positive detection of duplicates, due to reads generated from the same cut site sharing identical mapping coordinates.Sites with <20 reads and an allele frequency (AF) >0.75 were filtered out to remove germline, artifactual and low quality sites.Mutation rate was calculated by counting the number of mutations in each sample and dividing by the total base pairs covered by at least 20 reads (computed using samtools coverage).Cosine similarity matrices were then generated for comparison of different samples.The context of each mutation was extracted using MutationContext from the SomaticSignatures package.SigProfiler analysis was carried out using default parameters for extraction of SBS96, with a maximum number of signatures set at 10, and 100 non-negative matrix factorisation (NMF) replicates carried out.The SigProfiler analysis was carried out on samples with each sample split between clonal and sub clonal mutations, as well as the SSM3 WGS mutations.Signatures for the computed optimal solution were visually examined and on the basis of similarity, two signatures were combined.
To determine whether samples were enriched for Tp(C>T)pW mutations in the APOBEC transfection experiment, a binomial test was used, with the p-value calculated via where k is the number of mutations in the T(C>T)W mutation context, n is the total number of mutations detected in a sample, and π 0 is the predefined proportion of Tp(C>T) pW mutations.For this experiment, π 0 was defined as the proportion of Tp(C>T)pW in the parental 0D5 clone.Thus, if p < 0.05 the sample was defined to be to significantly enriched for APOBEC mutagenesis, indicating successful genome editing.

| Sequence data availability
Demultiplexed RADseq FASTQs and filtered VCFs from monoclonal SSM3 cell lines generated as part of this study have been deposited at the Gene Expression Omnibus (GSE234146).

| Estimating the number of mutations needed to recapitulate whole genome sequence mutation signatures
To quantify the number of mutations required to recapitulate the whole genome sequence mutation profiles of breast cancers, we used publicly available mutation data from a cohort of 560 breast cancer whole genome cancer catalogues. 9Random subsets of 10-800 mutations were sampled and a mutation profile representative of the 96 trinucleotide context as described by Alexandrov et al. 6 was generated for each subsample.This range was used as it represents a range of values that would be captured by reduced representation libraries such as RADseq, up to the number of mutations captured by WGS for breast cancer samples.Each sample was randomly subset as above 10 times.These were then compared to the mutation signatures generated from WGS by calculating cosine similarity.At higher numbers of randomly subset mutations, the cosine similarity to the WGS profile tended to increase (Figure 1A).At n = 330 mutations >95% of samples had a cosine similarity >0.9 to the WGS profile (Bonferroni corrected exact binomial test p = 4.88 × 10 5 , 95% confidence interval 95.97%-100% of samples with >0.9 cosine similarity, Figure 1B).

| APOBEC hypermutated samples can be reconstructed using a lower number of randomly sampled mutations and represent the majority of high mutation burden breast cancers
A number of samples had a high cosine similarity at a relatively low number of mutations sampled (Figure 1A).We hypothesised that these samples were APOBEC  hypermutated samples, as these are characterised by large numbers of C>T mutations in TpCpW contexts, and thus have low mutational complexity.To test this hypothesis, samples were split into 'APOBEC hypermutated' or 'APOBEC typical' samples (see Section 2) and the analysis was repeated as above for the separate groups.APOBEC hypermutated samples had significantly higher cosine similarities compared to APOBEC typical samples (Figure 2A, Holm adjusted Wilcoxon rank sum test p < 0.05 at each value of mutations subsampled).We repeated the binomial test to define the number of mutations required to recapitulate WGS profiles in these subcategories.APOBEC typical samples were able to be recapitulated at 330 mutations, the same as the global analysis (Figure 2B, Holm adjusted exact binomial test p = 0.02, 95% confidence interval 95.58%-100% of samples with >0.9 cosine similarity).
To examine the distribution of high mutation burden samples and the number of mutations required to recapitulate the WGS profile, samples were colour-coded according to the number of mutations present in the WGS profile.Paradoxically, samples with the highest TMB values were recapitulated at a lower number of mutations, as shown by the darker points on the plot in Figure 3A

| Optimisation of reduced representation methods to capture mutation signatures
The potential of RADseq to recapitulate genome wide mutation characteristics was examined.Furthermore, our estimate of the FoundationOne (F1) cancer diagnostics panel TMB was compared to these methods as it is the FDA approved standard for measuring TMB.We used the same mutation data as described above.However, instead of randomly subsetting mutations, we subset mutations based on whether they would be captured by an in silico genome digest, for each of 277 enzymes.First, we removed 'high fidelity' enzymes, as these are just variants of restriction enzymes that recognise the same cut sites, and therefore would result in duplicate libraries.Libraries were removed if the enzyme used was methylation sensitive, or if the median number of mutations captured was less than two.Methylation sensitive enzymes would be less amenable to RADseq on cancer samples as the captured regions would be dependent on methylation at each locus.The genome coverage parameters allowed us to focus on libraries with a high enough coverage to capture mutations for signature analysis.
After filtering, 117 enzyme libraries remained for analysis.For the F1 panel, we restricted mutations to those in exonic regions of genes reported to be assayed by the panel.The libraries captured a range of mutations in the range of our random sampling trials (range 1-54,968 mutations captured; Figure 4A; Figure S1a).Consistent with our random subsampling results, libraries with lower coverage, such as the F1 panel, and rarely cutting enzymes such as KpnI, usually resulted in lower mean cosine similarities of mutation profiles to the WGS profile while higher coverage restriction enzyme based methods were able to robustly recapitulate all mutation signatures (Figure 4A,B; Figure S1b).Interestingly, some libraries did not follow this general trend.For example, while more mutations were captured by the MspI library than the lower coverage BpuEI library, the cosine similarities to the WGS profiles of these libraries were similar.When samples were split by APOBEC hypermutation rate as above, even low frequency cutters such as SpeI were able to accurately recapitulate some APOBEC hypermutated samples (Figure 4D; Figure S2).The mutation profiles generated from simulated F1 libraries were highly dissimilar to the WGS generated profiles, with only a portion of profiles in the APOBEC hypermutated cohort showing similarity to the WGS profile (Figure 4C,D).

| Optimisation of reduced representation methods to estimate TMB
The suitability of various library preparation methods to estimate TMB was then tested.Estimation of tumour mutation rate derived from all simulated library preparation methods were significantly correlated with values for the whole genome sequence values (Pearson's r > 0.65, adjusted p < 0.05 for all libraries, Figure 3A).Evaluation using correlation however, has been criticised, due to the fact that highly mutated values can skew this parameter. 5This effect is demonstrated in Figure 5B.While the overall correlation appears to be strong for the F1 library, the wide range of points for values below 10 mutations/Mb in the F1 library F I G U R E 5 Correlation and error rates of libraries in estimating tumour mutation burden (TMB).(A) X axis depicts the coverage of libraries, y axis depicts Pearson correlation values between whole genome sequencing (WGS) TMB estimates and library TMB estimates.The F1 and SpeI libraries are highlighted as red and blue points respectively.(B) Scatterplot of mutation rate detected in the WGS profile versus the mutation rate detected by the F1 library (left panels, red) or the SpeI library (right panels, blue).The top panels show all the samples, while bottom panels depict only those with a WGS TMB <10 mutations/Mb.(C) The median absolute difference of the WGS TMB estimate and the library TMB estimate (y) as a function of the library coverage (x) A regression line (solid blue) is depicted with 95% confidence intervals (dashed blue).The F1 and SpeI libraries are highlighted and labelled in red and blue respectively.
suggests that this method cannot reliably estimate TMB.For example, many samples with WGS TMB between 0 and 2.5 mutations/Mb have highly variable estimates of F1 TMB, such that many samples in this category would have an estimated TMB > 2.5 mutations/Mb using the F1 library.By comparison, the SpeI library, with roughly double the coverage of F1, is both highly correlated, and displays much lower variability, even in samples with lower TMB.
Because of the potential issues with correlation-based assessment, we utilised two other parameters to measure the validity of each library in estimating TMB; the median absolute difference between the library and WGS mutation rate; and the median estimated bias, θ. 5 As expected, as the proportion of the genome covered increased, the median error rate of the library, as estimated by both parameters, decreased exponentially (Figure 5C; Figures S3 and S4).A linear model was fit, and this model suggests that error rates in this regard can be halved by increasing genome coverage by eight-fold.Libraries with a higher rate of error in this regard would be expected to have a median absolute difference higher than the 95% confidence intervals predicted by this model.The F1 library clearly falls outside of the 95% confidence interval of this linear model (Figure 5C).Some libraries with similar coverage, such as the SpeI library performed better than the F1 library in this regard (Figure 5C).

| Implications for immunotherapy trials
To examine the potential implications of the above findings in evaluating the significance of TMB as a biomarker for potential response to ICB, we simulated trials of ICB in breast cancer.The model assumes that the probability of response increases moderately as a function of mutation rate, plateauing at a probability of 50% at 11 mutations/Mb (Figure 6A).Although a simple model, this allowed us to examine the probability that, using a defined library preparation method in a cohort of breast cancer patients, the trial could detect an effect at a certain TMB cutoff.First, we tested whether, based on the results of Figure 1, pre-screening for APOBEC hypermutation might increase the power of such trials.In 100 simulated trials, clinical benefit, in the form of a significant odds ratio at a defined TMB cutoff, could be detected in the APOBEC hypermutated cohorts, with just under half the trials achieving a significant association with response (Figure 6B,C).By contrast, unselected cohorts almost always failed to detect any association of mutation rate with response (Figure 6B,C).
Next, we tested whether the library used to detect this association could affect the results of these trials.To do so, we computed significance as above, using the APOBEC hypermutated cohort, however we used the mutation rate estimated by the library to test the association, instead of the ground truth values derived by WGS.The proportion of trials reporting a true association tended to be lower when using the F1 panel at all cutoffs (Figure 6D,E) compared to using a WGS value.This trend remained true when using the widely used cutoff of 10 mutations/Mb.To test whether RADseq libraries could outperform F1 in this respect, additional libraries were tested.We used the KpnI library, as it has a similar library coverage to the F1 library and the SpeI library, which has roughly 2× the coverage of F1.At low TMB cutoff values (e.g. between two and four mutations/Mb), where use of F1 would result in no association being reported, use of KpnI libraries was able to detect a statistical association.The SpeI library was consistently able to improve the loss of significance seen at higher cutoffs (from 6 to 10 mutations/Mb).

| Utility of RADseq based methods to track mutation signature changes in a mouse cell line
While many mutation signatures have been defined using existing sequencing data, the processes by which these signatures arise remains largely unknown.To empirically test the ability of RADseq to detect mutation signatures and estimate TMB, we derived monoclones of the SSM3 cell line and subjected them to RADseq, then compared the resulting profiles to published WGS data. 29o determine whether a known mutation signature can be recapitulated from RADseq data, we transfected the 0D5 cell line with plasmids encoding either APOBEC3A and APOBEC3B, along with a uracil deglycosylase inhibitor (UGI) construct and a reporter construct that encodes a mutant GFP that is reactivated upon APOBEC editing. 25 a control, we co-transfected a population with plasmids encoding WT GFP and mCherry and sorted double positive cells.Derivative cell lines were then sequenced using a RADseq protocol and compared to published WGS sequencing data.When compared to the published WGS data for the SSM3 cells, all clones, including the parental 0D5 clone, had increased mutation rate when measured by RADseq (Figure 7A).Several clones derived from the population transfected with the APOBEC3A constructs had an increased mutation rate, and one clone, A2H1, had a different mutation profile to the remainder of the clones characterised (Figure 7A,B).Inspection of the 96 trinucleotide mutation frequency of this clone revealed that A2H1 had a significantly increased rate of mutations in a T(C>T)W context (Figure 7C), the preferred substrate for APOBEC enzymes.Furthermore, the mutations in the T(C>T)W contexts were clearly enriched in the YTCA context, supporting the notion that APOBEC3A mutagenesis generated these additional mutations (Figure S5).

| DISCUSSION
Genome wide parameters such as TMB and mutational profiles have the potential to play a significant role in precision medicine, either as biomarkers or as an aid in understanding the aetiology of cancer mutagenesis.We used our own data, together with previously published data, to explore alternative approaches for the assessment of these parameters and investigated the potential implications of the methodology used for cancer immunotherapy research.We identified that, based on random subsampling of mutation catalogues, mutation profiles could be accurately reconstructed for most tumours in the cohort using 330 mutations.APOBEC hypermutated profiles, presumably due to their low mutational complexity, were able to be recapitulated at 110 mutations.This finding may have important clinical ramifications, as it implies that these tumours can be effectively characterised with high confidence from lower coverage panels or via methods such as sequencing of ctDNA. 1,2Additionally, in this small cohort, APOBEC hypermutators were significantly more likely to have a mutation burden greater than 10 mutations/Mb when compared to non-hypermutators, consistent with previous reports. 13,35The utility of RADseq to reconstruct mutational profiles and estimate TMB was evaluated and compared to an estimate of the TMB from using the genes analysed in the FDA approved F1 panel.While TMB estimation in this cohort was significantly correlated, libraries of similar coverage outperformed the F1 panel in terms of error and potential utility in simulated ICB trials.We suggest that the current use of the F1 panel, combined with the use of unselected cohorts (compared to potential APOBEC hypermutators) significantly reduces the power to determine whether TMB could be a predictive biomarker for response to ICB.Finally, we demonstrated the utility of RADseq to detect changes in mutation profiles in mouse cell lines, including detection of an APOBEC hypermutation signature.
According to our simulated ICB trials, current clinical trials are likely to be underpowered to truly answer whether TMB can be used as a predictive biomarker for ICB.This may be because unselected cohorts do not represent the full spectrum of TMB in breast cancer.Therefore, pre-selecting APOBEC hypermutated samples may enable assessment of whether there is a significant association between TMB and ICB response.Previously, hypermutation of COSMIC signatures 2 and 13 in breast cancer has been associated with the deletion polymorphism affecting the APOBEC3B locus, especially in ER+ breast cancer. 35,36The frequency of this polymorphism varies based on ancestry.For example, while the MAF is low in European and African populations, East Asian populations exhibit heightened frequency. 37In carriers of the polymorphism, a high TMB is associated with increased immune infiltration and enrichment of an adaptive immune response signature. 11,12,35,36[38][39][40][41] This suggests that it is possible that APOBEC hypermutation, may in itself be predictive of response to immunotherapy.Therefore, an APOBEC enrichment score, combined with a TMB estimate, may provide additional clinical utility compared to a TMB estimate alone.
Our data also suggest that use of the FoundationOne panel to evaluate TMB as a predictive biomarker in immunotherapy trials may reduce the power of these trials.Detection of hypermutated samples was accurate and concordant with a similar experiment carried out by FoundationOne, using lung and colorectal TCGA data. 42n ongoing query in the field, however, is the determination of an optimal cutoff point for high TMB samples.The currently approved threshold for this biomarker is >10 mutations/Mb 2 , which would be an acceptable cutoff using this panel due to the reduction of error above this cutoff.Post-hoc analyses from a range of different tumour types, however, have suggested that optimal cutoffs are likely to differ between tumour types. 3,43,44hese analyses have suggested that tumour types with lower median TMB such as breast cancer do not demonstrate an association between TMB and response to ICB, in comparison to tumour types with high median TMB.Our results suggest that the use of the F1 panel, and perhaps cancer gene panels in general, have poor accuracy when estimating TMB in these 'lower TMB' cancer types.This may be due to the fact that these panels target known cancer genes and therefore, due to selection pressure, alterations would be expected to be found at different frequencies in these genes when compared to the rest of the genome.Therefore, we suggest that the lack of associations seen in these trials could be a result of the measurement error associated with the use of these panels to estimate TMB.Our simulated ICB trials, for example, demonstrated that at lower cutoffs, the inaccuracy of the F1 panels leads to a loss of power to determine whether TMB is predictive of response.As the currently accepted value has been based on a small range of tissue types, exploratory trials in other solid tumour types measuring TMB using this panel should be cautious about concluding associations based on the TMB estimation, especially in tumours with lower median TMB.If a tissue's true predictive value is lower than this threshold of accuracy, any true benefit may be lost due to poor estimation below this threshold.Our results suggest that optimising RADseq based approaches to estimate TMB should resolve these issues, as they have less error in TMB estimation than the F1 panel, even at lower coverage.
Another potential source of discrepancy between these results is that the FoundationOne cancer diagnostics panel sequences at a very high depth (>500×), as the panel may detect mutations at very low variant allele frequency (VAF), 1 and therefore overestimate clonal TMB.As the data used in this study was not of a sufficiently high depth, we were unable to model the extent and consequences of this overestimation of TMB.Tumour diversity may be associated with response to ICB; genetically heterogeneous tumours tend to elicit a relatively poor response to ICB, in comparison to clonal tumours. 45Therefore, a high TMB detected by 'deep and narrow' sequencing methods may represent a clonally diverse tumour with low numbers of mutations in each clone rather than a single clonal population with a high number of mutations and potential antigens.It is currently unknown how VAF values should be integrated into the TMB report. 42Paradoxically, 'shallow and wide' may be a more appropriate approach as variants that are detected will be present at a high VAF in the tumour.Altogether, these data indicate that for studies into previously uncharacterised tumours undergoing ICB, estimations of TMB based on panels targeting a small fraction of the genome should be used with caution to determine threshold values that guide clinical decisions of this biomarker.
Comparatively, the simulation of restriction enzyme based methods showed that these methods can achieve high coverage, provide accurate estimates of TMB, and effectively recapitulate genome wide mutation signatures.Although we only considered single enzymes in our analysis, combinations of enzymes may improve coverage; the potential of using combinations of enzymes to recapitulate mutation signatures in oesophageal cancer has been demonstrated. 17These methods have additional benefits over panel-based approaches, in addition to increased coverage.For example, library preparation methods are relatively simple, cost-effective and many labs may already have access to the resources needed, and there are less PCR enrichment steps which may lead to bias in panel-based analysis.
We have demonstrated the potential use of restriction enzyme associated library preparation methods to ascertain mutation signatures in the murine SSM3 cell line.This is an ER+ breast cancer cell line that is able to be implanted into immunocompetent, syngeneic 129SvEv mice, a potentially powerful model for studying immune populations in ER+ breast cancer, and the effects of endocrine therapies on ER+ tumour biology.Our transfection/cell sorting experiment results demonstrate the utility of RADseq to ascertain the aetiology of mutational processes acting in tumours, specifically the APOBEC mutation signature.Transfection of APOBEC3A mRNA resulted in a characteristic APOBEC signature.Our data suggested that the corresponding murine Apobec3 did not actively generate the APOBEC signature which is supported by the observation that the genetic sequence of this gene shares little homology to mutagenic human APOBEC enzymes, 46 and none of the control/APOBEC3B clones in this experiment demonstrated an increased APOBEC mutation signature.However, we cannot rule out the possibility that RADseq did not detect regions that were exposed to kataegis mutational processes, which have been assigned to both APOBEC3A and APOBEC3B activity. 47Despite this, this experiment demonstrates the high mutagenic potential of this enzyme, even over period of time estimated to be equivalent in this case to a single cell division.Intriguingly, another clone in this group also demonstrated increased mutation rate, albeit without the APOBEC signature.This may have resulted from the of loss of cells that were unable to survive mutagenic stress, leaving only clones with high 'mutagenic fitness', and potentially already high mutation rates, to survive transfection with this mutagenic enzyme.

| Study limitations
One of the shortcomings of our study is the lack of prospective testing on new sequence data from patients and empirical comparison of reduced representation methods.We instead relied heavily on a well characterised mutation dataset, assuming that mutations would be captured as accurately using the reduced representation methods.Previous work has shown concordance between restriction enzyme-based methods and WGS in ascertainment of mutational profiles and as noted above we reaffirmed this finding in a non-human model. 17,48To confirm these observations, direct comparison using the complete WGS and RADseq analysis pipelines on the same human tumour samples could assess the concordance between these methods.Another potential issue is that restriction enzyme sites themselves can be mutated, potentially obfuscating our in silico digest simulations.This may be exploited by researchers, especially if detection of a shift in mutations to an a priori known signature is required; for example, researchers may exploit the bias for APOBEC mutagenesis in TpCpA contexts by digesting the genome with BclI (recognition sequence TCATGA).Despite these limitations, we were able to replicate the previously reported utility of restriction enzyme based library preparation methods in both a different tumour type, with different mutation characteristics and in a non-human model of cancer.

| Future directions
Future applications of the RADseq methodology may include detection of mutation shifts in cell lines akin to experiments outlined by Zou et al. 49 These experiments, previously limited by cost and access for many research labs, can now be performed for a fraction of the price of WGS.Combined with data from Perner et al., 17 this technique represents a cost-effective tool for characterisation of mutation signatures in cancer.Experimental approaches and larger scale observation studies using this technique may reveal unknown associations between treatments and mutational signatures, as well as new approaches to diagnostics.Additionally, this study suggests that the previously defined 10 mutations/Mb threshold to identify patients likely to respond to anti-PD-1 therapy may need to be reviewed as it was initially derived using a TMB assessment methodology that our findings suggest is inaccurate.Clinical trials should use WGS, or a high coverage RADseq method as described in this paper to re-evaluate the optimal TMB cutoff.Additionally, breast cancer immunotherapy trials should prioritise recruitment of patients from populations with a high frequency of the APOBEC deletion polymorphism to increase the likelihood of capturing TMB high patients who are more likely to benefit from ICB.

| CONCLUSION
In conclusion, we have demonstrated that restriction enzyme-based library preparation methods can be used to accurately recapitulate breast cancer mutation signatures and TMB.Comparatively, RADseq based methods offer improved TMB estimation when compared to a derived version of the widely used FoundationOne panel, which may aid in exploration of association between TMB and response to ICB.We also demonstrated the utility of restriction enzymes to characterise the genome of a model organism cell line and generation of episodic APOBEC mutagenesis in this cell line.Due to the relatively low cost and increased availability of these methods, they may be preferable to assays such as WGS or WES in a range of tumour types for estimation of TMB and mutation signature quantification where these parameters have prognostic/ predictive value.

F I G U R E 2
Similarity of mutation profiles derived from whole genome sequencing (WGS) and randomly subsetted mutation catalogues from 560 breast cancer profiles.Each sample's mutation catalogue was randomly subset from 10 to 800 mutations for 10 replicates.(A) Distribution of cosine similarity to WGS profiles as a function of number of mutations subset.Each point represents the cosine similarity of a randomly subsetted profile to the WGS profile for that sample.(B) number of mutations sub sampled versus proportion of profiles that result in 90% cosine similarity to WGS profile.An asterix is shown to depict the point at which >95% samples can show >90% cosine similarity to the WGS profile (Bonferroni corrected exact binomial test for proportion of samples >0.95)Samples are grouped and coloured by APOBEC hypermutation status, defined as a sample containing >40% Tp(C>T/G)pW mutations in the 96 trinucleotide catalogue.

F I G U R E 3
Similarity of mutation profiles derived from whole genome sequencing (WGS) and randomly subsetted mutation catalogues from 560 breast cancer profiles and the distribution of mutation burden between APOBEC hypermutated versus non-hypermutated samples.Each sample's mutation catalogue was randomly subset from 1 to 800 mutations for 10 replicates.(A) Distribution of cosine similarity to WGS profiles as a function of number of mutations subset.Samples are coloured by the number of mutations in the sample.Each point represents the cosine similarity of a randomly subsetted profile to the WGS profile for that sample.(B) The distribution of mutation counts between nonAPOBEC hypermutated (APOBEC_typical) and APOBEC hypermutated (APOBEC_hyper) samples.The p-value of the Wilcoxon rank sum test between the two groups is shown, as well as the Fisher's exact p-value to test for association (bottom right).APOBEC hypermutation status was defined as a sample containing >40% Tp(C>T/G)pW mutations in the 96 trinucleotide catalogue.

F I G U R E 4
Representative library capture information.(A) number of mutations captured by each library for each of 560 breast cancer genomes.(B) Distribution of cosine similarity of the library 96 trinucleotide signature profile to the whole genome sequencing (WGS) trinucleotide signature profile.(C) same as (B) but split by APOBEC hypermutation status.(D) Library versus proportion of profiles that result in 90% cosine similarity to WGS profile.Samples are coloured by APOBEC hypermutation status as in (C).

F I G U R E 6
Simulated immune checkpoint blockade (ICB) trials.(A) Model tested, whereby the probability of response to ICB increases as a function of the tumour mutation burden (TMB) (mutations/Mb) (x axis).The distribution of the model is depicted in black.Samples tested are depicted on as jittered points, where the response to ICB is depicted as the colour.(B), (D) odds ratio for response (y) at defined for defined mutation cutoff (x) in each of 100 different trials (depicted as points).Trials are coloured either by cohort (B) or by library used to estimate TMB.(C, E) Proportion of trials reporting significant association (y) at a defined cutoff (x).Bars are coloured as in (B) and (D).Above are Fisher's exact test p-values testing proportions reporting significant association bv cohort (C) or by library used to measure mutation rate (D).

F I G U R E 7
Genomic profiling of derivative 0D5 clones transfected with APOBEC mRNAs using RADseq.(A) mutation rate (mutations/Mb sequence) in different samples.(B) Cosine similarity of mutation profiles.Tiles are coloured by value, with the cosine similarity value between two profiles printed inside each tile.(C) Proportion of mutations in either T(C>T)W or other contexts in each sample.Above, results of one-sided proportion test for proportion of T(C>T) W mutations in each sample, using the proportion in the 0D5 parental clone as a reference.ns, not significant; ***p < 0.001.