Comprehensive landscape and interference of clonal haematopoiesis mutations for liquid biopsy: A Chinese pan‐cancer cohort

Abstract Tumour‐derived DNA found in the plasma of cancer patients provides the probability to detect somatic mutations from circulating cell‐free DNA (cfDNA) in plasma samples. However, clonal hematopoiesis (CH) mutations affect the accuracy of liquid biopsy for cancer diagnosis and treatment. Here, we integrated landscape of CH mutations in 11,725 pan‐cancer patients of Chinese and explored effects of CH on liquid biopsies in real‐world. We first identified 5933 CHs based on panel sequencing of matched DNA of white blood cell and cfDNA on 301 genes for 5100 patients, in which CH number of patients had positive correlation with their diagnosis age. We observed that canonical genes related to CH, including DNMT3A, TET2, ASXL1, TP53, ATM, CHEK2 and SF3B1, were dominant in the Chinese cohort and 13.29% of CH mutations only appeared in the Chinese cohort compared with the Western cohort. Analysis of CH gene distribution bias indicated that CH tended to appear in genes with functions of tyrosine kinase regulation, PI3K‐Akt signalling and TP53 activity, suggesting unfavourable effects of CH mutations in cancer patients. We further confirmed effect of driver genes carried by CH on somatic mutations in liquid biopsy of cancer patients. Forty‐eight actionable somatic mutations in 17 driver genes were considered CH genes in 92 patients (1.80%) of the Chinese cohort, implying potential impacts of CH on clinical decision‐making. Taken together, this study exhibits strong evidence that gene mutations from CH interfere accuracy of liquid biopsies using cfDNA in cancer diagnosis and treatment in real‐world.


| INTRODUC TI ON
Tumour-derived DNA was found in the plasma of cancer patients by Stroun et al. 1 30 years ago, providing the probability to detect somatic mutations from circulating cell-free DNA (cfDNA) in plasma samples collected noninvasively. In addition to the advances in sequencing technology, somatic mutations detected from cfDNA have been used to diagnose and manage cancer patients in several approaches, 2 including screening for early carcinoma, 3 guiding systemic therapy 4 and monitoring minimal residual disease (MRD). 5 However, the origin of mutations observed in cfDNA from plasma samples is diverse and includes tumour-and clonal hematopoiesis (CH)-derived mutations. Several studies have indicated that the source of some gene mutations in cfDNA is not the matched tumour in metastatic breast cancer (MBC), non-small-cell lung cancer (NSCLC) and castration-resistant prostate cancer (CRPC). 6,7 These alterations are mainly due to interference from CH and have been proven by many scientists. [6][7][8][9] Thus, CH mutations in plasma can affect the accuracy of liquid biopsy for cancer diagnosis, treatment and management.
Although three genes, DNMT3A, ASXL1 and TET2, are hot genes, 10 CH mutations also appear in other genes, particularly some driver genes that are indicators for cancer therapy. Hu et al. 7 found that 2 out of 58 (3.4%) advanced NSCLC patients with mutant EGFR had CH mutations in KRAS (G12X) that were persistent in the blood.
Mutations in KRAS are genomic markers indicating resistance to tyrosine kinase inhibitors (TKIs) targeted to EGFR; thus, CH mutations in KRAS can lead to a misleading prognosis of TKIs. In addition to KRAS, several CH mutations in JAK2 and TP53 were also observed in patients with advanced NSCLC in this study. Furthermore, Jensen et al. 11 found that 7 out of 69 (10.1%) patients with advanced prostate cancer had CH mutations in DNA repair genes, including ATM, BRCA2 and CHEK2. Mutations in these DNA repair genes have been approved to determine the usage of poly (ADP-ribose) polymerase inhibitor (PARPi) for patients with advanced prostate cancer, where CH mutations might induce misdiagnosis. Furthermore, Li et al. 12 analysed the association between CH mutations and genomic markers related to immunotherapy. In particular, patients with a high tumor mutational burden (TMB) possessed CH mutations, implying probable effects brought by the improper counting of CH mutations.
Taken together, CH mutations should be distinguished from tumourderived mutations in the guidance of various cancer treatments, regardless of targeted therapy or immunotherapy.
Studies focused on the relationship between CH mutation and cancer showed that an increased risk of haematologic cancers is associated with the existence of CH mutations, 10,13,14 particularly those harboured by leukaemia driver genes (e.g. DNMT3A, ASXL1, TET2, PPM1D, TP53, RAD21, STAG2, ATM, NF1, CALR, JAK2, CBL, SETD2 and MPL). Patients with CH mutations also had adverse prognoses of nonhaematologic cancers with shorter survival times, 13 likely due to interactions between CH clones and cancer cells. 15  improved. Another study adopted high-intensity sequencing with a depth larger than 60,000× for cfDNA and matched peripheral blood lymphocyte (PBL) samples and machine learning models to call CH mutations. 6 Ultra-deep sequencing elevated both the sensitivity and specificity in CH mutation calling, resulting in a much more accurate and clearer CH landscape. In our study, we compared the CH landscape between the Chinese and Western cohorts.
In this study, we comprehensively profiled CH mutations using a bioinformatics pipeline and revealed the characterization of CH mutations in a Chinese pan-cancer cohort. To improve the accuracy of CH mutation detection, we identified CH mutations under a statistical framework to compare the distribution of alteration-supporting reads in cfDNA and matched PBL samples using Fisher's exact test while considering mutations whose distribution of alterationsupporting reads was similar in both samples as CH mutations. To further avoid false-positive calling, we filtered out CH mutations that appeared in matched tumour samples. Next, we systematically compared the landscape of CH mutations between the Chinese and Western cohorts discussed above. We further defined CH-, germline-and somatic-preferred genes according to the distribution of CH, germline and somatic mutations and found distinctive patterns of enriched functions for different categories. Additionally, we evaluated the potential effects of CH mutations on liquid biopsy, suggesting that some risk exists from CH in the diagnosis of tumours and administration of targeted anticancer drugs in the real world.

| Sample collection
In total, 11,725 pan-cancer patients and 30 asymptomatic individuals without known cancer were enrolled in this study. For all cancer patients, the genomic DNA of PBL and cfDNA in plasma were extracted and sequenced to identify CH mutations. The tumour tissues of 2336 cancer patients were also collected for sequencing to validate the true CH mutations. For asymptomatic individuals, the genomic DNA of PBL and cfDNA in plasma were extracted and sequenced to identify CH mutations.

| DNA extraction
The genomic DNA of PBL was extracted using the TGuide

| Library preparation
The genomic DNA from PBL was fragmented into DNA pieces of ap-

| Capture of targeted regions and sequencing
For cancer patients, the libraries of genomic DNA and cfDNA were captured using an in-house designed panel spanning a 1.89-Mb genomic region and including 468 genes (Table S3). For asymptomatic individuals, the libraries of genomic DNA and cfDNA were captured using another in-house-designed panel spanning a 0.55-Mb genomic region and including 118 genes (

| Data processing and mutation calling
Adaptor sequences and low-quality bases of sequenced reads were trimmed using Trimmomatic (v0.36) 16 to obtain clean reads. Clean reads were mapped to the human reference genome (hg19) using BWA (v0.7.17). 17 The mapping results were sorted and masked for duplications using Picard (v2.23.0). 6 From the sorted and duplicationmasked mapping results of PBL, cfDNA and tumour tissue samples, SNVs and InDels were called using VarDict (v1.5.1), 18 while complex mutations were called using FreeBayes (v1.2.0). 19 To avoid falsepositive results, SNVs and InDels that appeared in the blacklist (including sequence-specific errors, repeat regions, segmental duplications and lowly mappable regions recorded in ENCODE 20 ) were removed. The filtered mutations were annotated using ANNOVAR (2015Jun17), 21 and synonymous mutations were not considered in this study.

| Identification of somatic, germline and CH mutations
First, mutations detected only in cfDNA were considered candidate somatic mutations. We retained somatic mutations that satisfied the following criteria: (1) the sequencing depth of the mutation was not smaller than 100×; (2) the number of reads supporting the variant allele was not smaller than 2; (3) the VAF of the mutation was not smaller than 0.3% and (4) the minor allele frequency (MAF) of the mutation in the gnomAD 22 and ExAC 23,24 databases was not larger than 0.2%. Furthermore, mutations recorded in the dbSNP 25 database but not in the COSMIC 26,27 database and mutations in the HLA locus were filtered out. Second, mutations present in PBL with a sequence depth of at least 30× and VAF not smaller than 20% were considered candidate germline mutations. In the subsequent analysis, we retained only pathogenic germline mutations that were recorded as 'Pathogenic' or 'Likely pathogenic' in the ClinVar database. 28 Third, for mutations present in PBL with a sequence depth of at least 30× but VAF smaller than 20%, we conducted Fisher's exact test on the count distribution of reference and variant alleles in paired samples of cfDNA and PBL to identify CH mutations. If the p-value was not smaller than 0.05 and the odds ratio was larger than 0.5 and smaller than 1.5, the mutation was considered a candidate CH mutation.
Otherwise, the mutation was considered a candidate somatic (variant alleles appeared much more frequently in the cfDNA sample, the p-value was smaller than 0.05, and the odds ratio was larger than (1).
Candidate CH mutations were retained if there were no fewer than 2 reads supporting the variant allele in both cfDNA and PBL samples. Then the carrier ratio of each CH was obtained by calculating the percentage of samples carrying each mutation in total samples carrying at least one CH mutation. We further filtered out CH mutations whose carrier ratio was larger than 0.25% or that were called somatic mutations from the matched tissue sample.

| Correlation analysis between CH mutation and diagnostic age
We grouped the pan-cancer patients according to the range of diagnostic age, including ≤40, 40-50, 50-60, 60-70, 70-80, 80-90 and 90-100 years. For each group, we calculated the percentage of patients carrying at least one CH mutation (Table S5). Pearson's correlation coefficient between the percentage and age group was calculated. For each cancer type, we conducted the same analysis.

| Distribution difference of 189 genes in the Chinese and Western cohorts
After removing genes only included in the respective panels used for sequencing, we compared the distribution of CH mutations in 277 genes between Chinese and Western cohorts and found that 189 of those genes carried CH mutations in both cohorts. We Sample number without CH mutations in genes c d

| Identification of CH-, germline-and somaticpreferred genes
We defined CH mutations that were also germline or somatic mutations in other patients as polymorphic CH mutations and selected them for this analysis. For each polymorphic CH mutation, we calculated its percentage to be CH, germline or somatic mutation in this cohort. If its percentage to be CH, germline or somatic mutation was larger than 50%, the polymorphic CH mutation was labelled as CH-, germline-or somatic-preferred mutation. Based on the labelled mutations, we considered genes with CH-preferred mutations only as CH-preferred genes, with CH-preferred and more germlinepreferred than somatic-preferred mutations as germline-preferred genes while with CH-preferred and more somatic-preferred than germline-preferred mutations as somatic-preferred genes.

| Overlap CH mutations with actionable mutations
We overlapped CH mutations with actionable mutations collected in a knowledge base constructed by Genecast Biotechnology Co., Ltd., such as germline and somatic mutations that are therapy targets or prognosis markers. Overlaps were called if the genomic position and variant allele were exactly matched between CH mutations and actionable mutations. Overlapping CH mutations were listed and used to conduct descriptive statistics analysis (Table S6).

| Identification of CH mutations in cancer patients
By comparing the alteration-supporting reads distributed in the genomic DNA of PBL and matched cfDNA extracted from plasma samples, we comprehensively identified CH mutations using an inhouse pipeline ( Figure 1A; see Methods for details). If no distribution bias of alteration-supporting reads was found for a specific mutation (p-value reported by Fisher's exact test was not smaller than 0.05), we called it as CH mutation. In total, we called 6034 candidate CH mutations. Next, we adopted several criteria to filter out candidate CH mutations. First, we filtered out candidate CH mutations whose number of alteration-supporting reads was smaller than 2 in both PBL and matched plasma samples, with 6023 remaining.
Second, considering that the generation of CH mutations was random in populations, we filtered out candidate CH mutations whose carrier ratio was larger than 0.25%. We selected 0.25% as the carrier ratio threshold because we observed that most of the candidate CH mutations (5949 of 6023; 98.8%) had that carrier ratio ( Figure 1B).

| Landscape of CH mutations in Chinese pancancer patients
We identified CH mutations in a large Chinese pan-cancer cohort, including 11,725 patients with 18 cancer types ( Figure 1C) Table S1).
Most patients possessed one (3080 of 5100 patients; 60.4%) or two CH mutations (1289 of 5100 patients; 25.3%; Figure 2A). The distribution of CH mutation number of patients was similar across different cancer types ( Figure S1A;  Figure 2B; Table 1), suggesting that CH occurred more frequently in older patients.
Next, we calculated the carrier ratio for genes harbouring CH mutations and ranked them according to the sample number carrying CH mutations ( Figure 2C). For the top 20 genes in the ranking,

| Comparison of CH mutations with Western cohort
To detect whether a distinctive pattern exists, we compared the landscape of CH mutations between our Chinese cohort and a Western cohort. 19 After removing genes only included in the respective panels used for sequencing, we compared the distribution of CH mutations in 277 genes between these two cohorts and found that 189 of those genes carried CH mutations in both cohorts. First, we observed that the DNMT3A, TET2 and ASXL1 genes were hot genes of CH mutations in the Western cohort, consistent with that in the Chinese cohort.
Based on depicting the domain distribution of CH mutations in those genes, we found that the pattern was similar between the Chinese and Western cohorts ( Figure

| Functional bias of CH-, germline-and somatic-preferred genes
We found that CH mutations in a patient might be simultaneously germline or somatic mutations in others for this Chinese cohort.
Thus, we classified genes into three categories according to the percentage of CH, germline and somatic mutations appeared in genes, including CH-preferred (n = 256), germline-preferred (n = 0), and somatic preferred (n = 38) genes (see Methods for details). Next

| Interference of CH mutations on liquid biopsy
Liquid biopsy is a common strategy to detect actionable somatic mutations, which are therapy targets or prognosis markers, for cancer patients. However, the existence of CH mutations in plasma affects the power of identifying true actionable somatic mutations. To uncover the scope of influence caused by CH mutations, we over-

| DISCUSS ION
Although liquid biopsy has largely been used to identify gene mutation sites and guide anticancer drugs by applying NGS technologies, CH-associated mutations from peripheral blood might interfere with   36 We further revealed that the CH mutation spectrum in the DNMT3A, TET2, and ASXL1 genes was similar between the Chinese and Western cohorts.
However, 81 genes were uniquely identified in the Chinese cohort, and only 2 genes were uniquely identified in the Western cohort.
This result might be due to differences in sequencing depth and sample size in the Chinese and Western cohorts.
The haematopoietic system is responsible for approximately one trillion (10 12 ) cells arising daily in the adult human bone marrow. 43 The genetic diversity within the haematopoietic stem cell compartment is significant in aged individuals, and each haematopoietic stem cell may acquire on the order of one exotic somatic mutation per decade. 44 Our data showed that CH-preferred genes were enriched in cell proliferation and metabolic regulation (such as the PI3K/Akt signalling pathway), suggesting that these genes are active and working. A previous report indicated that 10% of persons older than 65 years, but only 1% of those younger than 50 years, had CH mutations, 10 revealing that CH mutation is increasingly common as individuals age. 10,45 The positive correlation between CH and age is consistent with the age-associated decrease in DNA repair capacity. 46

| CON CLUS IONS
This study helped to characterize CH mutations in a large Chinese pan-cancer cohort and compared the landscape of CH mutations between the Chinese and Western cohorts. Heterogeneity in the different types of CH genes was also explored. We further investigated overlap between CH mutations and drug-targeted somatic mutations in clinical examination. In summary, our research provides additional strong evidence that CH mutations can interfere with the liquid biopsy of plasma for cancer diagnosis and treatment.

ACK N OWLED G EM ENTS
We thank patients participated in the research. We appreciate technical support from Beijing Genecast Biotechnology Co., Ltd. for the work.

PATI ENT CO N S ENT S TATEM ENT
Written informed consent was obtained from all patients participating in the study for use of the samples for research and publication.

DATA AVA I L A B I L I T Y S TAT E M E N T
The datasets are available from the corresponding author and Genecast Biotechnology Co., Ltd on reasonable request.