Germline ancestry influences the evolutionary disease course in lung adenocarcinomas

Abstract Precision medicine relies on targeting specific somatic alterations present in a patient's tumor. However, the extent to which germline ancestry may influence the somatic burden of disease has received little attention. We estimated the genetic ancestry of non‐small‐cell lung cancer (NSCLC) patients and performed an in‐depth analysis of the influence of genetic ancestry on the evolutionary disease course. Compared with European Americans (EA), African Americans (AA) with lung adenocarcinoma (LUAD) were found to be significantly younger and smoke significantly less. However, LUADs from AAs exhibited a significantly higher somatic mutation burden, with a more pronounced tobacco carcinogen footprint and increased frequencies of alterations affecting cancer genes. Conversely, no significant differences were observed between lung squamous cell carcinomas (LUSC) from EAs and AAs. Our results suggest germline ancestry influences the somatic evolution of LUAD but not LUSC.


| INTRODUC TI ON
Large-scale sequencing projects, including the Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC), have revolutionized our understanding of the genomic basis of cancer. Studies building upon these data have identified scores of cancer-associated genes (Bailey et al., 2018;Lawrence et al., 2014) and revealed many of the mutational processes underpinning cancer development (Alexandrov et al., 2015, Alexandrov et al., 2013, Alexandrov et al., 2020. However, to date most studies have not focused on the potential influence of germline ancestry on cancer development and cancer evolution. Emerging data suggest germline ancestry can influence disparities in cancer care and the subsequent disease course. For instance, women with African ancestry (AA) have been reported to have higher breast cancer mortality, compared to women with European ancestry (EA), which has been associated with a higher occurrence of the more aggressive triple-negative form (Daly & Olopade, 2015). Colorectal cancer has been shown to be more lethal in both AA men and women relative to EA individuals (O'Keefe et al., 2015). These are not isolated observations as AA individuals have unfavorable health outcomes within numerous cancer types (Polite et al., 2017). The TCGA PanCan Atlas Germline Working Group also reported that on average, AA individuals harbor more germline predisposing variants relative to EA (Huang et al., 2018), and an enrichment of TP53 mutations (Yuan et al., 2018).
Here, we interrogate differences in the cancer genomes of EA and AA individuals for two subtypes of non-small-cell lung cancer, lung adenocarcinoma (LUAD), and lung squamous cell carcinoma (LUSC) using data from TCGA (Campbell et al., 2016). We explore differences in the age of diagnosis and in the disease progression between both ethnic ancestries. We also investigate differences in the strength of association between tobacco exposure and lung cancer development.

| Defining germline ancestry
The genetic ancestries of the TCGA LUAD and LUSC cohorts were determined using reference populations from the 1,000 Genomes Project (1KGP) (Liu et al., 2013) and applying ADMIXTURE (Liu et al., 2013). The workflow is illustrated in Figure 1.
ADMIXTURE estimates ancestries in a model-based manner from large autosomal SNP genotype datasets with maximum likelihood estimation applying a block relaxation approach which is a fast numerical optimization algorithm (Liu et al., 2013). It models the probability of observed genotypes using ancestry proportions and population allele frequencies, simultaneously estimating population allele frequencies along with ancestry proportions. The supervised approach used in this analysis requires a training dataset as well as the number of clusters to be estimated. In this case, the number of clusters was set to k = 5, matching the number of different super-populations in the 1KGP (EA (European American), AA (African American), SAS (South Asian), EAS (East Asian), and AMR (American)).
For each TCGA sample in the LUSC and LUAD cohort, a patient was considered as EA or AA, when the proportion for the European or African cluster was higher than 0.8, respectively, resulting in 36 AA and 448 EA patients in LUAD and 19 AA and 450 EA patients in F I G U R E 1 (a) Overview of the workflow. Ancestry estimation was performed by applying ADMIXTURE to the TCGA LUAD and LUSC cohort using 1KGP germline data as reference population and setting k = 5, matching the number of super-populations in the 1KGP cohort. Using a threshold of 0.8, classification of TCGA data was performed. Independently, a principal component analysis (PCA) was applied to the TCGA and 1KGP germline data to evaluate the consistency of the ADMIXTURE analysis. Subsequently, a comparison of the EA and AA cohort was conducted. (b) Scatter plot of PCA component 1 against PCA component 2, PCA component 1 against PCA component 3 and PCA component 2 against PCA component 3 for 1KGP (light red for EA and light blue for AA) and for TCGA (dark red for EA and dark blue for AA). Reassuringly, the estimated ancestries in TCGA and given ancestries in 1KGP in each group cluster together very well without any overlap to the other group. An equivalent plot for LUSC is shown Figure S1 LUSC. The results did not change qualitatively if the threshold was varied between 0.7 and 0.9. American, South Asian, and East Asian individuals were defined in the same way. The estimated ancestry was compared to the race reported in the TCGA clinical data to cross-check the results of the ADMIXTURE run.
After defining the two groups, clinical features, more precisely age at initial diagnosis, lifetime tobacco exposure measured by pack-years (number of cigarettes smoked per day multiplied by the number of years smoked divided by 20) as well as tumor stage, were compared using t test, Wilcoxon test, and Fisher's exact test.

| Exploring driver differences
For the list of cancer genes (Bailey et al., 2018;Lawrence et al., 2014;Martincorena et al., 2017) (Table S1 for LUAD and Table S2 for LUSC), the frequency of patients having at least one nonsilent mutation in cancer genes was determined for each group and each gene and was compared by using Fisher's exact test. False discovery rate (FDR) control was used to account for multiple testing. The statistical significance of relative frequency estimates was indicated by 95% confidence intervals.

| Mutational signatures
Mutational signatures within the LUAD and LUSC cohort were detected by applying a Bayesian variant of the non-negative matrix factorization (NMF) algorithm described in . K* detected contexts were compared and matched to already published COSMIC Signatures using cosine similarity. Here, K* indicates the optimal number of contexts given by the algorithm. NMF estimates two matrices W and H representing mutational signatures and their occurrence patterns in each patient. After estimating the matrices W and H from the algorithm for each found context, the cohort was separated into EA and AA and the activity of the signatures among the two groups was compared implementing a Wilcoxon test. Linear regression was used to check for associations of signatures with clinical features in each group. Propensity score matching was used to control for potential confounders.

| HLA LOH and immune deconvolution
The LOH (loss of heterozygosity) status for the TCGA cohort was collected by running LOHHLA, a computational tool to determine HLA allele-specific copy number from sequencing data (McGranahan et al., 2017). LOH can occur in different ways: Either all class I HLA alleles A, B, and C are lost, any of these alleles is lost, or none is lost.
After assigning LOH status for each patient, we used Fisher's exact test to compare whether all or any of the three mentioned HLA alleles are lost or whether none of them is lost.
Immune signatures as described by (Danaher et al., 2017) were used to compare immune scores among EA and AA individuals.
Danaher suggests to calculate immune scores as follows: F I G U R E 2 Comparison of age at initial diagnosis in LUAD (a) and LUSC (b) and tobacco exposure measured by pack-years for LUAD (c) and LUSC (d). AA patients seem to be younger when diagnosed with cancer compared to EA individuals in LUAD (t test, p = .00013). Furthermore, EA patients with LUAD have on average a higher tobacco exposure (Wilcoxon test, p = .02). The results are similar after splitting the cohort in current and reformed smokers For LUSC, there were no significant results Assuming each marker gene for a certain cell type j is present at a fixed but unknown number of cells c j , the average log-transformed expression of the marker genes in that cell type is equal to the log-transformed abundance of the cell type, plus an unknown constant. Let x ij be the expression value of marker gene i and let n j be the number of marker genes in cell type j, the cell type score for cell type j can be obtained as follows: j is an unknown constant (Danaher et al., 2017).

| Estimating the genetic ancestries in the TCGA LUAD and LUSC cohort
To explore the underlying germline ancestry of each and every TCGA patient, ADMIXTURE was applied (Liu et al., 2013), using the To confirm the results of the ADMIXTURE, a principal component analysis (PCA) of the TCGA and 1KGP germline data was applied. Reassuringly, the analysis revealed strong concordance between samples grouped by ancestral genotype (Figure 2 and Figure   S1).

| AAs with LUAD exhibit different clinical features compared to EA
Intriguingly, clinical features such as age at initial diagnosis and packyears were significantly different among AA and EA patients with LUAD, but not in LUSC ( Figure 2). EA patients with LUAD were older at diagnosis (p = .00013, t test for age), and, on average, they also smoked less than EA individuals (p = .02, Wilcoxon test for packyears). Stratifying the groups by gender revealed the same results regarding age-the AA male and female patients in LUAD were typically older than EA male and female patients, whereas no difference in LUSC was detectable (Figure 3).
To account for a possible relationship between age and packyears in LUAD, we used a linear model and found no evidence for such an association (p = .22, effect size = 0.191). Dividing the patients into groups according to smoking history as shown in Figure 3, the same results were observed. Here, p-values and significance levels should be treated with caution since due to the small number of patients within subgroups power of statistical testing is diminished.
Linear regression models of age against ancestry and pack-years against ancestry support the hypothesis of EA being older at their age of diagnosis in LUAD and of EA having a higher tobacco exposure than AA. Interestingly, no significant associations in LUSC for age nor pack-years were observed.

| Differences in mutation burden and selection of certain mutations may lead to distinct evolutionary tumor progress
Given the younger age and lower smoking exposure in AA compared to EA LUAD patients, we next asked whether we could observe differences in the somatic landscape of LUAD tumors between these two groups.
log log x ij + j

| Differences in mutational signatures between EA and AA
To investigate the mutational processes underpinning the increased burden in LUAD patients, we applied Bayesian non-negative matrix factorization (NMF) to the LUAD cohort . In total, four mutational signatures were identified (Figure 4a), which corresponded to previously identified signatures. The four signatures identified could be linked to COSMIC Signature 4 (CS4, cosine similarity 0.96), a signature linked to tobacco exposure, CS1 (cosine similarity 0.911), associated with spontaneous deamination of methylated cytosines and thought to correlate with patient age, CS2 (cosine similarity 0.854) linked to APOBEC-mediated mutagenesis and CS5 (cosine similarity 0.899), whose etiology is unknown, but has been found to correlate with patient age (Alexandrov et al., 2015, Alexandrov et., 2020, Alexandrov et., 2013. As expected, we observed a significantly higher contribution of CS1 and CS5 mutations in EA compared to AA LUADs, consistent with the older age of EA patients. However, strikingly, although AA patients smoked less, we identified that their tumors harbored a significantly higher burden of CS4 mutations. Conversely, CS2, which has been linked to APOBEC-mediated mutagenesis, was found to be significantly lower in the AA cohort, with no patients exhibiting a dominant APOBEC signature (>25% of mutations) (Figure 4).
To evaluate whether these differences are influenced by patient age, we applied a 3:1 propensity score matching, meaning that one AA individual was matched to three EA individuals. Interestingly, when adjusting for pack-years, the difference in CS4 mutations still remains whereas when adjusting for age, the difference almost disappears completely. Also, when considering both, age and pack-years for the calculation of the propensity scores, the difference almost disappears completely ( Figure 4). This suggests that conceivably the differences between AA and EA observed in mutational signatures may be linked to patient age and the fact that AA LUAD patients tend to be significantly younger than EA LUAD patients. However, when we applied the same procedure to CS2, calculating propensity scores considering age and APOBEC3B expression, the significant difference remains (Figure 4) (No clear differences in mutational signatures were observed between AA and EA LUSC tumors, Figure S3).
Taken together, these data suggest that the differences in mutational burden between AA and EA LUADs may reflect differences in the impact of mutational processes sculpting the cancer genome.
However, potentially some these differences may be driven by differences in patient age.

| Immune infiltration is different among EA and AA in LUAD
Finally, given the higher burden of mutations, and thereby also potentially neoantigens (Rooney, Shukla, Wu, Getz, & Hacohen, 2015) in AA LUAD tumors, we considered whether AA show differences in the loss Comparison of HLA LOH revealed that AA LUADs were more likely to exhibit HLA LOH (Figure 5), although this was not statistically significant (p = .066). Conversely, EA LUAD tumors were found to exhibit more signatures of immune infiltration, with five out of 16 immune signatures measured found to be lower in AA compared to EA LUADs.

| DISCUSSION
We used germline and somatic sequencing data from TCGA to investigate the influence of germline ancestry on the somatic evolution of NSCLCs. Each NSCLC patient was found to be one of five superpopulations (AA, EA, SAS, EAS, and AMR), with EA (448 LUAD and 450 LUSC) and AA being the major groups (36 LUAD and 19 LUSC). In CS4 (smoking signature) (d, h) AA individuals harbored a significantly higher burden of CS4 mutations although smoking less. As shown in (c), CS2, the APOBEC signature, was found to be significantly lower in the AA cohort. (b, f) and (e, i) visualize the EA group having a higher contribution to the aging signatures CS1 and CS5 as it was to be expected. After applying propensity score matching (PSM) considering age and tobacco exposure in CS4, the difference completely disappears. When applying PSM to CS2 considering age and APOBEC3B expression, the difference still remains after PSM   to be significantly older and exhibited significantly distinct smoking history, with a significantly higher average pack-year. These results are in keeping with significant differences in smoking history reported by .
Interestingly, despite being significantly younger and smoking significantly less, on average, AA LUADs harbored a significantly elevated mutation burden compared to their EA counterparts.
This remained significant when restricting the analysis to nonsilent mutations. Focusing specifically on driver alterations, a subset of genes were found to be more likely to be mutated in AA compared to EA LUADs. This suggests that germline ancestry may influence the selection for subsequent somatic alterations.
However, consistent with previous studies, no clear differences were observed with regard to targetable alterations . Nevertheless, it is worth considering that the higher burden of mutations in AA LUADs may render these tumors particularly susceptible to immune checkpoint blockade (Rizvi et al., 2015).

CO N FLI C T O F I NTE R E S T
None declared.