Identification of a DNA damage repair gene‐related signature for lung squamous cell carcinoma prognosis

Abstract Background DNA damage repair (DDR) plays a role in the tumorigenesis and progression of lung squamous cell carcinoma (LUSC), but the predictive value of DDR in LUSC has not been fully elucidated. Methods The LUSC datasets were retrieved from the Cancer Genome Atlas databases. Univariate Cox regression and least absolute shrinkage and selection operator regression were integrated to identify critical genes and construct a DDR gene signature. We performed Kaplan–Meier (KM) curve to compare the overall survival (OS) between the two groups based on DDR signature and used the CIBERSORT tool to compare the immune cell composition. Further gene set enrichment analysis (GSEA) was performed on the differential expressed genes. Result We established the DDR‐related gene signature on LUSC. KM curve showed the low‐risk group had a better prognosis than the high‐risk group in the training set (p = 0.022673) and the complete set (p = 0.003201). The area under receiver operating characteristic curve for OS was 0.98, 0.96, and 0.97 in the training dataset, testing dataset, and the complete dataset, respectively. The composition of immune cells was different between the high‐ and low‐risk group. The GSEA result suggests that genes of the patients in low‐risk group were mainly enriched in the DNA adducts; drug metabolism‐cytochrome P450, metabolism of xenobiotics by cytochrome P450. Conclusion This study identified DDR‐associated potential biomarkers related to overall survival of LUSC and establishes the DDR‐associated gene signature.


INTRODUCTION
Lung cancer remains the precedent cause of cancer death, with $1.8 million deaths in 2020. 1 Around 85% of lung cancer cases correspond to non-small-cell lung cancer (NSCLC), 2 including lung adenocarcinomas (LUAD) and lung squamous cell carcinoma (LUSC). More than 60% of LUAD can be identified by driving mutations, epidermal growth factor receptor (EGFR), ALK, and other driver gene mutations, which significantly improved survival in patients with LUAD; however, for LUSC, accounting for one-third of NSCLC, the driver genes such as EGFR mutations and ALK gene rearrangements are rarely detected. 3 However, increased EGFR gene copy and protein overexpression are more common in LUSC than in LUAD, 4 and LUSC has a high total mutation rate and significant genetic complexity. 5 LUSC has some apparent driver gene mutations, including PIK3CA, AKT, FGFR1, 6 but there is no related effective treatment.
Widespread intratumor heterogeneity can be detected in lung cancer for both somatic mutations and copy-number alterations. 7 Driver mutations that occurred later in evolution were almost always clonal or subclone and involved in chromatin modification and DNA damage response and repair. Bin Jia, Ting Gong, Bingsheng Sun contributed equally to this work. DNA damage response is crucial to prevent the accumulation of DNA lesions and mutations that may promote carcinogenesis. It can repair the damaged DNA by one or several pathways: base excision repair (BER); homologous recombination (HR); direct repair (DR); nucleotide excision repair (NER); mismatch repair (MMR) or non-homologous end joining (NHEJ). 8 When DNA damage repair (DDR) pathway is impaired, cytoplasmic DNA gathers, and DNA replication pressure increases, which not only leads to genomic instability, but also induces the release of validation-related factors and activates the innate immune response by activating the cyclic GMP-AMP synthase-stimulator of interferon genes (cGAS-STING) and/or nuclear factor-κB (NF-κB) pathway. 9 In this study, the Cancer Genome Atlas (TCGA) (https:// portal.gdc.cancer.gov/) database was involved in investigating the relationship between DDR genes and the LUSC prognosis and finally to establish DDR-associated prognostic signature. Here, we screened specific DDR genes that related to the prognosis of LUSC and classified the LUSC patients into high-and low-risk groups based on the DDR-related genes expression. We confirmed the classification with Kaplan-Meier (KM) curve analysis and tSNE cluster and verified the prognostic signature with multiple receiver operating characteristic (ROC) curve. In addition, we explored the distribution and composition of immune cells in the tumor microenvironment and performed Gene Ontology (GO) enrichment, and gene set enrichment analysis (GSEA) with differential expressed DDR genes between high-and low-risk groups. Based on the above analysis, we demonstrate the critical role of specific DDR gene in LUSC prognosis, indicating the potential use of this signature.

Data collection from TCGA and GSEA
Lung squamous cell carcinoma RNA transcriptome dataset and their relevant clinical information were retrieved from the TCGA database. Genes were grouped into proteincoding genes based on the human genome annotation data. The list of DDR-associated gene set was obtained from the Molecular Signatures Database (https://www.gsea-msigdb. org/gsea/msigdb).

Identification of differentially expressed DDR genes in LUSC
We used R software (https://www.r-project.org/) for statistical analyses and data visualization. The limma program was used to examine the differential expression genes (DEGs) of the DDR gene sets between LUSC and normal tissue at p < 0.05 with a two-fold change. The heatmap figure was visualized with "pheatmap" package.

Risk model construction and validation
Univariate cox regression analysis and least absolute shrinkage and selection operator (LASSO)-penalized Cox regression analysis (quantile cut-off = 0.25) were used to identify the DDR-related prognostic signature. A p-value <0.05 in univariate Cox regression analysis was considered statistically significant, whereas DDR genes were considered eligible for LASSO regression analysis only if they showed significance in the Cox regression analysis. Functional analysis of those genes was performed, and their coefficients were determined through the minimum standard. We stratified LUSC patients into high-risk and low-risk groups with DDR associated gene-based risk score prediction model. The formula for calculating the risk score of each LUSC patient was as follows: risk score = (coefficient mRNA1 Â expression of messenger RNA1 [mRNA1]) + (coefficient mRNA2 Â expression of mRNA2) + …… + (coefficient mRNAn Â expression mRNAn). mRNA indicates the FPKM value of specific gene in TCGA. Patients were then grouped into two datasets, the training dataset and the testing dataset, with F I G U R E 1 Differentially expressed DDR genes (DDR DEGs) between LUSC and normal tissue. (a) Volcano plots of DDR DEGs. X-axis represents the fold change of gene expression and y-axis stands for adjusted p value. (b) Heatmap of DDR DEGs the ratio of 7:3. The ROC curve analysis was performed, respectively, in training dataset, testing dataset, and the complete dataset to value the accuracy of the DDR signature. Differences between the high-and low-risk groups were evaluated by the KM curve and log-rank test. p-Value <0.05 was considered statistically different.

Functional and pathway enrichment analysis
GO enrichment analysis includes enrichment in cellular components (CC), molecular function (MF), and biological processes (BP). 10 GO enrichment analysis and enrichment map of differential expressed DDR genes was performed by "clusterProfiler" package. GSEA was performed with GSEA with FDR q-value < 0.05. Differential regulated genes were analyzed for high-risk group compared to low-risk group.

Relationship between DDR signature and immune infiltration
CIBERSORT was used for appraising the percentage of different types of tumor-infiltrating immune cells. LM22 signature file was used to analyze the integrated immune cell types. 11 We used "ggplot2" package in R Studio to visualize the immune infiltration differential between high-risk and low-risk groups based on the prediction model. DEGs between LUSC and normal tissue were analyzed on 150 DDR genes. In the TCGA database, including 502 LUSC samples and 49 normal lung samples, 28 upregulated DDR genes and 10 downregulated DDR genes of LUSC were identified relative to normal tissue (Figure 1(a); Table S1).
Heatmaps show the levels of the differentially expressed genes in detail (Figure 1(b)).

Prognostic value of DDR-related gene signature
A worse prognosis of LUSC patients was observed with cluster 1 (high-risk group) than patients with cluster 2 (low-risk group). The patients' background data of the two risk groups were listed in Table 1. The KM curve analysis showed that the low-risk group had a better prognosis than the high-risk group in the training set (p = 0.022673) (Figure 3(a)) and complete set (p = 0.003201) (Figure 3(c)).
However, because of the small numbers in one cluster, the KM curve of testing set did not meet the significance level (Figure 3(b)). Further, the AUC, which is defined as the area under ROC curve, was 0.98, 0.96, and 0.97 in the training dataset, the testing dataset, and the complete dataset, respectively, implying that the DDR-related 10-genes signature has good accuracy in the prognostic prediction of LUSC (Figure 4(a)-(c)). We also inspected the prognostic value of the DDR model stratified by age. For both age <70 and age ≥70 groups, patients in the high-risk group tended to have a worse OS ( Figure 5(a),(b).

Relationship between the gene signature and immune cell
LUSC samples in the TCGA database showed a difference for all the 22 immune cell types between the high-and low-risk groups ( Figure 6). Cell types, of which compositions were slightly higher in high-risk group, were naive B cells, plasma cells, memory-resting CD4 + T cells, nd T cells, activated natual killer (NK) cells, monocytes, M1 macrophages, resting dendritic cells, activated dendritic cells, and neutrophils. The tumor microenvironment in the low-risk group was infiltrated with more memory B cells, CD8 + T cells, memory-activated CD4 + T cells, follicular helper T cells, regulatory T cells, resting NK, M0 macrophages, M2 macrophages, and resting mast cells.

Enrichment analyses of DEGs based on the clustering
Based on the results of DDR signature clustering, there are 58 downregulated genes and eight upregulated genes with fold change >2 in the high-risk group (Figure 7(a); Table S3). GO term enrichment analysis demonstrated that the DEGs were principally enriched in NADP + 1-oxidoreductase activity in molecular function (Figure 7 (b)). Cell component analysis indicated that genes were significantly enriched in cornified envelope (Figure 7(c)). Biological process analysis demonstrated that the genes were principally involved in the cellular hormone metabolic process (Figure 7(d)).
The potential pathways or functions of DDR-related genes signature were explored by performing a GSEA. As shown in Figure 8(a)-(d), we discovered that patients in the high-risk group relative to low-risk group were mainly involved in DNA adducts, drug metabolism-cytochrome P450, metabolism of xenobiotics by cytochrome P450, and pentose and glucuronate interconversions.

DISCUSSION
DNA damage occurs through exogenous and endogenous processes. Carcinogens can reduce the DNA damage through myriad mechanisms. As a pivotal mechanism to preserve genome stability and repair DNA lesions, deficiency in the DNA repair pathway might give rise to hypersensitivity to carcinogens and the accumulation of DNA lesions influencing tumor development, metastasis, and prognosis. 12 In this study, we systematically investigate the role of the DDR gene in the prognosis of LUSC. By conducting a univariate Cox regression screen and performing LASSOpenalized regression analysis, we successfully formed a 10 DDR genes signature. The KM curve showed that this DDR signature could stratify patients' OS efficiently. ROC analysis results verified that the DDR signature had a high accuracy in predicting the OS of LUSC and had a good prognostic value.
Previous studies suggested that germline mutations in DNA repair genes increase the predisposition to lung cancer. 13 Supporting this, it has been shown that in $2.5% of all cancer, a germline mutation in a DNA repair gene was associated with cancer development, including LUSC. 14 The other way round, LUSC generally exhibits relatively a higher somatic mutation frequency in contrast with different tumor types. 15 A high proportion occurred in genes that are involved in the maintenance of genome integrity through chromatin modification and DNA damage response and repair and specific DDR gene alterations tend to associate with worse progression-free survival to initial chemotherapy. 16 For diagnosis and treatment, it is necessary to investigate further the predictive value resulting in the development of LUSC.
Among these 10 DDR genes in the predictive model, POLB, POLR2H, and BRF2 exhibit the highest coefficient (coefficient >2/<À2). POLB is one of the components of BER. PARP1 has been a popular target for targeted therapies in the past few years, of which role in BER is detecting single-strand breaks. PARP1 works as a signal for the repair machinery, recruiting the scaffold protein XRCC1, LIG3, and POLB, 17 to complete repair. The carrier status for the F I G U R E 8 GSEA analysis of the enrichment pathways for DEGs between high-and low-risk groups. In ascending order of enrichment score, the top 4 potential pathways are (a) chemical carcinogenesis-DNA adducts, (b) drug metabolism cytochrome P450, (c) metabolism of xenobiotics by cytochrome P450 and (d) pentose and glucuronate interconversions POLB rs3136797 germline mutation is associated with a worse prognosis for lung cancer and demonstrates poor sensitivity to cisplatin treatment. 18 POLR2H, the necessary subunit of RNA polymerase II, was essential for the transcription of DNA. 19 Du et al. 20 indicated that POLR2H expression correlates with the occurrence and progression of prostate cancer. TFIIIB is a known target of regulation by oncogenes and tumor suppressors. TFIIIB-mediated transcription is downregulated in a variety of cancers. BRF2, a component of TFIIIB required for gene external RNA pol III transcriptions, was identified as an oncogene in LUSC through integrative genomic analysis. 21 The related study of BRF2 was extremely rare. One study used the Oncomine database to study the expression of BF2 in tumors in a subset of the patient samples and showed that BRF2 is both over-and under-expressed in lung cancer.
We also investigated the correlation between DDR signature and the clinical factors in present study. In our study, DDR signature independently predict the OS of LUSC without the need to consider whether it is advanced age. In addition, the GSEA result revealed that patients in the high-risk group were negatively enriched in DNA adducts, drug metabolismcytochrome P450, metabolism of xenobiotics by cytochrome P450, and pentose and glucuronate interconversions.
It is reported that carcinogens may fall into two categories 22 : activation-dependent (e.g., polycyclic aromatic hydrocarbons) and activation-independent (e.g., ultraviolet and ionizing radiation). 23,24 Activation-dependent carcinogens require metabolic activation or molecular modification in host cells to transform them into reactive intermediates or carcinogenic metabolites. The reactions, including oxidation, reduction, or hydrolysis, mainly involve cytochrome P450 (CYP) mixed function oxidase isoforms. Bulky chemical adducts are commonly formed as a result of the interaction between activated carcinogens and DNA. Accordingly, our prognostic signature classified by 10 DDR genes showed that compared to low-risk group, the DEGs of high-risk group enrich in chemical adduct and P450 related pathway, implicating relatively an indirect DNA damage correlation.

CONCLUSION
In conclusion, our study identified DDR-related signature that could be involved in the prognosis of LUSC. To our knowledge, this is the first study that developed a DDRassociated risk model in LUSC. In addition, our study also investigated the potential relationship between DDR and clinical data, and DNA damage repair and immune cells. The findings of this study provide new insights to understanding the role of DNA damage repair in LUSC and the prediction obtained from the bioinformatics analysis can be verified by future experimental studies.