Systematic construction and validation of an immune prognostic model for lung adenocarcinoma

Abstract Lung adenocarcinoma (LUAD), the most common non‐small‐cell lung cancer, is characterized by a dense lymphocytic infiltrate, which indicates that the immune system plays an active role in the development and growth of this cancer. However, no investigations to date have proposed robust models for predicting survival outcome for patients with LUAD in terms of tumour immunology. A total of 761 LUAD patients were included in this study, in which the database of The Cancer Genome Atlas (TCGA) was utilized for discovery, and the Gene Expression Omnibus (GEO) database was utilized for validation. Bioinformatics analysis and R language tools were utilized to construct an immune prognostic model and annotate biological functions. Lung adenocarcinoma showed a weakened immune phenotype compared with adjacent normal tissues. Immune‐related gene sets were profiled, an immune prognostic model based on 2 immune genes (ANLN and F2) was developed with the TCGA database to distinguish cases as having a low or high risk of unfavourable prognosis, and the model was verified with the GEO database. The model was prognostically significant in stratified cohorts, including stage I‐II, stage III‐IV and epidermal growth factor receptor (EGFR) mutant subsets, and was considered to be an independent prognostic factor for LUAD. Furthermore, the low‐ and high‐risk groups showed marked differences in tumour‐infiltrating leucocytes, tumour mutation burden, aneuploidy and PD‐L1 expression. In conclusion, an immune prognostic model was proposed for LUAD that is capable of independently identifying patients at high risk for poor survival, suggesting a relationship between local immune status and prognosis.


| INTRODUC TI ON
Lung cancer is one of the most common causes of cancer deaths worldwide. 1,2 For treatment purposes, lung cancers are categorized as either non-small-cell lung cancer (NSCLC) or small cell lung cancer. Non-small-cell lung cancer comprises approximately 85% of all lung cancers, with lung adenocarcinoma (LUAD) being the most frequently diagnosed histological subtype of NSCLC, followed by squamous cell carcinoma. The high morbidity rate of lung cancer is due to tobacco smoking, genetic alteration, outdoor pollution, indoor air pollution and other factors. [3][4][5] Because LUAD is prone to metastasis at early stages, the prognosis for LUAD patients is usually poor, with an average 5-year survival rate of less than 20%. 6 Although recent progress in targeted therapy and molecular pathology has enhanced clinical therapy, the 5-year overall survival (OS) rate of LUAD patients remains low. 7,8 Hence, further understanding of the molecular mechanisms underlying tumorigenesis and progression in LUAD may enhance the overall prognosis and treatment of this disease.
The immune evasion strategy utilized by tumour cells for evading host immune responses and maximizing the possibility for continued growth is a hallmark of cancer. 9 Immune disorders in cancer are considered to promote oncogenesis and development. 10 Immune responses stimulated by cancer antigens, which should trigger the elimination of cancer cells, can be suppressed to offer an appropriate microenvironment for cancer growth. 10 Enormous efforts have been directed at understanding the interaction between the immune system and tumours, and significant success has been achieved in the form of tumour immune therapy to advance tumour treatment; however, this approach can be applied in only a subset of patients, as other patients either fail to respond or are unsuitable. 11,12 Cancer immunotherapy has attracted considerable attention in recent years because the development of immune checkpoint blockade therapy can achieve durable, long-term responses in refractory malignancies, including lung cancers. 13,14 The clinical development of immune checkpoint inhibitors in NSCLC began in patients being treated for metastatic diseases. 15,16 At present, three immune checkpoint inhibitors have received FDA approval as second-line NSCLC treatments (atezolizumab, pembrolizumab and nivolumab). 17 These agents were also authorized in the European Union. Atezolizumab is an immune checkpoint inhibitor that targets programmed cell death ligand 1 (PD-L1), while pembrolizumab and nivolumab target programmed cell death 1 (PD-1). 18 The immune response in the tumour microenvironment is now recognized as a significant factor that determines tumour aggressiveness and progression, as well as responsiveness to immune-modulating agents. The densities and types of tumour-infiltrating immune cells, as well as their expression of cytokines and immune genes, have been extensively studied as prognostic biomarkers in lung cancers. [19][20][21] Certain histopathologic patterns, such as intratumoral infiltration by cytotoxic lymphocytes, have also been linked to better responses in LUAD. Nonetheless, the molecular features illustrating interactions between the immune system and cancer remain to be fully explored in terms of their prognostic potential in LUAD.
In this study, multiple gene expression datasets were combined   to develop and validate an individualized immune prognostic model   for LUAD on the basis of immune genes. To leverage the complementary value of clinical and molecular features, the immune prognostic   model was combined with clinical features to build a composite prog-nostic nomogram, enabling improved estimation of LUAD prognosis.

| Gene expression data and clinical information
Level 3 raw counts of the RNA-seq data, tumour mutation burden, aneuploidy scores and corresponding clinical data from a total of 535 patients with LUAD were acquired from the data portal of The Cancer Genome Atlas (TCGA) as of 16 January 2019. Clinical parameters, including gender, age and pathological stage, were also evaluated. Transcriptome profiling data of 226 patients with LUAD in the GSE31210 dataset from the GEO database were used for validation. 22

| Identification of differentially expressed mRNAs (DEGs) in LUAD and adjacent normal tissues
To identify DEGs between adjacent normal tissues and LUAD, we performed differential expression analysis using the edgeR package (version: 3.26.5). 23 The thresholds for screening DEGs were |log 2 FC (fold-change) | > 2 and P < .01.

| Gene-set enrichment analysis
Gene-set enrichment analysis (GSEA) was conducted to explore whether immune functions, and the corresponding immune genes were significantly different between adjacent normal tissues and LUAD samples. 24

| Immune prognostic model construction and validation
First, we normalized the RNA-seq expression value of the immune genes using log 2 transformation. Then, we performed univariate Cox regression analysis to determine the relationship between patient survival and immune gene expression. Immune genes with P < .01 were selected for least absolute shrinkage and selection operator (LASSO) Cox regression analysis. Next, using multivariate Cox regression analysis based on the immune genes obtained from LASSO Cox regression analysis, in which we required selected genes to appear more than 990 times out of 1000 repetitions in total, we built a formula to predict the risk score of each patient. X-tile software (version: 3.6.1) was used to identify the optimum cut-off value for dividing patients into low-and high-risk groups according to the highest χ 2 value defined in the Mantel-Cox test. 25,26 Afterwards, we conducted a log rank test to determine the difference between the low-and high-risk groups. We plotted a Kaplan-Meier OS curve for both groups and calculated the hazard rate (HR). Additionally, we conducted Cox multivariate analysis to test whether the immune prognostic model was independent of clinical characteristics, including age, gender and pathologic stage, and we measured prognostic ability by calculating the area under the curve (AUC) of the receiver operating characteristic (ROC) curve.

| The evaluation of immune cells in LUAD
CIBERSORT, a deconvolution algorithm based on normalized gene expression profiles, can quantify the immune cell composition and has greatly expanded the potential of the genomic database. 27 Because CIBERSORT is superior to other methods, it has received increasing attention and has been successfully used to assess the composition of immune cells in liver and breast cancer. 28,29 We utilized CIBERSORT to assess the composition of 22 immune cells in the TCGA and GEO LUAD cohorts.

| Nomogram development and validation for prognostic risk prediction
To offer quantitative methods to clinicians to predict the 1-, 3-, and

| Differentially expressed mRNAs in patients with LUAD
Analyses of mRNA expression profiles between adjacent normal tissues and LUAD tissues identified 5774 DEGs in total ( Figure 1A).

| A weakened immune phenotype in LUAD
Considering the different immune status between normal lung samples and LUAD samples, we initially utilized RNA-seq data for 5774 DEGs from a total of 535 LUAD patients from the TCGA database to identify different immune biological processes and genes. Gene-set enrichment analysis analysis indicated that LUAD was significantly negatively related to 12 immune biological processes, indicating a weakened local immune response in the LUAD microenvironment ( Figure 1C; Table S2). Among the 5774 DEGs, 353 immune genes were enriched in those 12 biological immune processes (Table S3).
To validate the relationship between LUAD and immune biological processes, we extracted 353 immune genes for subsequent survival analysis.

| Establishment and evaluation of the immune prognostic model with the training dataset
To identify immune genes related to the survival of LUAD patients, univariate Cox regression analysis of 353 immune genes was performed. We selected a set of 113 immune genes at a 0.05 significance threshold (Table S4). The 113 immune genes were subjected to LASSO Cox regression analysis, and 2 immune genes were identified ( Figure 1D,E). We then performed multivariate Cox regression analysis to establish an immune prognostic model for patients with LUAD on the basis of gene expression levels, as follows: risk value = (0.2518 × ANLN expression) + (0.0879 × F2 expression).
The risk scores for patients were calculated, and patients were categorized as low risk or high risk according to the optimal cut-off.
Low-risk patients had a longer OS than high-risk patients (P < .001; HR = 2.26; 95% confidence interval [CI] = 1.62-3.14; Figure 2A). The expression of the two immune genes and the distribution of risk scores for each patient were also analysed ( Figure 2B). The gene expression levels of ANLN and F2 were significantly correlated with the risk scores, and these genes were significantly highly expressed in the high-risk group in the TCGA LUAD cohorts ( Figure 2C-F).
The ROC curves showed that the AUCs of the immune prognos- it ranges from 0.5 to 1 and is equal to the AUC. 33 The higher the value of the C-index, the better the predictability of the model. The

| Validation and evaluation of the immune prognostic model with the validation dataset
To confirm the robustness of the immune prognostic model, the validation dataset (n = 226) was utilized for further validation analysis.
In the validation dataset, we classified all patients into high-risk and low-risk groups using the same equation based on the optimal cutoff value. Consistent with the results of the training dataset, patients with low-risk scores had markedly longer OS than high-risk patients (P < .001; HR = 2.98; 95% CI = 1.45-6.1; Figure 2H). Figure 2I Figure 2N). in EGFR mutant LUAD who may receive EGFR-TKI as adjuvant therapy ( Figure 2Q).

| Tumour immune landscape and genomic association
To further explore the correlation between the immune prognostic model and the immune response, B7 family-related metagenes, as well as seven previously studied inflammatory genes, were taken into consideration. 34,35 The B7 family, interferon and STAT1 were found to be positively correlated with the risk score, while HCK, IgG, LCK, MHC-I and MHC-II were negatively correlated with the risk score ( Figure 3A). In addition, the Microenvironment Cell Populationscounter approach was used to assess the relationship between immune cell populations and risk score. 36 We found that the immune landscape was obviously different between low-and high-risk patients ( Figure 3B). Patients in the low-risk group had a significantly higher proportion of B lineage cells, endothelial cells, myeloid dendritic cells, neutrophils and T cells, as well as a markedly lower proportion of NK cells, compared with the high-risk group ( Figure 3C).

PD-L1 (CD274) expression is a biomarker for selecting NSCLC
patients for pembrolizumab treatment. 37,38 Therefore, PD-L1 expression was investigated in patients stratified by the immune prognostic model. High-risk patients had markedly elevated PD-L1 expression compared with low-risk patients (P < .001) and may respond better and have better outcome when receiving pembrolizumab ( Figure 3D).
Tumour mutation burden (TMB), meaning all the somatic missense mutations in a baseline tumour sample, serves as a predictor for predicting the efficacy of nivolumab. 39 Patients with a high TMB have a higher response rate and favourable progression-free survival when receiving nivolumab treatment. 39 The TMB of patients stratified by the immune prognostic model was therefore investigated.
The t test demonstrated a significant difference between the lowrisk and high-risk groups (P < .0001; Figure 3E).
Aneuploidy, also known as somatic cell copy number alteration (SCNA), has been proposed to drive oncogenesis and is widely found in human cancers. Aneuploidy is associated with reduced cytotoxic immune infiltration and tumour cell proliferation. High SCNA levels in cancers are associated with unfavourable survival in melanoma patients, and cancer SCNA scores are good predictors of survival after immunotherapy. Therefore, the aneuploidy of the patients with LUAD stratified by the immune prognostic model was investigated.

F I G U R E 3
Immune profile related to the immune prognostic model. A, Association between risk score and immune response. B, Associations between risk score and immune cell populations. C, Relative proportion of immune cell expression in the high-and low-risk groups. D, The distribution of programmed cell death ligand 1 (PD-L1) expression in the high-and low-risk groups. E, The distribution of tumour mutation burden in the high-and low-risk groups. F, The distribution of aneuploidy scores in the high-and low-risk groups

| The relationships of ANLN and F2 with immune cells
To explore which immune cells are associated with ANLN and F2, correlation analyses between immune cells and ANLN or F2 were performed. First, CIBERSORT was applied to assess the relative proportions of 22 immune cells, presenting a comprehensive immune cell landscape of LUAD. 27 Figure 4A,B shows the composition of 22 immune cells in LUAD in the TCGA and GEO LUAD cohorts. Then, correlation analyses between immune cells and ANLN or F2 were conducted. The immune cells with correlation coefficients >.4 and P values <.05 in both the TCGA and GEO LUAD cohorts were considered to be associated with the investigated genes ( Figure 4C).

As a result, ANLN was associated with three immune cells (T cells
CD4 memory activated, T cells regulatory (Tregs) and neutrophils; Figure 4D,E), and F2 was associated with three immune cells (Tregs, mast cells activated and neutrophils; Figure 4F,G).

| Identification of immune prognostic modelrelated biological processes and pathways
To identify pathways underlying the immune prognostic model, differential expression analysis was performed on 353 immune genes between the low-and high-risk groups. Forty-five immune genes were highly expressed (|log 2 FC | > 2 and P < .01) in the high-risk groups, and 11 immune genes were highly expressed in the low-risk groups (|log 2 FC | > 2 and P < .01). Then, we performed functional enrichment analysis with the DAVID and KOBAS bioinformatics resources to explore the underlying biological function of these highly expressed genes in the high-and low-risk groups, respectively, revealing 143 enriched biological processes (Table S5) and 12 enriched pathways (Table S6) in the high-risk groups and 33 enriched biological processes (Table S7) and 39 enriched pathways (Table S8) in the low-risk groups (P < .05).
The top three biological processes were defence response, regulation of response to external stimulus and response to wounding in the high-risk groups ( Figure 5A; Table S5) and immune response, positive regulation of response to stimulus, and negative regulation of immune system process in the low-risk groups ( Figure 5C; Table S7). The top three pathways were alcoholism, systemic lupus erythematosus and viral carcinogenesis in the high-risk groups ( Figure 5B; Table S6) and regulation of lipolysis in adipocytes, prostate cancer and Ras signalling pathway in the low-risk groups ( Figure 5D; Table S8).

| Relationship between the immune prognostic model and clinical parameters or patient outcome
To further understand the relationship between the immune prognostic model and other clinical data, such as age, gender and pathologic stage, we performed univariate and multivariate Cox analyses, which revealed that the immune prognostic model serves as an independent factor for predicting the prognosis of LUAD patients ( Figure 6A).

| Development and validation of a predictive nomogram
To facilitate the use of our immune prognostic model, we established a nomogram for predicting LUAD prognosis based on the multivariate Cox analysis in the TCGA database. Our nomogram contained two predictive factors: pathologic stage and immune prognostic model ( Figure 6B). Each factor was assigned a score in accordance with the multivariate analysis. We derived the nomogram score in total from the sum of the individual scores of all predictive factors. A high total score was predictive of low 1-, 3and 5-year survival; however, a low total score showed the opposite pattern. The C-index of the established nomogram for OS prediction was 0.6621 (95% CI = 0.6182-0.7059), and a calibration plot demonstrated good agreement compared with the ideal model, indicating that our proposed nomogram has stability for predicting LUAD patient prognosis in clinical practice ( Figure 6C).
In addition, prediction accuracy was compared among the immune prognostic model, pathologic stage and the nomogram. The discrimination performance of the nomogram was superior to that of the immune prognostic model or pathologic stage ( Table 1). The AUC of the nomogram was also the largest ( Figure 6D). These findings show that compared with the individual prognostic factors, the nomogram is the optimal model for predicting the survival of LUAD patients.

| D ISCUSS I ON
Lung adenocarcinoma, which constitutes approximately 30%-40% of NSCLCs, is a global public health problem and the most Two immune genes predicted in our study were previously shown to function as potential biomarkers. The anillin actin-binding protein (ANLN) gene is located on chromosome 7p14.2 and encodes a 1124 amino acid protein that contains four structural domains, including a RhoA-binding domain, a C-terminal pleckstrin homology domain and an actin-and myosin-binding domain. 44,45 The ANLN protein localizes to the cytoplasm, nucleus, cell cortex, cleavage furrow and cytoskeleton and is expressed in the adult testis, placenta and spinal cord, as well as many foetal organs. 46 ANLN was originally characterized as the human homolog of anillin, a Drosophila actin-binding protein present in the cortex following breakdown of the nuclear envelope, as well as in the cleavage furrow during cytokinesis. 47 Anillin plays a significant role in cell cycle progression, as well as in the assembly of the actin and myosin contractile ring separating the daughter cells. 46 Importantly, anillin has been identified as a substrate of the anaphase-promoting complex/cyclosome (APC/C), which is a kind of ubiquitin ligase that controls mitotic progression. 48 Thus, anillin is a conserved protein that functions in cytoskeletal dynamics in cytokinesis and cellularization. Previous studies have shown that anillin knockdown results in ingression of the cleavage furrow and cytokinesis failure in multinucleated monkey BS-C-1 cells. 47 The relationship between carcinoma and cell cycle regulation is well known. ANLN has been shown to be a biomarker of unfavourable prognosis and is related to aggressive tumour phenotypes. 49 Anillin plays a regulatory role in the cell cycle and an important role in the invasion of pancreatic and breast cancers. 50,51 ANLN has been developed as a prognostic marker based on immunohistochemistry (IHC) and is clinically applicable to hepatocellular carcinoma. 52 Suzuki et al 53 explored the importance of ANLN in lung cancers using cDNA microarrays and found that the growth of NSCLC was inhibited by ANLN small interfering RNAs.
Moreover, the induction of exogenous ANLN overexpression enhanced the migratory capability of mammalian cells by interacting with RhoA. Univariate analysis of gene expression from 66 patients with squamous cell carcinoma found that ANLN had a significant prognostic ability, which was validated in an independent cohort study of 26 patients. 53,54 Our results suggested that ANLN RNA is overexpressed in patients with LUAD. Patients with higher ANLN expression have a poorer prognosis.
The coagulation factor II (F2) gene encodes human prothrombin and is located on the short arm of chromosome 11, at position 11.2. 55 The F2 gene has fourteen exons that span 21 kb, and the structural integrity of the gene is essential for viability. Mice lacking prothrombin die prematurely at the embryonic stage because of bleeding complications. 56 Single nucleotide polymorphisms (SNPs) in patients are usually related to moderate to serious bleeding phenotypes, while the G20210A mutation in the 3′ untranslated region of the F2 gene is a well-established risk factor for thrombophilia. 57 The interplay among cancer, thrombosis, and haemostasis is well known. 57 Patients with tumours have a higher risk of venous thromboembolism, and certain coagulation factors are related to the development of various types of cancers. 57 F2 has been shown to be overexpressed and to act as a hub gene in colorectal cancer liver metastasis. 58 In the present study, F2 was overexpressed in LUAD and was associated with poor prognosis in patients with LUAD. Our TA B L E 1 Comparison of predictive accuracy of the pathologic stage, immune prognostic model and nomogram Correlation analyses showed that ANLN and F2 were significantly related to Tregs and neutrophils in both the TCGA and GEO LUAD cohorts. Guo et al found that Tregs correlated with poor prognosis in LUAD, and Li et al found that the percentage of neutrophil infiltration was significantly higher in the high-risk immune group than in the low-risk groups in non-squamous NSCLC, indicating that Tregs and neutrophils were risky immune cells for LUAD, which is in line with our research results. 59,60 In our research, ANLN and F2 were risky immune genes related to Tregs and neutrophils.
Our study has a few limitations. Although our study has the advantage of using massive cohorts from the TCGA and GEO databases to construct and validate the immune prognostic model, the present study nevertheless features a retrospective design.
Hence, a prospective cohort is needed to validate our model. In addition, further functional studies are needed to explore the molecular functions of the two identified immune genes during LUAD progression.
To the best of our knowledge, our study is the first to identify and validate an immune prognostic model comprising two immune genes (ANLN and F2) in patients with LUAD, which is capable of serving as an independent prognostic marker for patients with LUAD, including early-stage LUAD. In addition, this model is capable of indicating the intensity of immune responses in the LUAD microenvironment and providing new clinical applications for LUAD, taking into account the immune target as well as immune-related treatment.

ACK N OWLED G EM ENTS
This study was sponsored by grants from the National Natural Science Foundation of China (Grant Nos. 81771781, 81872410).

CO N FLI C T O F I NTE R E S T
The authors declare no conflict of interest.

AUTH O R CO NTR I B UTI O N S
CHL, GFP and YZ participated in the study concept and design. CHL and MYL helped with co-ordination and drafting of the manuscript. databases.