Peripheral blood transcriptome heterogeneity and prognostic potential in lung cancer revealed by RNA‐Seq

Abstract Understanding of the complex interaction between the peripheral immune system and lung cancer (LC) remains incomplete, limiting patient benefit. Here, we aimed to characterize the host peripheral immune response to LC and investigate its potential prognostic value. Bulk RNA‐sequencing data of peripheral blood leucocytes (PBLs) from healthy volunteers and LC patients (n = 142) were analysed for characterization of host systemic immunity in LC. We observed broad blood transcriptome perturbations in LC patients that were heterogeneous, as two new subtypes were established independent of histology. Functionally, the heterogeneity between the two subtypes included dysregulation of diverse biological processes, such as the cell cycle, blood coagulation and inflammatory signalling pathways, together with the abundance and activity of blood cells, particularly lymphocytes and neutrophils, ultimately manifesting as differences in antitumour immune status. Based on these findings, a prognostic model composed of ten genes dysregulated in one LC subtype with relatively poor immune status was developed and validated in a Gene Expression Omnibus (GEO) data set (n = 108), helping to generate a prognostic nomogram. Collectively, our study provides novel and comprehensive insight into the heterogeneity of the host peripheral immune response to LC. The expression heterogeneity–based predictive model may help guide prognostic management for LC patients.


| INTRODUC TI ON
The immune response is believed to play a crucial role in the occurrence and progression of cancers, including lung cancer (LC).
Past studies have pointed out that, as a highly heterogeneous disease with poor prognosis and high mortality, LC may be promoted via a reprogrammed immune microenvironment by contributing to inflammation, immune regulation and treatment response. 1,2 Therefore, a variety of diagnostic and prognostic biomarkers have been proposed, and targeted immunotherapy has been applied in clinical therapy based on components of the tumour microenvironment. [3][4][5] From another perspective, cancer is perceived as a systemic disorder. Immune cell abundance, genomic DNA methylation and serological characteristics have been found to be altered in the peripheral immune system in LC. [6][7][8] Nevertheless, a comprehensive and in-depth understanding of the impacts of LC on the host immune system is still lacking.
At present, treatment and prognosis prediction for LC patients are based mainly on the tumour-node-metastasis (TNM) staging system. 9 However, there are still considerable disparities in survival among patients with the same clinical characteristics. Researchers have tried to develop new prognostic auxiliary indicators to enhance the accuracy of LC prognosis, 3,10,11 but intratumour heterogeneity and sampling bias may limit the validity and reproducibility of prognostic markers, particularly for solid tumour specimens. Hence, reliable prognostic biomarkers are needed to guide adjuvant therapy.
As an accessible source of immune cells that migrate to and from tumour lesions, peripheral blood is exposed to tumour cells and host-secreted factors released into the peripheral circulation system and is suitable for evaluating the immune status of cancer patients.
Furthermore, genome-wide expression profile analysis based on high-throughput transcriptome sequencing (RNA-Seq) of peripheral blood provides an unbiased and in vitro operational approach to measure transcriptional responses in complex diseases, including cancer. 12,13 In this study, we generated transcriptional profiles of peripheral blood leucocytes (PBLs) from LC patients and healthy subjects by RNA-Seq. We characterized the transcription profile changes of PBLs in LC and parsed their heterogeneity for the first time.
Moreover, we developed a risk score (RS) model with signature genes of PBL subtypes as a good indicator to predict overall survival (OS) for LC patients.

| MATERIAL S AND ME THODS
See also Supplementary Methods.

| Collection of specimens
We recruited 69 healthy individuals and 73 LC patients and collected four millilitres of fresh peripheral blood from each enrolled individual before clinical treatment. The clinical characteristics of all subjects are summarized in Table S1. The study was reviewed and approved by the Ethics Committee of the Cancer Institute and Hospital of the CAMS and was carried out in accordance with the World Medical Association Declaration of Helsinki Ethical Principles for Medical Research. 14 All subjects provided written informed consent.

| Generation and normalization of RNAsequencing data
We extracted total RNA from peripheral leucocytes. Eligible libraries were prepared from qualified samples using a NEBNext ® Ultra™ RNA Library Prep Kit (New England Biolabs, Ipswich, MA, UK) and sequenced on the Illumina HiSeq 4000 platform. Paired-end reads (150 bp) were mapped to the human reference genome (GRCh38) by Salmon. Transcript abundances were summarized at the gene level with tximport and were normalized based on transcripts-per-million (TPM).

| Differentially expressed gene identification and pathway enrichment analysis
The Bioconductor package DESeq2 was used to detect differentially expressed genes (DEGs) with a gene read count matrix. 15 Only genes with a Benjamini-Hochberg false discovery rate (FDR) < 0.05 and a |log2 fold change (FC)| ≥ 0.59 were defined as DEGs.
Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were performed by the package ClusterProfiler and visualized with the package GOplot and Cytoscape software (v3.7.1) to functionally characterize the identified DEGs. 16,17 The Benjamini-Hochberg FDR threshold in each case was set at 0.05.

| Dimension reduction and cluster analysis
To explore PBL transcriptional profiling-based heterogeneity among samples, we performed t-distributed stochastic neighbour embedding (t-SNE) analysis by the package Rtsne using the first 50 principal components and 1000 iterations, combined with cluster analysis employed by the package mclust with default parameters using all genes. 18 Principal component analysis (PCA) of all genes and unsupervised hierarchical clustering analysis of the top 3000 most variable genes using the complete method and Pearson correlation distance were carried out by R to validate the gene expression profile heterogeneity between LC subtypes.

| Gene set enrichment analysis
To compare differences in the transcriptomes between LC subtypes, we implemented two enrichment methods for the gene expression For tmod, the CERNO test was employed. Only BTMs that were commonly significantly enriched (FDR < 5%) across the two methods were retained.

| Weighted gene correlation network analysis
In this study, two gene co-expression networks consisting of 86 samples (lung cancer set 1 (LC1) vs the normal group) and 125 samples (lung cancer set 2 (LC2) vs the normal group) were established by the package WGCNA. 21 The two adjacency matrixes were calculated with a beta of 8 or 10, and the minimum cluster size of the clustering dendrograms was 30. The association between eigengene values and clinical traits was assessed by Pearson's correlation, and key co-expression modules related to class differences were detected according to the correlation and significance P values. Genes in each key co-expression module were ranked by effect size as the mean expression of the LC class minus the expression of the normal class to generate pre-ranked gene lists for downstream GSEA with BTMs as gene sets.

| Construction and validation of a RS prognostic model
We downloaded a transcriptome data set of peripheral blood mononuclear cells (PBMCs) with clinical outcomes from NCBI's Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/) under accession number GSE13255, which contains 108 LC patients.
The normalized gene expression matrix transformed by log2 was applied to construct a RS prognostic model. A summary of the sample information is shown in Table S3.
All samples were randomly divided equally into training and testing sets. First, we identified a gene panel selected from LC2 DEGs with P < .05 in a univariate regression analysis of the training set (n = 54). Next, we introduced least absolute shrinkage and selection operator (LASSO) Cox regression analysis to the gene panel by the glmnet package. 22 The expected generalization error was estimated by 10-fold cross-validation, and the LASSO model consisting of n genes with non-zero regression coefficients was determined. Those coefficients (c) and the corresponding gene expression values (E) were used to calculate a RS for each patient, as shown below: An optimal cut-off value determined by performing timedependent receiver operating characteristic (ROC) analysis with the survival ROC package dichotomized patients into a high-risk group or a low-risk group. The Kaplan-Meier (KM) method with a log-rank test was used to perform time-to-event analysis to evaluate survival differences between the two groups. The testing set and stage I set were utilized to validate the prognostic ability of the predictive model.
Univariate and multivariate Cox proportional hazard regression analyses were performed to assess independent prognostic factors.
A novel nomogram was built based on the multivariable analyses and metric = − log 10 (p value) sign(log 2 FC) RS = c 1 E 1 + ⋯c n E n provided visualized risk stratification by the survival, foreign and rms packages. The discrimination of different predictors was appraised by Harrell's concordance index (C-index).

| Statistical analysis
The

| Broad blood transcriptome perturbations were evident in LC patients
To quantify sample differences in the PBL transcriptome between LC and healthy subjects, we carried out PCA across 142 subjects and found that the LC group was separated from the healthy group ( Figure 1A). We identified 1368 DEGs in the LC versus healthy groups to uncover the biological alterations of the peripheral immune system in response to LC ( Figure 1B, File S1). Biological processes, including the humoral immune response and regulation of lymphocyte activation, were enriched ( Figure 1C). Disturbed pathways identified previously in LC histological studies, such as arachidonic acid metabolism and transcriptional misregulation in cancer, were also observed ( Figure S1). 23,24 We next estimated the proportions of immune cells for each subject by deconvoluting the PBL RNA-Seq data to investigate changes at the cellular level (File S2). Nine major cell components are displayed ( Figure 1D). Changes in the abundance of monocytes,

NK cells, T cells and B cells in the patient group were observed,
which is in agreement with past research. 25 Similar changes were found in routine blood indicators that were significantly positively correlated with deconvolution analysis results ( Figure S2). In general, the PBL transcriptome of LC patients was widely disturbed and denoted broad aberrancies in innate and adaptive immunity.

| PBL transcriptional profiles of LC patients are heterogeneous
We next asked whether there is consistency in peripheral transcriptome perturbations among LC patients. We performed t-SNE and cluster analyses across all subjects based on the similarity of gene expression profiles (Figure 2A). The LC patients were distinct from normal subjects, in line with the PCA results ( Figure 1A). In particular, LC samples were clustered into two subsets defined as LC1 and LC2, both of which contained three LC histological subtypes ( Figure S3).
We also executed PCA and unsupervised hierarchical cluster analysis of the top 3000 most variable genes to verify between-group variance ( Figure 2B,C). Two clusters were delineated in unsupervised hierarchical cluster analysis, and the vast majority of samples in LC1 and LC2 were divided into different clusters. Detailed clinical traits of the two LC subsets are given in Table S2. Apparently, there were no significant differences in characteristics between the two subsets except for stage (Fisher's exact chi-square test, P = .004). Earlystage patients were relatively evenly distributed in the two subsets, but more advanced patients were classified as LC2.
In addition, we calculated the molecular distance for each patient to quantify global transcriptional changes in LC over the normal baseline ( Figure 2D). Notably, the molecular distance between In short, considerable heterogeneity of PBL transcriptional profiles independent of histological type was identified in LC. We next conducted analyses aiming to interpret the heterogeneity and its significance. Up-regulated BTMs in LC2 included myeloid cell signature enrichment, blood coagulation and inflammatory signalling pathways ( Figure 3A and Figure S4A). We next implemented WGCNA to evaluate class differences in the co-expression network module derived from LC-induced data ( Figure S5). Core modules related to disease states were found to be differentially enriched in the two subsets, most of which were consistent with the above enrichment analyses results ( Figure 3B, File S3). Furthermore, hallmark signatures related to the cell cycle, such as E2F targets, MYC targets and the G2M checkpoint, were salient in LC1. 26,27 LC2 highlighted the inflammatory response and cytokine-mediated signalling pathway ( Figure 3C and Figure S4B).

| Global alterations revealed the heterogeneity of the PBL transcriptome in LC patients
In view of the enrichment of immune cell signatures, we also focused on different cell abundances between the new LC subtypes.
Neutrophils and lymphocytes, especially NK cells and CD8 T cells, showed significant differences between LC1 and LC2 by deconvolution and routine blood analysis ( Figure 3D and Figure S6). Moreover, the estimated neutrophil-to-lymphocyte ratio of LC2, that is an indicator of the inflammation level, was significantly higher than that of LC1 (1.09 ± 0.55 vs 0.15 ± 0.10).
Overall, heterogeneity in the PBL transcriptome of LC patients involved cell cycle-related pathways, blood coagulation, inflammatory signalling pathways and PBL composition, which may play crucial roles in the LC subsets.

| Peripheral antitumour immune status is distinct between the LC subsets
Given the global abnormalities related to immunity of the PBL tran- Immunomodulators (IMs) are pivotal for modulation of the immune response level. To gain insight into the peripheral immune status, we next evaluated the expression of a list of IMs that may stimulate or inhibit the immune response to LC ( Figure 4C). 28 The results showed that the expression of IMs varied across the LC subsets.
In LC1, stimulatory IMs, such as CD70, CXCL9 and NK cell activation receptor-NKG2D (KLRK1), were up-regulated. 29 Comparatively, most immunosuppressive molecules, such as PD1 (PDCD1) and PDL1 (CD274), showed significant up-regulation, while stimulatory IMs, including NKG2D, tended to be down-regulated in LC2 cells. These findings underscored the possible peripheral immunosuppression in antitumour immunity of LC2 compared to LC1.  (Table S4 and Figure S8A,B,E). Most of the ten genes are expressed in monocytes or lymphoid cells and differentially expressed across the LC subsets in our study, suggesting possible differential outcomes between LC1 and LC2 (https://www.prote inatl as.org/) ( Figure S8F).

| Gene signatures of the PBL transcriptome have the capability of predicting LC survival
We calculated the RS for patients with expression values of the ten genes and corresponding regression coefficients and then divided the training set (n = 54) into high-risk and low-risk groups according to the optimum cut-off value (cut-off = −6.1; Figure S8C,D).
Time-dependent ROC analysis showed that the areas under the curve (AUCs) at 1, 2, 3 and 5 years were 0.897, 0.853, 0.813 and 0.911, respectively ( Figure S9A). The high-risk patients had poorer OS than patients with lower RS (log-rank P < .0001, Figure 5A). The model showed good performance with the same cut-off value in the testing set (n = 54, log-rank P = .0042), even in patients with stage I disease (n = 38, P = .0062) ( Figure 5B,C). The AUCs of ROC analysis at 1, 2, 3 and 5 years in the test set are shown in Figure S9B.
To determine whether the prognostic value of the RS was independent of other clinical characteristics, we conducted univariate and multivariate Cox regression analyses with the training set.
The RS remained an independent prognostic indicator with the  Figure 5D). Thus, the robustness of the RS for independently predicting LC patient OS was confirmed.

| DISCUSS ION
Lung cancer (LC) is widely recognized as a highly heterogeneous disease that greatly threatens human health. Previously, many targeted studies have been conducted to comprehend the local immune status of LC lesions, but the understanding of variations in the peripheral immune system of the LC host is still limited. [30][31][32] In In this study, we revealed substantial PBL transcriptome heterogeneity independent of histological type in LC patients. In other words, although different histological types have distinct origins, there seem to be no marked differences in influence on the peripheral immune system. Interestingly, similar findings were presented in research on breast cancer, and whether this observation represents a pan-cancer phenomenon needs additional study to verify. [35][36][37] Additionally, the heterogeneity seems to be related to stage in our study, raising the possibility that the PBL transcription profile can partly reflect the tumour burden of LC patients. mediates the trafficking of CD8 T cells, and CD70 is involved in the activation of CD8 T cells. 39,42 For LC2, we first observed activation of blood coagulation.
Previous findings suggest that activation of blood clotting leads to shows that inflammation promotes tumorigenesis and cancer progression. 45 As the first responders to inflammation, active neutrophils release several soluble neutrophil granule proteins that can promote tumour progression by inducing tumour cell proliferation, stimulating angiogenesis and disabling T cell-dependent antitumour immunity. 46 As for the protective factors, they may play a role in immune activation and anti-inflammatory. NR3C2 is considered to be a tumour suppressor and may be correlated with T cell activation. 57 It is traditionally believed that M2 macrophages are generally associated with immunosuppression and tumour metastasis. TRPC1 and MSR1 can induce M2 to polarize into M1 macrophages. 58,59 RLN2 has antiinflammatory properties. 60 However, the role of FXYD7 and HCG27 in the immune response is unclear.
The predictive ability of the RS model developed from PBMC data was validated even for patients with the same stage, proving the reliability of signature gene selection. The good performance of the prognostic model, especially the prognostic prediction for early-stage LC patients, may help improve follow-up treatment after surgical resection. More meaningfully, the RS contributed to constructing a comprehensive quantitative method for survival prediction, which performed better than existing clinical prognostic indicators.
We recognize limitations in our study. Due to the limitation of insufficient clinical data, we cannot demonstrate the potential clinical value of the heterogeneity between LC subsets directly in our data set. Moreover, both the heterogeneity of the PBL transcriptome and the universality of the RS prognostic model still require further validation with multicentre and larger scale studies.
In summary, to our knowledge, this study is the first to cover multiple LC histological types and stages and to target the peripheral blood transcriptome based on RNA-Seq. We characterized the PBL transcriptional profiles and interpreted the heterogeneity of peripheral immunity in LC hosts for the first time. A robust RS prognostic model, as a non-invasive and unbiased method to predict LC patient survival, was established and may help guide clinical treatment in the future.

ACK N OWLED G EM ENTS
The authors wish to thank all the volunteers enrolled in this study. Funders only provided funding and had no role in the study design, data collection, data analysis, interpretation and writing of the report.

CO N FLI C T O F I NTE R E S T
No conflicts of interest are declared.

DATA AVA I L A B I L I T Y S TAT E M E N T
The raw RNA-Seq data for this study are deposited in the Genome