Dissecting expression profiles of gastric precancerous lesions and early gastric cancer to explore crucial molecules in intestinal‐type gastric cancer tumorigenesis

Abstract Intestinal‐type gastric cancer (IGC) has a clear and multistep histological evolution. No studies have comprehensively explored gastric tumorigenesis from inflammation through low‐grade intraepithelial neoplasia (LGIN) and high‐grade intraepithelial neoplasia (HGIN) to early gastric cancer (EGC). We sought to investigate the characteristics participating in IGC tumorigenesis and identify related prognostic information within the process. RNA expression profiles of 94 gastroscopic biopsies from 47 patients, including gastric precancerous lesions (GPL: LGIN and HGIN), EGC, and paired controls, were detected by Agilent Microarray. During IGC tumorigenesis from LGIN through HGIN to EGC, the number of activity‐changed tumor hallmarks increased. LGIN and HGIN had similar expression profiles when compared to EGC. We observed an increase in the stemness of gastric epithelial cells in LGIN, HGIN, and EGC, and we found 27 consistent genes that might contribute to dedifferentiation, including five driver genes. Remarkably, we perceived that the immune microenvironment was more active in EGC than in GPL, especially in the infiltration of lymphocytes and macrophages. We identified a five‐gene signature from the gastric tumorigenesis process that could independently predict the overall survival and disease‐free survival of GC patients (log‐rank test: p < 0.0001), and the robustness was verified in an independent cohort (n > 300) and by comparing with two established prognostic signatures in GC. In conclusion, during IGC tumorigenesis, cancer‐like changes occur in LGIN and accumulate in HGIN and EGC. The immune microenvironment is more active in EGC than in LGIN and HGIN. The identified signature from the tumorigenesis process has robust prognostic significance for GC patients. © 2020 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of Pathological Society of Great Britain and Ireland.


Introduction
Gastric cancer (GC) is the fifth most frequently diagnosed cancer and the third leading cause of cancer-related death worldwide [1]. Gastric adenocarcinomas can be categorized into mainly intestinal and diffuse types according to the Lauren classification [2,3]. Intestinal-type GC (IGC) has a clear and multistep histological evolution that starts with inflammation and progresses through atrophy, intestinal metaplasia, gastric precancerous lesion [GPL, including low-grade intraepithelial neoplasia (LGIN) and high-grade intraepithelial neoplasia (HGIN)] to early GC (EGC) and then advanced GC (AGC) [4]. Most of the previous studies on GC have highlighted differences between AGC and the adjacent mucosa from patients undergoing gastrectomy [5,6]. Although the genomic signature of primary GC has been well characterized, there are still many unidentified mechanisms underlying different steps of the carcinogenic cascade, such as no specific mutation pattern during cancerization of intestinal-type GC, which restricts the diagnosis and treatment of GC [7]. Although some studies on premalignant gastric mucosa and EGC have been performed [8][9][10][11][12], no comprehensive molecular expression profiles interpreting the intact process of gastric tumorigenesis from GPL to EGC are available.
Gene expression signatures can be used as molecular predictors of overall survival (OS) and relapse of GC independent of TNM stage [13][14][15][16]. However, GC is highly heterogeneous in both phenotype and genotype, causing the gene signatures identified as prognosis predictors to depend greatly on the training set, which might limit the validity and the reproducibility in other cohorts, thereby emphasizing the importance of exploring molecular alterations in concert [14]. The process of tumorigenesis and progression provides an appropriate model to identify consistent genes to improve the diagnosis and prognosis prediction.
In the present study, we used LGIN, HGIN, EGC, and their paired inflammation controls to represent gastric tumorigenesis. We characterized the changes in gene expression, biological processes, tumor hallmark activities, stemness, and immune microenvironment during gastric tumorigenesis. Furthermore, we identified the gene signature from the gastric tumorigenesis and progression process to construct a risk model to predict OS and relapse of GC patients.

Patients and samples
Forty-seven patients with GPL or EGC were identified from the Department of Gastroenterology, Peking Union Medical College Hospital (PUMCH) from 2011 to 2015. Endoscopic biopsies were performed in each case according to the following parallel protocol: specimens for pathology were obtained with forceps from the serious part of the lesion; then an additional specimen was taken from the same spot in the lesion for our study. After systematic inspection with enhanced imaging techniques, the relatively normal area was localized and then two specimens were obtained from the same spot, sent for pathology and study control, respectively. In total, 94 specimens were obtained. Pathological results were confirmed by two independent pathologists (Linlin Guo and Weixun Zhou). Gastroscopic biopsies were rapidly immersed in RNAlater ® solution (Cat No AM7021; Invitrogen, Waltham, MA, USA) and transferred to −80°C after overnight storage at 4°C. According to the WHO Classification of Tumors of the Digestive System for pathological diagnosis [17], the samples were grouped into four categories: chronic gastritis, LGIN, HGIN, and EGC. Tissues with a pathological diagnosis of inflammation based on the Updated Sydney System [18] were used as controls. Tissues that were diagnosed as atrophic gastritis and intestinal metaplasia were excluded. This study was approved by the Ethics Committee of PUMCH and received institutional approval [institutional review board (IRB) number: B222]. All the patients provided written informed consent, and the experiments were performed in accordance with the World Medical Association Declaration of Helsinki Ethical Principles for Medical Research [19].

Microarray expression profiling and data normalization
Total RNA was extracted from frozen tissues using an RNeasy Mini Kit (Cat No 74106; Qiagen, Hilden, Germany) according to the manufacturer's instructions. Subsequently, purified RNA samples were labeled and hybridized to an Agilent SurePrint G3 Human GE v2 8 × 60K Microarray (Agilent Technologies, Santa Clara, CA, USA) according to the manufacturer's instructions.
The raw data were normalized using GeneSpring GX software, version 11.5 (Silicon Genetics, Redwood City, CA, USA). The expression value for a particular gene that was mapped by multiple probes was determined as the probe with the highest median expression value across all samples [20,21]. The raw data and processed data have been deposited in the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database with accession number GES130823.

Data collection
TCGA RNA sequence level 3 data [raw counts and RNA-Seq by Expectation Maximization (RSEM) normalized read counts] of 415 stomach adenocarcinomas (TCGA STAD), 35 normal stomach tissues, and the related clinical information were obtained from the cBioPortal for Cancer Genomics database (http://www.cbioportal.org/). The raw data of two human GC mRNA microarray studies with prognostic information (accession numbers GSE62254 and GSE15460; sample size > 200) were downloaded from the GEO database. These two mRNA microarray datasets were based on the same platform [HG-U133A Plus2 (GPL570) platform].

Identification of differentially expressed genes and functional enrichment
Paired or unpaired Student's t-tests wereconducted to identify differentially expressed genes (DEGs) between lesions and paired controls or among lesions. DEGs from stomach adenocarcinoma data of TCGA were obtained using the Bioconductor package 'DEseq2'. Protein coding genes with a false discovery rate (FDR) < 0.05 and a fold-change (FC) > 1.5 were regarded as DEGs. The R package 'ClusterProfiler' was applied for Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses. http://software.broadinstitute.org/gsea/index.jsp) [22,23]. Gene set variation analysis (GSVA) [24] was applied to estimate the activity scores of tumor hallmarks using these gene sets. All genes related to each hallmark were employed to score the activity of this hallmark in the control tissues, which was considered the baseline activity for each lesion stage.
A list of 299 cancer driver genes from the PanCancer Project was downloaded, and their expression levels were compared between lesions and paired controls. Driver genes with FDR < 0.05 and FC > 1.5, or FC > 2 were considered as differentially expressed driver genes.
TCGA STAD mRNA expression profiles were used to estimate the correlation between gene expression and immune cell infiltration, which was executed by TIMER [34] (https://cistrome.shinyapps.io/timer/).

Development of a five-gene risk scoring system for survival analysis
The combination (n = 548) of the two GC mRNA microarray datasets was designated as the training dataset. The Least Absolute Shrinkage and Selection Operator (LASSO) Cox regression model that was performed by R package 'glmnet' was used to narrow down the 22 consistent DEGs (coDEGs) to select the most useful prognostic markers, in which the training dataset was subsampled and the tuning parameter lambda was determined according to the expected generalization error estimated from leave-one-out cross-validation [35,36]. We then constructed a signature using the expression value of the identified genes and weighted by the regression coefficient. Finally, a five-gene signature was obtained under the value of tuning parameter lambda that gives minimum mean cross-validated error. The risk score of the signature was then calculated using the risk score formula: X-tile plot was utilized to select the optimum cutoff score of the training dataset to dichotomize patients into a high-risk group or a low-risk group [35,37]. Kaplan-Meier survival analysis and the log-rank test were used to evaluate the prognostic difference of the signature between two groups [38]. Multivariate Cox proportional hazard regression analysis was used to evaluate independent prognostic factors.
Harrell's concordance index (C-index) was used to assess and compare the discrimination and prognostic accuracy of the prognostic signatures [35,39].

Statistical analysis
Pearson's chi-squared test or the Kruskal-Wallis chisquared test was used to test the variation of clinical data. Paired Student's t-tests or Wilcoxon signed rank tests were used to calculate differences between paired samples, and an unpaired Student's t-test or Mann-Whitney test was used to calculate differences between unpaired samples. Pearson's product-moment correlation test was used to estimate correlations between gene expression and stemness indices.

Changes in biological functions during gastric tumorigenesis
We collected 47 paired LGIN, HGIN or EGC tissues and matched inflammation controls ( Figure 1 and supplementary material, Table S1). No significant differences were found among the samples in terms of gender (p = 0.370) or age (p = 0.552) (supplementary material, Table S2). We detected the whole-genome RNA expression profile of each sample. Through differential gene expression analyses, we found many DEGs between LGIN, HGIN, EGC, and their paired controls (defined as L. DEGs, H. DEGs and E. DEGs, respectively). Most of the H. DEGs and E. DEGs were already present among the L. DEGs (Figure 2A and supplementary material, Figure S1A,B), indicating the presence of cancer-like genetic alterations in LGIN. Simultaneously, the results showed no DEGs between LGIN and HGIN. The DEGs between EGC and LGIN or HGIN were similar, indicating that the gene expression profiles of GPL were more similar to each other than they were to that of EGC despite the fact that Characteristics in intestinal-type gastric cancer tumorigenesis 137 HGIN and EGC are more similar in morphology (supplementary material, Figure S1C-E).
To identify biological behavior changes during gastric tumorigenesis, GO and KEGG enrichment analyses of the three aforementioned DEG groups (L. DEGs, H. DEGs, and E. DEGs) were performed. The up-regulated DEGs mainly showed enrichment for mitosis in LGIN; for mitosis, cell adhesion in HGIN; and for immune response in EGC. However, the down-regulated DEGs mainly showed enrichment for metabolism, development, and gastric acid secretion in LGIN; for the Rap1 signaling pathway and gastric acid secretion in HGIN; and for detoxification, metabolism in EGC ( Figure 2B,C). Hence, during gastric tumorigenesis, molecular expression and biological functions are substantially changed in LGIN relative to the control gastric mucosa [40].
To further describe the process behind the development of malignant phenotypes during gastric tumorigenesis, we evaluated the activities of ten tumor hallmarks. When we scored with DEGs, the number of hallmarks with significant changes in activity gradually increased from LGIN to HGIN to EGC. The earliest different hallmarks generally included genome instability and mutation and limitless replicative potential, while tissue invasion and metastasis seemed to appear at a later stage, such as EGC tissues in our data ( Figure 2D).

Appearance of cancer-like changes in LGIN
According to the results of the biological function enrichment and tumor hallmark activity analyses, gastric epithelial cells gained limitless replicative potential and impaired gastric acid secretion, which are closely related to stem cell behaviors. To further explore functional changes during gastric tumorigenesis, we characterized the stem-cell-like features of different stages by stem score [25]. The stem scores of LGIN and HGIN were significantly higher than those of their paired controls (p < 0.05), while there was no difference between EGC and paired control tissues (p > 0.05, Figure 3A). In addition, there were no differences in stemness among LGIN, HGIN, and EGC (p > 0.05, Figure 3C). However, the stem scores of EGC controls, LGIN, HGIN, and EGC were all significantly higher than those of the controls of LGIN and HGIN (p < 0.05, Figure 3B, C) [12]. These results indicated that morphologically normal tissues around EGC have already changed at the molecular level in terms of some basic biological properties, which might be influenced by the tumor and tumor microenvironments.
Furthermore, to reveal the molecular contributions to the stem score changes, we detected the expression of 299 driver genes. We found that the expression levels of BCL2L11, RET, and ALB were significantly reduced in LGIN, HGIN, and EGC compared with their paired controls, and that the expression levels of GRIN2D and BRCA1 were significantly increased ( Figure 3D-F). These results provided a more in-depth explanation at the RNA level that these five driver genes play important roles in gastric tumorigenesis.
We estimated the correlation between the expression of the five common driver genes and the stem scores. All the  Characteristics in intestinal-type gastric cancer tumorigenesis 139 up-regulated genes were significantly positively correlated with the stem scores, and all the down-regulated genes were significantly negatively correlated with the stem scores (|cor| > 0.35, p < 0.0001, Figure 3G-I and supplementary material, Figure S2). These results indicated that these genes are important in the dedifferentiated oncogenic phenotype of gastric epithelial cells and that dedifferentiation occurs in LGIN.

More active immune microenvironment in EGC compared with GPLs
Based on the functional enrichment, we observed changes in immune-related pathways in EGC. To further explore the changes in the immune microenvironment during gastric tumorigenesis, we first applied 46 immune markers of different immune cells to an unsupervised cluster of lesions and found that EGC was separated from LGIN and HGIN ( Figure 4A) [26]. The infiltration scores of 22 immune cells were significantly higher in EGC than in LGIN and HGIN (p < 0.05), while there was no difference between LGIN and HGIN (p > 0.05, supplementary material, Figure S3A and Table S3) [27]. We observed that monocytes and M1 macrophages had the first and second highest median infiltration scores in EGC, both of which are related to the prognosis of many types of cancer [41][42][43]. We obtained the same results from the other two independent datasets (supplementary material, Figure S3B,C). Subsequently, we used CIBERSORT [27] to estimate the infiltration of various immune cells and found that the absolute infiltration levels of gamma delta T cells and macrophages were markedly higher in EGC than in GPLs ( Figure 4B and supplementary material, Figure S3D). This finding was consistent with the results estimated for five immune signatures, which showed significantly higher LIexpression_score and CSF1_response in EGC than in GPLs ( Figure 4C) [28].
To further validate the relationship between gene expression and immune infiltration, we merged DEGs from TCGA STAD data with the other three DEG groups, which yielded 23 up-regulated and five down-regulated genes that were specifically changed in GC (EGC and TCGA STAD) (supplementary material, Figure S4A-C). We used these 28 DEGs to build a protein-protein interaction network LGIN, (E) HGIN, and (F) EGC relative to their paired control samples. Driver genes with p < 0.05 and fold-change > 1.5, or fold-change > 2 were identified as up driver genes (red dots). Driver genes with p < 0.5 and fold-change < 1/1.5, or fold-change < 1/2 were identified as down driver genes (green dots) and other genes were assumed as unchanged driver genes (black dots). (G-I) Correlation of stem scores and three consistent changed driver genes in LGIN, HGIN, and EGC. p < 0.05 was considered to be statistically significant. *p < 0.05; **p < 0.01; ***p < 0.001; ns: no significance. Bars show mean AE SEM.

Y Zhang et al
and obtained one largest immune subnet, consisting of nine up-regulated genes (supplementary material, Figure S4D). Using TCGA STAD to explore the relationship between these nine genes and immune infiltration, we observed that seven genes were positively correlated with the infiltration of CD8 + T cells and five genes were positively correlated with the infiltration of macrophages (cor > 0.3, p < 0.05), which was consistent with our data (supplementary material, Figure S4E-G) [34].

Consistent DEGs play critical roles in gastric tumorigenesis and progression
As shown in supplementary material, Figure S4A,B, there were 11 up-regulated and 11 down-regulated coDEGs (co-up DEGs and co-down DEGs, supplementary material, Table S4) in all four stages (LGIN, HGIN, EGC, and TCGA STAD). All co-up DEGs were significantly positively correlated with the stem scores, and all co-down DEGs were significantly negatively correlated with the stem scores (|cor| > 0.35, p < 0.0001, supplementary material, Table S5). Given that these coDEGs continue to be dysregulated during gastric tumorigenesis and progression, we wonder what their prognostic value is. To detect the prognostic value of these 22 coDEGs, the combination (n = 548) of two human GC mRNA microarray datasets was designated as the training dataset. The LASSO Cox regression model was used to narrow down the 22 coDEGs to select the most useful prognostic markers, which identified five genes (GADD45B, LAMP3, PLEKHS1, RGN, and TIMP1).We then constructed a five-gene signature and derived a formula to calculate the survival risk score for each patient based on the expression value of the five genes and weighted by the regression coefficient in the training dataset. We divided patients into a high-risk group or a low-risk group using the optimum cutoff score (cutoff = 3.79) obtained by X-tile plot. Patients with lower risk scores had a better OS (n = 547, log rank: p = 9.48 × 10 −13 , Figure 5A) and disease-free survival (DFS, n = 300, log rank: p = 2.64× 10 −8 , Figure 5B) than patients with higher risk scores.

Validation of the prognostic significance of the fivegene signature in GC
The genes in the five-gene signature participate in many important biological processes, including growth (GADD45B and TIMP1), metabolism (TIMP1 and RGN), signal transduction (PLEKHS1), extracellular matrix assembly (TIMP1), and antigen presentation (LAMP3). Univariate and multivariate Cox proportional hazard regression analyses showed that the five-gene signature was an independent hazardous prognostic factor associated with OS (HR = 1.94, p = 9.26 × 10 −8 ) and DFS (HR = 2.27, p = 1.24 × 10 −5 ) of GC patients from other Characteristics in intestinal-type gastric cancer tumorigenesis 141 covariates, including gender, age, and pathological stage (supplementary material, Table S6). We obtained the same results in each individual microarray dataset of the training dataset with the same formula and the same cutoff score (cutoff = 3.79) (p < 0.05, supplementary material, Figure S5A,B and Table S6).
To further validate the prognostic value of the fivegene signature, we used the same formula and the same cutoff (cutoff = 3.79) to dichotomize TCGA STAD patients (independent testing dataset) into high-or lowrisk groups. Similar to the findings from the training set, patients in the high-risk group had a shorter median OS (log rank: p = 0.012, Figure 5C) and DFS (log rank: p = 0.0015, Figure 5D) than patients in the low-risk group. The univariate and multivariate Cox proportional hazard regression analyses showed that the prognostic capacities of the five-gene signature in OS and DFS were independent of the aforementioned covariates (supplementary material, Table S6).
Additionally, we compared our five-gene signature with two established RNA expression signatures, including a six-gene signature [13] and a 24-lncRNA signature [44] for GC. In all the training, GSE62254, GSE15460, and TCGA STAD datasets, our five-gene signature achieved a higher C-index in OS or DFS than the sixgene signature ( Figure 6). Moreover, in three of the four datasets or both datasets with DFS data (GSE62254 and TCGA STAD data), the combination of the aforementioned covariates with our five-gene signature provided a higher prognostic accuracy of OS or DFS than with the six-gene signature ( Figure 6 and supplementary material, Figure S6). Only the microarray datasets that Similarly, for all the training, GSE62254, and GSE15460 datasets, our five-gene signature showed a higher C-index in OS or DFS than the 24-lncRNA signature that was trained in the GSE62254 dataset ( Figure 6 and supplementary material, Figure S6). In addition, the combination of covariates with our five-gene signature provided a higher prognostic accuracy of OS and DFS than with the 24-lncRNA signature ( Figure 6 and supplementary material, Figure S6).

Discussion
Over the past few decades, to eradicate tumors, scientists have focused mainly on tumors themselves. Few advances have been made in understanding the mechanisms by which GPLs become early invasive tumors and by which tumors escape attack of the immune system and therapies. In this project, we detected the expression profiles of GPL and EGC samples to investigate the mechanisms underlying gastric tumorigenesis. First, DEG enrichment showed that gastric epithelial cells first displayed changes in proliferation and development, followed by changes in cell adhesion and gastric acid secretion; finally, the detoxication was reduced along with immune microenvironment activation during gastric tumorigenesis, which has never been integrally elaborated before. As shown in Figure 2, chromatin remodelingrelated pathways (chromatin remodeling at centromere and CENP-A containing nucleosome assembly) and their regulated pathways (chromatin assembly or disassembly, nucleosome organization, DNA packaging, and protein-DNA complex assembly) that are frequently altered in GC [8,[45][46][47] were dysregulated in LGIN and HGIN. Cell adhesion-related pathways are the top perturbed pathways in GC [47][48][49]. The dysregulation of the Rap1 signaling pathway and calcium-independent cell-cell adhesion via plasma membrane cell-adhesion molecules in HGIN indicated acquisition of the potential for invasion and metastasis of HGIN [46,50]. Severe inflammation could promote the transition from a precancerous state to tumorigenesis [9,51]. The single-cell sequencing results suggested that macrophages have a critical role in promoting gastric tumorigenesis [12]. In our project, the cytokine-and inflammation-related pathways were significantly up-regulated in EGC, which might contribute to the formation of EGC.
We observed cancer-like molecular expression profiles and phenotypes appearing in LGIN that were sustained in HGIN and EGC, such as some tumor hallmarks, stemcell-like features, commonly changed driver genes, and common DEGs. Even so, the expression profiles of LGIN and HGIN were more similar to each other than to that of EGC, despite the greater similarities in morphology between HGIN and EGC. The most obvious difference between EGC and GPL was the difference in the immune microenvironment. The immune cell infiltration in EGC was significantly higher than that in the GPL, especially the infiltration of lymphocytes and macrophages. Tumorinfiltrating lymphocytes are supposedly associated with a positive prognosis for GC patients and are considered to reflect the protective host antitumor immune response [52,53]; however, tumor-associated macrophages are considered to play a complicated role in GC development by contributing to the progression and poor prognosis of GC [41,42,54] while also indicating better OS [54].
Intriguingly, we found no significant difference in the stemness between EGC and paired controls, which might be interpreted as the influence of the tumor and the tumor microenvironment on the normal mucosa in view of the significant differences between LGIN or HGIN and their paired controls, between the EGC controls and the LGIN and HGIN controls, and the lack of significant differences among LGIN, HGIN, and EGC in their stem scores. Simultaneously, the immune Harrell's concordance index (C-index) was used to assess the discrimination and prognostic accuracy of the signatures. Forest plots illustrating the C-index (95% CI) for OS and DFS in training (left) and testing (right). CI, confidence interval.
Characteristics in intestinal-type gastric cancer tumorigenesis 143 infiltration was significantly higher in EGC than in GPL (especially macrophages) [12]. The influence of the tumor and tumor microenvironment serves to assimilate the surrounding normal mucosa, which explained to some degree why patients who underwent gastric endoscopic submucosal dissection (ESD) had a higher annual incidence of recurrent GC than patients who underwent surgery [55]. According to the central dogma of genetics, RNA is the next-level executor of DNA. In the past, however, scientists have primarily focused on driver gene mutations and found that intestinal-type GC shows no specific mutation patterns during tumorigenesis [7]. We checked the expression profiles of 299 pan-cancerrelated driver genes [56] in gastric tumorigenesis and found consistent differentially expressed driver genes in GPLs and EGC, including BCL2L11, RET, ALB, GRIN2D, and BRCA1. BCL2L11, which has been considered to act as an apoptotic activator, showed a significant decrease in gastric lesions [57]. GRIN2D, which has been deemed to be a possible oncogene in pan-cancer, showed a significant increase in gastric lesions [56,58]. These two driver genes might play vital roles in gastric tumorigenesis, and other variational driver genes might change to sustain the balance of biological functions, such as BRCA1, which has been identified as a tumor suppressor and showed increased expression in gastric lesions; and TP53, which has also been identified as a tumor suppressor and showed increased expression in LGIN tissues. Moreover, these coincidentally changed driver genes showed significant correlations with dedifferentiation, indicating that expression of these driver genes plays equally critical roles in functional maintenance.
In this study, we identified 22 coDEGs that might participate in the dedifferentiation of gastric epithelium cells, with the co-up DEGs being positively correlated with the stem score and the co-down DEGs being negatively correlated with the stem score. In addition, we calculated the prognostic relevance of these coDEGs and developed a robust five-gene signature. The five-gene signature could predict OS and DFS and was an independent prognostic factor associated with OS and DFS. The prognostic robustness was validated in an independent cohort and the discrimination and prognostic accuracy were assessed and compared with two established signatures for GC.
To the best of our knowledge, this study is the first to include paired continuous lesions representing gastric tumorigenesis. Our findings showed that during gastric tumorigenesis, cancer-like changes occur in LGIN and accumulate in HGIN and EGC. EGC has a more active immune microenvironment than GPLs. EGC tissues, but not GPL tissues, could assimilate the gene expression of the surrounding normal mucosa. The five-gene signature identified from the tumorigenesis process showed robust prognostic significance for OS and DFS in GC patients, and might lead to the generation of potential molecular targets for the development of anticancer therapy.  Table S1. Clinical information for each individual Table S2. Sample information of the mRNA microarray Table S3. Mann-Whitney test of immune infiltration scores Table S4. Twenty-two coDEGs from merging four groups of DEGs Table S5. Correlations between DEGs and stem scores Table S6. Univariate and multivariate Cox proportional hazard regression analysis of overall survival and disease-free survival in training (n = 547), testing (n = 406), GSE62254 (n = 300), and GSE15460 (n = 248) patients according to the five-gene signature, gender, age, and stage 146 Y Zhang et al