scRNA‐seq determines the characteristics of T cell marker genes to predict the prognosis of esophageal cancer

Esophageal cancer (EC) mainly includes two histological subtypes, esophageal squamous cell carcinoma (ESCC) and esophageal adenocarcinoma, which is of high morbidity and mortality. With the continuous development of medical technology, the treatment of EC has been greatly improved, but its prognosis is still unfavorable. Recent single‐cell RNA‐sequencing (scRNA‐seq) is expected to bring breakthroughs in the treatment of EC. First, we identified T cell marker genes and generated signature by analyzing scRNA‐seq data from Expression Omnibus (GEO) database and TCGA database. Then, an immune prognostic model was constructed using the least absolute shrinkage and selection operator. We found that the survival rate of ESCC varied significantly among the low‐ and high‐risk groups. Two genes from T cell signature, DCPS and CYB5R3, were expressed in ESCC cells. Collectively, our study proposed a novel prognostic signature for ESCC patients based on T cell marker genes.

symptoms of EC are often not obvious, and the middle and advanced stages are accompanied by progressive dysphagia. 4The preferred treatment for EC is surgery, followed by radiation therapy and chemotherapy.Although surgery combined with chemoradiotherapy is one of the best options for the treatment of EC, its prognosis is still not optimistic.It remains crucial to proposed precise diagnostic methods for the treatment of EC.
Recently, a novel high-throughput sequencing technology, singlecell RNA-sequencing (scRNA-seq), has been developed rapidly, and widely used in cancer, 5,6 growth and development, 7 microbiology, 8,9 and plant 10 research fields.Single-cell sequencing enables highthroughput sequencing analysis of genomes, transcriptomes, and epigenomes at the individual cell level, which improve the precision of tumor molecular biology research, and has produced important significance for tumor biomedical research.In recent years, scRNA-seq has made important progress in revealing the tumor heterogeneity and intrinsic pathogenic mechanism of EC. 11 There are no clinically effective ESCC biomarkers so that most patients cannot be diagnosed and treated in time, resulting in prognosis is poor.Therefore, it is urgent to find effective ESCC diagnostic and prognostic biomarkers to improve the efficiency of treatment.This study provides a theoretical basis for the identification of biomarkers related to the diagnosis and prognosis of EC and the discovery of new therapeutic targets.
In this study, we reanalyzed differential expression genes from a scRNA-seq dataset of T cell in The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) databases.Several candidate biomarkers that can predict ESCC diagnosis and prognosis simultaneously were screened through functional annotation and marker gene analysis.

| Data source
Single-cell transcriptome file of ESCC samples of GSE145370 was downloaded from the GEO database.TCGA bulk tumor transcriptome data were downloaded from https://portal.gdc.cancer.gov/.

| Identification of T cell marker genes
We used "Seurat" package to analysis scRNA-seq data.The top 2000 variably expressed genes were used to perform principal component analysis (PCA).Cell clustering analysis was performed using the uniform manifold approximation and projection (UMAP).

| Heatmap analysis
We performed the heatmap analysis to assess the activity changes in the pathways or functions of the gene set by using R package of pheatmap.

| Consensus clustering
The samples from the GEO and TCGA database were clustered by R package of Consensus Cluster Plus to determine the optimal cluster.

| Least absolute shrinkage and selection operator (LASSO)
LASSO method was used to determine significant prognostic genes.
The Kaplan-Meier survival curve of the risk scores was generated using the R package "survminer."

| Statistical analysis
Data are presented as means, and error bars indicate the standard deviation.All statistical analyses were performed with Prism (GraphPad Software Inc., La Jolla, CA, USA).Two-way ANOVA and two-tailed, Student's t-test were used to calculate statistical significance.A p-value of <.05 was considered to be significant.

| Data processing and identification of cell clusters in ESCC
The schematic diagram of this study is shown in Figure 1.The singlecell sequencing datasets of seven ESCC patients were obtained from GSE145370 through the GEO database (Table 1).After quality control and standardization of single-cell sequencing data, cells with mitochondrial sequencing count >15% and ribosome sequencing count >50% were excluded (Figure 2A).Then, cells with too high mitochondrial gene expression and some extreme cells were filtered according to the violin diagram (Figure 2B).
Next, characteristic genes with significant differences in expression between cells were extracted, that is, high expression in some cells and low expression in others, that help highlight biological signals in single-cell datasets.This is implemented with the FindVariableFeature function.Each dataset returns 2000 features, which will be used for downstream analysis (Figure 2C).The top 10 principal components (PCs) were selected for subsequent UMAP for dimension reduction analysis (Figure 2D, E).The batch effects from multiple samples were corrected by Seurat package, then the UMAP method was used for dimensionality reduction and visualization of data.Finally, all cells were clustered into 10 cell clusters using clustering analysis.DimHeatmap makes it easy to explore the main sources of heterogeneity in the dataset and can determine which PC dimensions can be used for further downstream analysis.Cells and genes are ordered according to PCA scores (Figure 2F).

| Identification of T cell marker genes expression profiles
First, we defined all the cells in groups, respectively NKcells, Tcells, Treg, Bcells, Neutrophils, Epithelial, Fbroblast, mastcells, and Myeloid (Figure 3A, B).Next, we extract the T cell (Figure 3C, D).Heatmap was used to compare different genes expression between subpopulations and identify the expression of T cell marker genes (Figure 3E).

| Construction of prognostic signature
As shown in Figure 5A, T cells were clustered into two subgroups.
TCGA data were used to validate prognostic features based on T cell marker genes, and it was divided into two groups based on single-cell cluster genes (Figure 5B).Survival analysis showed significant differences between the two cluster (Figure 5C) (p <.05), indicating that the model is successfully established.survival status showed a significant difference between the two groups.Notably, the heat map shows 10 genes with significant differences in expression, namely BTG2, PPP1R15A, SNX2, SELENOK, SMARCA4, CYB5R3, TLE1, BIK, LIMD2, RBBP7, and DCPS.We speculate these genes may be involved in the regulation of EC.

| Risk score prognostic model
F I G U R E 2 Seven ESCC patientsrelated single-cell sequencing datasets were obtained from GSE145370 through the GEO database.(A) The Seurat package in the R software was utilized for scRNA-Seq data analysis.(B)The violin diagram shows that filter cells with too high mitochondrial gene and some extreme cells.The computational dataset exhibits characteristic genes with high cell-to-cell variation.(C) Each dataset returns 2000 features, which will be used for downstream analysis.(D), (E)The top 10 principal components (PCs) were selected for subsequent uniform manifold approximation and projection for dimension reduction analysis.The UMAP revealed that all cells were clustered into 10 cell clusters using clustering analysis.(F) DimHeatmap determines further downstream analysis.Cells and genes are ordered according to PCA score.promote cancer growth, which occurs not only within the local tumor microenvironment but also in various distant body compartments. 24,25cells have a central role in virtually all therapeutic modalities, including surgery, chemotherapy, radiotherapy, immunotherapy, and targeted therapy, but are still largely unexplored.

| Experiment to validate the model
scRNA-Seq has become a widely used tool in cancer research to characterize the cellular and molecular composition of tumors.
Techniques for analyzing individual cells are currently capable to measure tumor heterogeneity at the molecular level, including DNA, RNA, proteins, and epigenetics.The scRNA-seq method was mainly used to examine tumors in detail, using gene or protein expression to characterize cell type composition and tumor heterogeneity.2][33] Our study proposed a novel prognostic signature based on T cell marker genes for ESCC patients by combining scRNA-seq sequencing data and TCGA data.

3. 3 |
Clustering T cell T cells were classified as CD4T, CD8T, Tmem, and T Naive (Figure 4A) based on different markers.The different markers' featureplots are shown in (Figure 4B-E).The gene expressions in different populations were visualized.As shown in Figure 4F, the top 5 genes with the highest expression in CD4T cells are: SELL, ICOS, GPR183, BATF, and LTB.The top 5 genes with the highest expression in T Naive cells are: HSPH1, HSPA1B, CCL4L2, HSPA1A, and GZMK.The top 5 genes with the highest expression in Tmem cells are GNLY, FCER1G, TRDC, TRDV2, and TYROBP.The top 5 genes with the highest expression in CD8T cells are: GZMH, MT1E, NKG7, and FGFBP2.

A 1
best Cox proportional hazards model was established based on the LASSO method (Figure 6A-C).Then the median value of the risk scores was the cutoff value to divide the ESCC patients into a low-risk (low score) group and a high-risk (high score) group.The curve of risk score distribution and the scatterplot of T A B L E 1 Primer sequences used for RT-qPCR experiments from Generalbiol.GeneForward primer 5 0 -3 0 Reverse primer 3 0 -5 0 The schematic representation of this study.
To further validate the diagnostic and prognostic models, we examined the expression levels of CYB5R3, DCPS, and SELENOK in vitro by qPT-PCR.Compared with normal esophagus cell (HEEC), K70 cells expressed lower levels of CYB5R3 and DCPS and higher levels of SELENOK (Figure 7A-C), which further verified these genes are abnormally enriched in tumor cells.Immunohistochemistry was used to detect the expression of the gene in our clinical samples and found a relative increase in the expression of CYB5R3 and DCPS F I G U R E 3 Identification of T cell marker genes expression profiles.(A), (B) The UMAP revealed all the cells in groups, respectively NKcells, Tcells, Treg, Bcells, Neutrophils, Epithelial, Fbroblast, mastcells and Myeloi.(C), (D) The UMAP shows the T cell marker genes.(E) The heatmap shows different genes expression between these subpopulation.gene in the adjacent tissues (Figure 7D, E).As expected, cancerrelated genes CYB5R3 and DCPS were lowly expressed in EC tissues, and patients with high expression of these genes had significantly good prognosis.These data suggest that the subpopulation-associated genes that we identified possessed a superior clinical prognostic value.F I G U R E 4 Clustering T cell.(A) T cells were classified as CD4T, CD8T, T mem and T Naiv.(B)-(E) T cells were defined by cell grouping with different markers.(F) The heatmaps show the expression of different genes in different populations.EC is one of the most common malignant tumors in the world, which has a high incidence in China.Although the incidence and mortality of EC are decreasing in China, it is still a major malignant tumor threatening the health of Chinese residents.20Studies have shown that T cells are not only abundant in the tumor stroma but also affect the prognosis of patients with cancer.[21][22][23]Different subsets of T cells have different functions, most of which promote cancer growth, but a small number show anti-tumor activity.Tumors co-opt T cells to F I G U R E 5 Construction of prognostic signature.(A) The TCGA data to validate prognostic features based on T cell marker genes.(B) The TCGA data was divided into two groups based on single-cell cluster genes.(C) Survival analysis showed significant differences between the two cluster.(The survival curve p-value is less than .05,and survival is meaningful.).
However, our selected genes DCPS and CYB5R3 were only validated by RT-PCR in tumor cells and will be further validated in T cells in the future.In summary, our results provide evidence for tumor cell heterogeneity and immune tumor cell heterogeneity in ESCC.T-cell marker genes were screened and found to be associated with prognosis of ESCC patients.F I G U R E 6 Risk score prognostic model.(A)The ESCC patients into a low-risk (low score) group and a high-risk (high score) group by the median value of the risk scores.(B) The curve of risk score distribution and the scatterplot of survival status also showed the significant difference between the two groups.(C) The heat map shows 10 genes with significant differences in expression, namely BTG2, PPP1R15A, SNX2, SELENOK, SMARCA4, CYB5R3, TLE1, BIK, LIMD2, RBBP7, and DCPS.F I G U R E 7 Experiment to validate the model.(A)-(C) The expression levels of CYB5R3, DCPS and SELENOK in vitro by qPT-PCR.(D), (E) The expression of the gene in our clinical samples by immunohistochemistry and found a relative increase in the expression of CYB5R3 and DCPS gene in the adjacent tissues.