Comprehensive characterization of pathological stage‐related genes of papillary thyroid cancer along with survival prediction

Abstract It is crucial to understand the differences across papillary thyroid cancer (PTC) stages, so as to provide a basis for individualized treatments. Here, comprehensive function characterization of PTC stage‐related genes was performed and a new prognostic signature was developed for advanced patients. Two gene modules were confirmed to be closely associated with PTC stages and further six hub genes were identified that yield excellent diagnostic efficiency between tumour and normal tissues. Genetic alteration analysis indicates that they are much conservative since mutations in the DNA of them rarely occur, but changes of DNA methylation on these six genes show that 12 DNA methylation sites are significantly associated with their corresponding genes' expression. Validation data set testing also suggests that these six stage‐related hub genes would be probably potential biomarkers for marking four stages. Subsequently, a 21‐mRNA‐based prognostic risk model was constructed for PTC stage III/IV patients and it could effectively predict the survival of patients with strong prognostic ability. Functional analysis shows that differential expression genes between high‐ and low‐risk patients would promote the progress of PTC to some extent. Moreover, tumour microenvironment (TME) of high‐risk patients may be more conducive to tumour growth by ESTIMATE analysis.

important to conduct risk stratification analysis for advanced patients to find prognostic factors related to their survival prognosis.
The weighted gene co-expression network analysis (WGCNA) is deemed as an efficient network-based approach, which can investigate the signature of gene networks in the pathogenesis of complicated diseases at system level. 8 It is an algorithm that constructs scale-free gene co-expression networks based on the expression of genes, which can not only classify different gene modules, but also figure out the relationships between clinical features and gene modules, 9 so this method provides an effective way to explore the interaction mechanism of clinical traits-related genes of diseases and identify potential biomarkers. 7,[10][11][12] Until now, no comprehensive investigation on PTC stage-related genes has been reported and the regulation characteristics of them are not well revealed. Here, this study first gives systematic functional analysis on them. First, 1243 common differentially expressed genes (DEGs) were screened out by comparing stage I, II, III and IV PTC samples with adjacent non-tumour tissue samples.
Then, WGCNA was employed to study the co-expression network of DEGs and two gene modules were proved to be associated with tumour stages. The Gene Ontologies (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis show that genes in both these two modules are mainly enriched in cancerrelated pathways, so 6 tumour stage-related hub genes were identified from the two gene modules, including RPS6KA6, SORBS2, EPHB3, QSOX1, S100A6 and UNC5CL. To validate six hub genes, their expression levels at different stages and the receiver operating characteristic (ROC) diagnostic analysis were, respectively, performed based on validation data sets. Meanwhile, DNA mutation and methylation analyses of the six hub genes were also systemically implemented.
Besides, we established a 21-mRNA-based prognostic risk model for PTC patients with stage III and IV using a least absolute shrinkage and selection operator (LASSO) Cox method. Kaplan-Meier analysis, ROC analysis, Cox regression analysis and stratified analysis were employed to assess and validate the prediction performance of the risk model on the overall survival (OS) of advanced patients. Finally, we used KEGG pathway analysis and ESTIMATE analysis to explore the changes of biological pathways and TME between high-risk and low-risk patients.

| Samples and preprocessing
The transcriptome data (level 3, HTSeq-counts) and clinical infor- According to the annotation information of gene type from GENCODE Version 29 (https://www.genco degen es.org/), the gene expression data of 19645 mRNAs were extracted. Then, genes with no or low expression in more than a quarter of the samples (read count <10) were discarded, so 14647 mRNAs were remained. Next, they were normalized by Trimmed Mean of M values (TMM). 13 As validation data, two microarray data sets GSE29265 and GSE3678 were downloaded from GEO database, including 10 and 7 pairs between tumour and normal tissues, respectively.

| Differential gene expression analysis
DEGs were detected using edgeR package in R software (http:// bioco nduct or.org/packa ges/edgeR/ ). 14 DEGs of stage I, II, III and IV PTC samples compared with adjacent non-tumour tissue samples were, respectively, screened out, according to the cut-off criteria of absolute log 2 (fold change; |log 2 FC|) ≥1 and false discovery rate (FDR) <0.05. Then, common DEGs were achieved by overlapping the four groups' DEGs.

| Weighted gene co-expression networks construction
Weighted gene co-expression network analysis was performed using 'WGCNA' R package. 9 First, the samples were clustered to delete the outlier samples. Second, a soft-threshold power β was selected based on the criterion of approximate scale-free topology using the function pickSoftThreshold. Third, the adjacency was transformed into a topological overlap matrix (TOM) using function TOM similarity. Fourth, according to the TOM-based dissimilarity measurement, average linkage hierarchical clustering was conducted to produce the common DEGs dendrogram. Consequently, module identification was performed with the function cutreeDynamic (minModule-Size of 30). Finally, to further analyse the module, the dissimilarity of module eigengenes (MEs) was calculated using the function modu-leEigengenes. ME is defined as the first principal component of the gene expression matrix of the corresponding module, which can summarize the gene expression profiles from a module. Highly similar modules were identified by clustering analysis and then to be merged together with a height cut-off of 0.25.

| Identification of stage-related gene modules and hub genes
To identify the stage-related modules and genes, module-trait relationship analysis was performed to measure the correlation between clinical traits and gene modules. The Pearson correlation coefficient and p value were calculated by between ME and clinical trait. The results were presented using heat map. Next, gene significance (GS) was calculated based on the correlation of a gene expression profile with a clinical trait. In general, the higher the absolute GS, the higher the correlation between this module and the clinical trait. For a certain gene, its Module membership (MM) was defined as correlation between its expression profile in all samples and the expression profile of a certain modules (MEs). The greater the MM value of the gene, the more important the gene is in the module. We defined the thresholds for the selection of hub genes as MM >0.8 and GS >0.2.
In order to explore the potential biological mechanism of each module, the genes in each module were uploaded into KOBAS (http:// kobas.cbi.pku.edu.cn/kobas3), 15 which is an online tool for gene enrichment analysis. Then, the Gene Ontologies (GO) functional enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment were performed. Corrected p value <0.05 was set as the cut-off criteria.
Gene expression difference analysis was used to validate the practicability of hub gene as biomarkers. On one hand, the expression of hub genes in PTC was studied using Gene Expression Profiling Interactive Analysis (GEPIA) 16 data sets. On the other hand, hub gene expression levels at different stages were also plotted.
In addition, the receiver operating characteristic (ROC) curve was preformed to verify the diagnostic performance of hub genes using 'pROC' package (https://cran.r-proje ct.org/web/packa ges/pROC/).

| Genetic alteration analysis of hub genes
Genetic alterations about hub genes were explored using cBioPortal (http://www.cbiop ortal.org/). 17 In addition, the changes of DNA methylation sites on hub genes were studied. First, DNA methylation data of the sites locating in hub genes were extracted; then, with data filtering and difference analysis, different methylation sites (DMSs) were selected according to the threshold that |Δβ| > 0.1 and p value <0.05 (Δβ: the difference value between the average β values of tumour and normal tissues); last, the Spearman's rank correlation coefficients between these DMSs and their genes were calculated.

| Construction and evaluation of the prognostic model for PTC advanced patients
A total of 148 samples with tumour stage III and stage IV were subjected to prognostic modelling analysis, after removing two with follow-up time less than 1 month. A two-step analysis strategy was established for prognostic modelling. First, the common DEGs were selected to analyse their relationship with OS of PTC patients by univariate Cox regression analysis. Those with p value <0.05 were extracted. Second, a least absolute shrinkage and selection operator (LASSO) Cox penalized regression model 18 was preformed to build the classifier using R package 'glmnet' (https://cran.r-proje ct.org/web/packa ges/glmne t/). 19 In order to optimize the model, 10-fold cross-validation was employed. Finally, candidate genes with non-zero coefficient were filtered to build a prognostic model.
The risk score of each PTC patient was calculated by the following formula: where N is the number of candidate genes, Ei is the expression of candidate normalized by TMM, and Ci is the coefficient of candidate genes in the LASSO Cox regression analysis.
Based on the risk score, the PTC patients were divided into high-and low-risk groups by cut-off median. A Kaplan-Meier survival curve was employed for survival analysis, and log-rank tests were used to compare the differences of OS between two groups.
Meanwhile, time-dependent ROC analysis was used to investigate the prognosis accuracy of the model and area under the ROC curve (AUC) values were also calculated using the 'timeROC' package (https://cran.r-proje ct.org/web/packa ges/timeR OC/).
The stratified analysis was conducted to determine whether the prognostic signature is independent of other clinical factors. KEGG pathway enrichment analysis was conducted on DGEs between high-and low-risk groups to explore potential biological pathway alteration. In addition, the stromal score, immune score and ESTIMATE score for each patient with PTC were computed using 'estimate' package (https://bioin forma tics.mdand erson.org/estim ate/). 20

| Screening DEGs
We first performed principal component analysis (PCA) for different tumour stage tissues and normal tissues using all the filtered and normalized gene expression data. As shown in Figure 1A, normal tissues and tumour tissues at different stages can be separated to a certain extent, but there is still a large proportion of overlaps. Using  Figure 1B). Again, these DEGs were used for PCA analysis of PTC tissues at different stages and normal tissues. As shown in Figure 1C, normal tissues were significantly separated from different stages tissues, showing that the DEGs screened here are reliable.
Next, as shown in Figure 1D, 1243 common DEGs were extracted from the four comparison groups. Among these 1243 common DEGs, 835 genes were up-regulated ( Figure 1E) and 408 genes were down-regulated ( Figure 1F). It is obviously to see that either the expression levels of up-regulated or the down-regulated DEGs are all consistent in four stages.

| Identification of co-expression gene modules and functional annotation
1243 common DEGs were performed WGCNA analysis. 470 samples of PTC were first clustered to remove obvious outlier samples ( Figure S1). To ensure a scale-free network, the power of β = 12 (R 2 = 0.952) was chosen for the soft-threshold parameter ( Figure 2A,B). Dynamic hybrid cutting was conducted to construct a hierarchical clustering. The genes with similar expression pattern formed a gene module, so four modules (blue, brown, turquoise and grey) were generated ( Figure 2C). Because the similarity between all modules is less than 0.75, there is no module merge ( Figure 2D). In addition, the weighted network and the eigengene heatmap were constructed to identify interaction relationships of the four coexpression modules. Figure

| Identification of stage-related modules
The module-clinical trait relationship analysis was conducted using The turquoise module is also correlated to pathologic T (r 2 = 0.18, p = 6e-05), pathologic N (r 2 = 0.38, p = 2e-17) and tumour stage (r 2 = 0.15, p = 0.001). Further, GS of each module for tumour stage also calculated. Figure 3B shows that GS values of blue and turquoise modules were much higher than brown module, so we can conclude that two modules (blue and turquoise) are confirmed to be associated with PTC pathological stages.

| Validation of the hub genes
Based on criteria of GS >0.2 and MM >0.8, two genes (RPS6KA6 and SORBS2) in blue module and four genes (EPHB3, QSOX1, S100A6 and UNC5CL) in turquoise module were identified as hub genes ( Figure 3C,D). Among them, RNA expressions of RPS6KA6 and SORBS2 in PTC tissues were significantly down-regulated compared with normal tissues, while expressions of other four genes were significantly up-regulated ( Figure 3E). In order to verify this observation, expression levels of these 6 genes were also analysed based on three validation data sets of GEPIA database, GSE29265 and GSE3678, respectively ( Figures S2-S4). We can see that all six genes are differentially expressed in at least two data sets, especially RPS6KA6, SORBS2, EPHB3 and S100A6 in all data sets. Besides, there are also significant differences on RNA expression levels of the six genes among four tumour stages ( Figure 4G-L), which were consistent with analysis results of the GEPIA database ( Figure S5).

| Genetic alteration analysis on hub genes
We furtherly performed genetic alteration analysis for the six hub genes. The DNA mutations statuses of them were analysed using TCGA PTC patients' data in cBioPortal database. The six hub genes altered in about 4 (1%) of 399 PTC patients ( Figure 5A), and the frequency of alteration of each gene is shown in Figure 5B. Only EPHB3, QSOX1 and S100A6 altered, but their frequencies of alteration were extremely low (0.3%, 0.5% and 0.5%, respectively; Figure 5B). These  (Table S1). Of them, 12 DMSs were found to be significantly associated with their corresponding genes' expression ( Figure 5C and Table 3). The 12 DMSs could regulate their corresponding genes' expression levels. As shown in Figure 5C, only cg04130557 was positive correlation with expression level of its corresponding gene SORBS2, and others DMSs were all negative correlation with their corresponding genes' expression.

| Construction of a prognostic signature for PTC stage III/IV patients
As shown in Figure 6A, First, univariate Cox regression analysis was conducted. Results show that 230 genes were associated with PTC advanced patients' OS (p < 0.05). To further screen out an optimal combination from these genes, LASSO Cox regression analysis was performed and 21 genes were identified to develop a risk score model ( Figure S6).
Finally, using the coefficients derived from LASSO Cox algorithm, a risk score prognostic model was constructed based on RNA expression values of the 21 genes: The risk score of each patient was calculated, and all patients were divided into high-and low-risk groups using the median as the cut-off. The risk score profiles and survival time of each patient are shown in Figure 6B,C. We can observe all dead patients are in highrisk group. In addition, the 21 gene expressions in normal, low-risk and high-risk group patients are shown in Figure 6D. shown in Figure 6D, the coefficient of ENTPD1 is negative and it is a safety factor, so the expression value in low-risk patients is higher than that in high-risk patients. While the coefficient of PAPSS2 is  Consequently, we aim to confirm that the prognostic signature is of high applicability and could precisely predict the OS of PTCadvanced patients. As shown in Figure 6G, univariate Cox regression analysis reveals that both age and risk score are associated with

QSOX1
QSOX1is an enzyme that oxidizes thiols during protein folding, reducing molecular oxygen to hydrogen peroxide, which may be utilized by tumour cells at different stages of tumorigenesis. 39 The results of Sung et al 40 have proven that QSOX1 might be a lung cancer tissue-derived biomarker and be involved in the promotion of lung cancers, and thus can be a therapeutic target for lung cancers S100A6 Overexpression of S100A6 is correlated with patient prognosis, so it is an independent prognostic predictor in gastric cancer and the methylation profile of specific CpG sites may affect its transcription. 41 S100A6 can not only stimulate proliferation and migration of colorectal carcinoma cells through activation of the MAPK pathways, 42 but also regulate the proliferation, invasion, migration and angiogenesis of lung cancer cells through the p53 acetylation. 43 Moreover, it plays an important role in pancreatic cancer 44,45 UNC5CL It is a novel inducer of a proinflammatory signalling cascade leading to activation of NF-κB and JNK. It has been first described as a novel ZU5 and DD-containing protein that is mostly homologous to the intracellular fragments of the Unc5-receptor family members 46

TA B L E 2
Detailed functional annotation about the six hub genes by deep literature-exploring subgroups based on stage III/low-risk, stage III/high-risk, stage IV/ low-risk and stage IV/high-risk, as shown in Figure 6I. The result indicates both stage III and IV patients in high-risk group have poorer OS than low-risk patients. Meanwhile, based on age and gender, patients were divided into four subgroups (<60/low-risk, <60/high-risk, ≥60/ low-risk and ≥60/high-risk) and four subgroups (female/low-risk, female/high-risk, male/low-risk and male/high-risk). As expected, OS of ≥60/high-risk group patients is the worst ( Figure 6J). In addition, in both female and male groups, high-risk patients have shorter survival time than low-risk ones ( Figure 6K). Overall, this prognostic signature shows a satisfactory applicability when advanced patients are regrouped by different clinicopathological characteristic, suggesting that it is an independent applicable prognostic predictor for PTC-advanced patients.

| Biological pathway and tumour microenvironment alteration between high-and lowrisk patients
To explore potential biological pathway alteration between high-and low-risk patients, we conducted KEGG pathway enrichment analysis on DEGs between two groups. First, according to |log 2 FC| ≥ 1 and FDR<0.05, we obtained 454 DEGs, including 439 up-regulated and 15 down-regulated genes, as shown in Figure 7A. Here, 40 pathways were enriched on these 454 DEGs, as listed in Table S3. Figure 7B shows the top 20 enriched pathways. We can see that pathways associated with cancers were enriched, such as 'Wnt signalling pathway', 'TGF-beta signalling pathway', 'Proteoglycans in cancer', 'PI3K-Akt signalling pathway' and 'Pathways in cancer', illustrating that these DEGs may promote the progress of PTC to some extent.
As we know, understanding the tumour microenvironment (TME) is of practical significance for cancer diagnosis and treatments. As major fraction of TME, infiltrating stromal and immune cells form the major non-tumour constituents of tumour tissues, which not only perturb the tumour signal in molecular studies but also have an important role in cancer biology. 20 Therefore, in order to investigate the relationship between these cells and the prognostic signature, ESTIMATE was preformed to calculated the Stromal Score, Immune Score and ESTIMATE Score for 148 PTC-advanced patients using R package 'estimate'. Here, the higher value estimated in Immune Score or Stromal Score means to the larger amount of the immune or stromal components in TME. ESTIMATE Score is the sum of Immune Score and Stromal Score denoting the comprehensive proportion of both components in TME. First, we analysed whether these scores were correlated with the risk score. As shown in Figure 7C-E, the prognostic signature is significantly positively correlated with Stromal Score, Immune Score and ESTIMATE Score (p < 0.05) respectively. Then, we performed difference analysis in terms of Stromal, Immune and ESTIMATE scores between low-and high-risk patients. Figure 7F-H demonstrate that high-risk patients have higher Stromal, Immune and ESTIMATE scores (p < 0.05).
These results suggest that TME of high-risk patients, compared with low-risk patients, may be more conducive to tumour growth.

| DISCUSS ION
In this paper, we systematically analysed PTC tumour stage-related genes and constructed a prognostic risk signature for PTC stage III/ IV patients. The workflow of this study is shown in Figure S7. Based Six hub genes of RPS6KA6, SORBS2, EPHB3, QSOX1, S100A6 and UNC5CL from the two stage-related modules were identified and then underwent comprehensive validation tests, including expression difference analysis between tumour and normal tissue in our data set, GEPIA database, GSE29265 and GSE3678, as well as among four stage tumours based on our data set and GEPIA database, respectively. Moreover, ROC curve analysis shows that these six hub genes yield excellent diagnostic efficiency between tumour and normal tissues. The alteration statuses of six hub genes were also analysed and mutations in the DNA of the six genes rarely occur, indicating that they are all much conservative, but the changes of DMSs on the six genes show that 12 DMSs are significantly associated with their corresponding genes' expression, so DNA methylation on six genes should be paid close attention in following researches.
Finally, by deep literature-exploring as described in All above analysis prove that these six hub genes would be potential biomarkers for PTC diagnosis and marking PTC stages.
Tang et al. 7 have also given five hub genes for PTC (COL1A1, COL1A2, COL3A1, COL5A2 and DCN) by WGCNA and protein-protein interaction network methods, but further validation about them needs to be explored. Since we performed more rigorous data filtering, COL1A2 is absent in our data set. Here, we conducted ROC curve analysis and expression difference analysis on other four genes. Figure S8 shows that AUC values of four genes are 0.757, 0.642, 0.696 and 0.909, respectively, while those of our six hub genes are all greater than 0.85 and four higher than 0.9 ( Figure 4A-F). The expression levels of the four genes yield no significant difference ( Figure S9 and  Consequently, biological pathway alteration analysis on DGEs between high-and low-risk groups illustrate that these DEGs promote the progress of PTC to some extent. Meanwhile, high-risk patients have higher stromal, immune and ESTIMATE scores than low-risk patients, suggesting that TME of high-risk patients may be more conducive to tumour growth. We can conclude the 21-mRNA-based prognostic risk signature could be a novel and effectively independent prognosis signature for predicting survival in advanced patients with PTC.

ACK N OWLED G EM ENTS
This work was financially supported by the Support Program for Science and Technology Department of Sichuan Province (2020YJ0237, 2018SZ0030) and the 1·3·5 Project for Disciplines of Excellence, West China Hospital, Sichuan University (ZYJC18025).

CO N FLI C T O F I NTE R E S T
The authors confirm that there are no conflicts of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
Data sharing is not applicable to this article as no new data were created or analysed in this study.