Expression scoring of a small‐nucleolar‐RNA signature identified by machine learning serves as a prognostic predictor for head and neck cancer

Abstract Head and neck squamous cell carcinoma (HNSCC) is a common malignancy with high mortality and poor prognosis due to a lack of predictive markers. Increasing evidence has demonstrated small nucleolar RNAs (snoRNAs) play an important role in tumorigenesis. The aim of this study was to identify a prognostic snoRNA signature of HNSCC. Survival‐related snoRNAs were screened by Cox regression analysis (univariate, least absolute shrinkage and selection operator, and multivariate). The predictive value was validated in different subgroups. The biological functions were explored by coexpression analysis and gene set enrichment analysis (GSEA). One hundred and thirteen survival‐related snoRNAs were identified, and a five‐snoRNA signature predicted prognosis with high sensitivity and specificity. Furthermore, the signature was applicable to patients of different sexes, ages, stages, grades, and anatomic subdivisions. Coexpression analysis and GSEA revealed the five‐snoRNA are involved in regulating malignant phenotype and DNA/RNA editing. This five‐snoRNA signature is not only a promising predictor of prognosis and survival but also a potential biomarker for patient stratification management.

poor patient management and uneven medical resources distribution can be found on HNSCC patients since very little is known about the patients' prognosis after diagnosis. (Bunbanjerdsuk et al., 2019;Huang & O'Sullivan, 2017;Xing, Zhang, & Tong, 2019). Thus, it can be concluded that despite recent advances in therapeutic management of HNSCC, little is improved in the overall survival (OS) of HNSCC patients due to the lack of diagnostic and prognostic markers (Galot et al., 2018), which means that it is currently quite urgent to identify and apply early diagnostic as well as prognostic signature for HNSCC patients.
Over the past few years, noncoding RNAs, especially micro RNAs (miRNAs) and long noncoding RNAs (lncRNAs), have been revealed as promising biomarkers for the diagnosis and prognosis of diseases, including various cancers (He et al., 2018;G. Liu et al., 2018;Xing, Zhang, & Chen, 2019). However, another class of noncoding RNAs, small nucleolar RNAs (snoRNAs), have rarely been considered as biomarkers for cancers due to the long-time common belief that they only perform housekeeping functions. snoRNAs are small RNAs of 60-300 nucleotides in length and are mainly found in the nucleolus (Williams & Farzaneh, 2012); they are one of the best-characterized classes of noncoding RNAs, and their function in rRNA biogenesis has been well documented (Romano, Veneziano, Acunzo, & Croce, 2017). Growing evidence has demonstrated that snoRNAs also play an important role in the carcinogenesis of multiple tumors (Mei et al., 2012). Dong et al have indicated that the mutation or downregulation of snoRNA U50 is associated with a malignant phenotype and identified snoRNA U50 as a candidate tumor-suppressor gene in prostate cancer (Dong et al., 2008).
Another study focused on SNORA42, an H/ACA box snoRNA encoded at 1q22, whose expression is frequently increased in non-small-cell lung cancer (NSCLC; Testa et al., 1997), while the small interfering RNA (siRNA)-induced downregulation of SNORA42 in NSCLC cell lines was able to induce apoptosis and reduce colony formation in vitro and was observed to inhibit tumor formation in a mouse model (Mei et al., 2012).
Additionally, another study found that SNORA42 also acts as an oncogene in colorectal cancer (Okugawa et al., 2017). SNORA21 appears to have potential as a prognostic biomarker in colorectal cancer according to Yoshida et al. (2017). Recently, a systematic pan-cancer analysis of the expression of snoRNAs in human cancer has highlighted significant roles of snoRNAs in the development and implementation of biomarkers or therapeutic targets for cancer (Gong et al., 2017). Considering the small size and stability of snoRNAs, as well as the accumulating evidence of its potential role in multiple tumors, snoRNAs are gaining increasing attention in the field of oncology and have the potential to serve as biomarkers for diagnosis, prognosis and therapeutic targets.
In our study, we were the first to perform a series of machine learning analyses to explore and construct a prognostic signature based on snoRNAs to predict the survival of HNSCC patients in 510 HNSCC patients. We identified prognosis-related snoRNAs using univariate regression analysis. Then, we performed dimensionality reduction for significant prognosis-related snoRNAs through the least absolute shrinkage and selection operator (LASSO) regression and multivariate regression and constructed a five-snoRNA-based prognostic signature.
Finally, we assessed the clinical utility of this prognostic model and explored its potential functions. Our findings provide new insights into the clinical significance of snoRNAs and provide a promising biomarker for predicting and evaluating the clinical outcome of HNSCC patients.  Gong et al., 2017). The expression data of snoRNA were then log2 (RPKM + 1) transformed for corresponding analysis.

| Data acquisition and preprocessing
The corresponding clinical follow-up information of HNSCC is also coming from UCSC Xena and the samples with survival time less than 30 days were removed.

| Construction of the prognostic model
Five hundred and ten HNSCC patients were involved in the construction of the predictive model after samples without clinical information were excluded, and they were randomly divided into training and testing sets (7:3). Survival related snoRNAs were identified using the HNSCC patients accompanied by the snoRNA expression profile as well as their clinicopathological features. Next, the snoRNAs with a p < .05 were screened out as candidate snoRNAs which were passed on for LASSO regression. LASSO is a popular method for regression analysis with high-dimensional features (Sauerbrei, Royston, & Binder, 2007), which has been widely used in the Cox proportional hazard regression model for survival analysis with high-dimensional data (Tibshirani, 1997).
LASSO Cox regression analysis was performed to select the most powerful prognostic markers from the candidate snoRNAs identified from the training set, which was then used for the multivariate Cox regression analysis to identify the possibilities of snoRNAs as independent prognostic markers from the most powerful snoRNAs.
Meanwhile, a risk score model was also constructed from this step.
The patients were divided into two groups, according to the median score. A Kaplan-Meier survival curve was plotted to analyze the difference between the two groups, while receiver operating characteristic (ROC) analysis was used to evaluate the efficiency of the predictive model, and AUC values were calculated to validate the performance of the prognostic predictors. To combine some basic clinicopathological information with the risk score model, a nomogram was plotted to predict the 1-, 3-, and 5-year survival probabilities of HNSCC patients. All of these processes were performed by R software (version 3.5.1).

| Correlation analysis of the five snoRNAs and annotation of their function
To determine the coexpression relationships between the five-snoRNAs and PCGs, the Pearson's correlation coefficients were calculated, and the PCGs positively or negatively correlated with the five snoRNAs were considered as snoRNA-related PCGs (we chose the top 300 correlated PCGs with adjusting p < .05 for each snoRNA) which were used to make functional annotation prediction for snoRNA (Fan & Liu, 2016;F. Liu, Xing, Zhang, & Zhang, 2019;R. Liu et al., 2017). To get further insight into the function of top 300 snoRNA-related PCGs for each snoRNA, the Database for Annotation, Visualization, and Integrated Discovery (DAVID, https://david.ncifcrf.gov/) was employed to perform the gene ontology enrichment analysis (Huang da, Sherman, & Lempicki, 2009). The gene lists of top 300 PCGs were uploaded, then analysis was run with default parameters and we got the results of the biological process (BP) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway. The annotation term with p < .05 is regarded as of statistical significance and the results were presented by dotplot (ggplot2 package in R) like our previous studies did (Xing, Zhang, Feng et al., 2019;Zhang, Feng, Du et al., 2018;Zhang, Feng, Li, Guo, & Li, 2018).

| Gene set enrichment analysis analysis
The mRNA expression data from HNSCC in this study was obtained from the public database TCGA-GDC (https://portal.gdc.cancer.gov/), and was downloaded as raw count. The expression data was preprocessed according to the following procedure. First, we keep the genes with count larger than 1 in at least 50% of the samples.
Next, the expression value was normalized using the R package edgeR, which was then transformed into log2 (expression value +1), and genes with the average value >1 were kept for gene set enrichment analysis (GSEA) analysis. GSEA software (Version:3.0; http://software.broadinstitute.org/gsea/index.jsp) and R package clusterProfiler (Yu, Wang, Han, & He, 2012) were employed to perform GSEA analysis between high and low-risk groups to determine how the hallmark phenotypes and pathways differ between them. According to the five-snoRNA signature, patients were ranked in descending order by risk score. The first 25% and last 25% of the patients were used to perform GSEA. Two annotated gene set files (h.all.v6.2.entrez.gmt and c2.cp.kegg.v6.2.entrez.gmt) were selected as the reference gene set to characterize the differences between the two groups according to the mRNA expression profile.

| Clinical characteristics of the patients
A total of 567 samples from the TCGA HNSC cohort were downloaded from the TCGA database, from which normal samples, samples without clinical follow-up, as well as samples with overall survival time less than 30 days were excluded, thus 510 HNSCC samples were obtained for this study (Table 1). Five hundred and ten samples were further randomly divided into a training set and testing set at the ratio of 7:3 (357 and 153). The study workflow is demonstrated in Figure 1.

| Prognostic values of snoRNAs
Using univariate Cox proportional hazards regression analysis, the survival-related snoRNAs were screened from training set, from which 113 snoRNAs were identified using p < .05 as the cut-off. The  patients were also grouped according the above variables and then we applied the five-snoRNA signature to the different subgroups. Since differences between different subgroups might influence the performance of the prognostic biomarkers. When it comes to sex subgroups, there were 372 males and 138 females in the HNSCC cohort, which did not differ significantly in terms of the risk score distribution (Figure 4c).
Besides, the high-risk patients also had significantly shorter OS in both the male and female groups (Figure 5a,b), which means that the five-snoRNA signature was independent of sex. There was also no significant difference in the risk score distribution of younger (≤60 years,

| Nomogram and calibration
A multivariate Cox proportional hazard regression model was constructed to combine some clinicopathological features (age, anatomic subset, sex, clinical stage, and histological grade) with the five-snoRNA signature for prediction of patients prognosis of 1-, 3-, and 5-year survival probability. A nomogram was plotted to visualize this model with an assigned score for each term ( Figure 6). It can also be concluded that the five-snoRNA signature is independent of other clinicopathological features in predicting the patient's survival.

| Functional analysis of the predictive snoRNAs
To identify the protein-coding genes co-expressed with the snoRNAs, correlation analysis was performed, using the top 300 co- F I G U R E 5 (a-j) Kaplan-Meier analyses of patients with head and neck squamous cell carcinoma in different subgroup cohorts, patients were grouped based on their gender, age, subdivision, grade, and stage. Kaplan-Meier analysis with a two-sided log-rank test was performed to estimate the differences in overall survival between the low-risk and high-risk patients Patients with HNSCC frequently have a poor prognosis and a low survival rate. Despite great improvements in diagnostic and therapeutic methods, the survival rate of HNSCC is still low (Troiano et al., 2018).
Prognostic biomarkers can provide information about the probable outcome of a cancer relative to disease progression, recurrence, or death (Ballman, 2015). This information could greatly aid in patient stratification, treatment management and monitoring disease status in clinical practice, for example, offering personalized therapeutic schedules to HNSCC patients who would benefit enormously (Nonaka & Wong, 2018).
The prognostic prediction would be very useful for clinicians to aid in choosing the magnitude and type of therapeutic approach (surgery, chemotherapy, radiotherapy, or a combination of these) on the basis of the molecular profile of HNSCC (Troiano et al., 2018). In the past few years, multiple molecular biomarkers have been proven to predict the clinical prognosis in different kinds of cancers (Quan et al., 2018;Zhu et al., 2016). In addition, combining several biomarkers achieved higher sensitivity and specificity compared to individual markers Zhao, Sun, Zeng, & Cui, 2018).
F I G U R E 6 Nomograms combining five-small nucleolar RNA signature and clinicopathological features to predict 1-, 3-, and 5-years survival probability of patients with head and neck squamous carcinoma

| 8079
We focused on snoRNAs and used similar methods to identify and prove that snoRNA could also serve as a prognostic signature. snoRNAs are a class of small (60-300 nt) noncoding RNAs implicated in the chemical modification of rRNA and are commonly known as housekeeping genes. They have been known to function as a guide for the posttranscriptional modification of rRNA, but in recent years, a new role in the regulation of other cellular pathways has emerged (Scott & Ono, 2011) including the regulation of oncogenesis in different cancers (Gong et al., 2017). Compared with the well-characterized role of miRNAs in biomarkers for early detection, recurrence and prognostication prediction (Hayes, Peruzzi, & Lawler, 2014), little is known about the significance of snoRNAs. Furthermore, the association between snoRNA expression and their clinical impact as biomarkers for HNSCC has not been undertaken.
Therefore, we performed a systematic analysis of the potential role of  results, the procedure of detection, quantification, and determination of transcriptomic activity of snoRNAs must be standardized (Guglas et al., 2017). Last, miRNA-seq is not designed for a full snoRNA repertoire, as miRNA-seq reads are too short (15-30 bp) to distinguish snoRNA from snoRNA fragments (Krishnan et al., 2016), which may have different biological functions. However, this method has been applied in several studies (Gao et al., 2015;Krishnan et al., 2016;L. L. Zheng et al., 2016) and is probably the most appropriate way to quantify snoRNA expression profiles from TCGA omics data. Thus, validating full-length snoRNAs from miRNA-seq data for further investigation is necessary (Gong et al., 2017). Therefore, the five newly found prognosis-related snoRNAs deserve more attention, and the next step for our research is to validate our results using experiments. We hope that these results could give other researchers inspiration for further exploration.
Taken together, we identified a novel five-snoRNA signature for HNSCC that is a promising independent survival predictor and serves as an important biomarker for guiding the clinical treatment of HNSCC patients to improve patient management. In addition, our findings provide new insights into the molecular mechanisms underlying HNSCC and present a promising new prognostic marker.
Therefore, our findings in the signature have very promising clinical significance.

| CONCLUSIONS
In conclusion, this study highlighted the prognostic value of snoRNAs and explored their underlying functions. Some prognosis-related snoRNAs have been revealed, and the survival of HNSCC patients could be predicted by a risk model based on these snoRNAs, which XING ET AL.