Big cohort metabolomic profiling of serum for oral squamous cell carcinoma screening and diagnosis

1 Department of Oral andMaxillofacial Surgery, Nanjing Stomatological HospitalMedical School of Nanjing University, Nanjing, Jiangsu, China 2 Department of Oral andMaxillofacial Surgery, Affiliated Hospital of Jiangsu University, Zhenjiang, Jiangsu, China 3 Department of Chemistry, Fudan University, Shanghai, China 4 Department of Chemistry, Stanford University, Stanford, California, USA 5 Central Laboratory of Stomatology, Nanjing Stomatological HospitalMedical School of Nanjing University, Nanjing, Jiangsu, China


INTRODUCTION
Oral cancer is one of the most common cancers in the head and neck region. There are around 377 713 new cases and 177 757 cases of death estimated worldwide in 2020 due to oral cancer. 1 Oral squamous cell carcinoma (OSCC) contributes around 90% of oral cancers.
Tobacco use (smoked or chewed), alcohol consumption, and human papillomavirus infection are regarded as high-risk factors for OSCC development. 2 The diagnosis of OSCC includes a physical examination, radiography, computed tomography, magnetic resonance imaging, and histopathological examination of tissue biopsies. 3,4 However, changes in molecular distribution at the primary carcinoma site are difficult to track at early stages before the histological lesion can be detected. 5 In addition, there are still many cases not diagnosed until the advanced stage when distant metastases have happened, thereby missing the best opportunity for treatment. If a necessary intervention before tumorigenesis could be conducted, the currently maintained 60% 5-year survival rate is expected to be greatly improved. 4 Currently, tissue-based biopsy remains the gold standard in cancer diagnosis. It requires harvesting biospecimens by invasive procedures such as biopsies or needle aspirations. These procedures have common issues such as patient discomfort and sampling inaccuracy caused by tissue heterogeneity. By contrast, liquid biopsy has been increasingly considered as an alternative option for cancer detection because it can provide cancer-associated molecular information in a minimally invasive manner. Liquid biopsy is conducted by detecting tumor-associated markers in the circulating or excreted biological fluids such as saliva, urine, and serum. Currently, the detecting markers were primarily focused on exosome, circulating tumor cells (CTCs), and circulating cell-free tumor DNA (cfDNA) that are shed into the bloodstream by cancer cells undergoing apoptosis or necrosis. Several DNA and mRNA species were reported to be associated with OSCC progressions, such as Gal-1, Gal-3, Transgelin, miR-24, miR-181, miR-196a, miR-10b, miR-18, lincRNA-p21, GAS5, and HOTAIR. [6][7][8][9][10][11][12][13] Gene-or protein-based clinical diagnosis mainly relies on the use of several immunoassays that introduce a hybrid probe or an antibody as a specific recognition element. This immune recognition-based multiplex detection is inevitably restricted by cross-reaction and spectral overlap in the readout. The analytical period and economic cost also increased with the introduction of more biorecognition probes. In contrast to gene and protein molecular detection, metabolomics-based in vitro diagnosis also has considerable promise because it provides the metabolic phenotype information that can not only precisely characterize the oncometabolite distribution at different stages but also help to guide the necessary therapy. 14 Therefore, a highly sensitive and specific metabolomics-based approach is in urgent demand for preclinical screening and diagnosis among the high-risk population.
In the past decade, ambient ionization mass spectrometry (AIMS) has increasingly gained acceptance in the field of clinical diagnosis including cystic fibrosis, breast cancer, renal cell carcinoma, cervical cancer, and so on. [15][16][17][18] Its most advantageous characteristic is the direct desorption and ionization of analytes from the sample matrix under atmospheric conditions. 19 No or minor additional sample pretreatment is needed compared to conventionally used chromatography-mass spectrometry systems. 20,21 Among various AIMS methods, paper spray ionization (PSI) is the leading one due to its wide availability in materials. PSI integrates well with the dried blood spot collection and storage with the following analytical procedures including solvent extraction, analyte desorption, and electricfield-induced spray ionization. 22 However, the plain paper substrate has drawbacks such as native impurities and strong retention of hydrophilic metabolites. Various surface coatings or modifications have been reported to overcome these issues. [23][24][25] Our group also proposes using polymer substrate instead of paper and has achieved remarkable improvements in ionization efficiency, signal stability, and the wide coverage of hydrophilic species. [26][27][28][29] This approach of polymer spray ionization has shown a good response for biological fluid metabolic profiling. In previous work, we have reported the practical value of conductive polymer spray ionization mass spectrometry (CPSI-MS) in the discrimination of OSCC with premalignant lesion (PML) and healthy contrast (HC). 30 Combined with machine learning (ML) for high-dimension data interpretation, this method can be performed with high accuracy at way less cost, trace sample consumption (< 3 μL), and high speed (a few seconds per sample), making it a promising analytical tool for clinical assays.
CPSI-MS/ML has shown its advantage in directly collecting hundreds of metabolites abundance information from a trace dried biofluid spot within a few seconds under atmospheric conditions, 31 and in identifying key salivary metabolites and pathways involved in the progression from the PML to OSCC stage. The characteristic metabolites previously discovered in saliva were mainly narrowed to small molecules whose molecular weight is less than 500 Da.
Compared to saliva-based diagnosis, serum samples have advantages of a tightly controlled homeostatic environment and less external interference caused by diet. Serum is a more clinically available biofluid than saliva. It not only contains small metabolites but is also rich in lipid information. Thus, the use of serum can gain more lipid information, which can complement salivary metabolic profiling. In addition, analysis of serum allows us to judge the progression of OSCC, something that we failed to be able to do with saliva. The minimally invasive nature of blood-based samples, the wide distribution of metabolites, and evidence of changes in metabolites during OSCC initiation and progression, make blood-based metabolites attractive biomarker candidates. 32,33 Currently, dozens of metabolites have been S C H E M E 1 Diagram of the serum metabolic profiling workflow by CPSI-MS/ML. (a) Two cohorts of serum samples were collected from the OSCC and HC volunteers as the marker discovery and validation sets, respectively. (b) One drop of dried serum spot (3 μL) was loaded onto a conductive polymer tip. Once the extraction solvent was spiked, the high voltage was switched on to trigger the data acquisition. (c) The high-dimension metabolic profiles of different groups were classified and visualized under the constructed 3D features space by an unsupervised machine learning model; (d) From a statistical analysis, the discriminating metabolites were selected as features. (e) Given the data of the two cohorts as the training and test sets, a machine learning model was applied; (f) The serum metabolite markers were further validated at the tissue level and the combination was employed as the diagnostic panel. tSNE, t-distributed stochastic neighbor embedding; Ln(FDR), natural logarithm of false discovery rate; Lv, latent variable reported to be dysregulated with OSCC malignant progression, including ketones, malonate, glutamine, propionate, valine, tyrosine, serine, methionine, and choline. [34][35][36][37][38][39] Given the hypothesis that the serum probably contains more OSCCassociated metabolic phenotype information, there were two concerns that needed to be investigated in this study: (1) whether the previously discovered salivary metabolites can still be significantly different among HC and OSCC in the serum to serve for preclinical screening; and (2) whether the significantly different metabolites in the serum can be not only used for discriminating OSCC from HC but also for discerning OSCC at different stages (T1, T2, T3, T4). Therefore, the aim of this study was to develop panels of serum metabolite markers with high sensitivity and specificity for OSCC screening and diagnosis. The potential of serum metabolic profiling for staging was also preliminarily investigated. With the aid of the CPSI-MS/ML approach, we believe that a low-invasive serum diagnosis can be realized to provide a quick, accurate, cost-effective diagnosis of OSCC. The scheme below describes the general workflow that is followed.

Volunteer recruitment
Two cohorts of volunteers were recruited prior to formal surgery or chemotherapy during two separate periods. The first batch was  (Table S1).

Specimen collection and preparation
Overnight 12 h fasting is required before the intravenous blood sampling in the morning. The blood withdrawal volume is approximately 1 mL for generating 400 μL serum. The same brand of glass centrifuge tubes (BD Vacutainer) was used for blood collection. To avoid the metabolite changes before preprocessing, blood sample was temporarily stored at 4°C until natural coagulation. Serum was prepared by 2000 × g centrifugation for 10 min at 4°C after a blood clot was formed.
All serum samples were saved under -80°C for long-term storage until use.

Metabolomic profiling by CPSI-MS
A full description of the CPSI-MS instrumentation can be found in a prior publication. 30

Tissue validation by DESI-MSI
There were eight intact OSCC tissues resected during clinical surgery for the serum metabolite markers validation. For each tumor, a series of cryosections were sliced at a thickness of 12 μm and stored at -80°C until use. Each cryosection will first go through H&E staining and the histopathology check to delineate the cancer and normal region for reference. Another adjacent slice will go through the DESI-MSI experiment. A commercial DESI system (Prosolia, Indianapolis, U.S) was employed for tissue imaging. N, N-dimethyl formamideacetonitrile (1:1, v/v) was used as the spray solvent with a flow rate of 1.0 μL/min and nebulizer gas pressure of 1.6 MPa. The impact angle between sprayer and section mounting stage was 56 • . High voltage of +4.0 kV was provided by the mass spectrometer and applied onto the sprayer to generate the electrospray for desorbing and ionizing the components across the tumor cryosection (12 μm). Target ion image reconstruction was achieved using Massimager (Chemmind Technologies Co., Ltd, China).

Metabolomics data processing
All raw files were first converted into cdf format by the Xcalibur (Thermo Fisher Scientific, San Jose, CA, US.) and then imported into MATLAB 2020a (Mathworks, Natick, MA.) for batch data preprocessing using the self-programmed script. Each sample's metabolomic profile was presented by averaging the mass spectra over 10 continuous scans in the corresponding time window. The mass tolerance of each bin was set at ±0.005 m/z when extracting the peak intensity information. Those peaks in the average mass spectrum which have intensity lower than 500 are regarded as background noise and discarded. A bin was treated as a missing value if it failed to be detected from a sample.
No missing value imputation was made to avoid artifactual statistical results in the univariate analysis. 41 To reduce the matrix data volume, the bin that possesses more than 50% missing values among the first cohort of 254 samples was discarded. Finally, there were 1518 bins initially extracted to characterize the metabolomic profile. A data matrix was constructed with each row representing one case and each column representing one bin variable. Then, the matrix goes through the IS normalization, natural log transform, zero-centering, and unit variance scaling before applying univariate analysis, multivariate analysis, and machine learning modeling. The data processing was done at Fudan University and Stanford University.

Statistical analysis
The unsupervised metabolic profile differentiation between OSCC and HC groups was first conducted with the t-stochastic neighbor embedding (t-SNE) in the MATLAB program. Rank sum test was first implemented separately among the two cohorts to search the OSCC and HC groups for significantly changed metabolite ions. The false discovery rate (FDR) was estimated with Benjamini and Hochberg method to adjust P-value and assess the statistical significance. 42 The ion will be selected if its FDR value is lower than 0.05. Only ions that are significantly changed both in the discovery cohort and validation cohort will be regarded as potential serum metabolite markers. Finally, a metabolite with a fold change larger than 1.5 or smaller than 0.67 will be included for further validation at the tissue section by DESI-MSI.
Variables with importance in projection (VIP) values higher than 1.5 were considered to contribute strongly to the pattern recognition of different OSCC stages. Prism (GraphPad Software, USA) was employed for preparing box plots, heatmaps, and receiver operating characteristic (ROC) curves.   To visualize the difference between HC and OSCC metabolite patterns, an unsupervised machine learning method, t-SNE, was introduced to reduce the high-dimensional metabolite ions information into a three-dimensional (3D) feature space. In the constructed 3D feature

Discovery and validation of serum metabolite markers
The rank sum test was employed to search for low abundance discriminating ions. In the development cohort, there were 241 significantly changed ions in OSCC compared to the HC (FDR < 0.05).
When the same procedure was conducted in the validation cohort, 218 ions were found to have significant differences. After overlapping the two batches of discriminating ions, 65 ions were confirmed to not only have statistical significance but also to become upregulated or downregulated in the same direction (Figure 2a). The nonoverlapped metabolites may come from several sources such as inter-batch variation in individuals and sample storage. After removing ions that were either redundant or failed to meet the fold change criteria (larger than 1.5 or smaller than 0.67), 39 metabolites were finally selected as potential characteristic marker candidates (Table S4)

Expression of salivary metabolite markers in serum
Given the 106 characteristic metabolites previously studied in salivary metabolic profiling, 30 the extent of their changes in serum between the OSCC and HC group were investigated. For this inter-specimen validation purpose only, the serum samples from two cohorts were combined to implement the rank-sum test. As a result, 52 out of 106 metabolites discovered in the previous saliva metabolomics were found to remain at abnormal levels in serum (Table S5), although changes in serum for most of these metabolites were not as obvious as those in saliva. Moreover, 33 out of these 52 metabolites in serum showed the same change trends as those in saliva (OSCC vs HC). Altogether, there were 65 metabolites discovered to be changed in OSCC compared to HC with statistical significance (FDR < 0.05). These metabolites were treated as serum marker candidates that have the potential discrimination power for OSCC prediction model development.

Feature selection and machine learning model development
A variety of classification models were trained to determine the most suitable one for further development. At the initial stage, all the selected metabolites in the univariate analysis were included as feature sets to train models. As a result, although all models can achieve perfect performance in the first cohort of 254 cases with an accuracy of no less than 90 % (Table S6), their performances in the second cohort (as the unseen cases) differed from one to another. The SVM achieved the general accuracy at 86% with the maximum area under curve (AUC) value at 0.86 (95% CI: 0.82-0.90). From the receiver operating characteristic (ROC) curves, SVM also gains the highest diagnostic performance with a sensitivity (85.1%) and specificity (90.6%) (Figure 3a). Therefore, SVM was selected as the optimal model for further tuning.
Feature selection is a critical step to avoid overfitting by reducing the model complexity. Recalling that all metabolite ions have statistical significance between the two groups, there were various possibili-ties for feature selection and combination for model development. To achieve a more robust machine learning model, it is necessary to select the optimal set of metabolites as the characteristic. For this purpose, we choose a wrapper-type feature selection strategy that evaluates the chosen machine learning model's performance after training with different candidate feature subsets. 43 (Table S8).

Serum metabolomic profiling for OSCC stages
We also investigated whether serum metabolomic pattern differences exist not only between HC and OSCC but also among different stages (from stage T1 to stage T4). The OPLS-DA model visualized the distribution of the OSCC cases staged from T1 to T4 at two cohorts (Figure 4a). It can be seen the OSCC at T1 and T2 stages can be ideally separated from these samples at the T3 and T4 stages. After combining T1 with T2 and T3 with T4, the OPLS-DA model performance can be greatly improved to the accuracy of 90.1% (CV = 5). Unfortunately, there was no obvious separation between T1 and T2 or T3 and T4. The variables that made a high contribution to this T1/T2 and T3/T4 separation were searched according to their variable importance on projection (VIP) values. The variables with VIPs large than 1.5 were picked (Table S9). After removing redundant ions, the top 50 metabolites were annotated and their relative contents in serum were displayed in a heatmap. It was worth noting that four of the top 10 metabolites were discovered in the univariate analysis (DG(34:2), oleamide, lysoPC(20:4), and nonanoyl carnitine) also contributed to this T1/T2 and T3/T4 stages discrimination (Figure 4b). As shown in Figure 4c, the relative contents of these metabolites in the OSCC group (T1-T4) were completely different from those in the HC group. Furthermore, they also showed increasing or decreasing trends from T1 to T4.
Unfortunately, none of these top 50 metabolites showed statistical significance among the four stages by the analysis of one-way variance (ANOVA).

F I G U R E 3
The development of serum metabolomics-based machine learning model for OSCC diagnosis. (a) Different machine learning models were initially investigated by comparing their diagnostic performance on the test set. SVM was chosen as the optimal one; (b) The number of features was investigated by a sequential feature selection strategy. Fifteen features were sufficient for the SVM model to achieve the optimal predicting accuracy on the test set. From the studies of serum metabolomics reported here and the previous saliva metabolomics, the OSCC-associated discriminating metabolites were identified, respectively. Given the identified metabolites, a pathway enrichment analysis was implemented to locate those metabolic pathways that are influenced in serum and saliva (Table   S10) (Tables S4 and S5), the changes of many metabolites become less obvious in serum, although the 52 discriminating metabolites discovered in the saliva study still had abnormal abundance in serum. This was observed mostly among the metabolites located in the histidine, arginine, and proline metabolic pathways. which were the major changed pathways in the saliva of the OSCC group. In contrast, the GL, GPL, and SM molecules in serum become the major discriminating markers ( Figure S3).

F I G U R E 5
The OSCC-associated metabolic pathways and the involved major metabolites. The color denotes the significance value (-log P) of the hypergeometric test which was conducted in the enrichment analysis to evaluate the statistical significance of a certain pathway. The size of the dot represents the impact importance of a certain pathway. The centrality of scattered dots was used as the metric in topology analysis to evaluate the pathway impact, which ranges from zero to one Because OSCC occurs in the oral cavity, cancer cells might scavenge nutrient supply either endogenously from the local blood circulation or exogenously from the excretion of the salivary gland. In turn, the OSCC cells' metabolic products will also be exchanged with the extracellular environment and transported through the circulation system. 46 Therefore, this inter-specimen derived difference in dysregulated metabolic pathways might be attributed to the complex biomass transport and exchange differences between the oral environment and endogenous circulation environment. Another possibility for explaining why salivary discriminating metabolites have diminished significance in serum might be caused by dilution in the global blood circulation. This suggests the possible value of employing serum metabolome data complementary with the salivary metabolome data for OSCC diagnosis based on serum lipid features.
It is known that cancer cells can utilize massive nutrients to support their uncontrolled proliferation. 47 Carbohydrates, amino acids, nucleotides as well as fatty acids were all their target biomass not only as the basic building blocks for proteins, glycans, nucleic acids, and bilayer lipids of membranes but also as the functional agents such as energy fuels, signaling factors, and transport intermediates. [48][49][50] The dysregulation in the aminoacyl tRNA biosynthesis pathway hints at enhancing protein synthesis. 51,52 The excessive energy consumption was also observed by the abnormal levels of glucose, lactic acid, free fatty acids (e.g., palmitic acid, palmitoleic acid, caprylic acid, linolenic acid), mono-acyl glycerol (e.g., MG(14:0), MG(16:1), and MG(16:0)), and acylcarnitine (e.g., acetylcarnitine, nonanoyl carnitine, 6-keto-decanoylcarnitine, and pentadecenoyl carnitine) for glycolysis and β-oxidation in the mitochondrion. GL, GPL, and SPL are not only the critical constituents for building the bilayer membrane systems but also served as the regulators for signaling. 53 The abnormal SPL metabolism (e.g., sphingosine 1-phosphate, sphinganine, and phytosphingosine) suggests cell proliferation dysregulation. 54 These dysregulated metabolite markers could not only serve as potential markers for OSCC diagnosis but also might assist roughly evaluating the OSCC stages.
Serum metabolome-based profiling and metabolites panel-based detection can serve as the molecular diagnosis approach complementary with the traditional tissue-based histopathology and routine visual examination. More than half of serum discriminating metabolites can be traced to the primary lesion site of OSCC tissue by the DESI-MSI confirmation, proving the feasibility of using serum metabolome information for OSCC detection. As for the rest of the discriminating metabolites that failed in DESI-MSI detection, it might be attributed to the limited sensitivity of DESI-MSI especially in these species with low abundance and ionization efficiency. This possible explanation needs to be confirmed in the future with the aid of an LC-MS or GC-MS system after tissue extraction and the analyte enrichment process.
It was also found that the serum metabolome has the potential for roughly assessing OSCC stages (Figure 4a). Although so far, no inter-stage statistical significance was found among any single serum metabolite (Figure 4b), a clear pattern difference appears especially when the OSCC stage was developed from T1 or T2 into T3 or T4 (Figure 4c). This result also emphasizes the fact that a single metabolite signature is neither specific nor sensitive enough to indicate the OSCC occurrence and development compared to the combination of characteristic metabolites in the form of a diagnostic panel. We must admit that the features selected for the OSCC diagnosis only achieved the distinction between two staging groups, (T1, T2) and (T3, T4). The criteria of traditional hypothetical tests in univariate analysis (conventionally referred to P or FDR < 0.05) might be too cautious for OSCC staging to pick out important features in profiling-based prediction methods. In another aspect, it also implies the importance of multivariate analytical method or even machine learning method in discerning these critical feature variables for the profile pattern recognition.

CONCLUSION
We find that the serum metabolic profile obtained by CPSI-MS and analyzed using machine learning can reflect oral cancer development.
Most discovered significant metabolites in serum were also found in saliva and cancer tissue, demonstrating the potential of serum for in vitro molecular diagnosis of OSCC. By cohort analysis using CPSI-MS, we found that histidine metabolism, arginine and proline metabolism, sphingolipid metabolism, and aminoacyl-tRNA biosynthesis were present in serum. These findings provide potential clinical markers for indicating OSCC tumorigenesis. We have demonstrated that CPSI-MS is a promising ambient ionization mass spectrometry tool that offers cost-effective performance in monitoring hundreds of biofluidic metabolites only with minor sample pretreatment. The combination of CPSI-MS with ML enabled excellent OSCC prediction performance (89.6 % accuracy). More surprisingly, the CPSI-MS combined with an OPLS-DA model could well differentiate the (T1, T2) with (T3, T4) stages (90.1% accuracy by cross-validation). All these findings indicate that CPSI-MS/ML can be a very useful tool to provide a simple, fast, affordable method both for OSCC screening and diagnosis.

TRANSPARENT PEER REVIEW
The peer review history for this article is available at https://publons. com/publon/10.1002/ntls.20210071

ETHICAL STATEMENT
Human serums and tissue samples were collected in strict observance of the ethical code of Nanjing Stomatological Hospital, Medical School of Nanjing University. All patients gave written consent.