Machine Learning and Bioinformatics Analysis for Laboratory Data in Pan‐Cancers Detection

Early diagnosis of cancer is crucial to improving the long‐term survival rate of patients. However, commonly used tumor markers lack sensitivity and specificity for screening purposes. Herein, 10 diagnostic models for 10 common types of cancer are developed by extreme gradient boosting, incorporating 66 laboratory parameters. The datasets consist of a retrospective cohort of 737 503 training and 184 012 validation cases, and a prospective cohort of 174 894 cases for model testing. The areas under the curve of the 10 diagnostic models range from 0.763 to 0.993. Notably, the different models have varying numbers of identical parameters among the 66 test features. Additionally, SHapley Additive exPlanation analysis reveals that 54 nontumor markers contributed significantly to the models. Cosine similarity analysis and clustering analysis demonstrate that some of the 10 cancers share common pathophysiological characteristics. Feature‐based inference graph models are thus performed and infer relationships between nontumor index parameters and cancers with strong correlations. In conclusion, a machine learning‐based pan‐cancer early warning system has been established in this study, which can guide doctors in selecting more accurate testing indicators and assessing the risk of 10 types of cancer with greater precision.


Introduction
Cancer is one of the leading causes of death globally, and early diagnosis is crucial for improving prognosis.Detecting cancer before it reaches stage IV could reduce cancer-related deaths by up to 15% within 5 years. [1]While some screening methods have shown effectiveness in early detection, their sensitivity and specificity are limited, and they are only applicable to certain types of cancer. [2,3]he occurrence of cancer is a complex pathological process involving multiple factors and steps of evolution.Currently, common clinical tumor markers include prostate-specific antigen (PSA), alphafetoprotein (AFP), carbohydrate antigen 19-9 (CA199), and carcinoembryonic antigen (CEA), etc.However, these single indicators or combinations of a few tumor markers cannot comprehensively reflect the complexity of cancer, and their sensitivity and specificity cannot meet the needs of tumor screening, [4,5] which is one of the reasons why early detection of cancers is difficult to achieve.Professor Weinberg and Hanahan summarized 14 hallmark characteristics of cancer cells in survival, proliferation, and metabolism, such as self-sufficiency in growth signals, evading apoptosis, tumor-promoting inflammation and polymorphic microbiomes, [6][7][8] etc.More importantly, cancer cells can abnormally transcribe, express, secrete, or release certain substances during the processes of carcinogenesis, proliferation, differentiation, metastasis, necrosis, or disintegration.The body's response to the presence and growth of cancer can also abnormally produce or upregulate some physiological substances, such as proteins, polypeptides, hormones, enzymes, polyamines, oncogene products, viral antigens, free cells, etc.These abnormal cellular metabolic changes are important common features of cancer, which have good effectiveness and stability.
Clinical testing is a convenient and noninvasive method of examination that involves collecting a variety of sample types, such as blood and body fluids, for analysis.With a wide range of testing options (usually >1000), it is widely used for diagnosing, excluding, classifying, or monitoring diseases due to its ability to reflect metabolic changes in the body. [9,10]However, in the diagnosis of complex diseases such as cancer, the large amount of data generated by laboratory parameters and their testing results have not been fully utilized.
The results of the diagnostic indicators are presented in the form of separate numerical or categorical values, with a large amount of data and complex interrelationships among numerous parameters, making it difficult to analyze using traditional statistical methods.With the development of artificial intelligence, machine learning (ML) has gradually been applied to data analysis in the medical field. [11]ML methods rely on their powerful autonomous learning ability to discover the correlation between routine indicators and diseases from a large amount of test data, mine the deep value of indicators, establish a multiparameter combined diagnostic model, [12] and timely reflect the disease process, thereby achieving more accurate prediction and diagnosis of diseases. [13]Abelson et al. developed an acute myelocytic leukemia predictive model using gradient boosting based on a large electronic health record database. [14]Lynch et al. predicted lung cancer patient survival via Support Vector Machines classification techniques. [15]Other machine learning algorithms such as Logistic Regression, [16] Random Forest, [17] Artificial Neural Networks, [18] and Extreme Gradient Boosting (XGBoost) [19] are also commonly used for disease prediction, treatment recommendation, and risk assessment.So far, pancancer-screening strategy based on laboratory data has not been developed.Very few, if any, pan-cancer detection models were developed.Zhang et al. utilized multiplexed nanomaterialassisted laser desorption/ionization mass spectrometry technology, combined with the Support Vector Machine algorithm, to develop diagnostic models for the diagnosis and classification of pan-cancer. [20][26] It outperforms other data mining methods in constructing classifications and achieving high-precision disease prediction tasks based on tabular data such as electronic health records (EHR). [27]Therefore, we chose XGBoost algorithm to build the classification model.
In this study, we developed diagnostic models for 10 common types of cancer using clinical laboratory data.Additionally, we conducted bioinformatics analysis to examine the similarity and relationships between different types of cancer, and constructed feature-based inference graph models using the Apriori algorithm and Bayesian statistics.These findings could provide guidance to doctors for selecting more accurate testing indicators and assessing disease risk with greater precision.

Characteristics of the Participants
Based on the conditions of sample formation, enrollment, and exclusion processes, a total of 1 096 409 outpatient and inpatient cases were ultimately included from January 1, 2017 to October 31, 2020.This included 14 949 191 diagnostic data, 854 standardized diagnostic names, 1993 laboratory features, and 122 365 478 test data.A total of 303 test features were included after the initial feature selection consisting of 70 qualitative test items and 233 quantitative test items (Table S1, Supporting Information).According to the systematic classification of tumors in the International Classification of Diseases (ICD) version 10, 10 types of cancers were selected for the experimental group (cancer group) based on sample size.There were 27 210 cases of bowel cancer, 12 870 cases of gastric cancer, 12 443 cases of lung cancer, 10 771 cases of pancreatic cancer, 9138 cases of urological cancers(Table S2a, Supporting Information), 8372 cases of liver cancer, 6764 cases of prostate cancer, 2553 cases of breast cancer, 2438 cases of biliary tract malignancy (Table S2b, Supporting Information), and 1524 cases of thyroid cancer.The research workflow diagram is shown in Figure 1, which embodies the screening criteria and process for the final modeled sample size and characteristics for each cancer.
The basic information characteristics of participants are presented in Table 1 and Figure S1, Supporting Information.Except for prostate cancer and breast cancer, the number of females (F) suffering from thyroid cancer was significantly higher than that of males (M) (M:F = 1:3.54,P < 0.0001), while the prevalence rate of other cancers was higher in males than females, with a substantial difference in malignant tumors of the urinary system (M:F = 4.34:1, P < 0.0001), liver cancer (M:F = 3.92:1, P < 0.0001), and gastric cancer (M:F = 2.30:1, P < 0.0001).The average age of prostate cancer patients was the highest (70.14 AE 9), and the average age of thyroid cancer patients was the youngest (44.37 AE 13).The gender gap of gastric cancer patients was the most significant (P < 0.0001), with the age of male patients (63.24AE 11) being significantly higher than that of female patients (56.8 AE 15).A retrospective cohort comprising 921 515 outpatients and inpatients from January 1, 2017 to December 31, 2019, was randomly divided into a 4:1 ratio, with 737 503 patients in the training cohort and 184 012 patients in the validation cohort.A prospective cohort (test cohort) was used to evaluate the diagnostic efficacy of the model, comprising a total of 174 894 outpatients and inpatients from January 1, 2020, to October 31, 2020.

The Model Performance of 10 Types of Cancers Based on XGBoost Algorithm
Using the XGBoost algorithm and forward stepwise regression method, 10 diagnostic models were established based on the specific modeling features for each of the 10 cancer types.Table S3, Supporting Information, shows the final selection of parameters for each cancer model, including parameter names and number of selections.Among them, the lung cancer model had the most parameters, with a total of 16 items.While urological cancers model had the lowest parameter count, with only nine items, and no tumor markers were included in the model.Performance analysis metrics for each model in diagnosing their respective cancer types include the areas under the curve (AUC), sensitivity (Se), specificity (Sp), Youden's index (YI), accuracy (ACC), positive likelihood ratio (PLR), and negative likelihood ratio (NLR) (Table 2).In the test set, except for biliary tract malignancy, the AUC of other diagnostic models was ≥0.800, with four types of cancer having an AUC > 0.900, namely pancreatic cancer, prostate cancer, breast cancer, and thyroid cancer.Four types of cancer models achieved an ACC ≥ 0.900, namely, lung cancer, pancreatic cancer, prostate cancer, and thyroid cancer, in contrast, urological cancers had the lowest ACC, with a value of 0.702.In particular, for several high-mortality cancer types, [28] their diagnostic models demonstrated excellent performance.For instance, the AUC, ACC, and YI values for pancreatic cancer were 0.918, 0.907, and 0.686, respectively; for liver cancer, the values were 0.835, 0.759, and 0.532, respectively; for lung cancer, the values were 0.896, 0.900, and 0.675, respectively; for gastric cancer, the values were 0.806, 0.731, and 0.474, respectively; and for bowel cancer, the values were 0.800, 0.752, and 0.474, respectively.We further compared the diagnostic efficiency of the ML model with the most widely used tumor markers AFP (for diagnosing liver cancer), PSA (for diagnosing prostate cancer), CEA (for diagnosing bowel cancer), and CA19-9 (for diagnosing pancreatic cancer).In the test cohort, the diagnostic efficiency of these four tumor markers was lower than that of the ML model group (liver cancer, AUC 0.755 vs 0.835; prostate cancer, AUC 0.934 vs 0.976, bowel cancer, AUC 0.716 vs 0.800; pancreatic cancer, AUC 0.793 vs 0.918; Figure 2).
These 10 diagnostic models collectively consist of 89 parameters, involving 66 types of laboratory test features, including 57 features of blood test, seven features of urine test, and two features of stool test.Among the blood test features, in addition to 12 tumor markers, there were also 45 other routine test characteristics, including six blood cell-related characteristics, 22 biochemical characteristics, two lipoprotein and lipid metabolism-related characteristics, five coagulation characteristics, and 10 immunological characteristics.

Feature Importance
SHapley Additive exPlanation (SHAP) value was used to determine the importance of each feature in disease diagnosis models.The weight value for each feature was determined by taking the absolute mean of the SHAP values for that feature, which was used as the basis for ranking and determining the importance of each feature.In Figure 3, we found that the top three ranked features in models for six types of cancer (liver cancer, colon cancer, pancreatic cancer, lung cancer, prostate cancer, and breast cancer) contained not only one or two tumor markers (such as AFP, CA19-9, CEA, PSA, and NSE), but also nontumor markers such as albumin, prealbumin, globulin, and fecal occult blood test, which are commonly used in routine laboratory tests.More interestingly, almost all of the top two weighted markers in the diagnosis models for gastric cancer, biliary tract malignancy, thyroid cancer, and urological cancers were nontumor markers.Due to the availability of urine samples in the urinary system, and the direct correlation between urine and urological cancers, three features in the model were urine sample types, with urinary leukocyte count being the most weighted feature.Thyroid cancer and biliary tract malignancy do not have a tumor-specific diagnostic index.This study found that nontumor markers contribute heavily to the model, and the elevation of total bilirubin and alkaline phosphatase were the most important risk factors for biliary tract malignancy.The most informative features in thyroid cancer models were mostly thyroid hormones.In addition, both gastric and intestinal cancers were cavities of the digestive system, organs of food digestion, and fecal production, and we have also observed that fecal-related items (fecal occult blood and fecal blood) contribute significantly to these two cancer models.The SHAP analysis of the 10 models suggested that, in addition to tumor markers (which are clinically relevant), some nontumor markers also play a significant role in models.Additionally, the analysis results showed that 24 parameters appeared in two or more cancer models, and the three parameters with the highest total weight in multiple cancers were CEA-abnormal (A), CA19-9 variable (V), and total bilirubin (Tbil)-A in blood, respectively (Figure S2, Supporting Information).

Cosine Similarity Analysis and Cluster Analysis
Different models have varying amounts of identical parameters, suggesting that there may be common pathophysiological changes in the occurrence and development of certain cancers.Eighty-nine parameters were used as variables, and cosine similarity analysis and cluster analysis were employed to investigate the similarities and differences among the 10 types of cancer.Cosine similarity analysis revealed that pancreatic cancer and biliary tract malignancy had the highest similarity (0.52), followed by breast cancer and lung cancer (0.39), lung cancer and gastric Abbreviations: ML, machine learning; YI, Youden's index; AUC, area under the curve; Se, sensitivity; Sp, specificity; ACC, accuracy; PLR, positive likelihood ratio; NLR, negative likelihood ratio.cancer (0.34), and bowel cancer and gastric cancer (0.25) (Figure 4a).Cluster analysis can classify cancers, as shown in Figure 4b.Pancreatic cancer, biliary tract malignancy, liver cancer, bowel cancer, lung cancer, and gastric cancer could belong to the same category and were clearly distinguished from the remaining four types of cancer.Among them, pancreatic cancer and biliary tract malignancy belong to the same group, while lung cancer and gastric cancer, as well as bowel cancer and liver cancer, belong to the same group in terms of branches, and their risk parameter compositions were relatively similar.

Feature Relation
It is widely recognized that not all patients undergo testing for tumor-related indicators during clinical testing, and routine test parameters other than tumor markers are not commonly used for early tumor diagnosis.To evaluate cancer risk in a more timely manner, Figure 5 illustrates the correlation between certain nontumor indicator parameters and strongly associated cancers.If any of the conventional indicators in the outer circle are found to be abnormal during examination, it is recommended that the remaining indicators in the inner circle be retested by the physician to promptly rule out or detect certain cancers.
In case of any abnormal results for stool blood, serum prealbumin, or total hemoglobin concentrations, it is advisable for the individual to undergo further testing for other markers within the inner circle to determine the presence of bowel or gastric cancer.If all of these markers show abnormal results, the next step should involve tests for bowel cancer and gastric cancer (Figure 5a).Similarly, in the event of abnormal results for serum amylase, cholyglycine, direct bilirubin, or alkaline phosphatase, it is recommended that the individual undergoes further testing for other markers within the inner circle to ascertain the presence of pancreatic cancer and biliary tract malignancy.If these markers also show abnormal results, the subsequent step of examination should be focused on detecting pancreatic cancer and biliary tract malignancy (Figure 5b).

Discussion
In the present study, XGBoost algorithm and forward stepwise regression were employed to screen the characteristics of test items and train diagnostic models of 10 types of cancers.The validation cohort and the test cohort both achieved good diagnostic efficacy (AUC 0.763-0.996),with the AUC of prostate cancer, thyroid cancer, breast cancer, and pancreatic cancer all above 0.9.
Compared to previous studies that only diagnose or predict a single cancer, the multicancer early warning system established in our study is more flexible and better suited for clinical practical application scenarios.These 10 types of cancer account for 73.82% (1 782 000/2 414 000) of annual cancer deaths in China, [29] highlighting the potential for this test to provide large-scale benefits to the population.
Different types of cancer exhibit distinct characteristics, but they also demonstrate commonalities due to underlying shared pathological processes and origins. [30]In this study, we used cosine similarity analysis and cluster analysis to identify similarities and differences between different types of cancers for the first time by analyzing laboratory test data results.Cosine similarity analysis [31] is a widely used technique for quantifying the similarity between two vectors, facilitating the determination of their close relationship based on specific features or attributes.Clustering analysis [32] groups similar items together based on shared attributes, enabling the identification of patterns and associations.By combining cosine similarity analysis and clustering analysis, we aim to gain a comprehensive understanding of the relationships among different types of cancers.The highest similarity score (0.52) was observed between pancreatic cancer  and biliary tract malignancy, which were also grouped in the same cluster.This is likely due to the fact that the biliary tract and pancreas share a common embryological origin, [33] with the ampulla region of the biliary pancreas being considered the closest end of the original biliary tract and pancreas. [34]umerous studies have demonstrated that biliary and pancreatic-related diseases exhibit similar clinical behavior, histopathology, immune phenotypes, and molecular spectra. [35]The most common parameters in the diagnostic models for these two cancers were uric acid, amylase, cholyglycine, direct bilirubin, alkaline phosphatase, and CA19-9.Changes in these enzymes and chemicals are related to the tumor itself and its cellular function.Another interesting finding was that lung cancer and gastric cancer exhibited a similarity score of 0.34, with clustering analysis revealing that lung cancer and digestive system tumors, particularly gastric cancer, belonged to the same category.The diagnostic model for these two cancers shared four common parameters, including CEA, albumin, CA19-9 in blood, and hemoglobin in urine.The reason for this may be due to the development of both the respiratory epithelium derived from lung cancer and the gastric gland epithelium derived from gastric cancer, which both originate from the endoderm. [36]Some studies have demonstrated the similarities and relationships between embryonic development and tumorigenesis in terms of invasive cellular behavior, gene expression, and other important biological behaviors. [37]Cells driven by the endoderm exhibit many common characteristics and basic biological processes, including epithelial-mesenchymal transition and cell migration. [38,39]utations in genes such as KRAS, TP53, and PIK3CA could lead to the occurrence of these two types of cancer. [40,41]Additionally, long-term smokers have an increased risk of developing both types of cancer, [42] although the specific mechanism remains to be further studied.Analyzing the similarity and classification of these cancers can guide clinicians in more accurately selecting test indicators and determining the risk of disease, thereby effectively avoiding missed diagnosis and misdiagnosis.
All indicators included in this study were commonly used clinical laboratory assays, and the results were easily obtainable.By using machine learning techniques to explore the implicit relationship between these indicators and diseases, more valuable diagnostic opinions could be provided for clinical practice, and the combination of testing items could be simplified to save medical resources.What is even more interesting is that based on the features of the parameters, the most likely relevant indicators for further examination can be inferred, which can help doctors quickly identify the most probable direction of disease diagnosis, saving time and improving diagnostic accuracy.
Additionally, we will embed the multitumor early warning system into the routine electronic health record system to verify the accuracy and effectiveness of the early warning model in realtime, to optimize the model and observe its application value in clinical practice.
In conclusion, we have established a multicancer early warning system using machine learning techniques to diagnose 10 types of cancer from clinical laboratory assay results.The study also uncovered potential shared pathological processes and origin relationships among the 10 types of cancer.

Experimental Section
Data Sources and Data Collection: 1) Data source: The research data were primarily obtained from the integrated data collection system developed by Shanghai Changhai Hospital, which includes the Clinical Data Repository and the Research Data Repository.This platform maps data to a controlled vocabulary, such as the Systematic Nomenclature of Medical Clinical Terms, ICD 9 and 10, Logical Observation Identifiers Names and Codes, Current Procedural Terminology, and Healthcare Common Procedure Coding System.2) Data collection: The data extracted from this data integration platform include patient basic information, test data, and diagnostic data.Basic information comprises gender and date of birth.Test data includes the patient ID number, item test time, sample type, test item, reference interval, test unit, test result, and result status.Diagnostic data information includes the patient ID number, diagnostic time, diagnostic category, diagnostic name, and ICD code.The diagnosis names of diseases encoded by ICD-9 in the database were standardized with reference to ICD-10 for diagnosis fields.We collected all test data 21 days before the date of discharge diagnosis or outpatient diagnosis. [43]or the same item, only the first test result at the time of diagnosis as a group disease was taken.The time limit was after hospitalization, before surgery or medication treatment, and duplicate records were removed from the database.All data involved in the patient's medical statistics experiment have been anonymized, and relevant privacy information has been removed.The Ethics Committee of Changhai Hospital (CHEC2021-121) has approved the ethics of this study.
Data Processing and Filtering: 1) Structured: Transforming unstructured data such as characters, sets, and reference ranges into a list format that was easy for computers to process.2) Standardization: Standardizing and converting various types of data in the database according to the Lonic standard, including sample type, test item, result status, test unit, and diagnosis name.3) Feature label: Quantitative features result within the reference range were determined as negative (N).Abnormal results were labeled with three characteristic parameters, the first method involves using the decimal equivalent method to normalize the data, [44] and the results are represented by the variable (V), the second method involves directly determining the abnormal result as abnormal (A), the third method involves marking the result as high (H) if it increases or low (L) if it decreases.Qualitative features were uniformly labeled as negative (N) or positive (P) based on the results.
We utilized the previously constructed knowledge graph [43] to integrate electronic data from the hospital by leveraging the triple relationship between medical testing entities and diagnostic entities.Specifically, we incorporated data from cancer groups and filtered it through the knowledge graph, excluding any invalid samples that did not have correlation evidence in the knowledge graph for the test-independent variables and the outcome variables of diagnosis.The remaining samples were then utilized in the modeling process.
Enrolled Population: The data extracted from the database were categorized into two groups: the cancer group and the control group, based on the pathological, discharge, or outpatient diagnosis.The research object was identified using the ID number on the first page of the inpatient medical records.The cancer group was divided into each type of cancer included in the study.The inclusion criteria for this group were a certain pathological, outpatient, or discharge diagnosis of "tumor", "cancer", or "space-occupying".The exclusion criteria were cases and data with diagnostic names of "tumor after surgery", "radiation therapy", or "chemotherapy".The control group consisted of all diseases except for this cancer.
Screening of Inspection Characteristics: Initial feature selection: All features with a missing proportion of less than 1‰ in the total sample were determined as initial features.Model feature selection: Features with a missing proportion of less than 50% in each cancer group sample were selected. [45]And the following two methods were taken as a combination.The first method involved calculating the ratio of a single feature (v) as the data in cancer group samples (1-missing rate) divided by the data in the overall samples (1-missing rate), and selecting the top 30 features with the highest v.The second method involved calculating the Pearson correlation coefficient of each feature and selecting the top 60 features with the highest value.Missing values were filled with negative values. [43]achine Learning Model Development and Performance Measurement: The XGBoost method was utilized to build a binary classification model.The parameters for the model were selected using a forward stepwise method.In an effort to rigorously assess the robustness and consistency of parameter selection for each distinct cancer type, we employed a tenfold cross-validation technique on the training dataset, ensuring that the evaluation process was comprehensive and minimized the potential for overfitting.AUC was used as an overall measure of discrimination.Other parameters used to evaluate model performance included the YI, ACC, Se, Sp, PLR, and NLR.
Model Result Observation-SHAP Values Were Used to Analyze the Characteristic Importance of the Model: SHAP values were used in XGBoost prediction to determine which features contributed the most to model prediction.Each SHAP value measures the contribution (either positive or negative) of each feature to the level of cancer risk assigned by the model.By weighting and summing all possible combinations of eigenvalues where S is a subset of the features used in the model (coalition), X is the vector of the eigenvalue of the instance to be interpreted, p is the number of features is the weight of subset S, and the weight depends on the order.val x (S) is the prediction of the subset S, and features not included in the set S are marginalized Model Result Observation-Cosine Similarity Analysis: Calculate the connectivity relationship between model parameters, evaluate the cosine similarity between different cancers, [46] and compare the molecular characteristics of cancers.Cosine similarity ranges from À1 to 1, so a higher score indicates similarity between two embedding vectors.
Model Result Observation-Clustering Analyses: Cluster analysis was employed to differentiate between various types of cancer.To conduct the clustering analyses, a dendrogram and heatmap were computed based on the strength of association between 89 biomarkers and 10 different cancers from ICD-10 chapters A to N. The selection of cancers was based on the highest number of significant biomarker associations within each ICD-10 chapter.The 89 biomarkers used in the analysis were previously validated.The clustering of cancers was determined by their biomarker profiles, utilizing complete linkage clustering based on the linear correlation between the association signatures.
Model Result Observation-Feature Relation Diagram: Using the Apriori algorithm to mine cancer data sets and feature data, [47] the resulting feature inference content was displayed by connected graphs, including feature set mining and cancer relationship mining.
The Apriori algorithm begins with an association rule expressed as X !Y, where X and Y are disjoint item sets denoted as X ∩ Y ¼ ∅.
Here, an item set can be any combination of the data sets about cancers and features.Next, the frequency of each item set can be calculated from the whole data set.Sup (X ) and Sup (Y ) denote the frequency of the item sets X and Y, respectively.The frequency of the association rule is as defined in Equation (3).
The confidence of the association rule (Bayesian statistics) is as defined in Equation (4).
The Apriori algorithm proceeds to determine all association rules X !Y based on Equation (3) and (4), subject to a minimum value of frequency Sup X !Y ð Þ and a minimum value of confidence for the rules Conf X !Y ð Þ .In this study, the minimum frequency was set at 0.2 and the minimum confidence at 0.6.For a better investigation of the impacts of each item, item set X included only one item for each set.
In a connected graph, each pair of nodes is connected by an edge.If any node in the outer circle is found to be abnormal, further detection can be performed by recommending a combination of the remaining nodes in the inner circle.For ease of representation, we group all nodes in the outer circle into the inner circle, and indicate the recommended combination relationships by connecting each node in the circle to every node in the inner circle.
Statistical Analyses: All statistical analyses were performed using Python (Version 3.6.8).The XGBoost package (version 1.4.2) was utilized for XGBoost.Shapley values, interpreted as feature significance and visualized, were provided by the SHAP package (version 0.39.0).The scikit-learn package (version 0.24.2) was employed for analyzing other models and Pearson coefficients.

Figure 1 .
Figure 1.Study design and enrollment of participants.Abbreviations: CA, cancer group; CON, control group; FN, feature number.

Figure 2 .
Figure 2. Single feature and machine learning model in different cancer diagnoses.The red line represents the ML model and the blue line represents the single feature.a) AFP and ML models were used to diagnose liver cancer.b) PSA and ML models were used to diagnose prostate cancer.c) CEA and ML models were used to diagnose bowel cancer.d) CA19-9 and ML models were used to diagnose pancreatic cancer.Abbreviations: ML, machine learning; AUC, area under curve; AFP, alpha-fetoprotein; PSA, prostate-specific antigen; CEA, carcinoembryonic antigen; CA19-9, carbohydrate antigen 19-9.

Figure 3 .
Figure3.SHAP summary plot.SHapley Additive exPlanation (SHAP) overall.Red dots represent high values (for quantitative features) or 1 (for qualitative features), whereas blue dots represent low values (for quantitative features) or 0 (for qualitative features).For each feature, the location of the dot on the x-axis represents its SHAP value, the dots on the left reduce the probability of cancer; whereas the dots on the right side represent the contribution that increases the risk of cancer.a-t) SHAP values and weights of 10 types of cancer diagnostic models.Abbreviations: B, blood; U, urine; A, abnormal; P, positive; V, variable; L, low.

Figure 4 .
Figure 4. Cosine similarity analysis and clustering analyses.a) Cosine similarity analysis.This figure uses cosine similarity analysis to examine the correlation between different types of cancer.The green blocks represent combinations of cancers with high similarity, and the colors range from dark to light to indicate similarity from high to low.The larger the cosine similarity coefficient, the more similar the two types of cancer.b) Heatmap showing the clustering of biomarker association features for different cancer rates.The coloring indicates the degree of correlation, the dendrograms depict the similarity of the association pattern, computed using complete linkage clustering based on the linear correlation between the association signatures the full link clustering is used to calculate the linear correlation between the association features.

Figure 5 .
Figure 5. Feature reasoning.In a connected graph, there is an edge connection between each pair of nodes.All nodes of the outer circle are assembled into the inner circle and the recommended combination relationship is represented by connecting any node of the outer circle to each node in the inner circle.a) Recommended parameters for bowel cancer and gastric cancer screening.b) Recommended parameters for pancreatic cancer and biliary tract malignancy screening.Abbreviations: A, abnormal; V, variable; PA, prealbumin; THbc, total hemoglobin concentration; CEA, carcinoembryonic antigen; CG, cholyglycine; AMY, amylase; DBIL, total bilirubin; ALP, alkaline phosphatase; CA19-9, carbohydrate antigen 19-9.

Table 1 .
Baseline characteristics of the enrolled population with 10 types of cancers.
Abbreviations: CA, cancer group; CON, control group; mean AE SD,Mean AE standard deviation.

Table 2 .
Performance of 10 ML diagnostic models in validation and test cohorts.