A novel prediction model of the risk of pancreatic cancer among diabetes patients using multiple clinical data and machine learning

Abstract Introduction Pancreatic cancer is associated with poor prognosis. Considering the increased global incidence of diabetes cases and that individuals with diabetes are considered a high‐risk subpopulation for pancreatic cancer, it is critical to detect the risk of pancreatic cancer within populations of person living = with diabetes. This study aimed to develop a novel prediction model for pancreatic cancer risk among patients with diabetes, using = a real‐world database containing clinical features and employing numerous artificial intelligent approach algorithms. Methods This retrospective observational study analyzed data on patients with Type 2 diabetes from a multisite Taiwanese EMR database between 2009 and 2019. Predictors were selected in accordance with the literature review and clinical perspectives. The prediction models were constructed using machine learning algorithms such as logistic regression, linear discriminant analysis, gradient boosting machine, and random forest. Results The cohort consisted of 66,384 patients. The Linear Discriminant Analysis (LDA) model generated the highest AUROC of 0.9073, followed by the Voting Ensemble and Gradient Boosting machine models. LDA, the best model, exhibited an accuracy of 84.03%, a sensitivity of 0.8611, and a specificity of 0.8403. The most significant predictors identified for pancreatic cancer risk were glucose, glycated hemoglobin, hyperlipidemia comorbidity, antidiabetic drug use, and lipid‐modifying drug use. Conclusion This study successfully developed a highly accurate 4‐year risk model for pancreatic cancer in patients with diabetes using real‐world clinical data and multiple machine‐learning algorithms. Potentially, our predictors offer an opportunity to identify pancreatic cancer early and thus increase prevention and invention windows to impact survival in diabetic patients.

Pancreatic cancer has one of the worst prognoses among all cancer types.Its pathophysiology makes it exceptionally difficult for early-stage detection.Due to a lack of effective screening and diagnostic tools, most pancreatic cancer cases are diagnosed when the tumor has already reached a locally advanced or metastatic stage.Even among patients who undertake surgical interventions, poor prognosis is still high.Of the 10%-20% of pancreatic cancer patients who undergo surgical resection after diagnosis, only about 20% have a 5-year survival rate. 1,2][8][9] Regarding the correlation between diabetes and pancreatic cancer, previous publications observed that up to 85% of pancreatic cancer patients have diabetes at the time of the cancer diagnosis. 105][16] An additional concern is that DM is associated with a high risk of mortality for cancer patients.There are also studies identifying insulin resistance and hyperinsulinemia as factors related to the risk of pancreatic cancer among patients with longterm diabetes. 17eviously, many scholars developed prediction models for the risk of pancreatic cancer in patients with diabetes.One, in particular, Dong et al. highlighted the use of traditional logistic regression (LR) model and found eight important predictive factors, including the age of onset of diabetes, body mass index (BMI), hepatitis B virus (HBV), total bilirubin (TBIL), alanine aminotransferase (ALT), creatinine (Cr), apolipoprotein A1 (APO-A1), and white blood cells (WBCs). 18his research established a prediction model with an area under the receiver operating characteristics curve (AUROC) of 0.81517.Hsieh et al. applied data from the Taiwan National Health Insurance database and developed prediction models using both traditional LR and deep-learning Artificial Neural Network (ANN) algorithms, with results showing that the traditional LR model (AUROC = 0.727) was more accurate than the ANN model. 2  Boursi et al. (2022) used data from The Health Improvement Network (THIN) database of the UK and used a traditional LR to establish a prediction model for the risk of pancreatic cancer within 3 years for patients with pre-diabetes, with an AUROC of 0.71 and important predictive factors including age, BMI, total cholesterol, proton pump inhibitor use, ALT, lowdensity lipoprotein, alkaline phosphatase, etc. 19 Diabetes and pancreatic cancer have a high degree of association.Previous studies have established pancreatic cancer risk prediction models for patients with diabetes.However, they lacked the use of complete clinical data types and multivariate machine learning algorithms, resulting in unsatisfactory model performance.This study aimed to comprehensively identify features that influence the occurrence of pancreatic cancer in diabetic patient populations and develop a more accurate AI-based prediction model for personalized pancreatic cancer risk assessment.utilized services within the outpatient department (OPD) or were admitted to the inpatient department (IPD).We excluded individuals under the age of 40 years, patients with Type I DM (T1DM), and those who had previously been diagnosed with pancreatic cancer (ICD-O-3: C25) prior to a Type 2 DM (T2DM) diagnosis.Patients without visit history and those who received no antidiabetic medications (Anatomical Therapeutic Chemical (ATC) code: A10) for DM were also excluded.Subsequently, 66,384 records from the TMURCD were used for this study.This included 13,504 patients from TMUH, 24,075 patients from WFH, and 28,805 patients from SHH. Figure 1 shows our population selection process.

| Outcome measurement
The index date for this study was characterized as the date when antidiabetic medications were prescribed.The study's aim was to define any occurrence of pancreatic cancer within 4 years following this index date. 22atients with pancreatic cancer were identified using data from the TMUCRD with the ICD-O-3 code C25.The participants were amended at a loss to follow-up, mortality date, or at the end of the study, on December 31, 2019.

| Predictors
4][25] The predictors included diagnoses, medications, and laboratory tests from outpatient or admission datasets.The particular predictors were as follows: 1. Demographic characteristics (i.e., gender, age, and body mass index (BMI)) 2. Comorbidities before the prescription date of antidiabetic drugs (i.e., cardiovascular, chronic obstructive pulmonary, and rheumatic diseases) and the Charlson comorbidity index (CCI) score 3. Long-term medications (i.e., antacids, gastroesophageal reflux disease (GORD), and gastrointestinal disorder agents) are prescribed during the 6 months before a prescription for an antidiabetic drug 4. Laboratory test results (i.e., glycated hemoglobin (HbA1c), glucose AC, and albumin) within 12 months before prescription of an antidiabetic drug.
Median imputation was applied for missing continuous predictors.

| Statistical analysis
Descriptive analyses of the study population, including the frequency (%) and mean (standard deviation [SD]) for categorical and numerical variables, were conducted.The univariate analysis investigated significant correlations between factors and the outcome variable.A Logistic Regression (LR) with univariate and multivariate analytical methods was used to estimate the association of each factor with the outcome.Our statistical analyses were completed using R version 4.1.3(R Project for Statistical Computing).
Statistical methods have a long-standing focus on inference, which is achieved through the creation and fitting of a project-specific probability model.The model allows us to compute a quantitative measure of confidence that a discovered relationship describes a "true" effect that is unlikely to result from noise.By contrast, machine learning (ML) concentrates on prediction by using generalpurpose learning algorithms to find patterns in often rich and unwieldy data.ML methods are particularly helpful when one is dealing with "wide data", where the number of input variables exceeds the number of subjects, in contrast to "long data", where the number of subjects is greater than that of input variables.Classical statistics and ML vary in computational tractability as the number of variables per subject increases.Classical statistical modeling was designed for data with a few dozen input variables and sample sizes that would be considered small to moderate today.However, as the numbers of input variables and possible associations among them increase, the model that captures these relationships becomes more complex. 33

| Model training and testing
We included patient data from two sites, TMUH and WFH, in the training dataset.We conducted a stratified fivefold cross-validation on the training set to evaluate the performance of different algorithms and errors.We divided all patients in the training dataset into five groups and assigned each group to be used as the internal validation set for one of five replications.After developing the training models, we used patient data from SHH for external testing to assess the model's generalization.The external testing demonstrated the ability of our model to predict outcomes from TMUH and WFH, as well as future generalization to other hospitals.

| Oversampling
We applied ADASYN (Adaptive Synthetic Sampling) resampling technique exclusively to the training dataset, while maintaining the original sample size of the testing dataset in alignment with real-world clinical practice patterns.ADASYN represents a widely adopted approach within machine learning for addressing imbalanced datasets.This methodology strives to rectify class distribution imbalance by creating synthetic instances for the underrepresented class.A key feature of ADASYN involves its consideration of both data distribution and feature space density, thereby enhancing its effectiveness. 34

| Evaluation of model performance and interpretation
We computed various metrics, including the area under the receiver operating characteristic curve (AUROC), accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1 score, to assess and contrast the performances of all prediction models.To determine the best model, we compared various models using the external test results and selected the model with the highest AUROC.We performed all data processing using MSSQL server 2017 and conducted model development and validation using Python programming language version 3.9. 35To interpret the model, we analyzed the influent levels of each predictor (i.e., feature importance) to the most optimal model applying SHapley Additive exPlanations (SHAP) values. 36

| RESULTS
Baseline characteristics of the study population, including patients' demographic information, diabetes treatment, comorbidities, long-term medications, and laboratory test results, can be found in Table 1.The study encompassed records from 66,384 patients.The training dataset included 37,579 records and the test dataset included 28,805 records.Among all patients living with DM, 89 patients were diagnosed with pancreatic cancer.The mean age of all patients was 64.8 years, and the proportion of males (53.1%) was higher than that of females (46.9%).The mean BMI (26.2 kg/m 2 ; SD: 4.87 kg/m 2 ) was slightly above the healthy standard (18.5-24kg/m 2 ) at 26.2.The mean of the CCI score was 2.2 (SD: 1.43).
Table 2 shows the results of both univariate and multivariate of Logistic Regression analysis.Prior to adjustment, patients with documented male sex, hypertension comorbidity, higher CCI score, laxative users, diuretic users, higher HbA1c, or higher AC glucose were significantly associated with a higher risk for pancreatic cancer.Conversely, patients who used biguanides and combinations of oral blood glucose-lowering drugs were associated with a significantly lower risk of pancreatic cancer.After adjustment, only patients with hypertension, diuretic users, and those with higher HbA1c were correlated with a significantly higher risk of pancreatic cancer.
Figure 3 shows LDA's feature importance levels of the model performance (in the full mode).The five most important features identified were glucose AC, HbA1c, hyperlipidemia comorbidity, antidiabetic drug use, and lipid-modifying drug use.

| DISCUSSION
Observational research is beneficial to early prevention interventions and disease detection as it offers a wide berth and depth of information on specific diseases and complications.In addition, emerging AI technologies afford an opportunity to establish personalized risk prediction models to optimize patient-centered personalized medicine and outcomes.This pivot in contemporary clinical medicine allows researchers to go beyond traditional statistical methods of exploring relationships between variables in a population of individual patients.This is groundbreaking and supportive for rare disease prevention and management.It is of exceptional importance for patients with diabetes to engage in preventive measures and early detection to minimize future complications related to pancreatic cancer. 37Particularly, given the increased mortality rate of pancreatic cancer patients and the elevated risk of pancreatic cancer that diabetic populations experience. 37Previous studies conducted on the association between diabetes and the risk of pancreatic cancer identified a variety of potential risk factors.This study offers a more robust review by using more-accurate AI prediction models.The continuous emergence of AI and sharing of information allows studies to build on each other.This study successfully used higher dimensional data and more advanced algorithms to construct a moreaccurate prediction model and explore the most vital predictors affecting the performance of the model.This can serve as a reference for future clinical decisions for treating diabetes and preventing pancreatic cancer.Per previous publications, we found that higher HbA1c levels 17,25 which typically indicate poorly controlled diabetes, and patients with combidity of diabetes hypertension had a significantly higher risk of developing pancreatic cancer. 38Discordant from previous publications, this study found that diabetes patients who use diuretics for hypertension had a significantly higher risk of developing pancreatic cancer than those who do not use diuretics.This has has direct clinical implications for provider screenings and prevention education.
Besides the aforementioned risk factors, previous publications found that age, 18,23,24 hepatitis B virus (HBV) status, 18,39,40 and hepatobiliary malignancies were also risk factors for pancreatic cancer in patients with diabetes. 41However, this study did not find significant relationships for these factors.A number of previous studies showed that a higher BMI increases the risk of pancreatic cancer, 18,38,42 while others indicated that a lower BMI increases the risk. 19This study did not find a suggestive association between BMI and pancreatic cancer pathophysiology.Finally, Baecker, et al. found that patients with diabetes and peptic ulcer disease had a higher risk of pancreatic cancer, while those with depression had a lower risk. 25However, these associations were not found to be significant in this study.
Using the TMUCRD provided this study several advantages over previous publications regarding data as a resource.First, previous studies mostly used older data, while this study used more-recent clinical data (2009-2019).In addition, this study used a more comprehensive clinical predictor which sets it apart from previous research.Some previous studies did not include longterm medication information, laboratory test results, or comorbidities and long-term medication information. 2,18,23Whereas, we intentionally encompassed a wide range of data features, including demographic information; diabetes severity and treatments; comorbidities; long-term medications; and laboratory test results based on updated literature reviews and clinical experts.
Constructing precise prediction models relies on multiple advanced AI algorithms.Most previous studies of risk prediction models for pancreatic cancer in patients with diabetes only used traditional bio-statistical multivariate LR methods, or one AI algorithm to compare to the traditional LR method. 2,18,23The results of these studies do not show an advantage of AI algorithms over traditional LR methods. 2,18,23Due to the limitations of the algorithms produced, the predictive performance of previous studies on the risk prediction model for pancreatic cancer in patients with diabetes was lower than ours (AUROC values of <0.82).This study used eight algorithms (traditional LR and seven AI machine-learning techniques: LDA, LightGBM, GBM, RF, XGB, SVC, and Voting).We compared the predictive performances of the models built with these different algorithms.He best prediction model in this study had an AUROC of 0.9073, which was higher than results of previous studies; 2 this advantage is attributed to our use of multivariate algorithms and more comprehensive clinical data.
According to the best predictive model in this study, the two most significant predictive factors for the risk of pancreatic cancer were indicators of blood glucose control in diabetes (HbA1c and glucose AC).If blood glucose in diabetes is well controlled, the risk of pancreatic cancer can be greatly reduced, which is consistent with the results of most past observational studies and predictive model studies. 17,23,25Results of this study also showed that coexisting hyperlipidemia,as a comorbidity, was also a vital predictive factor for the development of pancreatic cancer in patients with diabetes.This aligns with the previous predictive model study (i.e., patients with diabetes and comorbid hyperlipidemia had a lower risk of developing pancreatic cancer). 23his study had several limitations.First, the study used electronic medical records from multiple hospitals as data sources, and although it included robust clinical information (i.e., demographics; disease treatment information; disease histories and comorbidities; long-term medication use; and significant test results), it still lacked other less easily obtainable data types, such as personal lifestyle (i.e., diet, exercise, smoking, and drinking) and socioeconomic information.Future = research may collect this information and establish alternative prediction models.Second, the electronic medical records only contained detailed information on patients' services while visiting the TMUH, WFH, and SHH.Our research did not include information on visits to other medical facilities.Therefore, patients' clinical information might need to be more comprehensive, leading to inaccurate predictions in the model results.To address this concern and obtain a more accurate prediction model, this study adopted external validation mechanisms using data from two hospitals (TMUH and WFH) to establish the prediction model.Data from the third hospital (SHH) was used to conduct external tests to obtain the final prediction results.Finally, the data sources used in this study (including external validation) were all clinical data from Taiwanese hospitals.Prediction models established using data from a single country may lack generalizability for the results.Therefore, we suggest future research be carried out through international cooperation; using standardized case selection methods; the same research design and analytical methods; and data concepts following the protocol's guidance.This will allow enable using big data from different countries to understand differences among countries, achieve mutual verification of results, and enhance the clinical value of the prediction model.

| CONCLUSIONS
This study successfully developed a novel and precise computer-aided risk prediction model for diabetes complications and pancreatic cancer.Among various models, the LDA algorithm outperformed the highest AUROC (0.9073 by external testing).The significant clinical features that affected the model included glucose AC, HbA1c, hyperlipidemia as a comorbidity, antidiabetic drug use, and lipid-modifying drug use.The model can support the clinical treatment and management of patients with diabetes to avoid pancreatic cancer.Clinicians might benefit from these results when considering patients' health conditions and prescribing medications.

F I G U R E 1
Population selection process flowchart.| 19991 CHEN et al.

F I G U R E 2
Receiver operating characteristics (ROC) curves of the prediction models.F I G U R E 3 Importance ranking of features generated from the best model (linear discriminant analysis [LDA]).
Basic characteristics of the study cohort.Results of the logistic regression.
T A B L E 1Abbreviations: CCI, Charlson comorbidity index; SD, standard deviation; min, minimum; max, maximum.T A B L E 1 (Continued)T A B L E 2 Performance of the prediction models.