Lifestyle and occupational risks assessment of bladder cancer using machine learning‐based prediction models

Abstract Background Bladder cancer, one of the most prevalent cancers globally, can be regarded as considerable morbidity and mortality for patients. The bladder is an organ that comes in constant exposure to the environment and other risk factors such as inflammation. Aims In the current study, we used machine learning (ML) methods and developed risk prediction models for bladder cancer. Methods This population‐based case–control study is focused on 692 cases of bladder cancer and 692 healthy people. The ML, including Neural Network (NN), Random Forest (RF), Decision Tree (DT), Naive Bayes (NB), Gradient Boosting (GB), and Logistic Regression (LR), were applied, and the model performance was evaluated. Results The RF (AUC = .86, precision = 79%) had the best performance, and the RT (AUC = .78, precision = 73%) was in the next rank. Based on variable importance analysis in RF, recurrent infection, bladder stone history, neurogenic bladder, smoking and opium use, chronic renal failure, spinal cord paralysis, analgesic, family history of bladder cancer, diabetic mellitus, low dietary intake of fruit and vegetable, high dietary intake of ham, sausage, can and pickles were respectively the most important factors, which effect on the probability of bladder cancer. Conclusion Machine learning approaches can predict the probability of bladder cancer according to medical history, occupational risk factors, and dietary and demographical characteristics.


| INTRODUCTION
Cancer is one of the main causes of death and morbidity nowadays.
According to the global cancer observatory, bladder cancer is the 10th most frequent cancer in the general population. 1 Worldwide studies of cancer demonstrate that 1 out of every 100 men or 400 woman experience bladder cancer during their lifetime. 2 Many risk factors contribute to bladder cancer, which can be categorized into genetic predisposition and exposure to external carcinogens. Evidence showed that many cases of bladder cancer could be attributed to external risk factors, namely smoking and tobacco, family history of smoking or tobacco use or workplace exposure to cigarette smoke, and past medical history such as a history of bladder stones, neurogenic bladder, recurrent urinary tract infections (UTI), family history of bladder cancer, and diabetes mellitus. [3][4][5] Exposure to certain materials like petroleum and its derivatives, paint, some herbal drugs, and excessive use of analgesics have also been regarded as risk factors in some investigations. 6 Also, some studies have suggested that the diet and the type of foods consumed by a person might be useful in predicting the risk of bladder cancer. 7 Machine learning (ML) approaches is a branch of computer science in medical research. Being tremendously on the rise, many researchers apply different ML methods to develop models to predict the risk of diseases, make diagnostic criteria more accurate, or even diagnose the outcome of treatment based on different factors. Knowing the risk factors might help to strengthen the primary prevention plans of healthcare systems to achieve their goals and make them more accurate. The goal of the study, which included a relatively large sample size population, was to employ ML strategies to determine the influence each risk factor has on bladder cancer.

| METHODS
This population-based case-control study includes 692 cases of bladder cancer and 692 healthy people. Bladder cancer patients were selected from the cancer registry system, and one of the right door neighbors in each case, matched based on sex and age, was recruited as a control. More details are presented in a previously published article. 8   to train and test models, respectively. In other words, data were randomly divided into 80% and 20% for the training and the testing sets each time. Regularization was applied to handle the accuracy difference between train and test datasets, which means the model, may not generalize as well for the test set as the training set. The regularization terms were used as follows: including L2 penalty parameter for NN, Ridge L2 (standard regularizer) for stochastic gradient descent, and logistic regression. Also, for reducing the overfitting of models, the grid-search method was applied when tuning hyperparameters and trying to select the best combination of parameters for the data.

| Statistical analysis
Descriptive statistics were reported as mean ± SD for continuous variables and frequency and percentage for categorical ones. The relationship between categorical variables was explained using the Chi-square test, and the quantitative variable was tested using the independent sample t-test. The significance level of 0.05 was considered in the analysis. The missing data were imputed using modelbased imputer methods. In this way, a separate model is constructed for each attribute. The default model is 1-Nearest neighbor learner, which takes the value from the most similar example. Lifestyle and occupational risk factors for bladder cancer were explored using the ML approaches. The calibration plot, the area under curve (AUC) of the receiver operating characteristic (ROC) curve, precision, sensitivity, specificity, and F1 indexes were obtained to determine model performance. In the visualization, the best model was selected to calculate the probability of bladder cancer using lifestyle and occupational risk factors. All ML strategies were implemented using Orange software version 3.21.0, which builds data analysis workflows visually. The visual workflows in Orange and the dataset supporting this study's findings are available from the corresponding author upon request.

| RESULTS
A total number of 1384 subjects, including 692 bladder cancer patients and 692 healthy people, were included in the study. The basic information and characteristics of the samples have been summarized in Table 1. Patients in two groups were matched based on age, BMI, and sex. In past medical history, all included chronic conditions had a significant association with bladder cancer ( p-value<.05).
Occupational factors did not present a significant association in univariable analysis. In the case of lifestyle items, smoking, opium, and analgesics were significant factors for bladder cancer ( p-value<.05).
Finally, fruit and vegetable use and pickles had a significant effect on bladder cancer ( p-value<.05). Histograms for all numerical features were presented in Supplementary 1.
Subsequently, various ML algorithms were applied to assess the models for predicting bladder cancer, considering lifestyle and occupational risk factors. Table 2 indicates the performances of different ML algorithms in terms of AUC, F1, precision, sensitivity, and specificity in test and train datasets, using 5-fold cross-validation. Considering evaluation indexes in both datasets, the RF had the preferable performance, and the RT was in the next rank. Other approaches, however, have relatively acceptable performance.
The ROC curves in Figure 2

| DISCUSSION
In this case-control study, we used ML methods to develop a risk prediction model for bladder cancer according to lifestyle and occupational risk factors. The univariable results presented that half of the 12 important factors of bladder cancer were related to past medical history (diabetes mellitus, chronic renal failure, bladder stone, neurogenic bladder, spinal cord paralysis, and recurrent UTI) and family history of bladder cancer. Finally, three factors depend on bladder cancer; the lifestyle items: smoking, opium, and analgesics use. The last two ones were fruits and vegetables and pickles consumption per day, which belong to the dietary factors category. In addition to the importance of acquiring a complete patient history, lifestyle, and dietary risk factors must be considered. Having a better understanding of these risk factors will prove to be a great asset to the prevention and management of bladder cancer in the future.
In 2018, a systematic review study divided the risk factors into six major groups: smoking, occupational exposure, dietary factors, environmental carcinogens, gender, race, and socioeconomic status. 9 A more recent study in 2020 listed nine groups of risk factors: gender, age, hereditary factors, smoking, environmental and occupational exposure, alcohol, red meat, obesity, and pathogens. 10 In our study, we found out that medical histories, such as recurrent UTI, smoking and opium use, and also daily use of vegetables, fruits, and pickles may be important factors related to bladder cancer. This is consistent with the aforementioned study; moreover, our study sheds light on the importance of past medical records of patients and emphasizes the role of smoking and dietary factors on bladder cancer. Despite their importance, a survey showed that most bladder cancer survivors were not aware of any risk factors contributing to their disease. 11  informing the audience about risk factors and how to prevent cancers.
According to our results, smoking was one of the significant risk factors for patients with bladder cancer. Smoking and the use of tobacco-related products have been the center of attention and have been considered the most significant and well-known risk factors for bladder cancer. 9,10,12 Dietary factors contain numerous variables that may lead to or prevent bladder cancer depending on their quantity and quality. In our study, the daily consumption of fruits and vegetables, and pickles was also associated with a significant reduction in the risk of bladder cancer, with an importance of fifth position in Figure 2. Studies that investigated the association between dietary factors and bladder cancer have yielded inconsistent and controversial results. According to a survey, the role of fruits, vegetables, and micronutrients is still being debated. 13 A metaanalysis indicated that there was no association between the risk of bladder cancer and the total amount of fluids consumed. 14 Two dose-response meta-analysis studies regarding tea consumption and alcohol consumption revealed no significant association. 15,16 Finally, the past medical conditions of patients must be considered when estimating the risk of bladder cancer. According to the results, we discovered recurrent UTI, neurogenic bladder, and bladder stones were all significantly associated with bladder cancer in Table 1 T A B L E 1 (Continued)  22 There have not yet been ML or DP algorithms to predict the risk of bladder cancer based on risk factors in lifestyle, dietary intake, environmental, and occupational groups, and that is the novelty of our study.
The study has some strengths and limitations. We used a relatively large sample size of patients randomly chosen from the Iranian Cancer Registry system and developed risk prediction models. One limitation of our study was that we could not evaluate the effect of other risk factors. Another limitation is that dietary and lifestyle factors are greatly influenced by culture, ethnicity, and geographical and historical factors, and our study was conducted on the Iranian population, so the results might differ in various areas.

| CONCLUSION
Machine learning approaches can predict the probability of bladder cancer according to medical history, occupational risk factors, and dietary and demographical characteristics.

ACKNOWLEDGMENTS
Not Applicable.