Development of an Open‐Access and Explainable Machine Learning Prediction System to Assess the Mortality and Recurrence Risk Factors of Clostridioides Difficile Infection Patients

Identifying Clostridioides difficile infection (CDI) patients at risk of mortality or recurrence facilitates prevention, timely treatment, and improves clinical outcomes. The aim herein is to establish an open‐access web‐based prediction system, which estimates CDI patients’ mortality and recurrence outcomes and explains machine learning prediction with patients’ characteristics. Prognostic models are developed using four various types of machine learning algorithms and the statistical logistics regression model utilizing over 15 000 CDI patients from 41 hospitals in Hong Kong. The boosting‐based machine learning algorithm gradient boosting machine (GBM) (Mortality AUC: 0.7878; Recurrence AUC: 0.7076) outperforms statistical models (Mortality AUC: 0.7573; Recurrence AUC: 0.6927) and other machine learning algorithms. As the difficulty to interpret complex machine learning results limits their use in the medical area, Shapley additive explanations (SHAP) are adapted to identify which features are crucial to the machine learning models and associate them with clinical findings. SHAP analysis shows that older age, reduced albumin levels, higher creatinine levels, and higher white blood cell count are the most highly associated mortality features, which is consistent with existing clinical findings. The open‐access prediction system for clinicians to assess and interpret the risk factors of CDI patients is now available at https://www.cdiml.care/.

and associating them with existing clinical findings are one of our major objectives.
This study aims at estimating the mortality and recurrence outcomes of Clostridioides difficile infection (CDI) patients.CDI is the most common nosocomial enteric infection, and the symptoms of CDI patients can range from mild diarrhea to severe sepsis with organ failure, which may lead to significant morbidity and mortality. [9]Due to the high transmissibility of C. difficile and increased risk due to the widespread use of antibiotics, the disease carries a considerable health burden.We have previously reported that in Hong Kong, the incidence of CDI has increased by 26% from 15.41 cases to 36.31 cases per 100 000 persons from 2006 to 2014. [10]The approximated CDI incidences in the USA in 2011 was almost half a million. [11]he estimated number of deaths within 30 days of the initial diagnosis was 29 300 and the number of patients that experienced recurrence at least once was 83 000.Up to 20-35% of CDI patients would suffer from CDI recurrence, in which 45-65% of them would develop multiple recurrent episodes. [12,13]][15] Intelligent systems that assess CDI patients' severity at the early stage have important clinical implications in disease management, reduce risk infections' disease transmission, and improve clinical outcomes. [16]ith the changing epidemiology of CDI, [9,[13][14][15] clinicians need immediate and reliable diagnostic tools to assess the disease severity and predict clinical outcomes.Therefore, a robust prediction system with statistical or machine learning models to identify the patients at a high risk of mortality or recurrence allows upfront planning of medical treatment to improve survival outcomes.A logistic regression (LR) model utilizing 2 065 patient data in two US academic institutions for predicting the inpatient mortality and other disease-related outcomes has been reported. [17]Random forest (RF), a machine learning algorithm utilizing the ensemble method, was applied in CDI recurrence prediction with 198 Caucasian patient data from two hospitals in 2014. [18]However, patient data acquired from some institutions lack generalizability when compared with administrative databases with a large sample size and patients from multicentered institutions.The Nationwide Inpatient Sample of the USA was studied to construct a CDI severity score using multivariate LR; however, neither clinical and treatment data nor laboratory results (e.g., white blood cell count) were presented in the database. [19]With more systematic data collection in the EHR system, statistical or machine learning models can utilize a more comprehensive set of patient features to generalize underlying clinical patterns and provide precise predictions.
Therefore, the aim of our study is to establish a prediction system based on a large, detailed and well-established EHR of over 15 000 CDI episodes to estimate CDI patients' mortality and recurrence outcomes. [10]We also compared the model accuracy of four various types of machine learning algorithms to regression analysis and evaluate the feature importance of the best-performing model to known clinical findings.The prediction system with the best-performing machine learning model is now available online for clinicians to assess and interpret the risk factors of CDI patients.
The study cohort is previously described. [10]In brief, CDI patient records diagnosed between 2006 and 2014 were obtained from 41 public hospitals in Hong Kong using the Clinical Data Analysis and Reporting System (CDARS), a well-established electronic database managed by the Hong Kong Hospital Authority comprising laboratory and clinical records covering over 90% of all inpatient services in the territory.Patients with a positive result on culture, toxin, or the molecular assay of a diarrhea stool sample were diagnosed as a CDI case.Mortality was defined according to patients' vital status within 30 days after CDI diagnosis, whereas recurrence was defined by a recurrent diarrhea stool specimen with a positive test result within 60 days after completion of CDI treatment.
The patient features for model developments are shown in Table 1.Features were grouped into patient demographics, admitting diagnosis, laboratory results, past surgical procedure, medication prescriptions, comorbid disease diagnoses, and clinical outcomes.Descriptive statistics are shown in Table 1.Continuous variables were presented as mean AE s.d. and compared using two-sample t-tests.Categorical variables were reported as N (%) and compared using Pearson's chi-square test (χ 2 test).The two-sample t-test aims at finding the statistically significant difference between two independent population means through comparing the two sample groups.χ 2 test of independence is used for proving the significant association between two categorical variables.The p-values of the tests are shown in Table 1.
We next analyzed the clinical parameters associated with disease recurrence.Among the 15 864 patients, 1 219 of them suffered CDI recurrence within 60 days whereas 14 645 of them did not.Healthcare-associated infection was significantly associated To improve the generalization for CDI mortality and recurrence prediction, all data were divided into training and test sets according to patients' healthcare institutions.This external validation method is preferred over random splitting on training data and testing data because factors such as clinical settings or populations may vary according to hospitals.[25] The training set contained 76% of the samples from 33 hospitals (11 470 samples in mortality prediction and 12 024 samples in recurrence prediction), and the test set contained the remaining 24% from 8 hospitals (3 698 samples in mortality prediction and 3 840 samples in recurrence prediction).The overall workflow of model selection and development is shown in Figure 1.Five prognostic models on CDI mortality and recurrence were developed using the statistical LR model and four various types of machine learning algorithms, including support vector machines (SVM), GBM, RF, and neural network (NN).LR and SVM are models that maximize the n-dimensional feature distance to distinctly classify the data points.RF and GBM, which recursively split the features in a top-down induction manner to distinguish between classes, utilize bootstrap aggregating and boosting methods correspondingly to convert weak decision tree classifiers into strong classifiers.NN mimics the human brain for pattern recognition and utilized a feedback mechanism to learn the intrinsic weight by minimize classification error.CDI mortality and recurrence score systems were generally developed with statistical LR, [17,19,26] considering its ease of interpretation in model coefficients; nonetheless, the other four machine learning algorithms were adopted to handle more complicated problems such as detecting C. difficile toxins in stools [27] or modeling immunoregulatory therapeutics for treating CDI. [28]o avoid selection bias and reduce overfitting, the optimal hyperparameters for each model were determined by grid search and validated through fivefold cross validation. [29]The final model was determined by the hyperparameters producing the lowest error.To evaluate each model's discrimination power, the area under receiver operating characteristics (ROC) curves (AUC) was measured.The model with the highest AUC was chosen to develop into an open-access prediction system.
The ROC curves for the five predictive models in CDI mortality and recurrence prediction are shown in Figure 2A,B.The boosting-based machine learning algorithm GBM (Mortality AUC: 0.7878; Recurrence AUC: 0.7076) outperformed statistical-based LR models (Mortality AUC: 0.7573; Recurrence AUC: 0.6927) or other types of machine learning algorithms.The outstanding results of the boosting-based algorithm GBM over the other four algorithms may be attributed to two reasons.First, boosting, [30] a method which combines weak classifiers into a final strong classifier and increases the previous misclassified data weights for future training, improves the capability of handling the imbalanced characteristics in CDI dataset. [31]econd, GBM can well generalize unseen data because it tends to reduce errors (bias and variance) during the training process.A shallow tree in a weak classifier has a high bias but low variance.GBM builds the weak classifiers sequentially so that the error of prior classifiers can be modeled by posterior classifiers to reduce the overall bias. [32]These properties make GBM a more robust model for CDI mortality and recurrence prediction.
Understanding the top associated patient features, which lead to specific prediction in machine learning models, is a key challenge in machine learning.Complex models such as GBM and NN may provide significantly improved accuracy, however, with more complication in interpreting the hidden reasons.Therefore, we adopted SHAP, [33] which applies a technique in game theory to quantify the contribution of each player in a collaborative game, to represent the feature importance distributed in each classifier of the model.The feature importance of the machine learning model facilitates us to interpret the relevant features for output prediction and exclude the irrelevant features to reduce model complexity. [34]he feature importance of each classifier in the GBM model to predict CDI mortality and recurrence is shown in Figure 3. Workflow for machine learning model selection and deployment.In the model selection phase, grid search cross validation was applied to tune the optimal hyperparameter for each algorithm.The optimal model in each algorithm was evaluated in the test set to determine the best-performing algorithm for the dataset.SHAP analysis was applied to identify the top associated features.In model deployment phase, the model was retrained with top 18 features identified in the SHAP analysis and integrated to an open-access prediction system at https://www.cdiml.care/.
The features ranked at the top imply that the models choose these features more frequently to correctly classify patients.The color and the distribution of samples reveal the relationship between risk prediction and feature value.The low value of the minimum albumin level, long hospitalization, absence of creatinine level over 150 μmol L À1 and aging increase the mortality probability of patients.Number of days hospitalized, creatinine level rise 150 μmol L À1 and metronidazole prescription before are the three leading features to predict CDI recurrence.
][37] Significant associations were found between CDI mortality and several patient factors, including older age, reduced albumin levels, higher creatinine levels, and higher white blood cell counts.In Figure 3A, the feature importance of GBM mortality model shows that the abovementioned factors are also shown as the top-associated features for accurate prediction, which is consistent with existing clinical findings.][39][40][41] Aging and hospitalized days over 30 days are shown as top predictive recurrence risk factors in Figure 3B.The use of machine learning algorithms suggested other key risk factors, for instance, the creatinine rise above 150 μmol L À1 previous usage of nitroimidazole e.g., metronidazole, and refractory disease.
To facilitate CDI risk assessments, the best-performed CDI mortality and recurrence predictive models were validated and released online as a platform for worldwide access at: https:// www.cdiml.care/.Clinicians may fill in the questionnaire for individuals or upload comma-separated values files consisting multiple patients' data in our web application to obtain prediction results.The questionnaire was designed to include the top 18 most relevant features as input to facilitate efficient clinical assessment.To assess whether the GBM would outperform the other 4 algorithms using the top 18 features out of all 79 features, all 5 models were trained under the reduced feature setting for mortality and recurrence prediction.The GBM achieved an AUC of 0.7874 in mortality prediction and 0.7157 in recurrence prediction.Figure 2C,D shows the AUC in the top 18 feature setting.Comparing the AUC of the GBM model with originally 79 features and the GBM model with the top 18 associated features, the AUC is from 0.7878 to 0.7874 in mortality prediction and 0.7076 to 0.7157 in recurrence prediction.The GBM model trained with top associated features yields comparable or even more excellent prediction results in contrast to the full-feature GBM model.Risk factors explanation on an individual level was incorporated on the open-access platform to complement the prediction probability to demonstrate the reason why the machine learning model produces certain predictions.The individual-level explanation is more specific for clinicians to assess which factors influence the patient most rather than the entire population.Figure 4A shows the risk factor explanation of a patient who was predicted to have a high mortality probability of 0.81.The risk factors increasing or decreasing the predicted mortality risk can be explained by patient features listed on the red or blue bars, respectively: the red bars indicate the risk factors pushing the mortality prediction higher and the blue bars indicate the risk lowering the mortality prediction.Features are sorted according to the magnitude toward probability prediction.Examining the risk factor explanation of the patient, 8.52 Â 10 6 cells mL À1 of white blood cell count is within a normal range and, therefore, shows up as a factor to reduce the mortality probability.However, the levels of albumin and creatinine are beyond the standard range, thus increasing mortality probability.A patient who was predicted to have low mortality probability of 0.03 is shown in Figure 4B.The predicted low mortality risk can be explained by the normal albumin level (34-54 g L À1 ), middle age, and normal white blood cell count (≤15 kcells μL À1 ).With the individuallevel risk explanation, professionals can have a clearer understanding of the relationship on which input patient features increase or decrease the predicted mortality probability, rather than a predicted value from the machine learning models.
There were several limitations present in this study.First, these experiments were conducted in retrospect.In addition, all the patient data in this study were from Hong Kong.Sample bias may exist due to regional restrictions.As only 79 prediction features were included in this study, some associated factors may not be covered in this investigation.Furthermore, the clinical benefit of the machine learning approach (over conventional risk prediction) will need to be proven in prospective clinical studies.A randomized controlled trial can demonstrate improved patient outcomes using this algorithm, which may predict high-risk patients not identified by conventional clinical parameters.
The performances of our model and previously reported models are evaluated in depth.In mortality prediction, Kulaylat, Audrey S, et al. [17] developed a multivariable LR model with an AUC of 0.82 and Z. Kassam et al. [19] reported a severity score model based on multivariate LR to achieve an AUC of 0.77.The AUC of our best GBM model is 0.7878, which is comparable with their performance.It is worth mentioning that in a study by Kulaylat et al. [17] the eosinophil count acted as a statistically significant variable in assessing the mortality risk, which could be a potential factor where the models cannot attain a higher AUC in our study cohort.In recurrence prediction, we observed a noticeable AUC difference between our study and the study by LaBarbera, Francis D, et al. [18] One potential reason is that their study consisted a majority of Caucasian patients while our study is Hong Kong based, where demographic variation may exist.In addition, RF algorithm is included in our study to investigate the predictive power of different machine learning algorithms.The best AUC of the RF algorithm can only achieve 0.6707, which is lower than our best GBM model.
In conclusion, CDI mortality and recurrence predictive models were set up using different statistical and machine learning algorithms, the boosting-based algorithm GBM achieved the best performances in both mortality and recurrence prediction.The best-performed predictive models were available online for use.With the changing epidemiology of incipient and recurrent CDI, [10,11,42] such mortality and recurrence prognosis models hopefully could act as reference for doctors during treatment planning.Current treatment regimen recommended by the Infectious Diseases Society of America (IDSA) to treat initial CDI is the prescription of vancomycin or fidaxomicin, whereas high-dose vancomycin is suggested on fulminant CDI patients.Treatment intensification can be chosen at the early stage if patients are predicted to have worse clinical outcomes or heightened mortality.The recommended treatment for recurrent CDI is the tapered course of vancomycin, fecal microbiota transplantation, or other pharmacotherapies.Patients estimated to have high recurrence risk might benefit from upfront fecal microbiota transplantation or other pharmacotherapies before another recurrent episode.This may facilitate the early choice of appropriate treatment to reduce CDI mortality and recurrence, hence, relieving the healthcare burden.Intricate modeling and algorithms deterring clinicians from incorporating the machine learning model in patient care can be ameliorated with our intuitively accessible prediction website.The predicted probabilities and the individual level risk factors explanations can be conveniently acquired based on the entering clinical parameters.In addition, the top mortality-associated and recurrence-associated factors were uncovered through the machine learning algorithm.This could serve as a basis for further CDI pathology and clinical studies.

Experimental Section
Clinical Data Test: The statistical and machine learning model training were performed using Python 3.6.9.The data were divided into the training set (33 hospitals, 76% records) and test set (8 hospitals, 24% records).Data preprocessing steps were applied to handle missing data and categorical data.Numeric fields were first imputed by the mean value of the training samples to handle missing values and subsequently normalized through minimum-maximum normalization to avoid large varying feature values.For categorical variables, one-hot encoding was applied to represent the data structure for computation.The number of features increased from 79 to 222 during the data preprocessing steps.Randomized grid search cross validations were applied to tune the optimal hyperparameters for each learning algorithm.The randomized grid search techniques sample hyperparameter candidates from the target parameter sets and recursively validate them through the five-fold cross-validation process.For a total of 48 times, fivefold cross validations were carried out for each algorithm to explore the optimal hyperparameters.The optimal hyperparameter sets were determined by the highest mean validation score, which was obtained through averaging the fivefold cross-validation.To compare the model accuracy of different algorithms, the model with optimal hyperparameters of each algorithm were chosen and evaluated on the test set.In mortality prediction, the best-performed model was GBM with hyperparameter 190 estimators, 180 leaves, subsample ratio of 1, maximum tree depth of 2, and learning rate of 0.2, which achieved 0.811 mean validation score and 0.7878 test score.The best-performed model in recurrence prediction was GBM with 200 estimators, 180 leaves, subsample ratio of 1, maximum tree depth of 2, and learning rate of 0.05.The mean validation score was 0.693 and the test score was 0.7076.With a setting of randomized train test split, an identical set of model training and testing was conducted without splitting according to patients' healthcare institution.GBM also outperformed the other four algorithms with an AUC of 0.8089 in mortality prediction and 0.6848 in recurrence prediction.The ROC curves are shown on Figure 2E,F.The algorithms with the highest test accuracy were selected as the best-performing models and further analyzed with SHAP.The importance of each feature was ranked and is shown in Figure 3.An open-access prediction system was established based on the best-performed models with the top 18 most associated features.

Figure 1 .
Figure1.Workflow for machine learning model selection and deployment.In the model selection phase, grid search cross validation was applied to tune the optimal hyperparameter for each algorithm.The optimal model in each algorithm was evaluated in the test set to determine the best-performing algorithm for the dataset.SHAP analysis was applied to identify the top associated features.In model deployment phase, the model was retrained with top 18 features identified in the SHAP analysis and integrated to an open-access prediction system at https://www.cdiml.care/.

Figure 2 .
Figure 2. ROC curves of four machine learning models and the LR model in the test set.A) Mortality prediction and B) recurrence prediction.C) Mortality prediction and D) recurrence prediction of models trained with top 18 features only.E) Mortality prediction and F) recurrence prediction without train test data splitting corresponding to patients' healthcare institution.

Figure 3 .
Figure 3. Top 20 feature importance ranked according to the prediction.The red indicates high feature values and the blue indicates low ones.Positive SHAP value shows how the positive predicted class relates to the feature value.Note that one categorical feature 'number of days hospitalized' contains three cases (0-30 days, 91-180 days, over 180 days) ranked in the top 20 feature importance.

Figure 4 .
Figure 4. Risk factor explanation of the predicted probability by GBM.The red bars indicate the risk factors pushing the mortality prediction higher and the blue bars indicate the risk lowering the mortality prediction.A) A patient predicted to have high mortality risk.B) A patient predicted to have low mortality risk.

Table 1 .
Patients characteristics and statistical analysis.