Automated machine learning‐based model predicts postoperative delirium using readily extractable perioperative collected electronic data

Abstract Objective Postoperative delirium (POD) is a common postoperative complication that is relevant to poor outcomes. Therefore, it is critical to find effective methods to identify patients with high risk of POD rapidly. Creating a fully automated score based on an automated machine‐learning algorithm may be a method to predict the incidence of POD quickly. Materials and methods This is the secondary analysis of an observational study, including 531 surgical patients who underwent general anesthesia. The least absolute shrinkage and selection operator (LASSO) was used to screen essential features associated with POD. Finally, eight features (age, intraoperative blood loss, anesthesia duration, extubation time, intensive care unit [ICU] admission, mini‐mental state examination score [MMSE], Charlson comorbidity index [CCI], postoperative neutrophil‐to‐lymphocyte ratio [NLR]) were used to established models. Four models, logistic regression, random forest, extreme gradient boosted trees, and support vector machines, were built in a training set (70% of participants) and evaluated in the remaining testing sample (30% of participants). Multivariate logistic regression analysis was used to explore independent risk factors for POD further. Results Model 1 (logistic regression model) was found to outperform other classifier models in testing data (area under the curve [AUC] of 80.44%, 95% confidence interval [CI] 72.24%–88.64%) and achieve the lowest Brier Score as well. These variables including age (OR = 1.054, 95%CI: 1.017~1.093), extubation time (OR = 1.027, 95%CI: 1.012~1.044), ICU admission (OR = 2.238, 95%CI: 1.313~3.793), MMSE (OR = 0.929, 95%CI: 0.876~0.984), CCI (OR = 1.197, 95%CI: 1.038~1.384), and postoperative NLR (OR = 1.029, 95%CI: 1.002~1.057) were independent risk factors for POD in this study. Conclusions We have built and validated a high‐performing algorithm to demonstrate the extent to which patient risk changes of POD during the perioperative period, thus leading to a rational therapeutic choice.


| INTRODUC TI ON
Postoperative delirium (POD) is an acute fluctuating neurocognitive syndrome caused by reversible neuronal disruption due to an underlying systemic perturbation, which usually occurs a few hours to a few days after surgery and mainly manifests as a decline in consciousness, attention disorders, and thinking disorders. 1 It has been reported that the incidence of POD in elderly surgical patients ranges from 10% to 70%. 2,3 Previous studies have demonstrated that early interventions can help reduce or even prevent POD, 4 while many patients with POD can't be identified efficiently. In clinical settings, the diagnosis of POD is still mainly based on clinical observation. 5 However, the type of hypoactive POD is about 71% and very hard to notice. Therefore, it is critical to find methods to identify patients with a high risk of POD rapidly.
In recent years, basic and clinical studies have found that many risk factors or biomarkers may affect the occurrence of POD. 6,7 For instance, many inflammatory markers investigated in scientific and clinical studies, such as CRP, were believed to be associated with POD. [8][9][10] Therefore, disease prediction models conveniently screen high-risk patients, and the nomogram could be easily used in clinical settings. However, some prediction models for POD were based on a single statistical method, which may be limited in predictive performance. 11,12 Recently, it has been reported that using machine-learning techniques to establish various disease prediction models could improve the predictive performance of these models. 13,14 Thus, in the current study, we used machine-learning technology to extract the clinical data of 531 surgical patients who underwent general anesthesia before and on the first day after surgery and established four predictive models of POD using different methods.
Finally, we compared these models and created a model with optimal predictive performance, which can assist in diagnosing and identifying patients with a high risk of POD. Furthermore, to increase the availability of the optimal model, the optimal model was transformed into the form of a nomogram.

| Data source and extraction
The secondary analysis was based on an observational study (the Ethical Committee of the Affiliated Hospital of Xuzhou Medical University approved it, Certification No. XYFY2018-KL091). The written informed consent was obtained from all subjects participating, a legal surrogate, or the parents in this trial. Inclusion criteria were as follows: non-history of clear neurological disease; patients who underwent major noncardiac or non-neurological surgery with general anesthesia; expected a hospital stay of ≥3 days; Exclusion criteria were as follows 15 : significant impairments of vision; hearing or motor skills; history of neurological disease; liver or kidney dysfunction (such as severe hepatitis, pyelonephritis); severe trauma or surgical history within one year; history of severe physical illness and alcoholism; mini-mental state examination (MMSE) score < 17; refuse to sign informed consent.

| Model endpoint definition
We built classification models to predict the in-hospital incidence of POD as a binary outcome.

| Delirium assessment
Delirium was assessed using rigorous methodologies. In this trial, CAM 16 was applied to patients who could be communicated with.
The CAM-ICU 17 was applied to patients admitted to the intensive care unit (ICU) and cannot be communicated with due to endotracheal intubation. We assessed for delirium 2 h after the surgery and then repeated the assessment twice a day for three days after the surgery in the morning, afternoon, or evening. There was at least 6 h interval between these two assessments. 18

Conclusions:
We have built and validated a high-performing algorithm to demonstrate the extent to which patient risk changes of POD during the perioperative period, thus leading to a rational therapeutic choice.

K E Y W O R D S
delirium, machine learning, model prediction, nomogram, postoperative and delusions, was obtained from the nurses, families, and medical records. The evaluation of delirium was carried out by trained researchers who neither knew the patient's perioperative characteristics nor data entry and statistical analysis.

| Model input features
Forty-nine potential useful features including basic information such as age, sex, BMI, education degree; American society of anesthesiologists (ASA) degree; laboratory data obtained before surgery, such as serum sodium, potassium, creatinine, and blood cell counts; and surgery-specific information such as the surgery type were To achieve the highest predictive performance, four models were established, including logistic regression model (LR), random forest (RF), extreme gradient boosted trees (XGB) classifier, and support vector machine (SVM) classifier. Furthermore, in order to further explore the relationship between the above eight features and POD, multivariate logistic regression was used to confirm the independent risk factors for POD in this study. To achieve a sample size of at least 330 in the training set, we randomly split the total dataset (n = 531) into a training set (n = 400) and a testing set (n = 131) at a ratio of 7:3 in this study. Any patients who appeared in the testing set would be removed from the training set in case of information leakage.

| Sample size and statistical analysis
All analyses were performed with R version 3.6.1. (R Development Core Team). The normal distribution of numeric variables was tested by the Shapiro-Wilk test. Continuous variables with a normal distribution were expressed as the mean ± standard deviation (SD) and were compared using the independent-sample t-test. The Mann-Whitney U test presented continuous variables with a non-normal distribution. Categorical data were presented as a number (%) and were analyzed using the chi-square test or Fisher's exact probability test. The importance of each variable in the training datasets was assessed by LASSO regression analysis.
The selection of model hyperparameters used 10-fold crossvalidation on training datasets. In 10-fold cross-validation, the datasets were divided into ten partitions, where nine-tenths of the data were used to build the models, and the remaining one-tenths were used as the testing datasets. This process was repeated such that each partition was used as testing datasets only once and training datasets nine times. Cross-validation made ensures a better assessment of model performance by averaging metrics over multiple trials.
The role of missing data imputation is described as follows. If the missing value percentage is more significant than 20%, it will be excluded from the final completed dataset. If the rate of missing value is smaller than 20%, the random forest regression method would be used for imputation.
Discrimination and calibration were used to verify the predictive ability of the model. The AUROC expressed measurement of discrimination, and the Youden index (sensitivity + specificity − 1) was used to find the best critical value (cutoff value). The performance of models was evaluated by accuracy, sensitivity, specificity, recall, and precision. Model calibration was measured by Brier score and calibration curve. Brier score was the average squared distance between the predicted probability of the outcome and the true label, and the lower Brier score indicated the better performance of the model.
The LR and RF classifiers were implemented with glmnet package and randomForest package in R version 3.6.1, and the XGB classifier was implemented with the XGBoost package in R version 3.6.1. All performance metrics were calculated on the held-out testing datasets. We generated confidence intervals (CIs) for performance metrics with epiR package of R software in training and testing datasets.

| Patient characteristics
A total of 531 patients were included in this study. Among those screened, the incidence of POD was approximately 23.54%. The variables, including preoperative C-reaction protein (CRP) and postoperative CRP, have missing values. Missing parts of these variables accounted for 16.9% and 18.8% of the total data, respectively. The missing data were imputed by random forest regression. The dataset (n = 531) is randomly divided into the training set and testing set at the ratio of 7:3. Four hundred patients formed a training dataset.
One hundred thirty-one patients formed testing datasets. The data collected from training datasets were used to assess important variables associated with POD and to establish the predictive models.
Patients were divided into POD group (n = 125) and Non-POD group (n = 406) according to whether or not delirium occurred within the first three days after surgery. The data collected from the testing dataset were aimed to validate predictive models. The patients' recruitment flowchart is shown in Figure 1. Detailed information on patient characteristics can be found in Table 1. There was no significant statistical difference between the features of patients in the training datasets and the testing datasets. The selection of the best parameter (lambda) in the LASSO model uses 10-fold crossvalidation. Dotted vertical lines were drawn at the optimal values by using the minimum criteria and the 1 SE of the minimum criteria (the 1 − SE criteria).
A vertical line was drawn at the value selected using 10-fold cross-validation, where optimal lambda resulted in eight features with non-zero coefficients (Figure 2A,B). We selected eight nonzero characteristic variables in the LASSO regression results, including age, intraoperative blood loss, anesthesia duration, extubation time, ICU admission, MMSE score, CCI score, and postoperative NLR (Table 2).

| Model performance
We used four algorithms to build predictive models of POD, and

TA B L E 1 (Continued)
recall = 0.852 (95%CI: 0.763-0.924)) ( Table 4). The ROC of models in testing dataset and training dataset is shown in Figure 3A,B, and the AUROC for each model is shown in Table 5.
The LR achieved much lower (better) Brier scores compared with the other models. Calibration plots of four models in the training dataset and testing dataset are shown in Figure 3C,D. The curve at 45° between the X-axis and the Y-axis indicates good consistency of the model.
Finally, the LR model is transformed into a nomogram to understand and use the model (Figure 4). The two-class prediction outcome is generated based on the optimal cutoff value of the optimal model. Comparing the prediction outcome with the actual occurrence of delirium, the optimal model has shown that the prediction outcome has good performance. The optimal cutoff value of the LR model corresponds to the optimal score of the nomogram. The optimal score of the nomogram was deter-

| DISCUSS ION
The accumulation of multiple risk factors is critical for the occurrence of POD, and there is currently no single treatment to prevent the occurrence of POD. The combination of non-drug therapy and drug therapy is one of the best methods to treat POD. POD has been reported to occur in 10% to 70% of all elderly patients, 1-3 causing increased mortality, prolonged hospital stays, reduced functional abilities, 20,21 long-term cognitive dysfunction, 22 and even dementia. 15,23 Therefore, the prevention and treatment of POD is a clinical problem that needs to be solved. In many clinical studies on POD, researchers have tried to find powerful biomarkers that can accurately predict POD, such as S100β protein, 24 neuron-specific enolase (NSE), 25 tau protein, 26 and inflammatory mediators. 27 Researchers are also trying to find better ways to reduce the occurrence of POD. Although these biomarkers have a relatively high ability to predict POD, they cannot be popularized clinically because of the complexity and high cost of sampling. They are always used to explore scientific questions in clinical trials. Therefore, the emergence of disease prediction models may provide a solution for the prevention of POD. Neuroinflammation and the oxidative stress response may be involved in the pathophysiological process of POD. 5   There are many ways to build a POD prediction model, but many mathematical terms are always involved. 12,28 This is not conducive to the understanding and use of a model by medical staff. At the same time, many disease prediction models are transformed into certain formulas, limiting the availability of prediction models. 11,12 Therefore, the model established in this study was transformed into a nomogram to increase the availability of the model further.
In this study, we established a predictive model and incorporated the following eight variables into its construction: age, in- Finally, eight variables were included in the multivariate logistic regression analysis. We found that age, extubation time, ICU admission, MMSE score, CCI score, and postoperative NLR were independent risk factors for POD. Advanced age is known to be the most relevant risk factor for POD, and some basic systemic diseases before surgery may also increase the incidence of POD. 29,30 Entering the ICU after surgery may also increase the incidence of POD, which may be related to the ICU environment, long-term mechanical ventilation, and the severity of the patient's disease. 31 The MMSE assesses cognitive function in patients and is associated with POD. 32 These findings are consistent with our study. Extubation time is related to residual anesthetic drugs at the end of the anesthesia maintenance period and the patient's disease state before surgery. This study also confirmed that extubation time is a risk factor for POD.
The postoperative NLR is also related to POD, but CRP variable was excluded when we screened for important features in this study.
On the one hand, we infer that the NLR, a parameter derived from different white blood cell counts, is a synthesized marker of both inflammation and oxidative stress and a stronger inflammatory factor than CRP variable. 7,8 On the other hand, CRP variable has a certain amount of missing data. Although we imputed missing data, this could still affect the screening of important features. Considering the above two aspects, the missing data could lead to the exclusion of the CRP variable and the NLR inclusion. However, two variables, intraoperative blood loss and anesthesia duration, were excluded by multivariate logistic regression. These were inconsistent with some previous research findings. 33,34 Considering that these two variables may be potential risk factors for POD and the principle of the minimum Akaike information criterion (AIC) and the maximum AUROC of the prediction model, we finally included these two variables in the is a small sample study, and the predictive model requires a larger sample for verification. Second, the interpolation of missing data is a complex problem because data were considered to be randomly missing. In fact, there is a large field of research that builds optimal imputation algorithms, and suboptimal imputation algorithms will decrease the performance of the predictive model. This may be a possible reason why the performance of our model is lower than that of the previous models. 35 However, our choice to use imputation algorithms, 36,37 while not optimal, was better than using mean imputation. Third, the data in this study came from a single large academic medical center. Thus, this model may not have similar effects when used in other medical institutions. Most likely, the model will need to be recalibrated when used by another institution. The exact weights of the features may change through such recalibration. Finally, this model requires an independent dataset to test the extrapolation and generalization of the model. We hope to collect enough external validation datasets to improve this model in the future further.
The benefits of machine-learning technology are large, especially in the medical industry. For example, using machine-learning technology to establish disease prediction and risk assessment models can help clinicians better identify the factors that truly drive the occurrence and development of diseases.

| CON CLUS IONS
We developed four different POD prediction models and calibrated them with Brier Score to select the model with the best performance. We believe that the model is an important tool that should be utilized to screen out the high-risk group of POD.

ACK N OWLED G M ENTS
We would like to thank the study participants, data collectors and obstetricians and nurses for their unreserved help.
Finally, we are grateful to those who directly or indirectly supported us.

D I S C LO S U R E S TAT E M E N T
The authors declare that they have no competing interest.

AUTH O R S ' CO NTR I B UTI O N S
Jun-Li Cao designed the study, critically reviewed the manuscript, approves the final version, and is accountable for the work. Xiao-Yi Hu, He Liu, and Yuan Han designed the study, conducted the study, collected the data, prepared the manuscript, critically reviewed the manuscript, approved the final version, and are accountable for the work. Xing Gao, Yang Zhou, and Jian Zhou helped conduct the study and collected the data. Hui-Lian Guan and Xun Sun analyzed and interpreted the data. Xue Zhao and Qiu Zhao helped prepare the manuscript and critically reviewed the manuscript.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available from the corresponding author upon reasonable request.