Machine learning in clinical practice: Evaluation of an artificial intelligence tool after implementation

Artificial intelligence (AI) has gradually found its way into healthcare, and its future integration into clinical practice is inevitable. In the present study, we evaluate the accuracy of a novel AI algorithm designed to predict admission based on a triage note after clinical implementation. This is the first of such studies to investigate real‐time AI performance in the emergency setting.


Introduction
The amount of health data collected increases every day with the advancement and expansion of electronic medical records (EMR). 1 Emergency physicians are required to synthesise large amounts of complex data and make critical decisions while being frequently interrupted. 2,3his 'information overload' can lead to an increase in adverse patient safety events and unexpected death. 4,5In this context, machine learning (ML) and artificial intelligence (AI) systems have become • Live AI model performance may differ in comparison to the original trained and tested model.• Frequent assessment of the accuracy and performance of any integrated AI model with an electronic medical record is crucial to ensure upholding of defined diagnostic evaluation of the system after implementation.
• Integration of the electronic medical records and AI system can add value proposition in clinical practice by reducing cognitive load on clinicians and mitigating risk.
pertinent in order to manage, interpret, describe and best utilise big data.AI has been shown to improve productivity and mitigate inconsistency and disparities in care while assisting clinicians with the cognitive load of the decision-making process. 3,6ecently, faster and more powerful computers have brought about an expansion of AI systems and tools in healthcare. 7While there are many studies evaluating the application of AI and natural language processing (NLP) in emergency medicine, the majority use retrospective data to train and test their algorithms. 8This means that there are limited prospective studies evaluating the real-time performance of AI NLP systems and their effectiveness in clinical practice.
Growing concerns over the possible impact of AI implementation in healthcare and potential inadvertent consequences have been raised by physicians, administrators and policymakers.There have been challenges in the adoption and acceptance of AI systems in healthcare by both physicians and patients. 7Concern over losing autonomy, disruption to the patient-doctor relationship, lack of trust in the AI system and inadequate understanding of AI models are some examples limiting AI acceptance.The absence of overarching AI policies in healthcare to address maleficence, risk and safety, autonomy, privacy and confidentiality are contributing factors. 9e previously demonstrated the utility of NLP and AI to predict patient disposition based only on emergency triage notes. 10Here, we prospectively assess and evaluate the accuracy of the novel AI decisionsupport tool post-implementation.The performance of the AI model will be evaluated against the published result.The aim is to evaluate the differences between offline test scores and realtime scores of the AI system.

Methods
The present study is an extension of a previous study published in Emergency Medicine Australasia in 2020 (for a detailed description of the construction of the model and extracted features see Tahayori et al. 10 ).NLP was used to extract features from the unstructured triage notes.The predictive model was trained using a 10-year pool of data from St Vincent's ED.The AI model was not retrained or updated for the present study.Although there is no clear gold standard definition for 'admission to hospital', for the purpose of the present study and training of the AI model, the gold standard for admission was defined as patients physically transferred to an inpatient unit from the St Vincent's ED.St Vincent's Hospital ED is a major tertiary hospital in metropolitan Melbourne with an average ED presentation of 60 000, annually.This is an adult ED with a short-stay unit to admit patients with less complicated presentations or patients requiring transfer to their residential place or physiotherapy assessment.St Vincent's ED is not a paediatric, women or trauma centre.analytics team, ED research centre and ED research fellow, ED data and AI scientist and Insight Co Ltd. (www.insight.com)were gathered to deploy and operationalise the model into the clinical practice.The proposal was supported through a $10 000 internal grant from Microsoft.

Translation of the research into clinical practice
The model was originally developed in Jupyter notebook using Python (https://www.python.org/).The code was rewritten from Python into PySpark (https://spark.apache.org/) in order to be implemented in a Databricks notebook.Of note, Databricks Runtime Version 7.5 ML was utilised.The training data set with over 10 years of triage note data, as described in the original paper, 10 was pushed to an Azure blob storage as a part of the workflow that could be accessed by the notebooks.After the model was trained, it was used for predicting unseen data and would export the result to a designated blob storage.
We designed and implemented a user-friendly interface using Microsoft Power BI to display the results of the AI system to the end users, which refreshed every 15 min (Fig. 1).This platform was selected due to a lack of functionality of the current EMR at St Vincent's ED to integrate with a third-party AI application to feed the results back into the EMR.

Data collection
The live AI system was operational from 1 January 2021.The data collection period was conducted from this date to 17 August 2022.During this period, we collected all ED patient triage notes, their final disposition and the AI systems' predictions.There were no significant changes in the layout of the department or the patient triage process during this period.

Dispositions
The final dispositions for the purpose of the study were categorised into: • Home: If patients were seen in the ED and discharged from the ED without going to the short-stay area.• Ward: If patients were admitted to the hospital inpatient ward from the ED without going through short-stay admission (SSA).• SSA: If patients were admitted to the ED's short-stay unit and discharged home from the short-stay unit.

Results
Since the integration of the AI prediction model with the hospital's EMR, a total of 80 396 presentations to the ED were recorded at the time of the study.A total of 3271 (4.1%) records were excluded from the final analysis due to dispositions that AI system was not able to generate such as 'missing value', 'return to ward', 'partial treatment' or 'dies in ED'.A total of 77 125 ED presentations were included in the present study (Table 1).
In our initial published study, 10 the AI model was trained and tested on the data which did not include SSA, therefore we presented the post-implementation performance results of the algorithm in two different ways.From the 33 747 patients who were discharged home directly from the ED, 80% (n = 27 055) were correctly predicted to be discharged by the AI model.For admissions to wards, 74% (n = 8886) of admitted patients were correctly predicted to be admitted to the hospital by the model (Table 2).The diagnostic performance showed sensitivity of 74% (95% confidence interval 73.2-74.8),specificity of 80.2% (79.7-80.6),positive predictive value of 57% (56.5-57.6)and negative predictive value of 89.7% (89.4-89.9),with an accuracy of 78.6% (78.2-78.9)(Table 3).
Post-implementation performance on the entire data set, replicating live performance (including all SSA) The SSA was classified as 'home' and the APSSA as 'ward' for this analysis.Therefore, a total of 52 494 (68%) patients were discharged home and 18 437 (24%) were admitted.
From the 6428 patients who were admitted to hospital after admission to the short-stay unit, the AI model was able to correctly predict 72% (n = 4599) of admission to the hospital.In contrast, the model predicted that 36% of successful SSA required admission (n = 6801).
When evaluating the AI prediction accuracy of the top seven admission units in the hospital, the highest accuracy was observed for admissions under the gastroenterology unit (84%) followed by the medical units (80%) and the surgical units (76%) (Table 4).In contrast, the AI model's performance to predict psychiatric, cardiology and plastic admissions was considerably lower (34%, 52% and 54%, respectively).Although the model was trained in the prepandemic era, it was able to predict admission to the COVID ward with 88% accuracy (478/542).
Of the 6194 patients who left without receiving any medical review or after only receiving triage advice, the AI system predicted that 14% (n = 843) required hospital admission.

Discussion
In the present study, we tested the post-implementation diagnostic accuracy of a novel AI system designed at St Vincent's Hospital Melbourne ED.This is the first study of its kind to prospectively monitor and report an AI system performance in an Australian ED postimplementation.ML models have been developed to assist with different aspects of the triage process including assignment of triage score, 11,12 patient disposition 8,10,13 and identification of critical illness. 14,15Many of the tools developed to support triage processes have shown clinical utility; however, their development has been largely based on retrospective data that has been processed and cleaned preimplementation.Some of these studies acknowledge the challenges of implementing these tools into the workflow of the ED and the barriers of translating these tools into a live ED setting that uses real-time data.In many cases this is caused by concept drift (change of data and model requirement over time), which is a potential issue with live ML models. 16oncept drift may have affected our model as the trained data set was pre-COVID era and our goal was to illustrate the performance of the same model that was published previously. 10Our results are in line with degradation caused by concept drift and have shown the need to monitor and evaluate AI-enabled clinical support tools once they have been implemented in real-world contexts.This is to ensure their accuracy is maintained post-implementation.To this end, we propose to implement an optimised concept drift detection algorithm and continuous training of the model to fine-tune the parameters of the model dynamically. 16omparing the findings of the present study with the results of the previously published paper, 10 the performance of the live AI model was mixed when excluding SSA from the final analysis.The ability of the live system to predict admission to hospital (specificity) dropped from 86% to 80.2% while sensitivity (the ability to identify non-admitted patients) increased slightly.Contrastingly, the performance of the system significantly deteriorates after adding SSA to the analysis.These findings highlight the importance of training an AI model using data that resembles everyday activities in a real-world setting.We speculate that using historical data that undergo a significant amount of exclusion, such as cleaning and pre-processing prior to feeding the data into the AI training algorithm, unintentionally influence the diagnostic evaluation and performance of the AI system when implemented in a real-world scenario with real-time data.We propose that every AI system integrated into clinical practice should undertake post-implementation diagnostic evaluation and regular performance reporting to ensure stable performance to avoid unintentional consequences.The original AI model was designed to only utilise triage notes as an input to determine the prediction, therefore we hypothesise that by retraining the system with other patient oriented data, the diagnostic evaluation will improve, as shown by a recent study conducted by Kishore et al. 8 This model shows different accuracy when it comes to predict admission to inpatient units.While the system was able to predict up to 80% of medical admissions, it fails to produce the same performance for admission to cardiology or orthopaedic units.We assume that the availability of outpatient clinics for orthopaedic presentation or additional cardiac investigations such as troponin results, stress test or echocardiography could be an influential factor.Furthermore, the AI system performance was at the lowest when predicting psychiatric admission.This is not unexpected considering the complex nature of psychiatric presentations to the ED.In a recent study, Garriga et al. 17 developed a ML algorithm to identify patients at risk of deterioration and mental health crisis.Their algorithm achieved a fair performance with an area under the receiver operating characteristic curve of 0.797.Interestingly, Garriga et al. 17 also evaluated the performance of their ML model prospectively and found that the system was able to assist in mitigating the risk of mental health deterioration or managing the caseload in only 64% of cases.
AI has considerable potential to reduce cognitive load, emergency overcrowding, length of stay and 'Do Not Wait' patients in the ED.Diverting non-urgent patients from the ED has been proposed to mitigate emergency overcrowding. 18owever, the literature regarding the effect of this policy is not conclusive. 19Our AI system has an excellent negative predictive value of 89% that can be safely used to timely screen low-risk patients who could benefit from being diverted to other healthcare services such as virtual EDs, doctors on demand, GPs or local clinics such as the recently established Priority Primary Care Clinics in Victoria. 20In a study by Dias et al., 21 a Manchester triage system version II was used to divert non-urgent patients from the ED to a primary care provider.They found almost 62% of patients were satisfied with the diversion, with no unintentional consequences within 30 days of the diversion.Further studies are required to evaluate the utility of AI to ease overcrowding in Australian EDs.A total of 5% of patients presented to the ED left without any medical review (LWMR).The literature confirmed a higher rate of readmission and 4% hospitalisation rate 22 with controversial results on mortality rate. 23Our AI model predicted that almost 14% of LWMR patients required admission.Currently, there is no systematic process or policy to review all LWMR patients due to high number of cases and healthcare workforce shortages.We propose that one of the potential applications of the AI system is to improve patient outcome and overall satisfaction by screening and identifying high-risk LWMR patients for follow up.
Implementation frameworks have sought to provide a taxonomy of challenges for the implementation of AI, including autonomy, beneficence, nonmaleficence and justice; 24,25 explainability; 25 validation and accuracy; 24,26 and safety and efficacy. 24urther barriers and facilitators to use include the availability of large and varied data sets to improve AI prediction performance, 27 the requirement for external validation to reduce the potential of bias, 28 expanding research through multicentre studies, 29 ensuring the AI prediction tool outperforms the clinician 28 and it is effectively integrated into workflow. 30Public trust and acceptance of AI tools is also important for its effective implementation.Addressing these concerns, which include privacy, consent, and bias are considered critical to gain public trust and acceptance. 31Furthermore, in order for AI systems to be deployed, security and computational resources are required at institutional levels to help translate AI research into clinical practice. 31

Limitation and future studies
This is a single site study and its findings may not be generalised to other healthcare services such as rural settings or the private sector.In addition, the model was initially trained using a data set that excluded SSA.Retraining of the model including all data may alter the sensitivity and specificity of the model.We were also not able to evaluate inequity in key subgroups such as Indigenous population or assess gender bias caused by the AI system.Authors have proposed to conduct a multisite study to improve the performance of the AI model by considering other patient factors and to evaluate the generalisability of the AI-trained system to other health services.

Conclusion
We have demonstrated the importance of monitoring and evaluating AI-enabled clinical support tools once they have been integrated into clinical practice to ensure accuracy is maintained.AI-enabled decisionsupport tools have shown promising results in improving quality of healthcare delivery; however, continuous training and evaluation of their performance is required to prevent inadvertent consequences.We have also demonstrated the feasibility of integrating an AI system with EMRs to add value to clinical practice by reducing clinician cognitive load and mitigating risks.

Figure 1 .
Figure1.The top image provides an overview of the ED status showing the total number of patients in the ED, number of patients who need admission and number of patients who will get discharged.The bottom image depicts the details of each presentation including patient details (redacted due to patient's privacy) and triage note followed by the AI prediction for admission, time of arrival, time the admission unit requested and the admitted unit.

TABLE 1 .
Patient demographics attending the ED during the study period

TABLE 2 .
Overview of dispositions during the study period

TABLE 3 .
10 )nostic evaluation of the artificial intelligence (AI) algorithm to predict admission using triage notes (sensitivity was stated as recall and positive predictive value was stated as precision in the published paper10 ) © 2023 The Authors.Emergency Medicine Australasia published by John Wiley & Sons Australia, Ltd on behalf of Australasian College for Emergency Medicine.MACHINE LEARNING IN CLINICAL PRACTICE

TABLE 4 .
Accuracy of artificial intelligence (AI) prediction based on the admission unit