Investigating the impact of innovative AI chatbot on post‐pandemic medical education and clinical assistance: a comprehensive analysis

The COVID‐19 pandemic has significantly disrupted clinical experience and exposure of medical students and junior doctors. Artificial Intelligence (AI) integration in medical education has the potential to enhance learning and improve patient care. This study aimed to evaluate the effectiveness of three popular large language models (LLMs) in serving as clinical decision‐making support tools for junior doctors.


Introduction
Since its introduction in November 2022, ChatGPT (Chat Generative Pre-trained Transformer), an artificial intelligence (AI) based large language model (LLMs) has attracted great attention and controversy for its ability to generate academic content. 1,2These recent advancements have led to Google and Bing (Microsoft) producing their LLMs.Google's Bard, Bing's AI and ChatGPT can all respond to enquiries on virtually any topic in a logical, coherent and human-like manner. 3,4ChatGPT even demonstrated a passing performance equivalent to an undergraduate third-year medical school student on the US medical licensing exam. 5Furthermore, in simulated benchmarking testing, ChatGPT scored in the 89th percentile in Scholastic Aptitude Test (Math), 90th percentile in the Uniform Bar Exam, and 99th percentile in the Biology Olympiad. 6espite their potential, the efficacy of these AI tools in medical education remains underexplored.
The COVID-19 pandemic has had a significant impact on medical students and junior doctors, particularly in terms of their clinical exposure and experience.Many medical students have reported that the reduction in face-to-face teaching and clinical contact hours has led to an overall negative impact on their training. 7,8Medical students and junior doctors have been deprived of essential clinical exposure due to the suspension of elective procedures, reduced patient numbers, and isolation measures.This has had a profound impact on their learning and development, with some trainees expressing concerns about their ability to become competent clinicians, particularly in surgical specialties. 8Furthermore, the pandemic has created significant psychological distress, with many students and junior doctors reporting higher levels of anxiety, burnout and depression. 9,10LMs have the potential to be used as adjuncts to theoretical learning and clinical decision-making discussion in the post-COVID-19 pandemic era.2][13][14] Their simple interface, accessibility and sophisticated algorithms make them an ideal adjunct for both medical students and junior doctors.In this article, the authors evaluated the three most prominent LLMs (ChatGPT-4, Google's Bard and Bing's AI) ability to provide accurate, useful and safe advice in scenarios of increasing complexity, with an aim to explore its transformative potential in the realm of medical education, ultimately fostering a more effective and personalized learning experience.

Aim
The primary aim was to investigate the potential of AI LLMs to serve as a clinical decision-making supportive tool for junior doctors.For this purpose, we employed three of the most popular LLMs (ChatGPT-4, Bard and BingAI) and evaluated their capacity, effectiveness, and accuracy in providing clinical guidance to junior doctors.

Design
The LLMs were presented with a series of increasingly complex clinical scenarios from the viewpoint of a junior doctor.The scenarios were developed in consultation with all authors, three of whom hold teaching posts with the hospital affiliated medical school.These scenarios demonstrate some of the common complex surgical scenarios faced by junior doctors.As a recent innovation, there is currently a lack of standardized tools for evaluating LLMs in the clinical space.We chose to assess the LLMs responses by reviewing their accuracy, informativeness and accessibility using a Likert scale (Table 1), local hospital guidelines validated tools.This was independently conducted by two junior doctors (YX and IS), and three Plastic and Reconstructive Surgeons (DHS, WR and MS) with extensive clinical and medical education experience.If any differences in the Likert scale or reliability tools arose, they were discussed until a consensus was achieved.
The comprehensibility and complexity of the generated responses were evaluated utilizing a combination of the Flesch Reading Ease Score, the Flesch-Kincaid Grade Level and the Coleman-Liau Index (Table 2).The Flesch Reading Ease Score, which operates on a scale of 0-100, serves as an indicator of how easily a text can be understood, with higher scores implying greater readability.Meanwhile, the Flesch-Kincaid Grade Level and Coleman-Liau Index gauge the educational level necessary to understand the text and its complexity, with higher values denoting more complex content.Additionally, the DISCERN scoring system was utilized (Table 2), operating on a scale from 16 to 80, to assess the responses' quality, relevance and equitable distribution of information.A greater DISCERN score represents superior information quality and a more balanced presentation of the content.For the statistical analysis, we employed Student's t-test to identify any significant differences among the responses generated by the LLMs and a P-value of less than 0.05 was set as the threshold for statistical significance.

Inclusion and exclusion criteria
All LLMs have a probabilistic algorithm and use random sampling to generate varied responses, which can result in different answers to the same question.For this study, the first response provided by LLMs to each question was recorded.Subsequent questions were not altered based on the responses, to ensure consistency of the questions across the three LLMs.Care was taken to ensure there were no grammatical or syntax errors in each question.There were no exclusion criteria or restrictions to LLMs' responses.No institutional ethics was needed for analysing open-source artificial chatbots and for this type of article.

Results
Scenario A describes a patient with right lower limb cellulitis that begins to rapidly deteriorate on the ward (Fig. 1a-d).ChatGPT's recommendations commence with a comprehensive history and examination, focusing on signs of localized and systemic deterioration.The suggestion of further investigations and analgesia are all appropriate for the scenario presented.The advice to discuss the findings and management with a senior medical colleague demonstrates good awareness of patient safety and the limitations of a junior doctor.The subsequent finding of MRSA raises a compelling point regarding antimicrobial stewardship.Although ChatGPT advises considering an antibiotic switch for MRSA coverage, a more nuanced discussion on the merits of modifying the antibiotic regimen, as opposed to adding an appropriate antibiotic, along with an overview of the consequences on antibiotic resistance and effectiveness, would have provided a more insightful and judicious recommendation.Furthermore, ChatGPT fails to include an assessment of hydration and fluid resuscitation in its initial management approach, both of which form crucial steps in the care of unwell patients.
Google's Bard was less comprehensive.It appropriately suggested a re-examination of the patient and considering other potential diagnoses, including the possibility of sepsis.It provides some general information on MRSA and recommends changing antibiotic coverage in response to the old wound swab, however, does not advocate any further consultation with Infectious Disease Specialists or antibiotic guidelines.It did not suggest any further investigations or options for treatment.It also demonstrated poor awareness of the question, as it recommended calling the patient's doctor, and prompted that hospitalization may be necessary, despite the question specifically stating that the asker was a junior doctor on the ward.
BingAI was the least comprehensive of the three LLMs.It provided a brief, generic response to cellulitis and MRSA treatment, rather than attempting to answer the questions and would not have been useful to a junior doctor in the clinical setting.
The subsequent deterioration of the patient is recognized by Cha-tGPT as the development of sepsis, and the possibility of necrotising fasciitis is raised.The recommendation to call for help and involve the surgical team is appropriate.The ongoing deterioration of the patient despite adequate medical management is correctly recognized as highly concerning for necrotising fasciitis, a fulminant, rapidly progressive surgical sepsis.Although a clinical diagnosis, certain investigations can add weight to the decisionmaking process.Once necrotising fasciitis is suspected, surgical debridement becomes the mainstay of management.Given the high risk of mortality, investigations that can create a time delay to the operating theatre, such as a CT scan, are often considered unnecessary in the presence of concerning clinical features and rapid deterioration.ChatGPT's advice to involve senior medical colleagues and the surgical team earlier in the scenario demonstrates an appropriate and practical approach to managing a deteriorating patient.
Bard also recognizes the possibility of necrotising fasciitis in the deteriorating patient, appropriately suggesting broad-spectrum The large language model provides accurate answers to questions -Strongly Agree The large language model is proficient at understanding complex questions and providing appropriate answers antibiotics, fluids, and operative management.The response is less comprehensive than ChatGPT, however.While Bard summarizes important points in the overall management, it does not provide logical steps or a system for a junior doctor to follow in the clinical setting.Bard does not mention the primary survey, or advocate for consultation with ICU or Infectious Disease Specialists in a surgically septic patient.While Bard does correctly mention surgery, dialysis and intubation for the deteriorating patient, its advice does not apply to the practice of a junior doctor, whose role is not to perform all these tasks but to facilitate them in a timely and safe manner with the assistance of senior staff.While, BingAI's response to the question was generic, suggesting an escalation of the patient to a healthcare professional.It did not provide any advice beyond this.Scenario B describes a patient who is tachycardic on the ward 2 days postoperatively (Fig. 2a-d).ChatGPT's stepwise approach is logical and coherent.It provides some common differentials for postoperative tachycardia and an appropriate management strategy.The disclosure of the surgical procedure details, including its lengthy duration and the brief postoperative immobility period, should raise suspicion of a venous thromboembolic (VTE) event.However, ChatGPT's assertion that a tachycardic patient with adequate hydration and normal vital signs is stable with no immediate concerns can be misleading in this context.
Bard's stepwise approach is similarly logical and appropriate.It agrees with ChatGPT on the importance of an initial history and examination, followed by a review of the medical history and surgical procedure.Notably, Bard suggests an ECG in the first instance, a basic and useful investigation for tachycardia, which ChatGPT fails to do.However, Bard does not mention the importance of adequate hydration and fluid resuscitation, which ChatGPT does.None of the LLMs were able to infer from the information given that the patient had undergone a lengthy procedure with prolonged immobilization, and that in the presence of adequate hydration and pain control, a VTE event should at least be considered.As with the previous scenario, BingAI is the least informative.It correctly advocates for history and examination of the patient, however does not elaborate on any potential differential diagnoses, investigations or management.
The subsequent revelation that the patient has not received VTE prophylaxis should heighten the concern for a VTE event.Although this information prompts ChatGPT to consider deep vein thrombosis, it could be argued that an electrocardiogram (ECG) should have been suggested earlier in the scenario.An ECG can provide valuable diagnostic insights regarding cardiac function and potentially reveal signs of right heart strain, a significant consequence of a pulmonary embolism.The development of hypoxia correctly leads ChatGPT to recommend investigations for pulmonary complications, including a chest x-ray and computed tomography pulmonary angiogram.Again, the advice to involve senior medical colleagues early is safe and appropriate.While Bard provides some general information regarding VTE and VTE prophylaxis, it fails to make the connection that this patient may have had a VTE event.Providing generic information only may encourage the reader to form their own educated opinion and avoid misleading them through a false conclusion and lack the necessary context or specificity to fully inform or guide the reader's understanding of the situation.
Scenario C presents LLMs with some challenging ethical and legal dilemmas (Fig. 3a-d).ChatGPT's initial approach to diagnosis and management is safe and appropriate, recognizing the presentation of an abscess and suggesting both medical and surgical management.The revelation of illicit substance uses and intoxication prompts a reconsideration of the patient's capacity to provide consent, as well as the potential impact to the patient's safety in proceeding with surgery.ChatGPT highlights the need to address the patient's substance misuse and signs of withdrawal, encouraging the provision of harm reduction education and resources.The undertreatment of withdrawal has been demonstrated to be a significant contributor to patients discharging against medical advice. 15Finally, ChatGPT's response to the angry and aggressive patient threatening to self-discharge recognizes the importance of safety, and the issues surrounding capacity to discharge against medical advice.
Bard recognized the presenting symptoms as indicative of an infective process, however, did not explicitly state the diagnosis of abscess in its response.Like ChatGPT, Bard emphasized patient safety and the importance of addressing the patient's substance   misuse, going so far as to provide resources for further follow up, albeit tailored for its local demographic.Bard also suggests strategies on how to communicate with an agitated patient, however fails to recognize the impact of these developments on the surgical plan and does not discuss the issues surrounding consent.BingAI performed similarly to Bard in this scenario.It made the correct diagnosis, recognized the potential harm arising from an intoxicated patient, and appropriately prioritized patient and staff safety.Unlike Bard, BingAI also identified the issues around consent, as well as the importance of communicating risks to patients discharging against medical advice.
When comparing the three LLMs for readability and reliability, ChatGPT consistently outperformed, registering the highest Flesch Reading Ease Score (31.2 AE 3.5), Flesch-Kincaid Grade Level (13.5 AE 0.7), Coleman-Lau Index 13 and DISCERN score (62 AE 4.4), indicative of superior comprehensibility and medical advice alignment with clinical guidelines (Table 2).This was followed closely by Bard, and then BingAI, which lagged in all categories.The only comparisons that yielded statistically nonsignificant outcomes (P > 0.05) were those between ChatGPT and Bard regarding their readability indices, and between ChatGPT/ Bard and BingAI when assessing the Flesch Reading Ease scores (Table 3).All other comparative analyses demonstrated statistical significance (P < 0.05).
ChatGPT consistently gave more comprehensive answers than its counterparts and was the only AI language model to provide a semblance of a system for its reader to follow.Its approach is the only one that could conceivably form the foundation of an AI guided clinical decision-making tool for junior doctors and hospital policy writers in the future.

Discussion
LLMs have gained considerable traction in the healthcare sector and academia, particularly in the sphere of medical information retrieval and decision-making processes, thanks to their algorithmic capabilities.Within this emerging AI technology, this study accentuated the value of LLMs in education.
LLMs such as ChatGPT demonstrate the ability to provide accurate, informative, and safe advice in clinical scenarios, while also emphasizing the importance of involving senior medical colleagues early in the decision-making process.The current cohort of junior doctors and medical students is among the most negatively affected by the reduced clinical exposure of their formative years.The necessary precautions, such as social distancing, limiting patient contact, and shifting to virtual learning were implemented to curb the spread of the virus, but inadvertently led to a decrease in handson clinical experience.There is widespread concern among junior medical staff that the lack of direct patient interaction and realworld experience may have hindered the development of critical skills, such as physical examination techniques, bedside manners, and diagnostic acumen. 8While ChatGPT and similar AI systems can never replace this lack of clinical exposure and experience, they do have the potential to quickly and effectively assist clinicians in bridging any knowledge gaps, serving as a useful tool in providing effective medical education and clinical decision support. 16ncorporating LLMs into preclinical education offers medical students the opportunity to engage with a vast array of information, case studies, and clinical scenarios.AI systems can facilitate selfdirected learning, enabling students to ask questions and receive immediate feedback, thereby enhancing their understanding of complex medical concepts.This can be particularly useful for students who are hesitant to ask questions in a group setting, for fear of embarrassment or being perceived as foolish.With its recognition of writing patterns and retention of prior conversations, Cha-tGPT can provide a personalized learning approach, catering to individual learning styles and addressing specific knowledge gaps, fostering a more effective educational experience. 17unior doctors may employ LLMs as a clinical decision support instrument, offering evidence-based suggestions and guidelines for patient diagnosis and management.This enhanced access to medical knowledge fosters a framework of globalized healthcare, enabling professionals and students to obtain information irrespective of geographical location or temporal constraints.While Bard provided links to some general resources, eventual models should be able to incorporate local national and even hospital-based protocols and guidelines in their database.Moreover, AI systems with ongoing learning capabilities ensure that their medical expertise remains current, aligning with the rapidly evolving medical field.Notably, ChatGPT exhibits an awareness of its limitations as a language model and judiciously directs users toward relevant specialists based on the provided data.
It is worth noting that while LLMs are pre-trained on large datasets from the internet, it does not contain all of this information within its model.The memory required to store large swathes of the internet is prohibitive.Instead, LLMs use their pre-trained data to create rules that guide their answers based on a predictive algorithm.These rules allow it to answer questions in the same manner pre-programmed rules allow a calculator to solve mathematical problems, without having to store all its pre-trained data.For this reason, LLMs may occasionally be erroneous or misleading based on a fundamental misconstruction of the rule utilized to answer, for example, listing references for an academic text.
It is important to clarify that LLMs were not primarily designed to perform literature searches or serve as reference engines.Cha-tGPT and Bard, both machine learning models trained on diverse internet datasets, can generate text on a wide range of topics based on complex statistical models and rulesets, as discussed earlier.Such tools also provide an avenue for both medical students and doctors to engage with simulated patient scenarios, grasp the intricacies of various diseases, and explore different therapeutic strategies in a controlled, risk-free setting.AI-integrated learning platforms can offer an abundant array of information, enabling trainees to not only assimilate the process but also to refine their skills and supplement conventional teaching modalities.However, the application of AI in this realm warrants further scrutiny in terms of its effectiveness, accuracy and safety prior to advocating its extensive incorporation into medical education.AI-assisted clinical decision tools have been explored previously in the literature.Researchers in China compared treatment recommendations proposed by IBM Watson, a system that used natural language processing and machine learning to evaluate data, with actual clinical decisions by oncologists, finding high concordance with many types of cancer. 18Various studies have also explored the employment of computer-aided systems in early stroke detection, 19 with some automated tools demonstrating the detection time of CT head abnormalities to be on par with, or even faster than radiologists. 20Rather than being designated for narrow and esoteric purposes, the advantage of ChatGPT and other similar LLMs is that they are capable of providing answers to a wide range of queries, even within the specialized field of medicine.They are intuitive and user-friendly, capable of responding informally or using jargon, depending on the user's preference.This makes them accessible to all learners, from medical students to experts in the field.
Due to being the first study to evaluate multiple LLMs in surgical education, this study presents a few limitations that merit consideration.First, the evaluation of the LLMs' responses was conducted by a panel of three plastic surgeons and two junior doctors, though their proficiency lends credibility to the assessment, the evaluation of the quality of responses generated by LLMs may still harbour a degree of subjectivity.Responses were also evaluated against local hospital protocols and guidelines, which may differ from other jurisdictions.Other limitations must be addressed when incorporating LLMs into medical education and practice.ChatGPT's knowledge base is confined to its training dataset up to September 2021, which may not incorporate the latest medical findings or guidelines and will not integrate individual hospital health protocols that can vary based on population demographics.This could potentially result in outdated or inaccurate recommendations that adversely affect patient care.
The AI system may also inadvertently produce misleading or inaccurate responses due to biases in the training data or misconceptions of intricate medical concepts.For instance, it did not assign adequate clinical importance to the fact that the patient in scenario B had experienced an extended surgery with immobility and displayed early indications of a VTE event.Although BingAI and Bard were able to provide largely correct and appropriate information in response to each prompt, it did so in a broad, haphazard manner, indiscriminately providing a wide range of options rather than tailoring its response to be contextually specific.ChatGPT performed better in this aspect, with a more stepwise approach to the problem, however, was still not sufficiently targeted to the clinical scenario presented.Furthermore, it does not distinguish the importance or priority of its management approach to each patient.It provides what could be considered textbook answers, rather than a nuanced consideration of which management steps may be most appropriate for the individual patient before it.This highlights the necessity for healthcare professionals to critically appraise AIgenerated recommendations and judiciously cross-reference them with their own clinical experience and judgement.
Finally, there are significant societal, ethical, and legal concerns that must be addressed.Society's goodwill and trust for medical professionals lies in the fact that there is an implicit level of education, knowledge and evidence behind each treatment decision made.Even then, medical professionals do make mistakes, sometimes causing harm, for which they should be held accountable.The responsibility for an act of harm caused by a treatment decision made by an AI is uncertain.The extent of liability placed on the creator, the developer or the user cannot be ascertained until such matters are litigated in the court system.

Conclusion
Overall, this study demonstrates that using LLMs such as ChatGPT, Bard and BingAI may aid in self-directed learning, personalized learning approaches and provide valuable clinical decision-making support for junior doctors.It remains essential to acknowledge that, at present, LLMs should never replace the expertise of experienced educators.The prospective advancement of LLMs through training on valid medical databases, coupled with the rigorous scrutiny of their outputs by subject matter experts, could potentially foster a valid tool for medical education.

Fig. 1 .
Fig. 1. (a)-(d) Scenario A -Patient with right lower limb cellulitis that begins to rapidly deteriorate on the ward.

Table 1
Evaluation of large language model platforms' responses

Table 2
Readability and reliability of LLMs' responses

Table 3 T
-test difference analysis of large language model readability and reliability