Current applications and future potential of ChatGPT in radiology: A systematic review

This study aimed to comprehensively evaluate the current utilization and future potential of ChatGPT, an AI‐based chat model, in the field of radiology. The primary focus is on its role in enhancing decision‐making processes, optimizing workflow efficiency, and fostering interdisciplinary collaboration and teaching within healthcare. A systematic search was conducted in PubMed, EMBASE and Web of Science databases. Key aspects, such as its impact on complex decision‐making, workflow enhancement and collaboration, were assessed. Limitations and challenges associated with ChatGPT implementation were also examined. Overall, six studies met the inclusion criteria and were included in our analysis. All studies were prospective in nature. A total of 551 chatGPT (version 3.0 to 4.0) assessment events were included in our analysis. Considering the generation of academic papers, ChatGPT was found to output data inaccuracies 80% of the time. When ChatGPT was asked questions regarding common interventional radiology procedures, it contained entirely incorrect information 45% of the time. ChatGPT was seen to better answer US board‐style questions when lower order thinking was required (P = 0.002). Improvements were seen between chatGPT 3.5 and 4.0 in regard to imaging questions with accuracy rates of 61 versus 85%(P = 0.009). ChatGPT was observed to have an average translational ability score of 4.27/5 on the Likert scale regarding CT and MRI findings. ChatGPT demonstrates substantial potential to augment decision‐making and optimizing workflow. While ChatGPT's promise is evident, thorough evaluation and validation are imperative before widespread adoption in the field of radiology.


Introduction
In recent years, significant advancements in artificial intelligence (AI) have propelled the field of processing and machine learning to new heights. 1,2Among these breakthroughs, the emergence of chat-based language models, notably ChatGPT, has gained considerable attention. 3ChatGPT represents a cutting-edge language model that employs deep learning techniques to generate responses mimicking human-like conversation. 4hatGPT is a generative AI platform using a transformer architecture-based system.This model employs self-attention to encode an input, relations can then be made, and the encoded data can then be decoded with relevance to said relations to form an output.It can be thought of as a pattern recognition predicting system basing its text outputs on previous inputs, or prompts.It is based on a reward model which calculates if the output is acceptable or not. 5The launch of ChatGPT 4.0 enabled the user input to be in image or text format. 5hile originally designed to facilitate interactive dialogue, the integration of ChatGPT in various disciplines, including radiology, has presented compelling opportunities. 6,7y harnessing its language comprehension capabilities and extensive knowledge base, ChatGPT has the potential to revolutionize how radiologists interact with medical data and information. 6he potential uses of ChatGPT in the field of radiology are particularly promising. 7,8Radiology involves the use of a wide variety of imaging techniques to aid in diagnosis and intervention planning, generating vast amounts of imaging data and reports. 1 ChatGPT can assume a central role in analysing and extracting meaningful information from these datasets. 9,101][12] Moreover, it can support radiology education and training by acting as a virtual mentor, offering guidance to trainees and facilitating knowledge transfer across different levels of expertise, as a function of ChatGPT's ability to synthesize information based on particular questions and output its content in text format. 11As the field of radiology embraces technological advancements, the future applications of ChatGPT in enhancing workflow efficiency, improving diagnostic accuracy and streamlining radiology practices have huge potential. 13,14A recent review regarding the advantages and disadvantages of ChatGPT in radiological practice highlighted the potential educational benefits as well as a potential role for ChatGPT in the formation of differential diagnoses.Of note the potential issues surrounding patient consent, regulatory approval and the outputs of ChatGPT being a factor of its inputs are possible points of contention which need to be addressed. 15We performed a systematic review of the literature evaluating the current applications and future directions of ChatGPT in the field of radiology with a particular emphasis on the validation of ChatGPT outputs compared with a standard, the application of ChatGPT in academic radiology and the type and accuracy of ChatGPT outputs.

Study design and reporting guidelines
This study is a systematic review of original studies and follows the preferred reporting items for systematic reviews and meta-analyses (PRISMA) reporting guidelines.Our systematic review was registered on PROS-PERO in June 2023 (ID: CRD42023432007).

Search strategy
The following databases were searched as part of the systematic review in June 2023: Medline, EMBASE and Web of Science.The systematic search algorithm is outlined in the Supporting Information (Appendix S1).The last date of search was 18 June 2023.A grey literature search was conducted using 'Google' (Mountain View, California) to further identify other suitable publications.

Inclusion criteria
The inclusion criteria were as follows: 1 published studies demonstrating the current or future use and future potential of ChatGPT in radiology; 2 ChatGPT's role was considered relevant if it referred to one of the following two domains: Academic radiology and Clinical validation of ChatGPT; 3 publications relating to both clinical use and academic use are eligible for inclusion; and 4 published in the English language.
The exclusion criteria were as follows: 1 Abstract only publications and 2 Studies failing to discuss or denote ChatGPT in radiology.

Study selection, data extraction and critical appraisal
A database was created using the reference managing software EndNote X9 (The EndNote Team, Clarivate, Philadelphia, 2013).Abstracts of articles yielded from the search were reviewed by two independent reviewers (HCT and NOS) based on the inclusion and exclusion criteria detailed above, in particular reference to the three domains named above.These domains were chosen as they relate to the main uses and concerns as mentioned in the Introduction, namely; how can ChatGPT enhance radiologist workflow, is the ChatGPT output factually correct and how does it compare to radiologist reporting which is the current standard.Following the removal of duplicate articles, discrepancies in judgement about the relevance of articles were resolved via an open discussion between the authors and an independent third reviewer (IB).An article was excluded from the review when the three reviewers came to an agreement.Full texts of short-listed articles were obtained and further evaluated by two independent reviewers to ensure that they met our inclusion criteria.The references of short-listed articles were then searched to identify other relevant studies that may have been missed through the initial search of online databases.Data were extracted by two independent reviewers from the full-text articles that met inclusion criteria based on full-text review.
Information extracted was based on the PICOTS framework (Population, Intervention, Comparator, Outcomes, Timing and Setting).In order to extract and store data efficiently, the Cochrane Collaboration screening and data extraction tool, Covidence (Covidence systematic review software, Veritas Health Innovation, Melbourne, Australia.Available at www.covidence.org), was used. 16onflicts between the two reviewers were resolved following an open discussion and the final decision by the senior author (IB).
A critical appraisal of the methodological quality and risk of bias of the included studies was not performed.There is currently no risk of bias (ROB) tool specific for ChatGPT and as there is no true population ROB cannot be applied to these studies.

Search selection and characteristics
The literature search described above yielded a total of 180 results (Fig. 1).Following the removal of 41 duplicates, 139 studies were screened.After the initial screen, 25 abstracts were reviewed and assessed for eligibility, of which 14 were selected for full-text review.8][19][20][21][22] All studies were published in the year of 2023.All studies had a prospective study design and were assessed in relation to the three named domains in the inclusion criteria.See Table 1 for methodological characteristics of included studies.

Study demographics and performance assessment of ChatGPT
The ChatGPT model, based on the GPT architecture developed by OpenAI, 23 was utilized to examine its performance in a specific task in all cases.The models ranged from ChatGPT3-4 and in three cases the model utilized was not disclosed. 17,18,21The population group consisted of radiology articles (1/6), 17 multiple-choice questions (2/6), 19,20 radiology information questions (2/6) 18,22 and CT/MRI reports (1/6). 21In all cases, the control group consisted of users who interacted with the traditional methods of assessment of accuracy/performance, while the experimental group engaged with the ChatGPT model.Total participants across all included studies amounted to 551.A Likert scale to assess performance was used in all studies.Population groups and performance assessment are illustrated in Table 2.

Academic radiology
The performance of ChatGPT in generating academic articles was assessed by Ariyaratne et al. 17 They performed a comparative investigation to assess the accuracy and calibre of academic articles generated by ChatGPT in contrast to those authored by human experts.The particular focus of the study encompassed radiology articles, and the evaluation was performed by two fellowship-trained musculoskeletal radiologists.The articles were subjected to a meticulous grading process, employing a numerical scale ranging from 1 to 5, wherein higher scores denoted enhanced accuracy.The findings revealed that among the five articles generated by ChatGPT, a noteworthy majority of four exhibited substantial inaccuracies, manifesting in the inclusion of fictitious references.While one article exhibited commendable writing attributes, presenting a wellcrafted introduction and insightful discussion, regrettably, all of its references were deemed spurious.They concluded that these ChatGPT-generated articles could potentially deceive readers lacking the requisite expertise and emphasized that it is imperative to undertake further advancements aimed at rectifying the shortcomings identified in this investigation.

Validation of ChatGPT by radiologist
The performance of ChatGPT in answering common questions posed within the field of interventional radiology was evaluated by Barat et al. 18 They undertook a prospective study to assess the accuracy and reliability of ChatGPT, within the context of interventional radiology (IR).To achieve this, a set of 20 questions pertaining to IR was devised by a highly experienced interventional radiologist.Questions presented to the model included whether or not venous access is mandatory for patient monitoring during a transarterial interventional radiology procedure.These questions were then put forward to ChatGPT through an online chat platform.To evaluate the correctness of responses, a consensus opinion was established by two highly experienced interventional radiologists.These experts employed publicly available resources such as consensus guidelines from IR societies, organ society guidelines and device user guides as the standard of reference for assessing the accuracy of ChatGPT's answers.Each response was appraised using a rigorous three-point scale: 0 denoting an entirely erroneous answer, 1 indicating partial accuracy and 2 representing a fully correct response.Analysis of the consensus opinion revealed that 45% of ChatGPT's answers were deemed entirely incorrect (graded as 0), 15% were regarded as partially accurate (graded as 1), and 40% were evaluated as fully correct (graded as 2), based on the established standard of reference.They concluded that while ChatGPT demonstrates potential utility within the domain of interventional radiology, it is crucial for users to be cognizant of its inherent limitations.

Validation of ChatGPT by trainee examination
Two studies by Bhayana et al. 19,20 aimed to evaluate ChatGPT's performance on radiology board-style examination questions that do not involve images.In their first study, a total of 150 multiple-choice questions were used, which were carefully designed to match the style, content and difficulty level of the Canadian Royal College and American Board of Radiology examinations. 19The performance of ChatGPT was evaluated overall, by question type, and by topic.Additionally, the confidence of the model's language in providing responses was assessed.Univariable analysis was conducted to examine the results.ChatGPT accurately answered 69% (104 out of 150) of the questions.Notably, the model performed better on questions that required lower order thinking (84%, 51 out of 61) compared with those that required higher order thinking (60%, 53 out of 89) (P = 0.002).The model's performance was relatively weaker on physics questions (40%, six out of 15) compared with clinical questions (73%, 98 out of 135) (P = 0.02).Notably, ChatGPT consistently employed confident language, even when the provided responses were incorrect (100%, 46 out of 46).The authors subsequently performed a follow-up study, comparing the performance of GPT-4, an updated model, to GPT-3.5 used in the initial project. 20GPT-4 outperformed GPT-3.5 in higher order thinking questions, particularly in the description of imaging findings (85% vs 61%, P = 0.009) and application of concepts (90% vs 30%, P = 0.006).However, there was no significant improvement in GPT-4's performance compared with GPT-3.5 in lower order questions (80% vs 84%, P = 0.64).

Validation of ChatGPT compared with academic literature
The accuracy of responses and verification of references provided by ChatGPT was analysed by Wagner et al. 22 This study aims to evaluate the accuracy of responses provided by ChatGPT-3, when presented with questions pertaining to the daily routine of radiologists as described in Table 1.Additionally, they assessed the quality and authenticity of the references provided by ChatGPT-3.A total of 88 questions, equally distributed across eight subspecialty areas of radiology, were posed to ChatGPT-3 using textual prompts.The responses generated by ChatGPT-3 were cross-checked against peer-reviewed, PubMed-listed references to determine their correctness.
Of the 88 responses to radiological questions, 59 (67%) were determined to be accurate, while 29 (33%) contained errors.Among the 343 references provided, only 124 (36%) were accessible through internet searches, while 219 (64%) appeared to be generated by ChatGPT-3.Among the 124 identified references, only 47 (38%) were deemed to offer sufficient background information to correctly address 24 questions (38%).The authors concluded that ChatGPT-3 achieved a correct response rate of approximately two-thirds for questions related to the daily clinical routine of radiologists, with the remaining responses containing errors.Moreover, the majority of references provided by ChatGPT-3 were not discovered through internet searches and only a minority of the references contained accurate information to answer the questions, highlighting a major limitation to the integration of current ChatGPT models in modern medicine.

Diagnostic capacity of ChatGPT
The performance of ChatGPT in improving the readability of radiology reports was assessed by Lyu et al. 21They compiled a dataset, comprising 62 low-dose chest CT scans and 76 cerebral MRI scans.Two experienced radiologists (with 21 and 8 years of experience) evaluated ChatGPT's performance at translating the radiology reports into plain English, to aid in making recommendations for a patient or a radiologist, resulting in an average translation score of 4.27 on a 5-point Linkert scale (5 for best and 1 for worst).On the same scale, the assessments indicated minimal information gaps (0.08) and instances of misinformation (0.07).ChatGPT provided pertinent suggestions, with approximately 37% (51/138) of cases receiving specific recommendations based on report findings.

Discussion
Our review highlights the current strengths and weaknesses of ChatGPT in the field of radiology.5][26] To the best of our knowledge, this is the first systematic review of the literature evaluating the implementation of ChatGPT within this field.
ChatGPT can assist radiologists by automating administrative tasks, including report generation and data entry. 1,21This automation can reduce the burden of documentation, enabling radiologists to focus more on critical analysis and decision-making. 12,21One notable application is radiology report generation, where these models can automate the process of generating comprehensive and standardized reports based on image findings. 21,22,27,28espite the promising potential of ChatGPT within radiology, the practical implementation of this technology in real-life clinical settings has been relatively limited thus far. 6,9,29The integration of ChatGPT into radiology workflows requires extensive validation, addressing technical challenges and ensuring patient safety. 30,31While research and development efforts are ongoing, it is crucial to acknowledge that the current utilization of ChatGPT in radiology is still at an early stage, and more work needs to be done to translate its potential into meaningful applications for improving patient care. 7,27,29hatGPT's prospective role in radiology could be proposed as a screening tool, proficient at efficiently processing vast volumes of medical imaging data to identify cases that merit urgent attention, thus serving as a triage system.However, to date, it is imperative to emphasize that the final validation and interpretation of results will be performed by proficient human radiologists with the requisite clinical expertise.We do note an improvement in results comparing ChatGPT 4.0 to earlier iterations, this may be due to a number of improvements in the model including accepting images as inputs, enabling a greater number of iterations, training the model with text prediction and reinforcement from human feedback and enabling the use of third party data. 5he potential applications of GPT-based models in radiology are vast and offer exciting prospects for the future. 7,12One envisioned possibility is real-time assistance during image interpretation. 6,11,32GPT-based models could provide immediate feedback to radiologists, highlighting areas of concern or providing additional insights. 29,33This real-time assistance has the potential to expedite the diagnostic process, enabling prompt and accurate decision-making.Furthermore, by integrating patient data, clinical history and relevant literature, GPT-based models could support radiologists in making more informed diagnoses and treatment recommendations. 7,33ChatGPT potentially offers more than traditional or conventional AI models due to its interactive chat box interface. 34Unlike traditional AI models that may require specific input formats or commands, ChatGPT allows users to engage in natural language conversations. 7,24This conversational approach could make it easier for users to interact with the model, ask questions and seek clarifications in a more intuitive and human-like manner. 12,29,33 primary limitation of ChatGPT within the field of radiology is the absence of strong evidence pertaining to image processing. 6,24Previously, it has been thought the software lacks the requisite capabilities for effectively analysing medical images, thereby impeding its realistic usability in this domain 12 ; however, the newer iterations of the model may alter this limitation.Currently, ChatGPT may find less utility when trainees are preparing for viva examinations (such as the RANZCR (Royal Australian and New Zealand College of Radiologists) OSCER viva examination) due to the image interpreting heavy focus; however, the integration of image processing software and ability to input images in ChatGPT 4.0 may enhance usability. 35,36Additionally, it may be useful in phase 1 and 2 written examinations.Despite advancements in natural language processing and image recognition, the integration of these modalities into a cohesive model remains a challenge. 29thical and legal implications associated with the use of GPT-based models in radiology also warrant attention. 6,11Issues such as data privacy, security and liability need to be carefully addressed. 29The potential for unintended consequences and the need for clear guidelines regarding the responsibility of radiologists in using AI models should be considered. 2,37Ensuring transparency, informed consent and compliance with relevant regulations are paramount in the integration of GPT-based models into radiology practice. 29One further concern is the inaccuracies exhibited by ChatGPT systems, particularly arising from biases within the training data or the model's inability to handle specific cases. 24s radiology relies on accurate interpretations and diagnoses, ensuring the reliability of ChatGPT systems is crucial. 8Ongoing efforts to improve training data quality, mitigate biases and enhance model performance are necessary to address these concerns.ChatGPT's responses are based on the provided information, 4 lacking the ability to infer contextual knowledge or evaluate the reliability of input data.Lack of model training input supervision by medical professionals is also a major weakness of the system in question.Additionally, the scope to ask ChatGPT any question, even those that are clinically irrelevant is an area of concern.This limitation can potentially lead to inaccurate interpretations or recommendations, emphasizing the need for human oversight and critical appraisal. 6,9,10,12As ChatGPT's outputs are a function of the information it is supplied from a training perspective, it is important that representative and diverse clinical information is used that is relevant to the patient cohort upon which the model will be used.This requirement may infer differing ChatGPT models based on differing populations or areas of imaging.As discussed in prior reviews and within this manuscript, failure to diligently train and validate the model may produce erroneous outputs resulting in patient harm and have possible medicolegal consequences. 15In summary, AI language models like ChatGPT can be powerful tools in internal auditing and validation, augmenting human capabilities and improving efficiency.However, they should be seen as a complement to human expertise rather than a replacement.

Conclusion
ChatGPT demonstrates substantial potential in reforming radiology by enhancing workflow efficiency, improving diagnostic accuracy and aiding in research. 12However, ethical considerations, limitations and the need for human oversight are crucial for the safe and effective integration of ChatGPT into clinical practice.Additionally, as the patient population have access to ChatGPT there may be potential issues surrounding imaging reports or images themselves being self-managed or diagnosed.With ongoing advancements and refinement, ChatGPT has the potential to transform radiological workflows and possibly improve patient care. 6Further research and collaboration between AI developers, radiologists and policymakers are essential to harness the full capabilities of ChatGPT and address the challenges associated with its implementation.

Table 1 .
Methodological characteristics and primary outcome of included studies

Table 2 .
Population groups and performance assessment of ChatGPT 5-point Likert scale (1-incorrect, 2-some correct content, 3-approximately half correct content, 4-largely correct content and 5-entirely correct content) © 2024 The Authors.Journal of Medical Imaging and Radiation Oncology published by John Wiley & Sons Australia, Ltd on behalf of Royal Australian and New Zealand College of Radiologists.