Utility of artificial intelligence‐based large language models in ophthalmic care

With the introduction of ChatGPT, artificial intelligence (AI)‐based large language models (LLMs) are rapidly becoming popular within the scientific community. They use natural language processing to generate human‐like responses to queries. However, the application of LLMs and comparison of the abilities among different LLMs with their human counterparts in ophthalmic care remain under‐reported.


INTRODUC TION
Artificial intelligence (AI)-based chatbots are regularly used in customer service functions in business and marketing. 1lthough the use of AI is still at a relatively early stage in eyecare, it is used by ophthalmologists, optometrists and researchers in screening and diagnosing eye diseases. 2urrent applications of AI in ophthalmology are mostly focused on image-based techniques used for image analysis, recognition and diagnosis of ophthalmic diseases using ophthalmic data from fundus photography and optical coherence tomography images.The two subfields of AI are machine learning and natural language processing (NLP). 3achine learning requires 'supervised learning' where experts label and grade individual features and severity from images to develop the AI.A subset of machine learning is deep learning that shows promise in disease screening, diagnosis, risk stratification, treatment monitoring and improved patient care for eyes with myopia, 4 optic disc abnormalities (e.g., glaucoma, papilledema), [5][6][7] retinal diseases (e.g., age-related macular degeneration, diabetic retinopathy), 2 cataract 2 and corneal disorders. 8,9Deep learning, referred to as 'unsupervised learning', bypasses this need to label or grade individual features, and instead uses features of the entire image to compare with a diagnosis determined by an expert. 10The individual predictive features associated with the classification of a disease severity or its diagnosis are 'self-learned' by the AI developed from deep learning.Either way, the performance of deep learning and machine learning is comparable, with decreased error rates, and is better than traditional techniques of screening, diagnosis and management of diseases at a tertiary eyecare level. 10owever, deep learning is limited by the homogenous training data set, limited data availability for diseases, disagreement and wide interobserver variability in defining disease phenotype.Also, most AI systems have the 'black box' problem where the inputs and operations are unknown to the user.These impenetrable AI systems arrive at a conclusion or decision without providing any reasoning or explanations as to how they were reached; this opaque approach reduces practitioner and patient acceptance of the AI 11 and preference of human expert over AI in decision-making and treatment plan. 12Even though it is technically possible, AI has not been able to reach its target of converging AI and clinical care so far. 13LP, the other subset of AI, is focused on extracting and processing text data that include written and spoken words.NLP could transform human language (free text) or image into code that computers understand, and has been primarily used to date for information retrieval and text extraction. 3However, NLP is susceptible to error due to the variable nature of human-generated natural text and limited by the requirement of a huge data set for training NLP models which may or may not utilise deep learning or machine learning.Moreover, NLPs are often trained in specific domains, that impact how word embeddings, which is a method of extracting features out of text, based on the distance between two words, interpret relationships between words in different contexts.Most NLP applications developed so far are in the English language.It is pertinent to develop NLP in non-English languages to promote equity in care, reduce disparity and reach a wider population.Even writer/user presumption about the input (completeness and composition of words, image quality, noise), prior understanding and context are some other variabilities in NLP.Until now, NLP applications have involved text data extraction from clinical, operative and electronic health record notes in the screening of cataract 14 and glaucoma, 15 triaging of outpatient referrals to ophthalmic specialists, 16 prediction of quality-of-life impact of vision loss associated with diabetes 17 and prediction of operative complications related to cataract surgery, 18 to name a few examples.
Large language models (LLMs) are the class of AI primarily succeeding deep learning models that are capable of learning and recognising patterns.They are specific models within NLP that are capable of processing, understanding, generating and manipulating human language. 19LLMs are trained to predict the sequence of words in natural human language query and in response, generate a novel sequence of words.These models are designed to capture the complexities and nuances of natural language, enabling them to perform a wide range of language-related tasks.They are often based on transformer architectures, which are deep learning models designed to handle sequences of data, such as sentences in natural language.The key innovation in the transformer architecture is the 'self-attention' mechanism, which allows the model to weigh the importance of different words or positions in a sequence while processing each word.They are trained on vast amounts of text data to learn the patterns, semantics and context of

Key points
• With the huge interest and popularity of ChatGPT, artificial intelligence-based large language models have a massive role in providing patient information, disease diagnosis, symptom triaging, ophthalmic education and other applications.• Human experts are the most accurate (86%) in diagnosing disease, whereas ChatGPT-4 tops in responding to text-based ophthalmology examination questions (75.9%) and providing information and answering patient queries (84.6%).• Large language models perform best in general ophthalmology but worse in ophthalmic subspecialties.Responses are prompt-specific and can often be misleading due to their apparent comprehensiveness to queries and plausiblesounding fake responses.
natural language. 20These models consist of multiple layers of neural network that process sequential data, such as words or characters, and learn patterns and relationships within the data.The training process of an LLM involves exposing it to large amounts of text data, such as books, articles, websites and other sources of human language.By learning from this vast corpus of text, the model develops a deep understanding of language patterns, grammar, context and even some semantic meaning. 20As with NLP, an LLM presumes that the input (text), which influences the accuracy of the output, is accurate and up to date.

RISE OF THE L ARGE L ANGUAGE MODELS IN HEALTHCARE
LLMs have come into prominence following the recent introduction of the Chat Generative Pre-Trained Transformer, more popularly known as ChatGPT. 21,22ChatGPT (OpenAI, openai.com) currently relies on GPT-4, a language model that uses deep learning (DL)-based NLP, to produce text with approximately 170 trillion parameters, which is an upgrade from the GPT-3.5 version with approximately 175 billion parameters.Like the neural network of the human brain, these parameters are the weights of connections learned during the training stage of a neural network.This massive network builds the LLMs like ChatGPT and uses supervised and reinforcement-based learning strategies.
Using NLP, the LLM can generate responses to queries which can simulate human conversation.Their application in healthcare has seen a widespread use in education, research, practice and electronic health records, among others. 22In early 2023, ChatGPT gained widespread attention among the medical community worldwide after it performed at or near the passing threshold of 60% accuracy on the United States Medical Licensing Examination (USMLE), primarily because of its ability to respond to an array of natural language queries and human-like interaction. 22,23ChatGPT-3.5 passed all of the three difficulty levels of the USMLE designed using both multiple choice questions (MCQs) and open-ended questions, while displaying a high level of insight and concordance in its explanations. 23A slightly better result (67.6% and 67.1% accuracy) was obtained on the MCQs from the USMLE using the instruction-tuned variant of Flan (Fine-tuned Language Net)-PaLM (Pathways Language Model) and Med-PaLM (resea rch.google) which are LLMs with 540 billion parameters. 24The later versions, GPT-  25 In addition, a team of physicians found that ChatGPT generated written responses to healthcare-related patient questions collected from a public social media forum, that were comparable to those by physicians in quality and empathy, even surpassing some physician responses. 26However, this does not account for most ophthalmic patient communications being verbalised and being accompanied by expressive body language.With the development of newer LLMs and the exponential progress in technology, newer applications of LLMs have been identified but not all have been assessed comprehensively.

AIM OF THIS REVIEW
The aim of this review was to provide an overview of the current literature, investigating the recent advent of AI-based LLMs in ophthalmic care.In addition, the most widely used application(s) of LLMs and their future scope in eyecare are described.Finally, the limitations and challenges of implementing LLMs into clinical practice are highlighted.

METHOD OF LITER ATURE SEARCH
A comprehensive review of the literature was performed through a keyword-based and medical subject headings (MeSH)-based search of PubMed/Medline, Web of Science, Scopus, Embase, Google Scholar and pre-print servers until 31 December 2023.The following keywords, their synonyms and combinations were used: "artificial intelligence", "large language model", "LLM", "ChatGPT", "Generative Pre-Trained Transformer", "chatbot", "ophthalmology", "optometry", "ophthalmic", "eye", "health", "vision", "eye disease", "care".Boolean operators "AND," "OR," "NOT" were used to combine all search sets.When a specific application of LLM was identified, the specific factor was also used as a keyword in a second search to identify additional publications with prospective data on the specific use.Relevant articles cited in the reference list of articles obtained through this search were also reviewed.Studies were included if they described original research using LLMs in eyecare.After the selection of articles, all specific applications were grouped, and results were discussed accordingly.The authors manually reviewed each study's title, abstract and manuscript text to validate the relevance of the studies to both eyecare and LLM.Data extracted from each study included: the authors and year of publication, study aim, LLM used, content/query asked, grader/evaluator of the responses, results/outcome, main finding and conclusion.Any disagreements arising were resolved by discussion.All literature reviews and editorials were excluded, resulting in a yield of 70 original reports.

USE OF LLMs IN EYECARE
LLMs are trained to receive large corpora of text data and can interpret natural language inputs and respond with human-like real-time answers.Thus, the focus of LLMs in eyecare is to generate a list of potential diagnoses, information or guidance on management options.There has been a wide application of LLMs in eyecare so far, with studies examining different aspects of them.Figures 1  and 2 show the distribution of research articles published up to 31 December 2023, based on the ophthalmic domain and their subspecialty.Figure 3 illustrates the distribution of accuracy scores (%) across the LLMs and the domain.
F I G U R E 2 Distribution of published studies based on ophthalmic specialty.

Diagnosis InformaƟon
Table 2) in all the examinations.However, the judgement between GPT-4 and its human counterparts (experts) was not yet conclusive and therefore warrants further investigation.Results were similar among the American Academy of Ophthalmology's Basic and Clinical Science Course (46.0%-84.3%), 41,42,46,55phthoquestions (42.7%-84%), 41,47,48 Fellow of The Royal College of Ophthalmologists (FRCOphth) examination questions (32%-88.4%) 51,57and Statpearls (55.5%-73.2%). 49In comparison, a lower score was observed in Brazil board examination questions (41.5%) 44 and higher in European board examinations (91%). 50The performance of LLMs was found to be better for the subspecialties of medicine, cornea, refractive surgery and oncology, and weakest for glaucoma, neuro-ophthalmology, pathology, tumours, optics, oculoplastic and mathematical concepts.The weakness of GPT-3 and GPT-3.5 in answering questions on retina and vitreous (0%-23.1%) 44,48was overcome in a later version (GPT-4: 100% correct). 47Both GPT-3.5 and GPT-4's performance was better for first-order questions (recall) and lower for higher order (evaluative/analytical) and image-based questions.Even though ChatGPT has relatively lower hallucinations ('imagines' or 'fabricates' information) and errors in logical reasoning in comparison to other LLMs, ChatGPT provided explanations and additional insights for both its correct and incorrect responses (63%-98%), which can be misleading due to its apparent comprehensiveness (refer to Table 3).Overall, the accuracy and performance of LLMs is improving, and they often surpass the established benchmarks or thresholds of specialised assessments.The ability to pass specialised ophthalmic examinations shows that LLMs can serve as a valuable study aid for board certification examinations, as they can generate practice questions, explanations and feedback to enhance preparation.Besides, it can be a rapid source of specialised knowledge for busy clinicians.This ability to qualify for specialised examinations implies that conventional written evaluations fail to gauge clinical competence.The fact that AI can pass an ophthalmology examination could mean a decrease in the quality of clinicians entering the profession.However, shifting the assessments towards clinical scenarios and decision-making could improve the quality of future graduates.Fortunately, ophthalmic clinical practice relies heavily on physical examination of the eye, an element that cannot be attained solely through text-based interactions with an LLM. 58
T A B L E 1 Summary of current studies on the performance of LLM in diagnosing ophthalmic diseases and triage accuracy.were highest (80%), then between GPT-4 and GPT-3.5 (65%), GPT-3.5 and experts (60%).
• Specialists took 20-40 min to diagnose cases, while ChatGPT took only a couple of minutes.
• Accuracy of GPT-4.0 in diagnosing patients with various corneal conditions was markedly improved than GPT-3.5.
• May enhance corneal diagnostics, improve patient interaction and experience as well as medical education. ChatGPT-3.5

T A B L E 1 (Continued)
Google Bard (54.1%, 40.5%-72%).Compared to GPT-3.5 and Google Bard, GPT-4 exhibited a reduced proportion of responses that received poor ratings.Of the common questions on myopia, 5.4%-9.7% of the responses were incorrect and 54.8%-87.5% were considered accurate. 61,64Comparably, for retinal conditions and surgeries, the accuracy of LLM responses varied between 15.4% and 100%. 21,65,67Nonetheless, GPT-4's responses to common questions on vitreoretinal surgeries were found to be challenging and difficult to comprehend for an average individual without specialised knowledge.Grading of the responses using Flesch-Kincaid grade level and Flesch reading ease scores indicated that it requires at least college graduation to comprehend. 65The accuracy dropped drastically when it came to responses on lacrimal drainage disorders (40%). 59ven though the answers generated by GPT-3.5 had a similar error rate to the responses provided by humans, with comparable likelihood of harm and extent of harm, the presence of incorrect (3.6%-25%) and sometimes fabricated data (due to hallucinations) without supervision/moderation can be harmful, especially if applied in ophthalmic emergencies.Patients and parents should not solely rely on LLMs for their medical guidance, especially on treatment and side effects of medications.The information gathered should serve as a supplement or a basis for engaging in more individualised discussions with a human expert for specialist care and counselling (See Table 4).

Performance in other potential applications
Further to the applications mentioned above, LLMs have been tested on a diverse range of purposes.Chatbots can generate average quality scientific abstracts (41.7% correct) but remain plagued by fake data and references, when not provided with a data set. 85GPT-4 scores slightly better than GPT-3.5 with lower fake score and hallucination rates (Table 5).Chatbots can assist people with relatively weak writing or language skills to prepare written assignments both faster and of higher quality.But there is a growing concern that AI chatbots are being abused in writing essays, scientific abstracts and even manuscripts. 85With the number of factual errors these chatbots generate and their apparently comprehensive response, it is important for authors to know their limitations and pitfalls and for publishers/editors to identify AI-generated text in manuscripts. 86PT-4 can categorise refractive surgery candidates to their ideal procedures (68%-88% correct) with low to moderate agreement (0.399-0.610) with clinicians. 87However, when it came to recommending ophthalmologists based on their specialty or proximity (location), AI chatbots were unreliable with only 26.2%-37.5% accurate recommendations. 88hatGPT can accurately (59%) generate international classification of disease (ICD) codes from mock retina encounters 89 and even predict the risk of diabetic retinopathy (54%-73%) upon receiving prompts with patient details. 90 Applicability to generate novel ideas on future research ChatGPT-4 was questioned about the 'future research', 'further innovation' and 'technological advancements' in oculoplastic research.It could not come up with any novel idea and displayed convergent thinking in only conveying known ideas for research.91 ChatGPT's focus is on speed, accuracy, logic and recognising familiar techniques through reapplication of the stored/trained information. Itcan be viewed as a supplementary research tool, rather than a primary source of original research ideas.

Ophthalmic operative notes
When asked to generate ophthalmic surgery operative notes, ChatGPT-4 was able to create comprehensive and detailed operative notes across ophthalmic subspecialties. 92,93However, the response largely depended on the quality of input to GPT.The operative notes were thorough, yet they were deemed to require significant improvement.When appropriately prompted, ChatGPT could integrate specific medications, follow-up directions, consultation timing and location information into discharge summaries. 92

Literature review
Two prompts on dry eye disease were used to verify the ability of ChatGPT as a tool for conducting a literature (GPT: 60%-75% vs. GPT Plus: 65%-68.3%).
• With fixed question section and cognitive level fixed, the difficulty index of questions was predictive of ChatGPT's accuracy.
• Improved accuracy was associated with increased difficulty index (which implies an easier question).• Explanations and additional insight were provided for 63% of questions.
• However, equal proportion of both correctly and incorrectly answered questions had the explanation and additional insight.
• Mean length of both the questions and the responses was similar among questions answered correctly and incorrectly.
ChatGPT as used in this investigation did not T A B L E 3 (Continued) review.ChatGPT was found to provide article titles which were non-existent (60%-70%) with digital object identifiers (DOIs) belonging to different articles.The lack of training to distinguish between valid and invalid sources and their relative importance in a field may be the reason behind the errors encountered.The authors concluded that ChatGPT cannot consistently retrieve appropriate articles reliably and is not recommended for literature reviews on dry eye disease. 94A recent study reported the use of a 'zero-shot' classification approach, that is, without any previous training or exposure of the LLM in categorising and trend analysis of ophthalmology articles along with identification of emerging scientific trends.The proposed framework had high accuracy with 86% correct classification of articles as well as being time efficient. 95ientific writing An evaluation of ChatGPT and DALL-E 2 (openai.com), a prompt-based image generator, in writing a scientific article on the 'Complications of the use of silicone oil in vitreoretinal surgery' showed insufficient accuracy and reliability to produce scientifically rigorous articles.Despite the topic chosen being widely described in the literature, the language generated was uncommon with conceptual errors, the information provided was superficial and references did not represent the existing literature. 96The image generated by DALL-E 2 was not representative of the topic.Language models like ChatGPT can serve as research assistants but are limited to aiding in certain stages of research papers, like data analysis, literature review, hypothesis generation and peer review, thereby offering valuable insights.With advancing technology, the potential impact of LLMs in research is expected to expand further.Hence, it is essential to recognise the contributions of LLMs appropriately and acknowledge their involvement.ChatGPT has been considered to meet the first three criteria outlined by the International Committee of Medical Journal Editors (ICMJE) but fails to meet the fourth criterion to qualify as an author. 97Concerns were raised about the suitability of AI tools as co-authors in research papers, citing ethical considerations and copyright limitations. 98The existing legal framework does not permit non-human entities, such as AI tools, to possess copyright ownership rights.Furthermore, the authors should take accountability for the integrity of the content, which cannot be effectively applied to LLMs. 99Most journals agree that LLMs do not qualify for sole authorship, and agree against LLMs as co-authors of research papers. 97ost ophthalmology journals, for example, JAMA Ophthalmology, discourage the inclusion of AI-generated content, but they do permit its use under the condition that authors acknowledge the AI models' involvement and assume responsibility for the content's integrity.Notably, journals published by Elsevier (Elsev ier.com), which encompass prestigious ophthalmology publications including Ophthalmology, Progress in Retinal and Eye Research and the American Journal of Ophthalmology, allow AI tools solely for enhancing readability and clarity during the writing process.Nevertheless, authors are obligated to submit a declaration on AI usage as part of their manuscript submission. 97Interestingly, Wiley (wiley.com) states that AI cannot be considered capable of initiating an original piece of research without direction by human authors. 100The World Association of Medical Editors (WAME) recommended that chatbots cannot be an author as they cannot meet ICMJE authorship criteria and do not understand conflict of interest. 101To our knowledge, there are only two published review articles 102,103 and a pre-print server manuscript 104 which has listed ChatGPT as a co-author in academia.A third review published their corrigendum after Elsevier's Publishing Ethics Policies were revised, removing ChatGPT from the initial cited co-author list. 105n summary, LLMs' performance (median accuracy) depended on the iteration, prompts used and the task domain involved.A human expert (86%) was the best performer for disease diagnosis, followed by ChatGPT-4 (82%).However, ChatGPT-4 (75.9%) topped the list when answering text-based ophthalmology examination questions, followed by Bing Chat (75%).Similarly, ChatGPT-4 (84.6%) and Bing Chat (78.5%) were the best options for providing information and answering questions (Figure 3).LLMs performed best in general ophthalmology but worse in the ophthalmic subspecialties.ChatGPT-4 outperformed human experts in symptom triaging (98% vs. 86%).However, LLM can generate fake data and hallucinations based on its training data.

Voice-assisted ChatGPT
Combining the functionality of ChatGPT with voice assistant is used by Farcana (farca na.com) in gaming.It offers gamers a new approach to general account management; for example, the AI-powered voice assistant can teach the bot-specific actions like game strategy by extracting data from previous records of top players. 106oice dictation is also helpful in familiarising beginners with the game mechanics, checking account balances, evaluating gaming activity and suggesting improvements. 106These features enhance the player's level and skill proficiency by allowing gamers to focus better on their game.Thus, chatbots with voice assistants can interact in real time and provide personalised services. 106his can improve the user experience and further automate the interaction process.
T A B L E 4 Summary of current studies on the performance of LLM in providing information and answering questions.• 40% were graded as correct, 35% partially correct and 25% incorrect.
• Responses were detailed but had factual errors in them, especially for prompts related to surgery and not precisely evidence based.
• Performance of ChatGPT was unsatisfactory and can be termed average.
• Quality of responses improved in GPT-4.
• • Based on the FKGL and Flesch reading ease scores, the answers appear to be difficult or very difficult for the average lay person to comprehend.
• Requires a college graduation level of understanding to understand.
• Most of the answers provided by ChatGPT-4 were consistently appropriate.
• ChatGPT and other natural LLMs in their current form are not a source of factual information.
• Patients, physicians and laypersons should be advised of the limitations of these tools.

T A B L E 4 (Continued)
Author • Narrative content is more reproducible than numerical information.
• Frequently lists irrelevant or nonexistent references.• ChatGPT holds promise as an informative tool.
• It is important to review ChatGPT-generated content cautiously.

ChatFFA model
A bilingual (Chinese and English) model utilising ChatGPT-3.5 for answering tasks related to fundus fluorescein angiography (FFA) has been developed. 107The model achieved 60%-68% accuracy (as graded by an ophthalmologist) in report generation and disease diagnosis.In cases of microaneurysms, diabetic retinopathy and arteriosclerosis, it reached an accuracy of 87%-94%. 107

Electronic health records
The way in which large clinical language models, consisting of billions of parameters, can aid medical AI systems in effectively utilising unstructured Electronic Health Records (EHRs) remains uncertain.Fine-tuned LLMs like BioBERT (nvidia.com), BlueBERT (ncbi.nlm.nih.gov), DistilBERT (huggi ngface.co) and ClinicalBERT (mit.edu) demonstrated impressive capabilities (81.5%-84.3%precision) in accurately detecting ophthalmological examination components such as slit lamp or fundus examination from clinical notes. 108This highlights the promising prospect of leveraging these language models to extract and comprehend pertinent details efficiently from extensive patient records; a task that would otherwise be quite challenging.GatorTron (nvidia.com), an LLM consisting of >90 billion words of text (including >82 billion words of de-identified clinical text) was evaluated on five clinical NLP tasks including clinical concept extraction, medical relation extraction, semantic textual similarity, natural language inference (NLI) and medical question answering (MQA). 109,110The GatorTron model was found to increase the scale of the clinical language model significantly, expanding it to 8.9 billion parameters from 110 million (BioBERT or PubMedBERT) and 345 million (ClinicalBERT) parameters. 109As a result, they enhanced performance across the NLP tasks; the mean GatorTron-large model achieved accuracies of 90% and 93% for NLI and MQA, respectively (which is a 6.8%-9.5% improvement in accuracy over BioBERT and ClinicalBERT). 110These advancements have the potential to be integrated into medical AI systems, ultimately leading to improved healthcare and/or ophthalmic care delivery.

Understanding patient satisfaction
Although LLMs are not yet used to determine patient satisfaction, NLP of Healthgrades (Healt hgrad es.com, a verified physician review website available in the USA) reviews has been used to understand the determinants of patient satisfaction and the sentiment score of ophthalmologists. 111ince LLMs exhibit remarkable performance across a diverse set of NLP tasks, LLMs are also expected to estimate • Unsuitable ones lacked urgency in referral of acute conditions and without detailed inquiry to the level necessary for appropriate triage.
• No sources cited or follow-up questions asked.• Clinicians should be aware of the public health risk from patients using online chatbots.
• Chatbots may not be made consistently or appropriately aware of the urgency of symptoms.• Discharge summaries were valid but significantly generic.
• Operative notes were comprehensive, but they needed substantial refinement.
• ChatGPT consistently acknowledges its errors when confronted with factual inaccuracies.
• Mistakes can be prevented in subsequent instances with similar prompts.

Enabling gene discovery
Med-PaLM 2 has shown the ability to identify accurately mouse genes responsible for susceptibility to six biomedical traits including diabetes and cataract.It can also detect a novel causative murine genetic factor for susceptibility to spontaneous hearing loss. 113This result demonstrates the capability of LLMs in analysing gene-phenotype relationships and facilitating gene discovery.

Surgical training and education
The role of LLM (ChatGPT-4) has been evaluated as a teaching assistant to plastic surgery residents.A set of eight roles (cognitive, psychomotor and affective domains) were identified with an inter-observer agreement between experts for content output ranging between 30% and 100%. 114ncorporating LLMs into surgery residency programmes can provide an interactive, dynamic and personalised learning experience.

Drug discovery, development and other uses
LLMs can be helpful in the discovery and development of drugs.ChatGPT-4 added a retrieval plug-in that searches new documents and assays to update its knowledge base, designed to help drug discovery. 115Furthermore, LLMs can provide the blueprint of drug compounds with new structures, helping to predict the drug molecule's pharmacodynamics, pharmacokinetics and toxicity.It also assists in understanding a drug's absorption, distribution, metabolism and excretion; thus optimising drug properties, therapeutic target discovery and assessing toxicity. 116hile taking multiple medications, patients are at increased risk of experiencing adverse events or drug toxicity due to drug-drug interaction (DDI).ChatGPT can predict and explain the DDIs.Although not always correct, ChatGPT is an efficient tool for understanding DDIs, especially for those without access to healthcare facilities. 117LMs have the potential to assist with the complex process of drug repositioning or repurposing, which is finding new therapeutic purposes and targets for existing drugs.This approach saves time and investment required for new drug development without human trials. 116inally, DrugChat (ai.ucsd.edu) is a ChatGPT-like pharmaceutical domain-specific LLM designed to analyse drug compounds, answer questions and generate text descriptions for drugs. 118It can enable conversational analysis of drug compounds from both text and drug molecule graphs.Thus, LLMs have a role to accelerate drug discovery, find novel therapeutic molecules and perform functions that benefit patients.

Diagnosing ocular surface disorders
As well as diagnosing ophthalmic diseases of the posterior segment, NLP has also been used for identifying disorders such as herpes zoster ophthalmicus (HZO) 119 and the quantification of microbial keratitis. 120The sensitivity and specificity of NLP to quantify microbial keratitis from electronic health records ranged between 75%-96% and 91%-96%, respectively. 120Similarly, NLP used to screen HZO from clinical notes had a high sensitivity (95.6%) and specificity (99.3%). 119This indicates huge potential for LLMs in detecting ocular surface disorders.

Newer ophthalmic domain-specific LLM
Although most LLM systems were not built for any specific domain or field of knowledge, the extent of their popularity in health and ophthalmic care is tremendous.However, the text and images used for ophthalmic conditions are different from general web content.Thus, general domain LLMs may have difficulty dealing with professional conversations, resulting in incorrect answers or false facts.Ophthalmology large language-and-vision assistant (OphGLM) is a newly developed LLM (ailab.lv), and uses visual ability (images) specific to ophthalmic conditions. 121The OphGLM model conducts disease assessment and diagnosis by analysing fundus images and incorporating ophthalmic knowledge data along with real medical conversations.Additionally, this model incorporates visual capabilities and introduces a novel data set for ophthalmic multimodal instruction tracking and dialogue fine-tuning. 121Although experimental data sets demonstrate outstanding accuracy for diseases such as diabetic retinopathy, glaucoma and other ophthalmic conditions, it will be important to examine its real-life clinical application.The medical and scientific text domain-specific LLMs like Med-PaLM 2, PubMedBERT (micro soft.com), BioBERT, ScholarBERT (Public.Resou rce.Org), SciBERT (allen ai.com), DARE (healx.ai), ClinicalBERT and BioWordVec (ncbi.nlm.nih.gov) are already outperforming the foundation LLMs in biomedical tasks. 25,95,122

LIMITATIONS, CHALLENGES AND ADVANCEMENT OF LLM
The clinical deployment of ChatGPT and similar applications has been hindered by various issues and limitations.
First and foremost was the limited training of both GPT-3.5 and GPT-4 up to September 2021.Without continuous updating of the knowledge, the responses may be entirely outdated and even harmful, especially in the field of health and eye care.Second, lack of domain-specific training data leads to the 'garbage in, garbage out' issue.Although the LLM boasts an impressive size with 175 billion parameters, GPT-3.5 utilises a small fraction of the available data (only 570 GB) for its initial training. 123hird, the absence of any real-time internet access fundamentally limits LLMs like ChatGPT, with the exception of few LLMs like BlenderBot 3 (ai.meta.com) 124 and Sparrow (deepm ind.google) 125 which can access the internet while generating responses.Fourth, intensive fine-tuning and training have developed LLMs that can generate responses which sound plausible and coherent, although not necessarily accurate, when presented with queries.These 'hallucinations' or 'fact fabrications' are inaccurate or fake information invented when the information is not represented in the training data set.With the advent of GPT-4 expanded with Advanced Data Analysis (Python, python.org), if prompted, GPT-4 can even create fake data set for research and publication with an author's desired outcome. 126Moreover, responses are consensusbased, not evidence-based which can be illusional.Other than continuous updating of knowledge from publications, clinicians possess another advantage of accessing data not yet published from conference presentations and workshops.Thus, in the absence of any benchmarking and the presence of fake data, AI may make it difficult to differentiate fact from fiction.However, LLMs can selfimprove by employing chain-of-thought prompting and encouraging self-consistency enabled autonomous finetuning, leading to a 5%-10% enhancement in reasoning capabilities of an LLM. 127LLMs have undergone extensive development over the years, resulting in their emergence with 'few-shot' or 'zero-shot' capabilities.This means they can now recognise, interpret and generate text requiring minimal or no fine-tuning.In other words, few-shot and zero-shot are AI developed to complete tasks with or without exposure to initial examples of the task, with accurate generalisation to unseen examples. 123These impressive few-shot and zero-shot properties become evident and develop when model size, data set size and computational resources reach a significant scale. 128Fifth, like any AI system, LLM processing has the 'black box' problem, which is the absence or unclear explanations/ reasoning behind the model or decision, which reduces confidence, making interpretation and clinical decisionmaking tougher. 129Equally, DALL-E 2, which generates images in response to text prompts, risks false-positive diagnosis when reviewed by a lay person (patient), or even a potentially dangerous false-negative diagnosis, which might lead to false reassurance and delayed treatment. 123The inability of the earlier iterations of ChatGPT to process images may need incorporation of other transfer models such as the Contrastive Language-Image Pretraining (CLIP) model.CLIP generates text description of an image input, 130 which can be used as an adjunct for LLMs without image output.However, CLIP is designed for the domain of medical imaging and not specific for the eye, which might limit its sensitivity to differentiate eye diseases.Furthermore, the fusion of ChatGPT with Argil plugin (argil.ai) might help creation of images from textual prompts. 58Sixth, LLMs also raise ethical considerations and challenges, such as bias in data and outputs, privacy concerns and the responsible use of AI technology.Addressing these challenges is crucial to ensure the ethical and responsible deployment of LLMs in various health and eye care applications. 131GPT-4 raises privacy concerns and lack of accountability in containing patient-identifiable and personal data. 123Seventh, it should be borne in mind that, while LLMs do not possess emotions or consciousness, they can be designed to generate text based on patterns learned from data that appear empathetic and considerate in certain contexts. 132owever, empathy in health care settings is mainly verbal and from body language, so the ethical considerations of using empathetic LLMs should be considered, especially when dealing with sensitive topics or vulnerable users.Finally, most of the existing research is focussed on the qualitative appraisal of LLMs in artificial settings.Realworld clinical interventions tested through randomised controlled trials evaluating the safety, efficacy, morbidity and other parameters are needed for better understanding and deployment of LLMs in clinical care.
Despite the fears and hype, the barriers to implementation of LLMs replacing healthcare professionals in any capacity remain substantial. 133LLMs continue to be afflicted by mistakes and errors.Fundamentally, LLMs are limited by the quality of information available for training or browsing in response to queries, which remains governed by human activity (e.g., research, policymaking).
To summarise, the existing use and benefits of LLM in eye care are: (i) disease diagnostic support; (ii) symptom assessment and triage; (iii) patient education, information and engagement; (iv) helping prepare for ophthalmic qualifying examinations; (v) literature analysis/review; (vi) preparing ophthalmic operative notes and (vii) drafting editable response to patient queries.The potential future benefits arising from the current trend and speculation based on the ability of LLMs can be predicted to be: (i) remote monitoring and telemedicine, (ii) readily available and accurate information for clinicians in busy clinics, (iii) evidence-based practice and continuing education, (iv) predicting disease progression, (v) vision correction options, (vi) contact lens recommendations, (vii) research assistant, (viii) clinical documentation and report generation, (ix) electronic health records, (x) clinical decision-making support, (xi) image analysis and interpretation, (xii) understanding patient satisfaction, (xiii) pre-and post-operative patient counselling, (xiv) gene discovery for ophthalmic diseases, (xv) surgical training and education, (xvi) drug discovery, development, repurposing, interactions and (xvii) improve the design of future LLMs.The earlier iterations of ChatGPT (GPT-3 and GPT-3.5)could only process text-based prompts.Although ChatGPT-4 Vision can process and analyse text and image inputs, its performance has been below par.The way forward will be to develop an ophthalmic domain-specific LLM, which will be competent and have both text and image processing capabilities like ChatFFA and OphGLM.Another possibility of improving the diagnostic capability of LLMs could be to synergise them with image-based deep learning algorithms, thereby enhancing the potential for a contextual interpretation of text and image inputs simultaneously. 58Notwithstanding that the potential benefits of using LLM in eyecare is immense, only some of it has been explored so far.These findings are similar to that previously observed by a systematic review 22 in healthcare where the conversational LLM was found to reduce overall healthcare cost along with a multitude of benefits in research, education and clinical practice.
Unlike the earlier LLMs, the more recent versions can process, analyse and interpret images.If they can be trained for specific tasks, such as grades of diabetic retinopathy or cup-disc ratio, then the opportunities of ChatGPT become infinite at a much lower cost and greater convenience of using a single application combine and interpret both image and text inputs.It would give access to a plethora of medical knowledge that can be updated and improved.And yet the downsides of LLMs should be addressed before embracing these applications in eyecare.Moreover, clinical decision-making requires human interaction and consideration of the socio-economic-psychological factor of the patient alongside clinical knowledge.LLMs should be used conservatively in the future by identifying the scope and specific areas of use in eyecare while maintaining the human judgement for clinical decisions.

CONCLUSION
With the introduction of Cerebras GPT (cereb ras.net) (a family of seven GPT models ranging from 111 million to 13 billion parameters), 134 GPT-4 (170 trillion parameters) and those already in existence such as GPT-3.5,GPT-3, Chinchilla (deepm ind.google), Meta OPT (ai.meta.com), Pythia (eleut her.ai), PubMedGPT (also known as BioMedLM, crfm.stanf ord.edu), BioGPT (micro soft.com), BioBERT and PaLM, it appears that LLM-based AIs are the tool of the present and the future, 11 capable of undertaking a host of tasks from clinical decision-making to disease diagnosis, raising patient awareness, preparing for examinations and symptom triaging.Although the diagnostic accuracy widely varies based on the LLM iteration, they are more efficient, faster and repeatable than human non-experts and trainees. 22Primarily, LLMs are used as medical assistants; their use can be broadened into roles which might potentially save consultation time and reduce burden on clinicians and supporting staff like drafting responses that could be edited and sent to patients. 22,26,61Assessing the practical application of LLMs in a real-world clinical setting is essential before they can be deployed clinically.However, the patient's perspective, attitude and acceptability of LLM in a variety of ethical and minimally harmful clinical contexts must be considered.Although the image processing capability of GPT-4 alongside its text processing is expected to overcome some of the limitations and outdo the existing AI systems in accurate disease screening and diagnosis, developing ophthalmic domain-specialised LLMs combining multimodal ophthalmic data will be a significant step for the future.Nevertheless, given the limitations of LLM, caution should be exercised before embracing these applications.As AI continues to advance, it is essential to ensure that the potential benefits of introducing these applications in eyecare are maximised, while minimising the potential risks of its implementation.This is achievable only through the engagement of the eyecare community and staying up to date with the potential developments.

FU N D I N G I N FO R M AT I O N
None.

CO N F L I C T O F I N T E R E S T S TAT E M E N T
The authors have no proprietary or commercial interest in any materials discussed in this article.

T A B L E 2 T A B L E 3
Abbreviations: LLM, Large language model; ns, not significant.a Significance values have been adjusted by the Bonferroni correction for multiple tests.
statistic was used to compare the accuracies between the LLMs.The significance level for all statistical tests was set at p < 0.05 with Bonferroni correction for post hoc pairwise comparisons.Data are represented as median with full range (%).
SPSS Statistics version 29.0 (ibm.com) was used to estimate the median, IQR and range of accuracies.The Kruskal-Wallis H omnibus test
American Academy of Ophthalmology's Basic and Clinical Science Course Self-Assessment Program; AI, artificial intelligence; DOI, digital object identifier; FKGL, Flesch-Kincaid Grade Level; FRCOphtha, Fellowship of Royal College of Ophthalmologists; FRES, Flesch Reading Ease Score; LLM, Large language model; NA, not available; NHS, National Health Service; OKAP, Ophthalmology Knowledge Assessment Program; PEM, patient education materials; PEMAT-P, Patient Education Materials Assessment Tool-Printable; PG, postgraduate; TED, thyroid eye disease; UG, undergraduate; VKC, vernal keratoconjunctivitis.