Reporting guidelines for artificial intelligence in healthcare research

Reporting guidelines are structured tools developed using explicit methodology that specify the minimum information required by researchers when reporting a study. The use of artificial intelligence (AI) reporting guidelines that address potential sources of bias specific to studies involving AI interventions has the potential to improve the quality of AI studies, through improvements in their design and delivery, and the completeness and transparency of their reporting. With a number of guidance documents relating to AI studies emerging from different specialist societies, this Review article provides researchers with some key principles for selecting the most appropriate reporting guidelines for a study involving an AI intervention. As the main determinants of a high‐quality study are contained within the methodology of the study design rather than the intervention, researchers are recommended to use reporting guidelines that are specific to the study design, and then supplement them with AI‐specific guidance contained within available AI reporting guidelines.


| INTRODUCTION
Artificial intelligence (AI) is one of the most exciting and rapidly evolving areas of healthcare research in the 21st century. The number of AI-related academic publications has risen exponentially in recent years and scientific advances have revealed many potential AI healthcare applications. 1 Examples are wide-ranging and include AI algorithms for screening and triage, 2-4 diagnosis, [5][6][7] prognostication, 8,9 decision support 10 and treatment recommendation. 11 Naturally, the potential for AI to transform healthcare by, for example, offering earlier and more accurate diagnoses, providing novel insights for the understanding of diseases, and enabling more efficient service deliveryhas generated an enormous amount of excitement amongst patients, the public, politicians, and healthcare professionals. Concerns have been raised, however, that the potential impact of AI on healthcare may be overhyped. 12 Two robust systematic reviews and meta-analyses of AI medical imaging studies have confirmed that concerns raised regarding AI hype were well-founded. 13,14 Both of the reviews revealed that poor study design, delivery, and reporting were endemic in the field; 13,14 and, in fact, one of the reviews showed that <1% of the 20 000 AI medical imaging studies included were of sufficient quality to provide a trustworthy evaluation of the AI algorithm versus a human reader. 13 There has since been a collective response calling for better design, delivery and reporting of AI studies, and guidance and tools to support this. Reporting guidelines are tools that specify the minimum information required when reporting a study. 15 The use of AI reporting guidelines has the potential to improve the quality of such studies, through improvements in their design and delivery, and the completeness and transparency of their reporting.
The speciality of ophthalmology is at the forefront of AI healthcare research, with notable advances including the use of machine learning techniques such as deep learning to diagnose diabetic retinopathy, 16 detect papilloedema 17 and predict cardiovascular risk factors using fundus photographs. 18 In fact, IDx-DR, an AI system for diabetic retinopathy screening, was the first U.S. Food and Drug Administrationapproved diagnostic AI algorithm. 16 As such, the speciality of ophthalmology has an opportunity to lead by example by adopting and endorsing AI reporting guidelines.
This Review article provides an overview of AI reporting guidelines and their application in healthcare research to help researchers across all medical specialities, not only in the speciality of ophthalmology, improve the design, delivery, reporting, and ultimately, the quality of their work in the AI era.

| WHAT ARE REPORTING GUIDELINES?
Reporting guidelines are structured tools developed using explicit methodology that specify the minimum information required by researchers when reporting a study. 15 The aims of reporting guidelines are to ensure that studies can be understood by readers and reviewers, replicated by other researchers, used by healthcare professionals to make clinical decisions and included in systematic reviews and meta-analyses. 15 By highlighting key considerations relevant to different study types and demonstrating "what good looks like," reporting guidelines help improve the design, delivery and, ultimately, the quality of studies; and, by providing a clear list of minimum content that should appear in a paper, they help improve the completeness and transparency of their reporting, making areas of potential bias more visible and thus enabling more effective evaluation of studies.
The EQUATOR (Enhancing the QUAlity and Transparency Of health Research) Network is an international initiative that seeks to improve the quality of healthcare research by promoting the development and use of robust reporting guidelines. 19 The network provides Toolkits to support the development, selection and use of reporting guidelines; 20 and has a Library that contains an up-todate collection of reporting guidelines. 21 3 | WHY DO WE NEED AI-SPECIFIC REPORTING GUIDELINES?
The reporting guidelines developed with and published by the EQUATOR Network are organised according to study type (i.e. separate guidelines for clinical trials, diagnostic accuracy studies, observational studies etc.). This promotes a consistent approach for addressing the same study type, regardless of the speciality area, and indeed across most types of interventions. It has been recognised, however, that specific interventions (e.g. social and psychological interventions), 22 outcomes (e.g. patient-reported outcomes) 23,24 and scenarios (e.g. cluster trials) 25 require specific extensions to these reporting guidelines. For example, recognition of the potential sources of bias specific to studies involving social and psychological interventions has led to the development of social and psychological intervention-specific extensions to existing EQUATOR reporting guidelines. 22 In the same way, recognition of the potential sources of bias specific to studies involving AI interventions has led to the development of AI-specific extensions to existing EQUATOR reporting guidelines.
In parallel to the work of the EQUATOR Network, a number of experts and institutions have also developed their own recommendations for reporting AI studies. Unlike the AI reporting guidelines developed with and published by the EQUATOR Network, which start with study design and focus on clinical evaluation, these start with the intervention -AIand usually have a broader scope, including algorithm development, data transparency, ethical standards and utility.

| WHICH AI REPORTING GUIDELINES SHOULD I USE FOR MY STUDY?
All research studies, including those involving AI interventions, should be reported using the most appropriate reporting guidelines available. The range of different reporting guidelines that have been proposed in the field of AI can, however, make it difficult for researchers to determine which AI reporting guidelines to use for their study. Whilst researchers may be primarily interested in the intervention, particularly a complex intervention such as an AI system, they should bear in mind that it is only by considering every other component of a study that they can be confident that it will provide a trustworthy evaluation of the intervention.
Most of the factors that determine the quality of a study are not specific to the intervention, but are related to the study design. The EQUATOR Network has developed reporting guidelines based around the study design, which are deliberately "speciality neutral" and, indeed, "intervention neutral," except where an intervention has distinct attributes or risks of bias that require additional explicit reporting requirements. Fundamentally, whilst there may be some distinct characteristics of AI interventions, clinical evaluation for AI should not be overly "exceptionalised," but should use well-established methodology including good study design, delivery and reporting, as would be undertaken for other health technologies.

| WHICH AI REPORTING GUIDELINES SHOULD I USE FOR A CLINICAL TRIAL?
The strongest evidence for the safety, clinical effectiveness and cost-effectiveness of an AI intervention requires evaluation in the context of one or more well-designed, well-delivered and well-reported clinical trials. 26 This is because two fundamental characteristics of clinical trials, namely randomisation and a control arm, allow researchers to make causal inferences about interventions and their outcomes. 26 Although most AI interventions have not yet been evaluated in clinical trials, this is likely to be an area of rapid expansion as the field evolves. Such studies are particularly important because they will potentially be a key part of the evidence that regulators, payers and policymakers use to decide whether an AI intervention is sufficiently safe and effective to be approved and commissioned for use.
The risk to patients and the public of an AI algorithm being approved and commissioned for clinical use based on potentially incomplete information highlighted the need for reporting guidelines specific to AI interventions. To address this, the SPIRIT-AI and CONSORT-AI Steering Group announced in October 2019 an initiative to develop the first reporting guidelines for clinical trials involving AI interventions. 27 This work was undertaken with the EQUATOR network, and formed AI-specific extensions to their standard reporting guidelines that are in widespread use and mandated by many leading journals.
The AI-specific extensions to the SPIRIT 2013 28 (Standard Protocol Items: Recommendations for Interventional Trials) and CONSORT 2010 29 (Consolidated Standards of Reporting Trials) guidelines for clinical trial protocols and reports were developed in accordance with the EQUATOR Network framework using a Delphi methodology with an international multidisciplinary consortium. The SPIRIT-AI 30-32 and CONSORT-AI [33][34][35] guidelines were co-published in September 2020, representing the first global standards for reporting AI studies.
The SPIRIT-AI 30-32 and CONSORT-AI [33][34][35] guidelines are extensions to, rather than replacements for, the original SPIRIT 2013 28  Elements of clinical trials of AI interventions that require detailed and specific reporting according to the SPIRIT-AI 30-32 and CONSORT-AI [33][34][35] guidelines are similar across both clinical trial protocols and reports. Examples include, but are not limited to: the algorithm version; the procedure for acquiring, selecting and preprocessing the input data; and the criteria for inclusion at the level of the input data in addition to the level of participants.
For example, in a clinical trial evaluating an AI intervention for diagnosing diabetic retinopathy using fundus photographs, researchers should clearly specify which version of the AI algorithm they intend on using in the clinical trial protocol; and then state which version of the AI algorithm they actually used, and whether the algorithm version changed throughout the course of the trial, in the clinical trial report.
Researchers should also clearly and comprehensively outline their proposed plan for acquiring, selecting and pre-processing the fundus photographs prior to analysis by the AI algorithm in the clinical trial protocol; and then describe how the fundus photographs were actually acquired, selected and pre-processed prior to analysis by the AI algorithm in the clinical trial report.
Additionally, the eligibility criteria should be clearly and comprehensively reported by researchers at both the level of participants, such as patient age, and input data, such as fundus photograph image quality. This is important as it enables reviewers to differentiate between AI interventions that only work in ideal conditions and those that are more robust and suitable for real-world settings, such as large scale, national screening programmes.

| WHICH AI REPORTING GUIDELINES SHOULD I USE FOR A DIAGNOSTIC ACCURACY STUDY?
Diagnostic AI algorithms for detecting diseases, such as IDx-DR for diabetic retinopathy, 16 promise to achieve diagnostic accuracies similar to those provided by expert clinicians whilst simultaneously reducing healthcareresource use. 36 At present, a significant proportion of potential AI healthcare applications are diagnostic AI algorithms, but much of the evidence supporting their use has been disseminated in the absence of AI-specific reporting guidelines. 36 In terms of study design, diagnostic accuracy studies should be reported according to the STARD 2015 37 (Standards for Reporting Diagnostic Accuracy Studies) guidelines. An AI-specific extension, STARD-AI, 36 is, at the time of writing, under development for use alongside the STARD 2015 37 guidelines. Other AI-specific guidelines such as MI-CLAIM 38 may have particular value here, addressing important AI-specific elements that are not currently covered by the STARD 2015 37 guidelines.
Reporting of the non-AI components of the design and delivery of diagnostic accuracy studies should continue to adhere to the original STARD 2015 37 guidelines. They are designed to ensure that important information relevant to the design and delivery of diagnostic accuracy studies, including information relating to the participants (e.g. how, where and when potentially eligible participants were identified and what criteria were used to establish their eligibility), test methods (e.g. details of, rationale for choosing and order of the index test and reference standard) and data analysis (e.g. how indeterminate test results and missing data were handled) are clearly and comprehensively reported.

| WHICH AI REPORTING GUIDELINES SHOULD I USE FOR A PREDICTION MODEL STUDY?
Clinical prediction models estimate the likelihood of an individual having (diagnostic) or developing (prognostic) disease using predictor variables (risk factors such as age, sex and biomarkers). 39 The ability of AI to analyse large and complex datasets of predictor variables has led to the development of several potential AI prediction models, such as AI algorithms for predicting conversion to wet age-related macular degeneration. 8 The most widely accepted EQUATOR reporting guidelines for prediction model studies are the TRIPOD 2015 40 (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) guidelines. Many aspects of these guidelines are applicable to AI prediction models, but despite this, for various reasons such as differences in terminology, differences in the statistical basis on which AI prediction models and non-AI prediction models are built, and the fact that predictors are often hidden in the "black box" of AI prediction modelstheir use for disseminating AI prediction model studies has been poor. 41 To address this, an AI-specific extension, TRIPOD-AI, 41 is, at the time of writing, under development for use alongside the TRIPOD 2015 40 guidelines.
Reporting of the non-AI elements of the design and delivery of prediction model studies should continue to adhere to the original TRIPOD 2015 40 guidelines. They are designed to ensure that important information relevant to the design and delivery of prediction model studies, including information relating to the participants (e. g. the eligibility criteria, participant characteristics and the flow of participants through the study), outcomes (e. g. definition of the outcome that is predicted by the prediction model including how and when it was assessed), predictors (e.g. definition of the all the predictors used by the prediction model including how and when they were assessed) and model (e.g. how the prediction model was developed, how the prediction model was used and how the prediction model performed) are completely and transparently reported. 40 As with diagnostic accuracy studies, these guidelines may be supplemented by the valuable AI-specific guidance found in guidelines such as MI-CLAIM. 38

| WHAT OTHER AI-SPECIFIC GUIDELINES SHOULD I BE AWARE OF?
The MI-CLAIM 38 (Minimum Information about CLinical Artificial Intelligence Modelling) guidelines are EQUA-TOR reporting guidelines that were published in September 2020 to improve the reporting of information regarding clinical AI algorithms. The guidelines are designed to inform readers and users about an AI algorithm by ensuring that information about how it was developed and validated is clearly and comprehensively reported. They are split into six parts: (1) study design; (2) separation of data into partitions for model training and model testing; (3) optimisation and final model selection; (4) performance evaluation; (5) model examination and (6) reproducible pipeline. The guidelines are distinct relative to the AI reporting guidelines discussed above in that there is considerable attention given to the reporting of information relating to the AI algorithm itself rather than how it was used in the context of a specific study. In this way, their use may provide value alongside the use of AI-specific extensions such as the SPIRIT-AI, [30][31][32] CONSORT-AI, [33][34][35] STARD-AI 36 and TRIPOD-AI 41 guidelines.
The MINIMAR 42 (MINimum Information for Medical AI Reporting) guidelines are non-EQUATOR reporting guidelines published in June 2020 by the American Medical Informatics Association. The guidelines are designed for studies reporting the use of AI systems in healthcare. Their purpose is to ensure that the minimum information required to adequately understand the intended predictions, target populations and potential biases of an AI algorithm are clearly and comprehensively reported. Unlike other reporting guidelines, which provide a checklist of items that require reporting by researchers, these guidelines provide suggestions for reporting information across four key areas of clinical AI studies: (1) study population and setting; (2) patient demographic characteristics; (3) model architecture and (4) model evaluation. There is overlap between the MINNIMAR 42 guidelines and the MI-CLAIM 38 guidelines which both focus on AI algorithms and how they were developed and validated.
The DECIDE-AI 43 (Developmental and Exploratory Clinical Investigation of DEcision-support systems driven by Artificial Intelligence) guidelines are EQUATOR reporting guidelines that, at the time of writing, are under development. These guidelines are distinct in that their intended purpose is to improve the evaluation and reporting of human factors in clinical AI studies. The DECIDE-AI 43 guidelines will address the essential role that human factors will have in how a clinical AI algorithm performs; this also brings in the important distinction between AI assessed in isolation versus an AI intervention assessed as part of an AI/human system; commentators sometimes refer to the potentially improved performance over the human system as "augmented intelligence." The guidelines are intended to be used in early-stage, small-scale clinical studies of AI interventions, when the intervention itself and the human-machine interaction may still be being refined prior to fuller evaluation. Such studies, if conducted, would take place after development and technical validation (in diagnostic accuracy studies or prediction model studies that are covered by the STARD-AI 36 and TRI-POD-AI 41 guidelines respectively), but before clinical validation (in clinical trials that are covered by the SPIRIT-AI 30-32 and CONSORT-AI 33-35 guidelines). By ensuring that adequate attention is placed on the human-AI interaction during the development and evaluation of clinical AI algorithms at this stage of the translational pipeline, the developers of the DECIDE-AI 43 guidelines argue that their use will ultimately enable more efficient translation of AI algorithms from code to clinic.

| CONCLUSION
This Review article has provided an overview of AI reporting guidelines and their application in healthcare research. The article should help researchers across all medical specialities, including in the speciality of ophthalmology, better understand, select, and use AI reporting guidelines. In this way, this article should help researchers improve the design, delivery, reporting, and, ultimately, the quality of their work in the AI era.
Ultimately, the impact of AI-specific reporting guidelines on improving the quality of AI healthcare research is determined largely by the extent to which researchers use them when reporting studies, medical journal editors require authors to use them when submitting studies, and reviewers use them when appraising studies. As a speciality at the forefront of AI healthcare research, ophthalmology can lead by example by adopting and endorsing their use.