Minimum labelling requirements for dermatology artificial intelligence‐based Software as Medical Device (SaMD): A consensus statement

Artificial intelligence (AI) holds remarkable potential to improve care delivery in dermatology. End users (health professionals and general public) of AI‐based Software as Medical Devices (SaMD) require relevant labelling information to ensure that these devices can be used appropriately. Currently, there are no clear minimum labelling requirements for dermatology AI‐based SaMDs.


INTRODUCTION
Machine learning, a facet of artificial intelligence (AI), is ubiquitous in daily life.In medicine, AI algorithms have been developed for multiple purposes but most successfully in fields reliant upon image analysis such as radiology, ophthalmology, pathology and dermatology. 1In these fields, AI tackles tasks such as disease screening, diagnostic classification, segmentation and pathology detection. 2Despite immense potential, AI is still an evolving technology with many questions yet to be clarified before it can be safely integrated into clinical pathways. 1,3,4he Achille's heel of AI is its propensity to 'overfit' to the data set used to train the algorithm. 1,5This phenomenon affects generalizability, meaning that the algorithm adapts poorly to variations in data capture or characteristics.A given AI may not be accurate when applied to images captured by different devices, or when used in populations with different age, skin colour composition or sun damage characteristics from the training data set. 3Dermatology is especially sensitive to generalization biases since the imaging of dermatologic diseases can be performed by anyone, in any setting, and with various devices.Unlike radiology, dermatology has not had a tradition of standardizing clinical imaging procedures. 6To prevent unforeseen issues and to promote transparency and successful AI-human collaboration it is essential to perform appropriate clinical validation and open reporting. 7he rapid development of AI technology has resulted in many new commercial AI-based Software as Medical Devices (SaMD), most commonly within the field of radiology. 1Consequently, radiology has established guidelines for performing and reporting clinical studies and for commercial product labelling, but these guidelines are largely absent in the field of dermatology. 2,7,8In dermatology, only a few AI-based SaMD have been approved by the major regulatory bodies, such as the Food and Drug Administration (FDA), European Medicines Agency (EMA) or Therapeutic Goods Administration (TGA).However, since dermatologic conditions are easily imaged, many of the available AI-based dermatology SaMDs are general public-facing and have eluded such regulatory scrutiny.
Our literature review assessed current medical SaMD labelling guidelines and adherence of Australian AI-based dermatology SaMD mobile applications (SaMDapps) to these standards (Oloruntoba et al, 2023, submitted).Of the 18 AI-based SaMDapps available for download that were identified, 28% catered to health professionals, 61% to general public and 11% to both.None provided all the commonly recommended labels.While labelling is very important, it is crucial to find the proper balance between sufficient information to ensure safe use and excessive information which might be difficult for consumers to interpret and hinder product development.In this study, we used a modified Delphi consensus methodology within the Australasian College of Dermatologists' Digital Health Committee (DHC) to arrive at a set of minimum labelling requirements for AI-based dermatology SaMD, whether found in a mobile application or integrated within a skin imaging system or device.

MATERIALS AND METHODS
A modified Delphi consensus approach was undertaken.The Delphi method seeks consensus among experts on topics lacking exact knowledge. 9Three features define the Delphi method: anonymous response, iteration and controlled feedback and statistical group response.In

Overview of modified Delphi process (Figure 1)
The research team drafted statements on proposed labelling items based on a recent literature review (Oloruntoba et al, submitted).The initial Delphi voting round allowed expert panel members (EPM) to offer qualitative input, propose modifications and share references.In total, three Delphi voting rounds were conducted, with the last round focusing on items lacking consensus.

The expert panel members
The study contextualizes minimal labelling requirements within Australian regulations governed by the TGA, a part of the Australian Government's Department of Health and Aged Care.The EPMs comprised the members of the DHC (previously E-health committee) of the Australasian College of Dermatologists.The DHC has collective knowledge of emerging digital health technologies used for dermatological healthcare and extensive experience in clinical research, evaluation and oversight of technology developments. 10All members of the DHC were invited and accepted to participate in the study.

Construction of statements
Voting encompassed the following items: indication for use, intended user, training and test data set characteristics, algorithm design, image processing techniques, clinical validation, performance metrics, limitations, updates and adverse events.EPM ranked the importance to include these proposed items in a minimum labelling requirement for AI-based SaMD on a Likert scale from 1 (not at all important) to 9 (very important).For appropriate items, the EPM were also asked to indicate which characteristics of the labelling item they found important to be specified (Table 2, Table S1, Figure S1) with a yes/ no option.
In the first Delphi round, each item was followed up by questions relating to whether labelling should differ for (a) diagnostic versus non-diagnostic and (b) health professional versus general public-facing AI-based SaMD.Following survey amendments after Delphi round 1, TGA's classification of SaMD were used instead. 11The regulation for software-based medical devices was changed in February 2021 by the Australian government.The new classification, summarized in Table 1, considers four different types of SaMD and the level of risk (class I, IIa, IIb and III) associated with the product (based on the seriousness of the condition, the potential harm to an individual, the public health risk, and whether the information is provided to relevant health professionals or general public).The higher the classification the higher level of regulatory scrutiny.
In voting round 2, the EPM first voted on the importance of including each proposed item in the minimal labelling requirements assuming a class III AI-based SaMD.Subsequently, the EPM were asked if and how the labelling should differ for risk class I, IIa or IIb, or the different categories of AI-based SaMD.Finally, the format in which the labelling should be made available to (a) healthcare professionals, (b) general public, was explored, using the analogy of product information sheets for pharmaceutical products, with three options: (1) a 'COMPLETE' explanation of all labelling items, similar to a product information sheet for pharmaceutical products, (2) a 'SHORTENED, SIMPLIFIED' version (similar to a consumer medical information for pharmaceutical products) and ( 3) 'BOTH' a shortened, simplified version with complete explanation upon request.Two general public consumers familiar with the subject matter were invited to provide their perspective on the level of detail they would expect in an AI-based SaMD label.In Delphi round 3, the remaining items were slightly rephrased and options for different levels of details were added for the image capturing devices used.

Statistical methods
Consensus was defined as >75% rating within either the 7-9 range or 1-3 range on the Likert scale ('highly important' or 'not important').This threshold was also applied to define consensus for the voting on specific characteristics of labelling items, whether labelling should differ according to risk-class and category of AI-based SaMD, and how the labelling should be made available to different consumers.Summary statistics, including mean, median, range and whether consensus was reach, was compiled after each voting.

RESULTS
Nine (100%) DHC members actively participated throughout the three Delphi voting rounds.Round one voting resulted in consensus to include all items in a minimum labelling requirement (Table 2).Algorithm design was closest to not reaching consensus (78% agreement).Specific attributes of training and test data set, and some performance metrics, did not reach consensus for inclusion (Figure S1).In contrast, there was consensus to include all proposed characteristics (Table 2; Figure S1) of clinical validation, updates and adverse events.Whether labelling should differ for diagnostic versus non-diagnostic and health professional versus general public-facing AI-based SaMD seldom reached consensus, and comments and the discussion was to a large extent centred around these topics.Some common comments were that 'labelling is most important for diagnostic AI-based SaMD', 'diagnostic AI-based SaMD carry higher risk of causing harm', 'patients might be unable to interpret very technical labelling', 'diagnostic products should only be allowed as an aid for health professionals and never directly to the general public'.
Since the questions regarding clinician/patient facing and diagnostic/non-diagnostic AI-based SaMD created much debate and lacked a clear definition, the approach was shifted to use the TGA's classification of SaMD in round two (Table 1). 11Voting on the proposed labelling items, now assuming a class III AI-based SaMD, again yielded consensus to include all the proposed items in the minimum labelling requirements (Table 2).Similar to voting in round 1, a few of the proposed characteristics of the training and test data sets (image capturing device and breakdown on sex and age groups) and performance metrics (rationale for statistical methods used, decision thresholds, features to improve explainability, metrics related to quality of life) did not reach consensus (Figure S1).There was complete consensus that class IIb and IIa should have identical labelling to class III AI-based SaMD.Three out of nine EPM voted for an abbreviated labelling requirement for class I AI-based SaMD (Figure S1) in voting round 2 but following round three of voting, there was consensus that all classes of AI-based SaMD, should have the same minimum labelling requirements.
In round three of voting (Figure S1), the EPM concurred on recording breakdown of sex in the training and test data sets and categorizing image capturing device(s) in broad categories (such as smartphone cameras, specific whole-body skin imaging systems, and digital single-lens reflex cameras).In terms of performance metrics, there was consensus to include explanation of performance T A B L E 1 Summary of Therapeutic Goods Administration's (TGA) new classification rules for software medical devices.

Risk to individual or public health Medical device category (intended for)
Diagnosing and/or recommending treatment or intervention for a disease or condition  measures and to exclude metrics related to quality of There was no consensus regarding the rationale for statistical methods used or features to improve explainability.Labelling accessibility for healthcare professionals and the general public was explored in voting rounds two and three, with additional input from two general public consumer representatives.Initially, four EPM favoured the 'COMPLETE' version directly available to healthcare professionals, while five endorsed the 'BOTH' option (see Methods section for definition).In round 3, all EPM supported the 'BOTH' choice.Regarding the availability for the general public, 78% of EPM, and both general public consumer representatives, favoured the 'BOTH' option.

DISCUSSION
Improving transparency of AI-based SaMD through labelling is crucial to arrive at safe and effective human-AI collaboration. 12This modified Delphi study used commonly recommended labelling items from other medical fields and gathered a collective opinion from experts on the Australasian College of Dermatologists' DHC as to which labelling items should be considered as a minimum requirement for AI-based SaMD.Consensus was achieved across 10 specific labelling domains.Furthermore, our expert panel agreed that the labelling should be the same for all TGA categories and risk classes of AI-based SaMD.
Research on AI can be traced back to the 1950s. 13With the invention and development of deep learning in the early 2000s, the field expanded quickly.In dermatology, Esteva et al. 14 published groundbreaking results in 2017 using a single deep learning algorithm that performed on par with human experts.Since then, several computer laboratory-based studies have found similar results. 15,16][19] There is substantial evidence on the vulnerabilities of AI, particularly its susceptibility to falter in 'out-ofdistribution' cases-instances that deviate from the AI's training data sets.In the dermatological context, these vulnerabilities include disparities in skin colour, 3,12 sex and age distribution, 20 presence of image artefacts and source of images, including capturing device (use of high resolution or high dynamic range software), setting (clinic vs. remote) and operator (clinician vs. patient). 5,21,22Some of these factors are likely to influence image quality.These issues are further compounded by findings that user experience, personality traits and faulty AI could negatively impact the interpretation of AI results. 15,19,23Finally, imbalanced data sets (common in medicine), algorithm design (traditional vs. generative AI, and supervised vs. semi-supervised vs. unsupervised vs. reinforcement learning), image modifications (e.g.cropping, compression, distortion or file conversion) and inaccurate reference/gold standards in the training/ test data may impact algorithm performance in clinical contexts. 3,7,10Hence, understanding the development of the AI in SaMD and the data used for training and testing, including testing on out-of-distribution cases, is very important for the correct interpretation of the results. 24Like any product that we encounter, the proper and safe use should be outlined in an operating manual or a product information sheet.
It was not surprising that there was consensus for all broad categories to be included as minimum labelling requirements.That indication for use of an AI-based SaMD should be provided by developers is a common opinion not just within our EPM but among regulatory bodies and the digital health scientific community. 7,8,10,25,26Similarly, acknowledgement of the intended user is critical to ensure AI outputs provide appropriate and interpretable data to the end-user and to ensure the product falls within the correct risk classification. 11Specification of training and test data sets has been proposed both for the reporting of clinical studies and for the labelling of AI-based SaMD. 7,8,26Although our EPM voted for the inclusion of number and sources of images, use of synthetic images, breakdown on skin phototype/race, breakdown on skin diagnosis categories and description of gold standard determination in the minimum labelling requirements of AI-based SaMD, some aspects remained contentious, namely breakdown on sex, age and the image capturing device.After discussions, the EPM agreed that the former two should generally be required to assure fair and trustworthy AI output.Reporting specifics of image capturing devices was considered too cumbersome by some EPM when weighed against the benefit.However, since this feature may affect quality and generalizability in AI-based SaMD, the discussion following Delphi round 2 yielded a consensus that broad categories of image capturing devices (e.g.smartphone cameras, compact digital cameras, specific skin imaging system (Vectra, FotoFinder etc), digital single-lens reflex Cameras) should be reported.Image quality was not considered as a separate labelling item as it is difficult to measure objectively and can be inferred by other labelling items (image source and capturing device).
Finally, as we have learnt from pharmaceutical product development, the proper recording of updates and adverse events is essential in ensuring trustworthy and transparent AI. 8,10,25,27 This sentiment was echoed in the voting in this study as there was complete consensus to include version number, modifications and updates since last version, whether the AI is fixed or adaptive (continuously learning from input data), data on post-market performance and procedures for documenting adverse events as minimum labelling requirements for AI-based SaMD.In addition to the potential weaknesses and biases discussed above, AI is to a large extent a 'black box' science, meaning that most often the decision process of AI is not entirely clear. 10,28or these reasons, it is difficult to predict real-world performance of an AI-based SaMD based on preclinical data. 5,21hat clinical validation, using sound and robust scientific methodology, should precede the clinical implementation of an AI-based SaMD also resounded in this Delphi consensus study. 5,7,8,10,26,27More specifically, the EPM agreed that all proposed characteristics of a clinical validation should be included in the labelling (Table 2), together with freely available references of the clinical testing.Voting on performance metrics aligned with guidelines on evaluation of image-based AI reports in dermatology, as there was consensus to report measures of accuracy, decision thresholds, explanation of performance metrics, metrics on user acceptability and specification from which testing the results emanate. 7Even though there was consensus to exclude metrics related to quality of life in a minimum labelling requirement, there was acknowledgement that there may be devices for which this is relevant and should be reported.Other characteristics, such as the rationale for the statistical methods used and features to improve explainability did not reach consensus.In recent years, much research has been performed on explainability of medical AI, and there is a divide in the scientific community on whether these features are useful or even necessary. 28,296][27] The underlying procedures for standardized testing to determine limitations of AI-based SaMD in dermatology is an area that needs further investigation. 24hen exploring different AI-based SaMD categories and risk classes, as defined by the TGA, 11 the EPM agreed that all risk groups should comply with the same minimum labelling criteria.A less detailed labelling requirement for class I AI-based SaMD was initially advocated by some EPM, acknowledging that the burden of labelling must be weighed against the potential risk of harm.In this context, it should be emphasized that many AI-based SaMD are currently either exempted or excluded from regulatory oversight, which to some degree is determined by the manufacturer's self-selection during the registration of their product. 11,30Consequently, some AI-based SaMD exist that provide a 'risk assessment' that lack regulatory approval and forego responsibility by informing consumers that the product should not be used as a diagnostic tool.Finally, we assessed the opinion of the EPM on how the labelling should be made available to two groups of consumers: healthcare professional and general public.
Through the Delphi process, the EPM agreed that both groups of end users should be presented with a shortened, simplified version with readily available complete explanation on request.
The main strength of this study is the high level of agreement to include the proposed, albeit widely recognized, labelling items in a minimum labelling requirement for AIbased dermSaMD.The main limitation is that these results cannot be directly translated to be used outside of Australia.The composition of the EPM was strong due to their proficiency in digital health care, the inclusion of two consumer representatives and the absence of industry influence.The weakness of the EPM was the lack of representation from other medical specialities and computer engineers.
In conclusion, in this modified Delphi consensus study in a dermatology digital health expert panel, we found a high agreement on a set of minimum labelling requirements for AI-based SaMD in dermatology which largely align with recommendations for clinical study reporting and product labelling in other medical fields.

F I G U R E 1
Overview of the Delphi consensus process.

Labelling item Item characteristics that reached consensus to be included
Note: From how the TGA regulates software-based medical devices 2021, Therapeutic Goods Administration, used with permission of the Australian Government https:// www.tga.gov.au.T A B L E 2Results of modified Delphi consensus study listing labelling items and item characteristics that reached consensus to be included in a minimum labelling requirement for AI-based dermatology Software as Medical Device.Updates and adverse eventsVersion number, modifications, updates, Approvals and certifications of the SaMD (FDA, EMA, TGA, etc.), When and for what version the approval was obtained, Fixed or adaptive AI, Data on post-market performance, Procedure for documentation of adverse events