OnRAMP for Regulating Artificial Intelligence in Medical Products

Medical artificial intelligence (AI) involves the application of machine learning (ML) algorithms to biomedical datasets to improve medical practices. Products incorporating medical AI require certification before deployment in most jurisdictions. To date, clear pathways for regulating medical AI are still under development. Below the level of formal pathways lies the actual practice of developing a medical AI solution. This Perspective proposes best practice guidelines for development compatible with the production of a regulatory package which, regardless of the formal regulatory path, will form a core component of a certification process. The approach is predicated on a statistical risk perspective, typical of medical device regulators, and a deep understanding of ML methodologies. These guidelines will allow all parties to communicate more clearly in the development of a common good machine learning practice (GMLP), and thus lead to the enhanced development of both medical AI products and regulations.


Introduction
Medical AI involves the application of machine learning algorithms to biomedical datasets in order to improve medical practices.Outside of a research context, medical AI must be built into products in order to deliver the desired improvements.Typical examples of medical AI products are: (i) the evaluation of one-or-more data modalities to support diagnosis, also known as clinical decision support (CDS), e.g.x-ray diagnostic systems; (ii) the statistical linkage of complex patient parameters to suggest optimal treatment paths, a riskier form of CDS, e.g.tumor biopsy analysis and treatment suggestion; (iii) the automated monitoring and detection or prediction of risk events, e.g.community-based monitoring via telemedicine of chemotherapy patients, or intensive care unit (ICU-)based monitoring for cardiac events; (iv) pre-/post-operative patient risk stratification for enhanced monitoring; (v) machine learning (ML)-based surgery navigation aids.(Basch et al. 2017;Ronen, Hayat, and Akalin 2019;Meyer et al. 2018;Kim, Li, and Kim 2020;Rajpurkar et al. 2017) Any product with the potential to impact human health must be regulated before being placed on the market.In the USA, the US Food and Drug Administration (USFDA) is responsible for regulating medical products.In Europe, the regulatory framework is driven by the EU Commission and primary legislation, via a system of notified bodies.This means that in the EU medical AI products are regulated, depending on their intended use, as either medical devices (MD) or in-vitro diagnostics (IVD).The USFDA is currently the global leader in developing medical AI regulations.The initial focus of the USFDA has been on developing legal frameworks for regulation, running trials of multiple alternative regulatory pathways.(Health 2020) A recently updated USFDA position paper has declared a commitment henceforth to developing a customization of the existing, software-as-a-medical device (SaMD), regulatory framework for medical AI products.(Health 2021) 2 The academic community has begun proactively producing checklists and initial approaches for medical AI best practices, with a strong focus on reporting standards.The TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) Group is currently working on a checklist for reporting on clinical trials the data from which are subsequently used in the development of medical AI.(Collins et al.Artificial Intelligence) provides AI-specific protocol guidance.(Rivera et al. 2020;Liu et al. 2020) Finally, Model Cards, and Fact Sheets are proposed as a means for better defining the acceptable uses of deployed AI systems.(Mitchell et al. 2019;Arnold et al. 2019) Despite all of this progress, products incorporating medical AI are slow to appear on the market and typically fail to deliver on the promised levels of performance.(Benjamens, Dhunnoo, and Meskó 2020) Two key aspects, of this failure of translation, are: a lack of knowledge of regulatory practices, on the part of developers; and the absence of a best practices standard, required to produce safe and effective medical AI products.(Higgins and Madai 2020) The USFDA proposes Good Machine Learning Practice (GMLP) to describe a best practice standard, in clear analogy to the pharmaceutical drug development Good Manufacturing Practice (GMP).The USFDA position paper undertakes to support the development of such harmonized GMLP.(Health 2021) This Perspective article proposes a first attempt at such a set of best practice guidelines for the regulatory compliant development of medical AI products.The focus is on linking the development roadmap to the regulatory dossier of documentation, which must be submitted as part of the regulatory approval process.A related (non-peer reviewed) article has been produced, including input from the current author, beginning with the regulatory audit and attempting to provide a templated approach particularly for the regulator.(Johner 2020) That article is expected to form the core of upcoming International Telecommunications Union and World Health Organization (ITU/WHO) best practice guidelines.Here the focus is on the perspective of the ML developer or data scientist, and how to structure the work systematically, in order to not miss vital steps which may later be considered necessary by the regulator.

Guidelines for Good Machine Learning Practice in Medical AI
In order to develop comprehensive best practice guidelines, for medical AI development, a basic introduction to common terms, and the structure of the regulatory process is first necessary.A glossary of useful technical and regulatory terms is presented directly following the Conclusion .
Three technical terms are of particular importance in medical AI development.Machine learning (ML) algorithms are computational procedures for learning from data.ML models , or trained ML models, are the output of the application of the ML algorithm to data.A trained ML model takes previously unseen data, similar to that in the training set, and makes predictions or classifications for it.Finally, ML products are products, which may be either physical devices or software, which use trained ML models in their operation.There are, additionally, a number of ML products that use ML algorithms whose outputs do not result in a trained model, e.g.clustering techniques.These will be discussed briefly in Section 2.1.2but otherwise rarely appear in medical AI products.
The purpose of the regulatory process is to ensure that medical products are both safe and effective for their intended use.Safety is evaluated from a risk perspective.The likelihood of misadventure is evaluated in terms of both proper and any potential improper use of the product.When the benefits of using the product sufficiently outweigh the risks of accident then the product can be certified for use.For medical products effectiveness is as important as safety.The product must demonstrably deliver any and all claims of medical benefits and must deliver a state-of-the-art level of performance when compared with competing solutions, including those only considered internally by the development company.
The process of evaluating the safety and efficacy claims of a new medical product follows a series of audits and inspections, with a particular emphasis on risk analysis and mitigation.As part of the application for certification the manufacturer must submit a regulatory dossier.This will include a technical file, or device master record, which describes the product and can be used to prove that it was produced according to the requirements of a quality management system (QMS).A design history file (DHF), which documents design decisions pertaining to the product, will also appear in the regulatory dossier.The inspector then examines the multiple steps of validation, assurance that the product meets the needs of all stakeholders, and verification, evaluation of the product compliance to specifications and requirements, which are described in the dossier.
Three regulatory terms, and the associated approaches to verification and validation, are of particular importance in medical product evaluation: 1. Intended use: The manufacturer's declaration as to all of the valid usages to which the product may be put.Ultimately the product as a whole must be evaluated, in a normative top-down manner, against its intended use.

Stakeholder requirements:
A list of requirements in the form of statements on behalf of product stakeholders.These are derived, again in a top-down manner, starting with the intended use, and serve as an input to both verification and validation processes.
3. Technical or Product requirements: This is an exhaustive list of requirements, as to what a product should and should not do, which must be identified by the manufacturer developing the product, and serves as a checklist driven, bottom-up approach to verification.
While evaluation of both technical requirements and stakeholder requirements are largely driven by verification of conformity to the listed requirements, it is not possible to have a truly exhaustive checklist.Therefore, the auditor is primarily concerned with examining processes.Besides verifying that the technical requirements have been fulfilled, the auditor will examine whether robust processes have been established in order to formulate the requirements lists and ultimately to determine whether the intended use is likely to be safely delivered upon.
Following internationally agreed-upon standards of engineering is the easiest way to ensure that the needs of both verification and validation are satisfied and a product has been developed appropriately.The USFDA publishes a list of Recognized Consensus Standards for which a manufacturer may make a declaration of conformity in their premarket authorization filing.Applications based on non-recognized standards, or upon partial conformity to recognized standards, are also possible but require supporting documentation.The EU has historically maintained a list of Harmonised Standards, conformity to which was recognized as direct conformity to EU law.An updated list of standards, which conform to recent changes in EU regulations, has yet to be published.The following guidelines are divided into two sections.Section 2.1 presents the bulk of the guidelines with a focus on product development topics.Section 2.2 gives a much shorter, but highly necessary, overview of planning and preparing a clinical investigation for product validation.

Medical AI Product Development
In order to develop safe and effective medical AI products, which will pass the regulatory evaluation, the structure of AI development may be separated into six logically separate steps.
Four of these apply to any medical AI product.Two of them are only relevant for specific, more complex, products.4. Output prioritization and resource planning: Similarly to how humans must learn to prioritize their actions, particularly in a clinical setting, complex medical AI products must prioritize their outputs.For example, a product which outputs a list of 100+ potential diagnoses, per patient, is likely to overwhelm the user.Therefore the outputs must be safely prioritized.Similar constraints apply to the process of sequencing operations when lab tests, or interventions, must be algorithmically scheduled.In order to do this safely it is necessary to develop risk models for the competing tasks and publish strategies which mitigate the risk-adjusted worst-case scenarios.

Adaptive AI:
The USFDA has placed a priority on developing regulatory pathways for adaptive AI. (Health 2021) These are ML-based products which adapt to their deployment situation, updating their learned input-output mappings to better match the needs of the environment.As with any learning system there exists the possibility of unlearning.This must be strongly mitigated against or an adaptive AI system cannot be deployed.
6. Post-market planning: The manufacturer of medical products bears a legal responsibility for their product throughout the expected lifetime of the product.This means they cannot simply sell the product and then walk away.Rather they must plan and implement a strategy, for surveillance and updates, which incorporates the entire lifecycle of the product.For AI medical products issues such as shifting clinical standards and semantic drift are of particular consequence in this phase.
The sequence of these steps follows the natural flow of a data or ML project.Data curation, despite being foundational, is largely a software engineering task so the focus, in this guide, is on the data-specific aspects.In-sample performance forms the core of a typical data or ML project and similarly forms the bulk of the guidelines.New data performance and post-market planning are regulatory requirements which, while not necessarily technically demanding, deserve separate consideration.The sections on output prioritization and adaptive AI are only necessary for specific types of products and, otherwise, may be skipped.
Importantly here, the focus of these best practice guidelines is on certification which follows from documentation.In a structured product development path it is common to conduct exploratory analyses, comparing many different modelling approaches, on early data sets.(Higgins and Madai 2020) The level of documentation required for product certification is extremely high.Therefore, a balance must be struck between fully documenting paths, which are unlikely to subsequently be pursued, and omitting documentation of early attempts.
The former will lead to deeper trust from the side of the auditor, but the latter is completely reasonable until the point at which an intended use is defined.
The particular form the documentation takes is outside the scope of these guidelines.It typically involves documented standard-operating-procedures (SOPs), for all product development processes, accompanied by evidence that the SOPs were actually followed.Any decisions made, or considered, should be documented from the perspective of: paths considered, reasons for choices, potential risks, and risk mitigation of the chosen paths.
A summary of the sections from these guidelines, and their respective subtasks, is presented in Table 2 .In-distribution performance evaluation (e.g. via synthetic data).

Data properties perspective
Handling missing input values and data imputation justification (see  How is the primary output communicated relative to communication of secondary outputs?Risk evaluation that users will misunderstand the relative performance reliability for different outputs.
Resource planning.Task identification performance.Task prioritisation performance.
Overall evaluation in terms of intended use.

Data Curation
Data curation involves the acquisition, labeling and storage of reliable, unbiased, and high-quality data.An algorithm which is not built upon a strong basis of data curation is not fit for purpose.The data curation processes must not only be well planned and executed, they must also be documented with particular focus on risks.
Data gathering and storage processes must be robust to potential error-introducing steps.In particular, data storage devices, and software, must themselves follow basic risk mitigation norms.Such systems, even sourced from third-party vendors, must be backed by a Quality Management System (QMS) or separate quality assurance (QA) certification.

14
Adaptive AI Risk analysis Risk-benefit analysis must specifically justify the use of an adaptive algorithm.Quality assurance of any teaching or learning signal.Detailed algorithmic analysis for potential to introduce systemic biases.
Safe bounds on learning performance.
A minimal quality must be guaranteed.
Appropriate performance monitoring modules.
Existence of a non-adaptive performance monitor.Robustness analysis of the non-adaptive performance monitor.
Post-Market planning Surveillance.Paper based methods.Telemetry streaming allowed?
Software updates.
Minor vs major updates.Deployment methodology.Support for old versions until removed from use.
Shifting clinical standards / Non-stationarity of data inputs over time.
Semantic drift.New diagnoses.New input fields.
In an attempt to acquire data for medical AI it is common to transcribe paper-based medical records onto digital systems.This is an error-prone process and should be evaluated for the likelihood that it introduces errors into the data.For example, in drug development it is common to have a minimum of 2 individuals enter every case report separately, into digital form, and then an analysis is performed for accuracy before a single coherent data set is 'locked'.(Pitman 2019) Where clinical decisions are a factor, these are frequently reviewed by multi-disciplinary panels before incorporation into the training set.
Data which does not reflect the intended use is a form of bias.Two mistakes occur commonly here.First, it is usual to kick-off projects at world-class centres of excellence; such centres rarely see a 'normal' cohort of patients.If the final product is to be deployed in typical patient care, then data must be gathered from precisely such a population of patients.For example, a predictive model trained on a high-risk patient population cannot be certified to safely make predictions on low-risk patients without considerable further data acquisition and testing.
Another common mistake is in gathering 'too perfect' a data set.There is considerable scientific value in having a high quality data set.However, unless statistics are gathered as to which patterns of entries are missing from the intended deployment setting then this will also lead to a biased data set.(Hand 2020) For example, a particular biomarker test might be included in the 'perfect' data set but is not standard-of-care in typical clinical practice.The developed model may be overly dependent on this biomarker leading to biased, and likely inferior, performance in real-world usage.It is not possible to deploy an uncorrected model, which has been trained on biased data, on any other patient group.
Finally, medical AI is usually developed on labelled data.Therefore it is vital to ensure that the labelling process is correctly handled.This contains both a technical and a human aspect.
As with storage, the software for labelling should be certified to at least a medical software quality standard.Similar to the digitization process, it is common here to use multiple experts to independently label each data entry.In this case, a thorough analysis must be presented as to the degree of consensus among experts and how cases of disagreement are handled.

In-Sample Performance
How well does the trained model work on the data gathered so far?
Performance of a trained model on the data set from which it was derived is referred to as in-sample performance .There is additionally a second, statistical meaning of in-sample which refers not to the sample data itself but to the data generating process from which the data is understood to have derived.In order to disambiguate the two meanings, this latter meaning will be referred to as in-distribution performance .Since modern ML models may have an internal dimensionality high enough to perfectly capture and fit every data point on which they trained, the use of a test set is extremely important to examine generalization and prevent overfitting.(Zhang et al. 2017) If the test set has been correctly constructed the performance results on the training set, e.g.via multi-fold cross-validation, should form a narrow distribution around the test-set metric.Some medical data sets are too small to reliably allow such an approach, in which case considerable justification must be made about risk-benefit payoff, and a lack of alternative methods, before a product built on such data may be considered for certification.
Software tools which incorporate machine learning have many purposes.An important distinction however is in whether the ML is to be used for the purpose of automation or for exploration.Typically medical products are not certified for discovery-style exploration.The risk of misuse is too high to justify certification.Therefore, although ML is very good at this task it remains the domain of scientific research.Any software used in such a mode is then at the legal liability of the user and is no longer the responsibility of the developers.An exception is a medical device whose intended use is scientific, rather than directly medical, and which will be operated by a user with sufficient scientific training to be able to interpret the output correctly.For example, a clustering technique, the output of which, at the discretion of the analyst, may subsequently contribute to a histopathology report.While scientific tools have historically been exempt from regulation, recent changes in EU law include them under medical device regulations once any competing certified product appears on the market.
Appropriate model performance metrics are algorithm and application specific.What is important is that they must be thoroughly pursued, their particular use is justified, and all of their values reported.For example, for classifier systems selective reporting of sensitivity and specificity without reference to positive predictive value (PPV) and negative predictive value (NPV) is inappropriate.Rather, it is desirable that the confusion matrix and all derived measures are to be reported.As a second step, it is then perfectly understandable for the regulator that sensitivity and specificity are of particular interest, from a risk analysis perspective, for screening purposes whereas PPV and NPV are preferred for individual diagnostic or treatment decisions.Similarly, metrics such as accuracy and area-under-the-curve (AUC) may be reported, but they are unlikely to be sufficiently The first step to reporting on in-sample performance is the establishment of basic machine learning norms.This must, at all points, be thorough, detailed, and must openly compare with state-of-the-art performance.Beyond this there are two further technical issues on in-sample performance, which only the machine learning team have the expertise to address.They are, however, frequently overlooked in consumer-tech approaches to ML development.These issues are (i) the technical implications of the chosen ML algorithms, and (ii) the statistical properties of the underlying data sets.These issues will be addressed in the following two subsections.

Technical Implications of the ML Algorithm:
Considerable theoretical work has been carried out into the algorithms behind machine learning techniques.However, an understanding of these theoretical foundations is much less widely available.By evaluating the product development through the perspective of the algorithm enormous insights can be gained into the evaluation of in-sample performance.
A thorough description of all ML algorithms applied to the data, their relative merits, and why the final algorithm was chosen is a prerequisite for any regulatory evaluation of a ML product.The chosen algorithm must be justified as being the most appropriate for the application.The justification cannot be based solely on the preferences and available skills in the development company.Further, alternative means, i.e. non-ML methods, must also be considered and if any are considered superior to the ML-based approach, from a risk-benefit point of view, then the ML-based product cannot be certified.
Within a particular ML model class it is often possible to evaluate whether the data set is sufficiently large to support the reported model performance metrics.Although it is never recommended to estimate a-priori how large a data set will be required for a particular machine learning task, it is often possible to evaluate algorithm performance post-hoc using a combination of heuristic-based approaches and actual theoretical results to decide whether the estimated performance metrics are accurate or not.(Hua et al. 2005) The heuristic approach, to evaluate the general applicability of the model performance metrics, is essentially a modern data science validation approach.Particular focus should be on identifying whether there are specific sub-groups in the training data for whom the ML model works better.This must be combined with a risk evaluation.It is frequently less risky to The final application of specific algorithm knowledge is in evaluating how the trained model is likely to perform on 'similar' data.This follows the definition of in-distribution performance (see beginning Section 2.1.2).A training and test set are always finite samples even if they form excellent approximations of the underlying distributions.This means there may be many points of interest, which might be generated by the same data generating process and which might lead to as yet undiscovered results from the trained algorithm.A good evaluation of the algorithm will attempt to answer the question of how the trained model is likely to perform on data which is generated from the same data generating process but which is not identical to already tested points.The ideal evidence for in-distribution performance is whether the solution space can be shown to be some kind of smoothly differentiable manifold.Theoretical results currently fall short of being able to provide such assurances, however a numerical sensitivity analysis using synthetic data can give useful indications.
In general, an evaluation of in-sample performance under the lens of the properties of the algorithm will lead to: discovery of numerous problem cases which must be mitigated; holes in data gathering which must be filled; and, insights into how the trained model is likely to work on similar patient cohorts.The absence of a history of such insights and their remedies, in the formal documentation, is a strong indicator of insufficient quality control processes or adherence to said processes.

Statistical Properties of the Data:
Machine learning algorithms converge differently depending upon the structure of the underlying data distribution.Biomedical data sets frequently contain missing input values, missing outcome information, discontinuities in the encoded data, and statistically significant patterns which do not translate well to real-world usage.At a deeper level, not all data follows well-understood distributions, such as the normal distribution.All of which leads to serious model, and product, design implications which must be addressed when building a medical AI solution.
Missing input values are a common feature of medical data.In medical imaging this is less of a problem, and can perhaps be remedied by either value imputation or an insistence on retaking an intact image.However, a typical medical history is frequently sparse, and driven by clinical imperatives at the time of presentation.The presence or absence of a lab result may be more indicative of the final diagnosis than the actual lab values reported.A well constructed medical AI system needs to be able to deal appropriately with these missing values.See Table 3 for some examples.Unfortunately, obtaining outcome data from medical results can be both difficult and costly.
In certain cases, e.g.pregnancy, it is reasonable to assume that a complete medical record set will be gathered for most patients.But in most illnesses, even in cases where the patient subsequently returns to the same doctor or clinic, it is rarely possible to ascertain whether the 23

Data Imputation Example
Some parameters can be imputed from previous visits.

Age, height, weight
The same parameters cannot be imputed under specific clinical conditions.
Weight will change during pregnancy and the change may be clinically relevant.
Some parameter values should be switched to a binary flag (e.g.present / not present).
Was a lab result above a proven clinically relevant threshold?
When marking missing data with a 'NA' code, this code should not adversely interact with the algorithm.
If the actual values range from 0-5 then using '-1' for missing values will bias regression-based algorithms.
original diagnosis-treatment combination was successful or not.This is important from a labeling point-of-view.It is only possible to build products which automate the diagnostic or treatment decisions of the clinician if the outcomes are known.The next best alternative is that an expert committee must decide, based on the data, what diagnoses or treatments should be assigned.In either case, an analysis of the training data must be performed with particular focus on evaluating the risk that the data points with well-labelled outcomes form a non-representative subset of the general population, which would lead to a biased real-world performance.
Discontinuities appear frequently in biological data.This may be the result of the monitoring equipment, which is designed to work on timescales relevant to steady-state values e.g.early blood-glucose monitoring devices.Or, discontinuities may represent biological on-off processes which switch the production of certain hormones, and other biochemicals, either entirely-on or entirely-off in the body.In general, machine learning algorithms operate under assumptions of continuity which leads to inconsistent outcomes when fitting discontinuities.
The appropriate use of a kernel smoother function transform in the feature space representation can greatly aid in combining discontinuous data.(Hofmann, Schölkopf, and Smola 2008) Frequently, the easiest way to improve the performance of a machine learning algorithm on biomedical data is the use of input representation transformations.Simple transformations include normalization, transforming into another set of dimensions, and the use of input encoding values.Normalization supports the ML algorithm by reducing the numerical computing difficulties of handling inputs with vastly different value ranges.Whereas dimensional transformations and the use of input encoding values provide support respectively, by leading to easier separation of data to be learned, and by building-in knowledge, such as previous diagnoses, which must otherwise be learned from base data.
Each of these methods, when used carefully, represents only a low risk to the final product.
However, the use of more complex, particularly manually generated, engineered features must be very carefully analysed for risk.
The problem with complex feature engineering is best motivated by an analogical example.
There is a well-known, but perhaps apocryphal, story of a product for AI-driven medical radiography which showed excellent in-sample performance but further investigation revealed that it was relying on spurious information which spanned both the training and test sets.The story relates that the spurious information was a unique tag, in the borders of the images, which showed that the image originated from a clinic which specialized in a particularly rare clinical condition.The algorithm was able to leverage this statistical association to appear performant at diagnosing that rare disease.In the ML literature such a system is referred to as a 'Clever Hans'.(Lapuschkin et al. 2019) The problem with complex engineered features is that they curate such hidden associations, and they do so in a manner which is considerably more difficult to trace than that in the example.Therefore, feature engineering must be motivated by either numerical computing factors or by demonstrable biological association between the engineered feature and the target outputs.
Much of statistical theory is based on the normal distribution of data.It is possible to evaluate approximately how many data points are needed before reasonable estimates of the mean and variance of such distributions are obtained.In the real-world, biological data is frequently non-normally distributed.(Limpert, Stahel, and Abbt 2001) From a statistical analysis perspective, it has been shown that if it is not possible to distinguish between a log-normal distribution and a normal distribution then the statistical properties of the log-normal distribution must be assumed.(Taleb 2020) In such a case, the convergence properties for statistical methods are slowed to the point where no reasonable amount of data can assure convergence.ML algorithms have different convergence properties from statistical algorithms.But still, they can be expected to insufficiently converge for fat-tailed distributions.Recent work has called this incomplete convergence a form of model underspecification.(D'Amour et al. 2020) The risk is that the variance will always be underestimated and as a consequence so will the error.If the algorithm cannot reasonably converge then it is impossible to validate it for clinical deployment.
The best advice here is to use more complex, but also more robust, heuristics in evaluating how much data to employ and, more importantly, to fall-back on product designs which avoid the catastrophic effects of massive failure.This point, about failure modes, is extremely relevant for in-sample performance evaluation.Ultimately, every tool fails at some point, therefore the product should be designed to fail in a graceful manner.This topic will have even more relevance in the following sections.

New Data Performance
Assuming that a machine learning model, for medical AI, has been successfully built and has been shown to work reliably and with sufficient efficacy on a large data set.The next task is to validate how the model will perform on other data sets.Here the risk is that despite excellent model performance under the original data generating process, this will not transfer to real-world data generation situations.
Essentially, there are two complementary approaches here.First, a clinical trial needs to be conducted in order to demonstrate that the performance statistics reported for in-sample performance are reproduced in a clinical situation.(Higgins and Madai 2020) This is also an opportunity to evaluate model calibration, ie.how well the model performs across all cohorts, the definition and calculation of which is not trivial for a predictive model.(Archer et al. 2020;Riley et al. 2020) The trial itself will follow normal clinical trial protocols, such as definition of trial participant selection criteria and intended use, a discussion of which is in Section 2.2 .
Reporting of the trial should follow standardized checklist-based approaches, such as CONSORT -AI and SPIRIT -AI.(Liu et al. 2020;Rivera et al. 2020) The task of the development team is to determine bounds for what the valid population, on which the product will be used, looks like.This can, of course, drive the clinical trial design process.But, more importantly, this will lead to the development of an input plausibility model.This is the second of the two complementary approaches.
Data which is submitted to the trained model must first be evaluated for plausibility.This will ensure that data entry errors are caught, and rectified, before any model prediction is run.It will also, potentially, catch patients whose clinical condition does not match the conditions under which the model can be reasonably expected to operate.
The safety issue posed, by inappropriate application of a ML model, cannot be overestimated.
A machine learning model should not be deployed on a population which is different from the one on which it was trained.Despite the term AI incorporating the word intelligence, machine learning cannot self-correct to cope with even minor situational differences for which it has not been trained.
A simple input plausibility model will set maximum and minimum values for each input field.
A more advanced model will use a support-vector-machine (or similar) which has been Stephan 2001) That is, a model which has been trained to recognize data similar to the original training data and may, as a result, be used to detect anomalies.The most advanced techniques, even if they do not explicitly use a Bayesian method, will reflect the Bayesian question: What is the likelihood that the new data point is from the same data distribution as the original training data?
The absence of an input plausibility model must be motivated by a thorough risk evaluation of the lack of potential for patient harm.In contrast, the presence of a sufficiently advanced input plausibility model may be taken as evidence that the product can safely operate even for intended uses in which the potential for patient harm would otherwise be significant.

Output Prioritization and Resource Planning
As more and more complex medical AI systems are developed, the models involved will develop towards increasing levels and hierarchies of models and automation.A single task medical AI model can be evaluated by relying only on the sections leading up to this one.In this section, the consideration is on the situation where the overall product might be used for more than one medical condition, or application, or across multiple medical contexts.Such products might someday range from electronic health record integrated clinical decision support tools, which present a ranked list of potential diagnoses for a patient, all the way to automated artificial general intelligences (AGI) which take symptoms as input and directly prescribe treatments without the intervention of a human operator.
In the simplest example, a single trained ML model takes the data from a patient and evaluates the data for the likelihood that it represents a single clinical indication.A second ML model is trained for a second indication.And this process of training separate models continues for each indication which is considered of interest by the product developers.
The difficulty, in this example, is that in a product the output of each of the individually developed models must be aggregated and presented to a user, such as a clinician, in a coherent manner.In order to do this, a ranking or prioritization algorithm must be developed.
A naive approach might involve ranking the indications based on how certain each model is that its diagnosis is correct.To do this correctly, the individual models must have been trained to output such a likelihood and not just a binary classification.
This approach still leads to two difficulties.First, since different indications have very different prevalences the accuracy of the different models is not directly comparable.And second, different indications represent very different clinical risks from a medical point-of-view.That is, for some medical conditions it is important to (over-)respond early rather than rely on the fact that such a diagnosis is rare, but treated too late will be fatal.This is a core concept in the method of differential diagnosis in medicine.
Moving beyond this simple example, of hybridising the outputs of separate models, the same issues emerge when directly validating multi-class classifiers and Bayesian Network approaches.These techniques overcome some of the technical issues of comparing performance across multiple outputs.But such approaches still suffer from the prioritization issue.These are issues which must be considered in terms of the intended use, and how this use is communicated, first via the device sales communication process and then through the user interface, to the end users.User testing must be carried out, and reported upon, in order to evaluate the risks for miscommunication and subsequent adverse outcomes.
Finally, when patients present to a doctor the clinician must quickly evaluate the entire patient presentation.This incorporates both a high-and a low-level interpretation of symptoms and a priority based approach to investigation and treatment.At some point 'smart' algorithms, which similarly decide where medical resources are to be concentrated, will have to be evaluated.For such an algorithm to work appropriately, it must operate a paradigm which evaluates the ongoing task environment and ranks which sub-algorithms, or trained models, are most important to subsequently proceed with.Q-learning is an example of a commonly known technique used in this area today.(Watkins 1989) In computational neuroscience and research into AGI such an algorithm embodies a task identification and/or switching process.
The development of such algorithms should attempt to follow basic theoretically advantageous properties, such as the probably-approximately-correct (PAC) framework, but where the target or goal is fuzzier than that of a stand-alone tool.(L.G. Valiant 1984; L. Valiant 2014) The evaluation of this top-level process (frequently referred to as a module) must also be carried out, and will follow similar procedures to those referred to for the individual machine learning models.From a legal perspective, such an autonomous system is regulated according to its ability to safely and efficaciously serve its intended use.

Adaptive AI
A recent USFDA discussion document has made it clear that regulatory bodies are not just considering static, or 'locked', ML-based products, with a focus on when they can be changed without undergoing re-certification, but are also willing to engage with developers planning adaptive AI solutions.(Health 2020) In this case, the product is assumed to have an initial base-level of performance but is capable of modifying its performance characteristics based on the pattern of inputs received over time.This is probably best understood as a similar technology to that present in smartphone predictive text keyboards, which can adapt to the user's word-usage preferences, performing better over time.
From a machine learning point-of-view, it is important to note that the meaning of adaptive here is not referring to techniques with hidden internal states, such as Hidden Markov Models, recurrent Neural Networks (rNNs), etc., where the internal state for an ongoing prediction is updated based on newly inputted data.For example, this is not a real-time ML model where the prediction for a patient may change immediately based on their answer to a symptomatic question.Such models, despite their technical complexities, are regulated in exactly the same manner as all other 'locked' ML models.
Rather, the USFDA definition of adaptive AI should be read to mean models which are locally updated without interference from the developers.That is, methods which autonomously retrain the model based on patterns of data inputted over time.Common techniques for adaptation include reward learning and Bayesian methods.Both of which are, of course, compatible with the development of 'locked' ML-models but are also utilisable for ongoing performance improvements.For example, a reward learning method could be used in combination with accurate lab-based patient outcome data to advance the point in time, at which a particular illness may be accurately detected, to an earlier stage in the patient journey.
Alternatively, a Bayesian approach could modify its internal expectations of 'normal' patients based on experience of local conditions.That is, each clinic would have a ML model which over time would adapt to issues such as regional incidence rates of particular illnesses or comorbidities.
Clearly, in the reward learning example, the strongest issue is one of validity of the teaching or feedback signal.This issue is the same as the labelling quality control issue posed in Section 2.1.1 .The product manufacturer is responsible for ensuring the quality control of the feedback signal.The USFDA has declared that they will take a risk-based approach to this quality control issue.That is, if the risk of causing harm is sufficiently low, relative to the expected patient benefits, then such an approach may be granted approval.
The Bayesian example contains an important demonstration of the hidden potential for harm posed by adaptive AI products.Adaptation to local conditions contains a strong risk of algorithmically incorporating systemic bias into the model.That is, rather than correctly detecting a higher local incidence of a disease, which only affects subgroups of the national population, the product may falsely try to 'normalise' the incidence rate to national levels which are below those experienced locally.(Obermeyer et al. 2019) Therefore the adaptive component of the AI must be evaluated against the risk of increasing rather than decreasing already existing socio-economic and genetic biases which exist in the population.
With adaptive AI the most important regulatory aspect is in assuring that a minimum level of performance is always guaranteed.In any coupled learning process it is always possible to learn how to do things badly.The product should therefore enforce two safeguards: (i) bounds on the magnitude of the potential learning process; and (ii) strengthened monitoring processes.The goal here is to certify the product as always performing above some minimal acceptable level, while offering the possibility of improved performances.

Post-Market Plans
As part of the regulatory package a developer of a medical AI product should also expect to submit plans for post-market operations.Two aspects of these plans are relatively clear and are similar for any medical software device: surveillance and dealing with software updates.
Surveillance means the post-market monitoring for any adverse events which occur through the normal use, or misuse, of the product.This is separate from crash monitoring, typical in non-medical device software.Indeed, in most jurisdictions paper-based inbound reporting must still be supported.Beyond adverse event reporting, regulators, globally, are currently supporting a drive towards increasing post-market performance monitoring.
There is an open-question as to whether telemetry streaming, which incorporates patient data, will be acceptable under patient rights however.
The current proposal from the USFDA for regulating medical AI takes a lifecycle approach to regulation, this ensures that ongoing software updates will form a core aspect to the certification.(Health 2020) Minor changes to the software are typically allowed without a complete recertification.New features are often classified as major updates and entail a formal certification.The delivery of software updates, under the software lifecycle model, comes with implicit requirements that the updates are distributed to all users.A traditional software product, where each updated version is purchased separately, is only allowed when the absence of updates will not negatively impact patients.The manufacturer is responsible for all old versions of the product until they have been removed from use.
One issue which is specific to medical AI is the topic of shifting clinical standards.Medical diagnostic criteria change over time.Additionally, medical practitioners undergo intensive training early in their careers and frequently drop to a maintenance-level of education as they age.This slows the dissemination of new criteria depending on career phase.A machine learning system can only maintain its performance characteristics on data which is statistically indistinguishable from the data on which it was trained.In technical terms, the data set must be stationary.Unfortunately, medical data is not.
This problem is referred to as distributional or semantic drift.(Challen et al. 2019) In order to mitigate this, prior to launch a plan must be in place, firstly to detect any such changes, and secondly to react safely to them.Detection of semantic drift requires considerable monitoring.
In medical data, there is particular risk that un-modelled sources of variance may be particularly difficult to detect.The response plan will be specific to the product.Particular focus must be paid to the timeliness and feasibility of this plan.

Clinical Validation
In order to validate a medical AI product a clinical trial, or study, is almost always necessary.(Higgins and Madai 2020) The study occurs following the technical engineering process and follows practices much better known among traditional biomedical regulatory agencies.For these guidelines, the focus is on aspects of the clinical trial design which are difficult to translate when separate departments, involving machine learning on one side and the regulatory process on the other, attempt to design a trial together.
Bridging the gap between AI practitioners and a traditional pharma clinical trial design requires the understanding of (i) the target population, and (ii) the intended use.This can be extended to incorporate three further issues, which on the surface appear trivial, but contain hidden dangers.What are appropriate controls for a digital trial?Do the study endpoints match the intended use?And, can a user misunderstand the interface and consequently misuse the product?
The target population in machine learning terms should be understood from a statistical point-of-view.The statistical analyses required to carry out a thorough analysis of in-sample of the product, which is obvious to regulatory experts, but less so to machine learning developers.
Just like the potential for mistakes in communication between machine learning developers and regulatory experts, on model function versus product intended use, there exists a potential for real-world misapplication of the product.Most users do not understand that computer systems, especially intelligent systems such as 'AI', can make mistakes.As a result, users are overly trusting of computers, or conversely, they learn to ignore them.Both of these outcomes should be actively measured in a trial and product design should mitigate them.

Conclusion
A well-developed medical AI product has the potential to have a huge clinical impact.In order to be allowed onto the market, development must first follow clear regulatory pathways and practices.Following the recent call for Good Machine Learning Practice (GMLP), by the USFDA, this Perspective presents guidelines as to such best practice.
These guidelines follow a bottom-up approach which should be immediately intuitive to machine learning researchers and developers.The focus is on presenting ML development best practices which are compatible with the compilation of a regulatory dossier suitable for regulatory evaluation.This approach provides a necessary bridge between the risk-assessment world of the medical regulators and the technical approaches commonly used in the development and evaluation of trained machine learning models.
Only through the development of better common standards can we bridge the current translation gap, and deliver on the promises of better medical AI for all.
Table 4: Glossary of terms.
In-distribution performance Performance of a model on data sampled using an identical data generating process to that which was used to generate the original data set.In practice, this is a theoretical concept from statistical theory and leads to theoretical-property based and heuristic-based approaches to analysis.
In-sample performance Performance of a model on actual from the original data set used in model development.
In-vitro Diagnostic (IVD) Originally used for lab-based in-vitro diagnostics, this regulatory category also encompases digital diagnostics.

Input encoding values
Transformation of categorical data to a format compatible with ML algorithm and model inputs.Example, age-ranges may be converted to one-hot encoding.
Input plausibility model An input filter model which examines data inputted to a ML model in order to evaluate whether the model was trained to operate on such data.

Intended purpose
The use for which a product is intended according to the manufacturer's description, labelling and sales material.This does not include maintenance, or non user-facing, operations.

Intended use
Manufacturer's declaration as to all of the valid usages to which the product may be put.This includes operations such as 'maintenance modes'.The focus is on use, not just purpose.
(Intended purpose is a related, but distinct definition).

ISO
International organisation for standardization.An international standards body.

ITU/WHO International Telecommunications Union and World Health
Organization.An international standards body.Kernel smoother function A mathematical function which takes multiple inputs, or high-dimensional data, and projects it to a lower dimensional plane.Useful in smoothing input data, however the parameters must be fixed prior to the learning of the weights in the ML model (e.g as part of hyperparameter tuning).

Locked ML-model
The term 'locked' is a medical device usage meaning that the model does not change once it is certified for deployment.In practice this means that a version specific ML-model is trained and the product validation and certification is tied to the use of this particular model version.

Machine learning (ML)
The study of computer algorithms that improve automatically through experience and by the use of data.

Machine learning -(trained) models
Trained models take data similar to that in the training set and make predictions based on those similarities.manufacturer must apply for permission prior to carrying out any marketing activity.
Premarket authorization filing A set of documents which must be submitted to the USFDA in order to obtain premarket approval (PMA).

Probably approximately correct (PAC) framework
A framework for mathematical analysis of machine learning algorithms.
QA Quality assurance.QMS Quality management system.

Recognized consensus standards
A list of national and international standards, accepted by the USFDA, for which a Declaration of Conformity may be made as part of the premarket authorization filing.

Regulatory dossier
The documentation supporting a regulatory application.

Safety
A product must be safe for users and other stakeholders.It should not cause unnecessary harm.This does not preclude the development of products which provide more benefit than harm.

Semantic drift
The meaning of medical terms changes over time.Often this is too slow for human perception but is highly detrimental for ML models.

Sensitivity
A phrase used in diagnostics.The proportion of positive cases which are correctly identified.

Shapley values
A mathematical approach derived using game theory for determining the key factors driving decisions.Growing use in evaluation of ML models.
Software as a Medical Device (SaMD) Defined by the international medical device regulators forum as, "software intended to be used for one or more medical purposes that perform these purposes without being part of a hardware medical device."(IMDRF SaMD Working Group 2013) Specificity A phrase used in diagnostics.The proportion of negative cases which are correctly identified.

SPIRIT-AI Standard Protocol Items: Recommendations for Interventional Trials -Artificial Intelligence
Stakeholder requirements A list of requirements in the form of statements on behalf of product stakeholders.

Standard of care (SoC)
The standard treatment path for a given clinical indication, typically laid down by an expert body either for a large hospital or on a national basis.
User Interface / User Experience (UI/UX) Terms commonly used in product design to express the user perspective of an interface or product experience.
USFDA US Food and Drug Administration.Responsible for regulating medical AI in the USA.

Validation
The assurance that a product meets the needs of the customer and other identified stakeholders.

Verification
The evaluation of whether or not a product complies with a requirement, specification, or imposed condition.
2015; Collins and Moons 2019) MINIMAR (MINimum Information for Medical AI Reporting) represents a minimal list of fields, including AI descriptions, which must be reported in the development of a medical AI solution.(Hernandez-Boussard et al. 2020) The CONSORT -AI (Consolidated Standards of Reporting Trials -Artificial Intelligence) extension is an emerging standard for clinical trials evaluating efficacy of medical AI interventions, whereas its sister protocol SPIRIT -AI (Standard Protocol Items: Recommendations for Interventional Trials - Best practice for the development of any machine learning (ML) model is the use of a segregated, randomly selected, test set, and the use of cross-validation for any parameter tuning on the remaining training set.(Russell and Norvig 2016) Randomization of medical data sets carries a number of difficulties, notably due to different patient cohorts and inconsistencies in medical record keeping.It is, for example, incredibly common for the same patient to appear multiple times in a data set with different patient IDs.This means particular care must be paid to the quality of the randomization, preventing leakage of information between training and test sets, while preserving the desired predictive modalities across the sets.In reporting the development of the AI product, the sufficiency of the train-test or cross-validation set randomization must be justified.
informative to evaluate most products under their intended use evaluation.The reason for this is due to the clear asymmetry of risk, between false positives and false negatives, in particular medical contexts.Full reporting of model performance during any cross-validation performed, e.g during hyperparameter tuning, should be the norm.For small data sets, in products presenting a low risk profile, it may be allowable to report average model performance rather than test-set performance.Correct model comparison methods, for MxN-fold cross-validation, are particularly tricky.Correct model comparison equations are presented in papers by Nadeau and Bouckaert.(Nadeau and Bengio 2000;Bouckaert and Frank 2004)

Figure 1 .
Figure 1.Cartoon illustration based on experience of real-world medical AI projects.
designed for the task of one-sided classification.(D.Tax 2001; D. M. J. Tax and Duin 2001;

Table 1 .
The most commonly known standards are the International Organization for Standardization (ISO) standards.The International Electrotechnical Commission (IEC) provides some equally important, but less publicly known, standards.Further standards are provided by bodies such as National Electrical Manufacturers Association (NEMA), Institute of Electrical and Electronic Engineers (IEEE), and ASTM International.A brief summary of ISO and IEC standards most relevant to medical software development are presented in Table1.A detailed description is beyond the scope of this article.Principal ISO and IEC standards of particular relevance for medical AI development.

Table 2 .
Summary guide to build and evaluate the ML components of a medical AI product.

Table 2 (-cont'd).
(Liu et al. 2020;Rivera et al. 2020) standardized checklist approach such as CONSORT-AI or SPIRIT-AI.(Liuetal.2020;Riveraet al. 2020)Output prioritization and resource planning Blending of outputs.How is the blend achieved?i.e.What is the ranking method, or relative weighting for different output categories.Evaluation of relative performance for different outputs.Risk evaluation for over-/under-diagnosis and the severity of the associated outcomes.

Table 3 .
Appropriate data imputation in biomedical applications can depend on the intended use.Sometimes it is dangerous to impute missing values (eg.weight change during pregnancy).In other cases, the form of imputation used can directly interact with the form of the algorithm.
For example, what will happen to patients in the time period between a shift in diagnostic criteria first occurring, subsequent detection, and finally sufficient new training data being acquired and a product