Initial interactions with the FDA on developing a validation dataset as a medical device development tool

Quantifying tumor‐infiltrating lymphocytes (TILs) in breast cancer tumors is a challenging task for pathologists. With the advent of whole slide imaging that digitizes glass slides, it is possible to apply computational models to quantify TILs for pathologists. Development of computational models requires significant time, expertise, consensus, and investment. To reduce this burden, we are preparing a dataset for developers to validate their models and a proposal to the Medical Device Development Tool (MDDT) program in the Center for Devices and Radiological Health of the U.S. Food and Drug Administration (FDA). If the FDA qualifies the dataset for its submitted context of use, model developers can use it in a regulatory submission within the qualified context of use without additional documentation. Our dataset aims at reducing the regulatory burden placed on developers of models that estimate the density of TILs and will allow head‐to‐head comparison of multiple computational models on the same data. In this paper, we discuss the MDDT preparation and submission process, including the feedback we received from our initial interactions with the FDA and propose how a qualified MDDT validation dataset could be a mechanism for open, fair, and consistent measures of computational model performance. Our experiences will help the community understand what the FDA considers relevant and appropriate (from the perspective of the submitter), at the early stages of the MDDT submission process, for validating stromal TIL density estimation models and other potential computational models. © 2023 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland. This article has been contributed to by U.S. Government employees and their work is in the public domain in the USA.


Introduction
Here we discuss efforts to create a dataset to validate a computational model based on data from digital pathology whole slide images (WSIs) [1] for consideration of qualification as a Medical Device Development Tool (MDDT) [2].Specifically, we discuss feedback from the Center for Devices and Radiological Health (CDRH) of the U.S. Food and Drug Administration (FDA) on our initial interaction with the MDDT program.
The context of this work centers on tumor-infiltrating lymphocytes (TILs).TILs represent a readily available and robust prognostic biomarker that provides insight into T-cell-mediated immunity in cancer.In particular, stromal TILs (sTILs) in triple-negative breast cancer (TNBC) and human epidermal growth receptor 2 (HER2)-positive breast cancers [3][4][5] have been well studied for their prognostic importance.sTILs are quantified manually by pathologists by examining hematoxylin and eosin (H&E)-stained tissues with either a light microscope or WSI viewer, a strategy endorsed by international clinical and pathology organizations [6,7] through a standardized assessment [8].The guidelines define a specific biomarker, the density of sTILs, to infer prognostic importance for specific patient cohorts.However, the sTIL assessment is burdensome and fraught with pathologist variability [9,10].These challenges can be alleviated by computational models powered by artificial intelligence and machine learning (AI/ML) that have had appropriate validation of their bias, variance, and clinical context.Because computational models are more quantitative and reproducible than pathologists, the adoption of digital pathology and AI in TNBC sTIL assessment can be quite useful in addressing a clinical need [11] that is well understood [6][7][8][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30].We view this work as a prototype addressing the broader need to validate the effectiveness of whole slide imaging AI/ML computational models.
In theory, the true value of the density of sTILs exists, but in practice, the true value is never empirically known.In this regard, pathologists provide an estimate.This estimate is noisy and subject to inter-and intrareader variability [31,32] arising from differences in training, experience, and natural abilities of each pathologist.Pathologist variability ultimately impacts model validation when treated as the reference standard.Therefore, it is important to identify primary sources of variability and to create mitigation strategies to reduce this variability while maintaining a good estimate of the underlying sTIL density.These strategies will lead to an improved assessment and comparison of models, providing a more reliable mechanism to evaluate prognostic and predictive biomarkers.
The MDDT program is a voluntary program that was established by CDRH for the qualification of tools to support device development and regulatory decisionmaking while reducing the burden of regulatory submissions [2,33,34].There are two phases to the MDDT qualification process: a proposal phase and a qualification phase.Each phase includes a submission to the FDA by the submitter and a response to the submitter by the FDA.Key technical elements of the proposal include a description of the tool (including its context of use), qualification criteria by which the submitter proposes the tool to be judged, and a summary of the plan to collect evidence in support of qualification.The qualification package is largely the same but includes and discusses the evidence rather than the plans to collect the evidence.If a tool is qualified, a summary of evidence and basis of qualification are produced and shared publicly on the MDDT website [2].Use of a MDDT in a submission is optional, but when used within the context of use, a qualified MDDT does not require additional justification.
In 2019, members of the High-Throughput Truthing (HTT) project interacted with the MDDT program to pursue qualification of a validation dataset consisting of slides, images, and estimates of the density of sTILs from multiple pathologists.The density of sTILs we use is the same as defined by Salgado et al [8]: the area of sTILs divided by the area of the tumor-associated stroma.For our dataset, the area is estimated on 500 Â 500 μm 2 regions of interest (ROIs) selected to cover different tissue types [1].The purpose of annotating ROIs is to reduce the biological variability arising from different locations in a slide, reduce the burden on the pathologist from annotating all the tissue in a slide, and reduce the variability from pathologists looking in different areas.
A standardized dataset will be useful when creating and reviewing a regulatory submission.It will reduce the burden on developers and reviewers and will allow head-to-head comparisons of multiple models based on the same data.We believe that successful application of the MDDT mechanism to the problem of estimating sTIL density will provide a prototype for applying the MDDT pathway to other whole slide imaging AI/ML processing pipelines.
In the following sections, we describe our plans to create the validation dataset outlined in our pending MDDT proposal, feedback received from the FDA from our initial interaction, and insights learned from the process.It is not common practice to share feedback from the FDA regarding details of a submission.However, because our MDDT proposal is primarily driven by multistakeholder collaboration in the precompetitive space, we find benefit in sharing our findings with the community.This is useful in two ways.First, anyone contemplating a MDDT submission to the FDA can gain some understanding of how the FDA responds to such requests (i.e. the submission process).Second, the feedback from the FDA provides insight into considerations the FDA has when reviewing AI/ML models (referred to in the regulatory space as software as a medical device -SaMD [13]).The utility of sharing this information with the community will help and inspire stakeholders (e.g.patient advocacy groups, healthcare providers, professional societies, and funding organizations) to develop and support additional MDDT submissions.Our initial interaction is provided in supplementary material, Appendix S1, and our interpretation of the process is outlined in Figure 1.We plan to submit our MDDT proposal and responses to the FDA's feedback.

The High-Throughput Truthing (HTT) project
The primary goal of our project, titled the HTT project [1,35], is to create a validation dataset of H&Estained TNBC slides and digital WSIs with pathologist annotations of sTIL densities for validating computational models.Project members include scientists, clinicians, and the FDA, in collaboration with the Pathology Innovation Collaborative Community [36] and the International Immuno-Oncology Biomarker Working Group [37].We recently completed a pilot study whereby we collected 64 biopsies of invasive ductal carcinomas, defined 10 ROIs per WSI, and collected 7,373 estimates of sTIL densities from 29 pathologists [9].Microscopic assessments were enabled by mapping WSI scan coordinates to the corresponding coordinates on the glass slide of the optical microscope using the evaluation environment for digital and analog pathology (eeDAP) system [35].Digital assessments were made using either caMicroscope [38] or PathPresenter [39] digital WSI viewers.

Planned MDDT
Through crowd-sourced expert pathologists [8], the HTT dataset will offer a reference standard that captures real-world pathologist variability in sTIL assessments.Estimating and comparing the performance of computational models requires an accounting of pathologist variability.The ideal qualified MDDT validation dataset would address all relevant subgroups and acquisition devices and be large enough to address submission questions.For our planned MDDT, we proposed that it be categorized as a nonclinical assessment model, as it will be used to measure a SaMD's performance on an external dataset.That said, the category is not critical; it is intended to help guide the FDA on where to send a submission within the FDA for review.The category can be revised later in collaboration with the FDA.Our targeted context of use can be found in supplementary material, Appendix S1, and is summarized as a dataset of digital ROIs from WSIs of H&E-stained slides containing triple negative breast cancer for the analytical validation of a sponsor's computational model that quantifies sTILs in these images.Model developers who elect to use a qualified MDDT validation dataset in a regulatory submission may not need to develop their own dataset for their stand-alone performance study.Furthermore, they could avoid having to source slides and recruit expert pathologists to generate reference standards for their stand-alone study, which will reduce development burden through time and cost savings.Though our proposed dataset may have limitations that may not make it the ideal validation dataset, we believe our project will still be useful for sTIL density estimate models and save developers time and money.Our project will additionally develop a statistical analysis plan carried out by opensource software to assess model performance and account for variability due to pathologists and cases.The statistical analysis plan and software will be separate from the MDDT but may be used by model developers to create a performance report in a regulatory submission or by other stakeholders to compare the performance of multiple models in a consistent and fair way.

Questions to and responses from the FDA
A unique aspect of the MDDT process is the opportunity for submitters to receive official FDA feedback on the proposed tool and the plans to generate the qualification evidence with the mechanism to ask the FDA for input on unresolved questions.The FDA recruits a team of internal regulatory reviewers to review each submission.The team includes colleagues with domain expertise (e.g.clinical, statistics, engineering) to provide thorough comments as needed.In our initial interactions (supplementary material, Appendix S1), we asked two questions: • So that we can properly power our study, what are the FDA's recommendations on the number of sites, slides per site, and readers per slide?• Should we expand the collected slides to include non-TNBC cases, which could facilitate data collection?

Power and sample size
For our sample size and power question, the FDA responded that 'the samples (cases and readers) should be representative of the intended populations' and 'since the truthing of the dataset is not based on an independent gold standard but statistical integration of assessments of multiple pathologists, the sample sizes should target certain precision of the truthing.'In response, we have chosen to gather clinical data elements to define the characteristics of the patient study cohort.These include patient age, sex, race, ethnicity, breast cancer stage, and WSI scanner make and model.Considering the low prevalence of TNBC and challenges with data acquisition, we excluded treatment plans and patient outcomes as covariates because that would have required a larger sample size.We plan to provide summary statistics of the characteristics of the dataset and discuss its limitations in the submission.

Inclusion of non-TNBC cases
Regarding our question to include non-TNBC patients, the FDA encouraged restricting the MDDT to only TNBC cases.The agency indicated that, 'due to the significant differences between TNBC and non-TNBCs, the usability of the dataset will diminish significantly if non-TNBC samples are used during algorithm development.As TNBC will be an important part of the dataset, the Agency recommends that TNBC cases be used.'For the clinician, the task of estimating the sTIL density generalizes across TNBC, HER2+ cancer, and other subgroups.However, for a computational model, different histologic subtypes, tumor grades, and treatment modalities may change the type of lymphocyte present, the tissue context, and morphology.Model performance might be sensitive to these types of histologic characteristics.However, we are not aware of any data showing that these characteristics exist to a degree that impacts a model's ability to identify sTILs and estimate sTIL densities.We are not creating this dataset to resolve the many questions about the generalizability of models beyond TNBC patients.More research will need to be performed in this area before we can confidently ignore the breast cancer subtype in the assessment of models that estimate sTIL densities.Until then, exclusive use of TNBC data will necessarily remain a limitation of the MDDT's context of use should it be qualified.Given TNBC's prevalence of 12-15% [40], if we need to expand our inclusion criteria, we might include core biopsies and/or pretherapeutic excisional tissue of TNBC.We may also further expand our cohort to include HER2+ cases to adequately size our study.As the FDA noted, '[i]f sourcing becomes an issue and non-TNBC samples have to be sourced, the sponsor should justify in detail why "cases without TNBC status could be used" to evaluate the TNBC algorithm'.This expansion would also change the context of use, requiring further updates to the submission.Ultimately, we have decided to keep the scope of our dataset limited to increase the odds that we can deliver an adequately sized and characterized dataset.

General feedback
In addition to responses to our stated questions, the FDA offered feedback on our initial interactions.The FDA asked us to (1) modify the context of use, (2) include a detailed description of devices used that are not FDA qualified or cleared to collect pathologist annotations, (3) exclude the predefined statistical analysis plan from the MDDT, and (4) add a plan for dataset decommissioning.

Context of use
The FDA offered feedback on the context of use.The comments suggested various enhancements to improve the value and understanding of the dataset, such as limitations of the MDDT.For instance, it was pointed out that '[t]he use of this tool does not seem to be limited to analytical validation of AI/ML-based algorithms, and should apply to all kinds of algorithms, including the traditional feature-based algorithms'.

Collection of annotations using unregulated devices
The FDA voiced concern on the use of nonauthorized devices for the collection of reference standard annotations (e.g.WSIs generated by scanners that are not currently authorized).The FDA wrote, 'Please note that since the eeDAP has not been FDA qualified yet as an MDDT, whether it can be included as a component of the current proposed MDDT will require additional discussions.The Agency recommends you address this issue.'This feedback reflects the importance of validation and justification of the materials and methods employed.In the proposed MDDT, pathologists can use two modalities to perform their annotations: the light microscope-based eeDAP system [35] or two online digital WSI viewing and annotation platforms (caMicroscope, PathPresenter) [38,39].This concern means we need to validate and justify the use of Use of medical device development tool for tumor infiltrating lymphocytes the eeDAP system, as well as the WSI scanning and viewing devices.The eeDAP device registers the glass slide (stage coordinates of a microscope) to the digital WSI (WSI coordinates) such that the annotations from eeDAP are independent of WSI scanner, file format, or vendor.The true density of sTILs collected from this modality does not depend on the scanner.From a statistical point of view, the reference for a test method should be independent of the test method.That is why we wish to identify eeDAP as the primary modality for collecting annotations.We will demonstrate the registration accuracy of eeDAP (modality validation) to verify that annotations correlate with the proper ROIs across modalities.
Regarding WSI scanning and viewing devices, we plan to use FDA-cleared scanners to obtain WSIs but plan to validate and justify the use of noncleared viewers.The validation of the viewers will involve technical arguments and data demonstrating the agreement of microscope-and digital-mode annotations.

Regions of interest
Another point the FDA made was to provide more details around the selection of the ROIs [4].The FDA requested 'the established evaluation method for stromal TILs and how the clinical relevance of the marked ROIs is ensured'.ROI selection could impact the context of use by affecting generalizability if the variability of morphology is not adequately sampled in the ROI corpus.For the pivotal study dataset, we will sample ROIs using estimated sTIL densities, clinical metadata attributes, and tissue features from a corpus of ROIs generated by experts.The experts participated in ROI selection only after demonstrating proficiency in the task, which included completion of a continuing medical education course [41].Expanding on this dataset to collect sTIL densities on the entire WSI and understand the relationship between ROI-level performance and WSI-level performance represents possible future work.

Statistical analysis plan
The FDA requested that the statistical analysis plan and software component be excluded from the final MDDT proposal.The FDA expressed that, due to the proposed longevity of the dataset and 'any possible future changes in Agency perspective on the appropriate statistical methods to be used in these situations, the statistical analysis plan and software component should not be a part of the proposed MDDT, but may be a subject of discussion during future communications'.On the one hand, including statistical analysis methods creates the opportunity for multimodel comparisons in a consistent way.On the other hand, including statistical analysis methods entails making several assumptions about the intended use of a model.Submissions may have different workflows for their models and statistical analysis plans that do not align with the MDDT's context of use but may be equally valid approaches.For example, the MDDT, in current form, assumes that the pathologist will identify a ROI and send it to the model for evaluation.However, other models may receive a WSI and internally determine ROIs without pathologist input to assess sTIL density at the WSI level.This difference in intended workflow may require significant investment and development by sponsors to provide data that fit the MDDT format (e.g. a standardized definition of input file requirements) and might yield results that no longer reflect the intended use of the model as defined by the device sponsor.Because we do not presume that our approach is the only valid one, this exclusion is a reasonable consideration and may maintain the relevance of this MDDT in future submissions.However, we are developing a statistical plan to properly size our study, and it will be published and publicly available regardless of its inclusion in the MDDT.
On a related note, any dataset used multiple times to measure the performance of a model over time may cause resulting models to suffer from overfitting [42].In other words, when test data are used multiple times, the results implicitly feed back to model training.This happens because the model developer is making changes to improve the results on the test data.This feedback leads to subsequent models that are better on the test set and possibly worse on unseen data; the performance on subsequent models is biased high and does not generalize.Benchmark datasets, like the very large and widely used ImageNet [43], have been used to compare the performance of one model over another, yet overfitting may still lead to poor generalization on new data [44].This type of phenomenon is not unique to computational models and is known more socially as Goodhart's law: 'When a measure becomes a target, it ceases to be a good measure' [45].
In light of this issue, datasests are normally broken into three categories: training, tuning, and test sets.Model weights and parameters are typically learned on the training set, tracked, and optimized with the tuning set and (ideally) only validated by the test set once to gauge generalization performance.We plan to share our dataset broadly at no cost with any entity subject to terms required by the FDA or the MDDT program.The terms may include legal agreements limiting how the dataset may be used or may include methods for developers to share their model while keeping the dataset hidden.

Dataset decommissioning
We believe that it is appropriate to use a MDDT dataset once for testing.If it is used more than once, a model developer would have to address concerns about overfitting.Two strategies that can mitigate concerns include continuously adding new data to the test set or sampling a subset of the test set with each use.Regardless of the method used, it is important that the community monitor the generalization performance of each model over time, a suggested best practice for all medical devices [46].Lastly, the FDA expressed concern that 'any future changes in clinical practice, especially those prompted by AI/ML algorithms themselves, and/or changes to existing literature, may make databases such as the one you propose obsolete or introduce systematic bias in training and validation which can have an impact on patient care'.We believe this concern is true not only for MDDTs but also for medical devices or laboratorydeveloped tests.Adapting to current clinical practice is essential to improving patient care beyond the status quo, so it is reasonable to acknowledge that any dataset will have limited usable shelf life.

Discussion
The HTT project is creating a dataset of pathologist estimates of sTIL densities in triple negative breast cancers.These estimates are intended to be the reference standard to assess the performance of a computational model.The benefits of qualifying the dataset through the CDRH MDDT program is that the qualification process will improve the quality of the dataset to yield a dataset that can be used in regulatory submissions.Moreover, access to our dataset as a qualified MDDT will (1) reduce burden on model developers, (2) provide feedback to submitters about the performance of their model, (3) allow the FDA to make head-to-head comparisons of models using the same data, (4) provide a roadmap for others to create other MDDT datasets that address the validation of computational models that target other diagnostic digital pathology tasks, and (5) enhance public understanding of the role of the MDDT pathway for the transparent and public benefit of standardized model performance assessments.

Figure 1 .
Figure1. Outline of HTT project, including the two stages of the MDDT application process, submitting the proposal and submitting the qualification package.This workflow represents our experience with and interpretation of the MDDT workflow and does not represent the formal process as defined by the FDA's MDDT program.We have received feedback from our first interaction and are currently developing a proposal.

382S
Hart et al © 2023 The Authors.The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland.This article has been contributed to by U.S. Government employees and their work is in the public domain in the USA.www.pathsoc.org