Why do errors arise in artificial intelligence diagnostic tools in histopathology and how can we minimize them?

Artificial intelligence (AI)‐based diagnostic tools can offer numerous benefits to the field of histopathology, including improved diagnostic accuracy, efficiency and productivity. As a result, such tools are likely to have an increasing role in routine practice. However, all AI tools are prone to errors, and these AI‐associated errors have been identified as a major risk in the introduction of AI into healthcare. The errors made by AI tools are different, in terms of both cause and nature, to the errors made by human pathologists. As highlighted by the National Institute for Health and Care Excellence, it is imperative that practising pathologists understand the potential limitations of AI tools, including the errors made. Pathologists are in a unique position to be gatekeepers of AI tool use, maximizing patient benefit while minimizing harm. Furthermore, their pathological knowledge is essential to understanding when, and why, errors have occurred and so to developing safer future algorithms. This paper summarises the literature on errors made by AI diagnostic tools in histopathology. These include erroneous errors, data concerns (data bias, hidden stratification, data imbalances, distributional shift, and lack of generalisability), reinforcement of outdated practices, unsafe failure mode, automation bias, and insensitivity to impact. Methods to reduce errors in both tool design and clinical use are discussed, and the practical roles for pathologists in error minimisation are highlighted. This aims to inform and empower pathologists to move safely through this seismic change in practice and help ensure that novel AI tools are adopted safely.


Introduction
Artificial intelligence (AI) tools are starting to be introduced to many areas of healthcare, with cellular pathology being a key area of development.AI tools can be used across the pathology pipeline, including streamlining the laboratory workflow, quality control, screening and diagnosis of whole-slide images (WSI), and immunohistochemistry assessment. 1 AI has the potential to increase the accuracy, efficiency, and availability of pathology diagnostics, amalgamating the increasing volumes of complex patient data and helping to overcome the shortage of pathologists. 2urrently, screening tools have been approved for use in both prostate and breast cancer, with the field expected to expand rapidly and reshape the speciality of pathology. 1,3s increasing numbers of AI tools move into the clinical realm, it will be practising pathologists who will use such tools in their daily reporting.In a survey of 487 pathologists from 59 countries, the general attitude regarding the implementation of AI tools was positive, with the widespread belief that they will improve efficiency. 4However, only 24.8% of pathologists reported no concern about AI-tool errors, showing apprehension exists among the profession about potential errors. 4Moreover, in a literature review by the European Parliamentary Research Service, patient harm from AI tool errors was identified as one of the main risks of AI in healthcare. 5herefore, it is crucial that pathologists are aware of the capabilities and limitations of the AI tools being introduced, including the errors that are made, why they occur, and how they can be mitigated.This is essential to maximize benefit and minimize patient harm. 6As users of the technology, pathologists can influence both its design and use in order to prioritize safety.Additionally, as the introduction of AI is likely to raise complex legal questions, 7 particularly when errors do arise, it is important that pathologists are aware of this new consideration when practising.
Several articles have considered the source of errors in AI tools in healthcare generally, but none specifically focus on pathology. 8,9The National Institute for Health and Care Excellence (NICE) has highlighted the need for pathologists to be trained to understand the limitations of AI technology. 10This article is intended to be a guide for pathologists on the errors made by AI algorithms and how to minimize them.

Why do errors arise in AI diagnostic tools?
Interestingly, it is known that the errors made by AI systems are different from the errors made by humans. 11,12In radiology, a similar image-based diagnostic speciality, it is reported that 60-70% of errors are perceptual errors, essentially a failure to identify a salient finding while reviewing images (as opposed to a cognitive error, where an abnormality is identified but it is incorrectly interpreted). 13In contrast, AI diagnostic tools are less susceptible to incomplete searches, but instead produce errors due to other factors, and so the types of errors made are likely to be different. 11ypically in the development of AI diagnostic tools, the model is first trained on one dataset, referred to as the training dataset.The model is then subsequently tested on a second testing dataset, comprising data unseen during the training stage for external validation. 14It is well understood that these two developmental datasets should be different in order to assess the algorithm's performance on unseen data, and ideally some of the test dataset should be from external sources.This is summarized in Figure 1.The data are frequently labelled, either by labelling a WSI with the diagnosis or smaller areas within the WSI, giving information about the region of interest. 3The algorithm's performance is compared to a defined reference (usually a pathologist's diagnosis) in terms of performance. 15 In the model development process, labelling is generally done by experts such as pathologists, or can be generated semiautonomously using rule-based approaches to identify certain conditions. 16Labelling errors can be made by human annotators due to a lack of information (including poor-quality images), mistakes, variability in the labelling opinions between experts, and data coding or communication problems.This is a particular issue with medical images, where the labelling task can be complex and opinions can vary between professionals. 17Labelling errors are known to be a pervasive issue, occurring in both training and testing datasets used in a wide variety of machine-learning domains. 18If slides are labelled incorrectly, the algorithm could be trained to make incorrect disease classifications, potentially resulting in errors when used.To minimize the likelihood of erroneous errors we must ensure high quality and correct labelling of slides by appropriately

Missing data leading to bias and hidden stratification
0][21][22][23][24][25][26][27][28][29][30] Underrepresentation of disease entities or population groups is likely to result in some entities not being identified correctly by the algorithm, meaning it is likely to underperform in such groups.2][33] This can also cause the problem of hidden stratification; when an algorithm appears to perform well across the whole population, but is actually performing poorly in subsets not identified during training or testing, and this niche of poor performance goes undetected. 11,12,15or example, an algorithm may generally be effective at lung cancer detection, but consistently miss a rare subtype. 12This presents a significant safety concern and can be difficult to detect.There is evidence of it across multiple medical imaging datasets, and therefore it is likely to also present a problem in pathology. 12t is not uncommon for articles heralding the success of AI tools in pathology to be built using datasets that exclude certain entities or difficult types of slides.A deep-learning assistive algorithm for histopathological screening of colorectal cancer excluded slides that were blurred or that contained folded tissue, slides that contained malignancies other than colonic adenocarcinoma, and any slides suggestive of mucinous adenocarcinoma or signet ring carcinoma. 34Another colon biopsy screening tool, aimed at separating colonic biopsies into normal and abnormal cases, did not detect several entities that human pathologists would expect to, such as signet ring cells, giant cells, mitotic figures, and spirochaetosis. 35This means that although such algorithm results are promising, to be used safely pathologists need to understand their limitations and delineate further work required to improve performance.
In the field of prostate cancer, multiple screening algorithms from several commercial vendors are approved for use.One such tool has several studies demonstrating its overall successful performance [36][37][38][39] ; however, only one study performed any subgroup analysis across patient age, race, and ethnicity (reporting no significant difference in algorithm performance across these variables); thus, there is limited opportunity to detect biases. 39

Data imbalances
Another source of error is the imbalance of entities within developmental datasets.Within pathology, there are rare disease entities that are important not to miss.If a random selection of cases is used to train an algorithm, it is unlikely to see enough of these cases to learn (thus causing bias and hidden stratification); so to overcome this, datasets can be "rebalanced" by oversampling rarer cases.If these datasets are not later corrected appropriately, this can lead to the AI tool overdiagnosing rare entities, again being a potential source of error. 9,40

Distributional shift and lack of generalisability
If there is a mismatch between the developmental datasets and the operational data where the tool is used, this creates a distributional shift.2][43] In this scenario, algorithms lack generalisability, meaning the predictions are less accurate and more error-prone than reported. 27,44This lack of generalisability can occur due to subtle differences between the development and operational datasets; for example, the use of different scanners. 27,43,45lthough studies do try and tackle this by including data from different laboratories and different scanners, it is reported that the number of laboratories included is "typically too small for a true assessment of generalizability". 46ne study key to the introduction of an approved prostate screening tool used data from the Pathology Institute at Maccabi Healthcare Services, Israel, for training and internal validation, and then the University of Pittsburgh Medical Center for external validation. 47This is a positive step towards generalisability, with centres being geographically distinct and using different scanners, although only two centres is unlikely to be truly generalisable.To overcome this, further work is ongoing, including at several UK hospitals, to increase generalisability and demonstrate the translatability of results. 48n order to address these potential sources of error, the datasets used to train and test algorithms must be large and diverse enough to represent the different populations, environments, diseases, and data acquisition methods in which they will be used. 8,15,30,46,49n pathology, the developmental data should encompass different hospitals, scanners, batches, patient groups, disease entities, and severities.Artificial techniques such as data augmentation (modifying existing Why do errors arise in artificial intelligence 281 images to create new data) can be used to increase input image variation, such as colour, contrast, and orientation, mimicking the input from different pathology laboratories. 43,46Pathologists' knowledge should be used to shape datasets; for example: what are the rare entities with serious consequences that must not be missed, and so need enriching in the dataset?Recent proposals for standardization of pathologist's annotations may help in the assembly of these datasets in the future. 50Subgroup analysis should be performed to evaluate the performance of the algorithm across different groups. 6f there are any groups missing from the developmental series, for any reason, this must be clearly documented, as the algorithm may not perform as well in these groups. 15Decisions must then be made by endusers (including pathologists) about how this is managed, perhaps even avoiding using the algorithm in certain scenarios, although such steps may worsen, rather than address inequalities in patient care.The alternative approach is to introduce extra quality checks to detect errors as the algorithm enters practice and use the wider data accumulated from the early adoption to improve and address the deficiencies identified.

R E I N F O R C E M E N T O F O U T D A T E D P R A C T I C E S
As clinical practice and treatment progresses, failure to adapt the AI tools used in the pathological assessment of samples risks a drift towards outdated practice.This could be seen as a specific form of distributional drift. 27,44,45Although AI algorithms can be updated manually to fit new protocols, as their success is largely based on the availability of appropriate data, which may not be available at the point of a new change, such changes could be a source of errors. 8In pathology, this is likely to be seen through incorrect classification of entities as classification and staging systems are revised.
Pathologists are pivotal in addressing this issue through being able to anticipate potential changes and assessing if the algorithm outputs remain in line with new guidance, as well as providing heightened oversight during the transition.
As discussed, distributional drift occurs due to a mismatch between the developmental and operational data.Humans often find themselves in a similar situation: being in a new scenario they have not experienced previously; for example, a pathologist with a case of a previously unseen entity.The key in such a scenario, and one that humans inherently have, is to recognize our own ignorance, allowing us therefore to seek help, such as asking colleagues. 51Machinelearning systems do not exhibit this characteristic, and remain rigidly fixed to the outputs they have been trained to produce.Therefore, when distributional shift occurs, AI models not only make errors but may retain high confidence in their error. 51This leads to the term of unsafe failure mode-when a tool gives an output despite lacking the information or training required to make a robust decision. 8ne study of a prostate cancer assistive algorithm demonstrates that the tool has a form of fail-safe mode by being able to detect and flag slides that are out of distribution if the thumbnail image is significantly different from the distribution of core needle biopsy slides used in algorithm development. 37This is an example of an important safety mechanism introduced to avoid errors.However, it failed to identify some slides with quality issues, and so continued to make diagnostic predictions (and misdiagnosed adenocarcinoma as benign in some of these cases). 37urthermore, as suggested by Macrae, 52 an AI tool failing safely or not, is not "simply a technical property of an AI system declining to provide a prediction when its confidence is low, but rather is a sociotechnical property of the entire work system that a technology is embedded in".In other words, if an AI tool avoids giving a prediction in an unknown scenario, and the decision passes to a human pathologist lacking experience due to limited exposure, there is still a potential for errors to occur due to human factors. 52,53e need to use AI tools that have a robust fail-safe mode so they recognize scenarios when they could perform badly and mitigate errors by avoiding diagnosing in such settings. 51,54Additionally, in order to provide adequate human supervision, we must implement strategies to prevent the de-skilling of pathologists. 52

A U T O M A T I O N B I A S A N D H U M A N I N T E R A C T I O N W I T H A I T O O L S
In most instances, when AI tools are introduced into healthcare they will be interacting with healthcare professionals in some capacity, rather than functioning in isolation, and this interaction is a potential cause of errors.
Automation bias, as defined by DeCamp and Lindvall, is the concept of clinicians "treating AI-based predictions as infallible or following them unquestioningly". 31eliance, and potentially overreliance, on automated systems is known to begin very soon after exposure to a new technology. 8,9,31,53In a study of autonomous vehicles, 49 experienced drivers undertook a 30-min commute-style journey for 5 consecutive days, having been informed of the limitations of partially autonomous vehicles.The journey involved manual and automated driving (during which drivers might have to resume manual control).The drivers quickly developed high levels of trust in the automated vehicle, such that by the end of the week drivers spent 80% of their time on smartphones, laptops, or reading (even despite an unexpected, emergency handover being required on day 4). 55t is known that clinicians are more likely to accept machine results with little or no questioning when the machine has previously been reliable and when they are busy or tired.
It can also arise due to fear of the legal repercussions of overriding an algorithm's output. 31Studies across multiple disciplines have shown that automation bias can cause a decrease in clinician accuracy, including in ECG interpretation and skin lesion diagnosis and that clinicians at all levels, including experts, are susceptible. 56n prostate algorithm studies, Da Silva et al. found that when pathologists re-reviewed prostate cores after knowing that the tool had called them suspicious, in approximately half of the images at least one of the two reviewers changed their previously benign diagnosis to atypical. 36Raciti et al. also reported that in a small subset of cases, pathologists changed from an initially correct diagnosis to an incorrect diagnosis after the algorithm displayed incorrect results. 39n order to address automation bias, pathologists need to understand the level of fallibility of the tool they are using and be aware of high-risk areas.From a workload perspective, there needs to be adequate time given to pathologists to supervise algorithm results to ensure that productivity pressure does not heighten the risk of automation bias.Legal clarity would also be beneficial regarding where responsibility lies when agreeing or disagreeing with algorithm diagnoses.
Cognitive biases can also shape how healthcare professionals and AI tools interact, and be a potential source of errors.These include the anchoring effect, where clinicians may be overreliant on the first diagnosis obtained (such as from an AI tool), and so fail to make the correct diagnosis, and confirmatory bias, meaning clinicians, once having an AI tool diagnosis may only search for confirmatory evidence to support this, rather than approaching a case with an unbiased view. 56

I N S E N S I T I V I T Y T O I M P A C T
One final consideration in the discussion of errors in AI tools is the concept that the deep neural networks (DNN) that form AI algorithms are trained to minimize errors, with the assumption that all types of errors are equal.AI tools are thus considered to be insensitive to impact. 9,57,58However, across numerous areas this is often not the case, and it is a gross undersimplification in pathology, where different diagnostic errors can have hugely different outcomes for patients in terms of treatment decisions and prognosis. 58When pathologists are aware of potentially high-risk scenarios, they can adjust their practice accordingly and issue a cautious report (for example, to avoid underdiagnosing malignancy in a sample too small for a confident diagnosis), but we must be aware that AI tools have no such safety behaviour.
Currently, there can be some basic modification of a tools error profile based on the desired results by altering the balance of sensitivity and specificity (and so false-positive and false-negative errors made), for example to achieve a high sensitivity to minimize the risk of false-negatives in a screening test.A more robust, future method to address the differential risks of different types of error has recently been suggested by Santana et al., 57 who propose a method for DNN classifiers that considers risk during the training and verification stages, allowing factors such as the risk of each specific misclassification and the likelihood of them occurring to be built into the model during development.This attempts to mitigate high-risk misclassifications and make algorithms sensitive to impact to some extent. 57

F U T U R E C O N C E R N S
As both the types of AI tools and their roles evolve over time, there will be new errors to consider.Generative AI algorithms (those that create content, such as text and images) are being shown to have some ability to make medical diagnoses in complex cases. 59owever, this type of AI can suffer from hallucinations, meaning that these tools can fabricate information and present it as fact (occurring because they write according to what makes sense, not according to the truth), 60 something which would be of great concern in medical diagnostics.Why do errors arise in artificial intelligence 283

Role of pathologists in error minimization
Many pathologists may feel that this is an abstract issue, potentially because the development of AI tools is predominantly done by computer scientists and distant from daily practice.However, there are several vital roles that uniquely apply to pathologists in minimizing the errors made by AI tools.Some of these are applicable now and some will become applicable in the near future.These are summarized in Box 1.
The safe introduction of AI tools into practice firstly requires building a robust safety case, which is predominantly the role of developers.It then requires building of a high-quality evidence base, incorporating the AI tool as part of the entire diagnostic process, ideally including randomized controlled trials where possible. 46This is an area where involvement by pathologists is vital to trial development and success.
Even once there has been rigorous product testing and regulatory approval has been gained, there is a need for local validation through quality control mechanisms and audit, which should be done before local implementation of a tool and repeated subsequently. 15,27,44,61This is another key area for pathologists to take a proactive approach to help investigate local site-specific errors.
There are many proposed audit methods for healthcare AI, such as from Liu et al., 11 who suggested a five-step audit method consisting of: Scoping, Mapping, Artefact collection, Testing, and Reflection.They helpfully consider the roles of both developers and users in each step, which gives each a clear responsibility to users, such as pathologists, in this process. 11his is not pathology-focused, but it provides a good starting point.Development of a pathology-specific audit framework by pathologists and the relevant professional bodies would be invaluable.

Explainability tools
AI algorithms are often described as "black box entities".They use intricate neural networks to reach decisions and these inner workings are difficult, or impossible, for humans to understand.In particular, when a tool is trained in an unsupervised manner and the outcome is not clearly linked to features of a known biological parameter, understanding how the tool uses the input data to formulate a decision can be unclear.
There is extensive discussion in the literature about how significant the black box nature of algorithms is, which is beyond the scope of this work; however, if either the occurrence of errors, and/or the reasons for them occurring is not understandable, this makes the tool more difficult to supervise and hinders improvement.Any methods that increase tool explainability will be key to human understanding of errors, and so are likely to be beneficial. 27,52Additionally, making AI tools more explainable will help to detect and minimize the risk of an algorithm using spurious features (for example, markings on histology slides) in its predication of the outcome.
3][64] An example in pathology is the use of heatmaps to show Box 1. Shows the specific roles of pathologists in reducing the risk and consequences of errors made by AI diagnostic tools.Specific roles for pathologists in error minimisation: • Critically appraise the available safety data on any tool they use

Confidence estimates
Having tools that give a confidence score that expresses the certainty of the AI tool for a specific output allows pathologists interacting with the tool to understand the confidence in the prediction. 46,62This may also help to counteract automation bias, as in low-confidence cases the pathologist is able to take the most care in case review. 8However, care must be taken due to the potential for algorithms to have an unsafe failure mode, meaning algorithms may remain highly confident in their predictions despite making errors.Clinicopathological acumen must therefore remain at the forefront of assessment.

Conclusion
There remains uncertainty regarding the use of AI tools in healthcare from a legal, ethical, and regulatory standpoint.If it is inevitable that an AI tool will make a mistake, then a key question is, who is legally responsible-the pathologist, the tool itself, or the implementing trust? 65,66There is no precedent for this and clarification from the medicolegal community is warranted, encouraging prospective legislative action, rather than reactive (and likely damaging) legal action. 67There are also numerous ethical challenges with the use of AI.The principles of beneficence, nonmaleficence, and justice state that we must not only maximize benefit while minimizing harm, but also ensure fairness in the spread of these benefits and risks between populations, meaning data biases must be overcome. 6,32,68Finally, there is uncertainty around the accreditation process for AI tools, and concerns that it is not used to the same rigorous standard as other areas of medicine. 25As this field continues to grow, regulatory bodies must demonstrate robust standards that allow implementation of safe tools built on a high-quality evidence base, with transparency regarding any potential risks.The large amount of funding recently given to the introduction of AI in the National Health Service will undoubtably drive implementation, 69 and while this is an exciting time, we must keep patient safety at the forefront.This can be countercultural in the realm of AI development, where the main drives can often be profit and speed.For example, General Motors is a company that produces self-driving cars, and in 2017 one of their Cruise cars crashed with a motorcyclist in San Francisco.General Motors settled the lawsuit but the Cruise chief executive officer wrote that "While it seems crazy to test in an absurdly complex place like San Francisco, it's absolutely necessary," starting that "We believe it's the fastest way to achieve the level of performance and reliability needed to deploy self-driving cars". 70This attitude of putting a desire for rapid implementation before safety should not be seen in the healthcare realm.We must match work on the technical innovation of AI tools with an understanding of their safety profiles and the impact on wider patient safety.
This work aims to inform and equip pathologists, hoping to empower the profession to gain the benefits of novel AI tools while ensuring patient safety remains at the forefront of diagnostics.Pathologists have several vital roles in error minimisation: their pathological knowledge is essential to safely build, test, and audit AI tools, but more importantly, as endusers they have the unique ability and responsibility of minimizing errors by determining how AI tools are used in clinical practice. 2,11