Gross failure rates and failure modes for a commercial AI‐based auto‐segmentation algorithm in head and neck cancer patients

Abstract Purpose Artificial intelligence (AI) based commercial software can be used to automatically delineate organs at risk (OAR), with potential for efficiency savings in the radiotherapy treatment planning pathway, and reduction of inter‐ and intra‐observer variability. There has been little research investigating gross failure rates and failure modes of such systems. Method 50 head and neck (H&N) patient data sets with “gold standard” contours were compared to AI‐generated contours to produce expected mean and standard deviation values for the Dice Similarity Coefficient (DSC), for four common H&N OARs (brainstem, mandible, left and right parotid). An AI‐based commercial system was applied to 500 H&N patients. AI‐generated contours were compared to manual contours, outlined by an expert human, and a gross failure was set at three standard deviations below the expected mean DSC. Failures were inspected to assess reason for failure of the AI‐based system with failures relating to suboptimal manual contouring censored. True failures were classified into 4 sub‐types (setup position, anatomy, image artefacts and unknown). Results There were 24 true failures of the AI‐based commercial software, a gross failure rate of 1.2%. Fifteen failures were due to patient anatomy, four were due to dental image artefacts, three were due to patient position and two were unknown. True failure rates by OAR were 0.4% (brainstem), 2.2% (mandible), 1.4% (left parotid) and 0.8% (right parotid). Conclusion True failures of the AI‐based system were predominantly associated with a non‐standard element within the CT scan. It is likely that these non‐standard elements were the reason for the gross failure, and suggests that patient datasets used to train the AI model did not contain sufficient heterogeneity of data. Regardless of the reasons for failure, the true failure rate for the AI‐based system in the H&N region for the OARs investigated was low (∼1%).

][7][8][9] Early solutions to auto-segmentation used atlasbased methods, 10 but the use of artificial intelligence (AI) based software for auto-segmentation of OARs has become increasingly common in recent years.A review of auto-segmentation literature from 2008 to 2020 demonstrated that a shift from atlas-based methods to AI-based methods began around 2016. 11 More specifically, deep learning (DL), which is a subset of AI, forms the basis for these new autosegmentation techniques, 12 and it has been suggested that use of this new technology means we have now entered the fourth generation of autosegmentation algorithm development. 2A 2022 review of auto-segmentation techniques used in radiotherapy treatment planning concluded DL methods have the potential to transform the radiation oncology workflow by increasing efficiency and removing inter and intra-observer variability. 13raining sets for deep learning-based autosegmentation models can be significant compared to earlier atlas-based methods, although the number depends on representativeness of the training data and can be reduced via the application of augmentation techniques such as geometrical transformations of original images. 11ystems that utilize deep learning are often referred to as "black box," because it is not possible for users to understand their internal function and therefore not possible to predict their behavior.There is therefore a need for robust studies to evaluate performance before such systems are used clinically. 14he concept of "Explainable Machine Learning" has previously been described 15 and suggests that it is often possible to use interpretable black box models.To date, this approach has not been utilized for AI auto-segmentation in radiation oncology.Discussions of the use of black box AI in medicine more generally suggests that interpretability is a requirement to gain trust and acceptance of AI in medicine from physicians. 16he importance of auto-segmentation system quality assurance has previously been stressed due to the potentially serious consequences of segmentation errors with end-users advised to employ both casespecific and routine model quality assurance on such systems. 2,17Currently this is mainly achieved by manual inspection of any auto-segmented contours generated for all patients due to a lack of knowledge of failure rates and likely failure modes from auto-segmentation software. 18he aim of this study was to identify the failure rate and failure modes of a commercial AI-based autosegmentation system generating head and neck (H&N) OARs.There has been little research investigating gross failure rates and failure modes of such systems.

METHODS
In order to be able to define a "gross failure" it is important to establish expected behavior.For autosegmentation expected behavior can be quantified using similarity metrics.In this study, the Dice Similarity Coefficient (DSC) 19 was used to define normal range of similarity.
Initially, 50 anonymized H&N patient data sets with "gold standard" contours were compared to contours generated by an AI-based system (Mirada DLC Expert™).Mirada A DLC Expert™ is a commercially available AI-based system for the generation of organsat-risk (OARs) used in radiotherapy treatment planning.As previously described, the software uses multiple convolutional networks to learn features in the input images to generate a semantic segmentation.A coarse resolution OAR output from an initial 2D multi-class network with 14 layers, along with the CT image data, is fed into a separately trained 10-layer OAR-specific network to predict full-resolution contours. 20,21 standard DLC Expert model, H&N CT NL004 GN, was used to generate the AI-based contours with no local customization.The contours are based on published international contouring guidelines 22  All manual contouring data originated from patients previously enrolled in the PATHOS clinical trial.23 This patient cohort was selected because the associated trial protocol included clear anatomical guidelines for OAR delineation and, in addition, trial entry involved pre-trial OAR outlining quality assurance, which all Oncologists A Mirada Medical Systems Ltd, Oxford.UK. were required to undertake.Guidelines and teaching have previously been shown to significantly reduce interobserver contour variability.24 A sub-sample of patient data was retrospectively reviewed during the study to provide further assurance around the quality of contours used. Fo the purpose of this research, the contours were deemed to be of "gold standard" when comparing to automatically generated contours.Mean and standard deviation values for the similarity metric, DSC, for four commonly used OARs in the H&N region (brainstem, mandible, left and right parotid) were established.This data was used to define the lower limit of DSC expected for each OAR.
The same commercial AI-based system was then used to generate four commonly used OARs (brainstem, mandible, left and right parotid) on a further 500 anonymised patient CT data sets.A data set of this size was determined to be necessary due to the absence of any existing evidence on failure rates and the need to identify a sufficiently accurate failure rate.The 500 data sets also contained contours that had been previously generated manually, by a human expert.The AI-based contours were compared to the manual contours using the Mirada Contour Insights™ tool to produce a 3D DSC for each patient.
Mirada DLC Expert accepts all patients without restriction on age, and can therefore be used for both adult and pediatric patients.For the 500 patient test sample used, 498 were adults, with 1 pediatric patient, aged 2,and 1 young adult,aged 23.The test sample contained a 72:28 male-to-female ratio and the median patient age was 63.
To identify gross failures of the AI-generated contours, a three-sigma limit was set to determine the failure rate, meaning that 99.7% of results can be assumed to be within this limit. 25All failures for each OAR were manually inspected by an expert observer, and reasons for failure, or failure mode, were categorized as shown in Table 2. Failures identified as due to suboptimal manual contouring were censored.4 show DSC values for the comparison between AI auto-segmented and manually delineated OARs for the 500 patient cohort.The failure level is set at three standard deviations below the expected mean DSC value.The overall mean failure rate for the four OARs investigated was found to be 1.2%.

RESULTS
Resulting failure rates by OAR are shown in Table 4.For the brainstem, there were 2 true failures and a true failure rate of 0.4%.The mandible structure had 11 true failures after censoring of suboptimal clinical contours.Of these 8 were due to unusual patient anatomy and 3 appeared to be caused by dental artefacts.
For the left parotid there were 7 true failures.Reasons for failure were determined to be caused by unusual patient anatomy for 5 patients, and for the 2 remaining patients the failure reason could not be identified.For the right parotid there were 4 true failures.One failure was determined to be caused by a non-standard patient setup position, 2 were due to unusual patient anatomy and 1 was caused by a dental artefact in the CT scan.
An example of a setup failure for the brainstem OAR is shown in Figure 5.It can be observed that this patient had an obvious "roll" in their setup position.When measured, the axial roll was found to be approximately 7 • .
An example of anatomical failure for the mandible OAR is shown in Figure 6.It can be observed that the auto-segmented mandible contour includes a surgical metal plate.
An example of failure for the right parotid OAR is shown in Figure 7.It can be observed that the inferior extent of the auto-segmented parotid contour stops at the level where dental CT artefacts are present.

DISCUSSION
The aim of this study was to assess gross failure rates and identify any common failure modes of a commercial AI-based auto-segmentation system.The four OARs (brainstem, mandible, left and right parotid) were chosen as they were routinely, manually, outlined for all H&N patients at our center and could therefore be compared to AI-based contours for a large retrospective patient cohort without the need for additional curation of the data.The spinal cord was also routinely contoured and initially considered but subsequently excluded from the study as the inferior border of the manually outlined spinal cord was not anatomically defined.This resulted in the length of spinal cord manually contoured varying significantly from patient to patient.
The introduction of AI-based auto-segmentation software often leads to more OARs being routinely outlined compared to traditional manual contouring.Failure rates for the additional OARs should be established as part of post-implementation surveillance.This could be achieved via tracking of the number of major manual corrections required to the auto-generated AI-based contours for the first 500 instances of any new OAR being introduced.Given the low failure rates expected, a large sample size would be needed for additional OARs, in line with this study.
The 3D DSC metric was used in the study to identify gross failures of the AI-based OAR contours.Spatial-based metrics have been suggested as complimentary to volume-based metrics when comparing auto-generated contours with gold-standard manual contours. 26The 2D 95% Hausdorff distance metric 27 was also considered within the study, but was found not to be effective at reliably identifying gross failures of the AI-based contouring system.
To 2D Hausdorff distance is calculated from 2 contours of the same OAR.For CT slices with only one contour for the OAR, the 2D Hausdorff distances cannot be calculated, and therefore the 95% 2D Hausdorff distance for the OAR will not fully represent the similarity F I G U R E 5 Failure due to setup.Axial, sagittal, and coronal CT images illustrating an example of a gross failure for auto-segmentation of the brainstem.The AI-based auto-segmented contour is blue and the manual contour is orange.
of the 2 contours under consideration.Gross failures of the AI-based system often involved generation of a substantially smaller OAR in the superior-inferior direction compared to the gold standard contours.This can clearly be seen in Figures 5 and 7.Distance-based metrics such as the 2D Hausdorff distance give incomplete values for these types of gross failures.For example, for the patient shown in Figure 5, the 2D 95% Hausdorff distance correctly identifies a gross error, even though it is only calculating the Hausdorff distances on the CT slices containing both manual and AI-based contours.This is because on the CT slices containing both gold standard and AI-based contours, there is a significant distance between the two contours resulting in a 2D 95% Hausdorff distance of 13.7 mm, larger than the upper limit threshold of 11.6 mm for this OAR.For the patient shown in Figure 7, the 2D 95% Hausdorff distance again only calculates the distances on the CT slices containing both the manual and AI-based contours, but in this case, does not identify a gross error.In this case, the manual and AI-based contours are similar in size, shape and position resulting in a 2D 95% Hausdorff distance of 7.6 mm against an upper limit threshold of 22.8 mm for this OAR.
By contrast, the patient shown in Figure 6, where a surgical plate was incorrectly included in the AI-based contour, the superior-inferior length of the contour was the same as the gold standard manual contour resulting in an accurate evaluation of the 2D 95% Hausdorff distance that exceeded the threshold for the mandible and correctly identified a gross error in this case.The three examples demonstrate the limitations and poten-tial drawbacks of using a 2D-based distance metric when considering gross differences between multiple contours.
The study has shown that true gross failures of such a system is rare, being less than 1% for some OARs, for example, brainstem, and an overall mean failure rate of 1.2% for all OARs.One of the reasons for failure, the presence of dental artefacts, is both easy to identify on CT scans as a possible issue during review, and less likely to apply to modern CT scanners, which utilise metal artefact reduction algorithms. 28n terms of the other failure modes identified, patient setup position and non-standard anatomy, it is likely that such failures are caused by the absence of sufficient numbers of these "unusual" case types in the model training data set of the AI-based auto-segmentation software.This sort of dataset bias in AI is a wellknown [29][30][31][32][33][34] and should be expected, given the relative frequency of such cases in the clinic.In addition, due to the observed wide anatomical variation in patients with non-standard anatomy, it may not be possible to include sufficient numbers of this patient type in a model training data set for the training to be sufficiently effective due to the large patient dataset numbers typically required to train DL models. 35hese findings highlight the need for manufacturers to "open up" the black box nature of the DL software being produced.More specifically, datasheets containing comprehensive information about the dataset used to produce a DL model should be provided, 36 and such information could be used by healthcare professionals to guide the limits of clinical use and to devise appropriately targeted quality assurance methods.
Regardless of the reasons for failure, due to the extremely low true failure rates identified, the ideal approach for the QA of auto-segmented OARs would be to use stratified methods, rather than 100% manual human expert inspection.This is partly because attribute inspection errors by humans will always exist, 37 and with errors occurring at such low rates it is a distinct possibility that they may be missed by a human who does not often encounter such errors, or that multiple checks would be required to provide sufficient levels of safety, 38 which would add an increased quality assurance burden for a process with very small failure rates.
It is important to note that this research is looking at gross failures, rather than more subtle quality issues, which may still be clinically significant.0][41] It should, however, be noted that human observer variation can also be of a clinically significant magnitude, 42 yet such differences are often not considered significant when a human is involved.For example, a previous study 43 found mandible interobserver variability to have a median value of 0.9 which is a similar order of magnitude to the results obtained in this study and suggests that the quality of some AI-based auto-segmented OARs may already have reached human expert levels.This is supported by a separate study which concluded that the accuracy of AI auto-segmented contours is now at a comparable level to that of expert inter-observer variability. 14uture research to assess the clinical significance of minor contouring failures would therefore be beneficial to determine the true importance of the often-perceived requirement for further human manual inspection of contours produced by modern commercial AI autosegmentation systems.
A further point of interest is that suboptimal clinical contours made up 45.5% of initial total failures in this study using real world data.This failure rate is a similar order of magnitude to the true failure rate of the AI-based system, and raises questions around clinical significance of these failures which could be investigated in future research.

CONCLUSION
To conclude, this study has demonstrated that gross failure rates for the H&N OARs tested, using a modern commercial AI-based auto-segmentation system, are extremely low.
It is also recommended that manufacturers provide greater information relating to the datasets they use to produce AI models to assist users with identifying potential dataset bias, and that manufacturers attempt to further reduce this bias in future models.
Recent research has shown that differences between gold standard and AI-based auto-segmented contours are at a comparable level to inter and intra-observer variability differences.It is therefore suggested that as resulting auto-segmented contour quality improves with future iterations of this technology, it may be possible to remove the need for 100% manual human expert inspection in the near future.This approach would require sufficiently accurate quality assurance methods to be included as part of the workflow.

AU T H O R C O N T R I B U T I O N S
Simon Temple contributed to study design, data collection, analysis and interpretation, and drafting of the manuscript.Carl Rowbottom contributed to study design, data collection, analysis and interpretation, and drafting of the manuscript.

AC K N OW L E D G M E N T S
We would like to thank Mirada Medical Ltd for providing Mirada DLC Expert™ and the Contour Insights™ tool, which was utilised in the study to produce structure comparison metrics.

C O N F L I C T O F I N T E R E S T S TAT E M E N T
The authors declare no conflicts of interest.

F I G U R E 1 3
Brainstem OAR DSC for 500 patients.3D DSC for the comparison between AI-based auto-segmented and manually delineated brainstem OAR.F I G U R E 2 Mandible OAR DSC for 500 patients.3D DSC for the comparison between AI-based auto-segmented and manually delineated mandible OAR.TA B L E 4 AI-based auto-segmentation failure rates from 500 patient cohort.Left parotid OAR DSC for 500 patients.3D DSC for the comparison between AI-based auto-segmented and manually delineated left parotid OAR.F I G U R E 4 Right parotid OAR DSC for 500 patients.3D DSC for the comparison between AI-based auto-segmented and manually delineated right parotid OAR.

F I G U R E 6
Failure due to anatomy.Axial CT image illustrating an example of a gross failure for auto-segmentation of the mandible.The AI-based auto-segmented contour is blue and the manual contour is orange.The AI-based auto-segmented contour has included the surgical plate as well as the mandible.

F I G U R E 7
Failure due to artefact.Coronal and axial CT images illustrating an example of a gross failure for auto-segmentation of the right parotid.The AI-based auto-segmented contour is blue and the manual contour is orange.
OARs available in the DLC expert model (Model: H&N CT NL004 GN).
TA B L E 1

Table 3
DSC values from the 50-patient cohort study.
summarizes the mean and standard deviations of DSC for the four OARs from the initial 50-patient cohort:brainstem,mandible,left parotid and right parotid.TA B L E 3