Clinical assessment of a novel machine‐learning automated contouring tool for radiotherapy planning

Abstract Contouring has become an increasingly important aspect of radiotherapy due to inverse planning. Several studies have suggested that the clinical implementation of automated contouring tools can reduce inter‐observer variation while increasing contouring efficiency, thereby improving the quality of radiotherapy treatment and reducing the time between simulation and treatment. In this study, a novel, commercial automated contouring tool based on machine learning, the AI‐Rad Companion Organs RT™ (AI‐Rad) software (Version VA31) (Siemens Healthineers, Munich, Germany), was assessed against both manually delineated contours and another commercially available automated contouring software, Varian Smart Segmentation™ (SS) (Version 16.0) (Varian, Palo Alto, CA, United States). The quality of contours generated by AI‐Rad in Head and Neck (H&N), Thorax, Breast, Male Pelvis (Pelvis_M), and Female Pelvis (Pevis_F) anatomical areas was evaluated both quantitatively and qualitatively using several metrics. A timing analysis was subsequently performed to explore potential time savings achieved by AI‐Rad. Results showed that most automated contours generated by AI‐Rad were not only clinically acceptable and required minimal editing, but also superior in quality to contours generated by SS in multiple structures. In addition, timing analysis favored AI‐Rad over manual contouring, indicating the largest time saving (753s per patient) in the Thorax area. AI‐Rad was concluded to be a promising automated contouring solution that generated clinically acceptable contours and achieved time savings, thereby greatly benefiting the radiotherapy process.


INTRODUCTION
It has been estimated that approximately half of all cancer patients will benefit from radiotherapy treatment. Meanwhile, the burden on radiotherapy infrastructure is anticipated to increase in line with the incidence of cancer. 1 To fully exploit the advantages of inverse planning in radiotherapy, all target volumes and surrounding organs at risk (OARs) must be contoured before treat- ment planning. This process may be repeated multiple times during a treatment course because of tumor response or changes in patient weight or anatomy. When manual contouring is performed, large interobserver variations have been reported [2][3][4] and are considered one of the largest sources of uncertainty in radiotherapy. 5 For some tumor locations, inconsistencies in target volume definition may dominate all other errors in planning and delivery. 6 Although standardized guidelines and anatomy atlases have addressed the issue to a certain extent, automated segmentation using artificial intelligence (AI) has potential in both minimizing inter-observer variations 7 and significantly reducing contouring time while improving planning efficiency. 8 As an emerging field of computer science, AI attempts to emulate human-like intelligence by using computer software and algorithms to perform specific tasks without direct human input. [9][10][11] One of the subcategories of AI, machine learning (ML), uses computer software, modelling, and algorithms to detect patterns and correlation through the learning process by providing it with databases of raw data. 9 While ML has several applications in radiotherapy, [12][13][14] its use in contouring has drawn substantial interest owing to the potential benefit to efficiency and consistency. [15][16][17][18] The ML approach is to learn the structure labelling of each image voxel directly, more flexibly incorporating prior knowledge in the form of parameterized models. Successful techniques include the use of statistical and decision-learning classifiers, 19,20 and, more recently, deep learning. [21][22][23] A new automated segmentation tool, the AI-Rad Companion Organs RT™ (AI-Rad) contouring software (Version VA31) (Siemens Healthineers, Munich, Germany), has been introduced in the authors' department. The software was based on the method proposed by Ghesu et al. 24 that reformulated the detection problem as a behavior learning task for an artificial agent. In other words, an artificial agent is trained not only to distinguish the target anatomical object from the rest of the body, but also to find the object by learning and following an optimal navigation path to the target object in the imaged volumetric space. The purpose of this study was to evaluate the quality of contours generated by AI-Rad in different anatomical areas and, where possible, compare it to the quality of contours generated by Smart Segmentation™ (SS) (Version 16.0) (Varian, Palo Alto, CA, USA), a commercial automated segmentation solution that has been verified by various studies 25,26 and implemented clinically at the authors' department. A timing analysis was also performed to explore potential time savings achieved by AI-Rad compared to manual delineation.

Clinical data collection
A total of 28 patients, composed of eight H&N patients, five thorax patients, five breast patients, five male pelvis patients, and five female pelvis patients, were retrospectively selected from patients treated at the authors' department. The planning CT datasets of these patients were subsequently reviewed, during which it was noted that the collected patients varied in terms of body mass index (BMI), geometry (e.g., disease side, arm position and presence of breast implants in breast patients), and setup position (e.g., use of the Standard Wing Board™ vs. the Type S™ Overlay Board in breast patients).
In addition, all CT datasets were free of significant artefacts and acquired in the head-first supine (HFS) position.

Manual contouring
A radiation therapist (RTT) with at least 10 years of experience in radiotherapy contoured the major OARs that were commonly required for radiotherapy planning on the CT datasets of the 28 patients. The RTT performed the contouring task in accordance with clinical practice at the institution. Previous studies have demonstrated comparable contouring accuracy between RTTs and radiation oncologists (ROs). 27,28 Datasets were contoured in accordance with the Radiation Therapy Oncology Group (RTOG) contouring consensus. 29 The time required to contour each OAR was recorded by the RTT.

AI-Rad automated contouring
The AI-Rad software provides a cloud-based solution to automated contouring. To run an automated contouring session, a CT dataset is first uploaded to the cloud software from either the CT scanner or the treatment planning system (TPS). The system then determines the applicable template of structures using a user-defined DICOM tag, which, in this study, was (0018, 0015) (Body Part Examined) before automated contours are generated. If no user-defined templates are available, then all structures that can be automatically generated by the software and are present on the CT dataset are contoured by the system. The completed structure set is automatically transferred back to the TPS. During the entire process, the only step that requires user interaction is to upload the CT dataset. In this study, once the structure set generated by AI-Rad was imported into the TPS, it was reviewed and, if necessary, adjusted by the same RTT who did the manual contouring task. The time of this process was recorded as the time required for AI-Rad to generate automated contours. The time required by the AI-Rad software to process the CT dataset and generate contours was not included in the above timing data, as it significantly depends on system configurations. This study observed that under the trial setup (analogous to the clinical setup), it took less than 30 s for the AI-Rad software to generate the required structures and import them to the TPS, or less than 5 s for each structure. Therefore, the impact of not including this process TA B L E 1 Evaluation criteria used to assess the "Degree of Editing" of each automated contour.

No edits required
No changes made to the structure

Minor edits required
Less than 10% of slices requiring small edits

Moderate edits required
More than 10% of slices requiring small edits, or large edits required to a small number (10%) of slices

Major edits required
Edits not described in the above categories, up to and including the deletion and recontouring of the structure

Not applicable
Not relevant to the fraction or not assessed in the timing analysis was small. However, it should be noted that this time depends on system configuration, and whether the processing speed will be compromised upon large batches of patients requires further investigation.
During the review process, the RTT scored the "Degree of Editing," a commonly used subjective evaluation metric, 30,31 of each automated contour according to the criteria listed in Table 1.

SS automated contouring
The same structures were automatically contoured by SS, an atlas-based automated contouring solution. Three SS atlases, each composed of 30-40 patients, were previously created from the datasets of retrospective clinical patients and clinically validated at the authors' department, namely a H&N atlas, a breast atlas, and a pelvis atlas. 32 In this study, the eight H&N patients were contoured by the H&N atlas, the five breast and five lung patients by the breast atlas, and the five female and five male pelvis patients by the pelvis atlas. Some OARs generated by AI-Rad did not have a corresponding model in the SS. For these structures, no comparison was performed between AI-Rad and SS. Table 2 lists contours generated manually, by AI-Rad, and by SS.

Comparison of contouring quality
Several quantitative metrics were calculated to assess the quality of automated contours, namely Dice similarity coefficient (DSC),sensitivity,precision,and Hausdorff distance (HD), with the structures manually contoured by the RTT as the ground truth.
TA B L E 2 Contours generated manually, by AI-Rad, and by SS.

No. of patients (n) Manual AI-Rad SS
DSC between respective volumes, A and B, is defined as: (1) DSC approaches 1.0 when two structures overlap exactly. One study has recommended a DSC of 0.7 to be considered a good overlap, 33 whereas others have instead suggested 0.8. 34,35 DSC can provide false impression of high agreement. 35 The metric can over-penalize small structures whilst exaggerating the agreement of large structures. 36 Therefore, three other quantitative metrics, that is, sensitivity, precision, and HD, are introduced.
Sensitivity is defined as: Sensitivity is the fraction of true positives divided by all actual positive cases in a population. If contour A is taken to be analogous to positive results and contour B to be positive cases, then sensitivity is the intersection of the two, divided by B. Sensitivity approaches 1.0 when all parts of the "true" contour is included in the automated contour.
Precision is defined as: Precision refers to the number of true positive results divided by the sum of true positives and false positives, and therefore is the intersection of A and B divided by A. Precision approaches 1.0 when the automated contour only includes parts of the "true" contour.
HD is defined as the maximum Euclidean distance from each point in the ground-truth contour to the nearest point in the automated contour. 37 It has a minimum value of 0, indicating perfect agreement, and no maximum value. In this study, a sampling value of 20 was used to speed up the calculation. This works by only considering 1 in every 20 points on the ground-truth contour, but still every point on the automated contour. It is approximated that a sample rate of 20 would introduce an average error of 3% for the HD.
The above metrics were calculated using an in-house C# application implemented in the TPS.

Timing analysis
The automated contouring time of AI-Rad was calculated as the time required by the RTT to review and adjust the specific structure to clinical standards. This time was subsequently compared to the time spent by the RTT to manually contour the same structure.

Statistical analysis
Quantitative data were expressed as Mean ± Standard Deviation (SD). The Jarque-Bara test was used to assess the sample normality of DSC and HD. For data that were normally distributed, pairwise comparison was conducted with the Student's independent t-test. The significance level was defined at p < 0.05. Table 3 shows the DSCs of different automated contours delineated by AI-Rad and SS against the manual contours. The last two columns of the table list the p-values of the pairwise comparison, and the superior method if the difference is statistically significant (p < 0.05). Table 4 shows the HDs of different automated contours delineated by AI-Rad and SS against the manual contours. The last two columns of the table list the p-values of the pairwise comparison, and the superior method if the difference is statistically significant (p < 0.05). Table 5 shows the sensitivity and precision of AI-Rad and SS against the manual contours. Table 6 shows the qualitative rating (Degrees of Editing) of each automated contour generated by AI-Rad scored by the contouring RTT. Table 7 compares the time required to review and modify the automated contours generated by AI-Rad against the time required to manually delineate the same set of structures from scratch. The last two columns of the table list the p-value of the pairwise comparison, and the superior method if the difference is statistically significant (p < 0.05). Figure 1 compares the average automated contouring time of AI-Rad per anatomical area against that of manual contouring.

Quantitative and qualitative analysis of the quality of AI-Rad automated contours
In this study, the quality of AI-Rad for different anatomical areas was evaluated both quantitatively using DSC, precision, accuracy, and HD, and qualitatively using a subjective score ("Degree of Editing") rated by the contouring RTT. Results are shown in Tables 3-6. DSC is widely utilized to assess the quality of contours. Although different studies have used a range of DSC thresholds, 33-35 a value of 0.7 was selected in this case due to the inclusion of a large number of small volume structures, which can be over-penalized by using a high DSC. 36 In this study, of the 28 structures contoured by AI-Rad, only five structures, namely Larynx (0.405), OpticNrv_L (0.531), OpticNrv_R (0.468), SeminalVesicle (0.652), and SpinalCanal (0.691), had a DSC lower than 0.7. Alternatively, from the results of the qualitative analysis, structures that required moderate or major editing in more than 50% of the instances included Bladder (60.0%), Larynx (100.0%), Mandible (100.0%), Oral Cavity (75.0%), and Rectum (100.0%).
The only structure that had both a low DSC and a low qualitative rating was Larynx. Similar findings have been reported by the literature, suggesting that the lack of standardization on larynx boundaries and the complexity of the small structure makes its automated segmentation on CT difficult. 38 Similarly, a closer review of the results in this study indicated that the low agreement between the larynx contour generated TA B L E 3 DSCs of different structures between AI-Rad and manual contours and SS and manual contours. by AI-Rad and that delineated manually was predominantly caused by different definitions of the boundaries of the structure. Within the author's department, larynx is defined to start with epiglottis as the most superior point and include the thyroid cartilage whilst excluding the hyoid bone and the laryngopharynx, whereas the posterior and inferior edges are defined by the pharyngeal constrictor and the cricoid cartilage. However, the Larynx structure contoured by AI-Rad starts more superiorly and excludes both the thyroid cartilage and the airway, thereby the low DSC. An example is shown in Figure 2. Figure 2 Comparison of the larynx contour generated by AI-Rad (Green) and that delineated manually (Red). Disagreement between the two contours is contributed mainly by different definitions of the boundaries of the structure.
Among the other structures with a low DSC, the left and right optic nerves and the spinal canal never required any manual editing, and only one case of the seminal vesicle required manual editing as per the qualitative score. The inconsistency between the DSC and the qualitative score was predominantly caused by differences in the length of the contour in the superiorinferior direction, with manual contours being consistently shorter than automated ones. This is because the RTT only contoured these structures on slices where the target structures were located, whereas the AI-Rad tool contoured all the slices where the structures were present. To further validate the observation, the above structures generated by AI-Rad were cropped to the same length of the corresponding manual contours, followed by re-performing the DSC analysis. For the above structures, the DSC significantly improved after crop-TA B L E 4 HDs of different structures between AI-Rad and manual contours and SS and manual contours. ping (e.g., for Spinal Canal, DSC increased from 0.691 to 0.854; p = 0.012 < 0.05). Similar under-segmentation has been reported by a previous study. 32 The fact that manual contours were under-delineated in the superior-inferior direction was caused by time constraints RTTs and ROs frequently encounter during clinical workflows. This can affect the accuracy of the DVH of a structure and exhibit a significant impact on particularly small serial structures such as optic nerves. Therefore, there is a need to utilize qualitative metrics in addition to quantitative metrics such as DSC, 39 which can exaggerate the difference between manual and automated contours if not interpreted carefully. The observation also highlights the benefit of introducing automated contouring solutions, allowing users to amend on comprehensively delineated structures,thereby reducing the risk of time-based contouring errors while maintaining efficiency.
Contours for the bladder, mandible, oral cavity, and rectum all required significant editing when qualitatively evaluated despite acceptable DSCs. The commonality between these organ types is the large structure volume, further highlighting the insensitivity of the DSC metric to inaccuracies in large structures. Poor contouring accuracy for the oral cavity and rectum could be attributed to low-contrast tissue boundaries, which previous studies have linked to worse AI contouring performance. 32,33,38 Poor results for the bladder and mandible may suggest necessary improvements to the structure models, with alternative AI models demonstrating greater accuracy when generating contours across patient cohorts than demonstrated in this work. 40,41 For AI-Rad, improvements of the structure models can be requested but are not adjustable by the user. The inability for individual sites to improve specific structure models highlights a limitation of this implementation model. TA B L E 5 Sensitivity and precision of AI-Rad and SS against manual contours.

Comparison between AI-Rad and SS
Of the DSCs of the 21 contours that both AI-Rad and SS could generate, the data of 18 structures passed the normality test, warranting the use of Student's t-test for pairwise comparison. Among them, 10 showed significant differences, all favoring AI-Rad, which indicated its superior performance over SS.
The HD data were less normally distributed, with only 16 structures passing the normality test. Among these 16 structures, nine showed significant differences, with three favoring SS and six favoring AI-Rad. However, for the three structures that favored SS, qualitative scores in Table 6 showed that only one case of Brainstem required moderate editing, whereas neither Lung_L nor Lung_R required manual editing in any cases. This observation indicated that despite the statistically inferior HD values of AI-Rad in these structures, the actual clinical impact was small.
Previous studies have indicated that atlas-based automated contouring software performed poorly in structures with a low-contrast tissue boundary that was hardly distinguishable from surrounding tissues, especially in the pelvic area. 32,33,38 This study found that, compared to SS, AI-Rad performed substantially better in contouring such structures including the bladder and the rectum, although manual editing was still required for the automated contours to be clinically accepted.
Comparison of the precision and the sensitivity of the two automated contouring tools suggested that whilst the sensitivity of AI-Rad and SS was similar (0.809 vs. TA B L E 6 Qualitative rating of automated contours generated by AI-Rad.

Structure
No edits (n/%) Minor edits (n/%) Moderate edits (n/%) Major edits (n/%) The clinical implementation of SS requires training the atlas with user-defined datasets, whereas the ML database of AI-Rad is pre-configured by the vendor and therefore requires no user input. Additionally, AI-Rad does not require any user interactions other than uploading the CT dataset, whereas SS requires several steps of user interactions before generating results. Therefore, it is expected that the clinical implementation of AI-Rad is easier than that of SS.

Timing analysis
Among the 28 structures, 24 (85.7%) showed differences between the automated and the manual contouring times, with 23 (82.1%) favoring AI-Rad and only 1 (4.2%, Larynx) favoring manual contouring. Similar observations have been reported by the literature, that novel AI contouring tools are more efficient in contouring various organs, thereby allowing the comprehensive delineation of multiple structures that facilitates not only planning, but also reporting in radiotherapy. 9,32,33 In addition, Figure 1 shows that for all five anatomical areas, the time spent by AI-Rad is consistently shorter than that by manual segmentation, with more significant time savings observed in the Thorax and the Breast areas. The average time saving for each area, calculated from the difference between the mean times of the two methods in contouring all structures for a specific anatomical area, is 379s for H&N, 753s for Thorax, 679s for Breast, 291s for Pelvis_M, and 210s for Pelvis_F patients, respectively.

Study limitations
One limitation of the study was that the contours of a single RTT were adopted as the ground truth to evaluate F I G U R E 1 Comparison of automated and manual contouring times per anatomical area.
the quality of automated contours, which may introduce bias to the results. Therefore, although this study found that the quality of automated contours generated by AI-Rad was better than that of those generated by SS, further investigation that utilizes ground truth datasets defined by the common agreement of multiple users may be required. In addition, the performance of AI-Rad in different structures, patient cohorts, and imaging modalities such as cone-beam CT (CBCT) should be performed with a larger sample size.

CONCLUSION
This study evaluated the performance of AI-Rad, a ML-based AI contouring tool, both quantitatively and qualitatively. The results suggested that most of the automated contours generated by AI-Rad were clinically acceptable and required minimal manual editing. In F I G U R E 2 Comparison of the larynx contour generated by AI-Rad (Green) and that delineated manually (Red). Disagreement between the two contours is mainly contributed by different definitions of the boundaries of the structure.
addition, AI-Rad significantly outperformed SS, an atlas based commercial AI contouring software, in over half of the compared contours. Timing analysis favored AI-Rad over manual contouring,with the former achieving significant time saving in several structures, especially in the Thorax area. In summary, AI-Rad is a promising automated contouring solution that can generate clinically acceptable contours and achieve time savings. To the authors' knowledge, this is the first contour-comparison study that has explored the performance of AI-Rad, a novel AI contouring tool, and investigated possible time savings it may achieve in the clinical environment.

AU T H O R C O N T R I B U T I O N S
Yunfei Hu, Huong Nguyen, Claire Smith, Tom Chen, and Trent Aland conceived of the presented idea. Yunfei Hu, Huong Nguyen, Claire Smith, Tom Chen, and Trent Aland carried out the experiment. Yunfei Hu, Mikel Byrne, and Ben Archibald-Heeren performed related calculations and statistical analyses. James Rijken developed the C# application for data collection. Yunfei Hu wrote the manuscript with support from Mikel Byrne, Ben Archibald-Heeren, and James Rijken. Trent Aland supervised the project.

AC K N OW L E D G M E N T S
None.

C O N F L I C T O F I N T E R E S T S TAT E M E N T
The authors declare that there is no duality of interest that they should disclose.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available from the corresponding author upon reasonable request.