Evaluating automatically generated normal tissue contours for safe use in head and neck and cervical cancer treatment planning

Abstract Purpose Volumetric‐modulated arc therapy (VMAT) is a widely accepted treatment method for head and neck (HN) and cervical cancers; however, creating contours and plan optimization for VMAT plans is a time‐consuming process. Our group has created an automated treatment planning tool, the Radiation Planning Assistant (RPA), that uses deep learning models to generate organs at risk (OARs), planning structures and automates plan optimization. This study quantitatively evaluates the quality of contours generated by the RPA tool. Methods For patients with HN (54) and cervical (39) cancers, we retrospectively generated autoplans using the RPA. Autoplans were generated using deep‐learning and RapidPlan models developed in‐house. The autoplans were, then, applied to the original, physician‐drawn contours, which were used as a ground truth (GT) to compare with the autocontours (RPA). Using a “two one‐sided tests” (TOST) procedure, we evaluated whether the autocontour normal tissue dose was equivalent to that of the ground truth by a margin, δ, that we determined based on clinical judgement. We also calculated the number of plans that met established clinically accepted dosimetric criteria. Results For HN plans, 91.8% and 91.7% of structures met dosimetric criteria for automatic and manual contours, respectively; for cervical plans, 95.6% and 95.7% of structures met dosimetric criteria for automatic and manual contours, respectively. Autocontours were equivalent to the ground truth for 71% and 75% of common DVH metrics for the HN and cervix, respectively. Conclusions This study shows that dosimetrically equivalent normal tissue contours can be created for HN and cervical cancers using deep learning techniques. In general, differences between the contours did not affect the passing or failing of clinical dose tolerances.


INTRODUCTION
Radiotherapy is a crucial modality for treatment of head and neck (HN) and cervical cancers.2][3][4] These variations can lead to poor clinical outcomes for patients treated with suboptimal plans. 3,5][8] The Radiation Planning Assistant (RPA) is a webbased automated treatment planning tool being developed to generate consistent, high-quality contours and treatment plans.0][11][12][13][14][15][16][17][18][19][20][21] The RPA uses deep learning and knowledge-based planning to provide highquality and safe contours and treatment plans, with a goal of improving access to radiotherapy around the world.
In this study, we quantitatively evaluate the use of automatically generated organs at risk (OAR) contours for automatic plan generation.To do this, we used the RPA to generate VMAT plans for a cohort of HN and cervical cancer patients.The RPA plans were created using OAR contours generated by deep-learning models developed in-house.We, then, dosimetrically compared them to the original contours drawn by the clinic to assess contour quality.

Patient data
For this analysis, the medical records of a cohort of 54 patients with HN cancer and 39 patients with cervical cancer were retrospectively collected from our institution and de-identified.All patients were previously treated using VMAT.The original physician-drawn target contours, computed tomography (CT) scans, and dose prescriptions were used in autoplan generation.
Of the 54 patients with HN cancer, 8 were originally planned using two PTV dose levels.Of 39 cervical cancer plans, 26 included a boost to the gross tumor volume.Tables 1 and 2 show the patient cohort's various subsites, prescription ranges, and fraction ranges.

RPA workflow
Plan generation in the RPA is fully automated.The only user input required is the upload of the CT images and the service request, which contains the dose prescription, determination of margins, etc.Once the CT and service request is submitted by the user, normal tissue TA B L E 1 Distribution of the head and neck cancer cohort by site and total dose range.and target contours are generated using deep-learning models.After the contours are created, VMAT plans are generated and optimized using RapidPlan models in the Eclipse treatment planning system (TPS) (Varian Medical Systems, Palo Alto, CA).] After the autoplan is generated, the user can download the autocontours and plan files from the website interface to import into their own treatment planning system.The full RPA workflow 9 and model performances 7,[17][18][19][20] used in this study have been outlined in previous publications.The plans used in this study were generated for use on a Varian 2100 machine.The HN plans consisted of three 360 • coplanar treatment arcs with collimator angles of 15

Plan generation and data collection
For this study, we wanted to dosimetrically compare the autocontours created by the RPA to the original, physician-drawn (manual) contours.To do this, we needed to apply the same plan across both sets of contours.First, we generated an autoplan using the RPA.We, then, imported the manual contours into our Eclipse TPS, alongside the autocontours and autoplan.This was done for each patient in our patient cohort.The reported dose to the manual contour was considered our ground truth (GT) and we used it to evaluate the reported dose to the autocontours (RPA).Dosimetric data were collected using a Python script to interface with the Eclipse Scripting API.

Evaluation process
In this evaluation, we used a "two one-sided tests" (TOST) procedure [22][23][24] to determine if the dose to the autocontour was equivalent to that of the ground truth by a margin, δ, that we determined based on clinical judgment.Below are our hypotheses: M RPA is the median dose to the autocontour and M GT is the median dose for the ground truth, for a given normal structure DVH metric.For volumetric comparisons, δ = 5%; for dosimetric comparisons, δ = 3.5 Gy (or 5% of 70 Gy) for HN patients and δ = 2.5 Gy (or 5% of 50 Gy) for cervical patients.We did not consider the target contours in this evaluation.Using a one-sided Mann-Whitney U test, we considered the autocontour and the manual contour equivalent if both null hypotheses are rejected (i.e., p < 0.10) and a 90% confidence interval for M RPA -M GT lies between (−δ, δ).We also calculated the number of plans that met established clinically accepted dosimetric criteria.For HN plans, the criteria used is outlined in Radiation Therapy Oncology Group protocol 1016. 25For cervical plans, the criteria used is based on the GEC-ESTRO EMBRACE II protocol 26 and our own internal protocol.

Equivalence
The autocontours were equivalent (p < 0.10) to the ground truth for 12 of 17 HN DVH metrics and 15 of 20 cervical DVH metrics.Tables 3 and 4 show the confidence interval and p-values.Some OARs were not contoured for every patient.So, the number of data points used in the test for a given structure was included to help provide context.P 0 represents the p-value result of testing null hypothesis M RPA ≤ M GT -δ; while.P 1 represents the p-value result of testing null hypothesis M RPA ≤ M GT -δ.Both P 0 and P 1 need to be less than 0.10 to show equivalence.Figure 1 shows the distribution of planned dose to the brainstem, bilateral parotid glands, and spinal cord for the HN cancer cases, and shows the distribution for the bladder, bowel bag, and rectum for the cervical cancer cases.

Dosimetric criteria
The plans met dosimetric criteria for 91.8% and 91.7% of all HN structures for generated and manual contours, respectively.For cervical plans, 95.6% and 95.7% of structures met dosimetric criteria for automatically gen-erated and manual contours, respectively.Tables 5 and 6 list the DVH metrics evaluated for each site and the number of plans that met criteria.

DISCUSSION
Overall, this analysis demonstrates that autocontours were equivalent to the ground truth for 71% and 75% of common DVH metrics for the HN and cervix, respectively.For DVH metrics that did not meet our equivalence criteria, the autocontours tended to result in a higher reported dose than the ground truth.This may be a result of the autocontours being somewhat more generous than the manual contours, meaning they are more likely to report a higher dose.This finding can be seen in the bowel bag and rectum for cervical cancer and the brain for HN cancer, where the upper limit of the confidence interval was greater than the δ and the lower limit was within our margin.For the left parotid (HN), left femoral head (cervical) and spinal cord (cervical), the opposite was true; the autocontours tended to report less dose to these structures than the ground truth contours.Some of the DVH metrics were not able to confirm equivalence due to a lack of data (as not all structures are manually contoured for all patients), in particular, the optic chiasm (5 plans), and both optic nerves (10 plans each).This is also true for the bladder, despite having significantly more data points (39) than the optic chiasm and nerves.We were, however, able to confirm equivalence for the liver contour (5 plans).
In general, even for structures that did not demonstrate equivalence, the autocontours were just as safe as the manual contours for planning for a majority of the DVH metrics.For almost all DVH metrics, the same number of autocontours and ground truth contours met dosimetric criteria.The exceptions are the optic chiasm, both optic nerves, left cochlea, and left eye, which there is a difference of one plan between the two planning sets.Although preliminary, this indicates that effort should be spent reviewing all HN and cervical structures, especially when they are near tolerance, as they will tend to be in a dose gradient, so contouring errors will be particularly impactful.
This study has some limitations.Many of the patient data in our dataset were missing some clinical contours that we used in our evaluation.Depending on the treatment area, these contours were not critical for the treatment of these patients and were not delineated; however, we would need to increase the number of patients in our dataset or curate the patient structure files prior to reproducing this study.This study focused on patients from our own institution and did not evaluate the effect of anatomical variations, patient characteristics, clinical scenarios, and other factors on contour quality.In future work, we would need to widen the scope of our study by diversifying our cohort, using patients from other institutions and take the aforementioned factors into account.This would allow us to evaluate the generalizability of our models.In general, changes to CT resolution and acquisition on different scanners can affect the accuracy of autocontouring models.However, in a publication by Huang, et al. 27 showed that deeplearning contouring models a particular robust to pixel size and slice thicknesses >3 mm.Since the RPA does not allow for CT images with slice thickness above 3 mm, CT image quality was not considered.
In this study, we showed that many, but not all, structures are dosimetrically equivalent when comparing automatically generated and manual structures.Differences in contouring did not, however, generally affect whether the structures passed or failed clinical tolerances.Although rare, there were situations where the reported dose indicated that the autoplan passed clinical criteria, but the auto-generated contour did not meet equivalence, thus showing that careful contour review is still important.

AU T H O R C O N T R I B U T I O N S
Laurence Court, Adenike Olanrewaju, and Lifei Zhang conceived and designed this experiment.Beth Beadle provided clinical guidance for the model used in the experiment.Raymond Mumme and Raphael Douglas carried out the experiment.Raphael Douglas performed the data collection and analysis of the data.Laurence Court supervised the findings in this work.Raphael Douglas wrote the initial draft of the manuscript with support of Laurence Court; all authors contributed to the final manuscript.

F I G U R E 1
Scatterplots compare the distribution of planned dose to the normal structures for the brainstem (a), left parotid (b), right parotid (c), and spinal cord (d).The green line represents the estimated regression function.The red and blue lines represent the dosimetric constraints for a given normal structure.

F I G U R E 2
Scatterplots compare the distribution of planned dose to the normal structures for the bowel bag (a), bladder (b), and rectum (c).The green line represents the estimated regression function.The red and blue lines represent the dosimetric constraints for a given normal structure.

Primary site Number of patients Range of total dose, Gy Range of fractionation
• , 345 • , and 90 • .For cervical plans, three 360 • coplanar treatment arcs were used at collimator angles of 10 Number of HN autocontours (RPA) and manual contours (GT) that met clinical dose criteria.
TA B L E 3 90% confidence intervals (CIs) and p-values of the TOST for the HN cancer cases.Abbreviations: HN, head and neck; TOST, two one-sided tests.aStructurewith a 5-mm margin.TA B L E 4 90% confidence intervals (CIs) and p-values of the TOST for the cervical cancer cases.Abbreviation: TOST, two one-sided tests.TA B L E 5Abbreviations: GT, Ground Truth; HN, head and neck; RPA, Radiation Planning Assistant.a Structure with a 5-mm margin.
Number of cervical autocontours (RPA) and manual contours (GT) that met clinical dose criteria.
TA B L E 6Abbreviations: GT, Ground Truth; RPA, Radiation Planning Assistant.