Self‐configuring nnU‐Net for automatic delineation of the organs at risk and target in high‐dose rate cervical brachytherapy, a low/middle‐income country's experience

Abstract Background The high‐dose rate (HDR) brachytherapy treatment planning workflow for cervical cancer is a labor‐intensive, time‐consuming, and expertise‐driven process. These issues are amplified in low/middle‐income countries with large deficits in experienced healthcare professionals. Automation has the ability to substantially reduce bottlenecks in the planning process but often require a high level of expertise to develop. Purpose To implement the out of the box self‐configuring nnU‐Net package for the auto‐segmentation of the organs at risk (OARs) and high‐risk CTV (HR CTV) for Ring‐Tandem (R‐T) HDR cervical brachytherapy treatment planning. Methods The computed tomography (CT) scans of 100 previously treated patients were used to train and test three different nnU‐Net configurations (2D, 3DFR, and 3DCasc). The performance of the models was evaluated by calculating the Sørensen‐dice similarity coefficient, Hausdorff distance (HD), 95th percentile Hausdorff distance, mean surface distance (MSD), and precision score for 20 test patients. The dosimetric accuracy between the manual and predicted contours was assessed by looking at the various dose volume histogram (DVH) parameters and volume differences. Three different radiation oncologists (ROs) scored the predicted bladder, rectum, and HR CTV contours generated by the best performing model. The manual contouring, prediction, and editing times were recorded. Results The mean DSC, HD, HD95, MSD and precision scores for our best performing model (3DFR) were 0.92/7.5 mm/3.0 mm/ 0.8 mm/0.91 for the bladder, 0.84/13.8 mm/5.3 mm/1.4 mm/0.84 for the rectum, and 0.81/8.5 mm/6.0 mm/2.2 mm/0.80 for the HR CTV. Mean dose differences (D2cc/90%) and volume differences were 0.08 Gy/1.3 cm3 for the bladder, 0.02 Gy/0.7 cm3 for the rectum, and 0.33 Gy/1.5 cm3 for the HR CTV. On average, 65% of the generated contours were clinically acceptable, 33% requiring minor edits, 2% required major edits, and no contours were rejected. Average manual contouring time was 14.0 min, while the average prediction and editing times were 1.6 and 2.1 min, respectively. Conclusion Our best performing model (3DFR) provided fast accurate auto generated OARs and HR CTV contours with a large clinical acceptance rate.


INTRODUCTION
Cervical cancer is an ever-growing burden in Africa with the 2017 National Centre for Communicable Diseases (NCID) registry recording approximately 6600 new cervical cancer cases in South Africa alone. This accounted for 15.85% of all new cancer cases among women. 1 In 2018, the World Health Organization (WHO) registered 13 004 new cervical cancer cases, accounting for 12.1% of all cancer cases recorded for men and women combined. 2 These numbers are expected to only increase over the coming years. It is therefore crucial to ensure that patients gain access to the latest treatment techniques and for health professionals to continuously strive to improve all aspects of the radiotherapy treatment process. Image guided High Dose Rate (IG-HDR) intracavitary brachytherapy is an irreplaceable curative treatment modality for locally advanced cervical cancer. Often provided as a primary or boost treatment option, HDR brachytherapy provides steep dose gradients that offer excellent local control rates while reducing a number of normal tissue complications. [3][4][5][6][7] Several studies have shown an increase in survival rates and a decrease in recurrence rates when brachytherapy is used as part of a patient's treatment regimen. [8][9][10][11] Although IG-HDR brachytherapy comes with great advantages, the treatment planning workflow as a whole is a laborintensive process that includes a number of manual, time-consuming steps or processes and involves input from a range of professionals; Radiation Oncologists (ROs), Radiotherapists (RTs), and Medical Physicists (MPs). While MRI is the gold standard, CT-based IG brachytherapy is more cost-effective in a resourceconstrained environment with limited, or no access to MRI.
Organ at risk (OAR) and target delineation is an extremely important aspect of IG brachytherapy; however, this step makes up a significant portion of the whole treatment planning time and depends largely on the expertise of the RO. Several studies have highlighted issues due to inter-and intra-observer variability in organ delineation with Hellebust et al. and Saarnak et al. reporting inter-observer variabilities between 5−8% and 10−11%, respectively. [12][13][14][15][16] Not only is interand intra-observer variability a growing concern, lowand middle-income countries, such as those in Africa, face a variety of challenges in an attempt to meet the demands for high-quality cancer treatment. These include and are not limited to equipment, maintenance, high workloads as well as huge deficits in experienced ROs, MPs, and RTs. 17,18 A recent investigation into the current state of cancer in sub-Saharan Africa (SSA), published in the Lancet Oncology journal, found that the shortages of radiotherapy professionals is one of the most crucial barriers hindering access to cancer services in SSA, with an estimated 211% increase in workforce required to provide equitable access to radiotherapy. 19 Computer-aided automation has the ability to mitigate the bottlenecks experienced throughout the radiotherapy process and to alleviate some of the workload on the ROs, MPs, and RTs. As a result, there has been a growing demand internationally to automate as much of the radiotherapy treatment planning workflow as possible.
Over the past few years, several techniques have been proposed to automate the delineation of OARs and targets in radiotherapy. A commonly used approach is Atlas Based (AB) methods, which involve deformable image registration, where segmentations from the reference image or atlas are transformed onto the new or test image. 20-23 AB methods suffer from lack of certainty in deformable registration and provide sub-optimal segmentations when dealing with atypical patients, organ motion, and varying levels of bladder and rectum filling. [24][25][26] Due to the inconsistencies in AB methods, there has been a worldwide shift toward deep learning (DL) automated segmentation techniques, providing an increase in accuracy, reproducibility, and robustness. [27][28][29][30] The automated delineation of OARs and targets using DL networks in external beam (EB) radiotherapy has been successfully applied to cases such as head and neck, 31 brain metastases, 32 breast, 28 rectum, 33 pelvis, 34,35 and cervix. 36,37 DL has also been effectively applied in various areas of brachytherapy from applicator reconstruction, dose calculation, treatment planning as well as organ delineation. 38 In IG-HDR cervical brachytherapy, work has been done on automating the reconstruction of applicators using DL, 39,40 with a few studies looking at the automatic segmentation of the OARs and targets. [40][41][42][43][44] These studies however do not look at the clinical acceptability of the generated contours and involve complex deep learning models that require a high level of expertise to reproduce or apply in one's own clinical department. 45 The purpose of this study is to train and implement the self -configuring No New U-Net (nnU-Net), developed by Isensee et al., for the task of automatically delineating the OARs and HR-CTV in IG-HDR cervical brachytherapy. 46,47 To the best of our knowledge, this is the first CT-based study applying the novel nnU-Net to cervical brachytherapy segmentation with ring and tandem applicators. This addition is important, as low/middle-income departments experience severe bottlenecks in their planning process and may not have the required expertise to design and implement their own machine learning models, a task that nnU-Net was designed to tackle.

Clinical dataset
CT images from 100 locally advanced cervical cancer patients were included in the study. All patients were initially treated with external beam radiotherapy (50 Gy/25 fractions or 46 Gy/23 fractions), followed by HDR brachytherapy (25 Gy/5 fractions or 24 Gy/4 fractions) using the ring and tandem applicators. All patients were scanned on a Phillips Brilliance Big Bore 16 Slice CT scanner. Images were reconstructed with a 512 × 512 matrix size and 2 or 3 mm slice thickness. All OARs and HR-CTVs were subsequently contoured and approved by a clinical oncologist, based on the IBS-GEC ESTRO-ABS recommendations. 48 To maintain patient anonymity, all patient data was completely anonymized. The pixel data from each patient's CT was extracted from the DICOM files and converted into a three-dimensional (3D) NIFTI file format. The contour information was also extracted from the DICOM structure file and used to create 3D multiclass segmentation masks of the OARs and HR-CTVs and saved in NIFTI file format. The 100 patients included in the study were then split into 80 patients for training/validation and 20 patients for testing. The 80 patients were then further split into 64 for training and 16 for validation.

nnU-Net
Deep learning-based segmentation models tend to be extremely task specific, with any slight adjustment in architecture, training parameters or data, leading to significant drops in performance. Therefore, one often requires a high level of expertise to design a robust network architecture, optimize the data augmentation, determine the appropriate pre-processing or post-processing, and select the ideal training parameters to obtain a model best suited for the segmentation task at hand. 45 This process is especially cumbersome when dealing with 3D medical images,where image size, imaging modality, voxel size, and class imbalance vary substantially resulting in poor transfer of a configuration from one dataset to another. As a solution, Isensee et al. developed a selfconfiguring method for deep learning-based image segmentation specific to bio-medical imaging. The result is a completely automated deep learning segmentation pipeline, allowing any user with minimal programming knowledge the ability to set up and train their very own segmentation models. 46,47 At its heart, nnU-Net is based on the U-Net architecture, developed by the computer science department of the University of Freiburg for biomedical image segmentation. 49 Isensee et al. have found that a well-trained U-Net has the potential to beat or at the very least match any task specific segmentation model, leading them to develop their segmentation pipeline known as No New U-Net or nnU-Net. This self -configuring segmentation pipeline was proven to outperform extremely task specific model architectures in several international biomedical auto-segmentation competitions on 23 different publicly available datasets. These 23 different datasets include anatomical sites from brain, head & neck, breast, abdomen, and pelvic regions with the nnU-Net pipeline showing high performance and robustness regardless of anatomical site or imaging modality used. 47 nnU-Net involves the use of three different types of parameters: Fixed, rule-based, and empirical. The fixed parameters remain unchanged when applying nnU-Net to various datasets and include parameters such as learning rate, optimizer, loss function, architecture template, data augmentation, training, and inference procedures. Once a dataset has been provided, in the correct format, nnU-Net steps in and determines several rulebased parameters according to the dataset fingerprint. These include: intensity normalization, image target spacing, as well as image and annotation resampling. nnU-Net determines the remainder of the rule-based parameters such as patch size, network topology and batch size for 2D, 3D, and low-resolution (required for cascade configuration) iteratively based on the available GPU memory. A summary of the nnU-Net rule-based parameters for the current study can be found in section A5 of the supplementary material.
A summary depicting how the nnU-Net segmentation pipeline operates as well as certain design and configuration choices, such as learning rate scheduler and loss function is shown in Figure 1. nnU-Net provides three different trainable U-Net configurations: a two-dimensional (2D) U-Net, 3D full resolution U-Net (3DFR) and a 3D cascade U-Net (3DCasc) where the first part is trained on low resolution images followed by refining the segmentation maps at full resolution.
nnU-Net allows the user to train all configurations using five-fold cross-validation, and once completed, nnU-Net can empirically determine the best performing configuration or ensemble of configurations for the specific task as well as whether or not to apply allbut-largest-component-suppression post-processing. A main concern with running ensembles is the significantly high prediction times that can lead to impractical clinical application. For this study, all configurations were trained using five-fold cross-validation, where the best single configuration was determined independently using the performance metrics defined in section 2.D, as well as visual inspection of the predicted contours.
The total loss function implemented by nnU-Net is a combination of the Dice loss function and categorical F I G U R E 1 nnU-Net automated segmentation pipeline. cross entropy loss function as seen below.
Total Loss = CE loss + DSC loss (1) With the categorical loss function given as: and the Dice loss function defined as: where p n,c is the probability that class label c in pixel n is correct, while y n,c is the one-hot encoding of the ground truth label for pixel n and the indices n and c iterate over all the pixels and classes, respectively. The ε term avoids the numerical issue of division by zero when either the ground truth mask or prediction segmentation map are empty.

Training
All three configurations were trained using a polynomial learning rate scheduler with initial learning rate of 0.01, the stochastic gradient descent (SGD) optimizer with Nesterov Momentum (μ = 0.99) and trained for 1000 epochs. The 2D U-Net configuration was trained on the full resolution of the image (512 × 512) with a batch size of 66, while the 3DFR was trained on a batch size of 2 with a patch size of 40 × 256 × 240. The 3DCasc configuration was first trained on downsampled low-resolution images, with a median image shape 67 × 380 × 380, on a batch size of 2 with a patch size of 40 × 256 × 240 and then further refined at full resolution with the same batch and patch size. Training was carried out on the google colab pro+ server, with access to GPU (NVIDIA Tesla P100) and 52 GB ram.

Performance metrics
The performance of the three generated models (2D, 3DFR, and 3DCasc) were assessed using the 20 test patients. The predicted and manual contours were compared quantitatively using the Dice Score (DSC), Hausdorff Distance (HD), 95 percentile Hausdorff Distance (HD95), Mean Surface Distance (MSD), and Precision Score. Where the DSC focuses on the geometric overlap between the predicted and manual contours, while the HD, HD95, and MSD look at the boundary similarity between the manual and predicted segmentations.

Dosimetric evaluation
All brachytherapy plans were generated using the Oncentra (Elekta, Stockholm, Sweden, version 4.6) treatment planning system according to the OAR dose constraints and HR-CTV prescribed dose recommended by the American Brachytherapy Society (ABS) 50 and GEC-ESTRO 51-53 guidelines. Assuming α/β = 10, a prescribed dose (EB + brachy) of D 90 > 80 Gy EQD2 for the HR-CTV and assuming an α/β = 3, the maximum dose (D 2cc ) to the bladder and rectum (EB + brachy) of 90 Gy EQD2 and 75 Gy EQD2 respectively. To evaluate the dosimetric differences between the manual contours and predicted contours, the predicted segmentations were converted to DICOM before importing into the planning system and the original dose distribution was overlaid and cumulative dose volume histograms (DVHs) generated. The Dosimetric discrepancies of D 90% for HR-CTV and D 2cc for the OARS were evaluated and compared with the DVH statistics generated from the manual contours.

Clinician assessment
Performance and dosimetric evaluation provided a good indication of a model's accuracy; however, these metrics alone were not sufficient and required input from clinicians to properly assess the clinical acceptability of the generated patient contours. The OARs and HR-CTVs from each of the 20 test patients were evaluated independently by three different ROs, scoring each patient's bladder, rectum, and HR-CTV on a scale of 1 to 4 where; 1 indicates reject completely, 2 major revisions, 3 minor revisions before clinically acceptable, and 4 indicating clinically acceptable as is. Contours scored with a 4, also included ones where ROs would have preferred some small adjustments but were not necessary for the contours to be deemed clinically acceptable. For any contours scoring a 3 or lower, the ROs indicated the time required to adjust the contours to the level of clinical acceptability. These times, along with the model prediction times were compared to the average time taken for ROs to manually contour all the OARs and HR-CTV.

Model comparisons
The mean ± standard deviation of the performance metrics (DSC, HD, HD95, MSD, and precision) between the manual contours and predicted contours for all three generated nnU-Net configurations (2D, 3DFR, and 3DCasc) are given in Table 1. The best performing configurations were 3DFR and 3DCasc with the mean DSC over all OARs and HR-CTVs for both configurations being around 0.85. The mean HD, HD95, and MSD for the 3DFR and 3DCasc configurations over all OARs and HR-CTVs were approximately 9.9/10.1 mm, 4.8/4.9 mm, and 1.48/1.54 mm, respectively. The predicted contours from all three configurations were also visually inspected and the 3DFR configuration was found to provide slightly better predictions when it came to the larger bladder volumes.

Performance results
The overall boxplots of the DSC, HD, HD95, MSD, and precision scores for each structure of the 3DFR configuration is shown in Figure 2. Almost all predicted contours had performance metrics that were distributed within a certain range, with only a single low DSC and high MSD outlier in the boxplots for the bladder contours.Examples comparing the manual contours to those predicted by the 3DFR nnU-Net configuration are shown in Figure 3.

Dosimetric evaluation
The dosimetric accuracy of the 3DFR nnU-Net predictions was investigated by looking at the DVH parameter D 2cc for the OARs and D 90% for the HR CTV, as well the total volume (in cm 3 or cc) of the OARs and HR CTVs.
Boxplots showing the DVH parameters and volumes of all 20 test patients for the manual and predicted contours are given in Figure 4. The dose differences for the DVH parameters (D 2cc /D 90 ) for each test patient is shown in Figure 5.The mean dose and volume differences over all OARs and HR CTVs was 0.1 ± 1.2 Gy and 1.1 ± 7.1 cm 3 , respectively. A summary of the mean ± standard deviation of the DVH parameters and volumes for the manual and predicted contours are shown in Table 2, along with the mean ± standard deviation of the DVH dose parameters and volume differences.

Clinician reviews
The average scores from all three of the ROs reviews of the predicted bladder, rectum, and HR CTVs for all 20 test patients are shown in Figure 6. Overall, OARs and HR CTV combined, 65% of the predicted contours were suitable for clinical use,while 33% required minor adjustments before being clinically accepted, the remaining 2% required major revisions and 0% were rejected completely. A summary of the average review scores for the bladder rectum and HR CTV is given in Table 3. For patient contours that scored a 3 or lower, the ROs recorded the time needed to make the necessary corrections to the contours. The mean editing time, along with the mean manual contouring time and model prediction time are given in Table 4. The mean time taken for the auto-contouring system to delineate the OARs and HR CTV is substantially less than the mean manual contouring time.

DISCUSSION
With only a few lines of code, we successfully implemented the self -configuring nnU-Net for the automatic segmentation of the OARs and HR CTV in HDR cervical brachytherapy. Three nnU-Net configurations were trained (2D, 3DFR, and 3DCasc) with a total of 64 training CTs and 16 validation CTs. All three configurations were tested on the remaining 20 CT scans, where the 3DFR and 3DCasc models provided substantially better  All values reported as mean ± standard deviation.

F I G U R E 6
Average RO reviews of the predicted bladder, rectum, and HR CTV contours for all 20 test patients. performance with the 3DFR providing better generalization over the larger bladder volumes when inspected visually. We demonstrated that the nnU-Net can produce clinically acceptable contours for both the OARs and HR CTV, with significant time saving possible when applying the automatic contouring system.

Quantitative results
All three nnU-Net configurations were assessed using the performance metrics described in Section 2.C with 3DFR and 3DCasc models significantly outperforming the 2D model (P < 0.002) over all OAR and HR CTV contours combined (Table 1). There were no statistical differences between the 3DFR and 3DCasc models (P > 0.863); however, after visual inspection, the 3DFR model provided slightly better generalization when it came to larger bladder volumes. The 3DFR model predicted bladder, rectum, and HR CTV contours Among all the contours, it was found that the bladder provided the highest DSC and precision scores, as well as the lowest distance metrics (HD, HD95, and MSD). The largest variation in performance metric results was found to be the HR CTV. This can be attributed to the fact that HR CTVs do not have a clearly visible anatomical boundary or edge, especially in CT imaging, which can cause inter-and intra-observer variability in the ground truth contours as well as a drop in model performance for the HR-CTV when compared to the OARs with more clearly defined edges. A comparison of the 3DFR model performance metrics to those found in previous work is shown in Table 5. The performance of the 3DFR nnU-Net configuration compared well to work done in previous studies, where complex model architectures were designed specific to the task of HDR cervical brachytherapy contouring. 3DFR managed to outperform two of the studies when it came to bladder and rectum contouring,beaten only by the 2D ResU-Net by Mohammadi et al. 42 However, this study did not look at the segmentation of the HR CTV, included a larger dataset and contained CT scans where strict bladder and rectal filling protocols were followed, a limitation of this study. The current work provided HR CTV contour performance comparable to those found in other studies.

Dosimetric accuracy
Dosimetric evaluation between the predicted and manual contours was carried out for all 20 test patients.
There were no statistically significant differences in the mean D 2cc /D 90% or volumes for the bladder, rectum, and HR CTV (all P > 0.8). The largest mean change in DVH parameter was for the HR CTV, with a mean change in D 90% of 0.33 Gy and mean change in volume of 1.5 cm 3 . The mean change in D 2cc and volume for the bladder and rectum (Table 2) showed better accuracy than those found in other studies, where Mohammadi et al. 42 had mean changes in D 2cc /volume of −0.5 Gy/34 cm 3 and 0.3 Gy/17 cm 3 for the bladder and rectum respectively. The largest difference in D 2cc of −4.5 Gy ( Figure 5) was observed for a rectum contour of a single test patient, this test patient coincides with the lowest DSC value (0.77) and one of the highest HD values (20 mm) obtained for the predicted rectum contours.

Clinical acceptability
There was an average acceptance rate 65% over all the OARs and HR CTVs for the 20 test patients, with an average acceptance rate of 95% (19 patients) and 55% (11 patients) for the bladder and rectum, respectively. Approximately 40% (eight patients) for the rectum and 55% (11 patients) for the HR CTV required very minor revisions before being clinically acceptable. Most of the minor revisions for the rectum were the superior border of the rectum, where the model struggled slightly with distinguishing the rectum from sections of the bowel or sigmoid (see section A4 of supplementary material). The minor revisions for the HR CTV were a combination of either incomplete contours at the superior borders or at the inferior border, where too much of the ring was included in the contour. Based on the average scores for the bladder, minor revisions were only required for one patient, where the model failed to accurately determine the bladder wall edge. The average time taken for ROs to manually contour out the OARs and HR CTV was around 14 min, while the average model prediction time was 1.6 min and average editing time of approximately 2.1 min (for those that required minor revisions before being clinically acceptable). The results show that application of the auto-contouring system,on average,has the capability to substantially reduce the overall required contouring time thereby allowing for a more efficient planning workflow by alleviating the bottleneck experienced at the contouring stage.
This study faced several limitations such as small dataset size, no consistent bladder or rectum preparation and no contrast agent. Over and above these limitations, this study was also hindered by inter observer variations in the manual contours, used for training, as well as inter observer variations between the three ROs scoring the generated contours. In addition, it is acknowledged that CT-based brachytherapy contours are inferior in accuracy to MRI-based contours.

CONCLUSION
We have demonstrated that with only a few lines of code and relatively small patient dataset, one can develop an accurate and robust auto-segmentation model that provides clinically acceptable contours of the OARs and HR CTV in T-R HDR cervical brachytherapy, with no significant dose difference between manual and predicted contours. We have shown that an auto-contouring system can significantly reduce the required contouring time thereby providing a more efficient planning workflow. This is ideally suited to a resource-constrained oncology department using CT-based image guided brachytherapy.

AC K N OW L E D G M E N T S
No funding to declare.