Interactive contouring through contextual deep learning

Purpose: To investigate a deep learning approach that enables three-dimensional (3D) segmentation of an arbitrary structure of interest given a user provided two-dimensional (2D) contour for context. Such an approach could decrease delineation times and improve contouring consistency, particularly for anatomical structures for which no automatic segmentation tools exist. Methods: A series of deep learning segmentation models using a Recurrent Residual U-Net with attention gates was trained with a successively expanding training set. Contextual information was provided to the models, using a previously contoured slice as an input, in addition to the slice to be contoured. In total, 6 models were developed, and 19 different anatomical structures were used for training and testing. Each of the models was evaluated for all 19 structures, even if they were excluded from the training set, in order to assess the model ’ s ability to segment unseen structures of interest. Each model ’ s performance was evaluated using the Dice similarity coefficient (DSC), Hausdorff distance, and relative added path length (APL). Results: Iterative prediction performed better compared to direct prediction when considering unseen structures. Conclusions: Training a contextual deep learning model on a diverse set of structures increases the segmentation performance for the structures in the training set, but importantly enables the model to generalize and make predictions even for unseen structures that were not represented in the training set. This shows that user-provided context can be incorporated into deep learning contouring to facilitate semi-automatic segmentation of CT images for any given structure. Such an approach can enable faster de-novo contouring in clinical practice. © 2021 Mirada Medical Ltd. Medical Physics published by Wiley Periodicals LLC on behalf of American Association of Physicists in Medicine. [https://doi.org/10.1002/mp.14852]


INTRODUCTION
Many clinical procedures rely on accurate contouring of anatomical structures in medical images. For example, image segmentation is extensively used in radiotherapy planning to identify healthy and cancerous regions. 1 Accurate segmentation of both the tumor and the healthy tissue in the image, is essential to maximize the dose of radiation delivered to the tumor while minimizing the dose delivered to healthy tissues.
Manual contouring is a time-consuming process and is subject to significant intra-and interuser variability. As a result, semi-automatic and fully automatic image segmentation tools have been developed not only to reduce contouring time but also to improve the consistency of image segmentation. Machine learning (ML)-based approaches have been employed with great success to automatically contour medical images. 2,3 When algorithm training is performed using a sufficiently large and representative dataset, ML-based 1 approaches have been shown to generate contours that are effectively indistinguishable from manually drawn contours. 4 However, automatic image segmentation techniques fail when the postulated assumptions inherent to the ML model are violated. 5,6 For example, the high variability in size, location and appearance of tumors makes it challenging to build a representative training set for multiple tasks, limiting the performance of ML methods. 6 Human input is thus still currently required for the segmentation of structures whose appearance and location exhibit large variability.
To overcome the drawbacks of fully automatic image segmentation techniques, semi-automatic methods can be employed. Such approaches can improve segmentation accuracy compared to fully automatic techniques by integrating user inputs to guide the segmentation process. Compared to fully manual segmentation, interactive techniques have been shown to improve repeatability and consistency across multiple observers. [7][8][9][10] User inputs can be used in a ML model as additional prior information to the system. This has been shown to improve contouring results. [11][12][13] For example, a User can draw a contour on one image slice in a CT or MR dataset that can then be propagated through the remaining slices if the context between the contour on the delineated slice and the remaining image slices can be established. 11,12 As an alternative approach to incorporate user annotations, a three-dimensional (3D) U-Net for volumetric segmentation has been proposed that learns from sparsely annotated volumetric images. 14 In this work, a network is trained using only a few manually annotated image slices by using a weighted loss function and special data augmentation. Previous work on semi-automatic methods has focused on the segmentation of a predefined anatomical structure or set of structures. For such methods, networks are trained and optimized to detect the features relevant to the specific structure(s). This means, however, that they are likely to fail when applied to unseen structures. 5,6 In the present study, a deep learning training strategy is evaluated to enable 3D segmentation of an arbitrary anatomical structure using a manual segmentation on a single slice as prior knowledge. The ability of this method to capture the context between the manual segmentation and the image slice and apply it to adjacent image slices regardless of the structure of interest was investigated. This study shows that interactive contouring is possible using contextual deep learning, given access to adequate contextual inputs and a diverse training set. Most importantly, it demonstrates that the contextual deep learning model is not limited to structures included in the training set but extends to previously unseen structures.

2.A. Contextual deep learning
In this study, a deep learning model is proposed that makes predictions based on the provision of relevant contextual information, rather than being limited to a specific structure defined by the training set. Such a model is referred to as a contextual deep learning model in this work. The training strategy, particularly the diversity of structures in the training set, has a great impact on the models ability to predict previously unseen structures.
For the contextual deep learning model to learn from context, three different inputs were provided that contain the information of the image slice to be contoured and the contextual information. These inputs are illustrated in Fig. 1(a). The first input is the slice to be contoured, here referred to as the target image slice. A previously contoured image slice provides the other two inputs to the model: the image slice itself, as well as the contour information in the form of a binary mask. The relationship between the image slice and its corresponding binary mask allows the model to identify what to segment on the target image slice. The binary mask does not label which structure it is, such as whether the structure is, for example, a heart, a tumor or any other structure. This information was deliberately omitted, to generalize the model to its context. In Supplemental Material 3, different sets of inputs to the models were studied. This analysis showed that all three inputs are essential for the network to learn based on context.

2.B. Neural network architecture
In this work, a modified U-Net was used. 15 Specifically, a residual recurrent U-Net with attention gates. 16,17 The architecture is shown in Fig. 1(b). Here, the network input was expanded from one to three channels. The additional channels provide prior information to the network, as described above. Attention gates were incorporated to connect the features from the contracting path with the expanding path. The attention gates for soft self-attention highlight salient image regions implicitly. 18,19 In the last layer, a sigmoid activation was applied to get the probability map.
The network was trained using pixelwise binary cross entropy loss. 20 The Adam optimizer 21 was used with an initial learning rate of 1 × 10 −5 and a batch size of 4. Data augmentation was applied on-the-fly during training using a random affine matrix transformations consisting of translation (AE 10 mm), rotation (AE 10 degrees) and scaling (AE 5%). Each model presented in this study was trained from randomly initialized weights. Therefore, model performance may vary with initialization. To mitigate the variability and achieve the best possible performance for each model, eight models were initialized for training. After 5 epochs, the worst 50% of the models are discarded. This was repeated each time the number of epochs doubles until one model remains. Then the remaining model was trained for 20 more epochs.

2.C. Data
This study included CT images and delineations of 19 different structures of interest from various openly available datasets. An overview of all the used structures is shown in Table I.
The 3D CT volumes and corresponding contouring data were randomly split into a training, validation and test set using a 60:20:20 split, respectively. In total, 2000 image slices per structure were randomly selected from the training data in order to limit the computational resources required and to create a balanced representation of each structure in the training set. Structures that were delineated on only very few patients or that extend over fewer than 10 slices were excluded from this study.
In the training set, the previously contoured and target slices were chosen carefully to support learning from context. The interslice distance between the contoured and target slice was particularly important. If the previously contoured image slice and the target image slice were adjacent image slices, there was insufficient change between the image slices for the model to learn the context and, in this case, the model learnt to copy the existing contour rather than generate a segmentation mask for the new slice. Target slices far away from the contoured slice are less relevant for the application of this model, as the target image slices corresponding too large interslice distances often lie outside of the structure to be segmented. The slice distance was sampled from a range of distances to balance between the two extremes and create a more diverse training set that was not biased towards a specific slice distance. It was important to sample the distance from a range of distances relative to the size of the structure and not use one predefined set of distances for all structures. Thus, different structures with a variety of size (e.g., lungs, heart, tumor) were equally well represented. The previously contoured image slice must be chosen such that it lies within the structure, otherwise no contour exists from which context can be determined. However, the target image slice may lie outside the structure volume to allow the model to learn not only where to contour but also when to stop contouring.

2.D. Preprocessing
CT pixel intensities were clamped to the range from −1000 HU (air) to 2000 HU (cortical bone) to avoid artifacts in the high and low HU range. Intensities outside this range are typically irrelevant to segment biological tissue. The CT image slices of 512 × 512 pixels were rescaled to the network input size of 256 × 256 pixels. The rescaling was used to decrease computation time and reduce memory consumption.

2.E. Experimental details
A series of models was trained, for which the size of the training set was successively increased. Figure 2 shows the various models, together with the respective set of structures

2.F.1. Evaluation measures
The quality of the segmentation was quantified using the two-dimensional (2D) Dice similarity coefficient (DSC), 22 Hausdorff distance (HD), 23 and relative added path length (APL). 24 The DSC measures the overlap between two areas or two volumes, A and B. The DSC is defined as The HD measures how far two subsets of a metric space are from each other. The HD d HD (A,B) is defined by where d denotes the Euclidian distance. The APL is the length of contour drawn when editing a segmentation. Because the absolute length of a contour varies between patients and different structures, the APL is reported relative to the ground truth contour length. A tolerance of 2 mm between predicted result and ground truth was used for APL. A low relative APL means that few edits are necessary, whereas at a relative APL of the full contour needs to be drawn by the user. 24 The DSC and HD are widely used to quantify the performance of segmentation tasks. However, these measures are not necessarily the best choice in a clinical context, where the contouring speed is more important. The time needed for an expert to review and adjust the predict contours shows better correlation with the APL than with DSC and HD. 24 To compute the performance measures, the masks that were generated were upsampled to the original 512 × 512 imaging resolution using bi-linear interpolation. It is important to evaluate the performance after upsampling because a clinician would review these contours at the original image resolution.

2.F.2. Contour prediction
In practice, it is expected that a user would select an image slice and delineate a structure on that slice. Then, the structure would be segmented by the model on the remaining image slices given the initial user input. Two different contour prediction methods were evaluated.
In the first approach, the initial input image slice (the central slice) was chosen as the contextual input for the Medical Physics, 0 (0), xxxx segmentation of all the remaining slices. This method is referred to as "direct prediction." In the second approach, the expert contour at the central slice was used as the contextual input and the adjacent image slice as the target slice. The predicted target contour becomes the contextual input for the next slice. The process was repeated until all slices are contoured. This method is referred to as "iterative prediction."

RESULTS
Results for all evaluation measures can be found in the Supplemental Materials, while the relative APL or DSC are reported here. The boxplots in Fig. 4 show the relative APL for the heart, lung tumor, pancreatic tumor, liver, submandibular gland, and spleen. An equivalent figure for the rest of the structures and evaluation measures is provided in Supplemental Material 1. The most significant decrease in the relative APL for a given structure was observed when that structure was included in the training set of a model. However, observe that including additional structures in the training set also resulted in segmentation improvements to structures that were not included in the training set. For example, the relative APL for pancreas tumor, liver, submandibular gland and spleen decreased on adding the lung tumor in the training set (Model AB to Model ABC). The spleen was excluded from all models trained. Yet, the relative APL of the spleen decreases from 0.85 AE 0.22 for model A to 0.35 AE 0.21 for the final ABC-DEF model.

3.B. Comparison of contour prediction approaches
In Fig. 5, the performance of model ABCDEF for direct and iterative prediction is compared at different interslice distances from the central slice. Figures for the remaining structures and evaluation measures can be found in Supplemental Material 2.
Predictions closer to the central slice had a lower relative APL, whereas the relative APL increased towards the superior and inferior most slice of each structure. The structures included in the training set [Figs. 5(a)-5(e)] showed similar performance between the direct and iterative prediction method. In contrast, the spleen [ Fig. 5  However, with subsequent addition of structures to the training set, not only does the performance improve for the structures newly included in the training set, but also for the structures excluded from the training set, as seen in Fig. 4. For example, when adding lung tumors to the training set (going from model AB to model ABC), the performance improves for lung tumors, pancreatic tumors, liver, submandibular gland, and spleen. This improved performance cannot originate from learning any structure-specific features within the training set, but must be attributed to learning contextual features and making predictions based on the context from the provided input contour and image slice. For example, Model ABCDEF achieves a relative APL of 0.35 for the spleen, despite not having been trained on this structure before. This suggest that user interaction and thus time spent contouring can be decreased for de-novo contouring, when using such a model. Additionally, it is shown in Supplemental Material 4 that most structures can achieve a high segmentation accuracy, even if that structure was excluded from training set and the training set consisted of all remaining structures. In contrast, a model trained on only a single structure fails to generalize as shown in Supplemental Material 5.
The ability to generalize to unseen structures distinguishes this approach from previous work, in which previously contoured image slices are used to assist in segmentation. 11,12 Certain structures appear to have a larger impact on the model's ability to generalize to unseen structures. For example, four new structures were added to the training set between model A and model AB, but the improvement in segmentation performance is almost exclusively seen in the newly added structures. In contrast, between model AB and ABC, the segmentation performance improved for all structures, although only lung tumors were added to the training set. The improvement in segmentation of unseen structures and thus, the ability to generalize, could potentially be attributable to the high diversity in the tumor appearance.
At the same time, the structures that have already been included in previous models show further improvement upon expanding the training set by new structures, see Figs. 4(b) and 4(c). This suggests that the quality of segmentation of an individual structure benefits from an increased diversity in the training set. This might be attributed to a refinement of Medical Physics, 0 (0), xxxx the contextual features used in the segmentation over the shape-specific features of each structure. In this work, it has not been tested if the incremental changes between subsequent models are statistically significant due to the small and variable size of the test dataset for the different structures. This needs to be addressed in future work. The interactive contouring approach proposed here achieves similar or higher DSC than fully automatic segmentation approaches reported for the heart, 25 lung, 25 spinal cord, 25 lung tumor 26 and pancreas tumor, 27 but lower performance for the esophagus, 25 and spleen. 27 This comparison is limited to studies in which fully automatic solutions were reported for the same dataset as used in this report. It is noted that the current approach benefits from input of an initial manual contour.
The contextual deep learning approach used here did not achieve a DSC comparable to the best performing fully-automatic models for unseen structures such as the spleen. 27 Note, however, that those fully automatic models are trained on a dataset of images that include the spleen, whereas the model proposed in this work, was trained on images that did not include the spleen. Therefore, this indicates that the current approach can be applied to unseen structures and may decrease delineation times for such structures for which no automatic solutions exist. Furthermore, contextual deep learning achieved high DSC values for pancreas and lung tumors compared to fully automated solutions, highlighting the benefits of this approach for segmentation of highly variable structures.

4.B. Comparison of contour prediction approaches
While direct prediction shows similar performance as iterative prediction for most structures, iterative prediction performed significantly better for an anatomical structure that was omitted from the training set, that is, the spleen.
In the direct prediction approach, the interslice information is not used and the slice distance varies. This may increase the error of the direct prediction approach. In the iterative approach, an initial error or error in a subsequent prediction can accumulate throughout the CT scan. The improved performance of iterative over direct prediction suggests that the cumulative segmentation error through iterative prediction is smaller than the error of direct prediction when compared over the same interslice distance. For direct prediction far from the initial input slice, there may be large differences between the target and input image slices. If these differences are too large, the model will fail to relate the two slices based on context and thus hinder accurate segmentation. In practice, if the error of either prediction approach becomes to large at a certain distance from the initial contoured slice, a user can adjust the predicted contour. These predicted contours in return can then be used to make a prediction.
The difference between iterative and direct prediction is less marked for structures included in the training set. This may be because the model still learns some features specific to the structures in the training set. These learnt features may account for the improved performance of iterative prediction. Future work needs to study the impact of the proposed semi-automatic contouring tool on interobserver variability on the final segmentation. Furthermore, it is interesting to study the performance of the segmentation when compared to a reference segmentation from the same user or from different ones.

CONCLUSION
This work investigated the use of contextual information by a deep learning model using three input channels. The first channel was used for the image to be segmented. The second and third were used to provide context from an image slice and corresponding manual delineation of the structure to be contoured. The experiments demonstrated that a model trained using multiple distinct structures can accurately make predictions for structures within the training set and achieve performance similar or higher than fully automatic approaches, but more importantly it could generalize and predict structures excluded from the training set based on the provided context alone.
Two different prediction procedures (direct and iterative prediction) were introduced and compared. For unseen structures it was found that the error propagated using iterative prediction remains small when compared to the direct prediction approach.
The proposed contextual deep learning approach allows prior context to be incorporated into deep learning contouring to facilitate interactive contouring of CT imaging. Such an approach may enable faster de-novo contouring in clinical practice, where manual contouring is required and there is insufficient training data to train a specific model.

ACKNOWLEDGMENTS
This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 766276. KAV acknowledges funding support from CRUK (A15935) and the CRUK Radnet Centre (A28736).