A research on the improved rotational robustness for thoracic organ delineation by using joint learning of segmenting spatially‐correlated organs: A U‐net based comparison

Abstract Purpose To study the improved rotational robustness by using joint learning of spatially‐correlated organ segmentation (SCOS) for thoracic organ delineation. The network structure is not our point. Methods The SCOS was implemented in a U‐net‐like model (abbr. SCOS‐net) and evaluated on unseen rotated test sets. Two hundred sixty‐seven patients with thoracic tumors (232 without rotation and 35 with rotation) were enrolled. The training and validation images came from 61 randomly chosen unrotated patients. The test data included two sets. One consisted of 3000 slices from the rest 171 unrotated patients. They were rotated by us by –30°∼30°. One was the images from the 35 rotated patients. The lung, heart, and spinal cord were delineated by experienced radiation oncologists and regarded as ground truth. The SCOS‐net was compared with its single‐task learning counterparts, two published multiple learning task settings, and rotation augmentation. Dice, 3 distance metrics (maximum and 95th percentile of Hausdorff distances and average surface distance (ASD)) and the number of cases where ASD = infinity were adopted. We analyzed the results using visualization techniques. Results In terms of no augmentation, the SCOS‐net achieves the best lung and spinal cord segmentations and comparable heart delineation. With augmentation, SCOS performs better in some cases. Conclusion The proposed SCOS can improve rotational robustness, and is promising in clinical applications for its low network capacity and computational cost.

5][6] However,different immobilization techniques affect FCN's performance, since it likely causes patient rotation in the transverse section and hence leads to an unseen input image during training.
The problem is that the model has not learned well-generalizing representations, because the supine position is the most common, and less rotated samples exist in the training set.To alleviate it, rotation augmentation [7][8][9][10][11] is a potential solution.By increasing the size of rotated samples in the training, it helps a model learn how to segment organs from a rotated image.Nalepa et al. 12 proved that rotation augmentation did boost a model's generalization ability.But the proportion of the numbers of augmented samples at various angles would be a difficult decision, and the expanded training set presents a challenge to the computing hardware.5][16][17] The settings of the related tasks are various.An imagelevel classification was an auxiliary task in Tao He et al.'s work. 15The classification was whether computed tomography (CT) contained the target organs.It helped the segmentation task filter false positive pixels.Similar works were reported by Chakravarty et al. 18 and Zhou et al. 19 Shape-prior, including distance map and contour map,was the complementary tasks in Fernando Navarro et al.'s segmentation report. 20Other shape knowledge, such as distinctive curve 21 and signed distance, 22 were also adopted.Although these methods improved the network's performance, they increased the capacity of the network's parameters and required more training time.No rotated samples were explicitly reported in their test set.We have no idea of the rotational robustness of these MTLs.
This paper studies improved rotational robustness by using joint learning of spatially-correlated organ segmentation (SCOS).It is assessed in a U-net 23 like architecture for thoracic organ segmentation (i.e., bilateral lung, heart, and spinal cord).The training set solely encompasses CT images scanned in the supine position (i.e., without rotation), but only rotated samples are in the test set.
In our work, the "spatially-correlated organs" refer to the organs whose locations are close to each other in a human body, such as the neighboring organs in the thorax.SCOS means the tasks of segmenting these organs.
This work consists of four comparisons.The first one is to compare SCOS with its single-task learning (STL) counterparts as an ablation study.The second one is to compare SCOS with other published settings of multiple tasks for a comprehensive comparison.The third one is to compare SCOS with rotation augmentation.The three comparisons were conducted on the slices rotated by us.The fourth one is to compare the above settings on real rotated cases.Furthermore, we explore the reasons for rotational robustness achieved by SCOS using visualization techniques.Specially, the contributions of the work can be summarized as follows: 1. We study the improved rotational robustness by using joint learning of spatially-correlated organ segmentations.2. The proposed SCOS only involves segmentation tasks of spatially correlated organs, rather than other constraints, for a better saving on network capacity and training time.3. We analyze the results using visualization techniques for FCN interpretation.
To the best of our knowledge, this is the first work to study rotational robustness achieved by SCOS and to explain its improved performance using visualization techniques.The remainder of this paper is organized as follows.In Section 2, we detail the networks of SCOS, its STL counterparts, other comparative MTL configurations, rotation augmentation, and experiment.The results are presented in Section 3 and discussed in Section 4. Finally, we conclude our work in Section 5.

Learning task setting of SCOS and its STL counterparts in U-net like architecture
Figure 1 illustrates the model details trained with the proposed setting of multiple learning tasks (abbreviated as SCOS-net) and its STL counterparts (abbreviated as STL-net).

Details of STL-net
The STL-net is open-source. 24Compared to the original U-net, 23 the filter number in each layer is reduced by half, as a result of the limited computational ability of the hardware.In the decoding path, the transposed convolution 25 is adopted, instead of an up-sampling of the feature map followed by a 2 × 2 convolution.All convolutions in STL-net are padded ones to guarantee the same size between input and output.All convolution layers use activation functions of ReLU, except the sigmoid in the last one.

Details of SCOS-net
As exhibited in Figure 1b, SCOS-net shares the network backbone with STL-net.At the end of the network, SCOS-net has three task-specific convolution layers.They correspond to three OARs segmentation.

Other published multiple task settings in U-net like architecture
Figure 2 exhibits two published MTL settings for comparison with SCOS-net.Figure 2a shows the network details of the three methods with a similar structure of SCOS-net.Figure 2b gives a clear illustration of the different settings of learning tasks.The distance map 20 and contour map 20 are auxiliary tasks to help learn the shape representation.The distance map was generated by using the Euclidean distance transform on each organ's segmentation.The contour map was each organ's binary edge.

Rotation augmentation with SCOS and its STL counterparts
The SCOS-net and STL-net in Figure 1 were trained with randomly-rotational samples (abbreviated as SCOS augnet and STL aug -net).Their rotation angle ranged from -30 • to 30 • .

Data acquisition
The CT images were collected from 267 patients with thoracic tumors receiving radiation treatment in our department.Among them, 232 patients underwent CT scan in supine position (i.e., without rotation).The CT slices from the remaining 35 patients showed a rotation ranging from 1.5 • to 9.34 • .The CT slices were obtained by a Light Speed (GE Healthcare, Chicago, America) or a Brilliance CT Big Bore system (Philips Healthcare, Best, the Netherlands).They had 512 × 512 pixels with a thickness of 5 mm and a spatial resolution of approximately 1 mm.All OARs were delineated by experienced radiation oncologists and were ground truth in our work.The labeled organs included the lung, heart, and spinal cord.

Image preprocessing
For the sake of saving computing resources, all images were cut into a matrix of 512 × 256 by dropping 512 × 128 pixels from the top and bottom, respectively.The image intensity values were linearly mapped to 0-255 from -135HU ∼215HU.

Data splitting
In the data collected from 267 patients (232 without rotation, 35 with rotation), 1240 CT slices from 61 randomly selected unrotated patients constituted the training set (856 slices) and validation set (384 slices).The test data consisted of two sets.The first test set was two-dimension (2D) images rotated by us.It included 3000 slices from the rest of 171 unrotated patients.All experimental images encompassed the lung, heart, and spinal cord.To investigate the robustness against rotation, we applied a random rotation (-30 • ∼30 • ) to them and their labels.The second test set was three-dimension (3D) data from the 35 rotated patients.
The data splitting strategy (i.e., a small amount of training data, but a large amount of test data) is to simulate a real clinical application scenario in which a proposed network is applied on a large number of cases, but trained on a limited quantity of data.

Experiment setup and implementation
We implemented the SCOS-net,STL-nets,and other two comparative MTL settings with Python 26 and used the adaptive moment estimation 27 (Adam) with a learning rate of 10 −3 , batch size of 8, and epochs of 3000 for training.The training losses (L) for the SCOS-net and STL-nets were both dice loss functions, as defined in formulas (1) and (2): where X was the model result.Y was the ground truth.Dice, ranging from 0 to 1, assessed the overlap between X and Y. i meant the ith sample.N was the number of training data.L SCOS−net was designed to guide SCOS-net to learn how to simultaneously segment three organs, thus a dice loss for each organ was summed in Equation ( 2).The loss functions of the two comparative MTL configurations (indicated as tasks 1 and 2) were where L seg and L contour correlated to segmentation map and contour map respectively.They were both dice loss as defined in formula (2).The distance map related to L distance : in which g(x) was the ground truth value of a pixel in location x. p(x) was the predicted one.

Assessment metrics
Five metrics were adopted in this paper: dice, maximum hausdorff distance (maxHD), 95th percentile of Hausdorff distance (95%HD), average surface distance (ASD) and the number of cases with ASD equaling to infinity (N(inf)).Dice was defined in formula (3).The distance metrics were calculated as follows: 1. maxHD and 95%HD in which X and Y had the same definitions in formula (3).|| || was Euclidean distance. H (X, Y ) was a set of distances of all pixels in X to their nearest pixels in Y.  H (Y, X ) had the similar meaning, but the distance was from Y to X. Herein, maxHD was the maximum among  H (X, Y ) and  H (Y, X ).95%HD was the 95th percentile of all values in  H (X, Y ) and  H (Y, X )."inf " meant infinity.

N(inf)
N(inf ) = # (ASD = inf ) (10)   in which # meant the number.ASD = inf represented that the model totally lost the ability of recognizing the target organ.

Evaluation setup
The evaluations on the first and second test sets were performed in 2D and 3D, respectively.A post-processing of only keeping the largest 3D connected domain was conducted on the second test set.It aimed to assess the clinical effectiveness of SCOS, since such post-processing is usually implemented in practice.(a-c) are the gradient maps corresponding to lung, heart, and spinal cord segmentations, respectively.In the first row, the regions of lung, heart, and spinal cord are filled by green, red, and pink separately.

Results of SCOS-net and its comparison with STL-nets
The results were illustrated in the boxplot of Figure 3. SCOS-net achieved a better segmentation for lung and spinal cord, and a comparable performance for heart with STL-net.STL-net showed a larger number of bad cases than SCOS-net, since N(inf) = 669 for STL-net and N(inf) = 34 for SCOS-net.

Comparison with other multiple task settings
Figure 4 displays the comparison of different MTL settings.From this figure, SCOS-net had the best segmentations.The spinal cord N(inf) of SCOS-net is significantly smaller than two comparative settings, although its Dice and distance metrics are not the best.

Comparison with rotation augmentation
The boxplots in Figure 5a demonstrate that the rotation augmentation indeed improves rotational robustness, since SCOS aug -net and STL aug -net get a better performance than SCOS-net.The bar charts in Figure 5b are the Dice differences between SCOS aug -net and STL aug -net.In Figure 5b, SCOS aug -net achieves better lung and spinal cord segmentations than STL aug -net in some occasions, and comparable heart delineation with STL aug -net.

Test results on real rotated cases
Figure 6a illustrates the 3D results on real rotated cases.
In terms of no rotation augmentation, SCOS-net shows superior.The smallest spinal cord N(inf) of SCOSnet on 2D slices (shown in Figures 3 and 4) helped achieve its good spinal cord segmentation in 3D, since the post-processing deleted those unconnected areas.With rotation augmentation, SCOS aug -net and STL augnet perform better than SCOS-net.It demonstrates the improved rotational robustness by using augmentation during model training.
To further study the performance of SCOS, we plotted a bar chart (Figure 6b) to show the Dice difference between SCOS aug -net and STL aug -net on 3D slices owing three organs.In Figure 6b, the numbers of cases where SCOS aug -net performs better/worse than STL aug -net are comparable.However, in several cases, the improved Dice achieved by SCOS aug -net is higher than STL aug -net.Figure 7 illustrates such an example for a qualitative comparison.In Figure 7, the lung segmentations in the three slices and heart segmentations in No.70 slice suggests the superiority of SCOS.

DISCUSSION
In this section, we give a further exploration on the comparison results.

Discussion on the comparison between SCOS and its STL counterparts
The proposed SCOS setting aims to supplement pixelwise label knowledge to the local context by sharing organ-related features in each layer.Through this way, there are two beneficial points to the model: an abundant knowledge in feature maps (1) reduces the false positive, and (2) decreases the sensitivity to the small context change, because more pixels are involved in the task.The detailed analysis is given in the following two subsections.

4.1.1
Investigation on the false positive by the SCOS-net and its STL counterparts The false positive refers to a scenario where a pixel is incorrectly labeled as target.It leads to bad distance metrics, but good Dice, just likes the lung segmentation in Figure 3.To figure the reason, we use guided gradcam 28 (i.e., a combination of guided back-propagation 29 and gradient class activation map 28 for FCN interpretation) to visualize the feature map of No.144 validating image in Figure 8.
Although SCOS-net and STL-net both reach great match with the ground truth, their features for segmenting lung are different (Figure 8a).The pixels inside lung contribute to SCOS-net, but STL-net relies more on contour.It may be caused by the no knowledge provided from other organs in STL-net.To clarify this point further, we discuss it combing with an example of false positive (i.e., No.1726 test image in Figure 9).
Figure 9a shows a false positive in segmenting lung for STL-net.p 1 belongs to the false positive.p 2 is the true positive.In Figure 9B, the receptive fields (RFs) of p 1 and p 2 show different contexts.Part of heart is in the RF of p 2 .In Figure 9c, the context of p 2 is distinguishable from p 1 for SCOS-net, but similar with p 1 for STL-net.By fusing feature map with the input image in Figure 9d, we find that most heart pixels correspond to zero value for STL-net.The part of heart in the RF of p 2 does not contribute to distinguishing p 2 from p 1 .Consequently,STL-net makes a wrong classification.For SCOS-net, a large body of pixels show diverse feature values, and hence gives more information to the net.

Discussion on the pixel number involved by our SCOS-net and STL-net
To show how many pixels that are involved in a specified task using the proposed SCOS-net and STL-nets, we plotted the gradient map (G) of the network output with respect to its input image (detailed in Appendix) in Figure 10.
Figure 10 shows the similar segmentation results, but different Gs for the two nets.In Figure 10a,c, SCOSnet involves more pixels in segmenting lung and spinal, since the area in red for SCOS-net is obviously larger than STL-net.In Figure 10b, SCOS-and STL-nets encompass similar size of red region.This may be the reason for the equally good heart segmentations by SCOS-and STL-nets, and the superior delineations of lung and spinal cord by SCOS-net.More knowledge involved in a task is more likely to strengthen a model's robustness.

Discussion on the comparison between SCOS and other two MTL settings
Figures 4 and 6a show that SCOS has a better performance on unseen rotated images than tasks 1-2.The reason may be that the spatial constraints of tasks 1-2 strengthen the network's sensitivity to the orientation of an image.When facing an unseen rotated slice, they would make a wrong classification.

Discussion on the comparison between SCOS and rotation augmentation
Figure 5a and 6a both demonstrate the performance of augmentation in improving rotational robustness."SCOS+augmentation" segments better than "STL+augmentation" in some cases.It may be caused by the further exploited data of SCOS under the same condition of applying augmentation.

CONCLUSION
In this paper, we study the improved rotational robustness by using joint learning of spatially-correlated organ segmentation.The proposed setting was implemented in a U-net like network (abbreviated as SCOS-net).
It was validated on a large body of rotated samples, but there were no rotated ones in training set.The SCOS-net was compared with its STL counterparts for an ablation study, and was compared with other two published multiple task settings and rotation augmentation.The results proved that the SCOS performed better than its STL counterparts and other MTL settings."SCOS+augmentation" reached superior segmentation than "STL+augmentation" in some cases.Their performances were analyzed using visualization techniques for a better interpretation of the improved rotational robustness.

C O N F L I C T O F I N T E R E S T S TAT E M E N T
The authors declare no conflicts of interest.

R E F E R E N C E S
How to cite this article: Zhang J, Yang Y, Fang M, Xu Y, Ji Y, Chen M. A research on the improved rotational robustness for thoracic organ delineation by using joint learning of segmenting spatially-correlated organs: A U-net based comparison.J Appl Clin Med Phys.2023;24:e14096.https://doi.org/10.1002/acm2.14096

A P P E N D I X
The gradient map was calculated as: where G was a M × N matrix (M = 512, N = 256).The element g i,j was the gradient value corresponding to the input pixel in ith row and jth column.The network segmentation (Y) and input image (X) share the same size with G. g i,j was computed as: in which y k,p was the element of Y in kth row and pth column.
x i,j was the element of X in ith row and jth column.With ReLU, G only shows the pixels which have positive effects on output.The larger g i,j indicates the greater impact on network output.

F I G U R E 1
Architectures of SCOS-net and its single-task learning counterparts (STL-net).The white and blue thick arrows indicate encoder blocks.The orange and black thick arrows indicate decoder blocks.The kernel sizes are 1 × 1 and 2 × 2 for the last convolution and the transpose convolution (with the stride of 2 × 2), respectively.Other kernel sizes are 3 × 3. concat, concatenation; conv, convolution; trans_conv, transpose convolution.

F
I G U R E 2 (A) Network structures of (a, b) the two comparative task settings and (c) SCOS-net.The learning task settings are illustrated in (B) for a clear view.(B) Learning task settings of three methods.(a) The proposed SCOS.(b, c) are two comparative ones.Distance map is the cumulative sum of the normalized distance transform of each organ's segmentation.Contour map is each organ's binary edge.concat, concatenation; conv, convolution; trans_conv, transpose convolution F I G U R E 2 Continued F I G U R E 3 2D results of SCOS-net and STL-nets.STL, single-task learning; maxHD, maximum hausdorff distance; 95%HD, 95th percentile of Hausdorff distance; ASD, average surface distance; Q1: 25th percentile; Q3, 75th percentile; IQR = Q3-Q1; "2D" in y-axis title: evaluation is performed on 2D slice.F I G U R E 4 2D results of the proposed SCOS-net and other two comparative ones.maxHD, maximum hausdorff distance; 95%HD, 95th percentile of Hausdorff distance; ASD, average surface distance; Q1: 25th percentile; Q3, 75th percentile; IQR = Q3-Q1; "2D" in y-axis title: evaluation is performed on 2D slice.

F I G U R E 7 F I G U R E 8
Qualitative comparison between SCOS aug -net and STL aug -net on an exemplary 3D rotated case.The subscript "aug" means rotation augmentation.Feature maps of No.144 validating image by SCOS-net and STL-nets.(a-c) are the feature maps corresponding to lung, heart, and spinal cord segmentations, respectively.In the first row, the regions of lung, heart, and spinal cord are filled by green, red, and pink separately.The feature map relates to the last convolution layer and is obtained by guided grad-cam (i.e., a combination of guided back-propagation and gradient class activation map).The detailed feature maps are amplified in the white boxes for a clear view.in which d(y i , X) = min x∈X ‖y i − x‖ , d(x i , Y) = min y∈Y ‖x i − y‖ X and Y had the same indications in formula (3).|| || was Euclidean distance.n X and n Y were the number of contour pixels in X and Y, respectively."inf " had the same meaning in formulas (7, 8).

F I G U R E 9
Feature map of No.1726 test image.(a) Model outputs and ground truth.The false positive is highlighted in yellow box.p 1 and p 2 are two comparative pixels.(b) Input image (I) and the receptive fields (RFs) of p 1 and p 2 .RFs are shown in I using white boxes and amplified below I. (c) Average feature map of the last convolution layer in the encoding path.The white boxes have the same size and locations as the ones in (b).(d) Fusion of the average feature map and I. SCOS-net is the proposed one.The STL-net in this figure only refers to the lung segmentation model.In the first row, the regions of lung, heart and spinal cord are filled by green, red, and pink separately.The average feature map equals to the mean outputs of all filters and then is interpolated to the same size of I. F I G U R E 1 0 Gradient maps of No.144 validating image by SCOS-net and STL-nets.
T H O R C O N T R I B U T I O N SAll authors contributed to the study conception and design.Data collection was performed by Yiwei Yang, Min Fang, Yujin Xu, and Yongling Ji.Jie Zhang implemented the model, analyzed the results and wrote the first draft of the manuscript.Ming Chen designed the whole experiment.AC K N OW L E D G M E N T SThis work was supported by National Natural Science Foundation of China (82001928), Natural Science Foundation of Zhejiang Province (LQ20H180016), and Zhejiang Key Research and Development Project (2019C03003).
comparison results of SCOS-net and rotation augmentation.(B) Bar chart of 2D Dice difference between SCOS aug -net and STL aug -net.In (B), (a-c) refer to lung, heart and spinal cord segmentations, respectively.The subscript "aug" means rotation augmentation.