Aberystwyth University BUS-Set

Purpose: BUS-Set is a reproducible benchmark for breast ultrasound (BUS) lesion segmentation, comprising of publicly available images with the aim of improving future comparisons between machine learning models within the ﬁeld of BUS. Method: Four publicly available datasets were compiled creating an overall set of 1154 BUS images,from ﬁve different scanner types.Full dataset details have been provided, which include clinical labels and detailed annotations. Furthermore, nine state-of-the-art deep learning architectures were selected to form the initial benchmark segmentation result,tested using ﬁve-fold cross-validation and MANOVA/ANOVA with Tukey statistical signiﬁcance test with a threshold of 0.01. Additional evaluation of these architectures was conducted, exploring possible training bias, and lesion size and type effects. Results: Of the nine state-of-the-art benchmarked architectures,Mask R-CNN obtained the highest overall results,with the following mean metric scores:Dice score of 0.851, intersection over union of 0.786 and pixel accuracy of 0.975. MANOVA/ANOVA and Tukey test results showed Mask R-CNN to be statistically signiﬁcant better compared to all other benchmarked models with a p -value > 0 . 01. Moreover, Mask R-CNN achieved the highest mean Dice score of 0.839 on an additional 16 image dataset, that contained multiple lesions per image. Further analysis on regions of interest was conducted, assessing Hamming distance, depth-to-width ratio (DWR), circularity, and elongation, which showed that the Mask R-CNN’s segmentations maintained the most morphological features with correlation coefﬁcients of 0.888, 0.532, 0.876 for DWR, circularity, and elongation,respectively.Based on the correlation coefﬁcients,statistical test indicated that Mask R-CNN was only signiﬁcantly different to Sk-U-Net. Conclusions: BUS-Set is a fully reproducible benchmark for BUS lesion segmentation obtained through the use of public datasets and GitHub. Of the state-of-the-art convolution neural network (CNN)-based architectures, Mask R-CNN achieved the highest performance overall,further analysis indicated that a training bias may have occurred due to the lesion size variation in the dataset. All dataset and architecture details are available at GitHub: https://github.com/ corcor27/BUS-Set, which allows for a fully reproducible benchmark.


INTRODUCTION AND RELATED WORK
Breast cancer is one of the most common and deadliest types of cancer, with an estimated 55 000 women and 370 men being diagnosed in the United Kingdom every year. 1 To date, the disease's 5-year average survival rate is around 85%, with the primary preventative step for reducing mortality rates being early diagnosis. 2 Ultrasound (US) imaging is regularly used in detection and staging of breast lesions, being cheaper and more accessible than the alternate screening method of mammography. 3 However, the main drawback of US imaging is its reliance on a radiologists experience, with scans varying greatly in complexity, image quality, speckle noise, and lesion morphology. 4 Furthermore, lesions may be indistinguishable from surrounding tissue, increasing the chances of missed detection, see Figure 1 for an example. These challenges can be mitigated using computer-aided diagnosis (CAD), that can assist and improve the accuracy of a radiologist's evaluation. 5 There are numerous examples of CAD systems, which are based on a range of traditional handcrafted and/or deep learned features. 6 Although traditional image processing methods have existed longer, recent improvements of deep learning methods have pushed convolution neural networks (CNN) and transformer neural networks (TNN) to the forefront of computer vision and image analysis. 7 As the field of deep-learning-based segmentation has been continuously expanding since its inception, it has generated many publication within the field of breast ultrasound (BUS) segmentation. One of the main issues with these studies is that due to differences in evaluation datasets and metrics, comparisons are often difficult and open to interpretation. Therefore, it has become of importance to create a reproducible benchmark dataset for BUS segmentation. Until now, a hurdle for this research area has been the lack of publicly available data that can be used in a coordinated fashion. Currently, there are five publicly available BUS datasets; OASBUD, 8 RODTOOK, 9 UDIAT, 10,11 BUSIS, 12 and BUSI. 13 The aim of this paper is to create a reproducible benchmark from the currently available F I G U R E 1 An example US image and corresponding manual mask taken from the UDIAT dataset, 10 displaying the lesion's similarity to the surrounding tissue. US, ultrasound. public BUS datasets and to create a baseline for future comparisons in BUS lesion segmentation. Furthermore, we evaluate the datasets based on several characteristics and calculate lesion shape features for subsequent comparisons. Several deep learning architectures are evaluated with the assembled benchmark dataset with results reported. Additional comparison of the predicted segmentation results is performed assessing whether morphological shape features have been retained, or what role training bias plays in the performance of networks.Taking into account that our analysis includes comparisons with respect to the lesion type, the most recently released BUSIS dataset has been excluded in this paper, as it does not provide lesion type labels in the released version. Additionally, the benchmark is designed for datasets to be added when they become available.
In the literature, there have been several advancements in breast lesion segmentation through the means of fully convolutional networks (FCNs), a type of CNN and TNN. For instance, Yap et al. explored the use of several deep learning methods, including U-Net and transfer learning using FCN-AlexNet. 10 Additionally, they analyzed effects of key characteristics of their datasets, including lesion size and ratio of the segmented lesion region of interest (mask). In later work, Yap et al. used transfer learning models pre-trained on ImageNet for automatic segmentation of breast lesions. 11 When considering a Dice score > 0.5, they achieved 89.6 and 60.6% segmentation accuracy for benign and malignant masses, respectively. Hu et al. improved the effectiveness of FCN by addressing the issues of blurry boundaries and low contrast in images. They proposed combining a dilated FCN with a phase-based active contour model, and concluded that the dilated convolution layers improved the extraction of spatial details. 14 Chiao et al. explored the use of Mask R-CNN for the segmentation of BUS images. However, their results were reported using a validation set (no test set was defined) and no Dice evaluation was performed, making it difficult to gauge the architecture's full performance with respect to other existing approaches. 15 Byra et al. experimented with a variant of the U-Net model called the Selective kernel U-Net (Sk-U-Net), which utilizes an attention mechanism that automatically adjusts kernel sizes and the network's receptive field. Furthermore, they improved the network's ability to recognize biological objects at varying scales. They compared Sk-U-Net with a vanilla version of U-Net and found that Sk-U-Net outperformed U-net on every assessed performance metric. 16 Gomez-Flores et al. explored the application of transfer learning architectures for BUS segmentation, comparing five architectures with a variety of backbones, but they used their own private dataset. 17 They further evaluated their study by applying 10-fold cross-validation to analyze the architectures performance on smaller datasets,concluding that Deeplabv3+ performed best across the majority of their test sets. In the same year, Shareef et al. proposed the enhanced small tumor-aware network (ESTAN), a U-Net version that utilized two extraction encoders with column-wise kernels to adjust to breast anatomy, to overcome the poor segmentation performance achieved by stateof -the-art deep learning approaches on small breast lesions. 18 They compared the ESTAN model against nine state-of -the-art architectures, using a combination of the public BUSIS, BUSI, and UDIAT datasets. Zhuang et al. also proposed a U-Net variant RDAU-NET, that used a Residual Dilated Attention Gate U-Net to enhance edge information, encoder feature maps, and background suppression. 19 They evaluated their model against 10 other FCN's using BUSIS, BUSI, and UDIAT. Qu et al. 20 discussed the use of a full-resolution residual network, integrated with Global Attention Upsample and deep supervision. The model was tested on two datasets: one from Sun Yat-sen University Cancer Center and the other being UDIAT. More recently, Zhu et al. 21 published their RAT-Net model, a region aware transformer network with a U-Net backbone. This model was compared with standard U-Net configuration and four other transformer models.
A limitation of the these works is the difficulty of comparing the results due to the vast differences in datasets used. Furthermore, this is further emphasized when considering the differences in image preparation or the initial training hyper parameters. This shows the need for a reproducible benchmark to improve the current state of the art. We note that evaluation on how the architectures perform at maintaining morphological and lesion type aspects is under explored. Therefore, in this paper, we assemble a reproducible benchmark from four publicly available datasets and conduct further analysis, exploring different features of the generated prediction masks to enhance our evaluation of the architectures segmentation performances on the benchmark dataset.

Overview
The creation of a reproducible dataset is based on four publicly available US datasets, with each set containing US images (containing at least one lesion) and mask annotations provided by a radiologist. These datasets are OASBUD, RODTOOK, UDIAT, and BUSI, which have all been collected at various institutions. A summary of all the datasets used in this study can be found in Table 1, which includes image details and annotations, detailing field-of -view and lesion type classifications. The OASBUD dataset was collected from patients at the Institute of Oncology, Warsaw, Poland, and consists of 100 US images (radial scans around the nipple), 48 benign and 52 malignant cases. All images contain only one lesion per image and were collected with a Ultrasonix SonixTouch Research US scanner. 8 The RODTOOK dataset is from the Sirindhorn International Institute of Technology (SIIT), Thammasat University, Pathum Thani, Thailand. At the time of accessing the dataset, not all images were accompanied with an annotated mask. Therefore, we will exclude all the images without annotations, leaving a total of 149 US images; 59 benign and 90 malignant cases. Once again, each image contained only a single lesion and were collected using a Philips iU22 US scanner. 9 The UDIAT dataset was collected at the UDIAT Diagnostic Centre of the Parc Tauli Corporation, Sabadell, Spain, using a Siemens ACUSON scanner. The dataset contains 163 US images: 109 benign and 54 malignant cases, with only one lesion per image. 10,11 The BUSI dataset from the Baheya Hospital, Cairo, Egypt was collected using LOGIQ E9 and LOGIQ E9 Agile US scanners and consists of 780 US images. 13 This can be broken down into 437 benign,210 malignant, and 133 normal cases. Since all the other datasets used in this study only contain a single lesion per image and have no normal cases,we have removed the images that contain more than one or no lesions. Therefore, reducing our total to 630 US images containing 421 benign and 209 malignant cases.

Benchmark dataset
For our benchmark dataset, we first excluded all the images from each of the public datasets that contained more than one mass per image, which is consistent with the study of Byra et al. 16 The excluded BUS images will be used for additional analysis in Section 6.4. All the remaining US images were combined giving a total of 1154 US images for the benchmark set. The models were tested using five-fold cross-validation, with a 80/20 split for the training/testing data. An initial preprocessing step was conducted on all datasets to remove scanner annotations contained in the US images, the details of cropped locations are available on GitHub. Last, all images were resized to 224 × 224 pixels using bi-cubic interpolation, 22 which is currently the standard input dimension for the CNN architectures considered. 10,16 As one of the objectives of this paper is to make our results as reproducible as possible, we have included all the details about the images contained in each fold as comma-separated value files available on GitHub.

2.3
Dataset comparison Figure 2 shows examples of benign and malignant abnormalities from each of the public datasets. Qualitative analysis allows to draw various conclusions about TA B L E 1 Summary of dataset details, including the number of images for each dataset in our benchmark, benign/malignant quantities, and number of scanners used. differences in speckle noise and contrast in the different datasets. With regard to image quality, BUSI, UDIAT, and RODTOOK all exhibited good pixel resolution,allowing the observation of fine structures like the pectoral muscle or parenchymal tissue. However, the OASBUD dataset is of a lower resolution due to the applied image reconstruction algorithm,which was not as sophisticated as for the other scanners. 8 The appearances of lesions vary greatly in size, shape, and contrast for all datasets, with most containing clear well-defined lesions, especially compared to OASBUD. It is important to note that although the majority of the BUSI lesions are welldefined, there are several lesions that are of a low contrast and difficult to distinguish from the background. Last, speckle noise, described as granular interference caused by the environmental conditions on the imaging sensor during image acquisition 23 is also considered, with OASBUD showing this most, which is far less in all the other public datasets. We further analyze the benchmark dataset in terms of mask area, Hausdorff distance for circularity, and moments calculation for elongation. The lesion size was estimated using pixel length from the original image as the exact metric size was not available within the meta data. All evaluations described can be seen as box plots in Figure 3, which shows that on average, the malignant lesions were larger than the benign lesions. It is worth mentioning that for both the BUSI and ROD-TOOK datasets, the malignant lesions are larger than the benign lesions, signifying a potential size bias in our benchmark dataset. We computed the Hausdorff distance between a circle of equivalent pixel area and annotation to analyze the irregularity of the lesions. The bar chart indicates that malignant lesions are larger than benign lesions. Furthermore, the greatest variance in terms of lesion size is displayed within the BUSI benign lesion. Last, we assessed lesions elongation using moments. 24 Once again,the BUSI dataset showed the most elongated lesions, where OASBUD showed the least elongated examples. We also found that the majority of the outliers for the Hausdorff distance are the same for the elongation estimation.

Performance and evaluation metrics
To assess the performance, we separate our evaluation into two categories: Detection Rate (OR) (the number of correctly detected lesions, i.e., the true positive [TP], compared to the false positives [FP]) and segmentation accuracy (the number of pixels correctly identified as belonging to the lesion).
Within the related BUS and the wider literature, pixelwise detection rate can be obtained by varying the discrimination threshold in the predicted masks, and we can then create a receiver operating characteristic curve (ROC curve) and area under curve (AUC), to get a clearer understanding of our models diagnostic ability.
For segmentation accuracy, the most frequently used evaluation metrics include pixel accuracy (Acc), Dice similarity coefficient (DSC), and the Intersection over Union metric (IoU). 10,14-17 Furthermore, true negatives (TN) and false negatives (FN) are defined here as required for the above metrics. By assessing our predicted segmentation on a pixel by pixel basis, we can calculate the overall accuracy of our models using Although, due to such a large imbalance between mask and background pixels in some datasets, as seen in Figure 3, DSC would be considered more appropriate, due to a weighting on the TP pixels. The DSC metric is defined as two times the area of the intersection of manual and predicted mask, which is then divided by the sum of the areas of manual and predicted mask, which is given by where X indicates the manual mask and Y is the predicted mask. A more general case is the similar IoU metric, which calculates the intersection of the pixels found in both the manual mask and the predicted mask, whereas the union is simply comprised of all pixels found in either the manual mask or predicted mask: where X indicates the manual mask and Y is the predicted mask. Furthermore, the DSC and IoU metric results are calculated assuming that our BUS images are a single class problem (only mask), instead of a 2 class problem (background and mask). All evaluation code used within this paper can be found on GitHub. Last, Everingham et al. suggested that the overall detection accuracy of a segmentation model can be calculated by considering predictions that achieve a DSC metric score of above 0.5. 25 Therefore, a lesion is correctly detected, or TP, if it achieves (DSC >= 0.5), otherwise it is classed as a FP. This method was also used by Yap et al. and then as a standalone metric by Byra et al. 11,16

METHODOLOGY
To select our benchmark architectures, we based our work on Byra et al., 16 where they already conducted an extensive study into the capabilities of Sk-U-Net on three of the four publicly available datasets used within this study. Furthermore, they discussed several key issues within their study, including the lack of comparison with the ever popular transfer learning FCN, a subset of CNN without fully connected layers. These state-of -the-art FCN architectures have been extensively used in object detection and object-based segmentation in the literature, including medical image analysis. 16,[26][27][28] Therefore, we include two state-ofthe-art semantic segmentation FCN architectures: Matterport's implementation of Mask R-CNN 29 and TensorFlow's configuration of Deeplabv3+. 30 Furthermore, we include several state-of -the-art variations of U-Net, that is, Attention-U-Net (Att-U-Net), 31 Att-Dense-U-Net (Att-D-U-Net), 32 and U-Net++, 33 which, to our knowledge, have not been used for BUS segmentation.
We also wanted to explore the capabilities of the more recent state-of -the-art vision TNNs. Dosovitskiy et al.'s vision transformer (ViT) formed the foundation of a pure transformer model for computer vision tasks. 34 Therefore, Trans-U-Net and Swin-U-Net were selected for the benchmark, being specifically built for medical image segmentation. 35 39 by adding a fully connected layer, achieving instance segmentation for multiple proposed masks and by predicting segmentation at a pixel level. Furthermore, Mask R-CNN utilizes a region proposal network (RPN), which generates proposed regions for the object's location within an image. The RPN can be broken down into two components; the classifier which calculates the probability of the object residing in the proposed region and the regressor, which regresses the coordinates of the proposed regions. After the RPN has proposed regions, features are then extracted from each proposed region and then bounding-box regression is performed. Deeplabv3+ 30 combines the use of Atrous Spatial Pyramid Pooling (ASPP) and encode-decoder structures to not only refine the borders of segmentation results, but also to improve the detection of small/thin objects improving fine-grained segmentation. Additionally, Deeplabv3+ typically uses the backbone model Xception-65, a 65 layer atrous CNN used to extract feature maps from the input images. An advantage of the Deeplabv3+ architecture is that it does not require large datasets and pretrained weights are readily available. A diagram of Deeplabv3+ can be found in

FCN-based U-Net architectures
The U-Net architecture is based on a FCN model proposed by Ronneberger et al. 40 for biomedical image segmentation to overcome the need for large scale datasets, which has since been integrated with more state-of -the-art techniques. The U-Net architecture is synonymous with that of an encoder-decoder architecture, containing both a contraction path (encoder) and a symmetric expansion path (decoder). 40 Att-U-Net 31 is a modified version of U-Net for tissue/organ segmentation. It employs the use of attention gates (AG) that focus on target structures, while suppressing irrelevant features. Usually, it provides a better mask localization without loss to its receptive field. 31 Att-D-U-Net is further integrated with densely connected encoders. These dense blocks create a feedforward fashion between each layer strengthening feature propagation, substantially reducing the number of parameters. This architecture has been applied in the field of digital mammography, achieving better segmentation results than U-Net, Att-U-Net, and Dense-U-Net without AG. 32 Sk-U-Net 16 is another variant of the U-Net model architecture, but with the conventional blocks replaced by Sk blocks, which automatically adjusts its receptive field, resulting in a better utilization of the spatial information at varying scales, to obtain a high-level feature representation. This model architecture has shown to be more robust for classification, displaying considerable improvements over the vanilla U-Net as seen in Byra et al. 16 A diagram of Sk-U-Net can be found in Figure 5.
U-Net++ 33 is a nested architecture where the encoder and decoder subnetworks are connected through a series of nested, dense skip pathways, reducing the semantic gap between the feature maps of the encoder and decoder subnetworks. Therefore, improving learning with the decoder and encoder feature maps being semantically similar. The configuration of this network can be found in Figure 6. This model has been widely evaluated for segmentation applications concluding an improved performances over U-Net in nuclei, liver, and colon polyp segmentation. 33

U-Net-based transformer networks
We have selected two transformer-based models based on the U-Net described above, each with their own unique transformer block or encoder that replaces the conventional ones. Both models utilize a modified version of the ViT architecture, 34 which splits an image up into patches with their respected linearly activated pixel-wise feature maps generated from convolutions layers. These feature maps are then transformed into a sequence of tokens and fed into the transformer. These tokens are then sequenced, outputted, and projected back to the feature maps. Allowing the analysis of low-level pixel-wise structures through tokens-wise embedding, lowering computational cost compared to CNNs. 34 Trans-U-Net is the first architecture to utilize a modified version of the ViT architecture designed for medical image segmentation. The model uses a transformer encoder as described above, and the same decoder structure with upsampling and skip connections as in the vanilla U-Net. Their transformer uses multilayer perceptron (MLP) blocks and multihead self -attention (MSA) layers to create encoded tokenized image patches from convolutions layer feature maps. These encoded features maps are then upsampled with skip connections for accurate localization. 35 Swin-U-Net varies from Trans-U-Net by instead of having a transformer encoder, it uses Swin-transformer blocks that replace the conventional blocks in vanilla U-Net. The encoder transformer symmetric blocks use shifted windows MSA and patch merging layers at their base to contextualize features, whereas the decoder uses patch expanding layers to upsample the extracted deep features from the bottleneck layer. Then, the symmetric blocks upsample the features maps along with concatenation from the skip connections. 36

Implementation
All architectures tested within the benchmark were run on color images, the format the datasets were received in. Mask R-CNN, 29 Swin-U-Net, and Trans-U-Net used pretrained weights acquired from their respected GitHub repositories. Unfortunately, we had difficulty with the original Deeplabv3+ 30 source and an alternate was used without pretrained weights. The remaining models were all trained from scratch. For training Mask R-CNN, we experimented with two backbones: ResNet50 and ResNet101, we found that ResNet101 performed the best during our preliminary experiments using only training and validation datasets. We initialized Mask R-CNN with learning rate of 0.0005 and optimized using stochastic gradient descent (SGD) with learning momentum equal to 0.9. We set the batch size to 1 due to memory constraints and trained using per-pixel softMax and a multinomial loss. Furthermore, we set the weight decay to 0.0001. We initially trained only the "head" layers of the network for 10 epochs and then trained the whole model for an additional 30 epochs.After each epoch,the model weights were saved and the model with the highest average DSC score on the validation set was selected.
For benchmarking, all the FCN U-Net-based architectures and DeepLabv3+ were trained with a learning rate of 0.001 and optimized using Adam with learning momentum of 0.9. Furthermore, we set the batch size to 16 and we decayed our learning rate exponentially by a factor of 0.1. Both models were trained using the DSC metric (see Equation 2) and early stopping was implemented such that training was stopped if there was no improvement within our validation set after 15 epochs. We used data augmentation to improve training by vertically flipping each of the training images. We have also seeded our model to facilitate as much reproducible as possible. Each different architecture was run three times and the model with the highest DSC score on our validation set was selected to remain consistent with Byra et al. 16 Additionally, we implemented Deeplabv3+ with the modified Xception71 backbone as described in Chen et al., 30 along with atrous rates of 4, 8, 12.
Last, both TNNs were trained with a maximum epochs of 150 using a batch size of 4 because of GPU constraints. Swin-U-Net was initialized with a base learning rate 0.05, decay rate 0.0001, and SGD optimizer with TA B L E 2 Summary of the two systems used to implement the benchmark algorithms. To ensure that our results were as reproducible as possible, each of the architectures was trained by one author at one site (C.T.) (UK) and then by another author at a different site (M.B.) (Poland), and vice versa. Summary of each site configuration can be found in Table 2. Both authors used the same environments, which can be obtained from GitHub. After all our results had been collected separately, the best results (highest DSC score on each fold) from the two sites were compared and the best being selected as the benchmark. With the maximum difference between the two sites over all folds being 0.015 per DSC score.

Statistical significance test
To statistically validate the benchmarked nine methods, we performed a MANOVA to determine if the multivariate sample means are equal. When appropriate, this was followed Anova and a Tukey's Honest Significant Difference (HSD) post hoc test to statistically compare the different methods. Furthermore, DSC and ACC metric scores are analyzed separately with IOU not being used due to its correlation with DSC. The significance threshold used within this study is 0.01.

Qualitative results
Qualitative analysis provides a complementary perspective on the segmentation results specially related to lesion morphology and dataset characteristics. Lesions were randomly selected from each dataset where the average DSC value over all architectures was above 0.8, displaying the capabilities of the architectures. For simplicity, this analysis will only include the three best performing architectures: best semantic model, best U-Net FCN variant, best U-Net transformer variant, selected based on average DSC scores.

Lesion Accuracy/Morphology
To explore the models capabilities in terms of Lesion Accuracy/Morphology, we first calculated the Hamming distance 41 between the ground truth (manual) and prediction (automatic) masks, to obtain the percentage of pixels that were segmented correctly. Although segmentation metrics are commonly used for the evaluation of different approaches, we also propose to investigate the effects of different segmentation approaches on the lesion shape and morphology when compared to the ground truth. This was conducted to evaluate morphological properties of the obtained segmentation, assessing the level of robustness of these features. The three best average performing architectures were selected for this evaluation.

Comparing the manual and automatic masks through shape analysis
To compare the manual and automatic masks, three morphological features were analyzed: the depth-towidth ratio (DWR), circularity, and elongation. DWR was determined by calculating the major and minor axis lengths for a segmentation mask, then dividing the major by the minor we obtain a scalar estimate for the DWR. Circularity and elongation were calculated as described in Section 2.3. Both these features have been used for breast mass classification, and segmentation methods are expected to generate masks that provide accurate estimates of these basic shape descriptors. 42 Additionally, we only compared lesions that were classified as a TP. Segmentation masks were resized to the original ground truth image size in order to extract the shape descriptors to be compared. Furthermore, significance analysis was conducted transforming the correlation coefficients of DWR, circularity, and elongation to zvalues and subsequently, we estimated the observed value of z (z obs ). 43 To estimate significance, we used an alpha of 0.01 and assumed a two-tailed test, meaning that the difference is significant if the calculated observed z-value is outside −2.58 < z obs < 2.58.

Influence of lesion size
For this evaluation, we conduct two sets of analysis. First, we assess the size of the lesions in the benchmark and filter our results by their respective public datasets, followed by calculating their DSC and DR scores. Allowing us to making a direct comparison between predictions and ground truths.

BUS-SET
Second, in the literature, there has been some evidence of training biases directly affecting the DSC metric scores, as described by Maier-Hein et al. 44 They mentioned the possible drawbacks of training architectures using the DSC metric when a dataset is biased towards small/large abnormalities. 44 They concluded that the DSC metric can be appropriately used for large structures, for example, organs instead of smaller pathological structures. This indicates that an architecture trained using the DSC metric for an imbalanced lesion size dataset could cause bias in the segmentation performance, favouring the dominant lesion size and exhibiting bias towards the majority class. 45 The DSC metric focuses on the segmented region during training, inducing a bias towards a specific region size (lesion size). Furthermore, we would also expect a similar bias to arise in a model trained using any of the other region-based loss function, for example, the IoU metric.
To investigate whether lesion size influences the predicted lesion size, we evaluate the same three models as in Lesion Accuracy/Morphology. above. The original lesion size is then plotted against its respective prediction lesion size. With the aim of finding any correlation between lesions sizes on higher or lower DSC scores.

Multiple lesions
Last, during benchmarking, the dataset was refined to include only single lesion BUS images, but this is not always the case. We evaluated the benchmarked architectures on the excluded 16 images from the BUSI dataset.Each BUS image contained two or three lesions, with 15 images being benign and one malignant. This analysis provides some insights on the robustness of the three best performing models in segmenting multiple lesions.Furthermore,the statistical evaluation described in Section 3.3 will also be applied if appropriate.

RESULTS
This section presents the segmentation results with the proposed database in terms of quantitative (using different metrics) and qualitative evaluations. In addition, we show results based on lesion size and lesion elongation. Finally, we show how the various approaches deal with the presence of multiple lesions per image.  To further validate our results, we conducted a MANOVA test using the DSC, IoU, and Acc metric scores, which indicated a statistically significant difference between all our models with a p-value <0.01. Subsequently, we employed an Anova test on the three individual metrics,which again showed a statistically significant difference between all our models with a p-value < 0.01. To distinguish which models were significantly different, we used the HSD test on the DSC and Acc metrics (we do not include IoU results at this stage as they are strongly correlated with DSC and the equivalent IoU results can be found on GitHub) separately with an alpha of 0.01. The results can be found in Figure 7. Figure 8 shows the ROC curves and AUC values for all benchmarked architectures, displaying the models segmentation ability to distinguish between lesion and surrounding tissue. Swin-U-Net achieved the highest AUC value across all possible thresholds, closely followed by Trans-U-Net and MaskRCNN. There is no

Qualitative results
Qualitative segmentation results are shown in Figures 9  and 10, for benign and malignant lesions, respectively. For benign images, most architectures provide reasonable segmentation performances and capture welldefined boundaries of the benign lesions. Furthermore, there does not seem to be a drop in the segmentation performance due to the image quality of different scanners.With respect to the malignant masses,the methods showed a good level of segmentation on lesions with DSC > 0.5. Figures 11 and 12 show benign and malignant lesions that achieved the lowest average mean DSC scores.
For benign cases, we observe that the BUSI and OAS-BUD lesions have not been detected by any of the architectures. Instead, choosing to detect a nearby shadow apart from Sk-U-Net, which failed to make any detection. For RODTOOK, Sk-U-Net and Trans-U-Net detected two visible lesions with one being within the ground truth, whereas Mask R-CNN, which was set to detect only one region, selected the nearby shadow over instead of lesion. Last, for the UDIAT image, Mask R-CNN managed to locate the lesion with DSC : 0.913, whereas, again, Sk-U-Net and Trans-U-Net obtained false detections, highlighting a nearby shadow.
In Figure 12, the BUSI malignant lesion has been undetected, with all of the architectures detecting the large posterior acoustic shadowing (PAS) region instead. Usually, a PAS overlaps with the lesion resulting in over segmentation, or the PAS is separate from the lesion causing a false detection. 46 A similar case is found for the UDIAT case, although this time the PAS overlaps with the region, resulting in over segmentation. For the RODTOOK case, only Mask R-CNN managed to partially locate the lesion, but obtained a poor mean DSC score due to over segmentation. Last, looking at the OASBUD image, all architectures failed to make an accurate prediction, only Sk-U-Net produced a prediction.

Lesion Accuracy/Morphology
For this analysis, we selected only the three best performing architectures as previously stated: Mask R-CNN (best semantic model), Sk-U-Net (best U-Net FCN variant), Trans-U-Net (best U-Net transformer variant) based on average DSC scores.
Assessing the architectures' capabilities to maintain a lesion accuracy/morphology. We first calculated the Hamming distance 41 between the manual and prediction masks, to obtain the percentage of pixels that were segmented correctly. For the three models, we found F I G U R E 9 Segmentation results of the best-segmented benign lesion from the best performing model from each group are presented. The lesions were randomly selected from the segmentations with a DSC above 0.8 for all networks. Manual segmentation mask shown in"Blue" and prediction mask shown in "Red." DSC, Dice similarity coefficient. that on average, Mask R-CNN segments 84.3% of the manual mask pixels correctly, whereas, Sk-U-Net obtained 76.8% and Trans-U-Net achieves 74.77%, showing that Sk-U-Net and Trans-U-Net achieved similar accuracy, but due to false detection's Sk-U-Net, average DSC was lower than Trans-U-Net.
Comparing the manual and automatic masks through shape analysis Figure 13 shows the DWR, circularity, and elongation analysis for our three deep learning architectures. Regarding DWR, we found that Mask R-CNN captured the most accurate estimates and the highest linear correlation coefficient, 0.888, compared to Sk-U-Net's 0.628 and Trans-U-Net's 0.746. Mask R-CNN performed the best, most likely because of its bounding box improved lesion localization. A similar conclusion is drawn regarding circularity, with Mask R-CNN maintaining the most circularity characteristics of the lesions, with higher correlation coefficient of 0.532 compared to Sk-U-Net, 0.374, and Trans-U-Net, 0.409. Last, for elongation, we found that once again, Mask R-CNN achieved the highest correlation coefficient of 0.876. This was closely followed by Trans-U-Net with score of 0.713, whereas Sk-U-Net obtained a far lower score of 0.504, as the Sk-U-Net predictions were found to be larger due to over segmentation. For DWR, there is statistical significant difference between Mask R-CNN and Sk-U-Net with z obs = 19.650, Sk-U-Net and Trans-U-Net with z obs = 18.353, and no significant difference between Mask R-CNN and Trans-U-Net with z obs = 1.297. For circularity, there is a statistical difference across all models, for Mask R-CNN and Sk-U-Net with z obs = 8.274, Sk-U-Net and Trans-U-Net with z obs = 3.875, Mask R-CNN and Trans-U-Net with z obs = 4.399. With regards to Elongation, there is a statistical difference between Mask R-CNN and Sk-U-Net withz obs = 12.696, Sk-U-Net and Trans-U-Net with z obs = 11.003, but there is no statistical significant difference between Mask R-CNN and Trans-U-Net with z obs = 1.693.

Influence of lesion size
The size distibution of the lesions in the benchmark is presented in Figure 3. We observed that the RODTOOK and BUSI sets contain larger malignant lesions. Furthermore, BUSI and OASBUD contain larger benign lesions.

BUS-SET
F I G U R E 1 0 Segmentation results of the best-segmented malignant lesion from the best performing model from each group are presented. The lesions were randomly selected from the segmentations with a DSC above 0.8 for all networks. Manual segmentation mask shown in "Blue" and prediction mask shown in "Red." DSC, Dice similarity coefficient. Filtering our results by their respective public datasets and calculating their DSC and DR scores are illustrated in Tables 4 and 5.

DATASET Type DeepLabv3+ Mask R-CNN U-Net Sk-U-Net Att-D-U-Net Att-U-Net U-Net++ Swin-U-Net Trans-U-Net
Considering benign images, on average, the architectures did the best on the UDIAT images with an average mean DSC score of 0.789, then followed by BUSI: 0.747, RODTOOK: 0.746, and OASBUD: 0.720. It is an interesting observation that UDIAT has the small-est average benign lesion size. Excluding OASBUD from our results, as its score is most likely affected by the image quality, we observed that the models performed better on smaller lesions. The models achieved near identical scores, out by 0.001, for RODTOOK and BUSI, which have a similar average lesion size. This suggests a correlation between smaller lesions and higher DSC scores. F I G U R E 1 1 Segmentation results of the worst-segmented benign lesion from the best performing model from each group are presented. Manual segmentation mask shown in "Blue" and prediction mask shown in "Red." TA B L E 5 DR for different architectures, public datasets, and types of lesions. For malignant images, the highest average mean DSC score, 0.797, was obtained on the RODTOOK dataset, which contains the second largest average malignant lesion size. Moreover, the UDIAT dataset, which contains a lower than average lesion size, obtained a score of 0.725. The two other datasets obtained the following scores BUSI: 0.708 and OASBUD: 0.605, indicating a bias towards larger lesions size for the malignant BUS images.

BUSI
To investigate whether lesion size influences the predicted lesion size, we evaluated the best three models as described, with the aim of analyzing if there is any correlation between lesions sizes and higher or lower DSC scores. Figure 3 details lesions size profiles in the datasets. Comparing this with the graph on the left in Figure 14, we can see that smaller lesions are being of higher concentration, with most of the larger lesions being malignant. This is also supported F I G U R E 1 2 Segmentation results of the worst-segmented malignant lesion from the best performing model from each group are presented. Manual segmentation mask shown in "Blue" and prediction mask shown in "Red." TA B L E 6 Multiple lesions segmentation results, displaying mean metrics scores, median and standard deviation within brackets.

Multiple lesions
Multiple lesion detection results can be found in Table 6. Mask R-CNN performed the best, achieving the highest mean scores across all metrics, followed by Sk-U-Net and then Trans-U-Net. Splitting the results on lesion type, Mask R-CNN, Sk-U-Net, and Trans-U-Net obtained the DSC scores 0.759, 0.844, and 0, respectively, for the single malignant case. For the remaining 15 benign cases, they produced mean dice scores of 0.844, 0.696, and 0.632. Surprisingly, Mask R-CNN received a higher mean score on multiple lesions compared to its benchmark performance, displaying the capabilities of semantic segmentation on multiple lesions of BUS images, although Sk-U-Net did achieve a higher average performance than Mask R-CNN on the single malignant lesion. Statistically using MANOVA based on DSC, IoU, and Acc,we find a p-value of 0.065,which is greater than 0.01, and as such not indicating a statistically significant difference between the three models.

Benchmark discussion
Mask R-CNN achieved the highest segmentation results compared to all the other architectures shown in Table 3. Followed by Trans-U-Net and then Sk-U-Net, where both achieved lower segmentation performance with respect to all metrics. For the benign and malignant breakdown, once again Mask R-CNN achieved the highest metrics scores and this was consistent for both benign and malignant masses. Looking at the statistical analysis displayed in Figure 7. For DSC, the HSD indicated a significant statistical difference for Mask R-CNN and Att-D-U-Net and all other models, confirmed by p-value < 0.01 (to note that Mask R-CNN is better than all other models, and that Att-D-U-Net performs worse than the other models). For Acc,HSD indicated that there was a significant statistical difference between Mask R-CNN and all other models apart from Trans-U-Net. Furthermore, it also showed that there was a significant statistical difference between Att-D-U-Net and other models, apart from U-Net and U-Net++. Highlighting that when TN are included in the metric calculations, it becomes more ambiguous in regards to clear distinction between models.

Benchmarks comparison
The capabilities of deep-learning-based methods for segmenting breast masses can be seen clearly in the results in Table 3, which showed that on average, 8/9 of the explored architectures achieved a DSC score above 70%, with only Att-D-U-Net achieving an average score below 65%.
Considering benign and malignant types separately within the benchmark, on average, the models BUS-SET F I G U R E 1 4 Manual mask lesion pixel area against automatic mask pixel area. All lesions were split into groups based on the median and 25 and 75 interquartile range of the ground truth shown in the charts legends, and with Mask R-CNN, Sk-U-Net, and Trans-U-Net, shown as (a), (b), and (c), respectively. performed better on benign cases, with the two best performing models being Mask R-CNN and Trans-U-Net. In contrast, the worst performing method, the Att-D-U-Net, obtained its best scores on the malignant set. This could be explained by the fact that the ratio in the benchmark set is biased towards benign images. Additionally, benign lesions usually exhibit well-defined boundaries, compared to malignant lesions. Overall, the benchmark results highlight the importance of Trans-U-Nets transformer block and Mask R-CNN bounding box localization for improving detection rates. Interestingly, when considering the U-Net-based architectures that are integrated with AG and dense blocks, we find that there does not seem to be any improvement with regards to the metric scores.
Our obtained benchmarks express similar results to other papers on BUS segmentation, although direct comparison is difficult, due to a variety of datasets and methodologies. However, there are two papers that were based on a combination of some of the public datasets used in this study. Byra et al. achieved similar results with Sk-U-Net and U-Net obtaining 0.826 and 0.778 for DSC scores, respectively, a difference of 0.048. Whereas, in our benchmark, Sk-U-Net managed a mean DSC of 0.748 and U-Net 0.707, a difference of 0.041. With such a small variation in performance between each architecture, we observed that similar performances were achieved, although they used a dataset of 882 lesions compared to 1154 in this work. 16 Another comparable study was conducted by Gomez-Flores et al., where DeepLabv3+ achieved a median DSC score of 0.902, whereas in our benchmarks, we obtained a median DSC 0.848. 17 However, there were some important differences as they used a private dataset consisting of 3061 BUS images. Interestingly, they also applied 10-fold cross-validation, to assess the architectures based on different US machines, with the lowest median DSC being 0.81. 17 Many examples of the predicted segmentations across several models were also shown, which their qualitative results show similar morphological aspects of our own results. A thorough comparison is difficult as they only displayed examples that obtained a good segmentation for all models tested.

Lesion Accuracy/Morphology
For the three models, Mask R-CNN obtained the highest accuracy of 84.3% when considering the Hamming distance. 41 Furthermore, Figure 13 showed that Mask R-CNN was found to maintain the most morphological features associated with lesions, followed by Trans-U-Net and then Sk-U-Net. Overall, Sk-U-Net obtained far lower scores for DWR, circularity, and elongation when compared with the other networks. It is also interesting to see that the Mask R-CNN and Trans-U-Net prediction masks display a far-more circular appearance, reinforced by the high elongation correlation coefficient as both networks seem to miss the finer details of the manual masks producing a more circular prediction. Then from the statistical analysis, we could only draw that over the three methods, Mask R-CNN is statistically significantly different from Sk-U-Net.

Influence of lesion size
Results presented in Tables 4 and 5 show that our models performed the best on the RODTOOK dataset, achieving an average mean DSC of 0.777 over both lesion types, followed by UDIAT: 0.767, BUSI: 0.735, and OASBUD: 0.625. The poorer performance of the architectures on the OASBUD set is likely because of the image quality and smaller lesions, as indicated in Figure 3. This is reinforced by Table 3, indicating that the UDIAT images have the lowest average detection rate, especially for the benign images. This could indicate that the performance drop on the UDIAT images is because of the lesions size, resulting in fewer features for architectures to learn from, making detection difficult. We expected the RODTOOK benign images to also have low detection rates, which is different to what can be found in Table 5, which shows UDIAT having a lower detection rate. Each of the public datasets was also obtained on a variety of different scanners, leading to differences in BUS image resolution. Furthermore, it is generally difficult to compare performance score obtained based on different datasets. Specific biases present in the training data may result in better performance of specific networks. Therefore, lesion size may have generated a training bias, as the overall set contains a large number of bigger lesions, as seen in Figure 3. This could directly affect the size of the generated masks and in turn the DSC metric scores.
In Figure 14, where for three models, the original lesion size is plotted against its respective prediction lesion size, there seems to be a distinct level of bias with large lesions being predicted smaller and small/medium lesions being predicted larger. Considering that Mask R-CNN was trained using IoU loss compared to Sk-U-Net and Trans-U-Net, which were trained using DSC loss, it performed poorer on smaller masses. Note: these results align well with the work by Maier-Hein et al. 44

Multiple lesions
It was hard to draw any conclusion from the multiple lesion benchmark in Table 6. Although Mask R-CNN did obtained the highest performance, the MANOVA expressed that there was no statistical significance difference between models likely probably due to the insufficient number of testing samples.

Limitations and future work
Based on our investigations, we identified several issues related to this study. The four public datasets contained within this study were annotated by different radiologists, which could create variations, especially along the boundaries where it becomes more personal interpretation. Furthermore, image acquisition protocols were most likely different between centers leading to more variations between datasets.An evaluation of the agreement between radiologists could help to improve the consistency of the masks. Second, we did not enhance our segmentation with post processing methods, like region growing or watershed, 47,48 which might help to reduce the amount of features lost along the boundaries of the lesions. We also did not compare different loss functions and optimization methods for training.
Our future work will focus on feature differences between benign and malignant lesions and the semantic segmentation in 2D and 3D US.Additionally,we also plan to develop our own segmentation networks, building on the research performed within this study and conduct further investigation on different scanners, lesions types, and how these affect network performance.

CONCLUSIONS
We have proposed a reproducible benchmark, tested across two different system configurations using publicly available data. Our benchmark results found that Mask R-CNN achieved the best overall results on the benchmark dataset (1154 BUS images) achieving metric scores of DSC : 0.851, IoU : 0.786, and Acc : 0.975 using five-fold cross-validation. Furthermore, our results were further analyzed using MANOVA that indicated a statistically significant difference between models with a p-value < 0.01. Further evaluation using one-way ANOVA and Tukey test across the DSC metric scores highlighted a significant difference between Mask R-CNN and Att-D-U-Net and all other models. This became more ambiguous when considering the Acc metric score, which indicated significant statistical difference between Mask R-CNN and all other models apart from Trans-U-Net. Additionally, it showed a significant statistical difference between Att-D-U-Net and other models, apart from U-Net and U-Net++. From evaluating the prediction masks, we found that Mask R-CNN presented the highest linear correlation with DWR, circularity, and elongation, therefore, it maintained the morphological features of the lesions most accurately. We further found that models missed similar structures along the mask boundaries. Further statistical analysis based on the linear correlation coefficients for the three methods separately, indicated that Mask R-CNN is statistically significantly different from Sk-U-Net.
Possible challenges of the benchmark have been highlighted, with reasonable evidence to conclude that there is a training bias in the benchmark results. As there are a smaller number of malignant images causing the architectures to over segment or incorrectly detect lesions, affecting the DSC metric score. For this work, we can see evidence of a DSC training bias but more research would be required to evaluate the effects. Finally, three architectures were assessed on 16 images with multiple lesions, where Mask R-CNN performed the best with a mean DSC score of 0.895. Although the difference between models were not statistically significant. All dataset details and evaluation code used within this study are available on GitHub.