Deep Learning‐Based Segmentation of Locally Advanced Breast Cancer on MRI in Relation to Residual Cancer Burden: A Multi‐Institutional Cohort Study

While several methods have been proposed for automated assessment of breast‐cancer response to neoadjuvant chemotherapy on breast MRI, limited information is available about their performance across multiple institutions.

N eoadjuvant chemotherapy (NAC) is increasingly used to treat patients with breast cancer.It reduces the rate of mastectomies (to favor breast-conserving surgery) and axillary lymph node dissections and allows monitoring of treatment response with the tumor in situ. 1 Reduction of the size of the primary tumor and lymph node metastases in response to NAC is associated with improved survival. 2 Breast MRI is used in addition to other imaging modalities like mammography or ultrasound to further stage breast cancer.][5] Methods for response monitoring on MRI range from manual assessment by radiologists to methods under investigation for fully automated analysis.Manual assessment has been shown to be predictive of pathological complete response (pCR), depending on tumor subtype. 6Combinations of manually selecting regions of interest (ROI) and semi-automatic thresholding have also been proposed, including a semiautomated method to establish functional tumor volume (FTV). 7,8Reduction of FTV during treatment was found to be a stronger predictor of pCR than manual radiological assessment alone and also a strong predictor of recurrence free survival (RFS). 8,91][12] Little is known about the robustness of these methods across multiple institutions.
This study aimed to establish whether a deep learningbased model to assess whether changes in tumor load on DCE-MRI are associated with residual cancer burden (RCB) after NAC and are robust to variations between institutions and MRI scanners.Second, whether such a model is in agreement with the relationship between FTV and RCB.

Subjects
TRAINING COHORT.Requirement for informed consent and ethical review was waived by the institutional review board (Medical Research Ethics Committee Utrecht [METC Utrecht], no.19-245).The patients in the training cohort were retrospectively included from the BOGOTA study.Female patients treated for LABC with NAC between January 1, 2011 and December 1, 2019 were consecutively included.Of the 147 patients available, 28 were excluded.An additional 17 were excluded for response-assessment modeling.Three patients had bilateral cancer.This resulted in 105 breasts with cancer (from 102 patients) in the training cohort (Fig. 1).The mean age at diagnosis was 50 years (range 25-73 years).
TESTING COHORT.This study was approved by the institutional review board (Medical Ethics Research Committee Utrecht [METC Utrecht], no.19-396) and informed consent was obtained from all patients.An independent testing cohort was taken from the prospective multiinstitutional LIMA study (Liquid biopsies and IMAging). 13tween December 1, 2019 and October 1, 2021, 61 female patients across four institutions were included with histologically proven invasive breast cancer, planned to receive NAC. Exclusion criteria in the LIMA study were luminal A or inflammatory breast cancer, presence of distant metastases at the time of diagnosis (as determined using a PET/CT scan), history of prior ipsilateral breast cancer at any time previously, history of active malignant disease in the preceding 5 years (not including squamous cell or basal cell carcinoma of the skin) and contra-indications to DCE-MRI.In the current study, eight patients were excluded due to unavailability of the MR imaging data.One patient had bilateral breast cancer.This resulted in 54 breasts with cancer to evaluate the response assessment models (Fig. 2).The mean age at diagnosis was 50 years (range 25-72 years).

Evaluation of Resection Specimen
RCB was derived from the final resection specimens by a dedicated breast pathologist with 30 years of experience (PJvD), following the methodology described by Symmans et al. 14 The continuous RCB measure was discretized following cut-offs described in the literature. 14RCB categories RCB-0 (i.e.pCR) and RCB-I were defined as good responders to NAC.Conversely, categories RCB-II and RCB-III were considered bad responders.

MR Imaging
TRAINING COHORT.In the training cohort, patients underwent two MRI examinations: The first at baseline prior to NAC, and the second either midway through the chemotherapy schedule, or immediately before the second-to-last cycle of chemotherapy, depending on the NAC schedule.The second MRI examination was defined as the follow-up examination.Imaging was performed using 1.5 T or 3 T MRI scanners (Philips Ingenia or Achieva) with a dedicated breast coil.The current study focused on the dynamic contrast-enhanced T1-weighted MRI series, consisting of one precontrast scan and at least five postcontrast scans after injection of a gadolinium-based contrast agent (Gadobutrol, 0.1 mmol/kg).The postcontrast scans were acquired at intervals between 60 seconds and 90 seconds.Repetition time ranged between 3.3 msec and 7.1 msec, echo time 1.2-3.4msec, flip angle 8 to 10 , and field of view 340-426.7 mm.Voxel volumes ranged from 0.75 Â 0.75 Â 0.90 mm 3 to 0.97 Â 0.97 Â 1.30 mm 3 .Fat suppression was employed for all sequences.No acceleration or parallel acquisition was performed.Eighteen patients were excluded because the MR images were not available in full (Fig. 1).TESTING COHORT.In the testing cohort, three MRI examinations were performed: the first at baseline, the second after one-third of the chemotherapy cycles were given, and the third after all cycles of NAC but prior to surgery.Only the first and the third examination were used, corresponding with the baseline and follow-up examinations in the training cohort.Imaging was performed exclusively on 3 T scanners (Philips Achieva, Ingenia or Ingenia Elition X or Siemens MAG-NETOM Avanto, Spectra, Skyra or Vida) with dedicated breast coils.A dynamic contrast-enhanced T1-weighted MRI sequence was used, consisting of one precontrast series and at least three postcontrast series after injection of a gadolinium-based contrast agent (Gadobutrol, Gadoteridol or Gadoteric acid depending on the study site, 0.1 mmol/kg).The postcontrast scans were acquired at intervals of 55-89 seconds.Repetition time ranged between 3.7 msec and 5.5 msec, echo time 1.5-2.5 msec, flip angle 8 -12 .Voxel volumes ranged from 0.5 Â 0.5 Â 0.9 mm 3 to 1.0 Â 1.0 Â 1.25 mm 3 .No acceleration or parallel acquisition was performed.

Semi-Automated Assessment of Functional Tumor Volume
FTV was calculated per breast from each MRI examination.In short, a bounding box was manually placed around the tumor by a biomedical engineer (MHAJ, 4 years of experience in breast MRI), based on radiology reports.To verify the quality of the bounding boxes, a random subset of 20 scans was verified by a radiologist (MRM, 7 years of experience).Voxels in this box meeting thresholds for intensity, relative enhancement, and signal enhancement ratio (SER) were considered to be tumor voxels.FTV was implemented using MeVisLab (MeVis Medical Solutions AG) following. 15To account for the thinner slices in these datasets compared to the dataset used in the development of FTV, the volume of interest minimum intensity threshold was lowered to 20%.All other parameters were first set as described in the reference: a minimum of 70% relative enhancement was taken and the minimum SER threshold was set to zero. 8If necessary, the relative enhancement threshold could  be tuned per site starting from the baseline.The "omit" functionality, used to exclude irregular regions from the FTV segmentation, was not implemented. 15

Automated Tumor Volume Assessment Using Deep Learning
A convolutional neural network (CNN) was trained to automatically segment the tumor.Ground-truth annotations of malignant disease were derived from a previously reported histopathology-validated semi-automated region grower. 16A seed point was placed in or near the center of each tumor, followed by constrained volume growing based on contrast uptake.The resulting tumor segmentations were manually checked by the operator immediately after setting the seed points and if considered incorrect, the seed point was adjusted or a new seed point was added or chosen.Manual corrections were made to remove erroneously segmented structures such as attached blood vessels.All seed points were placed and checked by a biomedical engineer (MHAJ, 4 years of experience in breast MRI), based on radiology reports, under supervision of a breast MR radiologist (EJMWvdB, 19 years of experience).
Preprocessing was applied to all scans: all postcontrast DCE-MRI scans were registered to the precontrast scans using deformable registration in three dimensions (elastix). 17Images were automatically cropped in the anterior-posterior direction in the region between 1 cm anterior of the breast tissue and 5 cm posterior of the intermammillary cleft. 18round truth segmentations were used to train an nnU-Net CNN. 19Briefly, all images were resampled to the median voxel spacing per axis across the dataset and normalized using z-scoring.A 3D U-Net-style architecture was used.A sliding window approach was then used to train the network, the size of the patch being determined such that each mini-batch consists of at least two patches.The stochastic gradient descent optimizer is used with an initial learning rate of 0.01 and Nesterov momentum (μ = 0.99) and the network was trained for 1000 epochs.The learning rate was reduced during training using the polyLR schedule. 20The loss function consists of the sum of cross-entropy and Dice loss.Data augmentation consisted (randomly) of rotation and scaling (in 3D), additive Gaussian noise, Gaussian blurring, brightness and contrast simulation (by multiplying and clipping voxel intensities), repeated down-and up-sampling, gamma augmentation and random mirroring along all axes.Note that no validation set was used during training as no hyperparameter tuning was performed.For more details regarding the CNN, see the literature. 19wo-fold cross-validation on patient level was performed with equally sized folds: the CNN was trained on the train fold and the Dice score was assessed on the test fold (Fig. 3).Input to the nnU-Net were the precontrast DCE series and five postcontrast DCE MRI (i.e. six channels in total).If an examination contained more than five post-contrast series, only the first five were used.If less than five were available, imputation to five series was performed using the last observation carried forward.A single 3D full-resolution network was used.After inference, total tumor volume was calculated by accumulating all foreground-class voxels in the breast region.If both the ground-truth segmentation and the nnU-Netderived segmentations were empty volumes (i.e.no residual tumor, a true-negative result), a Dice score of 1.0 was assigned.In case of substantial deviations from the ground truth, the cause for the deviation was assessed in post hoc analysis.The trained weights of the nnU-Net are fully available at https://github.com/Lab-Translational-Cancer-Imaging/LABC_Segmentation.

Response Assessment
A model based on extremely randomized trees was fit to assess tumor response to NAC. 21Three input features were used as candidate predictors: the lesion volume derived from the follow-up scan, tumor subtype and the difference in tumor volume between baseline and follow-up scan.The end point was RCB derived from the final resection specimen, dichotomized into RCB-0/I or RCB-II/III.The area under the receiver operator curve (AUC) was used as measure of model performance.
The tumor subtype was established from pre-NAC biopsy.Subtypes were categorized as HER2-positive (ER and PR either positive or negative), ER-positive/HER2-negative, or triple-negative (ER-negative, PR-negative and HER2-negative).Five-fold nested cross-validation was performed for hyperparameter tuning and internal model validation.
To assess the robustness of the response assessment to deviations in tumor segmentation relative to the ground truth, the influence of the Dice score on the resulting AUC was assessed: for both baseline and follow-up scans, the AUCs were calculated separately for the set of cases with a Dice score at or above the median and for cases with a Dice score below the median.
The relative importance of the predictors in the two models was assessed using the Gini feature importance derived from the   22 The trained models were subsequently applied to the testing cohort for performance evaluation.

Statistical Evaluation
Volumetric measurements above three times the interquartile range (IQR) were considered outliers and clipped to this range.Pearson's correlation was used to assess the correlation between the ground truth and nnU-Net volumes, and nnU-Net volumes and FTV.Comparing characteristics between training and testing cohorts was done using Chi-squared tests for categorical variables, Student's t-tests for normally distributed variables and the Mann-Whitney U-test to compare median values non-parametrically.AUCs were compared using DeLong's test. 23P values <0.05 were considered significant.Statistical analyses were performed using SciPy v1.9.1 and scikit-learn 1.1.1.

Subjects
Significant differences in tumor subtype and chemotherapy regimens were present between the training and testing cohorts (Table 1).Concerning subtype, in the training cohort 65% of patients had ER+/HER2À tumors, in the testing cohort this was 35%.In the training cohort, 64% of patients received FEC + DOC chemotherapy, while none received FEC + DOC in the testing cohort.

Evaluation of Resection Specimen
No significant differences were observed in median RCB at final pathology between the training cohort and the testing cohort (RCB = 1.534 vs. RCB = 1.208,Mann-Whitney Utest, P = 0.137) (Table 2).

Semi-Automated Assessment of Functional Tumor Volume
Baseline FTVs were comparable between training and testing cohorts, with a median FTV of 6336 mm 3 and 6319 mm 3 , respectively (P = 0.49).In the follow-up scans, the median FTV in the training cohort (453 mm 3 ) was significantly larger than that observed in the testing cohort (45 mm 3 ).

Automated Tumor Volume Assessment Using Deep Learning
An example of a semi-automatically segmented lesion is shown in Fig. 4. The median (IQR, defined as the range spanning q1-q3) cross-validated Dice score from the nnU-Net, was 0.87 (0.62-0.93).Significant differences in performance between the two cross-validation folds could not be demonstrated (median [IQR]: 0.85 [0.62-0.92]vs 0.89 [0.62-0.93],Mann-Whitney U test, P ¼ 0:32).Pearson's correlation coefficient between volumes derived from the nnU-Net and the ground truth was R ¼ 0:95 (fold 1: R ¼ 0:93, fold 2: R ¼ 0:97).The correlation between the nnU-Net-derived volume and FTV in the training cohort was R = 0.74 for the baseline scan, R = 0.72 for the follow-up scan, and R = 0.80 for all scans combined.All correlations were statistically significant.Substantial deviations in segmentation occurred more often in the follow-up scans (Dice < 0.75) (34/105, 43.4%) than in the baseline scans (Dice < 0.2, to account for smaller tumors on follow-up) (14/105, 13.3%) (Table 2).Typical deviations were under-segmentation of nonmass enhancing lesions on baseline and false-positive segmentations on follow-up that corresponded with radiological complete response (rCR).Further analysis showed that of 18 cases that were reported as an rCR but where nnU-Net did segment residual lesion, 7 of the 18 (39%) had a badly responding tumor at final histopathology (RCB class II/III).

Response Assessment
In the training cohort, the cross-validated AUCs of the response assessment model derived from cross-validation were significantly different, 0.76 and 0.67 for the deep learningderived tumor volumes and for FTV, respectively.The robustness to deviations in tumor segmentation on the crossvalidated assessment of RCB was as follows: the mean AUC of the model to discriminate between RCB class 0/I vs. RCB class II/III was 0.78 for Dice scores at or above the median score at baseline MRI, and 0.67 below (P = 0.34).The mean AUC was 0.69 for Dice scores at or above the median score in the follow-up MRI, vs. 0.67 for scores below the median (P = 0.89).
In the testing cohort, the median (IQR) AUC of the response assessment model was 0.76 (0.71-0.84) for deep learning-derived tumor volumes and 0.77 (0.74-0.86) for FTV (Fig. 5).There was no significant difference in AUC between the two models (P = 0.66).Per hospital performance varied, with the worst performance being the hospital of the training set (hospital 1) (Table 3).
Subtype and volumetric features contributed equally in importance to the performance of the nnU-Net-based model and in the FTV-based model (Table 4).

Discussion
This retrospective observational cohort study demonstrated that tumor volumes derived from deep learning of DCE-MRI is associated with RCB, demonstrating the ability to discriminate between good responders (RCB-0/I) and bad responders (RCB-II/III) to neoadjuvant chemotherapy (NAC).The performance is on par with response assessments derived from   FTV. Advantages of the deep learning method over FTV are that it is a fully automated method that generalized well to other institutions and MRI units from different vendors without manual tuning.In addition, it appears to be robust to a variety of scan parameters.
The previously reported finding that FTV is associated with RCB in patients undergoing NAC for locally advanced breast cancer was successfully reproduced in the current study using a fully independent and separate implementation with different subjects. 9The FTV method requires limited user interaction, but the current study also confirmed that FTV requires parameter tuning per institution. 9,24Optimal quantification of FTV may require manual corrections to account for enhancing nontumor regions.No manual corrections to FTV or deep learning segmentations in the breast region were performed to closely follow the reported methodology for FTV. 15,25By not performing any manual adjustments, the potential influence of interobserver effects is also reduced.However, not performing manual adjustments can have negative impact on assessment of FTV: a previous study reported that uncorrected measurements may lead to overestimation of FTV in some hospitals. 15In the current study, no obvious over-segmented structures were observed.
In pursuit of methods to further automate the assessment of RCB during NAC, the nnU-Net was not trained on FTV either but based on automation of a previously developed constrained volume-growing approach.This method was previously found to be correlated with disease volume at histopathology. 16The current study confirmed this underlying relationship by the correlation of the segmented disease volumes with the residual disease in the histopathological resection specimen.
The deep learning-derived segmentations also showed deviations from the ground truth.For example, some cases labeled as complete remission by radiologists (rCR) were considered to have residual disease by the nnU-net.Such differences could be due to differences in how residual MRI enhancement are interpreted by radiologists and by the nnU-Net and show the potential complimentary value of the presented method.Baseline scans with substantial background parenchymal enhancement (BPE) were also associated with an increased rate of deviations in the segmentation.To assess the impact of these deviations on the accuracy of the assessment of RCB in the extremely randomized trees model, their effect on the AUC of the model was analyzed.There were no significant differences in AUC in cases with a high Dice score compared to cases with a low Dice score.These results suggest that the model for RCB response is robust to deviations in tumor segmentation.
Multiparametric MRI was not consistently available in the training cohort of the current study.To further improve tumor segmentation, multiparametric breast MRI could be employed in future studies.For instance, high temporal resolution DCE-MRI has been reported to be less affected by BPE for lesion detection, because malignant tissue typically demonstrates earlier enhancement than parenchyma. 263][34][35] These methods typically require manual or semi-automated tumor segmentation.The current study could potentially provide these segmentations.

Limitations
First, the study has limited sample size.Although it tested the performance of a fully automated deep learning-based model to assess RCB in an independent cohort of patients across multiple hospitals, it was not able to reliably stratify the performance to breast cancer subtype.Second, due to limited image quality in the axillary region, only the breast region could be taken into account, not the axillary lymph nodes.RCB takes both the primary tumor bed and lymph nodal status into account.By adding additional information to the model about post-treatment lymph node status, for example, by using more suited imaging modalities such as ultrasound or by using the MARI-procedure to sample the sentinel lymph node, assessment of RCB can potentially be further improved. 36,37Furthermore, only two MR vendors were included and all MRI was fat-suppressed.Finally, this is a retrospective study spanning a large time period.Besides the general biases of any retrospective study, the included patient population has changed significantly over time as evidenced in differences in subtypes and treatments between training/ testing cohorts, which were taken from different time periods.Future work will focus on validation in larger cohorts.In addition, other candidate predictors of tumor response can be added to the model such as radiomic features. 12,13

Conclusion
A deep-learning model based on nnU-Net was developed to estimate changes in tumor load on DCE-MRI that are associated with RCB after NAC.The response assessment is on par with that derived using FTV, a previously validated method, but it is fully automated and therefore observer independent.The performance of the model appears to be robust across multiple institutions.

FIGURE 2 :
FIGURE 2: Flowchart of patient selection in the testing cohort.

FIGURE 1 :
FIGURE 1: Flowchart of patient selection in the training cohort.LABC = locally advanced breast cancer; NAC = neoadjuvant chemotherapy.

FIGURE 3 :
FIGURE 3: Flowchart illustrating the process in which nnU-Net segmentation quality was assessed on the training set.The nnU-Net is trained based on segmentations obtained using a semi-automated region grower.The resulting testing segmentations are compared to the ground truth and to functional tumor volume (FTV).Above figure illustrates the process where fold 1 is the training fold and fold 2 is the test fold.

FIGURE 4 :
FIGURE 4: Illustration of response to neoadjuvant chemotherapy on T1-weighted MRI of a 72-year-old patient with estrogen receptor positive (ER+) cancer in the right breast.The left column shows the baseline scan, the right column follow-up.The top row shows a representative axial slice, the bottom row shows the nnU-Net segmentation as an overlay on that slice (purple outline).All images shown are subtractions of the first post-contrast image and the precontrast image.The tumor volume at baseline was 15,625 mm 3 and the tumor volume on follow-up was 975 mm 3 .On final histopathology, the patient had a residual cancer burden (RCB) score of 1.54 corresponding to class RCB-II.

FIGURE 5 :
FIGURE 5: Receiver operator curves (ROC) for the extremely randomized tree model assessing response to neoadjuvant chemotherapy (NAC), which is defined in terms of RCB-0/I on the pathological resection specimen versus residual cancer (RCB-II/III).Both models use tumor volume at the follow-up scan, change in tumor volume from baseline MRI and tumor subtype as features.The tumor volumes are established using either nnU-Net or functional tumor volume (FTV).AUC = area under curve; RCB = residual cancer burden.

TABLE 1 .
Patient, Tumor, and Treatment Characteristics in the Training and Testing Cohorts.Bilateral Breast Cancer is Divided into Left and Right Breast Separately

TABLE 2 .
Origin of Substantial Deviations in Deep-Learning Segmentations in the Training Cohort (After Cross-Validation)

TABLE 3 .
Comparison of RCB Response Classification Performance on a Per-Hospital Basis

TABLE 4 .
Feature Importance in the Response Assessment Models Based on Extremely Randomized Trees FTV = functional tumor volume.