Automating in vivo cardiac diffusion tensor postprocessing with deep learning–based segmentation

In this work we develop and validate a fully automated postprocessing framework for in vivo diffusion tensor cardiac magnetic resonance (DT‐CMR) data powered by deep learning.


| INTRODUCTION
Diffusion tensor cardiac magnetic resonance (DT-CMR) is an emerging technique that provides information on myocardial microstructure. Many diffusion parameters can be extracted from the tensors, including tensor orientation and rotational invariant measures. For tensor orientation, it is common to report the primary and either secondary or tertiary eigenvector orientations that have been shown to relate to the orientation of the local cardiomyocytes and their sheetlet structure, respectively. 1 These angles are commonly known as helix angle (HA) and sheetlet angle or secondary eigenvector angle (E2A) respectively. In terms of rotational invariant measures, it is common to report fractional anisotropy (FA), a measure of diffusion organization, and mean diffusivity (MD), which measures the magnitude of diffusion. Both tensor orientation and rotational invariant measures have been shown to change in cardiac disease. [2][3][4][5][6] The computation of diffusion tensors from in vivo data is particularly challenging in the heart. Diffusion imaging yields low SNR images, and it is common to compensate with long acquisitions spanning multiple breath-holds. For most data sets, image registration is needed to correct for residual image shifts due to inconsistency with intrabreath and interbreath holding. This registration is challenging due to the different diffusion weightings and encoding directions between images, which translate into differences in signal intensities and contrast, not only in the myocardium, but also the neighboring structures of the chest wall, liver, and stomach ( Figure 1A). Additionally, these peripheral structures do not move rigidly with the heart during the respiratory cycle, further complicating registration.
In addition to the challenges of registration, a typical DT-CMR study of one midventricular slice scanned at two stages of the cardiac cycle will typically contain more than 160 images. Approximately 14% of the images are corrupted by signal loss due to cardiac and respiratory motion and are usually manually excluded based on visual assessment ( Figure 1B). 7 In vivo DT-CMR postprocessing is typically done retrospectively. Multiple steps require manual input, including removing frames corrupted by motion artifacts, and thresholding and segmentation of the left ventricular (LV) myocardial borders. Thresholding removes unwanted structures and background voxels, whereas segmentation not only allows quantification of diffusion parameters in the LV myocardium, but it is also used to define the local cardiac coordinate system consisting of the longitudinal, circumferential, and radial orthogonal directions ( Figure 1C). 8,9 Visualization of tensor orientation measures in relation to the local cardiac coordinates aids in the interpretation of the tensor results.
In this work we develop and validate a fully automated postprocessing framework of in vivo DT-CMR data powered by a convolutional neural network (CNN) trained to perform semantic segmentation. The aim is to improve robustness, reduce human postprocessing workload, and ultimately to produce on-the-fly DT-CMR results at the scanner, providing real-time visual feedback on scan quality.

| METHODS
In vivo DT-CMR postprocessing requires several steps before tensor calculation. In summary, the data need to go through the following steps: (1) removal of data corrupted by motion-induced signal loss, (2) image registration of the remaining diffusion images, (3) thresholding and segmentation of the LV myocardial borders, and (4) voxel-wise calculation of the diffusion tensors ( Figure 2A).
A CNN for semantic segmentation was used as the basis to automate and enhance some of the stages of the DT-CMR tensor calculation workflow ( Figure 2B). A U-Net-based CNN was designed with five encoder/decoder levels. A diagram of its architecture is depicted in Figure 3A. A U-Net was chosen, as it is known to produce good results for biomedical data without requiring large training data sets. 10 The specific network design was configured ad hoc to yield the highest Dice scores.
The U-Net was trained to segment multiple classes, including LV myocardium; right ventricular myocardium and papillary muscle; the two interventricular insertion points; and the remaining pixels as the background ( Figure 3B). For the training, the position of the ventricular insertion points, previously designated by the user, were converted to a circular region with a 6-pixel radius but excluding any pixels inside the LV myocardium. The predicted insertion point regions by the U-Net are therefore similar shapes. The centroid of these regions was considered as the position of the predicted insertion points.
To train the U-Net, we used previously collected DT-CMR scans acquired with a STEAM-EPI sequence. The data included a total of 492 scans (348 healthy scans and 144 scans from patients with either hypertrophic cardiomyopathy [HCM], dilated cardiomyopathy, or cardiac amyloidosis). This study was approved by the National Research Ethics Service. Data were acquired during either the diastolic or the systolic pause. All data had been previously examined for artifacts, and the LV myocardium segmented by 1 of 2 experienced clinicians and subsequently checked/edited by an experienced physicist. All available data were primarily focused on the LV myocardium, although, when thresholding the heart during the analysis, the right-ventricular (RV) muscle was also segmented. The imaging protocol has been described previously, 11,12 and in summary all imaging was performed using a Skyra 3T MRI scanner (Siemens AG, Erlangen, Germany) with a STEAM-EPI sequence with zonal excitation and fat saturation, TR = 2RR intervals,
Each DT-CMR scan consists of multiple images with different diffusion weightings acquired in a short-axis slice at one cardiac phase. A typical clinical scan will contain approximately 80 images per slice and cardiac phase. The network input images were the mean image of each scan after spatially registering the entire diffusion series. This produces magnitude input images with a much higher SNR than the individual diffusion images and are named here as scan-mean images.
Once successfully trained, the U-Net was integrated within the in vivo DT-CMR postprocessing at different stages of the workflow. It was used primarily to enhance the image registration, to identify frames corrupted with motioninduced signal loss, and to segment the LV myocardial region for quantification of diffusion tensor parameters ( Figure 2B). These steps were compared with the current established F I G U R E 1 A, Diffusion images showing the heart and neighboring structures: the chest wall (yellow arrow), the liver (blue arrow), and the stomach (red arrow) at different diffusion weightings (b-value units, s mm −2 ) and encoding directions. B, Three diffusion frames with varying degrees of motion-induced signal-loss artifacts. C, Local cardiac orthogonal coordinate system diagram method of postprocessing the DT-CMR data with manual input from an experienced user, which was considered as the ground truth ( Figure 2A).
Finally, this work concluded by comparing the DT-CMR results of the fully automated workflow against the manual user workflow for the independent test data set as described The trained U-Net can be downloaded from https://github. com/Pedro -Filip e/DT_CMR_short_axis_conv_net.

| Left ventricular myocardial segmentation
The first step of this work was to train the U-Net to segment/ identify the five different regions/locations shown in Figure 3B. The scans were divided into three groups: training data with 393 (80%) DT-CMR scans, validation data of 29 (6%) scans used to assess the network weights at the end of each training epoch, and finally the test data set with 70 (14%) scans to test the CNN final weights. The data distribution discretized by disease is shown in Supporting Information Figure S1. Data augmentation was used to improve training with random uniform distributions: rigid translation (range: 10% of FOV) and rotation (range: ± 0.6 radians). These ranges were chosen to constrain these operations to realistic levels. A generalized Dice loss function 13 was used to mitigate class imbalance and assess the proportion of specific agreement between the experienced user and the U-Net segmented masks. Batch normalization layers were added to the U-Net, as they are known to reduce overfitting, and when tested in our data, they were found to improve the segmentation Dice scores of the test data. For training we used the Adam optimizer 14 (learning rate: 0.001, epsilon 1E −07 ), a batch size of four images, and 200 epochs. The hyperparameters were tuned until we could not improve the Dice scores on the validation data and with sufficient epochs for the loss function to reach a plateau.
The DT-CMR postprocessing was carried out using MATLAB (MathWorks, Natick, MA), and the U-Net training/ testing was performed using Python TensorFlow (Google, Mountain View, CA). A Workstation with Ubuntu 18.04, two Intel Xeon 8 core CPUs, 12 GB of RAM, and an NVIDIA Quadro P6000 GPU was used for the GPU-based training.

| Registration of diffusion-weighted images
The U-Net was originally trained with the scan-mean images. To enhance registration, we attempted to use the same U-Net to segment the heart in each individual diffusion image. The segmentation masks derived from the U-Net were used to remove all peripheral structures outside the heart (ie, everything outside the U-Net-defined LV + RV + papillary region). The segmentation information was also used to increase the image intensity toward the LV myocardial region with a Gaussian filter (SD = 40% of myocardial diameter), to weight favorably the registration to the LV at a cost to the right ventricle and to avoid sharp edges when masking. For the segmentation to work reliably, the diffusion images were strongly denoised with a nonlocal means algorithm and magnitude scaled before applying the U-Net. 15 A degree of filtering of 8 times the estimated noise was used in the denoising. This level of denoising was empirically found to smooth the image without adverse blurring. The degree of filtering controls the rate of decay of the weights in the nonlocal means algorithm, 15 while the noise variance estimation is described by Immerkaer et al. 16 The denoised diffusion frames were used only to aid segmentation and classification as described in the next section, but were not used for any diffusion tensor calculations. A multiresolution subpixel rigid image registration (translation only) by cross-correlation of the magnitude signal was subsequently performed. 17 This registration workflow, named here as Reg U-Net, was compared to the image registration workflow currently used, in which a user-defined square region around the heart is cropped before using the F I G U R E 3 A, U-Net architecture used. B, U-Net classes for one example. Abbreviations: LV, left ventricular; RV, right ventricular same rigid registration algorithm. This workflow was defined in this work as Reg. A flowchart of these methods is shown in Supporting Information Figure S2 for clarity. The image in the diffusion series with the highest mean intensity was used as the reference frame for registration.
Differences in the two image registration algorithms (Reg vs Reg U-Net) were quantified using the registered LV myocardial masks and by calculating the median Dice coefficient of all the diffusion images in the scan in comparison to the reference frame.

| Identification of data corrupted by motion artifacts
The next step in using the U-Net mask information in the processing workflow was to identify the images corrupted with motion artifacts. After image registration we recalculated the scan-mean image, which was used here as the reference image for identifying motion-corrupted images. Two different metrics were calculated for each diffusion image: The Pearson coefficient of the heart region between the scanmean image and each registered diffusion image measures the image similarity, and the Dice coefficient of the LV myocardium masks measures the image similarity of the masks. If an image contains significant signal loss in the heart, then these two metrics are likely to yield values considerably lower than 1. An outlier algorithm was then applied to the mean of the two metrics. The outlier algorithm consisted of finding images in which the mean score is more than three scaled median absolute deviations below the median. These images are excluded and not used for tensor calculation. An example of this workflow is shown in Supporting Information Figure S3. The Pearson correlation coefficient could be used to identify the corrupted frames without the need to mask the images, but this method would be sensitive to contrast changes within the heart due to the different diffusion weightings and would be affected by peripheral organs. By applying the mask obtained from the U-Net to all diffusion images, we can remove the other organs and combine the Pearson coefficient with the Dice score of the LV masks to increase the robustness of identifying images with artifacts automatically.
The efficiency of the classification was measured with an F1 score. The F1 score is a measure of accuracy for a binary classification. It accounts for both the precision P and the recall R and is defined as the harmonic mean of these two measures:

| Diffusion tensor results
Finally, the entire fully automated artificial intelligence (AI) that aided in vivo DT-CMR postprocessing workflow was compared with the currently manual processing by an experienced user on the test data set. After registration, removal of corrupted frames and tensor calculation, the U-Net was finally used for the task it was originally trained: to automatically segment the LV in order to quantify the DT-CMR parameters in the left ventricle. The LV epicardial border was also used to calculate the local heart orthogonal coordinate system for each voxel in the LV, and to obtain maps of tensor orientation in relation to the local cardiac coordinates, providing the HA and E2A maps.
The agreement in tensor parameters between the experienced user and the fully automated U-Net method was analyzed with Bland-Altman plots and by calculating voxel-wise mean absolute errors for the overlapping region of the myocardial masks between the two methods.
Data were tested for normality with a Kolmogorov-Smirnov test, and it was found throughout to be nonnormal; therefore, nonparametric statistics were used with a Wilcoxon signed rank test to quantify statistical difference between the user and the U-Net. Measures are quoted as median [interquartile range].

| Left ventricular myocardial segmentation
Training of the U-net plateaued on the validation data well within 200 epochs, and took less than an hour to train ( Figure 4A shows the training progress). The network weights were updated for the last time at epoch 84; after that the segmentation of the validation data did not improve, and it is likely that the remaining training was overfitting.
The U-Net achieved an overall median Dice coefficient of 0.93 [0.92, 0.94] in the test data set for the segmentation of the LV myocardial region when compared with an experienced user ( Figure 4B) Figure 5B,F. The HCM case ( Figure 5F) was removed from the remaining analysis, as the segmentation was inadequate. All remaining 69 test cases were used for further analysis. Measuring the axis from the LV center to the insertion points, a median angular difference of 10º [6,14] was measured for the insertion point positions between the experienced user and the U-Net. Like the Dice scores, we measured a similar range of angular differences among the cohorts: healthy 10 [6,14]; dilated cardiomyopathy 10 [5,16]; HCM 11 [5,16]; and Amyloid 8 [6,11].
Please see Supporting Information Video S1 for a comparison of the segmentation masks between the user and the U-Net of all 70 test cases.

| Registration of diffusion-weighted images
Dice scores were used to test whether U-Net was capable of segmenting the denoised individual diffusion images as effectively as the scan-mean images. Overall, a median Dice score of 0.94 [0.92, 0.95] was measured for the segmentation of the denoised individual diffusion images when compared with the corresponding scan-mean image segmentations (after registration). The trained U-Net was therefore capable of segmenting the denoised individual diffusion images.
We also compared the Dice scores after the registration algorithm between the denoised individual diffusion images and the registration of the reference frame. The median Dice coefficient for Reg and Reg U-Net were 0.91 [0.89, 0.92] and 0.93 [0.90, 0.94], respectively. A better registration was therefore measured for the Reg U-Net method, as demonstrated by a significant improvement in the Dice scores when compared with the existing Reg method (P < .0001) (Supporting Information Figure S4). For most cases, the improvement in the Dice score was small (median improvement 0.012 [0.005, 0.027]), but there were a few cases in which the Reg U-Net method visibly improved registration. This was due to the removal of bright bordering structures around the heart, before the registration algorithm, using the U-Net-derived masks. One of these cases is shown in Figure 6 with a Dice improvement of 0.06.

| Identification of data corrupted by motion artifacts
The next processing step was to use the U-Net to identify images corrupted with significant signal loss artifacts. The confusion matrix comparing the experienced user (ground truth) versus the U-Net method is shown in Figure 7. As expected, the data are imbalanced with 94% of the data considered to be good by both user and U-Net. An F1 score of 0.70 was measured for the U-Net. As anticipated, the mismatches between the user and U-Net were the borderline cases, with partial signal loss only.

| Diffusion tensor results comparison: user versus machine
Finally, we compared the resulting LV DTI parameters between the two postprocessing workflows. The median DTI parameters are given in Table 1, separated by cardiac phase and cohort. When comparing the overall results between user and U-Net, there was a small but strongly significant drop F I G U R E 4 A, Loss function evolution during training. B, Histogram of the Dice coefficients between the trained network and an experienced user for the LV myocardial segmentation in the test data in FA (P < .0001), a small significant difference in absolute E2A (P = .011), and no statistically significant difference was found for MD (P = .12). Figure 8A shows Bland-Altman plots that visually compare these parameters and indicate the following bias [ A voxel-wise comparison was also performed in the overlapping myocardial masks between the two methods. We found the following intersubject mean absolute errors: FA 0.04 [0.03, 0.06], MD 0.05 10 −3 mm 2 s −1 [0.03, 0.09], E2A 7.7º [6,10], and HA 7.9º [5.5, 12]. Figure 8B shows this same analysis but divided by cohort.
For some cases there was a visible difference in the DTI parameter maps when using the U-Net. Figure 9A,B shows a healthy control case in which conspicuous artifacts can be seen at the edges of the DTI maps for the manual processing, produced by residual misregistration of some of the images and resulting in notably higher FAs and lower MDs at the myocardial edges, and a less circularly symmetric HA than expected from previous healthy studies. Figure 9C,D shows a clinical amyloidosis case in which the manual processing, due to human error, failed to identify and remove diffusion-weighted images with a motion-induced signal loss artifact close to the anterior insertion point, resulting in higher FAs and MDs and abnormal HA values in this region. The fully automated method detected and rejected these corrupted images before tensor calculation, resulting in more homogeneous DTI values in the left ventricle.

| DISCUSSION
The fully automated DT-CMR analysis, aided with a CNN, performed effectively, yielding similar results to an experienced user. The myocardial segmentation learned to correctly exclude papillary and RV trabeculation from the LV myocardium, even in patient data with a more varied morphology, including thin dilated cardiomyopathy and asymmetric HCM hearts. The only exception in which the U-Net differed considerably to the user was in two examples shown in Figure 5B,F. This is likely due to the very conspicuous EPI geometric distortion in the lateral wall and the very narrow LV cavity for the HCM scan. The U-Net was also capable of localizing the LV-RV insertion points.
This can be used to automatically divide the LV myocardium into different regions, such as the American Heart Association standardized segmentation model. 18 The use of CNNs for cardiac segmentation is currently a hot topic in cardiac MRI. Most of the published work focuses on segmenting the left and right ventricle for cine imaging. [19][20][21][22] To the best of our knowledge, this is the first published CNN trained specifically for in vivo STEAM DT-CMR data. The final aim is not only for LV segmentation, but to use this knowledge to automate and improve the entire postprocessing workflow of DT-CMR data.
When an experienced user processes the DT-CMR data manually, the RV region is thresholded but not carefully F I G U R E 6 Example of Reg U-Net performing noticeably better than the current Reg method for a particular diffusionweighted image. A, Reference frame. B, Frame to be registered to the reference frame. C,D, Checkerboard composite of the two registration techniques. E,F, False color comparison of the myocardial masks for the two registration techniques delineated. This is due to our focus primarily on the LV myocardium in previous in vivo studies. Therefore, training the U-Net to segment the RV region was not as effective, and unsurprisingly results in a lower accuracy between the user and the U-Net with a Dice score of 0.85 [0.80, 0.89]. All of the DT-CMR results quoted in this work are for the LV region only; therefore, the lower RV accuracy is noteworthy but not as important as the LV.
Using a trained U-Net capable of segmenting the heart, and in particular the LV myocardial ring, before performing a rigid image registration algorithm improves the results of in vivo DT-CMR scans. Improvements are greatest when there are bright structures adjacent to the heart, commonly the chest wall, the liver, and the stomach. These organs do not move rigidly with the heart during the respiratory cycle; thus, image registration requires nonsmooth displacement field algorithms. 23 These methods have been shown to handle respiratory motion in the chest but are typically computationally expensive and thus may not be practical to implement with the large number of images in a typical DTI scan. The AI segmentation therefore provides a valid alternative to refine image registration toward a robust pixelwise diffusion tensor calculation.
The detection and removal of diffusion images corrupted with motion-induced signal-loss artifacts was also possible to automate with the aid of the U-Net. While scanning, we prescribe diffusion weightings that will keep the myocardial signal above the noise floor. If the heart contains significant fibrotic tissue or any other microstructural change that will lead to higher diffusion-induced signal loss, then ideally the b-value should be reduced to keep a reasonable amount of myocardial signal. This is important in order to be able to automatically separate diffusion encoding from more severe artifactual motion-induced signal loss. The lower magnitude signals caused by high diffusivity will lower primarily the Pearson coefficients. Our method of considering also the LV  myocardial mask and three scaled median absolute deviations below the median score is robust to the level of diffusion signal loss encountered routinely in our clinical scans. Overall, 6% of the diffusion images were deemed to be corrupted with signal loss by both the user and the U-Net. This was less than a previous study from our group, in which we found a 14% ratio. 7 This is likely due to the increase in number and image quality of our scan database. In this previous study we used a support vector machine approach to detect these corrupted frames. We have also experimented with a simple classification neural network. 24 Both previous methods were tested in less data and yielded higher F1 scores of 0.8 and 0.9, respectively, when compared with the U-Net approach used here with an F1 score of 0.7. The reason we selected the U-Net approach was F I G U R E 8 A, Bland-Altman plots for median fractional anisotropy, mean diffusivity, and sheetlet angle. The vertical lines and respective numbers represent the median and 5% and 95% quantile of the cohort. B, Voxel-wise difference in the overlapping myocardial masks. The data points indicate the mean absolute error for each subject discretized by cohort, and the bars indicate the respective interquartile ranges because of the SNR sensitivity of the previous methods when analyzing image features. They were found to be sensitive to SNR changes and not just myocardial signal loss. For a small number of low SNR scans, both previous methods rejected most of the images. Even though overall in the total pool of test images the match between the human observer and these previous methods was better, this SNR bias made us feel less confident in their clinical robustness. A similar human SNR bias, and the subjective nature of identifying the most borderline cases, as for example the partial signal loss image shown in Figure 1B, can perhaps be an explanation for the higher F1 scores of previous methods trained on the "ground truth" provided by previous human visual assessment. In summary, using a DT-CMR trained CNN not only enabled the automation of postprocessing but it also increased the quality of the diffusion tensor calculation for some scans.
The U-Net improved the registration, resulting in less edge artifacts, and this is potentially a reason for the small but significant reduction in FA when using the U-Net, as generally these edge artifacts contain high FA values, as shown in Figure 9.
An important point when using AI is the issue of training networks to meet an expectation that may differ from the truth. This is especially important when using AI to reconstruct MR images, but also important for semantic segmentation networks, as the final results are influenced by the segmentation masks. The training data segmentation masks were drawn by 1 of 2 experienced clinicians and subsequently checked by an experienced physicist. We also have a mixture of cardiac phases and heart morphologies. This heterogeneity should reduce some of the learning bias. However, the ground truth for the data is unknown, and therefore it is likely the network assimilated human error and bias. , conspicuous artifacts can be seen at the edges of the DTI maps for the manual processing, produced by residual misregistration of some of the images and resulting in notably higher fractional anisotropy values and lower mean diffusivity values (×10 −3 mm 2 /s) at the myocardial edges (arrows), and a less circularly symmetric helix angle (º). For the clinical amyloid case (C,D), the manual processing, due to human error, failed to identify and remove some diffusion-weighted images with a motion-induced signal loss artifact close to the anterior insertion point, resulting in higher fractional anisotropy values and mean diffusivity values, and unexpected helix angles in this region (arrows) A potential area to explore in future work is to use the uncertainty of the U-Net segmentation to detect other image artifacts. For example, the EPI readout of the STEAM sequence occasionally leads to geometric distortions in regions with field inhomogeneities. This is typically seen in the lateral wall as in Figure 5B. The U-Net is expected to be less certain in artifact areas (ie, closer probabilities among the different classes). An uncertainty map could therefore be a valuable tool in visualizing the image quality and detecting regions affected by artifacts.
We are currently porting this postprocessing pipeline to the Gadgetron 25 framework, to enable real-time online reconstruction of DT-CMR results for on-the-fly feedback and to be packaged as DICOM files with the rest of the study. In terms of processing time, the added denoising and U-Net segmentation introduces a significant processing overhead of approximately 200%, although it is likely to be quicker overall when the user manual input time needed to remove corrupted frames, to threshold the heart and segment the left ventricle, is also considered.
The main limitation of this work is the sparse availability of in vivo DT-CMR data, currently acquired only in a few specialized centers around the world. To be efficient, deep learning requires a large number of data sets for training and validation. This limitation can be mitigated to a certain extent by using data augmentation techniques as used in this work. We have used data acquired by our group only; it contains healthy controls and cardiomyopathy scans. Other cardiac diseases, such as myocardial infarction, were not included at this time due to the low number of scans available. The automatic postprocessing is therefore untested for this condition, although data availability will increase as we continue our DT-CMR research. Additionally, the trained U-Net is currently untested for different scanners, different sequences, and different protocols. Motion-compensated spin-echo methods are often used in other centers as an alternative to the STEAM sequence. [26][27][28] This will result in different contrasts, particularly in the blood pool; however, widely used transfer learning-based methods can be applied for multiscanner and multicenter studies. Other conditions like congenital heart diseases present a large spectrum of more exotic heart morphologies and are therefore more likely to be more challenging to automate. Because of these limitations, we have implemented a second AI aid mode in which the postprocessing presents the initial AI guess and waits for user acceptance or correction.

| CONCLUSIONS
The postprocessing of DT-CMR was successfully automated with a trained CNN, supporting real-time results and reducing human workload. The automatic segmentation of the heart improved image registration, resulting in improvements of the calculated diffusion tensor parameters.