Predictive uncertainty in deep learning–based MR image reconstruction using deep ensembles: Evaluation on the fastMRI data set

To estimate pixel‐wise predictive uncertainty for deep learning–based MR image reconstruction and to examine the impact of domain shifts and architecture robustness.


INTRODUCTION
Deep learning (DL)-based reconstruction of MRI data promises to substantially reduce acquisition times, improve diagnostic image quality, and support quantitative image analysis.Numerous applications for DL-based reconstruction have been proposed using a variety of techniques including image enhancement, [1][2][3] physics-based unrolled networks, [4][5][6][7] k-space learning, 8,9 transform learning, 10 and hybrid learning networks. 11,12owever, image reconstruction from undersampled data bears the risk of reconstruction errors that can alter the image content and potentially have adverse effects on the diagnostic process.In contrast to conventional MR reconstruction techniques such as in parallel imaging 13,14 reconstruction or compressed sensing, 15 DL techniques may obscure reconstruction errors, which are thus difficult to detect.A further major challenge in these DL-based approaches lies in the limited transparency and explainability of algorithm predictions, which can make the detection and understanding of reconstruction errors even more difficult.Moreover, most methods are task-agnostic and not well calibrated to changing applications or imaging data.
Errors in DL-based MR reconstruction are more likely to occur in these situations in which acquired input data differ substantially from training data used for model training.Such distribution shifts can have various origins, such as changes in sequence parameters, examination of different body parts, or patient-related factors such as motion artifacts of unusual pathologies.As a result, reconstructed images may display artifacts (e.g., inpainting artifacts, increased noise, aliasing) or altered anatomical structures.
Empirical evidence testing for comparing the performance of various DL reconstruction solutions may be confounded by the sources of variation such as data sampling, augmentation, model ablations, or hyperparameter choices. 16Likewise, a thorough search would be prohibitively expensive and hence conclusions are drawn from limited evidence.To better and more comprehensively identify these situations in which input data are out-of-distribution would be desirable, to avoid erroneous image interpretation, optimize the reconstruction process, and support certification of algorithms.
A central property of out-of-distribution predictions is their association with high epistemic uncertainty.The epistemic uncertainty describes the uncertainty related to the reconstruction model, which might be incorrect outside the training distribution due to, for example, limited training data.The aleatoric uncertainty, on the other hand, describes the uncertainty caused by the inherent randomness in the data.Both uncertainties contribute to the predictive uncertainty.Several methods have been proposed for the quantification of predictive uncertainty in the context of image regression tasks including Bayesian techniques, such as variational inference, [17][18][19][20][21][22][23] maximum softmax probability, 24 temperature scaling, 25 Monte Carlo dropout, [26][27][28] Monte Carlo sampling, 29 Markov chain Monte Carlo, 30,31 and ensemble techniques. 32,33Particularly ensemble-based epistemic uncertainty estimation in DL has been shown to provide valid and useful results in different practical applications 33,34 and has the advantage of simple implementation and application.
Even within data distributions, variability of the data due to changing sampling trajectories, acceleration factors, or imaging noise (originating from patient or scanner) can contribute to reduced reconstruction quality.These ambiguities in the data or in the examined model due to input data alterations are associated with aleatoric uncertainty. 35he purpose of this work is to provide a method for the assessment of potential algorithm failure in DL-based MR reconstruction during test time that can be easily implemented and applied in standard settings.To this end, we propose, apply, and evaluate epistemic uncertainty estimation via deep ensembling and aleatoric uncertainty estimation via Gaussian conditional likelihood in DL-based MR reconstruction.The proposed approach can be paired with any type of DL reconstruction, enabling investigations of their predictive uncertainties on a pixel level.The obtained uncertainty maps thus provide an interpretable solution to examine DL reconstruction uncertainties in relation to the underlying anatomical structures.

METHODS
The predictive uncertainty can be defined as the summation of the epistemic uncertainty, capturing model and approximation uncertainty (between the empirical risk minimizer and the true yet unknown risk minimizer in the hypothesis space), and the aleatoric uncertainty, capturing data uncertainty, which will be explained subsequently and is depicted in Figure 1.

Theory
To explain aleatoric and epistemic uncertainty, we start with defining the MR image reconstruction problem.For a given undersampled k-space y, we approximate a solution of the inverse problem with a solution of the optimization problem x ∈ arg min under some regularization constraint R(x) with weighting factor  to obtain the image x.The encoding operator Overview of the proposed predictive uncertainty estimation.A deep ensemble of multiple trained network instances reconstructs the image and provides insights into the epistemic uncertainty (model uncertainty).Additionally, aleatoric uncertainty (data uncertainty) is predicted by the network assuming heteroscedastic additive noise.The resulting predictive uncertainty map shows the summed aleatoric and epistemic uncertainty on a pixel level.The UNet mag is shown as an exemplary network.
A = ΦFS consists of the coil sensitivity maps S, the Fourier transformation F, and the undersampling mask Φ.The simplest solution to Eq. ( 1) uses a gradient descent scheme to optimize for the reconstruction x.In DL reconstruction, physics-based unrolling replaces the gradient of the regularizer ∇R(x) by a learnable mapping function f θ (x) with trainable parameters θ (i.e., the neural network whose model uncertainty we are interested in).Please note that the regularizer operates directly on the image domain.In a more straightforward fashion, image enhancement/denoising networks estimate the output image x from noisy input samples x u by adapting the trainable parameters θ of the network following x = f θ (x u ).
Various strategies have been proposed for the prediction of reconstruction uncertainty with methods varying in prevalence, scalability, and practical applicability. 36In a Bayesian treatment, the posterior distribution over the network parameters given the training data is modeled. 22n contrast, deep ensembling allows for a more convenient and simplistic implementation of the epistemic uncertainty estimation. 37,38Computing the full posterior requires knowledge about a proper prior distribution and an approximation strategy to implement the otherwise intractable true posterior.With deep ensembles, we produce samples directly from a distribution over the network weights based on the major sources of randomness during the optimization process: weight initialization and batch sampling.In this work, we view this as an approximation of the posterior distribution to an implicitly assumed yet explicitly unknown prior.Moments of the predictive distribution, including a notion of uncertainty, can then be derived using Monte Carlo sampling. 39From a deep ensemble of N parallel DL reconstruction networks, we obtain the predictive mean image x = 1 N ∑ N i=1 xi and the variance over the ensembles  2 E .Aleatoric uncertainty on the other hand is modeled by placing a distribution over the output of the model.The model outputs are assumed to be varying according to a heteroscedastic Gaussian noise model.We assume that the residual noise on the predictions vary with input x; and for a given sufficiently large number of samples B can be modeled under the central limit theory as Gaussian. 40The conditional likelihood for a reconstructed image x (of a single ensemble branch) given an undersampled input x u is p ) where   (x u ) = x relates to the reconstructed image and  2  (x u ) is the estimated variance, both of which are obtained as two outputs of the DL reconstruction network with trainable parameters .Following Kendall and Gal, 41 a maximum likelihood estimation with respect to  reveals the nonnegative log-likelihood loss as an empirical risk minimization objective over a minibatch of B independent and identical distributed (iid The loss can be interpreted as a weighting of the conventional mean squared error (MSE) loss between a reference image x ref and the network predicted reconstruction x by the input data variance  2  , summed with the logarithm of the variance (i.e., for a small variance, the output needs to be consistent to the reference, whereas this relationship is more lenient for larger variances but still attributing to a larger loss).Under the assumption of sufficient independent labeled samples and respective fully sampled references from a target data distribution, the alignment of a model's confidence with its accuracy (MSE) can be estimated and used to adjust the predictions via the nonnegative log-likelihood loss.

Reconstruction uncertainty
A deep ensemble with 20 different random seeds for initialization, batching, and sampling was used to train the models.The mean x, the epistemic uncertainty  2 E , and the aleatoric uncertainty  2 A of the predictive distribution was Monte Carlo-approximated using the predictions of the ensemble members (x i ,  2  i ) as samples.Subsequently, taking the expected means  2 E ,  2 A of the ensemble members yields the overall reconstruction uncertainty as under the assumption that the variance can be split into these contributing components.In an ideal setting, small epistemic uncertainty within distribution and small aleatoric uncertainty out of distribution is expected.
For the proposed approach, the neural network under test can be chosen independently of the uncertainty estimation strategy.No network modifications are required except for an additional output of the reconstruction network, for the estimation of aleatoric uncertainty.
Different reconstruction networks were investigated with respect to their reconstruction uncertainty: (a) UNet mag , an image enhancement/denoiser network based on a UNet 1 ; (b) UNet unroll , a physics-based unrolled reconstruction with UNet regularizer (same architecture as in [a]) and conjugate gradient data consistency; (c) DCNN, a deep cascade of convolutional neural networks 7 ; (d) a variational neural network (VN) 5 ; and (e) a model-based deep learning architecture (MoDL). 4Please note the difference between (b) and (e) lies solely in the regularizer: UNet in (b) versus residual convolutions in (e).The difference between (c) and (e) lies in the closed-form solution for (c) of the data consistency in the single-coil case.
Details of the network architectures are as follows: The UNet mag consists of two levels in encoder/decoder and two convolutional layers (3 × 3 kernel size, 32 base features with a dyadic increase per level), rectified linear unit (ReLU) activation, and instance normalization per level.A 2 × 2 max pooling was used in the encoder and transposed convolutions in the decoder.For the physics-based unrolled reconstruction UNet unroll , the same UNet regularizer as UNet mag was used in a cascade of six iterations (unrolled networks) with intermittent conjugate gradient data consistency (using the k-space and sampling pattern) for a fixed weighting factor  = 0.2 (after optimization on validation set).The DCNN consists of six iterations of the regularizer with intermittent proximal mapping data consistency and a fixed weighting factor  = 0.1 (after optimization on validation set).The regularizer consists of five convolutional layers (3 × 3 kernel size with 64 features) with ReLU activation.The VN 5 consists of 10 iterations with a fields of expert regularizer 42 for convolutional kernels of 11 × 11, 48 features, and trainable activation functions.The MoDL 4 uses 10 iterations with intermittent conjugate gradient data consistency (fixed weighting factor  = 0.25 after optimization on validation set) and a regularizer consisting of convolutional filters (3 × 3 kernel size with 64 features), batch normalization, and ReLU activation function.
Single-coil and complex-valued data were processed as a two-channel real variant (i.e., real and imaginary components were stacked as input channels [UNet unroll , DCNN, MoDL] or as a complex-valued tensor [VN]).The networks were trained with ADAM (UNet mag , UNet unroll , DCNN, MoDL) 43 /Block-ADAM (VN) 44 for a learning rate of 10 −3 , learning rate scheduler (only VN; halving every 15 epochs), and batch size 24 over 200 epochs, to minimize the MSE loss between the reconstructed image x and the fully sampled reference image x ref .For numerical stability and to enforce a positivity constraint on  2  , a softplus activation acts on the variance output, and a minimum variance of 10 −6 is added.For each experiment, an ensemble consists of 20 individually trained networks with varying seed.

Data
Network training and testing were performed on the single-coil knee cohort of the fastMRI database. 45A preselection was performed to exclude subjects with small slice coverage and toward a single magnetic field strength (3 T).In this study, we wanted to avoid misinterpretations of uncertainty due to magnetic field-related artifacts or variations and rather focus on the reconstruction process.
For the fastMRI database, 45  To test the impact of domain shifts in imaging conditions (i.e., changing imaging orientation and contrast), further separate test sets were investigated: (1) the multicoil brain data in fastMRI and (2) single-coil knee data from Hammernik et al. 5 For the out-of-domain brain test set, muticoil brain data were acquired in fastMRI 45 5 For the multicoil data, coil sensitivity maps were precomputed from a data block of size 24 × 24 at the center of k-space using ESPIRiT. 46The coil-sensitivity combined image was used as input for the single-coil networks, and k-space data were forward-calculated.A retrospective regular undersampling was used on the fully sampled raw data with 4× and 8× acceleration and 4%, 8%, and 16% fully sampled (fs) center lines.
In a further ablation study, we investigated the influence of the MSE training loss.Because fully sampled reference targets cannot be regarded as completely noise-free, reference errors could potentially lead to bias and uncertainty.A Monte-Carlo SURE loss 48 was applied to train the UNet mag and UNet unroll ensembles for 4× accelerated data with 8% fs, while testing was performed on 8× data with 8% fs.
Inputs to UNet unroll , DCNN, MoDL, and VN were the respective complex-valued image (real/imaginary: UNet unroll , DCNN, MoDL; complex: VN), k-space data, and undersampling mask.For UNet mag , only the magnitude image was used as input.Images were scaled to unit range.For both training and quantitative evaluation, each network reconstruction was compared with a reference image that we defined as the (and in case of multicoil input, coil-sensitivity combined) fully sampled reconstruction.For ablations and performance comparisons, the networks were also trained without the aleatoric uncertainty output.
Data were analyzed quantitatively by normalized MSE (NMSE) and absolute error.Reconstruction uncertainty maps are overlaid on the respectively reconstructed image for qualitative assessment.
All experiments were performed on multiple servers equipped with Nvidia V100 GPUs (32 GB VRAM).Twenty randomized seeds were trained for each architecture and training configuration, yielding in total about 700 individual trained networks.Code will be made publicly available at github.com/midas-tum/recon_uncertainty.

Reconstruction performance
We observed varying reconstruction performance in terms of NMSE between the reconstructed and fully sampled reference image, depending on the underlying algorithm.MoDL and DCNN showed the best performance, whereas the UNet approaches and VN were slightly inferior.Overall, quantitative reconstruction accuracy was similar between the single instance networks and the respective Normalized mean squared error (NMSE) between the reconstructed image and the fully sampled reference, and aleatoric and epistemic uncertainty over the whole test cohort.For changing acceleration factors, the in-distribution (trained and tested on the same acceleration factor) and out-of-distribution (trained on one and tested on a different acceleration factor) were examined.The networks were tested for an ensemble setting with aleatoric uncertainty estimation (proposed) in an ensemble without aleatoric uncertainty estimation and as single network instance without the aleatoric uncertainty output (original versions) network ensembles (with or without aleatoric uncertainty estimation).Notably, there was a slight overall tendency toward lower quantitative reconstruction errors using network ensembles; this effect was most pronounced for out-of-distributional data and for the MoDL and DCNN techniques (Table 1).For example, DCNN performance even surpassed VN, whereas there were individual DCNN ensemble members exhibiting a lower performance than the respective VN instances.These variations among ensemble members are reflected in the larger epistemic uncertainty for DCNN than VN.

Qualitative assessment of estimated uncertainty
Qualitative visual assessment of pixel-wise overall uncertainty estimates revealed distinct patterns depending on reconstruction technique and underlying data (Figure 2).Further representative subjects are illustrated in Figure S1.UNet-based architectures (UNet mag and UNet unroll ) in general showed uncertainty patterns resembling aliased content in the phase-encoding direction, whereas the DCNN and MoDL uncertainty maps showed noise-like characteristics.The VN approach tended toward high uncertainty estimates in image regions with higher intensities and along the aliased content in the left-right direction.A visual comparison of uncertainty patterns and absolute error maps revealed an overall high degree of similarity.Similar to the uncertainty patterns described previously, error maps of UNet-based architectures showed characteristics of subsampling artifacts in the phase-encoding direction; DCNN and MoDL showed rather noise-like error maps; and VN showed a tendendy toward intensity-related reconstruction errors (Figure 2 and Figure S1).
Uncertainty was observed to exhibit unique patternsin addition to architectural choices-with respect to subsampling, imaging orientation, anatomy, and pathology.In general, and as shown in Figure 3, in-distribution (Figure 3, middle column) predictive uncertainty was lower compared with the out-of-distribution (Figure 3, right column) setting and was primarily dominated by the aleatoric uncertainty.For out-of-distributional data (right

F I G U R E 3
Two exemplary subjects were reconstructed by the UNet mag for in-distribution (middle column) and out-of-distribution (right column) data.Training was performed on 4× accelerated knee proton-density (PD) images with and without fat saturation (FS) with 8% fully sampled center.Testing is performed on 4× and 8× accelerated data with 8% fully sampled center region.The reconstructed images are shown with the overlaid predictive uncertainty in color coding.column), the epistemic uncertainty increased.Independent of whether fat saturation was used, both uncertainty measures showed a similar characteristic.Predictive uncertainty in noisier images (as observed in PD with fat saturation of Figures 3 and S2) exhibited a similar predictive uncertainty characteristic than less noise-affected images.
Networks trained on multiple acceleration factors (4× and 8×) had a comparable in-distributional uncertainty with the networks trained on a specific acceleration factor (4× or 8×) (Figure 4).Comparing in-distributional uncertainty with out-of-distribution uncertainty, networks trained on multiple acceleration factors (4× and 8×) showed smaller predictive uncertainty (in-distribution) Illustration of predictive uncertainty, aleatoric uncertainty (A), and epistemic uncertainty (E) for trainings performed with the UNet mag on data with a specific (left and middle column) or mixed acceleration factor (right column).Testing is performed with 4× and 8× accelerated data sets.The uncertainty maps are displayed in color coding.
than networks trained on specific acceleration factors (4× or 8×) operating out-of-distribution.In these cases, aleatoric uncertainty dominated in-distributional uncertainty, whereas epistemic uncertainty dominated the out-of-distributional setting, as shown in Figure 4.It should be noted that the network trainings resulted in a similar loss (i.e., for similarly converged networks, we obtain distinct uncertainties).
A change in the fully sampled center also affected the uncertainty pattern (Figure 5).Increasing the fully sampled center region resulted in reduced uncertainty, as the reconstruction was aided by a larger low-frequency k-space range.These observations were consistent among different subjects and networks.
For stronger out-of-distribution domain shifts, larger overall predictive uncertainties were observed.The change in imaging orientation (testing on sagittal and axial, training on coronal) resulted in a stronger predictive uncertainty than a change in contrast (testing on T 2 weighting; training on PD), as shown in Figure 6 (second and third column).New orientations depicted novel morphological structures that the network did not see during training.However, networks were more robust to changes in contrast.
Changes in anatomy (i.e., training on knee and testing on brain) affected the predictive uncertainty in a similar fashion.In Figure 7, we illustrate the networks UNet mag and VN, which were trained on single-coil coronal PD-weighted knee data, while testing was conducted for multicoil brain MRI.Fully sampled multicoil brain data were first compressed to a single coil before being undersampled and reconstructed by the networks.In comparison to a change in imaging orientation (Figure 6), the networks presented a lower uncertainty.This might sound rather concerning, given the anatomy shift, but both networks performed reasonably well in reconstructing the images.In line with the observations from Hammernik et al., 49 reconstructions of 4× accelerated data for networks that were trained on a large cohort and overfitting to a small-scale training cohort were omitted, and the lower acceleration factor showed a smaller reconstruction challenge.In the reconstructed subjects, we did not observe any hallucinating structures.Consequently, a lower predictive uncertainty was achieved.

F I G U R E 5
Impact of changing fully sampled center region size in out-of-distributional data.Images were reconstructed for 2 exemplary subjects by the UNet mag which was trained on 8× accelerated data with 8% fully sampled center region.The reconstructed images are shown with the overlaid predictive uncertainty in color coding.

F I G U R E 6
Investigation on out-of-distributional data impact for changing imaging orientation and contrast.UNet mag and variational neural network (VN) were trained on coronal proton density (PD)-weighted knee images (with and without fat saturation) and tested on sagittal (displayed 90 • rotated clockwise) PD, and sagittal T 2 -weighted (T2w) and axial T 2 -weighted knee images.The predictive uncertainty is depicted in comparison to the fully sampled reference images.
For networks trained on healthy subjects (i.e., not included in fastMRI+ database), we observed larger relative uncertainties in the presence of pathologies (drawn from fastMRI+) for all network configurations.Figure 8 shows in 2 subjects with joint effusion, joint bodies, and cartilage partial thickness for UNet mag , UNet unroll , and VN a concentration of uncertainty at these out-of-distributional data (disease prevalence shift).
Ablations with respect to the training loss function (MSE vs. SURE) did not reveal any significant qualitative differences in image quality, absolute error, or predicted uncertainty, nor in training behavior (Figure S2).In noisy reference images, a slightly reduced absolute error can be achieved, but without any effect on the predicted uncertainty.

Quantitative assessment of estimated uncertainty
Confirming the qualitative impressions described previously (Figure 2), we observed a good image-level quantitative correlation between measured reconstruction errors and predictive uncertainty (Figure 9).The

F I G U R E 7
Investigation on out-of-distributional data impact for a change in anatomy.UNet mag and variational neural network (VN) were trained on coronal proton density (PD)-weighted knee images (with and without fat saturation) and tested on axial fluid-attenuated inversion recovery (FLAIR), axial T 1 -weighted (T1w), and axial T 1 -weighted postcontrast brain images.The predictive uncertainty is depicted in color coding overlaid to the reconstructed images and in comparison with the fully sampled reference images.

F I G U R E 8
Investigation on out-of-distribution impact of pathologies.UNet mag , UNet unroll , and variational neural network (VN) were trained on healthy subjects and tested for subjects with pathologies as obtained from the fastMRI+.The predictive uncertainty is depicted in comparison with the fully sampled reference image.Relative increase of predictive uncertainty was observed at the pathologies.
behavior of the predictive uncertainty over NMSE in the complete test cohort is illustrated in Figure S3.For increasing NMSE, a general increase in uncertainty was observed.This characteristic was also reflected in the pixel-level analysis.A distinct predictive uncertainty behavior was observed for changing network architectures (Figures 9A and S3A).Although MoDL showed on average the smallest error, it was accompanied with the largest predictive uncertainty.The VN and UNet unroll exhibited a similar trend, slightly outperforming the UNet mag .Over the complete cohort (and in contrast to Figure 2), the DCNN showed an elevated predictive uncertainty.For the UNet mag , the predictive uncertainty on a pixel level (Figure 9B) is depicted for changing undersamplings.Predictive uncertainty (Figure S3B), aleatoric uncertainty (Figure S3C), and epistemic uncertainty (Figure S3D) are depicted in relation to NMSE for the test cohort.A similar tendency was observed for other network architectures.Aleatoric uncertainty dominated the uncertainty for in-distributional data, whereas epistemic uncertainty was the prominent uncertainty in out-of-distributional data.
Overall, included data consistency (UNet unroll , DCNN, MoDL, VN) reduced the epistemic uncertainty.For a low and similar (between reconstruction networks) NMSE, we observed the lowest qualitative predictive uncertainty for the DCNN, but with a larger predictive uncertainty and less distinct relative spatial uncertainty distribution.
Training the networks with a SURE-based loss only resulted in a slightly reduced NMSE over the MSE-trained variants for each of the ensemble members and for the final ensemble.A stronger, but not significant, effect of the SURE loss on NMSE over the test cohort was observed for UNet unroll than Unet mag .Predictive uncertainty remained similar for SURE and MSE-trained networks.
A total of 2257 kWh of energy was consumed for training all experiments, equaling a carbon emission of 1056 kg CO 2 in Germany (468 g/kWh).A total of 14 690 GPU h were performed with an average training time of 23.1 ± 29.3 h.

DISCUSSION
In this work, we propose a framework for test time assessment of algorithm performance in deep learning-based MR image reconstruction through pixel-wise estimation of predictive uncertainty.This proposed method is based on deep ensembling paired with a nonnegative log-likelihood loss to measure epistemic and aleatoric uncertainty.We investigated patterns of uncertainty under different conditions, including different reconstruction architectures, varying acceleration factors, and distribution shifts due to changing anatomy and pathologies.
In concordance with theoretical expectations, we observed higher epistemic uncertainty in out-ofdistribution settings.Specifically, epistemic uncertainty was higher when test data were acquired at different acceleration factors, different image orientations, and different anatomical regions compared with training data.This result indicates that the obtained epistemic uncertainty estimates can potentially be used to identify out-of-distributional data on test time.This might support clinical decision making by avoiding overconfidence in reconstructed imaging data in such cases.
In addition, we observed increasing aleatoric uncertainty with higher acceleration factors.This result reflects the increasing ambiguity of the reconstruction task with higher subsampling rates.Although this effect can also be quantified by the reconstruction error during training, scenarios are conceivable in which a fully sampled reference image is not available during training (e.g., in motion-resolved or dynamic imaging).In these cases, aleatoric uncertainty estimation can potentially support the process of identifying clinically useful acceleration factors.
We identified an interplay between aleatoric and epistemic uncertainty.Although the aleatoric uncertainty has a direct influence on the network weights (via the nonnegative log-likelihood loss), we expect it to affect the model performance of the ensemble members and hence the epistemic uncertainty.In-distribution smaller variations of the epistemic uncertainty were observed when aleatoric uncertainty estimation was present.However, similar variations of epistemic uncertainty were observed out of distribution, independent of aleatoric uncertainty estimation.This further indicates the role of the epistemic uncertainty to capture out-of-distribution behavior.
Interestingly, we observed variations of the spatial patterns of overall reconstruction uncertainty depending on the reconstruction architecture.For the UNet mag network, areas of high uncertainty were clearly spatially related to the presence of ghosting artifacts.This effect was less pronounced in the unrolled and cascaded frameworks, which also showed better overall reconstruction performance.Furthermore, we observed elevated reconstruction uncertainty in anatomic regions with present pathologies that were not included in the training data.These findings indicate that our proposed method can potentially be of value not only for global but also for regional assessment of reconstruction quality at test time.
Importantly, quantitative analysis revealed a strong positive correlation between the reconstruction error and predicted overall uncertainty.This indicates that obtained uncertainty estimates are a useful surrogate marker for reconstruction performance, which is a desired property.Although pixel-level uncertainty information can reveal localized reconstruction ambiguities, it may burden radiologists with additional readings.Hence, future studies shall investigate an appropriate warning mechanism for reconstruction quality assessment.
As network performance can depend on training hyperparameters, we have opted to use the architectures as proposed in the respective publications.The weighting factor  was empirically optimized on a separate validation set for each architecture.The batch size was set to the maximum possible GPU memory limit, to avoid the network performance being confounded by it.Hence, we expect the achieved epistemic uncertainty to be independent of the batch size, as also indicated by the consistent training performance of all ensemble members.Furthermore, the maximal possible batch size is also in line with the Gaussian modeling of aleatoric uncertainty under the central limit theory.
Although conceivable, we did not observe an effect of noisy training samples on the obtained predictive uncertainty performance.The SURE loss trained networks performed qualitatively and quantitatively in terms of predictive uncertainty, on par with the MSE loss-trained variants.Moreover, in noisier images we did perceive a similar predictive uncertainty characteristic compared with less-noise-affected images.Further investigations are warranted to identify how noise confounds supervised trainings, how nonlinear network operations affect input noise distributions, and how this could affect predictive uncertainty under various domain shifts.Given the complexity of the MRI physics, a more comprehensive noise model would be needed to study and link network input and output variations.In this regard, a quality measure that is not confounded by noise characteristics or explicitly able to capture them would be an essential tool to pave investigations.
We acknowledge that this study has limitations.The data were drawn from a public database and evaluated in a retrospective manner with a technical focus, and only single-coil data were examined.To assess the clinical usefulness of the proposed approach, further studies including additional use cases and pathologies need to be conducted.Furthermore, we only investigated the proposed methodology in conjunction with a small number of DL techniques for MR reconstruction.It is possible that a different result can be observed in the context of other or future reconstruction methods.Despite these shortcomings, the underlying hypothesis of this work-that pixel-wise uncertainty estimation can be achieved and can provide useful information in DL-based MR reconstruction-is well-supported by our experimental results.

CONCLUSION
Epistemic and aleatoric uncertainty estimation on a pixel level is feasible in deep learning MR reconstruction using a deep ensemble approach.Predictive uncertainty showed good correlation to the error maps on a pixel level and can potentially provide clinically useful information about reconstruction performance at test time.

F I G U R E 2
Comparison of predictive uncertainty and absolute error in a representative subject for all examined reconstruction network architectures on a per-pixel level: magnitude-based image enhancement UNet mag , physics-based unrolled reconstructions (UNet unroll , deep cascade of convolutional neural network [DCNN], model-based deep learning architecture [MoDL]) with alternating learned regularizers and data consistency, and a variational neural network (VN).Networks were trained on 4× and tested on 8× to showcase the model uncertainty for out-of-distribution inference.The reconstructed images are shown (top row) with the absolute reconstruction error (middle row) to the fully sampled reference and the predictive uncertainty (bottom row) in color coding.Two further subjects are depicted in Figure S1.

F I G U R E 9
Results of quantitative analysis on a pixel level for a representative subject: uncertainty behavior over absolute error between reconstructed image and fully sampled reference.A, The predictive uncertainty over the examined network architectures (UNet mag , UNet unroll , deep cascade of convolutional neural network [DCNN], model-based deep learning architecture [MoDL], and variational neural network [VN]).B, The predictive uncertainty for UNet mag trained and tested for changing undersamplings providing in-distributional and out-of-distributional data.
Training was performed on coronal PD-TSE images of 536 subjects (269/267 subjects with/without fat saturation).Validation was performed on 100 subjects (left out from training) acquired with coronal PD TSE (48/52 subjects with/without fat saturation).Testing was performed on 108 subjects (left out from training) acquired with coronal PD TSE (54/54 subjects with/without fat saturation).
This work was supported by the German Research Foundation under Germany's Excellence Strategy (EXC 2064/1; Project No. 390727645).The authors thank the organizers of the fastMRI challenge for providing the database used in this study.Open Access funding enabled and organized by Projekt DEAL.DATA AVAILABILITY STATEMENTThe source code is made publicly available on github.com/midas-tum/recon_uncertainty (DOI: TBA).