DeepCEST 3T: Robust MRI parameter determination and uncertainty quantification with neural networks—application to CEST imaging of the human brain at 3T

Calculation of sophisticated MR contrasts often requires complex mathematical modeling. Data evaluation is computationally expensive, vulnerable to artifacts, and often sensitive to fit algorithm parameters. In this work, we investigate whether neural networks can provide not only fast model fitting results, but also a quality metric for the predicted values, so called uncertainty quantification, investigated here in the context of multi‐pool Lorentzian fitting of CEST MRI spectra at 3T.


| INTRODUCTION
Sophisticated MR contrasts such as diffusion tensor imaging, quantitative susceptibility mapping or spectroscopic MR imaging, as well as spectrally selective CEST imaging, often require complex mathematical modeling before contrast generation. This modeling can be error-prone and timeconsuming, because it commonly must be carried out offline and sometimes requires user interaction and expert knowledge for evaluation. In the case of CEST, in vivo Lorentzian models, 1,2 as well as full Bloch equations [3][4][5] and simplified Bloch equations 6,7 are used to evaluate and quantify spectrally selective effects. When correctly evaluated, these models allow to generate several interesting CEST contrasts simultaneously, such as the semi-solid compartment provided by the so-called magnetization transfer (MT) effect, as well as relayed nuclear Overhauser (NOE) effects correlating with protein content and confirmation. Best known are the CEST effects of amide, amine, and guanidine protons related to peptides and proteins, 8 as well as pH, 9 and metabolite content of creatine 10 or glutamate. 11 The ability to generate these multiple CEST contrasts at a clinical 3T scanner would be highly desirable for clinical imaging-if such sophisticated evaluations were possible online in a robust manner.
For CEST data, both acquisition and subsequent evaluation are prone to artifacts attributed to field inhomogeneity and motion. 12 Furthermore, as in many cases when nonlinear least squares optimization is involved, the results depend strongly on details of the fit model such as initial values, boundary conditions, and actual algorithm implementations. Last but not least, computations for 3D volume data can take hours depending on the algorithm. 4 To circumvent these problems of complex non-linear models, it was proposed as early as 1992 to use neural networks to improve curve fitting. 13 Since then, deep learning 14,15 has gained much interest within medical imaging as well as in many other branches of science. Deep learning was recently applied to circumvent the mentioned problems of complex non-linear models used in oxygen extraction fraction 16 and diffusion parameter mapping. 17 In CEST imaging, a first neural network approach was shown to successfully map 3T data to 9.4T CEST contrasts. 18 However, because deep neural networks (NN) are often considered as "black boxes" with limited insight into the mapping from input to prediction, a question that arises is how accurate and reliable such predictions are. A neural network that gives seemingly confident predictions for all possible inputs, regardless if reasonably supported by learning or not, may lead to critical failure cases and result in mistrust by potential users. Consequently, it is highly desirable to make deep learning models probabilistic in a way that allows for uncertainty quantification, indicating how confident the NN is about predictions for a given input. Steps towards such uncertainty quantification in deep learning have been taken in the context of computer vision [19][20][21][22] as well as in medical imaging applications (e.g., in MS lesion detection, 23 melanoma detection 24 and generation of synthetic CT from MRI 25 ).
Here, we introduce a probabilistic deep learning approach to shortcut conventional Lorentzian fitting analysis of 3T in vivo CEST data. By making use of previously evaluated data, this approach addresses the mentioned problems of complex non-linear models regarding evaluation speed and parameter fine-tuning. The objective is to evaluate semi-solid MT, NOE-CEST, and amide CEST contrasts robustly with the trained neural networks in a fraction of the normal evaluation time. At the same time, the networks are supposed to tell if predictions are trustworthy, or-more importantly-if they are not, allowing confident interpretation of contrast changes.

| Overview
An overview of the presented data processing pipeline is given in Figure 1. In vivo Z-spectra are acquired at 3T and evaluated by conventional methods (PCA denoising, 26 Lorentzian fitting 1,2,27 ). The obtained results are used to train a deep neural network to directly map from raw Z-spectra to Lorentzian parameters, yielding CEST contrast maps in ~1 s where conventional evaluation takes ~10 min. Additionally, the neural Conclusions: The deepCEST 3T neural network provides fast and robust estimation of CEST parameters, enabling online reconstruction of sophisticated CEST contrast images without the typical computational cost. Moreover, the uncertainty quantification indicates if the predictions are trustworthy, enabling confident interpretation of contrast changes.

K E Y W O R D S
APT, chemical exchange saturation transfer (CEST), deepCEST, NOE, probabilistic neural network, uncertainty quantification network architecture presented here is able to provide uncertainty quantification that indicates if predictions are trustworthy.

| CEST data acquisition
CEST MRI data acquisition was performed on 8 healthy subjects at 3 different sites with identical 3T whole-body MRI systems (MAGNETOM Prisma, Siemens Healthcare, Erlangen, Germany) and identical 64 Channel receive coils. Additionally, 1 patient with a brain tumor was measured using the same hardware and protocols. All scans were performed after written informed consent and with approval by the local ethics committee. The used 3D snapshot-CEST sequence 28 was composed of a low-power spectrally selective saturation block and an RF and gradient spoiled gradient-echo readout with centric spiral reordering (elongation factor E = 0.5). Readout parameters were FOV = 220 × 180 × 60 mm 3 and matrix size = 128 × 104 × 12 for 1.7 × 1.7 × 5 mm 3 resolution, TE = 2 ms, TR = 4 ms, bandwidth = 700 Hz/pixel, and flip angle (FA) = 6°. With these settings, the readout time was t RO = 2.5 s.
For the spectrally selective CEST saturation block, parameters optimized according to Deshmane et al 29 were used. A train of 80 Gaussian-shaped RF pulses, each with a pulse duration of t pulse = 20 ms separated by a delay of t delay = 20 ms (duty cycle DC = 50%) and a single nominal B 1 level of B 1 = 0.6 µT was applied before each readout, resulting in a total saturation time of T sat = 3.2 s. A crusher gradient was applied after the pulse train to spoil spurious transverse magnetization. Saturation and readout were repeated at 55 off-resonance frequencies in the range between −100 ppm and 100 ppm, and at −300 ppm for an unsaturated reference image. Saturation and readout resulted in an acquisition time per offset of TA = 6.1 s. For 55 frequency offsets and a 12-s recovery time before the first CEST module, this yielded a total scan time of ~5.35 min for 1 subject.

| Data evaluation: Lorentzian fitting
The reconstructed data sets were registered to the first unsaturated image using the AFNI 3Dvolreg function. 30 Z-spectra in each voxel were calculated from the saturated image S sat and unsaturated reference image S 0 as Brain masks for each subject data set were created manually. The Z-spectra were spectrally de-noised by principal component analysis (PCA) following the approach of Breitling et al, 26 keeping only the principal components F I G U R E 1 Schematic of the presented data processing pipeline. A deep neural network is used to shortcut conventional CEST-spectra evaluation (PCA denoising, Lorentzian fitting). The exemplarily displayed contrast maps can be found in more detail in Figure 8 | 453 suggested by the median criterion (Supporting Information  Table S1). To isolate CEST effects, a four-pool Lorentzian fit model according to Goerke et al 27 was used to fit the direct water saturation (DS), semi-solid magnetization transfer  (MT), amide proton transfer (APT), and relayed NOE peaks,  using the model equation   with a constant c, the direct saturation pool   and the other pools with amplitudes A i , full-width-at-half-maximum Γ i , and peak positions δ i . This model has 10 free parameters in total (4 amplitudes A i , 4 widths Γ i , water peak position δ DS , and constant c), as the positions of APT, NOE, and MT peaks were fixed at +3.5 ppm, −3.5 ppm, and −2.5 ppm relative to the water peak position, respectively, and were therefore not treated as free parameters. Including the water peak shift δ DS in the denominator of the APT, NOE, and MT peaks, L i , allows the modeled spectra to shift along the frequency axis while relative peak positions stay fixed. Lorentzian fitting was performed by non-linear least squares optimization in MATLAB with initial and boundary conditions listed in Supporting Information Table S2. Least squares Lorentzian fitting evaluation for a single subject 3D volume took ~10 min using a computer with Intel Xeon W-2145 3.7 GHz CPU, 8 cores, 12 parallel threads, and 32 GB RAM.

| Data evaluation: neural network
The neural network architecture presented here is a deep feed-forward neural network consisting of fully connected layers, also known as multilayer perceptron. It takes vectors x of 55 elements representing measured raw Z-spectra as inputs and maps them to vectors (x) = μ 1 (x), … ,μ n (x) of n = 10 elements representing the free parameters of the 4-pool Lorentzian model described above. The mapping is realized by sequential application of matrix multiplication (each matrix containing the weights of a layer) and a non-linear activation function. To make the neural network probabilistic and by that allow for uncertainty estimation, it has 10 additional output neurons that represent the uncertainty (x) = σ 1 (x), … , σ n (x) of each Lorentzian parameter. In total, the neural network represents the function f θ (x) = θ (x), θ (x) ∈ ℝ 2n , which is parametrized by the layer weights θ that are adjusted during training. Therefore, the number of neurons in the output layer is doubled compared to standard non-probabilistic neural networks, which would only output θ (x). The additional uncertainty outputs σ i (x) do not have to be given as explicitly labeled targets for training, but are indirectly inferred from the training data by applying the principle of maximum likelihood with the ground-truth training targets μ tgt i and likelihood function p θ . Similar approaches of implicitly learned output variance via maximum likelihood estimation for deep neural networks have been used in computer vision 19,20,22 and are sometimes referred to as "heteroscedastic aleatoric uncertainty." A Gaussian likelihood function given by is assumed, which results in the negative log-likelihood loss function for a single input-output pair x, tgt  is numerically minimized with respect to the layer weights θ. The learned network outputs consequently represent Gaussian probability distributions with mean µ i (x) and SD σ i (x). The mean µ i (x) is the predicted value of a Lorentzian parameter for a certain input spectrum x, whereas the SD σ i (x) indicates the uncertainty of target parameters for given x. The mapping from x to σ i (x) is indirectly inferred during training as the optimization algorithm tries to minimize the log-likelihood loss function evaluated on the training data. By that, the distribution of training targets for given inputs is implicitly incorporated in the σ i (x) outputs of the network. For a standard non-probabilistic neural networks, the SDs σ i would be assumed to be constant and independent of the inputs x, which reduces minimization of the GNLL loss function (Equation 7) to minimization of the commonly used mean squared error loss function For minimizing the probabilistic log-likelihood loss function (Equation 7), the optimization algorithm used for training can either adjust the network weights θ to reduce the residual prediction error μ tgt i −μ i (x) or increase the uncertainty output σ i (x). The residual prediction error in the loss function cannot be reduced for training samples with similar inputs x but different targets μ tgt i . In this case, an unequivocal mapping is not possible, because for one input there are multiple outputs. The non-probabilistic least squares loss function (Equation 9) would just assume high values for such training data. Minimizing the log-likelihood loss function, however, forces the optimizer to reduce the contribution of such unresolvable prediction errors by increasing σ i (x) (i.e., telling that the network output is uncertain for such inputs). The second summand in the loss function prevents the optimizer from making excessive use of increased σ i (x) to reduce the loss. Therefore, it is supposed to find a compromise of assigning low uncertainty to inputs with low prediction error and high uncertainty only where good predictions are not possible.
The neural networks were implemented in Keras 31 framework with Tensorflow 32 backend. Adam 33 optimization algorithm was used with mini-batches of size 64 and a learning rate of 10 −4 . Training data were randomly split in training and validation sets containing 90% respectively 10% of the samples. The training set was used for calculating gradients and updating the networks weights, whereas the validation set was only used to monitor the loss. Networks were trained for 5000 epochs (1 epoch means processing the whole training set once and updating the network weights), and the weights giving the best validation loss were saved. Input data was standardized to mean = 0 and SD = 1.
Hyperparameter estimation of number of layers, neurons per layer, activation function, dropout rate, and regularization was performed by a grid search as described in the Supporting Information. For the results shown in the main text, data sets of 3 healthy subjects were used for training, which yielded after brain mask application and dropping the lowest and uppermost slices a total number of 135,752 training samples (1 training sample means 1 Z-spectrum together with optimized Lorentzian parameters). We refer to the network trained on this data set as NN0. In the Supporting Information, a cross validation analysis including all eight acquired healthy subject data sets is shown.
For comparison, a non-probabilistic network (NNnonprob) was trained to only predict the Lorentzian parameters µ(x) without additional uncertainty outputs. For training this network, the mean squared error loss function (Equation 9) was used. Apart from that, all network parameters of NNnonprob were the same as for NN0.
Additionally, 2 types of training data augmentation were investigated: 1. Additional B 0 shifts: for each Z-spectrum used as training input, a value Δδ DS is randomly drawn from the uniform distribution in a range of ±0.8 ppm. The Z-spectrum is shifted along the frequency axis by Δδ DS using linear interpolation and the corresponding target value is updated as δ DS → δ DS + Δδ DS . 2. Gaussian noise of a various SDs (0.1%, 1%, 5%, 10%, and 50%) is applied to the input Z-spectra while leaving corresponding target vectors unaffected.
We refer to the network trained on such an augmented training data set as NN1.
The following questions need to be addressed: 1. Is the neural network able to generalize to test data that was not included in training? Beyond that, is it able to predict CEST contrasts for a tumor patient data set, even though trained only on healthy subject data sets? 2. Does the uncertainty estimation accurately reflect corrupted input data and does it indicate if input data exceeds the range of data learned during training? 3. Does augmenting the network training data improve the prediction and uncertainty estimation?
To assess the first question, trained networks were tested on a fourth healthy subject data set as well as on a tumor patient data set, and the predictions were compared to the results of conventional Lorentzian fitting using the Pearson correlation coefficient and Bland-Altmann analysis.
To address the second and third question, perturbations with Gaussian noise and strong B 0 shifts were applied to the test data set. To this end, Gaussian noise with amplitudes of 1%, 2%, 3%, 4%, 5% and 10% as well as relative B 0 shifts of 0.1, 0.2, 0.3, 0.5, 0.1 and 2 ppm were applied to Z-spectra in a single slice from the healthy test subject. These are in similar ranges as for the data augmentation, but not identical. The dependency of uncertainty estimation on the strength of perturbation was investigated using mean and SD of prediction error (least squares fit -NN prediction) and mean uncertainty in the whole slice as well as voxel-wise scatter plots of prediction error and uncertainty.

| RESULTS
Optimization of the neural network resulted in an architecture with 2 hidden layers, each containing 100 neurons and exponential linear unit (ELU) activation function (see Supporting  Information Figures S1, S2 and Table S3 for details). Training NN0 (no training data augmentation) took ~8 h (135,752 training samples) and ~48 h for NN1 (training data augmentation with Gaussian noise and B 0 shifts, 814,512 training samples). For both networks, generation of a prediction from data of a single subject (~50,000 voxels) took ~1 s.
To test the network's ability to generalize to new input data sets, NN1 and NNnonprob were applied to the data set of a fourth healthy subject. Figure 2 compares the results of conventional 4-pool Lorentzian least squares fit evaluation with the NN1 and NNnonprob predictions generated from the test data set. The predicted parameter maps of NN1 ( Figure 2B) show the expected gray-white matter contrasts in the peak amplitudes of APT, NOE, and MT. Difference maps between NN1 predictions and ground truth shown in Figure 2D exhibit spatial structures especially at the edges of the displayed slice, but of rather small magnitude of ~5%. Compared to that, the NNnonprob predictions ( Figure 2C) deviate more strongly from the ground truth (Figure 2A), resulting in depleted contrasts and larger prediction errors ( Figure 2E). Quantitative analysis of NN1 and NNnonprob predictions given in Figure 3 supports the observation that NN1 predictions for all parameters closely match the least squares fit results (Pearson correlation coefficient of 0.94 and 0.97 for APT and NOE amplitudes, respectively, and >0.99 for MT amplitude and δ DS ). This finding is confirmed for NN1 by the scatter plots ( Figure 3A) and Bland-Altmann analysis ( Figure 3C) showing small bias and few outliers for all parameters.
The poorer performance of NNnonprob compared to NN1 is also confirmed by the scatter ( Figure 3B) and Bland-Altmann ( Figure 3D) plots and lower Pearson correlation coefficients (below 0.9 for APT and NOE peak amplitudes, 0.95 for MT peak amplitude and 0.97 for water shift δ DS ). Especially, NNnonprob predictions of the MT peak amplitude show systematic deviations for high amplitudes. Similar comparisons of probabilistic and non-probabilistic networks were performed for different network architectures as well as for augmented and non-augmented training data. In all cases, the probabilistic networks were found to provide better predictions than those trained without uncertainty outputs.
However, there are cases in which the neural network fails to give accurate predictions. To assess these limitations, the same test data set was artificially corrupted (1) by adding Gaussian noise of different SDs to the input Z-spectra, and (2) by shifting the input Z-spectra along the frequency axis using linear interpolation, which mimics B 0 artifacts. Example spectra after both types of perturbation are shown in Figure 4A,B. Exemplary results for the APT peak amplitude with both types of input perturbation are shown in Figure 4C-L. Results for the remaining Lorentzian parameters show similar behavior on input perturbation and are given in the Supporting Information Figures S7-S9. Corruption with Gaussian noise leads to noise in the APT pool amplitude obtained by the least squares fit (Figure 4D), whereas NN1 prediction ( Figure 4E) results in a smoother parameter map. Figure 4F shows the corresponding deviations between fit and neural network predictions. In case of additional B 0 shifts, the fit results are weakly affected ( Figure 4I) as the fit model with CEST peak positions being defined relatively to the water peak is able to incorporate shifts along the frequency axis. In contrast, NN predictions show significant deviations for additional B 0 shifts >1 ppm ( Figure 4J), which were not covered by the distribution of training targets. Such failure of neural network predictions is problematic, as it might not be recognized in cases where no ground truth data is available. However, the uncertainty quantification introduced by the presented network allows to recognize such prediction errors: Figure 4G,L show that for both types of input perturbation, uncertainty estimations increase strongly with increasing perturbation strength, directly indicating contrast changes that should not be trusted. This can be seen quantitatively in Figure 5. For input perturbation with Gaussian noise, mean uncertainty estimations across the whole slice ( Figure 5A, orange curve) increase by a factor of ~3 for noise amplitudes >2%. This reflects the larger SDs of prediction errors ( Figure 5A, blue curve) induced by the input noise. A voxel-wise scatter plot of uncertainty versus prediction error ( Figure 5B) reveals that the uncertainty estimation allows to separate corrupted from uncorrupted voxels, because the corresponding distributions have only a small overlap. In case of additional B 0 shifts, prediction errors ( Figure 5C, blue curve) are weakly affected by B 0 shifts up to 1 ppm. Still, for shifts of 1 ppm, the mean uncertainty across the slice ( Figure 5C, orange curve) increases by a factor of ~3 compared to the unperturbed case. This can also be seen in Figure 5D, as the point cloud and histogram for a shift of 1 ppm deviates from the ones corresponding to lower shifts. For a strong B 0 shift of 2 ppm, there is a large prediction error together with an increase of uncertainty estimation by 2 orders of magnitude. As the training data set covers a range of B 0 shifts of approximately ±1 ppm, this indicates that the network has to extrapolate for B 0 shifts beyond that range. The point clouds in Figure 5D show that the uncertainty estimation allows to separate these "out-of-training-data" input voxels from the ones within the trained range. The desired feature of uncertainty quantification-recognizing corrupted and "out-of-training-data" inputs-is now established. Figure 6 shows the application of NN1 to the data set of a brain tumor patient together with clinical contrast images ( Figure 6A,B). The NN prediction ( Figure 6D) shows similar contrasts as the fit ( Figure 6E), with hyperintensity in the tumor area for the APT map. Scatter plots ( Figure 7A,C,E and G) and Bland-Altmann plots (Figure 7)B,D,F and H) comparing fit and NN1 prediction confirm that predictions in the tumor patient data set closely match fit results (Pearson correlation >0.87 for all parameters) and exhibit small bias and few outliers. Therefore, the trained network is able to generalize to tumor Z-spectra. A comparison of the non-augmented NN0 and augmented NN1 applied to the tumor patient data set is shown in Supporting Information Figure S12. As a final test of the uncertainty quantification, we created a more realistic case compared to Figure 4: an implant-like B 0 artifact caused by a magnetic dipole field originating from a source in the tumor patient's skull was simulated. The effect of this input data perturbation on least squares fit, NN1 prediction and uncertainty estimation is shown in Figure 8. It is clearly visible that strongly increased uncertainties ( Figure 8E) indicate failure of the prediction ( Figure 8D) close to the dipole location, where the strongest field inhomogeneity occurs. Therefore, contrast that might arise from or is depleted by the B 0 artifact can now be identified and therefore not be misinterpreted.

| DISCUSSION
In this work, we (1) analyzed the possibility of using a deep neural network to generate sophisticated CEST contrasts, which are normally obtained by multi-parametric non-linear least squares fitting. In addition, we (2) introduced an uncertainty quantification for the network prediction.
The trained networks are able to generate CEST contrasts from uncorrected Z-spectra fast (evaluation time ~1 s for 3D stacks of ~50,000 Z-spectra as opposed to ~10 min in case of Lorentzian fitting) and accurately (Pearson correlation coefficient >0.9 with respect to Lorentzian fitting), generalizing to data that was not included in training. In addition, the network generalizes to tumor data, even though trained only on healthy subject data sets. Particularly, it preserves the APT contrast that has been shown previously to correlate with gadolinium ring enhancement, 1,34 which can also be seen here. A similar deep neural network has been shown previously 18 to be able to map from Z-spectra acquired at 3T to Lorentzian fit parameters obtained at 9.4T by training on data sets that have been generated from subjects scanned at both field strengths. Also in that approach, a network trained only with healthy subject data sets was shown to generalize to unseen tumor patient data sets. This agrees with the results shown in Figure 5, where it could be observed that NN1 neither produced strongly deviating predictions nor clearly increased uncertainties also in the tumor region. In Zaiss et al, 18 it was argued to be plausible that certain spectra in a tumor area can be described as a combination of spectral features occurring in healthy tissue. Because of that, networks operating only in spectral, but not spatial, domain were argued to generalize also to tumor spectra as long as they can be expressed as non-linear combinations of healthy spectra. NN1 shows de-noising capability ( Figure 4C), which was also observed in previous work 18 and is demonstrated in more detail in Supporting Information Figure S10

| Assumptions for a "ground truth"
A limitation of the presented neural network approach is introduced by the fact that ground truth data itself is generated by least squares Lorentzian fitting. This is important to note, because the current Lorentzian fitting is also known to be a simplified model, not including the super-Lorentzian line shape 35 and only few of the known CEST pools. 36 This was recently shown to fit the data well at low saturation powers. 29 Moreover, both our input and target data was not B 1 -corrected, which might be the reason for observed deviations between different subjects (see Supporting Information Figure S3). Of course, the neural network will learn the same connection for differently saturated input and target data. Interestingly, the variations between subjects obtained by the Lorentzian fitting was perfectly reflected by the neural network prediction (see Supporting Information Figures S3-S6), which again indicates that the neural network simply follows the provided input and target data and perfectly mimics the least squares fit. In principle, a more sophisticated Bloch model including the saturation power might be able to overcome such issues. Otherwise, a B 1 correction solution as suggested at 3T by Goerke et al 27 might improve input and target data, and therefore also the neural network performance and reproducibility. With any model that is fitted using least squares, targets suffer from typical instabilities caused by low SNR and dependence on initial and boundary values. Some of these issues were previously addressed by the so called IDEAL approach 37 that harnesses the high SNR of spatially downsampled images and minimizes need for choosing appropriate initial values and boundaries. This method is also based on multi-pool Lorentzians for evaluation of low power CEST and has been shown to be superior to conventional fitting. However, being an iterative process, it is still slow (~100 s for 96 × 96 2D image) compared to a neural network prediction. Moreover, tissue interfaces and partial volume effects can affect the spatial resolution. For both approaches, IDEAL and the presented neural networks, it is true that the methods are independent of the actual Z-spectrum model and insights gained herein will translate to more sophisticated models such as Bloch-McConnell fitting. Still, to date, there is no gold standard model for Z-spectra in vivo, and full quantification is not in agreement between research groups. 3,7,38 Therefore, the current approach used Lorentzian fits to proof the concept of a shortcut calculation including uncertainty quantification but can always be extended to an improved Z-spectrum model once it is found. Moreover, also more general targets that are not generated by a fit can be used with the current approach.

| Quantification of uncertainty
The main extension of the networks presented here compared to similar approaches is the probabilistic output layer. It makes use of a log-likelihood loss function that allows the interpretation of network predictions as Gaussian probability distributions and by that enables uncertainty estimation. This probabilistic output layer can be adapted to any other neural network used for regression problems with only small changes to the network architecture. Hereby, it constitutes a simple way of enabling uncertainty estimation for similar deep learning approaches like the prediction of 9.4T contrasts from 3T Z-spectra 18 or reconstruction of MR fingerprinting data, 39 a technique that has also found applications in CEST. 38 Although for the networks investigated here, the predicted parameters were in very good agreement with the conventional method right away, an uncertainty quantification that accurately reflects expected prediction errors required training data augmentation. Figure 9 compares the results for both investigated types of input perturbation (noise and B 0 shifts) for the non-augmented NN0 and augmented NN1. The predictions of both networks are in good agreement. We only observed that with input noise, NN0 predicts noisier maps compared to NN1 ( Figure 9C,F). However, the uncertainty quantification certainly is improved by data augmentation. The non-augmented NN0 ( Figure 9E) does not accurately reflect prediction errors ( Figure 9D) in all voxels, as there are voxels with high error but low uncertainty (detailed plot as in Figure 5 in Supporting Information Figure S11). Compared to that, the augmented NN1 reacts to input noise by increasing uncertainties in all corrupted voxels ( Figure 9H). NN0 predictions exhibit insensitivity up to B 0 shifts of 0.5 ppm ( Figure 9J,K), with uncertainty estimation increasing by a factor of ~10 for inputs exceeding this range ( Figure 9L). This corresponds to the range of B 0 shifts encountered during training of the NN0. Training data augmentation reduces this prediction error ( Figure 9N) and leads to a less increased uncertainty ( Figure 9O), as input B 0 shifts of 1 ppm are better supported by training data of NN1. In summary, the used training data augmentation had 2 desirable effects. First, training with noise lead to consistently increased uncertainties when applying NN1 to noise-corrupted data, representing the resulting prediction error more accurately than NN0. Second, including larger simulated B 0 shifts in the training data allowed better predictions for a larger range of B 0 shifts of the input spectra. This resulted in lower prediction errors and consistently lower uncertainties for shifted input spectra when using NN1 compared to NN0, which can be seen in Supporting Information Figure   S11C. Similar observations can be made comparing NN0 and NN1 applied to the tumor patient data set, which is shown in Supporting Information Figure S12.
Therefore, including a plausible data augmentation, an advanced uncertainty prediction can be established using the proposed probabilistic predictive model. The uncertainty then indicates if input data is either corrupted as in the case of Figure 8 where B 0 artifacts shifted Z-spectra drastically, or in the case of vessels that can show noise-like pulsation artifacts in the spectral domain that are unique for a certain scan, or any other variation that was not in the training data. This can be seen for the tumor patient data set in Figure 6F. Moreover, the increased uncertainties in the center of the displayed slice in Figure 5F indicate the typical lower SNR in the brain center, where the distance to the receive coil elements is larger. Additionally, also parts of the tumor area showed slightly larger uncertainty predictions, especially the T2 FLAIR hyper-intense region ( Figure 6B). This reflects the fact that exactly such Z-spectra were not in the training data set nor can be generated as a superposition of healthy data. In principle, this can be solved by adding patient data to the training data. However, then a lot of data of different tumor types or other diseases must be added to enable correct prediction and uncertainty quantification.

| Potential of uncertainty quantification
With the established uncertainty maps, we have a quality measure for predicted data. A naive user has therefore additional information to judge if the predicted data is trustworthy or not (e.g., one could generate a segmentation mask indicating where the uncertainty is above/below a certain threshold and warn the user that the data in this region cannot be trusted). Furthermore, the spatial structure of the uncertainty maps can give hints about possible error sources (e.g., in case of vessel pulsation or low SNR) as observed in Figure 6. Similarly, the structure of the uncertainty maps could indicate missing coil elements, too low B 1 , or a bad shim. In addition to the pattern, the latter can be recognized by coincidently increased B 0 shift predictions, as observed for the dipole perturbation in Figure 8 and also in Supporting Information Figure S7. Therefore, the user can then also decide if the measured data is valid or if a measurement has to be repeated with corrected shim/coil/B 1 setting. Furthermore, a recommendation system could be trained based on the uncertainty maps to automatically recognize such failure cases, mask regions with high uncertainty or, if necessary, recommend to redo the scan with a corrected setup. Currently, just having the uncertainty maps at hand can at least serve as an indicator if manual assessment by an expert is required to have a closer look at the acquired spectra in the respective regions and check for anomalies. In such cases, the conventional evaluation can be performed to see if there are deviations between prediction and ground truth.

| CONCLUSIONS
We have demonstrated that deep neural networks can be used to learn the mapping from raw Z-spectra to multi-pool Lorentzian parameters obtained by non-linear least squares fitting, providing a shortcut to the conventional evaluation method at 3T. As a significant extension to previously investigated deep learning approaches in this field, the presented network architecture enables uncertainty estimation that clearly displays if predictions can be used with confidence or not. After training, the networks yield robust contrast and uncertainty maps in seconds, even in the case of B 0 artifacts and in the presence of noise. This is promising for fast online reconstruction, bringing sophisticated spectrally selective CEST contrasts a step closer to clinical application.

FIGURE S2
Comparison of NN predictions for best and worst hyperparameters. First column: fit results (ground truth). Second column: prediction of NNa (2 × 100 neurons, zero dropout, ELU activation). Third column: prediction of NNb (4 × 200 neurons, dropout rate 0.5, ReLU activation). Fourth column: prediction error (i.e., difference of fit results and prediction of NNa). Fifth column: prediction error of NNb FIGURE S3 Cross validation analysis performed on the 4 healthy subject data sets used for network training. Networks CV1-CV4 were trained using 3 out of 4 subject data sets, respectively, and tested on the remaining data set. Network CV1 is identical to NN0 as defined in the main text. (A) Pearson correlation coefficient between prediction and ground truth Lorentzian fit result evaluated on the data set that was not included in training, respectively. Curves correspond to the amplitude target parameters of APT, NOE, and MT peak as well as water shift δ DS . (B-E) Maps of fit result and prediction of cross validation networks CV1-CV4, vertically aligned with the corresponding correlation coefficients in (A) FIGURE S4 Comparison of inter-subject variability of Lorentzian fitting (dark blue, dark red) and NN predictions (light blue, orange). Mean values of the amplitude parameters for each subject data set were calculated in GM/WM tissue segments. Data points with error bars correspond to mean and SD of these spatial means across the 4 healthy subject data sets. In case of the NNs, predictions were generated by the cross-validation networks CV1-CV4 applied to the respective holdout test data set FIGURE S5 NN0 applied to 4 new healthy subject data sets S5-S8. (A) Pearson correlation coefficient between prediction and ground truth Lorentzian fit result. Curves correspond to the amplitude target parameters of APT, NOE, and MT peak as well as water shift δ DS . (B-E) Maps of fit result and NN0 prediction for the 4 new healthy subject data sets, vertically aligned with the corresponding correlation coefficients in (A) FIGURE S6 Inter-subject variability of Lorentzian fit results for all 8 healthy subject data sets. Each point with error bars corresponds to mean and SD across GM (blue) and WM (red) tissue segments for the respective data set. Additionally, the same results are given for the relative B 1 inhomogeneity rB 1 (green, different scale) obtained by the WASABI method. 6 Given percentages are the inter-subject coefficient of variation (i.e., SD divided by mean across the segment means of the 8 data sets) FIGURE S7 Effect of input perturbation on fit results and NN1 outputs for the NOE and MT peak amplitude and water peak shift δ DS , which are not shown in Figure 3. Perturbations are the same as in Figure 4, with the respective strengths indicated by ROI labels. First column: fit results from original data (no perturbation). Second column: fit results from corrupted input data. Third column: NN prediction generated from corrupted inputs. Fourth column: prediction error, i.e., difference of fit and NN prediction (both on corrupted data). Fifth column: NN uncertainty for corrupted input data FIGURE S8 Effect of input perturbation on NN1 outputs and uncertainty estimations for the NOE (A and B) and MT peak (C and D) amplitude and water peak shift δ DS (E and F), which are not shown in Figure 5. Mean prediction errors (blue) and NN uncertainties (orange) across a whole slice for different noise amplitudes between 1% and 10% (A, C, and E) as well as additional B 0 shifts between 0.1 ppm and 2 ppm (B, D, and F) FIGURE S9 Effect of input perturbation on NN1 outputs and uncertainty estimations for the NOE (A and B) and MT peak (C and D) amplitude and water peak shift δ DS (E and F), which are not shown in Figure 5. Scatter plots of absolute prediction errors versus logarithm of NN uncertainties and marginal distributions for some of the investigated noise amplitudes and B 0 shifts (indicated by colors) FIGURE S10 Denoising capability of NN1. First column: Lorentzian fit on raw, un-denoised Z-spectra. Second column: Lorentzian fit on denoised Z-spectra (PCA, keeping only components suggested by median criterion). Third column: NN1 prediction from un-denoised Z-spectra. Fourth column: NN1 prediction from denoised Z-spectra. Rows correspond to APT, NOE, and MT peak amplitude and water peak shift δ DS FIGURE S11 Effect of input perturbation on NN0 outputs (no training data augmentation) and uncertainty estimations. The same perturbations as in Figure 5 were applied to copies of the entire slice of the test data set shown in Figure 4. (A) Mean prediction error of NN0 (light blue) compared to NN1 (orange) and mean uncertainty of NN0 (red) compared to NN1 (orange) across the whole slice for different noise amplitudes between 1% and 10%. Error bars correspond to SD of prediction error across all voxels in the slice. (B) Scatter plot of absolute prediction error versus logarithm of NN0 uncertainty and marginal distributions for some of the investigated noise amplitudes (indicated by colors). (C) Mean prediction error of NN0 (light blue) compared to NN1 (dark blue) and mean uncertainty of NN0 (red) compared to NN1 (orange) across the whole slice for different additional B 0 shifts between 0.1 ppm and 2 ppm. Error bars correspond to SD of prediction error across all voxels in the slice.