Deep learning–based Lorentzian fitting of water saturation shift referencing spectra in MRI

Water saturation shift referencing (WASSR) Z‐spectra are used commonly for field referencing in chemical exchange saturation transfer (CEST) MRI. However, their analysis using least‐squares (LS) Lorentzian fitting is time‐consuming and prone to errors because of the unavoidable noise in vivo. A deep learning–based single Lorentzian Fitting Network (sLoFNet) is proposed to overcome these shortcomings.


INTRODUCTION
Water saturation spectra, more commonly known as Z-spectra, are used in chemical exchange saturation transfer (CEST) MRI experiments to probe exchangeable protons in low-concentration (mM range) solute molecules. [1][2][3][4][5][6][7] After selectively saturating these exchangeable protons using RF pulses, the saturation is transferred to water, resulting in a reduction of the water signal. This effect is visible in the Z-spectra, where the normalized water signal is plotted as a function of saturation frequency. At the resonance frequency of water, the signal exhibits a significant decrease caused by direct saturation (DS). 4 However, small shifts, in the range of 5-10 Hz, in the Z-spectra may result in erroneous CEST quantification. 4,5 By using a low B 1 and short saturation duration, the DS effect in the Z-spectrum can be isolated, yielding a water saturation shift referencing (WASSR) Z-spectrum with a Lorentzian line shape. The error in the CEST quantification can then be corrected by using the WASSR Z-spectra in each voxel for field shift correction. 2,[8][9][10] Similarly, when measuring saturation-transfer effects that require low B 1 , such as relayed nuclear Overhauser enhancements, the water resonance can be fitted not only to estimate field shifts, but in addition remove water-based spill-over effects. 11,12 WASSR Z-spectra have also been used to determine local field shifts for whole-brain magnetic susceptibility mapping. 13 Another implementation has been the study of relaxation effects in tissue, 10 as the linewidth (LW) of the WASSR Z-spectrum relates to the longitudinal (R 1 ) and transversal (R 2 ) relaxation rates. 10,14 By fitting a WASSR Z-spectrum to a Lorentzian curve, the parameters (amplitude, LW, and frequency shift) can be obtained. The conventional approach is to use a least-squares (LS) Lorentzian fitting. 8,10,13,15,16 However, when using phased array coils, in vivo data are corrupted by spatially dependent noise that distorts the signal intensities in the voxels, rendering the fit prone to errors. In addition, conventional LS fitting approaches are usually time-consuming. Recently, deep learning (DL) has been used to analyze CEST data, such as for fitting of CEST spectra to multiple Lorentzians at 3T and 7T data by Glang et al. 17 and Hunger et al., 18 respectively. The Lorentzian fittings by Glang et al. and Hunger et al. included denoising as a preprocessing step. Another application has been fitting a single phosphocreatine CEST peak (arbitrary shape, with additional magnetization-transfer contrast background) by Chen et al. 19 Furthermore, Li et al. 20 targeted the glutamate CEST signal by applying a convolutional neural network to overcome B 0 inhomogeneities and low SNR when mapping parenchymal glutamate in the brain. The focus in these previous studies was on Z-spectra with multiple saturation-transfer effects, including CEST, relayed nuclear Overhauser enhancements, and magnetization transfer contrast, some of which show non-Lorentzian and mixed line shapes. The aim of the present study was to develop and implement a DL-based approach, the single Lorentzian fitting neural network (sLoFNet), for fitting WASSR Z-spectra to a single Lorentzian line shape with improved robustness in terms of stability against noise fluctuations and higher efficiency with regard to time duration.

Discrete Z-spectrum
The normalized WASSR Z-spectrum follows a Lorentzian line shape that can be defined as follows 10,13,14 : (1) where S sat (Δf RF ) is the signal intensity at the irradiation offset Δf RF (in Hz) from the scanner resonance frequency; S 0 is the signal intensity without saturation; A is the normalized signal amplitude; LW is the linewidth (in Hz); and Δf H 2 O is the water proton frequency shift (in Hz) in the voxel. Notice that the scanner resonance frequency is set on the average water frequency of the brain, whereas the individual water frequency in a voxel varies over the brain, allowing WASSR MRI to be used to map the magnetic field variation over the brain. Each of the discrete and normalized signal values in the Z-spectrum can be expressed by Eq. (1). Thus, for 2N + 1 sampling points, given by the vector

Rician noise and estimation of noise variance
For in vivo WASSR Z-spectra, the signal values at each frequency are governed by the Rician probability density function (PDF). 23 To obtain a more realistic simulated sample x i,∶ , with a discrete number of frequency-dependent signal values according to Eq. (1), noise has to be added, and the resulting noisy signal of each element in ( x i,∶ ) noisy must be governed by the Rician PDF according to Eq. (2) 16 : where ( x i,∶ ) noisy is an arbitrary noisy sample that has been transformed to be governed by the Rician PDF, and random 1 and random 2 are two uncorrelated vectors of equal length to x i,∶ , where each element is randomly sampled from a normal distribution with zero mean and a SD = . Several methods have been proposed to determine the noise variance of magnitude MR images, 24,25 with Sijbers et al. 25 proposing a subtraction method defined as where ⟨⋅⟩ is an operator taking the spatial average; M s is the magnitude signal value for a single acquired image; and M a is the corresponding value of the averaged magnitude signal from images of two different acquisitions.

Evaluation metrics
The most common metric for evaluation of regression predictive machine learning (ML) models is the RMS error (RMSE) 26 : where y i is the ground-truth value;ŷ i the predicted value; and n is the total number of predictions. Another relevant metric to be used as an alternative to, or in combination with, the RMSE is the mean absolute error (MAE), defined as For evaluating predictions of data with unavailable ground truth, such as in the case of the predictions of Lorentzian parameters, the discrepancy between the discrete input signal values and the resulting fit of the predicted parameters can be measured both in terms of RMSE and MAE. In that case, y i is replaced with a vector of the discrete signal values of the considered sample, whereasŷ i is replaced with the corresponding signal values from the predicted Lorentzian fit of the considered sample.

In vivo data
Approval for human studies on 6 subjects (49 ± 8 years, 5 patients, and 1 healthy volunteer; all males except 1 female patient) was obtained by the Johns Hopkins Medicine Institutional Review Board, and signed informed consent was obtained from all subjects. The diagnoses of the patients included glioblastoma, meningioma, and brain metastasis resulting from rhabdomyosarcoma. All data were deidentified and defaced at Johns Hopkins University before being processed.

MRI acquisition
Brain images were acquired using a 3T Philips Elition RX system (Philips Healthcare, Best, Netherlands) equipped with 95-mT/m gradients. The signal receiver was a 32-channel phased-array head coil, and a quadrature body coil was used for RF transmission. To ensure Lorentzian line shapes for the obtained WASSR Z-spectra, it is necessary to minimize magnetization-transfer contrast effects, while attaining sufficient direct saturation. This was accomplished by applying a train of 10 sinc-Gaussian-shaped RF pulses (B 1 = 0.5 μT and 50 ms in duration) before a simultaneous multislice gradient-echo EPI readout (multiband factor = 3). A total of 28 saturation frequency elements were acquired using the following frequency offsets: For all data, the first two frequency elements (±1280 Hz) were omitted to ensure that steady state was obtained for the WASSR Z-spectra, resulting in 26 remaining saturation frequencies in the WASSR Z-spectra. The other two 10-ppm (±1280 Hz) intensities were averaged and used for S 0 . A total of 8050, 7945, and 18 935 individual WASSR Z-spectra were extracted from the white matter (WM), gray matter (GM), and CSF regions, respectively. Despite normalization to S 0 , some WASSR spectra showed intensities larger than 1.0. Therefore, an additional normalization by the highest intensity value was applied to ensure a maximum value of 1.0 in the fitting.

T A B L E 1
Defined search space for the hyperparameter optimization of the proposed Lorentzian fitting neural network (sLoFNet).

Simulated data
To train a neural network for predicting Lorentzian parameters from a discrete number of noisy (frequency-dependent) signal values, a paired training data set was created (X train , Y train ) that consisted of a total of 5 000 000 sample-label pairs. The samples in X train were simulated using Eq. (1), with the same 26 discrete saturation frequency offsets used during the in vivo data acquisition. The values of the parameters (A, LW, and Δf H 2 O ) for each sample were randomly chosen from a uniform distribution in the intervals defined previously, and then saved in Y train as the corresponding label for the simulated sample. Hence, the ground truth of the parameters (amplitude, linewidth, and shift) for each sample was available. All the simulated samples were assigned random noise and transformed to be governed by the Rician PDF according to Eq. (2). For each of the acquired in vivo WASSR brain scans, the variance of the noise was determined by applying Eq. (3) using two consecutive acquisitions of the same brain (because all scans were acquired in a time-resolved setting with multiple time points), to generate the single and average magnitude images. The average of the resulting variance from each calculation was used as the variance of the simulated samples. An additional 10 000 samples were generated in the same way, with their corresponding labels, and kept as a hold-out testing set (X test , Y test ) to evaluate the trained models. A comparison between the training and testing data sets was performed to ensure that no samples from the training set were included in the testing set. Similarly, another 10 separate test sets, containing 10 000 samples each, were generated, but each with a different noise variance incremented from = 0% to = 4.5% in steps of 0.5% to be able to test the robustness of the trained models and the conventional LS method with respect to noise level.

Network configuration
A base architecture for the sLoFNet was constructed, containing a dense input layer with 128 nodes, three dense hidden layers with 256 nodes each, and a dense output layer with three nodes for the three continuous prediction values: A, LW, and Δf H 2 O . For all layers, the rectified linear unit activation function was used, except for the final layer, for which the linear activation function was used (to allow the output to take values from the defined continuous intervals). When the model with the base architecture was trained, the RMSE loss function and a batch size of 32 was used.
To obtain the optimal hyperparameters and thus optimize the network configuration, HyperOpt was used. 27 For optimization, an objective function (to measure the performance of the candidate configurations), a search space (defining the hyperparameters and the possible choices to the configurations) and a search algorithm (determining the search approach through the defined space) need to be defined. The objective function was set to target the prediction errors (PEs), both RMSE and MAE, of the trained candidate configurations on a test dataset (10 000 samples) generated for this purpose. The search space was set to include the number of hidden layers, nodes in each layer, loss function, final activation function, and batch size with the choices provided in Table 1. The search algorithm was specified to be random, and the maximum number of iterations was set to 1200. The configuration of the resulting optimized network can be seen in Table 1.
All code was written and executed in a Python 3.7 environment on an intel Xeon-W2245 (Intel, Santa Clara, CA, USA) central processing unit, and tensor computations were executed on a Titan RTX (Nvidia, Santa Clara, CA, USA) graphical processing unit with 24 GB VRAM. The application programming interface used for the AI coding was TensorFlow 28 with the backend of Keras. 29

Simulated data
Once the optimized network configuration was found, the architecture was instantiated, and a model was trained on the full training data set (X train , Y train ) of 5 000 000 samples, referred to as sLoFNet HP. Another three instances of the same optimized configuration were instantiated to provide the models sLoFNet 1-3 , which were instead trained with subsets of the full training data set with varying sizes: the first with a subset of 5000, the second with 50 000, and the third with 500 000 samples. This allowed investigation of performance stability as a function of the number of used training samples, and hence, also of training time. For each of the models, 10% of the used training data were used as a validation data set during training.
To be able to test the effect of sampling density on the performance of the proposed method, sLoFNet HP was trained (with the same training data) in five different settings (i.e., the default 26-point setting and an additional four settings with a varied number of frequency points [4, 8, 16, and 36]). The frequency values for the different settings were chosen symmetrically and with an increasing offset around the DS frequency.
During the training of all models, the Adam 30 optimization function was used to update the weights and biases with the method of backpropagation of the output error (compared with ground-truth labels). The initial learning rate (LR) was set to 0.001, and a reduction schedule was implemented to achieve faster convergence, reducing the LR by a factor 10 every time improvement stopped for two epochs until reaching the limit at LR = 10 −6 , at which point the training was terminated.

Comparison of training with simulated data and in vivo samples
In this study, the training data used to obtain the single Lorentzian fitting neural network were simulated using Eq. (1) with added Rician noise. To validate the use of simulated data for training, the performance of training on simulated samples was compared with training on in vivo samples using two distinct models. The first model was the sLoFNet HP (trained with 5 000 000 simulated samples), and the second model was trained using approximately 56 000 in vivo samples (from GM, WM, and CSF) with the label annotations made by the LS method. To yield an improved training set, outliers of the annotated in vivo samples were detected using histogram analysis for the average of the sample parameters (linewidth and shift) of each sample. This gave a distinct bar located at an average 64 Hz with approximately 50 000 samples and another bar with an average at approximately 1150 Hz with the remaining samples. The detected outliers were removed, thus resulting in a total number of approximately 50 000 training samples (this model was referred to as sLoFNet Invivo ). The trained models, as well as the conventional LS method, were then applied to an in vivo test data set (test−set 7000 ) with approximately 7000 samples (from GM, WM, and CSF) and the results were compared.

Testing and evaluation
The models with optimized configuration, trained on either the full training data set (sLoFNet HP ) or a subset (sLoFNet 1-3 ), were applied to predict the parameters of the hold-out test set (X test , Y test ) with 10 000 samples. The error of the predictions compared with the available ground-truth labels Y test was calculated in terms of RMSE for the predictions from each of the four optimized models. For comparison, the training and validation RMSE losses at the final epoch for each of the models were also reported. In this way, the training, validation, and testing performances in terms of RMSE could be evaluated as a function of the number of training samples used for training. For the predictions made by the sLoFNet HP on the samples of the hold-out test set, the error compared with the ground-truth (GT) labels was calculated both in terms of RMSE and MAE. For comparison, the conventional LS fitting method was executed using the lmfit package in Python. 15 Measurement of the prediction time on the hold-out set was also carried out for both methods.
Next, sLoFNet HP and sLoFNet Invivo , as well as the LS method, were applied to extract the Lorentzian parameters from 10 test data sets with an increasing noise level from 0% to 4.5%. The PEs in terms of RMSE and MAE compared with the GT were studied as a function of sample noise level.
Finally, the sLoFNet HP and the LS method were used to generate estimates on samples from in vivo regions (CSF, WM, GM, and tumor). Because the GT was not available, the goodness-of-fit was measured in terms of discrepancy (RMSE and MAE) between the values of the measured sample points and the corresponding values on the predicted fitted line.
To investigate the generalizability of the approach, the PEs resulting from the different sLoFNet HP settings (4,8,16,26, and 36 points) were compared with the errors of the corresponding settings of the LS method. The influence of noise level was taken into consideration. In addition, the PEs were also investigated as a function of WASSR sweep-width-to-LW ratio (WSW-to-LW), in which the WSW is the frequency range over which the saturation frequencies were recorded. To compare the performance (PEs and time consumption) of the two methods, a paired Student's t-test was used with the null hypothesis that the means are equal. A preceding Shapiro-Wilk test was performed to test the null hypothesis that the data follow a F I G U R E 1 An overview of the project's workflow. normal distribution. A complete summary of the workflow is illustrated in Figure 1.

Training results and effect of training data size
The results of the training of sLoFNet HP are presented in Figure 2A, where training and validation losses (RMSE) are seen to converge already at a low number of training epochs. Both losses follow the same trend, which is an indication of good model generalization capabilities. Some sharp drops in RMSE (with the largest one at approximately 55 epochs and another around 70 epochs), observed for both training and validation, are a consequence of a reduction schedule applied to the step size (LR), whose purpose is to accelerate the convergence by reducing LR after reaching a plateau region, to keep moving closer to the minima of the loss function instead of moving around it due to too large steps.
The RSME performance of the networks with different amounts of training data is shown in Figure 2B. The model generalizability of sLoFNet 1 (5000 samples) was poor, as evidenced by the discrepancy between the RMSE of the training, validation, and test sets. With the increased number of training samples, the overall model performance (OMP) (i.e., having a low training, validation, and test loss) improved. In particular, the model generalization capability improved because the discrepancy between the losses for the training, validation, and test sets decreased. The most obvious improvements in terms of both OMP and improved generalizability were realized early on. While the OMP and the generalizability improved with the increased number of training samples, the improvement rate flattened out as the number of training samples increased. The total training time kept increasing substantially, as can be seen from the logarithmic x-axis scale on top of Figure 2B.
The MAE between sample points and the fitted curves resulting from the model trained with in vivo versus simulated data when applied on the test-set 7000 were 3.10 Hz for sLoFNet HP and 3.02 Hz for sLoFNet Invivo , and the corresponding error for the LS method was 3.67 Hz. Thus, the model trained on in vivo data had an error approximately 2.5% lower than that of the model trained on simulated data, whereas both DL methods had more than 15% lower error than the LS method for the corresponding test data. However, it should be noted that the error found for the sLoFNet Invivo (3.02 Hz) was for the cleaned version of the in vivo training data set consisting of a subset of approximately 50 000 samples taken from the 56 000 samples of the full in vivo training data set. The sLoFNet Invivo trained on the full 56 000 samples yielded an error of 90.5 Hz, which is almost 30 times higher.

Comparison between methods on simulated test data
The optimized model, sLoFNet HP , produced a slightly lower PE on the test set (X test , Y test ) compared with the sLoFNet (not optimized) model (Table 2). Both models produced significantly lower PEs on X test and required significantly less time to predict than the LS method.
For noise-free samples, the LS method produced a lower error than the DL approach. However, as soon as noise was introduced to the samples, the errors of the LS predictions increased rapidly. As can be seen in the insets of Figure 3, sLoFNet HP and LS produced a comparable PE between 0% and 0.5% noise (the average noise level observed in the in vivo 3T images was 1%). Above 1% noise, the error of the LS-method predictions increased rapidly, whereas the error of the sLoFNet HP increased only marginally (Figure 3). The sLoFNet Invivo model also had a marginal increase of error with increased noise level, with a slightly higher error than for sLoFNet HP . However, its error was somewhat higher than that for the LS for low noise levels ( Figure 3A,C,D) and, for the error of amplitude estimations ( Figure 3B), the error of sLoFNet Invivo was higher than for LS over the whole studied range of noise levels.

Comparison between methods on in vivo test data
Three representative in vivo voxel examples of WASSR Z-spectra fitted by sLoFNet HP , sLoFNet Invivo , and the LS method from different regions of a brain (CSF, WM, and GM) are shown in the top row of Figure 4. An additional

T A B L E 2
Prediction errors and time consumption for the sLoFNet HP and the least-squares method when applied to the holdout set and in vivo data sets from different regions of 3T WASSR brain scans.

F I G U R E 3
The mean absolute error (MAE) of the sLoFNET HP predictions (blue), sLoFNET Invivo predictions (green), and least-squares six fits-three good and three poor-on voxels from T GBM , T RMS , and T MM are shown in Figure 5A-F, respectively. The fits by the different methods agree very well for all examples and are almost completely overlapping by visual inspection. Except for the poor fit samples from T GBM and T MM (Figure 5D,F, respectively), where a slight deviation was observed, the fits were in excellent agreement with the sample points, which mostly fell on, or very close to, the fitted lines.
The results in Figure 4 (bottom row) show small differences between the DL methods and the LS method, with the best result for sLoFNet HP of the order of 1% or less. In the remainder, we therefore only used the sLoFNet HP approach. Examples of LW and shift maps from the healthy volunteer and the rhabdomyosarcoma patient, generated by the sLoFNet HP and the LS method, are shown in Figure 6. The difference maps ( Figure 6C,F,I,L) indicate that the difference was close to zero in many of the voxels.

F I G U R E 4
The The average difference between the methods was less than 3% for the LW and less than 8% for the shift. It can also be observed (for both LW and shift) that the difference in the tumor region was slightly higher ( Figure 6F,L).
The good agreement between the two methods is also reflected by the comparable PEs (both in terms of RMSE and MAE) on the in vivo samples from different regions of the brain ( Table 2). The Shapiro-Wilk test applied on the RMSE, MAE, and time data produced by the LS and sLoFNet showed no significant deviation from the normal distribution. Furthermore, the paired Student's t-test indicated no significant difference between the PEs of the two methods (p = 0.35 and p = 0.27 for RMSE and MAE, respectively). The prediction time was, however, significantly different (p = 0.020) between the methods. The proposed sLoFNet HP was on average 70 times faster. Figure 7 shows that PEs (only shown for MAE) increased as the number of points in the samples were reduced for both the sLoFNet HP and the LS method, and the observed trend was seen for both LW and Δf H 2 O (Figure 7A,B). Below approximately 15 points, the error increased rapidly for sLoFNet HP , whereas the increase occurred earlier and more extensively for the LS method (at approximately less than 25 points). The two methods produced comparable PEs down to approximately 25 sampling points for samples with 0.2% noise, but the difference in PE between the two methods (for the same number of points) increased with increased noise, with sLoFNet HP showing the smaller error (Figure 7).
The PEs of the shift (Δf H 2 O ) and the LW (only shown for MAE) as a function of the WSW-to-LW ratio are displayed in Figure 8A,B, respectively. For the setting of 36 points for the two methods, the Δf H 2 O -PE curves almost overlap, and the smallest error was observed at a WSW-to-LW ratio between 3 and 4. However, for all other settings (i.e., below 36 points), sLoFNet HP produced a smaller error for the estimated shift ( Figure 8). The LW errors produced by the sLoFNet HP settings were consistently lower than those of the corresponding settings for LS. The smallest LW error for the different settings was observed at a WSW-to-LW ratio of approximately 4 (except for LS with 4 and 8 points).

The algorithm, training results, and effect of training data size
For more complex tasks, particularly nonlinear problems (as in this study), neural networks are used commonly instead of the simpler ML algorithms. 31 However, this comes at the cost of a significantly larger number of hyperparameters that must be tuned for optimal performance, hence, the hyperparameter optimization applied in this study. It should be noted that repetitions of the optimization process would likely yield different optimized configurations, because the obtained configuration is one of many optimal solutions. Goodfellow and Vinyals 32 showed, using an interpolation approach from different configurations, that there is a smooth plateau region of the objective function landscape 33 in which all model configurations perform comparably. Thus, the focus should be to ensure that the obtained configuration is not one that overfits to the training data.
The agreement between the training and validation losses (difference less than 1% at the final epoch) of the sLoFNet HP indicated that no overfitting occurred. Furthermore, the agreement among the training, validation, and test losses provided evidence for the model's generalizability. As shown in Figure 2B, sLoFNet 1 (5000 samples) was overfitted and, naturally, has poor generalizability as seen by the discrepancy between the test loss and the training and validation losses. Nevertheless, with a slight increase in the number of training samples, the OMP greatly improved, which can be explained by an improved representation of the true sample distribution. The improvement of the OMP continued with further increased number of training samples, but with a decreasing improvement rate, whereas the training time increased substantially. Hence, a weighting between the acceptable generalizability and the training time (number of training samples) must be done. The model with optimized configuration has many trainable parameters compared with the base architecture (two orders of magnitude more). Although an increased number of trainable parameters allows more solutions, the risk of overfitting also increases, 33,34 thus requiring more samples (longer training time). The requirements on number of training samples (and therefore training time) can be reduced by choosing from the models in the plateau region (of the objective function) the one that has the fewest number of trainable parameters. This can be achieved through restriction of the search space to a lower number of maximum allowed layers and nodes per layer.

F I G U R E 6
Examples of parameter maps. LW (A-C) and shift (G-I) for a healthy volunteer. LW (D-F) and shift (J-L) for a rhabdomyosarcoma metastasis patient. The first column is produced by the sLoFNet HP , the second by the LS method, and the third is the difference between the two methods (sLoFNet HP -LS).
The benefits of using simulated data are the possibility to produce large quantities of data as well as the availability of a ground truth (i.e., the actual parameters used to produce each sample). While the best representation is obviously found in in vivo training data sets, the actual GT is not available, and explicit annotations to each in vivo sample are required to produce a paired in vivo training data set. For example, in the studies by Glang et al. 17 and Hunger et al., 18 fitting CEST spectra to multiple Lorentzians, the annotations of the parameters for the Lorentzian fitting of each CEST spectrum were done using LS to produce the paired in vivo training sets used for training the DL models. Because the DS of WASSR spectra follows a single Lorentzian shape, accurately described by Eq. (1), training with simulated samples with added Rician noise is expected to yield a fitting capability equivalent to training with in vivo data. Although, for a small sample size of 50 000, the model trained with the simulated data had a slightly higher error than the model trained on in vivo data, their corresponding errors were within 2.5%. It is not surprising that the model trained on in vivo samples produced slightly lower errors, as the training data are indeed more representative of the targeted test data, including effects that are not fully included in the simulations, such as a complete representation of the noise. Although noise was added to the simulated samples, the noise representation in the in vivo training data captures the true noise distribution more accurately. In vivo data also include potential contributions from motion and partial volume effects between tissues, slightly changing the signal appearance but still maintaining close to Lorentzian shape. However, when increasing the number of training samples for simulated data, the error significantly decreased because of improved representation (improved generalizability as verified in Figure 2B). Although it is straightforward to increase the number of simulated training samples, it is not as simple for in vivo data because of the lack of availability and lack of a GT. An interesting observation is the much higher error produced by the network trained on the full set (56 000 samples) of LS-annotated in vivo data. In fact, this model was not able to converge during training. A closer inspection showed

F I G U R E 8
The prediction errors (MAE in Hz) as a function of water saturation shift referencing (WASSR) sweep-width-to-LW ratio (WSW-to-LW) for the predicted LW (A) and the predicted shift (B). The different curves show samples with a different number of points (✕ 4, • 8, ◼ 16, ▲26, and ▾36) using sLoFNet HP (blue) and LS (orange). that some of the annotated samples had unrealistic values for the parameters (an average for the linewidth and shift of approximately 1050 Hz), whereas the corresponding annotations produced by the trained sLoFNet HP on the same samples yielded reasonable values (within the defined intervals). Many of the mentioned deviating samples originated from tissue boundary regions of the brain, where some partial volume effects may change the appearance of the Z-spectrum to no longer being approximated by a single Lorentzian curve.

Evaluation metrics
For evaluation of the PE of the two methods, both RMSE and MAE were used because of the lack of consensus in the literature on which metric was better for evaluating regression-predictive ML problems. Although RMSE is used most commonly, Willmott et al. 35 showed that RMSE depends on both the average error (i.e., MAE) and the number of samples used and, hence, also on the variance within the samples. This makes the MAE a more robust metric for intermodel comparisons. Nevertheless, for the results obtained in this study, there was a general agreement between the trends of the two metrics.

Performance comparison for in vivo data
Good agreement between the sLoFNet HP and LS predictions was observed for the in vivo samples. Because the noise level in vivo was approximately 1%, this agrees with the findings of the noise test ( Figure 3) showing a comparable PE for the two methods at the considered noise level. The PEs in the different brain regions were of the same order of magnitude, except for T GBM and T MM , which had errors greater than a factor 2. A closer inspection showed that the T GBM and T MM samples deviated slightly more from a Lorentzian line shape than the other regions ( Figure 5D,F), which reduced the goodness-of-fit. These deviations also resulted in the higher difference between the predicted LW and shift maps by the two methods (seen for rhabdomyosarcoma in Figure 6F,I). In tumors, there is a high probability of different tissue types with inherently different LWs within a given voxel, which would result in a curve that no longer adheres to a Lorentzian shape. Thus, to fit the spectra of tumor tissue where an extensive deviation from the Lorentzian shape occurs, a preceding separation of the spectra from the overlapping tissue types is necessary. This has potential for future work, where the separation could possibly be achieved with a DL approach. The resulting separated Lorentzian spectra can in turn be given to the sLoFNet for more accurate parameter estimation.

Noise effects
It is worth noting that the LS method outperformed the sLoFNet HP in terms of both RMSE and MAE on the noise-free simulated test data set, which is expected due to the inherent uncertainty error in the neural network. However, the stability of the sLoFNet HP against noise was evident from the marginal PE increase over the full noise level range studied, whereas for LS, the PEs increased rapidly beyond 1% noise (Figure 3). This robustness of the sLoFNet HP approach against noise can, for example, be exploited to reduce the MRI acquisition time of WASSR spectra by accepting a higher noise level or by increasing the spatial resolution, such as when using WASSR for QSM. 13

Method generalizability
Although the two methods had comparable performance down to approximately 25 sample points (Figure 7), sLoFNet HP showed a slower degradation with reduced number of frequency points. This was observed for both Δf H 2 O and LW (Figures 7 and 8); the PEs of the 4-point setting of sLoFNet HP showed an error below 1 Hz, whereas the corresponding setting of LS exhibited an error almost four orders of magnitude larger. In addition, the comparability of the two methods down to 25 points was only observed for the case with 0.2% sample noise (Figure 7). This observation agrees with the results shown in the insets of Figure 3, which indicate that both methods produced almost the same errors when a noise level between 0% and 0.5% was applied. However, with increased noise level, the difference in performance between the two methods, for the same number of points, diverged further, and sLoFNet HP outperformed the LS method. This was also verified by the findings shown in Figure 3, which indicate that the error for LS increased rapidly beyond 1% noise, while only marginally increasing for sLoFNet HP . The same trend of marginal increase observed for sLoFNet Invivo verifies the robustness of the DL approaches against noise increase; nevertheless, the error was slightly higher for the model trained on in vivo data, which is likely a result of the much smaller training data-set size used compared with the simulated data used for sLoFNet HP. The lowest PE of Δf H 2 O was observed at a WSW-to-LW ratio between 3 and 4, which is in agreement with the results found by Kim et al., 2 who suggested a sampling with 16-32 points and a WSW-to-LW ratio of 3.3-4.0, to obtain a MAE below 1 Hz. The lowest PE of LW is seen at WSW-to-LW ratios of 4 and above. Although the WSW-to-LW ratio for the lowest PE agrees well with the results presented by Kim et al., the constraint on the sampling density to maintain MAE below 1 Hz is lower for sLoFNet HP . This can be used to reduce the number of frequencies of the acquired WASSR Z-spectra, and therefore reduce the total acquisition time. However, to obtain a MAE below 10 −2 Hz for the predicted shift with sLoFNet HP , the results ( Figure 8) suggest a sampling density of at least 16 points and a WSW-to-LW ratio between 3 and 4.

CONCLUSIONS
We designed a DL method for Lorentzian fitting of WASSR MRI Z-spectral data (sLoFNet HP ) and compared its performance with conventional LS fitting, both in simulation and in vivo. LS performed well only up to noise levels of approximately 1%, whereas the sLoFNet HP had stable performance up to at least a noise level of 4.5%. When decreasing Z-spectral sampling density, sLoFNet performance was acceptable down to 15 frequency points, whereas degradation was tangible below 25 points for LS. The sLoFNet HP model also performed faster than LS by a factor of about 70. Therefore, sLoFNet HP adds considerable value to fitting of WASSR MRI Z-spectra, both in terms of reduced time consumption and excellent robustness against increased noise levels. CEST acquisitions (e.g., amide proton transfer-weighted MRI) are currently starting to be added to clinical packages but need assessment of B 0 homogeneity. The use of sLoFNet for WASSR data can enhance the clinical workflow for CEST packages by allowing accelerated calculation of B 0 and LW maps with higher robustness against noise. All code for building and training is available online (https://github.com/Sajad125/ sLoFNet-code), making it straightforward for manufacturers to copy and incorporate or for local users to add as an offline software connected to a patient database.