Accelerated white matter lesion analysis based on simultaneous T1 and T2∗ quantification using magnetic resonance fingerprinting and deep learning

To develop an accelerated postprocessing pipeline for reproducible and efficient assessment of white matter lesions using quantitative magnetic resonance fingerprinting (MRF) and deep learning.


| INTRODUCTION
White matter (WM) lesions are a common brain imaging finding in multiple sclerosis (MS) affecting the central nervous system. WM lesions are commonly characterized by increased T 1 and T * 2 relaxation times. 1 T 2 -weighted imaging or fluid attenuated inversion recovery (FLAIR) is most commonly used in clinical MRI. 2 However, conventional magnetic resonance imaging (MRI) only provides limited insights into the pathological substrate of tissue changes (eg, axonal loss, inflammation, demyelination). Specifically, qualitative imaging inherently hampers standardization and reproducibility. Therefore, quantification of relaxation times, such as T 1 , T 2 , and T * 2 , is increasingly receiving interest for providing additional information beyond qualitative imaging. [3][4][5] However, most quantitative methods suffer from long acquisition times as the acquisition of multiple qualitative images is required. This renders quantitative MRI susceptibility to intra-scan motion. Furthermore, due to interscan motion and image distortion, multiple successive scans commonly need to be co-registered in order to allow for joint analysis.
Magnetic resonance fingerprinting (MRF) is a promising, time efficient approach for quantification of multiple tissue parameters in a single acquisition. 6 In MRF, characteristic magnetization evolutions are generated for tissues by varying sequence parameters including flip angle, echo time (TE), and repetition time (TR) throughout the acquisition. Thus, MRF has shown the potential to differentiate between healthy and pathological tissue and may, therefore, be useful for clinical MRI. 7 Rieger et al proposed an MRF sequence based on an echo-planar imaging (EPI) readout for simultaneous quantification of T 1 and T * 2 times covering the whole brain in less than 5 minutes. 8,9 Lower undersampling factors are applied compared with spiral MRF, reducing not only the noise per magnitude image but also the total number of magnitude images. This method was recently shown to provide clinically robust T 1 and T * 2 in neuro and renal applications. [8][9][10] However, compared with other EPI scans high acceleration factors lead to a lower signal-to-noise ratio (SNR) than common for many clinical applications. Multiple denoising strategies have been proposed to improve the image quality and accuracy. [11][12][13] Recently, Marchenko-Pastur principal component analysis (MPPCA) was proposed to denoise EPI diffusion MRI images. 14 This is of particular interest, as a recent study demonstrated the value of denoising the acquired MRF magnitude images to improve the quality of the quantitative maps. 10 The large number of magnitude images in an MRF acquisition leads to long reconstruction times, which has been acknowledged as one of the drawbacks of the MRF methodology. 7,15 Additionally, several postprocessing steps hinder the practicability in clinical usage. In wake of recent developments, deep learning has superseded other approaches in many areas of data processing. Numerous publications have shown the benefits of using deep learning for medical imaging. [16][17][18][19][20][21] Specifically, deep learning accelerates processing steps and is capable of reconstructing MRI data. [22][23][24] Denoising plays an important role in MRI and several networks were evaluated to improve the visual image quality using generative adversarial networks and deep neuronal networks. [25][26][27][28] Furthermore, image synthesis has gained attention, which transforms a set of input images to a new set of image contrasts. 29,30 These image transformations can also contain deformable registration and artifact correction which showed good accuracy using CNN's. [31][32][33] Especially for MRF, several models using fully connected neuronal network, 34,35 recurrent and convolutional neuronal network (CNN) 18,[36][37][38] were analyzed showing promising results regarding the speed and accuracy of the reconstruction. 39 A deep learning reconstruction on MRF data using the spatiotemporal relationship between neighboring signal evolutions was proposed, 40,41 which showed an improvement in the reconstruction especially for undersampled complex MRF data. The U-Net has frequently been used to process medical data for segmentation and regression tasks. [42][43][44][45] Since most of the MRF acquisition techniques acquire a large number of highly undersampled images, the reconstruction problem is high dimensional. Therefore, a two-step deep learning approach was proposed in Ref. [19] to, first, reduce the dimensionality by using feature extraction with a fully connected network, 46 and second, a U-Net for spatially constrained quantification. The advantage of this learning-based model is that it contains tissue properties of the neighboring pixels which is more resilient to noise. 47 In this study, we performed MRF-EPI for simultaneous quantification of T 1 and T * 2 in the whole brain on 50 patients with WM lesions and 10 healthy volunteers and analyzed the T 1 and T * 2 times in WM and gray matter (GM). Compared to conventional MRF methods, our MRF-EPI only slightly undersamples the k-space allowing for conventional parallel imaging reconstruction and yielding magnitude data that contains all relevant structural information. We developed a CNN for the MRF-EPI reconstruction of denoised and distortion corrected T 1 and T * 2 maps, and WM and GM probability maps. Furthermore, we compare different outputs, loss functions, and patches of the CNN for optimizing the entire reconstruction using deep learning.

| METHODS
This bicenter study was approved by the local institutional review board at both sites (2019-711N, BCB2012/7965), and written, informed consent was obtained prior to scanning. We performed MRF-EPI in 10 healthy volunteers (75% male, 22-30 (mean: 26) years, mean) and 18 patients (39% male, 23-73 (mean: 39) years) with MS on a 3T scanner (Magnetom Skyra, Siemens Healthineers, Erlangen, Germany) at site 1 and in 32 patients (37% male, 1-63 (mean: 41) years) with MS at a 3T scanner (Magnetom Prisma, Siemens Healthineers, Erlangen, Germany) at site 2. Figure 1 depicts an overview of the MRF pipeline. The conventional steps (1-6) acquisition, denoising, dictionary generation, reconstruction, distortion correction, and masking are depicted in the first part. The approach for standardization and acceleration using deep learning is shown in the second part, combining steps 2-6 to a single CNN. F I G U R E 1 Schematic of the acquisition and postprocessing pipeline.
Step 1: Varying flip angles, TE and TR with inversion pulses are played for the MRF-EPI sequence.
Step 2: Denoising the magnitude data by MPPCA denoising.
Step 3: Generation of the dictionaries for all T 1 and T * 2 .
Step 4: Voxel-wise matching to generate the parametric maps via simple dictionary matching.
Step 5: Distortion correction of the MRF maps using a restricted nonlinear registration onto T 2 weighted image due to susceptibility artifacts of the EPI readout.
Step 6: WM and GM segmentation using SPM12 on the MRF T 1 maps. Manual lesion segmentation of the T 1 -FLAIR data. Steps 2-6: Deep learning as a tool to integrate all postprocessing steps in a single operation 474 | HERMANN Et Al.

| Magnetic resonance fingerprinting
The acquisition was based on the previously proposed MRF-EPI technique for which accuracy and precision to gold standard methods were already evaluated. 8 Dictionaries were generated per slice using MATLAB (The MathWorks; Natick, MA, USA) consisting of 131,580 entries with T 1 (30-4000 ms) in 5% steps, T * 2 (5-3000 ms) in 5% steps, and flip angle efficiency B1+ (0.65-1.35) in steps of 0.05.

| Principal component analysis denoising
We used MPPCA 14 to denoise the magnitude data of the MRF acquisition before reconstruction. Originally, the denoising strategy was proposed to estimate a non-Gaussian distribution on diffusion MRI data. The noise is estimated in a local neighborhood by the eigenvalues of principal component analysis using the Marchenko-Pastur distribution. 14 Quantitative T 1 and T * 2 maps were compared with and without denoising. Denoising was performed on a per slice basis using a 2-dimensional (2D) kernel. As we are not interested in the actual image contrast but in the absolute T 1 and T * 2 times, we use standard deviation to describe the noise in these values.

| Distortion correction
Distortion correction was performed to correct for susceptibility artifacts, especially around the nasal cavities. 48 Rigid registration was computed from the T 2 -weighted data to the MRF-magnitude data followed by a restricted nonlinear registration along phase-encode direction from the magnitude to the T 2 -weighted data using ANTs. 49 Distorted maps were then visually compared to the FLAIR and T 2 -weighted images to ensure that all modalities are properly registered.

| Data processing
White matter lesions were segmented manually by an expert radiologist on the FLAIR images. WM and GM were automatically segmented using SPM12 (Statistical Parametric Mapping version 12) 50 using the T 1 Maps acquired with MRF after denoising and distortion correction. The probability maps generated by SPM12 were transformed into binary masks by using a threshold (80%). Masks were visually analyzed and manually segmented WM lesions were extracted from the WM and GM mask to improve accuracy.

MRF reconstruction
Our network was a modified U-Net 42 implemented in Matlab 2020a (The MathWorks; Natick, MA) using the Deep Learning Toolbox. The network architecture is displayed in Figure 1. The training was performed on a GPU (Tesla K40m, Nvidia, Santa Clara, CA) for approximately 1 day per network. As inputs, the 35 differently weighted MRF-EPI magnitude images were used. The generated output were the T 1 and T * 2 maps and WM and GM probability maps. A brain mask was applied to exclude background noise. Data of 5 patients were randomly selected for testing, while the remaining data was chosen for training (49 datasets) and validation (5 datasets). Data of 6 healthy volunteers were acquired without T 2 -weighted images and therefore, they are excluded from training. Two patients from site 1 and 3 patients from site 2 were chosen for the testing set. The 2D network was trained on individual slices. We trained half the networks with the full input resolution (240 × 240 voxels) and the other half using 32 random T A B L E 1 Different parameters for all the networks compared in this work are listed here

Networks Input Outputs
Loss function patches (64 × 64 voxels) per slice (Table 1). We evaluated the following 4 different loss functions (the reference value y i , predicted value y We used Adam for optimization with a learning rate of 0.0001, L2-Regularization of 0.0001, 50 training epochs, and batch size = 64 for all networks, which was empirically determined to be optimal. Additionally, we trained the networks using patches and the full input resolution and MAE with 3 different types of outputs: (1) the network was trained with a single output once for T 1 and another for T * 2 (single); (2) the network was trained with both T 1 and T * 2 in a single network (dual); and (3) the network was trained with 4 output maps T 1 , T * 2 , WM, and GM probability maps (4 outputs). Relative differences between dictionary matched and predicted maps were calculated and correlation coefficient of mean T 1 and T * 2 times between prediction and reference in WM, GM, lesions, and the whole brain was calculated. Reconstructions were executed on the CPU (Intel(R) Core(TM) i5-6500 @ 3.20 GHz).

| Statistics
Mean T 1 and T * 2 times with standard deviations were calculated and pair-wise comparison was performed using Student's t-tests and correlation R-values. P-values less than 0.05 were considered significant. The mean Dice similarity coefficient was used as a statistical validation metric for the predicted WM and GM probability maps after binarizing them into logical masks.
Computational time was measured using a standard desktop PC.

| RESULTS
The first part of this section presents the results of the conventional methods for denoising, distortion correction, and dictionary-based reconstruction and analysis. The second part presents the comparison to the deep learning-based reconstruction and analysis.

| Conventional reconstruction
Image denoising was successfully performed using MPPCA and resulted in up to 50% decreased variability in the magnitude data and 15% reduced standard deviations of T 1 and T * 2 (Supporting Information Figure S1). Overall, denoising the MRF magnitude data took about 10 minutes per subject on a standard CPU.
After the denoising and reconstruction of the parametric T 1 and T * 2 maps, EPI distortion correction was performed as exemplarily shown in Supporting Information Figure S2. Deviations in the relaxation times of up to 10% were observed in caudal slices next to the nasal cavities after applying the distortion correction. Distortion corrected mean T 1 and T * 2 times show only minor variations (<2%) in WM, GM, and WM lesions compared with T 1 and T * 2 times without distortion correction. The distortion correction of the MRF data takes around 1 hour for one whole brain on a standard CPU.
Representative T 1 and T * 2 maps including annotations are shown in Figure 2 for both sites. Reconstruction of the parametric maps using a pattern-matching algorithm took around 20 minutes per subject. Mean T 1 and T * 2 relaxation times for WM, GM, and WM lesions are depicted in Figure 3 and provided in the Supporting Information Tables S1-S3. Differences between healthy and diseased subjects from both sites were less than 4% for T 1 and less than 2% for T * 2 in WM and less than 7% for T 1 and less than 3% for T * 2 in GM. MRF acquired in site 2 had 15% higher standard deviations in T 1 and T * 2 due to increased scan time acceleration. Mean T 1 relaxation times in WM lesions are widespread ranging from 800 ms, comparable to WM, up to 2500 ms. Mean T * 2 times in WM lesions were consistently higher (70%) than WM and GM with mean T * 2 times up to 200 ms. Clear separation between WM and GM was found in T 1 (Figure 3). We found a slight trend of increasing T * 2 (up to 10%) in WM and GM for increasing slice position (R = 0.974, P < .0001; Figure 3C). T * 2 was shorter and had higher standard deviations in caudal slices in the vicinity to the nasal cavities. No significant increase in T 1 and T * 2 with either age or gender was observed ( Figure 3D). T 1 and T * 2 times in WM lesions were highly heterogeneous and independent of their localization and size (P > .2).

| Deep convolutional network for MRF reconstruction
The computation time of the proposed CNN for 60 slices was about 5 seconds on a standard CPU workstation.
The performance of the reconstruction during the training process is depicted in Figure 4. Already after 5 epochs, the reconstructed maps have a visual good agreement with the dictionary-matched maps. Figure 5 shows the 2D histogram of a representative slice in 1 subject for the CNN predicted T 1 and T * 2 times over the dictionary matching. The relative difference showed major noise with few anatomical structures and mean deviations of less than 6% for T 1 and T * 2 . Variations in the CSF are increased as seen around the ventricle and at the skull. T 1 and T * 2 times, which exceed 3000 ms are cut, and therefore, the ventricle has variations of 0%. The average correlation coefficient R and the relative difference for T 1 and T * 2 were calculated for different loss function and outputs ( Figure 6). The P-value for all correlations was P < .001. For the 4 output models, the smallest relative difference for T 1 was observed when using the MAE with deviations of 5.8% in the whole brain and for T * 2 using the LCL with 6.0% deviations in the whole brain. Correlation coefficients in the whole brain were more than 0.99 except for the MSE (0.989) for T 1 and higher than 0.985 for T * 2 in the whole brain. The relative difference in T 1 and T * 2 was observed to be the highest in GM. All the relative differences and correlation coefficient are given in Supporting Information Table S4 for T 1 and Supporting Information Table S5 for T * 2 . The difference in the different loss functions is visually depicted in Figure 7, where the MSE smooths the predicted maps the most as clearly seen in the WM and GM probability maps. The HL has increased T * 2 in WM and the MSE decreased T 1 in WM. In the WM probability maps, the LCL visually performed the best as seen in the prediction around the lesion.
The training with full image input showed significant increases in the relative error (25.8% for T 1 and 21.6% for T * 2 ) and correlation coefficients of less than 0.90 for T * 2 in WM. Prediction in the WM performed better than in GM with around 4% higher correlation coefficient and correlation coefficients in WM lesions were observed to be higher than 0.98. The mean Dice coefficient across the test data for WM was 0.9 and for GM 0.91 after conversion into logical masks with a threshold of 80% for both SPM and DL probability maps (Table 1, network 4). Dice coefficients F I G U R E 3 A, Mean T * 2 times over mean T 1 times for white matter (blue), gray matter (orange), and WM lesions (yellow) of all patients and subjects from both sites. Representative 3D T 1 and T * 2 maps were depicted on the right. B, Representative distributions of the T 1 and T * 2 times from A, which shows a much wider spread for the WM lesions considering T 1 and T * 2 times compared with WM and GM. In C, the mean T 1 (left) and T * 2 (right) over the slice position for white matter (blue) and gray matter (orange) are depicted. Color brightness encode different subjects. In D, the WM and GM T 1 and T * 2 times over the age and gender are shown decreased up to 15% when training was performed on the full input size without patches (mean WM: 0.81, mean GM: 0.79). When training on T 1 and T * 2 as a dual output, prediction showed a slightly increased correlation coefficient (around 1%) and decreased relative difference compared with the 4 output models. Single T 1 and single T * 2 as outputs reached the highest correlation coefficients and smallest relative error among all other networks. Figure 8 shows the mean T 1 and T * 2 times per subject between the DL and conventional reconstruction. A linear fit shows the correlation which was above 0.99 with P < .0001 for both T 1 and T * 2 . The bright colored markers depict the test data, which are aligned to the linear fit. We observe a small offset in T 1 (55 ms) and in T * 2 (2.2 ms), which is within the standard deviations (100-200 ms ∼ [10]%, 3-5 ms ∼ 10%). Figure 9 depicts the Dice coefficient between the WM and GM masks generated from the probability maps using SPM and our DL approach for different thresholds. The black line depicts the highest dice coefficients with close correlation to a straight line with a correlation coefficient of 0.9965 for WM and 0.9974 for GM with both P < .0001. For a commonly used threshold of 80% for SPM, the Dice coefficient is shown for different thresholds of the DL WM and GM maps. For a threshold of 80% of the DL reconstruction, the mean dice coefficient yields for both, WM and GM values of higher than 0.9.

| DISCUSSION
We acquired MRF-EPI for simultaneous quantification of T 1 and T * 2 times in the whole brain. With a single convolutional neural network, we accelerated and combined several postprocessing steps as reconstruction, denoising, distortion correction, and masking.
MRF-EPI is a promising technique for quantification of T 1 and T * 2 of the whole brain in less than 5 minutes. T 1 and T * 2 times showed overall good agreement with literature. 1,3,5,51-55 However, as previously noted MR relaxation times for WM and GM show wide variability among studies due to different sequences, fitting procedures and natural variability among subjects. 51 Accuracy and precision measurements for the proposed MRF-EPI sequences were performed in previous work and therefore not analyzed in this study. 8,10 WM lesions exhibit a wide range of T 1 and T * 2 relaxation times. The relaxation times were independent of their localization and size in the brain. WM lesions were successfully delineated from WM, GM, and CSF based only on quantitative MRF T 1 and T * 2 maps. Lesions which are difficult to separate from CSF on conventional images show a clear difference in the T 1 and T * 2 maps acquired with MRF due to long T 1 times in CSF of around 3000-4000 ms compared with T 1 times in lesions of around 1000-2000 ms. These high and widespread ranges of T 1 times in lesions might be due to altered interstitial fluid mobility and water content from edematous brain tissue. 56 Thus, the use of quantitative relaxometry obtained by MRF might potentially enhance the segmentation around the CSF. A fraction of WM lesions exhibit only a slight elevation of the T 1 times compared with WM and, therefore, yield similar or even smaller values compared with GM. This hampers the separation of WM lesions and GM. However, the additional assessment of T * 2 proved to be beneficial for the assessment of those lesions and showed improved separation against GM. The increased sensitivity F I G U R E 5 Prediction of the CNN-network for 1 slice of a representative subject. The histograms (left panel) depict the predicted T 1 /T * 2 (top/ bottom) of 1 slice over the T 1 /T * 2 generated by dictionary matching. The linear fit (red) with corresponding fit parameters and R and P-values is shown. On the right side, the relative difference of T 1 and T * 2 is shown between the predicted and dictionary matched parametric maps. Voxel-wise differences range up to 30% around the ventricles, because of the very high T 1 and T * 2 times for the CSF rendering the prediction difficult for the network in T * 2 might be explained by the fact that T * 2 times in WM and GM yield similar values, and hence, deviations in T * 2 in lesions benefit delineation against both WM and GM. This is a gain compared to conventional methods such as FLAIR or T 2 -weighted images. We found no significant increase in T 1 and T * 2 with either age or gender, although a number of studies demonstrated that T 1 does change with age. 57,58 This might be due to the smaller number of subjects since we split between the healthy and diseased subjects and the narrow age range, especially for the healthy subjects. Further analysis of this might be performed when more subjects are measured. Only minor differences between data from site 1 and site 2 were observed (<7%), with no significant trends (P > .2). This demonstrates the potential of MRF as a quantitative method that is suitable for reproducible multicenter studies and a pathway to standardization. A slight trend of increasing T * 2 was identified in the cranial direction. This is unlikely to be a result of the acquisition scheme, as due to the slice interleaving any inaccuracies would be expected to appear interleaved as well. Instead, this effect might be explained by increasing B0 inhomogeneities in the axial direction. In site 2, additional scan time acceleration was achieved with SMS factor 3, reducing the effective scan time by a factor of 2-3. However, the use of SMS acceleration inflicts an additional drop in SNR depending on the G-factor due to the coil geometry. Accordingly, the quantitative data was found to have increased standard deviations of up to 15% compared with data from site 1. This might be improved by extending the acquisition scheme when using SMS or by using regularized SMS reconstructions. 59

F I G U R E 6
Relative difference between the predicted and dictionary matched T 1 (left) and T * 2 (right) for the whole brain (black), WM (blue), GM (orange), and lesions (yellow) compared to the different loss functions and network outputs. The first 4 data points (mean absolute error [MAE], mean squared error [MSE], logarithmic cosinus loss [LCL], Huber loss [HL]; Table 1, networks 4-7) are the networks trained with patches and 4 outputs. The fifth one (Table 1, networks 11) is trained with the full input resolution (full res.) and the MAE. The last 2 (dual output and single output; Table 1, networks 3 and 1+2) are trained using patches and the MAE loss function with 1 and 2 output maps, respectively. On the bottom correlation coefficients for the linear fit between predicted and dictionary matched T 1 (left) and T * 2 (right) is shown for the different network outputs using the MAE, MSE, LCL, and HL Our deep learning-based reconstruction yielded only minor differences between the T 1 and T * 2 times of WM, GM, and WM lesions compared with conventional dictionary matching. These mean deviations of 5.8% for T 1 and 6.0% for T * 2 are small and in the range of different approaches (2-8%). 19,36 Of note is that there is no ground truth data and, therefore, the dictionary matched data is the reference with a precision of 5%. Our deep learning approach is in the area of this precision and might be more precise since the output is continuous for all parameters. However, the deep learning reconstruction time was around 5 seconds for all slices as compared to 20 minutes dictionary matching, 10 minutes denoising, and 1-hour distortion correction (90 minutes in total). We trained our networks with different loss functions and found that the MAE and LCL performed better regarding our regression task compared to the commonly used MSE function. 18,35 This might be due to the fact that in the MSE the CSF is weighted higher as it has longer T 1 and T * 2 times, and therefore, it is more difficult for the network to learn the relatively small differences in WM and GM. Since the T 1 and T * 2 times in the CSF are not of great clinical interest we accept the loss in accuracy for the CSF. The Dice coefficient for WM and GM was in the range of reported literature (0.82-0.93) 45,60,61 and in the range of SPM (0.76-0.83) 62,63 and above 0.87 for all loss functions if the training was performed with patches. This might be explained by the fact that data augmentation (random patch extraction) prevents overfitting and enriches the dataset. Overall improved performance was observed for the training using patches independent of the loss function and the output. We found an overall 25.8% decreased relative error for T 1 and 21.6% for T * 2 respectively.

F I G U R E 7
Comparison of the different loss functions to the dictionary matched input of 1 representative subject of the test data using the network 4 from Table 1. T 1 , T * 2 , WM and GM probability maps are shown for the mean absolute error (MAE), mean squared error (MSE), logarithmic cosinus loss (LCL), and Huber loss (HL). A small patch (71 × 71) of 1 slice of a representative subject is shown. It is seen that the mean squared loss is smoothing the WM and GM probability maps the most This might be due to the fact that training with the full input resolution takes longer to converge. Compared with conventional highly undersampled MRF acquisition in our MRF-EPI approach, we do not need to extract first the features and reduce the dimensionality of the network input as proposed in other MRF deep learning reconstruction approaches. 19,46,47 Since the anatomical structure is retained, the network has to solve an image to images regression task, which might have smaller computational requirements. Fang et al 19 used a U-Net after the dimensionality was reduced: for their dataset, 2304 time points were used compared to 35 time points for our MRF-EPI (66 times smaller). We also used the U-Net since it captures information of the input locally and globally. This is important since we also include denoising and distortion correction with the same and single network.
Our reconstruction task included denoising and distortion correction within the MRF reconstruction and therefore, training with patches (64 × 64 voxels) achieved better results since the observed distortion from the EPI readout is only local at the nasal cavities and the frontal lobe of the brain. We showed that it is possible to perform denoising, distortion correction, and MRF reconstruction with one network architecture with relative difference within the standard deviation of the quantitative parameters.
We were able to additionally generate the WM and GM probability maps as outputs with only slightly decreased accuracy of the test data considering T 1 and T * 2 in WM and GM. We have shown that the dice coefficient for the binarized WM and GM masks are in good correlation between our CNN and the reference SPM method. However, the network trained only on the T 1 and T * 2 maps (dual output) as an output performed better than the 4 output model. Using single T 1 and single T * 2 maps as an output performed the best with only minor improvement (<1%) compared to the dual output model. We compared the relative differences for different tissue types instead of using the RMSE as commonly used, 18,35 because outliers and variations of quantitative measures within single tissue types result in an overestimated error for a voxel-by-voxel comparisons, especially in the CSF.
We showed that the predicted values correlate very well with the reference dictionary matched values for T 1 and T * 2 (R > .95, P < .0001) with only a slight offset, which is within the standard deviation. The correlation coefficient was the lowest for only WM since the range of single WM T 1 and T * 2 times is denser compared with GM and especially compared with lesions as provided in Figure 3.
We achieved standardized results as we trained on data from both sites without significant differences between both F I G U R E 8 Predicted T 1 (left) and T * 2 (right) times over the dictionary matched T 1 and T * 2 times for the 4 output networks using patches for training and the mean absolute loss. Mean values per subject of WM are shown in blue, of GM are shown in orange and for the lesion are shown in yellow. The increased brightness of the representative colors depicts the test data and the reduced brightness depicts the training and validation data. In 3 different gray shades, the single T 1 and T * 2 times per slice are shown. A linear fit is used to correlate the predicted and the dictionary matched quantitative maps with corresponding R-and P-values (P < .01), even though the magnitude data from both sites varies due to different accelerations. However, changing the sequence parameters changes the magnitude evolutions. Therefore, new dictionaries have to be calculated and different or retrained networks are required. Transfer learning may facilitate the possibility, to update the network when imaging parameters are changed. 24,64 Our study has some limitations. As GM suffers from partial volume effects, calculating the mean T 1 and T * 2 times strongly depends on the segmentation and the used threshold on the probability maps. Lesion segmentation could be an extra output from a CNN similar to the one such as investigated here. However, to obtain reliable results from this, more WM lesions data would be required, due to the large variation in lesion tissue parameters and the small fraction of lesions compared to WM and GM. In our experiments, the training datasets did not provide enough lesion examples for the training to converge without significantly affecting other outputs. The strength of deep learning approaches commonly stems from the abundance of training data. 65,66 Therefore, the proposed reconstruction will likely benefit from larger datasets. Fractioning the full input into small patches is a first step to artificially generate more data, but data augmentation could be applied additionally. In this study, both sites operated on the platform of a single MRI vendor. A multi-vendor study is required for more universal comparisons.

| CONCLUSIONS
MRF demonstrates to be an auspicious approach for quantifying T 1 and T * 2 in subjects with MS to obtain information in a standardized fashion along 2 clinical centers. This technique saves time by simultaneous acquisition of T 1 and T * 2 and might improve the segmentation pipeline of lesions as their quantitative measures are clearly separated from normal appearing brain tissue types. We showed that deep learning enables a drastic speed up in the postprocessing pipeline without a loss in accuracy and precision by combining denoising, distortion correction, reconstruction, and masking.
F I G U R E 9 A, Dice coefficient for different thresholds of the SPM and DL (Deep Learning) WM (left) and GM (right) probability maps of a representative subject when using the MAE loss with 4 outputs and patches-wise training ( Table 1, network 4). The black lines depict the maximum dice coefficient along with the different thresholds. The dice coefficient between both binary masks is shown in color encoding. B, The dice coefficient is shown for a fixed threshold of 80% of the SPM WM and GM masks dependent on the threshold of the DL mask, as marked in the red area A, which both show a maximum dice coefficient at around 80%. Exemplary WM and GM probability maps are depicted

SUPPORTING INFORMATION
Additional Supporting Information may be found online in the Supporting Information section.

FIGURE S1
Representation of magnitude images, T 1 and T * 2 maps before and after using Marchenko-Pastur principal component analysis (MPPCA) denoising. On the right side, the relative difference between acquired and denoised image and map is depicted. MPPCA (blue) compared with nondenoised (orange) 2D signals of 1 voxel is depicted over the different contrasts FIGURE S2 Distorted (top) and corrected (middle) Maps (3 transversal slices and 1 sagittal slice) for T 1 . Distortion correction was performed using ANTs (nonrigid deformation in phase-encoding direction only) onto the T 2 -weighted images depicted as overlay, which shows in blue and red the deviations to the MRF-EPI before and after correction. The difference in percentage before and after distortion correction is shown. Major improvements (areas marked by arrows and circles) are observed after distortion correction, especially around the nasal cavities and the frontal lobe TABLE S1 Mean T 1 and T * 2 times for white matter (WM), gray matter (GM), and WM lesions are listed for all patients from site 1. The last column shows the respective mean values among the subjects and the intersubject variability. The amount of WM lesions per patient and patient's age and gender are provided TABLE S2 Mean T 1 and T * 2 times for white matter (WM) and gray matter (GM) are listed for all healthy volunteers from site 1. The last column shows the respective mean values among the subjects and the intersubject variability. Subject's age and gender is provided TABLE S3 Mean T 1 and T * 2 times for white matter (WM), gray matter (GM), and WM lesions are listed for all patients from site 2. The last column shows the respective mean values among the patients and the intersubject variability. The amount of WM lesions per patient and patient's age and gender are provided TABLE S4 Mean absolute difference and correlation coefficients of T 1 for the different ne rks in the whole brain, WM, GM, and lesions. The loss functions mean absolute error (MAE), mean squared error (MSE), logarithmic cosinus loss (LCL), Huber loss (HL) are used for the network with 4 outputs (T 1 , T * 2 , WM, and GM). Additionally for MAE, the dual (T 1 and T * 2 as output) and single (combined single T 1 and single T * 2 as output) are listed. All networks are trained with patches and with the full input resolution (blue) TABLE S5 Mean absolute difference and correlation coefficients of T 2 * for the different networks in the whole brain, WM, GM, and lesions. The loss functions mean absolute error (MAE), mean squared error (MSE), logarithmic cosinus loss (LCL), and HL (Huber loss) are used for the network with 4 outputs (T 1 , T * 2 , WM, and GM). Additionally, for MAE, the dual (T 1 and T * 2 as output) and single (combined single T 1 and single T * 2 as output) are listed. All networks are trained with patches and with the full input resolution (blue)