Quantification of spatially localized MRS by a novel deep learning approach without spectral fitting

To propose a novel end‐to‐end deep learning model to quantify absolute metabolite concentrations from in vivo J‐point resolved spectroscopy (JPRESS) without using spectral fitting.


INTRODUCTION
Numerous published reports have applied proton ( 1 H) MRS to basic neuroscience studies and clinical studies of brain disorders. 1 Currently, the most common approach to MRS quantification of metabolite concentrations is spectral fitting. [2][3][4][5][6][7][8] In general, spectral fitting is often complicated by spectral overlaps, particularly for weakly represented metabolites (i.e., those metabolites having low concentrations or signal intensities). 9 Further complicating the fitting process, the background signals originating from macromolecules, and/or unsuppressed outer-volume signals including scalp lipids, give rise to a background spectral baseline, 10,11 which is often difficult to be separated from metabolite signals.
In contrast to spectra acquired with a single short TE, many experiments acquire MRS data consisting of different TEs, leading to additional independent spectral information. One such acquisition method is J-point resolved spectroscopy (JPRESS), 12 which acquires a series of different TE spectra. The sum of all spectra, known as the TE-averaged spectrum, has been used to detect glutamate (Glu), as the overlapping glutamine (Gln) signal was suppressed. 13,14 Using JPRESS, it was also found that a modulated average at J = 7.5 Hz can differentiate Glu from Gln without suppressing the latter. 15 However, these techniques are still one-dimensional, such that much of the spectral information in the entire echo train is lost due to signal averaging across the different echoes.
Although JPRESS provides more spectral information including T 2 and J-coupling, spectral fitting of in vivo JPRESS data 16,17 is more challenging than fitting single TE spectra because additional fitting variables or prior knowledge is needed to account for varying signal intensities of metabolites and baselines for different echoes.
Deep learning has proven to be exceptionally useful in medical imaging. 18,19 Significant efforts have also been made to develop deep learning approaches for MRS data preprocessing and spectral quantification. Lee et al. 20,21 presented a convolutional neural network architecture that learned to map the input data in the frequency domain to the target spectra linearly combined from basis spectra; metabolite concentrations were then obtained by fitting the output with a linear process. Gurbani et al. 22 incorporated a convolutional encoder into parametric model fitting, combining adaptive and unbiased convolutional networks with spectral models. Iqbal et al. demonstrated that deep learning could accelerate and quantify localized correlated spectroscopy, 23 and Li et al. 24 used learned component-specific representations to separate macromolecules from metabolites in short-TE spectroscopic imaging. Other deep learning applications in MRS include spectral reconstruction, denoising, artifact removal, and frequency and phase corrections. [25][26][27][28] Here we propose a novel neural network architecture to directly predict metabolite concentrations without using spectral fitting. We treat quantification of metabolites as a multiclass regression problem that can be effectively solved by deep learning. The key is to separate the individual metabolites into different spaces. To this end, the proposed model was designed for the dual task of predicting individual metabolite signals and concentrations concurrently. Leveraging the rich information carried by a diverse range of spectra with varying TEs and combining the strength of strong signals at short TE with the clean baselines at long TE, this end-to-end model takes time-domain JPRESS data as its input (i.e., FID signals), generates a unified representation to map the time-domain input to metabolite concentrations, and yields individual metabolite signals through a decoder.
Our hypotheses were (i) metabolite concentrations and individual component FIDs are very closely related in the sensor domain, such that a dual-task training for both metabolite concentrations and individual component FIDs can minimize overfitting as well as improve model performance; (ii) with the dual-task training metabolite concentrations and individual component FIDs can be obtained concurrently without spectral fitting. We further designed a novel encoder-decoder style neural network architecture and applied it to quantifying JPRESS data. Instead of learning the complex and often unpredictable background signals in the MRS data, our model was trained to distinguish the metabolite signals while filtering out unregistered signals. Our study revealed that short-TE MRS by LCModel fitting resulted in unjustified correlations between metabolite concentration estimates, as well as between these estimates and the noise levels. However, the proposed deep learning approach was able to eliminate or significantly reduce these correlations.

Problem formulation
In spectral fitting, parametric basis spectra of individual metabolite components are linearly combined to fit experimental data in frequency or time domain. 2 Retrieving metabolite concentrations from input data is an inverse problem that is solved by the following equation: where m * c stands for the optimal estimate for metabolite concentrations; x is a vector representing the acquired MRS data; andx is the modeled spectral data with respect to the parameter space represented by . Note that the metabolite concentrations m * c is a subset vector in the parameter space of . The objective  for spectral fitting is usually an L 2 loss function. One of the problematic issues with spectral fitting is that, in general, the background signals cannot be accurately parameterized because they are not well defined like metabolite resonance signals and often unpredictable (e.g., spectral artifacts, breakthrough outer-volume signals). A smooth baseline is usually added tox to account for background contributions and, as such, an outer loop of iterations is needed to solve Eq. (1). 3,29 Deep learning takes a very different approach. Specifically, the goal for quantifying metabolite concentrations is to convert the input of MRS data in the sensor domain into a high-level representation in a hidden space that can linearly map the metabolite concentrations. Using an encoding process, spectral information from sensor domain is converted into a dimensionality-reduced vector representation. A deep learning model consists of a stack of operation layers with weight parameters w i for layer i, determined by the training course. For quantification of metabolite concentrations, the model can be symbolically expressed by the following equations in a chain relationship: , for all 1 ≤ i < N (2) where y N is the final layer representation that maps metabolite concentrations m c through the linear operation F N ; F i stands for the operation in the intermediate layer i, which can be, for example, a linear transformation, a convolutional operation, or a stacking operation; and F 1 ( y 0 , w 0 = I ) maps the input y 0 = x to the first layer representation.
Deep learning is essentially a process of learning of representations in different layers. 30,31 To quantify metabolite concentrations, the weight parameters in Eq. (2) are determined by supervised training. Loss gradients are obtained from backward computation, starting from the following equation: where  is the loss function; m gt c is the ground-truth metabolite concentrations; and w = { w 1 , … , w i , … , w N } . During the test time, w * given by Eq. (3) is used for w in Eq. (2) to compute m c . Test and validation data sets were not included in the training data set.

Model architecture
The proposed model is illustrated in Figure 1 with the forward operations going from bottom to top. The encoder consisted of three WaveNet 32 blocks, starting from the input of 32 FIDs. The input had the format (N, E, S, C), representing batches, echoes, data points and channels, respectively. In this study, the input used the format (16,32,2048,2). The two channels were for real and imaginary data, respectively. There were two output branches: metabolite concentrations and individual metabolite FIDs, on the left and right, respectively, in Figure 1. Starting from the input, each of the three WaveNet blocks extracted spectral features and created higher-level feature maps (all hidden layers have 128 dimensions), thereby yielding low, intermediate, and high-level feature maps, respectively, accompanied by the three-level individual echo representations after pooling over the 2048 sampling points. A bidirectional GRU (gated recurrent unit), 33 a variant of long-term and short-term memory neural network architecture, 34 was used to integrate individual echo representations through attention mechanisms, 35 and the new individual echo representations were expanded and concatenated with the previous level feature maps. The subsequent output was then fed forward to the next block, and the final unified representation was achieved by pooling over all individual echo representations. The TensorFlow implementation of the encoder block can be found in Script S1.
The proposed model also learned to reconstruct individual metabolite FIDs for all the echoes using a decoder, as shown in Figure 1. The two WaveNet blocks decoded spectral features from all three levels. The lower-level features were needed to restore the spectral information ignored by the encoder. The loss gradients were stopped by the skip connections (the dashed lines in Figure 1), to prevent gradients flowing via shortcuts.
All WaveNet blocks used a convolution dilation depth of 8 and a filter kernel size of 5. In addition to the metabolite concentrations, the unified representation was also trained to predict spectral phase and frequency offset. The model had approximately eight million trainable weight parameters.

Training data set
Metabolite concentrations, spectral phase shifts, frequency offsets, and individual component FIDs for all 32 echoes were used as targets for supervised training. The training data set was generated via computer simulation with given ground-truth concentrations. Spatially localized basis spectra for 15 metabolites, including water,  (32), data points (2048), and channels (2), respectively. The encoder consisted of three WaveNet blocks, which created individual echo feature maps and representations (after pooling over the 2048 points) in three different levels. The output was split into two branches: the unified representation for metabolite concentrations on the left and individual metabolite FIDs on the right. In addition to the metabolite concentrations, the unified representation also output spectral phase and frequency offset. The gated recurrent units (GRUs) fused the feature representations from different echoes and concatenated the outputs with the previous features to create new higher level feature maps. ⊕ stands for the concatenation operation. Reconstruction of FID signals of individual metabolites was accomplished using the two WaveNet blocks on the right of the architecture that decoded the three-level feature maps.
were generated by computing spin density matrices of a 3D region containing 100 × 100 × 100 spatial points evenly distributed across the simulated region. 36 The simulated region was sufficiently large to encompass the voxel defined by the pulse sequence. in the brackets were the mean concentrations for each component in the training data set. The water component here represents the residual water after water suppression. The unsuppressed tissue water concentration was 43 300 mM and 35 580 mM for gray matter and white matter, respectively, 37 and was used to scale the in vivo data (see Section 2.5).
The concentrations of individual metabolite components, including water in the training data set, varied independently from zero to twice the values specified previously with uniform distributions. The mean concentrations were significantly higher than the real-world in vivo values for the weakly represented metabolites (i.e., those with low in vivo concentrations such as GABA and GSH). This design was intended to make all metabolites contribute similarly to the loss functions. The validation data set, however, was created with the mean metabolite concentrations close to the in vivo values reported in the literature. This design also helped to test the model's generalizability, given that the training and validation data sets had very different concentration distributions. The validation data set was created in a fashion similar to the training data set except that the mean metabolite concentrations were residual water [500 mM], NAA [ The numerical simulation of the training data set mimicked the JPRESS sequence implemented on a GE 3T scanner (General Electric Medical Systems, Milwaukee, Wisconsin, USA), which generated 32 echoes with TE starting at 35 ms and increasing with a 6-ms step after each echo. The resonance frequencies of metabolites and water were varied in uniform distributions over a 20-Hz range for water and a 5-Hz range relative to water for metabolites.
To incorporate T 2 attenuation into the model, individual metabolite T 2 s were allowed to vary randomly from 80 ms to 400 ms in both training and validation data sets. The spectral phase was shifted randomly in a range of 0 -360 • . To force the model to learn to derive signal intensities from individual echoes and determine the metabolite concentrations using aggregated information embedded in the unified representation, complex lineshapes were created by modulating the exponential decay with high-order polynomial terms to emulate signal decays caused by magnetic field inhomogeneity.
Up to three extraneous peaks with linewidth in a range of 20-50 Hz were randomly added to the data sets to serve as perturbation signals, forcing the model to learn to filter out unregistered signals and/or artifacts. Those extraneous peaks occurred randomly between 0 to 5 ppm with T 2 relaxation times between 10 to 50 ms. They were only visible for short TE echoes, to guide the model to reason the spectral information and identify those unregistered signals. Finally, white noise of various levels was injected into the data. The training data set contained 100 000 samples, and the validation data set had 2000 samples.

Model training
The proposed deep learning model was implemented using TensorFlow v2.5 and trained on Google TPU v3. In addition to metabolite concentrations and individual metabolite FIDs, the model also outputs spectral phase shift and frequency offsets for both residual water and metabolites. All outputs used L 1 loss functions, which were combined with weight coefficients to generate backward gradients as follows: The current study used the combination of 1, 40, 0.2, and 0.2 for the losses of concentrations, FIDs, phase shifts, and frequency offsets, respectively. Supervised learning of phase shifts and frequency offsets is necessary only if predicting phase shifts and frequency offsets is required. It can be disabled by setting w phase = 0 and w frq = 0, thereby allowing the model to learn the phase shift and frequency offset of each echo implicitly.
In this study, NAA and NAAG, unphosphorylated Cr and PCr, and PCho and GPCho were combined as tNAA, Cr, and Cho, respectively, and reported as single components. To accommodate the strong residual water signal, the losses attributed to water concentration and FID were scaled by a factor of 0.05 and 0.1, respectively.
The training process looped 15 epochs and took approximately 11 h using Google TPU v3. The model weight parameters were updated using an Adam optimizer with a one-cycle learning rate as follows: The learning rate was linearly ramped up to 5 × 10 −4 in three epochs, held constant for seven epochs, and then exponentially decreased to 10 −8 at the last epoch.

In vivo data
In vivo data were acquired from 10 healthy participants (4 males, 6 females; age = 29 ± 7.5) using a single channel head coil and a 3T GE whole body scanner. Each participant was scanned twice in two different sessions. All procedures performed in this study were approved by the local institutional review board (Protocol No. NCT01266577). Written, informed consent was obtained from all participants. Scan sessions began with a T 1 -weighted anatomical scan using the vendor-provided 3D spoiled gradient-echo sequence. The MRS sequence for acquiring in vivo data used the same RF pulses and sequence timing as those used to numerically generate the training data sets, with TR = 3 s and sampling bandwidth = 5000 Hz. The total acquisition time for each scan is 9.6 min. The prescribed voxels (2.0 × 2.0 × 4.5 cm 3 ) were dominated by gray matter and located in the anterior cingulate cortex.
Unsuppressed water reference was acquired immediately after spectral data acquisition. Because the first TE was 35 ms, a bi-exponential fit 37 was applied to analyze 32 unsuppressed water signals to remove CSF and to extrapolate the tissue water amplitude to TE = 0. The in vivo data were scaled using the amplitude ratio of the simulated unsuppressed water to the in vivo water reference. The average tissue composition over the 20 data sets was 62% ± 6.2% for gray matter, 27% ± 3.3% for white matter, and 11% ± 3.5% for CSF. Put together, the in vivo water-suppressed data were multiplied by where f g and f w are the tissue composition fractions for gray matter and white matter, respectively; water ref is the unsuppressed water intensity after removal of CSF and extrapolation to TE = 0; and water sim is the simulated water intensity without water suppression at TE = 0.

Comparison with LCModel
To compare with the commonly used short-TE MRS method with spectral fitting by LCModel, computer-synthesized white noise with 200 different levels were injected into an in vivo JPRESS spectrum and a short TE spectrum (TE = 35 ms) to generate 200 spectra for each. The estimated metabolite concentrations as a function of SNR and the correlations between the metabolites were evaluated. Ideally, the mean estimated metabolite concentrations should be independent of noise levels because noise does not generate biases for an unbiased estimator. The short-TE spectral fitting used the default LCModel setting (v6.3) and basis sets with a baseline stiffness DKNTMN of 0.15. Figure 2 shows a validation example in which input echoes, predictions, and true concentrations were compared in the frequency domain. Predicted individual components were added up to generate the predicted spectra.

RESULTS
The residuals under the input and prediction spectra illustrate their deviations from the ground truth. Because the ground truth is free of noise and extraneous perturbations, the residual lines revealed the extraneous peaks and noise in the input spectra and prediction errors. As depicted by Figure 2, despite the large differences between the input and ground-truth spectra, the spectra predicted by the proposed deep learning method without spectral fitting closely matched the ground-truth spectra. Both extraneous peaks and noise were minimized in the predicted spectra. Another example can be found in Figure S1, in which the predicted individual metabolite spectra were compared with the ground truth. The predicted concentrations of tNAA and the main metabolites of the glutamatergic and GABAergic systems, Glu, Gln and GABA, for all 2000 validation samples are shown in Figure 3. Note that the validation data were also perturbated by noise and extraneous peaks. The true concentrations were distributed uniformly from 0 to twice the mean value as described in Section 2. Despite the large variations in the input ground truth, including concentrations, phase shifts, frequency offsets, and linewidths in addition to randomly augmented noise and extraneous peaks, the predicted metabolite concentrations were strongly correlated with the ground truth. Note that all phase shifts were corrected using the predicted phase. The Pearson's correlation coefficients were 0.996, 0.990, 0.965, and 0.907, respectively, for the four components shown in Figure 3. The mean absolute error for all components, including water, which measures the converged validation loss, was 0.39 mM. The mean of the predicted metabolite concentrations and the mean absolute errors are given in Table 1. No significant correlations between the prediction errors and the true concentrations were observed, as shown in Figure 3. The lack of correlation between estimation errors and the ground-truth values is highly desirable, as it indicates that the final representation was created in a linear space and that individual components were separated from each other in subspaces with minimal covariances. Scatter plots of the predicted concentrations of Cr, Cho, mI, GSH, Asp, Tau, and Lac of the validation data set containing 2000 samples are provided in Figure S2. Figure 4 shows an in vivo example comparing the in vivo input (red) and the deep learning predictions (blue) in the frequency domain. The predicted spectrum for each echo was constructed by adding up the predicted individual component spectra. The differences between the inputs and predictions are displayed in green and consist of prediction errors, noise, artifacts, and any other background signals that were unregistered. As expected, the spectral features of the baseline gradually disappeared as the TE increased. The mean concentrations and SDs over the 20 human data samples are listed in Table 2. Figure 5 shows both the predicted individual metabolite concentrations and resonance signals and demonstrates that the model learned to distinguish and reconstruct individual component signals from the multi-TE FIDs. For clarity, only the first echo results are displayed. Figure 5A shows the difference between the input spectrum and the predicted spectrum. The predicted individual component spectra are displayed in Figure 5B. In Figure 5A, the difference (green) between the input echo (red) and the predicted spectra (blue) arose from the sum

F I G U R E 2
Comparison among inputs (red), predictions (green), and ground truths (blue) for a validation sample displayed in the frequency domain. Individual component spectra predicted by the model added up to the total predicted spectra. For clarity, only the first (35 ms), eighth (77 ms), 16th (119 ms), and 24th (161 ms) echoes are shown. All spectral phase shifts were corrected by using the predicted phase. The residual spectra under the predicted and the input spectra are their differences from the ground truth. The extraneous peaks in the first echo of the inputs are clearly revealed by the residuals, as they did not exist in the ground truth. They were filtered out in the predicted spectra as shown by the residual spectra in green color. Note that the extraneous peaks were allowed to occur at any frequencies between 0 and 5 ppm. Noise was reduced substantially across all echoes in the predicted spectra. of prediction errors, noise, artifacts including any breakthrough outer-volume signals, and other background signals. Note that the sharp artifact peak at about 3.8 ppm in Figure 5A was effectively filtered out by the proposed deep learning method. Additional examples of in vivo data with strong artifacts can be found in Figure S3. The predicted metabolite concentrations for the data in Figure 5 were tNAA [ Figure 6 shows the predicted metabolite concentrations of tNAA, Glu, Gln, and GABA compared with the short-TE LCModel estimates. As shown, the Glu and Gln concentration estimates from the short-TE spectra with the LCModel fitting strongly skewed toward higher concentrations as SNR decreased, as indicated by the linearly

F I G U R E 3
Scatter plots of the predicted total N-acetylaspartate (tNAA), glutamate (Glu), glutamine (Gln), and gamma aminobutyric acid (GABA) concentrations for the 2000 validation samples. The true concentrations were uniformly distributed in the range of 0-24 mM, 0-20 mM, 0-10 mM, and 0-4 mM for tNAA, Glu, Gln and GABA, respectively, as described in Section 2. The predicted metabolite concentrations were strongly correlated with the ground-truth values despite the large variations in the input ground-truth concentrations and addition of the extraneous peaks.

T A B L E 1
Predicted and true mean metabolite concentrations (mM) for the simulated validation data set.

F I G U R E 4
An in vivo example of comparing the experimental inputs and the predicted spectra. The predicted individual component spectra, including residual water, were summed to generate the predicted spectra in blue. The difference spectra (green) between the inputs and predictions consisted of prediction errors, noise, artifacts, and all other unregistered signals, showing the decay of the baseline as TE increased. Noise was substantially reduced in the predicted spectra for all echoes.

F I G U R E 5
Predicted in vivo spectrum of the first echo (A) and the individual metabolite components (B). For clarity, the vertical scale of (B) was made 40% greater than the scale of (A). In (A), the difference spectrum (green) between the input echo (red) and the predicted spectra (blue) arose from the sum of prediction errors, noise, and all other unregistered signals including any breakthrough outer-volume signals. Note that the sharp artifact peak at about 3.8 ppm in (A) was effectively filtered out by the proposed deep learning method. Asp, aspartate; Cho, choline; Cr, creatine; GABA, gamma aminobutyric acid; Gln, glutamine; Glu, glutamate; GSH, glutathione; Lac, lactate; mI, myo-inositol; tNAA, total N-acetylaspartate; Tau, taurine. fitted lines. Strong spurious correlations between fitted metabolite concentrations and SNR were also found for other weak metabolites, as shown in Figure S4. These spurious correlations cause significant errors in metabolite quantification. In contrast, as shown in Figure 6, the proposed deep learning method for JPRESS minimized the spurious correlations between metabolite concentrations and noise levels. The correlation between Glu and Gln concentration estimates is displayed in Figure 7. Ideally, there should be no correlations originating from spectral overlap, such that metabolite concentration measured by MRS can be used as spectrally uncorrelated variables for downstream statistical tests, 38 and no correlations of biological origin were present in the Monte Carlo simulations. However, the short-TE LCModel fitting gave rise to a Pearson's correlation of 0.52 due to the strong overlap between Glu and Gln and the influence of the spectral baseline. In contrast, this correlation was reduced to −0.20 by the proposed deep learning method for JPRESS, which suggested that individual spectral components were more effectively separated in our approach.

DISCUSSION
This study proposed a new deep learning method capable of directly mapping raw MRS input to metabolite concentrations end-to-end without using basis spectra or spectral fitting. Unlike conventional spectral fitting that inversely retrieves the metabolite concentrations from the input data in time or frequency domain, the proposed model extracts spectral features from the input data, converts them into a high-level representation, and then linearly maps metabolite concentrations. The model learns essential spectral features through training and filters out noise and unregistered signals. Individual metabolite signals are generated concurrently with the metabolite concentrations. We observed during the model training with the simulated data sets that learning to reconstruct individual metabolite signals significantly improved the accuracy of metabolite concentration predictions. Our ablation experiments also indicated that the extraneous peaks were filtered out only after the model learned the reconstruction of the individual metabolite FIDs instead of the summed FIDs.
Because the commonly used spectral fitting is conducted in the time or frequency domain (Eq. [1]), the parameterized model needs to account for all signals, which is often difficult due to the nonideality of in vivo data and the presence of the background signals that cannot be accurately parameterized. In addition, artifacts directly lead to fit residuals/errors. With the proposed deep learning approach, a low dimensional representation was sought to linearly map the concentrations. As such, the

F I G U R E 6
Comparison between the proposed deep learning method (blue) with short-TE acquisition by LCModel fitting (red). In vivo J-point resolved spectroscopy (JPRESS) data and short-TE (35 ms) data were injected with 200 different noise levels, resulting in 200 JPRESS and short-TE spectra, respectively, with SNRs ranging from 25 to 55. The estimated glutamate (Glu) and glutamine (Gln) concentrations from the short-TE spectra by LCModel fitting strongly skewed toward higher concentrations as SNR decreased, as indicated by the linearly fitted lines. In contrast, the mean predicted concentrations by the proposed deep learning method with JPRESS data minimized the spurious correlations with noise levels.
objective loss was generated purely by the errors between the predicted concentrations and ground-truth values, as expressed by Eq. (3).
Instead of learning the background baseline, 20,24 our model was trained to filter out unregistered signals. The submodel of predicting individual metabolite FIDs in Figure 1 can be considered as a classifier in which all signals other than the underlying metabolite FIDs are treated as the background, namely, the negative class. It removes the background by learning the reconstruction of individual component FIDs, because background leak will lead to prediction errors. The quantification of metabolite concentrations involves all 32 echoes with the effect of macromolecules constrained by echoes with medium and long TEs, therefore reducing errors in the predicted metabolite concentrations. The individual-component FIDs predicted from the in vivo data ( Figure 5 and Figure S3) also indicate that the baseline influences are minimized. Our strategy of filtering out the background baseline instead of learning it benefits from long echo data, in which the background signals are substantially reduced and therefore can be used by the model to identify the stronger background signals in shorter echo data with potential generalization for quantifying tumor MRS spectra ( Figure S5). Overall, the proposed model leveraged the spectral information in JPRESS to map different metabolite components into separate subspaces with a unified representation. As shown in Figure 7, the

F I G U R E 7
Correlations between the concentration estimates of glutamate (Glu) and glutamine (Gln) as the noise levels increased. With short TE by LCModel fitting (red), the Pearson's correlation coefficient and p-values were 0.52 and 3.4 × 10 −16 , respectively. In contrast, with the proposed deep learning method (blue), the corresponding values were −0.20 and 2 × 10 −3 , respectively.
glutamate-glutamine correlation caused by spectral overlap was significantly reduced.
Generalizability is an important criterion for the success of a deep learning model and is particularly critical for a model trained with computer-synthesized data like in the current study. We hypothesized that the model trained with simulated data can reliably predict in vivo metabolite concentrations if it can successfully defend against adversarial attacks in the training data set. Mathematically, adversarial training is formulated as a min-max problem, searching for the best solution to the worst-case problem. 39 In the present study, adding extraneous perturbation peaks into the training data was a heuristic approach inspired by the widely adopted practices in vision deep learning such as random in-plane translations, distortions, and scaling. The successful in vivo tests in the presence of spectral backgrounds shown in Figures 4 and 5 and Figure S5 validated our approach to handling adversarial attacks. More augmentations of adversarial attacks can be applied to training data, such as to mimic the eddy current effects or phase distortions.
One of the special designs in the present study is that metabolites with low concentrations in the training data set were upscaled to levels similar to the concentrations of dominant metabolites, and then the more realistic concentrations were used for validation. This design ensured that weak metabolite signals were not outweighed by the strong resonance signals of the more concentrated metabolites during model training. The validation results (Table 1) showed that the upscaled concentrations in the training data set did not lead to an overestimation of metabolite concentrations in the validation data set, which further indicates that the unified representation is a linear layer with a high degree of generalizability. Conventional spectral fitting requires initial guesses on all the fitting parameters, which often introduces biases and leads to a local optimal solution. It may need expert tuning for the test data that do not conform to the underlying normal distribution with which the default setting for spectral fitting is established.
Without extra calibrations except for the data scaling, in vivo results obtained by the proposed deep learning method demonstrated that the predicted concentrations agreed with those reported in the literature [40][41][42][43][44][45][46][47][48][49] (Table 2). The whole processing pipeline for the absolute quantification of metabolite concentrations was also self-contained and thus can be standardized.
Introducing an accommodative baseline is necessary in spectral fitting with LCModel or similar approaches, because unaccounted background signals can cause large errors in estimation of metabolite concentrations. This artificial baseline, however, interacts with metabolite signals and creates estimation uncertainties. Different handling strategies and/or hyperparameter settings have also been found to give inconsistent results. 50,51 Furthermore, prior knowledge or assumption of metabolite T 2 is still necessary for short-TE spectral fitting, to account for individual metabolite spectral linewidths accurately, although the quantification results are less sensitive to T 2 than long-TE spectral fitting. With the proposed method, both validation and the in vivo test (Figures 2 and 4) showed that the T 2 effect was automatically incorporated into the prediction of metabolite signals for spectra at different echoes. Although the current deep learning model was not trained to output metabolite T 2 values, an extension to include this task will be attempted in the future.
Accessibility of measurement confidence is important for a quantification method. For spectral fitting, the Cramér-Rao lower bounds on estimation variances 52 can be obtained analytically with the assumption that model is an unbiased estimator. Although the Cramér-Rao lower bounds is unattainable for a deep learning-based method due to the lack of an explicitly expressed model, we suggest that conducting multiple tests on a group of similar data can possibly provide a solution. For the proposed method, the amplitudes of the predicted individual FIDs can be independently changed to a certain degree and are then mixed with the background signals filtered out by the model to generate a new test data set. Finally, the prediction variances can be evaluated by comparing the two test results, as we know the differences between the two data sets, although the true concentrations are unknown. The feasibility of this approach warrants further investigation.

CONCLUSIONS
The proposed deep learning method predicted individual metabolite concentrations and FIDs concurrently. The rich spectral information carried by the multiple-TE spectral data allowed the model to effectively extract spectral features and create a unified representation for the entire echo train. Collectively, the results show that the proposed deep learning model-which was trained with simulated data and tested with both simulated and in vivo data-can be used to quantify in vivo metabolite concentrations for MRS without using spectral fitting.

SUPPORTING INFORMATION
Additional supporting information may be found in the online version of the article at the publisher's website.
Script S1 TensorFlow implementation of the encoder block. Figure S1. (A) Comparison among input (red), prediction (green), and ground truth (blue) for the first echo of a validation sample displayed in the frequency domain. Individual component spectra predicted by the model were added to the predicted spectrum. The residual spectra under the predicted and the input spectra are their differences from the ground truth. Note that the phase and lineshape of residual water peak were different from that of metabolite signals. The strong residues of the input (red) were dramatically reduced with the model prediction (green  Figure S2. Scatter plots of the predicted concentrations of total creatine (Cr), total choline (Cho), myo-inositol (mI), glutathione (GSH), aspartate (Asp), taurine (Tau), and lactate (Lac) of the 2000 validation samples. Pearson's correlation r between the predictions and ground-truth values is shown in each panel. Figure S3. Predicted in vivo spectra of the first echo (A,C) and the individual metabolite components (B,D). In (A) and (C), the difference spectra (green) between the input echoes (red) and the predicted spectra (blue) arose from the sums of prediction errors, noise, artifacts including any breakthrough outer-volume signals, and macromolecules. Note that the strong spike artifacts between 3 and 4 ppm were effectively filtered out. Figure S4. Comparison between the proposed deep learning method (blue) with short TE acquisition by LCModel fitting (red) for total creatine (Cr), total choline (Cho), myo-inositol (mI), glutathione (GSH), aspartate (Asp), taurine (Tau), and lactate (Lac). In vivo J-point resolved spectroscopy (JPRESS) data and short TE (35 ms) data were injected with 200 different noise levels, resulting in 200 JPRESS and short TE samples, respectively, with SNRs ranging from 25 to 55. With short TE spectra and LCModel fitting, strong skewness in concentration versus SNR was seen for weak metabolites, as shown by the linearly fitted lines. In contrast, the proposed deep learning method for JPRESS suppressed the unwanted dependence on noise levels. Figure S5. A computer-synthesized signal was injected into a time-domain in vivo data set in the region near 2.0 ppm, and the predictions before and after the injection were compared. (A) The first echoes of the original input data (blue) and the prediction (red) without the synthesized signal. The difference between the two spectra was displayed at the bottom (green). It shows the model can effectively remove the extraneous signals in the region <2 ppm. Please note the model was not trained to learn where the extraneous signals were located, and the input data were in the time domain. (B) The synthesized signal was injected into the original data in the time domain (for all echoes with T 2 considered). The signal was created with a Lorentz-type lineshape, linewidth of 30 Hz, T 2 relaxation time of 20 ms, and the resonance frequency close to 2 ppm. The prediction yielded the residual spectrum with an elevated peak near 2 ppm. (C) The amplitude of the synthesized signal was increased by 2-fold, and as a result, the residual peak near 2 ppm was notably higher.
(D) The amplitude of the synthesized signal was increased by 10 fold; the linewidth was expanded to 50 Hz; and the resonance frequency was shifted to right by 30 Hz. Again, the model successfully identified the strong unregistered signal and filtered it out as shown by the residual spectrum. The synthesized peaks can be retrieved by the subtraction of the original residuals (i.e., the residuals in [A] from the corresponding residuals arising from the data with the synthesized signals). (E) The comparisons between the retrieved signals and the true synthesized signals. (E) The mismatches in (B) and/or (C) between the red and blue lines should be an indicator of the extent errors due to the extraneous signals near 2 ppm.