Accelerated respiratory-resolved 4D-MRI with separable spatio-temporal neural networks

Background: Respiratory-resolved four-dimensional magnetic resonance imaging (4D-MRI) provides essential motion information for accurate radiation treatments of mobile tumors. However, obtaining high-quality 4D-MRI suffers from long acquisition and reconstruction times. Purpose: To develop a deep learning architecture to quickly acquire and reconstruct high-quality 4D-MRI, enabling accurate motion quantification for MRI-guided radiotherapy. Methods: A small convolutional neural network called MODEST is proposed to reconstruct 4D-MRI by performing a spatial and temporal decomposition, omitting the need for 4D convolutions to use all the spatio-temporal information present in 4D-MRI. This network is trained on undersampled 4D-MRI after respiratory binning to reconstruct high-quality 4D-MRI obtained by compressed sensing reconstruction. The network is trained, validated, and tested on 4D-MRI of 28 lung cancer patients acquired with a T1-weighted golden-angle radial stack-of-stars sequence. The 4D-MRI of 18, 5, and 5 patients were used for training, validation, and testing. Network performances are evaluated on image quality measured by the structural similarity index (SSIM) and motion consistency by comparing the position of the lung-liver interface on undersampled 4D-MRI before and after respiratory binning. The network is compared to conventional architectures such as a U-Net, which has 30 times more trainable parameters. Results: MODEST can reconstruct high-quality 4D-MRI with higher image quality than a U-Net, despite a thirty-fold reduction in trainable parameters. High-quality 4D-MRI can be obtained using MODEST in approximately 2.5 minutes, including acquisition, processing, and reconstruction. Conclusion: High-quality accelerated 4D-MRI can be obtained using MODEST, which is particularly interesting for MRI-guided radiotherapy.


I. Introduction
Respiratory motion poses a significant challenge in abdominal and thoracic imaging, causing large displacements in the liver 1 , lung 2 , kidney 3 , and pancreas 4 , introducing disruptive image artifacts that may preclude an accurate diagnosis 5,6 . In radiation therapy, respiratoryinduced motion can lead to sub-optimal treatment because it may influence the shape and position of tumors 7,8 . Consequently, the target may receive a different dose than planned while delivering hazardous radiation to nearby healthy tissue and organs-at-risk 9 . In the past, respiratory-resolved imaging has been proposed to improve treatments, using imaging with high spatial resolution and accurate motion information to enable the definition of treatment margins that encompass the tumor displacement 10,11 . In particular, four-dimensional respiratory-resolved computed tomography (4D-CT) is the standard imaging modality in Last edited Date : current clinical practice and is part of radiation treatment planning 12 . However, 4D-CT can be affected by artifacts that negatively influence the treatment outcome and local control 13,14 .
Recently, magnetic resonance imaging (MRI) has been proposed as an alternative to CT for radiotherapy guidance, leveraging the superior soft-tissue contrast that facilitates accurate target identification and dose deposition. With the clinical introduction of MRIguided radiotherapy (MRIgRT) 15,16 , MRI acquired prior to treatment can be used to adapt the treatment plan to the daily anatomy, while fast MRI during treatment can be used to track the tumor position [17][18][19][20][21][22] .
In MRIgRT, respiratory-resolved four-dimensional MRI (4D-MRI) is used in the treatment planning phase to adapt the radiation treatment based on the quantified tumor motion 23 . The 4D-MRI must be high-quality and quickly available to ensure treatment efficiency and patient comfort, i.e., acquired and reconstructed within five minutes 24 . However, obtaining high-quality 4D-MRI remains challenging due to the limited acquisition speed of MRI.
A straightforward way to accelerate MRI is by undersampling the acquisition, violating the Shannon-Nyquist data sufficiency criterion 25 , and introducing image artifacts that may preclude accurate motion quantification 24 . Several techniques have been proposed to reconstruct high-quality MRI from undersampled acquisitions, such as parallel imaging 26,27 , simultaneous multi-slice acquisitions [28][29][30] , or compressed sensing 31 . Some algorithms have been specifically developed to reconstruct high-quality respiratory-resolved 4D-MRI by taking advantage of all spatio-temporal information in the images, such as XD-GRASP 32 or HDTV-MoCo 33 . However, these reconstruction algorithms have a large computational cost and can take from 15 minutes up to 8 hours 23,33 , which is insufficient in clinical practice as long treatment times are detrimental to patient comfort and treatment efficiency.
With CNNs, the time-consuming model training can be performed offline before treatment.
Then, the trained model can be used for fast, online inference, achieving reconstruction quality on par or better than compressed sensing within tens of milliseconds for 2D imaging 39 .
Training such models requires large amounts of GPU memory to optimize the model parameters. As GPU memory is limited, training CNN-based reconstruction models is feasible for 2D and 3D MRI but challenging for 4D-MRI as these models require prohibitively costly four-dimensional convolutions to take advantage of the spatio-temporal information and obtain high image and motion quality. Several approaches have been proposed to avoid using 4D convolutions, e.g., by performing slice-by-slice reconstruction or carefully using multiple views of the spatio-temporal data [40][41][42][43] . However, training such models to obtain high-quality 4D-MRI remains challenging due to the computational cost or requirement for large datasets.
We propose an unrolled model to reconstruct 4D-MRI using low-dimensional subnetworks (MODEST), which exploits the spatio-temporal nature of 4D-MRI by separating the reconstruction problem into spatial and temporal components. Two independent subnetworks with few trainable parameters have been designed to learn these components without using 4D convolutional kernels. This allows the model to access the complete spatiotemporal information in 4D-MRI while maintaining low computational cost.
This work investigates the application of the proposed spatio-temporal decomposed network to accelerate the acquisition and reconstruction of undersampled 4D respiratoryresolved lung MRI, which is of particular interest for MRI-guided radiation treatments.
The model is evaluated on reconstructed image quality and consistency of the respiratory motion compared to compressed sensing reconstructions. Moreover, MODEST is compared to standard deep learning architectures such as a U-Net. Finally, we estimate the minimum acquisition length for high-quality 4D-MRI with MODEST.

II. Methods
We considered two networks to reconstruct 4D-MRI: a baseline residual U-Net, and our newly proposed architecture. After patient data was collected and pre-processed, the model hyperparameters were optimized. Then, the U-Net and MODEST were trained. To investigate the impact of the model architecture rather than the number of trainable parameters, the optimized parameters of the U-Net were pruned to match MODEST. The three models (MODEST, the baseline U-Net, and pruned U-Net) were evaluated using undersampled 4D-MRI before and after respiratory binning.

II.A. Patient data collection and preparation
Twenty-eight patients undergoing radiotherapy for lung cancer between February 2019 and February 2020 at the radiotherapy department were retrospectively included under the approval of the local medical ethical committee with protocol number 20-519/C. The male/female ratio was 16/12, and the mean age was 66 ± 13 years (range = 20-81). Patients Patients were scanned in the supine position using a 16-channel anterior and 12-channel posterior phased-array coil. In total, 1312 radial spokes per slice were acquired, corresponding to approximately four times oversampling compared to a fully-sampled volume, which requires 206 · π/2 ≈ 324 spokes. However, as the contrast agent was injected, the relative magnitude of the self-navigation signal changed over time. To account for the contrast pickup phase, we discarded the first 200 spokes of every scan to prevent contrast mixing.
For every patient, 4D-MRI was created based on a self-navigation signal by sorting kspace into ten respiratory-correlated bins for a final matrix size of V x , V y , n slice , n phase = 206× 206 × 77 × 10. The self-navigation signal was obtained by performing a 1D Fourier transform of the center of k-space (i.e., k 0 ) along the slice direction and principal component analysis on the concatenated navigators 32,44 . Then, radial spokes were sorted into respiratory bins using a hybrid binning algorithm 45 based on the phase and relative amplitude of the motion surrogate. For training purposes, undersampled 4D-MRI was obtained by undersampling the respiratory bins, i.e., "phase undersampling", ensuring motion consistency between the target reconstruction and undersampled MRI. The fully-sampled 4D-MRI contained n spokes per bin for every patient. Phase-undersampled 4D-MRI was created by retaining the first II. METHODS II.A. Patient data collection and preparation n/k spokes per bin, where k ∈ N is the acceleration factor, for undersampling factors R 4D = 1, 2, and 4. This corresponded to a true undersampling factor R Nyquist of approximately 3.7, 7.4, and 14.8 per respiratory phase, respectively. After sorting, k-space was densitycompensated using a Ram-Lak filter, interpolated onto a twice-oversampled Cartesian grid using a 3 × 3 Kaiser-Bessel kernel, and transformed to image-space using a non-uniform fast Fourier transform (NUFFT) 46,47 with a weighted coil combination. Coil sensitivity maps were estimated using ESPiRIT 48 . The patients were randomly split into a train (18), validation (5), and test (5). The training target was generated by performing an XD-GRASP reconstruction of the fully-sampled 4D-MRI using temporal total variation, using a regularization weight λ = 0.03 32,49 .
To match the effect of a shorter acquisition time, we have also created undersampled 4D-MRI by removing spokes prior to respiratory binning and discarding the final j sampled spokes, with j ∈ {100, 200, . . . , 1000}, i.e., "free-breathing undersampling". These reconstructions were used to estimate the maximum achievable undersampling factor in a clinical setting, comparing the motion consistency of the free-breathing undersampled 4D-MRI to the fully-sampled reconstruction. We selected the maximum value of j where the zero-filled reconstruction has a mean EPE < 1 mm and the mean SSIM of MODEST was > 0.85.

II.B. Model architectures
We propose MODEST, which uses two subnetworks to learn the spatial and temporal fea- We trained a network to reconstruct 4D-MRI on a per slice basis rather than per volume to reduce memory usage, which allowed using 2D convolutions. The model input consisted of the zero-filled undersampled 4D-MRI and deformation vector fields (DVFs) computed on zero-filled, undersampled 4D-MRI, registering the exhale phase to every other respiratory phase. The DVFs were obtained using a deep learning model 50 . They were added as additional input as we hypothesize that adding DVFs improves the reconstruction performance as they provide additional spatial information when considering the respiratory phase dimension. To reconstruct a V x × V y × n phase volume, the subnetwork learning the spatial componentΞ was implemented using k × k × 1 convolution kernels, while the network learning the temporal componentΨ was implemented using 1 × 1 × n phase convolutions. Both subnetworks used five convolutional layers and a cardioid non-linear activation function 51 .
The model hyperparameters and architecture were optimized using Bayesian optimization.
Details for this optimization are provided in Supplementary Document 1. An estimate of the 4D-MRI is then obtained as f (Ξ,Ψ), using some combination function f , which was chosen as the point-wise multiplication operator. We implemented the model to perform an unrolled optimization using three iterations. Data consistency was enforced between the reconstructed image and the sampled k-space after every iteration except the final iteration by computing where t is the iteration, x t is the image at iteration t, y is the measured, undersampled The spatio-temporal convolution block performs low-dimensional convolution over the spatial domain (blue) and the temporal domain (orange), recombining into a 4D-MRI using a combination function f . After every iteration of the unrolled model, data consistency is enforced on the reconstructed radial k-space using the sampled radial k-space using Equation 1.
Last edited Date : models were trained using 20,000 randomly-sampled batches of zero-filled 4D-MRI with undersampling factors R 4D = 1, 2, and 4 to minimize the ⊥ +SSIM-loss 55  The metrics' statistical significance (p < 0.05) was established using a paired t-test, comparing MODEST to the U-Net and parameter-pruned U-Net.

III. Results
Based on the model architecture and hyperparameter search, we found that adding non-Cartesian data consistency and motion information increased the reconstruction quality, as shown in Figure 2. Using data consistency increased the validation SSIM from 0.88 ± 0.04 to 0.90 ± 0.04 (p = 10 −6 ), while adding DVFs did not significantly improve the SSIM compared to image-only reconstruction or in addition to using data consistency. However, using DVFs decreased the mean EPE from 1.23 ± 0.28 mm to 1.18 ± 0.27 mm (p = 0.0008) and the NRMSE from 0.086 ± 0.02 to 0.084 ± 0.18 (p = 0.0009), indicating increased motion consistency. Therefore, we opted to use data consistency and DVFs for MODEST.

III.A. 4D-MRI reconstruction
Phase-undersampled zero-filled reconstructions were created using a NUFFT in approxi- In the example of phase-undersampled 4D-MRI at R 4D = 1 in the test set (Figure 3), MODEST produced reconstructions with an SSIM of 0.92 over the entire 4D volume, considering XD-GRASP as reference. This has significantly higher quality than the zero-filled reconstruction, which already shows undersampling artifacts and an SSIM of 0.82 (p = 0.0017).
Despite having over thirty times fewer trainable parameters, MODEST also produces higher image quality for the considered subject than the U-Net. Compensating for the increase in parameters of the U-Net, the pruned U-Net reconstructs 4D-MRI with low image and low Retaining fewer spokes for the free-breathing undersampled 4D-MRI decreased model performance due to an increased undersampling factor and increased intra-bin variability of the motion, as presented in Figure 6. The sharpness of the U-Net reconstruction decreased due to temporal blurring as the undersampling factor increased. In contrast, the sharpness of MODEST reconstruction is more stable. Based on the criterion that the shortest acquisition needed to have an EPE < 1 mm for the zero-filled reconstruction and an SSIM > 0.85 for the MODEST reconstruction, using the first 500 spokes is the shortest free-breathing acquisition that allowed reconstructing high-quality 4D-MRI using MODEST, corresponding Last edited Date : Figure 4: Quantitative comparison. All reconstruction methods are evaluated on the test set compared to the XD-GRASP reconstruction based on image similarity, measured by the SSIM and NRMSE, and motion similarity, measured by the EPE. All deep learning models perform significantly better than the zero-filled reconstruction, but MODEST outperforms the U-Net models based on image sharpness and NRMSE. A star indicates the t-test resulted in statistically significant differences with p < 0.05. Figure 5: Hepatic dome analysis. MODEST closely follows the XD-GRASP reconstruction, especially at inhale. At high undersampling factors, MODEST is able to reconstruct motion-consistent 4D-MRI as measured by the hepatic dome, while the other reconstruction methods show significant errors. Figure 6: Impact of free-breathing undersampling. The impact of free-breathing undersampling was evaluated by continually removing n spokes from the acquisition and compared to the fully-sampled XD-GRASP reconstruction using the SSIM, EPE, and NRMSE metrics. As the increased significantly beyond removing 600 spokes, the minimum acquisition length was determined as 500 spokes. The approximate acquisition time is shown on top. to an acquisition time of approximately two minutes.
An example reconstruction for this acquisition is shown in Figure 7. Here, it can be seen that MODEST can reconstruct 4D-MRI with high quality with a mean SSIM of 0.92 and a mean NRMSE of 0.137 for this patient, which is of higher quality than the U-Net and pruned U-Net reconstruction. This model also shows good motion correspondence, as indicated by the alignment of the hepatic dome position. The quantitative results for the test set are presented in

IV. Discussion
In this work, we have proposed an architecture called MODEST for efficient 4D-MRI reconstruction by splitting the model into spatial and temporal components. We designed a model that exploits all spatio-temporal information of 4D-MRI using only low-dimensional convolution layers. High-quality 4D-MRI was obtained using this model from highly undersampled acquisitions in only 25 seconds and outperforms an optimized residual U-Net, despite having 3% of its trainable parameters. We have shown that the model can accurately reconstruct 4D-MRI from shortened acquisitions for up to two minutes while maintaining high image quality (SSIM of 0.877 ± 0.025) and motion-consistency with the fully-sampled 4D-MRI.
These properties have some advantages over other models: models with few trainable parameters are less likely to overfit than larger models and have the potential to generalize better on unseen data due to less parameter variance 60 . Moreover, small models typically require fewer training samples converge 61 , which is particularly interesting for MRI, as large datasets are difficult to acquire.
Our hyper-parameter optimization and model architecture search found that performing data consistency improved image quality, and adding motion information increased the reconstructed image quality. These findings are in line with previously published literature 62 . However, only adding the DVFs without adding data consistency can be detrimental to the image reconstruction quality. At R 4D < 4, adding DVFs to the images resulted in a lower SSIM, as indicated in Figure 2. However, at R 4D = 4 and in combination with data consistency, increased SSIM, lower EPE, and lower NRMSE was observed by adding DVFs.
This could indicate that adding motion information at higher undersampling helps image reconstruction but provides less benefit at lower undersampling factors. This latter aspect could be due to the better conditioning of the inverse problem at higher sampling factors and due to imperfections in the motion estimation model. Currently, we only present the DVFs to the model as generated by a pre-trained network 50  This work used XD-GRASP reconstructed 4D-MRI as a ground truth since it demonstrated sufficient accuracy for radiotherapy applications 23,24,65 . However, this algorithm's regularization over the respiratory phases can introduce errors by overly smoothing the respiratory motion. This could introduce differences in motion amplitude compared to the measured data, and this uncertainty might limit the reconstructed motion quality by deep learning models. Using iterative joint image and motion reconstruction as ground truth could be a viable way to improve image quality 33 and remove residual artifacts in the ground truth.
When comparing to XD-GRASP we considered a GPU implementation using commodity hardware, which might not be optimal. Technological developments have accelerate the XD-GRASP algorithm with specialized "Processing-in-memory" hardware 66 , curtailing the computational bottleneck for XD-GRASP which enables a speed-up factor of 11, or 90 seconds of processing time. However, while this is a promising approach, these speed-ups have only been achieved in simulation and such hardware has not been clinically demonstrated.
The models presented in this manuscript have been trained on data obtained from eighteen patients, which is a limited training set size and could limit the performance of the presented models. Large training sets can offer several advantages, such as better performance and improved generalization capabilities. Several steps can be taken to increase the size of our training set. First, more patient data could be acquired, but this process is slow and costly, resulting in limited extra data. Second, digital phantoms could be used to generate 4D-MRI from numerical anatomy 67 . However, these samples might not be accurate compared to 4D-MRI acquired in-vivo. Future work will investigate the impact of different data augmentation approaches and dataset size.
MODEST is not the only architecture able to reconstruct 3D+t MRI. They achieve the 4D convolutions by interspersing 3D convolutions with 1D convolutions.
CINENet used an approach somewhat similar to ours by decomposing the 4D convolution into lower-dimensional convolution kernels, but we separated the spatial and temporal domains, whereas in CINENet they are interspersed. It is currently unclear whether interspersing or separating the spatial and temporal features would result in better performance, and it may be the object of future investigations.
MODEST has been specifically constructed to take advantage of the spatio-temporal information in 4D-MRI to obtain high-quality reconstructions. Interestingly, spatial and temporal information from MRI is relevant in other applications, such as cardiac imaging 39,41 or dynamic contrast-enhanced MRI 68,69 . Future work could investigate the application of MODEST, retraining the currently used model for these applications.
The availability of fast, accurate, and high-quality 4D-MRI is of particular interest for MRI-guided radiotherapy, where 4D-MRI is used for treatment adaptation of mobile tumors. With fast acquisition and reconstruction of 4D-MRI, treatment efficiency and patient comfort can be improved, eliminating the acquisition of a 4D-CT for motion quantification.
By treating such patients on a hybrid MRI-Linac, motion can quickly be quantified without repositioning the patient. Moreover, high-quality 4D-MRI can also be used for high-quality time-resolved imaging 65,70 and could be helpful for real-time intra-fraction radiation treatment adaptation 22 .

V. Conclusion
We proposed a deep learning architecture called MODEST that efficiently reconstructs highquality 4D-MRI by decomposing the reconstruction into spatial and temporal components.
This approach yielded superior performance than conventional models such as U-Nets, despite having only 3% of the trainable parameters. We found that high-quality 4D-MRI can be obtained with an MR acquisition of two minutes and 15 seconds of model inference, shortening the time for MRI-guided radiation treatments while improving treatment quality and incorporating accurate motion quantification.

VI. Acknowledgement
This work is part of the research program HTSM with project number 15354, which is (partly) financed by the Netherlands Organisation for Scientific Research (NWO) and Philips Healthcare. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Quadro RTX 5000 GPU used for prototyping this research.