University of Birmingham Complementary time-frequency domain networks for dynamic parallel MR image reconstruction

Purpose: To introduce a novel deep learning-based approach for fast and high-quality dynamic multicoil MR reconstruction by learning a complementary time-frequency domain network that exploits spatiotemporal correlations simultaneously from complementary domains


| INTRODUCTION
Magnetic resonance imaging (MRI) is a widely used diagnostic modality which generates images with high spatial and temporal resolution as well as excellent soft tissue contrast. Dynamic MRI is often used to monitor dynamic processes of anatomy such as cardiac motion by acquiring a series of images at a high frame rate. However, the physics of the image acquisition process as well as physiological constraints limit the speed of MRI acquisition, and long scan time also makes it difficult to acquire images of moving structures. Thus, acceleration of the MRI acquisition is crucial to enable these clinical applications.
Parallel imaging (PI) techniques [1][2][3] have been widely used to accelerate MR imaging. They speed up the scan time by sampling only a limited number of phase-encoding steps, and then exploiting the correlations to restore the missing information in the reconstruction phase. Compressed sensing (CS) techniques combined with PI have shown great potential in improving the image reconstruction quality and acquisition speed. [4][5][6][7][8] CS-based methods exploit signal sparsity in some specific transform domain, and recover the original image from undersampled k-space data using nonlinear reconstructions. One effective mean to exploit spatiotemporal redundancies for signal recovery in dynamic MRI is to enforce the sparsity in combined spatial and temporal Fourier (xf) domain against consistency with the acquired undersampled k-space data, and this can be represented by methods such as k-t FOCUSS 4,5 and k-t SPARSE-SENSE. 6 The combinations of CS with low-rank in matrix completion schemes and spatiotemporal partial separability 7,9,10 have also been proposed to exploit correlations between the temporal profiles of the voxels, eg, k-t SLR. 7 Some more recent approaches 11,12 also utilized patch-based regularization frameworks to exploit geometric similarities in the spatiotemporal domain. However, these CS-based approaches often require careful selection of problem-specific regularization schemes and the tuning of hyperparameters is often nontrivial. Furthermore, the reconstruction speed of these methods is often slow due to the iterative nature of the optimization used, and in the context of dynamic imaging, the additional time domain further increases the computational demand.
In contrast, deep learning (DL)-based reconstruction approaches have become extremely popular in recent years and have enabled progress beyond the limitations of traditional CS techniques. [13][14][15][16][17][18] In DL methods, prior information and regularization can be implicitly learnt from the acquired data without having to manually specify them beforehand. Additionally, image quality and reconstruction speed are improved substantially. These advances include applications in both PI [19][20][21][22][23][24][25][26] and dynamic MRI. [27][28][29][30][31][32][33] Most current approaches in DL for accelerated PI are based on exploiting information in a single image either in image domain 19,20,24 or in k-space domain, [34][35][36] where each image (or frame) is reconstructed independently. Examples of these include the variational network (VN) 19 and robust artificial-neural-networks for kspace interpolation (RAKI) etc. In accelerated dynamic MRI, one of the key ingredients is to exploit the temporal redundancies. To this end, 3D convolutional networks (Cascade CNN) 27 and bidirectional convolutional recurrent neural networks (CRNN) 29 have been proposed to exploit the temporal dependencies of dynamic sequences in spatiotemporal image domain. Most of these DL-based approaches so far have focused on either 2D static PI or single-coil dynamic MRI, whereas only a few methods exist for dynamic parallel MRI reconstruction. 32,33,37 Thus more efficient and effective DL models for dynamic parallel MRI are highly desirable.
In this work, inspired by CS-based k-t methods, we formulate the dynamic parallel MR image reconstruction as a multivariable minimization problem considering regularization in both spatiotemporal and temporal frequency domains. We propose a novel end-to-end trainable deep recurrent neural network to model the iterative process resulting from the multivariable minimization. Specifically, the proposed DL approach alternates among four steps: (1) a signal de-aliasing step in combined spatial and temporal frequency domain (xf) via an xf-CRNN; (2) a complementary de-aliasing step in spatiotemporal image domain (xt) with an xt-CRNN; (3) a closed-form point-wise data consistency (DC) step and (4) a closed-form weighted coupling step which are embedded as layers in the deep neural network (DNN). Each of these steps correspond to the iterative algorithm derived from a variable splitting technique (Section 2). As the proposed model exploits spatiotemporal redundancies from Complementary Time-Frequency domains for the effective image reconstruction, we term our model as CTFNet.
The main contributions of our work can be summarized as follows: First, we propose a new regularization method with fast reconstruction speed (2.8 seconds). This could potentially facilitate achieving fast single-breath-hold clinical 2D cardiac cine imaging.
built on recurrent neural networks for data regularization in complementary spatiotemporal and temporal frequency domains to fully exploit data redundancies. Though previous studies 15,38,39 have shown that MR reconstruction can be performed in both k-space and image domains, it is unclear how cross-domain knowledge can be effectively utilized by DNNs in the dynamic setting, with an extra temporal dimension. To the best of our knowledge, this is the first work that investigates how complementary domain knowledge can be exploited in learning-based dynamic reconstruction. Second, we propose a closed-form DC layer that does not require a complex matrix inversion, and operates together with a weighted coupling layer for multicoil images. Compared to other works, 19,20,40 our approach offers an exact update for DC, avoiding the expensive need of solving a linear system via gradient updates. This enables our approach to be computationally more efficient and simpler for implementation. Finally, we demonstrate that our approach is able to further push the undersampling rates with improved image quality against state-of-the-art CS and DL methods, as well as with a good generalization ability to unseen data. This indicates a great potential in achieving fast single-breath-hold 2D cardiac cine imaging.
This work extends our preliminary conference work on single-coil dynamic MRI reconstruction 41 and 2D static parallel MRI reconstruction, 42 where we explored dynamic MRI and static PI separately. In comparison to our previous work, this work presents a novel and unified end-to-end DL solution with a new formulation for dynamic parallel MRI reconstruction, which addresses a more common scenario in the use-case for clinical practice. It proposes a new way of exploiting complementary time-frequency domain information in DL. Significantly more thorough quantitative and qualitative evaluations of the proposed method including comparison, generalization, and ablation studies have been performed on multicoil cardiac MR data with retrospective undersampling.

| Dynamic parallel MRI model
Assume that m ∈ ℂ N is a complex-valued MR image sequence in x-y-t space represented as a vector, and let v i ∈ ℂ M (M ≪ N) denote the undersampled k-space data (in k xk yt space) measured from the ith MR receiver coil. The data acquired from each coil thus can be represented as where F s is the spatial Fourier transform matrix, D is the sampling matrix on a Cartesian grid that zeros out entries that are not acquired, and S i is the ith coil sensitivity map. The reconstruction of m from v i is an ill-posed inverse problem, where i ∈ 1, 2, …, n c and n c denotes the number of receiver coils. Similar to CS formulations 6,43,44 based on the SENSE model, we formulate dynamic parallel MRI reconstruction as the following optimization problem: Here, ℛ xt is defined as a regularization term on the spatiotemporal domain (xy-t space, also denoted as x-t) of the image sequence m, similar to the spatiotemporal total variation in most CS-based approaches. To fully exploit the spatiotemporal correlations, we additionally add a regularization term ℛ xf to regularize the data in the combined spatial and temporal frequency domain (xf space), in which F t denotes the temporal Fourier transform. This leverages the characteristic that the signal can be sparsely represented in the temporal Fourier domain, because of the periodic cardiac motion exhibited in dynamic imaging. Previous works 15,41,43,45 have shown that data regularization in different domains is beneficial due to the complementary information they represent, and thus, here we propose to combine the regularization terms from the complementary time and frequency domains with to balance between ℛ xf and ℛ xt . The last term in Equation (2) enforces the data fidelity in PI, and here we formulate it as a coil-wise DC term, which aims to avoid the need to solve a linear problem inside subsequent subproblem and also makes it simple to be embedded in an end-toend DL framework (see following Optimization). The model weight balances between regularization and data fidelity.

| Optimization
To optimize Equation (2), we propose to employ the variable splitting technique 42,46 to decouple the data fidelity term and regularization terms. Specifically, auxiliary splitting variables u ∈ ℂ N , ∈ ℂ N , and { i ∈ ℂ N } n c i = 1 are introduced here, converting Equation (2) into the following equivalent form: In detail, the introduction of the first constraint m = u decouples m in the regularization term ℛ xt from that in the data fidelity term, and the second constraint F t m = enables the decoupling of ℛ xf from the other terms. The introduction of the third constraint S i m = i is also crucial as it allows decomposition of S i m from DF s S i m in the data fidelity term, which avoids the difficult dense matrix inversion in subsequent calculations (see Equation 6). Using the penalty function method, Equation (3) can be reformulated to minimize the following single cost function: where , , and are penalty weights. To minimize Equation (4) which is a multivariable optimization problem, alternating minimization over m, u, and i is performed, resulting in iteratively solving the following sub-problems: Here, k ∈ {0, 1, 2, …, n it − 1} denotes the kth iteration and m 0 is the zero-filled reconstruction as an initialization. An optimal solution (m * ) can be found by iterating over k + 1 ,u k + 1 , k + 1 i and m k + 1 until convergence or reaching the maximum number of iterations n it . Specifically, Equations (5a) and (5b) are the proximal operators of the combined temporal Fourier and spatial domain prior ℛ xf and the spatiotemporal image domain prior ℛ xt , respectively. Equation (5c) is a coil-wise data consistency step in PI (pDC), which imposes the consistency between the acquired k-space measurements and the reconstructed data. A closed-form solution for Equation (5c) can be derived as: in which F H s is the conjugate transpose of F s and I is the identity matrix. Similarly, by optimizing Equation (5d), we obtain the following solution: where S H i is the conjugate transpose of S i . This can be regarded as a weighted coupling (wCP) of the results obtained from Equations (5a)-(5c). In particular, it can be seen that both Equations (6) and (7) are closed-form solutions and can be computed in a point-wise manner due to the inversion of diagonal matrices. This avoids iterative gradient updates and thus enables fast reconstruction speed in comparison to conjugate gradient-based approaches. 20,32,46

| CTFNet for dynamic parallel MRI reconstruction
Based on the model formulation in Equations (5a)-(5d), we propose to embed the iterative reconstruction process into a DL framework to further improve the reconstruction quality with faster reconstruction speed and higher acceleration rates. Specifically, we propose a complementary time-frequency domain network (CTFNet) for the dynamic parallel MRI reconstruction to exploit the spatiotemporal correlations in complementary spatiotemporal and temporal frequency domains. Our model consists of four core components: (1) an xf-CRNN to implicitly learn the regularization from the training data itself and perform the iterative de-aliasing in x-f domain, corresponding to Equation (5a); (2) an xt-CRNN similarly as the learning-based proximal operator in the spatiotemporal image domain, corresponding to Equation (5b); (3) a pDC layer that performs coil-wise DC in PI (Equation 5c); and (4) a wCP layer that is naturally derived from Equation (5d) and performs the weighted coupling. An illustrative diagram of the proposed model is shown in Figure 1. Note that the iterative reconstruction process as stated in Equation (5) is modeled via the convolutional recurrent neural networks (CRNN) with recurrence over iterations. Details of each component of our network is explained hereafter.

| xf-CRNN
Corresponding to Equation (5a), we first propose to exploit the spatiotemporal correlations in the combined temporal Fourier and spatial domain. Instead of explicitly imposing the regularization term on the data such as in conventional CSbased methods, here we propose to implicitly learn the regularization from the training data itself by leveraging DNNs in the x-f domain. Specifically, motivated by some of the CSbased k-t methods such as k-t FOCUSS, 4,5 where its solution to the underdetermined inverse problem can be expressed as the form that consists of a baseline signal together with its residual encoding ( k − ) for the k + 1th estimate of the x-f signal k + 1 , we propose to formulate our x-f domain reconstruction as Particularly, in our formulation of Equation (8), different from model-based 47 or compressed sensing 4,6 algorithms, we employ a stack of convolutional layers to estimate the missing data based on other available points, typically within its vicinity in x-f space. To fully exploit the spatiotemporal redundancies, we use the temporal average of a sequence as the x-f baseline signal , and thus xf-CRNN learns to reconstruct residuals of the temporal frequencies with respect to the temporal average (direct current) values. This makes the residual energy much sparser and enables the network to focus more on the dynamic patterns of the signals with less efforts in reconstructing static background regions. In contrast to k-t FOCUSS implementation where sparsity was exploited for each coil separately, the proposed approach exploits the joint information in the multisignal ensemble that represents the combination from all coils. This has been shown to be effective in reducing the number of required samples per coil and providing increased acceleration capability. 6 Furthermore, different from our previous work in, 41 we propose to model the iterative reconstruction process in x-f domain with the recurrent neural network (CRNN-i 29 ) where recurrence is evolving over iterations via hidden-to-hidden connections and the trainable network parameters are shared across sequential iteration steps.
The illustrative diagram of x-f reconstruction is shown in Figure 2. Specifically, we formulate the k-t to x-f transformation process in PI as an x-f transform layer in the network. The x-f transform layer receives input from multicoil k-t space data, and then transform it to x-f space as inputs to xf-CRNN. Details of the process are illustrated and explained in Figure 2. Note that the value range of the direct current component of the undersampled data in x-f space is lower than that of the temporal average, therefore after subtraction, the direct current component still remains but with a different value range from the temporal average. This also means that the subtracted data in image space can look similar to the temporal average but with a lower intensity range and aliasing artefacts. After the signal de-aliasing in x-f domain, another inverse Fourier transform along f is adopted to transform the estimated x-f signal k + 1 back to dynamic image space for the subsequent weighted coupling with other predictions, as shown in Figure 1.

| xt-CRNN
Corresponding to the formulation in Equation (5b), we additionally propose to learn a regularizer in the spatiotemporal F I G U R E 1 An illustrative diagram of the proposed CTFNet at a single iteration. Each component is corresponding to each subequation in Equation (11), respectively. A, Network architecture for xf-CRNN; B, network architecture for xt-CRNN; C, PI data consistency (pDC) layer; D, weighted coupling (wCP) layer. Numbers inside CNN, CRNN-i, and BCRNN layers indicate {kernel size, dilation factor, and number of filters}. Note that features learned at each iteration is propagated along iteration steps via the hidden-to-hidden connections in CRNN and BCRNN units. For mathematical notations, please refer to Equation (11) image domain complementary to the combined spatial and temporal frequency domain. Specifically, to effectively exploit the spatiotemporal redundancies in x-y-t space, we adopt a variation of our previous CRNN-MRI 29 network for image space de-aliasing which has been shown to be an effective technique in dynamic MRI reconstruction, termed as xt-CRNN. In detail, bidirectional CRNN layers 29 with recurrence evolving over both temporal and iteration dimensions via hidden-to-hidden connections are employed. This allows us to embed the iterative reconstruction process in a learning setting as well as to propagate information along temporal axis bidirectionally. Similar to the x-f space reconstruction, the proposed xt-CRNN also learns to reconstruct the combined data from all coils, and learns the residuals of the temporal average baseline m (Equation 12) in spatiotemporal domain with m k − m as input to the network. This can require fewer k-t samples for residual encoding and similarly enables the xt-CRNN to focus more on the dynamics of the reconstruction. The x-t domain and x-f domain reconstructions are complementary, which further enables the network to maximally explore cross-domain knowledge for the signal recovery.

F I G U R E 2
The x-f transform and reconstruction diagram for a single iteration in the combined spatial and temporal frequency space. In detail, the x-f transform layer receives input from multicoil k-t space data. The acquired multicoil k-space data is first averaged along t to yield a temporal average for each coil separately. At iteration k, the temporally averaged data is subtracted from corresponding coil data at each time frame, and the subtracted data and temporally averaged data from multicoils are then inverse Fourier transformed and sensitivity-combined back to image space. This yields a sequence of aliased images and a temporally averaged sequence (Equation 12). Each frequency-encoding position of the coilcombined images is then processed separately hereafter. The image rows from aliased images or baseline images are gathered and temporal Fourier transformed along t to yield an x-f image, corresponding to k − and respectively. These signals are then fed as inputs to xf-CRNN for x-f space reconstruction (Equations 8 and 11a) | 7 QIN et al.

| Data consistency layer
As discussed in Section 2, Equations (5c) and 6 give a closedform solution with no dense matrix inversion, so that we can exactly embed it as a PI data consistency (pDC) layer in the DNN. To make it concise, we reformulate Equation (6) as: where i ∈ 1, 2, …, n c and 0 = ∕( + ). The DC in PI is performed coil-wise and point-wise, which makes it simple and appealing for implementation in DNNs. Here 0 is a hyperparameter that allows the adjustment of data fidelity based on the noise level of the acquired measurements.

| Weighted coupling layer
Similarly, Equation (5d) can be formulated as a weighted coupling (wCP) layer in DNNs given estimations from Equations (5a)-(5c), as represented in the closed-form solution Equation (7). The coil sensitivity maps can be normalized to one along coil dimension, and thus we can simplify Equation (7) as in which 0 = + + and 0 = + + control the weighted coupling of predictions from x-t domain and x-f domain respectively.

| CTFNet
Based on the proposed four modules, our CTFNet can thus be compactly represented as follows: Here m denotes the temporally averaged sensitivity-combined image of a sequence that is used as the baseline signal, and it can be mathematically expressed as in which max operation is performed element-wise, ∑ t indicates summation along the temporal dimension, and [⋅] T represents the repetition operation along the temporal dimension for T times (the number of frames in a sequence). Given the proposed framework, our CTFNet can iteratively learn to reconstruct the true images from both spatiotemporal and temporal frequency spaces, so that the spatiotemporal redundancies can be jointly exploited from complementary domains for better reconstructions.

| Network architecture and learning
The detailed network architecture of the proposed CTFNet is shown in Figure 1. The CTFNet architecture consists of four components, where each component is corresponding to each subequation in Equation (11), respectively. In detail, xf-CRNN is composed of 4 layers of CRNN-i and 1 layer of 2D CNN with a residual connection from the baseline estimate. The 2D convolutions are applied over the x and f dimensions, and the y dimension is viewed as the batch dimension. For the xt-CRNN model, a variation of architecture 29 is employed which consists of 4 layers of BCRNN evolving over both temporal and iteration dimensions, 1 layer of 2D CNN and a residual connection. Here the 2D convolutions are applied on the spatial dimensions with recurrence over time dimension in BCRNN, whereas in 2D CNN layer the time dimension is viewed as the batch dimension. We used dilated 2D convolutions with kernel size 3 × 3 and dilation factor 3 × 3 to increase of the receptive field sizes. The number of input and output channels of the network was 2, representing the real and imaginary part of the complex-valued data.
Given the training set Ω with undersampled data m 0 as input and fully sampled data as target, the network is trained end-to-end by minimizing the pixel-wise L1 norm between the reconstructed data and the sensitivity-weighted ground truth data m gt : where m n it denotes the predicted image at iteration n it , that is, the final output of the proposed network, is the set of network parameters, and n Ω is the number of training samples. In our setting, we have the iteration step n it set to 5. Specifically, all the components in CTFNet including the pDC and wCP layers are needed for training and involved in gradient backpropagation, where the backward pass of the pDC operation can be similarly derived as in Ref. [27] and backward pass for wCP operation with respect to input layers are merely their corresponding coefficients due to the weighted coupling operation. For stability of training, values of 0 , 0 and 0 were all set to 0.1 based on our preliminary works. 25,42 For training details, gradients were hard-clipped to [−5, 5] to avoid the gradient explosion problem in training recurrent neural networks. ADAM optimizer was employed with a learning rate of 10 − 4 . The intensities of input images were normalized by the maximum value of their corresponding undersampled temporal average frame. During training, we extracted training patches along the frequency-encoding direction and used the entire sequence of the data. Networks for different undersampling factors were first trained jointly and then finetuned separately. A plateau of the performance can be observed with 10 5 backpropagations. Patch extraction and data augmentation were performed on-the-fly on the individual coil images, with random rotation and scaling, and the minibatch size during the training was set to 1.

| Data
We used two datasets for the experimental evaluations. The first dataset (Dataset A) includes 38 sets of complex-valued multislice short-axis cardiac MRI scans acquired on a 1.5T Siemens scanner. 2D bSSFP cine acquisition with retrospective gating and 2 × GRAPPA acceleration was performed for 14 healthy subjects and 24 patients with suspected cardiovascular diseases for left ventricular coverage. In the patient population, we encountered myocarditis, arrhythmogenic right and left ventricular cardiomyopathy, restrictive cardiomyopathy, dilated cardiomyopathy, hypertensive cardiomyopathy, non-ischemic cardiomyopathy, embolic myocardial infarction, and eosinophilic granulomatosis with polyangiitis (EGPA) with cardiac involvement. 37 The data were acquired with Cartesian sampling and with acquisition parameters including in-plane resolution of 1.9 × 1.9 mm, slice thickness of 8 mm, number of phase-encoding lines of 156, repetition time (TR) of 2.12 ms, echo time (TE) of 1.06 ms and temporal resolution of around 40 ms. Images were reconstructed from the 2 × acceleration to a fully sampled k-space by GRAPPA. Written informed consent was obtained from all subjects and the study was approved by the local ethics committee (healthy subjects: London Bridge Research Ethics Committee, patients: North of Scotland Research Ethics Committee). In experiments, 6 slices from each subject that cover the dynamic anatomy were extracted, resulting in a total number of 228 slices for the experiments. Each acquisition in this cohort consists of 25 frames with 30/34/38-channel multicoil data. The second dataset (Dataset B) used in our experiments consists of 10 fully sampled complex-valued short-axis cardiac cine MRI acquired on a 1.5 T Philips scanner. The data were acquired on healthy volunteers following written informed consent under approved research ethics (08/H0711/82). Each scan contains a single-slice SSFP acquisition with 30 temporal frames and 32-channel multicoil raw data. The acquisition has an in-plane resolution of 1.7 × 1.7 mm, slice thickness of 10mm, 190 phase-encoding lines, TE = 1.66 ms, TR=3.32 ms and an average temporal resolution of 33.70 ms.
A variable density incoherent spatiotemporal acquisition (VISTA) sampling scheme 48 was employed to undersample the k-space data in our experiments, which has been shown to be an effective Cartesian sampling strategy for dynamic data. The scheme is based on a constrained minimization of Riesz energy on a spatiotemporal grid. It allows uniform coverage of the acquisition domain with regular gaps between samples and guarantees a fully-sampled, time-averaged kspace to facilitate GRAPPA or ESPIRiT kernel estimation. In experiments, we undersampled the data at acceleration rates of 8, 16, and 24, and examples of them are shown in Supporting Information Figure S1. Coil sensitivity maps were pre-computed from the fully-sampled, time-averaged k-space center with the ESPIRiT algorithm 49 by using the BART toolbox. 50

| Experiments
We first performed the comparison study where we compared our CTFNet against other competing approaches on Dataset A with mixed healthy subjects and patients for reconstructions from undersampling rates of 8, 16, and 24.
Here the models were trained on Dataset A with a 2-fold cross-validation, where each fold contained 7 healthy subjects and 12 patients with six slices for each subject. In the second step, we explored the generalization potential of the proposed method. Specifically, we first investigated the robustness of the models when applied to data that were acquired with different scanners and acquisition settings from the training data. We employed models trained on Dataset A and directly tested them on Dataset B. Dataset B differs from Dataset A on the aspects of scanners, acquisition parameters, temporal resolutions, number of acquisition coils and sampling matrix size. In addition, we further investigated the generalization performance of the proposed method from healthy subjects to patients that were not represented in the training set. In detail, we trained another model with only healthy subjects (14 subjects, 84 slices), and directly tested it on patients in Dataset A. To better understand the proposed method and its performance, an ablation study was also conducted on both datasets to gain more insights on the effects of regularization in different domains for the dynamic parallel reconstruction problem. Specifically, we investigated and compared between the single domain regularization (ℛ xt or ℛ xf ) and the complementary time-frequency domain regularization. Lastly, a clinical evaluation of the proposed method was performed to investigate the clinical utility of the reconstructed data. In this regard, we measured the left ventricle end-diastolic volume (LVEDV), left ventricle endsystolic volume (LVESV) and left ventricle ejection fraction (LVEF) from the reference standard and the reconstructed accelerated data respectively. The segmentation masks were extracted by using a state-of-the-art DL-based segmentation algorithm, 51,52 where the model was previously trained on a different set of cardiac MR data without acceleration.

| Evaluation method
We compared our proposed approach (CTFNet) with representative MR reconstruction methods, including state-of-theart CS and low-rank-based method k-t SLR, 7 and two variants of DL methods, dynamic VN, 33 and Cascade CNN, 24,27 which have been substantially enhanced to adapt to dynamic parallel image reconstruction. Dynamic VN 33 learns the complex spatiotemporal convolutions in contrast to the original VN, 19 and for strong comparisons with our method, we propose to improve it by incorporating the temporal average baseline as an initialization. Similarly, as to Cascade CNN with the D-POCSENSE framework 24 originally designed for static PI, we also refined it to learn the residual of the temporal average, and adjusted it with the same convolutional recurrent architecture as CTFNet to equip it with the ability to exploit spatiotemporal correlations. Thus we term it as CascadeCRNN. The network architecture for CascadeCRNN was the same as xt-CRNN and the number of iteration steps n it was set to 5 for all DL methods. k-t SLR formulation has also been extended to be used with multicoil data based on SENSE model in contrast to its original implementation. 7 Quantitative results were evaluated in terms of normalized mean-squared-error (NMSE) and peak-to-noise-ratio (PSNR) on complex-valued images, as well as structural similarity index (SSIM) and high-frequency error norm (HFEN) on magnitude images. These metrics were made to evaluate the reconstruction results with complimentary emphasis. All quantitative results were computed only around cropped dynamic regions for better evaluation. Lower NMSE/HFEN and higher PSNR/SSIM indicate better results.
Statistical tests were performed for results of each metric to ensure that the differences between methods were significant. For multimodel comparisons, we first performed a Friedman test 53 to see if there was a significant difference in the population statistics (metric results). Then if the null hypothesis of the Friedman Test was rejected, we performed one-versus-all one-way Wilcoxon signed-rank test 54 with Bonferroni correction to find out if the results from our model significantly outperformed the others.

| Implementation details
The CTFNet approach as well as the compared DL methods were all implemented in PyTorch, and trained with the setting as described in Section 3.1. Experiments were performed on a 12 GB Nvidia Titan Xp Graphics Processing Unit (GPU). For k-t SLR, we used the Matlab implementation provided by Ref. [7] with an extension to multicoil data. Experiments were conducted on a 16 GB RAM, 3.60 GHz Central Processing Unit (CPU).

| Comparison study
Quantitative comparison results of different methods on dynamic multicoil cardiac data with various high acceleration rates (8 ×, 16× and 24×) are presented in Table 1. The results reported were on the entire 228 2D+t slices on Dataset A. It can be seen that our proposed CTFNet outperforms k-t SLR by a large margin in terms of all these measures at different undersampling rates. It also offers a much faster (∼1000×) reconstruction speed with 2.8 seconds for the entire sequence of one slice compared with k-t SLR with 2444.8 seconds for the same reconstruction. In comparison to other DL-based methods which have been carefully enhanced to incorporate temporal information, our proposed approach can still achieve better performance on all acceleration rates, with an improvement of around 1 dB PSNR and 1.5% SSIM increase over the most competing method (CascadeCRNN). The performance gap of the improvement is also increasing as acceleration rate increases. All results were statistically significant with p ≪ 10 − 5 . Additionally, we also compared the qualitative results on 16× and 24× undersampled data (equivalent scan time: 15 and 10 seconds respectively within a single-breath-hold) in Figure 3 and Supporting Information Figure S2, which shows the reconstructed images along both spatial and temporal dimensions as well as their corresponding error maps on a patient and a healthy subject. Compared to other competing methods, it can be observed that our proposed model can faithfully recover the images with smaller errors especially around dynamic regions, and can also produce sharper reconstructions along temporal profiles.

| Generalization study
In this study, we explored the generalization potential of the proposed method. The generalization test results of different DL models from Dataset A to Dataset B are presented. Particularly, to better understand how the methods perform on the key frames, here we show their comparisons on the end-systolic (ES) frame (Table 2) and end-diastolic (ED) frame, respectively (Supporting Information Table S1). It can be seen that the proposed method achieves high performance on the unseen test dataset and also consistently outperforms against other competing methods, indicating its capability in effectively learning the inverse dynamic reconstruction problem. All results on both ES and ED frames were also statistically significant with P < . 005. Besides, we also visualized the generalization results of Dataset B under different acceleration rates, as presented in Figures 4 and Supporting Information Figure S3. It can be observed that our approach can recover the fine details and the temporal traces of the image very well on data from unseen domain even with extreme undersampling rate (24 ×), though it is anticipated that the reconstruction gets more challenging as acceleration rate increases.
In addition, the generalization results from healthy subjects to patients were compared with models trained with mixed healthy subjects and patients (19 subjects, 114 slices), as shown in Table 3. Although the pathological conditions were not included in the training data, the generalization results from healthy data to patients were very competitive to the mixed training models with an average of only 0.2dB PSNR and 0.2% SSIM drop of performance. This can also be observed from the qualitative comparison as shown in Figure  5, where only subtle differences can be detected from these two training settings.

| Ablation study
The ablation study compared results from the spatiotemporal image space reconstruction (Proposed (ℛ xt )), the combined temporal Fourier and spatial space reconstruction (Proposed (ℛ xf )) as well as the complementary time-frequency domain reconstruction (Proposed (ℛ xf + ℛ xt )). All these ablated approaches with varying domain regularizations were conducted under the same variable splitting framework as in Section 2, where for the single domain reconstruction, only the corresponding domain network was used. The quantitative comparison results of the ablation study are shown in Table 4, where reconstruction models were trained on data with R = 8 from datasets A and B, respectively. A qualitative result is also given in Figure 6 on data with R = 16.

| Clinical evaluation
Our experiments indicate a good reconstruction quality of the proposed method with respect to the reference standard, however, the question of the clinical validity of such reconstructions remains open. To understand the translational utility, we performed the left ventricular function assessment (LVEDV, LVESV, and LVEF) on our reconstructed data in Dataset A. Note that the DL-based segmentation algorithm was trained on another group of cardiac MR data of healthy subjects without acceleration (UK Biobank data 55 ) and DLbased approaches are also known to be sensitive to variations of image distributions and perturbations, so there were some inevitable failure cases on our test data. Therefore, we performed manual quality control of the segmentation in our cases, where we have excluded severe failure cases from the cohort. The resulting dataset includes 12 healthy subjects and 13 patients, and we show their comparisons with the reference standard in Bland-Altman plots in Supporting Information Figure S4 for all acceleration rates. It can be observed that our reconstructions have achieved reasonably good results on all three measures, and on average a bias for LVEDV, LVESV, and LVEF of −0.04 ml, 1.15 ml, and −0.97% was observed with all observations lying inside the 96% confidence interval of ± 2.21%, ± 2.53, and ± 2.36% on 8× accelerated data. For 16× and 24× accelerated data, an average bias of −1.81 ml, 1.98 ml, and −2.32% for LVEDV, LVESV, and LVEF was observed with a slightly higher variance on 24× accelerated data than on 16× accelerated data. Both the F I G U R E 3 Qualitative comparison results of different methods on spatial and temporal dimensions with their error maps. Results are shown for undersampling rates 16 × of a patient (top) and 24 × of a healthy subject (bottom) on Dataset A at systolic frames. The scan time for these two acquisitions are 15 and 10 s within a single-breath-hold respectively. The proposed method can well recover the fine details and preserve the temporal traces, although this gets more challenging on aggressively undersampled data. An example of the results at diastolic frames is shown in Supporting Information Figure S2. A dynamic video is shown in Supporting Information Video S1 for better visualization biases and variances of these measurements are within an acceptable range, indicating that our reconstruction results can potentially have the clinical benefit.

| DISCUSSION
In this work, we have demonstrated that the proposed method is capable of recovering high-quality images from highly undersampled dynamic multicoil data. Different from existing DL-based approaches, we incorporated the combined spatial and temporal frequency domain regularization into the formulation of the dynamic parallel MRI reconstruction problem and exploited spatiotemporal redundancies from both x-t and x-f spaces with DNNs. Compared with spatiotemporal image (xt) domain reconstruction (Proposed (ℛ xt ), Table 4), the proposed x-f space reconstruction (Proposed (ℛ xf )) has shown to be more effective in exploiting the spatiotemporal correlations, with higher reconstruction accuracy and a smaller number of network parameters (Table 4). This is mainly due to the inherent nature of the periodic dynamic cardiac MRI data itself, where strong correlations exist in k-space and time and signal in temporal Fourier space is sparse. This has been represented in many traditional CS-based methods, and here our results have demonstrated that the learned implicit DNN-prior in the temporal Fourier domain can further increase the acceleration capability and achieve even better performance. In addition, combination of time-frequency cross-domain knowledge (Proposed (ℛ xf + ℛ xt ), Table 4 and Figure 6) further enhances the reconstruction capability of the proposed method with better reconstruction quality. In comparison to three fully independent constraints which regularize the spatial, temporal, and temporal-frequency dimensions independently, our approach offers the ability to efficiently exploit correlations in the joint spatial and temporal/temporal-frequency space, which enables the estimation of missing data to be based on other available points within its spatial vicinity at neighboring time points or temporal-frequency. Additionally, the use of two constraints is also more efficient than three independent constraints, as adding one extra constraint will lead to another subproblem for minimization, which thereby will increase complexity in network architecture design. The improved performance of CTFNet over other competing methods indicates that learning jointly from both spatiotemporal and temporal frequency domains can capture complementary useful information that can be effectively utilized by the proposed framework. Furthermore, the proposed CTFNet builds on a multivariable minimization problem and embeds it into an efficient DL framework. The employed variable splitting technique effectively decouples data regularization terms on various domains from the data fidelity term, which enables the natural derivation of pDC layer and wCP layer in PI with closedform point-wise solutions. Though the derived pDC layer shares similar form as the one proposed in D-POCSENSE 24 which is a simple extension from single-coil application, 27 our solution (pDC with wCP layers) for the multicoil setting has the mathematical support based on variable splitting and alternating minimization, and thus reasons the particular formulation and structure of our network. In contrast to 20,32 where data fidelity step is solved via conjugate gradient algorithm due to the difficult matrix inversion in their DC terms, our CTFNet offers a much simpler and more efficient solution with exact steps and avoids iterative gradient updates, allowing for faster reconstruction speed and easier embedding into DNNs. Besides, our approach also offers the flexibility of incorporating additional regularization terms in the framework, whereas this will not be very straightforward for the other approaches.
It is worth mentioning that with the recurrent design of the architecture, it is natural that the parameters for each iteration are shared, whereas this is not the case for other approaches such as VNs. 19 Similar to other DL-based unrolled network architecture, our proposed approach is a learning-based method which is designed to achieve optimal performance with the pre-set fixed number of iterations during the training, though the recurrent design enables it to be employed for arbitrary number of iterations at inference time. In our work, we determined heuristically that 5 iterations represents a compromise between GPU memory requirements and performance. Therefore, we fixed n it = 5 to train our network, and thus it is expected that the optimal inference performance also comes from n it = 5. If n it is increased correspondingly at training stage, we can expect that the performance of the network will be further improved and converge after a few iterations. However, due to the multicoil setting and the limitation of GPU memory, we were not able to train the model with more iterations. More efficient training schemes will be investigated in the future to improve the performance of the proposed method. In addition, with respect to the hyperparameters, our DL-based method appears to be more robust at test time while CS-based approaches may require case-level tuning for the optimal results.
In our work, we also propose to learn the residual of a temporally averaged frame, which is not necessarily required to be a fully-sampled image. This can be seen from our undersampling masks of R = 24 (Supporting Information Figure S1), where the averaged k-space is not fully-sampled but good reconstruction results can still be achieved. Besides, The proposed method can well reconstruct the images with good preservation of temporal trace on various undersampling rates. Though reconstruction is more challenging as R increases, the reconstructed results can still be useful. A dynamic video is shown in Supporting Information Video S2 for better visualization VISTA sampling pattern was designed in a way to spread the sampled lines out as much as possible to high frequencies, although it also appears to omit some of the highest k-space locations which is a common phenomenon for high accelerations of undersampling strategies. VISTA has been shown to be able to generate more consistent results with superior noise-sharpness trade-off compared to other commonly employed sampling patterns, 48 and thus it is also expected to benefit the DL-based reconstruction. The fundamental acceleration limit for which the undersampling does not affect the effective reconstructed resolution was not investigated in Ref. [48] and this is also not the focus in our work. Nevertheless, the proposed approach is a learning-based method, that is, the network is able to reconstruct the missing high frequency samples from the trained information and by that recover reasonably good reconstructions. Additionally, the impact of motion would also be crucial for understanding the robustness of the model. Theoretically, since the model allows high acceleration rates, it is likely that the model can accommodate even if some data are rejected or omitted. However, it would require comprehensive evaluation to study various degrees and types of motion and artifacts, and this will be an important direction of future work.
Moreover, the proposed method can generalize well to unseen cardiac MR data with different acquisition parameters and with pathology that were not seen in the training set. The method can achieve satisfactory performance on these scenarios even with highly aggressive undersampling strategies, which indicates that the proposed method is robust to unseen and unusual image features or temporal behaviors present in our currently used dataset. In addition, our clinical evaluation of the reconstructed images shows acceptable biases and variances on the left ventricular function assessment, indicating the great potential to deploy DL models for clinical practice. Also to note that the differences between the reference and reconstructed data could also be accumulated from errors induced by the segmentation performances especially on accelerated data, thus this shall also be taken into consideration when comparing them with the reference standard. However, the correction and improvement of the segmentation on accelerated data are out of scope of this work, and should be considered as an important avenue for future research.
Particularly, by exploiting spatiotemporal redundancies in the proposed DL framework, our approach can outperform the state-of-the-art CS-and DL-based methods and can further push the acceleration capability with fast reconstruction speed for the dynamic parallel MR imaging. In our work, Dataset A was a multi-breath-hold acquisition of 8 consecutive breath-holds with 15s for each (2 × GRAPPA accelerated). Hence, an acceleration rate of 16 or higher could result in the possibility of achieving the same acquisition in a single-breath-hold, but this may also require further investigation with respect to transient state imaging, achievable spatial coverage and/or temporal resolution for single breath-hold cine imaging. Despite this being a retrospective undersampling study, our results indicate a great potential in facilitating fast single-breath-hold clinical 2D cardiac cine imaging.
For the future work, we will explore the dynamic parallel image reconstruction with other types of undersampling strategies, such as radial sampling which is also commonly used in acceleration of 2D cardiac MR imaging in practice. In addition, we could also consider incorporating some other regularization terms into the framework, such as regularization on some other transform domains, to exploit the data redundancy for effective reconstruction. Besides, generalization capability of the model can be further validated on more data from different domains and with various acquisition parameters and pathologies, as well as on more aspects such as generalization to different orientations (eg, chamber views) to investigate its potential application for clinical use.

| CONCLUSION
In this paper, we have proposed a novel DL-based approach, termed CTFNet, for highly undersampled dynamic parallel MR image reconstruction. The proposed method exploits spatiotemporal correlations in both the combined spatial and temporal frequency domain and the spatiotemporal image domain based on a variable splitting and alternating minimization formulation. The network is able to learn to iteratively reconstruct the images by jointly and effectively exploiting information from the complementary time-frequency domains. Our proposed CTFNet outperforms state-of-the-art dynamic MR reconstruction methods in terms of both quantitative and qualitative performance, with excellent recovery of fine details and preservation of temporal traces. It also enables increased accelerations of data acquisition with favorable generalization ability, which is promising for realizing single-breath-hold clinical 2D cardiac cine MR imaging.

SUPPORTING INFORMATION
Additional Supporting Information may be found online in the Supporting Information section.