Robust partial Fourier reconstruction for diffusion-weighted imaging using a recurrent convolutional neural network

Purpose: To develop an algorithm for robust partial Fourier (PF) reconstruction applicable to diffusion-weighted (DW) images with non-smooth phase variations. Methods: Based on an unrolled proximal splitting algorithm, a neural network architecture is derived which alternates between data consistency operations and regularization implemented by recurrent convolutions. In order to exploit correlations, multiple repetitions of the same slice are jointly reconstructed under consideration of permutation-equivariance. The algorithm is trained on DW liver data of 60 volunteers and evaluated on retrospectively and prospectively sub-sampled data of different anatomies and resolutions. Results: The proposed method is able to significantly outperform conventional PF techniques on retrospectively sub-sampled data in terms of quantitative measures as well as perceptual image quality. In this context, joint reconstruction of repetitions as well as the particular type of recurrent network unrolling are found to be beneficial with respect to reconstruction quality. On prospectively PF-sampled data, the proposed method enables DW imaging with higher signal without sacrificing image resolution or introducing additional artifacts. Alternatively, it can be used to counter the TE increase in acquisitions with higher resolution. Further, generalizability can be shown to prospective brain data exhibiting anatomies and contrasts not present in the training set. Conclusion: This work demonstrates that robust PF reconstruction of DW data is feasible even at strong PF factors in anatomies prone to phase variations. Since the proposed method does not rely on smoothness priors of the phase but uses learned recurrent convolutions instead, artifacts of conventional PF methods can be avoided.


Introduction
Due to its ability to visualize regions of restricted diffusion, diffusion-weighted imaging (DWI) has been shown to be valuable in detecting and assessing lesions in several clinical applications [1,2,3,4]. However, DWI is an inherently signal-starved imaging technique since the diffusion encoding requires the application of strong magnetic gradients which dephase parts of the spins and reduce total net magnetization. In addition, DWI often employs single-shot echo-planar imaging (ssEPI) -which has lower scan times than its multi-shot counterpart and does not have to compensate for inter-shot motion -but suffers from T2*-decay for long echo-trains. In order to alleviate T2*related effects and increase signal-to-noise ratio (SNR), partial Fourier (PF) sampling along the phase-encoding direction can be employed which reduces the distance to the k-space center from one side and, hence, results in shorter echo time (TE). The ratio between the acquired portion of k-space and the total k-space is referred to as PF factor (PFF), with typical values of 5/8, 6/8 or 7/8.
Reconstructing an image from an asymmetrically sampled k-space is a non-trivial task given that the image at hand is not real-valued. The central idea of PF reconstruction is based on the observation that the phase of MR images usually has relatively low spatial resolution and, hence, can be estimated from the partially acquired data. Conventional PF techniques, such as Homodyne [5] and Projection Onto Convex Sets (POCS) [6], make explicit assumptions on the smoothness of the phase. Both methods perform phase correction using a low-resolution phase estimate which is computed from the symmetrically sampled portion of the data. While Homodyne is a one-step approach which involves ramp filtering in frequency domain, POCS is an iterative method that alternates between phase correction and enforcing data consistency.
Although smoothness priors are appropriate in many MR applications, they may be unsuitable in the context of DWI which oftentimes exhibits rapid phase variations, especially in motion-prone anatomies, such as the abdomen (see Supporting Information Figure S1). In the case of strong PFFs which reduce the range of symmetrically sampled data in k-space, conventional methods oftentimes introduce artifacts because the respective smooth phase estimate cannot appropriately represent the true image phase. Furthermore, phase variations may become strong enough to shift the k-space center out of the partially sampled region resulting in severe, irreversible signal loss [7].
In recent years, Deep Learning-based approaches found their way into the field of under-sampled MR reconstruction, including PF reconstruction [8,9,10,11,12]. Using the expressive power of convolutional neural networks (CNNs), these methods try to learn a reconstruction procedure from large sets of training data and thereby circumvent weaknesses of hand-crafted image priors.
Typically, training is performed in a supervised manner by acquiring ground-truth data which is retrospectively sub-sampled to create inputs of the network. The parameters of the CNN are then tuned towards the desired behavior by backpropagating gradients from a loss function that compares the network output with the ground-truth. In [11], it was shown that a CNN trained on PF reconstruction of natural images from the ImageNet database [13] was able to generalize to DW brain images during test time. However, the fact that ImageNet data is real-valued, which necessitated the simulation of random phases, constitutes a limiting factor. Further, in the context of medical image reconstruction, one general drawback of the networks which were proposed in [8,9,11] is that they perform direct image-to-image mappings and consequently lack assurance of data consistency. Unrolled networks that attempt to emulate iterative optimization schemes are known to perform better as they implement a gradual refinement of estimates and encode knowledge of the physics of the acquisition [14,15,16], e.g. by incorporating the forward model in a data consistency step. This makes deep neural networks more interpretable as well and can therefore increase acceptance among clinicians. The method proposed in [12] was derived by unrolling a relaxed version of the POCS algorithm but replacing the phase correction step by a recurrent CNN.
This work aims to further develop the preliminary approach presented in [12] by making more efficient use of correlations among multiple realizations of the same slice and eliminating the dependency on their ordering. In addition, training and evaluation are performed on liver DWI which is known to be very challenging in the context of PF reconstruction due to strong, motioninduced phase variations. The proposed method is evaluated quantitatively as well as qualitatively by comparison with conventional PF techniques and other Deep Learning-based approaches. A reconstruction method that works robustly even at strong PFFs in anatomies with high-frequency phase components facilitates the applicability of PF to DWI. Provided that image sharpness is restored reliably, the SNR gain due to PF-induced TE reduction can increase the diagnostic value of DWI. Alternatively, the increased SNR can be traded against scan time by reducing the number of acquired repetitions or used to compensate for the increase in TE related to acquisitions with higher spatial resolutions or motion-compensated diffusion preparation [17,18].

Inverse reconstruction problem
Assuming single-channel data y ∈ C M , the forward model in PF-sampled MRI can be expressed as where x ∈ C N is the complex-valued image, A ∈ C M ×N is the linear forward operator, encoding both PF-sampling and discrete Fourier transform, and n ∈ C M represents complex Gaussian noise.
Due to non-injectivity associated with the sub-sampling (M < N ) and due to noise, recovering x from its measurements y is an ill-posed inverse problem. It can be addressed by minimizing a regularized objective function of the form where R and D represent regularization and data terms, respectively. Both terms are weighted against each other by some parameter ν > 0. Using D(Ax, y) = 1 2 ||Ax − y|| 2 2 which is the natural choice for normally distributed noise, a proximal splitting algorithm can be employed to approximate the solution of Equation (2). Given iteration index k ∈ N, the proximal operators are computed on both terms in an alternating fashion with step size µ k ≥ 0: where x 0 = A * y and λ k = µ k ν . Since proximal mappings constitute generalizations of projections, it can be easily seen, that POCS is in fact a realization of this scheme for λ k = 0 and R = ι Φ , where ι Φ is the indicator function of the set of images consistent with the low-resolution phase estimate φ ∈ R N :

Unrolled recurrent network architecture
The proposed reconstruction method is derived from unrolling the aforementioned iterative scheme for a fixed number of iterations K and parametrizing the proximal mapping onto the regularizer term in Equation (3a) by a CNN instead of relying on hand-crafted priors.
In general, there are two different ways of employing a CNN for the regularization step in unrolled frameworks: 1) weight-sharing and 2) cascading, i.e. parametrizing distinct network instances for iterations as in [19,20], for example. The first category includes "plug-and-play" approaches in which a CNN is trained in isolation and applied in an iterative scheme afterwards [21]. While the first strategy has the advantages of requiring less parameters and not necessarily being tied to a fixed number of iterations, the latter offers more flexibility in the set of performed operations over iterations, however, at the cost of an increased risk of overfitting. In order to effectively combine the benefits of both strategies, recurrent architectures can be employed which implement weightsharing but simultaneously offer operational flexibility as the network's behavior might be slightly modified based on the results of earlier iterations.
The resulting unrolled network architecture is shown in Figure 1. Here, recurrence is realized by using a stack of G convolutional gated recurrent units (ConvGRUs) [22,23] which keep track of an internal memory state across iterations. Within each iteration, the CNN-based regularization is followed by a data consistency block implemented according to Equation 3b. The architecture is similar to the one used in [12] but differs with respect to the implementation of joint reconstruction of image repetitions as pointed out in the following.

Joint processing of image sets
The fact that DWI has inherently low SNR necessitates the acquisition of multiple repetitions, implying potential benefits of joint reconstruction. In [12], this was attempted by applying 3-D convolutions to repetitions arranged in a 3-D stack. When operating on sets of images, however, two central requirements need to be fulfilled: 1) the ability to handle different set sizes and 2) permutation-equivariance, i.e. the result should not depend on the ordering of the set elements.
While the first requirement can be satisfied when following the 3-D stacking strategy, it can be easily seen that permutation-equivariance cannot be ensured as a modified order of the images in the stack will modify the network output.
Therefore, the concept of Deep Sets [24] was employed in this work. Here, the set of available Figure 1. Overview of the employed network architecture which is unrolled for K iterations. Within every iteration it alternates between a CNN-based regularization and a data consistency block (DC). The latter consists of known operators only, such as (inverse) Fourier transform ((I)FT) and PF-sampling (PF). The regularization network is implemented by a stack of G convolutional gated recurrent units (ConvGRUs) which propagate an internal memory state along iterations as visualized by the blue lines. Apart from the last unit, every unit uses F feature maps for convolutions. After G/2 of the units, a batch aggregation (Aggr B ) is performed which allows to distribute shared information across the batch.
repetitions is interpreted as a typical input batch. In order to still exploit information shared among the repetitions, permutation-invariant pooling operations (such as mean or maximum computations) across the batch dimension are performed within the network. Besides the fact that the aforementioned requirements are fulfilled, this concept also allows to employ 2-D convolutions which are more efficient with respect to memory and run-time than 3-D convolutions. In order to augment the data set prospectively, 15 and 60 repetitions were collected per slice for the low and high b-value, respectively, which corresponds to three times the number used in typical clinical protocols. Data was acquired without PF-sampling and with a parallel acceleration factor of 2 due to echo-train length limitations. Ground-truth data was obtained by performing a conventional in-line GRAPPA reconstruction [25] followed by coil combination.

Data
Network inputs were generated by retrospective PF-sampling of the coil-combined data followed by an inverse Fourier transform. The reasoning behind that was to ensure modularity of the proposed method. Given that parallel acceleration is a standard setting in ssEPI sequences in order to avoid negative effects of excessively long echo-trains, this approach would allow to employ any parallel reconstruction method beforehand. Also, if the goal was to perform both parallel and PF reconstruction in a single framework, it would have to be tailored to a particular combination of parallel acceleration and PF factor which would require training of multiple network instances for different scenarios. Further, since conventional PF methods are typically applied to coil-combined data following some kind of parallel reconstruction, comparison with those techniques is facilitated as differences in performance would not originate from differences with respect to parallel reconstruction/de-aliasing.

Training and implementational details
Complex-valued images were normalized by the 98th percentile value of the corresponding magnitude images and were flipped along the readout direction with a probability of 50 % for the sake of data augmentation during training. An input batch was obtained by taking a random subset comprising one third of the available repetitions for a given slice, such that the network was provided with varying realizations over training epochs. Real and imaginary parts were interpreted as separate channels.
Given the batch of B output repetitions after K iterations {x 1 K , ..., x B K }, the magnitude average across the batch was obtained according to is the element-wise magnitude operation. Using the average of the ground-truth repetitions x GT computed in the same manner, end-to-end training of the network was performed by minimizing a loss function L of the where L 1 represents the L 1 -penalized difference averaged over all pixels and L perc is a perceptual loss metric introduced in [26]. Given that the goal was to restore image resolution and per-pixel metrics such as L 1 do not penalize image blur heavily enough, the perceptual loss which calculated normalized distances in a higher-level feature space of a pre-trained neural network aided in promoting sharp reconstructions. Since the focus was still on ensuring high pixel fidelity, the first loss term was weighted twice as strong. Note that no phase information contributed to the loss function.
The proposed architecture was unrolled for K = 5 iterations, while G = 10 ConvGRUs were used within the regularizing CNN. Except for the last unit which was constrained to have 2 feature channels to accommodate for the complex nature of the output data, ConvGRUs used F = 32 feature channels. All convolutions used kernels of size 3 × 3 with zero-padding, such that the image size was kept constant throughout. Exploitation of shared information among repetitions was implemented by maximum aggregation after half of the ConvGRUs (see Figure 1). Given the batch of B feature maps {f 1 , ..., f B }, the aggregated feature map was simply computed as the channel-and pixel-wise maximum over the batch: f aggr = max({f 1 , ..., f B }). In the case of mean aggregation, which was used for evaluations as well, the aggregated feature map would be obtained as f aggr = 1 B B b=1 f b , accordingly. In both cases, permutation-equivariance was ensured by updating the feature maps according to: Analogous to POCS, hard projections were performed for data consistency by setting λ k = 0 in Equation (3b) for all iterations. Thereby, it is ensured that only the non-acquired parts of the kspace are estimated. Note that λ k alternatively can be chosen to be greater than 0 in order to allow deviations from the measured data which is required for reconstructions that involve denoising. For example in [12], λ k were implemented as learnable parameters. However, in this work denoising was not desired in order to ensure comparability with conventional techniques which do not include denoising either.

Evaluation
Qualitative evaluations of representative reconstructions were conducted for the proposed method which will be referred to as Deep Recurrent Partial Fourier (DRPF) in the following. For cases in which ground-truth data was available, quantitative measures in the form of peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) were reported as well. Since conventional methods perform sufficiently well in the case of a PFF of 7/8, only factors of 5/8 and 6/8 were considered during evaluations. Concerning the proposed method, distinct network instances were trained and evaluated for each PFF separately.

Retrospectively sub-sampled liver DWI
Using the data from the test set, the reconstructions generated by DRPF were validated by qualitative and quantitative comparisons with ground-truth images. POCS was included into the comparison because it is a commonly used state-of-the-art method in clinical practice. POCS was performed for 5 iterations, although approximate convergence could generally be observed within as few as 3 iterations.

Prospectively sub-sampled liver DWI
In order to validate the reconstruction quality of the proposed method on prospectively PF-sampled data, three additional data sets were acquired using a PFF of 5/8.

Generalization to brain DWI
An important aspect with respect to learning-based methods is the ability to generalize to data that is distinct from instances seen during training. The robustness of the proposed method was demonstrated by applying it to a case which deviated from the training set in terms of anatomy, contrast and resolution. For this purpose, brain data was acquired using a PFF of 5/8 with (b = 1000 s/mm 2 , 4 diffusion directions, 4 repetitions) and without diffusion-weighting (b = 0 s/mm 2 , 3 repetitions).
The remaining acquisition parameters are listed in Supporting Information Table S2.

Benefits of joint reconstruction
In order to quantify the benefits of jointly reconstructing multiple repetitions in the context of DWI, the proposed architecture, which used maximum aggregation over the set of repetitions, was compared against two other realizations which used mean aggregation and no aggregation at all, respectively, meaning that correlations across repetitions were not accessible in the latter case.

Comparison of unrolling strategies
The approach of unrolling a network using a recurrent CNN was compared against two other unrolling strategies which used simple weight-sharing across iterations and a cascade of CNNs, respectively. Both employed a ResNet architecture [29] which approximately matched the number of parameters of the proposed recurrent architecture and used maximum aggregation after half of the residual blocks (see Supporting Information Figure S2). A base regularizer network (one iteration with data consistency) was pre-trained for 50 epochs with a learning rate of 5 · 10 −4 and then used to initialize both the weight-sharing and cascading architectures, where a distinct copy was used for each iteration in the latter case. Both architectures were then trained for another 150 epochs with a learning rate of 1 · 10 −4 . This training strategy showed better convergence and final performance than direct end-to-end training in both cases. The remaining training parameters were the same as described in 3.2. In order to further examine the effects of the different reconstruction methods, evaluations on a single repetition were conducted which allowed to visualize phase reconstructions as well. As shown in Figure 3, the aforementioned observations hold true for the magnitude images of the single repetitions as well, i.e. POCS is not able to recover regions affected by signal loss and introduces additional noise. Note that the artifacts appear more severe in the single repetitions since they tend to average out in the combined images. The location of the signal void in the left liver lobe coincides with high-frequency structures in the corresponding phase image which cannot be reconstructed properly by POCS. In fact, the phase produced by POCS has even lower resolution than the phase of the zero-filled image since the former only uses the symmetrically sampled portion of the k-space.

Retrospectively sub-sampled liver DWI
On the other hand, the proposed method is able to accurately reconstruct high-frequency structures

Prospectively sub-sampled liver DWI
As demonstrated in Figure 5, the proposed method also performs well on prospectively sub-sampled data. Compared to the zero-filled reconstruction, DRPF is able to restore resolution along the subsampled direction for both b-values. POCS reduces blurring as well but introduces additional noise and a mesh pattern. Representative reconstructions on prospective data of two additional subjects are presented in Supporting Information Figure S4.

Generalization to brain DWI
Results on prospectively sub-sampled brain data are presented in Figure 7. Both POCS and DRPF alleviate blurring introduced by zero-filling effectively. However, POCS leads to noise amplification which becomes more profound for the higher b-value. Since phase variations are comparatively moderate in brain DWI, severe artifacts such as signal voids cannot be observed. Although being trained exclusively on liver data, DRPF is able to reconstruct brain images which feature sharp edges at tissue boundaries but at the same time appear more homogeneous in iso-intense regions compared to POCS.

Benefits of joint reconstruction
As Figure 8 implies, all three network realizations produce visually appealing reconstructions.
However, error images reveal that the image obtained from the network which processes repetitions independently from each other exhibits a higher degree of residual blurring around tissue boundaries and some signal fluctuations (most apparent in the spleen). Using mean or maximum aggregation, both effects can be mitigated. While qualitative differences between the two aggregation approaches are minimal, maximum aggregation results in slightly improved PSNR and SSIM as confirmed by the quantitative evaluation on the whole test set provided in Table 1. As the metrics imply, improvements of joint reconstruction are more substantial for data acquired at b = 800 s/mm 2 .
Given that a larger number of repetitions is available and individual repetitions are noisier, the greater benefit of joint reconstruction for higher b-values conforms to expectation.

Comparison of unrolling strategies
As qualitative differences in the reconstructions produced by the different unrolling strategies are marginal, we focus on the quantitative comparison provided in Table 1 However, the surplus in parameters still allows the cascaded network architecture to produce slightly   superior results compared to the weight-shared network.

Discussion
Given that conventional PF techniques, such as POCS, rely on smoothness priors of the phase, they oftentimes introduce artifacts when applied to DWI which is prone to phase variations. Evaluations in this work showed effects like noise enhancement, the introduction of mesh patterns or even signal voids at strong PFFs (see Figures 2, 3 and 5). All of those prohibit the use of PF-DWI in clinical practice or limit it to weak PFFs which do not cause significant reduction of TE in ssEPI sequences.
The results of the conducted experiments imply that the proposed learning-based method achieves robust PF reconstruction of DW liver images with high-frequency phase structures. It was shown to outperform POCS and zero-filling quantitatively as well as qualitatively on a distinct testing set.
In addition, the proposed method generalized well to brain DWI acquired at b-values that were not used in the training set. This implies that the network attempts to invert the blur kernel associated with PF-sampling rather than just learning liver anatomy.
Further, it is noteworthy that recovery of high-frequency phase structures is possible although only magnitude information contributed to the objective function during training. This observation indicates that, similar to conventional methods, some form of phase regularization is exploited in order to reconstruct the missing data. However, in contrast to conventional methods, this is done in an implicit manner instead of using explicit smoothness constraints. We also hypothesize that the ability to recover signal voids arising from PF-sampling can be attributed to the restoration and utilization of high-frequency phase components as well as to the exploitation of information from other repetitions which potentially do not suffer from signal loss in the same area. In order to demonstrate that the network is not just filling signal voids empirically based on anatomy learned from training data, an experiment was performed in which an artificially generated signal void was placed into the repetition presented in Figure 3. The corresponding results in Supporting Information Figure S5 show that the artificial signal 'hole' in the liver parenchyma is deblurred but not filled, while the signal void in the left lobe resulting from PF-sampling is recovered as previously shown in Figure 3.
Further, the performance of the proposed method on prospectively sub-sampled data could be validated as well. Given the robust deblurring properties of DRPF, it is possible to omit almost one half of the k-space. This work demonstrated two ways of utilizing the associated TE reduction in combination with the proposed reconstruction technique. One was to simply minimize TE as much as possible in order to achieve DW images with high SNR without suffering from artifacts known from conventional methods (see Figure 5). Alternatively, it could be used to compensate for the TE increase associated with acquisitions of higher resolution (see Figure 6). However, there are several other possible scenarios of utilizing the shortened echo-train. For example, the consequently increased SNR could be traded against scan time by reducing the number of acquired repetitions.
Alternatively, the TE increase caused by bi-polar or motion-compensated diffusion preparation schemes could be countered by PF-sampling [17,18].
The concept of Deep Sets was employed in this work in order to provide the network with correlations within the set of available repetitions. Compared to a version of DRPF that did not exploit shared information, it could be shown that reconstruction quality could be improved by implementing aggregation of image features at certain points within the network. Here, maximum aggregation provided slightly superior results than mean aggregation. The fact that the incorporation of those operations is straightforward and does not increase the number of learned parameters renders this approach valuable in settings in which multiple realizations of an image are available.
Since individual repetitions can be affected by imaging artifacts, the effects of those artifacts on the reconstruction of a selected repetition were investigated as well. In the context of DW ssEPI, signal dropouts due to cardiac pulsation are particularly prominent (see Supporting Information Unrolled, model-based networks emerged as the state-of-the art architecture class for medical image reconstruction. As in our previous work [12], the regularizer was implemented by a recurrent CNN here. Our experiments showed that the given architecture was able to quantitatively outperform two alternative unrolling strategies, namely weight-sharing and cascading. In addition, it does so with K times less parameters than cascading, where K is the number of unrolled iterations.
Note that this is not necessarily an observation exclusive to PF reconstruction but could potentially be valid for other applications as well. However, in order to make a more general statement, further experiments are required which are beyond the scope of this work. One limitation of the comparison of unrolling strategies conducted in this work is the architectural difference between the recurrent unrolling approach and the other strategies. While the former uses ConvGRUs, the latter use a ResNet architecture. Since GRUs are recurrent by definition, a non-recurrent realization of it for comparison purposes would not be meaningful. Therefore, a ResNet architecture was chosen which was similar in the number of parameters.
Using recurrent networks in unrolled architectures has been proposed in previous work as well.
Recurrent Inference Machines (RIMs) were introduced in [30] and applied to accelerated MR reconstruction in [31]. In contrast to this work which defines the iterative update rule via proximal splitting in a variational setting, RIMs are derived from a more Bayesian perspective in which the gradient is calculated according to a maximum-a-posteriori solution. Hence, in [30] the recurrent CNN takes both the current estimate and its data-consistent transform as input instead of regarding regularization and data consistency as separate steps within an iteration. Further, a bi-directional recurrent network architecture was proposed in [32] for unrolled reconstruction of dynamic cardiac MRI. Memory was propagated not only along iterations but also along the time dimension of the data. In contrast to this work which employed more sophisticated GRUs, basic Elman cells [33] were used in [32].
One general limitation of PF acquisitions is the increased risk of omitting the maximum kspace point which can be shifted away from the k-space center in data with large phase errors.
Consequently, it would become impossible to fully recover the corresponding signal loss even for a Deep Learning-based solution. However, within our data set this scenario was very rare. While it did not occur at all for the lower b-value, the maximum k-space point was outside the region that would be sampled for PFF = 5/8 in only 0.53 % of the cases for b = 800 s/mm 2 (see Supporting Information Figure S8). One of those cases is presented in Supporting Information Figure S9 where the omitted k-space region carried the image information corresponding to the hyper-intense spleen which can be recovered only to a very limited degree by the proposed reconstruction method.
Nevertheless, given that this problem only occurs in high b-values which are typically acquired with a relatively high number of repetitions, the effects on the averaged DW images are expected to be marginal as shown in Supporting Information Figure S10. Here, the previously described signal loss in the spleen of the single repetition did not lead to perceivable intensity reduction in the corresponding area of the averaged image consisting of 20 repetitions. Despite the results shown in Supporting Information Figure S8, local signal dropouts in PF-acquired data due to rigid bulk motion are still possible. In order to minimize the risk of that, respiratory-triggered or breath-held acquisitions can be employed instead of free-breathing acquisitions as performed in this work for the sake of faster data collection.
As phase variations can be result of eddy currents which are typically stronger in prospectively sub-sampled data, clinical evaluation of the proposed method is planned involving a broader evaluation on prospective data by expert radiologists. As an outlook for future work, it could be also worthwhile to further investigate the aspects of recurrent network unrolling and joint reconstruction of image repetitions, for example, by employing more sophisticated aggregation techniques, such as attention-based methods [34,35]. Also, the frequency of aggregations as well as their location within the network could be optimized for. Given its robustness with respect to phase variations, the presented method combined with potential re-training that penalizes phase errors explicitly could enable PF imaging in other applications in which phase is inherently non-smooth or even carries contrast information, such as in phase-contrast imaging.

Conclusion
This work demonstrated that robust PF reconstruction of coil-combined DW data is possible by means of an unrolled, model-based neural network. Using a regularization which is parametrized by recurrent convolutions and trained on retrospectively sub-sampled data, artifacts introduced by conventional PF methods due to ill-suited phase priors can be avoided. Reconstruction quality could be validated on retrospectively and prospectively sub-sampled data. Decent generalization to other anatomies and contrasts can be achieved which substantiates the ability of the method in approximating an inversion of the blurring operation. Accurate PF reconstruction combined with the associated TE reduction has the potential to enable acquisitions which might have been prohibited so far due to excessive echo-train lengths in ssEPI sequences and/or the lack of accurate PF reconstruction techniques.

List of Supporting Figures
Supporting Information Figure S1. Phase variations in a representative repetition of a DW liver slice acquired at 3 T and a b-value of 800 s/mm 2 : (left) magnitude image, (middle) phase image, (right) and the corresponding low-pass-filtered phase having only 1/4 of the original resolution. The latter represents a low-resolution phase estimate as it would be used in Homodyne or POCS for a PFF of 5/8. Note how high-frequency structures along AP, such as in the left liver lobe, are lost completely in the low-resolution phase.
Supporting Information Figure S2. ResNet architecture used in the cascading and weightsharing approaches. It consists of G residual blocks (ResBlocks), each of which is composed of two convolutional layers (Conv) with F feature channels. A batch aggregation (Aggr B ) is performed after half of the residual blocks. In order to match the number of parameters of the recurrent architecture, parameters were set as G = 6 and F = 64. Further, maximum aggregation along the batch of image features was employed.
Supporting Information Figure S3. Reconstruction quality on retrospectively PF-sampled liver data (magnitude-average of 5 repetitions acquired at 1.5 T and a b-value of 50 s/mm 2 ). Top row: ground-truth image. Second and third row: reconstructions and corresponding difference images (magnified by a factor of 5) produced by zero-filling, POCS and DRPF at a PFF of 6/8. Fourth and fifth row: same as above for a PFF of 5/8.
Supporting Information Figure S7. Reconstruction quality of different aggregation configurations on a retrospectively PF-sampled (PFF = 5/8) single repetition shown in Figure 3. Three configurations were compared: separate reconstruction of repetitions (DRPF None ) and joint reconstruction (DRPF Max ) where the 19 remaining repetitions of the set were either affected by motion-induced signal dropouts ('corrupt') or not ('clean'). The first and second row show the magnitude images and the corresponding difference images with respect to the ground-truth (magnified by a factor of 5), respectively.