FD-Net: An unsupervised deep forward-distortion model for susceptibility artifact correction in EPI

Purpose: To introduce an unsupervised deep-learning method for fast and effective correction of susceptibility artifacts in reversed phase-encode (PE) image pairs acquired with echo planar imaging (EPI). Methods: Recent learning-based correction approaches in EPI estimate a displacement field, unwarp the reversed-PE image pair with the estimated field, and average the unwarped pair to yield a corrected image. Unsupervised learning in these unwarping-based methods is commonly attained via a similarity constraint between the unwarped images in reversed-PE directions, neglecting consistency to the acquired EPI images. This work introduces a novel unsupervised deep Forward-Distortion Network (FD-Net) that predicts both the susceptibility-induced displacement field and the underlying anatomically correct image. Unlike previous methods, FD-Net enforces the forward-distortions of the correct image in both PE directions to be consistent with the acquired reversed-PE image pair. FD-Net further leverages a multiresolution architecture to maintain high


Introduction
Echo planar imaging (EPI) [1] is the most commonly employed MRI sequence for diffusion-weighted imaging (DWI) and functional MRI (fMRI), due to its rapid k-space acquisition capability [2,3].However, EPI is prone to susceptibility artifacts arising from B 0 field inhomogeneities, which are particularly prominent near tissue interfaces [4].These artifacts manifest as intensity distortions from signal pileups/dropouts, and geometrical distortions due to compression/stretching of affected regions [5].Severe artifacts can limit the clinical utility of EPI images.Therefore, artifact correction is an essential step to ensure accuracy of downstream qualitative and quantitative assessments, especially at high field strengths [6,7,8].
A leading framework for susceptibility-artifact correction uses images acquired in reversed phase-encoding (PE) directions to estimate the susceptibility-induced displacement field directly from the resulting blip-up (BU) and blipdown (BD) EPI images [5,9,10,11].An unwarping-based approach is commonly adopted for correction, where the reversed-PE images are nonlinearly transformed to alleviate artifacts based on the estimated displacement field.Either voxel-wise field estimates [12,13,10], or weighted combination of basis spatial maps across the field-of-view (FOV) [9] can be used.Popular implementations of this framework include classical methods such as TOPUP from the FMRIB Software Library (FSL) [14,9] and hyperelastic susceptibility correction of DTI data (HySCO) from the Statistical Parametric Mapping (SPM) toolbox [15,16].Since no additional data collection is needed beyond reversed-PE images, classical methods in the unwarping-based framework can offer notable benefits over measured-field-based, registrationbased, or point spread function (PSF) based approaches in the literature [17,18].Nonetheless, these classical methods are based on iterative optimization techniques that introduce substantial computational burden.
Deep neural networks have recently been considered as a powerful alternative for artifact correction that can maintain high computational efficiency [19].Previous studies in this domain have adopted the unwarping-based framework where reversed-PE images are first individually unwarped, and then combined to produce a final estimate.In the absence of ground-truth for anatomically-correct images, network training has been performed via an unsupervised learning strategy that aims to maximize the similarity of unwarped images across the two PE directions [20].Among previous learning-based methods, S-Net performs unwarping via bilinear interpolation and assesses the similarity between the corrected BU/BD images via a cross-modal loss [20].Deepflow-Net instead performs unwarping via cubic interpolation and assesses the similarity between the corrected BU/BD images via a mean-squared error (MSE) loss [21].While promising results have been reported, these previous methods define an unsupervised loss function in the output domain of unwarped images, for which no ground-truth data are available.Such lack of physical constraints in the loss function can cause suboptimal learning [22,23].In turn, the network can produce low-fidelity images during inference, resulting in solutions that are notably inconsistent with the acquired reversed-PE images [24].
Here, we propose a novel deep network model (FD-Net) based on a forward-distortion approach for correcting EPI susceptibility artifacts in reversed-PE image pairs.Unlike unwarping-based methods that average individuallycorrected reversed-PE images, FD-Net predicts a single anatomically-corrected image along with a displacement field.Unlike previous deep-learning methods, FD-Net directly incorporates physical constraints in the input domain where measurements are available.Specifically, FD-Net forward-distorts the corrected image with the predicted field to reconstruct the reversed-PE image pair.Unsupervised learning is then achieved by enforcing consistency of the reconstructed versus acquired reversed-PE images.A multiresolution architecture is employed to maintain performance at both local and global scales.Comprehensive demonstrations are performed to assess the quality of corrected images and field estimates on EPI data from the Human Connectome Project (HCP) database [25].FD-Net performs competitively with the reference TOPUP method, while enabling a leap in computational efficiency; and it significantly outperforms competing deep-learning methods based on the unwarping framework.These findings demonstrate the potential of FD-Net as a fast and effective method for susceptibility-artifact correction in EPI.

Susceptibility Induced Distortions
The relationship between the anatomically-correct image and the distorted EPI image can be expressed as a linear system: where K is a transformation matrix called the K-matrix, ρ is the vectorized anatomically-correct image, f is the vectorized EPI image, and n P E and n F E are the image dimensions in the PE and frequency encode (FE) directions, respectively.In general, K can be complex-valued given complex-valued images ρ and f , such that it performs a phase shift as well as interpolation [9].In practice, however, magnitude images are more commonly utilized for convenience and K is real-valued.Ignoring the distortion along the FE-direction enables block diagonalization of the K-matrix, allowing the problem to be separated across FE lines as: Here, K i , i = 1, 2, • • • , n F E , are the transformation submatrices acting along the PE-direction, and ρ i and f i are the i th rows of the correct image and the EPI image, respectively.As shown in Figure 1, the K-matrix describes the mapping from the correct image to the EPI image.Deviations of the K-matrix from the identity matrix are representative of the amount of distortion, and multiple nonzero values on the same row indicate a many-to-one mapping (i.e., pileup/dropout distortions).For reversed-PE acquisitions, Equation (2) can be written separately for the i th rows of the EPI images from BU/BD acquisitions.In that case, the associated K-matrices K i,BU and K i,BD are based on the same underlying field, with the difference of utilizing the negative of the field for BD acquisition.

Classical Methods
Among classical methods for susceptibility-artifact correction in EPI, the predominant approach is correction based on reversed-PE acquisitions.TOPUP, a popular implementation of this approach, uses an alternating least-squares optimization to jointly solve the linear system of equations resulting from the reversed-PE acquisitions [14,9].TOPUP first estimates the underlying field, which is taken as a compact linear combination of spatial basis functions across the image domain [9].Next, transformation matrices that act on BU/BD acquisitions are generated based on the estimated field.Finally, to generate the anatomically correct image, unwarping is performed on BU/BD acquisitions by incorporating Jacobian modulation to compensate for intensity pileups.A main limitation of this method is that it relies on iterative optimization techniques that are computationally intensive.
Another classical method for correcting susceptibility artifacts is B 0 field map based correction, which requires at least two additional acquisitions with different TE values for computing the field based on phase differences.This field is then used to correct the distorted EPI images by unwarping in image domain.However, erroneous field maps can elicit residual artifacts after correction, and phase unwrapping during field computation is prone to failure especially in regions with high B 0 inhomogeneities, such as air/tissue and bone/soft tissue interfaces [26].Yet another classical method is registration-based correction, which requires an additional anatomical reference image to perform registration with the use of a cross-modal loss function [24].A distortion-free T 1 -or T 2 -weighted image typically serves as an anatomical template for the EPI image in the presence of large distortions.Additional constraints are often incorporated to improve solutions, including diffusion tensor [27] and fiber orientation distributions [28], alignment of cortical surfaces [29] and synthesized anatomical images [30].Popular implementations of the aforementioned methods provided in FSL are FUGUE and FLIRT, which perform B 0 field map based correction and image registration based correction, respectively [31,32].However, in addition to requiring auxiliary scans, these approaches fall short at capturing more intricate distortions or compensating for signal intensity variations [33].Alternatively, methods based on PSF measurements have been proposed for analytical correction based on regularized deconvolution [34,18], where learning-based deconvolution methods can also be adopted to improve performance [35,36].While PSF-based methods can correct a broad range of distortions in EPI images, they require voxel-wise PSF measurements via prolonged scans that must be repeated under notable changes in k-space trajectories [37].

Learning-based Methods
In recent years, learning-based approaches have been adopted as a promising alternative for correction of susceptibility artifacts in EPI.A first group of methods have aimed to improve performance of classical methods via complementary data processing.Synthesis methods are applicable in cases where reversed-PE data are not available [38,39].After an undistorted EPI image is synthesized given as input a structural MR image, synthesized and acquired EPI images are processed via TOPUP to unwarp the acquired image [38,39].While suited for clinical data acquired under time limitations, synthesis methods can yield images with reduced resolution when compared to those based on reversed-PE acquisitions.Fiber-orientation distribution (FOD) methods use latent features of FOD images extracted from DWI data to further improve TOPUP-based correction of reversed-PE images [40].FOD methods incorporate additional anatomical information to improve performance in problematic regions such as the brainstem.However, they still rely on the relatively slow TOPUP correction.Learning-based correction with multi-shot EPI sequences has also been considered to help minimize the distortions in acquired images.Low-rank reconstructions of a multi-shot EPI sequence based on simultaneous multislab acquisition has been proposed for DWI [41].Self-supervised denoising of a multicontrast multi-shot EPI sequence based on reversed-PE acquisitions has been proposed for T 2 , T 2 * , and susceptibility mapping [42].Physics-driven reconstruction of an echo-shifting acquisition has been proposed for relaxometry along with B 0 and B 1 mapping [43].Note that these methods involve advanced pulse sequence modifications that may not be available at all sites, and often leverage TOPUP for estimation of field maps.
A second group of methods have instead aimed to improve computational efficiency over classical correction methods.A common framework in this domain relies on field estimation followed by unwarping of EPI images.Earlier studies have considered supervised methods that train network models for correction assuming availability of ground truth for undistorted EPI images [44,45,46].These ground truth images are typically obtained via simulations or from classical correction methods.Some supervised methods further cast estimation of the displacement field from a reversed-PE image pair as an optical flow estimation problem, and later use the estimated field for correction [47,48].Although supervised methods benefit from the data-driven learning capabilities of network models, reliance on the availability of undistorted EPI images limits their utility in many applications where such ground truth is not available.This has sparked interest in unsupervised methods that can learn to correct artifacts in the absence of ground truth.As in the case of classical methods, the predominant approach for unsupervised correction relies on reversed-PE acquisitions.Based on the assumption that displacements in non-PE directions are negligible [10], the displacement field is estimated so as to maximize the similarity of unwarped images obtained by reverse distortion on the acquired PE image pair.The recently proposed S-Net [20] utilizes a 3D U-Net model [19] to predict the field, followed by unwarping using bilinear interpolation inspired by the deformable image registration method VoxelMorph [49].For unsupervised learning, S-Net uses a similarity loss taken as the local cross-correlation (LCC) between corrected BU/BD images, along with a diffusion regularizer to enforce field smoothness.Another recent method named Deepflow-Net [21] uses a 2D U-Net model where field estimates are produced at multiple resolutions by extracting features from various stages of the decoder [47,48].Deepflow-Net performs correction via cubic interpolation and adopts a density compensation similar to TOPUP [9] to handle pileups.For unsupervised learning, Deepflow-Net uses as similarity loss the MSE between the corrected BU/BD images, along with a total variation regularizer to enforce field smoothness.While these seminal methods have produced promising results, they enable unsupervised learning by assessing similarity of unwarped images in opposing PE directions.This indirect approach omits physics-driven constraints regarding the actual EPI measurements.Thus, performance of the learned correction can degrade under relatively large distortions and near tissue boundaries.
Here, we propose a novel unsupervised deep-learning method for artifact correction in EPI to improve performance.Unlike previous unsupervised methods, the proposed FD-Net method directly constrains fidelity to the actual EPI measurements.This constraint is introduced by integrating the forward physical model of EPI distortions observed on measured images, so FD-Net benefits from the enhanced reliability of physics-driven deep learning.

Proposed FD-Net
FD-Net is a novel unsupervised forward-distortion model that explicitly enforces measurement fidelity for enhanced correction performance, as outlined in Figure 2. The prediction unit, shown in Figure 2a, uses a 2D U-Net to produce both a predicted field and a predicted anatomically-correct image from the input reversed-PE images.In contrast to unwarping-based methods that produce separate unwarped images for BU/BD acquisitions, predicting a single correct image can offer SNR benefits analogously to the sensitivity-encoding approaches in parallel imaging [50].The K-Unit in FD-Net, illustrated in Figure 2b, forward-distorts the predicted anatomically-correct image using the predicted field to reconstruct the input PE images.The BU acquisition is reconstructed using the estimated field, whereas the BD acquisition is reconstructed using the negative of the estimated field.Distortions are efficiently emulated using the K-Unit that embodies a simple matrix multiplication with a separable formulation as in Equation (2).Afterwards, fidelity between reconstructed and measured data is enforced using a multiresolution scheme.
The rigid alignment unit in Figure 2c allows compensation for small movements between the input PE image in one direction (BD acquisition in this case) and its corresponding forward-distorted image.This allows the network to focus on displacements that are due to off-resonance via the field-based formulation of the K-Unit.

Forward-Distortion with K-Unit
The K-Unit in FD-Net performs forward-distortion on the estimated anatomically-correct image using the estimated field, as illustrated in Figure 3.The steps described below are given for the BU direction for brevity, but they are similarly conducted for the BD direction, with the difference of utilizing the negative of the displacement field.First, a uniform spatial grid X grid is formed: The distorted grid after interpolation, X i,BU , is formed by determining the new grid location for each pixel from the shift amount given in the displacement field, i.e., where O field is the estimated field output of FD-Net in units of pixels and i = 1, 2, . . ., n F E is the row index over the FE direction.For practical purposes, each entry in X i,BU is kept limited between 1 and n P E (i.e., clipped to the valid range of interpolation).Taking the difference between the two grids and then applying an interpolation kernel, κ(ξ), gives us the K-matrix that will act on the i th row as follows: Using this K-matrix, the i th row of the forward-distorted image is reconstructed via a matrix multiplication: where (•) T denotes matrix transpose, [•] i denotes the i th column of a matrix, and O image is the predicted anatomicallycorrect image.Finally, the forward-distorted image O dist,BU can be formed by stacking the individually distorted rows: Note that multiplication with K-matrix rows performs an interpolation across pixel neighborhoods with intensity modulations, so it can emulate signal pileups/dropouts.

Network Architecture
The architecture of FD-Net is detailed in Figure 4.As depicted in Figure 4a, the encoder in the prediction unit projects input reversed-PE images onto a latent representation across multiple stages.The receptive field is progressively refined by decreasing kernel size and using convolution with stride 2 for downsampling.The decoder then resolves the predicted field and predicted image from the latent representation through multiple stages of convolutional filtering and upsampling.Feature maps from the encoder stages are projected onto the decoder through skip connections to improve information flow.
A rigid-body motion may occur between the BU and BD acquisitions.As shown in Figure 4b, the rigid alignment unit in FD-Net applies motion-related transformations on one of the forward-distorted images only (BD distorted image in this case).This unit receives as input the measured BD acquisition along with the respective forward-distorted image, and uses convolutional and densely connected layers to predict the motion parameters s x , s y , and r, which capture the x-axis shift, y-axis shift, and in-plane rotation, respectively.These parameters are then used to apply a rigid transformation to the BD distorted image to improve its alignment with the corresponding BD acquisition.Note that a similar rigid alignment is also performed in TOPUP, and it offloads some burden from the non-rigid field-based alignment by accounting for subject movement between the two reversed-PE acquisitions.
As illustrated in Figure 5, FD-Net adopts a multiresolution scheme to improve performance by enforcing consistency across different spatial resolutions, in principle leading to faster convergence and more reliable performance.In FD-Net, we refer to the evaluation at different spatial scales as multiscale and at different spatial blurs as multiblur.For multiscale, field and anatomically-correct image estimates are produced at multiple spatial resolutions by extracting outputs from different stages of the decoder.For multiblur, the full resolution outputs are blurred with Gaussian kernels at varying standard deviations.In both cases, the estimates obtained at multiple scales/blurs are processed with the K-Unit after proper scaling of their contribution to the overall loss function.The multiscale approach relies on forming an output image and field at a lower dimensional scale, using appropriate convolutional steps to produce the outputs at 1/2 and 1/4 scale in this case.The numbers inside the boxes denote feature dimensions, with the numbers in brackets indicating filter kernel sizes.Leaky ReLU activation with slope coefficient of 0.2 is used, unless otherwise indicated.For the multiblur case, Gaussian blur kernels are applied to the full resolution outputs to create increasingly blurred results.Small, medium, and high blur amounts of σ S = 0.5, σ M = 1.5, and σ H = 2.5 are used, respectively, with a Gaussian kernel size of 4σ m in each case.The results of all the incorporated multiresolution levels are then passed through the K-Unit shown in Figure 1, contributing to the overall loss in a regularizing manner.

Network Loss
The overall loss function for FD-Net is given as: where m is the index of multiresolution step, ω m and λ m are the weighting and regularization parameter over the smoothness of the field for step m, superscript (m) denotes the version of a parameter at step m, and γ is the weight of the rigid alignment loss.Here, the first term denotes the sum of reconstruction losses, while the second term denotes the rigid loss, described in detail below.
First, L M SE is MSE between the measured and forward-distorted images averaged across the two PE directions at the m th step: where I im,BU and I im,BD are the input EPI images for BU and BD acquisitions, respectively.For the multiscale scheme, these images are downsampled properly to avoid aliasing artifacts.
Next, L BE is the bending energy regularizer [51] over the field at each step m expressed as: In practice, first-and second-order finite differences are used to approximate the gradients [52].
valley is the valley loss for the field to prevent the overall loss function from exploding in earlier training iterations [21], and is given as: where τ m is a chosen threshold of maximum permissible field swing in units of pixels.L Finally, L rigid is the rigid loss to find the smallest possible rigid transformation parameters for the alignment of measured and forward-distorted BD images, and is defined as follows: Because the same rigid alignment applies to all multiresolution steps, a single rigid loss term is included in Equation ( 8).

Experimental Dataset and Setup
For the experiments in this work, unprocessed DWI data from HCP 1200 Subjects Data Release were used [25].The images were acquired on a 3T MRI scanner (Siemens Skyra "Connectom"), using a multiband diffusion sequence with ss-EPI readouts in right-to-left (RL) and left-to-right (LR) reversed-PE polarities [53].A total of 24 subjects were selected randomly from the HCP database, with 12 reserved for training, 4 for validation, and 8 for testing.For each subject, a single b0-volume consisting of 111 axial slices with 168×144 image matrix was utilized.To obtain reference corrected images, the TOPUP method was applied on the data following the recommended guidelines by the toolbox.
All networks were implemented in Keras with Tensorflow backend, on a machine with NVIDIA RTX 3070 GPU.
Training was performed with the Adam optimizer for a learning rate of 10 −4 and a maximum of 1000 epochs, with early stopping when the change in the validation loss between consecutive epochs in the validation set fell below a threshold of 10 −6 .

FD-Net Implementation
The columns of the K-matrix in the K-Unit were generated using a sinc kernel, i.e., κ(ξ) = sinc(ξ).All convolutional layers in the encoder-decoder (i.e., U-Net) utilized Leaky Rectified Linear Unit (ReLU) activation with a slope coefficient α = 0.2, except at the final steps of the decoder as indicated in Figure 4a; the predicted image was output via a convolutional layer with ReLU activation and the predicted field was output via a convolutional layer with linear activation.For the multiscale case, convolutional layers akin to the full resolution case were employed to form the predicted field and image at 1/2 and 1/4 of the full scale.For the multiblur case, the full resolution output was blurred with Gaussian kernels of standard deviation σ m and width 4σ m .Three different blur levels were used: small (S), medium (M), and high (H) blurs of σ S = 0.5, σ M = 1.5, and σ H = 2.5, respectively.

Competing Methods
Two unsupervised learning-based methods, S-Net and Deepflow-Net, were implemented for comparison.In addition, a supervised method was implemented to serve as a baseline for FD-Net.Implementations of competing methods were maintained as consistent to FD-Net as possible to facilitate fair comparisons: 1. S-Net: S-Net was implemented using a 2D U-Net.Only the field head at the end of the decoder in Figure 4a was necessary and correction was performed using a modified K-Unit approach as follows: Here, O unwarp,BU denotes the unwarped BU image.Note that no density compensation was incorporated by Duong et al. [20].Similarly, by transposing K i,BU , a standard unwarping interpolation was performed without density compensation.The BD acquisition was similarly treated, with the K-matrix formed after negation of the field.The average of the unwarped BU/BD images was taken as the corrected image.For training, LCC of the unwarped BU/BD images was utilized for similarity loss [20,49].In place of the diffusion regularizer in [20], bending energy from Equation (10) was used to facilitate comparison with FD-Net.In addition, the rigid alignment unit was utilized and the rigid loss from Equation (12) was incorporated.
2. Deepflow-Net: Deepflow-Net was implemented using a 2D U-Net.Only the field head at the end of the decoder in Figure 4a was needed and density-compensated correction was performed based on a modified K-Unit approach.The K-matrix for the BU acquisition was multiplied with an image of 1's, 1, to produce a density pileup map W BU .This map was inverted and used to weight the input PE image to enable density compensation akin to Zahneisen et al. [21]: where Here, and denote Hadamard division and product, respectively, and (1 W BU ) is limited in [0, 1] to decrease the intensity in pileup regions [21].The same process was also followed for the BD acquisition, with the K-matrix formed after negation of the field.In contrast to Equation (13), Equation ( 14) applies density compensation together with unwarping.The average of the two unwarped images was used as the corrected image.The same multiscale strategy as in FD-Net was adopted.MSE between the unwarped BU/BD images was used as the similarity loss.In place of TV regularization in Zahneisen et al. [21], bending energy loss was applied for the field as in Equation (10).The rigid alignment unit was also incorporated along with its loss term.
3. Supervised Baseline: Finally, a supervised baseline was trained with an architecture identical to that of FD-Net, with the exception of the loss being fully supervised.For this purpose, MSE between the network predicted field/image and the results from TOPUP was utilized.

Quantitative Assessments
The qualities of the predicted image and field were assessed via Peak SNR (PSNR) and Structural Similarity Index Measure (SSIM) metrics, with the TOPUP results taken as reference.Before computing PSNR and SSIM, the field generated by each method was masked via a median Otsu threshold over the TOPUP image to remove background regions from consideration [54].
For all methods, hyperparameters were chosen empirically to maximize PSNR and SSIM over the 4 subjects reserved as validation data.The selected hyperparameters are provided in Table 1.Performance assessments were reported on independent test data.
Table 1: Hyperparameter choices for the proposed FD-Net and the competing methods.For each method, irrelevant hyperparameters are marked with a dash (−).The hyperparameters considered are: λ for field smoothness regularization, ω for multiresolution weighting parameter, γ for rigid loss, and τ for valley loss threshold.ω is split into its constituent full resolution ("FR"), multiscale (1⁄2 and 1⁄4 scale), and multiblur (S, M, and H) components.

Computation Time
All competing methods provided substantial computational advantage over TOPUP.Correction of a volume took on average ∼7.5 sec for each network considered.In contrast, TOPUP took on average ∼3086 sec (∼51.5 min) to predict the field and an additional ∼6 sec to apply correction.Thus, network-based artifact correction enabled significant speed up over classical methods.

Ablation Studies for FD-Net
The choice of multiresolution strategy for FD-Net was first considered, followed by an ablation study on the combination of multiresolution components.The parameters were chosen empirically, with the purpose of maximizing quantitative image quality metrics with respect to TOPUP over the predicted field/image.Lastly, an ablation study was conducted to evaluate the contribution of each loss term in Equation ( 8).

Multiresolution Ablation Study for FD-Net
FD-Net was trained and subsequently evaluated for each multiresolution strategy, alongside a strategy with no multiresolution.The performances of the multiscale and multiblur schemes, as well their combination, were compared to determine the best multiresolution strategy.The hyperparameters chosen for each multiresolution scheme considered are provided in Table 2. PSNR and SSIM metrics are listed in Table 3. Overall, introducing a multiblur strategy provides a performance boost.Using the multiblur strategy, the image quality is improved by 0.67dB PSNR/2.11%SSIM, and the field quality is improved by 1.68dB PSNR/2.94%SSIM over the no multiresolution case.In contrast, the multiscale strategy underperforms in comparison to both the multiblur and the no multiresolution cases.A combination of multiblur and multiscale strategies does not improve over the multiblur case either, indicating that multiblur alone is sufficient to boost performance.Hence, the multiblur strategy was selected for FD-Net.
Next, combinations of three different blur amounts were considered for the multiblur case, with hyperparameters as listed in Table 4.The results in Table 5 show that including two or more blur stages boosts performance.Incorporating all blur stages (i.e., S-M-H multiblur combination) provides the best results, improving the image quality by 2.28dB PSNR/5.23%SSIM and the field quality by 5.99dB PSNR/8.80%SSIM over the worst performing M multiblur scheme.Hence, the S-M-H multiblur combination was chosen for FD-Net, as it provides a reliable generalization by incorporating all blur stages.Table 4: Hyperparameter choices for the multiblur scheme ablation study for FD-Net.Combinations over three different blur amounts are considered: small (S), medium (M), and high (H) blurs of σ S = 0.5, σ M = 1.5, and σ H = 2.5, respectively.For each combination, excluded blurs are marked with a dash (−).The multiresolution weighting parameter, ω, is split into its constituent weights for full resolution ("FR") and multiblur (S, M, and H) components.The other hyperparameters are kept fixed: λ for field smoothness regularization, γ for rigid loss, and τ for valley loss threshold.

Loss Ablation Study for FD-Net
An ablation study was conducted by removing one loss term at a time from Equation ( 8) to investigate its contribution to the overall performance.Additionally, a version using only the MSE loss term (i.e., L M SE ) was provided for reference.The results provided in Table 6 indicate that the proposed FD-Net provides the best overall performance.Removal of the rigid loss slightly decreases the predicted field quality, while removal of the valley loss slightly decreases the predicted image quality.Removal of the bending energy loss has the most detrimental effect on performance, leading to a significant drop in PSNR and SSIM down to the level of the MSE-only case.The proposed FD-Net improves the image quality by 0.27dB PSNR/1.11%SSIM and the field quality by 1.21dB PSNR/1.74%SSIM over the MSE-only case.
Table 6: Performance comparison results for the loss ablation study for FD-Net.PSNR and SSIM metrics are reported as mean (SD) across subjects.Removal of a loss component is indicated by "\" symbol followed by the removed loss term in curly braces.The full version of the loss is chosen for FD-Net as it provides the best overall performance.

Loss terms
Image quality Field quality

Comparison with Competing Methods
Comprehensive quantitative evaluations and visual assessments of the proposed FD-Net and the competing methods were conducted with respect to the reference TOPUP results.
Slice-Wise Evaluations: Figure 6 demonstrates the performance of each method across different slices of the dataset.Since the dataset captured the same anatomy at the same orientation for all subjects, a given slice number corresponds to approximately the same anatomical location in all subjects.Hence, no additional intersubject registration was conducted for this analysis.The underlying anatomy is illustrated in Figure 6a for a particular subject, where the T 1 weighted volume was registered to the corresponding b0 volume for display purposes, using FSL's FLIRT [31,32].The results in Figure 6b show that all methods have dips/peaks in performance at the same slice indices, providing insight into which slices are more/less challenging in terms of distortion correction.FD-Net outperforms all competing methods in terms of the predicted image quality, especially at the problematic lower brain slices where severe distortions are present.Moreover, the predicted field quality from FD-Net exceeds the competing methods, except for the supervised baseline.
It should be noted that while the supervised baseline is able to match the TOPUP field better, it performs the worst in terms of predicted image quality.
Subject-Wise Evaluations: The performance of each method was assessed over all slices in the volume of a given subject, for each of the 8 subjects reserved for testing.Figure 7 gives the scatter plots of mean PSNR and mean SSIM of FD-Net vs. each competing method for each subject, for a direct one-to-one performance comparison.In terms of image quality, FD-Net dominates over the competing methods, including the supervised baseline.While S-Net matches FD-Net in terms of SSIM over the predicted image quality, it lags behind in terms of PSNR.As for the predicted field quality, FD-Net is second only to the supervised baseline which was trained to directly fit the results from TOPUP.
Overall Performance Evaluations: The quantitative results in Table 7 summarize the overall performance of each method across all subjects.FD-Net boosts image quality by 2.21dB PSNR/4.01%SSIM when compared to Deepflow-Net and 1.37dB PSNR/0.27%SSIM when compared S-Net.Field quality is boosted by 4.24dB PSNR/6.11%SSIM when compared to Deepflow-Net and 2.03dB PSNR/1.49%SSIM when compared to S-Net.Compared to the supervised baseline, FD-Net largely boosts performance in terms of image quality by 1.97dB PSNR/13.54%SSIM, with a cost in field quality by 1.00dB PSNR/3.61%SSIM.
Visual Assessments: To visually compare the qualities of the predicted images and the predicted fields, example results from the slices marked in Figure 6a are provided in Figure 8 for a lower brain slice and Figure 9 for an upper brain slice.These slices were chosen to represent the most and least challenging slices, corresponding to the dip and peak in PSNR in Figure 6b, respectively.The error maps, as well as visual inspection of the predicted image and predicted field, indicate that FD-Net outperforms the other methods.This is especially true at the problematic lower brain slice example shown in Figure 8, where large distortions are present.The upper brain slice example in Figure 9 exhibits distortions   that are not as severe, indicating a less challenging problem for all methods to solve.For both cases, the predicted images from FD-Net have higher overall similarity to the TOPUP corrected image, with less artifacts present than the other methods.The field results also demonstrate that FD-Net produces the highest fidelity field, with smoothness and details preserved in a coherent manner.Additionally, the forward-distorted images generated by FD-Net closely match the input distorted images for both the lower and upper brain slices, as shown in Figure 10 and Figure 11, respectively.

Discussion
In this work, we have proposed a deep forward-distortion model for unsupervised correction of susceptibility artifacts in EPI.FD-Net is based on a multiresolution network model that estimates a single anatomically correct image along with a displacement field, given a pair of reversed-PE acquisitions.Unsupervised learning is achieved by forward-distorting the anatomically correct image with the field, and enforcing consistency of the forward-distorted estimates to the input BU/BD acquisitions.Our results indicate that this forward-distortion approach improves estimation fidelity for both the corrected image and field across a broad range of cross sections in the brain.FD-Net outperforms competing unsupervised methods in image and field quality.It also achieves higher image quality than the supervised baseline, while maintaining the field quality.
Unwarping-based methods rely on similarity losses between corrected BU/BD images to enable unsupervised learning.As these losses are expressed in an inaccessible domain for which no explicit measurements are available, the resultant models can perform suboptimally under large displacements or intensity mismatches.In particular, S-Net uses LCC between corrected images.As a cross-modal similarity measure, LCC is known to be tolerant against intensity mismatches [55], but places higher emphasis on global features that can incur spatial blur in field estimates.In turn, overly smooth field estimates and lack of density compensation in S-Net can limit its performance in regions of large displacements with abrupt susceptibility changes, particularly near the sinuses and ear canals.To improve reliability against large displacements, Deepflow-Net performs density compensation by estimating pileups via linear interpolation of the grid point density map [21].However, the MSE loss that it adopts to measure similarity between corrected BU/BD images can lower tolerance against intensity mismatches and induce spatial blur in image estimates.In contrast to unwarping-based methods, the proposed FD-Net leverages a forward-distortion approach based on the K-matrix formulation where density compensation is not needed.For unsupervised learning, it uniquely measures the similarity    between forward-distorted images, emulated from estimates of the anatomically-correct image and the field, and acquired BU/BD images.As such, the similarity loss is expressed in the actual measurement domain, which can improve performance and reliability of FD-Net as suggested by our experimental results.Quantitative assessments on field quality indicate that the supervised baseline provides a closer match to the TOPUP-estimated displacement field than FD-Net.Yet, the apparent differences are relatively modest based on visual comparisons.On the other hand, FD-Net achieves a notable boost in image quality over the supervised baseline, which is best attributed to the physics-based forward-distortion approach in FD-Net contributing to generalization performance [23].
Here, we implemented all unsupervised correction methods by including a rigid loss for consistent and fair comparisons with FD-Net.Based on Table 6, we observe that the rigid loss slightly influences image quality but achieves a modest boost in field quality.This improvement can be attributed to the benefits of spatial registration to account for possible patient motion.The empirical benefits of the rigid loss are expected to become more prominent for increasing levels of motion.We also observe a modest improvement in image quality by inclusion of the valley loss.This benefit can be attributed to the enhanced performance in regions of high field inhomogeneities by avoiding unrealistically large displacements.Similarly, we observe that the bending energy loss that enforces field smoothness is critical to the performance of FD-Net.
As common in deep-learning methods, the trained weights of the FD-Net model are kept fixed during inference.
For models trained on limited datasets, this may results in suboptimal generalization to atypical anatomy.In such cases, subject-specific optimization of model weights during inference might improve generalization at the expense of prolonged inference times [56,57].Here, modules within FD-Net were implemented based on convolutional backbones given their training efficiency.To improve sensitivity to long-range context in brain images, self-attention based transformer backbones can be adopted [58].In the current study, all deep-learning models were effectively trained from scratch on relatively modest sized datasets including only 12 subjects.In applications where training data are scarce, network modules can first be pre-trained on large public datasets, and later fine-tuned on the application-specific target datasets [59].Lastly, here we assumed that only reversed-PE images are available as inputs to FD-Net.In cases where additional measurements are viable to capture the field map and/or PSF, FD-Net could be modified to integrate these measurements for improved performance.
It is worth noting that the extent of susceptibility artifacts in EPI can also be reduced by modifying the imaging procedure.For example, methods such as parallel imaging [50,60] or reduced FOV imaging [61,62] decrease sensitivity to field inhomogeneities by encoding a smaller FOV in the PE direction during data acquisition.Similarly, multi-shot EPI [63], such as interleaved EPI [64], can also be performed to reduce field sensitivity.While powerful, these acquisition-based methods still require additional distortion correction in postprocessing.The proposed FD-Net is compatible with this class of methods, as long as a reversed-PE acquisition is performed during imaging.

Conclusions
In this work, we introduced a novel deep-learning approach for efficient and performant correction of susceptibility artifacts in EPI.The proposed FD-Net estimates an anatomically correct image and a displacement field map.It achieves unsupervised learning by leveraging a forward-distortion model to enforce consistency of the estimates to measured reversed-PE images.FD-Net performs competitively with the reference TOPUP method, while offering notably faster inference as a deep-learning approach.It also outperforms recent unsupervised correction methods that enforce similarity of unwarped reversed-PE images.Therefore, FD-Net holds great promise for susceptibility-artifact correction in EPI applications.

Figure 1 :
Figure 1: Illustration of the image distortion characterized by the K-matrix.(a) The estimated displacement field (in units of pixels) and (b) the corrected image predicted by TOPUP are shown, with the magenta dashed lines highlighting a particular row along the PE direction (RL direction).(c) The K-matrix formed from the field for the highlighted row and (d) the corresponding blip-up EPI image.The deviations of the K-matrix from the identity matrix indicate the amount and direction of distortion, as can be understood by comparing the corrected image and the blip-up image for the highlighted row.The labeled axes correspond to the PE direction.

Figure 2 :
Figure 2: Overview of the proposed FD-Net.(a) The input distorted blip-up/blip-down images are fed through an encoder-decoder in the prediction unit, which outputs a predicted image and a predicted field with optional multiresolution (multiscale and/or multiblur) schemes.The field is used to formulate the bending energy loss and valley loss.(b) The K-Unit applies forward-distortion in each PE direction, with the field negated for one of the directions.(c) A rigid alignment unit is included to improve registration, with the rigid loss formulated from the transformation parameters.The forward-distorted images are compared with the input images (redisplayed here for convenience) to formulate the MSE loss.Training is performed over the aggregate of the shown losses.

Figure 3 :
Figure 3: Example of forward-distortion by using the K-Unit in FD-Net.(a) The input blip-up and blip-down EPI images are compared with the forward-distortion results of FD-Net.The intensities in absolute error maps are scaled up 2.5× for improved visualization.(b) The predicted field and predicted image outputs from FD-Net, which are input to the K-Unit to obtain the forward-distorted images in (a).

Figure 4 :
Figure 4: Details of the network architecture of FD-Net.(a) The encoder-decoder in the prediction unit is outlined with blocks representing the convolutional steps.Convolution with stride 2 is used to reduce dimensionality in the encoder steps, while upsampling by 2 is used to increase it in the decoder steps.Skip connections are introduced to facilitate information flow and improve gradient propagation by concatenating the encoder representations to the corresponding stages in the decoder.The numbers inside the boxes denote feature dimensions, with the numbers in brackets indicating filter kernel sizes.Leaky ReLU activation with slope coefficient of 0.2 is used, unless otherwise indicated.(b) A rigid alignment unit is used to align one of the forward-distorted images (blip-down distorted image in this case) to its corresponding input EPI image.The first stage encodes the images using convolutional layers with stride of 2, shown as boxes with the numbers inside indicating the feature dimensions.The output of the convolutional stage is flattened and passed through a dense layer with 32 neurons and another with 3 neurons.The final output comprises the 3 parameters required for the rigid transformation to be applied to the forward-distorted image.

Figure 5 :
Figure5: Illustration of two different multiresolution approaches, multiscale and multiblur, considered for FD-Net.The last stages of the decoder that generate the full resolution image and field are also shown for clarity.The multiscale approach relies on forming an output image and field at a lower dimensional scale, using appropriate convolutional steps to produce the outputs at 1/2 and 1/4 scale in this case.The numbers inside the boxes denote feature dimensions, with the numbers in brackets indicating filter kernel sizes.Leaky ReLU activation with slope coefficient of 0.2 is used, unless otherwise indicated.For the multiblur case, Gaussian blur kernels are applied to the full resolution outputs to create increasingly blurred results.Small, medium, and high blur amounts of σ S = 0.5, σ M = 1.5, and σ H = 2.5 are used, respectively, with a Gaussian kernel size of 4σ m in each case.The results of all the incorporated multiresolution levels are then passed through the K-Unit shown in Figure1, contributing to the overall loss in a regularizing manner.
valley sums the excess amount of field swing values when their magnitudes exceed τ m .These cases are penalized heavily by weighting L (m) valley with a large constant in Equation(8).In later stages of training, the effect of L (m) valley is negligible once the network converges towards reasonable solutions.

Figure 6 :
Figure 6: Slice-wise performance comparison of FD-Net and competing methods.(a) An example T 1 weighted image registered to the b0 volume of an individual subject to illustrate the anatomical locations corresponding to the slice indices.Magenta dashed lines indicate the locations of more challenging (lower line) and less challenging (upper line) slices in terms of distortion correction (see visual results in Figure 8 and Figure 9).(b) PSNR (top row) and SSIM (bottom row) metrics for predicted image (left column) and predicted field (right column).Results are shown for FD-Net and competing methods as a function of slice index.For each method, the mean metric is shown along with the 95% confidence interval.

Figure 7 :
Figure 7: Subject-wise performance comparison of FD-Net against competing methods.PSNR (top row) and SSIM (bottom row) metrics for predicted image (left column) and predicted field (right column).Metrics are averaged across slices within individual subjects, and the mean metrics for the 8 test subjects are displayed as scatter plots.The vertical axis denotes FD-Net performance, whereas the horizontal axis denotes competing method performance (see legend).The results above the dashed identity lines indicate superior performance by FD-Net.

Figure 8 :
Figure 8: Visual results for FD-Net and competing methods from a lower brain slice, corresponding to a more challenging location in terms of distortion correction.TOPUP results are taken as reference.(a) Predicted images and absolute error maps with respect to TOPUP.The error maps are scaled by 1.25× to a visibly discernible display window.(b) Predicted fields and the masked error maps with respect to TOPUP.The error maps were masked via a median Otsu threshold over the TOPUP image to remove the background regions.See the lower magenta dashed line in Figure 6a for the anatomical location of this slice.

Figure 9 :
Figure 9: Visual results for FD-Net and competing methods from an upper brain slice, corresponding to a less challenging location in terms of distortion correction.TOPUP results are takes as reference.(a) Predicted images and absolute error maps with respect to TOPUP.The error maps are scaled by 1.25× to a visibly discernible display window.(b) Predicted fields and the masked error maps with respect to TOPUP.The error maps were masked via a median Otsu threshold over the TOPUP image to remove the background regions.See the upper magenta dashed line in Figure 6a for the anatomical location of this slice.

Figure 10 :
Figure 10: Visual results for the forward-distorted images from FD-Net for a lower brain slice.The input blip-up and blip-down EPI images are compared with the results of forward-distortion in FD-Net.The error maps are scaled by 2.5× to a visibly discernible display window.See the lower magenta dashed line in Figure 6 for the anatomical location of this slice.

Figure 11 :
Figure 11: Visual results for the forward-distorted images from FD-Net for an upper brain slice.The input blip-up and blip-down EPI images are compared with the results of forward-distortion in FD-Net.The error maps are scaled by 2.5× to a visibly discernible display window.See the upper magenta dashed line in Figure 6 for the anatomical location of this slice.

Table 2 :
Hyperparameter choices for the multiresolution strategy ablation study for FD-Net.For each multiresolution strategy, irrelevant hyperparameters are marked with a dash (−).The multiresolution weighting parameter, ω, is split into its constituent weights for full resolution ("FR"), multiscale (1⁄2 and 1⁄4 scale), and multiblur (S, M, and H) components.The other hyperparameters are kept fixed: λ for field smoothness regularization, γ for rigid loss, and τ for valley loss threshold.

Table 3 :
Performance comparison of multiresolution strategies in FD-Net.PSNR and SSIM metrics are reported as mean (SD) across subjects.Bold font denotes the best performing strategy.The multiblur strategy is chosen.

Table 5 :
Performance comparison of multiblur schemes for FD-Net.PSNR and SSIM metrics are reported as mean (SD) across subjects.Combinations over three different blur amounts are considered: small (S) , medium (M), and high (H) blur.Bold font denotes the best performing combination.The S-M-H multiblur combination is chosen as the multiresolution scheme for FD-Net due to its superior performance.

Table 7 :
Performance comparison of FD-Net and the competing methods.PSNR and SSIM metrics are reported as mean (SD) across subjects.Bold font denotes the best performing method.