CG-SENSE revisited: Results from the first ISMRM reproducibility challenge

Purpose: The aim of this work is to shed light on the issue of reproducibility in MR image reconstruction in the context of a challenge. Participants had to recreate the results of"Advances in sensitivity encoding with arbitrary k-space trajectories"by Pruessmann et al. Methods: The task of the challenge was to reconstruct radially acquired multi-coil k-space data (brain/heart) following the method in the original paper, reproducing its key figures. Results were compared to consolidated reference implementations created after the challenge, accounting for the two most common programming languages used in the submissions (Matlab/Python). Results: Visually, differences between submissions were small. Pixel-wise differences originated from image orientation, assumed field-of-view or resolution. The reference implementations were in good agreement, both visually and in terms of image similarity metrics. Discussion and Conclusion: While the description level of the published algorithm enabled participants to reproduce CG-SENSE in general, details of the implementation varied, e.g., density compensation or Tikhonov regularization. Implicit assumptions about the data lead to further differences, emphasizing the importance of sufficient meta-data accompanying open data sets. Defining reproducibility quantitatively turned out to be non-trivial for this image reconstruction challenge, in the absence of ground-truth results. Typical similarity measures like NMSE of SSIM were misled by image intensity scaling and outlier pixels. Thus, to facilitate reproducibility, researchers are encouraged to publish code and data alongside the original paper. Future methodological papers on MR image reconstruction might benefit from the consolidated reference implementations of CG-SENSE presented here, as a benchmark for methods comparison.


| INTRODUCTION
Over the past decades MRI experienced a vast thrust towards an algorithmic perspective owing to the increased computational power of standard computers leading to the invention and development of numerous reconstruction methods. This is reflected in the tremendous increase of publications registered on Pubmed that involve 'MRI' and either 'reconstruction' or 'fitting' over the last 2 decades (see Figure 1). The peak of 3354 publications in 2018 amounts to an average of 9 papers per day. Typically, computational innovation in these papers is shown by comparing novel methods to established algorithms in the field via suitable quality metrics.
One of the fundamental computational approaches to image reconstruction is parallel imaging, i.e. the idea to use a-priori knowledge about multiple receiver coil sensitivities to accelerate scans. Image reconstruction then shifts from a simple Fourier transform -which may optionally include a gridding step for non-Cartesian data -to solving a more complex inverse problem, based on a matrix equation of image encoding, as proposed in a general form in Pruessmann et al. Pruessmann et al. [1] commonly referred to as "conjugate gradient CG-SENSE". A lot of image reconstruction papers published thereafter refer to this standard algorithm, often performing direct comparisons to prove the efficacy of their method. However, no commonly agreed-on reference implementation of the CG-SENSE algorithm is readily available. Therefore, these comparisons to the 'SENSE' method are mere comparisons to one version of it, be it custom implementations, those based on openly available image reconstruction toolboxes, or even obtained from a black-box implementation provided by the scanner vendors. This lack of a refererence implementation reflects a fundamental problem of research reproducibility in the MR image reconstruction domain.
The reproducible research study group (RRSG) of the ISMRM aims to enhance reproducibility by facilitating fair and simple comparisons to existing algorithms. However, comparing novel algorithms to re-implementations of published work without having access to the original code can lead to wrong conclusions. Often, algorithmic details are not reported in detail in publications and small deviations of input parameters can lead to strong differences in the output, regularly degrading the performance of the existing method, which is a general problem faced in the scientific community [2, 3, 4, 5, 6,7,8]. To that end, the RRSG announced a reproducibility challenge in April 2019 as part of the Annual Meeting of the ISMRM in Montreal. The goal was to reproduce the core findings of the paper "Advances in sensitivity encoding with arbitrary k-space trajectories" from Pruessmann et al. [1]], solely based on the description available in the paper, and to converge towards a reference implementation being accessible to the community.
Participants were required to reproduce the main figures of the original paper given two fully-sampled radial brain and heart datasets. Signal and trajectory data were supplied but neither sensitivity maps nor noise covariance scans. No programming language restrictions were given, as long as the source code was shared and the computational results could be reproduced. The detailed instructions can be found in the corresponding ISMRM blog post 1 .
In this work, we present the outcome of this initiative, compare the different submissions and discuss potential problems in reproducing the findings of a scientific paper solely from the manuscript. Furthermore, we consolidated the submissions from the participating groups into two reference implementations (Python and Matlab), which are available online in the ISMRM git repository and could serve as a benchmark for future publications seeking comparison to CG-SENSE. The reference implementations will be discussed in more detail in section 2.6, specifically focusing on critical points in the implementation. The main features of each submission will be shown in section 2.5 and differences regarding implementation details and possible sources of deviations to the reference implementations will be discussed.
Finally, recommendations for conducting reproducible research and future challenges are given. 4 OLIVER MAIER ET AL.

| Design of the first RRSG challenge
Since this was the first ever reproducibility challenge by the study group, we designed it around a rather simple premise to encourage submissions from the community. This started with the choice of the paper. We wanted a paper that is seminal in our field, where the authors did not already provide a reference implementation themselves. We wanted to be able to provide all the data ourselves and not rely on any closed source or proprietary software for any step of the data processing. We also wanted a paper where we expected the results of the challenge to be uncontroversial. In fact, we expected that the submissions of the participants would successfully reproduce the main results of the paper without showing fundamental differences, but still would reveal some interesting differences that we could learn from about reproducibility issues. Finally, since one of the goals of this initiative is to build up a library of standard reference implementations that can be used for comparison when publishing new methods, we wanted to cover a method that is commonly used as a reference by MRM authors. A second design choice was the timeline. We wanted the turnaround of the participants to be relatively quick, because we wanted to see how well the paper can be reproduced within a time-frame of a couple of weeks. In particular, we announced the challenge and provided the material on March 28th 2019, and set the deadline for submissions for May 1st 2019.
In the rest of this section, we are providing a brief review of the CG-SENSE method that was introduced in Pruessmann et al.
[1], a detailed description of the data that was used for the challenge and an overview of the individual submissions and finally a description of the consolidated reference implementations that were developed after the conclusion of the challenge.

| CG-SENSE
Throughout this work let n x × n y denote the image dimension in pixels and n c the number of receive coils. For simplicity, assume that n x = n y = n. The total number of k-space samples, i.e. number of read-outs times number of read-out points, is denoted as n k . Reconstructing an image v ∈ ¼ n 2 from undersampled data m ∈ ¼ n k nc acquired with multiple receive coils n c is an inverse problem following a linear encoding process with E : ¼ n 2 → ¼ n k n C being the linear encoding matrix, mapping from image space to k-space [9,1]. The encoding matrix E describes the whole MRI acquisition pipeline, consisting of coil sensitivity profiles S γ ∈ ¼ n k n C and Fourier transformation combined with the sampling operator, i.e. the non-uniform FFT (NUFFT : ¼ n k n C →∈ ¼ n 2 ). Assuming Gaussian noise in the acquired k-space data, the optimal solution regarding the signal-to-noise ratio (SNR) is given by with H denoting the Hermitian transpose. As the inverse of E H E is computationally demanding, the problem is typically solved in an iterative fashion. Optionally, the conditioning of the matrix inversion can be improved by addition of a small constant value λ ≥ 0 to the diagonal E H E + λI, with I being the identity matrix. This type of modification is typically referred to as Tikhonov regularization [10]. For λ = 0 the problem reduces to ordinary least squares. A numerically fast method to solve equation (2) is given by the conjugate gradient (CG) algorithm, outlined in Algorithm 1. An full description of the CG algorithm can be found in Shewchuk [11]. The CG algorithm can be applied to problems of the form in equation (1) but requires a positive (semi-) definite matrix E. This requirement can not be guaranteed for arbitrary encoding matrices E. To this end, the CG algorithm is applied to the normal equation which yields the last-squares solution defined by equation (2). Another advantage of the normal equation is that it has a positive (semi-) definite operator (E H E + λI) by definition. Thus, the requirements for the CG algorithm are met.
Algorithm 1 The Conjugate Gradient algorithm.

13: end while
If the noise correlation between receive coil channels can be estimated, e.g. from a separate noise scan, the coils and data can be pre-whitened to account for the correlation between different channels. This process creates virtual coils which can be used in the CG algorithm instead of physical coils without requiring any other modifications [1].
The conditioning of the problem and, thus, the convergence speed of the algorithm can be improved by including a density compensation function into the reconstruction pipeline. This accounts for the typically non-uniform density in the k-space center of non-Cartesian sampling strategies. The diagonal density compensation matrix D can be included in the encoding matrix E byĒ weighting each k-space signal by its spatial density. It should be noted that this introduces a weighting in the data consistency, which then deviates from the noise optimal least squares formulation. Intensity correction of the coil sensitivity profiles I can be included in analogy by Substituting E with E in equation (3) gives the density and intensity corrected image reconstruction problem. After convergence it remains to apply the intensity correction I to I −1 v to obtain the final reconstruction result v, which reduces to a point-wise division in image space.

| Non-Uniform Fast Fourier Transform (NUFFT)
If measured k-space points are acquired on a non-Cartesian grid, modifications to the standard FFT are necessary. The main steps involve: • Density compensation (optional).
• Gridding the non-Cartesian k-space to a regular but oversampled grid. Usually done with one of the following approaches.
-Convolution with a pre-computed kernel. Most common are Kaiser-Bessel based kernels [12].
-Min-Max interpolation following Fessler and Sutton [13] • Standard FFT of the now Cartesian data.
• Deapodization -Accounting for intensity variations due to the convolution with the gridding kernel.
• Cropping to the desired Field-of-View (FOV).
These steps are generally referred to as non-uniform FFT (NUFFT). Even though it achieves the same computational complexity (N log N ) as the standard FFT, the computation is typically slower and scaling with dimensionality is worse.

| Data
The evaluation in this work was performed on two different data sets. First, the algorithm was evaluated using radially sampled data provided by the organizers of the 2019 RRSG challenge. Second, during follow up work after the conclusion of the challenge, radial and spiral data including noise reference scans were acquired. These data sets closely follow the sampling trajectories and noise treatment of the original CG-SENSE paper, and were reconstructed with the consolidated reference implementations to evaluate their correctness and properties.

Challenge data
The challenge data consist of two radial k -space data sets, one brain and one cardiac measurement, supplied in the .h5 data format [14]. The data set entries are ordered using the BART toolbox convention [15], i.e. for the data the dimensions [1, Readout, Spokes, Channels] and for the trajectory [3, Readout, Spokes], where the first entry encodes the three spatial dimensions. The distance between sampling points is 1/FOV os and the entries run from −N /2 to N /2 with N being the matrix size of the desired FOV. FOV os is the readout-oversampled FOV. The brain data consisting of 96 radial projections. 2D radial spin echo measurements of the human brain were performed with a clinical 3 T scanner (Siemens Magnetom Trio, Erlangen, Germany) using a receive only 12 channel head coil. Sequence parameters were: TR=2500 ms, TE=50 ms, matrix size = 256x256, slice thickness 2 mm, in plane resolution 0.78×0.78 mm 2 . FOV was increased to a matrix size of 300x300 after acquisition. The sampling direction of every second spoke was reversed to reduce artifacts from off-resonances. The cardiac data set consists of 55 radial projections acquired with a 34-channel coil on a 3 T system (Skyra, Siemens Healthcare, Erlangen, Germany). A real-time radial FLASH sequence with TR/TE = 2.22/1.32 ms, slice thickness 6 mm, 5 × 15 radial spokes per frame, 1.6 × 1.6 mm 2 resolution and a flip angle of 10 • was used. Matrix size was set to 160 × 160 with 2-fold oversampling and a FOV of 256 × 256 mm 2 , which was up-scaled by a factor of 1.5 after acquisition to fully contain the heart, leading to a reconstruction FOV of 384 × 384 mm 2 with a 240 ×

Reference data
In addition to the original challenge data two new data sets, a radially acquired heart data set and spirally acquired brain data set, were used in this work. The heart data was acquired at the Karolinska Institutet and the acquisition parameters are as follows: Prototype bSSFP pulse sequence with golden-angle radial trajectory. Acquired at 1.5 T (Aera, The data sets are supplied as .h5 files containing trajectories, multi-channel data, coil sensitivity maps, and noise covariance matrix. Written informed consent was obtained by all healthy volunteers following the local ethics committee's regulations.

| Submissions
The following paragraphs will give a brief overview of the submissions with respect to the architecture and toolboxes used. Major differences between the submissions, especially implementation-wise, will be highlighted. The main features of each submission are summarized in Table 1.
To comply with the original algorithm, some sort of k-space filter function needs to be included in the reconstruction code. The most popular choice in all submissions is an arctan based filter function given by If other filters are used, they are described in the corresponding paragraph of the submission. The desired undersampling factor is attained by skipping every other acquired line for brain data to achieve factors of {1, 2, 3, 4} compared to the acquired number of spokes. The heart data is undersampled by selecting the first {55, 33, 22, 11} projections. Different realizations of undersampling for a given implementation are described in the corresponding paragraph.

University of California, Berkeley
Based on Python this submission uses an interactive Jupyter notebook to run the reconstruction. The CG algorithm is implemented using the SigPy python package, which provides a high level abstraction interface to run code on CPUs or GPUs. Coil sensitivity profiles for the head scans are estimated via a root-sum-of-squares (SoS) approach, dividing each channel by the SoS reconstruction. The sensitivity profiles for the heart data are estimated using the BART toolbox [15].
Density compensation is done by taking the norm of each coordinate in the trajectory, resulting in a simple ramp function. The NUFFT algorithm is based on Fessler's min-max interpolation [13]. Data is multiplied with the square root of the density compensation to assure adjointness of the forward and backward operator in the CG algorithm. The CG algorithm is terminated after {3, 6, 15, 50} iterations. No k-space filtering is performed after the CG reconstruction.
Brain and heart data are undersampled by skipping acquired lines according to the desired undersampling factor.

Berlin Ultrahigh Field Facility (B.U.F.F)
This Matlab based submission uses the NUFFT from the BART toolbox in the self-written CG implementation. Coil sensitivity estimation is performed via a SoS approach, dividing each channel by the SoS reconstruction. Density correction is performed by taking the Euclidean norm of each trajectory position in 2D, normalized by the maximum value. Intensity correction is achieved by normalizing the sum over the complex coil sensitivity to yield one. The CG algorithm is run on the brain data for 150 iterations for no acceleration and for 100 iterations for the accelerated cases.
For the heart data set 40 iterations are used for all cases. The filter function for k-space position k x , k y is given in equation 6 and k c being half the cropped FOV size, β = 100 are used as parameters. The filter is applied to the cropped FOV image after the CG algorithm has terminated.

Eindhoven University of Technology
This submission uses Python to achieve the desired reconstruction results. The main package used for reconstruction is PyNUFFT which also supplies the CG algorithm. The internal NUFFT algorithm is based on Fessler's min-max interpolation [13]. Each coil channel is reconstructed independently using PyNUFFT and the final image is formed via a SoS approach. No density correction or coil sensitivity estimation is performed prior to reconstruction. The CG algorithm is terminated after 11 iterations. No k-space filtering is performed after reconstruction. Brain and heart data are undersampled by skipping acquired lines according to the desired undersampling factor.

Swiss Federal Institute of Technology Zurich (ETH)
This implementation is built on code which was presented at past educational sessions and image reconstruction workshops of the ESMRMB and does not utilize any toolboxes to provide a white box implementation with full details provided to the user. The sensitivity maps are calculated from the fully-sampled center of the radial k-space data using an SVD-based approach by Walsh et al. [16]. The gridding operator is devised as a sparse matrix with a Kaiser-Bessel kernel function to utilize the fast matrix-vector multiplications in Matlab. A noise covariance matrix of the receive channels can be included in the reconstruction if available. The iteration is not regularized (e.g. by Tikhonov regularization), but stopped manually after a number of iterations determined by visual assessment of the intermediate result.
In the last step, a k-space filter is applied to suppress noise from k-space areas with no acquired data.

Karolinska Institutet (KI)
This submission is written in Matlab, using the object oriented capabilities of the language. The aim is to make the code as readable as possible. Coil sensitivities are estimated by dividing each coil element by the SoS of all coil elements. The CG algorithm is implemented from scratch using the same variable names and formalism as the original paper [1], to help readers relate the code to the paper. A "SENSE-operator" is implemented, which performs the steps outlined in the algorithm description provided in the original paper. A linear ramp-filter is used for sampling density compensation, and the gridding and Fourier transform is performed using the NUFFT routine, with Kaiser-Bessel interpolation, from the Michigan Image Reconstruction Toolbox (MIRT) [13]. No additional filtering of the k-space data is performed.

New York University (NYU)
iterations and augmented using Tikhonov regularization with a regularization weight of 0.2. If the residual values are below 1e −6 the algorithm is terminated. No k-space filtering is applied after the reconstruction.

Stanford University
The submission is based on Python and implements gridding via a Cython module. Gridding is realized via linear interpolation of a precomputed Kaiser-Bessel kernel function. The visualization of the reconstruction steps is done in an interactive Jupyter Notebook. Density compensation is realized via a simple ramp function computed from the Euclidean norm of the 2D k-space grid positions. The input data is multiplied with a Hamming window with the parameter M amounting to the number of samples on each spoke. Coil estimation is performed via a SoS approach, by filtering the data with a narrow-width Gaussian kernel (σ = 5% k-space width), applying the inverse NUFFT to the filtered data and dividing each channel by the SoS reconstruction. In contrast to the other submissions, FOV cropping is performed to center the brain in the reconstructed image instead of cropping symmetrical around the center. The CG algorithm is run for 8 iterations for brain and heart data. The filter function for the normalized k-space position k x , k y is given in equation 6 and k c = 1, β = 100 are used as parameters. The filter is applied in every iteration of the CG algorithm.

Graz, University of Technology, Institute of Computer Graphics and Vision (TUG H.)
The submission is based on Python and uses the gpuNUFFT [17], which is written in C++/CUDA to accelerate the reconstruction operator. Wrapper scripts to run the gpuNUFFT in Python are provided. The gridding in the gpuNUFFT is realized via nearest neighbour interpolation of a pre-computed Kaiser-Bessel function, where the gridding kernel size is set to 3. Coil sensitivity maps are estimated using ESPIRiT [18] from the BART toolbox [15]. Intensity correction is applied by dividing the coil sensitivity maps by its SoS reconstruction, resulting in unit-norm along the coil channels.
Density compensation is applied using a ramp filter estimated from the provided trajectories. To achieve adjointness of the forward and adjoint reconstruction operator E and E H , the raw data is multiplied by the square-root of the density compensation function. For reconstruction, 100 iterations are performed in the CG algorithm which was extended with Tikhonov regularization (λ = 0.2).

Graz, University of Technology, Institute of Medical Engineering (TUG M.)
The submission is based on Python and makes use of OpenCL to accelerate the reconstruction process. The Code is packaged and supports installation with automated dependency detection. Coil sensitivities are estimated from all acquired spokes using the ESPIRiT algorithm [18] implemented by the BART toolbox. Intensity correction is achieved by normalizing the sum over the complex coil sensitivity to yield one. As density compensation, a simple ramp function is used. The NUFFT is based on a Kaiser-Bessel gridding approach with linear interpolation between the pre-computed kernel points. The data is multiplied by the square root of the density compensation to ensure adjointness of the forward and backward application of the NUFFT. The CG algorithm is extended by an Tikhonov regularization and run for 300 iterations. The regularization weight λ is chosen as 0.5. The filter function for k-space position k x , k y is given in equation 6 and k c = 25, β = 100 are used as parameters. The filter is applied to the cropped FOV image after the CG algorithm has terminated.

University of Southern California (USC)
This Matlab implementation uses in-house written C/Mex function to implement the reconstruction algorithm.
Gridding is realized via sinc interpolation combined with the Matlab FFT implementation. As density compensation, a simple ramp function is used. Coil sensitivity estimation is performed via a SoS approach using a 32-by-32 low resolution k-space center region and dividing each channel by the SoS reconstruction. Intensity correction is achieved by normalizing the sum over the complex coil sensitivity to yield one. The CG algorithm is run for {3, 6, 15, 50} iterations for the brain data set and for 6 iterations for the heart data set. The filter function for the normalized k-space position k x , k y is given in equation 6 and k c = 0.5, β = 100 are used as parameters. The filter is applied to the cropped FOV image after the CG algorithm has terminated.
Utah Center for Advanced Imaging Research, University of Utah (Utah) This Matlab based submission uses a CPU/GPU accelerated NUFFT implementation, based on min-max interpolation of Fessler and Sutton [13]. Coil sensitivities are estimated by the method of Walsh et al. [16]. Sensitivity maps are normalized to give a sum of one along the coil dimension, accounting for intensity variations and ensuring adjointness.
Gridding is performed via linear interpolation of a precomputed Kaiser-Bessel kernel. Density compensation amounts to a simple ramp. Reconstruction is run for {200, 100, 100, 100} iterations for brain and 100 iterations for heart data.
The filter function for k-space position k x , k y is given in equation 6. k c is chosen to be half of the cropped FOV and β = 100 is used as parameter. The filter is applied to the cropped FOV image after the CG algorithm has terminated.

Massachusetts General Hospital (MGH)
This submission demonstrates how to use Matlab and modify the BART toolbox to perform a CG-SENSE reconstruction.
Coil sensitivities are estimated using the ESPIRiT algorithm [18] of BART. In addition to the forward/backward application of the NUFFT, a faster Toeplitz-embedding based implementation is described. Reconstruction is performed by a modification of BART's "pics" method. Performance is demonstrated using all acquired projections of the brain data set only. No scripts to produce the desired images from undersampled data or for the heart data were submitted.

Revised Submissions
To avoid registration of individual submissions and eliminate errors due to necessary interpolation, participants were given the opportunity to submit revised code to account for differences in FOV and/or resolution between the reference and their submissions. Both, the original and the revised submission, are subsequently compared to the reference to show initial deviations and corrected images.

| Consolidated Implementation
Accounting for the two major programming languages used throughout the submissions, reference implementations are developed both in Python and Matlab.

General steps
To facilitate comparability between the two implementations, coil sensitivity profiles are pre-computed using the Walsh et al. [16] algorithm and all available projections. Density compensation is derived from the trajectory by gridding a k-space of ones followed by division of the maximum value. Taking the inverse of the gridded k-space yields the estimated density compensation function [19].

Matlab specific
Sensitivity maps are assumed to be precomputed and read in at the start of the reconstruction. As in the original ETH submission, gridding is performed by a matrix-vector multiplication with a sparse matrix to reduce computation times

Python specific
If no coil sensitivity maps are supplied in the data file, receive coil sensitivities are estimated via the SoS approach, dividing each gridded coil image by the SoS reconstruction. To account for the typical smooth sensitivity profiles, the raw data is multiplied with a Hanning window with window width of 50 pixels, masking out high frequency components of the acquired k-space data prior to SoS reconstruction and coil sensitivity estimation. Optionally, the non-linear inversion (NLINV) algorithm [20] can be used estimate coil sensitivities prior to reconstruction. The pre-computed gridding kernel is derived using a Kaiser-Bessel function [12,19]. The kernel width is set to 5 and 10000 points of the window are pre-computed. The value of the kernel for gridding a specific measurement point is determined via linear interpolation of the pre-computed values. Optional Tikhonov regularization and a termination criterion can be enabled by setting the corresponding values larger than zero in the configuration file.

| Numerical Comparison
The results from each submission were compared on a pixel-by-pixel basis to the Python consolidated reference

| Code and Data Availability Statement
All data and code used in this paper are available online. Availability and associated Digital Object Identifiers (DOIs), if applicable, are listed below: • Challenge submissions https://ismrm.github.io/rrsg/challenge_one/.
• Supplementary spiral brain and radial cardiac data Zenodo.URL.
• Evaluation scripts to produce the figures Zenodo.URL • All code and data is licensed under the MIT License (https://en.wikipedia.org/wiki/MIT_License) 3 | RESULTS

| Submissions
Example reconstruction results for an acceleration factor of 2, showing the image after evaluation of the right-hand-side of equation 3 (Initial) and after convergence of the algorithm (Final), from each submission are given in and Utah -512x512). No major structural differences are observable in the reconstruction except for the case of Eindhoven. The brain reconstruction from KI did neither use Tikhonov regularization nor early stopping, and the k-space was not filtered, resulting in a noisy appearance compared to other submissions. In addition, it shows a slight rotation to the left.
For the heart data, more differences are observed. Firstly, FOV differences occur more frequently (Eindhoven -cropped to 320x320, ETH -cropped to 360x360, NYU -cropped to 300x300, Stanford -asymmetric crop to 240x336). Secondly, matrix size in the same FOV and thus resolution is changed by some submissions (Berkeley -300x300, USC -256x256, Utah -320x320). The heart reconstruction results for 11 spokes from Berkeley seem to be more noisy than the others.
The reconstructions using the reference implementations, are given in Figure 4 and Figure 5 for Python and Matlab, respectively. Neither intensity nor contrast variations are visible between the two reference implementations.

| Differences to Consolidated Implementation
Visually, no major differences to the submissions are visible. Pixel-wise difference plots are shown in Figure 6 and difference in the brain tissue can be seen. Initial steps seem to show good agreement if image alignment matches with only minor intensity difference in some submissions. Heart data shows overall more differences, especially in areas of low SNR. The heart itself seems consistent between most reconstructions. At the highest acceleration, differences become more pronounced.
The pixel-wise comparison between the two reference implementations in Figure 8 for brain (top part) and heart (bottom part) data shows the overall excellent accordance between the Matlab and Python reconstruction results. No major differences are visible in any of the images. The single coil images and initial images show very slight intensity differences. The visual impression is supported by high SSIM values (0.9987-0.9998) and small NRMSE values (0.006-0.028). The highest differences are visible outside of the brain at the border of the used image mask. Similar excellent accordance between both reconstructions is achieved for heart data. NRMSE and SSIM are comparable to the brain data but areas with little to no signal outside the body show slight, noise-like deviations.
Finally, Figure 9 shows the two additionally supplied data sets. Both reference methods are able to produce clean images and visually, no differences can be observed.

| Achievements of the Challenge
The first ISMRM reproducibility challenge led to twelve submissions from research groups spread across many countries. In addition to the paper challenge a questionnaire was opened 2 , regarding reproducible research, which showed that the majority of the 71 participants (77.5%) sees a reproducibility problem in their research area. This proves that scientists are aware of the problem of reproducibility of research and how hard it can be to recreate paper results without access to code or data. Furthermore, the challenge led to the production of two consolidated implementations of the seminal paper, written in the two most commonly used programming languages across all submissions. However, it also raised the question of what concretely reproducibility means and how to measure similarity between different images. Even though many different toolboxes and reconstruction approaches were used by the participants the visual appearance of the reconstructions is very similar.

Imaging Parameter
As no desired FOV was given, some problems arose with the correct choice of the reconstruction FOV. It is common to assume an oversampling of 2 compared to the radial acquisition but for the supplied data, this was not the case. The brain was oversampled by a factor of 1.706 and the heart by 1.3. These factors could be derived from the supplied trajectory but were not taken into account by all submissions. Larger FOVs can be easily corrected for by simply cropping to the desired FOV, which is commonly done symmetrically around the image center. All decided to crop in such a way, except for the submission of Stanford which crops symmetrically around the object. Although such an approach leads to a centered image, it can be tedious. The main reason for cropping the FOV is to crop away areas with aliasing, which typically folds back at the edges of the oversampled image grid [19,22]. The aliasing gets worse with increased distance from the image center due to the amplification from the deapodization function and, depending on the gridding kernel, additional errors from outside of its pass-band can be introduced at the image edges [12,19,22].

Gridding/NUFFT
As can be seen in the difference images in Figure 6 and Figure 7, reconstructions with larger FOV show little or no structural differences to the references. A more concerning modification is the change of resolution as such changes can potentially lead to interpolation related changes in visual appearance of the image. A possible source of such an increased or decreased pixel size lies in the way, how acquired data points are placed in the k-space via the gridding operation. If an oversampled grid is defined but the location of the samples in the trajectory is not correctly altered to span the whole range of this oversampled k-space, only the central part will be filled. Similar, if the points-to-grid lie outside of the desired k-space, they either are not gridded at all or enter on the opposite side of k-space, depending on the used boundary conditions. This leads to an interpolation in image space and an artificially altered resolution, i.e.
interpolation to higher or lower resolution, respectively. Small structural differences in the submissions may stem from different treatment of the supplied k-space trajectory. Normalization of the k-space coordinates, as is done in many submission, might lead to modifications of the actual k-space position if done independently for each of the spatial dimensions. This can lead to a rotation or distortion of the reconstructed image. Such differences can not easily be corrected in the final images as those would involve some sort of interpolation to the desired matrix size or image registration, introducing errors related to the interpolation kernel. Therefore, no attempt was made to correct for different resolutions in the final image, leading to rather large deviations in the pixel-wise difference maps. Revised submissions in Figure 10, accounting for deviations in trajectory handling and/or FOV, show numerical differences for brain and heart data which are in-line with most of the other submissions. This suggests that the rather huge differences in the original submissions solely stem from wrongly treated trajectory information or FOV cropping.

Algorithmic
The increased noise in the KI reconstruction may stem from the large amount of CG iterations combined with not using any regularization or k-space filtering. Running the CG algorithm for too many iterations leads to increased noise in the final reconstruction. This can also be seen in the heart reconstruction from 11 spokes from the Berkeley submission.
Thus, early stopping is used as regularization in the original publication. Another form of regularization used in the submissions is plain Tikhonov regularization based on the L 2 -norm of the image, i.e. λ > 0 in equation 3. The regularization parameter, typically termed λ, is used to balance between data costs and regularization. Even though this is not included in the original publication, results from submissions with Tikhonov regularization show only minor differences to the reference using early stopping, see Figure 6 and Figure 7. However, choosing a correct regularization parameter can be a challenging task. A too large value for the regularization parameter can lead to slow convergence and may be the reason for residual intensity variations in the TUG M. submission for both, brain and heart data. On the other hand, choosing a regularization parameter is usually easier than choosing a number of iterations, because it can be done based on sound principles [23,24,25].
The solution of the inverse problem of finding an image from non-uniform acquired data highly depends on the quality of the pre-computed coil sensitivity profiles. A wrong or inaccurate estimate ultimately leads to poor reconstructed images. Visual effects include intensity variations or signal voids in the image. In addition, the estimated coil profiles influence the solution via the intensity normalization, directly estimated from the coil profiles. Thus, all provided data also contain estimated coil sensitivity profiles, which were used to generate the reference reconstruction results in both algorithms.

Evaluation
As images are typically given in arbitrary units, a direct numerical comparison can be challenging. As a result, image intensity normalization was applied. However, if normalization fails due to outliers or, in a more general sense, due to deviations with respect to the expected intensity histogram, it can lead to a false impression of rather large differences.
This may be the reason for the increased deviations in the ETH submission of the heart data compared to the brain data as can be seen in the bright error map in Figure 7. To this end, no numerical metrics were used to compare submissions to the reference implementation as these would suffer even more from intensity variations or image shifts.

| Reference Implementations
During the development of the reference implementations, we indentified that processing steps related to gridding yield the largest deviations, e.g. trajectory normalization, apodization correction, and gridding kernel normalization.
The largest deviations were associated with the normalization of the trajectory to a specific range. The least deviations can be achieved without any modifications of the supplied trajectory, i.e. taking the k-space locations as is and adapting the gridding to account for the increased range of possible values (e.g. >> |0.5|).
The two reference implementations show no major difference inside the brain as evident in Fig. 8. A very slight intensity difference for single coil and initial images can be seen which might be related to the apodization correction. Minor implementation details, such as the linear interpolation of the gridding kernel versus the nearest neighbor interpolation or the FFTW [26] in Matlab versus the FFTPACK based algorithm Cooley and Tukey [27], Bluestein [28] of Python, in combination with the iterative steps of the algorithm may lead to the remaining differences. The heart data shows overall good agreement with increased deviations in areas with little to no signal, either inside the lungs or outside the body. The area of interest, the heart, shows no substantial difference between the two reference implementations. The SSIM and NRMSE metrics are computed within the same binary mask used for reconstruction. Thus, even better values for these metrics are expected if only the brain tissue itself is evaluated. The same is true for the heart reconstruction.
Cropping the area to only include reliable pixels, i.e. pixels with high enough signal, SSIM values could be further improved.
To this end, the implementations of the CG-SENSE algorithm in Matlab and Python can be considered equally accurate and thus the submitted algorithms were compared to just one of the two references, the Python based implementation.

| Licensing Code and Data
When the challenge was initiated, very little constraints were implied on how data could be used and code should be provided, to enable wide-spread participation. However, in retrospect, a wide-spread adoption and re-use of the data and code submissions created by the challenge requires some consideration of licensing, in order to stand on firm legal footing.
This is because if no license is specified, the owners of code or data retain all copyright, and have to give explicit permission for its use. But in the context of reproducibility, making software open-source and re-usable for other researchers is key. Two classes of software licenses are best suited for this cause: copyleft licenses, such as the General Public License (GPLv3), or permissive licenses, such as the MIT license.
There are good resources explaining the differences between those [29], including a very accessible website how to choose one: https://choosealicense.com/. In brief, MIT has the least restrictions and simplifies commercial use, whereas GPL puts emphasis on keeping code open-source, i.e., if one builds on GPL-licensed code, one has to make it publicly available, even in commercial settings. This also means that MIT-licensed code can be used within a GPL-project, but not the other way around, and one might have to choose GPL as a license then.
For sharing data, the situation is complicated by the fact that data might be considered part of software and documentation, or work of creative art, for which the class of creative common (CC) licenses were envisaged

| Future Impact
The first RRSG challenge has already been met with a very positive response, both in reproducing the selected publication, but more importantly in bringing together a community of researchers who are interested in reproducible science and MR image reconstruction. On top of that, we also see very practical use of its outcome in the future, as a  [31]. We believe that this could become a general model for future software publications to use proposed example data and outcome measures of reproducibility challenges in order to show performance and scope of these tools in a more standardized fashion.
Reproducibility of image reconstruction in MRI can be challenging, especially with the increased complexity of the used algorithms. Even though the description in a paper allows to re-implement the reconstruction algorithm, a lot of details may be not stated explicitly and can lead to unexpected outcome, e.g. exact step-sizes used in optimization, scaling of gradients, internal SNR estimates and other pre-and post-processing steps. These problems arise in many iterative fitting strategies throughout the whole field of MRI research, e.g. quantifying tissue parameters, estimating perfusion/diffusion metrics, just to name a few.
Quantifying tissue parameters, more specifically the T 1 relaxation constant, is also the aim of the "Reproducibility Challenge 2020" of the study group. Reproducing exact quantitative values at multiple sites is challenging due to small variations in measurement imaging hardware and software. The challenge aims to identify the sources of variation and tries to standardize T1 mapping across multiple sites.

| CONCLUSION
The present work shows that reproducing research results without access to the original source code and data leaves From what we have learned in this first reproducibility challenge, our recommendation is publishing not only papers but also code and data to make sure research is really reproducible.

AU T H O R C O N T R I B U T I O N S
OM wrote the manuscript.
AF and FP performed measurements for the consolidated reference implementations.
OM implemented the Python reference implementation. All authors proofread the manuscript.

AC K N O W L E D G E M E N T S
The authors would like to thank   F I G U R E 3 Example reconstruction results for each challenge submission. Shown are results using 55 and 11 spokes of the supplied challenge heart data. Visually observable differences amount to FOV changes as well as image center changes. Intensity variations are not as severe as in the case of brain data. Again, some reconstructions made use of a mask to suppress background signal. Radial acquisition F I G U R E 9 Example reconstruction results of the two additional supplied data sets. Top rows show the spirally acquired brain data set and bottom rows show radially acquired cardiac data. Both reconstructions included noise pre-whitening prior to reconstruction from a dedicated noise scan preceding image acquisition. Windowing is performed between the minimum and maximum intensity value in each image.

Initial
Final KI Stanford Utah 55 Spokes 11 Spokes F I G U R E 1 0 Relative pixel-wise difference of revised submissions, correcting FOV and/or trajectory related deviations to the reference implementation. In contrast to the initial submissions only minor deviations are visible.