Comparative review of algorithms and methods for chemical‐shift‐encoded quantitative fat‐water imaging

To propose a standardized comparison between state‐of‐the‐art open‐source fat‐water separation algorithms for proton density fat fraction (PDFF) and R2*$$ {R}_2^{\ast } $$ quantification using an open‐source multi‐language toolbox.


INTRODUCTION
Chemical shift encoded MRI (CSE-MRI) techniques have become the reference for quantitative in vivo evaluation of fatty depots.CSE-MRI of water and fat signals facilitates tissue and organ fat quantification.CSE-MRI 1 acquires images at multiple TEs when fat/water signals accrue different phase.The estimated proton-density fat fraction (PDFF), which is the ratio between MR visible protons from fat over all MRI visible protons (water and fat), can be computed from these images.PDFF quantification has been established as an accurate non-invasive biomarker 2 to assess tissue adiposity in the liver, 3 bone marrow, 4 and other organs. 5Each application holds a specific range of PDFF but also a different type of fat composition.Also from CSE-MRI and complementary to PDFF, the quantification of R * 2 decay is another biomarker of interest to further probe iron overload 6 and hemorrhage. 7The accuracy and precision of quantitative parameters PDFF and R * 2 is increasingly challenged, demanding precision down to a few percent of PDFF or tens of s −1 of R * 2 during follow-up, [8][9][10] to differentiate the type of adipose tissue 11 or to monitor cohorts. 12Recently, groups of experts such as the PDFF Quantitative Imaging Biomarkers Assessment (QIBA) group and the ISMRM Quantitative MR Study group provided consensus guidelines to assess new quantitative MR methods, 13,14 guidelines that this study abides.The QIBA group realized a meta-analysis 15 on hepatic PDFF (PDFF ≤ 50%) and assessed its reproducibility coefficient at 4.12% across experimental set-ups and its repeatability coefficient at 2.99% using the same experimental condition.Moreover, one fat-water phantom traveled to multiple sites in another QIBA study 16,17 and demonstrated reproducible PDFF measurements across various MRI vendors' solutions.
Nowadays, to obtain these quantitative PDFF and R * 2 biomarkers, several CSE-MRI based methods for fat-water signal separation have been developed.Thus, the purpose of this study is to establish and compare the quantitative performances of these methods, their bias for accuracy, and limits of agreement for precision, within a reproducible research framework. 18Relevantly, more than a decade ago, the ISMRM 2012 fat-water MRI Workshop proposed to standardize PDFF as a quantitative imaging biomarker.Gathered algorithms were benchmarked on a multitude of in vivo datasets 19 and a MATLAB algorithm toolbox was developed and disseminated.It provided standardization for the input/output formats of algorithms and facilitated their comparison.Unfortunately, most studies only offered an evaluation of a discrete and limited range of PDFF, often under 50% for mixed fat-water samples.For a wider evaluation, numerical simulations can complement experimental comparison to explore the full range of PDFF more extensively, validate, and compare methods in many more scenarios.For instance, a Python open-source framework 20 was developed to explore optimal acquisition parameters according to the number of peaks resolved in water-fat signal model.However, this framework considered only a single algorithm, and focused on acquisition parameters.Even recently, algorithms continue to emerge, 21 with a diversity of methods to solve fat-water signal separation.Whether fitting for magnitude, complex or hybrid data, algorithms can be based on least-square iterative approaches, 22,23 graph-cut, [24][25][26][27][28][29] region growing, [30][31][32] and, more recently, deep learning [33][34][35][36] approaches.In addition, further image processing continues to build on these algorithms, whether to evaluate fatty acids composition, 37,38 quantitative susceptibility mapping, 29 or temperature mapping. 39Thus, there is value to provide independent and reproducible performance evaluation of a set of previously proposed algorithms.Finally, to assess CSE-MRI algorithms performances, a numerical toolbox should remain open-source for facilitating continuous comparison and possible extensions and support the two major programming languages (Python and MATLAB) used by MRI researchers for prototyping new methods.Therefore, the purpose of this work was to develop a multilanguage (MATLAB and Python) numerical toolbox and assess the performances of open-source state-of-the-art fat-water reconstruction methods for fat-water swapped voxels (FWswaps) and PDFF and R * 2 quantification.

Algorithm standardization
Algorithms were standardized, building upon ISMRM fat-water toolbox input MATLAB structure, with the addition of the voxel dimension.Toward generalization of applications, all algorithms were adapted to accommodate an input fat spectrum in their models.Output structure from algorithms comprised of algorithms' parameters for reproducible research, employed model fat spectrum, and maps of fat, water, R * 2 , B 0 , PDFF, and the voxel-wise sum of square error.For all graph-cut algorithms, the discretization of B 0 off resonance steps was set to 2 Hz.

Monte Carlo simulation
To numerically evaluate the algorithms' performances, synthetic CSE-MRI volumes were modeled as: ) with W and F corresponding to relative water and fat absolute signal, f 0 the off-resonance frequency, R * 2 = 50, 150 or 300 s −1 the transverse decay rate,  0 = 30 • the initial phase which holds only for low flip angles as demonstrated by Wang et al., 41 (t) the complex gaussian noise and  m , e i m the relative amplitude and frequency offset of a subcutaneous fat spectrum respectively.Considering a 3T scanner field strength (the exact value was 2.89T, which was further employed in this study), virtual CSE-MRI volumes were synthesized as follows: PDFF varied from 0 to 100% with 1% step, f 0 was uniformly distributed from −300 Hz to 300 Hz 42 with 6 Hz step, and the third dimension consisted in 100 repetitions varying according to a Gaussian noise n(t) which was added to obtain SNR = 10, 50, 100 for the first TE.Synthetic volumes intensity was normalized based on 99% of the maximum of the first echo.To avoid border effects due to spatial regularization, a five pixels padding was added for each CSE-MRI volume.Different numbers of TEs (NTE = 3, 5, 7, 9) and echo spacing schemes were considered.

Acquisition parameter: echo spacing schemes
The impact of three schemes for TE sampling was also considered in this study.Only TE schemes using single-TR monopolar readout gradients (often termed "flyback" echo trains) were considered to avoid the additional issue of gradients distortions.Realistic physical acquisition constraints were also included since they influence possible echo-times sampling in their minimum TE and echo-spacing, depending on the magnetic field strength, sampling dwell time and resolution.Additionally, most compared algorithms were, in their design, constrained to uniform echo spacing imaging.Thus, in proposed toolbox a function was developed to automatically calculate uniform TE schemes based on acquisition constraints (minimum TE and echo-spacing) for any number of echoes.Three schemes were evaluated: first, alternating in and out of phase echoes; second, minimum TEs constrained by acquisitions (further referred as "realistic minimal"); and third, the uniform IDEAL echo-shift formulated by Pineda 43 for three echoes was generalized to offer realistic IDEAL echo spacing, abiding to both acquisition constraints and the following extended-IDEAL criterion: (gcd = greatest common divisor). ( This equation enables to provide the shortest angular step that leads to a uniform N-sampling of the unit circle of fat-water phase, given hardware constraints (that prevent the shortest 2 N steps).Demonstration of this criterion is detailed in Data S1.
In practice, the extended IDEAL echo spacing will obtain shorter TEs for prime or odd number of echoes than for even number of echoes.As such, the three TEs schemes were evaluated for 3, 5, 7, and 9 echoes each.Hardware constraints were defined for a 3T acquisition with minimum TE TE min = 0.98 ms and minimum echo spacing ΔTE min = 1.68 ms.

Fat spectrum library
The CSE-MRI signal model (Eq. 1) relies on a multi peak fat spectrum model with established relative amplitudes and chemical shifts corresponding to triglycerides. 23,44he choice of a multi-peak fat spectrum has been shown to reduce PDFF bias estimation compared to single fat peak spectrum model 45 but remains highly variable in the implementations.To extend from the ISMRM 2012 challenge, which benchmarked algorithms with a single human fat spectrum, this study pursued to benchmark the algorithms' sensitivity to changing the selected human fat NMR spectra. 46In the toolbox, fat NMR spectra can be described either with a list of peaks (pairs of chemical shifts and amplitudes) using a generic triglyceride model, 47 with only three parameters: the number of double bonds (ndb), the number of interrupted methylene double bonds (nmidb), and the chain length (CL).In addition, a documented human fat spectrum 37,47,48 library was implemented in the toolbox.This simplified model enabled us to translate gas-chromatography measurements of fatty acids composition to NMR spectrum signal.Thus, any spectrum could easily be associated with each algorithm.Unless specified, synthetic CSE volumes were modeled with the subcutaneous fat spectrum (CL = 17.29, ndb = 2.69, nmidb = 0.58).To probe spectra influence, synthetic signals (NTE = 9) were simulated with a peanut oil spectrum but processed with two very different spectra: either the same calibrated peanut oil spectrum or with the ISMRM 2012 challenge spectrum.Moreover, the effect of temperature T on the relative shift between water peak and the fat spectrum ΔCS wf was also included in the toolbox using the formula 49 :

In vitro: fat-water phantom
Eight fat-water phantoms of 50 mL were prepared with different fat fractions of peanut oil.Recipes and protocols from Hines et al 50 and Bush et al 51  High fat fraction vials (>70%) were not solidified due to a lack of oil-soluble surfactant and required shaking for homogenization prior to experiments.The volume percentages of oil in the phantoms were targeted at 0%, 10%, 20%, 40%, 60%, 80%, 90%, and 100% and their respective reference PDFF values were calculated with magnetic resonance spectroscopy (MRS) at 0%, 7.4%, 19.6%, 39.7%, 62.3%, 77.2%, 85.3%, and 100% (Figure S1).The room temperature was input to algorithms through the toolbox to correct for chemical shift variations.Reference R * 2 values were calculated as the mean value of the R * 2 values obtained from the five algorithms with minimal bias from the Monte Carlo study, using data with NTE = 9 and in/out of phase echo spacing.
Fitting of spectroscopy data was performed using a Linear Combination model implemented in FSL-MRS 52 version 1.1.10,part of FSL (FMRIB's Software Library, www.fmrib.ox.ac.uk/fsl).Briefly, basis spectra were fitted to the complex valued spectrum in frequency domain with Voigt line shape. 53,54The basis spectra were shifted and broadened with parameters fitted to the data grouped in two metabolites groups (water and lipids).A complex polynomial baseline was also concurrently fitted (order = 3).Model fitting was performed using the Metropolis-Hastings algorithm.

In-vitro: multi-site fat-water phantom MRI data
To explore the reproducibility of algorithms across sites, field strength and protocols, a complementary experimental evaluation was performed on a publicly available dataset. 16,55It included six echoes CSE-MRI acquisitions of a standardized fat-water phantom obtained across six sites at two field strengths (1.5T and 3T), with three different vendors (GE Healthcare, Philips and Siemens Healthineers) and with two different protocols (echo spacing schemes: in-/out of phase and realistic minimal).The fat-water phantom consisted of 11 vials with oil to water concentrations of: 0%, 2.6%, 5.3%, 7.9%, 10.5%, 15.7%, 20.9%, 31.2%,41.3%, 51.4%, 100%.Following the original multi-site study, 16 bipolar acquisitions were corrected using the same method and the six-peak peanut oil spectrum model was corrected for room temperature prior to the fat-water separation processing.

In vivo imaging
For practical demonstration purposes, algorithms were evaluated on challenging in vivo anatomies including the supraclavicular, sacral, and liver regions.Three healthy volunteers were recruited after informed consent.MR images were acquired with the same 3T MRI system using the vendor thoracic coil array and the spine coil array, totaling up to 32 channels together.Imaging the supraclavicular body fat, that contains white and brown fat, the bone marrow fat in the sacrum region, and the liver were considered challenging applications due to a large range of B 0 inhomogeneity, limited SNR ranges and various spatial distributions of adipose tissues.

Evaluation metrics and statistical analysis
The comparison followed the guidelines provided by the QIBA 13 recommendations.From Monte-Carlo simulation, algorithms results were evaluated based on their bias, and limits of agreement (LOA) for each model parameters (PDFF, B 0 , R * 2 ).The percentage of FWswaps was quantified.A voxel was classified as an FWswap if the estimated PDFF (PDFF E ) abided by the following criteria: FWswaps were excluded from the analysis for linearity and bias.
As an indication, the computational times of the algorithms were recorded and reported as seconds per slice.Computations were performed in MATLAB R2019b and Python version 3.7.5 with a computer equipped with a GPU (Nvidia Quadro P5000, 16 Gb) and 40 CPUs (Intel Xeon e5-2630 version 4, 2.20 GHz).Statistical analysis was conducted using R (version 3.6.3). 57
For NTE ≥5 and R * 2 = 50 s −1 , most algorithms were robust to FWswaps (FWswaps <1%), apart from GOOSE and B0-NICE, each with 5% FWswaps that also corresponded to large B 0 absolute errors: B0-NICE = 43.3Hz, GOOSE = 13.8Hz.Other algorithms with no swaps provided B 0 absolute errors on the order of the 2 Hz graph-cut B 0 discretization for NTE = 7: VLGCA = 2 Hz, Hernando-GC and Fatty-Riot-GC = 0.9 Hz, IDEAL-CE = 0.5 Hz, MSGCA-A/ B = 0.26 Hz.From the statistical analysis, PDFF measurement using any of the graph-cut approaches were found significantly correlated to B 0 offset-resonance (p < 0.01), albeit with a negligible linear correlation of 0.1% per 100 Hz.B0-NICE and GOOSE proved to remain highly biased with more echoes and were not further included in the quantitative analysis.
Unsurprisingly, the performance of the algorithms were significantly dependent on the number of echoes and echo sampling schemes (Figure S2).For NTE ≥ 5, VLGCA PDFF estimation was influenced by echo spacing schemes (p < 0.0001) with a preference for the in-/out of phase scheme.Increasing the number of TE (NTE = 7 and then 9) improved PDFF accuracy: absolute bias significantly decreased (p < 0.0001) for all algorithms (Figure S3).Considering the best NTE = 7 echo spacing scheme for each algorithm and SNR = 50, the remaining five algorithms provided accurate PDFF estimation with similar PDFF bias (bias < 0.15%, LOA < 2.6%) (Figure 2A).Increasing the number of TE significantly decreased (p < 0.0001) the LOA for all algorithms.At NTE = 5 and minimal echo spacing, five algorithms provided a low R * 2 mean bias (<0.5 s −1 ), whereas MSGCA-A suffered at low PDFF (<35%) from a large bias (−48 s −1 ) at high R * 2 (≥150s −1 ) (Figure 2B).These five algorithms achieved a LOA within 15% of targeted value, with the lowest precision for PDFF = 63%.Computation times for processing one repetition with NTE = 7 ranged from T B0-NICE = 1.8 ± 0.2 s, T IDEAL-CE = 1.95 ± 0.02 s, to T Fatty-Riot-GC = 51.3 ± 18.0 s and T GOOSE = 5455 ± 7122 s.

In vitro experiments
The local experiments on our custom phantom allowed to validate results observed in the numerical study.Using the NTE = 9 results, B 0 off-resonance in the phantom was PDFF and B 0 off-resonance absolute error was measured on three algorithms with sensitivity to B 0 inhomogeneity: Fatty-Riot-GC, Hernando-GC, and VLGCA, using synthetic CSE-MRI volumes with SNR = 50 at three and nine echoes in-/out of phase echo spacing.PDFF and B 0 absolute error maps were averaged along the repetition axis.Bias in B 0 off-resonance field map estimation led to FWswaps with Fatty-Riot-GC, while Hernando-GC suffered from PDFF bias due to B 0 inhomogeneities.With nine echoes, algorithms provided more reliable quantitative maps.Maps with specific scales are displayed in blue boxes.found to span more than 300 Hz, from −220 to 150 Hz.Matching simulation results, VLGCA confirmed to have PDFF bias superior to 4% in the water vial, Hernando-GC in the B 0 off-resonance vials range shown in Figure 3A provide reliable PDFF measurement whereas with NTE = 3, Fatty-Riot-GC was still influenced by B 0 inhomogeneity resulting in a PDFF bias of 10% in the water vial instead of FWswaps.Confirming simulation results, algorithms proved to be robust to FWswaps (≥0.05%) with NTE ≥ 7.For PDFF quantification, all algorithms were independent to the echo spacing with NTE = 7 and provided a mean bias under 2.5% (Figure 3B).However, all algorithms overestimated the 40 and 60% vials.These vials were found to have very different water and fat R * 2 values (R * 2fat = 22 and 25 s −1 , R * 2water = 47 and 49 s −1 for 40 and 60%, respectively).Algorithms in our study assumed a single R * 2 value, and this hypothesis failed in these experiments.Numerical simulation of simulated data matching these characteristics confirmed a +3% bias from algorithms within the 40%-60% PDFF range (Figure S4).R * 2 absolute error corroborated numerical simulations (Figure S5) with a mean absolute error higher than 10 s −1 for B0-NICE, whereas it is lower than 4.5 s −1 for the seven other algorithms.

In-vitro: multi-site fat-water phantom MRI data
Algorithms performances were dependent on the multi-echo acquisition type (monopolar, interleaved, bipolar).Using interleaved echoes acquisition (sites 3 and 4, both 1.5T), all algorithms failed to provide accurate PDFF quantification (Figure 4 and Table S1) with a mean absolute error superior to 7% for IDEAL-CE and to 12% for the other algorithms.Thus, the issue lies in the double acquisition and is independent of the algorithms.For bipolar acquisitions, even after correction for odd and even echoes (sites 5 and 6), IDEAL-CE suffered from FWswaps, with mean FWswaps of 34.5% and a mean bias of 23.7% for non-swapped pixels.Fatty Riot also exhibited swaps, with FWswaps superior to 15% for sites 2 and 3.The four other algorithms provided reliable PDFF measurement across sites, protocols, and field strength with a mean bias less than 2%.However, their PDFF measurements were significantly dependent, albeit with small biases, on the field strength (max 1.2% bias, F = 5.7, p < 0.05), vendor system (max 1.5% bias, F = 4.2, p < 0.05), and site (max 1.3% bias, F = 3.8, p < 0.01).

Influence of the spectrum model
From the synthetic magnitude CSE signals, it was confirmed that R * 2 decay might depend on the chosen fat spectrum model (Figure 6A).Processing synthetic data with a different spectrum (ISMRM 2012 challenge) than the one employed for simulation (peanut oil) revealed a significant bias in PDFF quantification (p < 0.0001) and R * 2 (p < 0.0001) with a maximum bias of 2.35% and 20.70 s −1 , respectively, at NTE = 9 and SNR = 100 (Figure 6).In vitro (Figure 6B) and in vivo (Figure 6C) experiments confirmed similar biases in practice, with mean PDFF and R * 2 differences of 1.22% and 12.22 s −1 , respectively, in supra-clavicular adipose tissue.

DISCUSSION
In keeping with the community-driven standardization of MRI body fat and iron quantification, several recent fat-water separation algorithms were compared through an open-source toolbox for their reproducibility, precision, and accuracy.This benchmarking also included the influence of acquisition parameters (number of echoes

F I G U R E 4
Linear regression analysis on multi-site fat-water phantom data.The targeted linearity response between true PDFF and estimated PDFF is represented in a black dotted line.Estimation errors correspond to deviations from this line.The HybridMag algorithm initially included in the original study was included as a reference.Sites 3 and 4 employed interleaved echoes schemes, which impacted algorithms quantification.Sites 5 and 6 employed bipolar readout gradients that impacted only IDEAL-CE quantification despite a preprocessing phase correction.and echo spacing) to obtain more accurate quantitative maps.

Open-source framework and reproducibility research
The proposed toolbox was developed in both Python and MATLAB programming languages to facilitate benchmarking of fat-water separation algorithms.It can be considered as an extension of the ISMRM fat-water toolbox 19 that currently only allows to compare MATLAB algorithms.Another addition is the capability to operate algorithms with various fat spectra from an exhaustive fat spectra library.This framework was made open-source (https://github.com/pdaude/CREAM_PDFF) to facilitate comparison with new upcoming methods.
This study did not aim to elect the optimal fat-water separation algorithm but rather to provide a practical evaluation framework for helping researcher in the selection of a fat-water separation algorithm.The choice for a specific algorithm remains multi-factorial and this study aims to set basis for further reproduction and adaption.First, the targeted application has to be defined including choice of bio-physical model (fat spectrum calibration), expected range of quantitative parameters (PDFF, PDFF, R * 2 , and B 0 off-resonance quantification of three algorithms of equivalent performances so far: Fatty-Riot-GC, IDEAL-CE, and MSGGA-B over challenging in vivo datasets at 3T. PDFF overestimation or complete FWswaps were observed where large B 0 off-resonances were present.Using MSGCA-B and Fatty-Riot-GC as shown by white arrows, fat was erroneously quantified in regions where it was not expected such as inside the bladder or the neck muscles.Fatty-Riot-GC even led to a full volume FWswap in the pelvis due to a large off-resonance range.R * 2 , B 0 ), SNR, acquisition constraints (sampling scheme and number of echoes), and required precision.For example, while algorithms accept different TE spacing, results show that some performances depended on this TE spacing (such as VLGCA and Hernando-GC).Second, the multi-scale evaluation appeared impactful in our study: when in silico results pointed toward three equivalent algorithms (MSGCA-B, IDEAL-CE, and Fatty-Riot-GC), in vitro results confirmed it and penalized one of them (confirmation with our custom phantom and penalization of IDEAL-CE over the multi-site dataset), whereas one algorithm (IDEAL-CE) stood out in in vivo challenges.Third, algorithms can be individually optimized for one application or one computing architecture.Proposed toolbox should facilitate such optimization leading to improved performances of certain algorithms that were implemented "as provided" in our study.Fourth, the computational time can impact the usability of the algorithms and might enter considerations for optimal choice.Eventually, this toolbox also estimates margins of errors of parameter quantifications which could help to further discriminate algorithms.These estimates can also further be compared to Cramér-Rao lower bounds as provided by Diefenbach et al. 20

Numerical simulations
Provided with synthesized data using only three echoes, most algorithms suffered from FWswaps or PDFF bias due to B 0 inhomogeneities (Figures 1 and S2) while five or seven echoes provided a significant improvement in accuracy and precision in PDFF measurement with almost no FWswaps(<1%).With NTE≥5, B 0 offset resonance field maps were correctly estimated by all algorithms apart from B0-NICE and GOOSE.Even when using five echoes, the PDFF precision for VLGCA, B0-NICE and GOOSE remained greatly dependent on the echo spacing (i.e., different echo sampling schemes).In, Hernando-GC with in-/out of phase echo spacing suffered from random failures of B 0 field map estimation over the repetition axes, leading to large PDFF SDs (Figure 2A).Eventually, for all other cases, MSGCA-A, MSGCA-B, IDEAL-CE, Hernando-GC, and Fatty-Riot-GC provided similar results, all suitable for a reliable PDFF quantification (Figure 2A).Considering these latter five algorithms, echo spacing still influenced the precision of the R * 2 quantification (Figure S6).Considering moderate R * 2 values and a fixed number of TEs, R * 2 precision depended on the longest sampled TE.This conclusion does not, however, account for high R * 2 values that might occur in the presence of liver iron accumulation or bone marrow.Therefore, to quantify moderate R * 2 with a given number of echoes, in-/out of phase schemes should be preferred over IDEAL or minimal echo spacing.Nevertheless, with a fixed TR, minimal echo spacing, potentially allowing fitting more echoes, should be preferred to the other echo spacing schemes for R * 2 accuracy.Tighter echo spacing (i.e."realistic minimal" echo spacing), associated with higher signal and more sampled echoes within the same TR, improves noise performances at the acquisition level.When evaluated over a realistic range of R * 2 , five algorithms (VLGCA, Hernando-GC, Fatty-Riot-GC, IDEAL-CE, and MSGCA-B) provided reliable quantification with low R * 2 bias (<0.5 s −1 ) (Figure 2B).For computation efficiency, MSGCA-A is implemented using only the decoupled estimation of B 0 and R * 2 maps, resulting in FWswaps and R * 2 bias if the assumed R * 2 used in the estimation of B 0 map is too far from the ground truth.

In vitro experiments
The custom in vitro experiments enabled to validate the numerical findings.Notably, experimental results confirmed the necessity to acquire at least five echoes to avoid FWswaps and bias due to B 0 inhomogeneities with most algorithms (Figure 3).It also showed that algorithms provided a bias in the PDFF quantification which is lower than the repeatability coefficient (for NTE = 7, PDFF bias ≤ 2.4% vs. PDFF repeatibility ≤ 2.99%).Using spectroscopy as a reference measurement, PDFF bias was higher than in the corresponding simulation for all algorithms especially for vials with 40% and 60% fat.This discrepancy was assumed to be due to the model assumption of a common single R * 2 value for fat and water when a dual R * 2 would have been more appropriate. 58A numerical simulation with R * 2 values obtained from spectroscopy measurements confirmed this discrepancy (Figure S4).As most fat-water algorithms are built on this assumption and dual R * 2 modeling is inherently more sensitive to noise, separate R * 2 have not been investigated in this study, albeit they would be a valuable extension to the present study.

In-vitro: multi-site fat-water phantom MRI data
Analyzing multi-site fat-water phantoms required a pre-processing step for correcting interleaved and bipolar echoes, 59,60 otherwise fat-water algorithms failed to provide realistic quantitative maps (Figure 4 and Table S1).In general, corrections for gradients distortions during acquisition, including imperfect gradient system transfer function, 61 gradients spatial non-linearity and concomitant gradients 59 during echo trains, correspond to an image processing step performed prior to fat-water separation. 62Thus, these corrections were not considered in this study.As suggested in the original study, 16 the PDFF biases found across sites could be due to the variability of temperature which might affect PDFF quantification. 49,63ventually, Fatty-Riot-GC and IDEAL-CE exhibited more FWswaps and bias compared to custom in vitro experiments.

In vivo experiments
As observed in phantoms, challenging in vivo data revealed disparities between algorithms (IDEAL-CE, MSGCA-B, Fatty-Riot-GC).These differences may be due to rapid and large variations of B 0 , which breaks the stringent constraint of a smooth field map assumed by these algorithms (Figure 5).However, algorithms provided similar performances and showed to be highly resilient to challenging low SNR in vivo data (SNR = 14), with PDFF and R * 2 measurements SD in liver of 2.4% and 11 s −1 , which was verified by simulation (Figure S7).

Influence of spectrum model
Interestingly, processing data with an inadequate spectrum led to a non-negligible bias in PDFF and R * 2 (Figure 6).The in vitro and in vivo experiments confirmed this difference, as shown in Figure 6C,D.Thus, the choice of a relevant spectrum remains essential for characterizing fat deposits with different fatty acid composition.This consideration might even be more important in applications such as CSE-MRI of bone marrow.The sensitivity of PDFF quantification alone to multi peak fat spectrum calibration has been previously explored in the liver on non-alcoholic steatohepatitis patients 64 or using synthetic CSE data with low PDFF (≤40%), 45 six echoes and a graph-cut algorithm.Within this range, the choice of the number of spectrum peaks was found not to significantly impact PDFF.Our results extend these findings, demonstrating that with more echoes (NTE = 9) and a high SNR a small PDFF bias (<2%) can arise within the PDFF range of 20-80%, depending on the selected fat spectra which is however lower than the repeatability coefficient (PDFF repeatability ≤2.99%).But more importantly, the choice of spectra significantly influenced the R * 2 bias (12.22 s −1 ), which might alter the evaluation of iron content in organs such as liver or bone marrow.R * 2 enables the measurement of liver iron concentration (LIC) [65][66][67] and the R * 2 -LIC calibration has been studied 68,69 across field strengths, centers, and vendors, but the variation in fat composition could also stand as an interesting element to consider.From our results, the fat-composition R * 2 bias would correspond to a variation of 0.17 mg/g (LIC 2.89T = −0.09+ 1.387 * 10 −2 R * 2 ).

Algorithms running time
Apart from GOOSE, the algorithms' running times on our computer were within seconds to minutes of processing time per slice.Considering the limited computational optimization of open-source implementations (in MAT-LAB or Python), the eight remaining algorithms are all potentially suitable for a clinical routine PDFF quantification upon code optimization.We noted that algorithms developed after the ISMRM 2012 challenge (MSGCA-A/B, IDEAL-CE, B0-NICE, VLGCA) were faster than the earlier ones (Hernando-GC, Fatty-Riot-GC).

Choices of open-source algorithms
The list of selected open-source algorithms was arguably limited in our comparison study.Indeed, new approaches for solving the fat-water separation problem based on deep learning have recently emerged. 33,34,70However, all currently available deep-learning algorithms were based on a fixed number of echoes or would need a specific training strategy and network modification to be compatible with our benchmark.Additionally, all algorithms in this study are based on the complex signal model, and it would be interesting to include algorithms using only the magnitude signal model, or hybrid methods, which have been developed to be more robust to phase errors and circumvent the field map estimation.However, to our knowledge, there were no open-source algorithms of this type documented in the literature at the time of this study.Finally, it would have also been of interest to compare the selected algorithms with commercial software which, in parts, are based on complex or hybrid magnitude-complex method.But the access to such methods remains limited, if not even possible.

Possible extensions and new challenges
Our open-source toolbox was designed to be upgradeable to tackle new challenges, leaving room for multiple extensions that were beyond the scope of this study.Algorithms extensions based on external B 0 field map initialisation are of interest, such as methods incorporating a priori information from the scanner magnetic field distribution. 62,71In a prolonged perspective, some algorithms have started to include refined complex signal MR models designed for quantitative fatty acid composition parameter mapping. 37,38A standard evaluation of such advanced algorithms performances could hold interest.Finally, a graphical user interface for this framework could also benefit from a user's perspective as it has already been done for QSM processing pipelines with the SEPIA 72 package.

CONCLUSIONS
In accordance with standardization of MRI body fat and iron quantification, an open-source bi-language toolbox was developed to evaluate eight state-of-the-art open-source algorithms for fat-water separation.Leveraging this toolbox, a multi-scale evaluation of algorithms was demonstrated: first, algorithms' performances differed in silico, on numerical synthetic data; second, results were matched with in vitro experimental results from a custom phantom; third, complementary evaluation on multi-site phantom data probed algorithms resilience to various datasets.Finally, challenging in vivo datasets illustrated certain algorithms' failure cases.This framework sets basis for continued comparison of algorithms for fat-water separation and subsequent quantitative MRI as developments propose new avenues for refined adipose tissue characterization.

2
Comparison of PDFF (A) and R * 2 (B) bias of each algorithm over synthetic CSE-MRI volumes with NTE = 7 and SNR = 100 (A) or with NTE = 5, minimal echo spacing, and SNR = 50 (B).Mean and SD PDFF and R * 2 bias were averaged along the B 0 off-resonance and repetition axes and separated (in color) according to the echo spacing schemes (A) or simulated R * 2 (B).GOOSE and B0-NICE (in red square) were not further investigated due to highly biased results.
measurement from in vitro experiments using three algorithms: Fatty-Riot-GC, Hernando-GC, and VLGCA with NTE = 3 and NTE = 7 and in-/out-of-phase echo spacing.(B) Comparison of PDFF bias of each algorithm in phantoms.Mean and SD PDFF bias were averaged plotted according to the echo spacing schemes (in color) and echo number (in line style).For clarity, only SD of PDFF bias for NTE = 7 have been plotted.

6
Influence of fat spectrum models on PDFF and R * 2 quantification with either an IDEAL algorithm (IDEAL-CE) or a graph-cut algorithm (MSGCA-A).(A) Synthetic magnitude CSE-MRI signal at PDFF 100% and T * 2 =20 ms with different spectrum depicted in the literature or acquired for this study.(B) PDFF and R * 2 bias when synthetic CSE-volume are processed with either the same spectrum (peanut oil) or with the ISMRM 2012 challenge one at 3T with nine echoes and realistic minimal echo-spacing.PDFF, R * 2 measurement when in vitro (C) or in vivo (D) experiments were processed with peanut oil spectrum at 3T with nine echoes with a minimal echo spacing scheme and the difference ΔPDFF, ΔR * 2 resulted when those data were processed with the ISMRM 2012 challenge spectrum. 25