Repeat it without me: Crowdsourcing the T1 mapping common ground via the ISMRM reproducibility challenge

T1 mapping is a widely used quantitative MRI technique, but its tissue‐specific values remain inconsistent across protocols, sites, and vendors. The ISMRM Reproducible Research and Quantitative MR study groups jointly launched a challenge to assess the reproducibility of a well‐established inversion‐recovery T1 mapping technique, using acquisition details from a seminal T1 mapping paper on a standardized phantom and in human brains.


INTRODUCTION
Significant challenges exist in the reproducibility of quantitative MRI. 1 Despite its promise of improving the specificity and reproducibility of MRI acquisitions, few quantitative MRI techniques have been integrated into clinical practice.3][4] Half a century has passed since the first quantitative T 1 (spin-lattice relaxation time) measurements were first reported as a potential biomarker for tumors, 5 followed shortly thereafter by the first in vivo T 1 maps 6 of tumors, but there is still disagreement in reported values for this fundamental parameter across different sites, vendors, and measurement techniques. 7 1 represents the time constant for recovery of the equilibrium longitudinal magnetization, and it is one of the fundamental MRI parameters. 80][11] Knowledge of the T 1 values for tissue is crucial for optimizing clinical MRI sequences for contrast and time efficiency [12][13][14] and to calibrate other quantitative MRI techniques. 15,16Inversion recovery (IR) 17,18 is considered the gold standard for T 1 measurement due to its robustness against effects like B 1 inhomogeneity, 7 but its long acquisition times limit the clinical use of IR for T 1 mapping. 7In practice, it is often used as a reference for validating other T 1 mapping techniques, such as variable flip-angle imaging (VFA), [19][20][21] Look-Locker, [22][23][24] and MP2RAGE. 25,26n ongoing efforts to standardize T 1 mapping methods, researchers have been actively developing quantitative MRI phantoms. 27The International Society for Magnetic Resonance in Medicine (ISMRM) and the National Institute of Standards and Technology (NIST) collaborated on a standard system phantom, 28 which was subsequently commercialized (Premium System Phantom; CaliberMRI, Boulder, CO, USA).This phantom has since been used in large multicenter studies, such as Bane et al., 29 which concluded that acquisition protocols and field strength influence accuracy, repeatability, and interplatform reproducibility.Another NIST-led study 30 found no significant T 1 discrepancies among measurements using NIST protocols across 27 MRI systems from three vendors at two clinical field strengths.
The 2020 ISMRM reproducibility challenge 1 posed a slightly different question: Can an imaging protocol, independently implemented at multiple centers, consistently measure one of the fundamental MRI parameters (T 1 )?To assess this, we proposed using IR on a standardized phantom (ISMRM/NIST system phantom) and the healthy human brain.Specifically, this challenge explored whether the acquisition details provided in a seminal paper on T 1 mapping 31 is sufficient to ensure the reproducibility across independent research groups.
To evaluate reproducibility within the framework of this challenge, we explored whether the intersubmission variability in T 1 measurements is the same as intrasubmission variability.

Phantom and human data
The challenge asked researchers with access to the ISM-RM/NIST system phantom 28 (Premium System Phantom) to measure T 1 maps of the phantom's T 1 plate (Table 1).Researchers who participated in the challenge were instructed to record the temperature before and after scanning the phantom using the phantom's internal thermometer.Instructions for positioning and setting up the phantom were devised by NIST and were provided to researchers through the NIST website 2 .In brief, the instructions explained how to orient the phantom and how long the phantom should be in the scanner room before scanning to achieve thermal equilibrium 3 .Researchers were also instructed to collect T 1 maps in healthy human brains and were asked to measure a single slice positioned parallel to the ante- Reference T 1 values of the NiCl 2 array of the standard system phantom (for both phantom versions) measured at 20 • C and 3 T.

Sphere # T 1 (ms) Version 1
Version 2 rior commissure-posterior commissure (AC-PC) line.Before imaging, the subjects consented 4 to share their de-identified data with the challenge organizers and on the Open Science Framework (OSF.io)website.As the submitted data were a single slice, the researchers were not instructed to de-face the data of their imaging subjects.Researchers submitting human data provided written confirmation to the organizers that their data were acquired in accordance with their institutional ethics committee (or equivalent regulatory body) and that the subjects had consented to data sharing as outlined in the challenge.

MRI acquisition protocol
Researchers followed the IR T 1 mapping protocol optimized for the human brain as described in the paper published by Barral et al., 31 which used TR = 2550 ms, TI = 50, 400, 1100 and 2500 ms, TE = 14 ms, 2-mm slice thickness, and 1 × 1 mm 2 in-plane resolution.Note that this protocol is not suitable for fitting models that assume TR > 5 T 1 .Instead, the more general Barral et al. 31 fitting model described in Section 2.4 can be used, and this model is compatible with both magnitude-only and complex data.Researchers were instructed to closely adhere to this protocol and report any deviations due to technical limitations.

Data submissions
Data submissions for the challenge were handled through a GitHub repository (https://github.com/rrsg2020/data_submission), enabling a standardized and transparent process.All data sets were converted to the NIfTI format, and images for all TIs were concatenated into a single NIfTI file.Each submission included a YAML file to store additional information (submitter details, acquisition details, and phantom or human subject details).Submissions were reviewed 5 , and following acceptance, the data sets were uploaded to OSF.io (osf.io/ywc9g/).A Jupyter Notebook 32,33 pipeline using qMRLab 34,35 was used to process the T 1 maps and to conduct quality control checks.MyBinder links to Jupyter Notebooks that reproduced each T 1 map were shared in each submission's GitHub issue to easily reproduce the results in web browsers while maintaining consistent computational environments.Eighteen submissions were included in the analysis, which resulted in 39 T 1 maps of the NIST/system phantom and 56 brain T 1 maps.Figure 1 illustrates all the submissions that acquired phantom data (Figure 1A) and human data (Figure 1B), the respective MRI scanner vendors, and the resulting T 1 mapping data sets.Some submissions included measurements in which both complex and magnitude-only data from the same acquisition were used to fit T 1 maps; thus, the total number of unique acquisitions is lower than the numbers reported previously (27 for phantom data and 44 for human data).The data sets were collected on systems from three MRI manufacturers (Siemens, GE, and Philips) and were acquired at 3T 6 , except for one data set acquired at 0.35 T (the ViewRay Mridian MR-linac).

Fitting model and pipeline
A reduced-dimension nonlinear least-squares approach was used to fit the complex general IR signal equation as follows: where a and b are complex constants.This approach, developed by Barral et al., 31 offers a model for the general T 1 signal equation without relying on the long-TR approximation.The a and b constants inherently factor TR in them, as well as other imaging parameters such as excitation flip angle, inversion-pulse flip angles, TR, TE, TI, and a constant that has contributions from T 2 and the receive coil sensitivity.Barral et al. 31 shared their MATLAB (Math-Works, Natick, MA, USA) code for the fitting algorithm used in their paper 7 .Magnitude-only data were fitted to a modified version of Eq. (1) (Eq.[15] of Barral et al. 31 ) List of the data sets submitted to the challenge.(A) Submissions that included phantom data.(B) Submissions that included human brain data.For the phantom (A), each submission acquired its data using a single phantom, but some researchers shared the same physical phantom with each other.Green indicates submissions used for intersubmission analyses, and orange indicates the sites used for intrasubmission analyses.T 1 maps used in the calculations of intersubmission (green) and intrasubmission (orange) coefficients of variation are indicated with asterisks.A more detailed figure can be found in Figure S1A.Images (C) and (D) illustrate the region-of-interest (ROI) choice in phantoms and humans.
with signal-polarity restoration by finding the signal minima, fitting the IR curve for two cases (data points for TI < TI minimum flipped, and data points for TI ≤ TI minimum flipped), and selecting the case that resulted in the best fit based on minimizing the residual between the model and the measurements 8 .This code is available as part of the qMRLab open-source software, 34,35 which provides a standardized application program interface to call the fitting in MATLAB/Octave scripts.

Image labeling and registration
The T 1 plate (NiCl 2 array) of the phantom has 14 spheres that were labeled as the regions of interest (ROIs) using a numerical mask template created in MATLAB, provided by NIST researchers (Figure 1C).To avoid potential edge effects in the T 1 maps, the ROI labels were reduced to 60% of the expected sphere diameter.A registration pipeline in Python using the Advanced Normalization Tools (ANTs) 36 was developed and shared in the analysis repository of our GitHub organization (https://github.com/rrsg2020/analysis, filename: register_t1maps_nist.py, commit ID: 8d38644).Briefly, a label-based registration was first applied to obtain a coarse alignment, followed by an affine registration (gradientStep: 0.1, metric: cross correlation, number of steps: 3, iterations: 100/100/100, smoothness: 0/0/0, subsampling: 4/2/1) and a BsplineSyN registration (gradientStep:0.5, meshSizeAtBaseLevel:3, number of steps: 3, iterations: 50/50/10, smoothness: 0/0/0, subsampling: 4/2/1).The ROI label template was nonlinearly registered to each T 1 map uploaded to OSF.For the human data, manual ROIs were segmented by a single researcher (M.B., 12+ years of neuroimaging experience) using FSLeyes 37 in four regions (Figure 1D), located in the genu, splenium, deep gray matter, and cortical gray matter.Automatic segmentation was not used because the data were single-slice and there was inconsistent slice positioning between datasets.

Analysis and statistics
Analysis code and scripts were developed and shared in a version-controlled public GitHub repository 9 .
The T 1 fitting and data analysis were performed by M.B., one of the challenge organizers.Computational environment requirements were containerized in Docker 38,39 to create an executable environment that allows for analysis reproduction in a web browser via MyBinder 10 . 40Backend Python files handled reference data, database operations, ROI masking, and general analysis tools.Configuration files handled data-set information, and the data sets were downloaded and pooled using a script (make_pooled_datasets.py).The databases were created using a publicly available Jupyter Notebook script and subsequently saved in the repository.The mean T 1 values of the ISMRM/NIST phantom data for each ROI were compared with temperature-corrected reference values and visualized in three different types of plots (linear axes, log-log axes, and error relative to the reference value).Temperature correction involved nonlinear interpolation 11 of a NIST reference table of T 1 values for temperatures ranging from 16 • C to 26 • C (2 • C intervals) as specified in the phantom's technical specifications.For the human data sets, the mean and SDs for each tissue ROI were calculated from all submissions across all sites.Two of the submissions (one of phantom data [Submission 6 in Figure 1A] and one of human data [Submission 18 in Figure 1B]) were much larger than the others, because they included multiple acquisitions.Submission 6 consisted of data from one traveling phantom acquired at seven Philips 3T imaging sites, and Submission 18 was a large cohort of volunteers who were imaged on two 3T scanners, one GE, and one Philips.These data sets (identified in orange in Figures 1,  3, and 4) were used to calculate intrasubmission coefficients of variation (CoVs) (one per scanner/volunteer, identified by asterisks in Figure 1A,B), and intersubmission CoVs were calculated using one T 1 map from each of these (orange) along with one from all other submissions 12 (identified as green in Figures 1, 3, and 4; the T 1 maps used in those CoV calculations are also indicated with asterisks in Figure 1A,B).All quality assurance and analysis plot images were stored in the repository.Additionally, the database files of ROI values and acquisition details for all submissions were also stored in the repository.

Dashboard
To widely disseminate the challenge results, a web-based dashboard was developed (Figure 2, https://rrsg2020 .dashboards.neurolibre.org).The landing page (Figure 2A) showcases the relationship between the phantom and brain data sets acquired at different sites/vendors.Selecting the icons labeled as "phantom" or "in vivo" and then clicking a ROI will display whisker plots for that region.Additional sections of the dashboard allow for displaying statistical summaries for both sets of data: a magnitude versus complex data fitting comparison, and hierarchical shift function analyses.

RESULTS
Figure 3 presents a comprehensive overview of the challenge results through violin plots, depicting intersubmission and intrasubmission comparisons in both phantoms (A) and human (B) data sets.For the phantom (Figure 3A), the average intersubmission CoV for the T 1 values in the human brain (Spheres 1-5, approximately 500 to 2000 ms) was 6.1%.By addressing outliers from two sites associated with specific challenges for Sphere 4 (signal null near a TI), the mean intersubmission CoV was reduced to 4.1%.One participant (Submission 6, Figure 1) measured T 1 maps using a consistent protocol at seven different sites, and the mean intrasubmission CoV across the first five spheres for this submission was calculated to be 2.9%.
For the human data sets (Figure 3B), intersubmission CoVs for independently implemented imaging protocols were 5.9% for genu, 10.6% for splenium, 16% for cortical gray matter (GM), and 22% for deep GM.One participant (Submission 18, Figure 1) measured a large data set (13 individuals) on three scanners and two vendors, and the intrasubmission CoVs for this submission were 3.2% for genu, 3.1% for splenium, 6.9% for cortical GM, and 7.1% for deep GM.The binomial appearance for the splenium, deep GM, and cortical GM for the sites used in the intersubmission analyses (green) can be explained by an outlier measurement, which can be seen in Figure 4E-G (Submission 3.001).
A scatterplot of the T 1 data for all submissions and their ROIs is shown in Figure 4 (phantom [A-C] and human brains [D-G]).The NIST phantom T 1 measurements are presented in each plot for different axes types (linear, log, and error) to better visualize the results.Figure 4A shows good agreement for this data set in comparison with the temperature-corrected reference T 1 values.However, this trend did not persist for low T 1 values (T 1 < 100-200 ms), as seen in the log-log plot (Figure 4B), Dashboard.(A) Welcome page listing all the sites, the scan type (phantom/brain), the scanner vendor, and the corresponding site.(B) Phantom tab for a selected region of interest (ROI).(C) In vivo tab for a selected ROI.Link: https://rrsg2020 .dashboards.neurolibre.org.GM, gray matter.
which was expected because the imaging protocol is optimized for human-brain T 1 values (T 1 > 500 ms).Higher variability is seen for long T 1 values (T 1 ∼ 2000 ms) in Figure 4A.Errors exceeding 10% are observed in the phantom spheres with T 1 values below 300 ms (Figure 4C), and 3-4 measurements with outlier values exceeding 10% error were observed in the human brain tissue range (∼500-2000 ms).
Figure 4D-F displays the scatter plot data for human data sets submitted to this challenge, showing mean and SD T 1 values for the white matter (WM; genu and splenium) and GM (cerebral cortex and deep GM) ROIs.Mean Summary of results of the challenge as violin plots displaying the intersubmission and intrasubmission comparisons for phantoms (A) and human brains (B).Green indicates submissions used for intersubmission analyses, and orange indicates the sites used for intrasubmission analyses.Interactive figure available at: https://preprint .neurolibre.org/10.55458/neurolibre.00023/.cGM, cortical gray matter; GM, gray matter.
WM T 1 values across all submissions were 828 ± 38 ms in the genu and 852 ± 49 ms in the splenium, and mean GM T 1 values were 1548 ± 156 ms in the cortex and 1188 ± 133 ms in the deep GM, with less variations overall in WM compared with GM, possibly due to better ROI placement and less partial voluming in WM.The lower SDs for the ROIs of human database ID site 9 (Submission 18 in Figure 1, and seen in orange in Figure 4D-G) are due to good slice positioning, cutting through the AC-PC line and the genu for proper ROI placement, particularly for the corpus callosum and deep GM.

DISCUSSION
This challenge explored whether different research groups could reproduce T 1 maps based on the protocol information reported in a seminal publication. 31Eighteen submissions independently implemented the IR T 1 mapping acquisition protocol as outlined in Barral et al., 31 and reported T 1 mapping data in a standard quantitative MRI phantom and/or human brains at 27 MRI sites, using systems from three different vendors (GE, Philips, and Siemens).The collaborative effort produced an open-source database of 95 T 1 mapping data sets, including 39 ISMRM/NIST phantom and 56 human-brain data sets.The intersubmission variability was twice as high as the intrasubmission variability in both phantom and human-brain T 1 measurements, demonstrating that acquisition details communicated via a paper are not sufficient for reproducing quantitative MRI measurements.This study reports the inherent uncertainty in T 1 measures across independent research groups, which brings us one step closer to producing a practical baseline of variations for this metric.Overall, our approach did show improvement in the reproducibility of T 1 measurements in vivo compared with researchers implementing T 1 mapping protocols completely independently (i.e., with no central guidance), as literature T 1 values in vivo vary more than reported here (e.g., Bojorquez et al. 41 reports that reported T 1 values in WM vary between 699 and 1735 ms in published literature).We were aware that coordination was essential for a quantitative MRI challenge, which is why the protocol specifications we provided to researchers were more detailed than any public guidelines for quantitative MRI that were available at the time.Yet, even in combination with the same T 1 mapping processing tools, this level of description (a paper + post-processing tools) leaves something to be desired.
This analysis highlights that more information is needed to unify all the aspects of a pulse sequence across sites, beyond what is routinely reported in a scientific publication.However, in a vendor-specific setting, this is a major challenge, given the disparities between proprietary development libraries. 42Vendor-neutral pulse sequence design platforms [43][44][45] have emerged as a powerful solution to standardize sequence components at the implementation level (e.g., RF pulse shape, gradients).Vendor neutrality has been shown to significantly reduce the variability of T 1 maps acquired using VFA across vendors. 45In the absence of a vendor-neutral framework, a vendor-specific alternative is the implementation of a strategy to control the saturation of magnetization transfer across TRs. 46Nevertheless, this approach can still benefit from a vendor-neutral protocol to enhance accessibility and unify implementations.This is because vendor-specific constraints are known to impose limitations on the adaptability of sequences, resulting in significant variability even when implementations are closely aligned within their respective vendor-specific development environments. 47fter reflecting on our reproducibility challenge design, we believe there are some improvements that would give additional insights if the challenge was to be repeated in the future.One major addition would be to distribute (1) a full T 1 mapping protocol file that can be imported on the scanners (matched as closely as possible for each vendor) and (2) a vendor-neutral sequence file (e.g., using Pulseq, 43 Gammastar, 44 or RTHawk 45 ), assuming sufficient sites would have the setup to use it.It would also be important to standardize the image reconstruction and postprocessing of the acquired data; this could be done using open tools such as Gadgetron 48 or BART. 49However, this would require the authors to submit raw k-space data, which would substantially increase the dataset sizes and complicate the transfer and storage of the submissions.These two additions (matched full protocols and vendor-neutral sequences) would provide further information on how much each component of the scanner-to-T 1 map pipeline contributes to the variation across independent sites.0][21] However, these protocols have greater B 1 sensitivity, 26,50 requiring an additional B 1 mapping protocol to be established and distributed to the researchers.
The 2020 Reproducibility Challenge, jointly organized by the Reproducible Research and Quantitative MR ISMRM study groups, led to the creation of a large open database of standard quantitative MR phantom and human-brain IR T 1 maps.These maps were measured using independently implemented imaging protocols on MRI scanners from three different manufacturers.All collected data, processing pipeline code, computational environment files, and analysis scripts were shared with the goal of promoting reproducible research practices, and an interactive dashboard was developed to broaden the accessibility and engagement of the resulting data sets (https://rrsg2020.dashboards.neurolibre.org).The differences in stability between independently implemented (intersubmission) and centrally shared (intrasubmission) protocols observed both in phantoms and in vivo could help inform future meta-analyses of quantitative MRI metrics 51,52 and better guide multicenter collaborations.
By providing access and analysis tools for this multicenter T 1 mapping data set, we aim to provide a benchmark for future T 1 mapping approaches.We also hope that this data set will inspire new acquisition, analysis, and standardization techniques that address non-physiological sources of variability in T 1 mapping.This could lead to more robust and reproducible quantitative MRI and ultimately better patient care.
Measured mean T 1 values versus temperature-corrected NIST reference values of the phantom spheres are presented as linear plots (A), log-log plots (B), and plots of the error relative to the reference T 1 value (C).Green indicates submissions used for intersubmission analyses, and orange indicates the sites used for intrasubmission analyses.The dashed lines in (C) represent a ± 10% error.Mean T 1 values in two sets of regions of interest (ROIs), white matter (one 5 × 5 voxel ROI for genu, one 5 × 5 voxel ROI for splenium) and gray matter (GM; three 3 × 3 voxel ROIs for cortex, one 5 × 5 voxel ROI for deep GM).(G) The missing datapoints for deep GM for Submissions 1, 8, and 10 were due to the slice positioning of the acquisition not containing deep GM.Interactive figure available at: https://preprint.neurolibre.org/10.55458/neurolibre.00023/.cGM, cortical gray matter.