Reproducibility of quantitative structural and physiological MRI measurements

Abstract Introduction Quantitative longitudinal magnetic resonance imaging and spectroscopy (MRI/S) is used to assess progress of brain disorders and treatment effects. Understanding the significance of MRI/S changes requires knowledge of the inherent technical and physiological consistency of these measurements. This longitudinal study examined the variance and reproducibility of commonly used quantitative MRI/S measurements in healthy subjects while controlling physiological and technical parameters. Methods Twenty‐five subjects were imaged three times over 5 days on a Siemens 3T Verio scanner equipped with a 32‐channel phase array coil. Structural (T1, T2‐weighted, and diffusion‐weighted imaging) and physiological (pseudocontinuous arterial spin labeling, proton magnetic resonance spectroscopy) data were collected. Consistency of repeated images was evaluated with mean relative difference, mean coefficient of variation, and intraclass correlation (ICC). Finally, a “reproducibility rating” was calculated based on the number of subjects needed for a 3% and 10% difference. Results Structural measurements generally demonstrated excellent reproducibility (ICCs 0.872–0.998) with a few exceptions. Moderate‐to‐low reproducibility was observed for fractional anisotropy measurements in fornix and corticospinal tracts, for cortical gray matter thickness in the entorhinal, insula, and medial orbitofrontal regions, and for the count of the periependymal hyperintensive white matter regions. The reproducibility of physiological measurements ranged from excellent for most of the magnetic resonance spectroscopy measurements to moderate for permeability‐diffusivity coefficients in cingulate gray matter to low for regional blood flow in gray and white matter. Discussion This study demonstrates a high degree of longitudinal consistency across structural and physiological measurements in healthy subjects, defining the inherent variability in these commonly used sequences. Additionally, this study identifies those areas where caution should be exercised in interpretation. Understanding this variability can serve as the basis for interpretation of MRI/S data in the assessment of neurological disorders and treatment effects.


| INTRODUCTION
Clinicians and scientists use longitudinal magnetic resonance imaging and spectroscopy (MRI/S) protocols to provide quantitative structural (T1-and T2-weighted imaging) and physiological (microstructural properties of molecular diffusion, cerebral blood flow (CBF), and concentrations of neurochemicals) measurements to assess progression of neurological disorders and therapeutic effects of treatment. We quantified technical and normal physiological variability in commonly used MRI/S measurements to study the consistency of repeated measurements in healthy volunteers. Quantitative analysis of imaging and spectroscopy data was performed using standardized analysis pipelines. We minimized technical variability by utilizing a single MRI scanner and technician. Physiological variables were minimized by studying a select healthy population while restricting daily activities to a subject's consistent baseline and by imaging over a short interval.
Previous replication efforts in neuroimaging have reported scanto-scan variability in the single modality measurements (Acheson et al., 2017;Dickerson et al., 2008;Han et al., 2006;Jovicich et al., 2014;Li et al., 2015;Maclaren, Han, Vos, Fischbein, & Bammer, 2014). We evaluated a battery of commonly used MRI sequences and measurements that ascertained both structural and physiological states of the brain in a well-controlled group of healthy individuals. We report on the reproducibility and normal physiological variability for the state-of-art neuroimaging and spectroscopic measurements including reproducibility analysis for advanced bi-exponential diffusion-weighted imaging analysis. This included ascertainment of reproducibility of the cortical gray matter thickness, volume and number of hyperintensive white matter regions, resting CBF, fractional anisotropy of water diffusion, multi-bvalue diffusion, and concentrations of important neurochemicals. The measurements included both gray-matter-and white-matter-specific values providing an assessment of the normative tissue-specific variance. Presenting reproducibility data collected under controlled physiological conditions while minimizing methodological variability may help planning of the future studies and performing power analyses of neuroimaging and spectroscopy measurements.
We selected three commonly used statistical metrics to provide a thorough assessment of reproducibility of MRI/S performed over a short interval in normal healthy volunteers. These metrics serve as the foundation for statistical inferences of the effects of disease or treatment on brain structure and/or physiology over time as measured by MRI/S. We used the variance observed across the three visits to perform a power analysis to calculate a hypothetical group size that is necessary to detect 3% and 10% group differences using a two-tailed t-test. This information should help to perform power analyses for the neuroimaging studies that utilize these measurements.

| Subjects
The study was reviewed and approved by the 59 th Medical Wing, United States Air Force (USAF), Institutional Review Board. Subjects were active duty members of the USAF recruited to serve as controls for an ongoing study on the effects of occupational exposure to extreme hypobaria in aircrew. All participants were recruited with strict adherence to the Department of Defense Instruction for Protection of Human Subjects (Department of Defense, 2011). For all subjects, participation was voluntary without commander involvement or knowledge. All subjects provided informed consent prior to participation.
Subjects did not receive compensation for participation.
Twenty-five (20 males/5 females, average age 25.8 ± 6.4 range 18-41 years) healthy subjects without hypertension, hyperlipidemia, or diabetes meeting USAF Flying Class III neurological standards were recruited (McGuire et al., 2014a,b). All subjects were in a military training environment with a consistently maintained meal time, sleep/wake time, and exercise program. Commencing 7 days prior to the first MRI and continuing throughout the study duration all subjects were alcohol free, drug/medication free, and tobacco free. Any new or acute illness was disqualifying. No subject was exposed to commercial air travel.
To minimize diurnal physiological fluctuations, the daily time of repeat scans within the same subject was consistent for all three scans. All sequences were obtained during each MRI except for MRI#2, which did not include a fluid-attenuated inversion recovery (FLAIR) sequence due to time constraints. Three subjects did not return for MRI#3.

| Imaging methods
Imaging data were collected at the Wilford Hall Ambulatory Surgical Center, 59 th Medical Wing, Joint Base San Antonio-Lackland, TX, using a Siemens 3T Verio scanner equipped with a 32-channel phase array coil operated under quality control and assurance guidelines in accordance with recommendations by the American College of Radiology.

| Volumetric three-dimensional FLAIR
Three-dimensional FLAIR was utilized for analysis of white matter hyperintensities (WMH) as previously described (McGuire et al., 2014a,b). Briefly, FLAIR images were oriented to a common Talairach atlas-based stereotactic frame using a nine-parameter affine spatial transformation to ensure consistency of orientation for identification of the periependymal and subcortical regions (McGuire et al., 2013).
The volumes of the FLAIR regions were calculated in the subject's frame by using an inverse of the spatial transformation. An experienced neuroanatomist blinded to the MRI study number manually traced WMH, while a neuroradiologist similarly blinded to the MRI study number provided MRI interpretation. Intrarater test-retest reproducibility was high (r = .95). For each lobe, we manually counted the number of WMH and used freely available Mango software version 4.0 (RRID:SCR_009603; http://ric.uthscsa.edu/Mango) to compute the total volume of WMH. WMH were divided into periependymal (adjacent to the ventricles) and subcortical (McGuire et al., 2013). Three-dimensional imaging parameters were T1 magnetizationprepared rapid gradient echo: repetition time (TR) = 2200 ms, echo time (TE) = 2.85 ms, isotropic resolution 0.80 mm, and FLAIR: TR = 4500 ms, TE = 1 ms, and isotropic resolution 1.00 mm. T1 imaging data were collected using motion-corrected protocol where six individual segments were averaged following motion correction to improve signal-to-noise ratio (SNR) (Kochunov et al., 2006). The total T1 acquisition time was 18 min.

| Cortical gray matter thickness
The T1-weighted (T1W) image processing for cortical gray matter thickness was conducted using the freely available FreeSurfer software version 5.3 (RRID:SCR_001847; http://surfer.nmr.mgh.harvard. edu/fswiki ) and 10-mm surface smoothing kernel. We used the freely available Enhanced Neuroimaging Genetics through Meta-Analysis (ENIGMA; RRID:SCR_014649; http://enigma.ini.usc.edu/protocols/ dti-protocols/) cortical gray matter thickness protocol that included visual quality assurance and control. The cortical gray matter thickness is measured as the Euclidian distance from the white matter mesh vertex to corresponding vertex on the cortical gray matter mesh.
Cortical gray matter thickness measurements were averaged for individual cortical areas for both hemispheres; the whole-brain cortical gray matter thickness measurement was obtained by averaging cortical gray matter thickness across left and right meshes. ENIGMA structural quality assurance/quality control (QA/QC) approach was used and one subject was excluded due to motion-related artifacts.

| High angular resolution diffusion imaging
High angular resolution diffusion imaging (HARDI) was utilized for diffusion tensor imaging (DTI) and fractional anisotropy (FA) as previously reported. Briefly, DTI data were collected using a single-shot echo-planar, single refocusing spin-echo, T2-weighted sequence with a spatial resolution of 1.7 × 1.7 × 3.0 mm with sequence parameters of TE/TR = 87/8,000 ms, field of view (FOV) = 200 mm, axial slice orientation with 50 slices and no gaps, 64 isotropically distributed diffusion-weighted directions, two diffusion weighting values (b = 0 and 700 s/mm 2 ), and five b = 0 images. HARDI data for both groups were processed using the ENIGMA-DTI (http://enigma.ini.usc.edu/ protocols/dti-protocols/) pipeline (Jahanshad et al., 2013). ENIGMA-DTI analysis pipeline is based on the tract-based spatial statistics (TBSS) method, distributed as a part of FSL package (Smith et al., 2006). The ENIGMA-DTI pipeline consists of a set of protocols and scripts to measure average whole-brain FA value and average tract FA values for 10 major white matter tracts (corpus callosum, corticospinal, internal capsule, corona radiata, thalamic radiation, sagittal stratum, external capsule, cingulum, superior longitudinal fasciculus, and fronto-occipital). ENIGMA-DTI pipeline incorporates visual and quantitative quality assurance and control analyses. It includes visual inspection and two quantitative QA estimates: average motion and average projection distance. Prior research showed that FA estimates provided by this pipeline may become unstable if the average motion exceeds 2.5 mm and average projection distance exceeds 3.8 mm (Acheson et al., 2017). One DTI session was excluded from this analysis due to exceeding motion threshold.

| Multi-b-value diffusion imaging (MBI) protocol
The MBI protocol was developed based on q-space protocols for in vivo mapping of water diffusion in the brain (Clark, Hedehus, & Moseley, 2002;Wu, Field, Whalen, & Alexander, 2011b;Wu et al., 2011a). This protocol consisted of 15 shells of b-values (b = 250, 500, 600, 700, 800, 900, 1000, 1250, 1500, 1750, 2000, 2500, 3000, 3500, and 3800 s/mm 2 ; diffusion gradient duration = 47 ms, diffusion gradient separation = 54 ms). Thirty isotropically distributed diffusion-weighted directions were collected per shell, including 16 b = 0 images. The highest b-value (b = 3,800 s/mm 2 ) was chosen because the SNR for the corpus callosum in the average diffusion image (SNR = 6.1 ± 0.7) measured in five healthy volunteers (ages 25-50 years) during protocol development approached the empirically selected lower limit of SNR = 5.0. The b-values and the number of directions per shell were chosen for improved fit of the biexponential model and SNR (Jones, Horsfield, & Simmons, 1999). The imaging data were collected using a single-shot, echo-planar, single refocusing spin-echo, T2-weighted sequence with a spatial resolution of 1.7 × 1.7 × 4.6 mm and seven slices prescribed in sagittal orientation to sample the midsagittal band of the corpus callosum. The sequence control parameters were TE/TR = 120/1,500 ms with the FOV = 200 mm. The total scan time was about 10 min per subject.
four ROIs in cerebral white matter (the whole and the genu, body, and splenium of corpus callosum) and for the gray matter of the cingulate gyrus.

| Pseudocontinuous arterial spin labeling (pCASL) imaging
Pseudocontinuous arterial spin labeling (pCASL; RRID:SCR_015004) imaging data for gray and white matter were collected using per 100 g/min, was calculated under the assumption that the postlabel delay was longer than average transfer time (Wang et al., 2002), where labeling efficiency was set at 0.99 and the mean transit time was set to 0.7 s based on empirical data. The data collection preceded the publication and was not based on the consensus guidelines for ASL-in-dementia parameters (Alsop et al., 2014). Instead, the imaging parameters were derived empirically to maximize detection of white matter perfusion by increasing labeling efficiency and signal-to-noise ratio. This was performed based on the methods described by others (van Gelderen, de Zwart, & Duyn, 2008;Wey, Wang, & Duong, 2012).
In short, pCALS data in five healthy volunteers, representative of the study population (average age, 25.1 ± 6.4 range 20-35 years), were collected using the range of the labeling offset distances, labeling duration, and postlabeling delay times. Least-square fitting was used to calculate the sequence parameters that maximized the labeling efficiency across cerebral white matter (WM) in all five subjects. This ensured that the derived parameters take into account the geometry of the MRI scanner and incorporate vascular physiology aspects of the subjects in this sample.

| Magnetic resonance spectroscopy
Proton magnetic resonance spectroscopy (MRS) data were acquired from voxels placed in frontal white matter and the anterior cingulate. For the frontal white matter region, short TE and long TE data were acquired using point resolved spectroscopy localization

| Statistical analysis
We used R-Statistical Program (https://www.r-project.org/) and SPSS (IBM Corp., Armonk, NY) for data analysis. Mean and confidence intervals for each measure are found in Tables 1-7 Finally, we calculated a "reproducibility rating" based on the variance observed for each trait across the three visits. This rating is based on the number of subjects per group needed to detect a 3% and 10% change for each measure calculated using a power analysis as detailed elsewhere (Iscan et al., 2015). The power analysis was performed under the following assumption: two-group comparison with an equal number of subjects performed using a two-tailed t-test with the significance level set at p = .05 and a power of 0.90. We gave empiric rating of high reproducibility for measurements that required two groups of <20 subjects each. Medium and low reproducibility ratings were assigned for measurements that required two groups of 40 subjects and >40 subjects per group, respectively.

| RESULTS
We separated measurements into structural and physiological.
Structural measurements included cortical gray matter thickness, FLAIR WMH volume and count, DTI-FA, and MBI. Physiological measurements included CBF and concentrations of neurochemicals. In general, structural measurements demonstrated greater consistency than physiological measurements (Tables 1-7). was more consistent in the corpus callosum than anterior cingulate, with M u more consistent than PDI (Table 4; ICC range 0.434-0.967).
All measurements were high on the 3% and 10% reproducibility rating scale.
MCV, MRD, and ICC for whole-brain gray matter pCASL were consistent, while individual segments varied, with greatest variability in the inferior temporal gyrus anterior, subcallosal cortex, cingulate gyrus anterior, parahippocampus gyrus anterior, and temporal fusiform cortex, posterior division (Table 5; ICC range 0.885-0.971). Whole-brain white matter pCASL was also consistent, with again more variability in individual change over a short interval in a healthy cohort would be anticipated.
Physiological variability, however, including activity level change, diurnal variation, or nutritional and/or alcohol intake, might impact measurements. Prior to interpreting the effect of a disease state, reproducibility or consistency must be known. The aim of this study of 25 healthy subjects is to provide reference data on intrasubject variability by controlling for these other factors, thus establishing a baseline power level to help with understanding the statistical significance of the observed changes.
This manuscript quantifies reproducibility and normal physiological variability for commonly used imaging and spectroscopic

measurements. Previous efforts demonstrated high scan-rescan
reproducibility of the neuroimaging included of the volumetric measurements for subcortical brain structures (Maclaren et al., 2014), cortical gray matter thickness (Dickerson et al., 2008;Han et al., 2006;Li et al., 2015), and diffusion tensor measurements (Acheson et al., 2017;Jovicich et al., 2014). Likewise, several prior studies quantified scan-rescan stability and reproducibility of the MRS measurements at 3T (Wellard, Briellmann, Jennings, & Jackson, 2005;Wijtenburg & Knight-Scott, 2011;Wijtenburg et al., 2013). Our approach of three scanning sessions and tightly controlled methodological parameters provides for the opportunity to assess these measurements based on the normal physiological variance among them.
White matter hyperintensity quantification for subcortical lesion volume/count was highly reproducible. Similarly, periependymal white matter hyperintensity volume was reproducible, but count less so. Pulsation of ventricular cerebrospinal fluid and subject motion may cause artifacts, with partial volume averaging impeding accurate segmentation of small (<1 cm 3 ) periependymal lesions (De Coene et al., 1992;Gawne-Cain, Silver, Moseley, & Miller, 1997;Kates, Atkinson, & Brant-Zawadzki, 1996). Subcortical lesions are unaffected by cerebral spinal fluid (CSF) pulsation artifacts and had higher ICC. We believe that higher variance in periependymal count measurements is secondary to these artifacts, making accurate identification of small periependymal lesions more challenging. This effect is further exaggerated by much smaller (3-5 times) number and volume of lesions in this healthy sample, compared to those reported in the general population (Kochunov et al., 2009(Kochunov et al., , 2010, magnifying the effect of misidentifying even a single small lesion. Overall whole-brain average and regional cortical gray matter thickness and volumetric measurements showed excellent ICC and other measures of reproducibility that were consistent with other published results (Iscan et al., 2015;Liem et al., 2015;Yang et al., 2016).
The cortical gray matter thickness of the entorhinal, insula, and medial orbitofrontal demonstrated lower reproducibility. These three cortical gray matter areas are located on the inferior frontal portion of the brain where susceptibility artifacts due to tissue-bone interface make the precise identification of boundaries more difficult. Therefore, caution should be recommended when interpreting cortical gray matter thickness findings from these areas. The power analyses estimates provided here showed a smaller number of subjects per group (N ~ 10) than Liem (N = 40) (Liem et al., 2015) but similar to that provided by Iscan (N = 19) (Iscan et al., 2015). This is due to a difference in methodology. Our approach was based on the variance in the average gray matter (GM) thickness measurements that was also used by Iscan (Iscan et al., 2015). The power analysis by Liem and colleagues provided the number of subjects needed to detect the vertex-specific difference in mean thickness by accounting for vertex-wise variance across the surface.

Whole-brain FA was highly reproducible, with individual tracts
showing only slightly reduced reproducibility metrics than the wholebrain average FA. The least consistency was observed in fornix (FX), corticospinal (CST), and superior fronto-occipital (SFO) tracts. The lack of consistency on these three tracts can be explained by partial volume averaging and/or spatial misregistration and is similar to previous reports (Vollmar et al., 2010). The FX and CST are long, tubular white matter that passes through the areas with magnetic susceptibility and therefore prone to geometrical distortions. Our overall results are comparable to and MCV of 0.69% (0.42%-0.99%) (Veenith et al., 2013). The regional pattern of reproducibility measurements was similar to that reported in Acheson et al. (2017). Future work evaluating these regions should take caution in interpreting any results localized to the CST and FX.
Measurements of the unrestricted water fraction M u and permeability-diffusivity index from the diffusion-weighted data collected in the white matter of corpus callosum were highly reproducible. The same measurements performed in the anterior cingulate gray matter were more variable, suggesting tissue-specific variance in normal physiology. There are two potential sources of variability in the anterior cingulate. The variance in diffusion-based measurements is likely to be influenced by normal day-to-day physiological variability in the gray matter. The higher variance may also be due to methodological sources as the measurements from the dense and consistently oriented fibers of the corpus callosum may have greater reproducibility than the measurements from the cortical GM ribbon that is adjacent for WM and CSF. The tissue-related difference in the reproducibility was likewise observed in the resting CBF as measured by pCASL. The whole-brain average CBF in cerebral white matter showed higher reproducibility than in cerebral gray matter, while the anterior cingulate gyrus was lower. Our results are consistent with other reported stud- reported an MCV of 8.3%-9.7% (Li, Babb, Soher, Maudsley, & Gonen, 2002), while Mullins and colleagues noted an MCV < 5%. (Mullins et al., 2003). Our results are similar to other reported series. In six subjects scanned twice using a 30 ms point resolved spectroscopy sequence, Mullins and colleagues observed comparable MCV (Mullins et al., 2003).
This study controlled the methodological parameters by using the same scanner, head coil, and MR operator. However, these conditions are unlikely to be maintained throughout the life of longer longitudinal or cross-sectional studies where scanner upgrades, significant hardware changes such as changes head coil, and other methodological changes may be expected. To address these aspects of longitudinal studies, our group and others used two strategies to accommodate for methodological changes: collections of calibration data and use of meta-and mega-analyses (Jahanshad et al., 2013;Kochunov et al., 2015;McGuire et al., 2014a). In the first approach, calibration data are collected before and after change to derive cross-calibration parameters. This approach provides direct normalization and is the only appropriate method for longitudinal studies where different imaging points are collected on different scanners. The following challenges must be met: the calibration sample must match the constitution of the imaging sample and a sufficient number of calibration subjects must be collected to reduce uncertainty in calibration parameters. For instance, a more sensitive MRI coil provided higher (rise of 15%) FLAIR region counts with less dramatic change in volume (rise of 3%) due to ability to detect smaller lesions. Therefore, collecting FLAIR calibration datasets in a younger population with fewer and smaller lesions may have biased the calibration results. Likewise, while collecting 10 subjects was sufficient for FLAIR calibration, calibration of DTI data required 20 subjects to reduce uncertainty in FA measurements for smaller and more variable white matter tracts (Acheson et al., 2017). Alternatively, crosssectional and longitudinal studies with short interimaging periods can use statistical aggregation approaches that treat samples collected on different hardware as independent datasets. ENIGMA consortium has demonstrated the utility in meta-and mega-analysis of quantitative neuroimaging data (Jahanshad et al., 2013;Kochunov et al., 2015).
This study measured the stability, reproducibility, and reliability in a healthy normal population of commonly utilized MRI modalities over an interval of 5 days while controlling for technical and physiological factors. We assessed the commonly used neuroimaging measures based on the ability to reproduce them and identified a subset of measurements with high variability due to methodological and/or physiological variances. We provide a power calculation-based reproducibility rating and the number of subjects per group necessary to detect a 3% or 10% change. Caution should be exercised when reporting and interpreting outcomes based on these. Overall, this study reports high reliability for most of the neuroimaging measurements making them valuable for evaluation of disease states or treatment protocols.