Whole‐brain atrophy assessed by proportional‐ versus registration‐based pipelines from 3T MRI in multiple sclerosis

Abstract Background and Purpose Whole‐brain atrophy is a standard outcome measure in multiple sclerosis (MS) clinical trials as assessed by various software tools. The effect of processing method on the validity of such data obtained from high‐resolution 3T MRI is not known. We compared two commonly used methods of quantifying whole‐brain atrophy. Methods Three‐dimensional T1‐weighted and FLAIR images were obtained at 3T in MS (n = 61) and normal control (NC, n = 30) groups. Whole‐brain atrophy was assessed by two automated pipelines: (a) SPM8 to derive brain parenchymal fraction (BPF, proportional‐based method); (b) SIENAX to derive normalized brain parenchymal volume (BPV, registration method). We assessed agreement between BPF and BPV, as well their relationship to Expanded Disability Status Scale (EDSS) score, timed 25‐foot walk (T25FW), cognition, and cerebral T2 (FLAIR) lesion volume (T2LV). Results Brain parenchymal fraction and BPV showed only partial agreement (r = 0.73) in the MS group, and r = 0.28 in NC. Both methods showed atrophy in MS versus NC (BPF p < 0.01, BPV p < 0.05). Within MS group comparisons, BPF (p < 0.05) but not BPV (p > 0.05) correlated with EDSS score. BPV (p = 0.03) but not BPF (p = 0.08) correlated with T25FW. Both metrics correlated with T2LV (p < 0.05) and cognitive subscales. BPF (p < 0.05) but not BPV (p > 0.05) showed lower brain volume in cognitively impaired (n = 23) versus cognitively preserved (n = 38) patients. However, direct comparisons of BPF and BPV sensitivities to atrophy and clinical correlations were not statistically significant. Conclusion Whole‐brain atrophy metrics may not be interchangeable between proportional‐ and registration‐based automated pipelines from 3T MRI in patients with MS.


| BACKG ROU N D
Whole-brain atrophy is a commonly used research metric to quantify multiple sclerosis (MS) pathology (Neema, Stankiewicz, Arora, Guss, & Bakshi, 2007) and remains one of the strongest correlates and predictors of clinical status (De Stefano et al., 2014). Investigators have applied a myriad of published proprietary and open-source methods to quantify brain volume loss (Giorgio, Battaglini, Smith, & De Stefano, 2008), leading to heterogeneous segmentation procedures across sites and studies, without any agreed-upon standard approach (Bermel & Bakshi, 2006). This heterogeneity is brought to the surface by the regular incorporation of whole-brain atrophy as a supportive outcome measure in Phase III MS therapeutic clinical trials, in which registration-based {affine-fit to an external multiple subject brain size atlas, e.g., normalized brain parenchymal volume [BPV; OPERA I/II (Hauser et al., 2017), FREEDOMS (De Stefano et al., 2016), ALLEGRO (Comi et al., 2012), DEFINE (Arnold et al., 2014)]}, or proportional-based {scaled to the subject's own intracranial cavity, e.g., brain parenchymal fraction [BPF; CARE-MS I/II (Arnold et al., 2016), AFFIRM (Miller et al., 2007), TEMSO (O'Connor et al., 2011)]} methods have been employed. Moreover, this challenge is amplified by the observations that the analysis of the same MRI image sets using different segmentation pipelines can produce conflicting findings (O'Connor et al., 2011;Radue et al., 2017;Rovaris, Comi, Rocca, Wolinsky, & Filippi, 2001;Sormani et al., 2004), which hamper the ability to draw firm conclusions on therapeutic effects, and may invalidate the comparison of results across trials.
Significant technical challenges arise in the measurement of cross-sectional and longitudinal brain volume loss, especially at a fully automated scale necessary for efficient deployment in routine clinical practice. MRI-derived volumetrics are prone to deviations throughout the data pipeline, including at the acquisition stage (e.g., head motion, hardware nonuniformity including magnetic field strength, gradient distortions, and pulse sequence type and parameters; Chu, Hurwitz, Tauhid, & Bakshi, 2017;Papinutto et al., 2017;Sharma et al., 2004;Shinohara et al., 2017) and segmentation procedure (e.g., preprocessing steps-inhomogeneity correction, method of tissue class segmentation, and normalization; Chard, Parker, Griffin, Thompson, & Miller, 2002;Chu, Hurwitz, et al., 2017;Durand-Dubief et al., 2012;Granberg et al., 2016;Kazemi & Noorizadeh, 2014;Popescu, Schoonheim, et al., 2016;Vidal-Jordana et al., 2017). Furthermore, brain volume may vary based on pathophysiological factors, including recent start of immunomodulatory therapy, acute inflammation, hydration status, time of day, tobacco use, genetics, and comorbid conditions (Rocca et al., 2017). As MRI technology evolves and increasingly precise high-field (e.g., 3T) magnets proliferate in clinical practice, there remains an ongoing need for critical evaluation of the sensitivity and validity of postprocessing software pipelines (Chu et al., 2016;Stankiewicz et al., 2011).
Previous MRI research has explored methodological aspects of precision (i.e., reproducibility), accuracy (i.e., relation to gold standard maps), and validity (i.e., relationship to clinical "truth") of whole-brain and regional tissue loss in MS. Recent studies have examined the precision of metrics from 1.5T or 3T scanners using standardized acquisition parameters and software pipelines; all concluded that intrascanner variance was generally minimal, whereas interscanner variability was consistently a source of significant bias (Biberacher et al., 2016;Durand-Dubief et al., 2012;Papinutto et al., 2017;Shinohara et al., 2017). The type of postprocessing software pipeline was also associated with divergent measurements in brain volumetrics in those studies. The accuracy and validity of MRI-derived metrics has also been explored in reference to both clinical and histopathological metrics. A recent study by Popescu, Klaver, et al. (2016) correlated postmortem, histopathologically defined cortical thickness with MRI-acquired cortical gray matter (GM) measurements at 1.5T; the authors found statistically significant correlations only when using manually corrected (but not automated) pipelines in SIENAX and FreeSurfer.
A separate study from the same group compared postprocessing pipelines in SIENAX, SPM, and FreeSurfer to evaluate the link between GM atrophy and cognitive performance in MS; although the software pipelines generally exhibited similar clinical correlations with cognitive variables, the authors found significant differences in deep GM and cortical structure measurements based, at least partly, on the choice of registration template/atlas (Popescu, Schoonheim, et al., 2016). The goal of this study was to compare the validity of two freely available widely used automated postprocessing algorithms for the assessment of normalized wholebrain volume from 3T MRI. We examined patients with MS and normal controls (NC) using two methods: both proportional-based [SPM8 to measure BPF (Dell'Oglio et al., 2015)] and registrationbased (SIENAX to measure BPV).

| Subjects
We prospectively enrolled 61 patients with MS and 30 NC; part of the data from these subjects and the recruitment/collection procedures have been published previously (Dell'Oglio et al., 2015).
In brief, inclusion criteria were: age 18-55, no significant medical comorbidities, no changes in disease-modifying therapy in the 6 months prior to examination. MRI was obtained within 3 months of the neurological examination. Demographic and clinical data are summarized in Table 1. Clinical data were obtained by MS specialists, including Expanded Disability Status Scale (EDSS) scoring and timed 25-foot walk (T25FW). This study was approved by our institutional board review board and all subjects provided written informed consent.  (Benedict et al., 2006), which was administered by a clinical psychologist and her supervised research fellow. MACFIMS scores were corrected for depression (CES-D) baseline scores, and compared to regression-based norms from a NC sample (Parmenter, Testa, Schretlen, Weinstock-Guttman, & Benedict, 2010). Cognitive impairment was defined as performance worse than the 5th percentile on two or more cognitive measures;

| Neuropsychological data acquisition and analysis
subjects who did not meet these criteria were defined as cognitively preserved.

| Image analysis
All images were inspected for quality, and processed through two separate pipelines ( Figure 1); BPF: as previously described (Dell'Oglio et al., 2015), raw MDEFT images were manually de-skulled, aligned to the MNI152 template, intensity normalized using N3 nonparametric nonuniform parameters, and automatically segmented using the SPM8 (Statistical Parametric Mapping, http://www.fil.ion.ucl.ac.uk/ spm/software/) unified segmentation model into GM, white matter (WM), and CSF volumes. Intracranial volume (ICV) was calculated as the sum of GM + WM + CSF. BPF was calculated as (GM + WM)/ ICV. In the BPV pipeline, raw MDEFT images were resliced to the axial plane, followed by removal of all slices inferior to the cervicomedullary junction using JIM v7 (www.xinapse.com). Images then underwent automated segmentation and template normalization using SIENAX, (Smith et al., 2002) part of FSL (v5.0) (Smith et al., 2004) using a previously optimized brain extraction tool (BET) threshold of 0.2 (Chu et al., 2016). T2-hyperintense lesion volumes were obtained by expert semiautomated segmentation with an edge-finding tool based on local image intensity thresholds using JIM (v5) as previously published (Dell'Oglio et al., 2015); manual corrections were applied as needed (Ceccarelli et al., 2012). To determine if manual versus default (automated) deskulling would affect the results in SIENAX, we analyzed scans from three subjects using manually skull-stripped images with a BET threshold of 0.01 (for maximal brain extraction) normalized with the original scaling factor from nonskull-stripped data; however, this approach provided similar BPVs (within 20 ml versus the nonskull-stripped extraction, mean ± SD = −5.66 ± 22.2, range: −19 to 20 ml). Thus, we chose to employ the fully automated SIENAX algorithm to obtain BPV in this study.

| Statistical analysis
Correlations between BPF and BPV in MS and NC populations were calculated using Pearson's correlation coefficients. Group differences between MS and NC were calculated using t tests, and linear regression to correct for age and gender. The difference in the estimated effect size comparing MS and NC for the BPF and BPV segmentation methods was calculated as the difference in Cohen's d and the 95% confidence interval (CI) for the difference was calculated using the percentile bootstrap method. Clinical correlations were obtained using Spearman's correlation coefficient (EDSS, T25FW, disease duration) and Pearson's correlation coefficient (age, gender), and partial correlation coefficients were used to correct for age and gender. For the comparison between the BPF and BPV segmentation methods regarding their correlations with EDSS and T2FW, the difference in the correlation coefficients was calculated, and the 95% CI was calculated using the percentile bootstrap method. Correlations between BPF, BPV, and components of the MACFIMS were estimated using Pearson's correlation coefficients and partial correlation coefficients were used to adjust for age and gender. In addition, Meng's test was used to compare the correlated correlation coefficients between the BPF and BPV measurements and the MACFIMS components (Meng, Rosenthal, & Rubin, 1992). p-Values <0.05 were considered statistically significant. Analyses were performed using the R software (www.r-project.org) with the pcor (Kim, 2015) and cocor (Diedenhofen & Musch, 2015) libraries.

| RE SULTS
The correlation between BPF and BPV is shown in Figure 2 Effect sizes for discriminating MS and NC groups did not differ in direction comparisons between BPF and BPV (95% CI: −0.643, 0.113, p > 0.05). Regarding the correlations between BPF or BPV and neurologic function/lesion variables ( However, the differences in correlation with disability between the segmentation methods were not statistically significant. The relationship between normalized whole-brain volume measures and cognition is shown in Tables 3 and 4. Brain parenchymal fraction showed statistically significant differences in whole-brain volume in cognitively impaired versus cognitively preserved patients both before (p = 0.02) and after (p = 0.03) age and gender correction (Table 3). However, there was only a trend toward lower whole-brain volume as measured by BPV in cognitively impaired versus cognitively preserved patients (p = 0.073), which did not attain significance following adjustment for age and gender (p = 0.14, Table 3). Table 4 shows the correlations between the normalized whole-brain volume measures and cognitive (MACFIMS) subsets. Both BPF and BPV were significantly correlated with the F I G U R E 1 Comparison of image processing steps for the proportional-and registration-based methods of determining normalized whole-brain volume. Both methods used the same 3D, T1-weighted MDEFT source images at 3T. Brain parenchymal fraction (BPF, left, a proportional-based method) began with manual skull-stripping, followed by automated SPM8 registration to the MNI-152 atlas, nonparametric intensity normalization, and tissue class segmentation with bias field tool disabled, yielding mutually exclusive maps for CSF (cerebrospinal fluid), gray matter (GM), and white matter (WM). BPF (bottom left) is calculated as the sum of the gray and white matter volumes divided by the total intracranial volume represented as the sum of GM + WM + CSF. Normalized brain parenchymal volume (BPV, right, a registration-based method), began with manual neck removal to the cervico-medullary junction, followed by automated SIENAXbased brain extraction with bias field correction enabled (orange highlight), registration to the MNI-152 template to determine the skullbased scaling factor, and intensity normalization and tissue class segmentation using a Markov random field model with the associated expectation-maximization algorithm. GM and WM volumes are summated to yield the BPV, which is multiplied by a subject-specific scaling factor to yield normalized BPV (red highlight, bottom right)

| D ISCUSS I ON
Our cross-sectional study suggests a difference in whole-brain vol- whole-brain and regional deep gray matter volume (mean coefficient of variation <1%; Chu, Kim et al., 2017).
Our data demonstrate that cross-sectional postprocessing methods require careful interpretation, especially as brain volume loss evolves into a potential metric for clinical decision-making in MS.
Our results are in line with several prior studies which have demonstrated improved MS-related clinical validity for a proportion-based over a registration-based metric for cross-sectional data. Gao and colleagues used a heavily T2-weighted approach at 3T to determine the total volume of intracranial cerebrospinal fluid and derived a "brain free water" fraction similar to (inverse) BPF; this parameter outperformed a T1-weighted registration-based approach (Lesion-TOADS) correlating with clinical variables including EDSS score, the 9-hole peg test, and the symbol digit modalities test (Gao, Nair, Cortese, Koretsky, & Reich, 2014). A separate group found that BPF derived from semiautomated methods at 1.5T outperformed the automated registration-based method using SIENAX in regards to accuracy and clinical validity with EDSS (Zivadinov et al., 2005), although this could be at least partially attributed to suboptimal brain extraction with the latter method.
Comparisons of postprocessing pipelines are complicated by the sheer number of potential underlying variables that differ between methods, as well as a lack of a clear "ground truth" gold standard.
Here we chose a pragmatic high-level approach to compare pipeline clinical validity; other authors have previously compared individual processing steps as well, yielding insight into sources of variability in healthy populations or simulated datasets. The SPM and FSL pipelines used here rely on inherently different statistical models and assumptions when performing (a) brain extraction, (b) intensity normalization and tissue segmentation, and (c) template registration/ normalization. Thus, one potential limitation of our study is that we cannot specify contributions of each of these factors to overall errors in clinical validity. Regarding (a) brain extraction, our BPF pipeline employed manually skull-stripped data whereas our BPV TA B L E 3 Two normalized whole-brain volume measures: relationship to cognitive status in the MS group Notes. Values are mean ± SD. BPF: brain parenchymal fraction; BPV: normalized brain parenchymal volume; n: number of subjects. Notes. Age and disease duration results are Pearson's correlation r, (p-value); EDSS, T25FW, T2LV results are Spearman correlation r, (p-value). Following adjustment for age and gender, the results provided are partial correlations; age is corrected for gender only. BPF: brain parenchymal fraction; BPV: normalized brain parenchymal volume; EDSS: Expanded Disability Status Scale; T25FW: timed 25-foot walk; T2LV: cerebral T2 hyperintense lesion volume; n: number of subjects. *p < 0.05.
TA B L E 2 Two normalized whole-brain volume measures correlated with clinical/ lesion variables in the MS group (n = 61) pipeline used native images as generally required to obtain a skullbased normalization factor. Although manual skull-stripping is closer to a gold standard for determining ICV, it is time-consuming and has been largely replaced with automated techniques such as BET (Smith et al., 2002), SPM's integrated tissue segmentation (Ashburner & Friston, 2005), or FreeSurfer watershed algorithm (Dale et al., 2004).
As prior authors have noted, the FSL BET can also be a significant source of error (Popescu et al., 2012;Zivadinov et al., 2005) and we found tissue misclassification in several subjects using the default settings; neck cropping and changing the default parameters (−f 0.2 and −B enabled) allowed an optimal solution for our dataset without any significant misclassification errors (Chu et al., 2016). Without any visually prominent errors, several groups have concluded that brain extraction methods are generally a very small source of variance (Clark, Woods, Rottenberg, Toga, & Mazziotta, 2006;Klauschen, Goldman, Barra, Meyer-Lindenberg, & Lundervold, 2009) and we feel this preprocessing step is unlikely to be a significant source of variance between methods.
Regarding intensity normalization and tissue segmentation, both SPM and SIENAX use an integrated approach to this process (Ashburner & Friston, 2005;Smith et al., 2004;Zhang, Brady, & Smith, 2001). One advantage to using whole-brain atrophy as a metric is its relative insensitivity to GM and WM tissue misclassification, as these two measures are summated to yield whole-  (Derakhshan et al., 2010;Popescu, Schoonheim, et al., 2016;Rocca et al., 2017), which is beyond the scope of this paper.
A third potentially important difference between our pipelines is the template registration and normalization process. Whereas the BPF metric normalizes brain volume using the subject's own intracranial volume, BPV normalizes to a registered template (MNI-152) of averaged healthy brains. We speculate that this difference in normalization factor may help explain why a proportion-based metric may be superior to a registration-based metric regarding clinical validity. This topic has not received significant attention in the literature and would be worth exploring in more detail in future experiments with longitudinal comparisons.

| CON CLUS ION
Determination of whole-brain atrophy on 3T MRI depends in part on the choice of postprocessing software methods; here, a comparison of automated pipelines revealed discrepant results for whole-brain atrophy measures and clinical correlations, likely based on the underlying statistical assumptions for tissue segmentation and scaling TA B L E 4 Two normalized whole-brain volume measures: correlation with cognitive component scores in the MS group (n = 61) methods of the software. Results obtained using these automated pipelines are unlikely to be interchangeable and should therefore be interpreted with caution.