Reliable measurements of brain atrophy in individual patients with multiple sclerosis

Abstract Introduction As neurodegeneration is recognized as a major contributor to disability in multiple sclerosis (MS), brain atrophy quantification could have a high added value in clinical practice to assess treatment efficacy and disease progression, provided that it has a sufficiently low measurement error to draw meaningful conclusions for an individual patient. Method In this paper, we present an automated longitudinal method based on Jacobian integration for measuring whole‐brain and gray matter atrophy based on anatomical magnetic resonance images (MRI), named MSmetrix. MSmetrix is specifically designed to measure atrophy in patients with MS, by including iterative lesion segmentation and lesion filling based on FLAIR and T1‐weighted MRI scans. Results MS metrix is compared with SIENA with respect to test–retest error and consistency, resulting in an average test–retest error on an MS data set of 0.13% (MS metrix) and 0.17% (SIENA) and a consistency error of 0.07% (MS metrix) and 0.05% (SIENA). On a healthy subject data set including physiological variability the test–retest is 0.19% (MS metrix) and 0.31% (SIENA). Conclusion Therefore, we can conclude that MSmetrix could be of added value in clinical practice for the follow‐up of treatment and disease progression in MS patients.

demyelination, studies have demonstrated that neurodegeneration develops along with inflammation and demyelination (Dutta & Trapp, 2007;Silber & Sharief, 1999). In addition, it has been shown that the neurodegenerative component of MS is responsible for the irreversible disability and is prognostic for short-and long-term disability and cognitive decline (Bjartmar, Kidd, Mörk, Rudick, & Trapp, 2000).
In recent years, magnetic resonance imaging (MRI) brain scans have been increasingly used to measure brain atrophy in MS patients.
Typically, a volumetric T1-weighted MRI data set is used to calculate the whole-brain, WM, GM, or cerebrospinal fluid (CSF) volume or atrophy. It has been demonstrated that brain atrophy, as measured on MRI scans, is already present in patients with a clinically isolated syndrome suggestive of MS and in patients with early (<5 years from diagnosis) definite MS (Chard et al., 2002;Chard & Miller, 2009;Henry et al., 2008;Raz et al., 2010). In addition, it has been shown that brain atrophy occurs in all types of MS, relapsing-remitting MS (RRMS), primary and secondary progressive MS (PPMS and SPMS) (De Stefano et al., 2010;Tedeschi et al., 2005). In contrast to WM atrophy, which is relatively constant across all disease stages, GM atrophy has been correlated with disease progression, physical disability, and cognitive impairment (Amato et al., 2004;Chard et al., 2004;Sanfilipo, Benedict, Sharma, Weinstock-Guttman, & Bakshi, 2005;Sanfilipo, Benedict, Weinstock-Guttman, & Bakshi, 2006;De Stefano et al., 2003).
As neurodegeneration is recognized as an important aspect of MS and a major contributor to disability in MS, brain atrophy has been increasingly used in clinical trials as an outcome measure to assess treatment efficacy and disease progression. Studies with conventional therapies such as interferon-β and glatiramer acetate have shown only limited effects on brain atrophy (Filippi et al., 2004;Leary & Thompson, 2003;Molyneux et al., 2000). However, the FREEDOMS study in patients with RRMS demonstrated that fingolimod significantly reduced brain atrophy over 2 years, compared with placebo (Radue et al., 2012;Silber et al., 2014). In addition, it was reported in the TRANSFORMS study that MS patients on fingolimod had a lower brain atrophy rate compared to patients using intramuscular interferon-β-1a (Cohen et al., 2013). Decreases in brain atrophy in RRMS patients have also been reported with laquinimod treatment (Comi et al., 2012). Brown and Coles (2013) noted decreased brain atrophy in patients treated with alemtuzumab compared to interferon-β for RRMS, while Portaccio et al. (2013) had similar findings of lower brain atrophy in patients treated with natalizumab compared to interferon-β.
An additional advantage of measuring brain atrophy in clinical trials is that 10 times less subjects need to be included, when the trial is powered on an outcome of a 50% reduction in MRI lesions and atrophy, compared to using disability endpoints (Polman et al., 2006;Sormani et al., 2001;Sormani, Arnold, & De Stefano, 2014;De Stefano et al., 2010).
Despite the importance of neurodegeneration in MS patients and the known positive effect of some drugs on slowing down this neurodegeneration, brain atrophy is currently not used in clinical practice to assess individual MS patients. As MS is a heterogeneous disease with a relatively unpredictable course, there is a clear need for objective measures that can be evaluated for individual patients and used for follow-up and treatment decisions. Several measures for disability are used, such as the Expanded Disability Status Score and the MS Functional Composite scale (Cutter et al., 1999;Kurtzke, 1983). However, it is known that they can be unpredictable within the same patient, being characterized by phases with predominant occurrence of relapses versus progression. Although neurodegeneration is only one of many aspects of the disease, a reliable measure of brain atrophy on individual MS patients will help in making informed decisions and moving into a more personalized and evidence-based medicine in MS.
Techniques to measure brain volume or brain volume loss can be subdivided into two main categories, that is, cross-sectional and longitudinal methods (Giorgio, Battaglini, Smith, & De Stefano, 2008;De Stefano et al., 2014). Cross-sectional methods use a single MRI scan to segment specific tissues or structures. As a result, the volume of these tissue types and/or structures is calculated. Well-known and validated examples of these are BPF (Rudick, Fisher, Lee, Simon, & Jacobs, 1999), SIENAX , and Freesurfer (Fischl et al., 2002). In contrast to cross-sectional approaches, longitudinal methods take into account two MRI scans of the same subject from different time points to calculate brain volume changes or atrophy. Longitudinal methods typically try to match the two MRI scans using warping techniques and directly extract small changes in brain volume from this process (e.g., Boyes et al., 2006;Freeborough & Fox, 1997;Smith, De Stefano, Jenkinson, & Matthews, 2001).
A longitudinal method that is frequently used in clinical trials (see, e.g., Comi et al., 2013) is SIENA , while a longitudinal processing pipeline that can take more than two time points into account is included in FreeSurfer (Reuter, Schmansky, Rosas, & Fischl, 2012).
When applying brain atrophy measures for individual MS patients, the measurement error of the method and thus the reliability becomes of paramount importance. It is indeed known that the average atrophy rate in MS patients is approximately 0.5%-1.3% per year, compared with 0.1%-0.4% per year in healthy individuals (Barkhof, Calabresi, Miller, & Reingold, 2009;Fotenos, Mintun, Snyder, Morris, & Buckner, 2008;Simon, 2006). The measurement error of the brain atrophy measure therefore needs to be very low, in order to draw meaningful conclusions in individual patients. This includes a robustness of the method toward daily physiological processes that might affect brain volume.
In this paper, an automated longitudinal registration-based method is proposed to measure whole-brain and GM atrophy by performing Jacobian integration within the segmentations, with the Jacobian extracted from the registration. The longitudinal method is initialized by a cross-sectional method providing lesion filled images as well as the segmentations. In order to assess the method's future applicability for atrophy quantification of individual MS patients, the paper is focused on the reliability of the method. The reliability of the method will be evaluated in terms of measurement error, robustness toward physiological processes and consistency. voxel resolution). The resolution of T1-weighted and FLAIR images from all the scanners is high and therefore, due to very high computational memory requirement, all T1-weighted images were downsampled to (1.0 × 1.0 × 1.0 mm 3 ) resolution. The FLAIR image was not downsampled at this point because it is rigidly registered to T1weighted image in the initial stage of the method and thus will have the (downsampled) T1-weighted image resolution.

| Data set 2
Data set 2 is the publicly available data set described in Maclaren, Han, Vos, Fischbein, and Bammer (2014). They acquired the data set with the approval of the Stanford University Institutional Review Board and all subjects gave their written informed consent.
A total of 120 T1-weighted images were acquired from three healthy subjects (40 scans/subject). Each subject was scanned two times on 20 different days within a 31-day period. Subjects were repositioned on the scanner console between the two scans in each session, so that all scans were treated as separate measurements (with a resulting break of ~5 min between scans). All images are acquired on the GE MR750 3T scanner using the ADNI-recommended T1weighted imaging protocol for this system (accelerated sagittal 3D IR-SPGR, standard 8-channel phased array head coil, TR 7.3 ms, TE 3 ms, TI 400 ms, FA 11°, 256 × 256 matrix slice, 270 mm FOV, 1.2 mm slice thickness, acquisition time: 5 min 37 s).

| Data set 3
Data set 3 was available from the department of radiology of the General University Hospital in Prague and consists of brain MR images obtained from 20 subjects. All patients signed informed consent forms.

| MSmetrix
The proposed method, for which an overview is shown in Fig. 1, starts with a 3D FLAIR and 3D T1-weighted MR image for each of the time points. Each time point is processed independently by a cross-sectional pipeline (MSmetrix -cross) that computes in a fully automated way segmentations of WM, GM, and CSF and produces T1-weighted images that are bias corrected, lesion filled, and skull stripped. Subsequently, the longitudinal pipeline is executed.

MSmetrix-cross
MSmetrix-cross  is a cross-sectional pipeline and hence handles the images of each time point separately. This automated pipeline is initialized by some preprocessing steps before entering the main loop of the algorithm. These preprocessing steps include: (1) matching the FLAIR image with the T1-weighted image, (2) skull stripping by transferring a brain mask from the MNI atlas toward the T1-weighted image, (3) warping probabilistic anatomical priors for GM, WM, and CSF toward the T1-weighted image space.

MSmetrix-long
The longitudinal method starts from a T1-weighted and a FLAIR image at two time points and consists of four fully automated steps.
First, the cross-sectional pipeline (MSmetrix-cross) is executed, generating lesion filled and bias corrected T1-weighted images for each time point , and their segmentation into WM, GM, and CSF.
In the second step, the T1-weighted images of the two time points are affinely registered, both from the first to the second as from the second to the first time point (i.e., two registrations in both directions are executed). The affine registration consists of (1) a rigid registration based on the whole image, (2) an affine registration based on the skull to compensate for small scaling differences due to distortions, and (3) a rigid registration on the whole brain in order to correct for small translation and rotation errors in the skull based co-registration.
The rigid and/or affine registration is based on a block matching approach (Ourselin et al., 2001). To improve the robustness of each coregistration, a symmetric approach (Modat et al., 2014) is used, which imposes, in theory, that the transformations are diffeomorphic. For the skull image, required in step (2), the region around the brain is used. This region is computed by binary dilating the brain mask, available F I G U R E 1 Schematic overview of MSmetrixlong. MSmetrix-long starts with a quality control of the images. Subsequently, MSmetrix-cross is performed for each time point. The results from the cross-sectional step are used to initialize the longitudinal pipeline. In the longitudinal step, the brain atrophy is calculated based on a Jacobian integration in both directions from the cross-sectional pipeline, with a kernel of 20 mm and subsequently subtracting the brain mask.
In the third step, the affinely registered and bias-corrected T1weighted images are nonrigidly registered. This registration is again performed in both directions, respectively, from the first to the second and from the second to the first time point. The nonrigid registration (Modat et al., 2010) uses normalized mutual information (NMI) as similarity and B-splines to restrict the deformation, with a final grid spacing of two voxels. The registration algorithm makes use of a multiresolution approach with two levels (grid space 4 and 2 mm, respectively) to allow sufficiently large deformation in order to capture the expected maximal atrophy between the time points (±5%). Furthermore, an additional penalty term is added for the Jacobian determinant of the deformation field, which slightly penalizes large local variations in the deformation field. In this way, we ensure plausible deformation fields.
Thus, the objective function is based on the NMI similarity measure and a log of the Jacobian determinant penalty term with weight = 0.95 for the former and 0.05 for the latter term.
In the fourth and final step, the percentage volume change between the scans is calculated based on a Jacobian integration of the segmentations. The Jacobian determinant represents local shrinkage (for values < 1) or expansion (for values > 1) and can be calculated from the deformation field describing the nonrigid registration. Here, the Jacobian determinant is calculated for each nonrigid registration (i.e.,

| SIENA(X)
SIENA and SIENAX are part of the FMRIB Software Library (http:// www.fmrib.ox.ac.uk/fsl), and refer to the longitudinal brain atrophy and cross-sectional brain volume measurement, respectively Zhang, Brady, & Smith, 2001). The input images for both SIENAX and SIENA are T1-weighted MRI data sets.

SIENAX
The pipeline is initialized by applying the Brain Extraction Tool (BET; Smith, 2002), which creates a brain mask in three steps: (1) global intensity thresholding to roughly select brain from non-brain region; (2) tessellated spherical mask creation, positioned at the approximate center of gravity of the brain; (3) iterative refinement toward the brain's edge, using smoothness criteria and a local intensity threshold. Subsequently, voxels within the obtained brain mask are classified in several classes, depending on the image intensities. As a result, CSF, WM, GM, and background are segmented, and the cross-sectional volumes can be obtained, referred to as SIENAX (Zhang et al., 2001).

SIENA
SIENA is initialized by applying the BET generating a brain mask for both images (Smith, 2002). Subsequently, an explicit skull image is extracted from the images (Jenkinson, Bannister, Brady, & Smith, 2002;. The images are then warped into an intermediate space where the skull image is used to guide the scaling. The time point 1 image in the intermediate space is segmented (analogously to SIENAX) and the edge between the brain parenchyma (WM + GM) and CSF is determined. Subsequently, the brain parenchyma/CSF edge displacement between the two time points is estimated by aligning the peaks of the spatial derivatives of the intensity profiles of both images. Finally, the mean edge displacement is converted into a global estimate of percentage brain volume change (PBVC) between the two time points.
In this study, SIENA(X) was run with no manual correction and with the default parameters, except for the '-B'' option for the brain extraction tool BET, which was modified from its default parameter f = 0.5 to the value f = 0.1, as found optimal in Popescu et al. (2012).

| Validation
MSmetrix-cross/long and SIENA(X) are evaluated in terms of reliability of the atrophy measurements for individual patients. To this end, a validation is performed in terms of the measurement error, the robustness toward physiological changes and the consistency of the method. Per experiment, significant difference between MSmetrix-cross/long and SIENA(X) will be evaluated using the parametric paired t-test as well as the nonparametric Wilcoxon signed-rank test at significance level 0.01.
First, we quantify the measurement error based on test-retest images from MS patients (data set 1). The test-retest images are acquired on the same scanner and the same day. The measurement error for atrophy is computed as the estimated percentage volume change of the parenchymal volume and of GM on these pairs of images, which is expected to be zero.
Subsequently, the robustness of the method is tested in terms of physiological changes (data set 2). MR images acquired at two successive time points from the same healthy subject are considered as test and retest scans, regardless of the exact time interval (0-3 days).
Hence, this evaluation captures the measurement error as well as robustness to daily changes due to physiological processes.
Finally, the consistency of the atrophy measurements over time is evaluated (data set 3). Atrophy measurements are performed for MR images from MS patients acquired with time gap of at least 6 months.
For any three consecutive time points, T1, T2, T3, with 6 months gap, the consistency index (CI) is computed as the absolute difference in PBVC between T1 and T3 on one hand and the sum of the PBVC between T1 and T2 and between T2 and T3 on the other hand, that is, T3) )|, This is motivated by linearly approximating the direct atrophy measurement from T1 to T3 with the cumulative atrophy measurement from T1 to T3 via T2.

| Test-retest measurement error
After visual quality control of the test-retest images (data set 1), four images were exhibiting artifacts, thus four image pairs are removed from the data set, resulting in, respectively, 23 image pairs (7 Philips, 8 Siemens, and 8 GE).

test-retest percentage whole-brain volume changes computed by
MSmetrix-long differ in absolute value from the expected 0% by 0.13% (median over all scan pairs, on all three scanners; first and third quartiles: 0.09-0.29%, maximum value: 0.7%), while those of SIENA differ from 0 in absolute value by 0.17% (first and third quartiles: 0.08-0.22%, maximum value: 1.2%). However, the difference between MSmetrix-long and SIENA is not significant (p = .54 for the paired t-test and p = .60 for the Wilcoxon signed-rank test). For the cross-sectional methods, the median percentage whole-brain volume change error is 0.62% (first and third quartiles: 0.23-1.3%, maximum value: 3.8%) for MSmetrix-cross and 0.82% (first and third quartiles: 0.34-2.04%, maximum value: 6.8%) for SIENAX. Also, the cross-sectional methods MSmetrix-cross and SIENAX are not significantly different (p = .10 for the paired t-test and p = .16 for the Wilcoxon signed-rank). However, MSmetrix-long is significantly different from the cross-sectional methods, that is MSmetrix-cross and SIENAX (p < .01 for the paired t-test and for the Wilcoxon signedrank test).
The measurement error for GM atrophy is also quantified, except for SIENA, as GM atrophy is not provided by SIENA. MSmetrixlong differs from the expected 0% absolute percentage GM volume change with 0.23% (first and third quartiles: 0.17-0.34%, maximum value: 0.9%), MSmetrix-cross with 0.77% (first and third quartiles: 0.39-1.06%, maximum value: 4.3%), while SIENAX with 1.06% (first and third quartiles: 0.67-2.59%, maximum value: 11.1%). Significance testing showed that MSmetrix-long is significantly different from MSmetrix-cross and from SIENAX (p < .01 for both the paired t-test and the Wilcoxon signed-rank test). The measurement error of GM atrophy for MSmetrix-cross is significantly smaller than F I G U R E 2 Illustration of MSmetrix-cross on six repeated scans of the same MS patient (data set 1). Each row shows test and retest segmentations for Philips (top), Siemens (middle) and GE (bottom). Lesions are marked with red and GM segmentation with teal SIENAX at significance level 0.05, but not at significance level 0.01 (p = .03 for the paired t-test and p = .02 for the Wilcoxon signedrank test).
For completeness, the measurement errors for whole-brain and GM atrophy are also shown per scanner for the longitudinal methods ( Fig. 4). No significant differences were observed between MSmetrixlong and SIENA for each of the scanners (paired t-test, Wilcoxon signed-rank test, significance level 0.01).

| Longitudinal consistency
On a longitudinal data set of patients with MS (data set 3), the correlation between whole-brain atrophy measurements obtained with MSmetrix-long and SIENA is relatively high, with a Pearson correlation coefficient equal to 0.91 and an intraclass correlation coefficient of 0.90. Figure 6 presents the scatter plot of the percentage whole-brain volume changes of MSmetrix-long with respect to SIENA's for 6-months, 1-year, and 2-year atrophy for all 20 patients.
The CI for 6-month intervals compared to the 1-year interval of the whole-brain percentage volume change had a median absolute value of 0.07% for MSmetrix-long (first and third quartiles: 0.04-0.11%, maximum value: 0.15%) and 0.05% for SIENA (first and third quartiles: 0.02-0.08%, maximum value: 0.17%). No significant difference was observed for the CI between MSmetrix-long and SIENA (p = .42 for the paired t-test and p = .35 for the Wilcoxon signed-rank test). For GM, the CI was 0.13% for MSmetrix-long (first and third quartiles: 0.09-0.23%, maximum value: 0.39%).

| DISCUSSION
In this manuscript, an automated longitudinal Jacobian integrationbased method for measuring whole-brain and GM atrophy is introduced. In order to assess the use of this method in clinical practice F I G U R E 3 Illustration of SIENAX on six repeated scans of the same MS patient (data set 1). Each row shows test and retest segmentations for Philips (top), Siemens (middle), and GE (bottom). GM segmentation is marked with teal F I G U R E 4 First row: Comparison of the measurement error of the longitudinal and cross-sectional methods MSmetrix and SIENA(X) on test-retest scans from all MS patients in data set 1. Boxplots show absolute values of the whole brain (left) and gray matter (right) percentual volume change, computed either by the longitudinal approaches or based on two cross-sectional measurements on the test-retest scans. Second row: Per-scanner comparison of the measurement error of the longitudinal methods MSmetrix and SIENA on test-retest scans from all MS patients in data set 1. Boxplots show absolute values of the whole brain (left) and gray matter (right) percentual volume change F I G U R E 6 Comparison of whole-brain percentual volume change obtained by MSmetrix-long and SIENA in 20 MS patients, five time points each F I G U R E 5 Boxplots of absolute percentual volume change (whole brain and gray matter) on successive scans from healthy subjects (time interval < 3 days) (data set 2) on MRI data sets of individual MS patients, the reliability of the method is evaluated in terms of the method's measurement error, of its robustness toward physiological processes, and of its longitudinal consistency. Results were compared to SIENA, a wellvalidated method that is commonly used for measuring brain atrophy in clinical studies and trials. Note that only whole-brain atrophy results are compared with SIENA, as the software does not provide GM atrophy measurements.
The MSmetrix software pipeline is specifically designed to measure atrophy in patients with MS, by including iterative lesion segmentation and lesion filling based on FLAIR and T1-weighted MRI scans. In this context, it is known that applying brain volume measures without performing lesion filling can introduce errors between 0.3% and 2.5%, depending on the lesion size and lesion intensity (Battaglini, Jenkinson, & De Stefano, 2012;Chard, Jackson, Miller, & Wheeler-Kingshott, 2010;Popescu et al., 2014). When brain atrophy measures are introduced in clinical practice for individual MS patients, interpretation of these results should be done with caution. In this context, it is indeed known that there are many confounding factors that can affect the measurement of brain atrophy and therefore the interpretation of the results (Bermel & Bakshi, 2006;Simon, 2006;Zivadinov & Minagar, 2009). For example, it is known that brain volume changes are not only caused by neuronal or axonal loss but that also demyelination and inflammation can play a role (Giorgio et al., 2008). In addition, brain volume loss as measured using MRI is affected by the use of steroids or some disease modifying therapies. It has indeed been demonstrated that their anti-inflammatory properties decrease the brain volume in the first 6 months to 1 year of treatment, typically referred to as pseudoatrophy (Zivadinov et al., 2008). In this context, it was suggested that the measurement of GM volume loss is less susceptible to this pseudoatrophy compared to whole-brain or WM volume changes (Nakamura, Fox, & Fisher, 2011;Tiberio et al., 2005).
Since the difference between brain atrophy in MS patients (0.5%-1.3% yearly atrophy) and healthy subjects (0.1%-0.4% yearly atrophy) is small, and clinicians would like to assess if an individual patient is stable on brain atrophy or not, a small measurement error is of paramount importance to draw meaningful conclusions in clinical practice (Barkhof et al., 2009;Fotenos et al., 2008;Simon, 2006). This paper focuses on the reliability of the methods on MRI data sets from MS patients acquired using a "clinical" MRI protocol. In order to introduce brain atrophy measures in clinical practice, acceptable measurement and reproducibility errors are required on MRI scans that can be obtained in a clinical setting with a limited acquisition time.
In contrast to most other studies that have investigated such errors of brain atrophy measures, for this paper, repeat scans were acquired on patients with MS instead of on healthy subjects (Cover et al., 2011;Maclaren et al., 2014;Nakamura et al., 2014;Smith et al., 2001). As a result, the errors presented in this paper can be seen as representative for a clinical setting for patients with MS.
MSmetrix-long results on data set 1 demonstrated a small measurement error across the three 3T scanners, with a median value of 0.13% over all scanners. These errors are within the tolerance level that might be attributed to normal variations in healthy controls, but are lower than the expected atrophy levels in pathology. Although the results of SIENA were not significantly different based on a parametric and nonparametric statistical test (p > .05), a larger median value (0.17%) over all scanners was observed for SIENA. This error can change when using different parameter settings of SIENA, for example, for this data set, the median absolute error was double if we kept the default parameter values. We tried to use optimal settings, as were described in the literature (Popescu et al., 2012).
Note that is has been demonstrated in the past that using crosssectional methods to measure atrophy results in much higher errors compared to longitudinal approaches (Durand-Dubief et al., 2012;Nakamura et al., 2014;Smith et al., 2002). Our results confirm these findings. The measurement errors for the cross-sectional methods were significantly higher than those of the longitudinal methods (p < .01 for both parametric and nonparametric tests).
The test-retest error for whole-brain and GM atrophy computed by MSmetrix-cross was lower than for SIENAX. These relatively lower measurement error for MSmetrix-cross compared to SIENAX might contribute to the lower values of MSmetrix-long compared to SIENA, as they are used as input data. The reported values for SIENA and SIE-NAX are similar to that observed by Smith et al. (2001Smith et al. ( , 2002, even though scans in their study were obtained from healthy subjects.
In addition to the measurement error, robustness toward daily physiological processes is evaluated using data set 2, where MSmetrixlong still results in a small overall error for whole-brain atrophy, while SIENA shows a significant larger error compared to MSmetrix-long. A median absolute value of 0.19% was observed for MSmetrix-long and of 0.31% for SIENA for whole-brain atrophy. For GM, the median absolute value is 0.23% for MSmetrix-long. This indicates that MSmetrixlong is more robust toward daily physiological effects than SIENA.
Finally, the consistency of the methods is assessed using data set 3. No significant differences were observed between MSmetrix-long and SIENA in terms of the CI for 6-month intervals compared to the 1-year interval of the whole-brain atrophy.
It is important to notice that in addition to small errors, including measurement errors, robustness toward daily physiological processes and consistency, the brain atrophy software should still be sensitive enough to detect small changes. This can be evaluated on longitudinal MRI data of MS patients. Although there is no ground truth available of the exact changes in brain atrophy that should be detected, our results suggest that MSmetrix has a high correlation with SIENA, which has already been used as surrogate outcome measure in several MS clinical trials. Also, the sensitivity of MSmetrix-long has been demonstrated in other domains, that is, the detection of dehydration effects and separating healthy controls from Alzheimer patients . In view of the measurement errors reported in this paper, one should consider as potentially pathological change any whole-brain atrophy levels exceeding, for example, 0.7%-1% per year.
In this work, we have also shown that GM atrophy can be measured automatically alongside whole-brain atrophy using MSmetrixlong. Although the absolute measurement errors are higher than those for whole brain, the approach still has potential to detect GM atrophy reliably if this is abnormally large and should prompt immediate therapy re-evaluation (e.g., >1.5% per year).
Note that the results on GM atrophy measurement were not directly compared against another longitudinal technique in this paper. The widely used SIENA does not return GM atrophy. Other approaches for longitudinal atrophy computations are, like SIENA, not specifically designed for MS, and thus lesion filling is often recommended as a preprocessing step before applying these methods. For instance, the longitudinal brain segmentation pipeline (Reuter et al., 2012) included in FreeSurfer offers the possibility to compute volume changes of brain substructures, after transforming multiple time point images to a common subject-specific template space. In MS patients, a disadvantage might be that large deviations between the individual time points (e.g., large atrophy or the presence of large new lesions) might have an unpredictable effect on the template creation or might bias the results of individual time points toward the common template.
We could conclude that due to the low measurement error, MSmetrix-long could be of added value to the clinical practice for the follow-up of treatment and disease progression in MS patients.