Comparison of structural MRI brain measures between 1.5 and 3 T: Data from the Lothian Birth Cohort 1936

Abstract Multi‐scanner MRI studies are reliant on understanding the apparent differences in imaging measures between different scanners. We provide a comprehensive analysis of T1‐weighted and diffusion MRI (dMRI) structural brain measures between a 1.5 T GE Signa Horizon HDx and a 3 T Siemens Magnetom Prisma using 91 community‐dwelling older participants (aged 82 years). Although we found considerable differences in absolute measurements (global tissue volumes were measured as ~6–11% higher and fractional anisotropy [FA] was 33% higher at 3 T than at 1.5 T), between‐scanner consistency was good to excellent for global volumetric and dMRI measures (intraclass correlation coefficient [ICC] range: .612–.993) and fair to good for 68 cortical regions (FreeSurfer) and cortical surface measures (mean ICC: .504–.763). Between‐scanner consistency was fair for dMRI measures of 12 major white matter tracts (mean ICC: .475–.564), and the general factors of these tracts provided excellent consistency (ICC ≥ .769). Whole‐brain structural networks provided good to excellent consistency for global metrics (ICC ≥ .612). Although consistency was poor for individual network connections (mean ICCs: .275−.280), this was driven by a large difference in network sparsity (.599 vs. .334), and consistency was improved when comparing only the connections present in every participant (mean ICCs: .533–.647). Regression‐based k‐fold cross‐validation showed that, particularly for global volumes, between‐scanner differences could be largely eliminated (R 2 range .615–.991). We conclude that low granularity measures of brain structure can be reliably matched between the scanners tested, but caution is warranted when combining high granularity information from different scanners.


| INTRODUCTION
Understanding how estimates of brain structure vary across different MRI hardware and field strengths is an important aspect of neuroimaging research since knowledge of between-scanner differences and the means to match measures between scanners is an essential prerequisite for multi-site analyses or multi-scanner longitudinal studies.
Cross-scanner comparisons of brain measures across scanner hardware is useful in multiple settings, including hardware replacement, relocation in ongoing research studies, or pooling of multi-site data (Kruggel, Turner, & Muftuler, 2010). An increase in the use of 3 T scanners is driven in part by the potential to increase tissue contrast and reduce background noise (thereby increasing the signal-to-noise ratio and contrast-to-noise ratio), acquire higher resolution scans more quickly, acquire higher b-values and thinner slices in diffusion MRI (dMRI), use advanced methods such as neurite orientation dispersion and density imaging (Zhang, Schneider, Wheeler-Kingshott, & Alexander, 2012), and potentially increase diagnostic accuracy (Fushimi et al., 2007;Schmitz, Aschoff, Hoffmann, & Grön, 2005;Tanenbaum, 2005;Wardlaw et al., 2012). In clinical practice, although higher field strength MRI may improve image quality and diagnostic accuracy, the theoretical doubling of the signal-to-noise ratio in practice was only 25%, though 3 T appeared to outperform 1.5 T technology in research settings (Wardlaw et al., 2012). Although the possibility of combining MRI data points from different scanner hardware is appealing, this is challenging because scanner-dependent geometric distortions and differences in tissue contrast can be problematic (Gunter et al., 2009).
When considering the potential for two different field strengths to yield different estimates of the same brain measurements, interpretation should be tempered by the finding that even same-scanner measurements, taken twice or more over short periods, are not perfectly reliable. Same-scanner test-retest studies of the same subjects report agreement as low as .8 in terms of the intraclass correlation coefficient (ICC) for global volumetric and water diffusion measures (Iscan et al., 2015;Luque Laguna et al., 2020;Melzer et al., 2020).
In this context, the extant cross-field comparison studies which compare brain MRI measures between 1.5 and 3 T indicate that agreement, in small samples of generally younger participants, may not be substantially lower than same-scanner test-retest findings. For example, between-field-strength differences have been reported as <10% for brain and tissue volumes (Heinen et al., 2016), and~10% for subcortical measurements (Chu, Hurwitz, Tauhid, & Bakshi, 2017).
However, these studies have typically been conducted using modest sample sizes (often N ≤ 20, but see Pfefferbaum et al., 2012 andSrinivasan et al., 2020) and among adults almost exclusively younger than 65 years old. Brains that, on average, exhibit greater degeneration are "further" from the average atlases upon which some neuroimaging pipelines rely, and older participants have a greater array of physical limitations which are a barrier to achieving artifactfree imaging data during extended scanning sessions, for example, arthritis. The wider variability in structural brain measures among individuals who are at greatest risk of cognitive decline and a range of age-related diseases and disorders might hamper the generalizability of findings from younger groups. Moreover, the low sample sizes mean that any statistical analyses aimed at identifying meaningful differences between conditions are likely to be substantially underpowered, providing speculative estimates of the comparability of data across field strengths. Finally, only a single, or a small subset, of MRIderived phenotypes has been considered at once. Such factors fundamentally complicate the meaningful synthesis of extant data for assessing the likely cross-scanner impact on structural and diffusion measures in older participants.
To address these gaps in the literature, the current study assesses an array of T 1 -weighted and dMRI imaging variables using 91 participants, aged 82 years, scanned at both 1.5 and 3 T. Between-scanner comparison of imaging variables was performed at several levels: overall brain and tissue volumes; regional cortical and subcortical GM volumes; cortical volume, surface area, and thickness; dMRI measures in global WM, dMRI measures in 12 WM tracts; and whole-brain structural networks. Additionally, we used 10-fold cross validation to test prediction of "unseen" 1.5 T values from 3 T data using linear regression.

| Participants
Data were drawn from the Lothian Birth Cohort 1936 (LBC1936), an on-going study on the influences on cognitive ageing from age 11 into the eighth and ninth decades of life (Deary et al., 2007;Deary, Gow, Pattie, & Starr, 2012;Taylor, Pattie, & Deary, 2018). Structural imaging including dMRI has been performed on the same well-maintained 1.5 T scanner at all imaging waves (Wardlaw et al., 2011). A subset of participants were also imaged at 3 T. This was motivated by the intention to safeguard against potential unexpected breakdown of the scanner, or facility relocation, and an incentive to use modern MRI acquisitions, reducing participant burden in terms of comfort and duration for those who are becoming increasingly frail and less able to lie still in a scanner for the hour-long 1.5 T acquisition. A total of 105 (60, 57.1% male) community-dwelling participants in the Lothian area were therefore recruited from the fifth wave of the LBC1936.
Prior to undergoing either scan, participants who had indicated they would undergo the standard 1.5 T session were invited to also undergo a 3 T imaging session-then, following a successful 1.5 T scan they were booked for 3 T imaging. Participants were recruited on a first-come, first-served basis and 3 T imaging ended after 105 participants had successfully completed both scans. The mean interval between scans was 71.9 (SD = 16.6; range 28-111) days. At the time of the 1.5 T scan, participants had a mean age of 82.0 (SD = 0.3) years. Written informed consent was obtained from each participant under protocols approved by the Lothian (REC 07/MRE00/58) and Scottish Multicentre (MREC/01/0/56) Research Ethics Committees.

| MRI acquisition
MRI acquisition parameters at 1.5 T have been described previously (Wardlaw et al., 2011) and are summarized in Table 1. All participants underwent brain MRI on the same 1.5 T GE Signa Horizon HDx clinical scanner (General Electric, Milwaukee, WI) with a manufacturer supplied 8-channel phased-array head coil. High resolution 3D T 1weighted inversion-recovery prepared, fast spoiled gradient-echo volumes were acquired in the coronal plane with 160 contiguous 1.3 mm thick slices resulting in voxel dimensions of 1 Â 1 Â 1.3 mm. For the dMRI protocol, single-shot spin-echo echo-planar (EP) diffusionweighted whole-brain volumes (b = 1,000 s mm À2 ) were acquired in 64 noncollinear directions, along with seven T 2 -weighted volumes (b = 0 s mm À2 ). Seventy-two contiguous axial 2 mm thick slices were acquired resulting in 2 mm isotropic voxels.
The same 105 participants had a brain MRI on a 3 T Siemens Magnetom Prisma (Siemens Healthcare Gmbh, Erlangen, Germany) using a 32-channel matrix phase array head coil. High resolution 3D T 1 -weighted magnetisation prepared rapid acquisition gradient echo volumes were acquired in the coronal plane with 224 contiguous 1 mm thick slices resulting in 1 mm isotropic voxels (Table 1). The multi-shell dMRI protocol employed a single-shot spin-echo EP diffusion-weighted sequence which acquired 14 b = 0 s mm À2 , 3 b = 200 s mm À2 , 6 b = 500 s mm À2 , 64 b = 1,000 s mm À2 and 64 b = 2,000 s mm À2 whole-brain volumes. Seventy-four contiguous axial 2 mm thick slices were acquired resulting in 2 mm isotropic voxels. A reverse phase encoding EP dataset with 6 b = 0 s mm À2 whole brain volumes was also collected for subsequent EP susceptibility distortion correction using the same acquisition parameters as the main dMRI protocol. Two participants were excluded from T 1weighted analyses and five were excluded from dMRI analyses due to incomplete or missing scans at 1.5 T.
Total CSF volume was computed as the sum of the volumes of the ventricular system, nonventricular CSF and choroid plexus.
We applied FreeSurfer using the default parameters and opted not to undertake any manual editing so as not to introduce any raterspecific bias into the comparison between scanners. The outputs of  et al., 2004), data underwent brain extraction (Smith, 2002) performed on the T 2 -weighted EP volumes acquired along with the dMRI data.
The brain mask was applied to all volumes after correcting for systematic eddy-current induced imaging distortions and bulk patient motion using affine registration to the first T 2 -weighted EP volume of each subject with "eddy_correct" (Jenkinson & Smith, 2001). Due to the longitudinal character of the LBC1936 study, the dMRI processing protocol at 1.5 T has remained unchanged since the first LBC1936 imaging wave in 2007.
We determined that alternative processing was required for the 3 T dMRI data due to greater susceptibility-induced distortions at this field strength. These data were read and converted from DICOM to NIfTI-1 format using TractoR v 3.3.1, masked using FSL's brain extraction tool (Smith, 2002) and corrected for susceptibility and eddy current induced distortions using "topup" and "eddy" from FSL version 5.0.9 (Andersson, Skare, & Ashburner, 2003;Andersson & Sotiropoulos, 2016). Additionally, to test the impact of different preprocessing pipelines between scanners we also applied the 1.5 T pipeline (Tractor v2.6.1 and FSL v4.1.9) using 3 T data for 10 subjects.
For all dMRI volumes, diffusion tensors were fitted at each voxel using FSL's "dtifit" and water diffusion measures were estimated for axial (AD), radial (RD), and mean (MD) diffusivity, which measure magnitudes of molecular water diffusion. FA was also computed, which measures the degree of anisotropic diffusion per voxel (Pierpaoli & Basser, 1996). All diffusion measures and tractography were computed in diffusion space. Mean values of the four dMRI measures were estimated in cerebral WM using the WM mask obtained from FreeSurfer, which was aligned to diffusion space using the transform estimated by the connectome cross-modal registration procedure (Buchanan et al., 2020). In a supplementary analysis, we also computed the same measures across the whole brain (using the mask obtained from the T 2 -weighted EP volume).
Whole-brain tractography was performed using an established probabilistic algorithm (BEDPOSTX/ProbtrackX; Behrens, Berg, Jbabdi, Rushworth, & Woolrich, 2007;Behrens et al., 2003). Probability density functions, which describe the uncertainty in the principal directions of diffusion, were computed with a two-fiber model per voxel (Behrens et al., 2007). Streamlines were then constructed by sampling from these distributions during tracking with a fixed step size of 0.5 mm between successive points.
Analysis of 12 major WM tracts was performed using probabi-

| Network processing
Whole-brain structural networks were computed for 79 participants (who had passed both T 1 and dMRI QC at both field strengths). Networks were constructed using 85 neuroanatomical regions (the 84 GM regions described above plus the brain stem) and probabilistic tractography resulting in 85 Â 85 networks (Buchanan et al., 2020).
Networks were computed for both MD and FA by computing the mean value of each measure in all voxels along the interconnecting streamlines between a pair of regions. We applied network thresholding using consistency-thresholding (Roberts, Perry, Roberts, Mitchell, & Breakspear, 2017) to remove some proportion of putatively spurious connections across subjects at a threshold level retaining the top 30% most consistent connections that was previously determined from a large single-scanner study (Buchanan et al., 2020). To obtain a representative estimate of between-scanner agreement, thresholding was applied separately for both field strengths (resulting in nonidentical sets of connections). Three common global graph-theoretic metrics were computed using weighted measures (Rubinov & Sporns, 2010): mean edge weight (mean of all edge weights per subject), global network efficiency (a measure of integration), and network clustering coefficient (a measure of segregation).

| Statistical analysis
Between-scanner comparison of imaging variables was performed at several levels: global (overall tissue volumes, dMRI measures in WM, global network metrics), regional (GM regions, major WM tracts), and sub-regional level (cortical surface vertex analysis, networks connections). Imaging variables were paired between scanners and we computed both the between-scanner difference and ICC to assess agreement. For a paired set of subject-specific measures, x 1:5T 1 , …, Similarly, the mean between-scanner difference expressed at percent change from the 1.5 T values was computed, The ICC (Shrout & Fleiss, 1979) was originally formulated for assessing multiple raters measuring the same quantity but has been widely adopted for repeated measurements. For each imaging measure, we computed ICC using a two-way model (i.e., each subject was measured by both scanners) with single measures and using consistency of measurements between sessions (R package irr). This formulation of ICC ranges from À1 to 1, where 0 indicates random agreement and negative values would not be expected in a test-retest study (Hallgren, 2012). For ICC scores we adopted the four level rating from (Cicchetti, 1994): poor for <.40; fair for .40-.59; good for .60-.74; and excellent for .75-1.00. Rather than reporting ICC agreement, which reflects both rank order agreement and intercept differences (e.g., also accounts for between-scanner discrepancies in absolute volumes), we also conducted a more detailed investigation in which we tested our ability to predict "unseen" 1.5 T values from 3 T data using linear regression. To estimate the generalization performance of a linear model, we computed the required slopes and intercepts for all imaging variables, and used 10-fold cross-validation to iteratively estimate a linear fit on 9/10th of the data, applying prediction for the held-out fold and reporting average model fit (predicted R 2 ). All imaging measures were modeled and estimated separately.
In order to assess if larger GM regions resulted in higher between-scanner consistency than smaller regions, we also reported the correlation between region volume (mean value of 1.5 and 3 T volumes) and the regional ICCs (for volume, surface area, and thickness). False discovery rate (FDR) was used to correct these correlations for multiple comparisons. For the cortical surface analyses, in addition to providing regional maps of ICCs and percent difference, we performed linear regression between 1.5 and 3 T values and computed uncorrected p-value maps to indicate areas of difference between scanners. To illustrate the impact of smoothing, we also provide average ICCs across the cortical mantle for volume, area and thickness smoothed with a 0, 5, 10, 15, 20, and 25 mm FWHM kernel. Table 2 summarizes the between-scanner statistics at each level of analysis. Broadly, we found a wide range (À13.2 to 39.1%) in the difference between imaging measures at 1.5 and 3 T. Figure 1 shows horizontal slices of two participants imaged at 1.5 T alongside the equivalent slice at 3 T. We observed different contrasts for skull, CSF, GM, and WM between field strengths. Discrepancies in both the GM-WM boundary and the GM-CSF boundary were visible between field strengths and it was apparent that more GM and WM was visible at 3 T than at 1.5 T. Gibbs ringing artifacts were apparent at 1.5 T but much less so at 3 T.

| Between-scanner agreement of global volumetric measures
Supratentorial, GM, and WM were estimated as 6.6-10.2% greater at 3 T than at 1.5 T. In particular, GM volumes were 8.9% greater for total GM, 10.2% greater for cortical GM and 6.6% greater for subcortical GM. WM volume was estimated as 7.0% greater at 3 T than at 1.5 T. Conversely, total CSF volume was estimated as 4.4% lower at 3 T than at 1.5 T. Scatter plots (Figure 2a) indicated that the between scanner relationships were largely linear (slopes between 0.688 and 1.048), and the Bland-Altman plots showed that there were few participants >2 SD from the mean difference ( Figure 2b).  (Table S1).

| Between-scanner agreement of GM measures
Mean values and between-scanner differences for cortical regions (volume, thickness, and surface area) and subcortical volumes are reported in Tables S2-S5 and summarized in Table 2. The volumes of the 68 cortical regions were measured as 10.8% greater at 3 T than at 1.5 T on average (range: À7.0 to 28.6%). The volumes of the 16 subcortical regions were measured as 7.3% greater at 3 T than at 1.5 T (range: À21.0 to 27.6%). Cortical surface areas were measured as 12.3% larger at 3 T than at 1.5 T (range: À12.0 to 29.5%). Cortical thicknesses were measured as 4.4% (range: À4.6 to 12.3%) or 0.089 mm thicker on average at 3 T than at 1.5 T. Figure 3 shows the between-scanner ICC consistency in volume,  (Figure 3).
The cortical volumes and ICC values were weakly to moderately correlated (volume: r = .431, q < 0.001; area: r = .275, q = 0.035; thickness: r = .246, q = 0.043; FDR corrected), which indicated that ICCs were generally lower for smaller regions, for example, frontal pole and pallidum. Between-scanner consistency for subcortical volumes (mean T A B L E 2 Summary of the between-scanner comparison performed at various levels of analysis using participants scanned at both 1.5 and 3 T: mean values, between-scanner differences, and intraclass correlation coefficient (ICC) 3.3 | Between-scanner agreement of vertex-wise cortical measures Figure 4 shows the between-scanner differences and ICCs for volume, area, and thickness for data smoothed at 20 mm FWHM. Volume, surface area, and thickness were all measured as greater at 3 T than at 1.5 T. The mean between-scanner difference at vertex level was 0.132 mm 3 (11.5% greater at 3 T) for volume, 0.065 mm 2 (14.1%

T mean 3 T mean BSD (%) ICC
greater at 3 T) for surface area, and 0.092 mm (4.8% thicker at 3 T) for cortical thickness.
When computed across all vertices, the mean ICC values were broadly in line with those for the atlas-based regions reported above ( substantially lower for cortical thickness (3.2% of vertices).
Scanner effects were somewhat regionally heterogeneous. Volumetric differences in the ICCs in the superior frontal lobe are mainly contributed to by lower ICCs for thickness, whereas lower volumetric ICCs in orbital frontal, cingulate, and medial temporal regions were common to both area and thickness. Additionally, between-scanner contrasts for these three measures ( Figure S1), indicated that for our sample most cortical vertices were not significantly different between scanners (p < .05, uncorrected). Small areas of significant difference F I G U R E 1 Axial and coronal T 1 -weighted slices at both 1.5 and 3 T of one participant where the between-scanner supratentorial volume difference was measured at 55.86 cm 3 (a) and another where supratentorial volume difference was 113.67 cm 3 (b). The slices shown are in native T 1 space (not co-registered) and were matched between scanners as closely as possible. Image intensity ranges were adjusted for visualization  lower at 3 T than at 1.5 T. AD was measured as 6.0% higher and FA as 33.0% higher at 3 T than at 1.5 T. Scatter plots indicated that the between scanner relationships were largely linear (slopes between F I G U R E 3 Intraclass correlation coefficients (ICC) and estimated 95% CIs between 1.5 and 3 T acquisitions for 84 grey matter regions identified by FreeSurfer 6.0 (N = 91) measuring volume, surface area and thickness. Surface area and thickness were not computed for subcortical regions 0.624 and 0.937). Additionally, Bland-Altman plots showed that there were very few participants >2 SD difference from the mean difference.

| Between-scanner agreement of global dMRI measures
Despite the differences in absolute levels, between-scanner consistency was considered excellent (RD ICC = .882; MD ICC = .867; AD ICC = .776), or good (FA ICC = .740; Table 2). In a supplementary analysis, we observed that ICCs were~.1 lower in WM than when the same four measures were sampled across the whole-brain (Table S6).
The lower values for WM could be explained by the discrepancy in GM/WM segmentation between 1.5 and 3 T, by which the wholebrain measures were unaffected.

| Between-scanner agreement of major WM tracts
The mean values and the between-scanner differences of 12 WM tracts are reported in Table S7 and summarized in Table 2. Figures S3 and S4 show scatter plots and Bland-Altman plots for these tracts. Visual inspection of the probability maps of each tract generated by probabilistic tractography revealed that streamlines more coherently followed the anatomical pathways at 3 T than at 1.5 T (Figure 6a-maps created using all data, prior to removal of QC fails), presumably due to the higher signal-to-noise and improved distortion correction at the higher field strength. Visual quality checking and exclusion of individual tracts identified more aberrant streamlines across subjects at 1.5 T than at 3 T with a tract success rate of 91.6-98.9% at 1.5 T and 95.3-100% at 3 T.
Across all tracts, FA was consistently higher at 3 T (mean: 37.4%; range: 23.6-48.6%), and MD consistently lower (mean: 5.8%, range: À15.1 to 1.0%), with only the right cingulum bundle having a 1.0% increase in MD at the higher field strength. We also applied the 1.5 T pipeline to 3 T data from 10 subjects and found that for the 12 tracts the FA values were measured as 3.4-24.8% higher using the 3 T pipeline, suggesting that the apparent increase in FA at 3 T was partly driven by the new FSL tools used in the 3 T tractography pipeline.  Table S8). For gMD, the first unrotated principal component explained 44% of the variance at 1.5 T and 56% at 3 T. For gFA, the first principal component explained 35% of the variance at 1.5 T and 31% at 3 T. Both gMD and gFA provided excellent between-scanner consistency (ICCs of .850 and .769, respectively), which was~.3 greater than the mean ICC of the 12 tracts (Figure 6b and Table S7).

| Between-scanner comparison of connectome
MD-and FA-weighted whole-brain networks were computed allowing 3,570 possible connections for unthresholded networks, but only 1,071 connections were retained after consistency-thresholding at 30%. Between-scanner results for individual connection weights (edges) and three global graph-theoretic measures (mean edge weight, global network efficiency, and network clustering coefficient) are shown in Table S9 and summarized in Table 2.
For unthresholded networks, the connection density was 79.6% greater at 3 T than at 1.5 T (mean network sparsity: 0.599 [SD = 0.037] for 1.5 T; 0.334 [SD = 0.048] for 3 T), meaning that considerably more interregional WM connections were identified at the higher field strength, presumably due to higher signal-to-noise. However, after network thresholding, which retained only the top 30% most consistent connections across subjects, each participant's network was constrained to have a sparsity of~0.7. Separate thresholds were applied at 1.5 and 3 T which resulted in a different set of connections after thresholding. However, we found that there was an overlap in the connections retained (ICC = .687 or 835/1307 matching connections) when comparing the binary masks obtained from the thresholding procedure between the two field strengths.
For MD weighted networks, mean edge weight was measured as 4.7% greater, network efficiency as 0.3% greater and network clustering coefficient as 5.3% lower at 3 T than at 1.5 T. For FA weighted networks, mean edge weight was 39.1% greater, network efficiency was 32.6% greater and network clustering coefficient was 28.6% greater at 3 T than at 1.5 T. Figure 7a,b shows the between-scanner results for these network metrics for both MD and FA networks. Consistency was rated as good to excellent for all global metrics (ICC range: .612-.888). The greatest consistency was for network efficiency with MD (ICC = .888) and network clustering coefficient with MD (ICC = .883). FA-weighted metrics were rated as excellent for network clustering coefficient (ICC = .799) and network efficiency (ICC = .794). The lowest consistency for these measures, which was rated as good, was for mean edge weight with FA (ICC = .612) and MD (ICC = .680). Despite these differences in between-scanner F I G U R E 5 Between-scanner differences of four water diffusion measures, namely, axial diffusivity (AD), radial diffusivity (RD), mean diffusivity (MD), and fractional anisotropy (FA), measured in cerebral white matter for 79 participants scanned at both 1.5 and 3 T: (a) scatter plots where the continuous blue line shows linear fit with 95% CI; (b) Bland-Altman plots of the same four measures showing the mean of between-scanner measures and the difference between these measures where the blue line indicates the mean and the red lines represent ±2 SDs consistency for the three metrics, we noted strong collinearity among the three graph-theoretic measures (r > .796 for MD; and r > .943 for FA).
The ICCs for each of the 1,071 individual connections which were retained following 30% network thresholding are shown in Figure 7c.
Overall between-scanner consistency was poor for both MD and FA networks (mean ICCs ≤ .280; Table 2). For FA, the mean ICC was .275 although the 95% IPR was broad (À0.051 to 0.795). This corresponded to a proportion of excellent/good/fair/poor of 5.4/9.0/12.4/73.2%. For MD, the mean ICC was .280 (95% IPR: À0.095 to 0.870) corresponding to a proportion of excellent/good/ fair/poor of 14.5/8.0/6.5/71.1%. Whereas 30% thresholded networks achieved better between-scanner consistency (mean ICCs ≤ .280) than unthresholed networks (mean ICCs ≤ .142), this result was driven by the large difference in network sparsity between scanners, that is, there are many more zero-valued connections (marking an absence of connection between regions) at 1.5 T compared to 3 T. When all zero-valued connections ( Figure S5) were excluded and ICCs were computed for only the 428 network connections (12% of all possible connections), which had a nonzero value in every participant, then the F I G U R E 6 Between-scanner comparison of 12 white matter (WM) tracts in 90 participants: (a) anatomical probability maps for both 1.5 and 3 T showing the streamline density of each tract (left-side only for bilateral tracts) across participants for whom validated tract data was available; (b) intraclass correlation coefficients (ICCs) and estimated 95% CIs between 1.5 and 3 T acquisitions for 12 tracts (and their general factors) identified by probabilistic neighborhood tractography and measuring both mean diffusivity (MD) and fractional anisotropy (FA). ATR, anterior thalamic radiations; ILF, inferior longitudinal fasciculus between-scanner consistency was considerably higher (mean ICC of .647 for MD and .533 for FA).
3.7 | Prediction of "unseen" 1.5 T imaging variables from 3 T data Slopes and intercepts from a linear fit of global and regional imaging variables between scanners are reported in Tables S1-S7 and S9, alongside the predicted model fit (R 2 ) obtained from 10-fold crossvalidation with a linear model. The range of predicted model fit was variable across all imaging measures (.155-.991) but the highest R 2 values indicted that differences between scanners could be virtually eliminated (almost perfect prediction) for global volumetric measures and large brain structures. For global T 1 volumetric measures (- Table S1), the R 2 range was .615-.991 with estimated intracranial volume having the lowest and CSF having the highest R 2 . For volumetric variables derived from FreeSurfer volumetric and subcortical F I G U R E 7 Between-scanner results for whole-brain structural networks using 85 nodes with 30% network thresholding, connection strength weighted by both MD and FA and computed using 79 participants scanned at both 1.5 and 3 T: (a) scatter plots for three global network metrics, where the continuous blue line shows linear fit with 95% CI; (b) Bland-Altman plots of the same network metrics showing the mean of betweenscanner measures and the difference between these measures where the blue line indicates the mean and the red lines represent ±2 SDs; (c) anatomical network plots for FA-and MD-weighted networks, where link color and thickness represent the intraclass correlation coefficient (ICC) for each connection (edge). Node abbreviations are listed in Table S10 processing (

| DISCUSSION
In one of the largest between-scanner comparisons to date, we report previously lacking information on a wide range of structural brain measures in an exclusively older group of participants. We found excellent levels of consistency (ICC >~.75) between the 1.5 and 3 T scanners for the largest brain structures (whole-brain, ventricular and tissue volumes; global dMRI measures in WM; and global network metrics) that were similar to same-scanner test-retest studies (Buchanan et al., 2014;Iscan et al., 2015;Luque Laguna et al., 2020;Melzer et al., 2020). We noted that there were overall mean shifts in the absolute levels of most measures between 1.5 and 3 T: volumetric measures and thickness appeared larger at 3 T, RD, and MD were lower, and AD and FA were higher at 3 T, consistent with prior observations from smaller studies on single metrics (Chu et al., 2017;Han et al., 2006;Heinen et al., 2016;Pfefferbaum et al., 2012), but not others (West et al., 2013). Regression-based correction for scanner (using intercept differences) effectively eliminated scanner differences in unseen (hold-out) data for global brain measures, giving similar (and sometimes higher) agreement than might be expected from samescanner test-retest data: global measures could be accurately predicted in line with 1.5 T values from 3 T data using 10-fold crossvalidation.
Interestingly, both GM and WM tissue volumes appeared larger at 3 T than at 1.5 T, but CSF volume was smaller. Contributing factors are likely to include a combination of higher tissue contrast (resulting in differences in the tissue-CSF boundary), different scanner-specific geometric distortions and a slight difference in T 1 -weighted voxel dimensions. More numerous sampling instances along a complex surface may result in both superior estimation (cf., Cavalieri), and the "coastline paradox," whereby complex shapes appear larger when measured with greater fidelity (cf., Richardson; Napolitano, Ungania, & Cannat, 2012). This clearly has important implications for crossscanner analyses that use ICV or CSF correction to measure atrophic change in global tissue volumes from cross-sectional data with different voxel dimensionslower field strengths may potentially result in higher estimates of atrophy.
As would be expected, between-scanner agreement decreased as the granularity increased from large brain structures to include smaller regional imaging variables. Scanner agreement at the regional level was similar or slightly lower than prior same-scanner work, such as for cortical regional measures (Boekel et al., 2017;Clayden et al., 2009;Liem et al., 2015;Luque Laguna et al., 2020;Madan & Kensinger, 2017;Srinivasan et al., 2020). We also found that smaller GM regions typically had poorer between-scanner agreement than large regions; this between-scanner finding corresponds well with the known relationship between reliability and region size observed in same-scanner work (Iscan et al., 2015;Tustison et al., 2014). This finding indicates that in this specific case, scanner differences may not contribute a substantial amount of additional noise to the noise reliability typically seen in test-retest settings. It also contributes more generally to the literature on the merits and drawbacks of increasing cortical atlas granularity for the reliability of the structural connectome (de Reus & van den Heuvel, 2013) or structural-functional correspondence (Messé, 2020).
Additionally, a recent 1.5-3 T field strength comparison (N = 113), reported a broadly similar pattern for regional reliability of FreeSurfer segmentation (Srinivasan et al., 2020). The authors of this study also identified a bias in the FreeSurfer procedure for under segmentation of subcortical structures, particularly hippocampal volumes, in older subjects.
Our vertex-wise cortical analyses were valuable in that they show that ICCs increase with greater smoothing and show a pattern of between-scanner ICC consistency which is agnostic to boundaries imposed by a particular cortical atlas. Prior findings suggest that cortical thickness generally shows lower reliability than either volume or surface area in a same-scanner setting (Iscan et al., 2015), with which our findings are consistent. Interestingly, although the percent differences between 1.5 and 3 T data were wider for volume and surface area (especially prevalent in lateral and orbital frontal, cingulate and posterior temporal areas) than for thickness, ICCs were very much lower for thickness than either volume or area. Thus, whereas the overall volume or area of cortex identified is proportionally higher at 3 T than for thickness, this overestimation is far more systematic (the rank order is better preserved across scanners) than for thickness. It is possible that thickness appears less reliable between 1.5 and 3 T because the two dimensions upon which it relies (GM-WM and GM-CSF boundaries) to derive sub-millimeter measurements are those that would be affected by contrast differences between field strengths.
With respect to dMRI data, the increase in FA and AD, and decrease in RD and MD between 1.5 and 3 T, as well as the higher number of WM inter-regional connections may also be indicative of superior signal-to-noise (a better fit of the diffusion tensor). However, these differences may be also partly explained by improved distortion correction at 3 T. On this latter factor, the application of a modified pipeline for the PNT-identified WM tracts (Tractor v2.6 with FSL v4 at 1.5 T and Tractor v3.3 with FSL v5) was necessary to work with multi-shell data for which the prior versions were not optimized, and to apply the more advanced tools in eddy-current distortion and susceptibility corrections that we would be using in future study waves at 3 T. This is likely to have provided an additional source of inconsistency for the PNT-identified WM tract analyses, which were generally poorer than ICCs from similar methods in same-scanner designs which report ICCs > .54 (Boekel et al., 2017;Clayden et al., 2009;Luque Laguna et al., 2020). Indeed, the contribution of pipeline differences was borne out in our supplementary analyses: applying a different pipeline to 3 T substantially affected dMRI measures in a small sample of our participants and FA increased substantially in WM tracts using more recent processing algorithms, which provided better distortion corrections (FA tends to increase with better distortion correction; Yamada et al., 2014). Nevertheless, our findings in the main analyses still indicated "fair" consistency, with poorer agreement found for smaller tracts which involved fewer streamlines and were generally found close to the ventricles; these were more likely to suffer from partial CSF contamination for some streamlines.
Global network metrics derived from the structural connectome showed good to excellent consistency, comparable to same-scanner results (Buchanan et al., 2014;Cheng et al., 2012). We found betweenscanner consistency to be poor at the level of individual connections, though this was vastly improved when we accounted for differences in network sparsity (many more connections were identified at 3 T than at 1.5 T). Poor between-scanner consistency was not unexpected because the variability in T 1 -weighted regional segmentation and tractography both contribute to the variability in the resulting networks. Our results suggest that multi-scanner network analyses require careful consideration in the treatment of acquisition-specific network sparsities, such as the use of stringent thresholding or other de-noising methods (de Reus & van den Heuvel, 2013;Roberts et al., 2017).

| Limitations
The present study has several limitations which should be taken into account. Our aim was to determine between scanner differences in a sample of exclusively older subjects including those with representative age-related pathology. However, same-scanner test-retest variability has been shown to be greater in older subjects than in younger (Jovicich et al., 2009). Additionally, there was a relatively large interval (mean of 72 days) between scans, but even in older age it is unlikely that age-related structural changes can be reliably detected by MRI over such a short period (Resnick et al., 2000). In addition, direct comparison to prior between-scanner and same-scanner test-retest studies is problematic because different statistics are commonly used, including different formulations of the ICC.
The specific scanner configurations used in this work may limit the generalizability of the current findings given that between-scanner agreement can be influenced by scanner manufacturers, acquisition parameters and image processing software (Heinen et al., 2016;Jovicich et al., 2009;Tardif, Collins, & Pike, 2010;Wardlaw et al., 2012). Our study represents a change in field strength, manufacturer, acquisition and some necessary processing steps (e.g., for dMRI processing), such that we must be clear that the differences between scans cannot be only attributed to field strength. A previous betweenscanner comparison showed that reliability is typically better when the same scanner manufacture was used than when different scanner manufactures were used (Jovicich et al., 2009). Some betweenscanner volumetric variability must be attributed to the slight mismatch in T 1 -weighted voxel dimensions (only 3 T voxels were isotropic). The spatial resolution used in our primary study was due to constraints on the scanning time for the required modalities. However, the FreeSurfer morphometric procedure was designed to be sequence-independent and involves interpolating T 1 volumes to isotropic voxels before segmentation . Additionally, for longitudinal settings the more recent FreeSurfer longitudinal processing pipeline has been shown to obtain better cross-session reliability than the crosssectional pipeline (Jovicich et al., 2013). Additionally, we did use openly-accessible and commonly-used methods across a large range of structural neuroimaging measures. Different versions of dMRI processing software were used as we needed to keep 1.5 T acquisition and processing consistent with prior waves of our longitudinal study. We cannot use the newer FSL software tools with our 1.5 T data as we do not acquire the necessary reverse phase-encoded volumes (https://fsl. fmrib.ox.ac.uk/fsl/fslwiki/topup).
We employed a straightforward linear regression approach using k-fold cross-validation, and therefore cannot rule out that promising scanner harmonization and calibration methods will not further improve cross-scanner reliability (Cetin Karayumak et al., 2019;Keshavan et al., 2016;Pinto et al., 2020;Tax et al., 2019). Finally, we judged that providing uncorrected, rather than corrected, p-values in the statistical test of scanner differences at the vertex level was more sensitive for the purpose of illustrating any potential differences. Nevertheless, there is relatively low statistical power here (e.g., Schönbrodt & Perugini, 2013)-even though this study represents one of the largest cross-scanner studies-potentially resulting in an underestimation of cross-scanner differences that may be apparent in larger meta-and mega-analytic settings.

| CONCLUSIONS
Longstanding longitudinal studies are torn between maintaining consistency of the MRI protocol and embracing improvements in scanner technology. The present study reports previously lacking crossscanner results on a broad range of structural brain measures in a comparatively large sample of older participants. Global measures showed consistently good or excellent agreement, with lower agreement seen with increasing granularity of measurement, though in most cases these were still comparable to prior within-scanner testretest results. Differences in the absolute level were prevalent, but we showed that, particularly for global measures, between-scanner variability could be effectively eliminated in unseen (hold-out) data using a k-fold cross-validation linear model. We conclude that low granularity measures of brain structure can be reliably measured between the different scanner manufacturers and field strengths tested. However, we recommend caution in combining high granularity information from different scanners. These data have useful implications for multicenter meta-and mega-analyses combining data across hardware, software and field strengths (Van Den Heuvel et al., 2019), and provide much-needed information in an exclusively older age group which is underrepresented in this literature.

ACKNOWLEDGMENTS
The authors thank the Lothian Birth Cohort 1936 members who took part in this study, and Lothian Birth Cohort 1936 research team members and radiographers who collected and checked data used in this study. The LBC1936 and this research are supported by Age UK (Disconnected Mind project) and by the UK Medical Research Council