Longitudinal Structural MRI in Neurologically Healthy Adults

Background Structural brain MRI measures are frequently examined in both healthy and clinical groups, so an understanding of how these measures vary over time is desirable. Purpose To test the stability of structural brain MRI measures over time. Population In all, 112 healthy volunteers across four sites. Study Type Retrospective analysis of prospectively acquired data. Field Strength/Sequence 3 T, magnetization prepared – rapid gradient echo, and single‐shell diffusion sequence. Assessment Diffusion, cortical thickness, and volume data from the sensorimotor network were assessed for stability over time across 3 years. Two sites used a Siemens MRI scanner, two sites a Philips scanner. Statistical Tests The stability of structural measures across timepoints was assessed using intraclass correlation coefficients (ICC) for absolute agreement, cutoff ≥0.80, indicating high reliability. Mixed‐factorial analysis of variance (ANOVA) was used to examine between‐site and between‐scanner type differences in individuals over time. Results All cortical thickness and gray matter volume measures in the sensorimotor network, plus all diffusivity measures (fractional anisotropy plus mean, axial and radial diffusivities) for primary and premotor cortices, primary somatosensory thalamic connections, and the cortico‐spinal tract met ICC. The majority of measures differed significantly between scanners, with a trend for sites using Siemens scanners to produce larger values for connectivity, cortical thickness, and volume measures than sites using Philips scanners. Data Conclusion Levels of reliability over time for all tested structural MRI measures were generally high, indicating that any differences between measurements over time likely reflect underlying biological differences rather than inherent methodological variability. Level of Evidence 4. Technical Efficacy Stage 1.

changes can also act as valuable markers for both the timing and efficacy of therapeutic treatment at an exploratory level.
In the field of neuroimaging, higher-resolution structural magnetic resonance imaging (MRI) is used to examine brain macrostructure including volume, cortical thickness, and surface area, while diffusion-weighted MRI is one method that interrogates the microstructural properties of white matter. 1 Both techniques are employed to measure either region-specific or whole-brain structural changes across clinical groups, including those with neurodegenerative disease. 2,3 In particular, structural MRI can be used to highlight and index morphological differences in regions of the brain associated with specific pathologies, as compared to healthy controls (or other diseasegroups), and how these differences change over the course of a disease trajectory. 4 There is very robust structural MRI evidence, for example, of striatal degeneration in the early stages of Huntington's disease (HD) 5,6 and in the medial temporal lobe in probable Alzheimer's disease (AD), 7,8 while structural measures can inform clinical diagnosis and treatment decisions in disorders such as multiple sclerosis and dementia. 9,10 Structural MRI-derived measures of brain volume are currently included as secondary endpoints in clinical trials. However, while routinely used in research studies, due to the complexity of diffusion-weighted MRI acquisition and analysis, microstructural measures have not yet been used in clinical trials.
Despite the clear utility of both structural and diffusion-weighted MRI, it is also important to consider that, as with any imaging technique, both noise and variations can be introduced at almost any stage. Rigorous protocol standardization and staff training notwithstanding, potential sources of variation include local magnetic field inhomogeneities, experimenter bias, inherent issues within analysis packages, and, importantly, differences between the individuals being scanned. All of these causes of variability are unrelated to the hypotheses that are being tested, but need to be considered when interpreting data. Multisite studies are becoming more frequent and they customarily include several different scanner types and models, in addition to variation in site personnel. Hence, a sound understanding of the nature and subsequent prevention of intersite differences is increasingly important.
Promisingly, there is some evidence to suggest that both standard scalar diffusion tensor imaging (DTI) metrics including diffusivity and fractional anisotropy (a ratio of diffusivity perpendicular and parallel to the direction of the main underlying fiber) and volumetric measures are reproducible across sites and scanner types despite predictably greater between-site variability. [11][12][13] However, most investigations have tended to use imaging phantoms to assess MRI reproducibility or small groups of healthy controls over very short time intervals. [11][12][13] While useful, this does not provide sufficient guidance for large multisite observational studies or clinical trials that examine participants over intervals of up to a year or longer. 14,15 It is important to have a clear grasp of the reliability of structural measures in order to understand disease progression (or its modification by treatment) over substantial time intervals. This consideration is especially important in populations where brain structure is known to change over time (eg, in neurodegenerative populations). Therefore, investigating reproducibility of imaging measures in healthy volunteers over longer time periods can help characterize variability over time due to measurement error or systematic biological change (eg, "healthy" age-related change). This, in turn, can help distinguish genuine effects related to pathology in disease populations. Accordingly, a recent study retrospectively examined the reliability of electrophysiological data in healthy individuals from the Track-On HD multisite longitudinal study over three annual timepoints using intraclass correlation coefficients (ICCs) and demonstrated that some measures met the criteria for high levels of stability, while others did not. 16 The study also identified limited between-site differences.
Here we have similarly retrospectively investigated the reliability of structural and diffusion MRI-based metrics of volume, cortical thickness, and anatomical connectivity focusing on the sensorimotor network in a similar cohort of healthy individuals from TrackOn-HD. 2,17 We aimed to characterize the effects of time, scanner type, and site on structural brain MRI measures.

Participants
Healthy control participants were recruited into the TrackOn-HD study at four study sites (London, Paris, Leiden, Vancouver). 2,17 For the present analyses, we used data from 112 participants (F = 67; mean age = 48 years ± SD: 11 years) who had complete DTI data for all three timepoints. Exclusion criteria included age below 18 or above 65 (unless previously in the Track-HD study), major psychiatric, neurological, or medical disorder or a history of severe head injury. 17 The study was approved by the local Ethics Committees, and all participants gave written informed consent according to the Declaration of Helsinki. At visits one to three, individuals had an average age of 48.1 years (SD: 10.7), 49.4 years (SD: 10.5), and 51 years (SD: 10.3), respectively. Attrition rates were low. At visit two, retention of participants was 93% and at visit three 87%. Table 1 contains demographic information about participants broken down by study site and scanner.

MRI Data Acquisition and Analysis
Data acquisition across sites was standardized as previously described. 2,18 In short, all site staff participated in a training session and regular contact was maintained between sites and study coordination throughout data collection. Prior to the start of the study, a human phantom was used at all sites to ensure identical settings and instructions. Throughout the study, data quality was monitored visually by IXICO (UK; Contract Research Organization). In parallel, quality control (QC) software was applied to all scans within 3 working days of acquisition. 3T MRI data were acquired on two different scanner systems (Philips Achieva at Leiden, Netherlands, and Vancouver, British Columbia, Canada, and Siemens TIM Trio at London, UK, and Paris, France). T 1 -weighted image volumes were acquired using a 3D magnetization prepared rapid gradient echo (MPRAGE) acquisition sequence with the following imaging parameters: relaxation time (TR) =2200 msec (Siemens [S]) / 7.7 msec (Philips [P]), echo time (TE) = 2.2 msec (S) / 3.5 msec (P), flip angle (FA) = 10 o (S)/8 o (P), field of view (FOV) = 28 cm (S) / 24 cm (P), matrix size 256 × 256(S)/224 × 224(P), 208(S)/164(P) sagittal slices with a slice thickness of 1.0 mm with no gap. Diffusionweighted images were collected with 42 unique gradient directions (b = 1000 sec/mm 2 ) on both scanner types plus eight images with no diffusion weighting (b = 0 sec/mm 2 ) (S) and one image with no diffusion weighting (b = 0 second/mm 2 ) (P). Acquisition parameters were TE = 88 msec, TR = 13,000 msec, and voxel size 2 × 2 × 2 mm (S); TE = 56 msec and TR = 11 sec, and voxel size 1.96 × 1.96 × 2.75 mm (P).

T 1 Processing
T 1 scans underwent visual QC upon data collection (performed by S.G., E.J., R.S.) to check for incorrect parameters in the metadata and image artifacts such as motion artifacts. Scans were then biascorrected to correct for inhomogeneity within the images using the N3 algorithm. 19 Images were segmented using FreeSurfer v. 5.3 run via the default recon-all pipeline with the 3T flag. FreeSurfer has two independent default automatic processing streams surface-and volume-based used to calculate different characteristics of structural MRI scans. Following processing, all FreeSurfer regions underwent visual QC (performed by S.G., E.J., R.S.), with both volumetric and thickness regions examined and scans excluded if regions showed a high degree of error across multiple slices. Volumetric and thickness values were automatically calculated and extracted from the following Brodmann areas: BA1 (somatosensory area), BA2 (somatosensory area), BA3a (somatosensory area), BA4a (primary motor area; anterior), BA4p (primary motor area; posterior), and BA6 (premotor area).

DTI Processing
Diffusion data were preprocessed using standard FSL (FMRIB Software Library) pipelines https://fsl.fmrib.ox.ac.uk/fsl/fslwiki. 20 Each DTI dataset was screened for artifacts (performed by S.G., E.J., R.S.), signal dropout and motion and then motion-corrected using eddy_correct in FSL; vector gradient information was updated accordingly. Both the B 0 and T 1 structural images were skullstripped using the Brain Extraction Tool (BET) and manually corrected for instances of mis-segmentation, whereby extraneous tissue had not on occasion been removed. To improve quality of the extracted structural image, we combined and dilated a thresholded segmented image with an eroded brain-extracted T 1 mask, which was then applied to the original brain-extracted T 1 image. The new T 1 image was then registered to the B 0 image using FMRIB's Linear Image Registration Tool. Diffusion tensors were fit using dtifit and crossing fibers modeled using Bedpostx. 21 Probabilistic tractography was then performed for a series of sensorimotor tracts using probtrackx 22 ; these included tracts connecting the primary motor cortex (M1) and the motor thalamus; the premotor cortex (PMC) and the motor thalamus; and the primary somatosensory cortex (S1) and the somatosensory thalamus. Regions of interest were Scanner Make Philips Siemens Siemens Philips created using the Anatomy Toolbox and warped into native space for each participant. Exclusion masks were used to exclude streamlines from outside the anatomically-defined tract and a white matter termination mask to ensure tracts did not extend into gray matter, cerebrospinal fluid (CSF), or dura. All tracts were then warped into diffusion space using FLIRT. Fractional anisotropy (FA), mean diffusivity (MD), axial diffusivity (AD), and radial diffusivity (RD) were extracted for each participant for each tract.

Statistical Analysis
To assess the reliability of our MRI measures over time, we calculated the average two-way random-effects intraclass correlation coefficient for absolute agreement (ICC), hereafter written as ICC(2,k).
For the two-way random-effects ICC, participants and observations were treated as random effects (ie, we assumed that both people and timepoints were samples from a larger population 23 ). For each dependent variable, data were filtered prior to analysis to remove participants with missing data (the number of missing cases is reported for each variable below). The ICC(2,k) can be interpreted as the ratio of true variance to total variance for k measures. 24 In our case, k = 3 for the three timepoints and we selected ICC(2,k) ≥0.80 as the cutpoint indicating a relatively stable and reliable measure with relatively little variation within a person over time compared to the individual differences between people. 16,25 The single measure two-way random-effects ICC was calculated to estimate absolute agreement, referred to as ICC (2,1). To establish systematic sources of variation in our data, we also conducted 3 × 2 repeated measures analysis of variance with a within-participant factor of time and a between-participant factor of scanner type. For these tests, we applied a Bonferroni correction for multiple comparisons to each effect in the model (ie, the alpha-level for main-effects and interactions was adjusted independently). Mauchly's test was used to check for violations in sphericity. If violations were found, a Greenhouse-Geisser correction was applied (denoted by adjusted degrees of freedom in the tables below). Within each scanner type, we also conducted pairwise comparisons comparing the different study sites using the same scanner to each other. All analyses were conducted using SPSS v. 23.0 (IBM, Armonk, NY). All descriptive statistics are reported as mean (SD) unless otherwise indicated.

Results
Assessing Reliability over Time All measures of white matter diffusivity (AD, RD, FA, MD) in connections between the M1 motor cortex region, the premotor cortex, or the S1 somatosensory cortex and the thalamus as well as the cortico-spinal tract met the ICC(2,k) cutoff ≥0.80, indicating high reliability ( Table 2). In addition, all measures of cortical thickness and volume of various cortical gray matter regions in both hemispheres (BA1, BA2, BA3a, BA3b, BA4a, BA4p, BA6) also met the ICC(2,k) cutoff criterion ≥0.80 of high reliability (Tables 3 and 4, respectively). It should also be noted that ICC (2,1) was moderate to high across all measures. However, a number of measures scored between 0.6-0.8. Tables 2-4, there were also statistically significant main-effects of time for several of the neuroimaging measures. For structural connectivity measures (Table 2), very few outcomes showed statistically significant main-effects of time. Indeed, following a Bonferroni correction for multiple comparisons, the only statistically significant main-effect of time was for axial diffusivity between M1 and the thalamus (P < 0.001). However, this interaction was superseded by a significant Scanner × Time interaction (P < 0.001). Scanner × Time interactions were also found for axial diffusivity and mean diffusivity between PMC and thalamus (P < 0.001). (Details of these interactions are shown in Appendix S1.) For cortical thickness measures (Table 3), there were no statistically significant effects of time after adjusting for multiple comparisons. Further, there were no statistically significant Scanner × Time interactions (see Appendix S1).For cortical volume measures (Table 4), following correction for multiple comparisons, there were statistically significant main-effects of time for BA1 (left and right hemispheres, P's < 0.001), BA4a (left and right hemispheres, P's < 0.001), and BA6 (left and right hemispheres, P's < 0.001). There were also statistically significant Scanner × Time interactions for BA4 and BA6 (left and right hemispheres, P's < 0.001; see Appendix S1).

Assessing Agreement Between Study Sites
The majority of diffusivity measures (Table 5) differed significantly between scanners/sites (Leiden and Vancouver vs. London and Paris). Specifically, diffusion values were higher for data collected on Siemens scanners compared to those from Philips scanners for AD and MD measures in the M1 tract (P < 0.001), AD and MD measures in the PMC tract (P < 0.001), and for AD and MD measures in the S1 tract (P < 0.001).
Cortical thickness (Table 6) also differed significantly between scanners bilaterally across most cortical regions (P's < 0.001). The only exceptions were BA4a and BA4p in the left (P = 0.064, P = 0.046) and the right hemispheres (P = 0.132, P = 0.631). For all other cortical thickness measurements, values were higher bilaterally for BA1, BA2, BA3a, BA3b, and BA6 for Siemens compared to Philips scanners.
For cortical volume measurements (Table 7), values were higher for Siemens scanners for bilateral BA3a, BA3b, and BA6 (P's < 0.001). Left hemisphere BA1 showed a significant main-effect of Scanner (P < 0.001), whereas right hemisphere BA1 did not (P = 0.023). In these regions, volume measures on Siemens scanners were generally higher than on Philips scanners (P < 0.05?). Other regions showed a similar pattern, but the differences were not statistically significant following a correction for multiple comparisons.
Within the connectivity measures, there were no statistically significant differences between the Leiden and Vancouver (Philips scanner) sites; however, there were significant differences between London and Paris (Siemens scanner) sites for  Cases where the main-effect of time is superseded by a significant scanner by time interaction; these statistically significant interactions are presented in Appendix S1.
AD and/or MD measures of all tracts (P's < 0.001). For the cortical thickness measures, there were no statistically significant differences between sites with the same scanner type (P ≥ 0.05). For cortical volume measures, the only statistically significant differences were in left BA2, in which the Paris site had significantly higher average volume than the London site (P < 0.001).

Discussion
MRI measures of brain morphology and anatomical connectivity are key to the characterization of biological mechanisms associated with neurodegenerative disease. Understanding the reliability of these MRI measures can help interpret any pathology-related changes, and also inform the power required for their use in a study or trial. In this study we investigated the reliability of morphology measures: cortical thickness and volume; and anatomical connectivity measured using DTI within the sensorimotor network over time, in a group of healthy individuals from the multisite, multiscanner Track-On HD study. All measures of cortical thickness, cortical volume, and white matter diffusivity for both hemispheres showed high levels of reliability, suggesting that differences  Cases where the main-effect of time is superseded by a significant scanner by time interaction; these statistically significant interactions are presented in Appendix S1. We first examined the long-term stability of morphological MRI-derived measures of cortical thickness and volume for brain regions within the sensorimotor cortex. Reproducibility for almost all regions was high. This was true for both reliability across the three timepoints (ICC(2,k)) and the estimated variation in "true" values captured by a single timepoint (ICC (2,1)). This is consistent with previous studies that have tested the reliability of Freesurfer-derived cortical thickness measures in healthy people, with reproducibility across a number of visits ranging from just 2 to up to 10. 26,27 Similarly, when examining anatomical connectivity using diffusivity metrics extracted from white matter sensorimotor pathways, we found generally high levels of reliability across three timepoints (ICC(2,k)), but some diffusivity measures were only moderately reliable at any given individual timepoint (ICC (2,1)). Both ICC measures showed that axial diffusivity was most reliable across tracts, with radial diffusivity generally lowest, but improved by calculating an average across several measurements. Measures of diffusion-based anatomical connectivity are less likely to be used in clinical trials, but they provide very useful indications of network breakdown due to changes in white matter microstructure. Previous studies have shown good levels of reliability, but tended to focus on the robustness of measures within regions of interest rather than tractography-based analyses. [11][12][13] Furthermore, previous studies have generally tested reliability over a short time period (eg, hours or days). While studies of reliability over a short timescale are important, reliability is an emergent property of the measurement tool and the context in which it is used, 14 so these studies have reduced generalizability to large multisite studies with considerable time between scans (eg, weeks or months). Our study, therefore, has, albeit retrospectively, examined variability in structural measures with greater generalizability to long timescales, specifically, studies with annual timepoints.
Understanding the magnitude of variation from different sources is useful for researchers planning multisite and/or longitudinal studies. Reliability of a given measurement has important implications for statistical power, and the number of participants needed to achieve a desired level of statistical power is a function of the level of significance, the desired power, and the underlying effect size. However, this effect size assumes no measurement error (unless based on empirical data) and very few constructs are measured that precisely. 28 In many cases, researchers estimate effect-sizes with heuristics (eg, Cohen's d = 0.5 is a "moderate" effect; Denotes a difference that remains statistically significant following a Bonferroni correction for multiple comparisons, c = 16. Tests of site differences within a scanner (Leiden = Lei; London = Lo; Paris = P; Vancouver = V) were based on Fisher's LSD, collapsing across the three different timepoints. Cohen's d = 1.0 is a "large" effect), which makes the implicit assumption of no measurement error. As such, it is important that researchers temper their predicted effect-sizes by incorporating measurement error. 29,30 For example, assuming alpha = 0.05 and a population difference of δ = 0.75 between groups, most of the connectivity/thickness/volume measures in the present study would have ≥80% statistical power to detect a difference between groups when n/group ffi 30 (See Figure 1). If the primary outcome measure in a study were M1-Thalamus AD, then the observed ICC (2,1) was 0.92 participants, per group would be required to achieve 80% statistical power because the error in this measure effectively reduces the effect-size from an idealized δ = 0.75 to d = 0.72. The ICC (2,1) reflects the average ratio of true variance to total variance captured by any single measurement. That is, if the ratio of the variance in "true" scores (T) to observed scores (X) is r 2 tx = var(T)/var(X) = 0.25 in the population, ICC (2,1) will approximate 0.25 (large samples) regardless of the number of observations made. As such, ICC (2,1) reflects the average amount of variance in true scores that is captured by a single measurement. Presenting ICC (2,1) as a complement to ICC(2,k) is important because ICC(2,k) is sensitive to the number of measurements, whereas ICC (2,1) is not and, unless experimenters are making a fixed number of repeated assessments, ICC (2,1) is most relevant to trial designs with a single pre-and posttest assessment.
Alternatively, if the primary outcome measure was PMC-Thalamus RD, then the observed ICC (2,1) was 0.68. This would mean that 43 participants are required per group, because the measurement error reduces the idealized δ = 0.75 to d = 0.62. Thus, the n/group required to achieve statistical power varies markedly within these connectivity measures, because the reliability of a measure at a given timepoint (ICC (2,1)) varied so substantially. Naturally, as the effect-size increases (eg, from δ = 0.5 to 1.0) the n/group required to achieve 80% power decreases. However, these data show that despite generally good reliability across these neuroimaging measures, the differences in reliability still have negative consequences for statistical power depending on the outcome.
We also examined the amount of change in our measures over 2 years. Cortical volume measures tended to show reliable decreases over time, consistent with previous research for healthy adults in this age range. 31,32 For example, BA1 volume decreased by 3%, and BA4a/BA6 volume decreased by 2% on average. While these effect-sizes are relatively small, the high reliability (ie, low within-subject variance) provided adequate statistical power to detect change. Cortical thickness, however, did not show statistically significant changes over time, suggesting that the magnitude of change may be smaller than that of cortical volume. It is also likely that Freesurfer measurements of cortical thickness are less reliable, given that small errors in segmentation can significantly inflate thickness values for particular regions where volumetric measures are more robust to this type of error. This could potentially impact detection of the subtle changes in thickness that may occur over a 2-year period in healthy controls. Our data were collected on two different types of MRI scanners and despite good within-participant reliability, there was an effect of scanner type for most measures. There appeared to be a consistent trend for sites using Siemens scanners to produce larger values for connectivity, cortical thickness, and cortical volume measures than sites using Philips scanners. For volumetric and cortical thickness measures, differences between Philips and Siemens scanners are unsurprising, as FreeSurfer was developed primarily on Siemens and GE scanners with the application on Philips data tested later in development. 33 It is also important to note that this study was not designed to test differences between scanners or study sites, so these differences must be interpreted with caution. Different participants were measured at the different study sites, so these between-site differences are not measures of "interrater" reliability. That said, we believe that the betweenscanner differences do reflect real differences between scanners and not merely sampling variability, given that there is demographic and anthropomorphic similarity between participants across study sites and that within-scanner differences were not statistically different (which would be more suggestive of differences due to sampling variability alone). Potvin et al showed that scanner type is responsible for up to 2.8% of the variance (right caudate), with most regions showing variability as low as 0.1%. 34,35 This is a relatively small proportion of the variance, particularly when compared with age, sex, and total intracranial volume. In the current study, we have shown a clear effect of scanner type and, qualitatively, our results tend to agree with those of Duchesne et al, with higher volumes being reported for Siemens scanners compared with Philips scanners. 36 It is difficult to compare quantitatively our results, given that Potvin et al's study scanned one individual, where we have larger samples, but not multiple scans per person on different scanners. Taken together, however, these results further suggest our scanner differences are real and not due to different samples at different sites.
These potential differences are especially relevant when comparing results between studies or planning multisite trials, as researchers need to account for between-site differences when pooling data sources (eg, using multilevel modeling procedures 37 ) or when contrasting data from different sources.
Finally, we investigated differences between sites using the same scanner. Between-site differences were generally less pronounced than between-scanner differences. For example, there were no differences between the two sites using Philips scanners, although there were still some notable differences in diffusivity measures between sites using Siemens scanners. Again, we must be cautious when interpreting these findings, given that the design was not balanced (ie, sites were nested within scanner types and ideally we would have all participants at each site scanned with each scanner, fully crossing the effects of site and scanner). In addition, these were relatively underpowered tests compared to those of betweenscanner differences. It is clear, therefore, that despite phantom scanning, rigorous multiscanner quality control and standardization of training, there were still significant differences between scanners (for many measures) and between sites for a particular scanner (mostly for diffusivity measures). This reinforces the importance of thorough training and streamlining of scanning protocols and using analyses that can account for between-site differences in large multicenter studies and clinical trials using MRI scanning as endpoint measures.
It is important to note that these between-site (or between-scanner) comparisons are not measures of "interrater" reliability, because different participants were measured at the different study sites. However, these measures are still informative because it is important to understand systematic between-participant and within-participant variability in our data.
Quantifying sources of between-and within-subject variability in older adults over a long timescale has important implications for clinical neuroscience. Large observational studies, for instance in Huntington's disease or Friedreich's ataxia, suggest that longitudinal studies with a long timescale (eg, over a number of years) are important in understanding disease-related brain alterations. 4,38 Differences in data collection techniques, equipment, and procedures could introduce variation/noise over real biological changes that occur over time. For our healthy cohort, structural measures of thickness and volume were generally robust over time, although impacted by scanner type and, therefore, also potentially suitable for use in a clinical trial as an exploratory endpoint. The number of participants per group required to achieve 80% statistical power as a function of a hypothetical underlying effect-size (δ) and the reliability of the measurement. Reliability is expressed as the ratio of true score variance (T) to observed score variance (X), r 2 TX = var T ð Þ=var X ð Þ, which is approximated by the ICC (2,1). Diffusion metrics do not have the same level of robustness, as they are seemingly more affected by scanner type and in terms of intersite variability. Therefore, when embarking on a longitudinal study, it is crucial to have some knowledge of the potential variability that may be introduced when investigating a brain structure or connectivity. Denotes a difference that remains statistically significant following a Bonferroni correction for multiple comparisons, c = 14. Tests of site differences within a scanner (Leiden = Lei; London = Lo; Paris = P; Vancouver = V) were based on Fisher's LSD, collapsing across the three different timepoints.