Mitigating site effects in covariance for machine learning in neuroimaging data

Abstract To acquire larger samples for answering complex questions in neuroscience, researchers have increasingly turned to multi‐site neuroimaging studies. However, these studies are hindered by differences in images acquired across multiple sites. These effects have been shown to bias comparison between sites, mask biologically meaningful associations, and even introduce spurious associations. To address this, the field has focused on harmonizing data by removing site‐related effects in the mean and variance of measurements. Contemporaneously with the increase in popularity of multi‐center imaging, the use of machine learning (ML) in neuroimaging has also become commonplace. These approaches have been shown to provide improved sensitivity, specificity, and power due to their modeling the joint relationship across measurements in the brain. In this work, we demonstrate that methods for removing site effects in mean and variance may not be sufficient for ML. This stems from the fact that such methods fail to address how correlations between measurements can vary across sites. Data from the Alzheimer's Disease Neuroimaging Initiative is used to show that considerable differences in covariance exist across sites and that popular harmonization techniques do not address this issue. We then propose a novel harmonization method called Correcting Covariance Batch Effects (CovBat) that removes site effects in mean, variance, and covariance. We apply CovBat and show that within‐site correlation matrices are successfully harmonized. Furthermore, we find that ML methods are unable to distinguish scanner manufacturer after our proposed harmonization is applied, and that the CovBat‐harmonized data retain accurate prediction of disease group.


| INTRODUCTION
The need for larger samples in human subjects research have led to a growing number of multi-site studies that aggregate data across multiple locations. This trend is especially prevalent in neuroimaging research where the reliability and generalizabilty of findings from the conventional single-site studies are often limited by the ability to recruit and study sufficiently large and representative samples from the population. Many consortia have been formed to address such issues (Mueller et al., 2005;Sudlow et al., 2015;Trivedi et al., 2016;Van Essen et al., 2013). The larger samples obtained through these efforts promote greater power to detect significant associations as well as better generalizability of results. However, these study designs also introduce heterogeneity in acquisition and processing that, if not appropriately addressed, may impact study findings.
Several researchers have determined that variability driven by site differences, often called site effects, reduce the reliability of derived measurements, and can introduce bias. Neuroimaging measurements have been repeatedly shown to be affected by scanner manufacturer, model, magnetic field strength, head coil, voxel size, and acquisition parameters (Han et al., 2006;Kruggel, Jessica, Tugan Muftuler, & Initiative, 2010;Reig et al., 2009;Wonderlick et al., 2009). Yet even in scanners of the exact same model and manufacturer, differences still exist for certain neuroimaging biomarkers (Takao, Hayashi, & Ohtomo, 2011).
Until recently, neuroimaging analyses primarily involved mass univariate testing which treats features as independent and does not leverage covariance between features. Under this paradigm, the impact of site effects is through changes in the mean and variance of measurements. Increasingly, researchers have used sets of neuroimaging features as inputs into prediction algorithms or state-of-the-art machine learning (ML) methods. This approach has become a powerful tool for leveraging both functional and structural neuroimaging for research into pain (Smith, L opez-Solà, McMahon, Pedler, & Sterling, 2017;Wager et al., 2013), neural representations (Haxby, Connolly, & Swaroop Guntupalli, 2014), and psychiatric illnesses (Koutsouleris et al., 2014). One of the major benefits of ML is that it leverages the joint distribution and correlation structure among multivariate brain features in order to better characterize a phenotype of interest (Gregorutti, Michel, & Saint-Pierre, 2017;O'Toole et al., 2007). As a result, site effects on the covariance of measurements are likely to impact findings substantially. In fact, a recent investigation showed that an ML algorithm was able to detect scanner with high accuracy and that the detection of sex depended heavily on the scanners included in the training and test data (Glocker, Robinson, Castro, Dou, & Konukoglu, 2019).
Recently, another stream of data-driven harmonization methods have aimed to apply generative adversarial networks (GANs) or distance-based methods to unify distributions of measurements across sites. However, the GAN-based harmonization methods have only been tested for harmonization of images and lack options to retain clinical associations of interest (Gao, Liu, Wang, Shi, & Jinhua, 2019;Nguyen, Morris, Harris, Korgoankar, & Ramos, 2018;Zhong et al., 2020). A recent distance-based method is applicable to derived measurements and has been tested in classification of Alzheimer's disease (AD) using support vector machines (Zhou et al., 2018). However, the method in (Zhou et al., 2018) has not been tested for detection of site via ML and also requires several conditions which may not hold in studies with sufficiently heterogeneous sites or major differences in subject demographics across sites.
In this article, we examine whether site effects influence ML results. In particular, we study the cortical thickness measurements derived from images acquired by the Alzheimer's Disease Neuroimaging Initiative (ADNI) and demonstrate the existence of site effects in covariance of structural imaging measures. We then propose a novel harmonization method called Correcting Covariance Batch Effects (CovBat) that removes site effects in mean, variance, and covariance.
We apply CovBat and show that within-site correlation matrices are successfully harmonized. Furthermore, we find that ML methods are unable to detect Siemens scanners after our proposed harmonization is applied, and that the CovBat-harmonized data retain accurate prediction of disease group. We also assess the performance of the proposed method in simulated data, and again find that the method mitigates site effects and maintains detection of meaningful associations. Our results demonstrate the need to consider covariance in harmonization methods, and suggest a novel procedure that can be applied to better harmonize data from multi-site imaging studies.

| ADNI dataset
The data for this article consist of baseline scans from ADNI (http:// adni.loni.usc.edu/ which are processed using the ANTs longitudinal single-subject template pipeline (Tustison et al., 2019) with code available on GitHub (https://github.com/ntustison/CrossLong). Informed consent was obtained by all participants in the ADNI study. Institutional review boards approved the study at all of the contributing institutions.
We briefly summarize the steps involved. First, we download raw T1-weighted images from the ADNI-1 database, which were acquired using MPRAGE for Siemens and Philips scanners and a works-inprogress version of MPRAGE on GE scanners (Jack et al., 2010). We choose ADNI-1 to highlight a severe site effect situation driven by greater variability in scanner properties, including magnetic field strength, which are standardized in later ADNI releases. For each subject, we estimate a single-subject template from all the image timepoints. After rigid spatial normalization to this single-subject template, each normalized timepoint image is then processed using the single image cortical thickness pipeline consisting of brain extraction (B. Avants, Klein, Tustison, Woo, & Gee, 2010), denoising (Manj on, Coupé, Luis, Louis Collins, & Robles, 2010), N4 bias correction (N. J. Tustison et al., 2010), Atropos n-tissue segmentation (B. B. Avants, Tustison, Wu, Cook, & Gee, 2011), and registrationbased cortical thickness estimation (Das, Avants, Grossman, & Gee, 2009). For our analyses, we use the 62 baseline cortical thickness values as defined by the Desikan-Killiany-Tourville atlas (Klein & Tourville, 2012). The sample covariance matrix for these cortical thicknesses in the largest site is shown with labels in Figure S1.
We define site based on information contained within the Digital Imaging and Communications in Medicine (DICOM) headers for each scan. Specifically, subjects are considered to be acquired at the same site if they share the scanner location, scanner manufacturer, scanner model, head coil, and magnetic field strength. In total, this definition yields 142 distinct sites of which 78 had less than three subjects and were removed from analyses. The final sample consists of 505 subjects across 64 sites, with 213 subjects imaged on scanners manufactured by Siemens, 70 by Philips, and 222 by GE. The sample has a mean age of 75.3 (SD 6.70) and is comprised of 278 (55%) males, 115 (22.8%) AD patients, 239 (47.3%) late mild cognitive impairment (LMCI), and 151 (29.9%) cognitively normal (CN) individuals.
The ADNI sample demographics are considerably different across sites, which precludes application of certain harmonization methods.
For example, (Zhou et al., 2018) relies on "nontrivial overlap" of the potential confounders across sites and proposed a subsampling approach that performs distributional shifts on subsamples of data matched by the discrete stratum of the confounders. Given that our data are sufficiently heterogeneous, it is challenging to form bins matched by age, sex, and diagnosis status to ensure that each site has at least one individual in each bin. This prevents protection of age effects in applying the harmonization method proposed by (Zhou et al., 2018).

| Combatting batch effects
We first review ComBat (Fortin et al., 2017(Fortin et al., , 2018Johnson et al., 2007) for harmonization of neuroimaging measures. ComBat seeks to remove the mean and variance site effects of the data in an empirical Bayes framework. Let y ij ¼ y ij1 , y ij2 , …, y ijp À Á T , i ¼ 1,2,…, M, j ¼ 1,2,…, n i denote the p Â 1 vectors of observed data where i indexes site, j indexes subjects within sites, n i is the number of subjects acquired on site i, and p is the number of features. Our goal is to harmonize these vectors across the M sites. ComBat assumes that the features indexed by v follow where α v is the intercept, x ij is the vector of covariates, β v is the vector of regression coefficients, γ iv is the mean site effect, and δ iv is the variance site effect. The errors e ijv are assumed to independently fol- ComBat first finds least-squares estimatesα v and β v for each feature. To estimate the site effects using empirical Bayes, ComBat assumes that the γ iv follow independent normal distributions and the δ iv follow independent inverse gamma distributions. The hyperparameters are then estimated via method of moments using data across all features. The empirical Bayes point estimates γ Ã iv and δ Ã iv are then obtained as the means of the posterior distributions. Finally, ComBat residualizes with respect to these estimates to obtain harmonized data

| Correcting covariance batch effects
To address potential covariance site effects, we build on the existing ComBat framework. We again assume that the features follow However, the error vectors e ij ¼ e ij1 , e ij2 , …, e ijp À Á T $ N 0, Σ i ð Þ may be spatially correlated and differ in covariance across site. Analogous to how ComBat modifies observations to bring each within-site variance to the pooled variance across sites, our proposed method modifies principal component (PC) scores to shift each within-site covariance to the pooled covariance structure. We achieve this by approximating within-site covariance structures using the PCs and PC scores obtained across all observations. We propose the CovBat algorithm, which accounts for the joint distribution of ComBat-adjusted observations as follows: Step 1. We first perform ComBat to remove the mean and vari- where p is the number of features. We then define these residuals using notation from Section 2.2 as M is the number of sites, and n i is the number of subjects acquired at site i.α v , x T ij ,β v , γ Ã iv , and δ Ã iv are defined in Section 2.2.
Step 2. The e ComBat ij are assumed to have mean 0; their covariance matrices which we denote by Σ i , however, may differ across sites. We first perform principal components analysis (PCA) on the full data residuals and represent the full data covariance matrix as Σ ¼ P q k¼1 λ k ϕ k ϕ T k where the rank q ¼ min P M i¼1 n i , p , λ k are the eigenvalues of Σ, and ϕ k are the PCs obtained as the eigenvectors of Σ. In practice, PCA is performed on the sample covariance matrixΣ and we obtain estimated eigenvaluesλ k and eigenvectorsφ k . The ComBatadjusted residuals can then be expressed as e ComBat ij ¼ P q k¼1 ξ ijkφk where ξ ijk are the principal component scores.
We then aim to bring each within-site covariance matrix Σ i to the pooled covariance across sites. Since our goal is to recover covariance structures resembling Σ, we approximate the within-site covariance matrices asΣ i ¼ P q k¼1λ ikφkφ T k whereλ ik are within-site eigenvalues estimated as the sample variance of the principal component scoreŝ Þandφ k are estimated from the full data covariance. This model assumes that the covariance site effect is contained within the variances of the PC scores with the principal components estimated from the full data. This assumption may not hold in some cases, but harmonization of these PC score variances will bring the within-site covariance matrices closer to the pooled covariance. This is analogous to how ComBat brings sitespecific variance closer to the variance estimated using observations across all sites.
Step 3. Thus, we posit: , τ k is the error standard deviation, and μ ik , ρ ik are the center and scale parameters corresponding to principal compo- Step 4. We obtain CovBat-adjusted residuals We then add the intercepts and covariates effects estimated in Step 1 to obtain CovBat-adjusted observations , n i be vectors of length p representing simulated cortical thickness values for three sites, each with n i observations. The y ij are generated using the following model: where x ij is a simulated diagnosis variable drawn from a Bernoulli distribution with probability 0.25, α is the first p=2 elements in each hemisphere from the sample mean vector of Scanner B observations in the ADNI data, β is the vector of simulated diagnosis effects on the mean, and e ij is the vector of error terms. We simulate mean and variance site effects based on the assumptions of ComBat and CovBat.
The mean site effects γ i ¼ γ i1 , γ i2 ,…, γ ip À Á T are vectors drawn from independent and identically distributed (i.i.d.) normal distributions with mean zero and standard deviation 0:1. The variance site effects

| Simple covariance effects
We first assess whether CovBat can recover the underlying covariance structure when the covariance site effects are captured by its PC directions.
We refer to this simulation setting as the Simple Covariance Effects simulation. For p ¼ 62, we set the underlying covariance S as the sample correlation matrix of cortical thickness observations in the ADNI data, We then generate error terms e ij that contain site-specific shifts in the first eigenvalue via where c i controls the severity of the covariance shift. For our simulations, we set c 1 ¼ À1=2, c 2 ¼ 0, and c 3 ¼ 1=2 so that the pooled covariance structure is equivalent to S. We choose β i ¼ À0:5 for p=4 b c regions of interest in both the left and right hemispheres to associate the simulated diagnosis with decreases in mean simulated cortical thicknesses.
We also investigate how the rank of the covariance effect influences CovBat harmonization results. We modify the rank of the simple covariance effect by varying K in the generation of error terms where c i takes the same values as previously, c 1 ¼ À1=2, c 2 ¼ 0, and c 3 ¼ 1=2. We simulate datasets while choosing K as 2, 6, and 12 PCs and evaluate how harmonization influences detection of site via ML.

| Complex covariance effects
To evaluate how CovBat performs when the covariance site effects are not easily captured by the principal components and when the simulated diagnosis may affect covariance, we modify the simulation to incorporate high-rank covariance shifts due to site via Ω i and due to simulated diagnosis as Ψ . For p ¼ 62, the error terms e ij are now generated via is the sample correlation matrix of cortical thickness observations in the ADNI data, Ψ is a chosen diagnosisdriven covariance shift matrix, and Ω i are site-specific covariance shift In the Diagnosis Affects Covariance simulation, we assume that the simulated diagnosis affects not only mean, but also covariance.
We choose the diagnosis effect on covariance to be proportional to a site's covariance shift. This scenario represents a situation where detection of the diagnosis using ML could be highly influenced by the presence of site effects. We use the same β value as in the ComBat and diagnosis affects mean simulations but choose Ψ to be related to Ω 3 to force confounding of Site 3 and diagnosis effects on covariance.
In the Covariance Only simulation, we assume that both site and the diagnosis influence the covariance, not the mean or variance.
We fix γ ¼ 0 and δ ¼ 1 to remove site effects in mean and variance while also using the same Ω i as in the Diagnosis Affects Mean and Diagnosis Affects Covariance simulations. Furthermore, we modify the diagnosis effect by setting β ¼ 0 while keeping Ψ ¼ À 3=4 ð ÞΩ 3 for the diagnosis effect in covariance.

| Simulation experimental design
In our simple covariance effects simulation with rank one covariance effects PCs that explain 95% of the variation, we also perform MANOVA for testing associations with site and simulated diagnosis and report the rejection rate at a type I error rate of 0.05 across the 1,000 datasets.

| Recovery of covariance
We first perform simulations to assess whether either harmonization method can recover the underlying covariance structure in our simple simulation setting and harmonize covariance matrices generally. For ComBat and CovBat, we include our simulated diagnosis status as a covariate. We apply our CovBat method using the number of PCs that explain 95% of the variation. In the Simple Covariance Effects simulation, Figure 1 shows that CovBat outperforms ComBat in recovery of the true covariance structure (denoted S in Section 2.5) across all parameters considered. Remaining deviation from the true covariance can be explained by error in covariance estimation; even with 10,000 samples per site and 62 features the distance of the pooled covariance estimate from the true covariance is still 3.14. In Table 1

| Performance across sample properties
We then conduct additional analyses to assess the robustness of Cov-Bat to reductions in sample size per site and number of features. In  (Stevens, 1980).

| Choosing the number of PCs in CovBat
To better inform the choice of PCs in CovBat, we evaluate the simulation results obtained across varying number of PCs. For simple covariance effects with varying low rank structures, Figure S2 shows that harmonizing smaller numbers of PCs yields suboptimal results for reducing the chance of detecting sites. Across varying ranks of simulated site effects contained in the covariance structures (2-12 PCs), simulation results show that the median AUC for detecting site achieves the lowest value when CovBat includes PCs that explain around 90% of total variation for a sample size of 100.
With a larger sample size of 250, the best performance was achieved when CovBat includes PCs that explain 95% of the total variation. Across both sample sizes, the U-shape curves in median AUC for detecting sites indicate that including excess numbers of PCs in CovBat hurts the performance due to overfitting and the optimal number of PCs increases with sample size. For the high rank covariance effect in our Diagnosis Affects Covariance, we observe in Figure S3 that for a sample size of 250 increasing the number of PCs lowers the AUC for detection of site across the whole range of considered PCs. However, looking at a sample size of 50, we observe that AUC for site detection increases as we select PCs that explain between roughly 95% and 100% of the variation. For detection of simulated diagnosis, Figure S4 shows that CovBat largely

| Classical multivariate analyses
We also evaluate the harmonization methods using multivariate analysis of variance (MANOVA). MANOVA tests for differences in mean across groups in multivariate data, but is known to be sensitive to differences in covariance across groups. We perform MANOVA across scanner manufacturer, sex, and diagnosis status using Pillai's trace, which is known to be more robust to inhomogeneity in covariance than other alternative test statistics (Olson, 1974). We report p-values for these associations before and after harmonization.

| CovBat reduces covariance site effects
We first examine the empirical covariance of the ADNI data before and after harmonization. To evaluate site differences, we investigate the three largest ADNI sites. Site A consists of 23 subjects acquired on a Siemens Symphony 1.5T scanner while Sites B and C each consist of 20 subjects acquired on GE Signa Excite 1.5T scanners. See Table 3 for demographic details. To avoid influence of site demographics on the covariance matrices, we residualize the cortical thickness measures across the three sites jointly on age, sex, and diagnosis status in each dataset using a linear model. Figure 4 shows the covariance matrices for each site using the residualized cortical measures both before and after harmonization (ROI labels are shown in Figure S1). The differences between the unharmonized covariance matrices are striking. We also provide quantitative comparisons for pairwise distances across sites before and after harmonization in Table 4. A tuning parameter of the CovBat model is the desired proportion of variance explained in the dimension reduction space, which we select at 95% (37 PCs). To ensure that our results do not depend strongly on the choice of tuning parameter, we also report the minimum and maximum of the pairwise Frobenius norms after applying CovBat with percent variation explained ranging from 44% (2 PCs) to 100% (62 PCs).
We report the results of this sensitivity analysis in parentheses. We find that ComBat adjustment can modestly harmonize the covariance matrices but CovBat adjustment shows large reductions in the between-site distances across a range of tuning parameter choices.

| CovBat impairs detection of site
To evaluate the potential impact of site effects in covariance using ML, we conduct a Monte-Carlo split-sample experiment for prediction of scanner manufacturer labels using all 213 ADNI subjects before and after harmonization with existing methods. We train using data harmonized with the state-of-the-art ComBat method and our proposed method, CovBat. Figure 5a shows that Siemens sites are easily identifiable based on unharmonized cortical thickness measurements (median area-under-the-curve [AUC] 0.89, IQR 0.87-0.90), which is consistent with recent findings (Glocker et al., 2019). We also note that scanner manufacturer is still detected after ComBat is applied (0.66, 0.64-0.68).
After CovBat, the ML method's performance for differentiating between sites is close to chance (0.46, 0.44-0.48). CovBat's performance depends on the number of PCs included in the model, but Figure S5 shows that the performance gain for each PC becomes negligible around the number of PCs that explain 95% of the variation.
DeLong's test results shown in Figure S2 suggest that these AUC values for site detection are significantly different between ComBat and Cov-Bat. Using MANOVA, Table 5 shows that the association with scanner manufacturer is statistically significant in unharmonized and ComBatadjusted data but is eliminated in CovBat-adjusted data.

| CovBat retains biological associations
It is well-known that cortical thickness differs substantially by sex and AD status (Lerch et al., 2005;Sowell et al., 2007). To assess whether CovBat maintains biological associations of interest, we perform two ML experiments using the full ADNI data to classify healthy versus Note: Differences in covariance structure between sites are reported as the Frobenius distance between covariance matrices calculated across observations acquired on each site. Results from adjusting the number of PC scores ranging from those explaining 44-100% of variation are shown in parentheses as the minimum and maximum pairwise Frobenius norms across the range.
F I G U R E 5 Multivariate pattern analysis experiments for detection of scanner manufacturer, sex, and Alzheimer's disease status using cortical thickness data. The data are randomly split into 50% training and 50% validation then used to train a random forests algorithm to predict a specified trait. AUC values from 100 repetitions of this analysis are reported for unharmonized, ComBat-adjusted, and CovBat-adjusted data. These findings suggest that CovBat not only provides thorough removal of site effects, but also maintains clinical associations. Figure S4 shows that CovBat retains these associations across varying number of PCs included in the model. Appendix A1 shows that similar results hold for prediction of age, where both ComBat and CovBat reduce root-mean-square error for prediction of age compared to the unharmonized data. Appendix A2 shows that these results for both detection of site and biological associations largely hold even when CovBat is trained on a subset of the data and all sites are included in both the training and validation sets. MANOVA results in Table 5 show that the significant associations with diagnosis and sex are retained after either harmonization method.

| DISCUSSION
The growing number of multi-site studies across diverse fields has spurred the development of harmonization methods that are general, but also account for field-specific challenges. In neuroimaging research, the rise of ML in neuroimaging has established an unmet need for harmonization of covariance. We demonstrate that strong site effects in covariance exist, influence downstream ML experiments, and remain after performing the state-of-the-art harmonization. We then propose a novel method and demonstrate that it is effective in removing site differences in covariance and retaining the detection of biological associations via ML. Simulation studies show similar ML results and demonstrate that CovBat performs well across a variety of settings and sample sizes.
In ADNI data, we show that substantial differences exist in the covariance structures of cortical thickness observations and can be mitigated through our proposed method. We furthermore show that ML can detect these site effects, whether through ML or conventional multivariate analyses. These results mirror recent studies that predict scanner from neuroimaging features with high accuracy (Glocker et al., 2019) and a recent study demonstrating that ComBat is insufficient to prevent detection of Siemens-manufactured scanners in a large multi-site dataset (Nielson et al., 2018 (Stevens, 1980).
Our proposed method harmonizes covariance across sites by removing mean and variance shifts in the principal components space, which we show to be effective in addressing the covariance effects we observe. This idea resembles spectral models, which also relate covariates to the eigendecomposition of covariance matrices (Boik, 2002). Our method assumes that the ideal covariance structure exists in the eigenspace of the full data covariance matrix. As we show through our simulations, in some cases this model may be insufficient to remove site effects, which do not resemble the covariance structure of the full data. Potential extensions could incorporate methods that model site effects as separate low-rank structures (Hoff & Niu, 2012) or identify projections most related to site (Zhao, Wang, Mostofsky, Caffo, & Luo, 2019 covariate effects (Pomponio et al., 2020) Imaging at the University of Southern California.

CONFLICT OF INTEREST
The authors declare no potential conflict of interest.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the Alzheimer's Disease Neuroimaging Initiative. Restrictions apply to the availability of these data, which were used under license for this study. Data are available at adni.loni.usc.edu with the permission of ADNI.

CovBat preserves prediction of age
ComBat has previously been shown to preserve age prediction via linear regression and support vector regression (Fortin et al., 2018).
To evaluate whether this result holds in ADNI data for ComBat or CovBat, we propose an additional ML experiment. We (i) randomly split the subjects into 50% training set and 50% validation set, (ii) train a random forests algorithm to predict age, and (iii) assess predictive performance on the validation set via root-mean-squareerror (RMSE). We perform steps (i)-(iii) 100 times each for unharmonized, ComBat-adjusted, and CovBat-adjusted data. Figure A1 shows that the mean RMSE for age prediction decreases from 6.05 (AE0:22) in unharmonized to 5.81 (AE0:23) in ComBatadjusted data to 5.75 (AE0:21) in CovBat-adjusted data. These findings are consistent with previous work (Fortin et al., 2018) and show that CovBat provides similar recovery of the association between cortical thickness and age.

Parameter estimation using training subset
Both ComBat and CovBat estimate and residualize out the covariate effects using the full data; however, there are cases were only a subset of the data is available when performing harmonization. For instance, if a group of subjects has already been acquired, prediction on subjects subsequently acquired on the same sites could only leverage data from the original sample. In this scenario, the new sample can be harmonized using ComBat or CovBat by estimating the covariate effect using the original sample, then proceeding with subsequent steps as usual.
We evaluate this modification by repeating our main ML analyses using ADNI data with different subsampling of the patients. Specifically, we replace step (i) in our ML experiments by instead splitting the sample into 270 training subjects and 235 testing subjects such that both the train and test sets contain at least one subject acquired at each site. We then apply ComBat and CovBat by estimating the covariate effects using only the training subjects. Figure  F I G U R E A 1 Multivariate pattern analysis experiments for detection of age using cortical thickness data. The data is randomly split into 50% training and 50% validation then used to train a random forests algorithm to predict age. RMSE values from 100 repetitions of this analysis are reported for unharmonized, ComBat-adjusted, and CovBat-adjusted data F I G U R E A 2 ML experiment results for harmonization using only training data. The data is randomly split into 270 training subjects and 235 testing subjects such that every site is represented in each group. The training set is then used to train a random forests algorithm to predict Siemens scanners or patient characteristics. a shows the AUC values for detection of Siemens. AUC values for detection of AD are displayed in b and detection of male in (c). RMSE values for prediction of age are displayed in (d)