Mega‐analysis methods in ENIGMA: The experience of the generalized anxiety disorder working group

Abstract The ENIGMA group on Generalized Anxiety Disorder (ENIGMA‐Anxiety/GAD) is part of a broader effort to investigate anxiety disorders using imaging and genetic data across multiple sites worldwide. The group is actively conducting a mega‐analysis of a large number of brain structural scans. In this process, the group was confronted with many methodological challenges related to study planning and implementation, between‐country transfer of subject‐level data, quality control of a considerable amount of imaging data, and choices related to statistical methods and efficient use of resources. This report summarizes the background information and rationale for the various methodological decisions, as well as the approach taken to implement them. The goal is to document the approach and help guide other research groups working with large brain imaging data sets as they develop their own analytic pipelines for mega‐analyses.


| INTRODUCTION
The ENIGMA (Enhancing NeuroImaging Genetics through Meta Analysis) Consortium, started in 2009, with the aim of performing largescale neuroimaging genetics research using meta-analytic methods by pooling data from around the world. ENIGMA has since expanded to include many working groups, resources, and expertise to answer fundamental questions in neuroscience, psychiatry, neurology, and genetics Thompson et al., 2020). One of these groups is the ENIGMA-Anxiety working group, created in 2016 (Bas-Hoogendam et al., 2020), focused on anxiety related disorders. Such disorders, that include social anxiety disorder, specific phobia, panic disorder, generalized anxiety disorder (GAD), and agoraphobia, share substantive phenomenological features and are often comorbid.
Within the ENIGMA-Anxiety working group, a subgroup devoted to the study of GAD was formed, the ENIGMA-Anxiety/GAD "subgroup," which for simplicity is referred to here as "ENIGMA-GAD." Because the ENIGMA-Anxiety working group was formed relatively recently, it has benefited from the experience and work performed by earlier groups, particularly in terms of collaborative methods. In more recent years, research groups have become increasingly favorable toward sharing and transferring de-identified individual participant data (IPD), often as part of cooperative agreements that respect country-level differences in data privacy and data protection procedures, discussed below. In the case of ENIGMA-GAD, as detailed in the final section of this article, the vast majority of sites contributed raw, T 1 -weighted magnetic resonance imaging (MRI) scans, as opposed to processed scans or results of subsequent analyses. These raw data could then be processed centrally using an imaging processing software, in this case FreeSurfer 1 Fischl et al., 2002;. Having access to raw IPD provided unique opportunities to review methods for handling and harmonizing such data, defining processing pipelines, and implementing analytic strategies. Crucially, this led ENIGMA-GAD to prioritize a mega-analysis approach. This approach consists of analyzing IPD from all sites in one stage. This contrasts with two-stage approaches, which consist of analyzes of site-specific results in a second step after each site generates processed data in an initial step (detailed below). This paper presents some of the challenges posed by the decision, by the ENIGMA-GAD group, to use a mega-analysis, and discusses the rationale for the choices that were made to establish the analysis plan. The discussion is broadly applicable to mega-analyses in the context of ENIGMA and other international neuroimaging efforts. Below, differences between meta-analytic vs. mega-analytic approaches, benefits of preregistration, issues concerning data sharing and data reuse are discussed. Methods for quality control and choices with respect to measurements and statistical analyses are also presented. Finally, specific choices in the ENIGMA-GAD group with respect to each of these issues are described.

| META-ANALYSIS VERSUS MEGA-ANALYSIS
As collaborative and coordinated endeavors, ENIGMA meta-analyses studies operate differently from literature-based meta-analyses. In the latter, results from published studies are compiled to draw conclusions on a certain question. In most cases, such pooled studies have been conducted and published over many years, with high sample and methodological heterogeneity, encompassing diverse statistical approaches. Such diversity is aggravated in meta-analyses that examine neuroimaging studies. In neuroimaging studies, substantial challenges for combined inference result from the use of statistical maps limited to significant p-values or test statistics, tables with coordinates in relation to some standard (but not always the same) stereotaxic space, and different representations of the brain (volume-based or surface-based; Fox, Lancaster, Laird, & Eickhoff, 2014;Müller et al., 2018;Tahmasian et al., 2019). Moreover, because of publication biases, there can be a misrepresentation of negative results (the "filedrawer" problem; Rosenthal, 1979) or study selection (Roseman et al., 2011).
In ENIGMA, these issues are minimized through analysis of IPD using an agreed upon processing strategy. Briefly, three approaches, that relate to data location, are currently used by different projects within ENIGMA working groups: (a) all raw data and all derived IPD remote in relation to the coordinating facility; (b) all raw data remote in relation to the coordinating facility, but derived data centralized; (c) all raw data centralized. These approaches are not mutually exclusive within a working group, and different projects conducted by the same working group may each use a different strategy, depending on the project goals and considerations about data availability, computational resources, and expertise. These approaches are summarized schematically in Figure 1.
For meta-analysis with access to IPD, the strategy includes quality checks and statistical analysis over mostly coetaneous data. Summary statistics (such as effect sizes, standard errors, and/or confidence intervals) are pooled by a coordinating facility that then uses metaanalytic methods for inference across sites. Such a coordinated, twostage meta-analysis approach has been pursued by most ENIGMA working groups (Hibar et al., 2015;Hibar et al., 2016;Schmaal et al., 2016;Stein et al., 2012;van Erp et al., 2016), particularly due to privacy concerns regarding genetic data. ENIGMA genome-wide association studies still use a meta-analysis approach (Hibar et al., 2017;Satizabal et al., 2019); sites analyze their own data with an agreed upon protocol, which avoids the need to transfer individual participant genomic data, and allows distributed analysis of computationally intense approaches.
Other strategies can be considered if the coordinating facility has access to all IPD: a single-stage statistical analysis can be performed by the coordinating facility, while addressing siterelated heterogeneity; this would be a one stage meta-analysis, or simply "mega-analysis." With imaging data, such mega-analyses could start with the raw images being sent to the coordinating facility where they then undergo batch processing using identical methods and computing environments. Alternatively, megaanalyses could start with image-derived measurements, such as the volumes of brain structures or cortical surface area, already computed and furnished by the participating sites to the coordinating facility, for each individual participant; the coordinating facility then proceeds to the statistical analysis. Combination of approaches for some projects (e.g., some sites sending raw data for processing whereas others sending processed data) are also possible.
Analyses using IPD offer several advantages (Riley, Lambert, & Abo-Zaid, 2010): they improve consistency in inclusion criteria across sites, better treatment of confounds and of missing data, verification of assumptions of statistical models, standardization of procedures, increases in statistical power, reductions in biases for not depending on previous publications of (invariably significant) results. Access to IPD further allows other strategies for investigation that are not limited to hypotheses testing. For example, it may allow classification at the individual participant level using machine-learning methods (Nunes et al., 2018). In a mega-analysis starting with raw imaging data, all data can be processed identically in the same facility, thus minimizing the chance for errors or variability that can arise when each site conducts these aspects of the analysis. One major challenge to this approach is that megaanalysis requires at least one site to possess the necessary resources and expertise to handle large datasets. Additionally, this approach is only possible when IPD are shared with a central facility. Data exchange on an IPD level often is limited as data protection is regulated differently among research projects, consortia, and countries. Barriers on data exchange and limitation of available resources can, in effect, restrict the participation to few wellequipped centers.
However, if individual sites use identical processing strategies with IPD, a random-effects two-stage approach leads to the same estimates as a (one-stage) mega-analysis. This is well-established in the neuroimaging literature, which uses similar statistical methods for multi-level inference for analysis of functional magnetic resonance imaging data (Beckmann, Jenkinson, & Smith, 2003;Worsley et al., 2002). Rarely such identical processing can be accomplished, though, given the usually large number of sites and, and the need that all engage in approaches intended to ensure consistency (discussed below).

| ANALYSIS PLAN AND PREREGISTRATION
Preregistration of clinical trials has been emphasized for many years, and a registry 2 was established by law in the United States through the Food and Drug Administration Modernization Act of 1997 (Dickersin & Rennie, 2003). Similar registries exist in other countries, and an international directory was created by the World Health Organization (WHO), the International Clinical Trials Registry Platform (ICTRP). 3 However, broadly similar efforts did not emerge in other research areas for decades. Defining a hypothesis, an associated analysis plan, and preregistering these ideas before conducting any analyses is important in many ways (Chambers, 2013). It helps to conceptually separate specific, previously formulated hypotheses F I G U R E 1 Differences between classical, literature-based meta-analyses, conducted without access to individual participant data (IPD) (upper panel) versus approaches used by different ENIGMA working groups, in which researchers, collectively, have access to IPD (lower panel). The latter encompasses three main approaches (top) data are processed using common methods at each site, then summary statistics are computed and sent to a coordinating facility which then conducts a meta-analysis; (middle) data are processed using common methods at each site, then sent to the coordinating facility which then conducts a mega-analysis; and (bottom) raw data are sent to the coordinating facility which then processes the data in batch and conducts a mega-analysis, while taking site-specific effects into account from exploratory analyses that have potential to generate new hypotheses based on the data (Nosek, Ebersole, DeHaven, & Mellor, 2018;Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012). Likewise, it helps to separate a priori and exploratory hypotheses and the analytic plans used for their investigation (Ledgerwood, 2018). The benefits, however, stretch well beyond epistemological advantages by reducing the potential for questionable research practices (Chambers, Feredoes, Muthukumaraswamy, & Etchells, 2014). For example, preregistration reduces problems that follow when negative results remain unreported (Rosenthal, 1979;Sterling, 1959), reduces the chances of selective reporting (Macleod et al., 2014) and maximizes transparency in analytic approaches, thereby facilitating replication (Simmons, Nelson, & Simonsohn, 2011). Without preregistration, these problems remain prevalent, possibly due to the structure of incentives in academic environments (Neuroskeptic, 2012;Nosek, Spies, & Motyl, 2012). Preregistration also reduces hypothesizing after the results are known (Kerr, 1998), and protects scientists from other biases (Chambers, 2013), such as confirmation bias, hindsight bias, and anchoring effects (Moreau, 2019).
For ENIGMA, specific details and challenges need to be considered when preregistering a study. First, an analytic plan must be discussed with participating centers. The plan should include who access the data, roles of each participating site and their personnel, compliance with supervening laws and regulations, funding sources, as well as authorship expectation. This ensures that pooled data from different cohorts are analyzed in a way acceptable by all investigators. Second, many ENIGMA sites may have already analyzed the data they share for meta or mega-analysis, often to test similar hypotheses as those being considered for the ENIGMA combined analyses.
Obtaining credible results requires an analytic plan free of influences from findings known by the investigators, and that remains inclusive of all relevant data. Preregistration mitigates such concerns by supporting reasonable hypotheses of broad interest and with welldefined inclusion and exclusion criteria of subjects, both of which are unlikely to be swayed by prior knowledge of outcomes. These analytic plans are formalized into "project proposals," which can be distributed to members for approval and participation, and are often considered a form of preregistration for working group members.
Many platforms support preregistration, though the platform provided by the Open Science Foundation 4 stands out for its comprehensiveness and user-friendliness. The process is remarkably simple, with the site offering detailed instructions and preregistration templates.
Specifying an embargo period before the registration becomes public is possible, and a digital object identifier (DOI) can be generated.

| DATA SHARING AND REUSE
Both meta-and mega-analysis require that individual sites transfer data to the coordinating facility. Aggregated data, such as histograms of quality metrics, effect sizes, confidence intervals, and standard errors, are not identifiable at the individual level and can be transferred parsimoniously among sites without substantive risk of reidentification. It should be noted, however, that without precautionary measures, repeated computation of aggregate results using slightly varying subsets of participants can expose information about individuals (Dwork, 2006). This risk can be minimized through agreements among researchers on the nature and amount of aggregated data to be transferred. For mega-analyses, in which IPD are transferred, further attention is needed, due to differences across sites in the regulations that protect the confidentiality, integrity, and security of the IPD and their use in human research. In international collaborations, such as ENIGMA, accommodating such requirements necessitates that the strictest regulations are followed. While compliance with the law must be integral, three points are particularly relevant for ENIGMA projects: (a) protection of data and privacy of research subjects, (b) data reuse, and (c) international transfers of data.
In the United States (US), research must follow the Federal Policy for the Protection of Human Subjects (the "Common Rule"; Arellano, Dai, Wang, Jiang, & Ohno-Machado, 2018). This requires that specific consent be obtained from participants before their data and/or specimens can be used not only for the research project for which they are enrolling, but also for future research that may use such material, which often is the case of ENIGMA projects. Privacy in the US is governed by the Health Insurance Portability and Accountability Act (HIPAA) of 1996, which requires patient data to be de-identified; reuse requires approval by an Institutional Review Board.
Regulations differ, however, across countries. In the US, there is a presumption that processing personal data is lawful unless it is expressly forbidden. In the European Union (EU), in contrast, the processing of such data is prohibited unless there is a lawful basis that permits it (Dove, 2018). Legal provision for data protection and use in research comes from the General Data Protection Regulation (GDPR), adopted in 2016, which also covers the use of data from EU residents outside the Union (Chassang, 2017). While HIPAA emphasizes subject privacy, the GDPR makes no direct mention of privacy whatsoever, dealing instead with data protection, as established in the EU Charter of Fundamental Rights along with the right to a private life. Privacy is extremely difficult to define (Alfino & Mayes, 2003), and may be understood in this context as a state of nonaccess to data pertaining to an individual (Dove, 2018). Data protection, in turn, is a less ambiguous definition and can be understood as a set of rules that aim to protect the rights, freedoms, and interests of individuals whose personal data are handled and used (Tzanou, 2013).
The GDPR establishes that data reuse should only be allowed where new purposes are compatible with those for which the data were initially collected. This is usually the case for ENIGMA analyses.
International data transfers are not allowed unless the country to which data are sent has been found by the European Commission to provide "adequate" data protection; at the time of this writing, the list of countries for which an adequacy decision has been provided includes, for example, Argentina, Israel, Japan, New Zealand, and Switzerland. 5 While the list does include the US and Canada, in the case of these two it does so for commercial uses of data that do not broadly cover research by universities and research institutes as needed for ENIGMA. In the absence of such adequacy decision, or of specific derogations, an alternative path to data transfer is through specific provision of safeguards concerning data protection. These require the signing of legally binding agreements between authorities, or binding corporate or institutional rules approved by competent supervisory authorities (Dove, 2018;Staunton, Slokenberga, & Mascalzoni, 2019).
If none of these paths are viable, a possible solution to still allow research is to determine that the coordinating facility for a given ENIGMA Working Group will be in the EU itself; then no data from EU subjects need to be transferred to outside the Union. However, such a workaround is limited in scope and time: countries that are in the process of adopting legislation modeled after GDPR (such as the United Kingdom through the Data Protection Act of 2018) will be under broadly similar rules; these countries might, nonetheless, quickly receive an adequacy decision by the European Commission, such that transfers between the EU and these countries should ultimately be facilitated.

| De-identification
Regardless of specific legislation, data de-identification is a crucial step. De-identification consists of removal of personally identifiable information that allows data to be traced back to individuals, thus rendering such identification impossible or extremely difficult or unlikely.
In the context of HIPAA, unless otherwise determined by an expert, removal of information such as names, locations with granularity smaller than that of a state, dates related to an individual (such as birth date, admission date, etc.), and other identifying details, is considered to provide a reasonable basis to assume that the information cannot be used to identify an individual. Full-face photographs and any comparable images must likewise be removed for HIPAA compliance. For ENIGMA data, this means that MRI scans may need to have facial features of subjects removed before data are shared (see below).
Unlike HIPAA, the GDPR does not specify de-identification methods. Instead, researchers are expected to remain mindful that deidentified data might become reidentifiable through the development of new technologies or use of ancillary data. Thus, the GDPR requires vigilance to ensure that data remain anonymous (Dove, 2018). Managing the risk of reidentification is crucial, and safeguards should be put in place as if the data were not anonymous. Pseudonymized (e.g., tokenized or key-coded) data are subject to the GDPR, even if the codes are not shared and remain within different organizations.
For ENIGMA, this means that sites that handle information of EU residents must ensure complete de-identification as well as take into account the risk that de-identified data becomes reidentifiable, or pursue GDPR compliance by treating data as if not anonymous.
Imaging data stored in the standard Digital Imaging and Communications in Medicine (DICOM) file format are accompanied by a host of personally identifiable information. Tools exist to anonymize such files, by erasing fields from the file header that could contain such information. Another popular file format used in brain imaging is the Neuroimaging Informatics Technology Initiative (NIFTI). This format stores no personally identifiable information but contains two general-purpose fields ("descrip" and "intent_name," with 80 and 16 bytes, respectively) that could hold such information. The format can also accommodate extensions, and can be paired with a JavaScript Object Notation text file (JSON), both of which may contain information that may allow subject identification. Any field with information that could lead to reidentification must be erased or removed before data can be shared between ENIGMA sites and the coordinating facility, or other safeguards must be in place to ensure no reidentification will be attempted or possible. A popular tool for conversion from DICOM to NIFTI, "dcm2niix" (Rorden, 2014) allows removal of such information during format conversion.
Moreover, the data portion of DICOM and NIFTI files may be edited to ensure that facial features will be removed (defacing).
Reidentification of participants based on scan data had been considered a remote possibility, which motivated the creation of defacing

| Encryption and transfer
Encryption reduces the possibility that data might be misappropriated when stored, or intercepted during transfer, and thus reduces the chances that data can be used in ways that are not in the best interest of research participants. Data encryption is always compatible with both HIPAA and GDPR, and in the case of the former, it can be considered "a reasonable and appropriate measure" to ensure confidentiality, which renders it mandatory for all practical purposes. Even without specific regulations, data encryption is good practice insofar as the confidentiality, integrity, and security of data of participants are concerned.
A basic scheme consists of encrypting the data using a reasonably secure cipher (algorithm), with a key (password) that can also be used for decrypting. Such a key is transmitted from an individual site to the ENIGMA coordinating facility through means other than those used to transfer the encrypted data. A more sophisticated approach uses pairs of public/private keys: the site encrypts the data using the public (not secret) key provided by the coordinating facility; data can be decrypted by the coordinating facility using the private (secret) key.

| Organization and processing
Before or after being transferred to the coordinating facility, the data can be organized into a scheme that facilitates processing and the use of imaging pipelines, such as the brain imaging data structure (BIDS; Gorgolewski et al., 2016). BIDS prescribes a hierarchy of files and directories that is simple and intuitive, yet powerful enough to accommodate a diverse set of imaging modalities collected in varied circumstances. The scheme is intended to minimize efforts related to data curation, to reduce the number of errors due to incorrect organization of data files, and to facilitate the development and usage of software, which can be written to parse the file structure directly (Gorgolewski, Alfaro-Almagro, Auer, Bellec, & Capot, 2017).
Processing of the whole dataset using one operating system and software version can help avoid inconsistencies. It has been demonstrated that differences in operating systems can have a small effect on, for example, FreeSurfer metrics (Gronenschild et al., 2012); such metrics have been used in many ENIGMA analyses to date, including in ENIGMA-GAD analyses. Scientists may benefit from monitoring their computing environment and run analyses in batches that are not interspersed with periodic software updates.
Options to ensure software consistency include the use of virtual machines (such as QEMU/KVM, 15 VirtualBox, 16 or VMware 17 ) or containerized environments (such as Docker 18 or Singularity 19 ). In virtual machines, the whole system-including emulated hardware and the "guest" operating system-can be kept static and be shared. Containers use a layer of compatibility between the "host" operating system and the desired applications. They tend to run faster and have simpler maintenance than virtual machines. In either case, the researcher can keep tight control over software versions, libraries, and dependencies. Neither of the two methods, however, is ideal. Virtual machines can be heavier to run and offer less flexible integration with the host operating system (which in turn may have access to a large computing cluster, such that integration is something often desirable).
Containers address this problem but introduce others: troubleshooting experimental software may be difficult because it is not always clear whether a given problem has arisen because of the software itself, or because of the container or its interaction with the host system. Regardless, such solutions improve reproducibility of results by allowing researchers to share not only their code and information about their computing environment, but also their actual computing environment.

| QUALITY CONTROL
For ENIGMA meta-analyses, each site can perform a quality assessment of its own data using a previously agreed protocol. Sites can report the quality metrics to the coordinating facility, which then can use the information in the statistical model by, for example, giving less weight to sites contributing lower-quality data. ENIGMA protocols provide consistent, streamlined strategies for visual inspection of imaging data; these strategies involve inspection of the cortical border between gray and white matter, parcellations of the cortex, and segmentation of subcortical structures. For mega-analyses, while the same kind of visual inspection could be advantageous, the amount of data may render this process difficult. Although there is no standard triage or similar requirement before sharing raw data, it is usually the case that images will have already been seen by at least one investigator before sharing, and as such, might have been excluded from consideration and not sent to the coordinating facility. Moreover, while using the same raters may give higher consistency on selection of participants across sites given imaging features, the same process might introduce unwanted bias toward selection, for example, if imaging features used to visually define inclusion or exclusion are unknowingly related to the variables investigated, a risk that may be present even if quality criteria are consistent across sites.

| Automated methods
Biases arising from manual inspection can be minimized through automated quality control methods. In the UK Biobank, for example, a supervised learning classifier identifies problematic images with acceptable accuracy (Alfaro-Almagro et al., 2018). The UK Biobank, however, benefits from the fact that data collection is limited to only three sites, all of which use identical equipment (Miller et al., 2016). In ENIGMA, data come from many sites, with MRI scanners from different vendors and models, with different field and gradient strengths, different coils, acquisition sequences, and software versions. Using a quality control classifier with such heterogeneous data is challenging (Chen et al., 2014;Focke et al., 2011;Han et al., 2006;Jovicich et al., 2006), although methods with good performance have been proposed (Klapwijk, van  This tool does not, however, classify images as having high or low quality; instead, it provides an interface for a rater to make that determination based on the computed quality metrics and possibly other features; these metrics may, in turn, be used to train a classifier. Such classification, however, can be difficult to generalize, given the diversity of data from multiple sites (Esteban et al., 2017). Even so, derived metrics may be insufficient to predict successful generation of cortical surfaces and segmentation of subcortical structures with FreeSurfer, from which image-derived measurements of interest are often computed. Notwithstanding these considerations, it is good practice to investigate quality using this kind of tool, which includes boxplots (Figure 2), and mosaics that show multiple slices color-coded so as to highlight potential defects. The output from these tools are useful to assist in flagging images that, even if successful at FreeSurfer processing, may require specific decisions whether or not they should remain in the sample. Moreover, these tools provide summary metrics that can be returned to the contributing sites, where local researchers can assess the quality of their own images versus those collected by others or elsewhere (Esteban et al., 2019).

| Euler characteristic
One particular metric has been found to be a good predictor of the quality of FreeSurfer outputs: the Euler characteristic (χ; sometimes also called Euler number) of the cortical surface produced before topological correction (Rosen et al., 2018). To conceptualize the Euler characteristic, consider a polyhedron whose spatial configuration is determined by its vertices, edges, and faces. It can be shown (Lakatos, 1976) that if the polyhedron is convex, the number of vertices (V), minus the number of edges (E), plus the number of faces (F), is always equal to 2; this quantity is the Euler characteristic, that is, If the polyhedron is crossed by a single hole, χ is decreased by 1; if crossed by two holes, decreased by 2; if hollow, χ is increased by 1. More generally, for every hole that crosses a polyhedron, its χ is decreased by one, whereas for every hollow, it is increased by one. The Euler characteristic is well-known in neuroimaging as a key metric for multiple-testing correction using the random field theory (RFT; Worsley et al., 1996). Here, however, it serves an entirely different purpose: it acts as a metric to quantify topological deviation of the initial cortical surface from a sphere, as an increasingly large number of holes in the initial surface generates an increasingly negative Euler characteristic. As these values become more negative, the more likely it is that the original T 1 -weighted scans had low quality in ways that negatively impacts the surface reconstruction.
FreeSurfer treats such holes as topological defects and corrects them automatically to create a cortical surface that reaches a χ = 2 (Fischl, Liu, & Dale, 2001). However, initial surfaces that have too many defects are less likely to be ever usable, even after topology correction.
The Euler characteristic was found to be highly correlated with manual quality ratings, discriminating accurately unusable from usable scans, and outperforming other data quality measures (Klapwijk et al., 2019). However, the precise threshold to be applied to χ remains unknown when deciding whether a surface is usable or not; such a threshold may be site or scanner specific. Moreover, it is not currently known whether, as a general rule, the Euler characteristics for each brain hemisphere should be combined as their mean, or the worst (minimum, most negative) of the two, nor whether other metrics related to surface topology could be helpful for quality assessment. For subcortical structures, specific quality metrics are currently missing from the literature.

| Manual edits
Image processing pipelines may allow manual edits when automated approaches fail to generate processed images of desirable quality. This is also the case with FreeSurfer, whereby the user can employ "con-

| MEASUREMENTS
Imaging generates a myriad of measurements. Analyses can reveal genetic and environmental influences on healthy and pathological variability in the human brain, providing great potential currently not fully harnessed. As an example, a recent ENIGMA meta-analysis using data from 51,665 subjects identified 187 loci influencing cortical surface area and 12 others influencing thickness (Grasby et al., 2020); in another example, the recent UK Biobank analysis used 3,144 imagingderived traits (Elliott et al., 2018)  Some notable examples include fine-resolution area of the cortex, which follows a lognormal distribution potentially reflecting exponential influences; fractional anisotropy of water diffusion, a quantity bounded between 0 and 1, also could be considered not a sum of multiple small effects, nor functional connectivity assessments bounded between −1 and 1. Cases such as these may be accommodated through the use of a data transformation, such as logarithmic, power, Fisher's r-to-z, logit, or probit transformations; generalized linear models and nonparametric statistics can also be considered.

| Choice of resolution
Researchers need to consider whether imaging analyses should use measures obtained at every point of an image (e.g., voxelwise or vertexwise data) or aggregate measures computed over regions of interest or parcellations, broadly termed as "ROIs." Although vertexwise analyses have been performed in recent ENIGMA research (Chye et al., in press;Ho et al., 2020), most previous ENIGMA studies used meta-analyses. In these cases, an ROI-based approach is more robust to small deviations from a common image registration scheme.
Moreover, voxelwise and vertexwise measurements represent small pieces of tissue in relation to the resolution inherent to the equipment or scanning sequence. As such, these measures are intrinsically noisier than ROI-based quantities. Furthermore, because the number of voxels/vertices is usually many times larger than the number of ROIs under potential consideration, their use is computationally more intensive, and leads to an exacerbation of the multiple testing problem.

| Harmonization
In a mega-analysis, pooling data from numerous cohorts requires addressing nuisance factors. Site, scanner and cohort-specific effects of no interest can be manifest as effects larger than diagnosis or other effects of interest; neglecting such nuisance effects can reduce power or generate false positives and low reproducibility (Baggerly, Coombes, & Neeley, 2008;Leek et al., 2010). These confounds can be accommodated at the time of the statistical modeling and analysis, at the penalty of increasing model complexity, or the data may be modified before analysis so as to remove such unwanted effects.
ComBat ("combining batches") is such an approach, that allows harmonization of data across sites. The method originated in genetics for correcting batch effects in microarrays, and is described in detail in F I G U R E 3 Surface reconstructions of the cortex of the right hemisphere based on different resolutions of a recursively subdivided icosahedron. The default in FreeSurfer uses n = 7 recursions, resulting in a total of 163,842 vertices. Considerable computational savings can be obtained with lower resolutions (such as with n = 4 or 5) without substantial losses in localizing power. V, number of vertices; E, number of edges; F, number of triangular faces (Johnson, Li, & Rabinovic, 2007). In brief, ComBat incorporates systematic biases common across voxels/vertices, under the mild assumption that phenomena resulting in such "batch" effects (e.g., site, scanner, and/or cohort effects) affect voxels/vertices in similar ways (e.g., stronger mean values, higher variability). In the method, location (additive) and scale (multiplicative) model parameters that represent these batch effects are estimated. This estimation is done by pooling information across voxels or vertices from participants from each site so as to shrink such unwanted effects toward an overall group effect (i.e., across batches and voxels/vertices). These estimates are then used to adjust the data, robustly discounting unwanted effects. Variability of interest or related to known nuisance or confounds (e.g., age or sex) can be retained. In brain imaging, the approach has been an effective method to harmonize diffusion tensor imaging data (Fortin et al., 2017), cortical thickness measures (Fortin et al., 2018), rest and task-based functional MRI (Nielson et al., 2018), and functional connectivity (Yu et al., 2018). ComBat has been used in ENIGMA studies (Hatton et al., in press; Villalón-Reina et al., in press), although it has been argued that it leads to similar results as random effects linear regression (Zavaliangos-Petropulu et al., 2019). Which statistical harmonization model is optimal remains an active discussion at the time of this writing.

| STATISTICAL ANALYSIS
Statistical analyses can proceed once data have been processed, and measurements obtained and possibly harmonized. Such analyses estimate the effects of interest and compare them to expected observations should there be no real effect to compute a p-value. It is at the stage of the statistical analysis that the differences between metaand mega-analysis become most pronounced.

| Fixed versus random effects
For all cases discussed in Figure 1, analyses may assume that true effects are fixed (constant) across sites, and therefore any differences in effects among sites are solely due to random experimental error, or may assume that the true effects themselves may be random (i.e., varying) across sites. For meta-analyses without access to IPD, the above distinction between fixed and random effects holds relatively without ambiguity, and distinct methods to summarize literature findings for either of the two cases exist (Borenstein, Hedges, Higgins, & Rothstein, 2009). For other cases, unfortunately, these terms have multiple meanings that sometimes conflict (Gelman, 2005). For research using IPD, less ambiguous definitions apply to slopes and intercepts, which can be treated as constant (thus, fixed) or allowed to vary (thus, random) across sites. This distinction between fixed and random becomes then an attribute not of the statistical model, but of each independent variable.
As for the level of inference, in the case of ENIGMA, the selection of sites is seldom a random quantity, and generalization is sought not to an idealized "population of sites", but instead to the actual popula-

| Confounds
Unwanted data variability may arise due to procedural factors including site or scanner features, or due to factors that affect both dependent and independent variables. Variables representing the former case are termed nuisance; those representing the latter, confounds.
Variables such as age or sex may be nuisance in some analyses or confounds in others, depending on the relationship between these variables and the other variables studied; here we broadly call nuisance and confound variables covariates.
The large sample size of ENIGMA increases statistical power in general, however, this may result in erroneous labeling confounding effects (Smith & Nichols, 2018); ignoring such confounds may reduce power or identify spurious associations. Addressing these concerns can be challenging, as decisions regarding confounding variables affect interpretation of the relationship between dependent and independent variables (Gordon, 1968;Lord, 1967). For example, if a confounding variable causes at least part of the variation observed in the imaging data across participants and in the variation of the independent variable (i.e., it is a collider), adjustment for the undesired effect induces a false association (Berkson, 1946;Luque-Fernandez et al., 2019;Pearl, 2009, chapter 6), which can happen in either direction (positive or negative).
Moreover, controlling for poorly reliable measures may not completely remove their putative effects, leading to false conclusions about effects (J. Westfall & Yarkoni, 2016). While this is a greater con- A special kind of confound in brain imaging is a composite measurement formed by pooling together values of all voxels/vertices or regions of interest, with the goal of discounting unwanted global effects. For example, in a vertexwise analysis of surface area, it might be of interest to consider the total cortical surface area as a confounding variable. Likewise, for studies of subcortical volume, total brain size-or a related quantity, the intracranial volume (Buckner et al., 2004)-can be considered a confound; for cortical thickness, the average thickness across the cortex; for functional MRI, at the subject level, a measurement of global signal, though controversial, might be considered in a similar manner (Murphy & Fox, 2017). The rationale for inclusion of a global measurement as a regressor within the model stems from interest in enhancing the localizing power afforded by imaging methods, and reducing sources of noise that affect measures globally. From this perspective, the scientist seeks to learn where, specifically, in the brain some phenomenon may occur. In this context, arguably, global effects would be of lesser interest, unless a research hypothesis is specifically about them. In addition, for functional MRI, some sources of noise, such as movement and respiration, result in artifactual global signal changes, and so removal of the global signal is also an effective means of reducing artifacts (Ciric et al., 2017).
What makes these confounds special is that, being composites of all other local (voxelwise/vertexwise) or regional quantities, they are almost certainly correlated with these measurements, and thus, are likely to also be associated with variables of interest in the model if these are associated with the local or regional measurements. These global variables are more likely to impact results where local or regional effects of interest are present, even more so if these are widespread across the brain. Options for taking into account such global effects in the statistical analysis have been studied (Andersson, 1997;Barnes et al., 2010;Nordenskjöld et al., 2013;Sanfilipo, Benedict, Zivadinov, & Bakshi, 2004). The main approaches are: (a) convert each local or regional measurement into a proportion over the global quantity; (b) residualize the dependent variable with respect to the global; and (c) include the global in the model. Among these three, the latter option should always be favored as it accounts for effects that the confounding variable may have over both dependent and independent variables. The least preferable is the proportion method (a), one of the reasons being that noisier (unreliable) measurements compromise the measurements to a much greater extent than the others.
If confounding variables are meant to be included in the model, it is often appropriate, considering all the above, to present results with and without these variables in the model (Hyatt et al., 2020;Simmons et al., 2011). Ideally, these would also be corrected for multiple testing, as the number of opportunities for falsely significant results has now doubled (see below more on multiple testing).

| Inference
Choices for inference can be broadly divided into parametric and nonparametric. Parametric methods are computationally faster but require assumptions that are sometimes difficult to justify. For example, data have to be assumed to be independent and normally distributed with identical variances after all nuisance variables and confounds have been taken into account. These assumptions may hold for some analyses, but not for others. When the variety of imaging modalities possible for ENIGMA studies is considered, these assumptions cannot hold for all of them. The consequence is that results will be incorrect in at least some cases. Nonparametric tests, such as permutation tests, on the other hand, require very few assumptions about the data probability distribution, and therefore can be applied to a wider variety of situations than parametric tests. For permutation tests, the only key assumption is that any random instantiation of permuted data must be as likely to have been observed as the original, unpermuted. In other words, the data must be exchangeable. If exchangeability holds, permutation tests are exact, in the sense that the probability of observing a p-value smaller than a predefined significance level α is α itself when there are no true effects (Holmes, Blair, Watson, & Ford, 1996;T. E. Nichols & Holmes, 2002;Winkler, Ridgway, Webster, Smith, & Nichols, 2014 Nichols & Holmes, 2002), or spatial statistics such as cluster extent, cluster mass, or threshold-free cluster enhancement (TFCE) (Smith & Nichols, 2009) are used. In all these cases, speed can be increased using fast, parallel implementation of permutation algorithms (Eklund, Dufort, Villani, & Laconte, 2014), or using accelerations based on various mathematical and statistical properties of these same tests (Winkler, Ridgway, et al., 2016), or both.

| Multiple testing
As with any imaging experiment that uses one statistical test per imaging element (voxel, vertex, ROI), correction for multiple testing is necessary (T. Nichols & Hayasaka, 2003). For parametric inference, and under a series of additional assumptions, it is possible to control the familywise error rate (FWER) using the RFT (Worsley et al., 1996); methods and software exist for both voxelwise and vertexwise data.
However, this method cannot be used for ROIs, as these cannot be represented as a regular lattice, or for voxelwise data that do not meet all the assumptions of the theory, such as tract-based spatial statistics (Smith et al., 2006). A valid approach for all these cases, but that controls a different error quantity, is the false discovery rate (FDR) (Benjamini & Hochberg, 1995;Genovese, Lazar, & Nichols, 2002). For permutation inference, correction for multiple testing that controls the FWER can be accomplished in a straightforward manner for all the above cases using the distribution of the maximum statistic obtained across all tests in each permutation (P. H. Westfall & Young, 1993

| REPORTING RESULTS
Classical meta-analyses results are often reported with the aid of forest plots (Borenstein et al., 2009), which show effect sizes and confidence intervals for each study separately (or for each site in the case of ENIGMA), along with a combined effect size that considers the effects from all studies after some sensible weighting. ENIGMA studies that used meta-analyses adopted a similar approach where possible, for example, when imaging metrics were collapsible into single numbers, such as asymmetry (Guadalupe et al., 2017;Kong et al., 2018) or indices for specific structures (Hibar et al., 2015;Stein et al., 2012). For mega-analyses, while such plots may be of lesser

| Authorship
Given the large number of involved sites and investigators, authorship of published reports are an important aspect of ENIGMA projects.
While there are no enforceable rules to determine the authorship of a scientific paper, a number of organizations have provided guidelines and recommendations intended to ensure that substantial contributors are credited as authors; for a review, see (Claxton, 2005).

| MEGA-ANALYSIS IN THE ENIGMA-GAD GROUP
Having discussed the above, we are now in a position to better describe the specifics of the ENIGMA-GAD analyses. In this group, sites were contacted based on their publication and funding record using imaging data of subjects with a history of anxiety disorders, and who could meet criteria for GAD. Virtually all sites that were contacted and that did have structural imaging data were able to partici- Measurements considered for analyses, as indicated in the preregistration and as in previous ENIGMA studies, included cortical measurements of thickness and surface area for each of the parcels of the Desikan-Killiany atlas (Desikan et al., 2006), as well as volumes of subcortical structures. Cortical vertexwise thickness and surface area were also measured, and downsampled to the resolution of an icosahedron recursively subdivided four times, with 2,562 vertices per hemisphere. Because sites differed widely in variables such as age, modeling age with random slopes (with additional quadratic effects) seemed more appropriate than merely assuming that, across all ages and sites, age effects would be exactly the same. Models with and without a global measure (total surface area, average thickness, and intracranial volume) were considered. Correction for multiple testing used the distribution of the maximum statistic, assessed via permutations. ComBat was not used for the main analyses; instead, scannerspecific effects were modeled (random intercepts) and a test statistic robust to heteroscedasticity was used, along with variance groups (one per site) and exchangeability blocks. ComBat is, however, being assessed with the same data as a potential option for future studies; results will be reported opportunely. Statistical analysis for this megaanalysis used the tool Permutation Analysis of Linear Models (PALM). 28 At the time of this writing, the analysis is being finished and the manuscript is being prepared for publication (Harrewijn et al., in prep.). Authorship, like with the present paper, was defined according to Vancouver criteria, generally with early career investigators, members of the coordinating facility and who worked directly with the data handling and the bulk of the writing appearing first, and with lead investigators appearing last; in between, the contributing sites in alphabetic order, and, within each site, early investigators appearing first; lead investigators last.

| CONCLUSION
This overview described the analytic choices across the various stages of an ENIGMA mega-analysis, setting out the reasoning behind these choices. Aspects related to data protection and privacy, and how to handle confounds, along with other challenges that inevitably occur when large-scale data from multiple sites are analyzed were also discussed. The various choices made by ENIGMA-GAD when facing each of the discussed topics were presented. The hope is that the resulting survey of these practical considerations will be useful to others embarking on similar multi-site neuroimaging studies, especially those integrating data across multiple countries and data modalities.

ACKNOWLEDGMENTS
The authors thank all the many research sites worldwide who considered contributing data to the ENIGMA-Anxiety/GAD project. André Research net of the University of Greifswald, Germany, which is funded by the Federal Ministry of Education and Research (grants F I G U R E 4 Example report pages with multiple views of the cortical surfaces (front) and slices of subcortical volumes (back). Pial surfaces are shown, but inspection can use white and inflated; slices with subcortical volumes can be complemented with surface overlays. The script that generates these pages uses FreeSurfer scripting to automate the operation of the tools "tkmedit" and "tksurfer," and is available at https:// brainder.org no. 01ZZ9603, 01ZZ0103, and 01ZZ0403), the Ministry of Cultural