Accelerated estimation and permutation inference for ACE modeling

Abstract There are a wealth of tools for fitting linear models at each location in the brain in neuroimaging analysis, and a wealth of genetic tools for estimating heritability for a small number of phenotypes. But there remains a need for computationally efficient neuroimaging genetic tools that can conduct analyses at the brain‐wide scale. Here we present a simple method for heritability estimation on twins that replaces a variance component model‐which requires iterative optimisation‐with a (noniterative) linear regression model, by transforming data to squared twin‐pair differences. We demonstrate that the method has comparable bias, mean squared error, false positive risk, and power to best practice maximum‐likelihood‐based methods, while requiring a small fraction of the computation time. Combined with permutation, we call this approach “Accelerated Permutation Inference for the ACE Model (APACE)” where ACE refers to the additive genetic (A) effects, and common (C), and unique (E) environmental influences on the trait. We show how the use of spatial statistics like cluster size can dramatically improve power, and illustrate the method on a heritability analysis of an fMRI working memory dataset.

dramatically improve power, and illustrate the method on a heritability analysis of an fMRI working memory dataset.

K E Y W O R D S
ACE model, heritability inference, permutation test, twin studies 1 | INTRODUCTION There continues to be growing interest in the joint study of imaging phenotypes and genetic data (genotypes; Glahn, Thompson, & Blangero, 2007). Imaging genetics is a multidisciplinary research area investigating the genetic influences on brain structure and function using both imaging and genetic information. A phenotype is an observable characteristic that results from the interaction of genetic inheritance and environmental conditions. To quantify the degree of the genetic effects on a phenotype, heritability is defined as the proportion of phenotypic variation that is due to genetic factors, where the genetic variability can be attributed to a particular gene or the aggregate of multiple genes. Several studies have examined the heritability of psychiatric disorders, and many of them suggest that most psychiatric disorders are moderately to highly heritable, with an estimated heritability of 0.83 for schizophrenia (Cannon, Kaprio, Lönnqvist, Huttunen, & Koskenvuo, 1998) and 0.85 for bipolar affective disorder (McGuffin et al., 2003). There also exists a large number of neuroimaging studies investigating the heritability of neuroanatomical phenotypes and brain functions and reporting considerable heritability (see for example , Ge et al., 2015, Ge et al., , 2016Ge, Holmes, Buckner, Smoller, & Sabuncu, 2017;Glahn et al., 2010;Thompson, Ge, Glahn, Jahanshad, & Nichols, 2013).
Recently, the development of genomic technologies has allowed direct heritability analysis from unrelated individuals using genomewide genetic data (Ge, Chen, Neale, Sabuncu, & Smoller, 2017;Yang et al., 2010;Yang, Lee, Goddard, & Visscher, 2011). Without genetic data, heritability can be estimated by studying individuals with varying degrees of genetic relatedness. Classic twin studies are often employed to estimate the level of genetic and environmental variations in traits.
The method of moments and the maximum likelihood approach are most commonly used methods to estimating heritability. Falconer's formula provides a simple point estimator for heritability based on moment matching (Falconer & Mackay, 1996). The best practice, likelihoodbased method uses the variance component model, which parameterizes different degrees of covariance expected with varying relatedness between subjects; the variance parameters are estimated by applying the maximum likelihood criterion (Neale & Cardon, 1992).
While the first neuroimaging studies measuring heritability used Falconer's method (e.g., Wright, Sham, Murray, Weinberger, & Bullmore, 2002), the likelihood-based approach is now routine, with variance components or structural equation modeling (SEM) methods applied one voxel at a time. However, such methods cannot exploit the spatial nature of the data, nor can they provide accurate inferences corrected for family-wise error rate over the brain. Although a simple Bonferroni correction offers the control of family-wise error rate, it is typically quite conservative for smooth images. When feasible, permutation inference offers an exact control of false positive risk and allows for specialized spatial statistics, such as inference by cluster size, which delivers family-wise error rate corrections while implicitly accounting for spatial dependence. However, most commonly used software tools for heritability estimation using twin data at present are too slow and unreliable to allow permutation.
In this article, we propose a linear regression-based method that is new to the neuroimaging community, based on the method of Grimes and Harvey (1980) and closely related to the Haseman-Elston regression for genetic linkage studies (Ge et al., 2018;Haseman & Elston, 1972). It allows voxel-wise heritability estimation with an approximate but remarkably fast and accurate performance. Using detailed Monte Carlo evaluations, we demonstrate that this method is valid with controlled false positive risk, and its statistical power is comparable to existing methods. With the speed advantage, this newly proposed method makes permutation inference more feasible and applicable.
Here we also present for the first time our permutation approach in detail, which is developed for both voxel-and cluster-wise inferences, with an application to a real fMRI blood oxygen level dependent (BOLD) dataset. Aside from fMRI data, this approach can also be applied to any other type of neuroimaging data.

| THEORY AND METHODS
Heritability can be interpreted as the proportion of phenotypic variance explained by a single genetic marker (Filippini et al., 2008) or any subset of genes/markers of the genome (Stein et al., 2010). To quantify heritability, the total phenotypic variance (σ 2 P ) can be partitioned into genetic (σ 2 G ) and environmental (σ 2 E ) components (Falconer & Mackay, 1996) in a linear manner: The heritability in the broad sense (H 2 ) measures the overall genetic influence on a trait, defined by where the genetic variation σ 2 G summarizes the additive and nonadditive genetic contributions. The additive genetic effect arises from the linear addition of independent genes, or more technically, allelic contributions at different gene loci, while the nonadditive genetic effect refers to, for example, dominance or the interactive influence among alleles within or between gene loci (e.g., epistasis). The additive genetic variation is generally of more interest since it is the summed effects of a particular allele or alleles at a given locus or at multiple trait-related loci. Thus, the narrow-sense heritability is defined as the proportion of phenotypic variation accounted for by the additive genetic effect (σ 2 A ) with an expression of which is more commonly used and is usually just called "heritability".
We now detail the models employed to assess heritability using twin data. Even in the absence of genetic influences on a phenotype, it is likely that twins are phenotypically more similar than unrelated individuals since they have been raised in the same family environment. This gives rise to the common environmental factor, which contributes to the covariance within twin pairs regardless of MZ or DZ type. Finally, there is an independent unique error, corresponding to the usual independent and identically distributed (i.i.d.) noise corrupting the measurements plus actual unique environmental influences, for example, trauma and illness. The phenotypic variance (σ 2 P ) within the population is assumed to be the same and can be divided into additive genetic (A), common environmental (C), and unique environmental (E) components, written as The so-called ACE modeling in twin studies is based on this variance decomposition . Narrow-sense heritability is denoted by and similarly, the contribution of common environmental factor can be defined as which describes the relative variance attributable to common environmental causes. The estimation of heritability and common environmental variance constitutes the analysis of variance components.

| Notation
In this section, we will outline the notation used in this article. Assume the experiment consists of n participants, including n MZ MZ twins (n MZ /2 pairs), n DZ DZ twins (n DZ /2 pairs), and n S singletons (unrelated subjects and denoted by S), such that n = n MZ + n DZ + n S . For each (brain) image, with V voxels per subject, Y i, v denotes the data from subject i and voxel v (v = 1, …, V). For voxel v, the data from all subjects can be written as a column vector Y v .
Some types of brain imaging data are directly measured, for example, grey matter density, producing one image per subject. However, fMRI requires hundreds of scans per subject to estimate blood flow change. A within-subject model is often fitted to the imaging data for each subject, producing an image of BOLD effect magnitude for each subject (Frackowiak et al., 2004). Since the same form of model is fit at each voxel, going forward we suppress the voxel index v. Thus, the general linear model (GLM) in a matrix form for each voxel can be constructed as where X is an n × p design matrix including p − 1 covariates, and ϵ is the error vector, assumed to be normally distributed, written as ϵ~N (0, V); the covariance matrix V is defined below. Typical covariates include age, sex, or other between-subject effects.
To simplify the description of variance/covariance decomposition, we introduce a subject type indicator function T: {1, … , n} ! {MZ, DZ, S}. The function T Á ð Þ associates subject index i to subject type: T i ð Þ 2 MZ,DZ,S f g for i = 1,…, n; that is, T i ð Þ = MZ when subject i is an MZ twin, T i ð Þ = DZ when subject i is a DZ twin, and T i ð Þ = S when subject i is a singleton. We now consider different possible covariance structures for pairs of subjects (i, j). To identify twins, let j(i) be the index of the twin pair of subject i. The MZ twin covariance can then be written, for i with T i ð Þ = T j i ð Þ ð Þ= MZ, as The DZ twin covariance, For subject pairs involving one or more singletons from different families without twins, (i, j) with T i ð Þ = S or T j ð Þ = S, or pairs of unrelated twins, (i, j) with j 6 ¼ j(i), we have unrelated covariance of To facilitate a general implementation, we re-write the pair-wise covariance matrices for MZ twins (4), DZ twins (5) and unrelated subjects (6) as the linear combinations of some known matrices, respectively: where variance components are extracted as the coefficients. If we denote the vector of variance components A, C, E by ρ = (A, C, E) 0 , a concise notation of the error variance-covariance matrix V is where Q r (r = 1, 2, 3) is constructed with the use of between-subject covariances (7), (8), and (9), corresponding to the arrangement of MZ, DZ and singletons in the data vector. The full likelihood and restricted likelihood (ReML) that accounts for nuisance regressors can be found in (Harville, 1977).

| Voxel-wise heritability estimation
For each voxel, we fit the GLM model to the voxel-wise data from twins and singletons, then estimate the model parameters-variance components, and finally obtain the estimate of heritability. We will first describe our proposed method in detail, and then briefly review other methods/tools that are widely used for heritability estimation.

| Linear regression with squared differences
In the 1980s, a simple linear regression method for variance component estimation using squared differences (SqD's) of each subject pair was proposed by Grimes and Harvey (1980). For a sample of n subjects, there are (n 2 − n)/2 distinct SqD's. Note that the expectation of a SqD depends on the covariance, that is, This allows SqD's to be related to variance parameters in a linear fashion, in particular construction of a linear regression where coefficients correspond to the variance components A, C, E (Grimes & Harvey, 1980;Lindquist, Spicer, Asllani, & Wager, 2012). Grimes and Harvey (1980) used ordinary least squares (OLS), which can produce negative variance component estimates that they simply neglected.
To deal with the non-negativity problem, Lawson and Hanson (1987) proposed a now ubiquitous non-negative least squares (NNLS) algorithm.
The foundation of this algorithm is the Karush-Kuhn-Tucker (KKT) conditions (Karush, 1939;Kuhn & Tucker, 1951), which were first proposed for more complex nonlinear programming problems. In our case with the linearity assumption, the KKT conditions can be simplified to accelerate the computation. Although other methods had been proposed to solve this non-negativity problem for large and sparse matrix settings, Luo and Duraiswami (2011) suggested that NNLS still maintained its superiority when small or moderate dense matrices were handled.
While Grimes and Harvey's method specifies a linear regression with the use of (n 2 − n)/2 different observations of SqD's, our modification of this method simplifies the computation such that only ( where 1 is an all-ones vector and β 0 denotes the population mean. By the extension of the covariance matrices (4), (5), and (6) and the basic properties of the variance operator, for MZ twin pairs (i, j(i)), T i ð Þ = MZ, we have for DZ twin pairs (i, j(i)), T i ð Þ = DZ, we have and for unrelated pairs of subjects (i, j), The relationships (11), (12), and (13) describe the expected values for all these (n 2 − n)/2 SqD's and specify the mean structure of a linear regression model: where the first n MZ /2 rows are the SqD's of MZ twins, the next n DZ /2 rows are for the DZ twins, and the remaining (n 2 − n)/2 − n MZ /2 − n DZ /2 rows are for the remaining unrelated subject pairings. We denote this as where, D is an (n 2 − n)/2 × 1 SqD vector of observations, Z is an (n 2 − n)/ 2 × 3 design matrix, and ρ is the unknown variance parameter vector.

Multiple linear regression
Now suppose that the GLM (3) is a multiple regression model containing an regression intercept and multiple covariates, expressed as where the n-vectors X 1 , … , X p − 1 are regressors, each associated with one of the p − 1 covariates, and β 1 , …, β p − 1 are the corresponding regression coefficients. The parameters β = (β 0 , …, β p − 1 ) 0 are not of interest and we treat them as nuisance parameters in variance component analysis. If we estimate β using OLS with an expression of Denote the residual projection matrix by R = I − X(X 0 X) −1 X 0 , and the OLS residual vector e = RY follows a normal distribution with mean E e ½ = 0 and variance Cov(e) = RVR, that is, e~N(0, RVR), where the projection matrix R projects the unobservable error vector ϵ to its estimate e that is orthogonal to the space spanned by the columns of the design matrix X. When the sample size n is large enough, the error vector ϵ can be well approximated by the residual vector e. We assume that the correlation induced by removing covariates and mean centering is negligible when compared with variance components, that is, RVR ≈ V, which is a reasonable assumption for sufficient n.
Here the variance components are derived in terms of nuisancefree errors: for MZ twin pairs, for DZ twin pairs, and for the remaining unrelated subject pairs. Therefore, the derived linear regression model with SqD's can be analogously denoted as Computational simplification For large n, the (n 2 − n)/2 rows of the SqD data and design matrix is unwieldy. Hence we modify how we compute estimatesρ where n otw = (n 2 − n)/2 − n MZ /2 − n DZ /2. Next, observe that where, D l is the l th element of D, SSD = P n 2 −n ð Þ=2 l = 1 D l is the sum of all squared differences, SSD MZ is the sum of n MZ /2 squared differences for MZ, and SSD DZ is the sum of n DZ /2 squared differences for DZ. A fundamental result (see Appendix A) shows that SSD is just a multiple of the sample variance: With the nuisance variables, we approximate this sum with the residual variance from the GLM (3), that is, SSD≈ n 2 − n À Áσ 2 , whereσ 2 = e 0 e= n− p ð Þ is the unbiased estimator for the phenotypic variance σ 2 . With simulations, we verified that estimating SSD with this residual variance had a negligible impact on parameter estimates.

Non-negative least squares
Our APACE method proceeds by applying the NNLS algorithm to the linear regression model with SqD's (10) or (15) for unknown variance component estimation; precisely, we seek where f(ρ) = kZρ − Dk 2 /2 is the objective function to be minimized.
KKT conditions provide the necessary conditions for this optimization problem: If ρ * is the local minimizer of f(ρ) satisfying the inequality constraint ρ ≥ 0, then the following conditions hold: where, the gradient vector is rf(ρ) = Z 0 (Zρ − D) (Karush, 1939;Kuhn & Tucker, 1951). As Z 0 (Zρ − D) = 0 corresponds to the least squares normal equations, for any Z, D, and ρ found by OLS, the first two conditions are trivially satisfied, and hence, only the third condition needs to be checked. The NNLS algorithm proceeds by using OLS to estimate ρ and checking for negative elements. If a negative element is found, it is set to zero, effectively dropping that column from the Z, and OLS is re-fit.

Algorithm simplification
The NNLS algorithm can be further modified and simplified for the computation in ACE modeling since there are only three parameters A, C, E in total. There are only a total of 2 3 − 1 = 7 possible models that can arise under NNLS, but we require the unique environmental factor E to always be present due to the unavoidable measurement error or noise. Thus we only need to consider four possible models E, where,ρ 0 andρ 1 are parameter estimates derived from the null and alternative models, respectively, and ℓρjY ð Þ is the ReML log-likelihood given observations Y. Under the assumption of normality, the likelihood function can be analytically computed (see for example, Harville, 1977). Wilks' theorem states that under certain regularity conditions, the null distribution of the LRT statistic for comparing nested models (e.g., CE vs. ACE) converges to a chi-squared distribution as n ! ∞ (Wilks, 1938). In particular, T is usually regarded to asymptotically follow a chi-squared distribution with 1 degree of freedom, that is, χ 2 1 , based on Wilks' theorem. However, the null value of the variance parameter A lies on the boundary of the parameter space and thus the asymptotic sampling distribution of this LRT statistic, under H 0 , is a half-half mixture of chi-squared distributions, that is, instead of a commonly used standard chi-square distribution, that is, (Dominicus, Skrondal, Gjessing, Pedersen, & Palmgren, 2006;Self & Liang, 1987;Zhang & Lin, 2008).
Given the asymptotic null distribution of the LRT statistic, the theoretical p-value can be easily calculated. Obtaining a p-value less than a given significance level α, which is typically a small number (e.g., α = 0.05), suggests that there is significant evidence against the null hypothesis and the null hypothesis should be rejected at level α. We note that when data is non-Gaussian, the LRT computed under the normality assumption can be inaccurate and the use of Wilk's theorem in approximating its null distribution might lead to invalid results, which will be investigated in our simulation studies. We therefore compute both the asymptotic theoretical p-value and the permutation-based p-value based on the empirical distribution of the LRT statistic (see below).

| Existing methods
Several approaches have been proposed for the analysis of heritability, and we briefly introduce these existing approaches in this section.

Falconer's method
The heritability method due to Falconer and Mackay (1996) is based on moment matching of intraclass correlation coefficients between MZ twins (r MZ ) and DZ twins (r DZ ): Solving for narrow-sense heritability (2), these equations give Falconer's heritability estimator: This method has been widely used and is the simplest way to estimate heritability (Falconer & Mackay, 1996). However, methods of moments estimators can perform poorly relative to optimal maximum likelihood estimators (Nichols, Friston, Roiser, & Viding, 2009). We consider this in a small set of simulations, described below.
The null hypothesis of zero heritability can be tested by comparing the MZ and DZ correlation coefficients after Fisher's variancestabilising transformation. For MZ, Fisher's transformation is which is approximately normally distributed with mean where ρ MZ is the true population correlation coefficient; likewise for z DZ . To test for the equality of r MZ and r DZ , that is, zero heritability, we can use the test statistic where 1 {Á} is the indicator function. Without the positivity constrain, this test statistic would be asymptotically distributed as a standard normal distribution under H 0 (Sakaori, 2002). Considering the positivity constraint, we consider the null distribution to be a half-half mixture of a point mass at zero (a.k.a. χ 2 0 ) and a half normal distribution.

Bayesian ReML
Statistical parametric mapping (SPM) software ( We considered perturbations of these settings but simulations found that these priors were best in terms of estimation accuracy and power (not shown). SPM uses the expectation-maximization (EM) algorithm to iteratively search for the maximum a posteriori estimates of the parameters in the log space (Friston et al., 2002b).

Structural equation modeling
The freely available R package "OpenMx" (http://openmx.psyc.virginia. edu) offers a structural equation SEM framework to allow flexible model definition and parameter estimation for variance components, both of which are commonly used in analysing genetic data for heritability inference. The SEM ACE model for univariate twin data can be displayed as a path diagram, shown in Figure 1, where the influence caused by the latent variables a, c, and e can be described by the path coefficients , respectively (Rijsdijk & Sham, 2002). According to path tracing rules, the covariance matrices for MZ and DZ twin pairs are which have the same structure as matrices (4) and (5). The goodness of fit of this model is also measured using the maximum likelihood criterion (Rijsdijk & Sham, 2002). However, there exist some drawbacks of this SEM approach employed in OpenMx for imaging data analysis.
The goodness-of-fit LRT statistic asymptotically follows a mixture of chi-square distributions (Dominicus et al., 2006;Self & Liang, 1987;Zhang & Lin, 2008), but we can observe that OpenMx incorrectly uses a single standard chi-square distribution with 1 degree of freedom (Rijsdijk & Sham, 2002). For neuroimaging, a relative weakness of OpenMx is lack of direct tools for operating with neuroimaging data.

Solar
The sequential oligogenic linkage analysis routines (SOLAR) software is designed for the investigation of genetic effects in imaging genetics studies (Almasy & Blangero, 1998;Koran et al., 2014). In addition, the SOLAR package is capable of estimating heritability with the data from diverse family structures. SOLAR uses maximum likelihood to estimate the variance parameters, A, C, E. To test the null hypothesis of zero heritability, the LRT test statistic, which is asymptotically distributed as a mixture of chi-square distributions, can be calculated (Almasy & Blangero, 1998). While SOLAR itself cannot read brain imaging data, the related software SOLAReclipse (https://www.nitrc. org/projects/se_linux) can read and write neuroimaging data.

| Permutation inference
Permutation testing is a nonparametric technique that makes minimal assumptions about the data. With a few simple assumptions like exchangeability of the observed data under the null hypothesis, the nonparametric permutation test is conceptually simple and theoretically intuitive (Nichols & Hayasaka, 2003;Nichols & Holmes, 2001).
When the null hypothesis is true, the data will exhibit the feature of exchangeability

| Voxel-wise single threshold test
Let π = 1, …, N p be the permutations considered including correctly labeled data. At a single ROI/voxel, the permutation test produces a level α threshold by finding the bαN p c + 1 largest value of the null distribution; p-values are computed as the proportion of null distribution values as large as or larger than the observed test statistic.
To control the FWE, the relevant null distribution is that of the maximum test statistic. That is, the level-α FWE threshold T FWE α is the bαN p c + 1 largest value of the maximum distribution, and FWEcorrected p-value, denoted by p FWE T , for any given test statistic T 0 is likewise the proportion of permutation maximum distribution values equaling or exceeding that value (Nichols & Holmes, 2001): where T max π is element π of the maximum distribution.

| Cluster-wise supra-threshold tests
for cluster statistics of size and mass, respectively, where K max π (or M max π ) is element π of the maximum cluster size (or mass) distribution.

| SIMULATION STUDIES
In this section, univariate simulation analysis is conducted to compare our newly proposed voxel-wise heritability estimation methods with existing methods in terms of prediction accuracy, validity, sensitivity, and the overall computation time for different variance component settings. The ROC-based simulation studies generate 2D image data for power evaluation and comparison between the voxel-and clusterwise heritability inference approaches for various settings.

| Univariate simulation evaluations
We use Monte Carlo simulations to evaluate our proposed linear regression methods, LR-SqD Perm (LR-SqD using empirical p-value based on 1,000 permutations), LR-SqD (LR-SqD using asymptotic pvalue) and LR-SqD ReML, and existing methods including Falconer's method, Bayesian ReML in SPM, SEM in OpenMx, and SOLAR.

| Simulation setting
The parameter settings shown in Table 1  Apart from the Gaussian random error, we also take into account the case where the error term is not normally distributed. Here we consider a log-normally distributed unique environmental random noise. The sample considered is balanced with the size of n = 300, that is, the number of MZ and DZ twin pairs is identical.
In total, 5 samples with balanced/unbalanced design and Gaussian/ non-Gaussian random error, along with 15 (A, C, E) parameter settings, lead to totally 90 simulation settings. For each setting, we consider both the one sample model (10)

Accuracy and precision
The MSE comparison of six methods is shown in Figure 3, which shows that the two linear regression methods, LR-SqD and LR-SqD  We believe this is due to convergence problems for small samples.

Specificity and statistical sensitivity
For the non-Gaussian case (Row 6), we note that all methods are invalid with inflated FPR except LR-SqD Perm, which does not rely on the assumption of normality.

Running time comparison
We evaluated the relative running time (relative to the running time of Falconer's method) for completion of 1,000 simulated datasets (see Figure 6). The computational performance comparing six methods

| ROC-based power evaluation
To evaluate the sensitivity of the voxel-and cluster-wise heritability inference approaches described in Section 2.3, we conduct a receiver operating characteristic (ROC) analysis.

| Simulation setting
In this set of simulations, we use n = 20, 60, 100, using only twins and equal number of MZ and DZ pairs. The signal is generated with four parameter settings shown in Table 2 consisting of different values of heritability and shared environmental variance.
The simulated images are 2D, with 128 × 128 pixels. Two spatial configurations of signal were considered, a single large region or nine separate regions, having similar number of signal pixels (1,020 vs. 1,024); see Figure 7. The set of N I images were first created by filling each pixel with i.i.d. standard Gaussian noise, and then inducing the signal with the Cholesky decomposition of the desired variance/covariance structure. Spatial Gaussian smoothing kernels with full width at half maximum (FWHM) of 0, 1.5, 3, or 6 pixels were applied to blur these images in order to accommodate spatial F I G U R E 5 The statistical power comparison with a nominal level α = 0.05 for false null hypothesis (H 0 : h 2 = 0) of LR-SqD Perm using 1,000 permutations, LR-SqD, LR-SqD ReML, Falconer's method, Bayesian ReML in SPM, SEM in OpenMx, and SOLAR, based on 1,000 realizations (Rows 1-5 for Gaussian error and Row 6 for non-Gaussian error). Comma ordered pairs on x-axis correspond to the rounded parameter values of (A, C) with A > 0; see Table 1 and Figure 2 for exact parameter settings used [Color figure can be viewed at wileyonlinelibrary.com] dependence across neighbouring voxels. A total of N I = 1000 images were generated for each simulation setting.

| ROC analysis
The ROC curves plot true positive rate (TPR; y-axis) against FPR with varying threshold levels. A standard ROC analysis is suitable for only a single outcome, while we have 128 2 outcomes.
Hence we use an alternative, free-response ROC approach (Chakraborty & Winter, 1990). As described in (Smith & Nichols, 2009), a free-response ROC, consisting of a y-axis representing the probability of true detection, averaged over pixels with A > 0, and an x-axis representing the probability of any false detections, is deployed.
F I G U R E 6 The relative running time comparison after base-10 log transformation, denoted by log 10 (t/t F ) for LR-SqD Perm using 1,000 permutations, LR-SqD, LR-SqD ReML, Bayesian ReML in SPM, SEM in OpenMx, and SOLAR, based on 1,000 realizations, where t F denotes the running time for Falconer's method, that is, log 10 (t/t F ) = 0 for Falconer's method (Rows 1-5 for Gaussian error and Row 6 for non-Gaussian error). Comma ordered pairs on x-axis correspond to the rounded parameter values of (A, C); see Table 1 and Figure 2 for exact parameter settings used [Color figure can be viewed at wileyonlinelibrary.com] We summarize the ROC curve by a normalized area under the curve (AUC), with a larger value for better performance. Since we are mostly concerned about FPR values between 0 and 0.05, corresponding to a family-wise error rate of 5%, the normalised AUC is 20 × AUC for FPR < 0.05, maintaining a"perfect" AUC of 1. We calculate this free-response ROC curve for both voxel-and cluster-wise inferences. For cluster-wise inference, we set the voxel-level clusterforming threshold to α = 0.05 or LRT statistic value u α = 2.71.
For clarity, the exact steps in this ROC calculation are as follows.  Table 2, and one of two spatial configurations in Figure 7.
2. For each image, estimate heritability pixel-by-pixel and create the LRT test statistic image.
3. Voxel-wise inference. Apply a large number of predefined grids of thresholds to the LRT test statistic image, obtain the suprathreshold pixels, and then calculate family-wise FPR and TPR for each of these threshold levels, obtaining FPR from noise-only image and TPR from the A > 0 pixels in signal images.
Cluster-wise inference. Threshold the LRT statistic images with a predetermined cluster-forming threshold (p-value α or statistic u α ) and form clusters. Use a predefined grid of cluster size thresholds to define each cluster as detected or not.

Compute the family-wise FPR and TPR:
FPR. Using the smoothed random noise images, for each threshold, the family-wise FPR is the proportion of realizations having any (false) detections.
TPR. Using the heritability images, for each threshold, compute the proportion of true positive pixels (detected and A > 0) out of all possible (number of the A > 0 pixels). This is computed for each realization and averaged over realizations.
5. Plot the ROC curves and calculate the corresponding normalized AUC values.

| ROC-based simulation results
As described above, a range of simulation settings are investigated for both voxel-and cluster-wise inference approaches using APACE. For different extents of smoothness, the returned ROC curves have fairly similar shape, so we will only illustrate the ROC curves created by medium degree of smoothing with FWHM of three pixels, which are shown in Figures 8 and 9 for the simulated focal and distributed signals, respectively. The corresponding normalized AUC comparison is then shown in Figure 10.
For both focal and distributed signal, ROC curves of the clusterwise method are always above those of the voxel-wise method, reflecting higher statistical power obtained for cluster-than voxelwise inference approaches. In general, for a particular family-wise FPR level, the TPR value of both inference methods becomes larger when the sample size or the heritability is increased.
The normalized AUC values ( Figure 10) show that the voxel-wise method has poor performance overall for all simulation settings with negligible AUC values, while the cluster-wise inference approach has much larger AUC values. While the absolute power is low here, reflecting the challenge of detecting nonzero heritability with just 100 subjects, these results show that the cluster-wise approach is more sensitive to such spatial signals and demonstrates the value of such spatial statistics.

| REAL DATA ANALYSIS
Here we report the heritability analysis of a working memory fMRI task. We illustrate the above-mentioned heritability inference approaches including univariate LR-SqD and permutations.  (Blokland et al., 2011).

| Real data results
The F I G U R E 1 1 The log-transformed FWE-corrected p-value image, that is, − log 10 p FWE K À Á , for supra-threshold clusters with significant supra-threshold cluster size [Color figure can be viewed at wileyonlinelibrary.com] only two significant voxels were identified, while there were four clusters with a total of 634 voxels identified to be significant for the cluster-wise tests. The FWE-corrected p-value image after log-10 transformation, that is, −log 10 (p FWE ), for significant supra-threshold clusters with respect to size statistic is shown in Figure 11. The heritability estimate image is shown in Figure 12, where the heritability estimate ranges between 0 and 0.59, and the significant voxels based on cluster-wise inference have a heritability range of 0.18 and 0.59. The most heritability-significant regions found using both the single threshold test and the supra-threshold cluster tests overlap with the most significant regions from the previous Mx analysis (Blokland et al., 2011).

| CONCLUSION AND DISCUSSION
In this article, we have presented two novel linear regression-based estimation methods for heritability inference in neuroimaging, trying to improve statistical power and reduce computational complexity. A simple LR-SqD method based on linear regression modeling with squared differences of paired observations has been developed, and found to have comparable or even better estimation accuracy and statistical power relative to existing methods. LR-SqD, as simple as Falconer's method, only requires linear regression to improve prediction accuracy. The univariate simulation study also showed that apart from Falconer's method, LR-SqD is the most time-efficient approach when compared with those likelihood-based iterative methods, and will never encounter any convergence problems. The fast, accurate and noniterative properties of LR-SqD make it more flexible and feasible to be applied for permutation inference.
A permutation-based heritability inference approach by embedding LR-SqD method in a permutation framework has also been developed. This permutation inference allows us to perform more exact heritability inference using LR-SqD Perm at each voxel, to control the FWER, and also to consider alternative cluster-wise imaging statistics.
The fact that adjacent voxels or regions in a brain image tend to be structurally and functionally homologous can be exploited by spatial statistics like cluster size and mass. Our use of LR-SqD, the fast and accurate noniterative method (free of any convergence issues), makes these spatially informed statistics more accessible. For equivalent FWERs, the cluster-wise approach was found to have higher sensitivity, and thus more powerful in ROC-based power simulations, which demonstrates the importance of such spatial statistics over voxel-wise statistics and the need for permutation inference to take advantage of these cluster statistics. With few weak assumptions, permutation inference is a feasible alternative to the parametric approaches, which is even preferable in studies having small sample sizes or when the stronger assumptions of the parametric approaches cannot be met (Nichols & Holmes, 2001).
Except for LR-SqD Perm, methods being compared in univariate simulations are asymptotic. We found our permutation-based LR-SqD method, LR-SqD Perm, is more robust, being the most powerful approach for nearly all simulation settings. Other asymptotic LR-SqD methods, LR-SqD and LR-SqD ReML, also have good power, and cluster F I G U R E 1 2 The heritability image for the masked brain regions. The heritability estimates vary between 0 and 0.59, and the heritability estimates of significant voxels using cluster size inference range from 0.18 to 0.59 [Color figure can be viewed at wileyonlinelibrary.com] inference methods have better detection power than voxel-wise methods. A sample size of 1,000 is still insufficient, for some parameter settings, resulting in limited power (far below 80%) for detecting heritability, but at least we found all methods are valid with normally distributed errors except SOLAR, which is specially designed for family studies with large sample sizes of various degrees of relatedness. For Gaussian noise, although Falconer's method has poor estimation accuracy, it seems to work well with the power comparable to that of LR-SqD Perm.
However, it relies on the normality assumption to test for the equivalence of MZ and DZ correlations. For non-Gaussian noise, the null distribution of LRT computed under the misspecified normality assumption can be inaccurate and the corresponding asymptotic null distribution of LRT based on Wilk's theorem is problematic, which results in inflated FPR as shown in our simulations. LR-SqD Perm, which relaxes the assumption of normality, is the only applicable method that maintains valid FPR control in the case of non-Gaussian error, and thus we suggest sticking to LR-SqD Perm. During univariate evaluations, we found adding singletons can improve neither estimation accuracy nor statistical power. However, we still suggest including singletons in the statistical analysis since a better estimate of the phenotypic variance can be obtained with more data taken into account. Averaging across all the simulation settings, we found LR-SqD is roughly 2.5 times faster than LR-SqD ReML, and around 45.5, 84.8, and 995.7 times faster than SPM, OpenMx and SOLAR, respectively.
The LRT statistic for testing H 0 : A = 0 is not asymptotically pivotal and its distribution varies discontinuously across the parameter space depending on the true value of variance component C. The configurations of C on the parameter space can be partitioned into two cases: (1) C > 0, (2) C = 0. For standard Case (1), the reference distribution for the LRT involving one parameter on the boundary of the parameter space has been proven to be a half-half mixture of χ 2 0 and χ 2 1 (Dominicus et al., 2006;Self & Liang, 1987). For nonstandard Case (2), under the null, both A and C are boundary parameters and the asymptotic distribution of the LRT statistic is a mixture of χ 2 0 , χ 2 1 , and χ 2 2 with mixing probabilities 1/2 − p, 1/2 and p, where 0 ≤ p ≤ 1/2 (Dominicus et al., 2006;Self & Liang, 1987). Even if C > 0 for Case (1), the asymptotic null distribution of the LRT statistic for a finite sample can be more similar to that for Case (2) when C is close enough to the boundary (Self & Liang, 1987). When the sample size tends to infinity or is sufficiently large, the asymptotic approximation is enhanced, and the tendency eases with the reference distribution more resembling that for Case (1). This leads to the conservativeness of the asymptotic LRT-based tests when compared with the permutation-based LR-SqD Perm, and thus we recommend using the nonparametric permutation inference.
The existence of nonzero variance components A and C induces the familial correlation (Dominicus et al., 2006). When the true value of A is nonzero, testing the null hypothesis of no heritability is similar to testing for the familial correlation since it would be difficult to precisely separate the familial influences and explicitly distinguish between the A and C effects due to the inevitable noise, which has been revealed in univariate simulation evaluations. Therefore, increasing the variance parameter C may improve the power of the test for the null hypothesis of no heritability while holding the validity of the test. In addition, when C is zero, the test comparing the AE model against the E model could probably offer higher power than the test of ACE versus CE. However, the variance parameter C is unknown in reality and impulsively using the test of AE versus E would lead to inflated FPR and invalid conclusions with overestimated power.
We have developed a Matlab-based tool "Accelerated Permutation Inference for the ACE Model (APACE)", which provides different analysis approaches specialized for heritability inference based on LR-SqD and is freely available at https://github.com/NISOx-BDI/APACE. Compared with the popular analysis tools such as OpenMx and SOLAR, APACE is designed specifically for neuroimaging data and is applicable for any sample sizes with controlled FPR. The use of the flexible permutation approach allows for any test statistics (e.g., LRT, cluster size, and so on) to be applied in computing the p-values, and enabling parallel execution further accelerates the implementation. The current version of APACE can be adopted for the family design including twins, siblings and singletons, and the generalization of APACE for any family designs is possible with the use of the pedigree information.