A Generalized Levene's Scale Test for Variance Heterogeneity in the Presence of Sample Correlation and Group Uncertainty

We generalize Levene's test for variance (scale) heterogeneity between $k$ groups for more complex data, which includes sample correlation and group membership uncertainty. Following a two-stage regression framework, we show that least absolute deviation regression must be used in the stage 1 analysis to ensure a correct asymptotic $\chi^2_{k-1}/(k-1)$ distribution of the generalized scale ($gS$) test statistic. We then show that the proposed $gS$ test is independent of the generalized location test, under the joint null hypothesis of no mean and no variance heterogeneity. Consequently, we generalize the recently proposed joint location-scale ($gJLS$) test valuable in settings where there is an interaction effect, but one interacting variable is not available. We evaluate the proposed method via an extensive simulation study, and two genetic association application studies.


Introduction
Testing for scale (variance) heterogeneity, prior to the main inference of location (mean) parameters, is a common diagnostic method in linear regression to evaluate the assumption of homoscedasticity.In some research areas, such as statistical genetics, testing for heteroscedasticity itself can be of primary interest.
With the goal of detecting a genetic association between a single-nucleotide polymorphism (SNP, G) and a quantitative outcome (phenotype, Y ), the traditional approach is to conduct a location test, testing mean differences in Y across the three genotype groups of the SNP (G = 0, 1 or 2 copies of the minor allele, the variant with population frequency < 0.5).
However, it has been noted that a number of biologically meaningful scenarios can lead to variance differences in Y across the genotype groups of a SNP of interest (say G 1 ).For example, an underlying interaction effect between G 1 and another SNP G 2 (G 1 xG 2 ) or an environmental factor E (G 1 xE), on Y can lead to heteroscedasticity across G 1 if the interacting G 2 or E variable is not collected to directly model the interaction term (Pare et al., 2010).Transformations on a phenotype can also result in variance heterogeneity (Sun et al., 2013).This transformation can occur knowingly for statistical purposes, e.g.log(Y ), or unknowingly, e.g.choosing a phenotype measurement that does not directly represent the true underlying biological outcome of a gene.In each of these scenarios, a scale test can be used either alone to indirectly detect associated SNPs (Pare et al., 2010), or combined with a location test to increase testing power (Cao et al., 2014;Soave et al., 2015).
Genotype uncertainty is inherent in both sequenced and imputed SNP data.For these types of data, the genotype of a SNP for an individual (G = 0, 1 or 2) is represented by three genotype probabilities (p 0 , p 1 , p 2 , and p 0 + p 1 + p 2 = 1).For testing methods that require genotype to be known unambiguously, the probabilistic data are typically transformed into the so-called "best-guess" (most likely or hard-call) genotype, crudely selected as the one with the largest probability.In the context of location-testing, several groups have proposed methods that incorporate the probabilistic data and showed that this improves power (Acar and Sun, 2013;Kutalik et al., 2011).The corresponding development for scaletesting, however, is lacking.
Genetic association studies often involve family data, where individuals in a sample are correlated or clustered.In addition, unintentional correlation due to cryptic relatedness may be revealed from standard quality control analyses of a population sample of presumed unrelated individuals (Sun and Dimitromanolakis, 2012).A number of generalized location tests allowing for family data have been proposed (Horvath et al., 2001;Jakobsdottir and McPeek, 2013), and their power gain over analyzing only the subset of independent individuals is a direct consequence of the increase in sample size.However, few scale tests deal with correlated data, with the exception of methods proposed specifically for clustered data present in twin studies (Haseman and Elston, 1970;Iachine et al., 2010).Further, these methods have been reported to have type 1 error issues in the presence of non-normal data or small, unequal group sizes (Iachine et al., 2010), and they have not been extended to incorporate group membership uncertainty.
Both classical statistical tests and graphical procedures have been proposed to investigate heteroscedasticity (Bartlett, 1937;Breusch and Pagan, 1979;Cook and Weisberg, 1983;Levene, 1960;White, 1980).In big data settings, such as genome-wide association studies, where possibly millions of SNPs are scanned for association with an outcome, graphical and other computationally burdensome approaches are undesirable.Levene's test (Levene, 1960) is known for its simplicity and robustness to modelling assumptions, and it is perhaps the most popular method for evaluating variance heterogeneity between k groups.Therefore, our development here focuses on Levene's method.
In this paper, we extend Levene's test for equality of variances across k groups to allow for both group membership uncertainty and sample correlation.When groups are known, we show that the proposed method outperforms existing methods for clustered twin data.
In the presence of group uncertainty, we demonstrate that our test continues to be accurate and has improved power over the "best-guess" approach.This generalized scale test can be used alone for heteroscedasticity diagnostic purposes but with wider applicability.Motivated by the complex genetic association studies described above, we also show that the proposed gS test can be combined with existing generalized location tests using the joint locationscale framework, previously developed for population samples without group uncertainty (Soave et al., 2015), to further improve power.Finally, we apply our methods to two genetic association studies, one of HbA1c levels in individuals with type 1 diabetes, and the other of lung disease in individuals with cystic fibrosis.

Methodology
We first consider a sample of independent observations with no group uncertainty, and formulate Levene's test as a regression problem.Using this regression framework, we then extend Levene's test as the generalized scale (gS hereinafter) test to allow for sample dependency and group uncertainty.For clarity of the methods comparison, we also briefly discuss the Iachine et al. (2010) extension of Levene's test, specifically designed for twin pairs without group uncertainty.Finally, we generalize the joint location-scale test of Soave et al. (2015) (gJLS) for the complex data structure considered here.

Notation and Statistical Model
Let y i , i = 1, . . ., n, be a sample of independent observations, where each y i ∼ N (µ i , σ 2 i ).
Suppose the y i 's fall into k distinct treatment groups with group-specific variance σ 2 j , j = 1, . . ., k, and let n j be the sample size for group j, n = n j .Our motivation concerns testing the hypothesis of equal variance across the k groups: For notation concision, here we use σ 2 j for group-specific variance, j = 1, . . ., k, and σ 2 i for observation-specific variance, i = 1, . . ., n; in what follows we make the distinction clear in the context.
Consider the normal linear model of interest here, where ε i ∼ N (0, σ 2 i ), σ 2 i corresponds to the variance associated with the group that y i belongs to.In other words, σ 2 i = σ 2 j * if x (j * −1)i = 1.In matrix notation, where X is the design matrix obtained by stacking the N n (0, Σ), and Σ is the covariance matrix with diagonal elements σ 2 i s.

Formulating Levene's Test as a Regression F-test and Modifications
The classical formulation of Levene's test first centres the observations, y i 's, by their estimated group means and obtains the corresponding absolute deviations, d i 's.It then tests for mean differences in the d i 's across the k groups using ANOVA.Let I ij , j = 1, . . ., k be the group indicator variables, where I ij = 1 if individual i belongs to group j.Now, let µ (j) = n i=1 y i I ij /n j be the estimated group means of the y i 's, such that an estimate of E(y i ) . The corresponding absolute deviations are Let d (j) be the estimated group means of the d i 's, such that an estimate of E(d i ) is , and let d = n i=1 d i /n be the grand mean.Finally, Levene's test statistic has the following form , where F (d) follows approximately an F (k − 1, n − k) distribution under the null hypothesis of (1), and a χ 2 k−1 /(k − 1) distribution asymptotically as n → ∞.
For the purpose of a unified development, it is prudent to re-formulate Levene's test using the following two-stage regression framework: Stage 1.1.Obtain the residuals, ε i = y i − y i = y i − x i β, from the ordinary least squares (OLS) regression of y i on x T i ; we refer to this as the stage 1 regression.
Stage 1.2.Take the absolute values of these residuals, Stage 2. Test for an association between the d i 's and x T i 's using a regression F -test; we refer to this as the stage 2 regression and test.
The justification for this two-stage regression procedure (Levene's test) being a test of the hypothesis of variance homogeneity (1) is as follows.Stage 1 performs OLS regression using a working covariance matrix Σ stage 1 = σ 2 y I, where I is the identity matrix.Therefore where h ii is the ith diagonal element of the hat matrix H.
folded-normal distribution and its mean is a linear function of σ i , This relationship between d i and σ i is approximated by the following working model in stage 2, where e i ∼ N (0, σ 2 d ).In matrix form, where θ = (α, γ T ) T = (α, γ 1 , . . ., γ k−1 ) T , and e ∼ N (0, Σ stage 2 ), Σ stage 2 = σ 2 d I. Testing the null hypothesis ( 1) is now re-formulated as testing using the classical OLS regression F -test.Note that although the d i 's are folded normal variables, Levene's variance test takes advantage of the fact that inference from OLS regression is robust to violations of the normality assumption.
This formulation of Levene's test has a similar structure to the score test of Glejser (1969) proposed for testing heterscedasticity associated with continuous covariates.Godfrey (1996) showed that when estimating β by OLS in the stage 1 regression, the resulting Glejser score statistic derived from the stage 2 regression analysis is not asymptotically distributed as χ 2 1 , unless the distribution of ε is symmetric.To achieve robustness, several modifications have been proposed (Brown and Forsythe, 1974;Im, 2000;Machado and Silva, 2000;Furno, 2005;Gastwirth et al., 2009), among which replacing sample group means with medians in constructing the d i 's is most intuitive.This substitution has been consistently recommended in the literature for its robustness against non-normality (Conover et al., 1981;Lim and Loh, 1996).It has also been shown analytically that, when the distribution of the error ε is not symmetric, centering on the sample group medians, and not the means, will lead to an asymptotically correct Levene's test (Carroll and Schneider, 1985) and correct Glejser's score test (Furno, 2005).In the regression framework, this modification corresponds to estimating β by least absolute deviation (LAD) regression instead of OLS regression in stage 1.

The Generalized Levene's Scale (gS) Test
The above regression framework for Levene's test allows us to incorporate group uncertainty by simply replacing the group indicators or dummy variables for each observation, x T i , with the corresponding group probabilities.Analogous to dummy variables, the group probabilities for each individual sum to 1, so we omit one of the covariates to ensure model identifiability.
Using genetic association as an example again, let (p 0 = 0.25, p 1 = 0.42, p 2 = 0.33) be the genotype probabilities for an individual i at a SNP of interest, then, without loss of generality, we can define x T i = (1, x 1i , x 2i ) = (1, 0.42, 0.33).Note that the "best-guess" approach would have the corresponding covariate vector as x T i = (1, 1, 0).Now, consider correlated data where ε i and ε j are no longer independent of each other and the covariance matrix Σ is no longer diagonal.In the stage 1 regression, because we are only interested in obtaining β to construct d i = |y i − x i β|, we can continue to use OLS or LAD regression with the misspecified working covariance matrix, Σ stage 1 = σ 2 y I, to obtain consistent and unbiased β estimates.
Stage 2 involves estimating the variance of γ to test the null hypothesis of (6), and not accounting for sample dependency can lead to invalid inference.Let Σ stage 2 = σ 2 d Σ d ; a valid inference can be achieved by using a generalized least squares (GLS) approach when Σ d is known (Aitken, 1936).When Σ d is unknown, feasible GLS (FGLS) (Baltagi, 2008) can be used, with or without iteration, where an estimate of Σ d is obtained, subject to constraints, and then used in GLS.Alternatively, orthogonal-triangular decomposition methods can be used to obtain a compact representation of the profiled log-likelihood, such that maximum likelihood estimates (MLE's) of all parameters can be obtained jointly through nonlinear optimization (Pinheiro and Bates, 2000).
In many scientific settings, including genetic association studies, the sample correlation structure is often specified with constraints on the n(n − 1)/2 correlations, e.g. a single serial correlation ρ for time series or family data with a single relationship type (e.g.twin data), or different cluster-specific correlations ρ's for different clusters.In this case, let T be the Cholesky decomposition, and define The GLS or FGLS regression, in essence, deals with the transformed model in stage 2 where θ = (α, γ T ) T .For a fixed ρ, the conditional MLEs for θ and The MLE of ρ can be obtained by optimizing the profiled log-likelihood, Thus, the generalized Levene's scale gS test of the null hypothesis of ( 6), H 0 : γ = 0, using the regression F-test in stage 2, has the following test statistic: where Under the linear regression model of (4), the F -statistic (8) of testing ( 6) is asymptotically (Arnold, 1980).However, similar to the results of Carroll and Schneider (1985) and Furno (2005) for the original Levene's test, we show that for nonsymmetric ε, this is true only when d is estimated using LAD in the stage 1 regression (Web Appendix A, Theorem 1).pairs.The proposed twin (T W ) test follows Levene's two-stage regression procedure but it makes use of the Huber-White sandwich estimate (White, 1980) of Var( γ 1 ) in the stage 2 analysis (here k = 2 requiring only one dummy variable) to construct an asymptotically χ 2 1 distributed Wald statistic, operationally an F -statistic in finite samples.

The
Complications with the T W test may arise if the number of clusters is small in either group (MZ or DZ) and can be compounded with imbalance between the groups (Iachine et al., 2010).Unfortunately, there is no clear definition of too few clusters (Cameron and Miller, 2013), and empirical type 1 error rates can be inflated for study designs with less than 20 clusters per group, particularly combined with non-symmetric data (see Iachine et al. (2010) and simulation results Section 3 below).
The original T W method assumes that if two observations are from the same pair/cluster they also belong to the same group k.This may not be satisfied in a more general setting like the genetic association studies discussed above.For example, two individuals from the same DZ pair or familial cluster often have different genotypes at a SNP of interest, so individuals from the same cluster may not share a common σ 2 k .However, the sandwich variance estimator can continue to be used in this setting.In the presence of group uncertainty, the T W method can be modified by replacing the group indicator covariate with group probabilities.

Generalized Joint Location-Scale (gJLS) testing
The standard location test of mean differences in an (approximately) normally distributed outcome across covariate values (e.g. the three genotype groups of a SNP in a genetic association study) is testing based on regression model (2).While the location test performs a hypothesis test on the β j 's, the scale test discussed here uses only the β estimates from the stage 1 regression of model ( 2) to obtain d i = |y i − y i | for the stage 2 regression of model ( 4), and it performs a hypothesis test on the γ j 's, testing A joint location-scale (JLS) test is interested in the following global null hypothesis, One simple yet powerful JLS method proposed in Soave et al. ( 2015) uses Fisher's method to combine p L and p S , the p-values of the individual location and scale tests.One can consider other aggregation statistics, e.g. the minimal p-value (Derkach et al., 2013(Derkach et al., , 2014)); for a review of this topic see Owen (2009) and Marozzi (2013).Focusing on Fisher's method, the corresponding test statistic is For independent observations with no group uncertainty, Soave et al. (2015) showed that, under H joint 0 of ( 9) and a Gaussian model, p L and p S are independent.Thus W F is distributed as a χ 2 4 random variable.
In the presence of sample correlation with group uncertainty, we propose to use the same framework but obtain p L from a generalized location test (e.g. a generalized least squares approach to model (2), where the design matrix X includes the group probabilities, and the covariance matrix, Σ stage 1 = σ 2 y Σ y , incorporates the sample correlation), and p S from the gS test proposed here.We show that the assumption of independence between p L and p S continues to hold theoretically under H joint 0 of (9) for normally distributed outcomes (Web Appendix B), as well as empirically for approximately normally distributed outcomes in finite samples (Web Figure 1).

Simulations
The validity of the generalized joint location-scale (gJLS) testing procedure relies on the accuracy of the individual generalized location (gL) test and generalized scale (gS) test com-ponents.The performance of the gL test has been established in the literature, therefore, our simulation studies here focused on evaluation of the proposed gS test, and when appropriate compared it with Levene's original test (Lev) and the T W test of Iachine et al. (2010).We use subscripts OLS and LAD to denote if the stage 1 regression was performed using OLS to obtain group-mean-adjusted residuals or LAD for group-median-adjusted residuals.Implementation We considered two main simulation models.Simulation model 1 followed the exact simulation setup of Iachine et al. (2010) to ensure fair comparison.Simulation model 2 extended model 1 by introducing genotype groups for each individual as well as group membership uncertainty.To apply the original Lev test for comparison, we ignored the inherent sample correlation in the presence of correlated data.In all simulations, empirical type 1 error and power were evaluated at the 5% significance level using 10,000 replicates, unless otherwise stated.
We first generated pairs of observations from independent bivariate normal distributions BV N (0, 1, ρ k ), k = 1, 2, with ρ 1 and ρ 2 corresponding to the correlation within the MZ and DZ twin pairs, respectively.Let w be the variable for an observation, we then applied a transformation g(•) to w to obtain the desired marginal distribution, y = σ k g(w), where the σ k 's induced different variances between the two groups.The choice of g(•) depended on the desired distribution for y: where Φ, F t 4 and F χ 2 4 are the cumulative distribution functions for the standard normal, Students t 4 and χ 2 4 distributions, respectively.
3.1.2Results.We were able to replicate the simulation results of Iachine et al. (2010) that studied Lev OLS , Lev LAD , T W OLS , and T W LAD (Table 1 and Web Table 1).However, we noticed that results reported in their paper for Lev LAD and T W LAD using median-adjusted residuals (labeled as W 50 and T W 50 , columns 9 and 12 of Tables 1-4 2010) that the T W method using the 10% trimmed mean "performed best", therefore, are incorrect and should instead refer to T W LAD using median-adjusted residuals from the stage 1 regression.
Our results in Table 1 clearly show that • In the presence of sample correlation, Levene's original method Lev that ignores the correlation had severely increased type 1 error rate, even with Gaussian data.That is, T W and gS performed better than Lev.
• When the error structure was non-symmetric (χ 2 4 ) or the group sizes were small (e.g. n 1 or n 2 less than 20), using OLS in the stage 1 regression for either T W or gS led to increased type 1 error.That is, T W LAD and gS LAD performed better than T W OLS and gS OLS , respectively.
• When the group sizes were unbalanced and small (e.g.n 1 = 10, n 2 = 20), T W LAD had increased type 1 error, even with Gaussian data.That is, gS LAD performed better than In large samples, the original Lev test remained too optimistic, with an empirical α of 0.097 when n 1 = n 2 = 2000 with Gaussian data (Web Table 1).Consider a SNP of interest with minor allele frequency (MAF) of q (= 0.2 or 0.1), we first simulated genotypes for n/2 (= 20, 50, 100, 500 or 1000) pairs of siblings.To account for the inherent correlation of genotypes between a pair of siblings, we started with drawing the number of alleles shared identical by decent (IBD), D = 0, 1 or 2, from a multinomial distribution with parameters (0.25, 0.5, 0.25), independently for each sib-pair.Given the IBD status D, we then simulated paired genotypes (G 1 , G 2 ) = (i, j), i, j ∈ {0, 1, 2}, following the known conditional distribution of {(G 1 , G 2 )|D} (Thompson, 1975;Sun, 2012).The distribution depends on q in a way that smaller q leads to greater imbalance in the genotype group sizes.Approximately, the distribution of the numbers of individuals with genotype G = 0, 1 and 2 is proportional to (1 − q) 2 , 2q(1 − q) and q 2 , respectively.
To introduce group membership uncertainty, we converted the simulated true genotypes G's to probabilistic data X's using a Dirichlet distribution.We used scale parameters a for the correct genotype category and (1 − a)/2 for the other two; this error model was used previously by Acar and Sun (2013) to study location tests in the presence of genotype group uncertainty.We varied a from 1 to 0.5, where a = 1 corresponds to no genotype uncertainty and a = 0.5 implies that, on average, 50% of the "best-guess" genotypes correspond to the true genotype groups.Thus, the genotype group uncertainty level ranged from 0% to 50% in our simulations.
We then simulated outcome data for each sib-pair in a fashion similar to simulation model 1.For each of the n/2 sib-pairs, we first simulated paired data from BV N (0, 1, ρ), where ρ = 0.5 was the within sib-pair correlation.For each simulated value w, we then applied the σ k g(w) transformation to obtain the desired outcome data y as in simulation model 1 (Gaussian, Student's t 4 , and χ 2 4 ).However, k here refers to the corresponding true underlying genotype group of an individual, and two individuals from the same sib-pair might not have the same genotype.We used (σ 2 0 , σ 2 1 , σ 2 2 ) = (1, 1, 1) to study type 1 error control, and (1, 1.5, 2) or (1, 2, 4) to study power; other values such as (2, 1.5, 1) and (4, 2, 1) were also considered.
It is evident from the results of simulation model 1 that the original Lev test is not valid in the presence of sample correlation, and T W OLS and gS OLS are inferior, respectively, to T W LAD and gS LAD , when the error structure is non-symmetric or the group sizes are small.
Therefore, the results presented below focus on comparison between T W LAD and gS LAD .In the presence of genotype group uncertainty, we also considered the "best-guess" approach and used T W BG LAD and gS BG LAD to represent the corresponding results.

3.2.2
Results.In the presence of sample correlation but with no group uncertainty, the results in Table 2 show that both T W LAD and gS LAD were accurate in large samples, e.g.
when sample size was 2000 (n/2 = 1000 sib-pairs).However, T W LAD had increased type 1 error when group sizes were unbalanced and relatively small, even for Gaussian data.For example, when the MAF is q = 0.2 and the number of sib-pairs is n/2 = 100, the expected sizes of the three genotype group sizes are n * ((1 − q) 2 , 2q(1 − q), q 2 ) = (128, 64, 8).In that case, the empirical type 1 error of T W LAD was 0.060, 0.072 and 0.078 for Gaussian, t 4 and χ 2 4 data, respectively.The problem was exacerbated by a smaller MAF q = 0.1 with empirical type 1 error levels of 0.092, 0.115 and 0.118, respectively for the three types of data.In contrast, the proposed gS LAD test remained accurate in most cases and was slightly conservative in small samples, when n/2 < 100.
Results in Table 3 are characteristically similar to those of Table 2.However, we note that group uncertainty somewhat mitigates the problem of unbalanced group sizes, and consequently the accuracy issue of T W LAD .Nevertheless, it is clear that gS LAD had better type 1 error control than T W LAD across the MAF values and the three outcome distributions.
As expected, T W BG LAD and gS BG LAD using the "best-guess" genotype group have similar type 1 error control to T W LAD and gS LAD incorporating the group probabilistic data, under the null hypothesis (Table 3 and Web Tables 2 and 3).
Focusing on the accurate gS LAD test, Table 4 and Figure 1 demonstrate the gain in power when incorporating the group probabilistic data into the inference (gS LAD ) as compared to using the "best-guess" group (gS BG LAD ).For example, at the 30% group uncertainty level with sample size of 1000 (n/2 = 500 sib-pairs), MAF of 0.1 and under Gaussian data, the power of gS LAD was 0.613, a 23% increase over the power of 0.495 observed for gS BG LAD ; a similar gain in efficiency was observed for other sample sizes, MAF, and with t 4 and χ 4 2 data (Table 4).
One would expect the relative efficiency gain to increase as uncertainty level increases.
However, this is true only if the uncertainty level is not too high.Depending on the model used to induce group uncertainty and the heteroscedasticity alternatives, it is reasonable to assume that the absolute power eventually converges to the type 1 error as the uncertainty increases.Consequently, the gain in relative efficiency of gS LAD compared to gS BG LAD would also diminish and converge to 1.This is consistent with results in Figure 1.

Applications
To demonstrate the utility of the proposed generalized scale (gS) test and subsequent generalized joint location-scale (gJLS) test, we revisited the two genetic association studies considered in Soave et al. (2015), and compared our results with those using only a sample of unrelated individuals with no genotype group uncertainties.We also used application data combined with simulation methods to further empirically validate the performance of the proposed methods.

HbA1c Levels in Subjects with Type 1 Diabetes
We use this application to demonstrate the gain in power by incorporating group uncertainty (probabilistic) data.Details of this dataset were previously reported in Soave et al. (2015).
Briefly, the outcome of interest was inverse normal transformed HbA1c levels in n = 1304 unrelated subjects with type 1 diabetes, and the SNP of interest was rs1358030 near SORCS1 on chromosome 10 with MAF of 0.36.With no sample correlation or group uncertainty, the original Lev test for variance heterogeneity was applied and resulted in a significant result with p = 0.01 (Soave et al., 2015).Combined with other evidence reported in Paterson et al. (2010), we assume here that the association is real and smaller p-values implies greater power.
To demonstrate the effect of genotype group uncertainty, we masked the true genotypes using the same Dirichlet distribution as in the simulation studies above, where the value of a ranged from 1 to 0.5, corresponding to no group uncertainty to 50% uncertainty.We then applied gS BG LAD to the "best-guess" genotype data and the proposed gS LAD incorporating the probabilistic data, and obtained the corresponding p-values, p gS BG LAD and p gS LAD .For a given uncertainty level, we repeated the masking process independently 1,000 times and obtained averaged p-values on the log10 scale (10 {average of log10(p)} ), p gS BG LAD and p gS LAD .Between the two methods, it was clear that gS LAD was more efficient than gS BG LAD .For example, when a = 0.75 for 25% group uncertainty, the gS LAD test remains significant with p gS LAD = 0.048 as compared to p gS BG LAD = 0.068.Regardless of the method used, the power of the scale tests decreased sharply as genotype uncertainty increased, consistent with those for location tests reported in Acar and Sun (2013), where location tests incorporating group uncertainty were compared with the "best-guess" approach.

Lung Disease in Subjects with Cystic Fibrosis
We used this application to demonstrate the gain in power by incorporating all available subjects, including relatives.We also used this dataset combined with permutation methods to further demonstrate the validity of the proposed methods.Details of this dataset were previously reported in Soave et al. (2015).Briefly, the outcome of interest was lung function as measured by the normally distributed SaKnorm quantity (Taylor et al., 2011) in a total of n all = 1507 individuals with CF, among which 1313 were singletons, 188 from 94 sib-pairs, and 6 from 2 sib-trios.In total, 8 SNPs from 3 genes (SLC26A9, SLC9A3 and SLC6A14 ) were analyzed based on association evidence for other CF-related outcomes reported in Sun et al. (2012) and Li et al. (2014).
Focusing on the n indep = 1313 + 94 + 2 = 1409 unrelated individuals, Soave et al. (2015) analyzed the association between lung function and each of the 8 SNPs using the individual location test and scale test, as well as the joint location-scale (JLS) test.They reported that SNPs from SLC9A3 and SLC6A14 were associated with CF lung functions (Table 5).
The number of omitted subjects in that analysis was small (n omit = 94 + 2 * 2 = 98) and consequently the expected loss of efficiency is anticipated to be small.Nevertheless, we re-analyzed the data available from the whole sample of n all = 1507 individuals, using the individual generalized location (gL) test, the proposed generalized scale (gS) test, and the subsequent generalized joint location-scale (gJLS) test (Table 5).We used a compound symmetric correlation structure (a single correlation parameter ρ) to model within family dependence for each application of GLS regression.
We first note that the conclusions for the presumed null SNPs from SLC26A9 did not change, as desired.The conclusions for the presumed associated SNPs from SLC9A3 and SLC6A14 did not change either, but using all available data led to smaller p-values for the gL test.The lack of apparent efficiency gain for gS was somewhat disappointing, but it was also expected given the few number of siblings added to the sample; see Discussion Section 5 for additional comments.Lastly, we note that the JLS framework indeed yields increased power when aggregating evidence from the individual tests; see Soave et al. (2015) for detailed discussions of the motivation and merits of the joint-testing framework.
To further exam the accuracy of the proposed gS and gJLS tests (as well as the gL test for completeness), we generated 10,000 permutation replicates of the outcome to assess the empirical type 1 error control; permutation was performed separately between singletons and between sib-pairs; see Abney (2015) for permutation techniques for more general family data.Without loss of generality, we focused on SNP rs17563161 from SLC9A3 (Web Figure 1).Testing the resulting p-values for deviation from the expected Uniform(0,1) distribution using the Kolmogorov-Smirnov test showed that all tests were valid.Additional simulations for inducing genotype group uncertainty led to the same conclusion (results not shown).

Discussion
Levene's scale test is widely used as a model diagnostic tool in linear regression, and more recently it has been employed as an indirect test for interaction effects.Increased data complexity due to sample correlation or group uncertainty, however, limits its applicability.
Here we proposed a generalization of Levene's scale test, gS, that has good type 1 error control in the presence of sample correlation, small samples, unbalanced group sizes, and non-symmetric outcome data.We showed that the least absolute deviation (LAD) regression approach to obtain group-median-adjusted residuals is needed to ensure robust performance of gS.Based on our results, we recommend the use of gS LAD over gS OLS (and other existing tests) uniformly for all study analyses.
In the presence of group membership uncertainty, gS incorporating the probabilistic data increases power compared to using the "best-guess" group data.However, based on the simulations considered here, we note that when the group uncertainty level is moderate (e.g.30%), the efficiency gain is also moderate (Table 4 and Figure 1).When the group uncertainty is too high, the relative efficiency gain may diminish because the absolute power decreases considerably and eventually converges to the the type 1 error rate.
In the presence of sample correlation, the original Lev test is inadequate due to inflated type 1 error.Using a subset of only unrelated individuals would improve the accuracy of Lev but at a cost to the power.The size of the efficiency loss depends on the proportion omitted from the sample as well as the dependency structure.The T W method of Iachine et al.
(2010) extends the Lev test for twin data.Their simulation study as well as ours showed that T W has an increased type 1 error rate when group sizes are unbalanced and relatively small, in contrast to the proposed gS.When all group sizes were large, gS and T W were empirically equivalent.
In the CF application, although the gS test yielded comparable or less significant results after reincorporating siblings in the analysis, we observed that the corresponding gL test results were more significant.We considered the possibility that even though scale differences existed in the data, the addition of only 98 siblings (7% increase from the independent sample) may not yield a noticeable improvement in power of the gS test.Using the setup of simulation model 1, we examined the effect of incorporating only a small proportion of additional related subjects to an otherwise independent sample (Web Table 4).We found that, compared with using a sample of 1000 singletons, using a sample of n = 900 singletons along with 100 sib-pairs (10% increase) led to a < 5% power increase.In contrast, the addition of siblings to all unrelated subjects provided a substantial increase in power (Web Table 4).
These results, and the noticeable power gain from the gL location test when applied to the same CF data, are consistent with previous observations in genetic association studies that, larger samples are needed to detect variance differences as compared to mean differences (Visscher and Posthuma, 2010;Yang et al., 2011).
The examination of the proposed gS here focused on SNP genotype categories.The socalled "additive" coding of the genotype data can be used in practice.That is, replacing the two dummy variables, x 1 and x 2 , with one continuous variable coded as x = 0, 1 or 2, if there is no group uncertainty; or replacing the two probabilistic variables, x 1 and x 2 , with an expected count (the so-called "dosage"), x = p 1 + 2 * p 2 .If the underlying model is truly additive, this model specification will lead to a more powerful test.However, the additivity assumption is often used only for testing the location parameters in genetic association studies.

The expression of E(d
in Section 2.2 suggests that the stage 2 regression of (4) could be improved by rescaling the d i 's by (1 − h ii ) −1/2 .This adjustment has been shown to improve the type 1 error control of Levene's original test for small samples with group design imbalance (Keyes and Levy, 1997) (independent observations with no group uncertainty implies h ii = 1/n j , where n j is the sample size of the group to which the ith observation belongs).Examination of this rescaling for gS under simulations involving correlated data, however, led to instances of increased type 1 error (results not shown).Thus, further investigation is required to propose an appropriate adjustment.Another potential improvement to the analysis of regression model ( 4) is from the recognition that the d i 's are folded normals and are in fact slightly correlated through correlation between the estimated residuals, ε i 's, even when there is no sample correlation among the true disturbances, ε i 's.
O 'Neill and Mathews (2000) derived expressions for the covariance matrix of d for independent observations with no group uncertainty, showing that the correlation across the d i 's disappears as the group sizes increase.For the complex data scenarios considered here, gS LAD appears robust for even small samples.Nevertheless, the potential for gain in efficiency by accounting for this type of correlation merits additional consideration.
The developments here did not consider additional covariates, z, e.g.age and sex in genetic association studies.The extension is straightforward if the effects of z on y are strictly on the mean.In that case, including z as part of the design matrix in the stage 1 regression of (2) suffices.However, if z also influences the variance of y, not including z as part of the design matrix in the stage 2 regression of (4) may lead to increased type 1 error of testing the γ j 's that are associated with the primary covariates of interest.This is the same phenomenon observed in location-testing where omitting potential confounders can lead to spurious association.
Joint location-scale testing is becoming a popular method for complex outcome-covariate association data, where the conventional location-only analyses may be underpowered.This scenario has received attention in many fields ranging from economics to climate dynamics Marozzi (2013), in addition to our motivating example of genetic epidemiology (Soave et al., 2015).The proposed gS test allows investigators to combine evidence from scale tests with existing generalized location tests via the JLS testing framework of (Soave et al., 2015), previously proposed for independent samples without group membership uncertainty.The CF application study showed that individual location or scale tests can provide more significant results when utilizing related individuals, which in turn may lead to a more powerful gJLS test.

Supplementary Materials
Web Appendix Sections A, B and C, Web Figure 1, Web Tables 1-4, and R-code description for data analysis referenced in Sections 2, 3, 4, and 5 are available below in this document.4 under Gaussian data.n/2 for the number of sib-pairs, ρ = 0.5 for the within-pair correlations, and q = 0.1 or 0.2 for the minor allele frequency (MAF) of the SNP of interest; on average the expected sizes of the three genotype groups are n(1 − q) 2 , n2q(1 − q) and nq 2 .Without loss of generality, σ 2 0 = σ 2 1 = σ 2 2 = 1 for type 1 error rate evaluation.The empirical type 1 error was estimated from 10,000 simulated replicates at the nominal 5% level.Godfrey, 1996;Im, 2000;Machado and Silva, 2000;Furno, 2005;Carroll and Schneider, 1985).In general, stage 2 tests involving d i from stage 1 OLS regression are only asymptotically correct when the y i follow a symmetric distribution.Conversely, stage 2 tests involving d i from stage 1 LAD regression will be asymptotically correct regardless of the shape of the error distribution (Furno, 2005;Carroll and Schneider, 1985).We demonstrate here that these results hold for the proposed gS test when using the F -statistic in the stage 2 generalized least squares regression to account for possible data dependence due to clusters/family relationships.
In keeping with the notation from the main text, we use only a single subscript (e.g.3)).Also note that the f subscript of σ 2 f defines the variance parameter associated with f conditional on C and X.We assume that we are able estimate β, and that n 1/2 ( β − β) has an asymptotic multivariate normal distribution with finite covariance matrix; this would be true for either OLS or LAD stage 1 regression estimates of β.
We consider an F -test based on d * i , which are the elements of , We first note that, in the absence of estimation effects, (k − 1)F (f * ) is asymptotically distributed as chi-squared with k − 1 degrees of freedom (see Arnold (1980)).To obtain the asymptotic distribution of F (d * ) we need to assume that (x i , ε i ) satisfy regularity conditions necessary for the application of standard asymptotic theory.Specifically, we assume Similar to the assertion of Carroll and Schneider (1985), we have that M SE(d * ) We can also rewrite the numerator of F n (d * ) as where (z * i ) T are the row vectors of Z * , and z * i are the predicted values from regressing the column(s) of Z * on 1 * .Note that 1 * will not, in general, be a vector or 1's, and will only be constant when all observations come from equal sized clusters with an identical symmetric correlation structure.
Let t = 1, ..., T and s = 1, ..., S index the variables associated with the residuals, ε i , that take positive and negative values, respectively.Correspondingly, T and S are the number of residuals that take positive and negative values, respectively.Let c i,j represent the element for the ith row and jth column of the matrix C −1 , and let [−] i denote the ith element from the vector with the square brackets.
Finally, we make the additional assumption, This assumption follows from Assumption 1, that x i and ε i are uncorrelated, and implies Now, applying the results of Lemma 1 from Im (2000), we propose the following lemma with proof similar to Im (2000) Lemma 3, Proof.

LHS
If the final equation becomes negligible as n becomes large, we can say that the absolute residuals, d, may replace the true disturbances, f , where and consequently This, in general, will only be the case when (T − S)/n = o p (1), which will occur for stage 1 OLS regression only if the error distribution is symmetric.On the other hand, (T − S)/n = o p (1) is ensured for stage 1 LAD regression as minimization of the sum of absolute residuals must equate the number of positive and negative residuals (Koenker and Hallock, 2001).
Note that Lemma 1 and its proof reduce to the same result of Im (2000) Lemma 3 for the case where Σ d = I (independent errors).
This leads to the following result.
Theorem 1.Under the previously stated conditions, if (T − S)/n = o p (1), Web Appendix B: Independence between the generalized location (gL) and

scale (gS) test statistics
To test for a location (mean) effect of a set of (k − 1) covariates, including an intercept, on a quantitative outcome y i with known correlation among the y i 's, we can obtain the F -statistic from a generalized least squares regression of the linear model, where the design matrix, X, represents the stacking of the x T i 's, and Σ stage1 represents the covariance matrix for y.We include indexes for families (i = 1, . . ., M ) and members within families (j = 1, . . ., n i ) to specify models that allows for within group/family dependence of observations as would be encountered in sibling or pedigree data.Under the assumption of homoscedasticity, Σ stage1 can be written as σ 2 y Σ y , where the correlation matrix, Σ y , specifies relationships between family/cluster members.
Let Σ y = C y C T y be the Cholesky decomposition, and for fixed Σ y , we use the following transformations, Thus, we can rewrite (1-1) as a classical linear model, Thus, the generalized location (gL) test statistic (T Location ) is simply the F -statistic from an ordinary least squares regression of y * on X * , which can be written as , where y * i = 1 * i β 0 and y * i = x * T i β all , are the predicted values from the regressions of y * on the first column of X * (yielding β 0 ) and all columns of X * (yielding β all ), respectively.
Without loss of generality, we assume the analysis of genetic data involving related subjects (family data) with an additive genetic model where y i is the quantitative trait under investigation and x T i = (1, x i ), where x i = 0, 1 or 2, reflects the number of minor alleles at a genetic variant under analysis, or the genotype dosage estimate incorporating uncertainty.
Thus, the density of y is an exponential family with three parameters θ are complete.Note that 1 * i and x * i are obtained from X * , and y * i from y * defined above.
Let T Scale be the proposed gS test statistic (see Section 2.3, equation ( 8)) analyzing the same covariate vector specified for T Location .Now we have the following Lemma.
Lemma 2: For the conditional normal model y ∼ N (Xβ, σ 2 y Σ y ) with fixed Σ y , T Location and T Scale are independent.
Proof.We showed that if Σ stage2 = σ 2 d Σ d , with Σ d fixed, and the covariate design vector For the different parameter combinations of simulation model 2, we considered two measures of central tendency from stage 1 of the procedure: the predicted values from OLS and LAD regression.All testing procedures were implemented in R as just described, while using the rq() function from the "quantreg" package to perform stage 1 LAD regression.[R Core Team (2016).R: A language and environment for statistical computing.R Foundation for Statistical Computing, Vienna, Austria.URL https://www.R-project.org/.]TW: Implementing the cluster-robust stage 2 regression in R Numerous modifications have been proposed for various types of cluster-designed data sets (Cameron and Miller, 2013).Iachine et al. (2010) chose to implement T W with the STATA statistical software command regress using the cluster() option to incorporate the cluster information in a standard error adjustment.This method of cluster robust inference can be implemented with the R software as follows: First, fit the desired panel data model (e.g. d ∼ X for the stage 2 regression; equation (4) of Section 2.2) using the plm() function from the "plm" package, with option index="cluster.id".Next, estimate the cluster robust covariance matrix using the vcovHC() function from the "sandwich" package, with options type= "HC0" and adjust = T .Next, multiply the covariance matrix by the following adjustment factor: where M is the total number of clusters and n − k is the residual degrees of freedom from the plm model.Next, perform the coefficient test using the waldtest() function from the "lmtest" package, setting the vcov option equal to the new covariance matrix and the test option equal to "F ".Finally, the resulting F -statistic is then compared to an F -distribution with (k-1) and (M -1) degrees of freedom to obtain the p-value of the T W test. (Note that the default degrees of freedom for waldtest() will be (k-1) and (N -1).) Web Tables and Figure Iachine et al. (2010)  Scale Test for Twin Pairs and Modifications Focusing on paired-observations, Iachine et al. (2010) extended Levene's test to determine if the variance of an outcome differs between monozygotic (MZ) and dizygotic (DZ) twin Model 1 3.1.1Model Setup.Following the exact simulation study design of Iachine et al. (2010), we simulated correlated outcome values for n 1 MZ twin pairs and n 2 DZ twin pairs, n = 2n 1 + 2n 2 , and we tested if the variance of the outcome differed between the two groups of pairs in Iachine et al. (2010))    were mistakenly replaced by the Lev and T W results obtained using 10% trimmed meanadjusted residuals (labeled as W 10 and T W 10 inIachine et al. (2010)).Subsequent conclusions inIachine et al. (

Figure 1 .
Figure 1.Relative efficiency of gS BG LAD and gSLAD under simulation model 2. gS BG LAD denotes gSLAD

i
= 1, . . ., n) to represent individuals across the sample and rely on the covariance matrix, Σ stage 2 = σ 2 d Σ d , to specify relationships between family/cluster members.Note that we are assuming homoscedasticity across the d i 's and thus Σ d = CC T describes the Cholesky decomposition of the correlation across d.Let f i = y i − x i T β , represent the absolute true disturbances from the stage 1 model (2), where β represents the true central tendency parameter vector being estimated (i.e.mean or median).For fixed C, let f * = C −1 f , such that for the model f * = α1 * +Z * γ +e * , E(e * ) = 0 and V ar(e * ) = σ 2 f I.Note that 1 * = C −1 1 and Z * = C −1 Z are a partitioned transformation of the gS stage 1 regression design matrix (i.e.X = (1, Z) from equation ( the predicted values from the regression equation (7), and d * i = 1 * i α, the predicted values from the regression of d * on 1 * .
then T Scale is asymptotically distributed as chi-squared with k − 1 degrees of freedom (Theorem 1, Web Appendix A), and it does not depend on θ (i.e.T Scale is ancillary for θ).Thus, T Scale is independent of T (see page 152 inLehmann and Romano (2005)).Because T Location is a function of T , T Location and T Scale are therefore independent.Web Appendix C: Implementing gS, T W , and Lev tests Under the simulation model 1 setup, we compared the performance of three tests: Levene's original test procedure (using two measures of central tendency: the sample means and sample medians, implemented in R using the "lawstat" package), the T W test proposed byIachine et al. (2010)  (using the same two measures of central tendency, with stage 2 regression implemented in R (see below) to mimic the results of the regress, cluster() command implemented in STATA), and our proposed gS test (using the same two measures of central tendency, with stage 2 regression implemented in R using the gls() function in the "nlme" package).

Table 1
Type 1 error evaluation under simulation model 1.Six different tests were evaluated, including the original Levene's test, Lev, the twin test ofIachine et al. (2010), T W , and the proposed generalized scale test, gS, with subscripts OLS and LAD denoting whether the stage 1 regression was performed using OLS or LAD.Parameter values included n1 and n2 for the number of MZ and DZ twin pairs, respectively, and ρ1 = 0.75 and ρ2 = 0.5 for the corresponding within-pair correlations.Without loss of generality, σ 2 1 = σ 2 2 = 1 for type 1 error rate evaluation.The empirical type 1 error was estimated from 10,000 simulated replicates at the nominal 5% level.Type 1 error evaluation under simulation model 2 without group uncertainty.Parameter values included Type 1 error evaluation under simulation model 2 with 30% group uncertainty.Superscript BG denotes T WLAD and gSLAD applied to the "best-guess" genotype data.The true genotype data were masked using a Dirichlet distribution for the genotype probabilities with scale parameters a for the correct genotype and (1 − a)/2 for the other two.On average, a = 0.7 corresponds to 30% group uncertainty level.See legend of Table2for additional simulation details.Power of gS BG LAD and gSLAD under simulation model 2 with 30% group uncertainty.gS BG LAD denotes gSLAD applied to the "best-guess" genotype data.The true genotypes were masked using a Dirichlet distribution for the genotype probabilities with scale parameters a for the correct genotype and (1 − a)/2 for the other two.On average, a = 0.7 corresponds to 30% group uncertainty.Besides the parameters shown in the table, other values include Application study of lung function in patients with cystic fibrosis.There were 1313 singletons, 94 sib-pairs and 2 sib-trios in the whole sample, resulting in n indep = 1313 + 94 + 2 = 1409 unrelated individuals, and n all = 1313 + 94 * 2 + 2 * 3 = 1507 individuals.Results for n indep were from Soave et al. (2015), where the standard regression Location test, Levene's scale test and the JLS joint location-scale test were used.Results for n all were obtained from the corresponding generalized tests, with LAD used for the stage-1 regression for the gS test.Web-based Supplementary Materials for "A Generalized Levene's Scale Test for Variance Heterogeneity in the Presence of Sample Correlation and

Table 1
Type 1 error evaluation under simulation model 1 (large samples).Six different tests were evaluated, including the original Levene's test, Lev, the twin test ofIachine et al. (2010), T W , and the proposed generalized scale test, gS, with subscripts OLS and LAD denoting whether the stage 1 regression was performed using OLS or LAD.Parameter values include n1 and n2 for the number of MZ and DZ twin pairs, respectively, and ρ1 = 0.75 and ρ2 = 0.5 for the corresponding within-pair correlations.Without loss of generality, σ 2 1 = σ 2 2 = 1 for type 1 error rate evaluation.The empirical type 1 error was estimated from 10,000 simulated replicates at the nominal 5% level.Type 1 error evaluation under simulation model 2 with 10% group uncertainty.Superscript BG denotes T WLAD and gSLAD being applied to the "best-guess" genotype data.The true genotype data were masked using a Dirichlet distribution for the genotype probabilities with scale parameters a for the correct genotype and (1 − a)/2 for the other two.On average a = 0.9 corresponds to 10% group uncertainty.See legend of Table2(main text) for additional simulation details.Type 1 error evaluation under simulation model 2 with 20% group uncertainty.Superscript BG denotes T WLAD and gSLAD being applied to the "best-guess" genotype data.The true genotype data were masked using a Dirichlet distribution for the genotype probabilities with scale parameters a for the correct genotype and (1 − a)/2 for the other two.On average a = 0.8 corresponds to 20% group uncertainty.See legend of Table2(main text) for additional simulation details.