Analysis of covariance under variance heteroscedasticity in general factorial designs

Adjusting for baseline values and covariates is a recurrent statistical problem in medical science. In particular, variance heteroscedasticity is non‐negligible in experimental designs and ignoring it might result in false conclusions. Approximate inference methods are developed to test null hypotheses formulated in terms of adjusted treatment effects and regression parameters in general analysis of covariance designs with arbitrary numbers of factors. Variance homoscedasticity is not assumed. The distributions of the test statistics are approximated using Box‐type approximation methods. Extensive simulation studies show that the procedures are particularly suitable when sample sizes are rather small. A real data set illustrates the application of the methods.

variable, but also other variables called covariates. For instance, baseline values, age, gender, and/or body weight might obscure the factor effects of the independent groups. Analysis of covariance (ANCOVA) is a general linear model that blends analysis of variance (ANOVA) and regression methods, and hence makes estimation of (adjusted) treatment effects as well as testing hypotheses formulated in terms of the (adjusted) treatment effects in such designs possible. 1 However, the performance of the methods (F-tests) with respect to maintaining the nominal type-1 error rate heavily depends on whether the data fulfill model assumptions, such as multivariate normality and homogeneous variances of the error term. Indeed, if the data are not in line with them, the methods tend to be liberal or conservative, depending on the amount of variance heteroscedasticity, sample size allocations, and general distributional shapes (see Section 5 for details). In statistical practice, however, verifying such assumptions is quasi impossible, especially when being confronted with small sample sizes. 2,3 We therefore prefer to use less stringent statistical methods for making inferences. The present article aims to introduce statistical inference methods for general ANCOVA designs with arbitrary (but fixed) numbers of factors and covariates without assuming multivariate normality and homoscedastic variances.
In recent years, several authors have proposed different methods to tackle the problem of variance heteroscedasticity in general ANOVA designs (without covariates). Pauly et al 4 provide a detailed overview of the different solutions and highlight the so-called Wald-type statistic (along with a permutation version) as well as the ANOVA-type statistic developed by Brunner et al. 5 In comparison, the Wald-type statistic is applicable for large samples only (n i ≈ 50), while the ANOVA-type statistic controls the type-1 error rate accurately even when sample sizes are rather small (n i ≈ 10). However, both of the statistics found their way into statistical practice and are implemented in prominent software packages like SAS PROC MIXED and in the R-package GFD. 6 For the analysis of general ANCOVA designs, Zimmermann et al 7 propose a Wald-type test along with a resampling version (wild-bootstrap approach) using Heteroscedasticity Consistent Standard Error (HCSE) estimators. [8][9][10][11] The methods even allow for complete variance heteroscedasticity, that is, every unit might have a different variance. Even though resampling methods share the advantages of being suitable methods particularly for small sample sizes, they might be numerically cumbersome and show their limitations in statistical planning purposes, in general. Statistical models that allow for complete heteroscedasticity might be very flexible and less stringent, however, in medicine and related sciences, reporting groupwise variance estimates is necessary, especially in translational research. We therefore focus on groupwise variance heteroscedasticity and discuss methods for complete heteroscedasticity as an aside. SAS PROC MIXED realizes an ANOVA-type statistic in such general ANCOVA designs by replacing the empirical means and variance estimators as used by Brunner et al 5 in ANOVA designs with estimators obtained from Minimum Variance Quadratic Unbiased Estimation (MIVQUE0) 12 algorithms in the general ANCOVA framework (see https://support.sas.com/documentation/onlinedoc/stat/141/mixed.pdf). The supporting site says "This generalizes results in the appendix of Brunner, Dette, and Munk (1997) to a broader class of models," see https://support.sas.com/kb/24/516.html. The results, however, are only available in SAS PROC MIXED. We investigate whether the methods maintain the nominal type-1 error rate (and power) in extensive simulation studies and find that the methods make accurate inferences in fixed effects, while the tests for the covariate effects tend to be very liberal. In this article, we will therefore generalize the results and propose novel test statistics. We develop unbiased estimators of the variance components using methods of moments and approximate the distributions of the test statistics using Box-type approximation methods. 13,14 Numerical investigations and extensive simulation studies show that (1) the tests for the fixed effects are very similar and of similar quality as their siblings in SAS and control the type-1 error rate accurately, while (2) the new approximation for testing the effects of the covariates outperforms its competitor in SAS. Especially for skewed data, it turns out the ANOVA-type statistics seem to control the type-1 error rate better than the wild-bootstrap test. 7 We note that HCSE estimators are also available in SAS PROC MIXED using the EMPIRICAL option.
The remainder of the article is organized as follows. In Section 2, a factorial toxicological and carcinogenic study is discussed. The statistical model, point estimators, and hypotheses of interest are introduced in Section 3. In the following Section 4, procedures for testing the aforementioned hypotheses are explained. Their behavior in small sample size situations is investigated in extensive simulation studies in Section 5. The article closes with the evaluation of the data in Section 6 and a discussion about the results in Section 7. Technical derivations are provided in the Appendix.
Throughout the article we use the following notation. 1 n denotes the n × 1 vector of 1's and I a the a × a unit matrix. The a × a centering matrix is P a = I a − 1 a 1 a 1 ′ a . Furthermore, A ⨁ B and A ⊗ B denote the direct sum (block operator) and the Kronecker product of the matrices A and B, respectively.

A MOTIVATING EXAMPLE
As a motivating example, we consider a part of the toxicological study 15 on pyridine (number C55301B) obtained from the US National Toxicology Program (see https://manticore.niehs.nih.gov/cebssearch/test_article/110-86-1; assessed November 2020). This animal testing aims to investigate the impact of pyridine (a basic heterocyclic organic compound) on different clinical chemistry parameters. Here, we chose blood urea nitrogen (BUN) measured in mg/dL as the response variable. Initially, N = 120 rats (60 male and 60 female) were randomized to six different dose levels of pyridine (0, 50, 100, 250, 500, or 1000 ppm) (n ij = 10 animals per dose) and BUN was measured at baseline and after 90 days. In addition, we also chose the change in body weight (in g) as a covariate. The data can thus be described in means of a two-way factorial design involving the factors gender with two levels and the factor dose with six levels with two covariates. We aim to estimate treatment effects of pyridine adjusted for baseline and change in body weight and to test whether the two factors and their interaction impact the results. We note that one male and two females in dose level 1000 died after the treatment, which makes the design imbalanced. Boxplots of BUN stratified by gender and pyridine concentration levels are displayed in Figure 1. It can be seen from Figure 1 that the factor gender seems to have an impact on BUN for each pyridine concentration. It also appears that the effect of the pyridine concentration is homogeneous across gender (ie, no interaction effect). Since the sample sizes of the study are rather small, making any distributional assumptions (eg, normality) would be doubtful. The boxplots in Figure 1 indicate that the BUN data are not necessarily symmetrically distributed. Moreover, the empirical variances of each factor-level combinations of the two factors are rather different and thus, assuming variance homoscedasticity is unlikely. Therefore, we will analyze this factorial study without either assuming any specific distribution or homogeneous variances of the data using the methodologies presented in this article. First, a general ANCOVA model, point estimators, and hypotheses of interest will be introduced in the next section.

STATISTICAL MODEL, HYPOTHESES, AND POINT ESTIMATORS
We consider a general ANCOVA model where Y ij and M ( ) ij denote the response and th covariate value from subject j under condition (treatment) i, b i represents the fixed treatment effect of condition i, p the th regression parameter, and ij the error term with E( i1 ) = 0 and Var( i1 ) = 2 i , i = 1, … , a. In total, N = ∑ a i=1 n i subjects are enrolled in the trial, where n i denotes the number of subjects in group i ∈ {1, … , a}. We note that two-and higher-way layouts can be modeled by subindexing the index i. In matrix notation, the model can be written as where Y denotes the N × 1 response vector, X = ⨁ a i=1 1 n i the design matrix for the vector of fixed treatment effects respectively. We furthermore assume the existence of fourth moments, that is, E( 4 ij ) < ∞. Now, the aim of ANCOVA is estimating and making inference across the components of the vectors b and p, that is, testing the null hypotheses Which hypothesis matrix H to chose depends on the actual model and research question of interest. For example, in a one-way design with a levels, H = P a . In a two-way design involving a factor A with a levels and a factor B with b levels the null hypotheses of no main effect A, no main-effect B and no-interaction (AB) between A and B are tested using the contrast matrices as known from linear model theory. Here, b ij denotes the effect of cell (i, j). For more details we refer to, for example, Pauly et al. 4 We estimate the unknown model parameters b and p using methods of least squares and obtain Here, P = X(X ′ X) −1 X ′ and Q = I − P denote the projection matrices and * denotes a suitable regular matrix, for example, ordinary least squares (OLS) estimators are obtained using * = I N and generalized least squares (GLS) estimators using if it is known, or a consistent estimator thereof, respectively. Different estimation methods of will be discussed in the next subsection.
Remark 1. The general model can be generalized to model interactions between the treatment effect and covariates. For instance, we aim to model an interaction term between group a and the th covariate. In this case the regression part Mp is kept in the model as above whereas the fixed treatment effect Xb is written as a regression model for qualitative predictors by Here, X ⊙ denotes the adjusted design matrix, b ⊙ = (b 1 , … , b a−1 ) ′ the vector of the treatment effects of groups 1, … , a − 1 and b a denotes the interaction effect of treatment a and the th covariate, respectively.

Estimation of the variances
For the ease of representation, we rewrite the estimatorsb andp using appropriate generating matrices asb = DY andp = AY, respectively. For large sample sizes, that is, if N → ∞ such that N∕n i → i (and under some mild regulatory assumptions, see Zimmermann et al 7 ), it can be shown that where is as given in (2). Both of these matrices are, however, unknown in practical applications and must be estimated from the data. Different methods for the estimation of , , and are available and include bootstrap, maximum likelihood (ML and REML), MIVQUE0 as well as methods of moments. 16 Clearly, each estimation method has its own advantages and disadvantages, but beyond that, the resulting groupwise variance estimators depend on the outcomes in the other groups using any of them. For instance, in a two-sample design, the estimators of 2 1 and 2 2 would change when a third group (independent of the others) was included in the model (without changing any value in the initial groups). This is not the case when covariates are not included in the model. The variance components 2 1 , … , 2 a are model constants and we therefore prefer to estimate them on a group-specific level using methods of moments. We follow the idea from Cao et al 17 and estimate them using the corresponding submodels. Let X i = 1 n i denote the n i × 1 vector of 1's and let M i , i = 1, … , a denote the matrices of the covariates for each group separately. Furthermore, let B i = (X i ⋮ M i ) denote the a partitioned matrices of X i and the corresponding covariates M i , and define the projection matrices Then, unbiased and consistent estimators of the variances 2 i are given bŷ Unbiased estimators of the matrices , , and will be provided in the next corollary.
Corollary 1. Let̂2 i as given in (9). Then,̂= ⨁ a i=1̂2 i I n i is an unbiased and consistent estimator of . Furthermore,̂= ND̂D ′ and̂= NÂA ′ are unbiased and consistent estimators of and , respectively.
We note that̂2 i is the "classical" variance estimator for each group-specific submodel and therefore the result follows, see Cao et al. 17 Both the estimatorsb andp of the treatment effects as well as their consistent variance-covariance matrix estimators can now be used for the derivation of statistical procedures. This will be explained in the next section.

THE ANCOVA-TYPE STATISTIC
In this section, test procedures for testing the null hypotheses H (b) 0 as well as H (p ) 0 as given in (3) and (4) will be introduced. In general ANOVA and ANCOVA models, test procedures are mainly quadratic forms in the estimators (scaled with variance-covariance estimators). Let H denote a suitable hypothesis matrix for testing H (b) 0 ∶ Hb = 0 given in (3). For large sample sizes, a Wald-type statistic for testing H (b) 0 is readily available. Here, (B) − denotes a generalized inverse of the matrix B. For large sample sizes, Q N (H) follows a 2 r distribution with r = rank(H) degrees of freedom. The Wald-type statistic is also numerically available in SAS PROC MIXED using the CHISQ option within the model statement. Furthermore, F-tests are computed using estimated degrees of freedom. These statistics, however, should only be applied when sample sizes are large. In the present article, we focus on small sample sizes and we therefore emphasize the ANOVA-type statistic developed by Brunner et al. 5 In their original paper, the authors allow for mean comparisons only (ie, no covariates are involved in the model). However, SAS PROC MIXED implements an ANOVA-type statistic within the ANOVAF option. The statistics shall now be explained.

ANOVAF Option in SAS PROC MIXED
A version of the ANOVA-type statistic for testing the aforementioned null hypotheses is numerically available in SAS PROC MIXED using the ANOVAF and MIVQUE0 options. Computational details are not provided, but the general procedure is explained in available handbooks, for example, on https://support.sas.com/kb/24/516.html. The following revisits its statements: "Let H denote the matrix of estimable functions for the hypothesis H ∶ Hb = 0 and let T = H ′ (HH ′ ) − H and let C denote the estimated variance-covariance matrix of (b − b) (see the section 'Statistical Properties' in the PROC MIXED documentation for the construction of C). The ANOVAF F-statistics are computed as Note that this test is a modification of the usual F-statistic where (HCH ′ ) − is replaced with t 1 = tr(TC); see, for example Brunner, Domhof, and Langer (2002, sec. 5.4). The p-values for this statistic are computed from either an F 1 , 2 or an F 1 ,∞ distribution. The respective degrees of freedom are determined by the MIXED procedure as follows: The term g ′ Ag in the term * 2 for the denominator degrees of freedom is based on approximating Var(tr(TC)) based on a first-order Taylor series about the true covariance parameters. This generalizes results in the appendix of Brunner, Dette, and Munk (1997) to a broader class of models. The vector g = (g 1 , … , g q ) ′ contains the partial derivatives and A is the asymptotic variance-covariance matrix of the covariance parameter estimates (ASYCOV option in the PROC MIXED statement). PROC MIXED reports 1 (2002), provided that the response data consists of properly created (and sorted) ranks and that the covariance parameters are estimated by MIVQUE0 in models with the REPEATED statement and properly chosen SUBJECT= and/or GROUP= effects." Up to now, the statistical procedure is available in SAS only. Replicating the computational steps is a very challenging task, especially since some computational details are provided in a somewhat cryptic manner. Furthermore, as mentioned above, the group-specific variance estimators depend on outcomes in other groups. The detailed code for the computation of the statistic is as follows The usage of the REPEATED statement is necessary for the groupwise estimation of the variances and should not be confused with "repeated measures". The repeated statement allows modeling of the structure of the variance-covariance matrix of the error term. When no covariates are available (ie, mean comparisons only), the ANOVA-type statistic is also implemented in the R-package GFD. 6

A novel approach
In the following, a novel ANCOVA-type statistic based on the point estimatorsb andp as well as their unbiased and consistent variance-covariance estimatorŝand̂will be introduced. First, procedures for testing H (b) 0 will be presented. Methods for testing the impact of the covariates will be obtained afterward in a similar way. Analogously to the ANOVAF realization in SAS, let T = H ′ (HH ′ ) − H denote the projection onto the column space of H and consider the test statistic Note that F N (T) is very similar to F A as given in (10), with the exception of using different variance estimators. The next proposition introduces an approximation of its asymptotic distribution.

Proposition 1.
Let D be as given in (8) and define the matrix 0 , the distribution of F N (T) as given in (11) can be approximated by a central F( .
The derivation of the approximation is given in the appendix. Test procedures for testing the null hypothesis H

SIMULATION RESULTS
All of the methods presented in this article are of asymptotic nature and we will therefore examine their behavior in small sample size situations with the help of extensive simulation studies (each with 10 000 simulation runs). We will use their control of the nominal type-1 error rate ( = 5%) as well as their powers to detect selected alternatives as quality criteria. Due to the abundance of possible general designs (numbers of factors), sample size/variance allocations, numbers of covariates, and their impact on the distribution of the response variable, we select a broad range of different designs to cover realistic practical scenarios, especially including the motivating example discussed in Section 2. Data have been generated from one-and two-way designs ) each with M = 3 covariates drawn from discretized normal distributions with p = (0, 1 2 , 1) ′ and b i ≡ b ij ≡ 10. The random variables Z ik and Z ijk base the error terms and were generated from standard normal, t 3 , Laplace, 2 15 , 2 7 , or exponential distributions, respectively. Thus, three differently tailed symmetric distributions as well as three skewed distributions with and Var(Y ijk ) = 2 ij . As a major assessment criteria, we will exemplify the behavior of the methods in positive and negative pairings of sample sizes and variances. The resulting different settings of sample sizes and groupwise variance allocations are displayed in Table 1.
The initial parameters of Model I are n 1,I = (7, 7, 7, 7) ′ , n 2,I = (7, 10, 13, 18) ′ , 1,I = (1, 1, 1, 1) ′ , 2,I = (1, 2,II = (1, 3,II = (  (15)) (using nboot = 10 000 bootstrap runs), the classical ANCOVA F-test as well as the novel ANCOVA-type test F N (T) in (11) for testing the null hypotheses in Figure 2 and in Figure 3, respectively. First, it can readily be seen from Figure 2 that the two ANCOVA-type tests F A (10) and F N (T) are of equal quality, and both procedures tend to control the nominal type-1 error rate quite accurate in all considered scenarios. When sample sizes are very small, they tend to be slightly conservative. It can also be seen that the quality of the approximation depends on distributional shapes. In case of symmetric distributions, the heavier the tail the more conservative the methods seem to be. This observation was made in each of the different settings. In case of skewed distributions, a similar pattern can be recognized. The higher the skewness the more conservative the tests behave. The classical F-test performs as expected and controls the nominal type-1 error rate very well under variance homoscedasticity, while it tends to be liberal or conservative under negative or positive pairings, respectively. The procedure is based on a pooled variance estimator (assuming equal variances), which is biased when the data actually have different variances in unbalanced designs. The wild-bootstrap method W * (T) (15) controls the type-1 error well in case of symmetric distributions. Under skewed distributions, the opposite as with the ANCOVA-type tests can be observed, namely the higher the skewness, the more liberal the test behaves and thus, over-rejects the null hypothesis. With increasing sample sizes, the conservatism of the ANCOVA-type tests as well as the liberality of W * (T) tend to decrease. Summing up all findings, the quality of each method depends on the shape of the underlying data distribution. In case of symmetric distributions, no major differences between the methods could be detected in the selected scenarios except for the t 3 distribution, where the wild-bootstrap test is closer to the nominal level than the somewhat conservative new ANCOVA-type tests. By contrast, the F N (T) seems more stable under skewed distributions than the wild-bootstrap method. Next, we will investigate the control of the type-1 error rate of the competing methods for testing the null hypothesis H (p 1 ) 0 ∶ p 1 = 0. Since the results obtained under Model II and Model I were very similar, we omit the latter. The simulation results are displayed in Figure 4. First, it can readily be seen that the ANCOVA-type test F A implemented in SAS is pretty liberal and over-rejects the null hypothesis when sample sizes are either small or of medium size. This can be observed under all of the investigated data distributions and settings. With increasing sample sizes, the liberality vanishes, but, it still cannot be recommended when sample sizes are n i ≈ 20. This behavior of the test might occur because SAS uses the generalized least squares with the estimated variances as the point estimators. The estimation of the variances of the estimators might be unstable and not well tracked within the approximation procedure when sample sizes are as small as considered here. The novel F I G U R E 3 Type-1 error ( = 5%) simulation results of F A in (10) implemented in SAS PROC MIXED, the wild-bootstrap test W * (T) in (15), the classical ANCOVA F-test, and the novel test F N (T) in (11)

Power simulations
We conducted extensive simulation studies to compare the powers (at significance level = 5%) of F A in (10), W * (T) in (15) and F N (T) in (11) in various scenarios. Due to the abundance of different factorial designs, we compare the powers of the methods to detect the three selected alternatives in Model I (upper row) and Model II (lower row) with varying ∈ {0, 0.1, … , 2} in all of the different settings as described in Table 1. Throughout, we set m = 0 and thus do not compare the impact of increasing We find that the wild-bootstrap method tends to have a higher power than the ANCOVA-tests in few scenarios (eg, Setting 2 in both Model I and Model II). A possible explanation could be that the method is a Wald-type statistic, while the others are approximate solutions. However, we also see the opposite in few scenarios (eg, Setting 3, Alternative 1). In general, the wild-bootstrap test W * (T) has a slightly higher power under variance heteroscedasticity, depending on variance and sample size allocations. In case of equal variances, its power is slightly smaller than of its competitors. Furthermore, F A in (10) also tends to have a slightly higher power than F N (T) in (11) (about 1% higher). The power increase might either result from using MIVQUE0 variance estimators and/or from using generalized least squares estimators. Overall, we see that the powers of the methods depend on various parameters, for instance the hypothesis of interest (contrast matrix), alternative pattern, sample sizes, degree of variance heteroscedasticity and especially their allocations. Therefore, finding a general conclusion is quasi impossible in the designs considered here. In general, none of the methods has a superior power, while there are situations in which any of the method outperforms the others.

DATA EVALUATIONS
The two-way factorial toxicological and carcinogenic study introduced in Section 2 can now be analyzed with the new methods. First, the data of this trial can be modeled by with Y ijk denoting BUN (equivalently change of BUN), M (1) ijk and M (2) ijk BUN at baseline and change of bodyweight of rat k of gender i in dose level j, respectively. First, the estimated treatment effects (b ij ), standard errors ofb ij , and the variance estimators (̂i j ) for each factor-level combinations of the two factors: gender (A) and pyridine concentration (B) are listed in Table 2. For illustration, we also report the variance estimators obtained by MIVQUE0 in SAS PROC MIXED.
It can readily be seen from Table 2 that the variance estimators from both estimation methods after adjusting for the two covariates are rather different for each combination of the two factors. Therefore, assuming homoscedastic variances in the classical ANCOVA model may lead to inaccurate statistical inferences.
In order to analyze the main effects and the interaction effect, we test each null hypothesis given in (5) using the novel ANCOVA-type test F N (T) in (11), the ANCOVA-type test F A in (10) implemented in SAS PROC MIXED, the wild-bootstrap test W * (T) in (15), and the classical ANCOVA F-test. The results are summarized in Table 3 with the test statistics, degrees of freedom (DF), and the P-values.
It can readily be seen from Table 3 that the factor A (gender) has a significant impact on the endpoint BUN at 5% level of significance using the first three tests: F N (T), F A , and W * (T), while an insignificant impact using the classical ANCOVA F-test. This difference in results might be because the test assumes homogenous variances. None of the tests detects a dose effect (pyridine concentration) at 5% level of significance. Furthermore, the bodyweight change does not seem to impact the response.

DISCUSSION
Analysis of covariance is a fundamental inference method in statistical practice and allows estimation and testing of adjusted treatment effects in general factorial designs. For example, adjusting for baseline values is essential in a variety of trials. In addition, variance heteroscedasticity is a non-negligible impact which might complicate the statistical analysis, especially in small sample sizes. In the present article, we discussed available methods for the analysis of ANCOVA models under groupwise heteroscedasticity. In particular, we investigated the ANCOVA-type statistic F A available in SAS PROC MIXED. In this spirit, we were able to derive its sibling statistic F N , which can be seen as a generalization of the ANOVA-type statistic. 5 In comparison with F A , F N is numerically feasible and can be computed within few numerical steps. Extensive simulation studies show that the ANCOVA-type tests control the nominal type-1 error rate equally well and have comparable powers to detect alternatives. As further competitor, we investigated a wild-bootstrap approach. 7 Extensive simulation studies show that none of the methods has a superior power. In few situations depending on variance heteroscedasticity, type of hypothesis, sample sizes, and so forth, W * has a higher power than its competitors. In general, the methods are applicable even in case of small sample sizes. As a rough recommendation, they are applicable when n i ≥ 10 (depending on the actual model under consideration). Furthermore, when confronted with small sample sizes, we recommend testing for no covariate effect(s) with the statistic F N . All of the presented methods are, however, approximate solutions each with its own pros and cons. Besides evaluating data, statistical planning and sample size computations are of same importance. We will tackle them in future research projects. Furthermore, we investigated Box-type approximations only. The investigation of other methods, for example, the pretty popular Kenward-Roger approximation methods 18,19 (which are implemented in SAS PROC MIXED as well), will also be part of future research.

Extension: Completely heteroscedastic errors
The assumption of equal variances within groups (see Equation (2)) may be further relaxed, in order to cover also scenarios with subject-specific errors (ie, "completely heteroscedastic" settings), namely by only assuming Translating the ANCOVA-type approximation framework that has been considered in the present article to such a general setting is not straightforward. By contrast, a Wald-type approach is still applicable, because the covariance matrix given in (13) may be replaced by the estimatorV wherêi j denotes the ANCOVA residual of subject j in group i, i ∈ {1, … , a}, j ∈ {1, … , n i }. In order to improve the finite-sample performance, the following wild-bootstrap Wald-type test has been proposed in Zimmermann et al. 7 Let B ∶= (X, M) denote the design matrix of the ANCOVA model. Independently from the data, we generate a sample of i.i.d. random variables T 11 , … , T an a satisfying P(T 1 = −1) = P(T 1 = 1) = 1∕2 and obtain the corresponding bootstrap observations Y * ij =̂i j T ij (1 − p ij ) −1∕2 , where p ij denotes the diagonal element of the hat matrix P B = B(B ′ B) −1 B ′ corresponding to subject j within group i. Finally, we calculate the wild-bootstrap Wald-type statistic W * (T) ∶= (Tb * ) ′ (T̂ * T) −1 Tb * .
Thereby,b denotes the least squares estimator of b defined in (6), but with the original observations replaced by the bootstrap sample, and̂ * denotes the upper-left block of the 2 × 2 block matrix whereVar * ( ) is the bootstrap counterpart of the covariance matrix estimator from (14). Now, when repeating the procedure of drawing i.i.d. bootstrap samples a "large" number of times, we may use the empirical (1 − ) quantile of the conditional distribution of W * (T) as the critical value for testing H (b) 0 ∶ Tb = 0. Some limited simulation evidence indicates that the wild-bootstrap Wald-type test performs well even in the more general setting of unequal variances within groups, except for some tendency toward a conservative behavior in the log-normal case (see table 1 in Zimmermann et al 7 ). In general, however, it should be kept in mind that in a well-designed study, one may expect at most slight heteroscedasticity within groups. Therefore, the ANCOVA-type approximation that has been considered in the present article might actually cover most practically relevant scenarios.
Another potential line of action might be to use alternative methods for estimating the variances of the estimated model coefficients instead of HCSE estimators. 8,9 For example, the Hadamard estimator considered in Dobriban and Su 20 may be less biased, and hence, improve the finite-sample performance of the wild-bootstrap ANCOVA. This idea needs to be further studied in future research. So far, we restricted our research to independent observations. Repeated measures designs and multivariate data will be part of future research. Implementations of the new methods in freely available software packages (eg, within the R-package GFD 6 ) are envisioned.

APPENDIX. DERIVATION OF THE APPROXIMATION PROCEDURE
We will prove Proposition 1 in the following. It follows from the (asymptotic) multivariate normality of √ N(b − b) and the representation theorem of a quadratic form in Mathai 14 that the (asymptotic) distribution ofB N (T) can be approximated by a weighted sum of independent 2 1 random variables. Following the idea of Box 13 and Brunner et al, 5 we approximate its distribution by a scaled g ⋅ 2 Since tr(T ) depends on unknown parameters and thus is unknown in practical applications, we replace it by its empirical counterpart tr(T̂). However, we observe that we can write the latter as a quadratic form and approximate its distribution in the same way as above Thus, it follows that In the last step we assumed . Therefore, we obtain the approximation The quadratic formsF N (T) and F 2 (T) are independent which motivates as approximation Finally, since f 1 and f 2 depend on unknown parameters, we replace them with their consistent estimators and obtain , respectively.