A simulation study to compare robust tests for linear mixed‐effects meta‐regression

The explanation of heterogeneity when synthesizing different studies is an important issue in meta‐analysis. Besides including a heterogeneity parameter in the statistical model, it is also important to understand possible causes of between‐study heterogeneity. One possibility is to incorporate study‐specific covariates in the model that account for between‐study variability. This leads to linear mixed‐effects meta‐regression models. A number of alternative methods have been proposed to estimate the (co)variance of the estimated regression coefficients in these models, which subsequently drives differences in the results of statistical methods. To quantify this, we compare the performance of hypothesis tests for moderator effects based upon different heteroscedasticity consistent covariance matrix estimators and the (untruncated) Knapp‐Hartung method in an extensive simulation study. In particular, we investigate type 1 error and power under varying conditions regarding the underlying distributions, heterogeneity, effect sizes, number of independent studies, and their sample sizes. Based upon these results, we give recommendations for suitable inference choices in different scenarios and highlight the danger of using tests regarding the study‐specific moderators based on inappropriate covariance estimators.


| INTRODUCTION
Recently, Jackson and White (2018) raised the question "When should meta-analysis avoid making hidden normality assumptions?" In the current paper, we investigate this in the context of meta-regression models while also studying the effect of employing different methods to account for heteroscedasticity. Here, the notion metaregression refers to a regression, in which the effect sizes from various studies are modeled by means of certain study characteristics. Thus, the effect sizes are the dependent (or outcome) variables and the study characteristics are the independent variables (also called moderators or explanatory variables).
As the effect sizes are usually certain summary statistics within diverse studies (as, eg, Cohen's d or a log-odds ratio), the study-specific moderators can only account for a part of the between-study heterogeneity. Thus, to "fully" account for heterogeneity, the introduction of a random effect is necessary, naturally leading to linear mixed-effects regression models. This was, for example, proposed 1 for the case of a single covariate and later extended. [2][3][4][5][6][7] In this context, a specific question of interest is to test for an effect of a certain moderator, that is, to test the null hypothesis whether the corresponding regression coefficient is zero. Here, Viechtbauer et al made a thorough comparison of different existing methods in extensive simulations. In particular, they compared tests based on the Waldtype, Knapp-Hartung (with and without truncation), Permutation, Huber-White, and the likelihood ratio method together with seven different estimators of the so-called between-study heterogeneity. It turned out that the choice of heterogeneity estimator did not affect the results greatly, while the choice of methods mattered: They found a certain preference for the Knapp-Hartung method 3 and also concluded that "Huber-White and likelihood ratio tests (…) cannot be recommended for routine use, at least in their present form." Moreover, they stressed that "additional simulations are needed to assess the performance (…) under more adverse conditions, such as non-normal random errors and/or true effects." In the current paper, we follow this suggestion and continue their work by investigating the effect of non-normal random effects. In addition, we analyze the effect of choosing different versions of the Huber-White heteroscedasticity consistent (HC) covariance estimators. These estimators are typically applied when the assumption of homogeneous variance of the residuals is not plausible, to avoid inconsistent inference. In particular, there exist the six versions HC 0 -HC 5 of the Huber-White estimator for regression models, of which Sidik and Jonkman 8 proposed the HC 0 and HC 1 -type in the meta-analytic context. For fixed-effects regression models, the estimators HC 3 and HC 4 are often recommended. 9,10 Thus, it is of interest to also investigate the influence of the different choices in the context of meta-regression models. This becomes especially important under adverse conditions, such as non-normally distributed effect sizes and/or unbalanced study sizes or arms. As already shown, 11 such circumstances can lead to poor control of type 1 error and/or poor coverage of confidence intervals when using standard meta-analytic techniques. For this paper, we therefore investigate the performance of the different estimators in different scenarios, utilizing both standardized mean differences and log-odds-ratios as effect measures.
In the following sections, we start with a formal introduction of the mixed-effects meta-regression model and introduce inference procedures for testing moderator effects (Section 2). Next, we focus on a motivational data analysis (Section 3) that illustrates the practical importance of the choice of covariance estimator and we analyze the data example using the previously introduced procedures. The data analysis motivates the need for an extensive simulation study (Section 4). In this section, we explain the various simulation designs and illustrate and discuss our main findings. We end with concluding remarks and an outlook for further research (Section 5).

| THE SETUP
The usual mixed-effects meta-regression model is given for independent outcome/effect variables where x ij denotes the jth moderator variable in the ith study, β j is the corresponding model coefficient, and K the number of independent studies. Furthermore, u i is a random effect that is typically assumed to be normally distributed 12 with u i $ N(0, τ 2 ) and ε i is the within-study error with distribution ε i $ N 0,σ 2 i À Á . However, to give answers on the opening question of "When should meta-analysis avoid making hidden normality aumptions?," we also study non-normal situations regarding the random effects u i : We do not specify a particular distribution and only assume E u i ð Þ= 0 and Var(u i ) = τ 2 . From a practical point of view, u i accounts for the variability not explained by the trialspecific moderators, leading to the notion of betweenstudy heterogeneity for its variance τ 2 . We point out here that the study-level outcome of each individual patient may be binary. In this case, inference is based on normal approximations to discrete (binomial) likelihoods. Caution should be used with such normal approximations, as highlighted by a recent discussion paper on the topic of hidden normality assumptions in meta-analysis. 13 Here, an alternative approach would be exact GLMM approaches, as considered by Stijnen et al and others. 14,15 Anyhow, model (1) involves several unknown parameters σ 2 i , β, τ 2 À Á , which have to be estimated. Thereof, the within-study sampling variance σ 2 i is estimated from the observations in the study and typically assumed to be known. To provide a simple expression of the weighted least-squares estimate for β and the corresponding covariance estimators presented below, we rewrite model (1) in matrix notation as where X ∈ R K × (m + 1) , β ∈ R m + 1 , and u, ε ∈ R K . The weighted least-squares estimator for β is given bŷ The weight matrix isŴ = diag σ 2 i +τ 2 À Á − 1 . In this setup, we are now interested in testing the null hypothesis of no moderator effect H 0 : β j = 0 n o for j∈ 1, …, m f g against two-sided alternatives H 1 : {β j 6 ¼ 0}.
There already exist several procedures applicable for this purpose and most of them are mainly based on a test statistic of (Welch)-t-type. In particular, these basically differ in how both, the between-study heterogeneity τ 2 as well as the within-study variances σ 2 i , are accounted for. To define them, denote byβ the weighted least-squares estimator for β and Σ = covβ . For all choices of (co)variance estimatorΣ considered in part 2.1, a two-sided test statistic of t-type for testing for the presence of the jth model coefficient, that is, for inferring H 0 : {β j = 0}, is then calculated via Here,Σ jj is the jth diagonal element of the covariance estimatorΣ . For large K, the statistic T j approximately follows a t-distribution with K − m − 1 degrees of freedom under the null hypothesis H 0 . 16 Comparing |T j | against the 1 − α/2 quantile of the t-distribution with K − m − 1 degrees of freedom yields the corresponding test and P values. Under mild regularity conditions on the moderators, these tests are asymptotically correct. We summarize this in Theorem 1, which is given in the supplement along with a proof.
As has already been pointed out, the testing procedures are not greatly affected by the choice of residual heterogeneity estimator. 17 We therefore solely focus on one estimator for τ 2 : the restricted maximum likelihood (REML) estimator, which was recently propagated as a good choice for continuous data. 18,19 Details regarding the REML estimator are presented in the Supplementary Materials (cf. Equation S8). Note that in this context, REML estimates are more suitable than naive ML estimates of variance components as the latter may have a negative bias. 20 As we have fixed estimators for β and τ 2 , we now turn to the question of how to estimate the covariance of the estimated model coefficientβ , given in Equation (3). Here, the Knapp-Hartung method 3 has been recommended. 17 However, in case of semiparametric linear models, robust Huber-White estimators are often seen as a reasonable solution; especially when the type of heteroscedasticity is not specified. 9,10,21 As Viechtbauer et al 17 only investigated the HC 1 estimator of the six Huber-White estimators HC 0 -HC 5 , we complement their study by also investigating the other versions with respect to their applicability in meta-regression. To this end they are detailed in the next subsection. These HC-estimators are furthermore compared to the (untruncated) Knapp-Hartung method, which provided adequate control of the type 1 error rate in previous research. 17

| Robust (Huber-White) approach
In semiparametric linear models, the assumption of homogeneous variance of the residuals is often not plausible, possibly leading to invalid inference from classical methods based on homoscedasticity. Here, the typical solution is to apply sandwich estimators. These are also known as Huber-White estimators, to recognize the contributions of Peter J. Huber and Halbert White. 22,23 In model (1), it especially makes sense to consider such estimators because the marginal variances σ 2 i + τ 2 of the effect size estimates are heteroscedastic. We are now interested in consistent estimators of the (co)variance matrix Σ = covβ . The classical White-estimator of type HC 0 that was proposed Sidik and Jonkman 8 in the metaanalytic context is given bŷ whereÊ = diag y −Xβ . Multiplying it with K/(K − m − 1) leads to the HC 1 -type estimator, which was considered in the above mentioned work by Viechtbauer et al 17 and is given byΣ 1 which is known to be more conservative. However, even in classical regression models Wald-or t-tests based on both (co)variance estimators are known to yield inflated type 1 error rates for small to moderate sample sizes. 10,24,25 This was also shown to be the case in meta-regression models. 17 Therefore, improved versions of the original Huber-White estimator have been suggested, namely White estimators of type HC 2 , HC 3 , HC 4 , and HC 5 . We introduce these estimators but refer to the papers in which they were originally discussed for further details. [26][27][28] As their general forms are rather complex (cf. Equations 5 and 6), we have also worked out the analytical form of the HC estimators in the simplest case of no moderators, that is, random-effects meta-analysis. Please refer to the Supplementary Material and the discussion for details. The form of the respective Huber-White covariance estimators in the context of the mixed-effects metaregression model (2) is described below: we first introduce the HC 2 and HC 3 estimators given by Here, the HC 3 estimator gives a very close approximation to the computationally more expensive jackknife estimator described in Reference 26 and given by Here,β i ð Þ is the weighted least-squares estimate of β based on all observations except the ith. It is important to note that HC 3 , unlike HC 2 , is biased under homoscedasticity. 28 To improve HC 3 , the following variation was suggested 27 : with a predefined constant 0 < γ < 1. Based on findings from simulation studies, the value γ := 0.7 was recommended. 28 We follow this suggestion below.
The asymptotic behavior (for large K) is the same for all of the considered covariance estimators. However, for small to moderate numbers of studies K, the respective behavior may be vastly different, as asymptotic arguments and limit theorems no longer hold. This is particularly apparent in the illustrative data example presented in the next section. Table 1 contains data on six studies, which investigate the effectiveness of Azithromycin vs Amoxycillin or Amoxycillin/clavulanic acid (Amoxyclav) in the treatment of acute lower respiratory tract infections. An explanation of the different variables can be found in Table 2. Azithromycin is an antibiotic, which is useful for the treatment of various bacterial infections. 29 The data are contained in the R package metafor and have previously been analyzed. 30 We want to investigate whether the respective trial having included patients with a diagnosis of pneumonia has a significant effect on the effectiveness of Azithromycin within the subgroup of trials containing patients with a diagnosis of acute bacterial bronchitis. We will attempt to answer this question using a mixed-effects meta-regression model.

| DATA EXAMPLE
Although in the original work on these data 30 the authors used risk ratios as the effect measure, we decided to utilize the log-odds ratio as the effect measure of choice, due to its favorable statistical properties, such as an approximate normal distribution. 31 Moreover, the logodds ratio behaved similarly to the standardized mean difference in our preliminary simulations. The resulting P-values and test statistic values (4) for each choice of estimator HC i , i = 0, …, 5 and the Knapp-Hartung method are given in Table 3.
The estimators HC 0 and HC 1 lead to a rejection of the null hypothesis at nominal level α = 0.05, while the test based on HC 2 still leads to a significant moderator effect at the 10% level. On the contrary, tests based on HC 3 -HC 5 do not reject the null hypothesis. If we compare the newer covariance estimators HC 3 -HC 5 and HC KH , the Knapp-Hartung method rejects the null hypothesis at the nominal level α, whereas the formerly mentioned methods do not.
These results illustrate that the choice of covariance estimator can have a large influence on results in practice and that the wrong choice of HC-estimator may lead to possibly false-positive or -negative test results. In particular, it is unclear whether the above rejections/nonrejections are due to a potentially liberal/conservative behavior or different power characteristics of the corresponding tests. In any case, researchers should take care when performing inference on study-specific moderators, especially when the number of investigated studies K is small. In order to help guide researchers' decision of which covariance estimator to use in their analysis, we perform an extensive simulation study regarding type 1 error and power.

| Software
Although this data set was analyzed using the open source software R, other statistical software packages are available for meta-regression. Two examples are metareg in Stata as well as various procedures in SAS. Metareg in Stata, for example, implements the REML method as the default estimation procedure regarding the betweenstudy variance τ 2 . In both Stata and SAS, the covariance matrix estimation approach can be specified: Metareg implements the Knapp-Hartung method as the default covariance estimation approach. In SAS, the PROC T A B L E 2 Explanation of variables in Table 1 Variable Meaning PANEL procedure per default uses the standard sample covariance estimator but allows the option to specify one of the HC covariance estimators HC 0 -HC 4 using the HCCME= option in the MODEL statement. 32,33 In the rma function in the metafor package in R, the default covariance matrix is simply V = diag σ 2 i +τ 2 À Á with REML as the default estimation procedure for the between-study heterogeneity τ 2 . The Knapp-Hartung method can be specified via the option test = "knha" in the rma function.

| SIMULATION STUDY
We conducted a Monte Carlo simulation using standardized mean differences and log-odds-ratios as the effect size measures. As we do not want to assume individual patient data, only the study effectsθ i are available (cf. Equation 10). As in previous work, 17 we assumed a single moderator influencing the true study-specific effects resulting in the model The values of the moderator x i were independently generated from a standard normal distribution and without loss of generality, β 0 was set to 0. Moreover, the random effects u i were chosen to be either standard normal-, (standardized) exponential-, double exponential-, log-normal-, or t 3 -distributed. For a detailed definition of the corresponding data generating processes, we refer to Section S6 in the Supplementary Material.
For the effect size, we considered the standardized mean difference in the ith study. We generated the true parameter θ i directly, analogously to Viechtbauer et al, 17 according to Equation (10). An unbiased estimator of θ i is given by Hedges' g 34 where n T i and n C i denote the size of treatment and control group, respectively, which are specified below. Moreover, d i denote the effect size estimates (Cohen's d) from study i which were generated via where ϕ iÑ θ i ,1=n T i + 1=n C i À Á , X iχ 2 n i with n i = n T i + n C i − 2 and then applying expression (11).
For the between-study heterogeneity τ 2 , we chose the values {0.1, 0.2, …, 0.9} and for β 1 we considered the choices {0, 0.2, 0.5}, where 0 corresponds to no effect of the moderator variable. The number K of independent studies was chosen from {5, 10, 20, 50}. Finally, a good approximation of the sampling variance of y i is given by 35 In order to see if and in what way the results depended on the chosen effect size measure, we also investigated log-odds ratios for binary data. Simulating data in a manner analogous to the one described in foundational work, 8,12 (results not shown) it turned out that the change of effect size did not alter the general conclusion. Therefore, we focus on the standardized mean difference alone.
In practice, the study-specific moderators are oftentimes binary, as can be seen in our data example. For this reason, we have also (exemplarily) considered binary moderators in the case of balanced study sizes, considering normal and exponential random effects. So, instead of generating the x 1i from a N 0, 1 ð Þ distribution, we generated them from a Bernoulli distribution with parameter P = .2. It is necessary to exclude the case where all moderators are equal to 1 or 0. Furthermore, it is sufficient to consider only power for the binary moderators, as the type 1 error will be the same as in the case of standard normally generated moderators because for β 1 = 0 the choice of x 1i does not matter.

| RESULTS
In this section, we describe the results of the simulation study. In particular, we present type 1 error and power based on the different covariance estimators under various simulation configurations. For power, we considered both the case of a (comparatively) smaller effect size β 1 = 0.2 and a (comparatively) larger effect β 1 = 0.5 of the study-specific moderator. For ease of presentation, we focus on the most important results and general trends and refer the interested reader to the Supplementary Material for the complete simulation results.

| Type 1 error rate
Studying the type 1 error results for all configurations given in the Supplementary Material, we can draw the first general conclusion that changes in the betweenstudy heterogeneity τ 2 , the number of subjects in each study and the underlying distributions of the random effects had little effect on the behavior of the procedures under the null hypothesis. In comparison, the number of studies K and the chosen test procedure were the driving forces for changes in type 1 error control. We therefore start by presenting a summary of the results of type 1 error simulations for different combinations of these two forces in boxplots given in Figure 1. The results shown in Figure 1 are for the scenario of unequal study sizes. We present results for HC 1 -HC 5 and the Knapp-Hartung method, referring HC 0 to the Supplementary Material, due to its known liberal behavior.
Here, each boxplot represents the 9(τ 2 ) × 3(n i ) × 5 (u i ) = 135 different empirical type 1 error rates for each test in case of K ∈{5, 10, 20, 50} studies. The White-type test based on the classical HC 0 -estimator exhibits highly inflated type 1 error rates, as expected; particularly for a smaller number of studies. The type 1 error rates are even more inflated than for HC 1 . For details we refer to the Supplementary Material. A similar, but less pronounced behavior can be observed for the tests based upon HC 1 and HC 2 . On the contrary, all other procedures control the nominal level α = 0.05 quite well. HC 3 -HC 5 are slightly conservative for K = 5 studies. HC 3 has a type 1 error around 3% and HC 4 and HC 5 around 4% for K = 5. For these three estimators, the type 1 error converges to the nominal level α for increasing number of studies K. The Knapp-Hartung method holds the nominal level exactly for K = 5 studies but seems to become (only slightly) conservative for increasing number of studies K. It is interesting to note that there was no significant correlation between type 1 error and different study sizes n (for a fixed number of studies K), see the Supplement for details. Finally, the Knapp and Hartung method controlled the nominal level α very well for a smaller number of studies K ∈{5, 10}, which is in line with previous research. 17 On the contrary, the other HC estimators were either liberal or slightly conservative in the scenario of K = 5 studies.
For a better comparison of the procedures with the overall best type 1 error control (HC 3 -HC 5 and HC KH ), we present the boxplots of their simulated type 1 error rates together in one figure (see Figure 2). The results shown are from the simulation configuration of unbalanced study sizes and the standardized mean difference as effect measure. Figure 2 summarizes the observed type 1 error rates. These are fairly close to the nominal level α = 5%, albeit being slightly conservative at the median with median type 1 error rates between 4% and 5%. The exception is the HC 3 estimator in the case of five studies, which is much more conservative with a median type 1 error rate just below 3% and the entire boxplot has whiskers lying below the nominal level α. For HC 3 -HC 5 , the type 1 error rates increase monotonically toward the nominal level for an increasing number of studies K, and for the Knapp-Hartung method the type 1 error rates start close to nominal for the case of K = 5 studies and decrease (slowly) away from the nominal level for increasing numbers of studies K.
Based on these results, we conclude that for ≤10 studies the Knapp-Hartung method is to be recommended (in terms of type 1 error control) and for the case of ≥20 studies, especially when the number of studies is very large, for example, 50 studies as in Figure 2, HC 3 is the preferred estimator with regards to type 1 error control. For the case of 10 < K < 20 studies, further simulations need to be done in order to give a clear recommendation for the choice of covariance estimator. For more comprehensive recommendations, we compare the procedures' power behavior in the next section.

| Power
In addition to the type 1 error rate, we investigated the power of the respective tests to reject the null hypothesis of no effect of the moderator variable, when it is in fact false. To this end, we consider alternatives with (comparatively) small and (comparatively) larger effects by setting β 1 = 0.2 and β 1 = 0.5, respectively.
For all methods, the observed general trend was that power increased monotonically for decreasing amounts of heterogeneity τ 2 , increasing number of studies K as well as increasing study size n. In the following, we again concentrate on power for the procedures based on HC 3 -HC 5 and Knapp-Hartung, as these were the only tests with a satisfactory type 1 error control. The detailed power simulation results, for each separate simulation scenario, for these, and all other methods are given in Section S6.2 of the Supplement. As the results for heterogeneous and homogeneous study sizes were very similar, we restrict ourselves to the former case and again refer to the Supplement (Section S6.2) for the complete results. Figure 3 summarizes the power results for the tests based on HC 3 -HC 5 and HC KH for a (comparatively) small effect size of β 1 = 0.2, in the scenario of unbalanced study sizes. Median power ranges from around 5% to 8% for K = 5 to around 45% to 47% for K = 50. For larger amounts of studies, the power of all shown tests is close together. However, HC 3 does seem to have slightly more median power than the other estimators for K = 50. In the scenario of K = 5 studies, the Knapp-Hartung method yields much greater power than HC 3 and slightly more power than HC 4 and HC 5 . For K = 10 and K = 20 studies, HC KH has slightly more median power than HC 3 -HC 5 , as well as having a longer "upper whisker" in the latter case, in comparison to the other methods.
The results for a (comparatively) larger effect of β 1 = 0.5 can be found in the Supplementary Materials. We again concentrate on the estimators HC 3 -HC 5 and HC KH . In the Supplement, we also give the power results for the scenario of balanced study sizes. For β 1 = 0.5, the difference between methods is more pronounced; especially for a smaller number of studies K ∈{5, 10}. In fact, HC KH has considerably more power than HC 3 -HC 5 for K ∈{5, 10}. At the median this difference amounts to 7%-8% more power than HC 3 and around 4% more than HC 4 and HC 5 for K = 5 and around 7%-8% more power than HC 3 -HC 5 for K = 5. For larger study sizes, this effect diminishes and the results are quite close together. Results were very similar for balanced study sizes.

| Bias and variance estimation
In addition to type 1 error and power, we also study the bias Eβ 1 Â Ã −β 1 and the variance var(β 1 ) = Σ 11 of the effect estimator ofβ 1 in the Supplement. Clearlyβ 1 is identical across all variations of variance estimator. Because these values cannot be expressed analytically, we resorted to simulations, which we performed in the scenario of normally distributed random effects and balanced study sizes with moderator variables drawn from a normal distribution. Our findings can be summarized as follows: The estimatorβ 1 is approximately unbiased for β 1 = 0 and becomes increasingly negatively biased for larger effect sizes β 1 . Moreover, the variance seems to increase with each new version of the HC estimator, that is, from HC 0 to HC 5 . The Knapp-Hartung method, however, has a smaller variance than the newer iterations of the HC estimators HC 3 -HC 5 . The details can be found in the Supplement.

| Binary moderators
Finally, since moderators can also be binary in practice, we extended the simulations to consider this scenario. The results of the power simulations with binary moderators indicate that use of binary covariates instead of continuous ones reduces power considerably. Furthermore, power did increase for larger numbers of studies K but much more slowly than in the case of continuous moderators. When comparing the power results of the different covariances estimators, it became apparent that the HC F I G U R E 3 Power of tests based on HC 3 -HC 5 and HC KH for K ∈{5, 10, 20, 50}, τ 2 ∈{0.1, 0.2, …, 0.9}, and β 1 = 0.2-with unbalanced study sizes and standardized mean difference (SMD) as effect measure [Colour figure can be viewed at wileyonlinelibrary.com] estimators displayed vastly superior power over the Knapp-Hartung method when the number of studies was small (K ≤ 10). This is interesting, as with continuous moderators Knapp-Hartung often had more power. For large numbers of studies (K = 50), Knapp-Hartung had slightly more power than the HC estimators. It therefore seems prudent to use one of the newer HC estimators (HC 3 -HC 5 ) instead of the Knapp-Hartung method when dealing with binary moderators and a small number of studies K. However, if dealing with binary moderators and a large number of studies (K > 20), it is probably best to stick with the Knapp-Hartung method. Detailed results can be found in the Supplementary Material.

| DISCUSSION AND FURTHER RESEARCH
Mixed-effects meta-regression models offer a good possibility to describe and model moderator (covariate) effects from various studies in a meta-analysis. In this context, it is of interest to determine which moderators significantly help to explain heterogeneity. This naturally leads to ttests for the null hypotheses of no moderator effects. Here, Viechtbauer et al 17 compared several procedures in extensive simulations and recommended the (untruncated) Knapp-Hartung method 3 as the procedure of choice. We complement their investigations by additionally considering all six robust covariance estimators of Huber-White (HC) type suggested in the literature, while also extending their simulation scenarios. In fact, following recent discussions on hidden normality assumptions in meta-analyses, 13 we also study situations with non-normal random effects. Although we focus on hypothesis tests for moderator effects, confidence intervals for the unknown regression coefficients based on tquantiles can easily be constructed via test inversion. 37 The coverage probabilities of these confidence intervals would be given by 1 minus the respective type 1 error.
For a total of 30 240 different simulation configurations we compared the t-tests based on the six different HC-type estimators (HC 0 -HC 5 ) and the (untruncated) Knapp-Hartung method 3 with respect to their type 1 error control and power. As observed in other regression contexts, 9,17,25,27,28 the tests based on the classical Huber-White estimators HC 0 , HC 1 as well as HC 2 generally had a highly inflated type 1 error, except for the simulation scenario of K = 50 studies. Of the other existing modifications HC 3 -HC 5 , all managed a satisfactory control of the nominal level α. HC 4 and HC 5 controlled the nominal level more exactly, whereas the HC 3 estimator was conservative in the case of very few studies (K = 5), with an observed type 1 error of around 3%. The (untruncated) Knapp-Hartung method also controlled the nominal level α well, albeit being more exact for smaller numbers of studies and slightly conservative for a larger number of studies K.
Regarding the behavior under different alternatives, all tests' power tended to increase monotonically with increasing study numbers K, increasing average study size and decreasing amounts of heterogeneity τ 2 -a marked difference when comparing to type 1 error behavior, where τ 2 and study size had little influence.
Somewhat surprisingly the choice of distribution of the random effects in the simulation study had hardly any effect on the type 1 error and power of t-tests based on the considered covariance estimators. This leads us to conclude that the typical normality assumption u i $ N (0, τ 2 ) for the mixed-model random effects is unproblematic, at least in the scenarios we considered in our simulation study.
Comparing HC 3 -HC 5 and the Knapp-Hartungmethod, we observed a higher power of the latter; especially in case of larger moderator effects or few studies. Only in case of small moderator effects and a larger number of studies (K = 50) a slight power advantage of the HC 3 -method was observed. Nevertheless, our findings lead to similar conclusions as drawn in previous research 17 that in most cases the (untruncated) Knapp-Hartung method seems to be the procedure of choice.
In addition to meta-regression, we have considered the special case of no moderators (random-effects metaanalysis) and worked out the formulas for the individual HC-type variance estimators of the main effectθ in this case. These results are presented in Proposition 1 of the technical Appendix in the Supplementary Material, along with a proof. Additionally, the individual formulas of the six HC estimatorsΣ 0 , …,Σ 5 of the form Σ ℓ = P K j = 1 v j,ℓ Áε 2 j , ℓ = 0,…, 5 for specific weights v j,ℓ are presented in Equations (S2)-(S7) of the Supplement along with a numerical example.Σ 0 andΣ 1 only differ by a constant, whereasΣ 2 -Σ 5 differ through the exponent of a weighting factor included in v j,ℓ . Please refer to the technical Appendix of the Supplementary Material for their explicit form.
In applications, one of the most problematic cases is when only a small number of studies are available. Our data example in Section 3 shows how large the influence of the choice of HC estimator can be in such a scenario. One possible reason may be that all considered estimators make direct use of the residuals. In case of few studies, this may not be too reliable, leading to less stable estimation of the between-study heterogeneity τ 2 and more variable SE. Here, alternative approaches exist, such as higher order likelihood based methods, which aim to improve on inference based on first order likelihoods. In this context, some authors have, for example, recommended inference based on Skovgaard's second-order statistic. 38,39 Moreover, we additionally conjecture that for such a case of few studies the underlying error distribution plays an important role as well. 13 We leave an exhaustive evaluation of these "residual concerns" to future research.
We conclude this paper with an outlook on ongoing and future research. In most clinical trials, two or more endpoints of interest are measured. Therefore, the current investigations will be extended to the case of multivariate mixed-effects meta-regression models. As the assumption of normality is usually more problematic than in the univariate case, [40][41][42][43] an adequate treatment may require the extension and/or improvement of existing methods. In this context, the additional study of modern imputation techniques 44,45 will be mandatory. Moreover, different to the present setting one might explore the methodology under the presence of individual patient data, allowing the application of a multitude of different permutation or resampling procedures. 25,46,47