A Q statistic with constant weights for assessing heterogeneity in meta-analysis

The conventional Q statistic, using estimated inverse-variance (IV) weights, underlies a variety of problems in random-effects meta-analysis. In previous work on standardized mean difference and log-odds-ratio, we found superior performance with an estimator of the overall effect whose weights use only group-level sample sizes. The Q statistic with those weights has the form proposed by DerSimonian and Kacker. The distribution of this Q and the Q with IV weights must generally be approximated. We investigate approximations for those distributions, as a basis for testing and estimating the between-study variance ( τ 2 ). A simulation study, with mean difference as the effect measure, provides a framework for assessing accuracy of the approximations, level and power of the tests, and bias in estimating τ 2 . Two examples illustrate estimation of τ 2 and the overall mean difference. Use of Q with sample-size-based weights and its exact distribution (available for mean difference and evaluated by Farebrother's algorithm) provides precise levels even for very small and unbalanced sample sizes. The corresponding estimator of τ 2 is almost unbiased for 10 or more small studies. This performance compares favorably with the extremely liberal behavior of the standard tests of heterogeneity and the largely biased estimators based on inverse-variance weights.


| INTRODUCTION
In meta-analysis, many shortcomings in assessing heterogeneity and estimating an overall effect arise from using weights based on estimated variances without accounting for sampling variation. Our studies of methods for random-effects meta-analysis of standardized mean difference 1 and log-odds-ratio 2 included an estimator of the overall effect that combines the studies' estimates with weights based only on their groups' sample sizes. That estimator, SSW, outperformed estimators that use (estimated) inverse-variance-based (IV) weights. Those weights use estimates of the between-study variance (τ 2 ) derived from the popular Q statistic discussed by Cochran, 3 which uses inverse-variance weights and which we refer to as Q IV . Thus, parallel to SSW, we investigate an alternative, Q SW , in which the studies' weights are their effective sample sizes. This Q SW is an instance of the generalized Q statistic Q F introduced by DerSimonian and Kacker, 4 in which the weights are fixed positive constants.
We consider the following random-effects model (REM): For Study i (i = 1, …, K), with sample size n i = n iT + n iC , the estimate of the effect isθ i $ G θ i , v 2 i À Á , where the effect-measure-specific distribution G has mean θ i and variance v 2 i , and θ i $ N(θ, τ 2 ). Thus, theθ i are unbiased estimators of the true conditional effects θ i , and the v 2 i ¼ Varθ i jθ i À Á are the true conditional variances. The general Q statistic is a weighted sum of squared deviations of the estimated effectsθ i from their weighted mean θ w ¼ P w iθi = P w i : In Cochran 3 w i is the reciprocal of the estimated variance of θ i , resulting in Q IV . In meta-analysis those w i come from the fixed-effect model. In what follows, we discuss approximations to the distribution of Q F and estimation of τ 2 when the w i are arbitrary positive constants. Because it is most tractable, but still instructive, we focus on a single measure of effect, the mean difference (MD). In this favorable situation, the cumulative distribution function of Q F can be evaluated by the algorithm of Farebrother. 5 We also consider approximations that match the first two or the first three moments of Q F . In simulations and examples, we concentrate on Q SW . For comparison we also include some of the popular inverse-variance-based methods of estimating τ 2 , approximating the distribution of Q IV , and testing for the presence of heterogeneity. A simulation study provides a framework for assessing accuracy of the approximations, level and power of the tests based on Q SW and Q IV , and bias in estimating τ 2 .

| EXPECTED VALUE OF Q F AND ESTIMATION OF τ 2
Define W = P w i , q i = w i /W, and Θ i ¼θ i À θ. In this notation, and expanding θ w , Equation (1) can be written as Under the above REM, and assuming that the w i are arbitrary fixed constants, it is straightforward to obtain the first moment of Q F as This expression is similar to Equation (4) in DerSimonian and Kacker. 4 Rearranging the terms gives the momentbased estimator of τ 2 What is already known?
1. The conventional Q statistic in meta-analysis underlies the usual test for heterogeneity, but that test produces p-values that are too high for small to medium sample sizes. 2. The use of inverse-variance weights based on estimated variances makes it very difficult to approximate the distribution of Q, which varies depending on an effect measure. 3. Related moment-based estimators of the heterogeneity variance (τ 2 ), such as the DerSimonian-Laird estimator, have considerable bias.
What is new?
1. We introduce a new Q statistic with constant weights based on studies' effective sample sizes. Its null distribution is calculated exactly by the Farebrother algorithm; alternatively, a two-moment approximation can be used. Both provide very precise control of the significance level, even when sample sizes are small and unbalanced. 2. The new Q statistic yields a new estimator of the heterogeneity variance. This estimator, SDL, is almost unbiased for 10 or more studies, even with extremely small sample sizes.
Potential impact for RSM readers outside the authors' field 1. The usual chi-square test of heterogeneity generally has level much greater than 0.05, and its power is even lower than generally believed, because it uses an incorrect null distribution for Q. 2. Our new Q statistic, with constant weights, results in a very precise test, and the related new estimate of τ 2 is almost unbiased. We recommend its exclusive use in practice.
This equation is similar to Equation (6) in DerSimonian and Kacker 4 ; they use the within-study (i.e., conditional) estimate i is a random variable whose distribution depends on that of θ i .

| APPROXIMATIONS TO THE DISTRIBUTION OF Q F
For approximations to the distribution of Q F , we draw on results for quadratic forms, which generalize the sums of squares that arise in analysis of variance. The Q statistic, Equation (2), can be expressed as a quadratic form in the random variables Θ i . Appendix A.1 gives the details and discusses approaches for evaluating and approximating distributions of quadratic forms in normal variables. Conveniently, the variables Θ i for the mean difference (MD) are normal.
Two approaches are most suitable, especially for obtaining upper-tail probabilities, P(Q F > x). One matches moments of Q F , either the first two or the first three moments; Appendix A.2 gives the details. The other uses an algorithm developed by Farebrother. 5

| SIMULATION STUDY FOR MEAN DIFFERENCE
For MD as the effect measure, we use simulation of the distribution of Q with constant effective-sample-size weights (SW)ñ i ¼ n iC n iT = n iC þ n iT ð Þ to study three approximations: the Farebrother approximation (F SW), implemented in the R package CompQuadForm 6 ; the twomoment Welch-Satterthwaite approximation (M2 SW); and the three-moment chi-square approximation (M3 SW) by Solomon and Stephens. 7 Details of these two moment-based approximations are given in Appendix A.2. We also study the bias of the moment estimatorτ 2 M in Equation (4), denoted by SDL, for this choice of constant weights.
For comparison, we also simulate Q with IV weights, and study three approximations to its distribution: the standard chi-square approximation, the approximation based on the Welch test to the null distribution of Q IV , and the "exact" distribution of Biggerstaff and Jackson (BJ) 8 when τ 2 > 0. To compare the bias of SDL with that of estimators of τ 2 that use the IV weights, we also consider DerSimonian and Laird (DL), 9 Mandel and Paule (MP), 10 REML, and a corrected DL estimator (CDL), 1 which uses an improved non-null first moment of Q IV . Table 1 lists abbreviations for all methods used in our simulations.
We varied five parameters: the number of studies K, the total (average) sample size of each study n (or n), the proportion of observations in the Control arm f, the between-study variance τ 2 , and the within-study variance σ 2 T (keeping σ 2 C ¼ 1). We set the overall true MD μ = 0 because the estimators of τ 2 do not involve μ and the estimators of μ are equivariant.
We generate the within-study sample variances s 2 ij (j = T, C) from chi-square distributions σ 2 ij χ 2 n ij À1 = n ij À 1 À Á and the estimated mean differences y i from a normal distribution with mean 0 and variance σ 2 iT =n iT þ σ 2 iC = n iC þ τ 2 . We obtain the estimated within-study variances asv 2 i ¼ s 2 iT =n iT þ s 2 iC =n iC : As would be required in practice, all approximations use thesev 2 i , even though the σ 2 iT =n iT þ σ 2 iC =n iC are available in the simulation. All simulations use the same numbers of studies K = 5, 10, 30 and, for each combination of parameters, the same vector of total sample sizes n = (n 1 , …, n K ) and the same proportions of observations in the Control arm f i = .5, .75 for all i. The sample sizes in the Treatment and Control arms are n iT = d(1 À f i )n i e and n iC = n i À n iT , i = 1, …, K. The values of f reflect two situations for the two arms of each study: approximately equal (1:1) and quite unbalanced (1:3).
We study equal and unequal study sizes. For equal study sizes n i is as small as 20, and for unequal study sizes average sample size n is as small as 13 (individual n i are as small as 4), in order to examine how the methods perform for the extremely small sample sizes that arise in some areas of application. In choosing unequal study sizes, we follow a suggestion of S anchez-Meca and Marín-Martínez. 11 Table 2 gives the details.
We use a total of 10,000 replications for each combination of parameters. Thus, the simulation standard error for an empirical p-valuep under the null is roughly ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1= 12*10, 000 ð Þ p ¼ 0:0029: The simulations were programmed in R version 3.6.2 using the University of East Anglia 140-computer-node High Performance Computing (HPC) Cluster, providing a total of 2560 CPU cores, including parallel processing and large memory resources. For each configuration, we divided the 10,000 replications into 10 parallel sets of 1000.

| RESULTS
For each configuration of parameters in the simulation study and for each approximation, we calculated, for each generated value of Q, the probability of a larger Q:  Table 2). The approximations to the non-null distribution of Q were based on the value of τ 2 used in the simulation. These data provide the basis for P-P plots (versus the true null distribution) for three approximations to the distribution of Q with effectivesample-size weights (F SW, M2 SW, and M3 SW) and two approximations to the distribution of Q with IV weights (chi-square/BJ and Welch) and for estimating their null levels, non-null empirical tail areas, and (roughly) their power. We also estimate the bias of five point estimators of τ 2 (SDL, DL, REML, MP, and CDL). In Figures 1-5, we present configurations that illustrate the differences in methods very clearly. The full results are presented, graphically, in Appendix B of Kulinskaya et al. 12 .
In some instances M3 SW produced anomalous results or no results at all (because numerical problems kept us from obtaining estimates of its parameters).

| P-P plots
To compare an approximation for a distribution function of Q against the theoretical distribution function, with no heterogeneity (τ 2 = 0), we use probability-probability (P-P) plots. 13 Evaluating two distribution functions, F 1 and F 2 , at x yields p 1 = F 1 (x) and p 2 = F 2 (x). One varies x, either continuously or at selected values, and plots the points (p 1 (x), p 2 (x)) to produce the usual P-P plot of F 2 versus F 1 . If F 2 = F 1 , the points lie on the line from (0, 0) to (1,1). If smaller x are more likely under F 2 , the points will lie above the line, and conversely. (Working with upper tail areas reverses these interpretations.) If F 2 is similar to F 1 , the points will lie close to the line, and departures will show areas of difference. To make these more visible, we flatten the plot by subtracting the line; that is, we plot p 2 À p 1 versus p 1 .
The simulations offer a shortcut that does not require evaluating the true distribution function of Q (which is unknown for IV weights). If F is the distribution of the random variable X, F(X) has the uniform distribution on [0, 1], and so does 1 À F(X). Thus, for the values of p listed above, we plotp À p versus p.
Our P-P plots (illustrated by Figure 1) show no differences between the M3 and M2 approximations for Q with constant weights. Very minor differences between the Farebrother and the moment approximations are visible, mainly at very small sample sizes. Other comparisons show three distinct patterns.

Parameter
Equal study sizes Unequal study sizes K (number of studies) n or n (average size of individual studytotal of the two arms) For K = 10 and K = 30, the same set of unequal study sizes is used twice or six times, respectively.
F I G U R E 1 P-P plots of the Farebrother, M2, and M3 approximations to the distribution of Q with sample-size-based weights, and of the chi-square and Welch approximations to the distribution of Q with IV-based weights. First row: unequal sample sizes, second and subsequent rows: equal sample sizes, F I G U R E 5 Bias in estimation of between-study variance τ 2 by five methods: SDL, DL, REML, MP, and CDL. First row: unequal sample sizes, second and subsequent rows: equal sample sizes, The chi-square approximation has strikingly higher empirical tail areas than the true distribution of Q with IV weights over the whole domain. This pattern is especially noticeable for K = 30 and small unequal sample sizes, though it persists for equal sample sizes as large as 100. It indicates that the approximating chi-square distribution produces values that are systematically too large.
The Welch test provides a much better fit that is especially good for balanced sample sizes, equal variances, and small K. When sample sizes are small and vary among studies or are unbalanced between arms, however, its fit is worse. It produces values of Q that are systematically too small when K = 5; produces more small values and, to a lesser extent, more large values when K = 10; and produces more large values and, to a lesser extent, more small values when K = 30.
The three approximations to Q with constant weights provide reasonably good fits, which appear to be similar to the fit of the Welch test to Q with IV weights.

| Empirical levels when τ 2 = 0
To better visualize the quality of the approximations as the basis for a test for heterogeneity at the 0.05 level, we plot their empirical levels under the null τ 2 = 0 versus sample size. Figure 2 presents typical results for a range of sample sizes at the 0.05 level.
For equal variances, the empirical levels depend on the sample size. The chi-square test is very liberal up to n = 100, especially for unbalanced arms, and the problem becomes worse as K increases. The Welch test is considerably better than the chi-square test, but is still noticeably liberal when the arms are unbalanced. Tests based on Q with constant weights are generally less liberal, though they may have level up to 0.07 for n = 20, for unbalanced arms and small K. The M3 approximation breaks down and results in very liberal levels for unequal sample sizes and unbalanced arms and large K. The Farebrother and M2 approximations perform better for larger K, and overall are the best choice. They also hold the level well at smaller nominal levels. The Welch test is rather unstable for very low levels such as α = 0.001 (which corresponds, in our simulations, to just 10 occurrences in 10,000 replications), but improves from α = 0.005.

| Empirical levels when τ 2 > 0
To understand how the approximations behave as τ 2 increases, we plot the empirical p-values (p) versus τ 2 for the nominal levels 0.05 and 0.01 ( Figure 3). For unequal sample sizes, the Farebrother and the 3-moment approximations differ slightly at the 0.01 level, but those differences disappear at the 0.05 level and for equal sample sizes. When K = 30, M3 sometimes fails; and when it does not, it breaks down for small and large values of τ 2 . The 2-moment approximation is almost indistinguishable from the Farebrother approximation.
Overall, the Farebrother approximation performs superbly across all τ 2 values. This is as it should be, as it is practically an exact distribution in the case of MD. The M2 approximation is reasonably good at the 0.05 level. The BJ approximation is much too liberal, especially at smaller values of τ 2 and for larger K. It is considerably more liberal for very small sample sizes such as n ¼ 13; but it improves when sample sizes increase, and it is reasonable by n = 100 or n ¼ 60. For larger values of n and n (not shown in Figure 3), the traces approach α as n or n increases (they are farther away from α when n < 30).

| Power of tests for heterogeneity
"Power" is a reasonable term as a heading, but not as an accurate description for most of the results. Although discussions of simulation results in meta-analysis do not always do this, comparisons of power among tests that are intended to have a specified level (i.e., rate of Type I error) are not valid unless the tests' estimated levels are equal or nearly so. This complication is evident in Figure 4, which depicts the power of tests of heterogeneity at the 0.05 level for n = 20 and equal and unequal sample sizes.
The chi-square test appears to be more powerful, and the Welch test slightly less powerful, than the tests based on Q with constant weights. These differences are much smaller when n = 40 (not shown) and disappear when n is larger. But even for n = 20, these appearances are misleading. For n = 20, Figure 2 shows that for balanced arms, the level of the chi-square test is 0.08 for K = 5, 0.1 for K = 10, and considerably higher than 0.1 for K = 30. For unbalanced arms, the level of the chi-square test substantially exceeds 0.1 for all K. This behavior is a consequence of using an incorrect null distribution. Thus, our results do not show that the chi-square test has higher power, and its power may actually be lower. It is not clear how to modify the chi-square test so that it has the correct level in a broad range of situations.
The Welch test has levels similar to those of the tests based on Q with constant weights when K = 5 or 10. But for K = 30 and f = 0.75, its level is approximately .09. This may mean that it does have somewhat lower power.
When n = 40, the traces rise more steeply, and when n < 30, they spread out and rise less steeply. When n ≥ 100 (or n ≥ 60 for unequal sample sizes), visible differences among the traces for the tests disappear. Given higher levels of the chi-square test, this means that its power is the same or even lower than that of the tests based on Q with constant weights.

| Bias in estimation of τ 2
Here we compare the SDL estimator of τ 2 with the wellknown estimators DL, MP, and REML and the recently suggested CDL. Figure 5 depicts the biases of the five estimators for small sample sizes.
All five estimators have positive bias at τ 2 = 0, because of truncation at zero. The bias across all values of τ 2 is quite substantial, and it increases for unequal variances and/or sample sizes. Among the standard estimators, DL has the most bias and MP the least. SDL and CDL generally have similar bias, considerably less than the standard estimators. The relation of their bias to K when n ¼ 13 is interesting, but atypical. As K increases, the trace for SDL flattens toward 0, demonstrating no bias at all for larger values of τ 2 , whereas the trace for CDL rises toward the other three. The traces flatten and approach 0 as n increases to 15 and 30. When n ≥ 100 or n ≥ 60, the differences among the five estimators of τ 2 are quite small. The systematic review of Rees et al 14 studied results of short-term trials of exercise training in people with mild to moderate heart failure. Exercise capacity was assessed by the maximal oxygen uptake, VO 2 max; an increase from baseline to follow-up indicates improvement with exercise. However, for the pooled analysis, the authors reversed the sign of the mean change in VO 2 max for both the intervention and control groups, so the beneficial effect is negative. We consider the results from Comparison 2.1.7, for the K = 15 studies with mean age above 55 years. Figure 6 shows the data and forest plot. The sample sizes in these trials are rather small, varying from 7 to 48 per arm; the average sample size is 18.4 in the treatment arm and 17.6 in the control arm. The trials are mostly balanced, with only one trial having a 2:1 allocation ratio, and they have similar variances in the two arms.
The review used a DL-based analysis and found significant heterogeneity (p = 0.03), I 2 = 45.73%,τ 2 ¼ 0:79, and a significant effect of exercise, with a mean difference in VO 2 max of À1.77 (À2.50, À1.03).  Table 3 brings together meta-analyses of these data by seven methods. When testing heterogeneity, the standard chi-square test gives p-value 0.027, and the Welch test gives 0.030, indicating significant heterogeneity. These differ substantially from the p-values for Q SW with constant weights, where all three approximations give 0.43 or 0.44. This agrees with our simulation results, illustrating how liberal the standard heterogeneity tests can be in the case of small sample sizes and medium to large K.
Comparing the estimated τ 2 values, the DL method provides an estimate of 0.791. CDL is very similar at 0.783, the REML estimate is lower at 0.652, and the MP estimate is considerably lower at 0.255. Unsurprisingly, the SDL estimate is very close to zero, at 0.009. Because the standard estimators are all positively biased in this setting, we consider the SDL estimate to be the closest to the true value of τ 2 .
These differences in the estimated heterogeneity variance have no substantial impact on the estimated overall effect of exercise on VO 2 max. Table 3 includes IV estimates of Δ with 95% confidence intervals. Because of IV weighting, the smaller τ 2 values result in stronger effects of exercise. SDL results in the most pronounced effect, À2.14 (À2.60, À1.68). However, we do not recommend IV weights for pooling effects, and instead advocate effective-sample-size-based methods. 1 These weights are denoted by SSW in Table 3, and the corresponding confidence intervals are based on t K À 1 critical values. Ironically, for these data the result, À1.78 (À2.37, À1.18), is very close to the original estimate reported in Rees et al. 14 The differences between SDL and other estimators of τ 2 are rather striking. However, they have a simple explanation. The largest study, Belardinelli 1999, the first on the forest plot in Figure 6, is a low outlier witĥ Δ 1 ¼ À3:4, and its inverse-variance weight, when τ 2 = 0, is 39.3%. This study is the major contributor to the high value of Q IV = 25.79 and the only reason for the seemingly high heterogeneity. The SSW weight of this study is less than half as large, at 17.5%, and the test based on Q SW does not find heterogeneity in the data. Setting this study aside decreases the Q IV statistic to 7.41 on 13 df; the p-values for all Q tests are very similar, at 0.88 for all IV tests and at 0.87 for the tests based on Q SW ; and all estimators of τ 2 agree onτ 2 ¼ 0.

| Drugs for prevention of exerciseinduced asthma
The systematic review of Spooner et al 15 compared several types of drugs for prevention of exercise-induced asthma attacks in asthma sufferers. We consider Comparison 6.2.2, which compared inhaling a single dose of mast cell stabilizer (MCS) prior to strenuous exercise with a single dose of short-acting beta-agonists (SABA). The measure of effect was the maximum percentage decrease in pulmonary function (PFT). This meta-analysis pooled results from seven high-quality clinical trials involving a total of 187 patients. Figure 7 shows the data and forest plot. The sample sizes in these perfectly balanced trials vary from 8 to 20 per arm; the average sample size is 13.4 in each arm; the variances mostly differ in the two arms, but without any clear pattern. The review used a DLbased analysis and found that heterogeneity was not significant,τ 2 DL ¼ 0:65 and I 2 = 2.14%, and that SABA provided significantly lower PFT,Δ ¼ 6:32 (2.47,10.18). Table 4 shows the results of meta-analyses of these data by seven methods. Heterogeneity is not significant by any method: the p-values are 0.409 for the chi-square and Welch tests and 0.799 to 0.812 for all three approximations to the distribution of Q SW . However, the estimated values of τ 2 vary widely: 0 for SDL, 0.34 for MP, 0.66 and 0.67 for CDL and DL, and 9.82 for REML. These results agree with the positive biases in estimation of τ 2 at zero in our simulation results, though the result for REML is quite aberrant. Its value is not so extreme, at 5.72, but the maximum-likelihood estimator of τ 2 behaves similarly. The presence of a study with a noticeably lower Δ i whose estimated variance is substantially lower strains the assumption that Δ i $ N(Δ, τ 2 ).
These differences in the estimated values of τ 2 are reflected in the width of confidence intervals for the pooled effect, but even more so, in the width of prediction intervals. 16 As SDL is zero, the prediction interval is not different from the confidence interval, MP IV has a 95% prediction interval of (1.04, 11.47), DL IV a somewhat wider prediction interval of (0.85, 11.80), and REML IV a much wider interval of (À2.80, 17.58). Thus, REML IV analysis does not find SABA drugs to be more beneficial than MCS. This conclusion does not change if  19 resulting in a slightly tighter prediction interval of (À2.09, 16.87).
In the forest plot (Figure 7), the study by Vazquez 1984 has a considerably lower mean than the other studies; and, because of its lower variance, its weight varies from 33.3% in the REML IV analysis to 45.5% in the DL IV analysis, in comparison to 13% in SSW. As a result, all the IV-weighted methods yield substantially lower estimates of the pooled effect (6.19 to 7.39) than SSW (9.30).
Once more, for these data, the sample-size-based weights provide more robust and more sensible inference than the IV-weighted methods.

| DISCUSSION
As a way of avoiding the shortcomings associated with the customary Q, which uses inverse-variance weights based on estimated variances, we are involved in studying a version of Q in which the weights are fixed constants. Such weights simplify derivation of higher moments of Q and facilitate approximation of its distribution.
In a simulation study we compared the properties of the test for heterogeneity for MD based on a Q statistic that uses constant sample-size-based weights, Q SW , with its IV-weights-based counterparts. From Q SW we also derived an estimator (SDL) of the heterogeneity variance τ 2 ; the simulation yielded estimates of its bias and comparisons with the bias of several other estimators.
A large number of small studies is the worst-case scenario for the statistical properties of meta-analysis. 1 This situation may not be very widespread in medical metaanalyses, but it is very common in the social sciences and  in ecology. 20,21 Thus, our simulations included additional small sample sizes.
Overall, the proposed test for heterogeneity for MD, combined with its exact distribution as obtained by the Farebrother algorithm 5 or, alternatively, with the twomoment approximation, provides very precise control of the significance level, even when sample sizes are small and unbalanced, in contrast to the extremely liberal behavior of the standard tests, especially for a large number of studies. (These results suggest that the null distribution of Q IV is more difficult to approximate than the null distribution of Q F .) Similarly, the proposed SDL estimator is almost unbiased for K ≥ 10, even in the case of extremely small sample sizes, and we recommend its exclusive use in practice.
Further, because it uses an incorrect null distribution for Q IV , the chi-square test generally has level much greater than 0.05, so our simulations could give only substantially inflated estimates of its power. An important conclusion of our work is that the power of the popular Q test is even lower than generally believed. As another consequence of the incorrect null distribution, we avoid I 2 and related measures of heterogeneity.
Our meta-analyses of the data from Rees et al 14 demonstrated just how liberal the standard tests for heterogeneity are. However, the substantial differences among the estimates of τ 2 produced only modest differences among the estimates and confidence intervals for the overall effect. On the other hand, the example illustrated how easily a single discrepant study could distort the IVweighted estimates of τ 2 .
In a second example none of the methods found significant heterogeneity. The SDL estimate wasτ 2 ¼ 0, whereas the IV-weighted methods produced substantial positive estimates, consistent with the biases that we found in our simulations. In this instance the SSW estimate of the overall effect was noticeably higher than the IV-weighted estimates.
It is enlightening to observe that, for the non-null distribution of Q IV , the approximation of Biggerstaff and Jackson 8 (using Farebrother's algorithm) is no better than the standard chi-square approximation to the null distribution. The problem here evidently lies with the IV weights.
We found that, even though both moment approximations performed well overall, the three-moment approximation sometimes fails, and it breaks down in the case of very small and unbalanced sample sizes and a large number of studies. Therefore, for MD we recommend the Farebrother 5 approximation to the distribution of Q with constant weights.
In further work we intend to develop tests for heterogeneity in other effect measures based on Q with constant weights. Even though we derived general expressions for moments of Q, application of these expressions to such effect measures as SMD and the log-odds-ratio involves a lot of tedious algebra. The moment approximations are less precise than the exact distribution or the approximation by Farebrother 5 for the case of normal variables in the quadratic form, but they are much faster and may be a better option when the distribution is only asymptotically normal.

APPENDIX A
This appendix assembles the more-technical information related to evaluating and approximating the distribution of Q. Appendix A.1 discusses approaches, in the broader context of quadratic forms in normal random variables. Appendix A.2 explains the form of the two-moment and three-moment approximations. Then Appendix A.3 presents derivations for the variance and third moment of Q. The resulting expressions involve the first six unconditional moments of Θ i . Appendix A.4 develops those moments for a general effect measure, and Appendix A.5 applies and simplifies them for the mean difference.

A.1 | Approximations to the distribution of quadratic forms in normal variables
The Q statistic, Equation (2), is a quadratic form in the random variables Θ i . We can write Q = Θ T AΘ for a symmetric matrix A of rank K À 1 with the elements a ij = q i δ ij À q i q j , 1 ≤ i, j ≤ K, where δ ij is the Kronecker delta. In this section we assume constant weights unless stated otherwise. Unconditionally, the Θ i are centered at 0, but they are not, in general, normally distributed. However, for large sample sizes n i , their distributions are approximately normal. Normality holds exactly for the mean difference (MD). In this case the exact distribution of the quadratic form is that of a weighted sum of central chi-square variables. But the cumulative distribution function of Q needs to be evaluated numerically. Therefore, we consider suitable approximations.
Quadratic forms in normal variables have an extensive literature. When the vector Θ has the multivariate normal distribution N(μ, Σ), the exact distribution of Q is P m r¼1 λ r χ 2 h r δ 2 r À Á , where the λ r are the eigenvalues of AΣ, the h r are their multiplicities, and the δ 2 r are the non-centrality parameters for the independent chi-square variables χ 2 h r δ 2 r À Á with h r degrees of freedom. (The δ r are linear combinations of μ 1 , …, μ K .) Interest typically centers on the upper-tail probabilities P(Q > x). Moment-based approximations match a particular distribution, often a gamma distribution or, equivalently, a scaled chi-square distribution, to several moments of Q. These methods include the well-known Welch-Satterthwaite approximation, which uses cχ 2 p and matches the first two moments. 22,23 Imhof 24 investigated an approximation to the distribution of a quadratic form in noncentral normal variables by matching a central chi-square distribution to three moments (including the skewness). The approximation has the form Q $ χ 2 h 0 À h 0 Pearson 25 first suggested this approach to approximate a noncentral chi-square distribution. Liu et al 26 proposed a four-moment noncentral chi-square approximation. To approximate the probability that a standardized Q exceeds t * , they use the probability that a standardized noncentral chi-square exceeds t * , equating the skewness of the two distributions and matching the kurtosis as closely as possible.
Yuan and Bentler 27 studied, by simulation, the Type I errors of a Q test with the critical values based on the Welch-Satterthwaite approximation. They concluded that this approximation is satisfactory when the eigenvalues do not have too large a coefficient of variation, preferably less than 1. For larger CV, the Type I errors may be larger than nominal.
For the general case of a noncentral quadratic form, the distribution of Q can be approximated by the distribution of cU r , where the distribution of U can depend on one or two parameters. The choice of c, r, and the parameters of U then permits matching the necessary moments. Solomon and Stephens 7 consider three moment-based approximations: a four-moment approximation by a Type III Pearson curve and 2 three-moment approximations, one with U $ N(μ, σ 2 ) and the other with U $ χ 2 p . They recommend the latter as fitting better in the lower tail, partly because it necessarily starts at zero, whereas the other approximations do not. This approximation matches the constants c, r, and p to the first three moments of Q. For c χ 2 p r the moments about 0 are μ 0 k ¼ c k 2 kr Γ kr þ p=2 ð Þ =Γ p=2 ð Þ. Other, more-complicated methods include relying on numerical inversion of the characteristic function 24 ; this can be made very accurate, with bounds on accuracy. The algorithm of Sheil and O'Muircheartaigh, 28 improved by Farebrother, 5 represents the value of the c.d.f. for a noncentral quadratic form by an infinite sum of central chi-square probabilities. Kuonen 29 proposes a saddlepoint approximation, and Zghoul 30 and Ha and Provost 31 consider approximations by Hermite and Laguerre polynomials. The first two methods are nearly exact and perform better than Pearson's threemoment approximation by a central chi-square distribution or, in the noncentral case, the four-moment approximation by a Type III Pearson curve. 24,6 Bodenham and Adams 32 and Chen and Lumley 33 discuss the behavior of various approximations when K is large.
We are aware of only one paper 34 on the asymptotic (K ! ∞) distribution of quadratic forms in non-normal iid random variables with finite sixth moment. This distribution can be approximated by that of a second-order polynomial in normal variables.
In meta-analysis, approximations to the distribution of Q have usually been sought only for the Q IV version with non-constant inverse-variance weights. Typically, the chi-square distribution with K À 1 degrees of freedom is used indiscriminately as the null distribution of Q IV . For MD, Kulinskaya et al 35 introduced an improved twomoment approximation to this version of Q based on the Welch 36 test in the heteroscedastic ANOVA. The distribution of this Welch test for MD is approximated under the null by a rescaled F distribution, and under alternatives by a shifted chi-square distribution. Kulinskaya et al also explored improved moment-based approximations for some other effect measures, 37-39 using two-moment approximations with a scaled chi-square distribution to the null distribution of Q IV . Biggerstaff and Jackson 8 used the Farebrother approximation to the distribution of a quadratic form in normal variables as the "exact" distribution of Q IV . This is not correct when the weights are the reciprocals of estimated variances, but with constant weights it is correct for MD. When τ 2 = 0, the Biggerstaff and Jackson approximation to the distribution of Q IV is the χ 2 KÀ1 distribution.
A.2 | Two-and three-moment approximations to the distribution of Q The two-and three-moment approximations to the distribution of Q use the distribution of a transformed chi-square random variable c χ 2 p r . The parameters c, r, and p are found by matching the first two or three moments.
The kth moment about zero for c χ 2 p r is ð Þ :

A.2.2 | Three-moment approximation
For the three-moment approximations we have Q $ c χ 2 p r . Similar to the two-moment case, we set k = 1,2,3 to obtain the following system of equations Dividing μ 0 2 by μ 0 1 , we obtain the following expression for c: To eliminate c, define A ¼ μ 0 2 = μ 0 1 À Á 2 and B ¼ μ 0 Then we have the following two nonlinear equations: We solve this system for p and r by using the function "multiroot" in the R package rootSolve 40 with the starting values r = 1 and c and p from the two-moment approximation.
A.3 | Variance and third moment of Q For approximations based on the first two or three moments, we need the second and the third moments of Q under the REM introduced in Section 1. We distinguish between the conditional distribution of Q (given the θ i ) and the unconditional distribution, and the respective moments of Θ i . For instance, the conditional second moment of Θ i is M c 2i ¼ v 2 i , and the unconditional second moment is is the fourth (unconditional) central moment ofθ i . These two moments are required to calculate the variance of Q, given by Appendix A.3.1 gives the details. When the weights are not related to the effect, these expressions for the mean and variance of Q are the same as in Kulinskaya et al. 37 For (known) inverse-variance weights w i ¼ v À2 i , and assuming that eachθ i is normally distributed and τ 2 = 0, so that M 2i ¼ v 2 i and M 4i ¼ 3v 4 i , the first moment of Q is K À 1, and the variance is 2(K À 1), as it should be for a chi-square distribution with K À 1 degrees of freedom.
In general, the unconditional moments M 2i and M 4i depend on the effect measure (through its second and fourth conditional moments) and on the REM that defines the unconditional moments. Appendix A.4 gives the details.
In the null distribution τ 2 = 0, and the unconditional moments of Q coincide with its conditional moments.
The derivation for the unconditional third moment of Q parallels that for the second moment, starting from Equation (2). Appendix A.3.2 gives the details of the derivation. Importantly, M 3i ¼ E Θ 3 PPP i ≠ j ≠ k q i 1 À q i ð Þq j 1 À q j q k 1 À q k ð ÞM 2i M 2j M 2k