Estimation in meta‐analyses of mean difference and standardized mean difference

Methods for random‐effects meta‐analysis require an estimate of the between‐study variance, τ 2. The performance of estimators of τ 2 (measured by bias and coverage) affects their usefulness in assessing heterogeneity of study‐level effects and also the performance of related estimators of the overall effect. However, as we show, the performance of the methods varies widely among effect measures. For the effect measures mean difference (MD) and standardized MD (SMD), we use improved effect‐measure‐specific approximations to the expected value of Q for both MD and SMD to introduce two new methods of point estimation of τ 2 for MD (Welch‐type and corrected DerSimonian‐Laird) and one WT interval method. We also introduce one point estimator and one interval estimator for τ 2 in SMD. Extensive simulations compare our methods with four point estimators of τ 2 (the popular methods of DerSimonian‐Laird, restricted maximum likelihood, and Mandel and Paule, and the less‐familiar method of Jackson) and four interval estimators for τ 2 (profile likelihood, Q‐profile, Biggerstaff and Jackson, and Jackson). We also study related point and interval estimators of the overall effect, including an estimator whose weights use only study‐level sample sizes. We provide measure‐specific recommendations from our comprehensive simulation study and discuss an example.


INTRODUCTION
Meta-analysis is a statistical methodology for combining estimated effects from several studies in order to assess their heterogeneity and obtain an overall estimate. In this paper, we focus on meta-analysis of continuous outcomes. The data and, often, existing tradition determine the choice of outcome measure. In a comparative study with continuous subject-level data for a treatment arm (T) and a control arm (C), the customary outcome measures are the mean difference (MD) and the standardized MD (SMD). Part 2, Chapter 9, of the Cochrane Handbook 1 pointed out that the choice between MD and SMD depends on whether "outcome measurements in all studies are made on the same scale." However, fields of application have established preferences: MD in medicine and SMD in social sciences. In ecology, almost half of all meta-analyses use another outcome measure, the log-transformed ratio of means (RoM), also called the response ratio. 2, 3 We plan to discuss RoM in a separate paper.
If the studies can be assumed to have the same true effect, a meta-analysis uses a fixed-effect (FE) model (common-effect model) to combine the estimates. Otherwise, the studies' true effects can depart from homogeneity in a variety of ways. Most commonly, a random-effects (RE) model regards those effects as a sample from a distribution and summarizes their heterogeneity via its variance, usually denoted by 2 . (Another approach, which we do not discuss further, allows the studies' true effects to differ without following a distribution. 4 ) The between-studies variance, 2 , has a key role in estimates of the mean of the distribution of random effects; but it is also important as a quantitative indication of heterogeneity, 5 especially because the interpretation of the popular I 2 measure 6 is problematic. 7,8 In studying estimation for meta-analysis of MD and SMD, we focus first on 2 and then proceed to the overall effect.
Veroniki et al 9 provide a comprehensive overview and recommendations on general-purpose methods (which can be used with any measure of effect) of estimating 2 and its uncertainty. Such a review, however, does not take into account the important evidence that the performance of those methods varies widely among effect measures. Veroniki et al 9(Section 6.1) mention this variation only in passing, as a hypothetical possibility. To address this important issue, we introduce new methods, specific to MD and SMD, that could perform better than the general-purpose ones.
Veroniki et al 9 recommend four methods of estimating 2 : the well-established methods of DerSimonian and Laird, 10 Mandel and Paule, 11 and restricted maximum likelihood, and the less-familiar method of Jackson. 12 Three of these four methods match moments to the asymptotic distribution of Cochran's Q statistic, and the fourth ignores the randomness of the inverse-variance weights. However, they all may be applicable only for large sample sizes.
As an alternative, we use improved effect-measure-specific approximations to the expected value of Q for both MD 13 and SMD 14 to introduce two new moment-based point estimators of 2 for MD (Welch-type [WT] and corrected DerSimonian-Laird [CDL]) and one WT interval estimator. We also introduce one moment-based point estimator and one interval estimator for 2 in SMD.
Any review on comparative performance of the existing methods, such as Veroniki et al, 9 currently can draw on limited empirical information, which we summarize in Appendix A in the Supplementary Materials. Existing gaps in evidence for MD include a complete lack of simulations using unpooled estimators of the study-level variance; instead, some studies have used the pooled estimator, and others, equivalently, have generated one normally distributed effect measure and an independent chi-squared estimate of the variance. The pooled estimator is equivalent to the unpooled estimator only when the sample sizes are equal within the study. So far, the only two studies 12,15 of coverage investigated a very limited number of interval estimators of 2 . In addition, studies have not examined the effect of estimation of 2 on coverage of the overall mean (Petropoulou and Mavridis 16 consider only inverse-variance-weighted estimators). For SMD, no studies have investigated coverage of 2 . Only one study 17 investigated coverage of the overall SMD, , but only for = 0.5. Therefore, we undertook an extensive simulation study to evaluate our new methods of estimating heterogeneity variance for MD and SMD and to compare them with existing methods, aiming also to address the gaps in evidence. We also study coverage of confidence intervals for 2 achieved by five methods, comparing our Q-profile methods based on improved approximations to the distribution of Cochran's Q with the Q-profile method of Viechtbauer, 18 profile-likelihood-based intervals, and methods by Biggerstaff and Jackson 19 and Jackson. 12 For each estimator of 2 , we also study bias of the corresponding inverse-variance-weighted estimator of the overall effect. As our work progressed, it became clear that those inverse-variance-weighted estimators generally had unacceptable bias for SMD. Therefore, we added an estimator (SSW) whose weights depend only on the sample sizes of the treatment and control arms. We study the coverage of the confidence intervals associated with the inverse-variance-weighted estimators, and also the HKSJ interval, the HKSJ interval using the improved estimator of 2 , and the interval centered at SSW and using the improved̂2 in estimating its variance.
The structure of this paper is as follows. In Section 2, we briefly review the continuous effect measures MD and SMD. Section 3 describes the standard random-effects model. Section 4 lists the methods for point estimation and interval estimation of a between-study variance. Section 5 lists the methods for point and interval estimation of the overall effect. Section 6 reports on our extensive simulation study. Section 7 discusses an example for SMD. Section 8 concludes with a discussion of practical implications for meta-analysis of MD and SMD, including recommendations on the choice of methods.

MD AND SMD
We assume that each of the K studies in the meta-analysis consists of two arms, treatment(T) and control (C), with sample sizes n iT and n iC . The total sample size in study i is n i = n iT + n iC . We denote the ratio of the control sample size to the total by q i = n iC ∕n i . The subject-level data in each arm are assumed to be normally distributed with means iT and iC and variances 2 iT and 2 iC . The sample means arex i , and the sample variances are s 2 i , for i = 1, … , K and j = C or T.

Mean difference
The MD effect measure is s 2 iT and s 2 iC do not depend on iT and iC , sô2 i does not involve i . In the best-case scenario for traditional meta-analysis methods, for normal data, the sample means are independent of the sample variances (and therefore of inverse-variance-based weights). However, the relation of the between-study variance 2 and the within-study variances 2 i may affect quality of estimation. Sometimes the pooled sample variance, given by Equation (2), is used instead of v 2 i . Then, however, unequal variances in the treatment and control arms can adversely affect estimation. 13

Standardized mean difference
The SMD effect measure is The variances in the treatment and control arms are usually assumed to be equal. Therefore, i is estimated by the square root of the pooled sample variance The plug-in estimator d i = (x iT −x iC )∕s i , known as Cohen's d, is biased in small samples, and we do not consider it further. Instead, we study the unbiased estimator where m i = n iT + n iC − 2, and the factor often approximated by 1 − 3∕(4 m − 1), corrects for bias. 20 This estimator of is sometimes called Hedges's g. For the variance of g i , we use the unbiased estimator derived by Hedges. 20 The sample SMD g i has a scaled noncentral t-distribution with noncentrality parameter Cohen 21 categorized values of = 0.2, 0.5, 0.8 as small, medium, and large effect sizes. However, these definitions of "small," "medium," and "large" may not be appropriate outside the behavioral sciences. Ferguson 22 proposed the values 0.41, 1.15, 2.70 as benchmarks in the social sciences. In an empirical study of 21 ecological meta-analyses by Møller and Jennions, 23 136 observed values of SMD varied in magnitude from 0.005 to 3.416, with mean 0.721 and 95% confidence interval (0.622-0.820). Unfortunately, little is known about the range of 2 for SMD in various applications.

STANDARD RANDOM-EFFECTS MODEL
In meta-analysis, the standard random-effects model assumes that within-and between-study variabilities are accounted for by approximately normal distributions of within-and between-study effects. For a generic measure of effect, resulting in the marginal distribution̂i ∼ N( , 2 i + 2 ).̂i is the estimate of the effect in Study i, and its within-study variance is 2 i , estimated bŷ2 i , i = 1, … , K. 2 is the between-study variance, which is estimated bŷ2. The overall effect can be estimated by the weighted mean̂R where theŵ i (̂2) = (̂2 i +̂2) −1 are inverse-variance weights. The FE estimatêuses weightsŵ i =ŵ i (0).

Point estimators
Our study includes the four methods recommended by Veroniki et al 9

Point estimation of 2 for MD by the WT and CDL methods
Because theŵ i (̂2) in (6) involve thê2 i , K − 1 is an adequate approximation for the expected value of Cochran's Q statistic only for very large sample sizes. However, this approximation is used in all moment methods for estimating 2 . As an alternative one can use an improved, effect-measure-specific approximation to the expected value of Q. Corrected MP-type moment-based methods for estimating 2 equate the Q statistic, with weightsŵ i ( 2 ), to the first moment of an improved approximate null distribution of Q. Corrected DerSimonian-Laird-type methods equate the Q statistic, with weightsŵ(0), to the first nonnull moment of Q.
More-realistic approximations to the null distribution of Q are available for several effect measures. These approximations do not treat the estimateŝ2 i as equal to 2 i . For MD, Kulinskaya et al 13 proposed an approximation based on the method of Welch. 27 This method calculates corrected first two moments of Q, 1 = E[Q] and 2 = Var[Q], under the null hypothesis of homogeneity and then approximates the null distribution of Q by an F distribution:ĉF K−1,̂2 with matched moments. The estimated degrees of freedom̂2 and the scale factorĉ are functions of K, the n iT and n iC , and thê2 iT and̂2 iC .
To simplify notation, with w i = 1∕Var(̂i), let W = ∑ w i , W (k) = ∑ w i k , and p i = 1 − w i ∕W, and let where f i j = n i j − 1 is the number of degrees of freedom for group j of study i, j = T, C. Then, the null moments of Q for MD 13 are We propose a new moment-based estimator of 2 for MD based on this improved approximation. Let E WT (Q) = 1 denote the corrected expected value of Q. Then, one obtains the WT estimator of 2 in the spirit of Mandel and Paule 11 by substitutinĝ2 i for 2 i where Var(̂i) appears in 1 to obtainÊ WT (Q) and iteratively solving We denote the resulting estimator of 2 bŷ2 WT . This proposal assumes that using the true 2 in the denominator in Q( 2 ) would achieve the null value of E(Q). However, this assumption is motivated by the standard assumption of a chi-squared distribution and, as we show in Section 6.1.2, is disproved by simulations. This is not surprising, as the null distribution of Q is better approximated by an F distribution. 13 We also propose another new moment-based estimator of 2 for MD based on the improved first moment of Q and the same term in 2 as in DerSimonian-Laird. 10 With and substitutinĝ2 i for 2 i in E(Q) (as above) and Q for E(Q), the CDL estimator is given bŷ The difference from the WT estimator is that CDL uses the improved nonnull first moment of Q.

Point estimation of 2 for SMD by the Kulinskaya-Dollinger-Bjørkestøl method
For SMD, Kulinskaya et al 14 derived O(1∕n) corrections to moments of Q and suggested using the chi-squared distribution with degrees of freedom equal to the estimate of the corrected first moment to approximate the distribution of Q. Kulinskaya et al 14 give expressions from which it can be calculated, along with a computer program in R.
We propose a new moment-based estimator of 2 for SMD in the spirit of Mandel and Paule 11 based on this improved approximation. Let E KDB (Q) denote the corrected expected value of Q. Then, one obtains the KDB estimate of 2 by iteratively solving Equation (9) with E KDB (Q) instead of E WT (Q) in the right-hand side.
We denote the resulting estimator of 2 bŷ2 KDB .

Interval estimators
Among the confidence-interval methods reviewed by Veroniki et al, 9

WT interval and Kulinskaya-Dollinger-Bjørkestøl interval
We propose a new WT confidence interval for the between-study variance for MD. This interval for 2 combines the Q-profile approach and the improved approximation by Kulinskaya et al 13 based on the method of Welch 27 (ie, the scaled F distribution with K − 1 and̂2 degrees of freedom based on the corrected first two moments of Q). This corrected Q-profile confidence interval can be estimated from the lower and upper quantiles of F Q , the cumulative distribution function for the improved approximation to the distribution of Q: The upper and lower confidence limits for 2 can be calculated iteratively.
Similarly, when the effect measure is SMD, the Kulinskaya-Dollinger-Bjørkestøl (KDB) confidence interval for 2 is based on the chi-squared distribution with the corrected first moment developed by Kulinskaya et al. 14

METHODS OF ESTIMATING OVERALL EFFECT
Most of the point estimators of the overall effect have corresponding interval estimators, but some do not. Therefore, we describe point estimators and interval estimators in separate sections.

Point estimators
A random-effects method that estimates by a weighted mean with inverse-variance weights, as in Equation (6), is determined by the particular̂2 that it uses inŵ i (̂2). The best-known and most widely used estimator,̂D L , was introduced by DerSimonian and Laird 10 ; it useŝ2 DL . Its shortcomings, in particular bias and below-nominal coverage of the companion confidence interval, have led numerous authors to propose alternative estimators of 2 . Some of those shortcomings arose from the derivation underlyinĝ2 DL , which uses the 2 i and 2 and then substitutes thê2 i and̂2. Unfortunately, the alternative methods REML, J, and MP generally rely on that same unsupported substitution; for MD, CDL attempts to reduce its impact.
In an attempt to reduce the bias in estimating the overall SMD that we encountered in the inverse-variance-weighted estimators, we included a point estimator whose weights depend only on the studies' sample sizes. 28,29 For this estimator (SSW), w i =ñ i = n iT n iC ∕(n iT +n iC ); that is, w i omits the term in g 2 i in Equation (3);ñ i is the effective sample size in Study i.

Interval estimators
The point estimators DL, REML, J, MP, WT, CDL, and KDB have companion interval estimators of . The customary approach estimates the variance of̂R E by [ ∑ K i=1ŵ i (̂2)] −1 and bases the half-width of the interval on the normal distribution. That expression for the variance of̂R E would be correct if it were based on w i = ( 2 i + 2 ) −1 . In practice, however, usingŵ i (̂2) may not yield a satisfactory approximation. In addition, we have not seen empirical evidence that the sampling distributions of̂R E for the various choices of estimator for 2 are adequately approximated by a normal distribution.
Hartung and Knapp 30 and, independently, Sidik and Jonkman 31 developed an estimator for the variance of̂R E that takes into account the variability of thê2 i and̂2. The Hartung-Knapp-Sidik-Jonkman (HKSJ) confidence interval uses the estimatorV together with critical values from the t distribution on K−1 degrees of freedom. A potential weakness is that the derivation of the variance estimator and the t distribution uses the 2 i and 2 and then substitutes thê2 i and̂2 DL . In addition, the HKSJ interval useŝD L as its midpoint, so it will have any bias that is present in̂D L . We study a modification of HKSJ that uses the WT estimator or KDB estimator of 2 and useŝW T or̂K DB , respectively, as the midpoint.
The interval estimators corresponding to SSW (SSW WT, SSW CDL, and SSW KDB) use the SSW point estimator as the midpoint, and the half-width equals the estimated standard deviation of SSW under the random-effects model times the critical value from the t distribution on K − 1 degrees of freedom. The estimator of the variance of SSW iŝ in which v 2 i comes from Equation (1) or (3) and̂2 =̂2 WT ,̂2 =̂2 CDL , and̂2 =̂2 KDB , respectively.

SIMULATION STUDY
As mentioned in Section 1, other studies have used simulation to examine estimators of 2 or of the overall effect for MD or SMD, but gaps in evidence remain. Appendix A in the Supplementary Materials contains a detailed summary of previous simulation studies and provides our rationale for choosing the ranges of values for , , and 2 that we consider realistic for a range of applications.
The following paragraphs describe mainly features that are common to our simulations for MD and SMD. Sections 6.1.1 and 6.2.1 describe other features that are specific to those measures.
All simulations use the same numbers of studies K = 5, 10, 30 and, for each combination of parameters, the same vector of total sample sizes n = (n 1 , … , n K ) and the same proportions of observations in the control arm q i = .5, .75 for all i. The sample sizes in the treatment and control arms are n iT = ⌈(1 − q i )n i ⌉ and n iC = n i − n iT , i = 1, … , K. The values of q reflect two situations for the two arms of each study: approximately equal (1:1) and quite unbalanced (1:3).
We study equal and unequal study sizes. For equal study sizes, n i is as small as 20, and for unequal study sizes, n i is as small as 12, in order to examine how the methods perform for the extremely small sample sizes that arise in some areas of application.
In choosing unequal study sizes, we follow a suggestion of Sánchez-Meca and Marín-Martínez, 32 who selected study sizes having skewness of 1.464, which they considered typical in behavioral and health sciences. Tables 1 and 2 give the details.
The patterns of sample sizes are illustrative; they do not attempt to represent all patterns seen in practice. By using the same patterns of sample sizes for each combination of the other parameters, we avoid the additional variability in the results that would arise from choosing sample sizes at random (eg, uniformly between 20 and 200).
We use a total of 10 000 repetitions for each combination of parameters. Thus, the simulation standard error for estimated coverage of 2 , , or at the 95% confidence level is roughly √ 0.95 × 0.05∕10, 000 = 0.00218. The simulations were programmed in R version 3.3.2 using the University of East Anglia 140-computer-node High Performance Computing (HPC) Cluster, providing a total of 2560 CPU cores, including parallel processing and large memory resources. For each configuration, we divided the 10 000 replications into 10 parallel sets of 1000 replications.
The structure of the simulations invites an analysis of the results along the lines of a designed experiment, in which the variables are 2 , n, K, q, 2 C , and 2 T . Most of the variables are crossed, but two have additional structure. Within the two  T consist of a cross of two factors, equal/unequal and small/large ( 2 C = 1 and 2 T = 1, 2 C = 10 and 2 T = 10, 2 C = 1 and 2 T = 2, and 2 C = 10 and 2 T = 20). We approach the analysis of the data from the simulations qualitatively, to identify the variables that substantially affect (or do not affect) the performance of the estimators as a whole and the variables that reveal important differences in performance. We might hope to describe the estimators' performance one variable at a time, but such "main effects" often do not provide an adequate summary: important differences are related to certain combinations of two or more variables.
We use this approach to examine bias and coverage in estimation of 2 and bias and coverage in estimation of and . Our summaries of results include illustrative figures and are based on examination of the figures in the corresponding arXiv e-prints. 33 A reviewer inquired about the values of I 2 underlying our simulations. Figures C1 and C2 in Appendix C plot I 2 = 100 2 ∕( 2 + s 2 ) versus 2 ∈ [0, 1] for MD and versus for SMD when 2 = 0.5(0.5)2, with traces for n = 20, 40, 100, 250. As indicated by the definition, I 2 increases as 2 increases. The value of n also has a substantial impact (larger n yields higher I 2 ); Higgins and Thompson 6 did not construct I 2 to be independent of the precisions of estimates observed in the studies. Importantly, for SMD, I 2 decreases as increases, especially for the smaller n, contrary to the scale-invariance criterion of Higgins and Thompson. We emphasize that we discourage use of I 2 , for the reasons mentioned here and in Section 1.

Design
For the MD, we vary six parameters: the between-study variance ( 2 ) and the within-study variances ( 2 T and 2 C ), in addition to the number of studies (K), the total sample size (n andn), and the proportion of observations in the control arm (q). Table 1 lists the values of each parameter. We set the overall true MD = 0 because the estimators of 2 do not involve and the estimators of are equivariant.
To cover both small and large values of the ratio of within-study to between-studies variance, separately from the value of 2 , we use two series of within-study variances ( 2 C , 2 T =(1,1), (1,2) and 2 C , 2 T =(10,10), (10,20)). We generate the within-study sample variances s 2 i ( j = T, C) from chi-squared distributions as 2 i 2 n i −1 ∕(n i − 1). We generate the estimated MDs y i from a normal distribution with mean and variance 2 iT ∕n iT + 2 iC ∕n iC + 2 . We obtain the estimated within-study variances as v 2 i = s 2 iT ∕n iT + s 2 iC ∕n iC . The simulation standard error in the estimates of is 0.01 (for n = 20) or less for the first series of within-study variances, and 0.02 or less for the second series.

Results
Our full simulation results are available as an arXiv e-print (Bakbergenuly et al 33 ). They comprise 108 figures, each presenting a plot of bias, mean squared error (MSE) or coverage versus 2 for the four values of n orn and the three values of K. A short summary is given below and illustrated by Figures 1 to 3. A detailed description appears in Appendix D in the Supplementary Materials. Table 3

summarizes our recommendations.
Bias in estimation of 2 ( Figure 1) In summary, except for CDL and WT, the estimators of 2 (DL, REML, J, and MP) have nonnegligible positive bias, especially for small sample sizes (n ≤ 40) and small values of 2 . Overall, CDL has the least bias, except for the most extreme cases, and is recommended for use in practice. WT is increasingly negatively biased for moderate to large heterogeneity, even for large sample sizes, so it is not recommended. All other estimators become acceptable for larger sample sizes (n ≥ 100).
Coverage in estimation of 2 ( Figure 2) In summary, none of the interval estimators of 2 (PL, QP, BJ, J, and WT) consistently achieve coverage close to .95 (ie, between .94 and .96). All have difficulty at 2 = 0, usually overcoverage; the departures of PL extend to other small 2 , and its coverage is often greater than .96 but sometimes less than .94. Meta-analyses in which the studies have small sample sizes are challenging for PL, QP, BJ, and J, which in some situations have coverage well below nominal for all 2 ∈ [0, 1], especially when the number of studies is larger (K = 30 vs. K = 5 and K = 10). Overall, WT comes closest to providing  nominal coverage of 2 . (The contrast in behavior between the WT interval and point estimators is surprising, but the former uses the appropriate F approximation to the distribution of Q, whereas the latter does not.)

Bias in estimation of
Because the estimated MD and its estimated variance are independent, all the estimators of are practically unbiased in all situations.
Coverage in estimation of ( Figure 3) When within-study sample sizes are balanced, HKSJ and HKSJ WT generally (but not uniformly) have the best coverage for small 2

Design
For the SMD, we vary five parameters: the overall true SMD ( ) and the between-studies variance ( 2 ), in addition to the number of studies (K), the studies' total sample size (n andn), and the proportion of observations in the control arm (q). Table 2 lists the values of each parameter.
We generate the true effect sizes i from a normal distribution: i ∼ N( , 2 ). We generate the values of Hedges's estimator g i directly from the appropriately scaled noncentral t-distribution, given by Equation (4), and obtain their estimated within-study variances from Equation (3).

Results
Our full simulation results are available as an arXiv e-print (Bakbergenuly et al 34  when K = 10, and KDB is closer to unbiased when K = 30 (and also when K = 20; see Appendix H). When n ≥ 100, MP, KDB, and REML are nearly unbiased. DL and J seriously underestimate 2 . The average of MP and KDB should be close to unbiased. Coverage in estimation of 2 ( Figure 4) As Figure 4 (bottom) illustrates, all five interval estimators of 2 (PL, QP, BJ, J, and KDB) have coverage substantially above .95 when 2 = 0. When 2 ≥ 0.5, QP is generally closest to .95, whereas KDB is somewhat too liberal when n = 20. The unusual behavior of BJ (and, to a lesser extent, J) when K = 30 (and also when K = 20; see Appendix H) adds to the evidence against it.
Bias and MSE in estimation of ( Figure 5) The bias of SSW is close to 0, and the other five estimators (DL, REML, J, MP, and KDB), which use inverse-variance weights, have greater (and negative) bias, amounting to 5% to 10% when sample sizes are small and ≥ 1. This bias increases as 2 and/or increases (see also Appendix H). SSW usually has slightly greater mean squared error than KDB and MP when n is small, but its MSE can be substantially smaller, especially for small 2 .
Coverage in estimation of ( Figure 6) Except when ≥ 1 and K ≥ 20 (see also Appendix H), HKSJ and HKSJ KDB have coverage closest to .95, though somewhat liberal; they differ little, and departures from .95 (toward lower coverage) are seldom serious. SSW KDB is rather conservative when K = 5 and for other K when 2 = 0. Otherwise it provides reliable albeit slightly conservative coverage. When = 2 and K = 20 or 30, SSW KDB is substantially the best choice. All the estimators that use inverse-variance weights and critical values from the normal distribution often have coverage substantially below .95.

EXAMPLE
As an example, we use data previously considered by Sánchez-Meca and Marín-Martínez, 35 on the efficacy of psychological treatments for obsessive-compulsive disorder (OCD). These data, Table 4, consist of 24 trials with mostly small sample sizes, ranging from 12 to 121 patients. Studies 5,15,16, and 23 are rather unbalanced; study 5 has 23 patients in the treatment arm and 11 in the control arm. The effect measure is SMD, and positive values correspond to lower levels of obsessions and compulsions in the treatment group. Figure 7 shows a forest plot, and Table 5 gives the results from various methods of estimation; recommended choices are in bold.  The estimated values of 2 have almost a threefold range, from 0.16 for REML to 0.45 for KDB. The methods differ much less in estimation of SMD. To a large degree, this is due to the relatively large number of studies (24). For instance, the variance of the overall effect for SSW given by (12) includes the multiplier of ∑ñ 2 i ∕( ∑ñ i ) 2 for 2 , and it is clearly of the order 1∕K (equal to 1∕K for equal sample sizesñ). This is also true for other estimators, so the differences between point estimators of almost disappear.
The results of our simulations for small sample sizes and near 1, Figures H1 and H2 in Supplementary Materials, indicate that 2 may be somewhat overestimated by KDB, somewhat underestimated by MP, and even more underestimated by REML, J, and especially DL. Combining this information with the results in Table 5, we expect 2 ≥ 0.4, much higher than the value of̂2 DL (= 0.1697). On the other hand, the Q-profile method is expected to provide the best confidence interval for 2 , here (0.099, 1.100), whereas the KDB interval may be too narrow at (0.216, 0.905). Both confidence intervals include a sizable range of values of 2 .
For , we expect all standard methods to yield negatively biased point estimates, including the KDB-based IV-weighted estimate at 1.122, so the SSW estimate of 1.095 seems somewhat low. From our simulations, the two best confidence intervals for are HKSJ KDB, here (0.802, 1.442), and the DL-based HKSJ, here (0.785, 1.365), but both may be too narrow. The SSW KDB interval, centered at the SSW point estimator, witĥ2 KDB in its estimated variance and t critical values, is  widest, at (0.700, 1.490); it may be too conservative, because it is 1.235 times as wide as HKSJ KDB and 1.362 times as wide as HKSJ.
We performed a small simulation (1000 replications per configuration), using values of 2 and within the confidence limits in Table 5. The results, plotted in Figure 8, show that the KDB method yields the least-biased estimates of 2 and has coverage of 2 comparable to or better than other methods. However, for these data, we prefer the more conservative QP interval. The HKSJ KDB interval for provides the best, though still somewhat liberal, coverage of among all intervals centered at an IV-weighted estimate. As expected, the sample-size-weighted estimator of is the only unbiased estimator, and the SSW KDB interval provides the most reliable though sometimes conservative coverage of . These methods are our recommended choices for estimation of .

DISCUSSION: PRACTICAL IMPLICATIONS FOR META-ANALYSIS
Methods for random-effects meta-analysis require an estimate of the between-study variance, 2 . We show that the performance of the popular estimators of 2 and related estimators of the overall effect varies widely among effect measures, and the existing evidence is scarce. For the effect measures MD and SMD, we use improved effect-measure-specific approximations to the expected value and distribution of Q to introduce two new methods of point estimation of 2 for MD (WT and CDL) and one WT interval method. We introduce one point estimator and one interval estimator for 2 in SMD. We also provide the first comprehensive simulation study for both MD and SMD.
The results of our simulations give a rather disappointing picture of the current state of meta-analysis for most common measures of effect. In brief, small sample sizes are rather problematic for many methods of meta-analysis, even for such a well-behaved effect measure as the MD, and meta-analyses that involve numerous small studies are especially challenging.
For MD, the between-study variance, 2 , is usually overestimated near zero. When n = 20, DL has a constant positive bias of about 0.07 regardless of 2 . REML is better for larger 2 , but it is about the same for 2 ≤ 0.2 when K = 30. These are the main methods used in the vast majority of meta-analyses. MP is the best at 0.03 to 0.06 bias (Figure 1). We do not recommend the WT point estimator of 2 . The CDL point estimator of 2 is essentially unbiased when n = 40, and it is the most reliable overall, across all values of 2 , n, and K; and our WT method provides reliable interval estimation. The estimators of are unbiased. Widespread complacency about the quality of meta-analysis methods is due to the use of MD as the outcome measure in many simulations. HKSJ intervals provide good but too liberal coverage of MD when studies are small and/or unbalanced. Our SSW CDL intervals are more reliable in this case, especially for larger K.
Arendacká 36 and Liu et al. 37 propose new confidence intervals for 2 in the one-way heteroscedastic random-effects model. These intervals can be used directly in meta-analysis of means in noncomparative studies. Both publications include extensive simulations and compare their intervals with those of Knapp et al. 15 Both proposals seem to do very well for normal distributions and very small sample sizes. It should be possible to extend these methods to MD in comparative two-arm designs; we plan to pursue this extension elsewhere.
For other effect measures, the picture is much more concerning. Because the study-level effects and their variances are related (as in Equation (3) for SMD), the performance of all statistical methods depends on the effect measures, estimates of overall effects are biased, and coverage of confidence intervals is too low, especially for small sample sizes. We see this for SMD. Bias of all inverse-variance methods for SMD when n = 20 is about 7% (Figure 4). Coverage of SMD is considerably worse when SMD is large and 2 < 0.5, at about 85% for HKSJ ( Figure 6). This may easily lead to misinterpretation of clinical findings.
The conventional wisdom is that these deficiencies do not matter, as meta-analysis usually deals with studies that are "large," so all these little problems are automatically resolved. Unfortunately, this is not true, even in medical meta-analyses; in Issue 4 of the Cochrane Database 2004, the maximum study size was 50 or less in 25% of meta-analyses that used MD as an effect measure, and less than 110 in 50% of them. 38 We have not surveyed typical study sizes in psychology, but Sánchez-Meca and Marín-Martínez, 35 promoting MA in psychological research, use an example with 24 studies in which the smallest study size is 12 and the largest is 121. We considered this example in Section 7. In ecology, typical sample sizes are between 4 and 25. 39 An effect-measure-specific estimator of 2 , such as KDB for SMD, can reduce inherent biases.
Arguably, the main purpose of a meta-analysis is to provide point and interval estimates of an overall effect. Usually, after estimating the between-study variance 2 , inverse-variance weights are used in estimating the overall effect (and, often, its variance). This approach relies on the theoretical result that, for known variances, and given unbiased estimateŝ variances are unknown, and use of the estimated variances makes the inverse-variance-weighted estimator of the overall effect biased. Consumers routinely expect point estimates to have no (or small) bias and CIs to have (close to) nominal coverage. Thus, the IV-weighted approach is unsatisfactory because, in general, it cannot produce an unbiased estimate of an overall effect.
We agree with Rukhin 40 : "A meta-analyst must be willing to use different estimates of the between-study variance 2 for different purposes: one to minimize the variance of the treatment effect statistic; another to construct a reliable confidence interval for this parameter; yet another to estimate 2 itself!" Our recommendations for meta-analysis of MD and SMD appear in Table 3.
A pragmatic approach to unbiased estimation of uses weights that do not involve estimated variances of study-level estimates, for example, weights proportional to the study sizes n i . Hunter and Schmidt 29 and Shuster, 41 among others, have proposed such weights, and Marín-Martínez and Sánchez-Meca 42 and Hamman et al 39 have studied the method's performance by simulation for SMD. We prefer to use weights proportional to an effective sample size,ñ i = n iT n iC ∕n i ; these are the optimal inverse-variance weights for SMD when = 0 and 2 = 0. Thus, the overall effect is estimated bŷ SSW = ∑ñ îi ∕ ∑ñ i , and its variance is estimated by Equation (12). Hamman et al 39 use weights proposed by Hedges, 43 which differ slightly fromñ for very small sample sizes. A good estimator of 2 , such as MP or KDB (for SMD), can be used aŝ2. Furthermore, confidence intervals for centered at̂S SW witĥ2 KDB in Equation (12) can be used.
This approach based on SSW requires further study. For example, in the confidence intervals, we have used critical values from the t-distribution on K − 1 degrees of freedom, but we have not yet examined the actual sampling distribution of SSW. The raw material for such an examination is readily available: For each situation in our simulations, each of the 10 000 replications yields an observation on the sampling distribution of SSW.

DATA AVAILABILITY STATEMENT
Our R programs for calculation of the proposed estimators of 2 and are provided in Supplementary materials. The scripts of simulations are available on request. All simulation results are provided in our arXiv eprints 33,34 and in Supplementary materials.