Methods for estimating between‐study variance and overall effect in meta‐analysis of odds ratios

In random‐effects meta‐analysis the between‐study variance ( τ2) has a key role in assessing heterogeneity of study‐level estimates and combining them to estimate an overall effect. For odds ratios the most common methods suffer from bias in estimating τ2 and the overall effect and produce confidence intervals with below‐nominal coverage. An improved approximation to the moments of Cochran's Q statistic, suggested by Kulinskaya and Dollinger (KD), yields new point and interval estimators of τ2 and of the overall log‐odds‐ratio. Another, simpler approach (SSW) uses weights based only on study‐level sample sizes to estimate the overall effect. In extensive simulations we compare our proposed estimators with established point and interval estimators for τ2 and point and interval estimators for the overall log‐odds‐ratio (including the Hartung‐Knapp‐Sidik‐Jonkman interval). Additional simulations included three estimators based on generalized linear mixed models and the Mantel‐Haenszel fixed‐effect estimator. Results of our simulations show that no single point estimator of τ2 can be recommended exclusively, but Mandel‐Paule and KD provide better choices for small and large numbers of studies, respectively. The KD estimator provides reliable coverage of τ2. Inverse‐variance‐weighted estimators of the overall effect are substantially biased, as are the Mantel‐Haenszel odds ratio and the estimators from the generalized linear mixed models. The SSW estimator of the overall effect and a related confidence interval provide reliable point and interval estimation of the overall log‐odds‐ratio.


| INTRODUCTION
Meta-analysis is broadly used for combining estimates of a measure of effect from a set of studies in order to estimate an overall (pooled) effect. In studies with binary individual-level outcomes, the most common measure of treatment effect is the odds ratio. 1 Our primary interest lies in meta-analysis of odds ratios. The actual measure of effect is the logarithm of the odds ratio (LOR), and the summary data are the numbers of subjects and the numbers of events in the two arms of each study, from which the usual analysis calculates the logarithm of each study's sample odds ratio and the large-sample estimate of its variance. A fixed-effect (or common-effect) model (FEM) assumes that the studies share a single true effect. It is usually more likely that the true study-level effects differ. A random-effects model (REM) describes that variation via a distribution, whose mean serves as the overall effect and whose variance summarizes the heterogeneity of the true study-level effects. Higgins et al 2 point out, "This variance explicitly describes the extent of the heterogeneity and has a crucial role in assessing the degree of consistency of effects across studies, which is an element of random-effects meta-analysis that often receives too little attention. " We focus mainly on two-stage approaches, which first calculate the studies' log-odds-ratios (and their estimated variances) and then combine those estimates; but we include, for limited comparisons, some one-stage approaches, which use the studies' numbers of events and subjects (eg, in a binomial likelihood) and avoid calculating the sample log-odds-ratios. To estimate the overall effect, the most common methods use a weighted average of the study-level estimates in which the weight for a study's estimate is the reciprocal of an estimate of its variance. Under the REM such inverse-variance weights combine the variance of the study-level estimate and the variance of the distribution of true study-level effects (τ 2 ). Thus, they require an estimate of the between-study variance. Most of the common inverse-variance-weighted methods estimate τ 2 by using the theoretical moments of Cochran's Q or its generalization. However, Kulinskaya and Dollinger 3 and van Aert et al 4 have shown that, for log-odds-ratio, the distributions assumed for those theoretical moments are incorrect. As a result, the moment-based point estimators of τ 2 are biased, and the interval estimators have coverage below the intended 95% level. Also, in combination with inverse-variance weighting, the departures from assumptions lead to biased point estimation of the overall effect and undercoverage of the associated confidence intervals (CIs). Therefore, for estimating betweenstudy variance, we propose new point and interval estimators based on an improved approximation to the moments of Cochran's Q statistic, suggested by Kulinskaya and Dollinger. 3 For the overall effect, we propose a weighted average in which the weights depend only on the effective sample sizes.
We use simulation to compare bias of our proposed point estimator of τ 2 with that of three previous momentbased estimators (the popular estimators of DerSimonian and Laird 5 and Mandel and Paule 6 and the less-familiar estimator of Jackson 7 ) and the restricted-maximumlikelihood estimator, and also to compare coverage of our proposed interval estimator of τ 2 with that of four previous estimators (profile likelihood, 8 the Q-profile [QP] interval, 9 and the generalized QP intervals of Biggerstaff and Jackson 10 and Jackson 7 ). We also compare bias and coverage of our proposed point estimator of the overall effect Highlights What is already known?
In combining estimates from studies that had a binary individual-level outcome, the most common methods of meta-analysis use a weighted average of the studies' odds ratios (on the logarithmic scale), under a random-effects model; but their required estimators of the between-study or heterogeneity variance suffer from bias and belownominal coverage, and produce bias and undercoverage in estimates of the overall log-odds-ratio.

What is new?
Our extensive simulations confirm that the usual methods of meta-analysis produce biased estimates of the overall effect and confidence intervals whose coverage is too low. Estimates of heterogeneity variance have similar shortcomings. Small sample sizes are rather problematic, and meta-analyses that involve numerous small studies are especially challenging.
• For estimating between-study variance, a new method (KD), based on an improved approximation to the null distribution of Cochran's Q, provides reliable interval estimates. The KD point estimator is inferior to another estimator (Mandel-Paule) when the number of studies is small, but is better otherwise. • A new, pragmatic point estimator of the overall effect (SSW) uses a weighted average in which a study's weight is proportional to its effective sample size. It has less bias than the popular inversevariance-weighted estimators and three estimators obtained from generalized linear mixed models. • The best interval estimator of the overall logodds-ratio is centered on SSW and bases its endpoints on a t distribution and the KD point estimator of the between-study variance.

Potential impact for RSM readers outside the authors' field
• The methods in common use for randomeffects meta-analysis of odds ratios can advantageously be replaced by the new estimators, which have better performance. • Meta-analysis software should include the new estimators. and a companion interval estimator with those of the related inverse-variance-based estimators. We extend the comparisons by including point estimators of τ 2 and point and interval estimators of the overall effect obtained from logistic linear mixed-effects models, and also the Mantel-Haenszel estimator of the odds ratio. Section 2 reviews estimation of study-level log-oddsratio and Section 3 briefly reviews REMs. Section 4 discusses previous point and interval estimators of between-study variance and introduces the proposed Kulinskaya-Dollinger method. Section 5 describes the corresponding point and interval estimators of the overall effect. Section 6 presents our simulation study and summarizes its results. In Section 7, we apply the various methods to data on the effect of diuretics on pre-eclampsia. Section 8 offers a concluding summary.
The Supporting Information (Data S1) reviews the logistic linear mixed-effects models, tabulates the methods studied in our simulations, discusses the properties of the M-H estimator under the REM, presents the results of the additional simulations that included the logistic linear mixed-effects estimators and the M-H estimator, and lists our R programs for calculating the proposed estimators.

| ESTIMATION OF STUDY-LEVEL LOG-ODDS-RATIO
Consider K studies that used a particular individual-level binary outcome. Each study i reports a pair of independent binomial variables, X iT and X iC , the numbers of events in n iT subjects in the Treatment arm ( j = T) and n iC subjects in the Control arm ( j = C); for i = 1, …, K: X iT $ Binom n iT , p iT ð Þ and X iC $ Binom n iC , p iC ð Þ: ð2:1Þ The log-odds-ratio for Study i is: ð2:2Þ The large-sample estimate of the variance ofθ i , derived by the delta method, is: (in finite samples Varθ i À Á is not finite). Evaluation ofθ i andσ 2 i requires the estimatesp ij . The usual (and maximum-likelihood) estimate of p ij isp ij = x ij =n ij , but an adjustment is necessary when either of the observed counts x ij is 0 or n ij (ie, when the 2 × 2 table for Study i contains a 0 cell). The standard approach adds 1/2 to x iT , n iT − x iT , x iC , and n iC − x iC when the 2 × 2 table contains exactly one 0 cell, and it omits Study i when the 2 × 2 table contains two 0 cells. An alternative approach always adds a (>0) to all four cells of the 2 × 2 table for each of the K studies; that is, it estimates p ij bŷ p ij a ð Þ = x ij + a À Á = n ij + 2a À Á . The most common choice, a = 1/2, removes bias of order n −1 inθ i . 11 It is convenient to denote the resulting estimate of θ i byθ i a ð Þ . Usingp ij a ð Þ with a = 1/2 in Equation (2.3) yields an estimator of Varθ i a ð Þ À Á that is unbiased except for terms of order n −3 . When n ij p ij < 3, however, that estimator substantially overestimates Varθ i a ð Þ À Á : 12 As far as we are aware, the corresponding small-sample bias of the standard approach has not been calculated. However, using unbiased estimators of the θ i and Varθ i À Á does not make the inversevariance estimator of the combined LOR unbiased, because 1= d Varθ i À Á is a biased estimator of 1=Varθ i À Á and theθ i and their estimated variances are not independent. 13,14

| Double-zero studies
Meta-analysis of binary data is challenging when the event rates are low. Such situations may involve so-called doublezero studies (ie, studies with zero events in both arms or, at the other extreme, x iT = n iT and x iC = n iC ). Actual practice varies, but often meta-analyses omit these studies. A popular argument is that such studies provide no information on the direction or magnitude of the effect. 15 Simulations that retain double-zero studies are rather scarce. Kuss 16 considers only methods that include double-zero studies without adjustment. Bhaumik et al 13 refer to their extensive simulation study comparing inclusion (with a = 1/2) and exclusion of double-zero studies and claim that inclusion results in less bias in estimation of the overall effect, but negatively affects estimation of τ 2 . Cheng et al 17 provide a review and some limited simulations for p ≤ .01 and K = 5. They argue that including double-zero studies is beneficial when θ is 0, but detrimental when a true treatment effect exists. We believe that this issue has no major practical consequences for our simulations (Section 6) because we use θ ≥ 0 and p iC ≥ .1.

| Standard random-effects model
The standard REM assumes that each estimated studylevel effect,θ i , has an approximately normal distribution and that the true study-level effects, θ i , follow a normal distribution:θ Thus, the marginal distribution ofθ i is N θ,σ 2 i + τ 2 À Á . Although the σ 2 i are generally unknown, they are routinely replaced by their estimates,σ 2 i . A key step involves estimating the between-study variance, τ 2 ; the most popular random-effects method uses the DerSimonian-Laird (DL) estimate. 5 The estimate of the overall effect is then ð3:2Þ weight for Study i. If the σ 2 i and τ 2 were known, the variance ofθ RE would be [ In practice, the variance ofθ RE is traditionally estimated by Pŵ iτ 2 À Á Â Ã − 1 , and a CI for θ uses critical values from the normal distribution.
The assumptions in this model (eg, within-study normality, between-study normality, and known σ 2 i ) have become familiar and seldom attract attention. Jackson and White, 14 however, advocate careful examination; they conclude that methods that make fewer normality assumptions should be considered more often in practice.

| Logistic linear mixed-effects models
One alternative approach uses a binomial-normal likelihood; the resulting logistic linear mixed-effects model belongs to the class of generalized linear mixed models (GLMMs). 18,19 Kuss, 16 Jackson et al, 20 and Bakbergenuly and Kulinskaya 21 review these GLMM methods. We include a fixed-intercept model (FIM) and a randomintercept model (RIM), equivalent to Models 4 and 5, respectively, of Jackson et al 20 and to models FIM2 and RIM2 of Bakbergenuly and Kulinskaya. 21 Briefly, the FIM includes fixed control-group effects (log-odds for the control-group probabilities), and the RIM replaces these fixed effects with random effects. Section A.1 in the Supporting Information gives more details.

| Noncentral-hypergeometric-normal model (NCHGN)
When one conditions on the total number of events for Study i, X iT + X iC = X i , only the number of events in the treatment group X iT is random. Then, given the studyspecific log-odds-ratio θ i , X iT has a noncentral hypergeometric distribution. If the θ i are normally distributed, θ i $N(θ, τ 2 ), the exact hypergeometric-normal likelihood function for Study i can be written as 19,22 : where the normalizing constant is defined as: and ϕ(Á| θ, τ 2 ) is the probability density function of the normal distribution with mean θ and variance τ 2 . Integrating out the unobserved study-specific effects produces the marginal distribution of X iT . Section A.1 in the Supporting Information gives more details.

| METHODS OF ESTIMATING BETWEEN-STUDY VARIANCE
A number of methods provide point and interval estimates of between-study variance. In a comprehensive review of existing simulation and empirical studies, Veroniki et al 23 focus on general-purpose estimators. Langan et al 24 systematically review simulation studies that compared estimators of heterogeneity variance. They summarize performance in estimating heterogeneity and also in estimating the overall effect. The studies used a variety of effect measures, including the odds ratio. Langan et al 25 use simulated data on standardized mean difference and odds ratio to compare nine estimators. We considered the recommendations of those three reports in choosing estimators to study. This section briefly reviews them; for reference, Section A.2 in the Supporting Information contains a list. Moredetailed descriptions appear in Veroniki et al, 23 Langan et al, 24 and Langan et al 25

| Point estimators
In applications, the DerSimonian-Laird 5 method remains the most popular; its relative simplicity facilitated its early implementation in software. Accumulating evidence of its inferior performance has done little to dislodge it. Recommended alternative point estimators include restricted maximum likelihood (REML), the method of Mandel and Paule, 6 and the method of Jackson. 7 These and other methods have been studied by many authors, including Viechtbauer 26 and Kosmidis et al. 27 This section briefly reviews these four methods and describes the Kulinskaya-Dollinger method. Information on the logistic linear mixed-effects models (FIM, RIM, and NCHGN) appears in Section A.1 in the Supporting Information. All these methods replace negative values ofτ 2 with zero.

| DerSimonian-Laird method
Pŵ i , is customarily assumed to have approximately the χ 2 distribution on K − 1 degrees of freedom. DerSimonian and Laird 5 substitute w i = 1=σ 2 i forŵ i , derive the corresponding expected value of Q when Varθ i À Á = σ 2 i + τ 2 , and estimate τ 2 by the method of moments. The resulting closed-form expression has made the DL estimator attractive.

| Restricted-maximum-likelihood method
Assuming that theθ i are distributed as N θ,σ 2 i + τ 2 À Á , the REML estimatorτ 2 REML maximizes the restricted (or residual) log-likelihood function l R (θ, τ 2 ), which differs from the ordinary likelihood function by the addition of − 1 2 ln Þ . It is obtained iteratively, using θ =θ REML from Equation (3.2) with weightsŵ iτ 2 REML À Á . REML is superior to DL because of its balance between unbiasedness and efficiency. 26 However, like DL, using theσ 2 i as if they were the σ 2 i may undermine its performance.
One can also obtain the REML estimator of τ 2 by maximizing the penalized log-likelihood developed by Kosmidis et al 27 to reduce the bias of maximumlikelihood estimation.

| Mandel-Paule method
The Mandel-Paule (MP) estimator,τ 2 MP , is another iterative moment-based estimator of the between-study variance. 6,28 As in Section 3.1, let the random-effects weights and θ RE depend on τ 2 ; denote the resulting Q by Q(τ 2 ). The MP estimatorτ 2 MP is obtained by iteratively solving the equation:

| Jackson method
DerSimonian and Kacker 33 generalize Q, replacing theŵ i by arbitrary fixed positive constants, a i , to obtain Q a = P a iθi −θ a À Á 2 , from which they derive a general method-of-moments estimator of τ 2 . They discuss several special cases, including DL (with a i = 1=σ 2 i , treating thê σ 2 i as fixed).
As an option when some heterogeneity is anticipated but there is little prior knowledge about its extent, Jackson 7 uses Q a with a i = 1/σ i . Although that choice yields a point estimator of τ 2 , he focuses on the interval estimator. However, the R function inference in the supplementary materials of Jackson 7 returns the point estimate. (His computational procedure avoids negativeτ 2 .) We abbreviate the point and interval estimators as J. In practice, meta-analyses would useσ i , so the a i in Q a are not fixed.

| Kulinskaya-Dollinger method
The chi-squared approximation for Q is inaccurate, and the actual distribution depends on the effect measure. Under the null hypothesis of homogeneity of the logodds-ratio, Kulinskaya and Dollinger 3 obtain corrected approximations for the mean and variance of Q and match those corrected moments to obtain a gamma distribution that (as their simulations confirm) closely fits the null distribution of Q. These approximations blend theoretical derivations with simulation results. Let E KD (Q) denote the corrected expected value of Q under the null hypothesis τ 2 = 0. This corrected first moment has the form E KD (Q where E th (Q) is a theoretical moment obtained from their general expansion for the mean of Q for arbitrary effect measures. 34 The corrected variance of Q is a quadratic function of the corrected mean E KD (Q). The expression for E th (Q) involved in specifying the corrected distribution of Q is not simple; Kulinskaya and Dollinger 3 give the details. For large sample sizes, E th (Q) ! K − 1.
We propose a new estimator of τ 2 based on this improved approximation. One obtains the KD estimatê τ 2 KD by iteratively solving This estimator closely resembles the MP estimator; both assume that adding τ 2 toσ 2 i in the IV weights makes the non-null distribution of Q (or at least, its mean) close to its null distribution. This assumption needs to be verified by simulation.

| Interval estimators
Viechtbauer 9 and Jackson and Bowden 35 compare CI estimators of the between-study variance. Interval estimators recommended by Veroniki et al 23 include profile likelihood, 8 the QP interval, 9 and the generalized QP intervals of Biggerstaff and Jackson 10 and Jackson. 7 Quality of estimation varies with the effect measure; for odds ratio van Aert et al 4 found that coverage of the last three methods can deviate substantially from the nominal 95% level. If the lower confidence limit is not defined or is negative, all these methods set it to zero. The logistic linear mixed-effects methods (FIM, RIM, and NCHGN) as implemented in the rma.glmm function in metafor, 36 used in our simulations, do not produce CIs for τ 2 .

| Q-profile confidence interval
If the weight for Study i is 1= follows the chi-squared distribution with K − 1 degrees of freedom. To obtain the QP CI, Viechtbauer 9 finds the lower and upper confidence limits by iteratively solving Qτ 2 L À Á = χ 2 K − 1; 0:975 and Qτ 2 U À Á = χ 2 K − 1; 0:025 . In practice, it is necessary to use theσ 2 i instead of the σ 2 i , and then the generalized Q-statistic no longer follows the assumed chisquared distribution.

| Biggerstaff and Jackson interval
For a generic effect measure, Biggerstaff and Jackson 10 derive the exact distribution of the statistic They show that the distribution is that of a linear combination of mutually independent chi-squared random variables, each with 1 degree of freedom, and they take advantage of available software for evaluating the cumulative distribution function F Q of such a distribution.
That distribution yields a generalized QP CI, whose lower and upper limits are the solutions to the equations:

| Jackson interval
As mentioned in Section 4.1.4, Jackson 7 proposes another generalized QP CI for τ 2 . The approach is the same as for the BJ interval, but with a i = 1/σ i in Q a .

| Kulinskaya-Dollinger interval
For the log-odds-ratio, we propose a new CI for the between-study variance. The KD CI for τ 2 combines the QP approach and the improved approximation by Kulinskaya and Dollinger. 3 This corrected QP CI can be estimated from the lower and upper quantiles of F Q , the cumulative distribution function for the corrected distribution of Q, as in Equation (4.6). The upper and lower confidence limits for τ 2 can be calculated iteratively.

| METHODS OF ESTIMATING OVERALL EFFECT
Most of the point estimators of the overall effect have corresponding interval estimators, but some do not. Therefore, we describe point estimators and interval estimators in separate sections.

| Point estimators
A random-effects method that estimates θ by a weighted mean with inverse-variance weights, as in Equation (3.2), is determined by the particularτ 2 that it uses inŵ iτ 2 À Á . The best-known and most widely used estimator,θ DL , was introduced by DerSimonian and Laird 5 ; it usesτ 2 DL . Its shortcomings, in particular bias and below-nominal coverage of the companion CI, have led numerous authors to propose alternative estimators of τ 2 . Some of those shortcomings arise from the derivation underlyinĝ τ 2 DL , which uses the σ 2 i and τ 2 and then substitutes theσ 2 i andτ 2 . Unfortunately, the alternative methods (REML, J, and MP) generally rely on that same unsupported substitution. In our simulations, we add one more inverse-variance-weighted estimator, KD, to this list.
In an attempt to avoid the bias in the inversevariance-weighted estimators, we include a point estimator whose weights depend only on the studies' effective sample sizes. 38,39 For this estimator (SSW)θ i usesp ij a ð Þ with a = 1/2 (as discussed in Section 2), and w i =ñ i = n iT n iC = n iT + n iC ð Þ ;ñ i is the effective sample size in Study i. These weights would be equivalent to the inverse-variance weights if all the probabilities across studies were equal (ie, p iT = p iC ≡ p for i = 1, …, K).
As we mentioned in Section 1, we also include estimators obtained from logistic linear mixed-effects models, namely FIM, RIM, and NCHGN.
A reviewer pointed out that the weights in SSW are the same as those in the MH estimator of a common risk difference, 40 and suggested that we include the MH estimator of a common odds ratio. That fixed-effect estimator applies the weight (n iT − x iT )x iC /(n iT + n iC ) to the sample odds ratio for Study i. As we discuss in Section A.3 in the Supporting Information, we expect MH to be biased under the REM.
In summary, the point estimators that we study are DL, REML, J, MP, KD, SSW, FIM, RIM, NCHGN, and MH.

| Interval estimators
The point estimators DL, REML, J, MP, and KD have companion interval estimators of θ. The customary approach estimates the variance ofθ RE by Pŵ and bases the width of the interval on the normal distribution. That expression for the variance ofθ RE would be correct if it were based on w i = σ 2 i + τ 2 À Á − 1 . In practice, however, usingŵ iτ 2 À Á may not yield a satisfactory approximation. Also, we have not seen empirical evidence that the sampling distributions ofθ RE for the various choices of estimator for τ 2 are adequately approximated by a normal distribution.
Hartung and Knapp 41 and, independently, Sidik and Jonkman 42 developed an estimator for the variance ofθ DL that takes into account the variability of theσ 2 i andτ 2 . The Hartung-Knapp-Sidik-Jonkman (HKSJ) CI uses the estimator: together with critical values from the t distribution on K − 1 degrees of freedom. A potential weakness is that the derivation of the variance estimator and the t distribution uses the σ 2 i and τ 2 and then substitutes thê σ 2 i andτ 2 DL . Also, the HKSJ interval usesθ DL as its midpoint, so it will have any bias that is present inθ DL . We study a modification of HKSJ (HKSJ KD) that uses the KD estimator of τ 2 and usesθ KD as the midpoint.
The interval estimator corresponding to SSW (SSW KD) uses the SSW point estimator as its center, and its width equals the estimated standard deviation of SSW under the REM times twice the critical value from the t distribution on K − 1 degrees of freedom. The estimator of the variance of SSW is

| SIMULATION STUDY
In a simulation study with log-odds-ratio as the effect measure, we varied six parameters: the number of studies K, the total sample size of each study n, the proportion of observations in the Control arm q, the overall true LOR θ, the between-study variance τ 2 , and the probability of an event in the Control arm p C . The number of studies K ∈ {5, 10, 30}. We included sample sizes that were equal for all K studies and sample sizes that varied among studies. The total sample sizes were n ∈ {40, 100, 250, 1000} for equal sample sizes, and the average total sample sizes were n ∈ 30, 60, 100, 160 f g for unequal sample sizes. In choosing sample sizes that varied among studies, we followed a suggestion of Sánchez-Meca and Marín-Martínez, 43 who selected study sizes having skewness 1.464, which they considered typical in behavioral and health sciences. For K = 5, Table 1 lists the sets of five sample sizes, which have the chosen skewness and average equal to 30, 60, 100, and 160. The simulations for K = 10 and K = 30 used each set of unequal sample sizes twice and six times, respectively. The values of q were .5 and .75. The sample sizes of the Treatment and Control arms were n iT = d(1 − q)n i e and n iC = n i − n iT , i = 1, …, K. The values of the overall true LOR were θ = 0(0.5)2 (ie, from 0 to 2 in steps of 0.5). The probability in the Control arm was p iC = .1, .2, . 4. The values of the between-study variance were τ 2 = 0(0.1)1, corresponding to small to moderate heterogeneity. This interval of τ 2 values is similar to or, for smaller sample sizes, somewhat shorter than that for the meta-analyses of LOR in the Cochrane database (Appendix 2 of Langan et al) 25 .
Altogether, the simulations comprised 7920 combinations of the six parameters. We generated 10 000 metaanalyses for each combination. The true values of LOR (θ i ) were generated from a normal distribution with mean θ and variance τ 2 . For a given p iC , the number of events in the control group, X iC , was generated from the Binomial(n iC , p iC ) distribution. The number of events in the treatment group, X iT , was generated from the Binomial(n iT , p iT ) distribution with p iT = p iC exp(θ i )/(1 − p iC + p iC exp(θ i )). The estimateθ i was calculated as in Equation (2.2), and its sampling variance was estimated by substitutingp iT andp iC in Equation (2.3). The methods differed however, in the way they obtainedp ij from x ij and n ij . In all standard methods, we added 1/2 to each cell of the 2 × 2 table only when the table had at least one cell equal to 0. This approach corresponds to the default values of the arguments add, to, and drop00 of the escalc procedure in metafor. 36 In the KD methods, and for estimation of θ i in SSW, we corrected for bias by adding 1/2 to each cell of all K tables. We also tried always adding 1/2 in the standard methods, but that made the biases forτ 2 worse.
In expanding our comparative study, we included the MH estimator of θ and the estimators from the FIM, RIM, and NCHGN models in simulations for selected values of the parameters: p C = .1, q = .5 and equal sample sizes with n = 40 and n = 100. The three logistic linear mixed-effects methods provide point but not interval estimators of τ 2 and both point and interval estimators of θ. For the MH point estimator of θ, we studied two versions: the usual version (MH), which does not modify the cell counts, and a version that always adds 1/2 to each cell (MH with 1/2). The results of these additional simulations are plotted in Section A.4 in the Supporting Information.

| Results of simulation studies
Our full simulation results are available as an arXiv eprint. 44 They comprise 300 figures, each presenting a plot of bias or coverage vs τ 2 for the four values of n or n and the three values of K. A detailed summary is given below and illustrated by Figures 1-3.

| Bias in estimation of τ 2
All the estimators have bias that varies with τ 2 , often roughly linearly. The sign and magnitude of the bias and the slope of that relation depend on p iC , θ, n, K, and q. For example, when p iC = .1, θ = 0, q = .5, n = 40, and K = 5, the bias of KD goes from +0.32 when τ 2 = 0 to −0.08 when τ 2 = 1, and the traces for the other estimators, close together, go from around +0.12 to around −0.47. Among these, MP appears to be the least biased. As K increases, the pattern shifts down; and as n increases, the traces tend to flatten (when n = 1000, most of the estimators are unbiased, but the bias when τ 2 = 1 is −0.08 for J and −0.17 for DL). As θ increases, the patterns shift down. When all studies are unbalanced (in favor of the Control arm), q = .75, the patterns often shift down, and the slopes become steeper. Figure 1 shows these patterns for KD and MP in the balanced case (q = .5). Both estimators have positive bias at zero, but for larger values of τ 2 , the bias of MP is mostly negative, whereas for KD it may be positive for larger values of θ. MP is considerably worse than KD (apart from τ 2 = 0) for K = 30. For K = 5 and 10, KD is less biased than MP for large values of τ 2 , but it may be worse for small values.
The effect of increasing p iC is not simple. As p iC increases from .1 to .2 to .4, the (positive) bias of KD at τ 2 = 0 decreases, and its bias at τ 2 = 1 approaches 0; at τ 2 = 0 the (positive) bias of the other estimators changes little, but at τ 2 = 1 the magnitude of the (negative) bias decreases when θ = 0 but decreases and then increases when θ = 2.
None of the point estimators of τ 2 has bias consistently close enough to 0 to be recommended; but among the existing estimators, MP and KD provide better choices for small and large K, respectively (Figure 2).

| Bias in estimation of θ
In the results for bias of the point estimators of θ, a common pattern is that the bias is roughly linearly related to τ 2 with a positive slope. The varied positions of the estimators' traces relative to the horizontal line of zero bias, however, complicate the process of summarizing. The situation with p iC = .1 and θ = 0 is straightforward: When F I G U R E 2 Bias and ratio of MSEs for estimators of the overall effect θ for θ = 0, p iC = .1, q = .5, and equal sample sizes n = 40, 100.
Light grey line at 0 and 1, respectively [Colour figure can be viewed at wileyonlinelibrary.com] n = 40 and K = 5, all estimators have no bias when τ 2 = 0; when τ 2 = 1, SSW has bias 0.14, and the other estimators' biases range from 0.23 to 0.26. Increasing K (to 10 and 30) has little effect on the pattern, and increasing n (to 100, 250, and 1000) flattens the pattern until little bias remains. (The plots for n = 100 show that the bias of SSW decreases F I G U R E 3 Coverage of between-studies variance τ 2 (top two rows) and overall effect θ (bottom two rows) for θ = 0, p iC = .1, q = .5, and unequal sample sizes n = 30, 100. Light grey line at 0.95 [Colour figure can be viewed at wileyonlinelibrary.com] more rapidly.) When θ = 0.5, the pattern splits into three: SSW has much smaller slope and flattens to essentially zero bias; the bias of KD changes from negative to positive around τ 2 = 0.5; and the common trace for the other estimators parallels that for KD and is about 0.06 units above it. Again, by n = 1000 the traces have flattened and merged. As θ increases (to 1.0, 1.5, and 2.0), the traces for all estimators except SSW shift down further, and the gap between KD and the others widens. When p iC = .2 and θ > 0, slopes of the non-SSW traces decrease as θ increases (the traces are flat when θ = 2). When p iC = .4 and θ > 0, the non-SSW traces go from flat to having negative slope as θ increases. Also, increasing K tends to shift those traces down slightly. When all K studies are unbalanced (q = .75), p iC = .1, θ = 0, and n = 40, the estimators have larger positive bias, even when τ 2 = 0. This effect decreases as θ and p iC increase, consistent with the behavior when q = .5, and it is absent when n ≥ 100.
As expected, in the vast majority of situations, SSW avoids most, if not all, of the bias in the IV-weighted estimators. The bias of the IV-weighted estimators affects their efficiency, so SSW tends to have smaller mean squared error than MP as τ 2 and K increase, but larger MSE than KD when K = 5 and K = 10 and when K = 30 and τ 2 is small ( Figure 2).

| Coverage in estimation of τ 2
Coverage of τ 2 is generally good for K = 5, but is considerably worse for larger numbers of studies, especially so for large values of θ. All methods are somewhat conservative at τ 2 = 0. When K = 5, PL is very conservative, whereas KD provides close to nominal coverage for τ 2 > 0, though it may become a bit liberal for large θ. The other methods are between these two, being somewhat conservative for small sample sizes n. For K = 10, PL is still mostly conservative, though it may become somewhat liberal for larger τ 2 . KD is almost perfect, though in one instance, for unequal sample sizes with n = 30, p C = :4, and θ = 2, its coverage drops to 90%. The other intervals are too liberal for small n. The large number of studies K presents the greatest challenge for the standard methods. PL is the most affected, with considerable undercoverage up to n = 100 for medium to large values of τ 2 . The other methods also have low coverage for small n, but they improve faster with increasing n. KD provides reliable coverage except for small sample sizes combined with p C = .4 and θ ≥ 1.5, where its undercoverage worsens with increasing τ 2 , though it is still considerably better than all the competitors (Figures 2 & 3).

| Coverage in estimation of θ
Interval estimators of θ respond in a variety of ways to the variables in the simulations. No simple description adequately summarizes the patterns. In one common pattern, coverage decreases as τ 2 increases, often falling substantially below the nominal 95% for the IV-weighted estimators. For a given value of θ and K = 10 and K = 30, undercoverage tends to decrease as n increases. For K = 5, however, the undercoverage of the IV-weighted estimators generally increases as n goes from 40 to 100 to 250 to 1000; when n = 1000, coverage is around 95% when τ 2 = 0 and roughly constant, at several percentage points below 95%, for 0.1 ≤ τ 2 ≤ 1 (the decrease is greater for p iC = .2 and p iC = .4 than for p iC = .1). Because HKSJ, HKSJ KD, and SSW KD do not exhibit such undercoverage in these situations, the explanation is likely to lie in the use of the normal distribution as the basis for the CI (Figure 3).
On the other hand, for given values of θ, n = 40 and n = 100, and τ 2 > 0, coverage tends to decrease as K increases. This effect is small for SSW KD (which moves from overcoverage to coverage close to 95%) and larger (by varying amounts) for all the other estimators. Thus, counterintuitively, when more than a small amount of heterogeneity is present and n ≤ 100, increasing the number of studies decreases coverage. A likely contributor is bias in estimating θ, which (for n = 40 and n = 100) is positive and increasing as τ 2 increases, and changes little with K.
A different pattern arises when θ ≥ 1, n = 40 and n = 100 (and n = 30 and 100), and K = 30. Coverage of HKSJ KD and KD is below 95% when τ 2 = 0 and increases toward 95% as τ 2 increases. For KD the explanation probably lies in its bias in estimating θ, which is negative and rises toward 0 (but remains <0) as τ 2 increases. For HKSJ KD (which has greater undercoverage), the reason is less clear. Undercoverage of both KD and HKSJ KD at τ 2 = 0 increases as θ increases. This pattern arises when p iC = .1 and p iC = .2. When p iC = .4, however, it is not evident when θ = 1. When θ ≥ 1.5, coverage of KD and HKSJ KD decreases as τ 2 increases and then stabilizes.
We do not recommend standard CIs based on IVweighted estimators of θ, because of their undercoverage. HKSJ and HKSJ KD often have coverage close to 95%, but they sometimes have serious undercoverage. All problems are typically worse for the unbalanced sample sizes. The SSW KD interval often has coverage somewhat greater than 95%, but its coverage is at least 93% (except for a few cases involving K = 30 and unequal sample sizes with n = 30) (Figure 2). 6.1.5 | Additional results: FIM, RIM, NCHGN, and MH In estimating τ 2 , FIM and RIM ( Figure A4.1 in the Supporting Information) often have bias that is between those ofτ 2 KD andτ 2 MP (Figure 1) and is generally not small, going from positive near τ 2 = 0 to negative at larger τ 2 . The size of their bias tends to decrease as K increases. As θ increases, the bias of RIM tends to decrease, whereas the bias of FIM remains roughly constant. The pattern of NCHGN is more complicated: positive and decreasing as K increases and θ increases when n = 40; but roughly linear (+ to −) in τ 2 , increasing as θ increases, and flattening as K increases when n = 100, where NCHGN is almost unbiased for larger τ 2 when K = 30. However, convergence rates of NCHGN are rather low, especially so for low values of τ 2 and K; they improve somewhat for larger values of n ( Figure A4 For interval estimation of θ ( Figure A4.3 in the Supporting Information), FIM, RIM, and NCHGN generally have lower coverage than the other estimators, decreasing as θ increases. When n = 100, the coverage of RIM decreases rapidly as τ 2 increases, and that pattern becomes more pronounced as K increases.
In summary, we do not recommend MH or the GLMMs for point or interval estimation of θ.

| EXAMPLE: EFFECTS OF DIURETICS ON PRE-ECLAMPSIA
Data from nine trials that reported the effect of diuretics on pre-eclampsia 45 were studied by Hardy and Thompson, 8 Biggerstaff and Tweedie, 46 Turner et al, 18 Viechtbauer, 9 Kulinskaya and Olkin, 47 and Bakbergenuly and Kulinskaya. 48 The data are shown in Table 2 and are re-analyzed here in order to compare the methods of point and interval estimation of between-study variance and the log-odds-ratio. For comparison we include results from three GLMMs available in the metafor package 36 : the FIM, the RIM, and the exact method based on the NCHGN. Bakbergenuly and Kulinskaya 21 give more details on those methods. Table 3 provides the point estimates of the betweenstudy variance and the point estimates and CIs for the overall log-odds-ratio and the overall odds ratio; and Table 4 shows the point estimates and CIs for the between-study variance. DL has the lowest estimate of τ 2 , 0.230, followed by the GLMM estimates at 0.254 to 0.264, and KD gives the highest estimate, 0.392. MP is second highest at 0.386. QP provides the longest CI for τ 2 , with length 2.130, and KD the second longest at 1.875, whereas BJ is considerably shorter at 1.384, and NCHGN has a very short interval with a length of just 0.667.
In estimating θ, all inverse-variance-weighted methods give similar values, ranging from −0.518 to −0.517 apart from KD which is −0.507, and the GLMM methods also give similar values ranging from −0.516 to −0.513. By contrast the FEM produces the highest estimate, −0.398, and SSW produces the lowest, −0.558. All the standard inverse-variance-weighted methods and the GLMMs show a significant effect of diuretics on pre-eclampsia, whereas all methods using t quantiles (HKSJ DL, HKSJ KD, and SSW KD) do not find a significant effect.
It is rather difficult to decide, from our simulation results, which method gives the best estimates, as the sample sizes, even though rather balanced, vary greatly, from 38 to 1370 in the Treatment arm. Therefore, we ran additional simulations, where we kept the sample sizes and the prevalence in the Control arm as in the actual nine trials, and varied the value of θ from −0.4 to −0.6 and the value of τ 2 = 0.20(0.05)0.45 to cover the range of possible values of these parameters. We used 10 000 repetitions at each combination of θ and τ 2 . Results of these simulations are shown in Figure 4. From these simulations, MP and KD are the least biased estimates of τ 2 ; the other methods have considerable negative bias, especially DL and the GLMMs, RIM being the most biased. KD provides the best coverage of τ 2 , though the coverage of all methods appears to be reasonable. All methods but SSW considerably overestimate θ, though here NCHGN and FIM are the least biased, with positive biases of 0.01 to 0.03. Coverage of θ is best for SSW KD and somewhat too low for the other methods based on t quantiles. The coverage of the standard IV-weighted methods based on normal quantiles is clearly not acceptable, and the GLMMs provide even worse coverage, probably because of their underestimation of τ 2 .

| SUMMARY
Our extensive simulations demonstrate that the existing methods of meta-analysis of (log) odds ratio often present a biased view of both the heterogeneity and the overall effect. In brief: small sample sizes are rather problematic, and meta-analyses that involve numerous small studies are especially challenging. Because the study-level effects and their variances are related, estimates of the overall effects are biased, and the coverage of CIs is too low, especially for small sample sizes and larger numbers of studies.
The between-study variance, τ 2 , is typically estimated by generic methods which assume normality of the estimated effectsθ i . It is usually overestimated near zero, but the standard methods are negatively biased for larger values of τ 2 . Our findings agree with those by van Aert et al 4 that the standard interval estimators of τ 2 are often too liberal. The behavior of the PL method is especially erratic.
Therefore, we proposed and studied a new method of estimating τ 2 based on the corrected approximation to the null distribution of Cochran's Q for log-odds-ratio F I G U R E 4 Bias and coverage of estimators of the between-study variance τ 2 and of the LOR θ for the sample sizes and thep iC in the pre-eclampsia data of Collins et al, 45  developed by Kulinskaya and Dollinger. 3 The KD method provides reliable interval estimation of τ 2 across all values of τ 2 , n, and K. Point estimation of τ 2 is more challenging; even though KD is better for K = 30, for small values of K it has positive bias and MP is better.
Arguably, the main purpose of a meta-analysis is to provide point and interval estimates of an overall effect.
Usually, after estimating the between-study variance τ 2 , inverse-variance weights are used in estimating the overall effect and its variance. This approach relies on the theoretical result that, for known variances, and given unbiased estimatesθ i , it yields a uniformly minimumvariance unbiased estimate of θ.
In practice, however, the true within-study variances are unknown, and use of the estimated variances makes the inverse-variance-weighted estimate of the overall effect biased. These biases (and even their sign) depend on τ 2 and the true value of θ, worsen for unbalanced studies, and may be considerable, even for reasonably large sample sizes such as n = 250. The coverage of the overall effect follows the same patterns because the centering of the CIs is biased. Additionally, traditional intervals using normal quantiles are too narrow; and the use of t quantiles, as in the HKSJ method, brings noticeable though not sufficient improvement.
Our additional simulations showed that the MH method and the GLMMs also do not perform well for point or interval estimation of θ.
A pragmatic approach to unbiased estimation of θ uses weights that do not involve estimated variances of study-level estimates, for example, weights proportional to the study size n i . Hedges and Olkin, 38 Hunter and Schmidt, 39 and Shuster, 49 among others, have proposed such weights. We use weights proportional to an effective sample size,ñ i = n iT n iC =n i ; these are equivalent to the optimal inverse-variance weights for LOR when all the probabilities are equal. Importantly, because inverse-varianceweighted estimators have considerable biases, little, if any, efficiency is lost by using the sample-size-based weights.
A reasonable estimator of τ 2 , such as MP or KD, can be used asτ 2 . Further, CIs for θ centered atθ SSW withτ 2 KD in Equation (5.2) can be used. In our simulations, this is by far the best interval estimator of θ, providing nearnominal coverage under all studied conditions.