Interval estimation of the overall treatment effect in random‐effects meta‐analyses: Recommendations from a simulation study comparing frequentist, Bayesian, and bootstrap methods

There exists a variety of interval estimators for the overall treatment effect in a random‐effects meta‐analysis. A recent literature review summarizing existing methods suggested that in most situations, the Hartung‐Knapp/Sidik‐Jonkman (HKSJ) method was preferable. However, a quantitative comparison of those methods in a common simulation study is still lacking. Thus, we conduct such a simulation study for continuous and binary outcomes, focusing on the medical field for application. Based on the literature review and some new theoretical considerations, a practicable number of interval estimators is selected for this comparison: the classical normal‐approximation interval using the DerSimonian‐Laird heterogeneity estimator, the HKSJ interval using either the Paule‐Mandel or the Sidik‐Jonkman heterogeneity estimator, the Skovgaard higher‐order profile likelihood interval, a parametric bootstrap interval, and a Bayesian interval using different priors. We evaluate the performance measures (coverage and interval length) at specific points in the parameter space, that is, not averaging over a prior distribution. In this sense, our study is conducted from a frequentist point of view. We confirm the main finding of the literature review, the general recommendation of the HKSJ method (here with the Sidik‐Jonkman heterogeneity estimator). For meta‐analyses including only two studies, the high length of the HKSJ interval limits its practical usage. In this case, the Bayesian interval using a weakly informative prior for the heterogeneity may help. Our recommendations are illustrated using a real‐world meta‐analysis dealing with the efficacy of an intramyocardial bone marrow stem cell transplantation during coronary artery bypass grafting.

distribution. In this sense, our study is conducted from a frequentist point of view. We confirm the main finding of the literature review, the general recommendation of the HKSJ method (here with the Sidik-Jonkman heterogeneity estimator). For meta-analyses including only two studies, the high length of the HKSJ interval limits its practical usage. In this case, the Bayesian interval using a weakly informative prior for the heterogeneity may help. Our recommendations are illustrated using a real-world meta-analysis dealing with the efficacy of an intramyocardial bone marrow stem cell transplantation during coronary artery bypass grafting.

| INTRODUCTION
Over the years, a variety of interval estimators has been developed for the overall treatment effect in random-effects (RE) meta-analyses, including frequentist, Bayesian, and bootstrap methods (see Veroniki et al 1 and the references therein). Correspondingly, there is also a variety of simulation studies comparing these methods. In their literature review, Veroniki et al 1 summarize those simulation studies (as well as real-life data studies). They "tentatively" recommend the Hartung-Knapp/Sidik-Jonkman (HKSJ) method for most situations and list situations where this choice might not be optimal (see section 2.3.2). However, they note that a comparison of all methods in a common simulation study is necessary for making reliable recommendations, but still lacking.
Thus, in this article, we want to compare a broad range of interval estimators under a broad range of scenarios. As it is computationally demanding to compare all methods under all scenarios, we confine ourselves to a sensible selection of methods and scenarios. Our choice is mainly guided by the results from Veroniki et al, 1 but we also present some new theoretical considerations relevant for the choice of methods.
We note that our work is motivated by meta-analyses in the medical field. The results generalize beyond this field, but for the choice of the simulation parameter values and the priors used in the Bayesian interval, this does have a minor impact.
In section 2, we introduce the RE model and the interval estimators considered by Veroniki et al. 1 In that section, we also describe the setup of the simulation study. In section 3, we present the results of the simulation study and summarize them through general recommendations. The application of those recommendations is illustrated in section 4 using a real-world meta-analysis. Our work is discussed in section 5.
For our computations, we used R 2 . The R 2 code used for this article, its output, as well as information about the computing environment is available online. 3

| METHODS
Let K ≥ 2 be the number of studies included in the metaanalysis, indexed by k ∈ {1, …, K}. The studies are assumed to be independent. The sizes of the treatment and the control group of study k will be denoted by N T, k and N C, k , respectively. We assume that N T, k and N C, k satisfy N k ≔ N T, k + N C, k > 4 (necessary e.g., for section 2.4.1).
Let θ k ∈ ℝ be the true treatment effect in study k whose estimator isθ k . The standard error (SE) ofθ k (given θ k ) will be denoted by ν k ∈ (0, ∞) whose estimator isν k . All methods investigated here were derived under the assumption that the ν k are known. Currently, this assumption is standard practice for meta-analytic methods 4 even though it might not be optimal. 1 Due to this assumption, we describe all methods in terms of the true ν k instead of their estimatorsν k . The feasible counterparts of the methods are obtained by substituting ν k byν k . In contrast to the theoretical • A unifying simulation study comparing the variety of interval estimators for the overall treatment effect in a random-effects meta-analysis is still lacking.
What is new?
• We conduct such a unifying simulation study.
• We recommend the Hartung-Knapp/Sidik-Jonkman (HKSJ) confidence interval (CI) under all conditions, except for small meta-analyses including only 2 studies. In this special case, the Bayesian credible interval (CrI) using a weakly informative prior for the betweenstudy standard deviation is recommended.
Potential impact for RSM readers outside the author's field: Our recommendations are illustrated in Figure 8: • For meta-analyses with more than 2 studies, the HKSJ CI should be used together with the Sidik-Jonkman (SJ) between-study variance estimator. This HKSJ-SJ CI is easy to calculate and features a major improvement in terms of coverage over other interval estimators. • For meta-analyses with only 2 studies, the Bayesian CrI should be used, together with a weakly informative prior for the between-study standard deviation.
formulation of the methods, we use these feasible counterparts when applying the methods in our simulation study (see section 2.5). This is done to have a simulation study which is as realistic as possible. In section 2.4, we define θ k , θ k , andν 2 k for the standardized mean difference (SMD) as well as for the log odds ratio (log OR).

| Random-effects model
The most popular RE model is a normal-normal hierarchical model (NNHM). 4,5 Other RE models are also possible: For example in the case of binary data, a binomial-normal hierarchical model (BNHM) might be a better option (see section 5). However, our work will be restricted to the NNHM, so that in the following, the term "RE model" will refer to the NNHM.
Let N Á, Á ð Þ denote the normal distribution. The RE model then consists of the following assumptions 4,5 : • θ k jθ,τ 2 ð Þ$N θ,τ 2 ð Þ with θ ∈ ℝ denoting the overall treatment effect and τ 2 ∈ [0, ∞) denoting the betweenstudy variance (heterogeneity), Marginally, that is, by integrating out the studyspecific effect θ k , the RE model yields 4,5 In the degenerate case τ 2 = 0, the RE model simplifies to the fixed-effect (FE) model (also called the common-effect model 6 ), which should not be confused with the fixed-effects model 7 where the θ k are not pooled at all, at least not by the model itself. 6 The FE model is quite restrictive and therefore not recommended in general. 8,9 The fixed-effects (as well as the FE) model only allows for conditional inference, meaning that its results cannot be generalized beyond the studies at hand (in contrast to the RE model which allows for unconditional inference). 6,10-12 Thus, we only consider methods for the RE model.
If τ 2 was known, θ could be estimated using the uniformly minimum-variance unbiased estimator (UMVUE) 9 with the unnormalized precision weights w k τ 2 ð Þ≔1= τ 2 + ν 2 k À Á and their sum w τ 2 ð Þ≔ P K k = 1 w k τ 2 ð Þ. In practice, τ 2 needs to be estimated. In general, plugging in an estimator and not taking this into account leads to invalid inference. 8,9 We will come back to this problem in section 2.3.1.

|
Estimators for the between-study variance τ 2 Several interval estimators for θ require the specification of an estimator for τ 2 . Reviews of estimators for τ 2 are given, for example, by Veroniki et al 13 and Langan et al. 14 Here, we only present those estimators used in our simulation study.

| DerSimonian-Laird
The DerSimonian-Laird (DL) estimator 15 is the default τ 2 estimator in multiple software packages andpresumably therefore-probably the most common τ 2 estimator. 13 For its definition, several FE quantities are needed: the weights u k ≔1=ν 2 k , their sum u≔ P K k = 1 u k , the FE estimatorθ FE ≔ 1 u P K k = 1 u kθk , and the Q statistic by Cochran, 16 given by By the method of moments (i.e., equating Q to its expected value under the RE model, given θ and τ 2 ) and cutting off at zero, the DL estimatorτ 2 DL results aŝ  13 also recommend the restricted maximum likelihood (REML) estimator, but the PM-or rather the EBestimator and the REML estimator are closely related. 5,19,20 The PM estimator is a method of moments estimator based on the generalized Q statistic which-given θ and τ 2 -follows a χ 2 distribution with K − 1 degrees of freedom. 21 Thus, the PM estimatorτ 2 PM is defined as the solution to the equation As noted by DerSimonian and Kacker, 22 Q gen (τ 2 ) is a strictly monotonically decreasing function of τ 2 . Thus, if a solution to (6) exists, it is unique. 22 However, the case Q gen (0) < K − 1 may also occur. 22 In this case, there is no solution to (6) on [0, ∞) and the PM estimator is set to zero. 22 For our simulation study, we use the R 2 package metafor 23 to calculate the PM estimate. In fact, we replaced the PM estimate by the EB estimate from the same package since this yielded a faster computation. The estimating equation for the EB estimator is given, for example, by Sidik and Jonkman. 20

| Sidik-Jonkman
The Sidik-Jonkman (SJ) estimator 24 is derived by writing the marginal RE model as a linear model with heteroscedastic errors. Applying the method of weighted least squares and assuming a known τ 2 leads tô By replacing τ 2 with the crude estimatorτ Sidik and Jonkman 24 define the SJ estimator througĥ Note thatτ 2 SJ is almost surely positive sincê θ 1 = … =θ K with probability zero.

| Interval estimators for the overall treatment effect θ
In their literature review, Veroniki et al 1 consider 15 different interval estimators for θ. These are listed in Table 1 together with their abbreviations used here. Since a comparison of all 15 methods is impractical-especially when covering a wide range of scenarios-we use the information given by Veroniki et al 1 to select only those methods which promise a favorable behavior (with the exception of the widespread Wald-type z confidence interval, see the following section). In Table 1, methods that are excluded from our simulation study are colored in gray. We explain why these methods are excluded after introducing the included methods.
Þand by plugging in an estimatorτ 2 for τ 2 , the Wald-type z (WTz) confidence interval (CI) is defined through 21 withτ 2 denoting the chosen τ 2 estimator and z 1 − α 2 the 1 − α 2 À Á quantile of the standard normal distribution. It is well-known that the WTz CI generally has a coverage considerably below the nominal level, especially for few studies (i.e., small K) or moderate to large τ 2 values. 1 This is caused by ignoring the additional uncertainty due to the estimation of τ 2 and by keeping the assumption of a normal distribution even though τ 2 is estimated. 8 Nevertheless, we include the WTz CI in our simulation study to show its inferiority compared to the other methods.  For this, we use the DL estimator for τ 2 , resulting in the typical "WTz-DL" combination, that is, used as the default interval estimator for θ in many statistical software packages. 1
The HKSJ CI shows favorable properties in multiple simulation studies: 1 It usually maintains the nominal coverage level, even for small K and independently of the magnitude of τ 2 . Additionally, the coverage of the HKSJ CI is only slightly influenced by the chosen τ 2 estimator. 1 However, according to Veroniki et al, 1 caution is required with the HKSJ CI if few (less than 5) or small studies are included in the meta-analysis, if there is no (or little) between-study heterogeneity, if analyzing binary data with rare events, or if the studies are of strongly varying precision (e.g., due to different study sizes).
In our simulation study, we will compare the PM and the SJ estimator for τ 2 when calculating the HKSJ CI. The PM estimator is chosen since it is generally recommended by Veroniki et al 13 and Langan et al. 14 The SJ estimator is chosen since it shows good properties in the simulation study by Sánchez-Meca and Marín-Martínez. 28 Depending on the results, one of those two τ 2 estimators will be chosen for the final comparison with the other methods.

| Skovgaard
The Skovgaard CI 29,30 arises from the ordinary profile likelihood (PL) CI 31 through a higher-order correction for the likelihood ratio statistic. Let denote the likelihood of the marginal RE model and the profile likelihood (profiled over the nuisance parameter τ 2 ) whereτ 2 ML θ 0 ð Þ denotes the maximum likelihood (ML) estimator for τ 2 given a specific θ 0 ∈ ℝ, that is, The ML estimator for θ, here denoted byθ ML , is then given by the solution tô withτ 2 ML ≔τ 2 MLθ ML À Á denoting the ML estimator for τ 2 . Under H 0 : θ = θ 0 , the signed square root of the likelihood ratio statistic, that is (with sign(Á) denoting the sign function), is asymptotically N 0, 1 ð Þ-distributed. Instead of using Z(θ 0 ) as done in the ordinary PL CI, the Skovgaard CI uses the corrected statistic with U(θ 0 ) givenτ 2 ML θ 0 ð Þ andτ 2 ML having a closed-form, yet complex expression that may be found in Guolo. 30 Note thatZ θ 0 ð Þ is only defined for θ 0 ≠θ ML . To our knowledge, the behavior ofZ θ 0 ð Þ for θ 0 !θ ML has not been investigated yet. Because of Zθ ML À Á = 0, we assume thatZ θ 0 ð Þ is supposed to converge to 0 for θ 0 !θ ML and we therefore defineZθ The Skovgaard CI is then given by the set In R 2 , we calculate the Skovgaard CI using the function profile.metaLik() from the package metaLik, 32 version 0.41.0. There are newer versions of this package available (at the time of writing: up to version 0.43.0) where the calculation of the Skovgaard CI is still possible using internal results (see e.g., Appendix A.8 of Veroniki et al 1 ), but version 0.41.0 is used here because it performs a "monotonicity check" when using the function profile. metaLik(). In this monotonicity check, several supporting points are used to check thatZ Á ð Þ is strictly monotonically decreasing. For example, a pole ofZ Á ð Þ atθ ML could prevent the set in (21) from forming an interval. If the monotonicity test is not passed, the ordinary PL CI is calculated. As this happened quite frequently during our simulations, we report on this issue in Section S1. In order to monitor this and other issues encountered during the simulation and to resolve some bugs, we slightly modified the source code of some functions from the package metaLik, 32 as may be seen in the online repository for the software code 3 (with the modified parts highlighted there).

| Bootstrap
Generation of the bootstrap samples Van Den Noortgate and Onghena 33 discuss four bootstrap types for the RE model: two nonparametric bootstraps (cases bootstrap, error bootstrap) and two parametric bootstraps (effect size bootstrap, raw data bootstrap). They conclude that the nonparametric cases bootstrap should be avoided. 33 The other three bootstrap types lead to similar results. 33 We therefore confine ourselves to one of those three types. As the error bootstrap and the raw data bootstrap require additional assumptions for effect measures like the SMD and the log OR, we choose the effect size bootstrap. It is performed by generating the R * ∈ ℕ "bootstrap-world" equivalents toθ k according to the Nθτ 2 À Á ,τ 2 + ν 2 k À Á distribution and by using ν k as the bootstrap-world equivalent to ν k (so no new values are generated for ν k ).

Bootstrap confidence intervals
The R 2 package boot, 34,35 which is used here for performing the bootstrap and for calculating the corresponding CI for θ, offers five options for the CI type: the normal approximation CI, 35 the basic bootstrap CI 36 (also known as nonstudentized pivotal bootstrap CI 37 ), the studentized bootstrap CI 36 (also known as studentized pivotal bootstrap CI or bootstrap-t CI 37 ), the bootstrap percentile CI, 38 and the adjusted bootstrap percentile CI 39 (also known as bias-corrected and accelerated [BCa] bootstrap CI 37 ).
In order to choose a CI type for our simulation study, we first consider the coverage accuracy. A first-order accuracy means that the asymptotic behavior of the coverage error (the true coverage minus the nominal cover- 40 Among the above-mentioned CI types, only the studentized bootstrap CI and the BCa bootstrap CI have a second-order accuracy. 35 In contrast to the BCa bootstrap CI, the studentized bootstrap CI may be easily combined with a parametric bootstrap. Thus, for our simulation study, we choose the studentized bootstrap CI. It uses the pivot 35 whereV is a squared SE estimator forθτ 2 À Á . Here, we will use the squared HKSJ SE estimator (i.e.,V ≔V HKSJ ) instead of the squared WTz SE estimator 1=wτ 2 À Á since the HKSJ CI was shown to perform slightly better than the Wald-type t (WTt) CI 41 in terms of coverage 28 (and they only differ in the SE estimator).
the true θ is replaced by its bootstrap-world equivalent θτ 2 À Á ) and assume the number R * of bootstrap samples to be chosen such that are integer-valued. Then the α 2 and the 1 − α 2 À Á quantiles of the distribution of the pivot are estimated bŷ respectively. Thus, the studentized bootstrap CI (henceforth the bootstrap CI) is defined here through 35 For the number of bootstrap samples, we choose the value R * ≔ 999 at which the Monte Carlo error should be low enough to allow for reliable comparisons. 35,42 For the τ 2 estimator, we will choose the superior one from the comparison of τ 2 estimators for the HKSJ CI.

| Bayesian credible interval
In our article, the Bayesian credible interval (CrI) is only considered as a technical device in a frequentist meta-analysis. For the Bayesian CrI, the heterogeneity will be parameterized in terms of τ instead of τ 2 . Since our focus is not on the study-specific effects θ k , we define the likelihood through p(x| θ, τ) ≔ L(θ, τ 2 ) with L(θ, τ 2 ) denoting the likelihood of the marginal RE model from Equation (14) and x≔θ 1 , …,θ K À Á > denoting the data. A priori, θ and τ are assumed to be independent, so that their joint prior factorizes into their respective parts 43 : Prior for θ For our simulation study, we will use the R 2 package bayesmeta. 43 This package offers two options for the prior distribution for θ: an improper uniform distribution on the real line (denoted here by U −∞,∞ ð Þ ) and a normal distribution N μ θ , σ 2 θ À Á whose mean μ θ and variance σ 2 θ may be specified by the user. The N μ θ , σ 2 θ À Á prior is conjugate conditionally on τ . 43 The U −∞, ∞ ð Þ prior also results in a normal conditional posterior for θ . 4,5,43 The mean and the variance of this latter normal conditional posterior are the limiting cases of those resulting from using the normal prior when σ 2 θ from the normal prior approaches infinity. 43 Additionally, they are identical to the frequentist UMVUE introduced in section 2.1 (where we also conditioned on τ) and its variance, respectively. 4,5,43 However, as noted by other authors, 4,5,43 this relationship between frequentist and Bayesian inference does not hold for unknown τ.
In our simulation study, we will use the U −∞, ∞ ð Þ prior as done e.g. by Friede et al. 4 The uniform prior may be deemed a noninformative prior, but in general, we have to warn about the downsides of noninformative priors as they may be a lot more informative than intended. 44 We will discuss the use of the U −∞, ∞ ð Þ prior in section 5.

Prior for τ
Existing simulation studies (e.g., References 5,4,45,46) suggest a strong impact of the prior for τ on the coverage of the Bayesian CrI. Therefore, we want to alter the information content of the prior for τ as much as possible: We will investigate one informative, one weakly informative, and one "noninformative" prior for τ.
For the informative prior, we choose the following priors which are both defined for logτ 2 : • for the SMD: nonstandardized t distribution with 5 degrees of freedom, location −3.44, and scale 2.59 by Rhodes et al. 47 ; • for the log OR: N −2:56, 1:74 2 ð Þdistribution by Turner et al. 48 These priors are based on data from the Cochrane Database of Systematic Reviews (CDSR) and are thus tailored to an application in the medical field. For both priors, we choose the general (marginal) setting, that is, we don't specify any characteristics such as the outcome type or the type of intervention.
For the weakly informative prior, we mainly follow the arguments of Gelman 49 * and Polson and Scott. 50 † Gelman 49 derives the family of half-t distributions as a family of weakly informative priors for τ based on Bayesian arguments. This family includes the half-Cauchy distribution (degrees of freedom equal to 1) and-although not in a strict sense-the normal distribution (degrees of freedom approaching infinity). 43 Polson and Scott 50 show that the half-Cauchy distribution is also favorable from a frequentist point of view: Its (frequentist) risk-using the quadratic loss function-is a good compromise between an improvement compared to the James-Stein estimator 51 for smaller Euclidean norms of θ ≔ (θ 1 , …, θ K ) > and a concession for larger Euclidean norms of θ . 50 Because of the arguments given by Polson and Scott 50 and because of its heavy tail which-rarely-allows for large heterogeneities, we choose the half-Cauchy prior among the family of half-t distributions. This is also supported by the work of Williams et al. 12 Concerning the scale of the half-Cauchy prior, Röver 43  Thus, we only choose the scale of 0.5 for our half-Cauchy prior. As our second effect measure-the SMD-may be regarded as being similar to the log OR, 47,52 this applies to both, the SMD and the log OR. In the case of the "noninformative" prior for τ, we choose the U 0, ∞ ð Þ distribution as it is the limiting case of the half-t distributions when the scale approaches infinity (and the degrees of freedom are fixed). 49 It is important to note that the U 0, ∞ ð Þ prior only results in a proper posterior if K > 2 . 49 Thus, in the special case K = 2, we will use a U 0, 10 3 ð Þprior instead of the U 0, ∞ ð Þ prior. This is a heuristic workaround and only intended for illustrative purposes.
All of the τ priors chosen for our simulation study are illustrated on the common scale of τ in Figure 1 using the following abbreviations: • Rhodes15: informative prior by Rhodes et al 47 in case of the SMD, • Turner12: informative prior by Turner et al 48 in case of the log OR, • hC(0.5): half-Cauchy prior with scale 0.5,

Marginal posterior for θ
The two options for the prior for θ allow the package bayesmeta 43 to use analytical and numerical procedures (instead of Markov chain Monte Carlo [MCMC] procedures) for approximating the marginal posterior distribution of θ, that is, the distribution with density p(θ| x).
The key idea is to write p(θ| x) as a mixture of Gaussian densities (mixed over the marginal posterior of τ), that is and to discretize in τ . 43 This way, the integral in Equation (27) is approximated by a weighted sum of Gaussian densities with weights given by the marginal posterior probability masses of τ lying in the discretization intervals. 43 The marginal posterior probability masses of τ are calculated using a combination of analytic simplifications and numerical integration. 43 In order to control the approximation of p(θ| x), the divergence restricting conditional tesselation (DIRECT) algorithm 53 is used. It restricts the symmetrized Kullback-Leibler divergence between p(θ| τ, x) and its approximation. This also guarantees a restriction of the symmetrized Kullback-Leibler divergence between p(θ| x) and its approximation (the weighted sum). 53 After approximating p(θ| x), a numerical integration yields the (approximate) cumulative distribution function which may be used for deriving quantiles through a numerical root-finding algorithm. 43 The α 2 and 1 − α 2 À Á quantiles are used here to give the central (1 − α) CrI for θ.

| Excluded methods
The WTt CI 41 arises from the WTz CI when substituting the standard normal quantile z 1 − α 2 by the t K − 1; 1 − α 2 quantile. The Wald-type CI with quantile approximation (WTqa CI) 54 arises from the WTz CI when substituting the standard normal quantile by a special quantile which was derived by Brockwell and Gordon 54 and depends on K. The WTt CI and the WTqa CI are excluded from our simulation study as they were shown to perform worse than the HKSJ CI in terms of coverage. 1 The modified Hartung-Knapp/Sidik-Jonkman (mHKSJ) CI 55 corresponds to the HKSJ CI, except that it uses qτ 2 À Á ≔max 1, qτ 2 À Á È É instead of qτ 2 À Á in the definition of the HKSJ SE estimator. The mHKSJ CI paired with the PM estimator for τ 2 is identical to the WTt CI (by definition of the PM estimator which always gives qτ 2 PM À Á = 1). 56 Thus, with the exclusion of the WTt CI, the mHKSJ CI is also excluded from our simulation study. A further justification for this exclusion is that the mHKSJ CI has indeed been recommended instead of the HKSJ CI in some very specific cases, but we could not observe strong deviations of the HKSJ CI's coverage from the nominal level in those cases. This is something we will come back to later in the discussion (section 5).
The ordinary PL CI, which uses Z(θ 0 ) instead ofZ θ 0 ð Þ in Equation (21), has been shown to perform worse than the Skovgaard and the Bartlett  The Zeng-Lin (ZL) CI 65 is based on a two-step resampling procedure which uses the DL estimator for τ 2 in its first step. The ZL CI is excluded here since it was shown to perform worse than the Skovgaard CI in terms of coverage. 66 For the nonparametric bootstraps, we already explained in section 2.3.4 why we exclude them.
The Follmann-Proschan (FP) CI 41 is composed of those values for θ for which a permutation test cannot reject the corresponding null hypothesis. It has led to promising results in existing simulation studies 1,59,60,63 but it has a crucial downside 60 : At a significance level of α = 0.05, the permutation test can reject the null hypothesis only if K ≥ 6 so that the FP CI equals (−∞, ∞) if K < 6. As will be seen in section 2.5, real-world meta-analyses usually consist of a small number of studies so that the FP CI is useless in most practically relevant cases.

| Study-specific effect measures
In the following, the individual observations (i.e., on patient level) will be denoted by X T, k, n (n ∈ {1, …, N T, k }) and X C, k, n (n ∈ {1, …, N C, k }) for the treatment and the control group of study k, respectively. In addition to the between-study independence, we also assume that the individual observations are stochastically independent within each study.

| Standardized mean difference
For the SMD, we assume a normal distribution (and variance homogeneity) for the patient-level observations: with μ T, k ∈ ℝ and μ C, k ∈ ℝ denoting the means in the treatment and the control group (respectively) and σ 2 k ∈ 0, ∞ ð Þ denoting the variance in both treatment groups. The study-specific treatment effect (the SMD of study k) is then defined through 67 Its unbiased estimator (Hedges's g) is defined through 68θ with where Γ(Á) denotes the gamma function. The variance of θ k (given θ k ) is given by 67 which may be estimated unbiasedly by 68

| Log odds ratio
For the log OR, we assume a Bernoulli distribution for the binary patient-level observations: X T,k,n $ Bin 1, p T,k À Á and X C,k,n $ Bin 1,p C,k À Á with p T, k ∈ (0, 1) and p C, k ∈ (0, 1) denoting the event probabilities in the treatment and control group of study k (respectively) and Bin(N, p) denoting the binomial distribution with parameters N ∈ ℕ and p ∈ (0, 1). The log OR of study k is defined through using the logit function logit p ð Þ≔log p 1 − p for p ∈ (0, 1). Let E T,k ≔ X N T,k n = 1 X T,k,n and E C,k ≔ X N C,k n = 1 X C,k,n denote the number of events in the treatment and control group of study k, respectively. We then estimate the log OR of study k bŷ This type of zero-cell correction is used, for example, by Pateras et al. 69 It is also the default zero-cell correction in the R 2 package metafor. 23 Accordingly, the variance of θ k (given θ k ) is estimated by

| Simulation setup
For each scenario of our simulation study, we perform 5000 replications. The parameters defining the scenarios are listed in Table 2. The choice of those parameter values is described in detail in sections 2.5.1-2.5.6. As a full factorial design would have increased the number of scenarios considerably, we split our simulation study up into four analyses (see Table 2). In the location shift analysis, we investigate the sensitivity of the interval estimators with respect to a shift of magnitude of θ (see section 2.5.3). In the robustness analysis, we assert the sensitivitŷ with respect to deviations from the normal distribution of the θ k (see section 2.5.5). Given a fixed scenario and a fixed replication of this scenario, we generate the study-specific estimatesθ k and ν 2 k as follows: 1. Generate the N T, k according to the chosen distribution (see section 2.5.2) and set N C, k ≔ N T, k . 2. Generate the true study-specific parameters: • For the SMD: a. Generate the θ k according to the chosen distribution, usually N θ, τ 2 ð Þ(see section 2.5.5). b. Calculate μ T, k ≔ μ C, k + θ k Á σ k . a. Calculate ℓ C ≔ logit(p C ) with a value for p C from section 2.5.6. b. Generate ℓ T,k $ N ℓ C + θ, τ 2 =2 ð Þ and ℓ C,k $ N ℓ C , τ 2 =2 ð Þindependently. c. Calculate p T, k ≔ logit −1 (ℓ T, k ) and p C, k ≔ logit −1 (ℓ C, k ) with logit −1 denoting the inverse of the logit function.

Generate the patient-level observations:
• For the SMD: Generate X T,k,n $ N μ T,k , σ 2 k À Á and X C,k,n $ N μ C,k ,σ 2 k À Á . • For the log OR: Generate directly E T, k $ Bin(N T, k , p T, k ) and E C, k $ Bin(N C, k , p C, k ). 4. Calculate the study-specific estimatesθ k andν 2 k as explained in section 2.4.
Afterwards, all the 5 interval estimators (8 when taking into account the variants arising from different τ 2 estimators or τ priors) from sections 2.3.1-2.3.5 are applied to this simulated meta-analysis. By averaging over the 5 000 replications, we estimate the primary and secondary performance measures, the coverage and the interval length, respectively. When plotting those performance measures, we also plot the Monte Carlo standard error (MCSE).
We note that in contrast to the frequentist and the bootstrap methods, the Bayesian CrI is not designed to maintain the nominal coverage when given a fixed point in parameter space (here, the two-dimensional space of θ and τ). It is rather supposed to maintain the nominal coverage when averaging over the prior distribution. 43 By estimating the coverage at fixed points in parameter space, we take a frequentist point of view. This is necessary, however, to see which interval estimators are affected by the magnitude of τ 2 . Williams et al 12 also argue why comparing methods according to their frequentist properties makes sense, even for Bayesian methods.

| Number of studies (K)
For the number of studies, K, we choose the Fibonacci numbers 2, 3, 5, 8, 13, 21, 34, and 55. They reflect the right-skewed empirical distribution of K from the CDSR quite well 70 and cover the range of K values used in most existing simulation studies. 1

| Treatment group sizes (N T, k )
For generating the treatment group sizes N T, k , we use three settings: small, large, and mixed. The small group sizes are drawn from a uniform distribution over {10, 11, …, 30} and the large group sizes from a uniform distribution over {100,101, …, 300}. Those two settings are also used by Panityakul et al. 72 For the mixed group sizes, we follow Doi et al 73 and Doi et al 74 in using a Delaporte distribution (with differing parameter values, however). The Delaporte distribution (denoted here by Delap(λ, κ, ξ)) arises from a Poisson distribution whose mean is the sum of a fixed component (λ ∈ (0, ∞)) and a gammadistributed component (shape: κ ∈ (0, ∞), scale: ξ ∈ (0, ∞)). 75 Thus, it also arises as the convolution of a Poisson distribution and a negative binomial distribution. 76 Here, we use a Delap (15, Figure 2.
For the location shift analysis and the robustness analysis, we only use small group sizes. Mixed group sizes would have hindered a clean interpretation and large group sizes are less common in practice. 70

| Overall treatment effect (θ)
For the main analyses as well as for the robustness analysis, we use θ = 0.5 as this is a reasonable value for the SMD 78 as well as for the log OR 79 that is also used in many existing simulation studies. 1,8,14,71 In the location shift analysis, we will compare the results for θ = 0.5 to those for θ ∈ {0,0.8} in selected scenarios (see Table 2).

| Study-specific treatment effects (θ k )
The θ k will be generated according to the N θ, τ 2 ð Þ distribution (abbreviated in tables and figures by "N"), except for the robustness analysis where we will investigate a skewnormal distribution ("sN") and a bimodal 50:50 mixture of two normal distributions of equal variance ("biN"). To maintain comparability, we ensured that the mean and the variance of the skew-normal distribution and the bimodal normal mixture are also equal to θ and τ 2 , respectively.
For drawing from the skew-normal distribution, we use the function rsnorm() from the R 2 package fGarch 80 with argument xi (determining the skewness) set to 1 2 . For the bimodal normal mixture, we need to specify the distance δ mix between the means of the two normal distributions. In order to have a positive common variance and an actual bimodal distribution, 81 this distance needs to fulfill δ mix ∈ ffiffi ffi 2 p τ, ffiffi ffi 4 p τ À Á . Thus, we set δ mix ≔ ffiffi ffi 3 p τ (more or less heuristically). For the τ 2 value used in the robustness analysis (τ 2 = 0.15), we get δ mix = ffiffiffiffiffiffiffiffiffiffiffiffi ffi 3Á0:15 p ≈0:67. The densities of this skew-normal distribution and this bimodal normal mixture are illustrated in Figure 3, together with the corresponding normal distribution.

| Parameters specific to the effect measures
In case of the SMD, we choose μ C, k = 0 and σ 2 k = 1 as done in other simulation studies. 28,71 In case of the log OR, we choose an "overall control group event probability" of p C ∈ {0.1,0.5} as done, for example, by Hartung and Knapp. 25 The value of p C is varied since the coverage generally depends on this parameter. 1,25

| RESULTS
Issues that occurred during the simulation are listed in Section S1. The plots for the interval length as well as some other plots may be found in Section S4. Henceforth, all references to Supporting Information 1 may be recognized by the prefix "S", for example, Section S1 or Figure S1.
The interval length as well as K are always plotted on logarithmic scale. In situations where there are many scenario parameters to visualize, we use a nested loop plot 82 (see e.g., Figure 5).

| Main analysis
We first performed the following pre-selections: 1. pre-selection of the τ 2 estimator (PM or SJ) in the HKSJ CI, 2. pre-selection of the prior for τ (informative, weakly informative, or "noninformative") in the Bayesian CrI.
The corresponding results are given in Section S2. The SJ estimator is selected for the HKSJ CI and the hC(0.5) prior for the Bayesian CrI.
We now focus on comparing all methods chosen a priori or in the respective pre-selection. Figure 4 and Figure S10 show the coverage and the interval length for these methods. Figure S9 corresponds to Figure 4, however zoomed into the relevant part of the coverage. Interestingly, the HKSJ-SJ CI and the bootstrap CI building upon it (pBoot-HKSJ-SJ) have a very similar coverage and interval length. Since the formulas of these CIs only differ in their quantiles, this similarity suggests that the t K − 1 distribution underlying the HKSJ CI is a good approximation to the true sampling distribution ofθτ 2 À Á . This might be the reason why, in terms of coverage, the HKSJ-SJ and the pBoot-HKSJ-SJ CI perform best in most scenarios. They also yield the longest intervals, but in most scenarios, this additional length is not too large. For K = 2 however, the length of these CIs increases extremely. This is due to t 1; 0.975 ≈ 12.7 . 4 The huge length is also associated with a decrease in power for the corresponding test. 83 Thus, the HKSJ-SJ and the pBoot-HKSJ-SJ CI cannot be recommended in the case K = 2. As was shown before, 1,8 the WTz-DL CI often has a coverage well below the nominal level. Its short length is not an advantage but rather associated with this inappropriate behavior.
The coverage of the Skovgaard CI is also below the nominal level if τ 2 is large and K is very small. In contrast, for τ 2 = 0, the coverage of the Skovgaard CI is above the nominal level. The behavioral pattern of the Skovgaard CI is to some degree similar to the Bayes-hC (0.5) CrI's. However, when looking closer at the problematic case K = 2, the coverage of the Bayes-hC(0.5) CrI does not decrease as heavily as the Skovgaard CI's when τ 2 increases. Apart from that, the length of the Bayes-hC (0.5) CrI is usually even lower than the Skovgaard CI's. This suggests a preference of the Bayes-hC(0.5) CrI over the Skovgaard CI for the case K = 2.

| Location shift analysis
The coverage and the interval length for the location shift analysis are shown in Figure 5 and Figure S11, respectively. Here, we omit the pre-selection steps for the HKSJ CI and the Bayesian CrI (as they do not provide any new insights additionally to the results presented here), but they may be found in the online repository for the software code. 3 The same applies to the robustness analysis following in section 3.1.3.
The interval length increases slightly for increasing θ ( Figure S11). This may be due to the fact that ν k increases with increasing jθ k j (see Equation (36) and use the fact that c k ∈ (0.957,1) for the study sizes N k ≥ 20 used here) so that on average, an increasing jθj leads to: • an increasing SE 1= ffiffiffiffiffiffiffiffiffiffiffi w τ 2 ð Þ p ofθ τ 2 ð Þ given a fixed τ 2 (relevant for the WTz-DL, the HKSJ-SJ, and the pBoot-HKSJ-SJ CI) and • a wider likelihood for θ given τ 2 (relevant for the Skovgaard CI and the Bayes-hC(0.5) CrI; for the latter, this yields a wider marginal posterior for θ).
Note that in the following, we often have similar observations for the frequentist and the Bayesian methods which is not surprising due to their relationship under the U −∞, ∞ ð Þprior for θ when conditioning on τ 2 (see section 2.3.5).
For very large K (K = 55), the coverage decreases with increasing θ. This has already been observed by Sánchez-Meca and Marín-Martínez 28 who explain this by an increasingly negative bias ofθτ 2 À Á for increasing θ (see also Viechtbauer 9 ). The negative bias itself might be again due to the monotonic relationship between ν k and jθ k j. Note that Sánchez-Meca and Marín-Martínez 28 use K ∈ {5,10,20,40,100} and average the coverage over their scenarios defined by K, N k , and τ 2 . Thus, it seems indeed plausible that Sánchez-Meca and Marín-Martínez 28 observe the same effect as we do for K = 55. For K < 55, we cannot observe this effect consistently. This might be due to the Monte Carlo error being too large to detect such small differences.

| Robustness analysis
When taking into account the Monte Carlo error, the main conclusion from the plots for the coverage and the interval length in the robustness analysis ( Figure 6 and Figure S12, respectively) is that all methods are relatively robust against deviations from the normal distribution of the study-specific effects. The only exceptions might be F I G U R E 7 Coverage of all chosen interval estimators. Effect measure: log odds ratio [Colour figure can be viewed at wileyonlinelibrary.com] the coverages of the WTz-DL and the Skovgaard CI when the study-specific effects follow the "biN" distribution in the case K = 2. All other differences in coverage are rather small compared to the Monte Carlo error and not consistent across the values for K. The missing consistency also holds for the interval length: There might be a slight increase in the interval length for the "sN" distribution and an additional slight increase for the "biN" distribution, but not consistently across the values for K.
The performance of the HKSJ-SJ and the pBoot-HKSJ-SJ CI is again very similar and they usually provide the coverage closest to the nominal level. In particular, we may conclude that the pBoot-HKSJ-SJ CI is not needed to ensure robustness.
For a subset of the methods investigated here, this robustness has already been observed-also under additional non-normal distributions. 63,84

| Log odds ratio
The pre-selection steps (τ 2 estimator for the HKSJ CI, τ prior for the Bayesian CrI) are also performed for the log OR, but shown in Section S3. We draw the same conclusions as for the SMD, that is, the SJ estimator is selected for the HKSJ CI and the hC(0.5) prior for the Bayesian CrI.
Here, we will focus on the comparison of all chosen methods. The coverage and the interval length are illustrated in Figure 7 and Figure S13, respectively. Essentially, the results are very similar to those from the SMD. Most importantly, when looking at the overall performance and taking into account the Monte Carlo error, the HKSJ-SJ and the pBoot-HKSJ-SJ CI perform the best in terms of coverage. They are strikingly similar in terms of coverage and interval length. Besides, they are usually the longest intervals, but the difference to the other methods is already small for K = 3 and it decreases even further for increasing K. For K = 2 however, the length of these two CIs is extreme compared to the other methods. As for the SMD, the Skovgaard CI has a coverage below the nominal level if τ 2 is large and K is very small. To some degree, the Skovgaard CI and the Bayes-hC(0.5) CrI behave again similarly in terms of coverage. In particular, for large τ 2 , the Bayes-hC(0.5) CrI sometimes also has a coverage below the nominal level. However, when looking only at K = 2 (and large τ 2 ), the coverage of the Bayes-hC(0.5) CrI is still higher than the Skovgaard CI's. The inferiority of the WTz-DL CI in terms of coverage is proven here again.
Concerning the behavior of the coverage with respect to p C , we observe that for the WTz-DL CI, the Skovgaard CI, and the Bayes-hC(0.5) CrI, the coverage usually decreases for common events (p C = 0.5) compared to rare events (p C = 0.1). If the coverage is already above the nominal level for both p C values (usually for small to moderate τ 2 ), this yields a "common events" coverage closer to the nominal level than the "rare events" coverage. In contrast, if the coverage is already below the nominal level for both p C values (especially for large τ 2 ), this yields a "common events" coverage farther away from the nominal level. Interestingly, for small K (K ∈ {2, 3}), the HKSJ-SJ and the pBoot-HKSJ-SJ CI usually behave the opposite way (see also Figure S5 where this is also the case for the HKSJ-PM CI).
The ranking of the interval estimators with respect to their length is the same as for the SMD. The influence of p C on the interval length is very consistent: Rare events lead to longer intervals than common events and this is more pronounced as τ 2 decreases. The longer intervals for rare events might be due to increasing SEs ν k of the log OR estimatorŝ θ k in case of rare events. For the WTz-DL, the HKSJ-SJ, and the pBoot-HKSJ-SJ CI, this leads to an increasing SE ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1=w τ 2 ð Þ p ofθ τ 2 ð Þ given a fixed τ 2 and this also explains why this phenomenon is more pronounced as τ 2 decreases: For small τ 2 , the ν 2 k dominate the precision weights w k τ 2 ð Þ= 1= τ 2 + ν 2 k À Á . For the Skovgaard CI and the Bayes-hC(0.5) CrI, the increasing ν k lead to a wider likelihood for θ given τ 2 (yielding a wider marginal posterior for θ in case of the Bayes-hC(0.5) CrI). Again, this also explains why this phenomenon is more pronounced as τ 2 decreases: For small τ 2 , the ν 2 k dominate the term τ 2 + ν 2 k À Á occurring in the likelihood (see Equation (14)).

| Recommendations for applied researchers
We want to keep our summarizing recommendations as simple as possible. Thus, even though that for increasing K, the performance of all methods becomes more and more similar, we generally recommend the HKSJ-SJ CI as it performs best in most scenarios.
T A B L E 3 Study-specific results from a meta-analysis by Donndorf et al, 85 Figure 2. The MD was originally used as effect measure, but the data is also analyzed on SMD scale here In the special case K = 2, the usage of the HKSJ-SJ CI is limited by its extreme length. The alternative methods (WTz-DL, Skovgaard, Bayes-hC(0.5)) have the downside that for K = 2, their coverage decreases for increasing τ 2 , eventually falling below the nominal level for sufficiently large τ 2 . Among those alternative methods, the Bayes-hC (0.5) CrI still performs best in terms of coverage and still better than the Skovgaard CI in terms of interval length. Thus, for K = 2, we recommend the Bayes-hC(0.5) CrI.
Note that the above recommendations hold for the SMD as well as for the log OR.

| EXAMPLE FROM CARDIAC SURGERY
We illustrate our recommendations using a meta-analysis conducted by Donndorf et al, 85 where the treatment group received an autologous intramyocardial bone marrow stem cell transplantation during coronary artery bypass grafting while the control group only underwent the coronary artery bypass grafting. The outcome considered here is the left-ventricular ejection fraction (LVEF) difference, calculated as the LVEF after the intervention (in %) minus the LVEF before the intervention (in %). For this outcome, K = 6 studies were included in the meta-analysis. The effect measure used by Donndorf et al 85 is the mean difference (MD), that is, The results from the 6 studies are shown in Table 3 85 and for all investigated τ 2 values greater than 0, this can be seen in the 4 rightmost subplots in the first row of Figure 4.
This example also demonstrates that the conclusions drawn from MD and SMD analyses do not need to coincide. Possible reasons are: 1. The SMD includes the standard deviation estimatorσ k in its denominator which might introduce additional variability (seen with respect to the scale, e.g., by looking at the variation coefficient). 2. The true distribution of the SMD estimatorθ k (given θ k ) is a scaled noncentral t distribution, 67 so the normal distribution is just an approximation, even when assuming normally distributed patient-level observations. Under the same assumption, the MD estimator exactly follows a normal distribution. 9 3. For the SMD, ν k and jθ k j follow a positive monotonic relationship (see section 3.1.2).
As all studies in the meta-analysis by Donndorf et al 85 measure the LVEF difference in the same unit, we consider the MD to be a legitimate effect measure 26,92,93 which allows for an easier interpretation. 93 Nevertheless, we point out that even when using the MD, the significance of the difference of the overall treatment effect from 0 is only weak (HKSJ-SJ CI: [0. 71, 10.19]). If the null hypothesis would not have been a point hypothesis (θ = 0), but rather an interval hypothesis (θ ∈ [−δ, δ] with a prespecified δ > 0), the significance would probably not have persisted. An interval hypothesis might be more helpful, especially for a medical interpretation.

| DISCUSSION
The approach for correcting "zero cells" when analyzing binary data has a crucial impact on the coverage. 94 The F I G U R E 8 Decision tree illustrating the recommendations from our frequentist performance evaluation. K is the number of studies included in the meta-analysis. Note that the Bayesian CrI may always be used in a fully Bayesian meta-analysis approach chosen here (add 1 2 to all cells of studies having at least one zero cell) was shown to be preferable over the approach of adding 1 2 to all cells of all studies (no matter if they have a zero cell or not), especially for rare events. 94 A zero-cell correction may be avoided in the first place, for example, by using the arcsine difference as effect measure 95 or by using the BNHMmentioned in section 2.1-instead of the NNHM, either in a Bayesian framework [96][97][98] or in a frequentist framework. [99][100][101] In case of a meta-analysis of only K = 2 studies, the RE model might not be the optimal model in the first place since it basically requires estimating a variance (the between-study variance τ 2 ) with only two data points and since a power gain (compared to the average power of the individual studies) is not to be expected for K < 5 . 102 Bender et al 6 investigate the case K ∈ {2,3,4} more deeply than we do. Their conclusions are very detailed (also including the FE model and a qualitative evidence synthesis, apart from the HKSJ and the Bayesian approach) and thus, we refer readers interested in meta-analyses with small K to their paper.

| Interval estimators
For the estimation of τ 2 in the HKSJ CI, we did not consider the REML estimator as it is closely related to the PM estimator. 5,19,20 We could confirm this similarity for the HKSJ CI in the SMD main analysis of our simulation (results not shown).
Our simulation results provide an additional justification for excluding the mHKSJ CI (cf. section 2.3.6): First, the mHKSJ CI would not have helped for K = 2 since it is always at least as long as the HKSJ CI. Second, for general K, the coverage of the HKSJ CI is already close to the nominal level so that a substantial improvement is not to be expected by the mHKSJ CI. However, note that the mHKSJ CI has been recommended if 1 : 1. K is small and the study-specific SEs ν k show high variability 56 (the latter being reflected by the mixed study sizes in our simulation study) or 2. binary data is analyzed. 103 In those situations, we could not observe a remarkable failure of the HKSJ CI. But-as will be discussed in section 5.2-our simulation setup might not be extreme enough in those cases. Furthermore, the mHKSJ CI has been recommended if the estimated between-study variance is zero. 1 However, for the SJ estimator recommended here, this is not possible: It gives a positive estimate for τ 2 almost surely.
In a personal communication, the authors of the metaLik package 32 pointed out that the Skovgaard method was mainly intended to correct the likelihood ratio test, not the corresponding CI (the ordinary PL CI). Thus, when comparing different tests especially in terms of power, the Skovgaard method is still worth to be included.
Concerning the bootstrap CI investigated here (the pBoot-HKSJ-SJ CI), our simulation results show that its additional computational and theoretical complexity (compared to the HKSJ-SJ CI) does not provide any remarkable advantage for applied researchers. This is why we favor the simpler HKSJ-SJ CI instead.
Because of the general downsides of so-called "noninformative" priors in a Bayesian analysis (see section 2.3.5), we also had a look at a weakly informative prior for the overall treatment effect θ instead of the U −∞, ∞ ð Þprior: For the log OR, Günhan et al 98 derived the N 0, 2:82 2 ð Þdistribution as a weakly informative prior for θ. We tested this prior (together with the hC(0.5) prior for τ), but found little differences in coverage and interval length compared to the U −∞, ∞ ð Þprior, at least for our preset value of θ = 0.5 (see Figures S14 and S15). This in accordance with the results from Günhan et al. 98 ‡ The only minor systematic difference between the two priors for θ observable here is that for small K, the weakly informative prior yields a slightly shorter CrI (see Figure S15), which is also in accordance with Günhan et al. 98 This shorter length for small K may be directly explained by the narrower prior for θ, carrying forward its narrowness to the marginal posterior for θ in case of sparse data.
Our choice of the half-Cauchy distribution as a weakly informative prior for τ was mainly based on theoretical considerations (see section 2.3.5). However, it may be interesting to include other priors in the comparison: Under the assumption of a "constant rate penalisation," Simpson et al 104 derive the exponential distribution as a weakly informative prior for τ (even though not in a meta-analytic context). In future studies, it might be interesting to compare the exponential prior to the half-Cauchy prior. In this case, the gamma distribution (as a generalization of the exponential distribution) could also be investigated. Hamaguchi et al 46 investigate various "noninformative" priors for τ (and for transformations of τ) by simulation. They focus on prediction intervals, but also investigate CrIs for θ, with very similar results for these two types of intervals. Like in our work, they also observe that the performance of the Bayesian CrI largely depends on τ 2 (apart from the dependence on other simulation parameters like K), so our general recommendation of the HKSJ-SJ CI for K > 2 is not challenged by the additional priors investigated by Hamaguchi et al. 46 However, the special case K = 2 is not investigated by Hamaguchi et al, 46 so we ran parts of our simulation study (mixed study sizes) with K = 2 again for the priors performing best in the simulation study by Hamaguchi et al. 46 These best performing priors were "Jeffreys' prior" 43,105,106 and the "conventional prior". 43,107 For Jeffreys' prior, the results by Hamaguchi et al 46 already indicated that the Bayesian CrI might be longer than the HKSJ-SJ CI and our additional simulation confirmed this (results not shown). Thus, we did not investigate Jeffreys' prior further. For the conventional prior, we received a lot of warnings from the R 2 package bayesmeta 43 (about 10% of all scenario replications). Thus, before being able to interpret the results for the conventional prior reliably, future investigations first need to clarify where these warnings come from. The remaining 90% of the replications (results not shown) indicate that the conventional prior might perform slightly better than the hC(0.5) prior. Note that Hamaguchi et al 46 also mention the "Berger-Deely prior", 43,107 the U 0, 10 ð Þprior for τ, and the U 0, ∞ ð Þ prior for τ among the better performing priors, but the Berger-Deely prior leads to an even longer CrI than Jeffreys' prior, 46 the U 0, 10 ð Þ prior performs similarly to the U 0, ∞ ð Þ prior, 46 and the U 0, ∞ ð Þ prior was already included in our simulation study.
Note that the Bayesian CrI may always be used in a fully Bayesian meta-analysis. Apart from the typical advantages of Bayesian analyses in general 108 (most importantly, a full posterior distribution for θ and a more intuitive interpretation of the interval estimate), this has the advantage that posteriors for the θ k and posterior predictive distributions (for the θ k as well as for theθ k ) are easily obtained. 43 In a fully Bayesian meta-analysis, using a weakly informative prior for θ instead of the U −∞,∞ ð Þ prior might be preferred. 12,44,98,109 In that case, if a weakly informative prior with heavier tails than the normal distribution is desired (e.g., a Cauchy distribution 109 ), the R 2 package bayesmeta 43 cannot be used anymore. The computational flexibility required for such an analysis is offered, for example, by the R 2 package brms 110,111 which is based on Stan 112 and its R 2 interface package rstan. 113 Note that in a fully Bayesian meta-analysis, the U 0, ∞ ð Þ prior for τ clearly does not make sense as authors of real-world meta-analyses usually only include studies with comparable designs (which restricts their heterogeneity). Thus, the choice of a prior for τ is then rather between a weakly informative (e.g., half-Cauchy with scale 0.5) and an informative prior. For the latter, a context-specific setting of the parameters offered by Turner et al 48  3. the Kenward-Roger CI, 118 4. several exact CIs based on inverting hypothesis tests, 119,120 5. an exact CI based on Monte Carlo sampling and inverting a hypothesis test, 121 6. a CI based on a conditional Monte Carlo approach and the inversion of a hypothesis test. 122 We excluded the quantile-based mHKSJ CI since it is very similar to the mHKSJ CI. This does not only refer to their conceptual similarity: In the SMD main analysis of our simulation study (see section 2.5), we observed that in general, the quantile-based mHKSJ CI is overly conservative (results not shown) which has already been observed-to a larger extent-for the mHKSJ CI. 56,123 However, like the mHKSJ CI, the quantile-based mHKSJ CI may be more appropriate than the HKSJ CI in case of extremely varying within-study precisions. Our study sizes might not be mixed extremely enough to detect this (see section 5.2).
We excluded the alternative SE estimators 27,117 for the HKSJ CI since the simulation study by Partlett and Riley 123 showed that they lead to undercoverage and therefore perform worse than the HKSJ CI.
We excluded the Kenward-Roger CI since the simulation study by Partlett and Riley 123 showed that in general, it is overly conservative. For extremely mixed study sizes however, the Kenward-Roger CI might perform better than the HKSJ CI. 123 Again, our study sizes might not be mixed extremely enough (see section 5.2).
The exact CIs by Liu et al 119 and Wang and Tian 120 were excluded here since they suffer from the same problem as the FP CI: They are not appropriate for K < 6 and α = 0.05.
Michael et al 121 state that their exact CI performs similarly to the Bayesian CrI with "a typical" noninformative prior. This was the main reason why it was excluded here, but when trying to apply their method in our simulation study (via the R 2 package rma.exact 124 ), we also ran into a lot of errors (more than with any included method).
The conditional Monte Carlo CI by Sugasawa and Noma 122 might be interesting for future investigations but we had to exclude it here for technical reasons (the runtime being about 11 times higher than for the Bayesian CrI which is already the most time-consuming method of our simulation study).

| Simulation settings
Our simulation scenarios were chosen with a focus on medical research. Even though our simulation study is extensive, it cannot cover all possible settings. Our conclusions are restricted to the settings we have investigated.
In a real-world meta-analysis, the assumption of variance homogeneity (i.e., the same σ 2 k in both treatment groups) is most probably violated. However, extending our simulation study to include scenarios with heterogeneous variances would be outside of the scope of our article.
When simulating log ORs, the procedure for generating the event probabilities p T,k and p C,k has a nonnegligible impact, especially for small study sizes N k.

69
For the HKSJ CI however, this influence is rather small 69 so that our general recommendation of the HKSJ-SJ CI should remain valid.
The probability of p C = 0.1 might be considered too large for actually rare events as they occur, for example, for safety endpoints of a clinical trial. 85,98 In our "mixed" setting, the study sizes N k could have been varied more extremely and not as smooth as implied by the Delaporte distribution. This might lead to conclusions comparable to Röver et al 56  For the bimodal normal mixture in the robustness analysis, we note that a mixture ratio other than 50:50 (in particular, an extreme mixture with one component being much more probable than the other) might have altered our results and would be interesting for future investigations.
Finally, our simulation study does not answer the question which method to choose in meta-regressions or in meta-analyses where a publication bias is deemed possible. For the latter, the HC CI (see section 2.3.6) or the IVhet CI 73 might be methods to include in a comparison.