A confidence interval robust to publication bias for random-effects meta-analysis of few studies

Systematic reviews aim to summarize all the available evidence relevant to a particular research question. If appropriate, the data from identified studies are quantitatively combined in a meta-analysis. Often only few studies regarding a particular research question exist. In these settings the estimation of the between-study heterogeneity is challenging. Furthermore, the assessment of publication bias is difficult as standard methods such as visual inspection or formal hypothesis tests in funnel plots do not provide adequate guidance. Previously, Henmi and Copas (Statistics in Medicine 2010, 29: 2969-2983) proposed a confidence interval for the overall effect in random-effects meta-analysis that is robust to publication bias to some extent. As is evident from their simulations, the confidence intervals have improved coverage compared with standard methods. To our knowledge, the properties of their method has never been assessed for meta-analyses including fewer than five studies. In this manuscript, we propose a variation of the method by Henmi and Copas employing an improved estimator of the between-study heterogeneity, in particular when dealing with few studies only. In a simulation study, the proposed method is compared to several competitors. Overall, we found that our method outperforms the others in terms. In particular, an improvement in coverage probability of the new method compared with the proposal by Henmi and Copas is demonstrated. The work is motivated and illustrated by a systematic review and meta-analysis in paediatric immunosuppression following liver transplantations.


INTRODUCTION
Systematic reviews aim to summarize all the available evidence relevant to a particular research question. If appropriate, the data from identified studies are quantitatively combined in a meta-analysis. If the true effect is the same in all studies to be combined in a meta-analysis, then the so-called common-effect or fixed-effect model is appropriate. In practical applications this assumption often appears to be too strict as some level of between-trial heterogeneity in the effects is suspected. Then the random-effects model is applied which treats the latent study-specific effects as random from an assumed distribution, often a normal distribution. Inference regarding the overall effect in the standard model relies on normality assumptions which apply if the number of trials and the trials themselves are sufficiently large. A number of estimators for the between-trial heterogeneity have been proposed [1]; among them the standard approach by DerSimonian and Laird [2].
Recently the topic of meta-analysis of few studies, say 2 -5, got more attention, since this case is very common in practice. With few studies, however, confidence intervals of the overall effect based on normal quantiles tend to be too short as they ignore the uncertainty in estimating the between-trial heterogeneity. As remedies, methods based on t-quantiles have been proposed [5,6,7,8]. With few studies only, however, they are often conservative and so long that they are uninformative [9]. Also various likelihood-based methods have recently been assessed in the specific situation of few studies and not found to be a solution to the problem [10]. Between-trial heterogeneity estimates mentioned above often result in zero [4], with the notable exception of the method proposed by Chung et al [3]. Chung et al suggested the so-called Bayes modal (BM) estimator, which uses in a Bayesian framework a weakly informative prior for the between-trial heterogeneity to avoid zero estimates of the heterogeneity. Furthermore, a fully Bayesian approach to random-effects meta-analysis with weakly informative priors for the between-trial heterogeneity parameter has some advantages in this situation, since zero estimates are avoided as with the BM estimator and in addtition the uncertainty in estimating the heterogeneity is accounted for [4,11]. Of course the Bayesian credible intervals would not necessarily have frequentist properties. Evaluating the operating characteristics in extensive simulation studies, it was found that the frequentist coverage probabilities are often above the nominal level with conservative choices of the prior for the between-trial heterogeneity [4,11]. Bender et al [12] recently provided an overview on the topic of meta-analyses with few studies.
In systematic reviews, relevant evidence is identified through systematic searches of literature databases. If all relevant studies would be published, this would be sufficient. However, this is not always the case. The problem was first described as the 'file drawer problem' [13]. Today various types of reporting biases are carefully distinguished including publication bias, time lag bias, citation bias and outcome reporting bias to name but a few. Studies might not be published at all for various reasons or only with a certain delay or in journals or languages that are more difficult to access (see e.g. Table 7.2.a in [14]). In the following we focus here on the aspect of publication bias. Prospective registration of clinical trials is one way to tackle this problem. It has become standard practice to search not only at least two electronic databases of the literature but also to search at least one registry for clinical studies such as clinicaltrials.gov. The idea would be to include unpublished studies in systematic reviews. However, access to the unpublished results is often challenging as it requires the cooperation of investigators, sponsors etc.
A number of methods have been proposed over the years to deal with publication bias [15]. A popular way to interrogate data for publication bias is the visualization in form of a so-called funnel plot. In this scatter plot, each study contributes an estimate of an effect measure and its estimated standard error. The former is plotted on the x-axis, the latter on the y-axis. If no publication bias is present, we would expect the plot to be symmetric to the vertical line running through the average effect. Any absence of this symmetry might be interpreted as a signal that some form of reporting bias might be present. As this can be difficult to judge, formal hypothesis tests have been proposed (see e.g. [16]). The problem with the visual inspection as well as with the formal tests is that they become more powerful with larger number of studies, but are less sensitive with few studies only. In the context of funnel plots, trim-and-fill methods have been proposed to correct the overall effect for potential publication bias [17]. Following an alternative approach, several sensitivity analysis methods have been suggested based on selection functions describing the selective publication process [18,19,20]. For instance, Copas and Jackson [19] investigated the maximum bias over all possible selection functions which satisfy the (fairly weak) condition that studies with smaller standard errors are at least as likely to be selected than studies with larger standard errors. Building on their work, Henmi et al [20] developed sensitivity analyses that, in contrast to the proposal by Copas and Jackson [19], account for uncertainty in estimation. Again, the methods are not designed for the setting of few studies only.
In contrast to the approaches described above, Henmi and Copas [21] proposed a method for random-effects meta-analysis that is robust to selection. They modified the DerSimonian-Laird (DL) confidence interval [2] by replacing the random-effect estimator by the fixed-effect estimator of the overall effect and by replacing the normal quantiles by more accurate ones. The latter depends on the between-trial heterogeneity. The DL estimator is used in the computation of the quantiles. Therefore, with few studies this approach may not work well. Here we propose a modification of the Henmi-Copas method by replacing the estimator of the between-study heterogeneity in the computation of the quantiles by the one developed by Chung et al [3]. The properties of the new approach are assessed and compared to alternative methods including the Henmi-Copas approach in Monte Carlo simulation studies considering in particular the case of few studies with and without publication bias.
Our work is motivated by a systematic review and meta-analysis of controlled clinical trials assessing efficacy and safety of Interleukin-2 receptor antagonists (IL-2RA) in children having undergone liver transplantations [22], a rare surgical procedure in children. In total only six relevant studies were identified with little standardization with regard to the design of the studies implying some level of heterogeneity. Although the authors carefully checked for publication bias using standard techniques, it cannot be excluded that in particular some smaller studies were not published if they resulted in inconclusive treatment effects.
The manuscript is organized as follows. In the next section the new confidence interval for the overall effect is developed starting by introducing notation and reviewing existing methodology. The simulation study assessing the properties of the new confidence interval in comparison to existing methods is presented in Section 3. In Section 4 the proposed method is applied to the motivating example. We close with a brief discussion of our findings and their limitations.

METHODS
Adopting the notation by Henmi and Copas [21], the true effect of an individual study i out of n independent studies is denoted by θ i . Estimates y i of the effects θ i are observed with stand errors σ i . Here we consider the normal-normal hierarchical model (NNHM), which is the standard model for random-effects meta-analysis. In the NNHM, it is assumed that the θ i are from a normal distribution with expectation θ and variance τ 2 , i.e.
Furthermore, the effect estimators Y i follow (at least approximately) a normal distribution with expectation θ i and variance σ 2 i , i.e.
From Equations (1) and (2) they follow the marginal model If the between-trial heterogeneity τ 2 is 0, then the random-effects model reduces to the so-called fixed-effect or common-effect model. The focus of our study is inference regarding θ, the overall effect. A standard method to construct an estimator and a (1−α) confidence interval for θ was proposed by DerSimonian and Laird [2] (DL). In short, the DL estimator of θ is given bŷ Here, the DL estimatorτ 2 DL of the between-study heterogeneity τ 2 is given byτ The weights w i are the fixed effect weights (with τ 2 = 0), which are whereθ F is the fixed (or common) effect estimator of the overall effect witĥ If the estimatorτ 2 DL is assumed to be a fixed constant as the true value of τ 2 , then it holds that This results in the DerSimonian-Laird (1 − α)% confidence interval (DL) for θ, which is given by where z γ is the γ quantile of the standard normal distribution. The assumption that the estimateτ 2 DL is the true value of τ 2 might be reasonable when the between-study heterogeneity can be estimated with high precision, i.e. when the number of studies included in the meta-analysis is large. In medical applications, however, this is frequently not the case. As noted by several authors, the application of the DL approach in metaanalyses with small to moderate numbers of studies results in coverage probabilities below the nominal level 1 − α [4].
Henmi and Copas [21] tackled the two problems that (a) the distribution of the pivot statistics is quite different from the standard normal distribution when the number of studies n is small, and (b) the estimators of θ are biased due to selective publication of smaller studies with less favourable results (publication bias). With respect to the latter they note that the common (or fixed) effect estimatorθ F is more robust to publication bias than the random-effects estimatorθ R simply because smaller studies, which are less likely to be published when their outcome is not favourable, have a smaller weight in the construction ofθ F than inθ R . To address the problem of the normal approximation they derive the distribution of the pivot statistics based on the fixed effect estimator under the random-effects model. More specifically, the pivot statistics U is given by The variance ofθ F is The variance V can be estimated by plugging inτ 2 DL for τ 2 . We denote this estimator of V byV which is given byV Recall that the weightsŵ i also depend onτ 2 DL . The point in the derivation of the distribution of U by Henmi and Copas is to take into account the random variation ofτ 2 as well asθ F . However, the derived expression of the distribution of U includes some conditional distribution of Q, which is difficult to calculate. Henmi and Copas [21] propose an approximation of this conditional distribution by the gamma distribution with the exact conditional mean and variance of Q. There is no explicit solution for the (approximate) γ quantile of U , which is denoted by u γ , but it can be obtained by means of numerical integration and optimization (see Appendix B in [21] for an implementation in R). More importantly, the approximation relies on an estimate of τ 2 , since the above conditional mean and variance of Q depend on the unknown true value of τ 2 . Henmi and Copas [21] suggest to use the DL estimatorτ 2 DL as provided in (5). Hence, a (1 − α) confidence interval for θ is given by In simulation studies, Henmi and Copas [21] could show that their approach improves coverage probabilities as compared to standard procedures including the DL approach. With only few studies included in the meta-analysis, however, the performance is not satisfying. The poor performance of the method in this particular situation is caused (at least partly) by the use of the DL estimatorτ 2 DL , which results frequently in zero estimates with few studies although the between-trial heterogeneity is positive, τ 2 > 0.
The use of weakly informative priors for the between-study heterogeneity to avoid zero estimates has been advocated for some time, whereas an uninformative, e.g. improper uniform, prior is used for the effect θ [23,24]. A number of suggestions have been made on the choice of such weakly informative priors for τ including half-t [24] and half-normal distributions [4]. Here we follow Chung et al [3] who proposed to use a gamma distribution with shape η and rate λ as a prior for τ , specifically p(τ ) = λ η τ η−1 e −λτ /Γ(η) with gamma function Γ(η). This choice means that the logarithm of the posterior of θ and τ is equal to the log likelihood plus a term depending only on τ but not θ. Rather than using the mean or median of the posterior, Chung et al [3] consider the mode, which can be computed by numerical optimization. This estimator of τ is referred to as the Bayes Modal (BM) estimatorτ BM . As default, Chung et al recommend to use α = 2 and λ close to 0. The BM estimatorτ BM can be interpreted as a penalized maximum likelihood (ML) estimator [3].
Here, we propose to replace the DL estimatorτ 2 DL in the computation of the quantiles of the pivot statistics U by the BM estimatorτ 2 BM . The resulting γ quantile is denoted by u (BM ) γ . The (1 − α) confidence interval for θ is then given by In summary, our idea is that we still use the DL estimatorτ 2 DL in the construction of the pivot statistics U given in (10) in the same way as Henmi and Copas [21], but we use the BM estimatorτ 2 BM in the approximation for the distribution of U instead ofτ 2 DL . The reason for the use of the DL estimatorτ 2 DL in the construction of U is that it is easier to calculate the distribution of the pivot statistics U , taking into account the effect of estimating τ 2 . However, the distribution of U depends on the unknown true value of τ 2 and it is necessary to use some estimate of τ 2 to approximate the distribution of U . One possibility is to use the DL estimatorτ 2 DL again, which was done in [21], but it would be inaccurate unless the number of studies are sufficiently large. Hence, we propose the use of the BM estimatorτ 2 BM to improve the accuracy in estimating τ 2 and in approximating the distribution of U , which we expect to lead the improvement of the coverage probabilities of the Henmi-Copas (HC) confidence interval (13). In the next section, by simulation studies, we show that the new confidence interval (14) actually improves the HC confidence interval (13) in coverage probability as well as the DL confidence interval (9) in both cases with and without publication bias, especially when the number of studies is small.

SIMULATION STUDY
In order to compare the performance of the proposed approach with previously suggested procedures a Monte Carlo simulation study was conducted. As comparators the methods by Henmi and Copas [21] (HC), Chung et al [3] (BM) and DerSimonian and Laird [2] (DL) were included. The first one is known to be robust to publication bias to some extent, but its performance in meta-analyses with few studies only is unknown. The approach by Chung et al [3] was developed for the scenario of few studies but might not be robust to publication bias. The DL approach was included here as it is often considered to be the standard approach to random-effects meta-analysis. The simulation model by Brockwell and Gordon [25] formed the basis for our simulation study. It was used in several recent simulation studies and therefore appeared to be a good choice. To account for publication bias, we used the same selection function (probability that a study with an outcome y and associated standard error σ is selected in the meta-analysis) as in [21] with the same sets of the parameters β and γ for moderate and severe publication bias. Here, Φ is the cumulative distribution function of the standard normal distribution. Table 1 summarizes the simulation scenarios considered. Per scenario N = 2, 000 simulation replications were run. Figure 1 presents the simulated coverage probabilities for the different confidence intervals in the various scenarios. In all scenarios considered, the proposed method performs at least as well as the HC method in terms of the coverage probability. With larger number of studies, say n ≥ 9, and more pronounced between-trial heterogeneity, say τ 2 ≥ 0.15, the performance of both approaches is fairly similar. With smaller numbers of studies or only low levels of heterogeneity, however, there is a clear advantage for the new proposal as it improves the coverage probability considerably. In scenarios with few studies, n = 3 or n = 6, and only low levels of between-trial heterogeneity, τ 2 = 0.05, the coverage probabilities of the BM approach are slightly higher than those of the proposed method. In the scenarios with publication bias, however, the coverage probabilities of the BM approach rapidly decrease well below the nominal level of 0.95 with increasing numbers of studies included in the meta-analysis and increasing levels of between-trial heterogeneity. Overall, the coverage probabilities of the proposed approach are closest to the nominal level whereas the coverages for the DL approach are well below the nominal level for several scenarios characterized by publication bias and small numbers of studies included in the meta-analysis. In scenarios where different methods resulted in similar coverage probabilities close to the nominal level it is of interest to compare the length of the intervals obtained by these methods. Shorter intervals with the same coverage would of course be preferred. For instance, in the setting without publication bias and n = 3 studies the median length of our proposed confidence interval (HC-BM interval) is 1.12, which is slightly larger than the median length of the BM intervals (1.15) although the coverage is 0.96 just below the coverage of the BM intervals (0.97). Similarly, with n = 6 studies the median lengths of the HC-BM and BM intervals are 0.76 and 0.67, respectively. In scenarios where the coverages of the HC and HC-BM intervals are close, the median lengths of the intervals are similar again. As we have seen above, looking across all scenarios considered the HC-BM intervals achieve the best coverage. However, the price for this performance of coverage is paid by somewhat longer lengths when comparing to methods that achieve the desired coverage in particular situations but not in others.

APPLICATION TO A SYSTEMATIC REVIEW IN PAE-DIATRIC LIVER TRANSPLANTATION
Crins et al [22] report a systematic review and meta-analysis evaluating Interleukin-2 receptor antibodies (IL-2RA) for immunosuppression in children who underwent liver transplantation. The authors identified a total of six controlled studies including two randomized trials. Given the heterogeneity in the designs of the studies, some between-   study heterogeneity in the treatment effects can be expected. Although Crins et al [22] did not identify any publication bias by visual inspection of funnel plots and formal tests for asymmetry of these plots, this provides little reassurance that indeed no publication bias is present, since the number of studies is fairly small, which hinders the identification of publication bias in funnel plots or formal hypothesis tests. Therefore, there is a need for methods for random-effects meta-analyses robust to publication bias in this setting. The endpoint acute rejections was reported in all six studies identified in the systematic review, whereas only three also reported the outcome steroid-resistant rejections. Table  2 summarizes the findings for both outcomes. These data were previously considered by Friede et al [4] who applied several point estimators and confidence intervals of the overall effect including DL and BM to these. For acute rejections, DL and BM yielded log odds ratios (95% confidence interval) of -1.59 (-2.21, -0.96) and -1.61 (-2.35, -0.87), respectively. The between-study heterogeneity was estimated asτ 2 DL = 0.16 andτ 2 BM = 0.38 with the DL and BM methods, respectively. The fixed-effect estimate of the overall effect is -1.56 smaller than the random-effects estimates. The HC interval given by (-2.24, -0.89) is centred around the fixed-effect estimate. The HC-BM interval proposed here is calculated as (-2.31, -0.82) which is considerably wider than the HC interval.
For steroid-resistant rejections, DL and BM resulted in a log odds ratio (95% confidence interval) of -1.21 (-2.28, -0.15) and -1.32 (-2.78, 0.14), respectively. Whereas the DL method results in a statistically significant treatment difference, the effect is not statistically significant with the BM approach although the point estimate hints at a more pronounced treatment effect. This is explained by the larger between-study heterogeneity ofτ 2 BM = 0.87 with the BM method which compares toτ 2 DL = 0.14 with the DL method. These compare to the fixed-effect estimate of -1.17 with 95% confidence intervals of (-2.24, -0.09) and (-2.53, 0.20) for the HC and HC-BM methods, respectively. Again, the fixedeffect estimate is smaller than the effects obtained from random-effects meta-analyses. Furthermore, the HC-BM confidence interval is wider than the HC interval. Here, this wider interval means that the effect is no longer statistically significant on the usual 5% level.

DISCUSSION
Meta-analyses of only a few studies are very common, but pose a number of challenges. These include the estimation of between-trial heterogeneity as well as the assessment of publication bias. Here we proposed a method that faces both challenges successfully. The confidence interval of the overall effect proposed by Henmi and Copas [21] was improved by replacing the DerSimonian-Laird estimator by the Bayes Modal estimator of Chung et al [3] in the computation of the quantiles to construct the confidence interval. The use of a weakly informative prior biases the Bayes Modal estimator away from zero. This resulted in larger quantiles, in particular in situations with few studies and only small to moderate levels of between-trial heterogeneity, which improved the coverage of the confidence intervals.
The normal-normal hierarchical model considered here is a standard model for randomeffects meta-analyses. This model is very general but not without limitations since effect estimates are modelled and not the data directly implying a two-step procedure. For instance, considering binary outcomes and treatment effects summarized by odds ratios Jackson et al [26] discuss six alternative generalised linear mixed models which are more efficient one-step procedures. Modelling the data directly can have particular benefits when dealing with rare events; see for example Günhan et al [27] or Gronsbell et al [28].
The approach taken here to improve the coverage of confidence intervals of the overall effect in pairwise meta-analysis might also be useful in more complex settings such as meta-regression or network meta-analysis. The exploration of such opportunities is out of the scope of this manuscript but subject of future research.