How trace plots help interpret meta-analysis results

The trace plot is seldom used in meta-analysis, yet it is a very informative plot. In this article we deﬁne and illustrate what the trace plot is, and discuss why it is important. The Bayesian version of the plot combines the posterior density of 𝜏 , the between-study standard deviation, and the shrunken estimates of the study eﬀects as a function of 𝜏 . With a small or moderate number of studies, 𝜏 is not estimated with much precision, and parameter estimates and shrunken study eﬀect estimates can vary widely depending on the correct value of 𝜏 . The trace plot allows visualization of the sensitivity to 𝜏 along with a plot that shows which values of 𝜏 are plausible and which are implausible. A comparable frequentist or empirical Bayes version provides similar results. The concepts are illustrated using examples in meta-analysis and meta-regression; implementaton in R is facilitated in a Bayesian or frequentist framework using the bayesmeta and metafor packages, respectively.


INTRODUCTION
Much progress has been made in encouraging people to go beyond a fixed-effect (FE) model (also known as commoneffect model) to a random-effects (RE) model in metaanalysis. 1However, most RE models do not take into account the uncertainty in estimates of the standard deviation of true effects (commonly called tau, and symbolized as ), and the effect that this uncertainty has on other aspects of the analysis, such as the shrunken estimates of true effect sizes.When the number of studies is large, is estimated with enough precision that the effects on shrunken estimates is not worrysome; but that is not the case when the number of studies is small or even moderate (say, a dozen). 2,3he problem occurs with methods based on heterogeneity point estimates, such as DerSimonian and Laird, and even with empirical Bayes methods, which go part way towards a fully Bayesian solution. 4On the other hand, many practitioners do not understand the Bayesian approach, in which the mathematical derivations can be taxing.In this article we show that using graphical methods, and conceptual explanations of Bayesian models, everyone can benefit from the Bayesian approach without worrying about derivations of the results.
A range of graphical tools have been established to aid in the interpretation or diagnosis of meta-analysis data and models, such as forest plots, funnel plots, L'Abbe plots, radial plots, or drapery plots. 5,6,7One particular type of graphical display that (to our knowledge) was originally introduced by Rubin (1981) 8 is the trace plot, illustrating conditional estimates over a range of plausible heterogeneity values, and by that facilitating valuable insights into the inner workings of a random-effects model, in particular the role of the heterogeneity parameter .This display seems to have been rarely picked up in the meantime; it used to be implemented in DuMouchel's "hblm" S-PLUS software package, 9,10 but besides some theoretical or instructional treatments 4,11,12,9,13 we are only aware of few (published) applications of this kind of display. 14,15he trace plot of Rubin (1981) (see supplementary Figure C3) 8 has been reproduced in several sources, including Gaver et al. (1993)  11 and Gelman et al. (2013), although the latter has separate plots for the shrunken estimates and the posterior distribution of . 12 Raudenbush and Bryk (1985)  4 included plots similar to Rubin's, and added lines for parameter estimates and a 95 percent confidence interval for those estimates.A variation of trace plots, with the posterior density of plotted below the trace lines, appeared in Zucker et al (1997). 14uMouchel (1994) 9 developed software for producing such plots, but it was written in S-PLUS and was not completely compatible with R. Supplementary Figure D4 shows a trace plot of the SAT coaching data produced by the hblm program of DuMouchel (see DuMouchel, 1994 9 ). Te posterior distribution of is plotted for 9 values.Because DuMouchel uses unequal spacing when picking quadrature points for , although the height of the posterior plot bars represent probability correctly, these do not reflect the posterior density.Thus this plot is somewhat misleading compared to a continuous plot.The density would be the probability divided by the bin width; bin widths in this plot are shorter for smaller values of .
The predecessor to the trace plot is a plot that has the raw estimates on one side (sometimes left vs. right, sometimes top vs. bottom) and empirical Bayes estimates on the other side, with straight lines connecting them; such a display is a special case of a parallel coordinates plot. 16These plots show only the two values of : raw estimates, with = ∞, and shrunken estimates, with being at its point estimate.An example of such a plot is in Efron and Morris (1975; p. 315). 17e believe that the trace plot provides a very intuitive access to the inner workings of the random-effects model (also called the normal-normal hierarchical model, NNHM) that is commonly applied for meta-analysis, and may aid practicioners in the interpretation of meta-analytic inferences.In the following, we will first introduce a motivating example in Section 2, followed by a brief introduction of the random-effects model for meta-analysis in Section 3. Use of the trace plot is then extensively demonstrated using several example applications of meta-analyses and meta-regressions in Section 4. A frequentist variation of the trace plot is introduced in Section 5. We conclude with a discussion in Section 6. Rubin (1981)  8 discussed an application case involving what used to be called the Scholastic Aptitude Test (SAT); these are regularly given to support college admission decisions.While the SAT is designed to be resistant to short-term preparation exercises, this particular example dealt with the effectiveness of coaching programs to prepare students for the SAT.The data set includes the results of eight randomized experiments (performed in eight schools), in which the adjusted effects on SAT-V ("SAT-verbal section") scores in response to a coaching scheme were evaluated.The effect size is the mean difference

FIGURE 1
Forest plot for the SAT coaching example data introduced in Section 2. 8 For each of the 8 schools (labelled A-H), an estimate of the coaching programme's effectiveness is given along with a standard error .A positive effect estimate (i.e., an observed increase in the SAT-V score) suggests a successful programme.
in SAT verbal scores among those coached versus those not coached in each of the eight randomized experiments.Figure 1 shows the data in a forest plot.Effect estimates tend to be on the positive side, suggesting successful coaching programmes; however, uncertainty is large, and none of the eight experiments was able to convincingly demonstrate effectiveness on its own.In addition, it is not obvious whether there are any differences between schools, i.e., whether for example the school that appeared to show the largest treatment effect (school A) did in fact do better than the others.While the observed experimental outcomes might also be consistent with a common effect across schools, it is also conceivable that effects may vary between schools.A hierarchical (or random effects (RE)) model allowing for potential heterogeneity between estimates was proposed for analysis, as described in the following section. 8

The analysis model
A simple but useful and commonly applied meta-analysis approach is given through the normal-normal hierarchical model (NNHM), which captures the analysis problem based on normal distributions for both the measurement errors as well as for the between-study heterogeneity.The model is sketched in Figure 2; a more technical treatment is provided e.g. by Röver (2020). 18he NNHM is characterized by an overall mean parameter , that denotes the "average" effect across all studies, and the heterogeneity ≥ 0, denoting the (dis-)similarity of effects in different studies.The data that are the basis for inference here are the effect estimates , and their associated standard errors (which are treated as known).In mathematical terms, Several studies (here: = 1, 2, 3) have slightly differing (heterogeneous) treatment effects ; their underlying distribution is shown in blue at the top of the figure, and it is characterized by its center (or "overall mean") , and the degree of (dis-)similarity is denoted by the heterogeneity parameter .When a study is conducted, we only ever get to know its true effect with some amount of uncertainty, which is quantified through the standard error .The eventual data, the effect estimates , are hence more or less offset from the true values , depending on the magnitudes of the .
the model may be expressed as and (see also Figure 2), meaning that study-specific means may exhibit a certain amount of variation (of scale ) around their common mean , while the data measure the effects only with a limited accuracy, which is expressed through the associated standard errors .
Figure 2 illustrates the model graphically; equation (1) is represented by the blue density at the top that generates (here: three) different true effects , one for each study.According to equation (2), each study then yields an effect estimate that tends to be some distance from , as indicated by the green densities.The data eventually observed are the estimates and their standard errors ; the as well as and are unknowns.The overall mean is often the figure of primary interest; generally it may be interpreted as the average effect across studies, in the special case of = 0, the mean as well as the study-specific all collapse to a single common effect.
In the statistical analysis, one then needs to learn the unknown model parameters (overall mean effect , heterogeneity , and study-specific effects ) based on the data given in terms of the effect estimates and standard errors .As one might imagine, the data may sometimes only convey a rather vague idea of the underlying true parameter values, for example in a case of only few estimates, as in the sketch in Figure 2.
Use of a normal distribution for the measurement uncertainty (2) may often be motivated via the central limit theorem (i.e., "large" sample sizes within studies); the normal distribution at the first model stage is mostly a convenient (albeit obvious) choice.The simple NNHM may be extended to a meta-regression model considering study-level covariables in addition. 19For example, when each study also provides a covariate , the common overall mean may be replaced by an expression 0 + 1 to account for the (potential) effect of on the outcome.With covariates in the model, now represents the residual standard deviation; that is, the amount of variation not accounted for by the covariates (and standard errors).

Marginal and conditional posteriors
In a Bayesian context, the information extracted from a data set is formulated in terms of probability distributions, expressing what parameter values or ranges are more or less probable given the data at hand. 12 Depending on the problem, the data, and the assumptions implemented, these so-called posterior distributions may convey more or less precision or ambiguity, and these may often roughly resemble the shape of a normal distribution, or may also look quite different.
In the present context, we will first of all consider the posterior distributions of the heterogeneity parameter , and of the study specific effects or their overall mean (see Section 3.1).A special variety of a posterior distribution is the so-called conditional posterior.For example, the posterior of conditional on (denoted as " | ") depends on the value of that we insert on the right-hand side, and it reflects the inference on if we happened to know that was the actual true heterogeneity value.By varying the value, we may then get an impression of how our inferences depend on our information about .In the particular case of the NNHM, the conditional posteriors are all simply normal distributions, so that they are readily characterized through their associated (conditional) mean and variance parameters.
The conditional posterior distributions have analogous (and to some extent analytically identical) counterparts in the context of frequentist/likelihood inference; a given value defines the conditional likelihood, and with that, conditional

FIGURE 3
Trace plot for the SAT coaching example data (see Sections 2 and 4.1). 8ximum likelihood (ML) estimates and associated standard errors. 4,20

The trace plot
While the heterogeneity is commonly considered a nuisance parameter, the trace plot aims to illustrate how our conclusions depend on this value, while also indicating the relevant plausible range of values.The -axis in a trace plot corresponds to different values (with ≥ 0).The -axis shows first of all the inferred values (conditional posterior means) of the studyspecific effects , as well as model parameters such as the overall mean (or regression parameters , or linear combinations of these).The trace plot is sometimes also called a conditional means or conditional shrinkage plot.
A trace plot for the example data from Section 2 is shown in Figure 3 and is introduced and discussed in detail in the following section.

SAT coaching meta-analysis
Figure 3 shows a trace plot for the SAT coaching example data (Rubin, 1981)  8 introduced in Section 2. Meta-analysis was performed using uniform priors for and .These "flat"' priors are essentially uninformative and conservative specifications for both parameters. 18,21t the bottom of the trace plot, the posterior distribution of the heterogeneity is illustrated; its most likely value is at = 0, but a range of positive values also remains plausible.The posterior median is at 5.2, which is indicated by a vertical grey line, and the 95% credible interval (CI), reflecting the most likely region for the true parameter value, ranges from 0 to 17.3, and is shown by the grey shaded area.
The top section of the plot shows the conditional means ( ) for the eight schools, as well as for the overall mean .On the far left, the case of = 0 corresponds to the "homogeneous" (or "fixed-effect") case where all schools' effects collapse into a common effect (of about 7.9).As soon as one considers positive values, the estimated (conditional) effects tend to be a compromise between the estimated overall mean and the observed (empirical, apparent) effect .Technically, the conditional posterior means (E[ | ]) shown in the trace plot result as a weighted mean of the (conditional posterior) overall mean estimate (E[ | ]) and the sample estimates ( ). 18,22 The attraction towards the overall mean vanishes with increasing .
This so-called shrinkage of individual effects towards each other (or towards their common mean) is an example of the classical "regression towards the mean" effect. 23,24,9This may be illustrated by considering one of the more extreme examples; consider the case of school A which appeared to demonstrate the greatest coaching effect (of 28.4 points; see Figure 1).When this datum is not considered in isolation, but in the context of the remaining observations, it appears to be a "lucky" outlier to some extent.Assuming = 0, the homogeneous case at the left of the trace plot, one must conclude that the observation of 28.4,somewhat above the probable mean of 7.9, was due to measurement error alone (which is not implausible given the standard error and the fact that this happened to be the maximum out of 8 measurements). 8Once some positive heterogeneity is allowed for, the reasoning changes only gradually; the fact that school A measured the greatest effect quite naturally is attributed to some degree to be due to an outlying ("lucky") measurement, while at the same time, once differing effects between schools are permitted, school A is likely to have the largest effect among the eight schools.
Supplementary Figure A1 illustrates the "shrinkage" estimate for school A in a bit more detail; in addition to the conditional mean, the conditional credible intervals (CIs) for school A and for the overall mean are shown.While the shrinkage estimate for school A is above the overall mean, the CIs are largely overlapping, which also makes sense when considering that the between-study heterogeneity is estimated to most likely be smaller than the individual estimates' standard errors (see Figure 1).Consideration of such shrinkage estimates has important applications e.g. in the context of clinical trials, where a meta-analysis of "historical" data may contribute to the prior information considered in a new trial. 25,22he shrinkage of individual studies' effect estimates goes along with a certain precision gain; for finite values, the conditional standard errors are always smaller than the original values, 26 and for = 0 all estimates are completely "pooled".
While the trace plot shows the conditional means and allows for some insights into the role of the heterogeneity, eventual inference will focus on marginal estimates (e.g., means, medians or modes of marginal distributions), i.e., consideration of (conditional) estimates integrated (marginalized) over the heterogeneity posterior distribution shown at the bottom of the trace plot.With most probability concentrated at low heterogeneity values, we may expect substantial shrinkage, and a corresponding gain in precision.
Supplementary Figure B2 shows a forest plot indicating marginal shrinkage estimates as well as the overall mean along with the data.Marginal shrinkage estimates are substantially more precise than the original data estimates, and in all cases we see a sizeable shift towards the common mean estimate.For example, the marginal estimate (of 1 ) for school A is at 10.5 (95% CI [−3.3, 29.9]), and the associated posterior standard deviation amounts to only 56% of the original 1 .The posterior standard deviation for the overall mean is at 5.2, and with that substantially smaller than any of the provided with the data.
As a historical digression, we include the original trace plot from Rubin (1981)  8 in supplementary Figure C3.Essentially the same plot also appears in Gaver et al. (1992, Sec.3.3), 11 and the example (including slightly different plots) is also discussed in a general hierarchical modeling context in Gelman et al. (1995, Sec.5.5). 12Using DuMouchel's original S-PLUS code 9,10 and porting it to R, we generated a trace plot for this data set, shown in supplementary Figure D4.

Aspirin meta-analysis
The next example shows the utility of the trace plot for checking model assumptions and locating violations of those assumptions.The data are from the widely-used meta-analysis of studies on the effect of aspirin on prevention of a second myocardial infarction (heart attack). 27Figure 4 shows a trace plot based on a simple random-effects meta-analysis of the log-odds ratios (log-ORs) for myocardial infarction when comparing aspirin to placebo.Note that (as in the previous examples) the trace lines level off towards large values of -the limiting values are in fact the values, and their arithmetic mean.In Figure 4 the limiting values are included at the right-hand side (dotted lines) at the " = ∞" -axis tick mark.
At the left-hand side, for values of near zero, estimates of true effect sizes are at or near the overall average, but the posterior distribution of shows that zero is an unlikely value of .

FIGURE 4
Trace plot for the Aspirin example data. 27r values of that are more likely to be true, the trace lines of shrunken estimates diverge into two parts: The first part is a group that have similar (negative) values of the effect size, and the second part is a single study (the AMIS trial) that diverges from the main group.The single outlier not only diverges from the main group, it has an effect that is positive (suggesting that aspirin was harmful), while the others are strongly negative (see also the forest plot in supplementary Figure E5).The heterogeneity's posterior median is at = 0.20, and the posterior distribution largely covers ranges in which rather little shrinkage is taking place.The data were subsequently analyzed to investigate potential sources of heterogeneity, but with limited success; adjusted mortality estimates seemed more homogeneous, and differences were found in short-term vs. long-term follow-up outcomes. 28A substantial fraction of the empirical heterogeneity may be attributed to the outlying AMIS study; the posterior median for is reduced to only 0.094 (from 0.20) when the AMIS study is omitted; this sensitivity analysis shows that it is the AMIS study driving the heterogeneity estimate towards larger values where less shrinkage is implied.

Meta-regression using binary covariables
DuMouchel (1994) 9 meta-analyzed nine studies that examined the relationship between nitrogen dioxide (NO 2 ) exposure and the development of respiratory illness in children.The results had originally been compiled by Hasselblad et al. (1992); 29 supplementary Figure E6 presents the original data as a forest plot, and Figure 5 shows the trace plot for a "simple" meta-analysis of the data.DuMouchel utilized a weakly infor-

FIGURE 5
Trace plot for the NO 2 example data. 9tive prior distribution for the heterogeneity here; 9,18 the prior density (dashed line) is shown along with the posterior density (solid line) at the bottom of the plot.In this case, one can clearly see the differing shrinkage that comes with different standard errors (see also the forest plot in supplementary Figure E6): the study with the largest ("Keller79") has substantial shrinkage even for very large values, whereas the estimates with greater precision ("Ware84", "Melia77", "Melia79") only tend to shrink towards the common overall mean for small values.Note the differing appearance when comparing the "Keller79" study with the case of the AMIS trial in the previous example: an outlying estimate that also has a large standard error associated has a very different effect on the analysis.
The studies' one-to-one comparability, however, seemed to be questionable, and a number of distinguishing features were noted.For example, some studies reported estimates that had been adjusted for gender, while others hadn't.One may imagine that if the chances for respiratory illness differ between girls and boys, then an analysis of the NO 2 effect may reduce bias, gain precision, and avoid confounding effects if gender is adjusted for.Such study-level covariables may be considered in a meta-regression analysis; when each study's adjustment status ( = 0 if the th study did adjust for gender, = 1 if it failed to adjust) is provided, one may specify a model fitting two parameters ( 0 + 1 ) rather than a single "intercept" parameter ( ) only. 30,19igure 6 shows the trace plot for a meta-regression considering gender adjustment as a covariable.Rather than showing the (conditional) estimates of as in a "simple" meta-analysis, one may now include estimates of the regression parameters

FIGURE 6
Trace plot for the NO 2 example data; metaregression with a single binary covariable (adjustment for gender, y/n). 9 ( ), or of linear combinations of these in the trace plot.With the binary coding for gender adjustment as suggested above, the estimates of 0 and 0 + 1 then correspond to the mean treatment effects of adjusted and unadjusted studies, respectively.1 here corresponds to the (possible) bias due to failure to adjust for gender.In the trace plot, it becomes evident that the estimated heterogeneity is substantially reduced (compared to the previous analysis shown in Figure 5); it is essentially the difference of about 0.2 between the two group means that is explained by the covariable and that in turn reduces the heterogeneity from a posterior median of 0.065 to 0.025.The individual studies' shrinkage estimates now behave quite differently; instead of converging to a common effect for zero heterogeneity at the left of the plot, each study now shrinks towards one out of the two subgroup means.Based on the two group's mean effects, the use of gender adjustment within a study seems to result in a larger estimated effect of NO 2 exposure.The importance of this meta-analysis was that it actually investigated three sources of methodological diversity 31 by coding various ways that the studies could be subject to confounding, and therefore could estimate the effect size for a study that had none of those sources, even if there were no such study.There was one such study, but the others added information about the target effect, as well as information about the bias introduced by each potential source of bias.To do this, DuMouchel fit a model with three binary dummy variables, namely gender (whether the quoted estimate had been adjusted for gender, y/n), smoke (whether the estimate was adjusted

FIGURE 7
Trace plot for the NO 2 example data, 9 three binary covariables (adjustment for gender, smoking, and mode of NO 2 measurement).
for patents' smoking status, y/n), and no2 (whether NO 2 levels were measured directly (y/n), or presence of a gas stove in the household was used as a proxy); see also the forest plot in supplementary Figure E6 for the original data.These variables were all coded 1 if the study had failed to control for a condition, and 0 if it did.Thus the intercept estimated the effect size of a study that was zero on each dummy variable, and thus had none of the three possible sources of bias.This is similar to a suggestion of Rubin (1992)  32 that meta-analysis should be a response surface analysis, such that the main quantity estimated is the effect in an "ideal" study, rather than the average study.
Figure 7 shows the trace plot for the meta-regression analysis including all three covariables.Looking at the trace plot, one can see that the posterior distribution of is highest at zero; the maximum-likelihood (ML) estimate would likewise be zero.Further, in an ML-approach the regression parameters would be estimated conditioning on the estimate of zero for .But the distribution of is spread over a rather large range, meaning that moderate non-zero values are also plausible.At those larger plausible values, the shrunken estimates would diverge some from the fully shrunken values at = 0, but not to an extreme.
Conditional estimates for five linear combinations of the regression coefficients ( ) are also illustrated in the plot, namely, corresponding to the cases where all three covariables ( 1 , 2 , 3 ) equaled 0 (studies with adjustment for all 3 covariables), studies without adjustment, and the three cases where one of the covariables is adjusted for.One can again see that each study shrinks towards an individual mean value, depending on its associated combination of covariates; those studies sharing the same combination then shrink towards a common mean.Consideration of covariates then allows us to investigate potential differences in effects for different study designs; it appears that a study meeting all three criteria (like the "Neas91" study) would yield the largest effect estimate (of about 0.3).Also, a study without any adjustment has an estimate near zero, so an unadjusted study misleadingly suggests no deleterious effect.
Using only nine studies to fit four regression parameters (some of which are not clearly different from zero) implies a lot of uncertainty in the model fit, which is also reflected in the heterogeneity's posterior.With a greater number of degrees of freedom in the model, the likelihood is not able to constrain the heterogeneity as much, leading to a posterior that is closer to its prior, and with that, a larger heterogeneity estimate.In general, one of course needs to be cautious in balancing the number of parameters estimated against the number of studies included in order to avoid overfitting; on the other hand consideration of known effect moderators may also be considered essential, and the use of informative effect priors might help here.Karner et al. (2014)  33 performed a systematic review and meta-analysis to investigate the effects of tiotropium, a medication used in the management of chronic obstructive pulmonary disease (COPD).A total of 22 randomized placebo-controlled trials were found; among the primary endpoints was the odds ratio (OR) of exacerbation; the raw study data (with effects in terms of log-ORs) are shown in a forest plot in supplementary Figure E7.A meta-analysis of the data, using a weakly informative half-normal prior with scale 0.5 for the heterogeneity, 21 yields an estimated log-OR of −0.25, indicating a beneficial effect of the treatment (i.e., a reduction of exacerbations).A trace plot for the analysis is shown in Figure 8.The half-Normal(0.5) heterogeneity prior (implying a prior median and 95% quantile of about 1 3 and 1, respectively, for ) is a conservative specification in the context of endpoints on a logarithmic scale. 21The heterogeneity's posterior distribution is much narrower than the prior; it shows that is not likely to be zero, but also is not likely to be larger than about 0.3, while the prior appears effectively uniform across the relevant range.Over the range of plausible values of , the shrunken effect size estimates vary, but most are in a range indicating that the drug is effective.Two outliers are apparent (the studies by Verkindre (2006) and Sun (2007)), which were the two most extreme effect estimates, and at the same time were also based

FIGURE 8
Trace plot for the COPD example data. 33Metaanalysis without covariables.
on the smallest sample sizes.With their large standard errors, these two are still consistent with the remaining data (i.e., the intervals shown in the forest plot (Figure E7) are still mutually overlapping), and their estimates are shrunk considerably towards the remaining studies.With some heterogeneity evident in the data, and a number of study characteristics recorded, it is interesting to investigate potential sources of heterogeneity.For example, studies were of differing duration (a case of methodological diversity 31 ), and it is quite conceivable that the treatment effect (OR) may differ for shorter or longer follow-up periods. 34A meta-regression including a binary covariable distinguishing "short" and "long" studies (defined as follow-up times up to 1 year or more than 1 year) is shown in Figure 9. Unlike in the previous example (Section 4.3), from the data it is not quite clear whether there actually exists a difference between the two groups, and the heterogeneity's posterior distribution is virtually unaffected by the inclusion of the covariable.
Another relevant determinant of the treatment effect may be the study participants' disease severity.A common measure of disease severity is the forced expiratory volume in 1 second (FEV 1 ), which is determined through spirometry, and which quantifies a patient's breathing capacity.This amount is reduced with increasing COPD severity. 35For the present data set, the population averages (at inclusion) are available for all 22 studies.As this covariable relates to differences between study populations, this would be a case of clinical diversity. 31igure 10 shows a trace plot for the meta-regression considering the FEV 1 value as a covariable (via an intercept / slope parametrization).

FIGURE 9
Trace plot for the COPD example data. 33Metaregression using a single binary covariable ("short" vs. "long" follow-up).

FIGURE 10
Trace plot for the COPD example data. 33eta-regression using a single continuous covariate (baseline FEV 1 ).
On the left-hand side the effects are fully shrunken to the predicted effect size for the covariate value of that study.There is a great deal of spread in these values, indicating that the covariate is important.Many of the lines are flatter than in the other two plots, indicating less shrinkage being necessary than for the other models.Also, most of the lines are not as steep over the range of plausible values, indicating less sensitivity to the value of .Further, the posterior distribution of has moved to smaller values, because more variability among studies is accounted for.The posterior median is at about 0.12, while the range of predicted values covers multiples of that, illustrating how a substantial fraction of heterogeneity is explained by the FEV 1 value.
The traces for conditional effects at three selected FEV 1 values are also shown in the plot (for FEV 1 between 1.0 and 2.0, roughly covering the range of population means encountered in the data); one can see that larger FEV 1 values correspond to larger effects, suggesting that treatment benefit is greater in less severe cases.

FREQUENTIST TRACE PLOTS
Trace plots may also be motivated based on frequenstist reasoning.The conditional distributions of effects (study-specific effects, overall effects or linear combinations of regression coefficients) have their analogues in so-called best linear unbiased prediction (BLUP).When considering uniform priors for effects ( or ), the conditional posterior expectations and standard deviations correspond to frequentist conditional point estimates and standard errors. 4,20,36While in a frequentist framework it is not possible to assign (prior or posterior) probabilities to heterogeneity ( ) values, one may still mark confidence interval bounds or consider different values or ranges in the spirit of a sensitivity analysis.While there are similarities and analogies between the Bayesian and frequentist approaches, care needs to be taken regarding the differing interpretation, e.g., of credible and confidence intervals (in the Bayesian and frequentist contexts, respectively), but these issues are beyond the scope of the present investigation. 37,38igure 11 shows a trace plot for the COPD example data (see also Figure 8) based on functions from the metafor package.39 The plot's bottom panel shows the -test statistic as a function of the heterogeneity considered as the (point) null hypothesis.The grey area indicates the central 95% region based on a 2 distribution with ( − 1) degrees of freedom (with denoting the number of studies); the points where the -statistic exceeds these bounds then constitute a confidence interval for , shown in dark grey.This bottom panel hence essentially illustrates the construction of a ( -profile) confidence interval for .40 The dashed vertical line shows the heterogeneity point estimate (here: the restricted ML estimate ̂ REML = 0.14).Analogous plots can be generated based on other test statistics, e.g., the likelihood ratio (or deviance).
The plot highlights the differences between common frequentist and Bayesian treatments of the inference problem.In a Bayesian approach, effect estimates (for the overall mean or for shrinkage estimates ) result by averaging (marginalising) over the heterogeneity's posterior distribution (shown e.g. at the bottom of Figure 8).A common frequentist approach on the other hand is to treat the heterogeneity estimate ̂ as if this was known to be the true value (i.e., to condition on ̂ ) and derive effect estimates by considering the corresponding vertical "slice" of the plot's top half.This is a reasonable approximation when is estimated with good accuracy, but otherwise it leads to overconfidence in the resulting effect estimates.Another approach at propagating heterogeneity uncertainty through to the effect estimates is by using adjusted standard errors and Student-quantiles as in the Hartung-Knapp-Sidik-Jonkman (HKSJ) method. 41

DISCUSSION
The trace plot is a little-used plot that conveys a great deal of useful information.Usual methods of analysis, both fixed-(common-) and random-effect as well as empirical Bayes, commonly ignore uncertainty in the estimation of .Bayesian methods take it into account, but average over values of , thus hiding the extra cause of variability.The trace plot shows both the uncertainty in our knowledge of , but also the effect of that uncertainty on our knowledge of study effects and parameter estimates.In addition, the plot allows us to see more clearly the presence of outliers or hidden subgroups of studies.It is most useful for meta-analyses with small to moderate numbers of studies, because as the number of studies grows, our knowledge of becomes stronger, and there is little variation in parameter or shrinkage estimates within reasonable ranges of estimates of .Furthermore, with many studies the trace lines can appear too tangled, making interpretation more difficult.
If the trace lines are relatively flat over the area where the posterior of tau is appreciable, there is no sensitivity of the estimates to .If the lines are not flat, there is sensitivity to .Thus the plot gives valuable information about the sensitivity to the estimate of .This is especially important for likelihood or empirical Bayes meta-analysis, which rely on the estimate of to be precise for inference about the mean (or regression parameters).
While trace plots provide insights regarding the interplay of heterogeneity and effect and shrinkage estimates, they do not help much in judging the appropriateness of a prior.Choice of sensible priors depends on the scale of the endpoint and the context of the analysis, and varying prior shape or scale is part of sensitivity analyses in a Bayesian framework. 21,18Varying priors will only affect the trace plot's bottom panel, while the top remains unaffected (besides possible changes in the range of values considered), as the effect estimates are conditional on the heterogeneity.
We can produce analogous plots from a frequentist perspective based on the best linear unbiased predictions (BLUP).Figure 11 was produced using results from the metafor package. 20The bottom panel of the trace plot illustrates the inference on the heterogeneity, showing the point estimate as well as the -profile confidence interval along with the underlying -statistic.The trace plot is easy to produce in bayesmeta using the "traceplot()" function that is applied to the object returned by the analysis function (bayesmeta() or bmr()). 42,18,19For those who use metafor, 20 we provide code for that package in the Appendix.It also used to be available in the hblm package, 10 but that has never been officially released for R, and the S-PLUS is no longer on the web, nor would it work in R without modification.
The general ideas underlying the trace plot should be generalizable beyond the normal-normal hierarchical model (NNHM) discussed here.An obvious example would be network meta-analyses (NMAs); as long as these may be expressed as special cases of a simple meta-regression (e.g., when only pairwise comparisons are included), these would be tractable using the tools shown in Section 4 already. 19In principle, as long as there is only a single heterogeneity parameter involved, the same approaches should generally still work for NMAs, and these might be straightforward to implement, e.g., based on existing functions in the netmeta R package. 43owever, while computations are straghtforward (and mostly analytical) in "simple" normal models, technical calculations would be more demanding, e.g., in the case of a binomialnormal model; and these might in fact be easier in a frequentist framework compared to a Bayesian one.

APPENDIX A TRACE PLOT FOR SAT COACHING DATA WITH CREDIBLE INTERVALS
Figure A1 shows a trace plot for the SAT coaching example data from Section 4.1; 8 the plot is analogous to Figure 3), but includes (conditional) credible bounds.

B FOREST PLOT FOR SAT COACHING DATA WITH ESTIMATES
Figure B2 shows the data from Section 4.1 in a forest plot along with shrinkages estimates ( ), overall mean ( ) and prediction ( * ) based on an analysis with uniform priors.The eventual (marginal) shrinkage estimates result from averaging the conditional shrinkage estimates over the heterogeneity's posterior distribution.

D THE HBLM PLOT
Figure D4 shows a trace plot for the SAT coaching example data from Section 4.1 as generated by the hblm S-PLUS package (actually: a code version ported to R).Note the odd scaling of the (discretized) -axis, and the very different appearance compared to Figure 3.

FIGURE D4
Trace plot for the SAT coaching example data (generated using the hblm package). 9,10Note the scaling of the -axis.

FIGURE E6
Forest plot for the NO 2 example data. 9

FIGURE E7
Forest plot for the COPD example data. 33gure E7 illustrates the data of the COPD example from Section 4.4.

3 FIGURE 2
FIGURE 2Illustration of the normal-normal hierarchical model (NNHM) that is at the basis of many meta-analyses.Several studies (here: = 1, 2, 3) have slightly differing (heterogeneous) treatment effects ; their underlying distribution is shown in blue at the top of the figure, and it is characterized by its center (or "overall mean") , and the degree of (dis-)similarity is denoted by the heterogeneity parameter .When a study is conducted, we only ever get to know its true effect with some amount of uncertainty, which is quantified through the standard error .The eventual data, the effect estimates , are hence more or less offset from the true values , depending on the magnitudes of the .

40 FIGURE 11
FIGURE 11Trace plot for the COPD example data33 , analogous to Figure8, but based on a frequentist analysis.The bottom panel illustrates the -test statistic as a function of the heterogeneity, and the resulting -profile confidence interval for ; the dashed line indicates the point estimate for .

FIGURE A1 8
FIGURE A1Trace plot for the SAT coaching example data (see Section 4.1) including (conditional) credible ranges.As it is very busy plot, only the estimates and CIs for school A (red) and for the overall mean (black) are highlighted, the remaining 7 estimates are shown in pale green.Conditional means are shown as solid lines, 95%credible limits are shown as dotted lines.8

Figure 8 ESTIMATEDFIGURE C3
FigureC3shows a trace plot for the SAT coaching example data from Section 4.1 as shown in Rubin's original publication.8

FIGURE E5
FIGURE E5Forest plot for the Aspirin example data.27 E5 illustrates the data of the Aspirin example from Section 4.2.FigureE6illustrates the data of the NO 2 example from Section 4.3.