On weakly informative prior distributions for the heterogeneity parameter in Bayesian random-effects meta-analysis

The normal-normal hierarchical model (NNHM) constitutes a simple and widely used framework for meta-analysis. In the common case of only few studies contributing to the meta-analysis, standard approaches to inference tend to perform poorly, and Bayesian meta-analysis has been suggested as a potential solution. The Bayesian approach, however, requires the sensible specification of prior distributions. While non-informative priors are commonly used for the overall mean effect, the use of weakly informative priors has been suggested for the heterogeneity parameter, in particular in the setting of (very) few studies. To date, however, a consensus on how to generally specify a weakly informative heterogeneity prior is lacking. Here we investigate the problem more closely and provide some guidance on prior specification.


INTRODUCTION
In meta-analysis, researchers commonly encounter a certain amount of variability between experiments, to a degree going beyond what could be attributed to measurement error alone. Hierarchical models are commonly used in order to account for such ("between-study") heterogeneity. 1,2 In the present paper, we focus on the special simple case of meta-analysis within the framework of the normal-normal hierarchical model (NNHM). The NNHM approximates estimates from separate sources and their standard errors via normal distributions, and implements heterogeneity at a second level using another normal variance component. In meta-analysis applications, the NNHM provides a good approximation for many types of endpoints or effect measures. 3,4 The normal approximation has its limitations, 5 some of which are less of a problem in a Bayesian context. 6 A small number of studies tends to pose a problem especially for frequentist methods, in particular regarding the construction of confidence intervals (CIs) with good coverage properties. 7,8,9,10 A common convention is to exercise extra caution when the number of studies is small. 9 Bayesian approaches to meta-analysis have been advocated for quite a while, 11,12,13,14,15,16,17 and analyses may technically be performed using MCMC methods 1 or semi-analytical integration. 18 Within the R software, for example the bayesmeta 19,20 or bmeta 21 packages are available. Performing a Bayesian analysis is not technically challenging; computations are straightforward and valid for any number of studies, although less data will mean that results are more sensitive to prior specifications (especially when it comes to variance parameters). A crucial condition is that the explicitly implemented normal approximation needs to hold, which may break down e.g. for meta-analyses of small studies. 5,6 While for large numbers of studies, the choice of prior distributions usually has little impact, for few studies the exact form of the prior distributions chosen may become crucial, as one cannot rely on the prior information being overruled by the data in that case. At least part of this problem may be considered "shared" for frequentist and Bayesian methods as long as one tries to get by without using a proper, informative prior. 22 Some supposedly noninformative prior distributions can probably be argued to be less influential than others, but ultimately these are unlikely to be the best choice in few-study problems. Beyond meta-analysis, the use of informative priors for regularisation in the estimation of certain parameters is also common. 23 Especially for few studies, this may be a promising approach. 24 The case of "few" studies is hard to define; there is no obvious threshold, and in fact there may actually be no need to distinguish: use of an informative prior will not be harmful for analyses of "many" studies. Indeed, a proper prior is necessary irrespective of the number of studies in case the analysis requires the calculation of marginal likelihoods. In the present manuscript, we will investigate examples ranging in size between 2 and 5 studies. These are the cases where the use of an informative prior will make the greatest difference, and such situations have been discussed in the context of up to 4, 9 , 3-10, 7 or only 2 studies. 8 Heterogeneity priors have been investigated previously from different angles; some discussed general considerations for variance parameters 15,25,26 while others motivated particular settings for specific example cases 27,28 or investigated commonly used settings in a systematic literature review. 29 The aim of the present investigation is to provide general guidance for judging and deriving weakly informative heterogeneity priors, and to suggest consensus examples for some common types of effect measures. This may also aid in the design and justification of prior settings, or the prospective pre-specification of Bayesian meta-analyses 30 and it may help avoid (suspicion of) post-hoc tweaking of prior assumptions.
The remainder of this article is structured as follows. In the next section, the normal-normal hierarchical model (NNHM) along with its parameters and prior distributions are formally introduced. Section 3 discusses prior distributions for the heterogeneity parameter and some general motivating considerations and implications. Section 4 motivates heterogeneity priors for a selection of common types of endpoints and effect measures based on the previously discussed ideas. In Section 5, examples of metaanalyses with different endpoints are introduced, and analyses are performed using the suggested prior settings. Section 6 closes with conclusions and recommendations.

The normal-normal hierarchical model (NNHM)
The normal-normal hierarchical model (NNHM) represents measurements from different sources using two hierarchy levels. Along with the estimates, their associated standard errors need to be available. The are assumed to be fixed and known (which commonly is only an approximation. 5,31 ) Each estimate is assumed to measure an underlying true value , which is not necessarily identical across all measurements; ("between-study") variability among the is accounted for by an additional variance component whose magnitude is given by the heterogeneity ≥ 0: | , ∼ N( , 2 ) for = 1, … , , where the estimates (as well as the ) are modelled as exchangeable. The overall mean effect is often the figure of primary interest. By marginalizing over the values, the model may be written in simplified form: This is a random-effects model, which in the special case of = 0 simplifies to the common-effect model (also known as the fixed-effect model). 3,4,20,32 The NNHM provides a good approximation for many types of effect measures where the estimates as well as between-study variability may be assumed to be (approximately) normally distributed. 5 While often the aim of a meta-analysis is estimation of the overall mean , it is sometimes useful to also infer the study-specific means or a prediction +1 . The amount of information gained on or +1 through the joint meta-analysis depends very much on the amount of heterogeneity . If there was no heterogeneity ( = 0), then we would have 1 = 2 = … = +1 = , and all data would essentially contribute to the estimation of a single common parameter. If, on the other hand, was very large, then different parameters would only be very loosely connected (2), and consideration of additional data would only add very little to the estimation of any particular or to a prediction +1 . In between, for moderate values, estimates of are somewhat "shrunk" towards the overall mean , and the prediction +1 is also more tightly constrained. Estimation of the heterogeneity hence also has distinct effects on the so-called "shrinkage estimates" as well as predictions +1 . 20,33 (often improper) priors leading to posterior distributions that also provide proper frequentist coverage, but usually such a prior is not available. 44 Credible intervals are calibrated and yield proper coverage on average across the prior distribution; for the point-wise coverage this means that there may be overcoverage in certain regions of parameter space and undercoverage in others. 20,45,46,47 For example, in the present case this may mean that long-run coverage may be above the nominal level if data were repeatedly generated based on heterogeneity values from the lower end of the prior range, and below the nominal level otherwise.

Aim
For meta-analyses involving many studies (large ), the choice of prior distribution often has little impact, and an (improper) uniform prior for may be a good choice, not least due to its invariance property. 20,25 Here we are concerned first of all with the case of few studies (small ); a uniform prior may not actually be an option here, as it requires ≥ 3 studies in order to yield a proper, integrable posterior, 25 and it may otherwise generally be considered overly conservative. 7,8,25 Similar problems arise also with the Jeffreys prior for the NNHM model; 20 Sec. 2.2 this kind of issue is common in Bayesian analysis. 23 Another case where a proper, weakly informative prior may be required (not only for few studies) is when marginal likelihoods or Bayes factors are of interest.
While the availability of a "noninformative" prior comes with a certain convenience (one less issue to worry about), in the present case its failure to provide reasonable estimates in certain instances will often appear somewhat contradictory to common sense. The introduction of an informative prior then may entail a trade-off of the introduced regularisation versus simplicity and robustness. On the other hand, the explicit consideration of relevant prior information may also be seen as an advantage.
From a merely "technical" perspective, a heterogeneity prior must (in order to ensure integrability of the posterior) have a shorter-than-uniform upper tail (an eventually decreasing, integrable density function) and also an integrable density towards zero. In that spirit, it may also make sense to consider near-origin-and upper-tail-behaviours separately. While an (improper) uniform prior may be considered noninformative for several reasons (e.g., due to its scale-invariance property 20 Sec. 2.2), its overly heavy upper tail may also be considered "anti-conservative". 48 On the other hand, it may be possible to "rescue" some of the desirable behaviour and robustness e.g. by the use of heavy-tailed priors. 49 Besides upper-tail considerations, priors may also behave quite differently near zero; for example, depending on whether the prior density approaches zero, a finite value, or infinity. A finite prior density may ensure a near-zero behaviour roughly like a uniform prior, while a zero density may be useful e.g. in bounding maximum-a-posteriori (MAP) point estimates away from zero; 38 in particular from the regularisation perspective, the prior density's derivative near zero may also be of interest (as it determines how small values may be pushed towards or away from zero).
While the concept of "weak informativeness" remains somewhat elusive (just like that of a "noninformative" prior), the information content (or "vagueness") of a prior is commonly related to its variance, its entropy, 50 or its associated effective sample size (ESS). 51,52 In many cases it is also helpful to consider the informativeness of a prior relative to a reference, 53 for example, a unit information prior. 26,54 Since the posterior draws its interpretation in part from the prior, it is important to make the prior specification plausible and transparent. Following the parsimony principle (Ockham's razor), it may be contructive to seek the (in some sense) simplest prior distribution within any relevant constraints. 55 Possible approaches to implement such a notion in practice may work, e.g., via maximization of the entropy, 50 pre-specification of an effective sample size, 51,52 or matching of moments.
Despite the aim of a weakly informative formulation, one should also anticipate the case where the data have little information to add, so that the posterior closely resembles the prior and hence the analysis results are largely determined by the prior settings. This may happen especially in cases of few studies and is also suggested in some of the examples that will be discussed below (see Figure 8); such cases highlight the importance of a transparent and convincing prior specification.
In the remainder of this section, we aim to facilitate a structured approach to interpreting heterogeneity and specifying heterogeneity prior distributions by pointing out relevant perspectives and highlighting consequences of certain heterogeneity settings. Similar ideas are to some degree also utilized in prior elicitation in general. 56,57 A set of guiding questions is eventually suggested in Table 6.

General properties of the NNHM
When considering prior distributions for the heterogeneity , it is useful to recall that ≥ 0 is a scale parameter, and that its square 2 denotes a variance component within the NNHM. Immediate associations of variance priors useful in a simple normal model however may be misleading: inverse-gamma (or inverse-2 ) distributions are usually not recommended, as these arise as conjugate distributions only in related, yet distinctly different circumstances. An inverse-gamma distribution is conjugate in the simple case of estimating the variance of a normal distribution with known mean. 1 In such a case, an unequal pair of two data points for example implies that the variance must be positive (a zero variance would have a zero likelihood); in the present NNHM context, however, unequal values may be consistent with zero heterogeneity ( = 0), so that such priors are not a natural choice here, and their use is generally discouraged. 2,25,58,59 Supposedly noninformative settings based on inverse-gamma distributions commonly tend to result in sensitivity to specification details, 25 and often too much probability is allocated to very large heterogeneity values. 60 For uniform or normal effect prior distributions, the resulting conditional effect posterior ( | , ) again is normal. While for increasing the (conditional) posterior mean of shifts from the inverse-variance weighted mean towards the unweighted average of the estimates , the (conditional) posterior variance of is proportional to . 20 At the same time, larger heterogeneity values also imply wider prediction intervals and less shrinkage 16,20,61,62,63 (see also Section 2.1). Varying between zero and infinity essentially also means varying between the extremes of pooled and separate analyses of individual studies. In a sense, overestimation of may hence often be considered a "conservative" or "less harmful" form of bias. In that spirit, one might argue that -within reasonable limits-a prior that is stochastically larger than another is also more conservative. 64 A simple way to implement stochastically ordered distribution families is by using parametisations that include a scale parameter. 65 Sec. VII.6.2 Use of a scale parameter does not actually impose a restriction; if not already included in the parametrisation, it may easily be introduced. Note that simple re-scaling of a prior distribution ( ) then also implies a (re)scaling of the corresponding marginal prior predictive distributions ( | ) by the same factor. In general, stochastically ordered priors also imply the same ordering for the resulting posteriors. 63,66,67 Consideration of stochastically ordered alternative priors may hence also offer a framework for sensitivity analyses (see also Appendix D.4).

Reasonable (proper) distributional families
A simple way to implement the "technical" requirements (as suggested in Section 3.1) may be to require roughly uniform behaviour near zero (implying indifference among small heterogeneity values on the scale and ensuring integrability in the lower tail), and a monotonically decaying tail with increasing heterogeneity values (implying decreasing probability for increasing values and ensuring integrability in the upper tail). This may be achieved e.g. by using half-normal, half-Student-, half-Cauchy, half-logistic, exponential or Lomax distributions. A sample of such distributions is sketched in Figure 1. Note that for comparability, the distributions in the figure are all scaled such that they have a common median of 1; their corresponding parameters are also listed in Table 4 below. In particular, half-normal, half-Student-, or half-Cauchy distributions have been recommended as appropriate families within the NNHM, also due to favourable frequentist properties. 2,25,58 The half-Studentdistribution (including the half-Cauchy as a special case, and the half-normal as a limiting case) may be derived as conditionally conjugate distributions in an extended parametrisation of the NNHM. 2 Sec. 19. 6 The exponential distribution might be motivated as the maximum entropy distribution for a pre-specified prior expectation, 50 or as the penalised complexity prior. 39 The half-logistic distribution combines a zero derivative (implying near-uniform behaviour) at the origin with an upper tail behaviour close to that of an exponential distribution.
Half-Student-and Lomax distributions here may be considered as heavy-tailed variants of the half-normal and exponential distributions, respectively. In the spirit of a contaminated prior, encompassing priors "close to an elicited one", 68 69 Sec. 3.5.3 these may also be motivated as scale mixtures, where the (exponential or half-normal) scale parameter is associated with some variability or uncertainty. The scale mixture connection is also derived in detail in Appendix C below. The special case of a Lomax( = 1) distribution also coincides with the form of prior distribution suggested by DuMouchel (a log-logistic prior for ). 70,71 Similarly, the exponential distribution may also be motivated as a scale mixture of a half-normal distribution with Rayleigh-distributed scale. The use of heavy-tailed prior distributions has the advantage of ensuring some degree of robustness against prior misspecification (or prior/data conflict) 49 at the cost of sacrificing some of its "regularisation" power. Another simple way of implementing some degree of robustness is by combining "informative" and "heavy-tailed" elements in a twocomponent mixture distribution. 72,73 Another simple and common prior distribution is the (proper) bounded uniform distribution defined on an interval [0, ]. It inherits certain qualities from the (improper) uniform distribution, but it introduces a sharp cutoff at the upper bound , which may be hard to motivate or justify. Although, if the bound is large enough, then it may be very reasonable (e.g. for log-ORs).
Among the above examples, the Student-and Lomax distributions possess "shape" parameters in addition to scale parameters, which here essentially regulate the degree of heavy-tailedness. If considered desirable, more complex prior assumptions may be implemented using more complex distributions, e.g., using folded non-central Student-distributions with a non-zero mode, 2,25,58 however, additional degrees of complexity would probably require solid justification to be convincing. In the context of a penalisation interpretation of the prior, a mode at zero also implies a corresponding "penalty term" that is monotonically increasing in ; this applies e.g. for a penalized-complexity prior 39 that aims to give preference to sparse models. In empirical investigations based on meta-analyses archived in the Cochrane Database of Systematic Reviews, log-Normal and log-Student-5 distributions have been fitted to empirical data. 74,75 The log-normal and log-distributions here were found to fit the predictive distributions best, however, only few alternatives (log-normal, log-5 and inverse-gamma, 74 or log-normal, inverse-gamma and gamma distributions 75 for 2 ) were considered as candidates in these comparisons. Some properties of the distributions discussed here are also listed in Appendix B.
In practice, the half-normal distribution is quite commonly used; the reasons for its popularity are probably its simple and familiar form, its near-uniform behaviour at the origin along with a reasonably quickly decaying upper tail, as well as considerations of numerical stability. In the following, we will focus mostly on half-normal distributions. In our experience, minor differences between similar prior densities are of rather minor practical relevance, while it is most important what heterogeneity ranges the bulk of prior probability is assigned to.
When eventually formulating prior assumptions in terms of a parametric prior probability distribution, it is first of all necessary to be able to judge the meaning and implications of certain heterogeneity settings; these issues will be discussed in the following section.

Units of
Informative priors naturally always need to be considered in the context of the endpoint under consideration. In order to specify a sensible prior for , it is important to recapitulate its role in the NNHM (see Section 2.1). The heterogeneity is a scale parameter that relates to the probable size of differences (between-study differences) in effects ( and ; see equation (2)). With that, the units of measurements ( ), effects ( , ) and heterogeneity ( ) are the same; if the effect is measured, say, in metres, then so is the heterogeneity. Or both may be dimensionless, as e.g. in the case of log-transformed ratios (like log-odds-ratios heterogeneity (τ)

FIGURE 1
A selection of potential probability densities for the heterogeneity. All distributions are scaled so that their prior median is at unity ( = 1, dashed line; see also Table 4).
(log-ORs), log-incidence-rate-ratios (log-IRRs), log-hazard-ratios (log-HRs),. . . ) or standardized mean differences (SMDs). One may in fact argue that the nature of the effect scale is the most important aspect to consider for prior specification. 24 In case the effects have been transformed prior to analysis, then it is often useful to consider implications on the back-transformed scale.
Transformations are usually introduced to achieve a better fit to the normality assumptions within the NNHM; for example, using logarithmic or arcsine transforms. 3,4,76 In such cases, also considering the back-transformed (exponential or sine) effect scales is often instructive.
In case the effect scale has definite upper and lower bounds (which is often the case e.g. for endpoints measured as scores), this also provides information on the plausible (and possible) between-study variability. In case of bounded scales, it may for example be useful to consider the extreme cases of a continuous uniform distribution across the considered range (which would have standard deviation − √ 12 = − 3.46 , where and are the lower and upper bounds, respectively), or a discrete distribution with probabilities of 1 2 concentrated at both margins and (which would have standard deviation − 2 ). Such considerations may define absolute "worst-case" settings for the heterogeneity. Any normal approximation employed on a bounded parameter space with a standard deviation of, say, > − 4 would inevitably have substantial overlap with out-of-domain values; any heterogeneity value that is not ≪ − 4 should raise suspicion and might actually call for a different approach (e.g., transformation to a different parameter space).

Magnitudes of other effects
Relevant hints may originate from considering the magnitude of other (known or plausible) effects of interventions or covariates. The reasonable range for the overall mean effect may also have implications for the expected range of study-specific means ; in case an informative prior for is used (or is at least plausible), its variance may help constraining also the between-trial variability. Heterogeneity may often be attributed to differences in the composition of the populations underlying each estimate, and the distribution of relevant covariates within (which may be observed or unobserved). If the observed heterogeneity is assumed to be due to different constitutions of populations, then the heterogeneity relates to accumulated effects of associated covariates. With that, within-and between-study variability in effects are related to within-and between-study differences among subjects and the plausible magnitude of covariates' effects. For example, if a treatment effect is known to differ between males and females by a certain amount, this difference between genders may help judging or motivating plausible magnitudes of effect differences between studies. In case the variability between centers within the same study has been investigated, this may also provide a hint on between-study variability (which will then most likely be larger).

Implications of a fixed heterogeneity value
Specific values of the heterogeneity may be judged and compared based on the implied distribution of true effects , which is given by the (conditional) prior predictive distribution ( | , ) (see equation (2)), where defines the distribution's standard deviation. The effects (conditional on ) then vary within a range of ± 1.96 with 95% probability. For a randomly picked pair of effects ( and ), their difference ( − ) follows a N(0, 2 2 )-distribution (2), and their absolute difference | − | then has a median of 0.95 . Quite commonly, the effects are transformed prior to analysis, so that it may be helpful to consider the implications on the back-transformed scale. A very common example is the logarithmic transformation, which is often used for analyses involving e.g. odds ratios (ORs), relative risks (RRs) or hazard ratios (HRs), and where the inverse transform is the exponential function. 95% predictive intervals and median differences are shown for a range of values in Table 1 15 categorized ranges of values in the context of log-ORs as "reasonable", "fairly high" or "fairly extreme" as shown in Table 2. Such investigations may help judging what values are reasonable or unrealistic and with that may help specifying e.g. the heterogeneity prior's tail quantiles.

Implications of a heterogeneity distribution
Besides considering the conditional distribution for fixed values ( ( | , ), see previous subsection), one may also investigate the marginal prior predictive distribution ( | ), marginalized over a particular heterogeneity prior, which technically results as the integral ( | ) = ∫ ∞ 0 ( | , ) ( ) d . Since ( | , ) is normal (2), the marginal ( | ) is a normal (scale) mixture distribution. Its form may usually either be derived numerically, 18,19,20 or it may easily be explored using collapsed Gibbs sampling, that is, generating a Monte Carlo sample by repeatedly sampling from the heterogeneity prior ( ( )), and subseqently from the conditional predictive distribution ( ( | )). Investigating the marginal prior predictive distribution may help judging the prior scale or distributional family. Table 3 illustrates a range of prior predictive distributions for a set of half-normal priors that differ in their scale. The implied probabilities for the (log-OR) categories shown in Table 2 are also given. Note that a simple re-scaling of the heterogeneity prior implies proportional scaling of mean and quantiles for as well as (as can be seen in Table 3). In this spirit, Dias et al.  category range "reasonable" 0.1 < < 0.5 "fairly high" 0.5 < < 1.0 "fairly extreme" > 1.0  Similarly, Table 4 illustrates a range of prior predictive distributions for a set of heterogeneity priors from different distributional families; what they have in common is the prior median of 1.0 for . Quantiles or mean of or for other scalings of ( ) may be derived by proportional re-scaling (as in Table 3). For example, a half-Cauchy distribution that has its median heterogeneity matched to that of a half-normal distribution requires a scale parameter that is smaller by a factor of ≈ 2∕3. From the table, one can also read off the ratio of 95% quantile over the median, which may be a useful indicator of the heavy-tailedness of the different distribution families. The distributions from Table 4 are also illustrated in Figure 1. Some additional properties of these distributions are provided in Appendix B.
Different distributional families for the prior ( ) imply differing marginal prior predictive distributions ( | , ). Concrete prior information on ( | , ) then may help constraining the shape of ( ), however, the prior family may also be selected based on considerations of heavy-tailedness, near-zero behaviour, or simplicity.

The role of the unit information standard deviation (UISD)
Consider the simple case of an effect measure that for each study is determined as an average of independent identically distributed observations. In such a case, the associated standard error is simply of the form where is the sample size, and 1 is the common "population" standard deviation of each single observation that was averaged over. This figure describes the population-, or within-study-standard deviation, 54 which for the moment we take to be constant across studies. This figure is also called the unit information standard deviation (UISD), as it relates to an observational unit's contribution to a study's likelihood. One may now relate the heterogeneity to 1 and ask whether the between-study variability ( ) is likely to exceed the within-study variability ( 1 ), or what ratios of these two are plausible. Figure 2 illustrates the relationship of within-study and between-study standard deviations 1 and . Usually, one would expect ≪ 1 , implying that while study means ( ) may differ to some degree, the distributions of subjects within studies will still be largely overlapping (see Figure 2, left panel). In that sense, the UISD 1 may constitute an important "landmark" on the heterogeneity continuum and thus may help constraining the range of plausible heterogeneity values. 26 This concept of within-study standard deviation may be extended to other types of effect scales -for example, the standard error of a log-OR derived from a 2×2-table is approximately given by = 4 √ , so that, heuristically, the UISD here equals 1 = 4 per subject (at least). 20 Appendix A.1 Sometimes it may also make more sense to define UISDs not per subject but rather per event (see also Appendix A.3 for an example), but care also needs to be taken in order not to confuse these two figures. For a given set of log-OR estimates, the UISD may alternatively also be investigated by inverting equation 4 (see also (6) and the examples in Section 5.3 below).
Another link may be drawn between 1 and via shrinkage estimation (see Section 2.1) and the consideration of prior effective sample sizes. 52, 77 Consider the case where a meta-analysis of studies is available, and a new ( +1th) study is conducted. The distributions of subjects within studies distribution of (true) study means distributions of subjects within studies distribution of (true) study means FIGURE 2 Illustration of the relationship of between-study heterogeneity and unit information standard deviation (UISD) 1 . The left panel (a) shows the commonly expected setup, in which the heterogeneity is relatively small compared to the withinstudy standard deviation ( ≪ 1 ). The right panel (b) shows that a larger would imply that the distributions of subjects from different studies were eventually barely overlapping. Note that the eventual estimates ( ) resulting from the different studies then may have different standard errors = 1 √ < 1 associated, depending on the studies' sample sizes .

TABLE 5
Correspondence between prior maximum sample sizes ( ⋆ ∞ ) and the magnitude of the heterogeneity ( ) relative to the unit information standard deviation (UISD) ( 1 ) (see (5)). 77 previous meta-analysis of course provides (prior) information on the new study's estimate +1 , the exact amount of which is determined by the number of studies , their sample sizes , the UISD 1 , but also by the amount of heterogeneity. 33, 72 If is large, then separate studies are only loosely related and the previous data add little information. If on the other hand is very small (i.e., studies are almost homogeneous), then they may contribute a lot of information. With that, the amount of heterogeneity is related to whether studies should rather be pooled or viewed as essentially independent pieces of information. One may then consider the idealized limiting case of infinitely many ( → ∞) infinitely large ( → ∞) studies as the previous data source, so that the amount of contributed information solely depends on . In that case, the historical data may be thought of as effectively contributing a number of ⋆ ∞ additional subjects to the +1th study. This prior maximum sample size then relates to 1 and as 77 Table 5 illustrates this relationship. For example, if in the ideal case (i.e., = ∞, = ∞) the additional data should add information equivalent to at most 16 subjects, then this would correspond to amounting to at most a quarter of 1 . If one has an idea of how much information a meta-analysis may (or should) contribute to a single study's shrinkage estimate (in the idealized case of very many very large studies), then such considerations may help constraining probable magnitudes of , or associating probabilities with ranges of values.
Note that a number of priors have been proposed which are defined relative to the magnitude of the values (or their harmonic mean), e.g., the Jeffreys, DuMouchel or uniform shrinkage priors. 20 Sec. 2.2 In view of the above arguments, it might also make sense to define priors relative to the UISD, or its estimated value. Inverting (4) yields 1 = √ 2 for a single study, and based on a given data set we suggest the more general empirical estimate is the average (arithmetic mean) sample size, and̄ 2 h = 1 ∑ =1 −2 −1 is the harmonic mean of the squared standard errors (variances). This estimator is defined so that in the special case of a common-effect analysis (i.e., assuming = 0), the overall mean estimate's variance (which then is given by ∑ =1 −2 −1 ) consistently also equals

Empirical information on
Empirical data, e.g. from earlier investigations in a related area, 78 may also contribute to a-priori information. Informative priors based on empirical information have been derived for standardized mean differences (SMDs) and log-ORs in medical applications by investigating large numbers of meta-analyses published in the Cochrane Database of Systematic Reviews by  85 Note that some references provide information directly on the heterogeneity parameter, while others summarize estimates of heterogeneity. Empirical information often entails the question of how representative the external information is for the study at hand, or what may be the relevant data subset, or what to do if no such sample may be available. In terms of the epistemic view discussed in Section 2.2.2, the inclusion of empirical evidence in the prior specification affects the interpretation of the prior, and with that, of the posterior. Empirical data may then often be seen as a somewhat complementary source of evidence. When there is doubt about the immediate applicability of empirical information for the problem at hand, this may also be reflected e.g. in a robustified two-component mixture prior. 72,73

Guiding questions
In order to summarize the above arguments, Table 6 lists some guiding questions that may aid in structuring the specification of a prior for the heterogeneity. These are mostly based on the arguments laid out in Sections 3.3 and 3.4. Firstly, plausible heterogeneity magnitudes (in terms of or ranges) need to be be determined. These reflections may then also help choosing a parametric family for the prior, or the distributional family may also be selected based on considerations of near-zero behaviour, heavy-tailedness or simplicity. Beyond the mere type of endpoint or effect measure, the context also may determine whether smaller or larger amounts of heterogeneity are to be expected, e.g., depending on whether studies' designs and populations were similar. Special considerations in the context of specific common types of effect scales are discussed in detail in Section 4. These are then illustrated using actual data examples in Section 5.

TABLE 6
Some guiding questions for judging reasonable prior distributions for the heterogeneity parameter .

Prior information: (i)
What is the effect scale, what (between-study) differences are expected or plausible? (ii) What is the magnitude of other known (or plausible) effects? Do these provide guidance? Is an informative effect prior used? If so, what is its variance? Does it provide guidance? (iii) Is a plausible "unit information standard deviation (UISD)" available? Does it provide guidance? (iv) Is relevant external empirical information on heterogeneity available? Should it be considered in the analysis?
Translation into a prior probability distribution: (v) Does the prior information help pinpointing prior quantiles (of )? (vi) Does the prior information help pinpointing prior predictive quantiles (of )? (vii) Does the prior information suggest particular properties for the prior (-density)?

Means and mean differences
This general case covers endpoints measured on absolute scales, hence it is not possible to give universally applicable advice on a plausible prior scale. For example, the same analysis may require different scalings of the prior depending on whether an endpoint is expressed, say, in terms of hours or minutes. In particular, in case of effects that are defined as averages, the UISD (see also Section 3.4.5) may provide some guidance; if standard errors scale with sample size ( ≈ 1 √ , see also equation (4)), then 1 (or an estimate 1 , (6)) may provide some orientation based on the considered (or other related) data. Relating effects to "within-population standard deviations" is actually an approach that is also formalized in the case of standardized mean differences (SMDs); see the following section.
Mean differences are another very common special case. These are often used in order to "normalize" outcomes; for example, in controlled clinical trials, each study's treatment group is usually related to a control group in order to express the treatment effect relative to the unexposed group. In the simplest case, the study's outcome then is defined as =̄ 2; −̄ 1; , wherē 1; and̄ 2; are the th study's averages from control and treatment group, respectively. When considering UISDs, the relevant sample size will then result as the sum of the two treatment groups' sizes ( = 1; + 2; ). In the simple case of two equallysized groups ( 1; = 2; = 2 ) and equal variances within groups (so that Var(̄ 1; ) = Var(̄ 2; ) = 2 w ∕2 ) the UISD simply results as 1 = √ 2 2 w , where 2 w is the within-group variance. Again a special case arises when considering paired differences. 86 In general, analogous considerations apply for un-paired as well as for paired differences; only for the latter case the UISD 1 may be expressed as Finally, there are generic cases of parameter estimates that are reported along with a standard error, but which do not necessarily have a "sample size" ( ) associated, as is sometimes the case, e.g., for laboratory experiments. 87

Standardized mean differences
Standardized mean differences (SMDs) aim to compare mean differences measured on different scales by normalizing them through their population standard deviation. Effectively, these measure by how many standard deviations the two study groups differ; SMDs are always dimensionless. Their aim is to estimate = 2; − 1; , where 2; and 1; are the two groups' true means and is the within-group standard deviation (which may be defined with respect to one or the other or both treatment groups, or which may also be externally informed). Note that here bears some similarity to the UISD 1 (when considering the latter with respect to the unstandardized differences). Slightly differing, but essentially similar approaches are given e.g. by the "Cohen's ", "Hedges' " or "Glass' Δ" estimators, which differ in details like bias correction or standardization terms. 3,4 Essentially, these aim to estimate the mean difference ( 2; − 1; ) by the difference of averages (̄ 2; −̄ 1; ), and also the standard deviation by an empirical one. SMDs (along with the correlations treated below) are somewhat different here from the "general" mean differences, in that they are explicitly designed and utilized in order to compare endpoints measured on different scales, which are not directly comparable. A heterogeneity of = 0 may hence be considered particularly unlikely. A value of = 1 would mean that the between-study heterogeneity (among values) was equal to the within-group variability . Closely related to SMDs are standardized regression coefficients, which are re-scaled as if both the regressor's as well as the response's variance were normalized to unity. 88 Similar arguments would apply for analyses involving standardized regression coefficients, and arguments applicable to correlation coefficients (see Section 4.5 below) may also be relevant.
Effects on the SMD scale have been categorized as 0.2="small", 0.5="medium", 0.8="large", 89 Sec. 2.2.3 where an extension has recently been proposed to include the grades of 0.1="very small", 1.2="very large", and 2.0="huge". 90 Consequently, such a ranking might be utilized in order to bound between-study effects to mostly non-extreme values, e.g. by anticipating mostly up to "large" heterogeneity and hence formulating a bound on P( ≤ 1). Neglecting estimation uncertainty for the denominator, and for simplicity assuming equal sample sizes for each of the th study's groups, leads to a UISD of 1 = 2 (see Appendix A.1).
Empirical evidence on heterogeneities between SMDs based on an analysis of studies archived in the Cochrane Database of Systematic Reviews is given by Rhodes et al. (2015); 74 for a general healthcare setting (not restricted to a particular outcome type), a log-Student-distribution with parameters = −1.

Log-transformed odds, rates and effect scales
Many outcomes are commonly analyzed on a logarithmic scale, which may be advantageous for several reasons; firstly, the domain of positive numbers is mapped to the complete real line, which makes strictly positive scales tractable for normal models like the NNHM, which is often convenient. Secondly, additive effects on the log-scale translate to multiplicative effects on the original scale. Symmetry of the normal distribution (2) on the log-scale then implies a "symmetric" treatment of multiplicative factors and their inverses (since exp( + ) = exp( ) × exp( ) while exp( − ) = exp( ) × 1 exp( ) ). This is useful, e.g, when dealing with outcomes like rates, odds, rate ratios, odds ratios, relative risks, hazard ratios or concentration measurements. An offset of, say, 0.1 on the log-scale translates (approximately) to a change of 10% on the back-transformed (exponentiated) scale, regardless of the original value. Thirdly, the normal approximation to the likelihood that is used in the NNHM (1) may provide a better fit on the logarithmic scale.
When considering heterogeneity values on the logarithmic scale, a more intuitive approach is usually to examine the corresponding implications on the back-transformed scale. Note that a normal model on the log-scale actually corresponds to a log-normal model on the original scale. In a sense, an analysis on the logarithmic scale may also be viewed as an implementation of a dependent joint prior for effect and heterogeneity 22,34 on the original (exponentiated) scale. The consequences of certain heterogeneity values or heterogeneity distributions were already investigated in some detail in Sections 3.4.3 and 3.4.4; the important issue to judge is what relative (multiplicative) difference between studies is deemed plausible; see also the extensive discussion by Spiegelhalter et al. (2004). 15 Sec. 5.7 A common type of effect are log-transformed odds (or logits). 91,92 For example, in epidemiology or at the design stage of a clinical trial it may be of interest to infer the magnitude and variability of the prevalence of a certain condition, or historical information may be utilized to support the control group in a clinical trial. 72 The prevalence may be expressed in terms of the probablity ∈ [0, 1] or the odds 1− ∈ [0, ∞], while for meta-analysis purposes it then makes sense to move to the log-odds scale log 1− ∈ R. Rather than viewing this as a case of a logarithmic transformation of the odds, one might as well consider this as a logit transformation of probabilities, mapping the interval [0,1] to the real line via the logit function ( ) = log 1− . Besides considerations of what ratios the odds may plausibly be spanning, here it may be helpful to consider a uniform distribution in proportions as an extreme case; for the log-odds, this implies a logistic distribution that has a standard deviation of √ 3 = 1.81. The UISD in this case amounts to (at least) 1 = 2 (see Appendix A.2). Similarly, event rates (based on a Poisson model) are commonly combined in meta-analyses based on a log-transformation.
Similarly to the cases of means and mean differences discussed earlier, a log-transform is also commonly applied in the context of two-group comparisons, for example, for log-OR, log-IRR, log-RR or log-HR effect measures. Logarithmic ORs are a natural extension of the log-odds case above, since the logarithmic ratio of odds is simply a difference of log-odds; other pairwise group comparisons generalize similarly from single-group estimates. UISDs for log-ORs and log-RRs are derived in Röver (2020), 20 and for log-IRRs in Appendix A.3; the corresponding figures for log-HRs are discussed by Spiegelhalter et al. (2004). 15 Sec. 2.4.2 When discussing UISDs for count outcomes, it is important to clearly indicate whether these relate to subjects or events (e.g., for ORs the numbers are 4 per subject 20 and 2 per event 15 ).
Empirical evidence on the magnitude of heterogeneities within meta-analyses published in the Cochrane Database of Systematic Reviews is given by Turner et al. (2015). 75,79 For example, for a log-OR effect in a general healthcare setting (without restricting to a specific type of outcome), a log-normal distribution with = −1.28 and = 0.87 was derived, implying a median and 95% quantile of 0.28 and 1.16, respectively (see also Table 3). Similarly, Günhan et al. (2020) 85 in a re-analysis of data from the Cochrane Database of Systematic Reviews determined a 95% quantile of heterogeneity estimates of 1.05 for analyses based on binary data and log-ORs. Consider for example the common case of a meta-analysis of log-OR estimates. If we want to restrict prior probabilities mostly to "reasonable" to "fairly high" heterogeneity levels (according to Table 2 in Section 3.4.3), one could use a half-normal prior with scale 0.5, implying P( > 1.0) = 4.6% and assigning 52% and 27% probability to the "reasonable" and "fairly high" categories, respectively.  Tables 2 and 3).
at the bottom, the probabilities for the categories are shown. The probabilities assigned by the half-normal(0.5) prior and the "empirical" prior are roughly in agreement, while the half-normal(1.0) prior would assign more or less equal probabilities to the "reasonable", "fairly high" and "fairly extreme" categories, and leave only 8% probability for smaller values. Similar arguments hold also for other log-transformed effect scales.

Regression slopes
Very closely related to mean differences is the more general case of meta-analysis of regression parameters (slopes or interactions) and their standard errors. 93 In the special case of a single binary covariate, the regression effectively reduces to a two-group comparison, and consideration of additional covariates then may allow for some "adjustment". When the covariate is continuous, however, extra care needs to be taken, since not only the endpoint's scaling is relevant (the regression's " variable"), but also the regressor's scaling (the regression's " variable"). Whether the regressor is expressed in, say, days or weeks, affects the resulting slope parameter (and its standard error) by a corresponding re-scaling by a factor of seven. The regressor's scaling will then similarly also affect the scale of the anticipated heterogeneity: when combining estimated (linear) regression coefficients, which are to be interpreted as "the expected change in for a one-unit change in ", the heterogeneity between estimates depends on the units of . For example, the variability expected among temporal changes that are expressed on a per-week scale rather than a per-day scale should be seven times as large.
The immediate question then is what increment in the regressor to base heterogeneity considerations on; what is eventually needed is a statement of the form "for a change in the regressor by a difference of Δ , the associated effects are anticipated to vary by a magnitude of ", and that difference Δ needs to be specified. Sometimes there may be obvious "natural" units to be used, for example in the common case of a binary (zero/one) coded covariate (e.g. for treatment vs. control or males vs. females); the obvious difference to consider here is an increment of Δ = 1. Otherwise the width of the regressor's distribution may be relevant. 94 Consider again the case of a binary covariate and a balanced setup; the standard deviation of the binary variable will then be 1 2 , so that twice the standard deviation might generally be a sensible scale to consider. Note though that this is by no means universally applicable, as such scales may be affected by many factors (e.g., inclusion criteria in clinical trials) and might also be very different between studies. Note that the Δ value needs to be the same across the considered studies.
Once the "reference" increment Δ has been determined, a prior for the associated heterogeneity may be formulated. In case the actual analysis then is done with respect to a differing scaling, the prior needs to be re-scaled accordingly. For example, if a prior with scale was determined for a per-week increment, but the actual analysis is based on the per-day regression coefficients, then their prior should have scale 7 . The UISD 1 then also scales proportionally.
Note that the above arguments extend beyond simple linear regressions with continuous outcomes, for example, logistic regressions, Poisson regressions or survival analyses, in which regression parameters then relate to log-ORs, log-IRRs or log-HRs. Once a reference increment Δ has been determined, the arguments regarding log-transformed endpoints discussed earlier in Section 4.3 apply, and potential re-scaling issues still need to be considered. A way to circumvent considerations of regressor's or response's scales may be to move to standardized regression coefficients instead, which are unitless and are somewhat similar to SMDs (see also Section 4.2) or correlations (see Section 4.5). 88 . Depending on the exact type of regression analysis and the standardization technique (e.g., in case of a logistic regression, and when standardization is done based only on the regressor's scale), 95,96,97 arguments relevant for log-transformed endpoints might also apply.

Correlation coefficients
Estimated correlation coefficients (Pearson's ) are commonly quoted and summarized for studies dealing with paired observations. 3,4,98 Correlation coefficients are restricted to the domain [−1, 1], with values of | | = 1 indicating perfectly linear (positive or negative) correlation, and = 0 indicating uncorrelatedness. 99 Due to the problems with bounded parameter spaces, correlation coefficients are commonly analyzed after an appropriate transformation using Fisher's transform, which is defined as . Correlation values within the range −0.5 < < 0.5 are little affected by the transformation, which makes more of a difference for more extreme values.
An upper limit to the expected heterogeneity may be specified by considering a uniform distribution of values across the range of correlation coefficients as a "worst case". For plain (correlation ) values, this would imply a variance of 1 3 = 0.58 2 . On the scale of -transformed values, this implies a distribution with probability density function ( ) = 2 (exp(− )+ ( )) 2 , that has a zero mean and a variance of 2 12 ≈ 0.91 2 (these moments might actually motivate a prior for the overall effect , too). The standard error of values after transformation (see above) implies a UISD of approximately 1 = 1.0. With that, it should usually be safe to expect heterogeneity values well below = 1.0.
If values near unity (or 0.91) already imply rather extreme heterogeneity, the question remains what constitutes "large", yet reasonable heterogeneity. For that, we may consider the somewhat more moderate cases of ∼ Uniform(−0.5, 0.5) or ∼ Uniform(0.0, 0.8). Both these cases happen to lead to similar variances of Var( ) = 0.30 2 on the transformed scale, so that = 0.30 may already be considered "large" heterogeneity. While the use of "plain", un-transformed correlation values within the NNHM framework is a bit problematic due to the bounded parameter space that is not reflected in the model, it is not uncommon. We have already seen some hints of what amounts of between-study variance for plain correlations may be possible or plausible in the considerations above; a value of  Grande et al. (2015) 100 Analysis 1.5 investigated the effect of physical exercise (vs. no exercise as control) on the duration of acute respiratory infections (ARIs). Four studies were jointly considered in a meta-analysis, the endpoint of interest was the mean difference in the number of symptom days per episode. The relevant data are shown in Table 7.

Mean differences
The outcome here is measured in units of days (change in symptom duration for treated patients relative to the control group). For the purpose of the present analysis, ARIs were defined as "infections of the respiratory tract that last for less than 30 days", 100 while ARI durations generally are substantially shorter, lasting of the order of a week. 101,102 With that, the reduction in symptom days cannot be more than (roughly) a week. ARIs may be caused by bacterial or viral pathogens; the effect of antibiotic treatment is in a shortening of the order of one day. 103 From the data (Table 7), we can derive estimates of the UISD, which here is at an average of 1 = 3.9.
The treatment effect may be expected to be of the order of days (anything below 1 day would probably not be considered clinically meaningful), and a similar magnitude may be expected for the heterogeneity. Values > 1 would make the betweenstudy heterogeneity larger than the effect of antibiotics, which seems implausible. Variations in treatment effects of the order of several days would probably imply that the effect was several times larger in some studies than in others.
A value of 1.0 would imply a median difference in true effects of ≈ 1 day for a random pair of studies (see Table 1), which might be at the upper end of the plausible range. A half-normal(0.5) prior would imply P( ≤ 1) ≈ 95%, and considering the corresponding prior predictive distribution (see Table 3), we can see that this implies a 95% prior predictive interval of roughly ±1 day around the overall mean effect.
For the present example, we would hence suggest a half-normal(0.5) prior. Note that this is a common, well-researched condition. For more uncertain cases, one might want to go for a heavier-tailed prior. A meta-analysis based on the halfnormal(0.5) prior is illustrated in Figure 4. Among the four studies considered, one suggests a stronger effect than the others, however, due to its relatively small size and correspondingly large associated standard error, it is still consistent with the remaining three. The estimated heterogeneity (the median and 95% credible interval (CI) are shown in the bottom left of the forest plot) here has barely changed from the a priori anticipated amount (see Table 3). The heterogeneity's posterior is also illustrated in Figure 8; prior and posterior are very similar in this case. The resulting combined estimate then also suggests a more moderate effect, namely, a reduction of the order of one symptom day, with an uncertainty of about a factor of two. The estimated heterogeneity is relatively low compared to the width of the overall mean's CI, and so the prediction interval is only slightly longer, and the shrinkage intervals show substantially greater precision than the original estimates. Sensitivity to other prior choices is also investigated for this example in Appendix D.4. Aalbers et al. (2017) 104 Analysis 1.1 investigated the short-term effect of music therapy on depression symptoms; four studies comparing music therapy plus treatment-as-usual (TAU) versus TAU alone were found. Within these four studies, differing clinician-rated symptom scores were utilized in order to quantify depression severity: the Hamilton rating scale for depression   (HAM-D), considering potentially differing numbers of items between studies, as well as the Montgomery-Åsberg depression rating scale (MADRS). In order to facilitate a joint analysis, the meta-analysis was based on SMDs (here: Hedges' ); the relevant data are shown in Table 8. The outcome measured on the SMD scale means that a unit change in corresponds to a one standard deviation change in the symptom severity score. Considering e.g. the Albornoz (1992) study, 105 which was measuring change in symptom severity using the 17-item HAM-D scale with a within-group standard deviation of about 5 (see Table 8), a difference of 1 on the SMD scale here would roughly correspond to a 5-point change in HAM-D score. 106,107,108,109 In terms of SMD, this would already be considered a "large" effect. 89,90 The UISD for SMDs is predicted at 1 = 2, while from the present data here we get a very similar empirical average of 1 = 2.2.

Standardized mean differences
For the between-study differences, we would assume that they would be mostly in the "small" to "medium" range (≪ 1)otherwise effects would be differing by a standard deviation or more between studies, and also the studies' confidence intervals (which are roughly of the size A value of = 1.0 would imply a median difference of ≈ 0.95 ("large") for a random pair of true study means (see Table 1), which already appears like a rather extreme amount; values of = 0.5 (implying mostly "medium" sized betweenstudy differences) or below seem to be more plausible. A half-normal(0.5) prior would cover this range and would imply a prior median (for ) slightly above the magnitude suggested the empirical investigations (see also Table 3).
For the present example, we would then suggest a half-normal(0.5) prior as a slightly conservative choice, in order to reflect the potential heavy-tailedness suggested by Rhodes et al. (2015), 74 and to account for the fact that the empirical data might be of limited relevance for the present example data. A meta-analysis based on the half-normal(0.5) prior is illustrated in Figure 4. Among the four studies, three consistently indicate estimates in the range 0.5 -0.8, while the first one shows a huge effect estimate of the order of 2.0; a positive amount of heterogeneity appears to be present (the CI for is in a strictly positive range; see also Figure 8), and the eventual combined estimate indicates a "small" to "very large" average effect. Given the pronounced heterogeneity one might discuss whether the estimation of a pooled effect is meaningful. Nevertheless, we use this example to illustrate the use of Bayesian methods in heterogeneous situations, where heterogeneity cannot be explained and good reasons are available to perfom a quantitative meta-analysis despite of large heterogeneity. The large estimated heterogeneity here results in a wide CI for the overall effect, a very wide prediction interval, and also very little shrinkage for the estimated study-specific effects .

Log odds ratio
A systematic review was performed by Crins et al. (2014) 110 to investigate the effect of Interleukin-2 receptor antagonists (IL2-RA) on recovery of pediatric patients following liver transplantation. One aspect of interest was the occurrence of acute rejection (AR) reactions as a common adverse event. Two randomized controlled trials reporting such data were found, the event counts along with the corresponding (logarithmic) odds ratios and standard errors are shown in Table 9. Both studies indicated a reduction in the chances of an AR event for the treatment group.
The treatment effect is expressed and analyzed on a logarithmic scale here. A heterogeneity magnitude of = 1.0 would imply that any random pair of studies would be expected to exhibit effects differing by a factor of 2.6 (see Table 1), which seems quite extreme already; values like = 0.5 or below seem more plausible. In a simular investigation involving 14 studies and A half-normal(0.5) prior would mostly cover values < 1.0 (up to "fairly high" heterogeneity according to Table 2) with an expectation and median below 0.5 (see also Table 3). The resulting 95% prior predictive interval would still include effects within a factor of 3 around the overall mean log-OR . For the present investigation, we would then suggest a half-normal(0.5) prior as a reasonably conservative choice, which also agrees roughly with the empirical evidence (see Fig. 3). A meta-analysis based on this prior is shown in Figure 5. In this example we have two studies only, demonstrating the somewhat speculative nature of infering heterogeneity based on sparse data, and higlighting the value of considering a-priori probabilities. In the present case, the two studies involved are not very large, and their resulting CIs are overlapping, which makes the data consistent with a wide range of heterogeneity values, from homogeneity ( = 0) up to magnitudes of = 10 or = 20. Including the weakly informative heterogeneity prior, and effectively down-weighting unreasonably large heterogeneity values, then leads to an estimate of −1.81 for the log-OR, corresponding to a reduction in the odds of an AR event down to exp(−1.81) = 16%. While the uncertainty TABLE 9 Log-OR example data. 110 and 1 as well as and 2 denote the event counts and total numbers of patients in treatment and control groups, which together summarize the trial outcome in terms of a 2 × 2 table. The are the derived logarithmic odds ratios and are the associated standard errors that eventually go into the analysis (see Section 2.1). Negative values here indicate a reduction of the event odds, i.e., a beneficial treatment effect.  still is large (ranging roughly from 5% up to 50%), the analysis clearly indicates a substantial reduction in AR events here. The heterogeneity's posterior density is also shown in Figure 8; here we can see that for the present example constellation, the posterior is very similar to the prior. With the very uncertain original estimates (due to the small sample sizes), the overall mean's CI is wide, but the additional width of the prediction interval is limited due to the (prior and empirical) information on the heterogeneity, and a noticeable shrinkage effect is also observable.

Log incidence rate ratio
Four studies investigating the effect of ferric carboxymaltose vs. placebo in heart-failure patients with iron deficiency were jointly analyzed by Anker et al. (2018). 112 The main outcome was the incidence rate ratio (IRR) with respect to the composite endpoint of recurrent cardiovascular (CV) hospitalisations or CV death. The relevant available data are shown in Table 10. The eventual analysis is based on the logarithmic ratio of the event rates (per 100 patient-years of follow-up) of treatment over placebo group. As in the previous example, the outcome is analyzed on the logarithmic scale, so that many arguments apply essentially analogously here. Regarding empirical evidence on previously encountered amounts of heterogeneity, there are no studies available that would be directly applicable for log-IRRs, however, odds ratios and rate ratios have quite some similarity, so that these findings also have some bearing here. The UISD here is at 1 = 2 per event (see Appendix A.3); with a total of 114 events observed among a total of 839 patients 112 Tab. 4 (a rate of ≈ 0.14 events per patient), this would correspond to 1 ≈ 2 √ 0.14 = 5.4 per patient. For the present data, we empirically get an average of 1 = 6.6.
For this example, we would again suggest a half-normal(0.5) prior. A meta-analysis based on this prior is shown in Figure 5. While the data look homogeneous (all intervals have some overlap, also because some studies are very small and intervals are correspondingly wide), we would still anticipate the possibility of heterogeneity -since from experience we know that heterogeneity is frequently present, and because we know that heterogeneous circumstances are still likely to produce data that may still "look homogeneous". 7 Compared to our a-priori expectations of values up to 0.98 (see Table 3), the posterior then suggests a slightly lower heterogeneity range of up to 0.75, but the data do not provide very much evidence in this regard (see TABLE 10 Log-IRR example data. 112 The incidence rate ratios for the composite endpoint of recurrent cardiovascular (CV) hospitalisations and CV mortality are given for each study. For the analysis, the logarithmic rate ratio is considered. Negative values here indicate a reduction of incidence rates, i.e., a beneficial treatment effect. also the posterior in Figure 8). The mean treatment effect eventually is at a log-IRR of −0.49, corresponding to an IRR of 61% (i.e., a reduction in the event rate), with a CI ranging from 33% up to 116%. For these somewhat homogeneous estimates, one can see that the ones with very large associated standard errors eventually have shrinkage estimates close to the overall prediction interval. A sensitivity analysis investigating alternative prior choices for this example is also shown in Appendix D.4.

Log odds
Neuenschwander et al. investigated the use of historical data in order to inform the analysis of a new data set. 77 A meta-analysis of several trials in ulcerative colitis was performed in order to support the analysis of a subsequent phase II trial. The figure of interest here was the probability for clinical remission at week 8 in placebo-treated patients, and the main interest was in a prediction for the new study's event probability, to then formally integrate this in a subsequent analysis using a meta-analyticpredictive (MAP) approach. 72 Four previous randomized controlled trials reporting this endpoint were available, their data are shown in Table 11. Instead of working directly on the estimated probabilities , the analysis here is done based on the odds 1− , and a subsequent log-transformation. 92 Homogeneity of placebo rates is not expected -differences between control rates are among the main reasons for requiring a control arm for each RCT, and for pursuing a contrast-based analysis. 113,114 The studies were designed aiming for an estimate of the treatment effect, and the placebo rate originally way mostly a nuisance parameter here. However, some amount of similarity still is anticipated, and the aim of this exercise is to carefully derive the predictive distribution, which of course depends on the amount of heterogeneity .
The earliest of the four studies was planned anticipating a remission rate of 10% for the placebo group, 115 and hence a UISD of 1 ≈ √ 1 0.1 + 1 0.9 = 3.33 may be expected. Empirically, we get an estimate of 1 = 3.2 from the present data set. As the endpoint are logarithmic odds, we may again apply similar reasoning as in the previous subsections, regarding the anticipated ratios of odds. However, a major difference here is that while clinical trials are usually carefully designed to provide reliable estimates of treatment effects (treatment/control contrasts), this is not necessarily the case for the event rates that we are considering here; we may expect the log-odds to be more variable than the log-ORs. With this in mind, and considering conservatism and robustness particularly desirable in the present context, we would suggest a half-normal(1.0) prior here. From Table 3, we can see that the implied 95% prior predictive interval then spans a range of roughly a factor 9 around the median . Given the context, it may be of particular interest to consider the associated prior maximum sample size ⋆ ∞ (see Section 3.4.5); for the prior median of = 0.67, we have 1 = 0. 67 3.2 = 0.21, corresponding to a maximum size of ⋆ ∞ = 23 (compared to an original total of 363 subjects included in the analysis). The prior's 95 % quantile is (approximately) at = 2, and larger values would effectively imply (with ⋆ ∞ < 3) an almost noninformative posterior predictive distribution. The eventual analysis is illustrated in Figure 6. Looking at the heterogeneity's posterior (Figure 8), one can see that heterogeneity here appeared to be less than anticipated. The prediction interval is relatively wide, and on the back-transformed scale is centered at a probability of 0.11 with its 95% posterior predictive interval ranging from 0.03 to 0.34. The posterior predictive distribution's standard error is 0.70, and relative to the UISD, this roughly corresponds to an effective sample size of 3.2 2 0.70 2 = 21 subjects.

TABLE 11
Log-odds example data due to Neuenschwander et al. (2010). 77 The and here denote total numbers and the numbers of remitting patients among these. Analysis is done based on the derived log-odds and their standard errors .   Bergau et al. (2017) 116 investigated predictors of all-cause mortality among patients with an implantable cardioverterdefibrillator (ICD) device. Several potential covariables were considered, among these the left ventricular ejection fraction (LVEF), which is a measure of the efficiency of heart function that is usually determined via echocardiography. LVEF is commonly expressed in percent, where 52%-72% are normally observed in healthy individuals, while values below 30% are considered abnormal. 117 Criteria for an indicated ICD therapy include various conditions, including thresholds on the LVEF in the range 30-40%. 118 . Five studies were found that had reported on survival analyses including LVEF as a predictor, and a meta-analysis was performed based on the coefficients standardized to a 5 percentage point decrease in LVEF; the data are shown in Table 12. The different studies also included different sets of additional covariates in their analyses. 116 The regressor, LVEF, here is expressed in percentages (between 0 and 100), which might just as well have been expressed as a fraction (between 0 and 1), while for the analysis a unit of a 5 percentage point decrease was used -this highlights the importance of clarifying the scale of the increment Δ that heterogeneity considerations are to be based on. Table 12 also shows the distributions of LVEF within studies; these are roughly similar and have standard deviations of the order of 10 percentage points. For the "reference" increment Δ for judging plausible heterogeneity magnitudes, we will then consider a difference of 20 percentage points, which roughly spans the bulk of LVEF values encountered in each of the studies. This also coincides with the range of values considered "normal" (52%-72%) or the difference between "normal" and "abnormal" ranges (≥ 52% vs.
Since the regression coefficient is to be interpreted as a logarithmic HR, we will assume a half-normal(0.5) prior for the effect correponding to a Δ = 20 percentage point increment (analogously to the arguments made in Sections 5.3.1 and 5.3.2). For the 5 percentage point decreases considered in the analyses, this then implies a four-fold smaller heterogeneity, i.e., a half-normal(0.125) prior. Analysis results for a half-normal(0.125) prior are illustrated in Figure 6. The estimates are very homogeneous, which is evident from the forest plot as well as from the estimated heterogeneity (see also Figure 8). The overall log-HR estimate is at 0.19, corresponding to 1.21-fold increased mortality hazard for a 5 percentage point decrease (worsening) in LVEF. Molloy et al. (2014) 119 investigated the relationship between conscientiousness and medication adherence. A total of 16 relevant studies reporting correlation coefficients of the two factors were found, which were also graded according to their methodological quality. Three of the studies were rated with the highest quality score; their data are shown in Table 13. The data are also available as part of the metafor R package. 91 In order to avoid problems due to the bounded parameter space of correlations (between −1 and +1), we will use the Fisher-transformed values instead. Note that, since in the present example the reported correlations ( ) are relatively close to zero, the corresponding Fisher-values ( ) are almost identical here (see Table 13; and values only differ in their third decimal place) and the transformation eventually makes little difference. As elaborated in Section 4.5, we expect smaller magnitudes of heterogeneity for correlation endpoints (say, mostly ≤ 0.3); the UISD is at 1 = 1.0, which also matches the figures we see empirically in the present data set ( 1 = 1.004). Van Erp et al. 82 report a median and 95% quantile of 0.12 and 0.29, respectively, for empirically observed heterogeneity estimates from published studies. Meta-analysing the remaining set of 13 studies from the present data set 119 (using a uniform prior), in order to quantify the evidence "external" to the example data, yields a heterogeneity estimate of 0.07 with 95% CI [0.00, 0.17].

Correlations
Heterogeneity values of = 0.1 or = 0.2 would imply differences between a random pair of studies of a similar order of magnitude (see Table 1). A half-normal(0.2) prior for the heterogeneity would cover values mostly in the range below 0.4, with a prior median at = 0.13 (see Table 3).
For the present analysis, we would then suggest a half-normal(0.2) prior for the heterogeneity. A meta-analysis of the example data based on this prior is illustrated in Figure 7. The two traits were originally measured using differing scales, so that complete homogeneity might be considered especially unlikely. The heterogeneity's resulting posterior median is at = 0.12 (with the 95% CI ranging up to 0.30), its posterior distribution is also illustrated in Figure 8. The three studies are of differing size   and suggest neutral to slightly positive correlation between conscientiousness and medication adherence. The resulting mean estimate is positive at about 0.08, while the CI ranges from negative to positive (−0.1 to +0.3).

DISCUSSION
While executing a Bayesian meta-analysis is not technically difficult, specifying a widely acceptable prior remains a challenge, especially when it comes to the heterogeneity parameter . Although the problem may appear complex at first, it is usually possible to break down the specification into a number of more specific questions that are easier to approach one-by-one. These steps are summarized in Table 6 and may be outlined as follows: (i) what is the effect's scale? (ii) what is the probable magnitude of other effects? (iii) how large is the unit information standard deviation (UISD)? (iv) is relevant empirical information available?
The information may then be related to more concrete prior specifications by constraining (v) prior quantiles (of ) (vi) prior predictive quantiles (of ), and (vii) other prior properties. We have demonstrated the prior specification in seven applications involving few studies and covering a range of common effect scales and application areas, leading to sensible prior distributions and results in all examples. Besides the case of few studies, another context in which (weakly) informative priors are useful is whenever marginal likelihoods (or Bayes factors) need to be computed. 73 Calculation of marginal likelihoods requires proper prior distributions, and special care must be taken in their selection in order to avoid (seemingly) paradoxical results. 1,120 In many applications, the results will be robust to variations of the prior, which may also be checked in sensitivity analyses. The prior specification will usually not be the most crucial or influential among the line of assumptions being made, which include normality, 5 exchangeability, the selection of estimates to be pooled, or the choice between effect measures. 121 Different prior specifications will of course leave their imprint on the posterior distribution, for example, results based on short-or heavy-tailed priors will reflect the differing assumptions, which may be based on emphasizing regularisation or robustness aspects. There usually is no unique "correct" prior, and "sceptical" or "enthusiastic" results may be derived by implementing corresponding prior assumptions. 42 Even uncertainty in the prior distribution itself (or its scale) may be accommodated by using mixture priors. Consideration of the stochastic ordering of heterogeneity priors may help assessing more or less conservative settings, which may be useful for the definition of sensitivity analyses. However, we would also like to warn against inflationary default specification and execution of multiple analyses here, as the resulting alternative estimates may lead to unnecessary ambiguity or inconsistent (flip-flopping) conclusions. In Appendix D.4, sensitivity analyses are discussed in the context of the two examples from Sections 5.1 and 5.3.2. Pre-specification of analyses (and their intended consequences) may help here. In case there is genuine a-priori uncertainty about the heterogeneity's magnitude, this might better be reflected in a single prior (e.g., in terms of a mixture distribution). Either way, one needs to be prepared and willing to base the eventual analysis results on the posterior also when the data have little information on heterogeneity to add to the weakly informative prior, as was the case for some of the examples discussed here (see Figure 8). If it is not possible to specify a suitable (weakly) informative prior for the expected heterogeneity, then one might have to resort to a more conservative approach using uninformative priors.
Another central assumption crucial to the validity of inference is the exchangeability (see Section 2.1). This might be compromised by selection effects, for example, publication bias 122 or reporting bias. 123 Especially in the case of only few studies, such effects might be hard to detect from the data, and information on the presence of selection effects may need to come from considerations of the context.
Choice of heterogeneity priors has consequences for estimation of the overall mean parameter, but in particular also in prediction and shrinkage applications, as the inferred heterogeneity directly impacts on the amount of borrowing-of-strength; 20,33,72 smaller heterogeneity will lead to stronger pooling of estimates, and larger heterogeneity will imply that individual estimates are only loosely connected through the model.
Especially in regulatory settings such as drug approval or health technology assessment (HTA) the definition of a standard prior distribution for the heterogeneity parameter is important to avoid post hoc discussions in case the use of different prior distributions leads to results suggesting conflicting interpretations. The Institute for Quality and Efficiency in Health Care (IQWiG) in Germany is currently looking into determining the empirical distribution for the between-study heterogeneity parameter from all published IQWiG reports with the goal to motivate a suitable prior distribution for HTA applications.
While in the present manuscript we focused on the NNHM, some of the arguments laid out here are analogously transferable to other models for pairwise meta-analysis, for example, a Binomial-Normal model. Additional parameters and their priors may need to be specified in regard to baselines (which are often nuisance parameters and assigned vague priors). 85,113,114,124 More complex applications in evidence synthesis such as meta-regression or network-meta-analysis would again require similar prior specifications regarding between-study heterogeneity in the effects, but would then entail additional model components, e.g., in order to accommodate individual-patient data (IPD). 125,126,127 Analogous arguments also extend more generally to hierarchical or multilevel models, such as generalized linear mixed models (GLMMs). 2,128 The sensitivity analyses shown in Appendix D.4 suggest that (for a given prior median) the prior distribution's shape has little impact on the results, as compared to the scaling of the pior. As it might simplify prior specification further, it will be interesting to investigate whether or to what extent this feature holds more generally. In summary, the application of Bayesian methods with weakly informative prior distribution for the heterogeneity parameter can be recommended for meta-analyses with random effects especially in the common case of only few studies. This paper provides guidance on the choice of useful prior distributions for various effect measures and data situations.

HIGHLIGHTS
• What is already known: -A Bayesian approach to meta-analysis may often be useful, in particular in cases of only few studies, and in order to derive predictions and shrinkage estimates.
-Careful specification (and justification) of prior distributions is required, especially for the heterogeneity parameter.
• What is new: -Prior selection may usually be narrowed down considerably using a structured approach.
-A series of questions to guide choice and justification of the prior distribution was devised.
-Unit information standard deviations (UISDs) were derived for some commonly used effect measures.

• Potential impact for Research Synthesis Methods readers outside the authors' field:
-Similar approaches may be useful also in related fields where hierarchical models or generalized linear mixed models (GLMMs) are used.

A.1 Standardized mean differences (SMDs)
Defining an SMD simply as = 2; − 1; (see Section 4.2), this figure is in practice estimated based on empirical group-averages̄ 1; and̄ 2; . Neglecting uncertainty in variance estimation and assuming a known common standard deviation for both treatment groups then leads to Var( ) = Var . Furthermore assuming equal group sizes ( 1; = 2; = 2 ) then leads to an approximate standard error of 2 √ and hence a UISD of 1 = 2.

A.2 Logarithmic odds (logits)
The variance of a logarithmic odds (or logit-proportion) estimate is 1 1 + 1 1− , where is the sample size and is the true proportion. The variance (squared standard error) is in practice commonly estimated by 1 + 1 − , where is the observed event count. 92 The UISD then is given by 1 = √ 1 + 1 1− ≥ 2. Note the similarity to the standard error of an logarithmic odds ratio, 20 which may be expressed as the difference of two logodds. For = 1 2 , the resulting UISD 1 is twice as large (i.e. the variance 2 1 is four times as large), since (i) the two logits' variances add, while (ii) each of the two logits has twice the variance since it is only based on "half as many" subjects (per total number of subjects ).

A.3 Logarithmic incidence rate ratios (log-IRRs)
An (approximate) standard error for an incidence rate ratio is given by √ 1 + 1 , where and are the event counts in treatmentand control-groups, respectively. 129 Sec. 6.7.1 Assuming a total number of events, and, for simplicity, = = 2 then yields a standard error of 2 √ , implying a per-event UISD of 2. For a given event rate (per subject), the per-subject standard deviation then is at 1 = 2 √ . Table B1 characterizes some of the probability distribution families that are discussed in Section 3 in more detail (see also Figure 1 and Table 4). The distribution families considered are half-normal, half-Student-, half-Cauchy, half-logistic, exponential, Lomax, log-normal and (proper) uniform. 130,75 . The distributions' parameters, probability density functions, medians, 95% quantiles, means, variances, and coefficients of variation

B PRIOR DISTRIBUTION FAMILIES
are listed. In particular for families including several parameters, some of the expressions may get somewhat complex (e.g., the moments of a general half-Student-distribution, which are omitted in the table). 131 However, if only a scale parameter is present, then quantiles, expectation and standard deviation are simply proportional to the scale, and the coefficient of variation is a constant. Examples are the half-normal distribution, or half-Student-or Lomax distributions with fixed shape parameters.
Note that for the exponential distribution, which is most commonly parameterized using a rate (or inverse scale) parameter, the inverse of the rate is a scale parameter. Similarly, for the log-normal distribution, exp( ) would be a scale parameter, and the corresponding expressions then factor as multiples of exp( ). Some of the expressions given below are not always defined, e.g., expectation and variance of the half-distribution are only defined for > 1 and > 2, respectively, 131 and the first two moments of the Lomax distribution are only finite for > 1 and > 2, respectively. 130

TABLE B1
Some properties of potential prior distribution families that were discussed in Section 3. An asterisk ( * ) means that the corresponding expression is somewhat complex and hence omitted here, and a dash (-) means the figure is not defined. v denotes the coefficient of variation (the ratio of standard deviation over expectation).

C.1 Motivating Lomax and Student-distributions as scale mixtures
Heavy-tailed priors may be constructed as scale mixtures of shorter-tailed distributions. For example, a distribution ( | ) that has a scale parameter > 0 may be generalized by specifying a mixing distribution ( ) and subsequently marginalizing over it, yielding the mixture ( ) = ∫ ∞ 0 ( | ) ( ) d . 49,132 In order to make the connection to the original (conditional) distribution ( | ), it is instructive to consider the mixing distribution's location and spread, e.g. in terms of expectation = E[ ] and coef- For small v , the mixture will closely resemble the original distribution ( ( | )), for larger v , it will be heavier-tailed. Note that, since the scale parameter's domain is the positive real line, an increasing coefficient of variation also implies an increasingly skewed mixing distribution. In the following, we show how Lomax and Student-distributions result as scale mixtures of exponential and normal distributions, respectively, and how these may be parameterised in terms of pre-specified expectation and coefficient of variation of their scale parameters. Specification of a prior in terms of a scale mixture may be seen as a case of a "contaminated" prior also considering variations of a prior that are "close to an elicited one". 68 69 Sec. 3.5.3

C.2 The Lomax distribution as an exponential scale mixture
The exponential distribution may be parameterized in terms of rate (inverse scale) , or scale = 1 , where the expected value is (implying a gamma-distribution for the rate with shape and scale 1 ). A mixture of exponential distributions with inverse-gamma-distributed scale (or gamma-distributed rate) then results as a Lomax distribution parameterized by shape = and scale = , with expectation −1 and variance 2 ( −1) 2 ( −2) . By pre-sprecifying the exponential scale's expectation and uncertainty (in terms of the coefficient of variation), we can then derive the corresponding Lomax distribution. For example, if we are aiming for an exponential scale mixture in which the scale has expectation = 0.5 and coefficient of variation v = 0.5, this implies Lomax parameters of shape = 2 + 1 2 v = 2 + 1 0.5 2 = 6 and scale = 1 + 1 2 v = 0.5 1 + 1 0.5 2 = 2.5.

C.3 The (half-) Student-distribution as a normal scale mixture
The Student-distribution (with degrees of freedom) is classically defined as the distribution of a variable = √ ∕ , where follows a standard normal distribution, and is independent and follows a 2 distribution (with degrees of freedom). The Student-family includes the Cauchy distribution as a special case (for = 1) and the normal distribution as a limiting case (for → ∞). Alternatively, the distribution of may be expressed via | ∼ N(0, 2 ) and ∼ Inv-, √ , where the distribution of the normal scale is a scaled inverse distribution with degrees of freedom and scale = √ . The latter formulation then makes the scale mixture connection more obvious. The arguments in the following then equally apply for Student-and half-Student-distributions. The inverse distribution is simply defined as the distribution of the inverse of the square root of a 2 -distributed deviate; it is a special case of a square-root inverted-gamma distribution 133 (with = 2 and = 1 2 ). The scaled inverse distribution then results by introducing an additional scale parameter . 65 Sec. VII.6.2 Its probability density function is given by  of freedom and subsequently the scale . Table C2 lists corresponding coefficients of variation for a selected set of degrees of freedom values (according to equation (C2)). Inversion of the relationship may be done numerically; degrees of freedom settings corresponding to certain coefficients of variation v are shown in Table C3.
For example, if one was aiming for a normal scale mixture with expectation = 0.5 and coefficient of variation v = 0.5, this first of all implies = 4.2 degrees of freedom (Table C3). Using an "plain" Student-distribution now would correspond to a scaled inverse mixing distribution of the normal scale with degrees of freedom = 4.2 and scale = √ 4.2 = 2.05, and, according to equation (C2), with E[ ] = 1.24. In order to set the expectation to the intended = 0.5 instead, the (half-) Student-distribution needs be scaled by a factor of 0.

C.4 Scale mixture examples
Tables C4 and C5 below show a number of Lomax and Student-distributions that result as scale mixtures for pre-specified mean and variance for the exponential or half-normal scale parameter. Note that, due to linearity, simple re-scaling of the (exponential or half-normal) scale's distribution implies proportional re-scaling of heterogeneity and predictive distribution. For example, the Lomax =6 (8.17)-distribution from Table 4 results from re-scaling of the Lomax =6 (5)-distribution from Table C4 by a factor of 1 0.612 so that the prior median is at 1.0. Note also that by fixing the expectation and increasing the coefficient of variation, one get an increasingly skewed mixing distribution with a decreasing median.

TABLE C4
Lomax prior distributions resulting as scale mixtures of exponential distributions. An inverse-gamma distributed scale (or inverse rate) parameter for the exponential distribution marginally yields a Lomax distribution. Pre-specifying expectation and coefficient of variation ( v ) for the scale (shown in bold) implies a unique inverse-gamma and resulting Lomax distribution. The table illustrates distributions of exponential scale ( ), heterogeneity ( ) and predictions ( ). The first line corresponds to a "plain" exponential distribution with fixed scale. exponential scale heterogeneity prediction | ∼ exponential( ),  half-normal scale heterogeneity prediction | ∼ HN( ),

D.2 R code to illustrate the marginal prior predictive distribution
The following R code illustrates the marginal prior predictive distribution ( | ) (see Section 3.4.4) for the example case discussed by Dias et al. (2013). 28 A half-normal(0.32)-distribution for implies a marginal distribution of effects (ORs, exp( )) within factors of 0.5 and 2.0 with 95% probability.

D.3 R code to reproduce examples
The following R code shows how to use the bayesmeta library 20 to perform a meta-analysis of the example data from Section 5.3.1 110 using a half-normal(0.5) prior.

.1 General remarks
In the following we illustrate some sensitivity analyses for the prior choice based on the MD example from Section 5.1 (Grande et al., 2015), 100 and on the IRR example from Section 5.3.2 , 112 both involving = 4 studies. Sensitivity analyses are commonly suggested and often easy to do, however, sensitivity to the choice of prior alone should only be a reason for concern if the prior is not convincingly motivated. Also, prior sensitivity must not be confused with the (weak/strong) informativeness of a prior; these are two quite separate aspects. A sensitivity analysis will also contribute little to the question of whether a particular prior is "appropriate" or not. Besides investigating variations of a given prior, analysis results may also be contrasted with those obtained when using a noninformative prior (presuming that this is possible; e.g., the improper uniform prior requires ≥ 3 studies in order to yield a proper posterior). The aim here may then be to investigate to what extent results are determined by prior or data (likelihood). It should also be noted that the prior is only one among several crucial aspects of the model that might be challenged; additional aspects include normality, 5 exchangeability (selection effects), 122,123 the choice of effect measure, 121 the model parametrisation, 134 or the use (by deliberate choice or due to a zero heterogeneity estimate) of a common-effect model. 38,7 The default statement of several (seemingly) alternative results might also encourage inconsistent (flip-flopping) conclusions from the data; if in fact there is uncertainty about the shape or scale of the prior, this might more appropriately be addressed via specification of a mixture prior reflecting this uncertainty. In Sections 5.1 and 5.3.2 half-normal priors with scale 0.5 were suggested for both examples. In order to investigate sensitivity of the analysis to details of the heterogeneity prior specification, we will vary the prior scale (within the half-normal family) as well as the distribution family (while keeping the prior median fixed). For the sensitivity check, we will then consider scales half or twice as large. For the investigation of sensitivity with respect to the choice of the prior distribution's shape, we consider the distribution families shown in Figure 1 and Table 4, which are half-Student-(with = 3 d.f.), half-Cauchy, half-logistic, exponential and Lomax (with shape parameters 6 or 1). The prior median for the original half-normal(0.5) prior was at = 0.34, the different distributions then were scaled to have a matching median. Some of the reasons why one might choose one of these distribution families were discussed in Section 3.3; differences between these in particular relate to their behaviour near zero or towards their upper tail, or to their motivation as mixture distributions (see Appendix C). In order to contrast results with those obtained by using a noninformative prior, we selected the improper uniform prior in , as well as the Jeffreys prior for for comparison. Both of these are improper and "noninformative" in a particular sense, and, since ≥ 3 in both examples, both yield proper posteriors here. 20

D.4.2 Mean difference example (Grande et al.; 2015)
Varying the prior scale by a factor of two here implies a re-scaling of the prior predictive distribution by the same factor (see also the discussion in Section 3.4.4 and especially Table 3). Instead of a-priori considering between-study variations of ±1 day around the overall mean most plausible, this would mean focusing on a range of half a day of two days instead, respectively. Figure D1 (left panel) shows the overall effect estimates corresponding to the three prior settings. Most notably, with larger heterogeneity deemed plausible, the effect CI's lower bound includes more extreme values, while median and upper bound are less affected. This is consistent with the empirical data here (see Figure 4), as larger heterogeneity implies greater weight for the most extreme, yet also most uncertain first estimate, while lower heterogeneity implies that weighting is closer to the inverse-variance weights. 20 When varying the prior distribution's shape (and keeping the prior median fixed), the effect on the resulting overall estimate is remarkably small, despite the different priors' different properties and appearances (see Table 4 and Figure 1). Figure D1 (right panel) illustrates the corresponding effect estimates, where differences are barely discernible visually.
Parameter estimates for the above analyses (also for heterogeneity and prediction +1 ) are shown in Table D6. When contrasting results with those obtained based on the noninformative uniform or Jeffreys priors, the estimates differ more substantially. One may argue that, without the use of a weakly informative prior, the empirical data alone are not sufficient to rule out implausible ranges of heterogeneity here. For instance, in case of the improper uniform prior, heterogeneity values beyond = 4.0 would be considered a-posteriori plausible. This is more than the estimated UISD ( 1 = 3.9), and, looking at Table 4, this would imply variability of values (differences in mean numbers of symptom days) within ranges of more than ± 1 week, while the numbers of symptom days themselves were only of the order of one week (see Table 7). Use of a weakly informative prior then allows to rule out such implausible parameter ranges.

TABLE D6
Estimates and 95% CIs for the heterogeneity ( ), the overall mean effect ( ) and the prediction ( +1 ) corresponding to the discussed sensitivity analyses for the MD example (Grande et al.;2015) 100 . The first line shows the estimates resulting from the half-normal(0.5) prior that was originally proposed in Section 5.  Although especially the CI width changes slightly, and for the more "optimistic" half-normal(0.25) prior almost excludes zero, the conclusions do not change drastically. Figure D2 (right panel) illustrates the overall effect estimates corresponding to the different prior distribution families. Despite their different appearances and properties (see also Figure 1), the resulting overall effect estimates and CIs again are remarkably similar.
The noninformative priors yield CIs that are wider by factors of roughly 1.5 or 2.1 than for the analysis based on the proposed half-normal(0.5) prior, so the precision gain is quite substantial here. In addition to investigating the priors' influence on the overall effect ( ), it may also be of interest to consider its effect on prediction intervals, shrinkage estimates, or the heterogeneity's posterior. Table D7 lists some estimates corresponding to all of the sensitivity analyses discussed above.

TABLE D7
Estimates and 95% CIs for the heterogeneity ( ), the overall mean effect ( ) and the prediction ( +1 ) corresponding to the discussed sensitivity analyses for the IRR example   112 . The first line shows the estimates resulting from the half-normal(0.5) prior that was originally proposed in Section 5. 3