Normalized power priors always discount historical data

Power priors are used for incorporating historical data in Bayesian analyses by taking the likelihood of the historical data raised to the power $\alpha$ as the prior distribution for the model parameters. The power parameter $\alpha$ is typically unknown and assigned a prior distribution, most commonly a beta distribution. Here, we give a novel theoretical result on the resulting marginal posterior distribution of $\alpha$ in case of the the normal and binomial model. Counterintuitively, when the current data perfectly mirror the historical data and the sample sizes from both data sets become arbitrarily large, the marginal posterior of $\alpha$ does not converge to a point mass at $\alpha = 1$ but approaches a distribution that hardly differs from the prior. The result implies that a complete pooling of historical and current data is impossible if a power prior with beta prior for $\alpha$ is used.


Introduction
Power priors are a class of prior distributions which can be used for incorporation of historical data in Bayesian analysis of current data (Chen and Ibrahim, 2000).The basic idea is to use the likelihood of the historical data raised to the power of α as the prior distribution for the model parameters θ.This leads to the posterior distribution of θ borrowing information from both the current and the historical data, typically resulting in an information gain compared to an analysis of the current data in isolation.
The power parameter α is usually restricted to the interval between zero and one, thereby determining how much the historical data are discounted and enabling a quantitative compromise between the extreme positions of completely trusting (α = 1) and completely ignoring them (α = 0).
In practice, the power parameter α is unknown and therefore often assigned a prior distribution.
In this case, the marginal posterior density of the model parameters θ based on the current data D and the historical data D 0 is given by that is, the posterior of θ based on a fixed α averaged over the marginal posterior of α.The marginal posterior of α thus determines how much pooling between the two data sets occurs, and a complete pooling of both data sets happens when the posterior has all its mass at α = 1.
Standard Bayesian asymptotic theory establishes that, under certain regularity conditions, posterior distributions become more concentrated with increasing amounts of data (Bernardo and Smith, 2000, section 5.3).Since the power prior is based on the historical data, intuition would suggest that as the sample sizes of the current and historical data sets increase, the marginal posterior of α should become increasingly concentrated at α = 0 if there is conflict between the data sets, and increasingly concentrated at α = 1 if there is no conflict.Here, we show that only the former is true, but not the latter, at least in the typical situation where α has a beta prior distribution and the data have normal or binomial likelihood.In case of conflict, there is instead a limiting posterior distribution that is hardly different from the prior.Our results imply that complete discounting is possible, but complete pooling is impossible.This paper is structured as follows: We start by showing the claimed result for parameter estimates under normality (Section 2).We then show that the result also approximately holds in the binomial model (Section 3).The papers then ends with some concluding remarks in Section 4. As a running example we consider data from two randomized clinical trials comparing the efficacy of the drugs Fidaxomicin and Vancomycin on Clostridium difficile-associated diarrhoea in adults (Table 1).
Table 1: Historical and current data on the comparison between Fidaxcomicin and Vancomycin with respect to their effect on Clostridium difficile-associated diarrhoea in adults.The number of participants and events were taken from the respective intention-to-treat analysis of the studies as in the meta-analysis of Nelson et al. (2017).

Study
Risk (Fidaxomicin) Risk (Vancomycin) Risk ratio (95% CI)  unit.The relative variance c = σ 2 0 /σ 2 = n/n 0 can then be interpreted as a ratio of sample sizes.For both estimates, assume a normal likelihood centered around the parameter θ and with variance equal to the squared standard error.In the default "normalized" version of the power prior (Duan et al., 2005;Neuenschwander et al., 2009) the prior for the power parameter α is assigned marginally to the posterior distribution of θ based on an initial prior for θ and the likelihood of the historical data D 0 raised to the power of α.Here and henceforth we will use an initial uniform prior for the parameter π(θ) ∝ 1 and a beta prior for the power parameter α ∼ Be(p, q).This choice leads to the normalized power prior with N(• | m, v) the normal density function and Be(• | p, q) the beta density function.Combining (1) with the likelihood of the current data produces a joint posterior for θ and α, i. e., , from which a marginal posterior for α can be obtained by integrating out θ, i. e., The black solid lines in Figure 1 show the marginal posterior distributions of the power parameter α (left) and the log risk ratio θ (right) based on a uniform α ∼ Be(1, 1) prior for the power parameter and the data from Table 1; the log risk ratio with standard error from the current data D = { θ = 0.15, σ = 0.06}, and the log risk ratio with standard error from the historical data D 0 = { θ0 = 0.16, σ 0 = 0.06}.
Both distributions are computed via numerical integration.
The left plot of Figure 1 shows that the observed marginal posterior of α has hardly changed from the uniform prior, despite the almost perfect correspondence of historical and current log risk ratio.The dashed line shows the (soon to be discussed) marginal posterior for the best-case scenario when both log risk ratios perfectly correspond and their standard errors become arbitrarily small.As is visible, this best-case posterior of α is not too far off the observed one, giving only slightly more support to larger values of α.
The right plot of Figure 1 shows the corresponding marginal posterior for θ.As can be seen, the observed posterior (solid line) and the best-case posterior (dashed line) are virtually indistinguishable, suggesting that the data achieve as much pooling as the normalized power prior model permits.In contrast, the dotted line indicates that the posterior based on complete pooling of the data sets would be somewhat more peaked.
In general the integral in the denominator of (2) has to be computed by numerical integration, but there are certain important situations when an analytical solution exists, and further insight can be gained.We will discuss these cases in the following.

Perfect compatibility of historical and current data
The first situation occurs when the current data perfectly mirror the historical data in the sense that both parameter estimates are equivalent ( θ = θ0 ).In this case, several terms cancel in (2) so that the integral can be represented in terms of the hypergeometric function 15.3.1),i. e., where c = σ 2 0 /σ 2 is the relative variance.The distribution (3) is close to a Be(p + 1/2, q) distribution which is hardly different from the Be(p, q) prior distribution, despite perfect compatibility of both data sets.Importantly, the marginal posterior (3) does not depend on the actual value of the standard errors σ and σ 0 but only on the relative variance c.This means that (3) holds for finite standard errors but also in the idealized mathematical situation where both standard errors go equally fast to zero (i.e., infinite sample size), but with possibly different starting values (c = 1).
Typically, the historical data are predetermined and only the standard error of the current study can be changed.It is therefore interesting to study the behavior of (3) for c → ∞, i. e., the current standard error σ goes to zero while the historical standard error σ 0 remains fixed, reflecting an arbitrary increase of the current sample size.In that case it is straightforward to see from the power series representation of the hypergeometric function (Abramowitz and Stegun, 1965, section 15.1.1)that Hence, the limiting posterior density is that is, again a beta density but with updated success parameter p + 1/2, so just slightly different from the prior.The limiting Be(3/2, 1) distribution for the uniform prior is depicted by the dashed line in the left plot of Figure 1.

Arbitrarily precise current data
The second situation in which the marginal posterior ( 2) is available in closed form is the limiting case when the current standard error σ goes to zero while the historical standard error σ 0 remains fixed (but in contrast to the previous situation the parameter estimates θ and θ0 can take different values).In this case, the integral in (2) can be represented by the confluent hypergeometric func- and Stegun, 1965, section 13.2.1)so that the marginal posterior is given by As expected, the distribution (5) reduces to (4) when the parameter estimates are equal ( θ = θ0 ) since then the left fraction becomes one, which can be shown using the power series representation of the confluent hypergeometric function (Abramowitz and Stegun, 1965, section 13.1.2).The marginal posterior for α can thus at best become a Be(p + 1/2, q) if the prior is a Be(p, q), S. Pawel, F. Aust, L. Held, E.-J.Wagenmakers implying that a complete pooling of historical and current data can never be achieved.On the other hand, a complete discounting is possible since conflict between the current and historical data can make the marginal posterior arbitrarily peaked at α = 0.However, considering that the distribution ( 5) is based on an extremely informative current data set which lead to an estimate of the unknown parameter θ without any measurement error, the rate at which the posterior becomes more concentrated seems also dissatisfactory.For instance, for a standardized parameter difference | θ − θ0 |/σ 0 = 3, the posterior is only slightly peaked at around α = 0.1.

Binomial model
So far we assumed a normal likelihood and a univariate parameter, Appendix A shows that the previous result can be generalized to normal linear models with multivariate parameter θ.In this section, we will show that the previous result also holds approximately in the binomial model, which is an important model class in medical applications of power priors.
Let D = {x, n} and D 0 = {x 0 , n 0 } denote the number of successes and total trials from current data and historical data set, respectively.Assume a binomial likelihood with success probability θ for each of them, and let θ = x/n and θ0 = x 0 /n 0 denote the respective maximum likelihood estimates.
Assigning an initial beta prior θ ∼ Be(0, 0) for the success probability and a beta prior α ∼ Be(p, q) for the power parameter leads to the normalized power prior Combining the prior (6) with the likelihood of the current data leads to a joint posterior distribution for θ and α, from which a marginal posterior for α can be obtained by integrating out θ, i. e., with BeBin(• | n, p, q) the beta-binomial probability mass function.
As for the normal model, the integral in the denominator of ( 7) is generally not available in closed form.Yet again, it is possible to obtain a closed form expression when the probability estimates from both studies are equivalent ( θ = θ0 ).Application of Stirling's approximation B(x, y) ≈ √ 2πx x−1/2 y y−1/2 (x + y) −(x+y−1/2) to both beta functions in the probability mass function of the beta-binomial leads then to Using the approximation (8) in numerator and denominator of (7) produces the same marginal posterior (3) as for the normal model but with c = n/n 0 , representing the relative sample size of the data sets.The limiting marginal posterior distribution of α is thus (approximately) the same for normal and binomial model.
Figure 3 shows the marginal posterior (7) based on the data from Table 1 using the risk in the to the best-case posterior for c = 1, indicating that the data achieve almost as much pooling as the power prior permits in that case.

Discussion
We showed that normalized power priors in normal and binomial models combined with beta priors assigned to the power parameter α have undesirable and counterintuitive properties.Specifically, in the best-case scenario when the current data perfectly mirror the historical data and the sample sizes from both data sets become arbitrarily large, the marginal posterior of α does not converge to a point mass at α = 1 but approaches a α ∼ Be(p + 1/2, q) distribution, hardly differing from the prior α ∼ Be(p, q).
The result implies that a complete pooling of historical and current data can never be achieved.Our case study illustrates that the property is not only a mathematical curiosity but can occur in statistical analysis of medical data.
We still believe that normalized power priors are useful since they permit arbitrarily large datadriven discounting of historical data.However, data analyst should be aware that this does not work in the other direction as the amount of possible pooling is predetermined by the prior.Data analysts have different options to alleviate this limitation.For instance, they can use the power prior based on Normalized power priors always discount S. Pawel, F. Aust, L. Held, E.-J.Wagenmakers a fixed α and use either a "guide value" (Ibrahim et al., 2015) or elicit a reasonable value from external knowledge about the similarity of the data sets.Another option is to specify informative priors which give most of their mass to larger values of α, thereby shifting the best-case marginal posterior to larger values as well.Finally, a pragmatic alternative is to specify α via an empirical Bayes approach as proposed by Gravestock and Held (2017), which permits complete pooling of both data sets.
We only studied the limiting marginal posterior of α in the normal and binomial models combined with beta priors on α, yet we conjecture that the issue is more fundamental and also present in other types of models.However, this will likely be more difficult to establish as marginal posteriors are typically not available in closed form for more complex models.
For the normal model, there is an exact correspondence between power parameter models with fixed α and hierarchical (random-effects meta-analysis) models with fixed heterogeneity variance (Chen and Ibrahim, 2006).This connection may provide an intuition for why the counterintutive result occurs: Precisely estimating a heterogeneity variance from two observations alone (the historical and current data sets) seems like an impossible task as the "unit of information" is the number of data sets and not the number of samples within a data set.We report elsewhere on the precise connection between power parameter and hierarchical models when power parameter and the heterogeneity variance are random (Pawel et al., 2022).

Software and data
The summary data used in this study were extracted from Figure 5.1 in Nelson et al. (2017).All analyses were conducted in the R programming language version 4.3.0(R Core Team, 2022).The packages ggplot2 (Wickham, 2016) and hypergeo (Hankin, 2016) were used used for graphics and (confluent) hypergeometric function implementation, respectively.The code and data to reproduce our analyses is openly available at https://github.com/SamCH93/ppPooling.A snapshot of the GitHub repository at the time of writing is available at https://doi.org/10.5281/zenodo.6626963.
and a beta prior α ∼ Be(p, q) for the power parameter, we obtain the normalized power prior π(θ, α | D 0 ) = N k θ | θ0 , σ 2 α −1 (X 0 X 0 ) −1 Be(α | p, q) (9) with N k (• | µ, Σ) the density function of a k-variate normal.Updating the prior (9) with the likelihood of the current data leads to a joint posterior for θ and α, from which a marginal posterior for α can be obtained by integrating out θ, i. e., When the current and historical data are perfectly compatible (characterized by equivalent maximum likelihood estimates θ = θ0 ), the integral in (10) can again be represented in terms of the hypergeometric function, and the marginal posterior of α is given by (3) but with c = |X X|/|X 0 X 0 | the ratio of the determinants of the precision matrices (X X)/σ 2 and (X 0 X 0 )/σ 2 of the maximum likelihood estimates θ and θ0 .
Define the current data set by D = { θ, σ} with θ an estimate of an unknown univariate parameter θ and σ the (assumed to be known) standard error of the estimate.Denote by D 0 = { θ0 , σ 0 } the respective quantities obtained from the historical data.The standard errors are usually of the form σ = κ/ √ n and σ 0 = κ/ √ n o , where n and n 0 are effective sample sizes and κ 2 is the variance of one

Figure 2 :
Figure 2: Limiting marginal posterior distribution of power parameter α based on α ∼ Be(1, 1) prior and when the current standard error goes to zero (σ ↓ 0), for different values of the parameter difference standardized by the historical standard error | θ − θ0 |/σ 0 .

Figure 2
Figure2shows the distribution (5) for different values of the parameter estimate difference standardized by the historical standard error (| θ − θ0 |/σ 0 ).One can see that when the parameter estimates become more different (larger | θ0 − θ|) the limiting distribution (5) will be increasingly shifted towards smaller values of α indicating more incompatibility among the data sets and reducing the information borrowing from the historical data.This shift amplifies with decreasing historical standard errors σ 0 , meaning that the posterior can become arbitrarily peaked by increasing the sample size of the historical study.In contrast, when the parameter estimates are the same ( θ = θ0 ) the historical standard error σ 0 does not influence the posterior.

FidaxomicinFigure 3 :
Figure 3: Marginal posterior of power parameter α based on α ∼ Be(1, 1) prior and historical data D 0 = {x 0 = 214, n 0 = 302} from Louie et al. (2011).The black solid line shows the marginal posterior for the actually observed current data D = {x = 193, n = 270} from Cornely et al. (2012), whereas the dashed lines show the best-case marginal posteriors for hypothetical current data D = {x = c × x 0 , n = c × n 0 } which perfectly mirror the original data but with relative sample sizes c = n/n 0 .