A general approximation to nested Bayes factors with informed priors

A staple of Bayesian model comparison and hypothesis testing Bayes factors are often used to quantify the relative predictive performance of two rival hypotheses. The computation of Bayes factors can be challenging, however, and this has contributed to the popularity of convenient approximations such as the Bayesian information criterion (BIC). Unfortunately, these approximations can fail in the case of informed prior distributions. Here, we address this problem by outlining an approximation to informed Bayes factors for a focal parameter θ . The approximation is computationally simple and requires only the maximum likelihood estimate θ^ and its standard error. The approximation uses an estimated likelihood of θ and assumes that the posterior distribution for θ is unaffected by the choice of prior distribution for the nuisance parameters. The resulting Bayes factor for the null hypothesis H0:θ=θ0 versus the alternative hypothesis H1:θ∼g(θ) is then easily obtained using the Savage–Dickey density ratio. Three real‐data examples highlight the speed and closeness of the approximation compared with bridge sampling and Laplace's method. The proposed approximation facilitates Bayesian reanalyses of standard frequentist results, encourages application of Bayesian tests with informed priors, and alleviates the computational challenges that often frustrate both Bayesian sensitivity analyses and Bayes factor design analyses. The approximation is shown to suffer under small sample sizes and when the posterior distribution of the focal parameter is substantially influenced by the prior distributions on the nuisance parameters. The proposed methodology may also be used to approximate the posterior distribution for θ under H1 .


Introduction
Bayes factors represent the standard solution to problems involving Bayesian model comparison and hypothesis testing (e.g., Benjamin et al., 2018;Berger & Delampady, 1987;Johnson et al., 2023;Kass & Raftery, 1995).Across the empirical sciences, the prototypical testing scenario features a null hypothesis H 0 where a focal, test-relevant parameter θ is fixed to a particular value of interest: H 0 : θ = θ 0 .The alternative hypothesis H 1 relaxes the restriction imposed by H 0 .Here we consider the Bayesian framework in which the test-relevant parameter is assigned a prior distribution: H 1 : θ ∼ g(θ).Both H 0 and H 1 may additionally feature a common set of nuisance parameters ψ.The Bayes factor (e.g., Etz & Wagenmakers, 2017;Jeffreys, 1935Jeffreys, , 1939;;Kass & Raftery, 1995) quantifies the evidence that the data y provide for H 0 versus H 1 and is defined as the ratio of the two integrated likelihoods, that is, the ratio of the likelihoods integrated over the prior: Hence the Bayes factor reflects the models' relative predictive performance, which also equals the extent to which the data warrant a change from prior to posterior model odds (Wrinch & Jeffreys, 1921): p(H 1 ) p(H 0 ) .
In most non-trivial applications, however, researchers who seek to obtain a Bayes factor are faced with considerable computational challenges.The integrals that define the integrated likelihood may be high dimensional, and state-of-theart methods such as bridge sampling (Meng & Wong, 1996;Gronau et al., 2017a) or reversible-jump Markov chain Monte Carlo (Green, 1995) are generally time intensive.This concern is especially relevant for prior sensitivity analyses and Bayes factor design analyses that require repeated Bayes factor evaluations (Schönbrodt & Wagenmakers, 2018;Stefan et al., 2019).In addition, models with many nuisance parameters are almost always applied using a default multivariate prior distribution that is difficult to adjust in light of substantive background knowledge concerning the focal, test-relevant parameter of interest.
These challenges are often side-stepped by convenient approximations to the integrated likelihood such as the Bayesian information criterion (BIC; Schwarz, 1978) and Laplace's method (Tierney & Kadane, 1986), or an approximation to default Bayes factors requiring only sample size and a test-statistic or p-value (Jeffreys, 1936;Wagenmakers, 2022, also see Johnson, 2005;Shao et al., 2019;Villa & Walker, 2022;Rostgaard, 2023 for other approaches).However, these approximations have notable limitations.
First, the BIC is defined as −2 log p(y | ξ) + k log n, that is, the maximum likelihood plus a complexity correction term that contains the number of free parameters k and the sample size n.Unfortunately, both k and n can be surprisingly difficult to determine (Kass & Raftery, 1995;Pauler, 1998).Moreover, the BIC approximates a "default" Bayes factor that is based on a unit-information prior (Kass & Wasserman, 1995); consequently, the BIC does not easily lend itself to an analysis that seeks to take advantage of substantial background knowledge.
Second, Laplace's method assumes that the posterior distribution is highly peaked around the maximum likelihood estimate and approximates the integrated likelihood under each hypothesis H . as (2π ), where Σ is the covariance matrix of the k maximum likelihood estimates θ, ψ (Kass & Raftery, 1995).Laplace's method effectively assumes that the posterior distribution is multivariate normal, fully determined by the likelihood function, and not influenced by the shape of the prior distributions.Consequently, Laplace's method can perform poorly with informed prior distributions.
Finally, Jeffreys's approximate Bayes factor (first mentioned in Jeffreys, 1936, p. 417) assumes that the prior distribution θ ∼ g() varies slowly in the neighborhood of the maximum likelihood estimate and provides a convenient test against a null hypothesis H 0 : θ = 0. Jeffreys's approximate Bayes factor simplifies to BF 01 = A √ n exp(−χ 2 /2), where χ 2 corresponds to a Wald statistic and A is a constant usually close to 1 (see Wagenmakers, 2022 for an overview).Although the prior sensitivity of Jeffreys's approximate Bayes factor can to some degree be accommodated by adjusting the A argument, the general expression with A = 1 corresponds to an "objective" unit-information Bayes factor that does not test informed hypotheses.Other approaches (e.g., Johnson, 2005;Shao et al., 2019;Villa & Walker, 2022;Rostgaard, 2023) also focus on Bayes factors under "objective" or improper prior distributions which we do not consider in this manuscript.
In sum, all three approximate methods outlined above are appropriate only for scenarios with relatively uninformed prior distributions (Kass & Raftery, 1995).Consequently, accurate approximations are lacking for exactly the type of testing scenario in which Bayesian inference ought to excel: the case where substantial prior knowledge is available Gronau et al. (2020).
To overcome this limitation we outline a simple method that can approximate informed Bayes factors for focal parameters in nested models differing only in the presence or absence of a single test-relevant parameter θ.The method takes advantage of the approximate likelihood function of Tsou & Royall (1995) and the Savage-Dickey density ratio (Dickey, 1971;Dickey & Lientz, 1970;Wetzels et al., 2009), the same principle recently used by Johnson et al. (2023); Mulder et al. (2020);Rostgaard (2023). 1 The approximation requires a maximum likelihood estimate θ, its standard error se( θ), and holds under weak regularity conditions.Appendix 1 shows how the resulting Bayes factor can be readily computed in R or JASP (R Core Team, 2021;JASP Team, 2021;Ly et al., 2021).
We refer to the proposed methodology as the Savage-Dickey normal approximation and illustrate its precision and utility with three real-data examples.The first example features a two-sample t-test and compares the results to those obtained using standard numerical methods.The second example features a sequential parametric survival analysis and compares the results to those obtained using bridge sampling and Laplace's method.The third example features meta-regression and presents a comparison based on a prior sensitivity analysis.The closing section points to limitations and outlines further advantages of the approximation.

Methods
As the name implies, the Savage-Dickey normal approximation is based on the Savage-Dickey density ratio which obviates the need to compute the ratio of two marginal likelihoods; instead, it expresses the Bayes factor BF 10 for H 1 : θ ∼ g(θ) against a point null hypothesis H 0 : θ = θ 0 as a ratio of the prior ordinate over the posterior ordinate under H 1 evaluated at the test value θ 0 : .
The Savage-Dickey density ratio assumes that p(ψ , that is, the prior distributions on the nuisance parameters are specified such that H 1 reduces to H 0 when the focal parameter equals θ 0 in both models (Jeffreys, 1961, p. 249;Verdinelli & Wasserman, 1995; for a generalization that relaxes this assumption see Verdinelli & Wasserman, 1995;Heck, 2019;Mulder et al., 2020).
The prior ordinate for θ under H 1 evaluated at θ 0 is available directly from the prior probability density function, but the marginal posterior density function p(θ | y, H 1 ) usually does not have a closed-form solution.Therefore, we obtain the posterior ordinate for θ at θ 0 using an approximate marginal posterior probability density function p a (θ | y, H 1 ).
We construct this approximate marginal posterior probability density function via the approximate likelihood function , which is proportional to a normal density.Tsou & Royall (1995) showed that the estimated likelihood L a (θ) = L(θ, ψ) is asymptotically globally pointwise equivalent to the complete likelihood L(θ, ψ) under weak regularity conditions.2 In other words, L a (θ) is asymptotically equivalent to L(θ, ψ) when comparing support provided by the data between any two values of θ via likelihood ratios (Royall, 1997, p. 158).Next we apply Bayes' rule to obtain the approximate marginal posterior distribution for θ, as a standardized product of the approximate likelihood of θ and the prior distribution of θ (see Pratt, 1965 for the same approximation with insufficient statistics).Since we need to approximate only the marginal posterior distribution of θ, the denominator features only a one-dimensional integral that can easily be evaluated numerically.This can be considered a simple extension of the Laplace approximation to the posterior distribution (Leonard, 1982) that takes into account the prior information without assuming the posterior is normally distributed.
Finally, we substitute the approximate posterior distribution into the Savage-Dickey expression for the Bayes factor and obtain the Savage-Dickey normal approximation as follows: When the focal parameter θ is assigned a normal prior distribution with mean µ 0 and standard deviation σ 0 , H 1 : θ ∼ N (µ 0 , σ 2 0 ), the Savage-Dickey normal approximation conveniently yields a closed-form expression for the Bayes factor: Appendix 2 shows that this expression, which corresponds to a Bayesian z-test (cf.Berger &Delampady, 1987, Eq. 6, andClyde et al., 2021), approaches Jeffreys's default approximate Bayes factor (e.g., Jeffreys, 1961, p. 247) with increasing sample size and under a unit information prior; as an approximate Bayes factor for logistic regression the expression has been advocated by Wakefield (2007Wakefield ( , 2009)).

Examples
3.1 Two-sample t-test First we compare the Savage-Dickey normal approximation to a numerical solution for the informed Bayesian twosample t-test (Gronau et al., 2020).We consider Gronau et al.'s re-analysis of a replication study of the facial feedback hypothesis (Wagenmakers et al., 2016), which holds that facial expressions can impact emotional experience.In the replication study, participants were instructed to rate the funniness of a cartoon while holding a pen either with their teeth, i.e., the smile condition, or with their lips, i.e., the pout condition.The facial feedback hypothesis predicts that participants in the the smile condition will rate the cartoons to be funnier than participants in the pout condition (Figure 1).In the replication study, Wagenmakers et al. (2016) found a mean funniness rating of 4.63 (SD = 1.48) across 53 participants in the smile condition and 4.87 (SD = 1.32) across 57 participants in the pout condition.To specify the prior distribution on the effect size parameter δ, Gronau et al. (2020) performed prior elicitation with an expert from the field of social psychology and obtained an informed Student-t prior distribution, called the "Oosterwijk" prior distribution after the expert.The prior distribution specifies mostly small effect sizes and is restricted to positive values: The numerical solution presented by Gronau et al. (2020) shows strong evidence in favor of the null hypothesis, BF 0+ = 11.6.The effect in the sample is in the direction opposite to that predicted by the facial feedback hypothesis. 3he Savage-Dickey normal approximation with the maximum likelihood estimate δ = −0.17 and standard error se( δ) = 0.19 leads to almost identical evidence in favor of the null hypothesis: BF 0+ = 11.5.
We found that the Savage-Dickey normal approximation Bayes factor closely corresponds to the numerical solution by Gronau et al. (2020) for all reasonable effect sizes | δ |< 1. Larger sample sizes and effect sizes can result in notable underestimation of the evidence in the favor of the alternative hypothesis H 1 .However, the evidence in favor of H 1 is already so large, e.g., BF 10 > 10 100 , that a ten times lower Bayes factor obtained by the Savage-Dickey normal approximation does not change the qualitative assessment of evidence.4

Parametric survival analysis
Next we compare the Savage-Dickey normal approximation to Laplace's method and bridge sampling for an informed Bayesian parametric survival analysis (Bartoš et al., 2022a).We repeat Bartoš et al.'s full sample and sequential re-analysis of a colon cancer treatment trial that examined the potential increase in disease-free survival due to adding Cetuximab to the standard sixth version of a FOLFOX regimen (Alberts et al., 2012).
The data set obtained from Project Data Sphere (Re3data.Org, 2019) contains 22.9% recurrences across 1247 participants in the standard treatment group and 22.9% recurrences across 1251 participants in the enhanced treatment group.We perform two analyses: (1) we specify an uninformed standard normal distribution on the log acceleration factors, log(AF) ∼ Normal(0, 1), and (2) we specify and informed directional hypotheses test, log(AF) ∼ Normal + (0.30, 0.15 2 ), as performed by Bartoš et al. (2022a).Laplace's method should perform relatively well in the first scenario, with weak prior information, and relatively poorly in the second scenario, with strong prior information.
We focus solely on log-normal survival model that received the highest posterior model probability in the re-analysis.
For the nuisance parameters, we use informed prior distributions as specified in Table 1 in Bartoš et al. (2022a).
In the scenario with the standard normal prior distribution, i.e., weak prior information, the precise Bayes factor computed by bridge sampling on the complete data set shows an absence of evidence, BF 10 = 1.3, as does the Savage-Dickey normal approximation, BF 10 = 1.3, and Laplace's method, BF 10 = 1.3.The left panel of Figure 2 compares the results of both approximations to bridge sampling when the data are analyzed as they accumulate over time.We see that especially early in the trial, when the number of observed events is low (i.e., 8 vs. 1, and 13 vs. 2 events in the experimental and control conditions, respectively), Laplace's method approximates the precise Bayes factor better than the Savage-Dickey normal approximation, possibly due to the impact of the informed prior distributions on the nuisance parameters which is not accounted for by the Savage-Dickey normal approximation.Nevertheless, the Savage-Dickey normal approximation quickly converges to the precise Bayes factor as well.Left panel: Bayes factor for the treatment effect, i.e., the log acceleration factor, under a weakly informed Normal(0, 1) prior distribution.Right panel: Bayes factor for the treatment effect, i.e., the log acceleration factor, under an informed Normal + (0.30, 0.15 2 ) prior distribution.Laplace's method would be undefined for the right panel since the maximum likelihood estimate of the log acceleration factor is negative; therefore we modified the procedure by ignoring the truncation of the prior distribution.
Using the informed directional hypothesis test illustrates one of the limitations of Laplace's method; at the conclusion of the trial the maximum likelihood estimate of the log acceleration factor is negative, log( AF) = −0.19,se(log( AF)) = 0.08, as it is throughout most of its sequential trajectory.In order to obtain an approximation using Laplace's method at all, we removed the lower truncation from its computation (the maximum likelihood estimate is not inside the region of interest).Using the complete data set, the precise computation using bridge sampling shows strong evidence in favour of the absence of the treatment effect, BF 0+ = 61.8, a value that is closely approximated by the Savage-Dickey The right panel of Fig. 2 tells a similar story when accumulating the evidence over time; the Savage-Dickey normal approximation quickly converges to the precise computation with increasing number of observed events, whereas the modified Laplace's method is unable to provide the expected answer.
The sequential analysis also illustrates the practical utility of the Savage-Dickey normal approximation for Bayes factor design analyses, in which hundreds of Bayes factor trajectories need to be computed under different simulated data sets.For this single Bayes factor trajectory consisting of 60 Bayes factors, the computation required 3.4 and 3.5 CPU hours for the bridge sampling implementation, 0.1 and 0.4 CPU seconds for the Savage-Dickey normal approximation, and 8.2 and 7.7 CPU seconds for Laplace's method, for the uninformed and informed specifications, respectively.5

Meta-Regression
Finally we compare the Savage-Dickey normal approximation to Laplace's method and bridge sampling for a metaregression analysis.In the original article, Gronau et al. (2017b) combined evidence across six replication studies concerning the (weak form of) power posing hypothesis, stating that an expansive body posture can increase the subjective feeling of power.We extend the analysis by testing and adjusting for a moderator: participants' self-reported familiarity with the power posing hypothesis.
Table 1 summarizes the effect size estimates y and their standard errors se(y) split by participants' familiarity with the power posing hypothesis.We specify a fixed-effect meta-regression model, y ∼ Normal(α + βx, se(y) 2 ), with the intercept, α, corresponding to the overall (unweighted) mean effect size and a moderator, β, accounting for the difference based on participants' self-reported familiarity (x; yes = 0.5, no = −0.5)with the power posing hypothesis.We use default independent Cauchy prior distributions for both parameters, α, β ∼ Cauchy(0, 1/ √ 2), (Morey & Rouder, 2015) and perform prior sensitivity analysis for scale parameters of the prior distributions.
Using the default prior distribution, the precise Bayes factor computed by bridge sampling shows strong evidence for the presence of the overall effect of power posing on perceived power, BF α 10 = 88.0, and moderate evidence against moderation by participants' familiarity with the power posing hypothesis, BF β 10 = 0.22.Essentially equivalent results are obtained by the Savage-Dickey normal approximation, BF α 10 = 87.9 and BF β 10 = 0.23, and by Laplace's method, BF α 10 = 89.4,BF β 10 = 0.23. Figure 3 compares results of both approximations to bridge sampling for the prior sensitivity analysis for the effect of power posing on perceived power (left panel) and the moderation by participants' familiarity (right panel).For the overall effect of power posing the results of all three methods are virtually indistinguishable.For the moderation by participants' familiarity, however, Laplace's method performs poorly for small scales of the Cauchy prior distributionthat is, for informed prior distributions.
The prior sensitivity analysis again demonstrates the practical utility of the Savage-Dickey normal approximation.Computation of 40 Bayes factors for each parameter took 52.1 CPU seconds for bridge sampling, which is substantially more demanding than the computation of the Savage-Dickey approximation, 0.24 CPU seconds, or Laplace's method, 0.05 CPU seconds.The speed difference was compounded by the fact that whereas the Savage-Dickey normal approximation requires only the estimation of a single model yielding both maximum likelihood estimates and multiplying each of them with their prior distributions, the bridge sampling implementation requires the estimation of three Stan models.

Concluding Comments
Popular approximations to the Bayes factor hypothesis tests such as BIC and Laplace's method fail in the case of informed prior distributions-arguably exactly the kind of scenario in which the demand for a Bayesian method is most acute.To overcome this limitation we outlined the Savage-Dickey normal approximation.This approximation is simple, accurate, and fast.Compared to the commonly used Laplace's method or BIC, the approximation allows researchers to test informed hypotheses instead of forcing them to perform different tests that they intended, or even relying on more ad hoc methods.Compared to the gold-standard bridge sampling implementation, the approximation is several orders of magnitude faster, a reduction of computation time that is especially useful in Bayesian design analyses and prior sensitivity analyses.Furthermore, because the maximum likelihood estimate and its standard error are commonly reported, the Savage-Dickey normal approximation presents a particularly straightforward method to re-evaluate claims from the literature.The approximation also provides a sanity check for researchers wishing to implement precise Bayes factor alternatives to existing frequentist tests.
The Savage-Dickey normal approximation is not without limitations.Specifically, the approximation does not apply to non-nested model comparison (e.g., different parametric families as in Bartoš et al., 2022a).Also, the approximation assumes that the prior distributions on the nuisance parameters do not strongly impact the posterior distribution of the focal parameter.Even though this is often the case (Jeffreys, 1961, pp. 249-251), there exist several applications of Bayesian inference with restrictive prior distributions on nuisance parameters to help regularize the estimates for focal parameter (e.g., Maier et al., 2022;Bartoš et al., 2022b).In some cases, this limitation might be addressed by incorporating the restriction into the likelihood directly.
Finally, throughout the manuscript we used a normal distribution to approximate the likelihood.This might not always lead to acceptable results; in some cases, a log or logistic transformation of the focal parameter is advisable (Leonard, 1982;Mulder et al., 2020).In other cases, a completely different likelihood function might be used directly, e.g., binomial or Student-t.We also observed that the Savage-Dickey normal approximation can lose precision with increasing distance of the maximum likelihood estimate from the test value θ 0 .This is a consequence of a decreasing appropriateness of the approximate likelihood function for the extreme tails.However, in such cases the data are so diagnostic that the results will pass the "interocular traumatic test", attributed to Berkson ( Edwards et al., 1963).The approximation still leads to a qualitatively same conclusion, although a more precise methods should be used if high precision results are demanded.Some of the above limitations may be re-phrased as advantages; for instance, the approximation does not require the analyst to specify prior distributions on nuisance parameters, which can be challenging.Furthermore, since the approximation concerns a single focal parameter, it does not fall prey to the Borel-Kolmogorov paradox (Consonni & Veronese, 2008).We believe that the Savage-Dickey normal approximation provides a straightforward and attractive alternative to other ways of approximating Bayes factors for nested models, especially in cases with informed prior distributions on the parameter of interest., where A is "usually not far from 1" (Jeffreys, 1977, p. 89).

Figure 1 :
Figure 1: Distribution of mean cartoon funniness ratings in the pout and smile condition.Data from the Oosterwijk's laboratory in the replications study of a facial feedback hypotheses by Wagenmakers et al.. Figure from JASP.

Figure 2 :
Figure2: Comparison of different methods for computing Bayes factors for a sequential analysis of the survival analysis example.The solid black line corresponds to bridge sampling, representing the gold standard, the dashed blue line corresponds to the Savage-Dickey normal approximation, and the dashed red line corresponds to Laplace's method.Left panel: Bayes factor for the treatment effect, i.e., the log acceleration factor, under a weakly informed Normal(0, 1) prior distribution.Right panel: Bayes factor for the treatment effect, i.e., the log acceleration factor, under an informed Normal + (0.30, 0.15 2 ) prior distribution.Laplace's method would be undefined for the right panel since the maximum likelihood estimate of the log acceleration factor is negative; therefore we modified the procedure by ignoring the truncation of the prior distribution.

Figure 3 :
Figure 3: Comparison of different methods for computing Bayes factors for prior sensitivity analysis of the metaregression example.The solid black line corresponds to bridge sampling, representing the gold standard, the dashed blue line corresponds to the Savage-Dickey normal approximation, and the dashed red line corresponds to Laplace's method.Left panel: Bayes factor for the overall meta-analytic mean effect with varying scale of the prior Cauchy distribution.Right panel: Bayes factor for the moderation by participants' familiarity with the power posing hypothesis with varying scale of the prior Cauchy distribution.

Figure 4 :
Figure 4: JASP implementation of the Savage-Dickey normal approximation for Example 1: the two-sample t-test.The left panel of the graphical user interface contains the input, that is, the likelihood and the prior distribution of the null and alternative hypothesis.The right panel presents the output, that is, the Bayes factor in favour of the null hypothesis over the informed directional alternative hypothesis.Screenshot from JASP.

Table 1 :
Effect sizes and standard errors of the effect of power posing on perceived power across six replication studies split by participants' familiarity with the power posing hypothesis BF 0+ = 63.2, but poorly approximated by the modified Laplace's method, BF 0+ = 23.4.