Bayesian hypothesis tests with diffuse priors: Can we have our cake and eat it too?

We introduce a new class of priors for Bayesian hypothesis testing, which we name"cake priors". These priors circumvent Bartlett's paradox (also called the Jeffreys-Lindley paradox); the problem associated with the use of diffuse priors leading to nonsensical statistical inferences. Cake priors allow the use of diffuse priors (having one's cake) while achieving theoretically justified inferences (eating it too). We demonstrate this methodology for Bayesian hypotheses tests for scenarios under which the one and two sample t-tests, and linear models are typically derived. The resulting Bayesian test statistic takes the form of a penalized likelihood ratio test statistic. By considering the sampling distribution under the null and alternative hypotheses we show for independent identically distributed regular parametric models that Bayesian hypothesis tests using cake priors are Chernoff-consistent, i.e., achieve zero type I and II errors asymptotically. Lindley's paradox is also discussed. We argue that a true Lindley's paradox will only occur with small probability for large sample sizes.


Introduction
Determining appropriate parameter prior distributions is of paramount importance in Bayesian hypothesis testing. Bayesian hypothesis testing often centres around the concept of a Bayes factor, which was initially developed by Jeffreys (1935Jeffreys ( , 1961, and later popularized by Kass and Raftery (1995). The Bayes factor is simply the odds of the marginal likelihoods between two hypotheses and is analogous to the likelihood ratio statistic in classical statistics, where instead of maximizing the likelihoods with respect to the model parameters, the model parameters are marginalized out. In classical statistical theory testing a simple point null hypothesis against a composite alternative is routine. However, such hypothesis tests can pose severe difficulties in the Bayesian inferential paradigm where the Bayes factors may exhibit undesirable properties unless parameter prior distributions are chosen with exquisite care. This paper offers a solution to this difficulty.
Prior distributions can be chosen in an informative or uninformative fashion. Employing informative priors (either based on data from previously conducted experiments, or eliciting priors from subject matter experts) can be impractical, particularly when the number of parameters in the model is large. Furthermore, informative priors can be criticised on grounds that such priors are inherently subjective or may not let the data from the current experiment speak for itself. However, using alternative priors can also lead to problems.
One such problem occurs when using overly diffuse or flat improper priors. In the former case, as priors become more diffuse the hypothesis corresponding to the smaller model becomes increasingly favoured regardless of the evidence provided by the data. This problem occurs due to the normalizing constants of the priors dominating the expression for the Bayes factor, and is sometimes referred to as Bartlett's paradox (e.g, Liang et al., 2008) or the Jeffreys-Lindley paradox (e.g, Robert, 1993Robert, , 2014, named after the pioneering work of Jeffreys (1935Jeffreys ( , 1961, Lindley (1957), and Bartlett (1957), whose authors identified this and other related problems associated with Bayes factors. Discussions of this paradox and the related Lindley's paradox can be found in Aitkin (1991), Bernardo (1999), Sprenger (2013), Spanos (2013), and Robert (2014).
An extension of Bartlett's paradox occurs in the limit where diffuse priors become flat to the point of being improper. The use of flat improper priors gives rise to arbitrary constants in the numerator and denominator of the Bayes factor (see DeGroot, 1973).
Such arbitrary constants are problematic since, without suitable modification, they could be chosen by the analyst to suit any preconceived conclusions preferred, and as such, are not suitable for scientific purposes. Techniques for selecting the arbitrary constants in Bayes factors in an acceptable way when employing flat improper priors have been developed in several papers. Bernardo (1980) proposes to derive a reference prior for the null hypothesis by maximizing a measure of missing information. Spiegelhalter and Smith (1982), and Pettit (1992) use an imaginary data device leading to the arbitrary constants cancelling with other terms in the Bayes factor. A further approach to the problem of using diffuse priors was proposed by Robert (1993) who advocated for reweighing the prior odds to balance against parameter priors as prior hyperparameters become diffuse. O'Hagan (1995) considers the problem of using flat improper priors in the calculation of the Bayes factor by splitting the data into a training and testing set. The training set is used to construct an informative prior to be used to calculate the Bayes factor using the remaining portion of the data. These ideas have been refined in O'Hagan (1997), Berger and Pericchi (1996), and Berger and Pericchi (2001). A computational drawback of some of these approaches is that the same model is fit multiple times. For models where Bayesian inferential procedures are considered too slow for fitting a single model these approaches to Bayesian testing become infeasible from a practical viewpoint.
Other Bayesian hypothesis testing approaches abandon the Bayes factor altogether by constructing hypothesis testing criterion which only enter the criterion through parameter posterior distributions themselves. These include information criteria type approaches such as the Bayesian information criterion (BIC) and the deviance information criterion (DIC). The BIC or Schwarz's criterion uses a Laplace approximation where the prior term is assumed to be asymptotically negligible as the sample size grows (Schwarz, 1978). The DIC involves a linear combination of the log-likelihood evaluated at a suitably chosen Bayesian point estimator and the posterior expectation of the log-likelihood (Spiegelhalter et al., 2002). Under such a construction the DIC is not dominated by the prior as prior hyperparameters diverge. Similarly, posterior Bayes factors proposed by Aitkin (1991) are based on the posterior expectation of the likelihood function, rather than the joint likelihood (comprising of the model likelihood and prior). Since this only involves the prior in the calculation of the posterior distribution, the prior does not dominate posterior Bayes factors. Berger and Pericchi (1996) criticized this approach because it employs a double use of the data that is not consistent with typical Bayesian logic.
An interesting alternative approach to Bayesian hypothesis testing is that suggested in Section 6.3 of Gelman et al. (2013) who discuss examining the posterior distribution of carefully chosen test statistics such that large values of a given test statistic provides evidence against the null hypotheses. This idea is explored more formally in Gelman et al. (1996) and give rise to the concept of posterior predictive p-values, the probability that a test statistic of posterior predictive values is greater than the observed value of the test statistic.
Bayes factors in the context of linear model selection (Zellner and Siow, 1980;Mitchell and Beauchamp, 1988;George and McCulloch, 1993;Fernández et al., 2001;Liang et al., 2008;Maruyama and George, 2011;Bayarri et al., 2012) and generalized linear model selection (Chen and Ibrahim, 2003;Hansen and Yu, 2001;Wang and George, 2007;Chen et al., 2008;Gupta and Ibrahim, 2009;Bové and Held, 2011;Hanson et al., 2014;Li and Clyde, 2015) have received an enormous amount of attention. While we defer discussion of the types of priors used in these contexts to Section 4.3 we will draw special attention to Liang et al. (2008). Liang et al. (2008) considers several prior structures in the context of linear models. They employ Zellner's g-prior (Zellner and Siow, 1980;Zellner, 1986) for the regression coefficients where g is a prior hyperparameter. They consider several choices for choosing g including setting g to various constants, selecting g using a local and global empirical Bayes procedure, and via placing a hyperprior on g. Their results suggest that in order for the resulting Bayes factors to be well behaved (including model selection consistent) a hyperprior needs to be placed on g.
In this paper we will construct a new class of priors which was inspired by the priors used in the context of linear and generalized linear models. This class of priors is constructed in such a way as to mimic Jeffreys priors (which have the desirable property that they are invariant under parameter transformations Jeffreys, 1946) in the limit as a prior hyperparameter g diverges. In order to circumvent a Bartlett's like paradox from occurring, the rate at which g diverges is different in the null and alternative hypothesis in such a way that results in the cancellation of problematic terms in both the numerator and denominator of the Bayes factor. Bayes factors using cake priors have several desirable properties. In the examples we consider the Bayes factor can be expressed as a difference in BIC values, i.e., a penalized version of the likelihood ratio test (LRT) statistic. Using properties of the LRT statistic we show that Bayesian hypothesis tests are Chernoff-consistent in the sense of (Shao, 2003, Section 2.13), i.e., they achieve asymptotically zero type I and type II errors as the sample size diverges. In contrast classical hypothesis testing procedures are usually chosen to have a fixed type I error and are consequently not Chernoff-consistent. In this respect our Bayesian hypothesis tests are superior to classical procedures whose type I error is held fixed. Due to the above properties we call the priors we develop "cake priors" since they allow the use of diffuse priors (having ones cake) while being able to perform sensible statistical inferences (eating it too). We will also discuss Lindley's paradox in the context of cake priors and argue that generally Lindley's paradox will only occur with vanishingly small probability for large samples.
In Section 2 we reintroduce Bayes factors, including the interpretation of Bayes factors.
In Section 3 we discuss more specifically the problems associated with Bayes factors, including both Lindley's and Bartlett's paradoxes. In Section 4 we describe cake priors and illustrate their use in the context of one sample tests for equal means (with unknown variance), two sample tests for equal means (assuming unequal variances), linear models, and one sample tests for equal means (with known variance). In Section 5 we derive some asymptotic theory for our proposed of Bayesian hypothesis tests. In Section 6 discuss the relationship between cake priors and improper priors and discuss how arbitrary constants can be introduced into the Bayes factor. In Section 7 we take a closer look at the interpretation of Bayes factors in light of our findings. In Section 8 we conclude.

Bayes factors
Bayes factors are a key concept in Bayesian hypothesis testing introduced by Jeffreys (1935,1961), although a similar concept was also developed independently by Good (1952). Suppose that we have observed the data vector x = (x 1 , . . . , x n ) T which are observed samples from P = { p i ( · ) : i = 1, . . . , n } and we have two hypotheses H 0 and H 1 representing two models P j = { p ij ( · |θ j , H j ) : i = 1, . . . , n }, j = 0, 1, describing two potential distributions from which x was drawn, i.e., H 0 : P ∈ P 0 versus H 1 : P ∈ P 1 . (1) The models could potentially have distinct parameters from two distinct models and the models need not be nested. Let p(θ j |H j ) be the prior distribution under hypothesis H j for j = 0, 1. The Bayes factor is then defined as where integrals are replaced with combinatorial sums for discrete random variables.  Kass and Raftery (1995) offer an interpretation of λ Bayes , and BF 10 = 1/BF 01 in Table   1 in terms of strength of evidence against the null hypothesis. In Section 7 we will take a closer look at the interpretation of Bayes factors in light of the analysis in the current paper.
For the examples we consider, using the cake priors described later, the quantity λ Bayes will turn out to be a penalized version of λ LRT = −2[ 0 ( θ 0 ) − 1 ( θ 1 )] (the LRT statistic for the hypotheses in (1) where j (θ j ) = ln p(x|θ j , H j ) and the θ j 's are the MLEs under H j ) given by depending on the example, where ν is the difference in the number of parameters in H 0 and H 1 . Intuitively one might expect λ Bayes and λ LRT to be related since the focus of both approaches are based on the ratio of likelihoods, albeit different likelihoods.
λ Bayes BF 10 Strength of evidence 0 to 2 1 to 3 not worth more than a bare mention 2 to 6 3 to 20 positive 6 to 10 20 to 150 strong > 10 > 150 very strong  Kass and Raftery (1995).

Paradoxes in Bayesian hypothesis testing
Problems with Bayesian hypothesis testing based on Bayes factors, for particular combinations of hypotheses and priors, have been identified as early as 1935 by Jeffreys (1935), and later by Lindley (1957), and Bartlett (1957). As we will see for particular hypothesis tests, when parameter priors are not chosen with care, the conclusions based on Bayes factors will not be sensible. To give some context for the ensuing discussion we will now consider the hypothesis testing problem introduced by Lindley (1957) in order to illustrate potential problems.
Lindley's example: Consider the hypothesis test where the sample is modelled via x i |µ ∼ N (µ, σ 2 ), 1 ≤ i ≤ n independently, where µ and σ 2 are the mean and variance parameters respectively. Here µ is an unknown value to be estimated and σ 2 is a fixed known constant. Suppose that we wish to perform the hypothesis test where µ 0 is a known constant. Under H 0 the values of all model parameters are fixed (so that under H 0 the model has zero unknown parameters), i.e., H 0 is a simple point null hypothesis. Suppose that for H 1 we employ the prior µ|H 1 ∼ N (µ 0 , τ 2 ) where the prior variance τ 2 is a known constant. The Bayes factor with the stated prior on µ is where z(x) = √ n(x − µ 0 )/σ is the standard z-test statistic (see Bernardo, 1999). The p-value for this test is P(χ 2 1 > z(x) 2 ).
If we were to choose µ|H 1 as above then Lindley (1957) identified the following problem.
• Problem I: For any fixed p-value as n → ∞ we have BF 01 → ∞.
Suppose that the observed value of z(x) is large so that, for any reasonably chosen level α, the typical frequentist approach would reject the null hypothesis. For this value of z(x) a Bayesian procedure based on the above Bayes factor would prefer the null hypothesis for a sufficiently large n, drawing a contradiction between the two inferential paradigms.
We now consider a second example posed by Sprenger (2013).
Sprenger's example: Jahn et al. (1987) used electronic and quantum-mechanical random event generators with visual feedback; the subject with alleged psychokinetic ability tries to "influence" the generator. The number of "successes" was s = 52, 263, 470 and the number of trials was n = 104, 490, 000. Assuming independence of each trial we have a rejection of H 0 leads to evidence that the subject has psychokinetic ability. Using the data a classical hypothesis testing approach leads to a p-value approximately equal to 0.0003, leading to a rejection of the null hypothesis for the α = 0.05 cut-off. A 95% confidence interval for ρ is (0.50008, 0.50027). A standard Bayesian hypothesis test uses the prior ρ ∼ Beta(1/2, 1/2) (the Jeffreys prior) leads to: λ Bayes = 2 ln Beta(1/2 + s, 1/2 + n − s) − 2 ln(π) + 2n ln (2), where s = n i=1 x i . For the Bayesian test λ Bayes ≈ −5.86, which implies the null model is preferred and an apparent contradiction between inferential paradigms.

Resolving Lindley's paradox
We will now resolve Lindley's paradox in both of the above examples.

Resolving Lindley's example:
We argue that Problem I for Lindley's example only occurs because it is assumed that the p-value is held fixed, and that a true Lindley's paradox only occurs with vanishingly small probability as n → ∞. The p-value cannot be held fixed as n → ∞ as its behaviour depends on the data generation process. Let X = (X 1 , . . . , X n ) T be a random sample. Consider the value of λ Bayes for Lindley's example as a function of this random sample, i.e., where The first term on the right-hand side of (7) depends on the data generating process for X, whereas the second term is O(ln(n)). Consider the two cases: , the data is generated from H 0 . Then z(X) 2 ∼ χ 2 1 = O p (1) and the O(ln(n)) term dominates. Hence, as n → ∞ we have λ Bayes (X) → −∞ implying P(T (X) = 0) → 1, i.e., the null hypothesis is preferred.
is the non-central chi-squared distribution with degrees of freedom ν and non-centrality parameter λ. Then implying P(T (X) = 1) → 1, i.e., the alternative hypothesis is preferred.
Note that 1. implies that a test based on the above Bayes factor has vanishing type I error as n → ∞ and 2. implies that the Bayesian test is consistent in the sense of (Lehmann, 2004, Section 3.3). Combining 1. and 2. implies that the test is Chernoff consistent (see Section 5 for a formal definition). Resolving Sprenger's example: Using properties of the beta function, the gamma function, and Stirling's approximation leads to approximating (6) by where λ LRT is the LRT statistic corresponding to the hypotheses (5). Again, Lindley's paradox occurs here if we consider λ LRT (x) (or equivalently the p-value) to be fixed. If λ LRT is held fixed and n diverges then the null hypothesis will be preferred in the limit.
Let X = (X 1 , . . . , X n ) T be a random sample. We will later show (see Section 5) that is true so that the test based on λ Bayes is Chernoff consistent. Figure 1 illustrates the empirical probabilities for rejecting the null hypothesis (for the frequentist test at the 5% level) or preferring the alternative hypothesis (for the Bayesian test) based on simulating 10 6 datasets with the true value of ρ in the set {0.5, 0.5001, 0.5002, 0.5003} for n on a grid form n = 10 6.5 to n = 10 9 . The vertical line in Figure 1 indicates the actual value of n in Sprenger's example and the dashed grey line illustrates the estimated empirical probability that the two tests disagree. In Figure 1 we see that when H 0 is false, as the sample size increases, both tests reject the null as n or ρ grows. When ρ = 0.5 the frequentist test accepts the null model at the 5% level, while the Bayesian test prefers the null model with very low probability. Furthermore, at the actual value of n in the experiment the disagreement between frequentist and Bayesian tests could have occurred by chance with relatively high probability. However, for much larger large n such a disagreement will only occur with low probability when H 1 is true.
We now note for Lindley's example that the test based on λ Bayes is an LRT test with an extremely small level α given by α = P[χ 2 1 > {1 + σ 2 /(nτ 2 )} ln(1 + nτ 2 /σ 2 )] (when τ 2 is large) while for Sprenger's example that the test based on λ Bayes is in also an LRT test with asymptotic level α = P[χ 2 1 > ln(n) + ln(π/2)]. Hence, we note, as was also argued by Naaman (2016) and noted by Lindley (1957), that Lindley's paradox can also be resolved by letting the level of the test α in the frequentist test tend to zero as n → ∞. We discuss Lindley's paradox more generally in Section 5.2. The above expression for the level of the test for Lindley's example draws attention to a second problem for Lindley's example which does not occur in Sprenger's example.
This problem occurs because as the prior variance increases the level of the test decreases.
Problem II is referred to as Bartlett's paradox in Liang et al. (2008) and the Jeffreys-Lindley paradox in Robert (2014). Bartlett's paradox is paradoxical since as τ 2 becomes large the prior on µ becomes increasingly vague regarding the location of µ. However, in the attempt to be vague about the location of µ the prior becomes "informative" in favouring H 0 as the preferred hypothesis, again, regardless of the evidence provided by the data. Unlike Lindley's paradox, we believe that Bartlett's paradox is a real problem in practice since the use of diffuse priors can sometimes lead to testing procedures with extremely small power. Our proposed cake priors described in the next section circumvent this problem.

Cake priors
Consider the general hypotheses (1). Let d 0 and d 1 be the dimensions of θ 0 and θ 1 respectively. For the time being we will assume that 0 < d 0 ≤ d 1 (later we will con- . Define the observed information and Fisher information matrices respectively. Define the mean observed information matrix as J(θ) = n −1 J(θ) and the mean expected observed information matrix as I(θ) = n −1 E x|θ [J(θ)]. We will denote the Fisher information matrix under the null and alternative hypotheses as I 0 (θ 0 ) and I 1 (θ 1 ) with similar use of subscripts to denote similar quantities such as J, I and J. We define a Jeffreys prior as any density for θ such that p(θ) ∝ |I(θ)| 1/2 . We construct cake priors using the following ingredients: 1. Define the priors where P j (θ j ) is a prior precision matrix (assumed to be full rank). For all of the examples considered in this paper we will use P j (θ j ) = I j (θ j ).

Set
3. Calculate the Bayes Factor as When P j (θ j ) ∝ I j (θ j ), j = 0, 1, (8) leads to a Bayes Factor, in the limit as g j → ∞, that would have been obtained if a Jeffrey's prior is used.
are Jeffreys priors in the limit as g j → ∞. Letting g j → ∞ would be problematic if not for 2. which leads to certain terms involving h cancelling in the Bayes factor.
As h → ∞, the priors on θ j are made diffuse, but at a rate that depends on the d j 's.
We We will now give some intuition for how cake priors avoid Bartlett's paradox via the following heuristic argument. Let P j (θ j ) ≡ P j , i.e., the prior precision matrices are con- (which depends on g 0 and g 1 rather than h). Then the Bayesian test statistic is where the second line is obtained using a Taylor series argument in g 0 and g 1 . Ignoring the dependency of O(g −1 j ) terms on the θ j 's, using Laplace's method on the numerator and denominator of the first term in the second line above, and setting P j = J j ( θ j ) (where θ j are the MLEs for the θ j 's), leads to The O(n −1 ) error follows from the relative error of the Laplace's method applied to the numerator and denominator (Tierney et al., 1989;Kass et al., 1990). Suppose that g 0 = g 1 = g. Then using the asymptotic χ 2 ν distribution λ LRT the level of the test using (9) is α = P[χ 2 ν ≥ ν ln(ng)]. So that again we see that the power of the test goes to 0 as g → ∞. Here also we see that setting g 0 to be a large constant (making the prior for θ 0 diffuse) leads the test to preferring H 0 while making g 1 large leads to preferring H 1 .
Hence, the relative rates that g 0 and g 1 diverge must be considered.
Lastly, well us briefly discuss the choice of P j . Setting P j = I leads to λ Bayes = λ LRT −ν ln(n)+ln | J 0 ( θ 0 )|−ln | J 1 ( θ 1 )|+O(h −1/d 0 +h −1/d 1 +n −1 ). This would be undesirable because of the additional computational burden of the log-determinant terms (which can be considerable in some contexts), and because if λ LRT ≈ ν ln(n) we would prefer the model with larger ln | J j ( θ j )|, i.e., larger standard errors. For this reason we would like P j ≈ J j ( θ j ) so that at least approximate cancellation occurs.

One sample test for equal means (with unknown variance)
Consider the hypothesis test (3) where x i |µ, σ 2 iid ∼ N (µ, σ 2 ), 1 ≤ i ≤ n, where µ and σ 2 are the mean and variance parameters respectively. Suppose now that both µ and σ 2 are unknown parameters to be estimated (unlike the example in Section 2 where σ 2 was assumed to be known).
We cannot directly use the methodology outlined in Section 4 as σ 2 > 0. To handle this complication we use the transformation σ 2 = exp(s). Under this transformation the mean expected information matrices become: I 0 (s) = 1/2 and I 1 (µ, s) = diag(exp(−s), 1/2). Using the steps for Section 4 under this transformation we have Transforming back to the σ 2 parametrisation gives σ 2 |H 0 ∼ LN (0, 2g 0 ), with density which is a Jeffreys prior for σ 2 in the limit as g 0 → ∞.
The marginal distributions of x given H 0 and H 1 are ]. Suppose that we were to use g = g 0 = g 1 and let g → ∞. Then we would see a manifestation of Bartlett's paradox where the null hypothesis is favoured since BF 01 → ∞ as g → ∞. If we instead use g 0 = h and g 1 = h 1/2 then the Bayes factor simplifies to which can be evaluated using univariate quadrature or other methods for any fixed h > 0.
As a computational short-cut if the integrand is a monotonic function of h with a well defined limit as h → ∞ we will write where the notation h⇒∞ = is used to denote "equality in the limit as h → ∞ after terms related to h cancel in the numerator and denominator in the Bayes factor, or terms related to h vanish as h diverges in the Bayes factor." The above expressions can be more easily simplified using standard results to reach the same expression for λ Bayes as above.
We conduct the following short simulation study to illustrate the differences between the LRT and the Bayesian test for this problem. Letting µ 0 = 0 we simulate a single set of data from x i ∼ N (µ true , 1), 1 ≤ i ≤ n. After simulating 10 6 such datasets for all values of µ true in the set {0, 0.05, 0.25, 0.5} and a grid of n from n = 15 to n = 1000 we plot in Figure 2 the empirical probabilities of rejecting the null hypothesis (for the LRT test) using α = 0.05 or preferring the alternative hypothesis (for the Bayesian test).
From Figure 2 we see empirically that the type I error of the Bayesian test is tending to 0 as n grows when H 0 is true, whereas the LRT test has, by design, a type I error of 0.05. When H 1 is true and ln(n) < χ 2 1,α the Bayesian test is more powerful than the LRT test, and when H 1 is true and ln(n) > χ 2 1,α the LRT test is more powerful than the Bayesian test. When µ true ∈ {0.25, 0.5} both tests appear to have power tending to 1 as n grows. Lastly, for the case µ true = 0.5 when ln(n) > χ 2 1,α both have very similar power.
Hence, ξ(x) = − 1 2 ln(x) + O(x −1 ). Using this λ Bayes simplifies to Note that the coefficient of ln(n) is d 1 − d 0 = 2 which is the corresponding degrees of freedom of the corresponding LRT.
It is important to note that all of the constant terms have cancelled from the asymptotic approximation for λ Bayes . This has been achieved by incorporating the (n/n 0 ) and (n/n 1 ) factor in the priors for µ 0 and µ 1 , and the (2n/n 0 ) and (2n/n 1 ) factors in the priors for σ 2 0 and σ 2 1 . Without these factors, cancellation of O(1) and larger terms in the expression λ Bayes would not occur.
The empirical probabilities of rejecting the null (in the LRT case) or preferring the alternative (in the Bayesian test) are illustrated in Figure 3. Note that under H 0 the type I error approaches zero as n → ∞ for the Bayesian test, and under H 1 the type II error approaches zero as n → ∞ for both the Bayesian and LRT tests. When H 0 is true the LRT has a fixed 5% type I error.

Linear models
We will now consider hypothesis testing for linear models. Consider the base model where y is a response vector of length n, β is a coefficient vector of length p, σ 2 is a positive scalar, X is a full-rank n by p matrix of covariates, and I is the identity matrix of appropriate dimension. In order to simplify some calculations we will transform y and X so that y and the columns of X are standardized, i.e., y = 0, y 2 = y T y = n, X T j 1 = 0, and X j 2 = n where X j is the jth column of X. Let γ be a binary vector of length p, and let X γ be the submatrix X comprised from the columns of X whose corresponding elements of γ are non-zero. Consider the hypothesis test where γ 0 and γ 1 denote the models under the null and alternative hypotheses respectively with 0 ≤ |γ 0 | ≤ |γ 1 |.
To simplify exposition for this example we will only use cake priors for α and β γ .
Since σ 2 is a common parameter across all parameters we can use the typical improper (Jeffreys) priors for σ 2 given by p(σ 2 ) ∝ (σ 2 ) −1 I(σ 2 > 0). This choice has been formally justified in Berger et al. (1998). Cake priors can be used for all parameters for this example, but the working out is lengthy and unnecessarily obfuscates the exposition.
Note that as h → ∞ the parameter posteriors become α|y, γ ∼ t n (0, σ 2 γ /n), β γ |y, γ ∼ t n ( β γ , σ 2 γ X T γ X γ −1 ), and σ 2 |y, γ ∼ IG n 2 , n 2 σ 2 γ , where β γ and σ 2 γ are the MLEs corresponding to model γ. We will not provide any numerical examples due to the close relationship between our Bayes factors and the BIC, and the fact that almost every paper ever written on model selection for linear models uses the BIC in its comparisons. We direct the interested reader to any of the papers in the discussion below all of which make comparisons with the BIC as a model selection criteria.
There are four main differences between the priors used here and the priors that have been used in the literature for linear models. The first such difference is the choice of prior on α which the typical prior is to use the Jeffreys prior p(α) ∝ 1 which was advocated in Berger et al. (1998). If we were to use this prior and were only to use cake priors for β γ then p(g 0 |γ 0 ; h) = δ(g 0 ; h 1/|γ 0 | ) instead of p(g 0 |γ 0 ; h) = δ(g 0 ; h 1/(1+|γ 0 |) ).
The consequence of this would be that the null model (where γ = 0) would become problematic to calculate.
The second difference is in the choice of prior for β. Most Bayesian approaches to model selection for linear models use the Zellner g-prior where instead of the prior for β γ in (12). This difference is subtle. Bayarri et al. (2012) advocate the priors for β should remain proper and not degenerate to a point mass. If we were to treat X γ as random then under mild conditions almost surely suggesting that our prior for β γ does not degenerate to a point mass. Lastly (12) and (13) simplify marginal likelihoods since terms involving determinants cancel, and have the added advantage that they do not depend on the unit of measurements of the covariates.
The third difference is the choice of prior on g. As stated in the introduction, Liang et al. (2008) argues for a hyperprior to be assigned to g. Liang et al. (2008) considers the hyper g-prior; the hyper g/n-prior; and the Zellner-Siow prior (equivalent to a particular inverse-gamma on g) (Zellner and Siow, 1980). Maruyama and George (2011) use a different prior to (12) or (13) and a beta-prime prior with specially chosen prior hyperparameter values. All of these choices, apart from Maruyama and George (2011), either no closed form expression for the marginal likelihood exists, or such an expression is in terms of a Gauss hypergeometric function which is numerically difficult to evaluate (Pearson et al., 2017) so that approximation is required.
Model selection consistency is another desirable criteria of Bayarri et al. (2012). The authors corresponding g/n, Zellner-Siow, beta-prior and robust priors are model selection consistent for all possible models. Our prior specification results in a null based Bayes factor is a simple function of the BIC and so achieves model selection consistency for iid data (under some additional mild assumptions, Yang, 2005). See also Section 5.2.

Handling zero parameters in the null model
Let us now return to Lindley's example posed by Lindley (1957) described in Section 3.
In order to apply the methodology of Section 4 the null model needs to have a non-zero number of parameters. We provide the following novel artificial construct to handle this case in order to augment the problems so that both hypotheses have a non-zero number of parameters.
1. Introduce a second sample of hypothetical data, say z.
2. Modify the null and alternative hypotheses by adding a clause that the hypothetical data has the same distribution under the null and alternative hypotheses.
3. Apply the methodology of Section 4 to the augmented problem.
In order to illustrate this approach suppose we have a second sample of hypothetical data z = (z 1 , . . . , z n ) T and consider the augmented hypotheses and z 1 , . . . , z n | µ ∼ N ( µ, σ 2 ) versus where µ 0 and σ 2 have known fixed values, and µ is an artificial mean parameter corresponding to the sample z. This is a modification of the original hypotheses (3) has the same logical implication as the hypotheses (3) for the observed sample x since the hypothetical data has the same hypothetical models under the null and alternative hypotheses. For the augmented problem we have θ 0 = µ with d 0 = 1, and θ 1 = (µ, µ) T and d 1 = 2 so that we have avoided the problem of dividing by zero. The cake priors become µ|H 0 ∼ N (0, g 0 σ 2 ), µ|H 1 ∼ N (0, g 1 σ 2 ) and µ|H 1 ∼ N (0, g 1 σ 2 ). For g 0 and g 1 we use g 0 = h and g 1 = h 1/2 . Then The Bayes factor in the limit as h → ∞ is is the likelihood ratio test statistic corresponding to the hypothesis (3). We conducted a small simulation study identical to the simulation study in Section 4.1 with the exception that σ 2 was treated as known. The resulting figure and interpretation was nearly identical to that in Section 4.1 (not shown).

Theory
In all of the examples in Section 4 the quantity λ Bayes can be placed into the form (2).
We will now consider the asymptotic properties of hypothesis tests based on this form. Shao (2003) developed theory regarding the asymptotic properties of hypothesis tests.
We will adopt his notation and definitions here. Let X = (X 1 , ..., X n ) T be a random sample from P = { p i ( · ) : i = 1, . . . , n }. The type I and type II errors are defined by α T (P) = P( T (X) = 1) when P ∈ P 0 and 1 − α T (P) = P( T (X) = 0) when P ∈ P 1 respectively. Fix the level of significance α such that sup P∈P 0 {α T (P)} ≤ α. We will now suppose that T n (X) ≡ T (X) and consider scenarios where n diverges. In our ensuing discussion we use the following definitions.
(ii) If lim n→∞ sup P∈P 0 {α Tn (P)} exists, then it is called the limiting size of T n .
(iii) The sequence of tests T n is called consistent if and only if the type II error probability converges to 0, i.e., lim n→∞ [1 − α Tn (P)] = 0, for any P ∈ P 1 .
(iv) The sequence of tests T n is called Chernoff-consistent if and only if T n is consistent and the type I error probability converges to 0, i.e., lim n→∞ {α Tn (P)} = 0, for any P ∈ P 0 . Furthermore, T n is called strongly Chernoff-consistent if and only if T n is consistent and the limiting size of T n is 0.
We note that any reasonable test which is consistent where the level α is controllable, can be made Chernoff-consistent by letting α n ≡ α → 0 as n → ∞.
Wilks Theorem (Wilks, 1938) tells us that, assuming the data was generated under the null distribution, under appropriate regularity conditions (including that the hypotheses are nested) that λ LRT converges to χ 2 ν in distribution so that λ LRT = O p (1). A detailed exposition on the characterization the asymptotic distribution of the LRT statistic under quite general conditions, including when H 0 and/or H 1 is misspecified, and whether the hypotheses are nested or non-nested, can be found in Vuong (1989). Below we summarize the most pertinent results.

Asymptotic properties of the likelihood ratio test statistic
Suppose X 1 , X 2 , . . . , X n are independent random variables from { p 0i ( · ) : i = 1, . . . , n } and that we have a parametric model P = { p i ( · |θ) : i = 1, . . . , n, θ ∈ Θ } (which may or may not include the true distribution(s) { p 0i ( · ) }). Define the log-likelihood as (θ) = n i=1 ln p i (X i |θ), the MLE and "pseudo-true" value of θ as respectively (assuming both are well-defined). Assume also that E [n −1 (θ * )] → C, for some finite C > 0. Under the "pseudo-true" value θ * the resultant distribution is { p i ( · |θ * ) } which is the "best" distribution in the sense that it results in the smallest Kullback-Leibler (KL) divergence with respect to the true distribution over all distributions for the model. We assume that these are unique.
If P is suitably regular, with Θ a nice subset of d-dimensional Euclidean space, then certain derivatives exist and various statements can be made: the Euclidean norm of θ − θ * is O p (n −1/2 ), and writing ∇ (θ) and ∇ 2 (θ) for the first and second order partial derivatives we may expand (θ * ) about θ to get since ∇ ( θ) ≡ 0, for some θ between θ * and θ; also the quadratic form is O p (1). Thus, we may decompose the maximised log-likelihood into the following terms: The first term on the right hand side is asymptotic to nC; the second term is a random sum of n terms with expectation zero, and so under further regularity conditions is O p (n 1/2 ). We refer to these three terms as the "O p (n)", "O p (n 1/2 )" and "O p (1)" terms respectively (from left to right). Precise regularity conditions guaranteeing all of this nice behaviour may be found in Vuong (1989); see also Chapter 5 of van der Vaart (1998).
They are all easily satisfied in all of our examples.
Finally, we note that the first term can be written as: where the first term corresponds to the negative KL-divergence between p 0i and p i , and the second term is related to the entropy of p 0i . We see from the above equation that maximizing E [ (θ)] with respect to θ is equivalent to minimizing the KL-divergence between n i=1 p 0i ( · ) and n i=1 p i ( · |θ).
2. If * 1 = * 0 then immediately we see that the two "O p (n)" terms in the difference between the log-likelihoods would, at least asymptotically, cancel out and that the "O p (n 1/2 )" terms would "dominate". However, in many practical examples the "O p (n 1/2 )" terms also cancel out, in which case (see in particular Theorem 3.3 of Vuong, 1989). This occurs when the "best" member of each model yields the same distribution, that is when The parametrisations may be completely different, but nonetheless the distributions corresponding to the pseudo-true parameter values are identical. This occurs when the two models have some overlap, i.e., P 0 ∩ P 1 is non-empty as a subset of all possible sets of distributions { p i ( · ) }. In such a case, the models may or may not be nested, and may or may not be correctly specified; however the "best" distribution in both is the same (and is part of the overlap).
We also have the following consequences.
• Lindley's paradox: If H 1 is correct (and H 0 is not) then the above theory implies λ LRT (X) is O p (n) and a test of the form (2) is consistent. Since Lindley's paradox occur with asymptotic probability p(χ 2 ν,α < λ LRT < ν ln n), it occurs with vanishingly small probability as n → ∞.
Hence, a Bayesian test of the form (2) is Chernoff consistent.
• Model selection consistency: Now suppose that we have M competing hypotheses H j , j = 1, . . . , M . For each model we have E n −1 j (θ * j ) → * j for some constants * j . We will call a hypothesis H j correct if j ∈ C where The hypotheses H j such that j ∈ C correspond to correct models in the sense that all such models are closest in terms of their KL-divergence to the data generating distribution. Using (2) to compare models not in C with models in C leads to λ LRT (X) is O p (n) and the test preferring the model in C. Comparing any two models in C will asymptotically prefer the model with the smallest size.
When cake priors become arbitrarily diffuse to the point of becoming improper and a further problem occurs. We now discuss such problems.

Arbitrary constants
We now return to the issue of improper priors in the context of Bayes factors discussed in the introduction. When using improper priors, consider the conditional density does not exist for j = 0, 1. Then This Bayes factor is problematic since it depends on two arbitrary constants D 0 and D 1 .
• Problem III: When using improper priors either the null or the alternative model can be made to be preferred by artificially changing D 0 or D 1 to suit the a priori preferred conclusion.
In the limit as as h → ∞ cake priors become improper. Suppose that instead of using where c j > 0 are arbitrary constants. This implies  Lastly, we note that if we select ∆ = ν ln(n) − χ 2 ν,α , where χ 2 ν,α denotes the upper quantile function of the chi-squared distribution with degrees of freedom parameter ν, that Bayesian tests can be made to mimic the frequentists LRT when the type I error is controlled to have level α.
Given that (15) we can directly compare that the Bayes factors have taken the interpretation offered by Kass and Raftery (1995) in Table 1 appears to have the short-coming of not taking into account the degrees of freedom ν of the test, nor the sample size n.
Note that every p-value is smaller than 5% suggesting that Bayesian tests are typically more conservative at preferring the alternative than classical tests reject the null at the usual 5% cut-off. Further, going from λ Bayes (x) = 2 to λ Bayes (x) = 6 and from λ Bayes = 6 to λ Bayes (x) = 10 roughly translates to a 5 to 10 fold reduction in the corresponding p-value. Thus, the anticipated potential short-coming of Table 1 not depending on ν or n does not pan out, at least for the values of n and ν considered in Table 2. We believe Table 2 is a reasonable interpretation of strength of evidence for Bayesian tests.

Conclusion
We have introduced a new class of priors we call cake priors having a number of desirable properties. Cake priors can be made arbitrarily diffuse without the Bayes factor favouring the null or alternative hypotheses as the prior becomes increasingly diffuse. In the limit, at least for the examples we consider here, Bayes factors take the form of penalized likelihood ratio statistics, one of the most thoroughly understood quantities in Statistics.
Due to their close link with Jeffreys priors, cake priors are parametrization invariant. The resulting Bayesian test avoids the need to specify a p-value cut-off and are asymptotically Chernoff-consistent. With a slight modification, an arbitrary but controllable constant can be used to bridge the gap between Bayesian tests and likelihood ratio tests. Unlike approaches that split the dataset up into parts or use imaginary data, cake priors are transparent, uncomplicated, and easily implementable. Finally, Bayesian tests using cake priors providing some protection against sequential testing being more conservative as the sample size grows. We believe all of the above properties should make cake priors the default choice when performing Bayesian hypothesis tests for hypothesis consisting of a simple point null against a composite alternative for parametric models with iid data.