Learning from a lot: Empirical Bayes for high‐dimensional model‐based prediction

Abstract Empirical Bayes is a versatile approach to “learn from a lot” in two ways: first, from a large number of variables and, second, from a potentially large amount of prior information, for example, stored in public repositories. We review applications of a variety of empirical Bayes methods to several well‐known model‐based prediction methods, including penalized regression, linear discriminant analysis, and Bayesian models with sparse or dense priors. We discuss “formal” empirical Bayes methods that maximize the marginal likelihood but also more informal approaches based on other data summaries. We contrast empirical Bayes to cross‐validation and full Bayes and discuss hybrid approaches. To study the relation between the quality of an empirical Bayes estimator and p, the number of variables, we consider a simple empirical Bayes estimator in a linear model setting. We argue that empirical Bayes is particularly useful when the prior contains multiple parameters, which model a priori information on variables termed “co‐data”. In particular, we present two novel examples that allow for co‐data: first, a Bayesian spike‐and‐slab setting that facilitates inclusion of multiple co‐data sources and types and, second, a hybrid empirical Bayes–full Bayes ridge regression approach for estimation of the posterior predictive interval.


Baseball batting example, revisited
We shortly revisit the famous baseball batting example (Efron and Morris, 1975), often used as a scholarly example of EB. While this is an estimation problem instead of a prediction problem, we revisit it for several reasons: i) it is a well-known example for which the true values are known; ii) the EB objective function is the same as for diagonal linear discriminant analysis; and iii) by casting the problem to a large p setting it allows us to show the importance of p being large.
For 18 baseball players, their batting averages over the first 45 bats are recorded and denoted by B i . The batting averages over the remainder of the season are also known, and considered to be the truth. We follow Van Houwelingen (2014) by modeling B i ∼ N (θ i , σ 2 i ), where the aim is to estimate θ i . The variances are estimated byσ 2 i = B i (1−B i )/45. Then, to effectuate shrinkage Van Houwelingen (2014) applies a Gaussian prior N (µ, τ 2 ) to θ i . In the formulation of the marginal likelihood (see Main Document), this implies hyper-parameter α = (µ, τ 2 ), and estimation of α is straightforward due to the conjugacy of the likelihood and the prior: Then, the posterior mean estimate equalŝ θ i = E(θ i |B i ; (μ,τ 2 )) =μ +τ 2 (τ 2 +σ 2 i ) −1 (B i −μ). The conclusion in Van Houwelingen (2014) is that the shrinkage prior slightly reduces the mean squared error, but enforces too strong shrinkage for the extremes. E.g. for the best playerθ 1 = 0.271, whereas X 1 = 0.400 and true θ 1 = 0.346. Two possible explanations come to mind: the estimate of the prior parameters is not good due to p = n being small and/or the prior does not accommodate the extremes well. We investigate this.
First, we simulate 10,000 additional true values from a density estimate with Gaussian kernel (using R's density command) applied to (θ 1 , . . . , θ 18 ). To obtain B i , i = 19, . . . , 10018, Gaussian noise was added with variances θ i (1 − θ i )/45. The estimates obtained in Van Houwelingen (2014) wereμ = 0.256 andτ 2 = 0.000623. The latter seems to be a major cause of over-shrinkage: the true variance computed from the 18 known θ i 's equals 0.00143. If we estimate τ 2 from the large data set, a much better estimate is obtained:τ 2 = 0.00195, as compared to the variance of the 18 known plus 10,000 generated true θ i 's, equaling 0.00166. From this, we obtain posterior mean estimateθ 1 = 0.293, which is substantially closer to θ 1 = 0.346 thanθ 1 = 0.271. Estimates for all 18 players are displayed in Figure 1(a).
In this example, it is natural to replace the Gaussian prior by a 3-component Gaussian mixture prior (bad, mediocre and good players): θ i ∼ 3 k=1 p k N (µ k , τ 2 k ). Then, α consists of 8 hyperparameters given that p 3 = 1 − p 1 − p 2 . We employed the EM-type algorithm of Van de Wiel et al. (2012) to maximize the marginal likelihood (see Main Document) in terms of α. Here, we use that the likelihood is Gaussian, and the Gaussian mixture prior is conjugate to it. The latter also facilitates straightforward computation of the shrunken estimatorθ Mixt i = E(θ|B i ; α). In this setting, the mixture prior is fairly close to the estimated Gaussian prior, and so are the shrunken estimates, as displayed in Figure 1(b). Slightly less shrinkage for the extremes is observed, though. For example,θ Mixt 1 = 0.298.

Bayesian elastic net
The Bayesian linear elastic net model, as used in the Main Document, is (Li and Lin, 2010): with some arbitrary (possibly improper) density f (σ 2 ). The normalizing constant g(λ 1 , λ 2 , σ 2 ) is given by: Since the simulations are for illustrative purposes only, the error variance was kept fixed at its true value (σ 2 = 1) throughout the simulations. Then, after introducing the latent variables τ = τ 1 · · · τ p T , we have the following conditional distributions for β and τ : , and GIG denotes the generalized inverse Gaussian distribution.

Marginal likelihood from Gibbs samples
According to Chib (1995), the log marginal likelihood of a Bayesian model may be calculated from the converged Gibbs samples as: where β * is some high posterior density point of p(β|Y) and τ (k) are Gibbs samples indexed by k = 1, . . . , K. In principle, any point β * may be used, but for the sake of efficiency a high-density point of β is preferred, such as the posterior mode. Then, for fixed σ 2 , the log marginal likelihood is approximated by: Sampling from the multivariate normal is a costly operation in high dimensions. In Bhattacharya et al. (2015) an efficient sampling scheme for β is described: Furthermore, if (τ j − 1)|Y, σ 2 , β j ∼ GIG(1/2, ψ, χ j ), then 1/(τ j − 1)|Y, σ 2 , β j ∼ IGauss(µ j = ψ/χ j , λ = ψ). Sampling from this inverse Gaussian is done by the following scheme: 3 Proof for Theorem 1: EMSE τ 2 for linear regression Then, let us first compute the expected squared bias w.r.t. β: where we used the central moments of Gaussian random variables, available from Isserlis' Theorem (Isserlis, 1918): Hence, we need to compute V (β 2 j ) and Cov(β 2 j ,β 2 k ). These are again derived from expressions of the central moments of Gaussian random variables. Let us first express the non-central moments in Cov Y (β 2 j ,β 2 k ) = Cov(β 2 j ,β 2 k ) = E(β 2 jβ 2 k ) − E(β 2 j )E(β 2 k ) in terms of the central ones. Denote the centralized value ofβ j byβ j =β j − β j . Then, because T 2 = 0 due to the symmetry of the central Gaussian distribution. Likewise, the second term of the covariance equals: Subtracting the latter from T 1 cancels the latter 3 terms in both expressions, rendering where we used the equations for the central moments of Gaussian random variables (Isserlis, 1918).
Note that the latter can also be obtained by writingβ 2 (5) and (6) into (4) renders: Taking expectation w.r.t. β gives: because we assume i.i.d. central priors for β j . Now to compute n). Hence, the requested moments are known (Press, 1982): where we assume p < n − 3. Substituting (8) into (7) and aggregating with the expected squared bias (3) finalizes the result: This simplifies for independent X i , because then ψ jj = 1 and ψ jk = 0:

Simulation Example
Here, we show the results for all simulation settings presented in the Simulation example.