Weighted Bayesian bootstrap for scalable posterior distributions

We introduce and develop a weighted Bayesian bootstrap (WBB) for machine learning and statistics. WBB provides uncertainty quantification by sampling from a high dimensional posterior distribution. WBB is computationally fast and scalable using only off‐the‐shelf optimization software. First‐order asymptotic analysis provides a theoretical justification under suitable regularity conditions on the statistical model. We illustrate the proposed methodology in regularized regression, trend filtering and deep learning and conclude with directions for future research.


INTRODUCTION
Weighted Bayesian bootstrap (WBB) is a simulation-based algorithm for assessing uncertainty in machine learning and statistics. Uncertainty quantification (UQ) is an active area of research, particularly in high-dimensional inference problems (Wang & Swiler, 2018). Whilst there are computationally fast and scalable algorithms for training models in a wide variety of contexts, uncertainty assessments are still required, as are methods to compute these assessments. Bayesian analysis offers a general solution, but developing computationally fast scalable algorithms for sampling a posterior distribution is a notoriously hard problem. WBB makes a contribution to this literature by showing how off-the-shelf optimization algorithms, such as convex optimization or stochastic gradient descent (SGD), can be adapted to provide uncertainty assessments. Our goal Additional Supporting Information may be found in the online version of this article at the publisher's website. * Author to whom correspondence may be addressed. E-mail: newton@stat.wisc.edu here is to marry Bayesian uncertainty techniques with state-of-the-art optimization methods and software systems.
For relatively simple statistical models, the weighted likelihood bootstrap (WLB) method provides approximate posterior sampling through repeated optimization of a randomly weighted likelihood function (Newton & Raftery, 1994). The proposed WBB extends the WLB to a broad class of contemporary statistical models by leveraging advances in optimization methodology. Essentially, the WLB used optimization of certain randomized objective functions to enable approximate marginalization (i.e., integration) required in Bayesian analysis. The same idea-optimize a randomized objective function to achieve posterior sampling-is at the heart of the proposed WBB method, though some changes to the WLB procedure are required to carry out this program for the models considered. Theoretical support for the WLB approximation is based on connections between posterior variation and curvature of the log-likelihood revealed through repeated optimization of randomly weighted likelihoods. By contrast, the proposed WBB calculates a series of randomized posterior modes rather than randomized likelihood maximizers.
A key rationale for this proposal is that high dimensional posterior modes are now readily computable, thanks to systems such as TensorFlow (Abadi et al., 2015) and Keras (Chollet, 2015) that deploy SGD and convex optimization methods for large-scale problems, such as on neural network architectures used in deep learning (LeCun, Bengio & Hinton, 2015). By linking random weighting with advanced optimization, we expose a simple scheme for approximate uncertainty quantification in a wide class of statistical models.
Quantifying uncertainty is typically unavailable in a purely regularization optimization method. We contend that UQ is available directly by repeated optimization of randomized objective functions, using the same computational tools that produce the primary estimate, rather than through Markov chain Monte Carlo, variational methods, approximate Bayesian computation (ABC), or other techniques. See Green et al. (2015) for a good summary of Bayesian computation history. Thus, uncertainty assessments are provided at little extra effort over the original training computations. A further benefit is that with extra computational cost, it is straightforward to add a regularization path across hyper-parameters (e.g., simply repeat WBB on different ), which is usually difficult to compute in traditional Bayesian sensitivity analysis. We use predictive cross-validation techniques in this regard.
The rest of the article is outlined as follows. Section 2 develops our weighted Bayesian Bootstrap (WBB) algorithm. Section 3 provides applications to high dimensional sparse regression, trend filtering and deep learning. WBB can also be applied to Bayesian tree models (Taddy et al., 2015). Section 4 indicates several directions for future research, including bootstrap filters in state-space models (Gordon, Salmond & Smith, 1993) and connections to the resampling-sampling perspective in sequential Bayesian inference (Lopes, Polson, & Carvalho, 2012).

Setting
We work with a broad class of statistical models for data structures involving outcomes and covariates. Examples considered in Section 3 include regression, trend-filtering, and deep learning. Let y = (y 1 , y 2 , … , y n ) be an n-vector of outcomes and let = ( 1 , 2 , … , p ) be a p-dimensional parameter of interest. Covariate data may be organized in an n × p matrix A whose rows are the design points (or "features") a T i where we index observations by i and parameters by . A large number of estimation/training problems can be expressed in the form where l(y| ) = ∑ n i=1 l i (y i | ) is a measure of fit (or "empirical risk function") depending on and y and implicitly on A. The penalty function, or regularization term, ( ), may encode soft or hard constraints, and is controlled by a hyper-parameter, > 0, whose values index an entire path of solutions. The penalty function ( ) effects a favourable bias-variance estimation trade-off and provides extensive modelling flexibility (Wellner & Zhang, 2012). To accommodate contemporary applications we allow ( ) to have points in its domain where it fails to be differentiable (e.g., L 1 norm). If we treat data, y, as arising from a probabilistic model parameterized by , then the likelihood function p(y| ) yields the model-associated measure of fit l(y| ) = − log p(y| ). The maximum likelihood estimator (MLE) iŝ∶= argmax p(y| ), though of course this usually differs from the solution to Equation (1): * ∶= arg min {l(y| ) + ( )}]. We recall a key duality between regularization and Bayesian analysis.

Bayesian regularization duality
From the Bayesian perspective, the measure of fit, l(y| ) = − log p(y| ), and the penalty function, ( ), correspond to the negative logarithms of the likelihood and prior distribution in the model This posterior p( |y) is often a proper distribution over  d , even if the prior p( ) is not proper. The well-known equivalence between regularization and Bayesian methods is seen, for example, in regression with a Gaussian regression model subject to a penalty such as an L 2 -norm (ridge) Gaussian prior or L 1 -norm (LASSO) double exponential prior. By this duality, the posterior mode, or maximum a posteriori (MAP) estimate, is * , a solution to Equation (1). See Gribonval & Machart (2013) for a nuanced view of the connection between Equations (1) and (2) in Gaussian regression models.

Optimization
Advances in optimization methodology provide efficient algorithms to compute * = arg min ( ) for a wide range of loss and penalty functions. Theory is well developed in the case of convex objective functions (e.g., Bertsekas, Nedi & Ozdaglar, 2003;Boyd & Vandenberghe, 2004). For example if loss l is convex and differentiable in and penalty is convex, then a necessary and sufficient condition for * to minimize l(y| ) + ( ) is where is the subdifferential operator (the set of subgradients of the objective), in this case the sum of a point and a set. Though not a formula for * , such as given by the normal equations in linear regression, Equation (3) usefully guides algorithms that aim to solve * . For example, under separability conditions on the penalty function, coordinate descent algorithms effectively solve for * ; see Wright (2015), or Hastie, Tibshirani & Wainwright (2015, chap. 5) for a statistical perspective. The optimization literature also characterizes * as the fixed point of a proximal operator prox  ( ) = arg min z {(z) − 1 2 ‖z − ‖ 2 2 }, which opens the door to powerful MM algorithms and related schemes; see Lange (2016, chap. 5), , and Polson, Scott & Willard (2015). Beyond convexity, the guarantees are weaker (e.g., local not global minima) and the algorithms are many (e.g., Nocedal & Wright, 2006). Gradient descent or SGD are effective in many cases, owing to parameter dimensionality and structure of the gradients. The appendix develops SGD for one example. DOI: 10.1002/cjs.11570 The Canadian Journal of Statistics / La revue canadienne de statistique Advances in applied optimization provide effective software tools for data analysis. For example, the R package glmnet deploys coordinate descent for loss functions arising from generalized linear models and LASSO or elastic net penalties (Friedman, Hastie & Tibshirani, 2010). To solve the generalized LASSO problem, the R package genlasso deploys a dual path algorithm (Arnold & Tibshirani, 2014). A variety of general purpose optimization tools for statistics are compiled at the optimization view at CRAN (https://cran.r-project.org/web/views/Optimization.html). For machine learning, the TensorFlow system has greatly simplified gradient descent, SGD, and related algorithms for many applications (Abadi et al., 2015).

WBB Algorithm
We now define the WBB. Recalling the original objective function Equation (1), we form the randomized objective where entries of w = (w 0 , w 1 , … , w n ) are independent and identically distributed (i.i.d.) standard exponentially distributed random weights, generated by the analyst and independently from the data y. Equivalently, to help with sparsity and to reduce the effect of occasionally large scalar weight. In either case, associated with any vector w is the solution, * w = arg min  w ( ). Our basic conjecture is that the conditional distribution of * w -the distribution induced by w with the data fixed-approximates the Bayesian posterior Equation (2). For any measurable set  in the parameter space, Section 2.5 provides an asymptotic argument in support of Equation (5), and we investigate the approximation numerically in a few examples in Section 3. Assuming this conjecture is true, we have a straightforward optimization-based algorithm for approximate posterior sampling:

Algorithm 1 Weighted Bayesian bootstrap
Input: Independently construct w 0 as either a univariate Exponential (common weight case) or as a vector of i.i.d. Exponentials (separate weights case) Set w = (w 0 , w 1 , … , w n ) Compute: * ,t ← arg min  w ( ) end for When optimization on the original problem Equation (1) is fast and scalable, so too is the WBB. Weights w and the corresponding * w are independent across t, making it possible to speed up the algorithm via parallel computing. To choose the amount of regularization which is assumed to be fixed for all sets of w, we can use the marginal likelihood m (y), estimated by bridge sampling (Gelman & Meng, 1998) or simply using predictive cross-validation. Next we consider the approximation Equation (5) from an asymptotic perspective.

WBB Properties
Conditions under which the target posterior distribution Equation (2) is approximately Gaussian are well established (Kleijn & van der Vaart, 2012). For example, when data form a random sample from fixed distribution p(y i | 0 ) that resides in a sufficiently regular model, and when the prior is smooth and positive around 0 ∈  p , we have where including sample size n as an explicit subscript, we have posterior mode * n = arg min  n ( ), and where J n ( * n ) = n ( * n ) is the Fisher information matrix evaluated at * n . Here ( ) is the information per sample, and y = (y 1 , … , y n ) denotes data. Johnstone (2010) studies Bernstein-von Mises theorem in high dimensional settings where p grows with n and the situation is very different. Centring on the posterior mode, rather than the MLE, improves accuracy in many cases.
As to the WBB distribution, consider a one-term Taylor expansion of ∇ w,n ( ) about the posterior mode * n , which is allowable for sufficiently smooth loss and penalty terms: where R n is an error term and ∇ and ∇ 2 record the gradient vector and matrix of second partial derivatives, respectively, of the weighted objective function. Evaluating this expansion at * w,n = arg min  w,n ( ) zeros out the left hand side of Equation (7) and leads to: whereR n is another error term. Following Newton & Raftery (1994), we recognize that the w-induced variation in Equation (8), conditional upon the data, causes the matrix factor to be approximately the inverse information [J n ( * n )] −1 , the score-like second factor to be approximately mean-zero Gaussian with covariance equal to J n ( * n ), and the errorR n to be negligible. Thus, compared to the target posterior variation Equation (6) ) .
In a relatively narrow asymptotic sense, therefore, the WBB procedure is approximating the target posterior distribution, as both are approximately Gaussian with the same mean and covariance. Details of the asymptotic analysis follow the WLB case presented in Newton & Raftery (1994), and differ only slightly in our use of the posterior mode * n in place of the MLÊn, and also in our incorporation of weight w 0 on the penalty term of the objective function. At this level of first-order asymptotic analysis, neither of these features affects the limiting conditional Gaussian distribution of * w,n . We note that rescaling in Equation (4) has no effect on solutions * w,n , and so it is equivalent in the construction to use normalized weightsw that sum to unity. Suchw are uniformly distributed over the unit simplex, and thus correspond to a specific Dirichlet distribution. Exponential/Dirichlet weights are motivated from the perspective of both inference, related to the original Bayesian bootstrap (Rubin, 1981;Muliere & Secchi, 1996), and computation, owing to benefits of smoothly varying weights. Other weight distributions may also be effective (Barbe & Bertail, 2012).
We aim to use WBB samples as approximate posterior samples for uncertainty quantification. It remains unknown in general what is the relationship between this WBB distribution and the posterior distribution associated with any specific prior. The first-order asymptotic approximation above is constructed so that the prior structure is asymptotically negligible; one could say, then, that WBB approximates any one of a number of different posterior distributions. In particular, the WBB samples may not provide a close approximation to the posterior formed from the same prior as used in the penalty term in the objective function. We deploy numerical experiments to understand the approximation better in finite samples.

NUMERICAL EXPERIMENTS
We illustrate the proposed methodology with a number of scenarios to assess the quality of the WBB approximation.

LASSO Experiment
First, consider a simple univariate normal means problem with a LASSO prior where Given the i.i.d. exponential weights w 1 and w 0 , the weighted posterior mode * w is given by * This is sufficiently simple for an exact WBB solution in terms of the soft thresholding proximal operator: * The WBB mean E w ( * w |y) is approximated by the sample mean of { * ,t w } T t=1 . On the other hand, Mitchell (1994) gives the expression for the posterior mean, shrinks the posterior means towards zero. WBB samples are approximate posterior draws, though the algorithm structure does not permit a clear description of the prior that is in force for this sampled distribution. Whatever is this effective prior, it may be different from the prior encoded in the penalty used in repeated optimization, as the result in Figure 1 suggests.
A simulation study is conducted in order to further investigate WBB in the context of regression. We simulate data from the linear model, where X is n × p and y is n × 1. The design matrix X is generated by drawing its n rows independently from a p-dimensional normal distribution, N(0, Σ). Within each row, the covariance between entries in columns i and is set to be Σ i, = 0.1 × 0.8 |i− | . Furthermore, we set the noise variance 2 = ‖X ‖ 2 2 ∕(2n). In this setting, the signal-to-noise ratio is 2. Figure 2  Next we use the simulation to compare WBB distributions to MCMC-computed posteriors in a range of regression settings. We consider both sparse and dense cases for the coefficient vector = [ 1 , 2 , … , p ] ′ : A(i).
= 1 for 1 ≤ ≤ p. For each setting of , we consider p = 40, 60, 80, 100, 120. To investigate the performance of WBB under different sample sizes, we also choose two settings of n: n = 100 for all p, or n = p∕2 for all p.
We compare the WBB with LASSO prior and separate weights to Markov chain Monte Carlo under the Bayesian LASSO (Park & Casella, 2008;Carlin & Polson, 1991). The two methods entail different prior penalties: Bayesian LASSO imposes a noninformative marginal prior on 2 , ( 2 ) = 1∕ 2 and the posterior distribution is sampled by a Gibbs sampler, with chosen by maximizing the marginal likelihood. In WBB, is chosen via cross-validation, using standard unweighted LASSO. Here the original LASSO prior is separable and we study the separate-weights version of WBB. In high-dimensional cases, a large common w 0 multiplied to ( ) can introduce extra sparsity to all marginal posteriors. We use separate weights in an effort to overcome this issue. For comparison criteria, we present the estimation MSE of coefficients (use posterior mean as our estimate), out-of-sample prediction MSE (test sets are of the same size as the corresponding training sets), and 95% credible interval coverage. Fixing (p, n, ), we draw T = 200 posterior samples by each method and the estimation procedure is repeated over B = 500 independent datasets {(X, y) (b) } B b=1 . Let * ,t (b) denote the tth posterior draw, with respect to the bth dataset. For each coordinate , coverage is calculated using { * ,t (b) ∶ 1 ≤ t ≤ T, 1 ≤ b ≤ B}. The coverages are then averaged across 1 ≤ ≤ p.
The complete results for estimation error, out-of-sample prediction error, and credible interval coverage are displayed in Tables 1-3, respectively. In almost all cases when p ≥ 60, WBB has lower estimation error than BLASSO. WBB shows comparable or superior out-of-sample prediction in most cases. Credible intervals of neither WBB nor BLASSO have exact 95% coverage. WBB performs well when p ≥ 100 and is sparse, though it often under covers. When = 1 for all 1 ≤ ≤ p, BLASSO intervals are too wide while WBB intervals have coverage close to 95% when 40 ≤ p ≤ 80. Though limited in scope, this simulation reveals some distinctions and also broad similarities between WBB and Bayesian LASSO in both posterior   structure and the sampling performance of approximate posterior summaries. It may provide a useful basis for further methodological development. Serial computations were used in the simulation above, and WBB was slightly slower than BLASSO (by a factor of 1.5 on CPU time); the computational advantage of WBB shows up in parallel computations, since weight vectors induce separate optimizations.

Diabetes Data
To further illustrate the WBB methodology, we apply it to the often-analyzed diabetes dataset (Efron et al., 2004  disease progression and with 10 baseline variables (p = 10), including age, sex, body mass index, average blood pressure, and six blood serum measurements. We apply Algorithm 1 with T = 1, 000. (R code is in the Supplementary Material.) For each weight vector w, the WBB estimate * w is calculated using Equation (9) via the regularization method in the package glmnet. * ,common w ∶= arg min The regularization factor is chosen by cross-validation with unweighted likelihood. The WBB is also performed with separate weights on each | |, * ,separate w ∶= arg min As in the simulation study, WBB is compared to the Bayesian LASSO. Figure 3 shows the results of all these three methods (the WBB with common/separate weight on prior terms, and Bayesian LASSO). Marginal posteriors for 's are presented. For some coefficients there is very good agreement among the methods (e.g., bmi and map). One notable feature is that the WBB tends to introduce less sparsity than Bayesian LASSO. For example, the Bayesian LASSO posteriors of age, tc, ldl, tch and glu have higher spikes near zero compared with the WBB.

Trend Filtering
The where l(y| ) = 1 2 ‖y − X ‖ 2 2 is the negative log-likelihood. D ∈  m×p is a penalty matrix and ( ) = ‖D ‖ 1 is the negative log-prior or regularization penalty. There are fast path algorithms for solving this problem (see genlasso package).
As a subproblem, polynomial trend filtering (Tibshirani, 2014; allows for piece-wise polynomial curve-fitting, where the knots and the parameters are chosen adaptively. Intuitively, the trend-filtering estimator is similar to an adaptive spline model: it penalizes the discrete derivative of order k, resulting in piecewise polynomials of higher degree for larger k. Specifically, X = I p in the trend filtering setting and the data y = (y 1 , … , y p ) are assumed to be meaningfully ordered from 1 to p. The penalty matrix is specially designed by the discrete (k + 1)th order derivative,  and D (k+1) = D (1) D (k) for k = 1, 2, 3, …. For example, the log-prior in linear trend filtering is explicitly written as For a general order k > 1, WBB solves the following generalized LASSO problem in each draw: * w = arg min To illustrate, we simulate data y i from a Fourier series regression for i = 1, 2, … , n = 500, where i ∼ N(0, 2 2 ) are i.i.d. Gaussian deviates. The cubic trend filtering result is given in Figure 4. For each i, the WBB gives a group of estimates { * ,t w (i) where T is the total number of draws. The standard deviation of these weighted solutions constitutes a posterior standard deviation, or essentially a standard error for the estimator̂i.

Deep Learning: MNIST Example
Deep learning is a form of machine learning that uses hierarchical abstract layers of latent variables to perform pattern matching and prediction. A Bayesian probabilistic perspective provides a number of insights into more efficient algorithms for optimization and hyper-parameter tuning. The general goal is to find a predictor of an output y given a high dimensional input x. For a classification problem, y ∈ {1, 2, … , K} is a discrete variable and can be coded as a K-dimensional 0-1 vector. The model is as follows. Let z (l) denote the lth layer, and so x = z (0) . The final output is the response y, which can be numeric or categorical. A deep prediction rule (Polson & Sokolov, 2017) is then ) , ) , ) , Here, W (l) are weight matrices, and b (l) are threshold or activation levels. (l) is the activation function. Probabilistically, the output y in a classification problem is generated by a probability model where y ik is 0 or 1. Adding the negative log-prior (W, b), the objective function (negative log-posterior) to be minimized by SGD is Accordingly, with each draw of weights w, WBB provides the estimates (W * w , b * w ) by solving the following optimization problem: We take the classic MNIST example (LeCun & Cortes, 2010)   For simplicity, we build a 2-layer neural network with layer sizes 128 and 64 respectively. Therefore, the dimensions of parameters are The activation function (i) is ReLU, (x) = max{0, x}, and the negative log-prior is specified as where we manually set = 10 −4 . Figure 5 shows the posterior distribution of the classification accuracy in the test dataset. We see that the test accuracies are centred around 0.75 and the posterior distribution is left-skewed. Furthermore, the accuracy is higher than 0.35 in 99% of the cases. The 95% interval is [0.407, 0.893]. Due to the simple 2-layer neural network structure, the classification accuracy is admittedly low compared with the state-of-the-art (e.g., Liang & Hu, 2015). This illustrative example shows how WBB can be easily implemented in deep learning. Implementing variational Bayes in deep learning is discussed in Polson & Sokolov (2017).

DISCUSSION
WBB provides a computationally attractive solution to scalable Bayesian inference (Minsker et al., 2014;Welling & Teh, 2011) whilst accounting for parameter uncertainty by maximizing a weighted posterior distribution. WBB can also be used in conjunction with proximal methods (Parikh & Boyd, 2014;Polson, Scott & Willard, 2015) to provide sparsity in high dimensional The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs.11570 statistical problems. With a similar ease of computation, WBB provides an alternative to ABC methods (Beaumont, 2009) and variational Bayes (VB) methods (Blei, Kucukelbir & McAuliffe, 2017). A fruitful area for future research is the comparison of WBB to recent extensions of the WLB for generalized Bayesian inference (Lyddon, Holmes & Walker, 2019). For a wide range of non-smooth objective functions/statistical models, recent regularization methods provide fast, scalable algorithms for calculating estimates of the form Equation (1), which can also be viewed as posterior modes. Therefore as varies we obtain a full regularization path as a form of prior sensitivity analysis. Furthermore, Strawderman, Wells & Schifano (2013) and  considered scenarios where posterior modes can be used as posterior means from augmented probability models. There may be useful interpretations of the random weights from the data-augmentation perspective.
Extending WBB asymptotics presents an exciting research opportunity. The argument in Section 2.5 relies on smoothness in both the sampling model and prior, and it retains fixed parameter dimension as n increases. Theoretical guarantees remain unavailable for relatively large parameter dimension or for non-smooth penalty functions. Fortunately, groundwork has been done, for example by van der Pas, Kleijn & van der Vaart (2014), Narisetty & He (2014), and others on the asymptotic behaviour of the posterior distribution, and by Knight & Fu (2000) and others on sampling theory of optimization-based estimators,

BIBLIOGRAPHY APPENDIX
Here we consider how SGD may be deployed to minimize the penalized loss function, ∑ n i=1 w i l i (y i ; ) + w 0 ( ). The method minimizes the function by taking a negative step along an estimate g k of the gradient ∇ [ ∑ n i=1 w i l i (y i ; k ) + w 0 ( k ) ] at iteration k. The approximate gradient is estimated by calculating where E k ⊂ {1, … , n} and b k = |E k | is the number of elements in E k . When b k > 1 the algorithm is called batch SGD and simply SGD otherwise. A usual strategy to choose subset E is to go cyclically and pick consecutive elements of {1, … , T}, E k+1 = [E k mod n] + 1. The approximated direction g k is calculated using a chain rule (i.e., back-propagation) for deep learning. It is an unbiased estimator. Thus, at each iteration, the SGD updates the solution k+1 = k − t k g k .
For deep learning applications the step size t k (i.e., learning rate) is usually kept constant or some simple step size reduction strategy is used, t k = a exp(−kt). Appropriate learning rates or the hyperparameters of the reduction schedule are usually found empirically from numerical experiments and observations of the loss function progression.