SwISS: A Scalable Markov chain Monte Carlo Divide-and-Conquer Strategy

Divide-and-conquer strategies for Monte Carlo algorithms are an increasingly popular approach to making Bayesian inference scalable to large data sets. In its simplest form, the data are partitioned across multiple computing cores and a separate Markov chain Monte Carlo algorithm on each core targets the associated partial posterior distribution, which we refer to as a sub-posterior, that is the posterior given only the data from the segment of the partition associated with that core. Divide-and-conquer techniques reduce computational, memory and disk bottle-necks, but make it difficult to recombine the sub-posterior samples. We propose SwISS: Sub-posteriors with Inflation, Scaling and Shifting; a new approach for recombining the sub-posterior samples which is simple to apply, scales to high-dimensional parameter spaces and accurately approximates the original posterior distribution through affine transformations of the sub-posterior samples. We prove that our transformation is asymptotically optimal across a natural set of affine transformations and illustrate the efficacy of SwISS against competing algorithms on synthetic and real-world data sets.


Introduction
Markov chain Monte Carlo (MCMC) algorithms are widely used within Bayesian modelling to sample from the often intractable posterior distribution.These techniques are widely applicable and only require point-wise evaluation of the posterior density.One of the potential drawbacks of MCMC algorithms is their lack of scalability.The computational cost of MCMC is typically linear in the amount of data and can be prohibitive for large data sets, both in computational cost and storage.
In settings with large data sets, or where the model is computationally expensive, evaluating the posterior at every iteration of the MCMC algorithm may be infeasible.Strategies to overcome this include data subsampling (Welling and Teh, 2011), (Baker et al., 2019a,b), (Nemeth and Fearnhead, 2021), where only a subset of the data is used at each MCMC iteration, or delayed acceptance (Sherlock et al., 2017;Quiroz et al., 2018), where the Metropolis-Hastings accept-reject probability is replaced with a cheaper approximation to the true posterior and the full data posterior is evaluated less frequently.

arXiv:2208.04080v1 [stat.CO] 8 Aug 2022
In situations where it is possible to easily parallelise computation in a MapReduce framework, or through cloudcomputing infrastructure such as Amazon Web Services, then statistical modelling becomes easily scalable to large data sets.However, applying this approach in practice using algorithms such as MCMC, which are designed to work in serial rather than parallel, is challenging.In this paper, we consider the divide-and-conquer strategy to circumvent the computational bottleneck of MCMC, where the data are partitioned into batches, and each batch is stored on a separate computer core.MCMC is then applied independently on each data batch and posterior samples from each computer are combined to form an accurate approximation of the full posterior, i.e., the posterior that would have been obtained using the full data set.
The main challenge with divide-and-conquer approaches for MCMC lies in the merge step.A range of approaches has been considered in the literature such as: the use of weighted averages of the batch samples (Scott et al., 2016); kernel density estimation (Neiswanger et al., 2014); Gaussian process approximations (Nemeth and Sherlock, 2018); finding the Wasserstein barycenter of different measures (Srivastava et al., 2015), the geometric median of batch samples (Minsker et al., 2014), as well as using a post-MCMC importance sample (Entezari et al., 2018), to name a few.
One of the most popular algorithms in the literature is the consensus Monte Carlo algorithm (Scott et al., 2016), which approximates the full posterior using a weighted average of sub-posterior samples.The consensus approach is computationally cheap to apply, does not require tuning and scales well to high-dimensional parameter spaces.It is also analytically exact in the case of Gaussian sub-posteriors, but can produce poor approximations when the sub-posteriors are non-Gaussian (see Section 4.1).
In this paper we propose SwISS, an algorithm that is as fast as the consensus algorithm, is exact in the Gaussian case, does not require tuning, and which scales well to high-dimensional posterior distributions.However, in the case of non-Gaussian sub-posteriors, it can produce more accurate posterior approximations than the consensus algorithm.Unlike the consensus approach, SwISS does not merge samples but instead applies a transformation to the posterior samples that are generated from a stochastic approximation of the full posterior.As in Entezari et al. (2018), we refer to this stochastic approximation as the inflated sub-posterior, which is the posterior density, conditional on a subset of the data, raised to a positive power.Inflating the sub-posterior in this manner has the effect of approximately preserving the shape of the posterior density conditional on the full data.Affine transformations (shift and re-scale) are applied to each batch of sub-posterior samples to form an approximate sample from the full posterior.This is a generalisation of the algorithm of Wu and Robert (2017) which simply shifts each sub-posterior, with no further correction and hence performs poorly when the sub-posterior variances differ substantially.There are many different affine transformations that produce a sample from the true posterior when the sub-posteriors are Gaussian; we provide theoretical support for our particular choice, showing that, in a natural sense, it is optimal amongst the set of transformations that are exact in the Gaussian case.
The paper is organised as follows: Section 2 provides an introduction to divide-and-conquer MCMC, covering the notation for posterior and sub-posterior densities.In Section 3 we introduce our proposed algorithm, SwISS, and provide supporting theoretical results and pseudo-code for implementation.Section 4 covers the numerical performance of SwISS and is compared against other popular divide-and-conquer algorithms from the literature.Finally, Section 5 gives a summary of the contributions from the paper.

Preliminaries
Let f (y|θ) be the likelihood for a statistical model, parameterised by θ ∈ R d , for a data set y = {y 1 , y 2 , . . ., y n } of length n.Let π 0 (θ) denote the prior density for the parameter vector θ, then our posterior density is, up to a constant of proportionality, π(θ|y) ∝ π 0 (θ)f (y|θ). (1) We assume that y can be partitioned into B batches, y 1 , . . ., y B , such that the likelihood for the full data is the product of the likelihoods for the individual batches, i.e., f (y|θ) = B b=1 f b (y b |θ).This is the case, for example, when the individual data points are independent.The posterior density for θ given y is, up to a constant of proportionality, (2) In the literature, there are generally two approaches to applying MCMC on batches of data.In the first approach, MCMC is applied to target a sub-posterior density for each batch b, of the form where b = 1, . . ., B, such that  If we assume that there are J sub-posterior samples from each of the B batches, which we define as (θ b ; j ∈ {1, . . ., J}, b ∈ {1, . . ., B}), then the consensus Monte Carlo algorithm (Scott et al., 2016) gives a simple strategy for approximating the full posterior (2) through a weighted average of the sub-posterior samples, where the weights are typically chosen to be is Gaussian, then the consensus algorithm produces exact samples from the full posterior.
A second approach applies MCMC to each inflated sub-posterior, where the target density for batch b = 1, . . ., B is This is a stochastic (across partitions of the data) approximation to the full posterior π(θ|y) ≈ π B b (θ|y b ), and hence individual samples from it, in a sense, are already on the same scale as samples from the full posterior.
If we assume that the data are partitioned equally across batches, then in the limit, as the amount of data in each batch n * = n/B approaches infinity, the likelihood will typically dominate the prior, so that by the Bernstein von Mises theorem, the b th inflated sub-posterior is , where approximately, θb ∼ N (θ 0 , BI −1 E (θ 0 )) and where I E and I O,b are the full-data expected information and the observed information from the inflated likelihood for batch b, respectively, and θ 0 is the true parameter value.Hence, the difference between the expectations of the inflated sub-posteriors are O(n ) and, since ), the ratio of the variances of the sub-posteriors is 1 + O(n ).However, in practice, both the location and scale of the inflated sub-posteriors can vary considerably if the partitioned data sets are imbalanced (see examples in Section 4).Our proposed algorithm, SwISS, provides a correction for the discrepancy in the variance and location of the sub-posterior approximations.

SwISS Algorithm
Suppose that we have applied independent Monte Carlo algorithms, such as MCMC, in parallel to sample from the inflated sub-posteriors (4), and denote the j th (of J) sample from the b th (of B) batch by θ (j) b .The SwISS algorithm transforms each sample from the inflated sub-posterior into a sample from an approximation to the full posterior (2) using a batch-specific affine transformation.In the case of Gaussian sub-posteriors, as we show below, these affine transformations produce a set of samples from the correct Gaussian full posterior, and SwISS is exact in this setting.In general, sub-posteriors are non-Gaussian; however under standard regularity conditions and the Bernstein-von Mises theorem (Le Cam et al., 2000), as n * = n/B approaches infinity the sub-posteriors will be approximately Gaussian and SwISS can be expected to produce samples from an approximation to the full posterior.
Firstly, let us suppose that each inflated sub-posterior is Gaussian with expectation µ b and invertible variance matrix V b , so that the full posterior is θ|y ∼ N d (µ, V) where, are the variance and mean of the full posterior.
Since it is invertible, V b is a positive-definite matrix and therefore it has a d × d, invertible square root, M b ; i.e.
Similarly, the full posterior variance V has a square root, M, so that V = MM .Let samples from the b th inflated sub-posterior, θ (1:J) b , be (marginally) realisations from the random variable θ b ∼ π B b and define the transformed random variable: Furthermore, an affine transformation of a Gaussian random variable is Gaussian, and hence ϑ ∼ N d (µ, V).Applying the same transformation to individual samples from the b th batch, therefore provides a sample from the full posterior.As discussed above, even when the sub-posteriors are not Gaussian, we can still apply the same scaling and shifting to any sub-posterior samples and produce samples from an approximation to the full posterior.

Choice of Matrix Square Roots
Matrix square roots are not unique; e.g. for a diagonal matrix, each element of the diagonal square root could be negated; methods for finding a square root of a positive-definite matrix include the Cholesky decomposition, or the simple asymmetric square root arising from the spectral decomposition.Moreover, for square roots M and M b , A b need not be simply MM −1 b , and indeed, this is not always the most sensible choice.For now, let M be any d × d square root of V and let and provided M b is chosen to be I, A b becomes the identity transformation.To be clear, though, if some diagonal elements of M b had been chosen to be −1 rather than 1 then A b would not be the identity and, unless the initial distribution of points θ (1:J) b was elliptically symmetric, the transformation in (6) would not then lead to a set of points that represented the true posterior at all.Applying the same logic as above, the transformation M b should be the square root of V b that moves the individual points θ (j) b as little as possible.With this in mind, we define a natural measure of the distance moved by J points, θ (1:J) , to which a linear transformation A is applied, as: where .
2 denotes Euclidean distance.We wish to find the linear transformation M b that minimises In Section 3.2 we show that, provided the points have expectation zero, as J ↑ ∞, the best choice of M b is the positive-definite, symmetric square root of V b ; this is the square root used by SwISS.The choice of square root, M, of V is less important, since within the linear transformation A b , the initial transformation by M −1 is later inverted; however, with the general motivation of preventing excess movement, SwISS sets M to be the positive-definite, symmetric square root of V. Finally, the averaged re-centring algorithm of Wu and Robert (2017) can be viewed as a special case of SwISS where A b = I.

The Positive-Definite, Symmetric Square Root and its Optimality
We first define the positive-definite symmetric square root of a positive-definite matrix and detail the sense in which it is optimal with respect to the distance measure D (7).
Let V be a positive-definite matrix and let its spectral decomposition be (8) where Λ is a diagonal matrix with entries equal to the positive square roots of the eigenvalues of V, and U is a unitary matrix (i.e. the columns of U are the orthonormal right eigenvectors of V, so UU = I = U U) and so The natural interpretation of M is as a simple scaling transformation with different scalings applied along each of the eigenvectors of V.
As explained previously, we require a matrix A b such that the transformation (6) leads to a sample with a variance of V; however, when inflated sub-posteriors are non-Gaussian, we need a transformation that preserves the shape and orientation of the inflated sub-posterior as much as possible.Theorem 1 shows that for large J, M −1 is not likely to cause more than the minimum discrepancy, given the constraints.
Theorem 1.Let θ (j) ∈ R d (j = 1, . . ., J) be a set of independent and identically distributed realisations of a random variable θ with E Proof.

D(B; Z
where the convergence is almost sure.But trace B T B = trace BB T , giving the required result. Lemma 1.Let V, Λ, and U be as defined in Theorem 1, and let and λ i = Λ i,i (i = 1, . . ., d).Then sup The supremum is achieved when M = U ΛU T .
Proof.The rows of U form an orthonormal basis e 1 , . . ., e d , with e T i V e i = λ 2 i (i = 1, . . ., d).For any matrix M with , where f i = M T e i .Next, recall that for any unitary matrix, U , and square matrix M , trace U M U T = trace (M ).Thus, using the Cauchy-Schwarz inequality, and since U T is also unitary, The final part of the Lemma follows as trace To prove Theorem 1, let Z i = AX i , so E [Z i ] = 0 and Var [Z i ] = I d , and let B = A −1 so BB T = V .Since D(A; X 1:n ) = D(B; Z 1:n ), by Proposition 1, and then from Lemma 1, we have almost surely, Algorithm 1 SwISS Algorithm; here SPSQ(V) denotes the symmetric positive-definite square root of the matrix V as described through ( 8) and ( 9).] and V b ← var[θ Set the global mean µ and variance V and calculate the matrix square root, Apply the affine transformation to the inflated posterior samples for b ∈ {1, . . ., B} do Concatenate the transformed samples ϑ1:J b to give a Monte Carlo approximation of the full posterior distribution π(θ|y) return ϑ The affine transformation (6) of SwISS is easy to apply to each batch of inflated sub-posterior samples, making the algorithm as fast and as simple to use as the consensus algorithm, with the guarantee of exactness in the Gaussian case.A visual representation of SwISS is given in Figure 1 and pseudo-code for implementing the algorithm is given in Algorithm 1.

Experiments
In this section we test the accuracy of the SwISS algorithm to merge batch posterior samples drawn from a variety of posterior distributions.We consider various complex posterior geometries to highlight the difference between affine transformations of posterior samples (i.e.SwISS) and averaging posterior samples (i.e.Consensus Monte Carlo).We also investigate the efficiency of alternative merging algorithms on popular statistical models with simulated and real data.We compare the SwISS algorithm against the following popular competing algorithms from the literature: • Consensus Monte Carlo (Cons) algorithm (Scott et al., 2016), as described in Section 2.
• Semiparametric density estimation (SKDE) 1 from Neiswanger et al. (2014), where sub-posteriors are approximated semi-parametrically as described in Hjort and Glad (1995).• Average re-centring (AR) algorithm from Wu and Robert (2017), which is a special case of SwISS where A b = I.
• Gaussian Barycenter (GB) algorithm (Srivastava et al., 2018), assuming a Gaussian approximation for each inflated sub-posterior, the barycenter is the geometric center of the inflated sub-posterior distributions.
We assess the accuracy of the above algorithms to combine batch posterior samples to form an approximation of the full posterior, comparing the merged approximations against the full posterior, which is generated by sampling (in serial) from the posterior conditional on the full data set.Accuracy of estimation of the posterior of the d-dimensional parameter, θ, is assessed with the following discrepancy measures: • Mahalanobis distance (Mah):  .In all cases the x-axis is θ 1 ; for the left plot the y-axis is density, and for the other two plots it is θ 2 .
where V f and µ f are the variance and mean estimates of posterior samples taken from the full data posterior using an MCMC algorithm.For a given posterior approximation algorithm, e.g.SwISS, µ a denotes the estimated mean.• Mean absolute skew deviation (Skew): where η is the sum over components of the third standardised moments.• Integrated absolute distance (IAD): the average of the integrated absolute differences between two kernel density estimates of the marginal posteriors for each component, j, of θ: πf j , obtained from samples from the true posterior, and πa j using one of the approximate merging algorithms (Chan et al., 2021).

Complex Posterior Geometries
One of the main motivations for using MCMC to sample from a posterior distribution, rather than using deterministic approximations (e.g.Laplace), is that the posteriors are often non-Gaussian.We consider three artificially generated posterior distributions of dimension one or two (see Figure 2) which reflect a range of potential posterior shapes and we compare SwISS against the consensus Monte Carlo algorithm in these settings.Here φ(µ) denotes the probability density function of a standard Gaussian N (µ, 1), for some µ ∈ R This corresponds to a posterior with 1000 Bernoulli observations with a single positive response and a uniform prior on the success probability, θ, which gives a skewed posterior density.
• Warped bivariate Gaussian density Both the SwISS and the consensus algorithm are guaranteed to be exact in the case of merging Gaussian posterior samples, but it can be shown that both algorithms still work well for a variety of non-Gaussian posteriors.However, one of the drawbacks of the consensus algorithm is that averaging across batches of sub-posterior samples can remove posterior features such as skewness and multi-modality, as illustrated in Figure 2.
Figure 2 shows posterior density plots for each of the three models, where full MCMC has been utilised to provide a ground truth approximation for the full data posterior.The consensus Monte Carlo and SwISS approximations are based on combing samples from B = 10 sub-posterior and inflated sub-posterior approximations, respectively.The results from these three test cases show that the consensus algorithm struggles to approximate the full data posterior when the target density exhibits non-Gaussian behaviours.The SwISS algorithm, which utilises affine transformations of the inflated sub-posterior samples, rather than averaging, can produce reliable approximations when the posterior is significantly non-Gaussian.

Scalability with parameter dimension
Typically, divide-and-conquer methods are advertised for use with tall data, i.e. a large number of observations and up to a moderate number of parameters.Here, we test the accuracy and computational speed of the merging algorithms as the number of parameters grows. Let , where d is the dimension of the parameter space and let Each sub-posterior is Gaussian, with expectation and variance drawn respectively from Gaussian and inverse-Wishart distributions.For each experiment J = 5, 000 samples were drawn from each sub-posterior and inflated sub-posterior.Using this model, the full data posterior is tractable: The following set of dimensions were used: d ∈ {5, 10, 20, 40, 80}. Figure 3 shows that both the consensus Monte Carlo algorithm and SwISS perform well with increasing dimension (as measured by integrated absolute distance) and are both computationally efficient.The semi-parametric KDE approach, Gaussian barycenter and average re-centering approaches display reduced accuracy (as measured by integrated absolute distance).Only SwISS and consensus are robust to increasing the dimension of the parameter space.In terms of the computational cost required to merge the posterior samples, all approaches are generally fast, with the exception of the semi-parametric KDE approach.

Linear Mixed Effects Model
A natural way to extend the simple linear model is to introduce both fixed and random effects.This extension can be particularly useful when data exhibit a hierarchical dependency structure, for example, to cluster student test scores based on classroom.Let y i,j ∈ Y ⊆ R (for i, j = 1, . . ., n j , and j = 1, . . ., n) be the response variable, where n j is the number of observations for group j.The fixed and random effects are x i,j ∈ X ⊆ R p and z j ∈ Z ⊆ R r , respectively, and are related to the response variable by y i,j |β, α j , 1 ∼ Logistic (x i,j β + z j α j , 1), α i ∼ N r (0, Σ), where β ∈ R p and α i ∈ R r are the fixed and random effect model coefficients and the Logistic(a, 1) distribution has a cumulative distribution function of 1/(1 + exp[−(x − a)]).Our parameters of interest are then θ = (β, α, Σ), where Σ represents the variance of the random effects.We assume an inverse-Wishart distribution for the prior of Σ, Σ ∼ W −1 (ν, S), with ν = 5 and S = 5I r , and a priori we assumed β ∼ N p (0, 1000I p ).
We simulated a dataset that contains 200, 000 observations.We set the number of groups n = 2000 and the number observations for each group n j = 100, for j = 1, . . ., n.The number of parameters for the fixed effects were set to p = 10, with β 0,i = (−1) (i−1) .The number of parameters for each random effect was set to be r = 2, and we set then α i were simulated independently from a N r (0, Σ 0 ) distribution.We included an intercept term, that is x i,j,1 = 1 for all i, j, otherwise x i,j and z j were simulated from independent Bernoulli(0.5)distributions.
The data were randomly partitioned into B = 10 batches by group, so that each group only belonged to one batch.This was necessary since divide and conquer methods assume independence between the batches.With a Gaussian observation model for the y i,j , marginalisation over all of the random effects would be tractable.The logistic observation model necessitates the use of a sampling scheme such as MCMC.
We used the STAN software to sample from the full posterior and sub-posteriors generating J = 5, 000 MCMC samples after an initial 1,000 sample burn in.Table 1 gives the discrepancy measures for each of the merging algorithms, averaged over 10 random partitions of the data.The results show that all algorithms perform well, with the exception of AR and the Gaussian barycenter, both on the Mahalanobis metric.The SwISS and Consensus algorithms are robust across the range of metrics.

Logistic Regression Model
Logistic regression is a popular technique for modelling binary data, i.e. y i ∈ {0, 1}.Features x i ∈ R d , also known as covariates, that can indicate the classification outcome are mapped onto the binary observations using a logit transformation, where the outcome probability P(y i = 1) = exp(x i θ)/ 1 + exp(x i θ) , is the success probability of a Bernoulli random variable.Our parameter of interest θ ∈ R d is the vector of coefficients.
We consider two data sets, the first is a synthetic data set which is designed to simulate a scenario with rare but highly informative features.This data set is similar to the one given in Scott et al. (2016).We simulate N = 100, 000 data points with d = 5 binary features with relative frequencies of x i = 1 being (1, 0.02, 0.03, 0.05, 0.001) and the corresponding true parameter values are θ 0 = (−3, 1.2, −0.5, 0.8, 3).Due to the rarely occurring final feature, this can lead to largely differing variances across the sub-posteriors.For our experiments, we spilt the data equally across B = 25 batches.
We also consider a real-world data set; the Hepmass data set2 from high-energy particle physics where the response is an indicator for whether a signal was indicative of an exotic particle being present as opposed to background noise.The In each of the our experiments, the data were repeatedly partitioned n runs = 5 times with a Monte Carlo average of the discrepancy metrics given in Table 2.The STAN (Carpenter et al., 2017) software, which implements an automatically-tuned version of Hamiltonian Monte Carlo sampling, was used as the MCMC sampler and applied to the full posterior and sub-posteriors for each experiment.Each sampler drew J = 10, 000 samples after a burn in of 1, 000 iterations.
The results in Table 2 show that SwISS and the Consensus algorithm outperform all of the others on the simulated data, whereas for the real example all of the methods work well.The AR algorithm performs especially poorly on the synthetic data example as the variance of the sub-posteriors varies across subposteriors, and the AR algorithm does not correct for this when the sub-posterior samples are merged.The similarity of eprformance on the Hepmass data could be due to the sub-posteriors all being close to Gaussian.We would expect both Consensus and SwISS to work well in this setting as they are exact for Gaussian sub-posteriors, and much faster to apply than nonparametric methods such as SKDE.

Conclusions
We have introduced a new method to merge posterior samples generated in parallel on independent batches of data.Our algorithm, SwISS, is fast, scalable to high-dimensional settings, and accurate on a variety of test cases.The SwISS algorithm, like the consensus Monte Carlo algorithm, is simple to apply and competitive against popular alternative divide-and-conquer algorithms.SwISS also has the advantage that it does not require hyper-parameter tuning and is faster to apply than many of the alternative divide-and-conquer algorithms given in the literature.We have provided theoretical support for our choice of affine transformations and shown that SwISS is exact in the case of merging inflated Gaussian sub-posteriors.Code to recreate this work is available through the Github link: https://github.com/CJohnVyner/SwISS

Figure 1 :
Figure 1: A visual representation of the SwISS algorithm applied to a two-dimensional Gaussian with three subposteriors (blue) approximating the full posterior (black).First, the batch samples are shifted by their respective means (θ b − µ b ), then scaled by the matrix A b before finally being shifted by the global mean µ.

Require:
{θ j b } J j=1 -J Monte Carlo samples from each of the B inflated posteriors Calculate the mean and variance for each of the inflated posteriors for b ∈ {1, . . ., B} do µ b ← mean[θ (1:J) b

Figure 2 :
Figure2: Density plots showing the posterior reconstructions using the SwISS algorithm against the Consensus algorithm on a rare Bernoulli target (left), warped Gaussian (also known as the banana-shaped target) (middle) and bi-modal target (right).In all cases the x-axis is θ 1 ; for the left plot the y-axis is density, and for the other two plots it is θ 2 .

Figure 3 :
Figure3: The left plot shows integrated absolute distance for each method for a different number of parameters.The right plot shows the mean time each combination method took to obtain the samples from an approximate posterior.

Table 1 :
Discrepancy measures for the linear mixed effects model on the simulated dataset.These measures were averaged over 5 random partitions of the data.Estimated standard errors are given in brackets.

Table 2 :
Discrepancy measures for the logistic regression model with simulated and Hepmass data sets.Each metric was averaged over 5 runs.Estimated standard error are given in brackets.27 real features which we augmented with an intercept term to give d = 28 parameters.The full data set contains 10.5 million responses, in this experiment we considered the first N = 100, 000 and split the data across B = 20 batches.