Accelerating MCMC algorithms

Markov chain Monte Carlo algorithms are used to simulate from complex statistical distributions by way of a local exploration of these distributions. This local feature avoids heavy requests on understanding the nature of the target, but it also potentially induces a lengthy exploration of this target, with a requirement on the number of simulations that grows with the dimension of the problem and with the complexity of the data behind it. Several techniques are available toward accelerating the convergence of these Monte Carlo algorithms, either at the exploration level (as in tempering, Hamiltonian Monte Carlo and partly deterministic methods) or at the exploitation level (with Rao–Blackwellization and scalable methods). This article is categorized under: Statistical and Graphical Methods of Data Analysis > Markov Chain Monte Carlo (MCMC) Algorithms and Computational Methods > Algorithms Statistical and Graphical Methods of Data Analysis > Monte Carlo Methods


| INTRODUCTION
Markov chain Monte Carlo (MCMC) algorithms have been used for nearly 60 years and have become a reference method for analyzing Bayesian complex models in the early 1990s (Gelfand & Smith, 1990). The strength of this method is that it guarantees convergence to the quantity (or quantities) of interest with minimal requirements on the targeted distribution (also called target) behind such quantities. In that sense, MCMC algorithms are robust or universal, as opposed to the most standard Monte Carlo methods (e.g., Rubinstein, 1981;Robert & Casella, 2004) that require direct simulations from the target distribution. This robustness may, however, induce a slow convergence behavior in that the exploration of the relevant space-meaning the part of the space supporting the distribution that has a significant probability mass under that distribution-may take a long while, as the simulation usually proceeds by local jumps in the vicinity of the current position. In other words, MCMC-especially in its off-the-shelf versions like Gibbs sampling and Metropolis-Hastings (MH) algorithms-is very often myopic in that it provides a good illumination of a local area, while remaining unaware of the global support of the distribution. As with most other simulation methods, there always exist ways of creating highly convergent MCMC algorithms by taking further advantage of the structure of the target distribution. Here, we mostly limit ourselves to the realistic situation where the target density is only known as the output of a computer code or to a setting similarly limited in its information content.
The approaches to the acceleration of MCMC algorithms can be divided in several categories, from those which improve our knowledge about the target distribution, to those that modify the proposal in the algorithm, including those that exploit better the outcome of the original MCMC algorithm. The following sections provide more details about these directions and the solutions proposed in the literature.

| What is MCMC and why does it need accelerating?
MCMC methods have a history (e.g., Cappé & Robert, 2000) that starts at approximately the same time as the Monte Carlo methods, in conjunction with the conception of the first computers. They have been devised to handle the simulation of complex target distributions, when complexity stems from the shape of the target density, the size of the associated data, the dimension of the object to be simulated, or from time requirements. For instance, the target density π(θ) may happen to be expressed in terms of multiple integrals that cannot be solved analytically, which requires the simulation of the entire vector (θ, ξ). In cases when ξ is of the same dimension as the data, as for instance in latent variable models, this significant increase in the dimension of the object to be simulated creates computational difficulties for standard Monte Carlo methods, from managing the new target ω(θ, ξ), to devise a new and efficient simulation algorithm. A MCMC algorithm allows for an alternative resolution of this computational challenge by simulating a Markov chain that explores the space of interest (and possibly supplementary spaces of auxiliary variables) without requiring a deep preliminary knowledge on the density π, besides the ability to compute π(θ 0 ) for a given parameter value θ 0 (if up to a normalizing constant) and possibly the gradient r log π(θ 0 ). The validation of the method (e.g., Robert & Casella, 2004) is that the Markov chain is ergodic (e.g., Meyn & Tweedie, 1993), namely that it converges in distribution to the distribution with density π, no matter where the Markov chain is started at time t = 0. The Metropolis-Hastings algorithm is a generic illustration of this principle. The basic algorithm is constructed by choosing a proposal, that is, a conditional density K(θ 0 |θ) (also known as a Markov kernel), the Markov chain θ t f g ∞ t = 1 being then derived by successive simulations of the transition.
This acceptance-rejection feature of the algorithm makes it appropriate for targeting π as its stationary distribution if the resulting Markov chain θ t f g ∞ t = 1 is irreducible, that is, has a positive probability of visiting any region of the support of π in a finite number of iterations. (Stationarity can easily be shown, e.g., by using the so-called detailed balance property that makes the chain time-reversible; see, e.g., Robert & Casella, 2004.) Considering the initial goal of simulating samples from the target distribution π, the performances of MCMC methods like the Metropolis-Hastings algorithm above often vary quite a lot, depending primarily on the correspondance between the proposal K and the target π. For instance, if K(θ|θ t ) = π(θ), the Metropolis-Hastings algorithm reduces to i.i.d. sampling from the target, which is of course a formal option when i.i.d. sampling from π proves impossible to implement. Although there exist rare instances when the Markov chain θ t f g ∞ t = 1 leads to negative correlations between the successive terms of the chain, making it more efficient than regular i.i.d. sampling (Liu, Wong, & Kong, 1995), the most common occurrence is one of positive correlation between the simulated values (sometimes uniformly, see Liu, Wong, & Kong, 1994). This feature implies a reduced efficiency of the algorithm and hence requires a larger number of simulations to achieve the same precision as an approximation based on i.i.d. simulations (without accounting for differences in computing time). More generally, a MCMC algorithm may require a large number of iterations to escape the attraction of its starting point θ 0 and to reach stationarity, to the extent that some versions of such algorithms fail to converge in the time available (i.e., in practice if not in theory).
It thus makes sense to seek ways of accelerating (a) the convergence of a given MCMC algorithm to its stationary distribution, (b) the convergence of a given MCMC estimate to its expectation, and/or (c) the exploration of a given MCMC algorithm of the support of the target distribution. Those goals are related but still distinct. For instance, a chain initialized by simulating from the target distribution may still fail to explore the whole support in an acceptable number of iterations. While there is not an optimal and universal solution to this issue, below we will discuss approaches that are as generic as possible, as opposed to artificial ones taking advantage of the mathematical structure of a specific target distribution. Ideally, we aim at covering realistic situations when the target density is only known (up to a constant or an additional completion step) as the output of an existing computer code. Pragmatically, we also cover here solutions that require more efforts and calibration steps when they apply to a wide enough class of problems.
1.2 | Accelerating MCMC by exploiting the geometry of the target While there is no end in trying to construct more efficient and faster MCMC algorithms, and while this (endless) goal needs to account for the cost of devising such alternatives under limited resources budgets, there exist several generic solutions such that a given target can first be explored in terms of the geometry (or topology) of the density before constructing the algorithm. Although this type of methods somehow takes us away from our original purpose which was to improve upon an existing algorithm, they still make sense within this survey in that they allow for almost automated implementations.

| Hamiltonian Monte Carlo
From the point of view of this review, Hamiltonian (or hybrid) Monte Carlo (HMC) is an auxiliary variable technique that takes advantage of a continuous time Markov process to sample from the target π. This approach comes from physics (Duane, Kennedy, Pendleton, & Roweth, 1987) and was popularized in statistics by Neal (1999, 2011) and MacKay (2002. Given a target π(θ), where θ 2 R d , an artificial auxiliary variable ϑ 2 R d is introduced along with a density ϖ(ϑ|θ) so that the joint distribution of (θ, ϑ) enjoys π(θ) as its marginal. While there is complete freedom in this representation, the HMC literature often calls ϑ the momentum of a particle located at θ by analogy with physics. Based on the representation of the joint distribution.
where H(Á) is called the Hamiltonian, Hamiltonian Monte Carlo (HMC) is associated with the continuous time process (θ t , ϑ t ) generated by the so-called Hamiltonian equations.
which keep the Hamiltonian target stable over time, as.
Obviously, the above continuous time Markov process is deterministic and only explores a given level set, instead of the whole augmented state space R 2d , which induces an issue with irreducibility. An acceptable solution to this problem is to refresh the momentum, ϑ t $ ϖ(ϑ|θ t− ), at random times τ n f g ∞ n = 1 , where θ t− denotes the location of θ immediately prior to time t, and the random durations τ n −τ n − 1 f g ∞ n = 2 follow an exponential distribution. By construction, continuous-time Hamiltonian Markov chain can be regarded as a specific piecewise deterministic Markov process using Hamiltonian dynamics (Davis, 1984(Davis, , 1993Bou-Rabee et al., 2017) and our target, π, is the marginal of its associated invariant distribution.
Before moving to the practical implementation of the concept, let us point out that the free cog in the machinery is the conditional density ϖ(ϑ|θ), which is usually chosen as a Gaussian density with either a constant covariance matrix M corresponding to the target covariance or as a local curvature depending on θ in Riemannian HMC (Girolami & Calderhead, 2011). Betancourt (2017) argues in favor of these two cases against non-Gaussian alternatives and Livingstone, Faulkner, and Roberts (2017) analyze how different choices of kinetic energy in HMC affect algorithm performances. For a fixed covariance matrix, the Hamiltonian equations become: which is the score function. The velocity (or momentum) of the process is thus driven by this score function, gradient of the log-target. The above description remains quite conceptual in that there is no generic methodology for producing this continuous time process, since the Hamiltonian equations cannot be solved exactly in most cases. Furthermore, standard numerical solvers like Euler's method create an instable approximation that induces a bias as the process drifts away from its true trajectory. There exists, however, a discretization simulation technique that produces a Markov chain and is well-suited to the Hamiltonian equations, in that it preserves the stationary distribution (Betancourt, 2017). It is called the symplectic integrator, and one version in the independent case with constant covariance consists in the following (so-called leapfrog) steps where ϵ is the time-discretization step. Using a proposal on ϑ 0 drawn from the Gaussian auxiliary target and deciding on the acceptance of the value of (θ Tϵ , ϑ Tϵ ) by a Metropolis-Hastings step can limit the danger of missing the target. Note that the first two leapfrog steps induce a Langevin move on θ t : thus connecting with the Metropolis-adjusted Langevin algorithm (MALA) discussed below (see Durmus and Moulines, 2017 for a theoretical discussion of the optimal choice of ϵ). Note that the leapfrog integrator is quite an appealing middleground between accuracy (as it is second-order accurate) and computational efficiency. In practice, it is important to note that discretizing the Hamiltonian dynamics introduces two free parameters, the step size ϵ and the trajectory length Tϵ, both to be calibrated. As an empirically successful and popular variant of HMC, the "no-U-turn sampler" (NUTS) of Hoffman and Gelman (2014) adapts the value of ϵ based on primal-dual averaging. It also eliminates the need to choose the trajectory length T via a recursive algorithm that builds a set of candidate proposals for a number of forward and backward leapfrog steps and stops automatically when the simulated path steps back.
A further acceleration step in this area is proposed by Rasmussen (2003) (see also Fielding, Nott, & Liong, 2011), namely the replacement of the exact target density π(Á) by an approximationπ Á ð Þ that is much faster to compute in the many iterations of the HMC algorithm. A generic way of constructing this approximation is to rely on Gaussian processes, when interpreted as prior distributions on the target density π(Á), which is only observed at some values of θ, π(θ 1 ), …, π(θ n ) (Rasmussen and Williams, 2005). This solution is speeding up the algorithm, possibly by orders of magnitude, but it introduces a further approximation into the Monte Carlo approach, even when the true target is used at the end of the leapfrog discretization, as in Fielding et al. (2011).
Stan (named after Stanislas Ullam, see Carpenter et al., 2017) is a computer language for Bayesian inference that, among other approximate techniques, implements the NUTS algorithm to remove hand-tuning. More precisely, Stan is a probabilistic programming language in that the input is at the level of a statistical model, along with data, rather than the specifics of an MCMC algorithm. The algorithmic part is somehow automated, meaning that when models can be conveniently defined through this language, it offers an alternative to the sampler that produced the original chain. As an illustration of the acceleration brought by HMC, Figure 1, reproduced from Hoffman and Gelman (2014), shows the performance of NUTS, compared with both random-walk MH and Gibbs samplers.

| Accelerating MCMC by breaking the problem into pieces
The explosion in the collection and analysis of "big" data sets in recent years has brought new challenges to the MCMC algorithms that are used for Bayesian inference. When examining whether or not a new proposed sample is accepted at the acceptreject step, an MCMC algorithm such as the Metropolis-Hastings version needs to sweep over the whole data set, at each and every iteration, for the evaluation of the likelihood function. MCMC algorithms are then difficult to scale up, which strongly hinders their application in big data settings. In some cases, the data sets may be too large to fit on a single machine. It may also be that confidentiality measures impose different databases to stand on separate networks, with the possible added burden of encrypted data (Aslett, Esperança, & Holmes, 2015). Communication between the separate machines may prove impossible on an MCMC scale that involves thousands or hundreds of thousands of iterations. Comparisons between random-walk Metropolis-Hastings, Gibbs sampling, and NUTS algorithm of samples corresponding to a highly correlated 250-dimensional multivariate Gaussian target. Similar computation budgets are used for all methods to produce the 1,000 samples on display. Source: Hoffman and Gelman (2014) 1.5 | Scalable MCMC methods In the recent years, efforts have been made to design scalable algorithms, namely, solutions that manage to handle large-scale targets by breaking the problem into manageable or scalable pieces. Roughly speaking, these methods can be classified into two categories (Bardenet, Doucet, & Holmes, 2015): divide-and-conquer approaches and subsampling approaches.
Divide-and-conquer approaches partition the whole data set, denoted X , into batches, X 1 ,ÁÁÁ,X k f g , and run separate MCMC algorithms on each data batch, independently, as if they were independent Bayesian inference problems. 1 These methods then combine the simulated parameter outcomes together to approximate the original posterior distribution. Depending on the treatments of the batches selected in the MCMC stages, these approaches can be further subdivided into two finer groups: subposterior methods and boosted subposterior methods. Subposterior methods are motivated by the independent product equation: and they target the densities π i (θ) (up to a constant) in their respective MCMC steps. They thus bypass communication costs (Scott et al., 2016), by running MCMC samplers independently on each batch, and they most often increase MCMC mixing rates (in effective samples sizes produced by second), given that the subposterior distributions π i (θ) are based on smaller data sets. For instance, Scott et al. (2016) combine the samples from the subposteriors, π i (θ), by a Gaussian reweighting. Neiswanger, Wang, and Xing (2013) estimate the subposteriors π i (θ) by nonparametric and semi-parametric methods, and they run additional MCMC samplers on the product of these estimators toward approximating the true posterior π(θ). Wang and Dunson (2013) refine this product estimator with an additional Weierstrass sampler, while Wang, Guo, Heller, and Dunson (2015) estimate the posterior by partitioning the space of samples with step functions.
As an alternative to sampling from the subposteriors, boosted subposterior methods target instead the components in separate MCMC runs. Since they formaly amount to repeating each batch k times toward producing pseudo data sets with the same size as the true one, the resulting boosted subposteriors, e π 1 θ ð Þ, ÁÁÁ,e π k θ ð Þ, have the same scale in variance of each component of the parameters, θ, as the true posterior, and can thus be treated as a group of estimators of the true posterior. In the subsequent combining stage, these subposteriors are merged together to construct a better approximation of the target distribution. For instance, Minsker, Srivastava, Lin, and Dunson (2014) approximate the posterior with the geometric median of the boosted subposteriors, embedding them into associated reproducing kernel Hilbert spaces, while Srivastava, Cevher, Dinh, and Dunson (2015) achieve this goal using the barycenters of e π 1 , ÁÁÁ,e π k , these barycenters being computed with respect to a Wasserstein distance.
In a perspective different from the above parallel scheme of divide-and-conquer approaches, subsampling approaches aim at reducing the number of individual datapoint likelihood evaluations operated at each iteration toward accelerating MCMC algorithms. From a general perspective, these approaches can be further classified into two finer classes: exact subsampling methods and approximate subsampling methods, depending on their resulting outputs. Exact subsampling approaches typically require subsets of data of random size at each iteration. One solution to this effect is taking advantage of pseudomarginal MCMC via constructing unbiased estimators of the target density evaluated on subsets of the data (Andrieu & Roberts, 2009). Quiroz, Villani, and Kohn (2016) follow this direction by combining the powerful debiasing technique of Rhee and Glynn (2015) and the correlated pseudo-marginal MCMC approach of Deligiannidis, Doucet, and Pitt (2015). Another direction is to use piecewise deterministic Markov processes (PDMP) (Davis, 1984(Davis, , 1993, which enjoy the target distribution as the marginal of their invariant distribution. This PDMP version requires unbiased estimators of the gradients of the loglikelihood function, instead of the likelihood itself. By using a tight enough bound on the event rate function of the associated Poisson processes, PDMP can produce super-efficient scalable MCMC algorithms. The bouncy particle sampler  and the zig-zag sampler (Bierkens, Fearnhead, & Roberts, 2016) are two competing PDMP algorithms, while  unify and extend these two methods. Besides, one should note that PDMP produces a non-reversible Markov chain, which means that the algorithm should be more efficient in terms of mixing rate and asymptotic variance, when compared with reversible MCMC algorithms, such as MH, HMC, and MALA, as observed in some theoretical and experimental works (Bierkens, 2016;Chen & Hwang, 2013;Hwang, Hwang-Ma, & Sheu, 1993;Sun, Gomez, & Schmidhuber, 2010).
Approximate subsampling approaches aim at constructing an approximation of the target distribution. Beside the aforementioned attempts of Rasmussen (2003) and Fielding et al. (2011), one direction is to approximate the acceptance probability with high accuracy by using subsets of the data (Bardenet et al., 2015;Bardenet, Doucet, & Holmes, 2014). Another solution is based on a direct modification of exact methods. The seminal work of Welling and Teh (2011), stochastic gradient Langevin dynamics (SGLD), is to exploit the Langevin diffusion where Λ is a user-specified matrix, π is the target distribution, and B t is a d-dimensional Brownian process. By virtue of the Euler-Maruyama discretization and using unbiased estimators of the gradient of the log-target density, SGLD and its variants (Chen, Fox, & Guestrin, 2014;Ding et al., 2014) often produce fast and accurate results in practice when compared with MCMC algorithms using MH steps. Figure 2 shows the time requirements of a consensus Monte Carlo algorithm (Scott et al., 2016) compared with a Metropolis-Hastings algorithm using the whole data set, while Figure 3 displays the saving in likelihood evaluations in confidence sampler of Bardenet et al. (2015).

| Parallelization and distributed schemes
Modern computational architectures are built with several computing units that allow for parallel processing, either fully independent or with certain communication. Although the Markovian nature of MCMC is inherently sequential and somewhat alien to the notion of parallelizing, several partial solutions have been proposed in the literature for exploiting these parallel architectures. The simplest approach consists in running several MCMC chains in parallel, blind to all others, until the allotted computing time is exhausted. Finally, the resulting estimators of all chains are averaged. However, this naive implementation may suffer from the fact that some of those chains have not reached their stationary regime by the end of the computation time, which then induces a bias in the resulting estimate. Ensuring that stationarity has been achieved is a difficult (if at all possible) task, although several approaches can be found in the literature (Guihenneuc-Jouyaux & Robert, 1998;Jacob, O'Leary, & Atchadé, 2017;Mykland, Tierney, & Yu, 1995). At the opposite extreme, complex targets may be represented as products that involve many terms that must be evaluated, each of which can be attributed to a different thread before being multiplied all together. This strategy requires communication among processors at each MCMC step. A middle-ground version (Jacob, Robert, & Smith, 2011)    chains periodically share information, also through joint MCMC steps, thus allowing an efficient combination of global (coordinated) exploration and local approximation. O-MCMC methods also allow for a parallel implementation of the Multiple Try Metropolis. In Calderhead (2014), a generalization of the Metropolis-Hastings algorithm allows for a straightforward parallelization. Each proposed point can be evaluated in a different processor at every MCMC iteration. Finally, note that the section on scalable MCMC also contains parallelizable approaches, such as the prefetching method of Angelino, Kohler, Waterland, Seltzer, and Adams (2014) (see also Banterle, Grazian, Lee, & Robert, 2015 for a related approach, primarily based on an approximation of the target). A most recent endeavor called asynchronous MCMC (Terenin, Simpson, & Draper, 2015) aims at higher gains in parallelization by reducing the amount of exchange between the parallel threads, but the notion still remains confidential at this stage.

| Accelerating MCMC by improving the proposal
In the same spirit as the previous section, this section is stretching the purpose of this paper by considering possible modifications of the MCMC algorithm itself, rather than merely exploiting the output of a given MCMC algorithm. For instance, devising an HMC algorithm is an answer to this question even though the "improvement" is not garanteed. Nonetheless, our argument here is that, once provided with this output, it is possible to derive new proposals in a semi-autonomous manner.

| Simulated tempering
The target distribution, π(θ) on d-dimensional state space Θ, can exhibit multimodality with the probability mass being located in different regions in the state space. The majority of MCMC algorithms use a localized proposal mechanism which is tuned toward local approximate optimality see, for example, Roberts, Gelman, and Gilks (1997) and Roberts and Rosenthal (2001). By construction, these localized proposals result in the Markov chain becoming "trapped" in a subset of the state space meaning that in finite run-time the chain can entirely fail to explore other modes in the state space, leading to biased samples. Strategies to accelerate MCMC often use local gradient information and this draws the chain back toward the center of the mode, which is the opposite of what is required in a multimodal setting.
There is an array of methodology available to overcome issues of multimodality in MCMC, the majority of which use state space augmentation. Auxiliary distributions that allow a Markov chain to explore the entirety of the state space are targeted and their mixing information is then passed on to aid mixing in the true target. While the subposteriors of the previous section can be seen as special cases of the following, the most successful and convenient implementation of these methods is to use power-tempered target distributions. The target distribution at inverse temperature level, β, for β 2 (0, 1] is defined as Therefore, π 1 (θ) = π(θ). Temperatures β < 1 flatten out the target distribution allowing the chain to explore the entire state space provided the β value is sufficiently small. The simulated tempering (ST) and parallel tempering (PT) algorithms (Geyer, 1991;Marinari & Parisi, 1992) typically use the power-tempered targets to overcome the issue of multimodality. The ST approach runs a single Markov chain on the augmented state space {B, Θ}, where B = {β 0 , β 1 , …, β n } is a discrete collection of n inverse temperature levels with 1 = β 0 > β 1 > … > β n > 0. The algorithm uses a Metropolis-within-Gibbs strategy by cycling between updates in the Θ and B components of the space. For instance, a proposed temperature swap move β i ! β j is accepted with probability min 1, in order to preserve detailed balance. Note that this acceptance ratio depends on the normalization constants K β ð Þ which are typically unknown, although they can sometimes be estimated, as in, for example, Wang and Landau (2001) and Atchadé and Liu (2004). In case estimation of the marginal normalization constants is impractical then the PT algorithm is employed. This approach simultaneously runs a Markov chain at each of the n + 1 temperature levels targeting the joint distribution given by Q n i = 0 π θ i ð Þ ½ βi . Swap moves between chains at adjacent temperature levels are accepted according to a ratio that no longer depends on the marginal normalization constants. Indeed, this power tempering approach has been successfully employed in a number of settings and is widely used for example, Neal (1996), Earl and Deem (2005), Xie, Zhou, and Jiang (2010), Mohamed, Calderhead, Filippone, Christie, and Girolami, (2012) and Carter and White (2013).
In both approaches, there is a "Goldilocks" principle to setting up the inverse temperature schedule. Spacings between temperature levels that are "too large" result in swap moves that are rarely accepted, hence delaying the transfer of hot state mixing information to the cold states. On the other hand, spacings that are too small require a large number of intermediate temperature levels, again resulting in slow mixing through the temperature space. This problem becomes even more difficult as the dimensionality of Θ increases.
Much of the historical literature suggested that a geometric spacing was optimal that is, there exists c 2 (0, 1) such that β i + 1 = cβ i for i = 0, 1, …, n. However, in the case of the ST version, Atchadé, Roberts, and Rosenthal (2011) considered the problem as an optimal scaling problem by maximizing the (asymptotic in dimension) expected squared jumping distance in the B space for temperature swap moves. Under restrictive assumptions, they showed that the spacings between consecutive inverse temperature levels should scale with dimension as O(d −1/2 ) to prevent degeneracy of the swap move acceptance rate. For a practitioner the result gave guidance on optimal setup since it suggested a corresponding optimal swap move acceptance rate of 0.234 between consecutive inverse temperature levels, in accordance with Gelman, Gilks, and Roberts (1996). Finally, contrary to the historically recommended geometric schedule, the authors suggested that temperature schedule setup should be constructed consecutively so as to induce an approximate 0.234 swap acceptance rate between consecutive levels; which is achieved adaptively in Miasojedow, Moulines, and Vihola (2013). The use of expected squared jumping distance as the measure of mixing speed was justified in Roberts and Rosenthal (2014) where, under the same conditions as in Atchadé et al. (2011), it was shown that the temperature component of the ST chain has an associated diffusion process.
The target of an 0.234 acceptance rate gives good guidance to setting up the ST/PT algorithms in certain settings, but there is a major warning for practitioners following this rule for optimal setup. The assumptions made in Atchadé et al. (2011) and Roberts and Rosenthal (2014) ignore the restrictions of mixing within a temperature level, instead assuming that this can be done infinitely fast relative to the mixing within the temperature space. Huber (2009a, 2009b)and Bhatnagar and Randall (2016) undertake a comprehensive analysis of the spectral gap of the ST/PT chains and their conclusion is rather damning of the ST/PT approaches that use power-tempered targets. Essentially, in situations where the modes have different structures, the time required to reach a given level of convergence for the ST/PT algorithms can grow exponentially in dimension. A major reason for this is that power-based tempering does not preserve the relative weights/mass between regions at the different temperature levels, see Figure 4. This issue can scale exponentially in dimension. From a practical perspective, in these finite run high-dimensional nonidentical modal structure settings the swap acceptance rates can be very misleading, meaning that they have limited use as a diagnostic for intermodal mixing quality.

| Adaptive MCMC
Improving and calibrating an MCMC algorithm toward a better correspondance with the intended target is a natural step in making the algorithm more efficient, provided enough information is available about this target distribution. For instance, when an MCMC sample associated with this target is available, even when it has not fully explored the range of the target, it contains some amount of information, which can then be exploited to construct new MCMC algorithms. Some of the solutions available in the literature (e.g., Liang, Liu, & Carroll, 2007) proceed by repeating blocks of MCMC iterations and updating the proposal K after each block, aiming at a particular optimality goal like a specific acceptance rate like 0.234 for Metropolis-Hastings steps (Gelman et al., 1996). Most versions of this method update the scale structure of a random walk proposal, based on previous realizations (Robert & Casella, 2009)   Other adaptive resolutions bypass this preliminary and somewhat ad hoc construction and aim instead at a permanent updating within the algorithm, motivated by the idea that a continuous adaptation keeps improving the correspondance with the target. In order to preserve the validation of the method (Gelman et al., 1996;Haario, Saksman, & Tamminen, 1999;Roberts & Rosenthal, 2007;Saksman & Vihola, 2010), namely that the chain produced by the algorithm converges to the intended target, specific convergence results need be established, as the ergodic theorem behind standard MCMC algorithms does not apply. Without due caution (see Figure 5), an adaptive MCMC algorithm may fail to converge due to over-fitting. A drawback of adaptivity is that the update of the proposal distribution relies too much on the earlier simulations and thus reinforces the exclusion of parts of the space that have not yet been explored.
For the validation of adaptive MCMC methods, stricter constraints must thus be imposed on the algorithm. One welldescribed solution (Roberts & Rosenthal, 2009) is called diminishing adaptation. Informally, it consists in imposing a distance between two consecutive proposal kernels to uniformly decrease to zero. In practice, this means stabilizing the changes in the proposal by ridge-like factors as in the early proposal by Haario et al. (1999). A drawback of this resolution is that the decrease itself must be calibrated and may well fail to bring a significant improvement over the original proposal. Markov chains produced by an adaptive algorithm where the proposal distribution is a Gaussian distribution with mean and variance computed from the past simulations of the chain. The three rows correspond to different initial distributions. The fit of the histogram of the resulting MCMC sample is poor, even for the most spread-out initial distribution (bottom). Source: Robert and Casella (2004) 1.10 | Multiple try MCMC A completely different approach to improve the original proposal used in an MCMC algorithm is to consider a collection of proposals, built on different rationales and experiments. The multiple try MCMC algorithm (Liu, Liang, & Wong, 2000;Bédard, Douc, & Moulines, 2012;Martino, 2018) follows this perspective. As the name suggests, the starting point of a multiple try MCMC algorithm is to simultaneously propose N potential moves θ 1 t , …, θ N t of the Markov chain, instead of a single value. The proposed values θ i t may be independently generated according to N different proposal densities K i (Á|θ t ) that are conditional on the current value of the Markov chain, θ t . One of the θ i t 's is selected based on the importance sampling weights The selected value is then accepted by a further Metropolis-Hastings step which involves a ratio of normalization constants for the importance stage, one corresponding to the selection made previously and another one created for this purpose. Indeed, besides the added cost of computing the sum of the importance weights and generating the different variates, this method faces the non-negligible drawback of requiring N − 1 supplementary simulations that are only used for achieving detailed balance and computing a backward summation of importance weights. This constraint may vanish when considering a collection of independent Metropolis-Hastings proposals, q(θ), but this setting is rarely realistic as it requires some amount of prior knowledge or experimentation to build a relevant distribution. An alternative found in the literature is ensemble Monte Carlo (Iba, 2000;Cappé, Douc, Guillin, Marin, & Robert, 2008;Neal, 2011;Martino, 2018), illustrated in Figure 6 which produces a whole sample at each iteration, with target the product of the initial targets, in closer proximity with particle methods (Cappé, Guillin, Marin, & Robert, 2004;Mengersen & Robert, 2003).
Yet another implementation of this principle is called delayed rejection (Tierney & Mira, 1998;Mira, 2001;Mira & Sargent, 2003), where proposals are instead considered sequentially, once the previous proposed value has been rejected, to speed up MCMC by considering several possibilities, if sequentially. A computational difficulty with this approach is that the associated acceptance probabilities get increasingly complex as the number of delays grows, which may annihilate its appeal relative to simultaneous multiple tries. A further difficulty is to devise the sequence of proposals in a diverse enough manner.

| Accelerating MCMC by reducing the variance
Since the main goal of MCMC is to produce approximations for quantities of interest of the form, an alternative (and cumulative) way of accelerating these algorithms is to improve the quality of the approximation derived from an MCMC output. That is, given an MCMC sequence θ 1 , …, θ T , converging to π(Á), one can go beyond resorting to the basic Monte Carlo approximationÎ toward reducing the variance (if not the speed of convergence) ofÎ T h to I h .  (2018) A common remark when considering Monte Carlo approximations of I h is that the representation of the integral as an expectation is not unique (e.g., Robert & Casella, 2004). This leads to the technique of importance sampling where alternative distributions are used in replacement of π(θ), possibly in an adaptive manner (Douc et al., 2007), or sequentially as in particle filters (Andrieu, Doucet, & Holenstein, 2011;Del Moral, Doucet, & Jasra, 2006). Within the framework of this essay, the outcome of a given MCMC sampler can also be exploited in several ways that lead to an improvement of the approximation of I h .

| Rao-Blackwellization and other averaging techniques
The name "Rao-Blackwellisation" was coined by Gelfand and Smith (1990) in their foundational Gibbs sampling paper and it has since then become a standard way of reducing the variance of integral approximations. While it essentially proceeds from the basic probability identity.
when π can be expressed as the following marginal density and while sufficiency does not have a clear equivalence for Monte Carlo approximation, the name stems from the Rao-Blackwell theorem (Lehmann & Casella, 1998) that improves upon a given estimator by conditioning upon a sufficient statistics. In a Monte Carlo setting, this means that Equation (4) can be improved by a partly integrated version assuming that a second and connected sequence of simulations (ξ t ) is available and that the conditional expectation is easily constructed. For instance, Gibbs sampling (Gelfand & Smith, 1990) is often open to this Rao-Blackwell decomposition as it relies on successive simulations from several conditional distributions, possibly including auxiliary variates and nuisance parameters. In particular, a generic form of Gibbs sampling called the slice sampler (Robert & Casella, 2004) produces one or several uniform variates at each iteration. However, a more universal type of Rao-Blackwellization is available (Casella & Robert, 1996) for all MCMC methods involving rejection, first and foremost, Metropolis-Hastings algorithms. Indeed, first, the distribution of the rejected variables can be derived or approximated, which leads to an importance correction of the original estimator. Furthermore, the acceptreject step depends on a uniform variate, but this uniform variate can be integrated out. Namely, given a sample produced by a Metropolis-Hastings algorithm θ (1) , …, θ (T) , one can exploit both underlying samples, the proposed values ϑ 1 , …, ϑ T , and the uniform u 1 , …, u T , so that the ergodic mean can be rewritten aŝ The conditional expectation then enjoys a smaller variance. See also Tjelmeland (2004) and Douc and Robert (2010) for connected improvements based on multiple tries. An even more rudimentary (and cheaper) version can be considered by integrating out the decision step at each Metropolis-Hastings iteration: if θ t is the current value of the Markov chain and ϑ t the proposed value, to be accepted (as θ t + 1 ) with probability α t , the version 1 T should most often 2 bring an improvement over the basic estimate (Liu et al., 1995;Robert & Casella, 2004).

| CONCLUSIONS
Accelerating MCMC algorithms may sound like a new Achille versus tortoise paradox in that there are aways methods to speed up a given algorithm. The stopping rule of this infinite regress is, however, that the added pain in achieving this acceleration may overcome the added gain at some point. While we have only and mostly superficially covered some of the possible directions in this survey, we thus encourage most warmly readers to keep an awareness for the potential brought by a wide array of almost cost-free accelerating solutions as well as to keep trying devising more fine-tuned improvements in every new MCMC implementation. For instance, for at least one of us, Rao-Blackwellization is always considered at this stage. Keeping at least one such bag of tricks at one's disposal is thus strongly advised. The authors are quite grateful to a reviewer for his or her detailed coverage of an earlier version of the paper, which contributed to significant improvements in the presentation and coverage of the topic. All remaining errors and omissions are solely the responsability of the authors.