The Barker proposal: Combining robustness and efficiency in gradient‐based MCMC

Abstract There is a tension between robustness and efficiency when designing Markov chain Monte Carlo (MCMC) sampling algorithms. Here we focus on robustness with respect to tuning parameters, showing that more sophisticated algorithms tend to be more sensitive to the choice of step‐size parameter and less robust to heterogeneity of the distribution of interest. We characterise this phenomenon by studying the behaviour of spectral gaps as an increasingly poor step‐size is chosen for the algorithm. Motivated by these considerations, we propose a novel and simple gradient‐based MCMC algorithm, inspired by the classical Barker accept‐reject rule, with improved robustness properties. Extensive theoretical results, dealing with robustness to tuning, geometric ergodicity and scaling with dimension, suggest that the novel scheme combines the robustness of simple schemes with the efficiency of gradient‐based ones. We show numerically that this type of robustness is particularly beneficial in the context of adaptive MCMC, giving examples where our proposed scheme significantly outperforms state‐of‐the‐art alternatives.


Introduction
The need to compute high-dimensional integrals is ubiquitous in modern statistical inference and beyond (e.g.Brooks et al. [2011], Krauth [2006], Stuart [2010]).Markov chain Monte Carlo (MCMC) is a popular solution, in which the central idea is to construct a Markov chain with a certain limiting distribution and use ergodic averages to approximate expectations of interest.In the celebrated Metropolis-Hastings algorithm, the Markov chain transition is constructed using a combination of a 'candidate' kernel, to suggest a possible move at each iteration, together with an accept-reject mechanism [Metropolis et al., 1953, Hastings, 1970].Many different flavours of Metropolis-Hastings exist, with the most common difference being in the construction of the candidate kernel.In the Random walk Metropolis, proposed moves are generated using a symmetric distribution centred at the current point.Two more sophisticated methods are the Metropolis-adjusted Langevin algorithm [Roberts and Tweedie, 1996] and Hamiltonian/hybrid Monte Carlo [Duane et al., 1987, Neal, 2011].Both use gradient information about the distribution of interest (the target) to inform proposals.Gradient-based methods are widely considered to be state-of-the-art in MCMC, and much current work has been dedicated to their study and implementation (e.g.Beskos et al. [2013], Durmus and Moulines [2017], Dalalyan [2017]).
Several measures of performance have been developed to help choose a suitable candidate kernel for a given task.One of these is high-dimensional scaling arguments, which compare how the efficiency of the method decays with d, the dimension of the state space.For the random walk algorithm this decay is of the order d −1 [Roberts et al., 1997], while for the Langevin algorithm the same figure is d −1/3 [Roberts and Rosenthal, 1998] and for Hamiltonian Monte Carlo it is d −1/4 [Beskos et al., 2013].Another measure is to find general conditions under which a kernel will produce a geometrically ergodic Markov chain.For the random walk algorithm this essentially occurs when the tails of the posterior decay at a faster than exponential rate and are suitably regular (more precise conditions are given in [Jarner and Hansen, 2000]).The same is broadly true of the Langevin and Hamiltonian schemes [Roberts and Tweedie, 1996, Livingstone et al., 2019, Durmus et al., 2017a], but here there is an additional restriction that the tails should not decay too quickly.This limitation is caused by the way in which gradients are used to construct the candidate kernel, which can result in the algorithm generating unreasonable proposals that are nearly always rejected in certain regions [Roberts andTweedie, 1996, Livingstone et al., 2019].
There is clearly some tension between the different results presented above.According to the scaling arguments gradient information is preferable.The ergodicity results, however, imply that gradient-based schemes are typically less robust than others, in the sense that there is a smaller class of limiting distributions for which the output will be a geometrically ergodic Markov chain.It is natural to wonder whether it is possible to incorporate gradient information in such a way that this measure of robustness is not compromised.Simple approaches to modifying the Langevin algorithm for this purpose have been suggested (based on the idea of truncating gradients, e.g.Roberts and Tweedie [1996], Atchade [2006]), but these typically compromise the favourable scaling of the original method.In addition to this, it is often remarked that gradientbased methods can be difficult to tune.Algorithm performance is often highly sensitive to the choice of scale within the proposal [Neal, 2003, Fig.15], and if this is chosen to be too large in certain directions then performance can degrade rapidly.Because of this, practitioners must spend a long time adjusting the tuning parameters to ensure that the algorithm is running well, or develop sophisticated adaptation schemes for this purpose (e.g.Hoffman and Gelman [2014]), which can nonetheless still require a large number of iterations to find good tuning parameters (see Sections 5 and 6).We will refer to this issue as robustness to tuning.
In this article we present a new gradient-based MCMC scheme, the Barker proposal, which combines favourable high-dimensional scaling properties with favourable ergodicity and robustness to tuning properties.To motivate the new scheme, in Section 2 we present a direct argument showing how the spectral gaps for the random walk, Langevin and Hamiltonian algorithms behave as the tuning parameters are chosen to be increasingly unsuitable for the problem at hand.In particular, we show that the spectral gaps for commonly used gradient-based algorithms decay to zero exponentially fast in the degree of mismatch between the scales of the proposal and target distributions, while for the random walk Metropolis (RWM) the decay is polynomial.In Section 3 we derive the Barker proposal scheme beginning from a family of π-invariant continuous-time jump processes, and discuss its connections to the concept of 'locally-balanced' proposals, introduced in [Zanella, 2019] for discrete state spaces.The name Barker comes from the particular choice of 'balancing function' used to uncover the scheme, which is inspired by the classical Barker accept-reject rule [Barker, 1965].In Section 4 we conduct a detailed analysis of the ergodicity, scaling and robustness properties of this new method, establishing that it shares the favourable robustness to tuning of the random walk algorithm, can be geometrically ergodic in the presence of very light tails, and enjoys the d −1/3 scaling with dimension of the Langevin scheme.The theory is then supported by an extensive simulation study in Sections 5 and 6, including comparisons with state-of-the-art alternative sampling schemes, which highlights that this kind of robustness is particularly advantageous in the context of adaptive MCMC.The code to reproduce the experiments is available from the online repository at the link https://github.com/gzanella/barker. Proofs and further numerical simulations are provided in the supplement.

Basic setup and notation
Throughout we work on the Borel space (R d , B), with d ≥ 1 indicating the dimension.For λ ∈ R, we write λ ↑ ∞ and λ ↓ 0 to emphasize the direction of convergence when this is important.For two functions f, g : R → R, we use the Bachmann-Landau notation f (t) = Θ (g(t)) if lim inf t→∞ f (t)/g(t) > 0 and lim sup t→∞ f (t)/g(t) < ∞.
The Markov chains we consider will be of the Metropolis-Hastings type, meaning that the π-invariant kernel P is constructed as P (x, dy) := α(x, y)Q(x, dy) + r(x)δ x (dy), where Q : R d × B → [0, 1] is a candidate kernel, α(x, y) := min 1, π(dy)Q(y, dx) π(dx)Q(x, dy) is the acceptance rate for a proposal y given the current point x (provided that the expression is well-defined, see Tierney [1998] for details here), and r(x) := 1 − α(x, y)Q(x, dy) is the average probability of rejection given that the current point is x.

Robustness to tuning
In this section, we seek to quantify the robustness of the random walk, Langevin and Hamiltonian schemes with respect to the mismatch between the scales of π(•) and Q in a given direction.Unlike other analyses in the MCMC literature (e.g.Roberts and Rosenthal [2001], Beskos et al. [2018]), we are interested in studying how MCMC algorithms perform when they are not optimally tuned, in order to understand how crucially performance depends on such design choices (e.g. the choice of proposal step-size or pre-conditioning matrix).The rationale for performing such an analysis is that achieving optimal or even close to optimal tuning can be extremely challenging in practice, especially when π(•) exhibits substantial heterogeneity.This is typically done using past samples in the chain to compute online estimates of the average acceptance rate and the covariance of π (or simply its diagonal terms for computational convenience), and then using those estimates to tune the proposal step-sizes in different directions [Andrieu and Thoms, 2008].If the degree of heterogeneity is large, it can take a long time for certain directions to be well-explored, and hence for the estimated covariance to be representative and the tuning parameters to converge.In such settings, algorithms that are more robust to tuning are not only easier to use when such tuning is done manually by the user, but can also greatly facilitate the process of learning the tuning parameters adaptively within the algorithm.We show in Sections 5 and 6 that if an algorithm is robust to tuning then this adaptation process can be orders of magnitude faster than in the alternative case, drastically reducing the overall computational cost for challenging targets.The intuition for this is that more robust algorithms will start performing well (i.e.sampling efficiently) earlier in the adaptation process (when tuning parameters are not yet optimally tuned), which in turn will speed up the exploration of the target and the learning of the tuning parameters.

Analytical framework
The most general scenario we consider is a family of target densities π (λ,k) indexed by λ > 0 and k ∈ {1, ..., d} defined as where π is a density defined on R d for which π(x) > 0 for all x ∈ R d and log π ∈ C 1 (R d ).The set up allows modification of the scale of the first k components of π (λ,k) through the parameter λ.Our main results are presented for the case k = 1, and we write π (λ) := π (λ,1) for simplicity, before discussing extensions to the k > 1 setting in Section 2.5.We consider targeting π (λ) using a Metropolis-Hastings algorithm with fixed tuning parameters, and study performance as λ varies.Intuitively, we can think of λ as a parameter quantifying the level of heterogeneity in the problem.As a concrete example, consider a random walk Metropolis algorithm in which given the current state x (t) the candidate move is y = x (t) + σξ, with σ > 0 a fixed tuning parameter and ξ ∼ N (0, I d ), where I d is the d × d identity matrix.It is instructive to take σ as the optimal choice of global scale for π, meaning when λ is far from one then σ is no longer a suitable choice for the first coordinate of π (λ) .
In the context of the above, the λ ↓ 0 regime is representative of distributions in which one component (in this case the first) has a very small scale compared to all others.Conversely the λ ↑ ∞ regime reflects the case in which one component has a much larger scale than its counterparts.Studying robustness to tuning in the context of heterogeneity is particularly relevant, as highlighted above, as this is exactly the context in which tuning is more challenging.The λ ↓ 0 regime is particularly interesting and has been recently considered in Beskos et al. [2018], where the authors study the behaviour of the random walk Metropolis for 'ridged' densities for different values of k using a diffusion limit approach.The focus in that work, however, was on the finding optimal tuning parameters for the algorithm as a function of λ, whereas the present paper is concerned with the regime in which the tuning parameters are fixed (as discussed above).
The above framework could be equivalently formulated by keeping the target distribution π fixed and instead rescaling the first component of the candidate kernel by a factor 1/λ.This is indeed the formulation we mostly use in the proofs of our theoretical results.A proof of the mathematical equivalence between the two formulations can be found in the supplement.

Measure of performance
Our measure of performance for the various algorithms will be the spectral gap of the resulting Markov chains.Consider the space of functions Note that any function g with E π g 2 < ∞ can be associated with an f ∈ L 2 0,1 (π) through the map f = (g − E π g)/ √ Var π g, and that if X (t) ∼ π(•) and X (t+1) |X (t) ∼ P (X (t) , •) then Corr{g(X (t) ), g(X (t+1) )} = Corr{f (X (t) ), f (X (t+1) )}.The (right) spectral gap of a π-reversible Markov chain with transition kernel P is The expression inside the infimum is called a Dirichlet form, and can be thought of as the 'expected squared jump distance' for the function f provided the chain is stationary.This can in turn be re-written as 1 − Corr{f (X (t) ), f (X (t+1) )}.Maximising the spectral gap of a reversible Markov chain can therefore be understood as minimising the worst-case first-order auto-correlation among all possible square-integrable test functions.
The spectral gap allows to bound the variances of ergodic averages (see Proposition 1 of Rosenthal, 2003).Also, a direct connection between the spectral gap and mixing properties of the chain can be made if the operator P f (x) := f (y)P (x, dy) is positive on L 2 (π).This will always be the case if the chain is made lazy, which is the approach taken in Woodard et al. [2009], and the same adjustment can be made here if desired.

The small λ regime
In this section we assess the robustness to tuning of the random walk, Langevin and Hamiltonian schemes as λ ↓ 0. This corresponds to the case in which the proposal scale is chosen to be too large in the first component of π (λ) .The results in this section will support the idea that classical gradient-based schemes pay a very high price for any direction in which this tuning parameter is chosen to be too large, as already noted in the literature (e.g.Neal, 2003, page 738), while the random walk Metropolis is less severely affected by such issues.

Random walk Metropolis
In the random walk Metropolis (RWM), given a current point x ∈ R d , a proposal y is calculated using the equation with σ > 0 and ξ ∼ µ(•) for some centred symmetric distribution µ.The resulting candidate kernel Q R is given by Q R (x, dy) = q R (x, y)dy with q R (x, y) = σ −d µ((y − x)/σ), where µ(ξ) for ξ ∈ R d denotes the density of µ.Following Section 2.1, we consider Metropolis-Hastings algorithms with proposal Q R and target distribution π (λ) defined in (2), and denote the resulting transition kernels as P R λ .We impose the following mild regularity conditions on the density µ(ξ).These are satisfied for most popular choices of µ, as shown in the subsequent proposition.
Condition 2.1.There exists λ 0 > 0 such that for any x, y ∈ R d and λ < λ 0 we have µ (δ λ ) ≥ µ(δ), where δ = y − x and (5) In addition, We conclude the section with a characterization of the rate of convergence to zero of the spectral gap for the Random Walk Metropolis as λ ↓ 0.
Theorem 2.1.Assume Condition 2.1 and Gap(P R 1 ) > 0. Then it holds that Note that Theorem 2.1 requires very few assumptions on the target π other than Gap(P R 1 ) > 0. Note also that the lower bound is of the form Gap(P R λ ) ≥ λGap(P R 1 ), see proof of Theorem 2.1 for details.No dependence on the dimension of the problem other than that intrinsic to Gap(P R 1 ) is therefore introduced.

The Langevin algorithm
In the Langevin algorithm (or more specifically the Metropolis-adjusted Langevin algorithm, MALA), given the current point x ∈ R d , a proposal y is generated by setting for some σ > 0 and ξ ∼ N (0, I d ).In this case the proposal is no longer symmetric and so the full Hastings ratio (1) must be used.The proposal mechanism is based on the overdamped Langevin stochastic differential equation dX t = ∇ log π (λ) (X t )dt + √ 2dB t .We write Q M λ for the corresponding candidate distribution and P M λ for the Metropolis-Hastings kernel with proposal Q M λ and target π (λ) .We present results for the Langevin algorithm in two settings.Initially we consider more restrictive conditions under which our upper bound on the spectral gap depends on the tail behaviour of π in a particularly explicit manner, and then give a broader result.
(ii) For some q ∈ [0, 1), it holds that Theorem 2.2.If Condition 2.2 holds, then there is a γ > 0 such that When compared with the random walk algorithm, Theorem 2.2 shows that the Langevin scheme is much less robust to heterogeneity.Indeed, the spectral gap decays exponentially fast in λ −(1+q) , meaning that even small errors in the choice of step-size can have a large impact on algorithm efficiency, and so practitioners must invest considerable effort tuning the algorithm for good performance, as shown through simulations in Sections 5 and 6.Theorem 2.2 also illustrates that the Langevin algorithm is more sensitive to λ when the tails of π(•) are lighter.This is intuitive, as in this setting gradient terms can become very large in certain regions of the state space.
Remark 2.1.Theorem 2.2 (and Theorem 2.4 below) could be extended to the case q ≥ 1 in (7), however in these cases samplers typically fail to be geometrically ergodic when λ is small [Roberts andTweedie, 1996, Livingstone et al., 2019] meaning the spectral gap is typically 0 and the theorem becomes trivial.
We expect Condition 2.3 to be satisfied in many commonly encountered scenarios, with the exception of particularly heavy-tailed models.In the exponential family class π(x) ∝ exp{−α x β 2 }, for example, Condition 2.3 holds for any α and β > 0 (see proof in the supplement).

Hamiltonian Monte Carlo
In Hamiltonian Monte Carlo (HMC) we write the current point x ∈ R d as x(0), and construct the proposal y := x(L) for some prescribed integer L using the update where each x(j) is defined recursively in the same manner, and ξ(0) ∼ N (0, I d ).The transition is based on numerically solving Hamilton's equations for the Hamiltonian system H(x, ξ) = − log π (λ) (x) + ξ T ξ/2 for Lσ units of time.The decision of whether or not the proposal is accepted is taken using the acceptance probability min(1, π (λ) (y)/π (λ) (x)× e −ξ(L) T ξ(L)/2+ξ(0) T ξ(0)/2 ), where A more detailed description is given in Neal [2011].We write P H λ for the corresponding Metropolis-Hastings kernel with proposal mechanism as above and target π (λ) .Here we present a heterogeneity result under Condition 2.2 of the previous subsection.
Theorem 2.4.If Condition 2.2 holds, then there is a γ > 0 such that It is no surprise that Theorem 2.4 is comparable to Theorem 2.2, since setting L = 1 equates the Langevin and Hamiltonian methods.

The large λ regime
In this section we briefly discuss the λ ↑ ∞ regime, where σ is chosen to be too small for the first component of π (λ) , arguing that all samplers under consideration behave similarly in this regime and pay a similar price for too small tuning parameters in a given direction.The intuition for this is that as λ ↑ ∞ the gradient-based proposal mechanisms discussed here all tend towards that of the random walk sampler in the first coordinate.For example, if we consider one-dimensional models, for any x ∈ R we can write ∇ log π (λ) (x) = λ −1 ∇ log π(x/λ), meaning as λ ↑ ∞ the amount of gradient information in the proposal is reduced provided π is suitably regular.The following result makes this intuition precise.To avoid repetitions, we state here the result for both the Langevin and the Barker proposal that we will introduce in the next section.
Proposition 2.2.Fix x ∈ R and let the density π be such that ∇ log π is bounded in some neighbourhood of zero.Then the Langevin and Barker candidate kernels Q M λ and Q B λ , defined in (6) and (16) respectively, both satisfy where Q R is the (Gaussian) random walk candidate kernel.
The same intuition applies to the Hamiltonian case provided L is fixed, since each gradient term in the proposal is also Θ(1/λ).While there are many well-known measures of distance between two distributions, we argue that total variation is an appropriate choice here, since it has an explicit focus on how much the two kernels overlap and is invariant under bijective transformations of the state space (including re-scaling coordinates).
While the above statements provide useful heuristic arguments, in order to obtain more rigorous results one should prove that the spectral gaps decay to 0 at the same rate as λ ↑ ∞, which we leave to future work.We note, however, that the conjecture that the algorithms behave similarly for large values of λ is supported by the simulations of Section 5.1.

Extensions
The lower bound of Theorem 2.1 extends naturally to the k > 1 setting, becoming instead ≥ Θ(λ k ), and so the rate of decay remains polynomial in λ for any k.Analogously, we expect the corresponding upper bound for gradient-based schemes to remain exponential and become ≤ Θ e −k(γλ −(1+q) +q log(λ)) , although the details of this are left for future work.We explore examples of this nature through simulations in Section 5 and find empirically that the single component case is informative also of more general cases.Further extensions in which a different λ i is chosen in each of the k directions can also be considered, with each λ i ↓ 0 at a different rate.We conjecture that in this setting the λ i that decays most rapidly will dictate the behaviour of spectral gaps, though such an analysis is beyond the scope of the present work.

Combining robustness and efficiency
The results of Section 2 show that the two gradient-based samplers considered there are much less robust to heterogeneity than the random walk algorithm.In this section, we introduce a novel and simple to implement gradient-based scheme that shares the superior scaling properties of the Langevin and Hamiltonian schemes, but also retains the robustness of the random walk sampler, both in terms of geometric ergodicity and robustness to tuning.

Locally-balanced Metropolis-Hastings
Consider a continuous-time Markov jump process on R d with associated generator for some suitable function f : R d → R, where π(x) is a probability density, Q(x, dy) := q(x, y)dy is a transition kernel and the balancing function g : (0, ∞) → (0, ∞) satisfies A discrete state-space version of this process with symmetric Q was introduced in Power and Goldman [2019].The dynamics of the process are such that if the current state X t = x, the next jump will be determined by a Poisson process with intensity and the next state is drawn from the kernel The L 2 (R d ) adjoint, or forward operator A (e.g.Fearnhead et al. [2018]) is given by Ah(x) = h(y)g π(x)q(x, y) π(y)q(y, x) q(y, x)dy − h(x)Z(x).
Note that in the case h(x) = π(x) using ( 12) the first expression on the right-hand side can be written g π(y)q(y, x) π(x)q(x, y) π(x)q(x, y)dy = π(x)Z(x), meaning Aπ = 0, suggesting π is invariant.It can therefore serve as a starting point for designing Markov chain Monte Carlo algorithms.
In the 'locally-balanced' framework for discrete state-space Metropolis-Hastings introduced in Zanella [2019], candidate kernels are of the form meaning the embedded Markov chain of (11) with the choice Q(x, dy) := µ σ (y − x)dy, where µ σ (y − x) := σ −d µ((y − x)/σ) for some symmetric density µ.It is well-known that the invariant density of the embedded chain does not coincide with that of the process when jumps are not of constant intensity, in this case becoming proportional to Z(x)π(x), as shown in Zanella [2019].
As a result a Metropolis-Hastings step is employed to correct for the discrepancy.In Power and Goldman [2019] it is suggested that as an alternative the jump process can be simulated exactly.
The challenge with employing either of these strategies on a continuous state space is that the integral (13) will typically be intractable.To overcome this issue we take two steps, and for simplicity we first describe these on R (there are two options on R d for d > 1, which are discussed in Section 3.3).The first step is to consider a first-order Taylor series expansion of log π within g (again with a symmetric choice of Q), leading to the family of processes with generator We refer to candidate kernels in Metropolis-Hastings algorithms that are constructed using the embedded Markov chain of this new process as first-order locally-balanced proposals, taking the form where Z(x) := g(e ∇ log π(x)(y−x) )µ σ (y − x)dy.This second step is to note that, if particular choices of g are made, then Z(x) becomes tractable.In fact, if the balancing function g(t) = √ t and a Gaussian kernel µ σ are chosen, then the result is the Langevin proposal Thus, MALA can be viewed as a particular instance of this class.Other choices of g are, however, also possible, and give rise to different gradient-based algorithms.In the next section we explore what a sensible choice of g might look like.
Remark 3.1.One can also think at (12) as a requirement to ensure that the proposals in (15) are exact (i.e.π-reversible) at the first order.In particular, in the supplement it is shown that a proposal Q (g) defined as in (15) is π-reversible with respect to log-linear density functions if and only if (12) holds.

The Barker proposal on R
The requirement for the balancing function g to satisfy g(t) = tg(1/t) is in fact also imposed on the acceptance rate of a Metropolis-Hastings algorithm to produce a π-reversible Markov chain.Indeed, setting t := π(y)q(y, x)/(π(x)q(x, y)) and assuming α(x, y) := α(t), then the detailed balance equations can be written α(t) = tα(1/t).Possible choices of g can therefore be found by considering suggestions for α in the literature.One choice proposed in Barker [1965] is The work of Peskun [1973] and Tierney [1998] showed that this choice of α is inferior to the more familiar Metropolis-Hasting rule α(t) = min(1, t) in terms of asymptotic variance.The same conclusion cannot, however, be drawn when considering the choice of balancing function g.
In fact, the choice g(t) = t/(1 + t) was shown to minimize asymptotic variances of Markov chain estimators in some discrete settings in Zanella [2019].In addition, as shown below, this particular choice of g leads to a fully tractable candidate kernel that can be easily sampled from.
The resulting proposal distribution is We refer to Q B as the the Barker proposal.A simple sampling strategy to generate Algorithm 1 Generating a Barker proposal on R Require: the current point x ∈ R. Proposition 3.2.Algorithm 1 produces a sample from Q B on R.
Algorithm 1 shows that the magnitude |y − x| = |z| of the proposed move does not depend on the gradient ∇ log π(x) here, it is instead dictated only by the choice of symmetric kernel µ σ .The direction of the proposed move is, however, informed by both the magnitude and direction of the gradient.Examining the form of p(x, z), it becomes clear that if the signs of z and ∇ log π(x) are in agreement, then p(x, z) > 1/2, and indeed as z∇ log π(x) ↑ ∞ then e −z∇ log π(x) ↓ 0 and so p(x, z) ↑ 1.Hence, if the indications from ∇ log π(x) are that π(x + z) π(x), then it is highly likely that b(x, z) will be set to 1 and y = x + z will be the proposed move.Conversely, if z∇ log π(x) < 0, then there is a larger than 50% chance that the proposal will instead be y = x − z.As ∇ log π(x) ↑ ∞ the Barker proposal converges to µ σ truncated on the right, and similarly to µ σ truncated on the left as ∇ log π(x) ↓ −∞.See Figure 1 for an illustration.
The multiplicative term 1/(1 + e −∇ log π(x)(y−x) ) in ( 16), which incorporates the gradient information, injects skewness into the base kernel µ σ (as can be clearly seen in the left-hand plot of Figure 1).Indeed, the resulting distribution Q B is an example of a skew-symmetric distribution [Azzalini, 2013, eq.(1.3)].Skew-symmetric distributions are a tractable family of (skewed) probability density functions that are obtained by multiplying a symmetric base density function with the cumulative distribution function (cdf) of a symmetric random variable.We refer to Azzalini [2013, Ch.1] for more details, including a more general version of Propositions 3.1 and 3.2.In the context of skewed distributions the Gaussian cdf is often used, leading to the skew-normal distribution introduced in Azzalini [1985].In our context, however, the Barker proposal (which leads to the logistic cdf p(x, z) in Algorithm 1) is the only skew-symmetric distribution that can be obtained from (15) using a balancing function g satisfying g(t) = tg(1/t).See the supplement for more detail.

The Barker proposal on R d
There are two natural ways to extend the Barker proposal to R d , for d > 1.The first is to treat each coordinate separately, and generate the proposal y = (y 1 , ..., y d ) by applying Algorithm 1 independently to each coordinate.This corresponds to generating a z i and b i (x, z i ) for each i ∈ {1, ..., d}, and choosing the sign of each b i using where ∂ i log π(x) denotes the partial derivative of log π(x) with respect to x i .Writing Q B i (x, dy i ) to denote the resulting Barker proposal candidate kernel for the ith coordinate, the full candidate kernel Q B can then be written The full Metropolis-Hastings scheme using the Barker proposal mechanism for a target distribution is given in Algorithm 2 (see the supplement for more details and variations of the algorithm, such as a pre-conditioned version).Note that the computational cost of each iteration of the algorithm is essentially equivalent to that of MALA and will be typically dominated by the cost of computing the gradient and density of the target.
The second approach to deriving a multivariate Barker proposal consists of sampling z ∈ R d from a d-dimensional symmetric distribution, and then choosing whether or not to flip the sign of every coordinate at the same time, using a single global b(x, z) ∈ {−1, 1}, to produce the global proposal y = x + b(x, z) × z.In this case the probability that b(x, z) = 1 will be Algorithm 2 Metropolis-Hastings with the Barker proposal on R d Require: starting point for the chain x (0) ∈ R d , and scale σ > 0. Set t = 0 and do the following: 1. Given x (t) = x, draw y i using Algorithm 1 independently for i ∈ {1, ..., d} 2. Set x (t+1) = y with probability α B (x, y), where Otherwise set x (t+1) = x 3.If t + 1 < N , set t ← t + 1 and return to step 1, otherwise stop.
This second approach doesn't allow gradient information to feed into the proposal as effectively as in the first case.Specifically, only the global inner product z T ∇ log π(x) is considered, and the decision to alter the sign of every component of z is taken based solely on this value.In other words, once z ∼ µ σ has been sampled, gradient information is only used to make a single binary decision of choosing between the two possible proposals x+z and x−z, while in the first strategy gradient information is used to choose between 2 . Indeed, the following proposition shows that the second strategy cannot improve over the random walk Metropolis by more than a factor of two.
One can also make a stronger statement than the above proposition, namely that if this strategy is employed, only a constant factor improvement over the Random Walk Metropolis can be achieved in terms of asymptotic variance, for any L 2 (π) function of interest.Given Proposition 3.3 we choose to use the first strategy described to produce Barker proposals on R d , and the multi-dimensional candidate kernel given in (17).In the following sections we will show both theoretically and empirically that this choice does indeed have favourable robustness and efficiency properties.

Robustness, scaling and ergodicity results for the Barker proposal
In this section we establish results concerning robustness to tuning, scaling with dimension and geometric ergodicity for the Barker proposal scheme.As we will see, the method enjoys the superior efficiency of gradient-based algorithms in terms of scaling with dimension, but also shares the favourable robustness properties of the random walk Metropolis when considering both robustness to tuning and geometric ergodicity.

Robustness to tuning
We now examine the robustness to tuning of the Barker proposal using the framework introduced in Section 2. We write Q B λ and P B λ to denote the candidate and Metropolis-Hastings kernels for the Barker proposal targeting the distribution π (λ) defined therein, and P B for the case λ = 1.The following result characterizes the behaviour of the spectral gap of P B λ as λ ↓ 0.
Theorem 4.1.Assume Condition 2.1 and Gap(P B ) > 0. Then it holds that Comparing Theorem 4.1 with Theorems 2.1-2.4 from Section 2.3 we see that the Barker proposal inherits the robustness to tuning of random walk schemes and is significantly more robust than the Langevin and Hamiltonian algorithms.In the next section we establish general conditions under which Gap(P B ) > 0.

Geometric ergodicity
In this section we study the class of target distributions for which the Barker proposal produces a geometrically ergodic Markov chain.We show that geometric ergodicity can be obtained even when the gradient term in the proposal grows faster than linearly, which is typically not the case for MALA and HMC.
Recall that a Markov chain is called geometrically ergodic if for probability measures µ and ν.When such a condition can be established for a reversible Markov chain, then a Central Limit Theorem exists for any square-integrable function [Roberts and Rosenthal, 2004].We prove geometric ergodicity results for generic proposals as in ( 15), assuming g to be bounded and monotone, and µ σ to have lighter than exponential tails.Following the discussion in Section 3.3 we consider proposals that are independent across components, leading to where With a slight abuse of notation, we use µ σ to represent one and d-dimensional densities.The Barker proposal in ( 17) is the special case obtained by taking g(t) = t/(1 + t).
For the results of this section, we make the simplifying assumption that π is spherically symmetric outside a ball of radius R < ∞.
We note that tail regularity assumptions such as Condition 4.1 are common in this type of analysis (e.g.Jarner and Hansen [2000], Durmus et al. [2017a]).As an intuitive example, the condition is satisfied in the exponential family π(x) ∝ exp(−α x β ) for all β > 1.As a contrast, for MALA and HMC it is known that for β > 2 the sampler fails to be geometrically ergodic [Roberts andTweedie, 1996, Livingstone et al., 2019].We expect the Barker proposal to be geometrically ergodic also for the case β = 1, although we do not prove it in this work.

Scaling with dimensionality
In this section we provide preliminary results suggesting that the Barker proposal enjoys scaling behaviour analogous to that of MALA in high-dimensional setings, meaning that under appropriate assumptions it requires the number of iterations per effective sample to grow as Θ(d 1/3 ) with the number of dimensions d as d → ∞.Similarly to Section 4.2, we prove results for general proposals Q (g) as in ( 21) with balancing functions g satisfying g(t) = t g(1/t).The Barker proposal is a special case of the latter family.
We perform an asymptotic analysis for d → ∞ using the framework introduced in Roberts et al. [1997].The main idea is to study the rate at which the proposal step size σ needs to decrease as d → ∞ to obtain well-behaved limiting behaviour for the MCMC algorithm under consideration (such as a Θ(1) acceptance rate and convergence to a non-trivial diffusion process after appropriate time re-scaling).Based on the rate of decrease of σ one can infer how the number of MCMC iterations required for each effective sample increases as d → ∞.For example, in the case of the random walk Metropolis σ 2 must be scaled as Θ(d −1 ) as d → ∞ to have a well-behaved limit [Roberts et al., 1997], which leads to RWM requiring Θ(d) iterations for each effective sample.By contrast, for MALA it is sufficient to take σ 2 = Θ(d −1/3 ) as d → ∞, which leads to only Θ(d 1/3 ) iterations for each effective sample [Roberts and Rosenthal, 1998].While these analyses are typically performed under simplifying assumptions, such as having a target distribution with i.i.d.components, the results have been extended in many ways (e.g.removing the product-form assumption, see Mattingly et al. [2012]) obtaining analogous conclusions.See also Beskos et al. [2013] for optimal scaling analysis of HMC and Roberts and Rosenthal [2016] for rigorous connections between optimal scaling results and computational complexity statements.
In this section we focus on the scaling behaviour of Metropolis-Hastings algorithms with proposal Q (g) as in ( 21), when targeting distributions of the form π where f is a one-dimensional smooth density function.Given the structure of Q (g) and π(•), the acceptance rate takes the form α(x, y) = min 1, d i=1 α i (x i , y i ) , where and φ = log f .In such a context, the scaling properties of the MCMC algorithms under consideration are typically governed by the behaviour of log(α i (x i , y i )) as y i gets close to x i , or more precisely by degree of the leading term in the Taylor series expansion of log(α i (x i , x i +σu i )) in powers of σ as σ → 0 for fixed x i and u i .For example, in the case of the random walk Metropolis one has log(α i (x i , x i + σu i )) = Θ(σ) as σ → 0, which in fact implies the proposal variance σ 2 must decrease at a rate Θ(d −1 ) to obtain a non-trivial limit.By contrast, when the MALA proposal is used, one has log Durmus et al. [2017b] for a more detailed and rigorous discussion on the connection between the Taylor series expansion of log(α i (x i , y i )) and MCMC scaling results.The following proposition shows that the condition g(t) = t g(1/t), when combined with some smoothness assumptions, is sufficient to ensure that the proposals Proposition 4.1.Let g : (0, ∞) → (0, ∞) and g(t) = t g(1/t) for all t.If g is three times continuously differentiable and R g (j) (e sw )µ(w)dw < ∞ for all s > 0 and j ∈ {0, 1, 2, 3}, where g (j) : (0, ∞) → (0, ∞) is the j-th derivative of g, then for any x i and u i in R.
Proposition 4.1 suggests that Metropolis-Hastings algorithms with proposals Q (g) such that g(t) = t g(1/t) have scaling behaviour analogous to MALA, meaning that σ 2 = Θ(d −1/3 ) is sufficient to ensure a non-trivial limit and thus Θ(d 1/3 ) iterations are required for each effective sample.To make these arguments rigorous one should prove weak convergence results for d → ∞, as in Roberts and Rosenthal [1998].Proving such a result for a general g would require a significant amount of technical work, thus going beyond the scope of this section.In this paper we rather support the conjecture of Θ(d 1/3 ) scaling for Q (g) by means of simulations (see Section 5.2).While Proposition 4.1 only shows log(α i (x i , x i + σu i )) ≤ Θ(σ 3 ), it is possible to show that log(α i (x i , x i + σu i )) = Θ(σ 3 ) with some extra assumptions on φ to exclude exceptional cases (see the supplement for more detail).

Simulations with fixed tuning parameters
Throughout Sections 5 and 6, we choose the symmetric density µ σ within the random walk and Barker proposals to be N (0, σ 2 I d ) for simplicity.Note, however, that any symmetric density µ σ could in principle be used.It would be interesting to explore the impact of different choices of µ σ to the performances of the Barker algorithm, and we leave such a comparison to future work.

Illustrations of robustness to tuning
We first provide an illustration of the robustness to tuning of the random walk, Langevin and Barker algorithms in three simple one-dimensional settings.In each case we approximate the expected squared jump distance (ESJD) using 10 4 Monte Carlo samples and standard Rao-Blackwellisation techniques, across of range of different proposal step-sizes between 0.01 and 100.As is clearly shown in Figure 2, all algorithms perform similarly when the step-size is smaller than optimal, as suggested in Section 2.4.As the step-size increases beyond this optimum, however, behaviours begin to differ.In particular the ESJD for MALA rapidly decays to zero, whereas in the random walk and Barker cases the reduction is much less pronounced.In fact, the rate of decay is similar for the two schemes, which is to be expected following the results of Sections 4.1 and 2.3.See the supplement for a similar illustration on a 20-dimensional example.proposal step−size ESJD q q q q q q q q q q q q q q q q q q q q Gaussian target q RW MALA Barker 1e−01 1e+01 1e+03 1e−05 1e−01 proposal step−size ESJD q q q q q q q q q q q q q q q q q q q q Laplace target 1e−01 1e+01 1e+03 1e−05 1e−01 proposal step−size ESJD q q q q q q q q q q q q q q q q q q q q Student−t target (df=5) Figure 2: Expected squared jump distance (ESJD) against proposal step-size for RWM, MALA and Barker on different 1-dimensional targets.

Comparison of efficiency on isotropic targets
Next we compare the expected squared jump distance of the random walk, Langevin and Barker schemes when sampling from isotropic distributions of increasing dimension, with opti-mised proposal scale (chosen to maximise expected squared jumping distance).This setup is favourable to MALA, which is the least robust scheme among the three, as the target distribution is homogeneous and the proposal step-size optimally-chosen.We consider target distributions with independent and identically distributed (i.i.d.) components, corresponding to the scenario studied theoretically in Section 4.3.We set the distribution of each coordinate to be either a standard normal distribution or a hyperbolic distribution, corresponding to log π(x) = − d i=1 x 2 i /2 + const and log π(x) = − d i=1 (0.1 + x 2 i ) 1/2 + const, respectively.Figure 3 shows how the ESJD per coordinate decays as dimension increases for the three algo- rithms.For MALA and Barker the ESJD appears to decrease at the same rate as d increases, which is in accordance with the preliminary results in Section 4.3.In the Gaussian case, MALA outperforms Barker roughly by a factor of 2 regardless of dimension (more precisely, the ESJD ratio lies between 1.7 and 2.5 for all values of d in Figure 3), while in the hyperbolic case the same factor is around 1.2, again independently of dimension (ESJD ratio between 1.1 and 1.25 for all values of d in Figure 3).The rate of decay for the random walk Metropolis is faster, as predicted by the theory.

Simulations with Adaptive Markov chain Monte Carlo
In this section we illustrate how robustness to tuning affects the performance of adaptive MCMC methods.

Adaptation strategy and algorithmic set-up
We use Algorithm 4 in Section 5 of Andrieu and Thoms [2008] to adapt the tuning parameters within each scheme.Specifically, in each case a Markov chain is initialised using a chosen global proposal scale σ 0 and an identity pre-conditioning matrix Σ 0 = I d , and at each iteration the global scale and pre-conditioning matrix are updated using the equations Here X (t) denotes the current point in the Markov chain, Y (t) is the proposed move, µ 0 = 0, ᾱ * denotes some ideal acceptance rate for the algorithm and the parameter γ t is known as the learning rate.We set ᾱ * to be 0.23 for RWM, 0.57 for MALA and 0.40 for Barker.We tried changing the value of ᾱ * for Barker in the range [0.2, 0.6] without observing major differences.
In our simulations we constrain Σ t to be diagonal (i.e.all off-diagonal terms in (26) are set to 0).This is often done in practice to avoid having to learn a dense pre-conditioning matrix, which has both a high computational cost and would require a large number of MCMC samples.See the supplement for full details on the pre-conditioned Barker schemes obtained with both diagonal and dense matrix Σ t , including pseudo-code of the resulting algorithms.
We set the learning rate to γ t := t −κ with κ ∈ (0.5, 1), as for example suggested in [Shaby and Wells, 2010].Small values of κ correspond to more aggressive adaptation, and for example Shaby and Wells [2010] suggest using κ = 0.8.In the simulations of Section 6.2 we use κ = 0.6 as this turned out to be a good balance between fast adaptation and stability for MALA (κ = 0.8 resulted in too slow adaptation, while values of κ lower than 0.6 led to instability).The adaptation of RWM and Barker was not very sensitive to the value of κ.Unless specified otherwise, all algorithms are randomly initialized with each coordinate sampled independently from a normal distribution with standard deviation 10.Following the results from the optimal scaling theory [Roberts and Rosenthal, 2001], we set the starting value for the global scale as σ 2 0 = 2.4 2 /d for RWM and σ 2 0 = 2.4 2 /d 1/3 for MALA.For Barker we initialize σ 0 to the same values as MALA.

Performance on target distributions with heterogeneous scales
In this section we compare the adaptive algorithms described above when sampling from target distributions with significant heterogeneity of scales across their components.We consider 100dimensional target distributions with different types of heterogeneity, tail behaviour and degree of skewness according to the following four scenarios: (1) (One coordinate with small scale; Gaussian target) In the first scenario, we consider a Gaussian target with zero mean and diagonal covariance matrix.We set the standard deviation of the first coordinate to 0.01 and that of the other coordinates to 1.This scenario mirrors the theoretical framework of Sections 2 and 4.1 in which a single coordinate is the source of heterogeneity.
(2) (Coordinates with random scales; Gaussian target) Here we modify scenario 1 by generating the standard deviations of each coordinate randomly, sampling them independently from a log-normal distribution.More precisely, we sample log(η i ) ∼ N (0, 1) independently for i = 1, . . ., 100, where η i is the standard deviation of the i-th component.
(3) (Coordinates with random scales; Hyperbolic target) In the third scenario we change the tail behaviour of the target distribution, replacing the Gaussian with a hyperbolic distribution (a smoothed version of the Laplace distribution to ensure log π ∈ C 1 (R d )).In particular, we set log π(x) = − d i=1 ( + (x i /η i ) 2 ) 1/2 + c, with = 0.1 and c being a normalizing constant.The scale parameters (η i ) i are generated randomly as in scenario 2.
(4) (Coordinates with random scales; Skew-normal target) Finally, we consider a non-symmetric target distribution, which represents a more challenging and realistic situation.We assume that the i-th coordinate follows a skew-normal distribution with scale η i and skewness parameter α, meaning that log π , with c being a normalizing constant.We set α = 4 and generate the η i 's randomly as in scenario 2.
First we provide an illustration of the behaviour of the three algorithms by plotting the trace plots of tuning parameters and MCMC trajectories -see Figure 4 for the results in scenario 1.The adaptation of tuning parameters for the Barker scheme stabilises within a few hundred iterations, after which the algorithm performance appears to be stable and efficient.On the contrary both RWM and MALA struggle to learn the heterogeneous scales and the adaptation process has either just stabilized or not yet stabilized after 10 4 iterations.Looking at the behaviour of MALA in Figure 4 we see that, in order for the algorithm to achieve a non-zero acceptance rate, the global scale parameter σ t must first be reduced considerably to accommodate the smallest scale of π(•).At this point the algorithm can slowly begin to learn the components of the pre-conditioning matrix Σ t , but this learning occurs very slowly because the comparatively small value for σ t results in poor mixing across all other dimensions than the first.Analogous plots for Scenarios 2, 3 and 4 are given in the supplement and display comparable behaviour.
We then compare algorithms in a more quantitative way, by looking at the average mean squared error (MSE) of MCMC estimators of the first moment of each coordinate, which is a standard metric in MCMC.For any h : R d → R, define the corresponding MSE as E ( ĥ(t) − is the MCMC estimator of E π [h] after t iterations of the algorithm.Here t burn is a burn-in period, which we set to t burn = t/2 , where • denotes the floor function.Below, we report the average MSE for the collection of test functions given by h(x) = x i /η i for i = 1, . . ., d after t MCMC iterations (rescaling by η i is done to give equal importance to each coordinate).
In addition, we also monitor the rate at which the pre-conditioning matrix Σ t converges to the covariance of π, denoted as Σ, in order to measure how quickly the adaptation mechanism learns suitable local tuning parameters.We consider the l 2 -distance between the diagonal elements of Σ t and Σ on the log scale.This leads to the following measure of convergence of the tuning parameters after t MCMC iterations: where the expectation is with respect the Markov chain (X (t) ) t≥1 .We use the log scale as it is arguably more appropriate than the natural one when comparing step-size parameters, and we focus on diagonal terms as both Σ t and Σ are diagonal here.Monitoring the convergence of d t to 0 we can compare the speed at which good tuning parameters are found during the adaptation process for different schemes.
Figure 5 displays the evolution of d t and the MSE defined above over 4 × 10 4 iterations of each algorithms, where d t and the MSE are estimated by averaging over 100 independent runs of each algorithm.The results are in accordance with the illustration in Figure 4, and suggest that the Barker scheme is robust to different types of targets and heterogeneity and results in very fast adaptation, while both MALA and RWM require significantly more iterations to find good tuning parameters.The tuning parameters of MALA appear to exhibit more unstable behaviour than RWM in the first few thousands iterations (larger d t ), while after that they converge more quickly, which again is in accordance with the behaviour observed in Figure 5 and with the theoretical considerations of Sections 2 and 4.1.To further quantify the tuning period, we define the time to reach a stable level of tuning as τ adapt ( ) = inf{t ≥ 1 : d t ≤ } for some > 0. We take = 1 and report the resulting values in Table 6.2, denoting τ adapt (1) simply as τ adapt .The results show that in these examples Barker always has the smallest adaptation time, with a speed-up compared to RWM of at least 34x in all four scenarios, and a speedup compared to MALA ranging between 3x (scenario 3) and 30x (scenario 2).The adaptation times τ adapt tend to increase from scenario 1 to scenario 4, suggesting that the target distribution becomes more challenging as we move from scenario 1 to 4. The hardest case for Barker seems to be the hyperbolic target, although even there the tuning stabilized in roughly 3,000 iterations, while the hardest case for MALA is the skew-normal, in which tuning stabilized in roughly 30,000 iterations.
The differences in the adaptation times have a direct implication on the resulting MSE of MCMC estimators, which is intuitive because the Markov chain will typically start sampling efficiently from π only once good tuning parameters are found.As we see from the second row of Figure 5 and the second part of Table 6.2, the MSE of Barker is already quite low (between 0.007 and 0.012) after 10 4 iterations in all scenarios, while RWM and MALA need significantly more iterations to achieve the same MSE.After finding good tuning parameters and having sampled enough, MALA is slightly more efficient than Barker for the Gaussian target in Scenario 1 and equally efficient in the hyperbolic target of Scenario 3, which is consistent with the simulations

Comparison on a Poisson random effects model
In this section we consider a Poisson hierarchical model of the form µ ∼ N(0, 10 2 ) , and test the algorithms on the task of sampling from the resulting posterior distribution p(µ, η 1 , . . ., η I |y), where y = (y ij ) ij denotes the observed data.In our simulations we set I = 50 and n i = 5 for all i, leading to 51 unkown parameters and 250 observations.The model in ( 28) is an example of a generalized linear model that induces a posterior distribution with light tails and potentially large gradients of log π, which creates a challenge for gradient-based algorithms.In particular, the task of sampling from the posterior becomes harder when either the observations (y ij ) ij contain large values or they are heterogeneous across values of i ∈ {1, . . ., I}.The former case results in a more peaked posterior distribution with larger gradients, while the latter induces heterogeneity across the posterior distributions of the parameters η i .
In our simulations we consider three scenarios, corresponding to increasingly challenging target distributions: (1) In the first scenario we take σ η = 1 and generate the data y from the model in (28) assuming the data-generating value of µ to be µ * = 5 and sampling the data-generating values of η 1 , . . ., η I from their prior distribution.
(2) In the second scenario we increase the value of σ η to 3, which induces more heterogeneity across the parameters η 1 , . . ., η I .
(3) In the third scenario we keep σ η = 3 and increase the values of µ * to 10, thus inducing larger gradients.
Similarly to Section 6.2, we first provide an illustration of the behaviour of the tuning parameters and MCMC trace plots for RWM, MALA and Barker in Figure 6.Here all algorithms are run for 5×10 4 iterations, with the target defined in the first scenario.We use the adaptation strategy of Section 6.2 for tuning, following ( 24)-( 26) with κ = 0.6 and Σ t constrained to be diagonal, and initialize the chains from a random configuration sampled from the prior distribution of the model.In this example, the random walk converges to stationarity in roughly 10,000 iterations while the Barker scheme takes a few hundreds.By contrast MALA struggles to converge and exhibits unstable behaviour even after 5 × 10 4 iterations.Note that the first 3×10 4 iterations of MALA, in which the parameter µ appears to be constant, do not correspond to rejections but rather to moves with very small increments in the µ component.
We then provide a more systematic comparison between the algorithms under consideration in Table 6.3.In addition to RWM, MALA and Barker, we also consider a state-of-the-art implementation of adaptive Hamiltonian Monte Carlo (HMC), namely the Stan [Stan Development Team, 2020] implementation of the No-U-Turn Sampler (NUTS) [Hoffman and Gelman, 2014] as well as of standard HMC (referred to as "static HMC" in the Stan package).The NUTS algorithm is a variant of standard HMC in which the number of leapfrog iterations, i.e. the parameter L in (34), is allowed to depend on the current state (using a "No-U-Turn" criterion).The resulting number of leapfrog steps (and thus log-posterior gradient evaluations) per iteration is not fixed in advance but rather tuned adaptively depending on the hardness of the problem.This is also the case for the static HMC algorithm implementation in Stan, as in that case the total integration time in (34) is fixed and the step-size and mass matrix are adapted.For both algorithms we use the default Stan version that learns a diagonal covariance/mass matrix during the adaptation process.This is analogous to constraining the preconditioning matrix Σ t for RWM, MALA and Barker to be diagonal, as we are doing here.Table 6.3 reports the results of the simulations for the five algorithms in each of the three scenarios.For each algorithm, we report the number of log-posterior gradient evaluations and the minimum and median effective sample size (ESS) across the 51 unknown parameters.The Table 2: Comparison of sampling schemes on the posterior distribution arising from the Poisson hierarchical model in (28).Blocks of rows from 1 to 3 refer to the three data-generating scenarios described in Section 6.3.All numbers are averaged across ten repetitions of each algorithm.For each algorithm we report: number of iterations; number of leapfrog steps per iteration and total number of gradient evaluations (when applicable); estimated ESS (minimum and median across parameters); minimum ESS per hundred gradient evaluations (with standard deviation across the ten repetitions).ESS values are computed with the effectiveSize function from the coda R package [Plummer et al., 2006], discarding the first half of the samples as burn-in.The RWM, MALA and Barker schemes are run for 5 × 10 4 iterations, and the HMC and NUTS schemes for 2 × 10 3 iterations.The latter is the default value in the Stan package and in this example corresponds to a number of gradient evaluations between 1.7 × 10 4 and 1.6 × 10 7 .All numbers in Table 6.3 are averaged over ten independent replications of each algorithm.We use the minimum ESS per gradient evaluation as an efficiency metric, of which we report the mean and standard deviation across the ten replicates (multiplied by 100 to facilitate readability).

Method
According to Table 6.3, NUTS is the most efficient scheme in scenario 1, while Barker is the most efficient one in scenarios 2 and 3.This is in accordance with the intuition of Barker being a more robust scheme, as the target distribution becomes more challenging as we move from scenario 1 to 3. MALA struggles to converge to stationarity in scenarios 2 and 3 (with an estimated ESS around zero), while it performs better in scenario 1, although with a high variability across different runs (shown by the large standard deviation in the last column).The RWM displays low ESS values for all three scenarios, although with a less dramatic deterioration going from scenario 1 to 3. Interestingly, the performances of Barker are remarkably stable across scenarios (with an ESS of around 1400), as well as across parameters for which ESS is computed (in all cases the minimum and median ESS are close to each other) and across repetitions (shown by the relatively small standard deviation in the last column).We note that NUTS is also remarkably effective taking into consideration that it is not an algorithm designed with a major emphasis on robustness, but that performance does degrade when moving from scenario 1 to scenario 3.As in the MALA case, static HMC struggles to converge in scenarios 2 and 3 and is not very efficient in scenario 1.Note that NUTS, and in particular HMC, compensate for the increasing difficulty of the target by increasing the number of leapfrog steps per iteration.For example, the drop in efficiency of NUTS between scenarios 1 and 2 is mostly due to the increase in average number of leapfrog iterations from 8.5 to 57.7 rather than in a decrease in ESS.Somewhat surprisingly, in static HMC the number of leapfrog steps per iteration is increased significantly more than NUTS, which could either be due to genuine algorithmic differences or to variations in the details of the adaptation strategy implemented in Stan.Overall, Barker and NUTS are the two most efficient algorithms in these simulation, with a relative efficiency that depends on the scenario under consideration: NUTS being roughly 2.4 times more efficient in scenario 1, Barker 2.3 times more efficient in scenario 2 and Barker 40 times more efficient in scenario 3.

Additional simulations reported in the supplement
In the supplement we report additional simulations for some of the above experiments.As a sensitivity check, we also performed simulations using the tamed Metropolis-adjusted Langevin algorithm [Brosse et al., 2018] and the truncated Metropolis-adjusted Langevin algorithm [Roberts andTweedie, 1996, Atchade, 2006], two more robust modifications to MALA in which large gradients are controlled by monitoring the size of ∇ log π(x) .The schemes do offer some added stability compared to MALA in terms of controlling large gradients, but ultimately are still very sensitive to heterogeneity of the target distribution and to the choice of the truncation level, and do not exhibit the same robustness observed in the case of the Barker scheme.See the supplement for implementation details, results and further discussion.

Discussion
We have introduced a new gradient-based MCMC method, the Barker proposal, and have demonstrated both analytically and numerically that it shares the favourable scaling properties of other gradient-based approaches, along with an increased level of robustness, both in terms of geometric ergodicity and robustness to tuning (as defined in the present paper).The most striking benefit of the method appears to be in the context of adaptive Markov chain Monte Carlo.Evidence suggests that combining the efficiency of a gradient-based proposal mechanism with a method that exhibits robustness to tuning gives a combination of stability and speed that is very desirable in this setting, and can lead to efficient sampling that requires minimal practitioner input.
The theoretical results in this paper could be extended by studying in greater depth the large λ regime (Section 2.4) and the high-dimensional scaling of the Barker proposal (Section 4.3).Of course, there are many other algorithms that could be considered under the robustness to tuning framework, and it is worthwhile future work to explore which features of a scheme result in either robustness to tuning or a lack of it.Extensions to the Barker proposal that incorporate momentum and exhibit the d −1/4 decay in efficiency with dimension enjoyed by Hamiltonian Monte Carlo may be possible, as well as the development of other methods within the first-order locally-balanced proposal framework introduced in Section 3, or indeed schemes that are exact at higher orders.

Supplement to 'On the robustness of gradient-based MCMC algorithms.'
The ary material contains proofs of the theoretical results and additional figures related to the simulations.It also includes some background on the key techniques used for the proofs of Section 2, a proof of Condition 2.3 for the exponential family class and details related to skew-symmetry and pre-conditioning of the Barker proposal.In this supplement, we number equations, figures and lemmas differently from the main paper, e.g. ( 1) rather than (1.1), to avoid confusion between the two documents.

A Tools to bound spectral gaps
To establish lower bounds on spectral gaps we use the following Lemma.
To find upper bounds we use the notion of conductance for a Markov chain.Define the conductance of a set K ∈ B with 0 < π(K) ≤ 1/2 for a π-reversible Markov chain with transition kernel P as Recall the spectral gap bound for P that for any such K This can be seen directly by setting g(x) = π(K C )I(x ∈ K) − π(K)I(x ∈ K c ), letting f (x) := g(x)/ g(x) 2 π(dx) and computing the Dirichlet form of f using (3).Here I(•) denotes the indicator function.

B Change of variables and isomorphic Markov chains
In this section we provide two lemmas showing that bijective mappings do not change the spectral gaps of Markov chains, nor the Metropolis-Hastings dynamics.These lemmas will allow us to prove the results of Section 2 working with the equivalent formulation where the target is fixed and the proposal distribution is changing, rather than having a target that changes with λ.This will in turn allow us to exploit results such as Lemma A.1, thus significantly simplifying the proofs.We follow the terminology of Johnson and Geyer (2013), and introduce the notion of isomorphic Markov chains.Intuitively, two Markov chains (X (t) ) t≥1 and (Y (t) ) t≥1 are isomorphic if (φ(X (t) )) t≥1 is equal in distribution to (Y (t) ) t≥1 for some bijective map φ.More formally, let (X (t) ) t≥1 and (Y (t) ) t≥1 be Markov chains with transition kernels P and K and state spaces (S, A) and (T, B), respectively.We say that (X (t) ) t≥1 and (Y (t) ) t≥1 are isomorphic if there exists a bijective function φ from S to T such that Equation ( 31) means that K(φ(x), •) is the push-forward of P (x, •) under φ for every x ∈ R d , which we write as K = φ • P .We will use • to denote the push-forward operator for both probability distributions and transition kernels, so that (φ Isomorphic Markov chains share the same convergence behaviour and, in particular, they have the same L 2 -spectral gap, as stated in the following lemma (see Lemma 1 of Papaspilioupulus et al. ( 2019) for a proof of analogous results).
Lemma B.1.Isomorphic Markov chains have the same L 2 -spectral gap.
In the following we will exploit the fact that the Metropolis-Hastings (MH) algorithm preserves isomorphisms of Markov chains under transformations of both the target and candidate distributions, as shown by the following lemma.
Note first that from the definitions of push-forward measure and transition kernel given above that µ Denoting the transition kernels of (X (t) ) t≥1 and (Y (t) ) t≥1 as P and K respectively, it therefore holds that meaning that (X (t) ) t≥1 and (Y (t) ) t≥1 are isomorphic.

C Proofs
Throughout the proofs we often use • to denote the standard euclidean norm • 2 .
In cases (i) and (iii) µ(z) is monotonically decreasing in z 2 2 .So when λ < 1 it holds that which proves the condition for λ 0 = 1.In case (ii) µ(z) is monotonically decreasing in z 1 , so again The statement that sup z 1 ∈R µ 1 (z 1 ) < ∞ follows by noting that in all three cases the marginal µ 1 is known in closed form and is, respectively, a Gaussian, Laplace and Student's t distribution, all of which have bounded density.
Proof of Theorem 2.2.This follows directly from the proof of Theorem 2.4 below, by noting that setting L = 1 in Hamiltonian Monte Carlo gives the Langevin algorithm.
Proof of Theorem 2.4.Similarly to the case of the random walk and Langevin schemes, we will study the MH transition kernel P H λ with proposal φ • Q H λ and target φ • π (λ) , and exploit the fact that Gap( P H λ ) = Gap(P H λ ) by Lemma B.2 and Lemma B.1.Considering φ(x) = Σ 1/2 λ x as above we have π = φ • π (λ) and QH λ = φ • Q H λ evolving according to a preconditioned HMC algorithm as follows.Writing the current point x ∈ R d as x(0), as in Section 2.3.3 of the the paper, the proposal y := x(L) ∼ QH λ (x, •) is obtained using the update where each x(j) is defined recursively in the same manner, and ξ(0) ∼ N (0, I d ).It is easy to check that QH λ = φ • Q H λ using the same calculations as in the proof of Theorem 2.3.We now prove Gap( P H λ ) ≤ Θ(e −λ −α ) as λ ↓ 0 for some α > 0. To simplify the notation in the following we prove the equivalent statement that Gap( P H h −1 ) ≤ Θ(e −h α ) as h → ∞.Fix δ ∈ (0, (1 − q)/2) with q defined in Condition 2.2 and consider the sets Here k is chosen large enough that Lemma C.1 below is satisfied and that 0 < π(K) < 1/2, which can always be done thanks to the tightness and positiveness of π (see above).Lemma C.1 implies that if where h 0 = λ −1 0 with λ 0 defined as in Lemma C.1, then X (t+1) ∈ K.We now upper bound the probability P(X (t+1) ∈ K c |X (t) ∈ K).First note that Breaking out the first term on the right-hand side gives which, using the result of Lemma C.1, reduces to Hence we obtain the overall bound Using standard bounds on Gaussian tails and ξ 1 ∼ N (0, 1), we have Also, from Lemma C.3, we have P(X (t) ∈ K ∩ A c h ) ≤ Θ(e −γh 1+q −q log(h) ) as h → ∞ for some γ ∈ (0, ∞).Hence, since δ < (1 − q)/2 and P(X (t) ∈ K) = π(K) is constant with respect to h, we obtain P(X (t+1) ∈ K c |X (t) ∈ K) ≤ Θ e −γh 1+q −q log(h) as h → ∞ .
Proof of Proposition 2.2.For the Langevin case, standard results on the total variation distance between two Gaussian measures with differing means reveals that where t = σ|∇ log π(x/λ)|/(4λ).Because ∇ log π is bounded in a neighbourhood of zero, for large enough λ we can write t ≤ C/λ for some C < ∞.Then note that as λ ↑ ∞ For the Barker case, note that the total variation distance here can be written where g(t) = 1/(1 + t −1 ).Setting u := ∇ log π(x)z, a Taylor series expansion about for some ξ satisfying |ξ| ≤ |∇ log π(x)z|, using the Lagrange form of the remainder.Substituting this into the integral and simplifying gives For large enough λ the boundedness assumption allows us to write |∇ log π(x/λ)| ≤ c for some c < ∞.In addition note that g (ξ) = 2e −2ξ /(1 + e −ξ ) 3 − e −ξ /(1 + e −ξ ) 2 , and so sup ξ∈R |g (ξ)| = (6 √ 3) −1 .Substituting into the bound and evaluating the two integrals gives an upper bound to the total variation distance of which is Θ(1/λ) as λ ↑ ∞, as desired.
C.1.1 Lemmas used to prove Theorem 2.4 Lemma C.1.Assume Condition 2.2 and let δ ∈ (0, 1).For every L ≥ 1 there exist λ 0 > 0 and a large enough k such that for every Proof.Recall that, for each i ≥ 1, x 1 (i) is implicitly a function of the starting location x 1 (0), the parameter λ and the noise ξ 1 .For notational convenience, in the following we set h = λ −1 and study the limit h ↑ ∞.In order to prove the thesis it is sufficient to show that for fixed, sufficiently large k > 0 we have inf where ξ 1 and x 1 (0) in the infimum are restricted as in the lemma's statement, i.e. ξ 1 ∈ (−h 1−δ , h 1−δ ) and x 1 (0) In order to prove (35) we will show that for all i ≥ 1, as h ↑ ∞ we have Θ(h Θ(h where infima and suprema run over as in ( 35), and ∂ 1 (i) stands for ∂ log π 1 /∂x 1 (x 1 (i)).Note that, for any i ≥ 1, (37) is implied by (36) thanks to inf |x 1 (1)| → ∞ as h → ∞ and ( 7).Thus it suffices to prove that (36) holds for all i ≥ 1, which we will do by induction over i.
Lemma C.2. Condition 2.2 (ii) implies that there exist t, c and C in (0, ∞) such that Proof.Condition 2.2 implies that there exists t, c ∈ (0, ∞) such that Since log π 1 ∈ C 1 (R), the above implies that either holds for all |x 1 | ≥ t.Since π 1 (x 1 )dx 1 = 1 the latter option must be true.Computing the anti-derivative gives log π 1 (x 1 ) ≤ −cx 1+q 1 + log C, for some constant log C.An analogous argument can be used in the case x 1 ↓ −∞, and the two combined give the result.
Proof of Proposition 3.2.Assume y = x + b(x, z) × z is generated using Algorithm 1. Then for any A ∈ B(R) Note that the second term on the right-hand side can be re-written owing to the symmetry of µ σ .Because of this, we can write which completes the proof.
Proof of Proposition 3.3.We establish a point-wise bound on the candidate transition densities of the two algorithms.Combining this with Lemma A.1 gives an equivalent bound on the spectral gaps.To reach this point-wise bound, first note that the candidate transition density associated with the Random Walk algorithm is q R (x, x + z) = µ σ (z) for any x, z ∈ R d .Now, for the modified Barker proposal, the candidate density can be written qB where on the last line we have used that p(x, −z) = 1 − p(x, z).Noting that p(x, z) ≤ 1 establishes that q R (x, x + z) ≥ qB (x, x + z)/2 for any x, z ∈ R d , and upon combining this with Lemma A.1 the result follows.

C.3 Proofs for Section 4
Interestingly, the proof of the lower bound of Theorem 4.1 is analogous to the one of Theorem 2.1, providing further insight into the similarity between the Barker scheme and random walk in terms of robustness to scales.
Proof of Theorem 4.1.As in the proof of Theorem 2.1, we write Q B λ to denote the Barker candidate kernel targeting π (λ) , and QB λ (x, dy) := qB λ (x, y)dy to denote the isomorphic kernel defined as QB , where φ is the same function used in the proof of Theorem 2.1.Also, we denote by P B λ and P B λ the Metropolis-Hastings kernels with candidate kernels Q B λ and QB λ , respectively, and target distributions π (λ) and π, respectively.
From ( 16) and ( 17) it follows that Here we are using µ to denote the d-dimensional distribution obtained by proposing each coordinate independently as in Section 3.3.We therefore have We will show that the conditions of Lemma C.4 are satisfied when considering a Lyapunov function V s (x) = exp(s x ∞ ) based on the sup norm, x ∞ = sup i |x i |.
In the following results we denote sup t>0 g(t) by M .We denote the log-target and its derivatives as U (x) = log π(x) and U i (x) = ∂ ∂x i U (x), respectively.Condition 4.1 implies that ∇U (x) = f ( x ) x x and U i (x) = f ( x ) x i x for x > R. Also, we denote the kernel Q (g) in (21) as Q for brevity and its density function as where w i = y i − x i and q i (w i ; x) = g(e w i U i (x) )µ σ (w)/Z i (x).First, we provide some simple results on the behaviour of g, Z i and q i that will be useful afterwards.
Proof.If t ≥ 1 then g(t) ≥ g(1) = g(1) min{1, t} by the monotonicity of g.If t < 1 then g(t) = tg(1/t) ≥ tg(1) = g(1) min{1, t} by g(t) = tg(1/t) and the monotonicity of g.From Proof.Consider the case U i (x) → −∞.From g(t) = t g(1/t) ≤ tM it follows g(t) → 0 as t → 0. Also, from the boundedness and monotonicity of g it holds g(t) → M as t → ∞.Therefore, for all Thus, from the bounded convergence theorem Z We now provide two lemmas that will be used to prove the inequality in (43).
Lemma C.7. Suppose Condition 4.1 holds.Let V s (x) = exp(s x ∞ ) and Q the kernel with density q as in (46).Then for any w ∈ R. Also, from (46) and Lemma C.5 we have q i (w R e s|w| µ σ (w)dw , and thus lim sup Combining the last two displayed equations we get lim sup x From ( 48), ( 49) and basic properties of the lim sup we get lim sup x →∞ e sw µ σ (w)dw .

Thus lim sup
x which goes to 0 as s → ∞.
Proof.Let w = y − x and µ σ (w) = d i=1 µ σ (w i ).Also, denote by α(w; x) = α(x, y) the MH acceptance rate when moving from x to y.We write f (w; x) g(w; x) if the function f (w; x) is greater or equal than g(w; x) up to positive constants independent of x and w.From Lemma C.5 we have g(1) 2 ≤ Z i (x) ≤ M and thus q(w; x)α(w; x) Then, using g(t) ≥ g(1) min{1, t} from Lemma C.5 we obtain q(w; x)α(w; x) .
Assume x large and w ∈ A(x), where A(x) = {w ∈ R d : x + w ≤ x − , w ≤ 2 and x i w i ≤ 0 for all i} for some fixed > 0. From x i w i ≤ 0 it follows w i U i (x) ≥ 0 and thus, from the monotonicity of g, g(e w i U i (x) ) ≥ g(1).Combining the latter with the last displayed equation we have q(w; x)α(w; x) µ σ (w) min g( 1 where the last inequality holds for sufficiently small because of the assumption inf w∈(−δ,δ) µ σ (w) > 0. Therefore lim inf x →∞ R d q(x, y)α(x, y)dy ≥ lim inf x →∞ A(x) q(w; x)α(w; x)dw lim inf x →∞ A(x) 1 dw .
The proof is completed noting that lim inf x →∞ A(x) 1 dw > 0 by the construction of A(x).
Proof of Theorem 4.2.Lemmas C.7 and C.8 imply that there exist an s > 0 such that V s satisfy (43).The thesis then follows from Lemma C.4, noting that compact sets are small for P (which can be deduced from the fact that inf π(x) > 0 on compact sets) and that that sup x QV s (x)/V s (x) < ∞ because where we used e y ∞− x ∞ ≤ e y−x ∞ ≤ i e |y i −x i | , q i (w i ; x) ≤ 2µ σ (w i ) and R exp(s|w|)µ σ (w)dw ≤ 2 R exp(sw)µ σ (w)dw < ∞ for every s > 0.
C.4 Proof of Proposition 4.1 Proposition 4.1 follows directly from Lemmas C.9 and C.10 below.
Lemma C.9.Under the assumptions of Proposition 4.1 we have g e −φ (x i +σu i )σu i g e φ (x i )σu i = O σ 3 as σ → 0 , for all x i , w i ∈ R.
Lemma C.10.Under the assumptions of Proposition 4.1 we have for all x i , w i ∈ R.
Proof.Without loss of generality, assume g(1) = 1 throughout the proof.First consider log (Z i (x i )), which can be written as R g e φ (x i )(y i −x i ) σ −1 µ y i − x i σ dy i = R g e φ (x i )σs µ(s)ds . (54) For every non-negative integer j, denote by κ j the j-th moment of the distribution µ(•).Note that, since µ is a symmetric pdf, κ 0 = 1, κ j = 0 if j is odd and κ j > 0 if j is even.For j ∈ {1, 2, 3}, we have ∂ j ∂σ j g e φ (x i )σs σ=0 µ(s)ds = R ∂ j ∂σ j g e φ (x i )σ σ=0 s j µ(s)ds = ∂ j ∂σ j g e φ (x i )σ σ=0 κ j , (55) where the exchange of integration and derivation is justified by the assumptions on g and µ.
Combining the latter equation with (56) we have Remark C.2.For the Barker proposal, the normalization term Z i (x i ) is constant over x i and thus Lemma C.10 is trivially satisfied.These schemes are effective in achieving geometric ergodicity also for light tails [Atchade, 2006].They are less effective, however, in terms of being robust to tuning and they are very sensitive to the choice of truncation parameter (respectively δ and σ −2 ).We illustrate this point in Figure 9.There we compare MALTA, MALTAc and Barker on targets being 100-dimensional Gaussian distributions with one component much smaller than the others, analogously to the first scenario of Section 6.2.For MALTA we set δ = 1000, as is done for example, in Atchade [2006].We also tried setting δ = 100 without observing major differences.Rows 1-3 of Figure 9 consider exactly the same target of Figure 4, which is a 100-dimensional Gaussian where the first component standard deviation is equal to 0.01 and all others standard deviations are equal to 1.In this case both MALTA and MALTAc improve over MALA, and in particular that MALTAc manages to converge to stationarity in around 4000 iterations, although this is still significantly slower than Barker (which only requires few hundred iterations).We then consider modifying the target distribution by either multiplying the scales of all coordinates by 100, resulting in the first component standard deviation equal to 1 and all others standard deviations equal to 100, or dividing all scales by 100, resulting in the first component standard deviation equal to 10 −4 and all others standard deviations equal to 10 −2 .The results are reported in rows 4-6 and rows 7-9 of Figure 9, respectively.We observe a dramatic deterioration in the performances of both MALTA and MALTAc, while the performance of the Barker scheme are much less affected.The underlying reason is that MALTA and MALTAc are highly sensitive to the choice of truncation parameter (respectively δ and σ −2 ), which needs to be tuned appropriately depending on the scales of the target distribution.
These illustrative simulations suggest that ad-hoc strategies to improve the robustness of gradient-based MCMC, such as truncating or taming gradients, are intrinsically more fragile and sensitive to heterogeneity and scales compared to a more principled solution such as the Barker algorithm, in which robustness arises naturally from the proposal mechanism.In addition to this, truncating and taming can be thought of as introducing a 'bias' into the proposal mechanism, in the sense that the resulting proposal is no longer first-order exact.Depending on how the truncation level δ is scaled with the dimensionality d, this can compromise the d −1/3 scaling behaviour discussed in the paper.

Figure 1 :
Figure 1: Left: density of the Barker proposal in one dimension.Current location is x = 0 and the four lines with increasing red intensity correspond to ∇ log π(x) equal to 1, 3, 10 and 50.Right: density of the Barker proposal in two dimensions.Solid lines display the proposal density contours, heat colours refer to the target density, and the current location is x = (−2, 0).

Figure 3 :
Figure 3: ESJD against dimensionality for RWM, MALA and Barker schemes with optimallytuned step size.The target distribution has i.i.d.coordinates following either a Gaussian distribution (left plot) or a hyperbolic one (right plot).

Figure 4 :
Figure 4: RWM, MALA and Barker schemes with adaptive tuning as in (24)-(26) and learning rate set to γ t = t −κ with κ = 0.6.The target distribution is a 100-dimensional Gaussian in which the first component has standard deviation 0.01 and all others have unit scale.First row: adaptation of the global scale σ t ; second row: adaptation of the local scales diag(Σ t ) = (Σ t,ii ) 100 i=1 ; third row: trace plot of first coordinate; fourth row: trace plots of coordinates from 2 to 100 (superposed).

Figure 5 :
Figure 5: Comparison of RWM, MALA and Barker on the four target distributions (Scenarios 1 to 4) described in Section 6.2, averaging over ten repetitions of each algorithm.First row: convergence of tuning parameters, measured by d t defined in (27).Second row: Mean Square Error (MSE) of MCMC estimators of first moments averaged over all coordinates.

Figure 6 :
Figure 6: Behavior of RWM, MALA and Barker on the posterior distribution from the Poisson hierarchical model in (28).Data are generated as in the first scenario of Section 6.3.First row: adaptation of the global scale σ t ; second row: adaptation of the local scales diag(Σ t ) = (Σ t,ii ) 100 i=1 ; third row: trace plot of the parameter µ; fourth row: trace plots of the parameter η 1 .

Figure 8 :
Figure 8: Same as Figure4for the target distributions of Scenario 2 (rows 1 to 3), Scenario 3 (rows 4 to 6) and Scenario 4 (rows 7 to 9).For each scenario, the first row displays the traceplot of the global scale σ t ; the second row the ones of the normalized local scales Σ t,ii /Σ ii for i = 1, . . ., 100; and the third row the ones of the normalized coordinates X (t) i /Σ 1/2 ii for i = 1, . . ., 100.See Section 6.2 for more details.

Figure 9 :
Figure 9: Comparison of MALTA, MALTAc and Barker on target distributions with one small component.Rows 1-3: same target considered in Figure 4 (100-dimensional Gaussian with first component standard deviation equal to 0.01 and all others standard deviations equal to 1).Rows 4-6 and rows 7-9: re-scaled versions where the scales of all coordinates are either multiplied or divided by 100.See Figure 8 for an explanation of each parameter plotted.

Table 1 :
Adaptation times (τ adapt ) and mean squared errors (MSE ) from 10k, 20k and 40k iterations of the RWM, MALA and Barker algorithms under each of the four heterogeneous scenarios described in Section 6.2.