Accelerating adaptation in the adaptive Metropolis–Hastings random walk algorithm

The Metropolis–Hastings random walk algorithm remains popular with practitioners due to the wide variety of situations in which it can be successfully applied and the extreme ease with which it can be implemented. Adaptive versions of the algorithm use information from the early iterations of the Markov chain to improve the efficiency of the proposal. The aim of this paper is to reduce the number of iterations needed to adapt the proposal to the target, which is particularly important when the likelihood is time‐consuming to evaluate. First, the accelerated shaping algorithm is a generalisation of both the adaptive proposal and adaptive Metropolis algorithms. It is designed to remove, from the estimate of the covariance matrix of the target, misleading information from the start of the chain. Second, the accelerated scaling algorithm rapidly changes the scale of the proposal to achieve a target acceptance rate. The usefulness of these approaches is illustrated with a range of examples.


Introduction
The Metropolis-Hastings random walk (MHRW) algorithm (Metropolis et al. 1953;Hastings 1970) is a Markov Chain Monte Carlo (MCMC) algorithm that has an enduring popularity with practitioners due to the ease of implementation and the wide variety of circumstances in which it is applicable. Adaptive versions of the algorithm (Haario, Saksman & Tamminen 1999;Haario et al. 2001) automatically tune the proposal covariance matrix to improve the mixing. Landmark papers have shown that, for a d dimensional target with covariance matrix , in a range of circumstances the optimal proposal covariance matrix is 2.38 2 =d , leading to an optimal acceptance rate of 0.234 (Gelman, Roberts & Gilks 1996;Roberts, Gelman & Gilks 1997;Roberts & Rosenthal 2001). Curiously this acceptance rate has also been proved to be optimal for some other MCMC proposals (Lee & Neal 2018). One drawback of the adaptive MHRW algorithm is that it can take a large number of iterations to adapt to the target. Accelerating the speed at which adaptations occur during the burn-in phase of the MHRW algorithm is the subject of this paper.
Another drawback of adaptive algorithms is that the process of learning the shape of the target from the history of the chain destroys the Markov property and therefore it can be chal-lenging to prove that an adaptive algorithm converges to the correct target. Indeed, in some examples adaptive algorithms have been shown to converge towards an incorrect distribution (Haario, Saksman & Tamminen 1999;Atchadé & Rosenthal 2005;Roberts & Rosenthal 2007). Many adaptive algorithms in the literature come with accompanying proofs of asymptotic convergence, and may contain features designed to facilitate these proofs rather than rapid adaptation, leaving scope for improvement in most practical applications. One existing approach avoiding such difficulties isYang & Rosenthal (2017) where the adaptation is stopped once the mixing appears close to optimal. The authors' step-by-step approach identifies a first adaptive phase, a transient phase (for travelling to the mode), a second adaptive phase and finally a sampling phase with no adaptation, which guarantees convergence to the correct target. Although this approach provides great scope for accelerating adaptation, the focus of the paper is on diagnostics to identify the phases. Accelerating the rate of adaptation is not explored in detail.
Better mixing MCMC algorithms than the MHRW exist (see Robert et al. 2018, for a review), but generally require more effort to implement. Examples include gradient-based approaches, for example, Hamiltonian Monte Carlo (Neal 2011), Metropolis-adjusted Langevin algorithm (Roberts & Tweedie 1996), etc. and the delayed rejection adaptive Metropolis algorithm (Haario et al. 2006). Furthermore, the burn-in phase of an MCMC algorithm can be greatly reduced by starting the chain from the posterior mode. However, there are circumstances in which it is very time-consuming or unhelpful to calculate gradients of the target. For example, when there are discrete parameters, large amounts of missing data are being imputed, the likelihood function takes a long time to evaluate, or there are discontinuities in the target. Any one of these circumstances mean that both numerical optimisation to find the posterior mode and gradient-based MCMC algorithms become challenging to implement, and practitioners frequently revert to the MHRW algorithm. An important motivating application is the problem of fitting non-linear systems of differential equations to time series data, such as when modelling the spread of infectious diseases (see e.g. Keeling & Rohani 2011;Hollingsworth et al. 2015). The model equations can only be solved numerically and so posterior gradients are not available; the posterior mode is challenging to obtain and the likelihood is time-consuming to evaluate, so a large amount of computation time is required to fit the model. This paper focusses on methods to shorten the burn-in phase of the adaptive MHRW algorithm by making the adaptation occur more rapidly. It outlines a general algorithm for shaping the MHRW proposal, which includes both the adaptive proposal (AP; Haario, Saksman & Tamminen 1999) and the adaptive Metropolis (AM; Haario et al. 2001) algorithms as special cases. Furthermore, in Section 2.3, a rescaling approach is discussed that uses the Robbins-Munro algorithm to achieve a target acceptance rate. As the number of MCMC samples increases, the algorithms described in this paper converge towards existing adaptive algorithms that approach optimal mixing, but as illustrated by the examples in Section 3, they approach the optimal proposal more rapidly. The adjustments that are proposed are straightforward to implement without substantial extra coding. The number of user-specified parameters has been kept to a minimum so that the resulting algorithm is not too problem specific and can be applied to a wide range of well-behaved problems with little tuning. However, there are many examples of problems for which the MHRW algorithm is not a good choice (e.g. when the target has heavy tails or there are multiple separated local modes) and for these problems the proposed algorithm will not generate good mixing, and may even fail to converge. Alternative algorithms that explore multi-modal targets should be used instead, such as simulated tempering (Geyer 1991;Tawn, Roberts & Rosenthal 2020) or parallel tempering (Marinari & Parisi 1992;Miasojedow, Moulines & Vihola 2013;Tawn & Roberts 2019). For a discussion of these and related methods, see for example Tawn (2017).

Accelerated shaping
In a landmark paper, Haario et al. (2001) developed a Gaussian random walk proposal and proved that the resulting MCMC approaches the correct target. This algorithm is termed the AM algorithm. The proposal for iteration n + 1 is Y n+1 ∼ N(X n , c n ), where X n is the d -dimensional column vector representing the current location of the chain. This proposal is then accepted or rejected according to the usual Metropolis-Hastings ratio. The authors proposed the formula: where is a small positive constant and I is the d -dimensional identity matrix. Recall that cov(X 0 ,…, X n ) = n −1 n i=0 X i −X n X i −X n , whereX n = (n + 1) −1 n i=0 X i . The authors also suggested using c = 2.38 2 =d , which was proved to be optimal for a range of targets (Gelman, Roberts & Gilks 1996;Roberts, Gelman & Gilks 1997;Roberts & Rosenthal 2001), when n is replaced with the true covariance matrix of the target distribution.
The above algorithm becomes increasingly efficient as the covariance matrix adapts to the target. However, in the early iterations there can be a couple of major inefficiencies. First, no adaptation occurs at all for the first n 0 iterations and so if the scale of the initial covariance matrix 0 has been badly chosen then these iterations are completely wasted, leaving nothing on which to base the proposal for subsequent iterations. Second, the estimate of the covariance matrix always includes the (arbitrary) starting location of the chain, as well as the following burn-in. Since the usual mean and covariance estimates are sensitive to outliers, it can take a long time for the influence of these points to have reduced enough for the proposal to become efficient. Interestingly, in an earlier paper (Haario, Saksman & Tamminen 1999), the authors describe the adaptive proposal (AP) algorithm, which uses a fixed number of the most recent observations to estimate the covariance matrix, avoiding this pitfall. In this paper, a more general shaping algorithm is described, termed the accelerated shaping algorithm, that includes both the AM and AP algorithms as special cases. Locations visited in the early iterations of the MCMC are removed from the estimate of the covariance matrix, potentially at a rate slower than new ones are accumulated. The framework allows the adaptation to occur smoothly and to begin immediately, avoiding a sharp transition in the proposal. The aim is to make the proposal distribution as effective as possible for every iteration, even in the early stages.
Let f (n) be a non-decreasing sequence of integers such that f (1) = 0 and either f (n + 1) = f (n) or f (n + 1) = f (n) + 1 for all n. For example, f (n) = n=2 . Consider for n > 0, where (w n ) is a non-negative sequence and (S n ) is a sequence of positive definite d × d matrices. In what follows redefineX n = (n − f (n) + 1) −1 n i=f (n) X i .
Lemma 1 If n > 1 and w n−1 > 0 then (2) can be calculated iteratively using the update rules below.
If f (n) = f (n − 1) then a new observation is included: Otherwise f (n) = f (n − 1) + 1 and the new observation replaces the oldest: Important special cases of (2) include the AM algorithm, in which f (n) = 0 and if n < n 0 then w n = 0 and S n = 0 , while if n n 0 then w n = 1 and S n = I; and the AP algorithm, in which the covariance estimate is based on the most recent H observations, with f (n) = max(0, n − H ), and for n H , S n = 0 (the d × d zero matrix) and w n = 1.

Bayesian learning of the covariance matrix
In this section we discuss learning the covariance matrix of the target in a Bayesian way, and use this idea to find suitable choices for the weight of the empirical covariance matrix, w n , and the regularising sequence of covariance matrices, S n .
For the accelerated shaping algorithm to converge towards the optimal proposal, it is required that n − f (n) → ∞, w n → 1 and S n → 0. The convergence and ergodicity properties of the algorithm are discussed further in Section 2.4. To avoid the need to specify the length of the burn-in a priori, it is desirable to have a smooth transition between the initial guess for the covariance estimate 0 and the current estimate. The aim is for n to represent the current best estimate of the covariance matrix of the target after n iterations, taking into account that in early iterations the empirical covariance estimate is likely to be poor. One solution to this problem is to allow n to be based on a Bayesian analysis of the covariance matrix of the states visited by the MCMC, with the initial covariance matrix 0 taking the role of the estimate from the prior. As more observations are collected, the influence of the prior diminishes in a natural way.
Let μ denotes the mean and denotes the covariance matrix of the target distribution being explored by the MCMC. Assume a normal inverse-Wishart distribution for (μ, ) ∼ NIW(μ 0 , 0 , C 0 , 0 ), which has density (Prince 2012): A posterior for (μ, ) is going to be derived, based on the states visited between iterations f (n) and n, as in Section 2.1, and so the analogous notation will be used. Suppose that n − f (n) + 1 independent observations of data are observed from a multivariate normal Using the conjugacy of the normal inverse-Wishart prior, the posterior is The parameters 0 and 0 quantify the strength of the prior information for the mean and covariance matrix respectively. From the updating rules above, it is possible to interpret these parameters in terms of the number of observations of data that is represented in the prior. Under this model then the maximum a posteriori (MAP) estimator for the covariance matrix is n = C n =(n − f (n) + 0 + d + 2). Choosing the prior covariance matrix so that the prior mode is equal to the initial proposal covariance matrix 0 yields C 0 = ( 0 + d + 1) 0 . Finally, to reduce the number of algorithmic parameters and to avoid the need to specify a prior for the mean of the target, μ 0 , it is possible to set 0 = 0. This leads to These assumptions produce some cancellations in Lemma 1. For f (n) = f (n − 1), and, for f (n) = f (n − 1) + 1, where n > 1 in both cases. In general, the observations from the MCMC will be neither independent nor distributed according to the multivariate normal distribution. However, the motivation above can still be used to justify the posterior mode as the best available estimate of the covariance matrix of the target.

Accelerated scaling
In Algorithm 4 of Andrieu & Thoms (2008), the authors suggest changing the global scale of the covariance matrix in order to achieve a target acceptance rate. This modifies the proposal distribution to Y n+1 ∼ N(X n , 2 n c n ). A good choice (Atchadé & Rosenthal 2005;Andrieu & Thoms 2008;Garthwaite, Fan & Sisson 2016) for adapting the global scale parameter n is to use the stochastic search algorithm known as the Robbins-Munro algorithm (Robbins & Monro 1951). This algorithm can be used to find the solution to the equation p( ) = a for some monotonically decreasing function of , based on the results of some Bernoulli random variables with success probability p( ). Let n = log( n ), then n is updated iteratively via When applied to the MHRW algorithm, this adaptive scheme leads to the scale increasing after each accepted proposal and reducing after each rejected proposal, so that the expected increase is zero if and only if the success probability is a. Alternatively, it is possible to remove the randomness introduced by the Bernoulli trials by setting where (Y n+1 |X n ) is the Metropolis-Hastings acceptance probability for proposal Y n+1 . In Garthwaite, Fan & Sisson (2016) this algorithm is refined for adaptive MCMC. First the authors derive a suitable step size constant is the cdf of a standard normal distribution. Second, the authors introduce a check to prevent the algorithm diminishing too rapidly and being unable to reach the target. If n changes by a factor of 3 from its starting value (or equivalently | n − 0 | > log(3)) then the algorithm is restarted from its current location. Finally, they begin (or restart) the algorithm at n = round(5=a(1 − a)) to avoid rapid changes in scale in the early stages, or after a restart. These changes are summarised in Algorithm 1, using a slightly different notation to avoid the confusion of having more than one iteration with index n. In this revised notation it is clear that the rounding of the starting iteration is unnecessary.
As the dimension of the target increases, the optimal acceptance rate has been shown to rapidly approach 0.234 (Gelman, Roberts & Gilks 1996;Roberts & Rosenthal 2001), at least when the target is Gaussian or can be written as a product over its dimensions, that is, For such targets we would expect to achieve the optimal acceptance rate when n = 1, giving the optimal scaling discussed previously. However, if the covariance matrix of the target is underestimated in the early iterations of an adaptive algorithm, the scale can be inflated to keep 'pushing at the boundaries' of the target in order to increase the rate at which the true covariance matrix is estimated. Conversely, if the covariance matrix is overestimated (perhaps due to some outlying points from the burn-in), then reducing the scale of the target prevents the acceptance rate becoming too small and the chain getting stuck. Reducing the scale by too much can cause very slow mixing. To prevent this it is sometimes necessary to impose a minimum value for n , for example min = 1 which corresponds to c 2 n = 2.38 2 =d . If this value is reached, then the algorithm will not achieve the target acceptance rate a. This version of the Robbins-Munro scaling algorithm is referred to here as the accelerated scaling algorithm.

Convergence and ergodicity
This section discusses some conditions for the proposed algorithms to converge to the correct target. Since the accelerated shaping algorithm includes both the AM algorithm (which is ergodic for the target ) and the AP algorithm (which is not always ergodic for ) as special cases, then it is clear that further technical conditions must be introduced to imply the correct ergodicity. Finally, the ways in which the algorithm can fail to converge are discussed, which suggest some additional checks for convergence. Roberts & Rosenthal (2007) introduces two conditions that imply convergence and the correct ergodicity for adaptive MCMC algorithms. The authors consider a collection of Markov chain kernels {P } ∈Y on a state space X, each of which has stationary distribution equal to the target, . An adaptive algorithm is then given by a sequence of states visited by the Markov chain (X n ) n along with a sequence of Y-valued random variables ( n ) n that indicate the choice of kernel at iteration n + 1. The first condition, which is needed for the correct ergodicity, is diminishing adaptation, lim n→∞ sup x∈X P n+1 (x, ·) − P n (x, ·) = 0, in probability, where · denotes the total variation norm. The second condition is containment , which states that the convergence times are bounded in probability, that is, These conditions are often difficult to verify in practice, and are not satisfied by accelerated shaping or accelerated scaling in general. Previous authors have introduced additional conditions, for example that X is bounded (Haario et al. 2001); that n cannot shrink to zero (e.g. S n = I, Haario et al. 2001); or that each iteration may be drawn from a fixed kernel with small probability (Roberts & Rosenthal 2009) to prevent the chain getting stuck. Although these additional conditions facilitate the proof of ergodicity to the correct target, they make the algorithm less efficient than the optimal non-adaptive algorithm. These differences can be made to be small, but some parameters may be difficult for inexperienced practitioners to interpret. For example if the scales of the parameters are radically different then choosing appropriate values of , when adding I to the proposal covariance matrix, may be challenging. Craiu et al. (2015) proves that, under certain technical conditions, a general adaptive MCMC algorithm converges to the correct target in total variation distance; as long as adaptation only occurs within a compact subset K ⊆ Y, with a fixed and bounded proposal outside of K. This implies that a suitable adaptive algorithm that remains within a compact subset of Y must converge to the correct target. Vihola (2011) considers two adaptive MHRW algorithms-one with adaptive scaling using Robbins-Munro updates and one with adaptive scaling and shaping that are very close in spirit to the algorithms discussed in this paper. Under either assumptions (i) and (ii) below; or assumptions (i) and (iii); a strong law of large numbers is proved for bounded functions f on the target for both algorithms, that is, Although in Vihola (2011) the scaling constant is not restricted to a compact set, it must be bounded away from zero. Furthermore the covariance matrix of the proposal must be restricted to reside within a compact set, by assuming that adaptation cannot occur unless the eigenvalues of the unscaled proposal covariance matrix remain within [ −1 , ] for some ∈ [1, ∞). Rather than add these additional technical assumptions for the accelerated shaping algorithm and repeat these proofs (which is beyond the scope of this article), it is more important to note that conditions which ensure asymptotic convergence do not guarantee that a finite sample from such an MCMC chain will resemble a sample from the target, as is always the case with MCMC. Ignoring this may give a false sense of security. It is therefore of more practical benefit to identify ways in which these adaptive algorithms may fail to converge. The following discussion describes three ways in which the accelerated algorithms can fail to converge, along with some suggested methods to identify when this has occurred. First, if the accelerated shaping algorithm gets stuck in a location, then the empirical estimate of the covariance matrix will shrink to the zero matrix, which can lead to n approaching the zero matrix. This will be obvious from the very low acceptance rate and in a trace plot of the entries of n . Second, the accelerated scaling algorithm can fail to converge if n tends to zero, which can be identified from a trace plot of log( n ). Finally, if the target has very heavy tails, then the scale of the proposal may continue increasing, possibly indefinitely. This can be identified from a trace plot of n . For such targets, the MHRW algorithm is not a good choice and is unlikely to be successful under any kind of adaptation.

Accelerated shaping example
A simple two-dimensional example shows that removing as well as adding observations to the estimation of the covariance matrix speeds up the time taken to obtain a reasonable estimate when the chain is not started close to the posterior mode. The example target is a multivariate normal with mean (0, 200) and covariance matrix = [50, − 40; −40, 50]. This is an elliptical ridge with strong negative correlation. If the chains are started at (0, 0), then the initial gradient sends the chain into the positive quadrant, tricking the covariance estimate into having positive correlation. Once a chain reaches the crest of the ridge, it must change direction and follow the ridge-line up to the summit, going against its fledgling correlation estimate. Having reached the summit, the chain must then forget the burn-in and learn the true covariance matrix.
We compared the performance of the AM algorithm (Haario et al. 2001) with the accelerated shaping algorithm described in Section 2. estimates of the entries of the proposal covariance matrix are shown in Figure 1. Both algorithms used 0 = I, the AM algorithm used n 0 = 100 and = 0.01; while the accelerated shaping algorithm used 0 = 100. The accelerated shaping algorithm in (3) and (4) is seen to converge towards the true much more rapidly and the trace reaches the posterior mode approximately twice as fast. In this toy two-dimensional example, the chains were started a long way from the mode to emphasise the difference between the two algorithms. This was intended to mimic more realistic problems in higher dimensions, in which a starting location close to the mode becomes hard to identify a priori.

Global scaling example
For multivariate normal targets or targets that can be written in product form (i.e.
, the optimal scale of the proposal is achieved with a scaling constant of 2.38 2 =d , which yields an acceptance rate of 0.234. But if the target does not satisfy the required conditions, then an algorithm can be tuned to either one of these at the expense of the other. But which criteria should be aimed for?
Consider the 'banana-shaped' target (Haario, Saksman & Tamminen 1999;Roberts & Rosenthal 2009) with 'bananicity constant' B = 0.1 and dimension d = 2. Robbins-Munro adaptive scaling algorithms with a range of target acceptance rates were compared with a non-adaptive algorithm with n = 1. In all cases the covariance matrix was assumed known, that is, n = for all n. For the adaptive algorithms, the scaling constant n was started from 1 = 1 and updated via (5), recalling that n = log( n ). For this example, min was set to 0, to illustrate the effect that shrinking the proposal below n = 1 has on the mixing. All chains were started at the posterior mode and run for 2 × 10 5 iterations. Table 1 shows the mean squared jumping distance (MSJD), the mean Euclidean jumping distance (MEJD) and minimum effective sample size (ESS) for each algorithm, (for definitions see Sherlock, Fearnhead & Roberts 2010). The ESS was calculated using the function effectiveSize from the R package coda (Plummer et al. 2006). Table 1 shows that n < 1 when the target acceptance rate was a = 0.234. Approximately the largest MSJD (which equates to the best mixing) and ESS were produced by setting n = 1. However, targeting 0.234 gave approximately the best MEJD (distance travelled). As an aside, the optimal acceptance rate would be higher than 0.234 if the conditions of the theorem held, since this target has just two dimensions. A secondary observation from Table 1 is that the Robbins-Munro algorithm is generally accurate in achieving the desired acceptance rate.
This simple example showed that shrinking the proposal to achieve the 'optimal' acceptance rate actually made the mixing worse. However, there are still possible advantages of Table 1. Mean squared jumping distance (MSJD), mean Euclidean jumping distance (MEJD) and effective sample size (ESS) for 2 × 10 5 iterations of the accelerated scaling algorithm with two-dimensional Banana-shaped target with B = 0.1. Normal proposals using the true covariance matrix were scaled by n using the Robbins-Munro adaptive scaling algorithm in order to target an acceptance rate of a. Observed acceptance rates are given by a and the mean value of n is given by . rescaling a multivariate proposal, for example maximising the distance travelled during the burn-in may help an adaptive algorithm to estimate the covariance matrix of the target more rapidly. This will be especially true when the covariance matrix is underestimated in at least some dimensions. Shrinking the proposal also increases the acceptance rate and prevents the chain from getting stuck. In light of this example, in future the Robbins-Munro algorithm will be prevented from shrinking the proposal below c n = 2.38 2 =d . This can easily be achieved by setting min = 1, and replacing (5) with n+1 = max{log( min ), n + n −1 ( (Y n+1 |X n ) − a)} in the Robbins-Munro update. This version of the Robbins-Munro algorithm will be referred to as the accelerated scaling algorithm.

Choosing the forgetting sequence, f (n)
The final aspect of Algorithm 1 that needs to be determined is an appropriate choice of the forgetting sequence, f (n). To determine this, consider another example from Roberts & Rosenthal (2009) with a higher dimensional target, where there is a great deal more learning to be done in the proposal covariance matrix. Let M be a d × d matrix with entries drawn from independent standard normal distributions, and then form = MM . The target is then the d dimensional multivariate normal with mean zero and covariance matrix .
The first phase of an adaptive MCMC algorithm is the transient phase, where the chain travels towards the high posterior mass of the target distribution. Once the states from the transient phase have been removed from the estimate of the covariance matrix there is no need to remove further observations, so this example will concentrate on finding the best sequence f (n) for forgetting the transient phase. Three functional forms for f (n) will be considered, for 0 < b < 1: Here, the constants have been chosen to ensure f i (2) 1 while maximising the impact of varying b ∈ (0, 1). Figure 2 shows the results from the three forgetting sequences as a function of the scaling constant b. For comparison, the AM algorithm is also shown. Chains were started at X 0 = (5, 0,…, 0), which is away from the posterior mode at the origin, and run for 10,000 iterations. The results shown are averages over 100 randomly generated Gaussian targets with dimension d = 20. The results show that the linear function f 1 achieves the lowest mean square error for values of b around 0.4, however, this does not produce the largest MSJD. The MSJD is highest for f 3 with b close to one. All three functions outperform the AM algorithm in both measures, except for f 1 with b close to one. The difference between the AM algorithm and the alternatives with b close to zero is explained by the improved use of the initial iterations, as described in Section 2.2. In conclusion, the performance was not very sensitive to the choice of b in the interval (0.2, 0.5). Generally b should be chosen to be as small as possible so that once the observations from the transient phase have been removed, as many of the subsequent observations as possible will contribute to the estimate of the covariance matrix. Unfortunately it is impossible to say a priori when the transient phase will end, but if it is expected to be over by iteration B, then an improved forgetting function would be f (n) = max{B, bn }.

High-dimensional example
Finally, consider a high-dimensional Gaussian target with d = 100, as in Section 3.3. Four adaptive random walk algorithms will be compared: the AM algorithm, the accelerated shaping algorithm from Section 2.1, the accelerated scaling algorithm (the Robbins-Munro algorithm with min = 1) and the accelerated shaping and scaling algorithm described in Algorithm 1. The algorithm parameters were = 0.01 and n 0 = 100 for AM; 0 = 100 and f (n) = 0.3n for accelerated shaping; a = 0.234 for accelerated scaling and 0 = I for all algorithms. Each chain was run for 5 × 10 5 iterations starting, away from the posterior mode, at (5, 0,…, 0). Figure 3 shows the mean squared error in the estimate of the covariance matrix, the MSJD and the scaling factor n as a function of iteration for each of the four algorithms. Unsurprisingly, it takes the AM algorithm a considerable number of iterations to learn the 5050 parameters of the covariance matrix. The accelerated scaling algorithm estimated the covariance matrix more rapidly than AM because it made larger jumps and was able to explore the target more effectively. The accelerated shaping algorithm produced a better covariance estimate than AM as it used the early iterations more effectively and gradually removed the initial transient phase from the covariance estimate. However, using both accelerated shaping and scaling together produced the lowest mean squared error in the estimate of the covariance matrix and led to the highest MSJD. However, as the estimate of the covariance matrix became more accurate, the scaling factor n naturally adapted towards the optimal value of one.

Discussion and conclusion
This numerical study has explored two ideas for increasing the rate of adaptation during the early stages of an adaptive MHRW. First, the shaping algorithm of Haario et al. (2001) was modified to adapt smoothly between the initial covariance matrix 0 and the current estimate n ; and to remove early outlying locations from the estimate of the covariance matrix at a rate slower than new observations arrive. This was shown to greatly improve the rate of convergence to the true covariance matrix. Although this modification is not required if the chain is initialised at the posterior mode, there are circumstances in which the mode can be challenging to obtain, such as when the gradients of the target are not available. Second, the study explored an approach to scaling the proposal to achieve a target acceptance rate, via the Robbins-Munro algorithm. Although shrinking the proposal turned out to be deleterious in a multivariate banana-shaped example, the accelerated scaling approach was easily modified to prevent shrinking from occurring. The accelerated scaling and shaping approaches together was shown to increase the rate at which the covariance matrix of the target was learned in a high-dimensional example.
The adaptive MHRW algorithm has an enduring popularity despite the availability of more sophisticated alternatives, largely due to the simplicity of its implementation, wide applicability and robustness to misspecification of algorithmic parameters such as the initial covariance matrix 0 . The modifications described here have been shown to improve the learning rate of the adaptation in unimodel targets and to have a negligible cost in terms of increased complexity and difficulty in implementation.