Particle Markov chain Monte Carlo methods


Address for correspondence: Arnaud Doucet, Department of Statistics, University of British Columbia, 333–6356 Agricultural Road, Vancouver, British Columbia, V6T 1Z2, Canada.


Summary.  Markov chain Monte Carlo and sequential Monte Carlo methods have emerged as the two main tools to sample from high dimensional probability distributions. Although asymptotic convergence of Markov chain Monte Carlo algorithms is ensured under weak assumptions, the performance of these algorithms is unreliable when the proposal distributions that are used to explore the space are poorly chosen and/or if highly correlated variables are updated independently. We show here how it is possible to build efficient high dimensional proposal distributions by using sequential Monte Carlo methods. This allows us not only to improve over standard Markov chain Monte Carlo schemes but also to make Bayesian inference feasible for a large class of statistical models where this was not previously so. We demonstrate these algorithms on a non-linear state space model and a Lévy-driven stochastic volatility model.

1. Introduction

Monte Carlo methods have become one of the standard tools of the statistician's apparatus and among other things have allowed the Bayesian paradigm to be routinely applied to ever more sophisticated models. However, expectations are constantly rising and such methods are now expected to deal with high dimensionality and complex patterns of dependence in statistical models. In this paper we propose a novel addition to the Monte Carlo toolbox named particle Markov chain Monte Carlo (PMCMC) methods. They rely on a non-trivial and non-standard combination of MCMC and sequential Monte Carlo (SMC) methods which takes advantage of the strength of its two components. Several algorithms combining MCMC and SMC approaches have already been proposed in the literature. In particular, MCMC kernels have been used to build proposal distributions for SMC algorithms (Gilks and Berzuini, 2001). Our approach is entirely different as we use SMC algorithms to design efficient high dimensional proposal distributions for MCMC algorithms. As we shall see, our framework is particularly suitable for inference in state space models (SSMs) but extends far beyond this application of choice and allows us to push further the boundaries of the class of problems that can be routinely addressed by using MCMC methods.

To be more specific, the successful design of most practical Monte Carlo algorithms to sample from a target distribution, say π, in scenarios involving both high dimension and complex patterns of dependence relies on the appropriate choice of proposal distributions. As a rule of thumb, to lead to efficient algorithms, such distributions should both be easy to sample from and capture some of the important characteristics of π, such as its scale or dependence structure. Whereas the design of such efficient proposal distributions is often feasible in small dimensions, this proves to be much more difficult in larger scenarios. The classical solution that is exploited by both MCMC and SMC methods, albeit in differing ways, consists of breaking up the original sampling problem into smaller and simpler sampling problems by focusing on some of the subcomponents of π. This results in an easier design of proposal distributions. This relative ease of implementation comes at a price, however, as such local strategies inevitably ignore some of the global features of the target distribution π, resulting in potentially poor performance. The art of designing Monte Carlo algorithms mainly resides in the adoption of an adequate trade-off between simplicity of implementation and the often difficult incorporation of important characteristics of the target distribution. Our novel approach exploits differing strengths of MCMC and SMC algorithms, which allow us to design efficient and flexible MCMC algorithms for important classes of statistical models, while typically requiring limited design effort on the user's part. This is illustrated later in the paper (Section 3) where, even using standard off-the-shelf components, our methodology allows us straightforwardly to develop efficient MCMC algorithms for important models for which no satisfactory solution is currently available.

The rest of the paper is organized as follows. Section 2 is entirely dedicated to inference in SSMs. This class of models is ubiquitous in applied science and lends itself particularly well to the exposition of our methodology. We show that PMCMC algorithms can be thought of as natural approximations to standard and ‘idealized’ MCMC algorithms which cannot be implemented in practice. This section is entirely descriptive both for pedagogical purposes and to demonstrate the conceptual and implementational simplicity of the resulting algorithms. In Section 3, we demonstrate the efficiency of our methodology on a non-linear SSM and a Lévy-driven stochastic volatility model. We first show that PMCMC sampling allows us to perform Bayesian inference simply in non-linear non-Gaussian scenarios where standard MCMC methods can fail. Second, we demonstrate that it is an effective method in situations where using the prior distribution of the underlying latent process as the proposal distribution is the only known practical possibility. In Section 4 we provide a simple and complete formal justification for the validity and properties of PMCMC algorithms. Key to our results is the realization that such seemingly approximate algorithms sample from an artificial distribution which admits our initial target distribution of interest as one of its components. The framework that is considered is somewhat more abstract and general than that for SSMs but has the advantage of applicability far beyond this class of models. In Section 5 we discuss connections to previous work and potential extensions.

2. Inference in state space models

In this section we first introduce notation and describe the standard inference problems that are associated with SSMs. Given the central role of SMC sampling in the PMCMC methodology, we then focus on their description when applied to inference in SSMs. For pedagogical purposes we consider in this section one of the simplest possible implementations—standard improvements are discussed in Section 2.5. The strengths and limitations of SMC methods are subsequently briefly discussed and we then move on to describe standard MCMC strategies for inference in SSMs. Again we briefly discuss their strengths and weaknesses and then show how our novel methodology can address the same inference problems, albeit in a potentially more efficient way. No justification for the validity of the algorithms presented is provided here—this is postponed to Section 4.

2.1. State space models

Further on, we use the standard convention whereby capital letters denote random variables, whereas lower case letters are used for their values. Consider the following SSM, which is also known as a hidden Markov model. In this context, a hidden Markov state process inline image is characterized by its initial density X1μ(·) and transition probability density


for some static parameter θ ∈ Θ which may be multidimensional. The process {Xn} is observed, not directly, but through another process inline image. The observations are assumed to be conditionally independent given {Xn}, and their common marginal probability density is of the form g(y|x); i.e., for 1leqslant R: less-than-or-eq, slantnleqslant R: less-than-or-eq, slantm,


Hereafter for any generic sequence {zn} we shall use zi:j to denote (zi,zi+1,…,zj).

Our aim is to perform Bayesian inference in this context, conditional on some observations y1:T for some Tgeqslant R: gt-or-equal, slanted1. When θ ∈ Θ is a known parameter, Bayesian inference relies on the posterior density p(x1:T|y1:T)∝p(x1:T,y1:T) where


If θ is unknown, we ascribe a prior density p(θ) to θ and Bayesian inference relies on the joint density


For non-linear non-Gaussian models, p(x1:T|y1:T) and p(θ,x1:T|y1:T) do not usually admit closed form expressions, making inference difficult in practice. It is therefore necessary to resort to approximations. Monte Carlo methods have been shown to provide a flexible framework to carry out inference in such models. It is impossible to provide a thorough review of the area here and instead we briefly review the underlying principles of MCMC and SMC methods for SSM models at a level that is sufficient to understand our novel methodology.

2.2. Sequential Monte Carlo algorithm for state space models

In the SSM context, SMC methods are a class of algorithms to approximate sequentially the sequence of posterior densities {p(x1:n|y1:n);ngeqslant R: gt-or-equal, slanted1} as well as the sequence of marginal likelihoods {p(y1:n);ngeqslant R: gt-or-equal, slanted1} for a given θ ∈ Θ. More precisely such methods aim to approximate first p(x1|y1) and p(y1), then p(x1:2|y1:2) and p(y1:2) and so on. In the context of SMC methods, the posterior distributions that are associated with such densities are approximated by a set of N weighted random samples called particles, leading for any ngeqslant R: gt-or-equal, slanted1 to the approximation


where inline image is a so-called importance weight associated with particle inline image. We now briefly describe how such sample-based approximations can be propagated efficiently in time.

2.2.1. A sequential Monte Carlo algorithm

The simplest SMC algorithm propagates the particles inline image and updates the weights inline image as follows. At time 1 of the procedure, importance sampling (IS) is used to approximate p(x1|y1) by using an importance density q(x1|y1). In effect, N particles inline image are generated from q(x1|y1) and ascribed importance weights inline image which take into account the discrepancy between the two densities. To produce N particles approximately distributed according to p(x1|y1) we sample N times from the IS approximation inline image of p(x1|y1); this is known as the resampling step. At time 2 we aim to use IS to approximate p(x1:2|y1:2). The identity


suggests reusing the samples obtained at time 1 as a source of samples approximately distributed according to p(x1|y1) and extending each such particle through an IS density q(x2|y2,x1) to produce samples approximately distributed according to p(x1|y1q(x2|y2,x1). Again importance weights inline image need to be computed since our target is p(x1:2|y1:2) and a resampling step produces samples approximately distributed according to p(x1:2|y1:2). This procedure is then repeated until time T. The resampling procedure of the SMC algorithm prevents an accumulation of errors by eliminating unpromising samples: this can be both demonstrated practically and quantified theoretically (see Section 4.1 for a discussion).

Pseudocode of the SMC algorithm that was outlined above is provided below. To alleviate the notational burden we adopt below the convention that whenever the index k is used we mean ‘for all k ∈ {1,…,N}’, and we also omit the dependence of the importance weights on θ—we shall do so in the remainder of the paper when confusion is not possible. We also use the notation inline image for the normalized importance weights at time n and inline image for the discrete probability distribution on {1,…,m} of parameter p=(p1,…,pm) with pkgeqslant R: gt-or-equal, slanted0 and inline image, for some inline image.

  • Step 1: at time n=1,

  • (a)sample inline image and
  • (b)compute and normalize the weights
  • Step 2: at times n=2,…,T,

  • (a)sample inline image,
  • (b)sample inline image and set inline image, and
  • (c)compute and normalize the weights

In this description, the variable inline image represents the index of the ‘parent’ at time n−1 of particle inline image for n=2,…,T. The standard multinomial resampling procedure is thus here interpreted as being the operation by which offspring particles at time n choose their ancestor particles at time n−1, according to the distribution


where, for any inline image. The introduction of these variables allows us to keep track of the ‘genealogy’ of particles and is necessary to describe one of the algorithms that is introduced later (see Section 2.4.3). For this purpose, for k=1,…,N and n=1,…,T we introduce inline image, the index which the ancestor particle of inline image at generation n had at that time. More formally for k=1,…,N we define inline image and for n=T−1,…,1 we have the backward recursive relation inline image. As a result for any k=1,…,N we have the identity inline image and inline image is the ancestral ‘lineage’ of a particle. This is illustrated in Fig. 1.

Figure 1.

 Example of ancestral lineages generated by an SMC algorithm for N=5 and T=3: the lighter path is inline image,inline image,inline image and its ancestral lineage is inline image (3,4,2)

This procedure provides us at time T with an approximation of the joint posterior density p(x1:T|y1:T) given by


from which approximate samples from p(x1:T|y1:T) can be obtained by simply drawing an index from the discrete distribution inline image. This is one of the key properties exploited by the PMCMC algorithms. In addition we shall also use the fact that this SMC algorithm provides us with an estimate of the marginal likelihood p(y1:T) given by




is an estimate computed at time n of


It follows from equation (7) that wn(x1:n) only depends on x1:n through xn−1:n. We have omitted the dependence on N in equations (5), (8) and (9), and will do so in the remainder of this section when confusion is not possible.

2.2.2. Design issues and limitations

This algorithm requires us to specify q(x1|y1) and {q(xn|yn,xn−1);n=2,…,T}. Guidelines on how best to select {q(xn|yn,xn−1)} are well known. With p(xn|yn,xn−1)∝f(xn|xn−1g(yn|xn), it is usually recommended to set q(xn|yn,xn−1)=p(xn|yn,xn−1) whenever possible and to select q(xn|yn,xn−1) as close as possible to p(xn|yn,xn−1) otherwise; see for example Carpenter et al. (1999), Cappéet al. (2005), Doucet and Johansen (2009), Liu (2001) and Pitt and Shephard (1999). It is often much simpler to design these ‘local’ importance densities than to design a global importance density approximating p(x1:T|y1:T). An ‘extreme’ case, which was originally suggested in Gordon et al. (1993), consists of using the prior density of the latent Markov process {Xn;ngeqslant R: gt-or-equal, slanted1} as an importance density; i.e. set q(x1|y1)=μ(x1) and q(xn|yn,xn−1)=f(xn|xn−1). In scenarios where the observations are not too informative and the dimension of the latent variable not too large, this default strategy can lead to satisfactory performance. It is in fact the only possible practical choice for models where f(xn|xn−1) is intractable or too expensive to evaluate pointwise, but easy to sample from; see Ionides et al. (2006) for many examples.

Note that SMC methods also suffer from well-known drawbacks. Indeed, when T is too large, the SMC approximation to the joint density p(x1:T|y1:T) deteriorates as components sampled at any time n<T are not rejuvenated at subsequent time steps. As a result, when Tn is too large the approximation to the marginal p(xn|y1:T) is likely to be rather poor as the successive resampling steps deplete the number of distinct particle co-ordinates xn. This is the main reason behind the well-known difficulty of approximating p(θ,x1:T|y1:T) with SMC algorithms; see Andrieu et al. (1999), Fearnhead (2002) and Storvik (2002), for example. We shall see in what follows that, in spite of its reliance on SMC methods as one of its components, PMCMC sampling is much more robust and less likely to suffer from this depletion problem. This stems from the fact that PMCMC methods do not require SMC algorithms to provide a reliable approximation of p(x1:T|y1:T), but only to return a single sample approximately distributed according to p(x1:T|y1:T).

2.3. Standard Markov chain Monte Carlo methods

A popular choice to sample from p(θ,x1:T|y1:T) with MCMC methods consists of alternately updating the state components x1:T conditional on θ and θ conditional on x1:T. Sampling from p(θ|y1:T,x1:T) is often feasible and we do not discuss this here. Sampling exactly from p(x1:T|y1:T) is possible for two scenarios only: linear Gaussian models and finite state space hidden Markov models (Carter and Kohn, 1994; Frühwirth-Schnatter, 1994). Beyond these particular cases the design of proposal densities is required. A standard practice consists of dividing the T components of x1:T in, say, adjacent blocks of length K and updating each of these blocks in turn. For example we can update xn:n+K−1 according to an MCMC step of invariant density


When K is not too large it might be possible to design efficient proposal densities which can be used in a Metropolis–Hastings (MH) update; see Shephard and Pitt (1997) for a generic Gaussian approximation of equation (10) for SSMs with a linear Gaussian prior density f(xk|xk−1) and a log-concave density g(yk|xk). However, as K increases building ‘good’ approximations of equation (10) is typically impossible. This limits the size K of the blocks xn:n+K−1 of variables which can be simultaneously updated and can be a serious drawback in practice as this will slow down the exploration of the support of p(x1:T|y1:T) when its dependence structure is strong.

These difficulties are exacerbated in models where f(xk|xk−1) does not admit an analytical expression but can be sampled from; see for example Ionides et al. (2006). In such scenarios updating all the components of x1:T simultaneously by using the joint prior distribution as a proposal is the only known strategy. However, the performance of this approach tends to deteriorate rapidly as T increases since the information that is provided by the observations is completely ignored by the proposal.

2.4. Particle Markov chain Monte Carlo methods for state space models

In what follows we shall refer to MCMC algorithms targeting the distribution p(θ,x1:T|y1:T) which rely on sampling exactly from p(x1:T|y1:T) as ‘idealized’ algorithms. Such algorithms are mostly purely conceptual since they typically cannot be implemented but in many situations are algorithms that we would like to approximate. In the light of Sections 2.2 and 2.3, a natural idea consists of approximating these idealized algorithms by using the output of an SMC algorithm targeting p(x1:T|y1:T) using Ngeqslant R: gt-or-equal, slanted1 particles as a proposal distribution for an MH update. Intuitively this could allow us to approximate with arbitrary precision such idealized algorithms while only requiring the design of low dimensional proposals for the SMC algorithm. A direct implementation of this idea is impossible as the marginal density of a particle that is generated by an SMC algorithm is not available analytically but would be required for the calculation of the MH acceptance ratio. The novel MCMC updates that are presented in this section, PMCMC updates, circumvent this problem by considering target distributions on an extended space which includes all the random variables that are produced by the SMC algorithm; this is detailed in Section 4 and is not required to understand the implementation of such updates.

The key feature of PMCMC algorithms is that they are in fact ‘exact approximations’ to idealized MCMC algorithms targeting either p(x1:T|y1:T) or p(θ,x1:T|y1:T) in the sense that for any fixed number Ngeqslant R: gt-or-equal, slanted1 of particles their transition kernels leave the target density of interest invariant. Further they can be interpreted as standard MCMC updates and will lead to convergent algorithms under mild standard assumptions (see Section 4 for details).

We first introduce in Section 2.4.1 the particle independent Metropolis–Hastings (PIMH) update, an exact approximation to a standard independent Metropolis–Hastings (IMH) update targeting p(x1:T|y1:T), which uses SMC approximations of p(x1:T|y1:T) as a proposal. We emphasize at this point that we do not believe that the resulting PIMH sampler that is presented below is on its own a serious competitor to standard SMC approximations to p(x1:T|y1:T). However, as is the case with standard IMH-type updates, the PIMH update might be of interest when used in combination with other MCMC transitions. In Section 2.4.2, we describe the particle marginal Metropolis–Hastings (PMMH) algorithm which can be thought of as an exact approximation of a ‘marginal Metropolis–Hastings’ (MMH) update targeting directly the marginal density p(θ|y1:T) of p(θ,x1:T|y1:T). Finally, in Section 4.5 we present a particle approximation to a Gibbs sampler targeting p(θ,x1:T|y1:T), called hereafter the particle Gibbs (PG) algorithm.

2.4.1. Particle independent Metropolis–Hastings sampler

A standard IMH update leaving p(x1:T|y1:T) invariant requires us to choose a proposal density q(x1:T|y1:T) to propose candidates inline image which, given a current state X1:T, are accepted with probability


where inline image. The optimal choice for q(x1:T|y1:T) is q(x1:T|y1:T)=p(x1:T|y1:T), but in practice this ideal choice is impossible in most scenarios. Our discussion of SMC methods suggests exploring the idea of using the SMC approximation of p(x1:T|y1:T) as a proposal density, i.e. draw our proposed sample from equation (8). As indicated earlier, sampling inline image from equation (8) is straightforward given a realization of the weighted samples inline image, but computing the acceptance probability above requires the expression for the marginal distribution of inline image, which turns out to be intractable. Indeed this distribution is given by


where the expectation is here with respect to all the random variables generated by the SMC algorithm to sample the random probability measure inline image in equation (8). Although this expression for q(dx1:T|y1:T) does not admit a simple analytical expression, it naturally suggests the use of the standard ‘auxiliary variables trick’ by embedding the sampling from p(x1:T|y1:T) into that of sampling from an appropriate distribution defined on an extended space including all the random variables underpinning the expectation above. The resulting PIMH sampler can be shown to take the following extremely simple form, with inline image as in equation (9).

  • Step 1: initialization,i=0—run an SMC algorithm targeting p(x1:T|y1:T), sample inline image and let inline image denote the corresponding marginal likelihood estimate.

  • Step 2: for iteration igeqslant R: gt-or-equal, slanted1,

  • (a)run an SMC algorithm targeting p(x1:T|y1:T), sample inline image and let inline image denote the corresponding marginal likelihood estimate, and
  • (b)with probability
    set inline image and inline image; otherwise set X1:T(i)=X1:T(i−1) and inline image.

Theorem 2 in Section 4.2 establishes that the PIMH update leaves p(x1:T|y1:T) invariant and theorem 3 establishes that under weak assumptions the PIMH sampler is ergodic. Note in addition that, as expected, the acceptance probability in equation (11) converges to 1 when N→∞ since both inline image and inline image are, again under mild assumptions, consistent estimates of p(y1:T).

2.4.2. Particle marginal Metropolis–Hastings sampler

Consider now the scenario where we are interested in sampling from p(θ,x1:T|y1:T) defined in equation (4). We focus here on an approach which jointly updates θ and x1:T. Assume for the time being that sampling from the conditional density p(x1:T|y1:T) for any θ ∈ Θ is feasible and recall the standard decomposition p(θ,x1:T|y1:T)=p(θ|y1:Tp(x1:T|y1:T). In such situations it is natural to suggest the following form of proposal density for an MH update:


for which the proposed inline image is perfectly ‘adapted’ to the proposed θ*, and the only degree of freedom of the algorithm (which will affect its performance) is q(θ*|θ). The resulting MH acceptance ratio is given by


The expression for this ratio suggests that the algorithm effectively targets the marginal density p(θ|y1:T)∝p(y1:Tp(θ), justifying the MMH terminology. This idea has also been exploited in Andrieu and Roberts (2009) and Beaumont (2003) for example and might be appealing since the difficult problem of sampling from p(θ,x1:T|y1:T) is reduced to that of sampling from p(θ|y1:T), which is typically defined on a much smaller space. It is natural to propose a particle approximation to the MMH update where, whenever a sample from p(x1:T|y1:T) and the expression for the marginal likelihood p(y1:T) are needed, their SMC approximation counterparts are used instead in the PMMH update. The resulting PMMH sampler is as follows (note the change of indexing notation for inline image compared with the PIMH case).

  • Step 1: initialization,i=0,

  • (a) set θ(0) arbitrarily and
  • (b) run an SMC algorithm targeting pθ(0)(x1:T|y1:T), sample inline image and let inline image denote the marginal likelihood estimate.
  • Step 2: for iteration igeqslant R: gt-or-equal, slanted1,

  • (a) sample θ*q{·|θ(i−1)},
  • (b) run an SMC algorithm targeting pθ*(x1:T|y1:T), sample inline image and let inline image denote the marginal likelihood estimate, and
  • (c) with probability
    set θ(i)=θ*, inline image and inline image; otherwise set θ(i)=θ(i−1), X1:T(i)=X1:T(i−1) and inline image.

Theorem 4 in Section 4.4 establishes that the PMMH update leaves p(θ,x1:T|y1:T) invariant and that under weak assumptions the PMMH sampler is ergodic. Also note that under mild assumptions given in Section 4.1 the acceptance probability (13) converges to equation (12) as N→∞.

2.4.3. Particle Gibbs sampler

An alternative to the MMH algorithm to sample from p(θ,x1:T|y1:T) consists of using the Gibbs sampler which samples iteratively from p(θ|y1:T,x1:T) and p(x1:T|y1:T). It is often possible to sample from p(θ|y1:T,x1:T) and thus the potentially tedious design of a proposal density for θ that is necessary in the MMH update can be bypassed. Again, sampling from p(x1:T|y1:T) is typically impossible and we investigate the possibility of using a particle approximation to this algorithm. Clearly the naive particle approximation to the Gibbs sampler where sampling from p(x1:T|y1:T) is replaced by sampling from an SMC approximation inline image does not admit p(θ,x1:T|y1:T) as invariant density.

A valid particle approximation to the Gibbs sampler requires the use of a special type of PMCMC update called the conditional SMC update. This update is similar to a standard SMC algorithm but is such that a prespecified path X1:T with ancestral lineage B1:T is ensured to survive all the resampling steps, whereas the remaining N−1 particles are generated as usual. The algorithm is as follows.

  • Step 1: let inline image be a path that is associated with the ancestral lineage B1:T.

  • Step 2: for n=1,

  • (a) for kB1, sample inline image and
  • (b) compute inline image by using equation (6) and normalize the weights inline image.
  • Step 3: for n=2,…,T,

  • (a) for kBn, sample inline image,
  • (b) for kBn, sample inline image and
  • (c) compute inline image by using equation (7) and normalize the weights inline image.

For further clarity we illustrate this update on a toy example. Fig. 1 displays ancestral lineages that were generated by a standard SMC method in a situation where N=5 and T=3. Consider inline image whose ancestral lineage is inline image. A conditional SMC update leaving inline image (the lighter path in Fig. 1) identical generates four new paths consistent with both inline image and inline image. One could, for example, obtain the set of new paths that is presented in Fig. 2.

Figure 2.

 Example of N−1=4 ancestral lineages generated by a conditional SMC algorithm for N=5 and T=3 conditional on inline image and inline image

A computationally more efficient way to implement the conditional SMC update is presented in Appendix A—this is, however, not required to present our particle version of the Gibbs sampler, the PG sampler, as follows.

  • Step 1: initialization, i=0—set θ(0),X1:T(0) and B1:T(0) arbitrarily.

  • Step 2: for iteration igeqslant R: gt-or-equal, slanted1,

  • (a) sample θ(i)∼p{·|y1:T,X1:T(i−1)},
  • (b) run a conditional SMC algorithm targeting pθ(i)(x1:T|y1:T) conditional on X1:T(i−1) and B1:T(i−1), and
  • (c) sample inline image (and hence B1:T(i) is also implicitly sampled).

In theorem 5 in Section 4.5, it is shown that this algorithm admits p(θ,x1:T|y1:T) as invariant density and is ergodic under mild assumptions.

2.5. Improvements and extensions

2.5.1. Advanced particle filtering and sequential Monte Carlo techniques

For ease of presentation, we have limited our discussion in this section to one of the simplest implementations of SMC algorithms. However, over the past 15 years numerous more sophisticated algorithms have been proposed in the literature to improve on such a basic scheme; see Cappéet al. (2005) or Doucet and Johansen (2009) for recent reviews. Such techniques essentially fall into two categories:

  • (a) techniques aiming at reducing the variance that is introduced by the resampling step of the SMC algorithm such as the popular residual and stratified resampling procedures (see Liu (2001), chapter 3, and Kitagawa (1996)) and
  • (b) techniques aiming at fighting the so-called degeneracy phenomenon which include, among others, the auxiliary particle filter (Pitt and Shephard, 1999) or the resample–move algorithm (Gilks and Berzuini, 2001).

Popular advanced resampling schemes can be used within the PMCMC framework—more details on the technical conditions that are required by such schemes are given in Section 4.1. Roughly speaking these conditions require some form of exchangeability of the particles. Most known advanced SMC techniques falling into category (b) will also lead to valid PMCMC algorithms. Such valid techniques can in fact be easily identified in practice but this requires us to consider the more general PMCMC framework that is developed in Section 4.1.

2.5.2. Using all the particles

A possible criticism of the PMCMC updates is that they require the generation of N particles at each iteration of the MCMC algorithm to propose a single sample. It is shown in theorem 6 in Section 4.6 that it is possible to reuse all the particles that are generated in the PIMH, the PMMH and PG samplers to compute estimates of conditional expectations with respect to p(x1:T|y1:T) and p(θ,x1:T|y1:T).

3. Applications

We provide here two applications of PMCMC methods. The first model that we consider is a popular toy non-linear SSM (Gordon et al., 1993; Kitagawa, 1996). The second model is a Lévy-driven stochastic volatility model (Barndorff-Nielsen and Shephard, 2001a; Creal, 2008; Gander and Stephens, 2007).

3.1. A non-linear state space model

Consider the SSM


where inline image, inline image and inline image; here inline image denotes the Gaussian distribution of mean m and variance σ2 and IID stands for independent and identically distributed. We set θ=(σV,σW). This example is often used in the literature to assess the performance of SMC methods. The posterior density p(x1:T|y1:T) for this non-linear model is highly multimodal as there is uncertainty about the sign of the state Xn which is only observed through its square.

We generated two sets of observations y1:100 according to model (14)–(15) with inline image, and inline image and inline image. We display in Fig. 3 the average acceptance rate of the PIMH algorithm when sampling from p(x1:T|y1:T) as a function of T and N. This was computed using 50000 iterations of the PIMH sampler. We used the most basic resampling scheme, i.e. the multinomial resampling that was described in Section 2.2.1. We also used the simplest possible proposal for SMC sampling, i.e. q(x1)=μ(x1) and q(xn|yn,xn−1)=f(xn|xn−1) for n=2,…,T. The acceptance probabilities are higher when inline image than when inline image and inline image. This is to be expected as in this latter scenario the observations are more informative and our SMC algorithm only samples particles from a rather diffuse prior. Better performance could be obtained by using an approximation of p(xn|yn,xn) based on local linearization as a proposal distribution q(xn|yn,xn−1) (Cappéet al. (2005), page 230), and a more sophisticated resampling scheme. However, our aim here is to show that even this off-the-shelf choice can provide satisfactory results in difficult scenarios.

Figure 3.

 Average acceptance rate of the PIMH sampler as a function of N and T for (a) inline image and inline image and (b) inline image and inline image; |, T=10; ×, T=25; *, T=50; □, T=100

Determining a sensible trade-off between the average acceptance rate of the PIMH update and the number of particles seems to be difficult. Indeed, whereas a high expected acceptance probability is theoretically desirable in the present case, this does not take into account the computational complexity. For example, in the scenario where T=100 and inline image, we have an average acceptance rate of 0.80 for N=2000 whereas it is equal to 0.27 for N=200, resulting in a Markov chain which still mixes well. Given that the SMC proposal for N=2000 is approximately 10 times more computationally expensive than for N=200, it might seem appropriate to use N=200 and to run more MCMC iterations.

When θ is unknown we set the prior inline image and inline image where inline image is the inverse gamma distribution and a=b=0.01. We simulated T=500 observations with inline image and inline image. To sample from p(θ,x1:T|y1:T), we used the PMMH sampler and the PG sampler using for the SMC proposal the prior and stratified resampling with N=5000 particles. The PMMH sampler uses a normal random-walk proposal with a diagonal covariance matrix. The standard deviation was equal to 0.15 for σV and 0.08 for σW. We also compared these algorithms with a standard algorithm where we update the state variables X1:T one at a time by using an MH step of invariant distribution p(xn|yn,xn−1,xn+1) and proposal density f(xn|xn−1). In the one at a time algorithm, we updated the state variables N times at each iteration before updating θ. Hence all the algorithms have approximately the same computational complexity. All the simulations that are presented here are initialized by using inline image. We ran the algorithms for 50000 iterations with a burn-in of 10000 iterations. In Fig. 4, we display the estimates of the marginal posterior densities for σV and σW, a scatter plot of the sampled values inline image and the trace plots that are associated with these two parameters.

Figure 4.

 Approximations of p(σV|y1:T) and p(σW|y1:T), scatter plots and trace plots after burn-in of simulated values for (a) the MH one at a time update, (b) the PG sampler and (c) the PMMH sampler: inline image, true values on the histograms; inline image, true values on the trace plots; □, true values on the scatter plots

For this data set the MH one at a time update appears to mix well as the auto-correlation functions (ACFs) for the parameters (σV,σW) (which are not shown here) decrease to zero reasonably fast. However, this algorithm tends to become trapped in a local mode of the multimodal posterior distribution. This occurred on most runs when using initializations from the prior for X1:T and results in an overestimation of the true value of σV. Using the same initial values, the PMMH and the PG samplers never became trapped in this local mode. In practice, we can obviously combine both strategies by only occasionally updating the state variables with a PG update to avoid such traps while using more standard and cheaper updates for a large proportion of the computational time.

We present in Fig. 5 the ACF for (σV,σW) for the PG and PMMH samplers and various numbers of particles N. Clearly the performance improves as N increases. In this scenario, it appears necessary to use at least 2000 particles to make the ACF drop sharply, whereas increasing N beyond 5000 does not improve performance, i.e. for N>5000 we observe that the ACFs (which are not presented here) are very similar to N=5000 and probably very close to that of the corresponding idealized MMH algorithm.

Figure 5.

 ACF of the parameters (a), (b) σV and (c), (d) σW for (a), (c) the PG sampler and the (b), (d) the PMMH sampler: inline image, 1000 particles; inline image, 2000 particles; inline image, 5000 particles

3.2. Lévy-driven stochastic volatility model

The second model that we discuss is a Lévy-driven stochastic volatility model. These models were recently introduced in Barndorff-Nielsen and Shephard (2001a) and have become extremely popular in financial econometrics; see for example Creal (2008), Frühwirth-Schnatter and Sögner (2008), Gander and Stephens (2007) and Roberts et al. (2004). However, performing inference for Lévy-driven stochastic volatility models is a challenging task. We demonstrate here that PMCMC methods can be useful in this context. The model can be described as follows. The logarithm of an asset price y*(t) is assumed to be determined by the stochastic differential equation


where μ is the drift parameter, β the risk premium and B(t) is a Brownian motion. The instantaneous latent variance or volatility σ2(t) is assumed to be stationary and independent from B(t). It is modelled by the Lévy-driven Ornstein–Unlenbeck process


where λ>0 and z(t) is a purely non-Gaussian Lévy process with positive increments and z(0)=0. We define the integrated volatility


Let Δ denote the length of time between two periods of interest; then the increments of the integrated volatility satisfy






Here ‘inline image’ means ‘equal in distribution’. By aggregating returns over a time interval of length Δ, we have


thus, conditional on the volatility, we obtain


Many publications have restricted themselves to the case where σ2(t) follows marginally a gamma distribution, in which cases the stochastic integrals appearing in equation (17) are finite sums. Even in this case, sophisticated MCMC schemes need to be developed to perform Bayesian inference (Frühwirth-Schnatter and Sögner, 2008; Roberts et al., 2004). However, it is argued in Gander and Stephens (2007) that

‘the use of the gamma marginal model appears to be motivated by computational tractability, rather than by any theoretical or empirical reasoning’.

We address here the case where σ2(t) follows a tempered stable marginal distribution inline image. This is a flexible class of distributions which includes inverse Gaussian distributions for inline image. In this case, it is shown in Barndorff-Nielsen and Shephard (2001b) that


where A0=2κδκ/Γ(1−κ) and inline image. In equation (18), {ai}, {ei} and {vi} are independent of one another. The {ei} are IID exponential with mean 1/B and the {vi} are standard uniform, whereas a1<a2<… are arrival times of a Poisson process of intensity 1. It is also established in Barndorff-Nielsen and Shephard (2001b) that z(t) is the sum of an infinite activity Lévy process and of a compound Poisson process such that


where A=2κδκ2/Γ(1−κ). In equation (19), {ai}, {ei}, {ri}, inline image and {vi} are independent of one another. The {ai}, {ei} and {vi} follow the same distributions as in equation (18), the {ci} are IID inline image where inline image is the gamma distribution and {ri} and inline image are standard uniform. Finally N(λΔ) is a Poisson random variable of mean λΔδγκ.

Performing inference in this context is difficult as the transition prior of the latent process Xn:=(σ2(nΔ),z(λnΔ)) cannot be expressed analytically. It is actually not even possible to sample exactly from this prior as equations (18) and (19) involve infinite sums. However, it was shown experimentally in Barndorff-Nielsen and Shephard (2001b) that these sums are dominated by the first few terms, ‘although as κ goes to one this becomes less sharp’. Further on, we truncate the infinite sums in equations (18) and (19) to their first 100 terms to obtain a ‘truncated’ prior. We found that increasing the number of terms did not have any effect on our results. In Creal (2008), an SMC method is proposed to sample from p(x1:T|y1:T) which uses the truncated prior as proposal density, all the hyperparameters θ of the model being assumed known. We propose here to use the PMMH algorithm to sample from p(θ,x1:T|y1:T) where θ=(κ,δ,γ,λ) and we set μ=β=0 as in Creal (2008). Our SMC method to sample from p(x1:T|y1:T) is similar to Creal (2008) and simply uses the truncated prior as a proposal. We do not know of any realistic alternative in the present context. Indeed, if the truncated prior was not used, it follows from equation (19) that a proposal density on a space of dimension more than 400 would have to be designed.

We first simulate T=400 data from the model with Δ=1 and (κ,δ,γ,λ)=(0.50,1.41,2.83,0.10). We assigned the following independent priors (Gander and Stephens, 2007): inline image, inline image, inline image and inline image. Here inline image denotes the beta distribution. We used a normal random-walk MH proposal to update the parameters jointly, the covariance of the proposal being the estimated covariance of the target distribution which was obtained in a preliminary run. It is also possible to use an adaptive MCMC strategy to determine this covariance on the fly (Andrieu and Thoms (2008), section 5.1). The results for N=200 are displayed in Fig. 6. Using N=200 might appear too small for T=400 but it is sufficient in this scenario as the observations are weakly informative. For example, the posterior for κ is almost identical to the prior. We checked that indeed the likelihood function for this data set is extremely flat in κ. We also ran the PMMH algorithm for N=50,100,200 to monitor the ACF for the four parameters and to check that the ACFs decrease reasonably fast for N=200.

Figure 6.

 Lévy-driven stochastic volatility model for synthetic data: (a) histogram approximations of posterior densities p(κ|y1:T), p(δ|y1:T) and p(λ|y1:T)(inline image, prior) and scatter plots obtained for N=200, and (b) ACFs of the simulated values for various N (inline image, κ; inline image, δ; inline image, γ; inline image, λ)

We now apply our algorithm to the Standard & Poors 500 data from January 12th, 2002 to December 30th, 2005, which have been standardized to have unit variance. We assign the following independent priors (Gander and Stephens, 2007): inline image, inline image, inline image and inline image. We have T=1000 and we use N=1000 particles. We also use a normal random-walk MH proposal, the covariance of the proposal being the estimated covariance of the target distribution which was obtained in a preliminary run. In this context, 1000 particles appear sufficient to obtain good performance. The results are presented in Fig. 7.

Figure 7.

 Lévy-driven stochastic volatility model for Standard & Poors 500 data: (a) histogram approximations of posterior densities p(κ|y1:T), p(δ|y1:T), p(γ|y1:T) and p(λ|y1:T) (inline image, prior) and scatter plots, and (b) ACFs of the simulated values obtained for N=1000 (inline image, κ; inline image, δ; inline image, γ; inline image, λ)

Gander and Stephens (2007) proposed an MCMC method to sample from the posterior p{θ,σ2(0),η1:T|y1:T} which updates one at a time σ2(0) and the terms ηn by using the truncated prior as a proposal. The algorithm has a computational complexity of order O(T2) for updating η1:T as it requires recomputing xn:T each time that ηn is modified to evaluate the likelihood of the observations appearing in the MH ratio. For the two scenarios that were discussed above, we also designed MCMC algorithms using such a strategy to update σ2(0) and η1:T. We tried various updating strategies for θ but they all proved rather inefficient with the ACF of parameters decreasing much more slowly towards zero than for the PMMH update. It appears to us that designing efficient MCMC algorithms for such models requires considerable model-specific expertise. In this respect, we believe that the PMCMC methodology is less demanding as we could design reasonably fast mixing MCMC algorithms with little user input.

4. A generic framework for particle Markov chain Monte Carlo methods

For ease of exposition we have so far considered one of the simplest implementations of the SMC methodology that is used in our PMCMC algorithms (see Section 2). This implementation does not exploit any of the possible standard improvements that were mentioned in Section 2.5 and might additionally suggest that the PMCMC methodology is only applicable to the sole SSM framework. In this section, we consider a more general and abstract framework for PMCMC algorithms which relies on more general SMC algorithms that are not specialized to the SSM scenario. This allows us to consider inference in a much wider class of statistical models but also to consider the use of advanced SMC techniques in a unified framework. This can be understood by the following simple arguments.

First note that the SMC algorithm for SSMs that was described in Section 2.2.1 aims to produce sequentially approximate samples from the family of posterior densities {p(x1:n|y1:n);n=1,…,T} defined on the sequence of spaces inline image. It should be clear that this algorithm can be straightforwardly modified to sample approximately from any sequence of densities {πn(x1:n);n=1,…,P} defined on inline image for any Pgeqslant R: gt-or-equal, slanted1. This points to the applicability of SMC, and hence PMCMC, methods beyond the sole framework of inference in SSMs to other statistical models. This includes models which naturally have a sequential structure (e.g. Liu (2001)), but also models which do not have such a structure and for which the user induces such a structure (e.g. Chopin (2002) and Del Moral et al. (2006)).

Second, as described in Doucet and Johansen (2009), the introduction of advanced SMC techniques for SSMs such as the auxiliary particle filter (Pitt and Shephard, 1999) or the resample–move algorithm (Gilks and Berzuini, 2001) can be naturally interpreted as introducing additional intermediate, and potentially artificial, target densities between, say, p(x1:n|y1:n) and p(x1:n+1|y1:n+1). These additional intermediate densities might not have a physical interpretation but are usually chosen to help to bridge samples from p(x1:n|y1:n) to samples from p(x1:n+1|y1:n+1). Such strategies can therefore be recast into the problem of using SMC methods to sample sequentially from a sequence of densities {πn(x1:n);n=1,…,P} for some integer Pgeqslant R: gt-or-equal, slantedT.

4.1. A generic sequential Monte Carlo algorithm

Consider the problem of using SMC methods to sample from a sequence of densities {πn(x1:n);n=1,…,P} such that for n=1,…,P the density πn(x1:n) is defined on inline image. Each density is only assumed known up to a normalizing constant, i.e. for πn(x1:n)=γn(x1:n)/Zn where inline image can be evaluated pointwise but the normalizing constant Zn is unknown. We shall use the notation Z for ZP. For the simple SMC algorithm for SSMs in Section 2, we have γn(x1:n):=p(x1:n,y1:n) and Zn:=p(y1:n). An SMC algorithm also requires us to specify an importance density M1(x1) on inline image and a family of transition kernels with associated densities {Mn(xn|x1:n−1);n=2,…,P} to extend inline image by sampling inline image conditional on x1:n−1 at time instants n=2,…,P. To describe the resampling step, we introduce a family of probability distributions on {1,…,N}N, {r(·|w);w ∈ [0,1]N}. In Section 2.2 the sampling distributions are Mn(xn|x1:n−1):=q(xn|yn,xn−1) and inline image. As in Section 2.2 we use the notation inline image where the variable inline image indicates the index of the ‘parent’ at time n−1 of particle inline image for n=2,…,P. The generic SMC algorithm proceeds as follows.

  • Step 1: for n=1,

  • (a) sample inline image and
  • (b) compute and normalize the weights
  • Step 2: for n=2,…,P,

  • (a) sample An−1r(·|Wn−1),
  • (b) sample inline image and set inline image, and
  • (c) compute and normalize the weights

This algorithm yields an approximation to π(dx1:P) and its normalizing constant Z through


Again the role of the vector An is to parameterize a random mapping on {1,…,N}→{1,…,N}N, and the standard resampling procedure is hence interpreted here as being the operation by which offspring particles at time n choose their parent particles at time n−1 according to a probability distribution r(·|Wn−1) parameterized by the parents’ weights Wn−1. For any ngeqslant R: gt-or-equal, slanted1 we shall hereafter use inline image to denote inline image, the number of offspring of particle k at time n, and s(·|Wn) to denote the corresponding probability distribution of inline image. We shall make extensive use of the notion of ancestral lineage inline image of a path inline imageinline image already introduced in Section 2. We recall that inline image and for n=P−1,…,1 we have inline image. This notation is necessary to establish the mathematical validity of PMCMC algorithms since it allows us to describe precisely and simply the various probabilistic objects that are involved. For example it will be useful in what follows to describe the probability density inline image of all the random variables generated by the generic SMC algorithm above. Letting inline image denote inline image, the set of N simulated inline image-valued random variables at time n, for n=1,…,P, it is straightforward to establish that the joint density of inline image defined on inline image is


We shall make extensive use of this result in the remainder of the paper. Note in particular that a sample X1:P that is drawn from inline image in equation (21) has a distribution inline image where inline image denotes the expectation with respect to ψ.

Not all choices of {πn}, {M1(x1),Mn(xn|x1:n−1);n=2,…,P} and r(·|w) will lead to a consistent SMC algorithm, i.e. to an algorithm such that inline image and inline image respectively converge to π and Z in some sense as N→∞. We shall rely on the following standard minimal assumptions. The following notation will be needed to characterize the support of the target and proposal densities. We define


with the convention π0(x1:0)=1 and M1(x1|x1:0)=M1(x1). The required set of minimal assumptions is as follows.

 Assumption 1.  For n=1,…,P, we have inline image.

 Assumption 2.  For any k=1,…,N and n=1,…,P the resampling scheme satisfies




Assumption 1 simply states that it is possible to use the importance density πn−1(x1:n−1Mn(xn|x1:n−1) to approximate πn(x1:n). Assumption 2 is related to the resampling scheme. The ‘unbiasedness’ condition in equation (23) is satisfied by the popular multinomial, residual and stratified resampling procedures. The condition in equation (24) is not usually satisfied as in practice, for computational efficiency, On is usually drawn first according to a probability distribution s(·|Wn) such that equation (23) holds (i.e. without explicit reference to An) and the offspring then matched to their parents. More precisely, once On has been sampled, this is followed by a deterministic allocation procedure of the offspring particles to the parents, which defines indices; for example the inline image first-offspring particles are associated with the parent particle number 1, i.e. inline image, and likewise for the inline image following offspring particles and the parent particle number 2, i.e. inline image etc. However, condition (24) can be easily enforced by the addition of a random permutation of these indices.

We provide here some results concerning the precision of SMC estimates of key quantities that are involved in the implementation of PMCMC algorithms as a function of both P and N. We point to the fact that some of these results rely on relatively strong conditions, but their interest is nevertheless twofold. First they provide some quantitative insight into the reasons why using the output of SMC algorithms as proposals might be a good idea and how performance might scale with respect to both P and N. Second these results correspond to current understanding of SMC methods and have been empirically observed to extend beyond the scenarios that are detailed below.

 Assumption 3.  There is a sequence of constants inline image for some integer inline image such that for any inline image


 Assumption 4.  There are μ(·) a probability density on inline image and inline image such that, for any inline image and any inline image,


Theorem 1.  Assume assumption 1 for inline image and some inline image, and assumption 3. For the multinomial resampling scheme, for any inline image there are C(P) and D(P) such that for any Ngeqslant R: gt-or-equal, slanted1 the variance of inline image satisfies


and such that the distribution of a sample from inline image satisfies for any Ngeqslant R: gt-or-equal, slanted1


where ‘‖·‖’ denotes the total variation distance.

If in addition assumption 4 is satisfied then there are constants C,D>0, depending on inline image and μ but not P, such that the results above hold with


Assumption 3 is related to the standard boundedness condition for importance weights in classical importance sampling. The results in equations (26) and (27) have been established very early on in the literature; see for example Del Moral (2004). However, these results are rather weak since C(P) and D(P) are typically exponential functions of P. Assumption 4 imposes a practically realistic pattern of dependence on the components of X1:P, namely a forgetting property, which turns out to be beneficial in that it mitigates the propagation of error in the SMC algorithm. As a result the linear bounds in expression (28) can be established (Cérou et al. (2008) and personal communication with Professor Pierre Del Moral). As discussed in more detail in the next sections these results have direct implications on the performance of PMCMC algorithms. In particular expression (28) suggests that approximations of idealized algorithms requiring exact samples from π(dx1:P) by algorithms which instead use a particle X1:P drawn from inline image in equation (21) are likely to scale linearly with increasing dimensions under assumption 4. This should be contrasted with the typical exponential deterioration of performance of classical IS approaches.

4.2. The particle independent Metropolis–Hastings update

To sample from π(x1:P), we can suggest a PIMH sampler which is an IMH sampler using an SMC approximation inline image of π(x1:P) as proposal distribution. The algorithm is similar to the algorithm that was discussed in Section 2.4.1 where P=T and where we substitute inline image and inline image respectively in place of inline image and inline image, with the notation given in equation (21).

Given inline image, the PIMH update consists of running an SMC algorithm targeting π(x1:P) to obtain an approximation inline image and inline image respectively for π(dx1:P) and Z, sampling inline image. We set inline image with probability


and inline image otherwise.

We now prove that a sequence {X1:P(i)} generated by a PIMH sampler, i.e. by iterating the PIMH update (initialized with the output inline image of an SMC algorithm targeting π(x1:P)), has π(x1:P) as the desired equilibrium density for any Ngeqslant R: gt-or-equal, slanted1. The key to establishing this result is to reformulate the PIMH update as a standard IMH update defined on an extended state space X with a suitable invariant density. First we establish the expression for the density of the set of random variables generated to construct inline image above. In the light of the discussion of Section 4.1 the SMC algorithm generates the set of random variables inline image and the selection of inline image among the particles inline image involves sampling a random variable K with distribution inline image. From equation (22) we deduce that this density takes the simple form


and is defined on inline image. Here inline image is the realization of the normalized importance weight inline image. The less obvious point is to identify the density inline image on X targeted by the PIMH algorithm which is given by


where we remind the reader that inline image and note that inline image is the marginal probability density inline image. Note the important property that, for a sample inline image from this distribution, inline image is distributed according to the distribution of interest π. For any igeqslant R: gt-or-equal, slanted0, let inline image denote the distribution of X1:P(i) generated by the PIMH sampler with Ngeqslant R: gt-or-equal, slanted1 particles. Our main result is the following theorem, which is proved in Appendix B.

Theorem 2.  Assume assumption 2. Then for any Ngeqslant R: gt-or-equal, slanted1 the PIMH update is a standard IMH update on the extended space X with target density inline image defined in equation (31) and proposal density qN defined in equation (30).

Proving this theorem simply consists of checking that under our assumptions the ratio between the extended target inline image and the extended proposal qN is, whenever qN>0, equal to inline image, and deduce that the acceptance ratio of an IMH update with target and proposal densities inline image and qN takes the form in equation (29). Note that, although assumption 1 is not needed to establish this theorem, this condition is, however, required both to ensure that inline image is a consistent estimator of Z (and hence that the PIMH is a consistent ‘exact approximation’) and that the corresponding sampler is ergodic. This result implies that if inline image, and in particular inline image, then after an IMH update the resulting sample inline image, and in particular inline image. In addition formulating the PIMH sampler as an IMH algorithm in disguise targeting inline image allows us to use standard results concerning the convergence properties of the IMH sampler to characterize those of the PIMH sampler.

Theorem 3.  Assume assumptions 1 and 2. Then

  • (a)the PIMH sampler generates a sequence {X1:P(i)} whose marginal distributions inline image satisfy
  • (b)if additionally assumption 3 holds, then there exists ρP ∈ [0,1) such that for any igeqslant R: gt-or-equal, slanted1 and inline image

The first statement is a direct consequence of theorem 2, standard convergence properties of irreducible MCMC algorithms and the fact that inline image. The second statement, leading to equation (32), simply exploits the well-known fact that the IMH sampler converges geometrically if and only if the supremum of the ratio of the target density to the proposal density (here inline image) is finite.

This latter result nevertheless calls for some comments since ρP is—perhaps surprisingly—independent of N, implying the rather negative property that increasing N does not seem to improve convergence of the algorithm. Again the IMH nature of the PIMH sampler sheds some light on this point. In simple terms, the convergence properties of an IMH sampler are governed by the large values of the ratio of the target density to the proposal density. Indeed, leaving a state with such a large ratio is difficult and results in a slowly mixing Markov chain exhibiting a ‘sticky’ behaviour. What the second result above tells us is that the existence of such sticky states is not eliminated by the PIMH strategy when N increases. However, the probability of visiting such unfavourable states can be made arbitrarily small by increasing N by virtue of the results in theorem 1 and the application of Tchebychev's inequality. In fact, as a particular case of Andrieu and Roberts (2009), it is possible to show that for any ɛ,η>0 there exists an N0 such that for any Ngeqslant R: gt-or-equal, slantedN0 and any igeqslant R: gt-or-equal, slanted1


with ψ-probability larger than 1−η, where inline image denotes the conditional distribution of X1:P(i) given the random variables generated at iteration 0 by the SMC algorithm.

4.3. The conditional sequential Monte Carlo update

The expression


appearing in inline image given in equation (31) is the density under inline image of all the variables that are generated by the SMC algorithm conditional on inline image. Although this sheds some light on the structure of inline image, sampling from this conditional density can also be of a practical interest. As we shall see, it is a key element of the PG sampler discussed in Sections 2.4.3 and 4.5 and can also be used to update sub-blocks of x1:P. Given inline image the algorithm to sample from the distribution above proceeds as follows.

  • Step 1: n=1,

  • (a)for inline image, sample inline image and
  • (b)compute inline image and normalize the weights inline image
  • Step 2: for n=2,…,P,

  • (a)sample inline image,
  • (b)for inline image, sample inline image and set inline image, and
  • (c)compute inline image) and normalize the weights inline image.

Here we have used the notation inline image. We explain in Appendix A how to sample efficiently from inline image. Intuitively this update can be understood as updating N−1 particles together with their weights while keeping one particle fixed in inline image. Going one step further, one can suggest sampling inline image from this updated empirical distribution inline image. The remarkable property here is that, whenever inline image (corresponding to the marginal inline image), then inline image. This stems from the fact that the conditional SMC update followed by sampling from inline image can be interpreted as a standard Gibbs update on the distribution inline image. Indeed, as mentioned earlier, the conditional SMC update samples from inline image and it can easily be checked from equation (41) that inline image, which is precisely the probability involved in sampling from inline image. We stress the crucial fact that a single particle inline image is needed to initialize this Gibbs update.

An important practical application of this property is concerned with the sampling of subblocks of x1:P when P is so large that a prohibitive number of particles might be required to lead to an efficient global update. In such situations, we can simply divide the sequence x1:P into large sub-blocks and use a mixture of Gibbs sampler updates as described above. Given a sub-block Xc:d=xc:d for 1<c<d<P such an update leaving π(xc:d|x1:c−1,xd+1:P) invariant proceeds as follows.

  • (a)Sample an ancestral lineage Bc:d uniformly in {1,…,N}dc+1.
  • (b)Run a conditional SMC algorithm targeting π(xc:d|x1:c−1,xd+1:P) conditional on Xc:d and Bc:d.
  • (c)Sample inline image

4.4. The particle marginal Metropolis–Hastings update

We now consider the case where we are interested in sampling from a density


with inline image assumed known pointwise and Z a possibly unknown normalizing constant, independent of θ ∈ Θ. In the case of the simple SMC algorithm for SSMs that was considered in Section 2.4.2, we have P=T,π(θ,x1:P)=p(θ,x1:T|y1:T),γ(θ,x1:P)=p(θ,x1:T,y1:T) given in equation (4) and Z=p(y1:T). Following the developments of Section 2.4.2 we can suggest the use of a PMMH sampler which consists of approximating an MMH algorithm with proposal density inline image and target density π(θ,x1:P)=π(θπ(x1:P) where π(x1:P)=γ(θ,x1:P)/γ(θ) with inline image and π(θ)=γ(θ)/Z; we have γ(θ)= p(y1:T)p(θ) in Section 2.4.2. We use an SMC algorithm to sample approximately from π(x1:P) and approximately compute its normalizing constant γ(θ). This requires introducing a family of bridging densities inline image, each of them known up to a normalizing constant, such that inline image and a family of IS densities inline image. We shall use inline image and inline image respectively to denote the SMC approximation to π(dx1:P) and γ(θ).

The PMMH update consists at iteration i of sampling a candidate θ*q{·|θ(i−1)}, then running an SMC sampler to obtain inline image and sampling inline image. We set inline image with probability


and inline image otherwise. We formulate very mild and natural assumptions which will guarantee convergence for any Ngeqslant R: gt-or-equal, slanted1, i.e. ensure that the sequence {θ(i),X1:P(i)} that is generated by the PMMH sampler will have π(θ,x1:P) as asymptotic density. For any θ ∈ Θ, we define


with the convention inline image and inline image, and inline image. We make the following assumptions.

 Assumption 5.  For any inline image, we have inline image for n=1,…,P.

 Assumption 6.  The MH sampler of target density π(θ) and proposal density q(θ*|θ) is irreducible and aperiodic (and hence converges for π almost all starting points).

Again assumption 5 is needed to ensure that inline image can be used as an importance density to approximate inline image for any θ ∈ Θ such that π(θ)>0 whereas assumption 6 ensures that the associated MH algorithm converges. Our main result is the following theorem, which is proved in Appendix B.

Theorem 4.  Assume assumption 2. Then for any Ngeqslant R: gt-or-equal, slanted1

  • (a)the PMMH update is an MH update defined on the extended space Θ×X with target density
    and proposal density
    where inline image consists of the realization of the normalized importance weights that are associated with the proposed population of particles, and
  • (b)if additionally assumptions 5 and 6 hold, the PMMH sampler generates a sequence {θ(i),X1:P(i)} whose marginal distributions inline image satisfy

4.5. The particle Gibbs update

The PG sampler aims to solve the same problem as the PMMH algorithm, i.e. sampling from π(θ,x1:P) defined on some space inline image as defined in equation (8) in the situation where an unormalized version γ(θ,x1:P) is accessible. A Gibbs sampler for this model would typically consist of alternately sampling from π(θ|x1:P) and π(x1:P). To simplify our discussion we shall assume here that sampling exactly from π(θ|x1:P) is possible. However, sampling from π(x1:P) is naturally impossible in most situations of interest, but motivated by the structure and properties of inline image defined in equation (36) we can suggest performing a Gibbs sampler targeting precisely this density. We choose the following sweep:

  • (a)inline image,
  • (b)inline image,
  • (c)inline image.

Steps (a) and (c) are straightforward to implement. In the light of the discussion of Section 4.3, step (b) can be directly implemented thanks to a conditional SMC algorithm. Note that step (a) might appear unusual but leaves inline image invariant and is known in the literature under the name ‘collapsed’ Gibbs sampler (Liu (2001), section 6.7). We remind the reader that a detailed particular implementation of this algorithm in the context of SSMs with multinomial resampling is given in Section 2.4.3. We now state a sufficient condition for the convergence of the PG sampler and provide a simple convergence result which is proved in Appendix B.

 Assumption 7.  The Gibbs sampler that is defined by the conditionals π(θ|x1:P) and π(x1:P) is irreducible and aperiodic (and hence converges for π almost all starting points).

We have the following result.

Theorem 5.  Assume assumption 2. Then

  • (a)the PG update defines a transition kernel on the extended space Θ×X of invariant density inline image defined in equation (36) for any Ngeqslant R: gt-or-equal, slanted1, and
  • (b)if additionally assumptions 5–7 hold, the PG sampler generates a sequence {θ(i),X1:P(i)} whose marginal distributions inline image satisfy for any Ngeqslant R: gt-or-equal, slanted2

4.6. Reusing all the particles

Standard theory of MCMC algorithms establishes that under our assumptions inline image will almost surely converge to inline image whenever inline image as the number L of PMMH or PG iterations goes to ∞. We show here that it is possible to use all the particles that are involved in the construction of inline image to estimate this expectation, but also rejected sets of particles. The application to the PIMH sampler is straightforward by ignoring θ in the notation, replacing inline image with inline image and the acceptance ratios below with their counterparts in expression (29).

Theorem 6.  Assume assumptions 2–5 and let inline image be such that inline image. Then as soon as the PMMH sampler or the PG sampler is ergodic then, for any Ngeqslant R: gt-or-equal, slanted1 or Ngeqslant R: gt-or-equal, slanted2 respectively,

  • (a)the estimate
    converges almost surely towards inline image as L→∞ where inline image corresponds to the set of normalized weights and particles used to compute inline image,
  • (b)and for the PMMH sampler, denoting by inline image the set of proposed weighted particles at iteration i (i.e. before deciding whether or not to accept this population) and inline image the associated normalizing constant estimate
    with for any θ,θ ∈ Θ
    converges almost surely towards inline image as L→∞.

The proof can be found in Appendix B and relies on ‘Rao–Blackwellization’-type arguments. The estimator in equation (39) is in the spirit of the ideas of Frenkel (2006) and tells us that it is also possible to recycle all the candidate populations that are generated by the PMMH sampler.

5. Discussion and extensions

5.1. Discussion and connections to previous work

The PIMH algorithm is related to the configurational-biased Monte Carlo (CBMC) method, which is a very popular method in molecular simulation (Siepmann and Frenkel, 1992). However, in contrast with the PIMH algorithm, the CBMC algorithm does not propagate N particles in parallel. Indeed, at each time step n, the CBMC algorithm samples N particles but the resampling step is such that a single particle survives, to which a new set of N offspring is then attached. The problem with this approach is that it is somewhat too greedy and that if a ‘wrong’ decision is taken too prematurely then the proposal will most likely be rejected. It can be shown that the acceptance probability of the CBMC algorithm does not converge to 1 for P>1 as N→∞ in contrast with that of the PIMH algorithm. It has been more recently proposed in Combe et al. (2003) to improve the CBMC algorithm by propagating several particles simultaneously in the spirit of the PIMH algorithm. Combe et al. (2003) proposed to kill or multiply particles by comparing their unnormalized weights inline image with respect to some prespecified lower and upper thresholds; i.e. the particles are not interacting and their number is a random variable. In simulations, the performance of this algorithm is very sensitive to the values of these thresholds. Our approach has the great advantage of bypassing the delicate choice of such thresholds. In statistics, a variation of the CBMC algorithm known as the multiple-try method has been introduced in the specific case where P=1 in Liu et al. (2000). Our methodology differs significantly from the multiple-try method as it aims to build efficient proposals using sequential and interacting mechanisms for cases where P≫1.

The idea of approximating an MMH algorithm which samples directly from π(θ), by approximately integrating out the latent variables x1:P, was proposed in Beaumont (2003) and then generalized and studied theoretically in Andrieu and Roberts (2009). The present work is a simple mechanism which opens up the possibility of making this approach viable in high dimensional problems. Indeed in this context the SMC estimate that is used by the PMMH algorithm is expected to lead to approximations of π(θ) (up to a normalizing constant) with a much lower variance than the IS estimates that were used in Andrieu and Roberts (2009) and Beaumont (2003). The results in Andrieu and Roberts (2009) suggest that this is a property of paramount importance to design efficient marginal MCMC algorithms. Recently it has been brought to our attention by Professor Neil Shephard that a simple version of the PMMH sampler has been proposed independently in the econometrics literature in Fernandez-Villaverde and Rubio-Ramirez (2007). However, their PMMH sampler is suggested as a heuristic approximation to the MMH algorithm and the crucial point that it admits the correct invariant density is not established.

Note finally that it is possible to establish in a few lines that the PMMH sampler admits π(θ) as marginal invariant density. Indeed if equation (23) and assumption 5 hold then inline image is an unbiased estimate of γ(θ) (Del Moral (2004), proposition 7.4.1). It was established in Andrieu et al. (2007) that it is only necessary to have access to an unbiased positive estimate of an unnormalized version of a target density to design an MCMC algorithm admitting this target density as invariant density. The two-line proof given in Andrieu et al. (2007) is as follows. Let U denote inline image, the set of auxiliary variables distributed according to the density ψθ(u) given in equation (37) that is necessary to compute the unbiased estimate inline image; we write here inline image to make this dependence explicit. The extended target density inline imageψθ(u) admits by construction π(θ) as a marginal density in θ. To sample from inline image, we can consider a standard MH algorithm of proposal q(θ*|θψθ*(u*). The resulting acceptance ratio is given by


which corresponds to equation (35). The algorithm that was proposed in Møller et al. (2006) can also be reinterpreted in the framework of Andrieu et al. (2007), the unbiased estimate of the inverse of an intractable normalizing constant being obtained in this context by using IS. However, we emphasize that the PMCMC methodology goes further by introducing an additional random variable K and establishing that the associated extended target density inline image can be rewritten as in equation (36). For example, identifying this target density shows that we obtain samples not only from the marginal density π(θ) but also from the joint density π(θ,x1:P). Moreover this formulation naturally suggests the use of standard MCMC techniques to sample from this extended target distribution. This is a key distinctive feature of our work. This has allowed us, for example, to develop the conditional SMC update of Section 4.3 which leads to a novel MCMC update directly targeting π(x1:P) or any of its conditionals π(xc:d|x1:c−1,xd+1:P).

5.2. Extensions

We believe that many problems where SMC methods have already been used successfully could benefit from the PMCMC methodology. These include contingency tables, generalized linear mixed models, graphical models, change-point models, population dynamic models in ecology, volatility models in financial econometrics, partially observed diffusions, population genetics and systems biology. The CBMC method, to which our approach is related, is a very popular method in computational chemistry and physics, and PMCMC algorithms might also prove a useful alternative in these areas. We are already aware of recent successful applications of PMCMC methods in econometrics (Flury and Shephard, 2010) and statistics (Belmonte et al., 2008).

From a methodological point of view, there are numerous possible extensions. Given that we know the extended target distributions that the PMCMC algorithms are sampling from, it is possible to design many other sampling strategies. For example, we can sample only a proportion of the particles at each iteration or a part of their paths instead of sampling a whole new population at each iteration. It would be also interesting to investigate the use of dependent proposals to update the latent variables. In practice, the performance of the PIMH and PMMH algorithms is closely related to the variance of the SMC estimates of the normalizing constants. Adaptive strategies to determine the number of particles that is necessary to ensure that the average acceptance rate of the algorithms is reasonable could also be proposed.

From a theoretical point of view, it is possible to study how ‘close’ the PMCMC algorithms are to the idealized MCMC algorithms that they are approximating—corresponding to N→∞—using and refining the techniques that are developed in Andrieu and Roberts (2009).


The authors thank the referees and the Research Section Committee for their valuable comments which have helped to improve the manuscript. We thank Paul Fearnhead for pointing out an error in Appendix A of an earlier version of the manuscript. Christophe Andrieu's research is supported by an Engineering and Physical Sciences Research Council Advanced Research Fellowship.


Appendix A: Conditional sequential Monte Carlo implementation

The delicate step in practice to implement the conditional SMC procedure is that of sampling from inline image. As discussed in Section 4.1, the resampling procedure is usually defined in terms of the number of offspring On−1 of the parent particles from iteration n. In this case, a generic algorithm consists of the following two steps.

  • (a)Sample the numbers of offspring inline image.
  • (b)Sample the indices of the N−1 ‘free’ offspring uniformly on the set inline image.

To sample from inline image, we can use the fact that






In the multinomial resampling case, denoting inline image the multinomial distribution, this is equivalent to the following procedure.


Appendix B: Proofs

B.1. Proof of theorem 2

We can easily check that equations (30) and (31) sum to 1. Note that the factor 1/NP corresponds to the uniform distribution on the set {1,…,N}P for the random variables inline image. Now the acceptance ratio of an IMH algorithm is known to depend on the following importance weight which is well defined because of assumption 1:


where inline image is given in equation (21). In the manipulations above we have used assumption 2 on the second line whereas the final result is obtained thanks to the definitions of the incremental weights (20) and of the normalizing constant estimate (21). It should now be clear that the PIMH algorithm that is described above corresponds to sampling particles according to qN defined in equation (30) and that the acceptance probability (29) corresponds to that of an IMH algorithm with target density inline image given by equation (31).

B.2. Proof of theorem 3

Under the assumptions the PIMH defines an irreducible and aperiodic Markov chain with invariant density inline image from theorem 2. Since inline image we conclude the proof from the properties of inline image. To establish the second statement, we note that under assumption 3


for all inline image. For an IMH algorithm this implies uniform geometric ergodicity towards inline image, with a rate at least inline image; see for example Mengersen and Tweedie (1996), theorem 2.1. This, together with a reasoning similar to above concerning X1:P(i), allows us to conclude the proof.

B.3. Proof of theorem 4

The proof of the first part of theorem 4 is similar to the proof of theorem 2 and is not repeated here. The second part of the proof is a direct consequence of theorem 1 in Andrieu and Roberts (2009) and assumptions 5 and 6.

B.4. Proof of theorem 5

The algorithm is a Gibbs sampler targeting equation (36). We hence focus on establishing irreducibility and aperiodicity of the corresponding transition probability. Let inline image, inline image, k ∈ {1,…,N} and i ∈ {1,…,N}P−1 be such that inline image. From assumption 5 it is possible to show that accessible sets for the Gibbs sampler are also marginally accessible by the PG sampler, i.e. more precisely if inline image is such that inline image for some finite j>0 then also inline image for all k ∈ {1,…,N} and i ∈ {1,…,N}P. From this and the assumed irreducibility of the Gibbs sampler in assumption 7, we deduce that if π{(θ,X1:P) ∈ D×E}>0 then there is a finite j such that inline image for all k ∈ {1,…,N} and i ∈ {1,…,N}P. Now, because π{(θ,X1:P) ∈ D×E}>0 and step (b) corresponds to sampling from the conditional density of inline image, we deduce that


and the irreducibility of the PG sampler follows. Aperiodicity can be proved by contradiction. Indeed from assumption 5 we deduce that, if the PG sampler is periodic, then so is the Gibbs sampler, which contradicts assumption 7.

B.5. Proof of theorem 6

To simplify the presentation, we shall use the notation inline image and for inline image we define the function


Note, using two different conditionings, that the following equalities hold:


where we have used that inline image and inline image by using an identity similar to equation (41). The first statement follows from the ergodicity assumption on both the PMMH and the PG samplers and the resulting law of large numbers involving {K(i),V(i)}. To prove the second result in equation (39) we introduce the transition probability Q that is associated with the PMMH update. More precisely, with inline image and α(v,v) the acceptance probability of the PMMH update, the conditional expectation of F with respect to the distribution Q(k,v;·) is given by


where inline image denotes the normalized weights that are associated with inline image. By construction Q leaves inline image invariant, which from equation (42) leads to inline image. Now, noting that Ψ(v,v)α(v,v) does not depend on k for the PMMH we can rewrite equation (43) as


Using the definition of F, the fact that inline image does not depend on k and equation (42) lead to


and we again conclude the proof from the assumed ergodicity of {K(i),V(i)}. Note that the proofs suggest that theorem 6 still holds for a more general version of the PMMH sampler for which the proposal distribution for θ is allowed to depend on v (i.e. all the particles), but on neither k nor k.

Discussion on the paper by Andrieu, Doucet and Holenstein

Paul Fearnhead (Lancaster University)

I see great potential for particle Markov chain Monte Carlo (MCMC) methods—as the strengths of particle filters and of MCMC sampling are in many ways complementary. For example, in work on mixture models (Fearnhead, 2004), particle filter methods can perform well at finding different modes of the posterior, whereas MCMC methods do well at exploring the posterior within a mode. Similarly particle methods do well for analysing state space models conditional on known parameters and can analyse models which you can simulate from but cannot calculate transition densities, whereas MCMC methods are better suited to mixing over different parameter values. This is the first work to use particle filters within MCMC sampling in a principled and theoretically justified way.

The paper describes several particle MCMC methods, and I shall concentrate the rest of my comments on just one of these: particle Gibbs sampling.

To understand the mixing properties of particle Gibbs sampling it helps to look at the set of paths that can be sampled from at the end of a conditional sequential Monte Carlo (SMC) update: Fig. 8(a) gives an example. The conditional SMC update is an SMC algorithm conditioned on a specific path surviving, which I shall call the conditioned path. The set of paths can be split into those which coalesce with the conditioned path, and those which do not and hence are independent of it. For the particle Gibbs sampler to mix well we want the probability of sampling one of these latter independent paths to be high.

Figure 8.

  (a) Example realization of paths of conditional SMC updates (inline image, conditioned path; inline image, paths which coalesce with the conditioned path; inline image, independent paths) and (b) probability of sampling an independent path for multinomial (broken curves) and stratified (full curves) sampling, as a function of N/T for various values of T (inline image, inline image, T=50; inline image, inline image, T=200; inline image, inline image, T=400)

For this, we would like to minimize the number of times that the particle of the conditioned path is resampled at each iteration. Consider time n, and assume that the conditioned path consists of the first particle at both times n and n+1 (in the notation of the paper Bn=1 and Bn+1=1). Let On be the number of times that the first particle at time n is resampled. Then under the conditional SMC algorithm we are interested in


thus it is easy to show that inline image. Hence we see the importance of choosing a resampling scheme that minimizes the variance of the number of times that each particle is resampled; or of not resampling every time step.

To illustrate this empirically, I considered a simple toy model inline image,


We simulated data for σ=0.1. Fig. 8(b) shows how the probability of sampling an independent path in the conditional SMC step depends on T, N and the type of resampling. Results are given for multinomial resampling and for stratified resampling (Kitagawa, 1996; Carpenter et al., 1999), which is known to minimize var(On). We see that stratified sampling requires much smaller values of N to have the same performance as for multinomial sampling. Also, as pointed out in the paper, you want N to increase linearly with T to have a roughly constant performance (this suggests that the central processor unit cost of SMC scales as O(T2); this sounds competitive with or better than standard MCMC sampling; see Roberts (2009)).

I have two other comments on particle Gibbs sampling. Firstly it seems that how you initialize the conditional SMC update is important. Naive strategies of sampling particles from the prior will lead to many initial particles being sampled in poor areas of the state space. Care needs to be taken even when a better proposal for the initial particles is used, as initially particles in the mode of this proposal will have small weights and resampling may remove many of these particles sampled unless the resampling probabilities are chosen appropriately (Chen et al., 2005; Fearnhead, 2008). Also, within particle Gibbs sampling are there extra ways of learning a good proposal for the initial particles: could you learn this from the history of the MCMC run or use the information of the conditioned path?

Secondly, you can use particle Gibbs sampling to update jointly parameters and the states by using a conditional SMC update with particles being both the state of the system and the parameters. Again care needs to be taken in terms of the proposal distribution for the parameters, and how resampling is done. On the above toy example it was possible to use the conditional SMC update to sample jointly new X1:T- and σ-values—with moves where σ changed by more than an order of magnitude more than that of a Gibbs sampler which updates σ|X1:T. For such an implementation, can you use MCMC methods within the conditional SMC update (Fearnhead, 1998, 2002; Gilks and Berzuini, 2001; Storvik, 2002)?

It feels like the theory behind the efficiency of particle Gibbs sampling may be very different from that for the other particle MCMC methods. Whereas the latter seems related to the variance of the SMC estimates of the marginal likelihood, the efficiency of particle Gibbs sampling seems related to rates of coalescences of paths in the conditional SMC update (and reminiscent of Kingman (l982)). Are these related, or is there a fundamental difference between particle Gibbs and other particle MCMC methods?

This has been a fascinating paper, and I look forward to see future developments and application of particle MCMC methods. It gives me great pleasure to propose the vote of thanks.

Simon Godsill (University of Cambridge)

In seconding the vote of thanks on this paper I congratulate the authors on a fine contribution to Bayesian computational methods. The techniques that they propose allow us to combine, in a principled way, the two most successful tools that we currently have available in this field: the particle filter and the Markov chain Monte Carlo (MCMC) method. Other attempts in this area have focused on incorporating MCMC methods into sequential updating with particle filters. The current contribution, however, introduces the full power of particle filters into batch MCMC schemes. This has been done before, using empirical justifications (see Fernandez-Villaverde and Rubio-Ramirez (2007), which implements precisely the particle marginal Metropolis–Hastings update), but here we have a full theoretical justification for this usage, which will reassure practitioners and should increase the uptake of such methods widely. By adopting a fully principled approach, which identifies an augmented target distribution which is at the heart of the particle marginal Metropolis–Hastings approach, we gain significant extra mileage, notably through the particle Gibbs algorithm, a method that applies Gibbs sampling to the same augmented target distribution. This particle Gibbs algorithm goes significantly beyond what has been applied before and allows inference in intractable models where the only feasible state sampling approach is particle filtering. It should be highlighted, however, that the approach is one of the most computationally demanding methods proposed to date. In its basic form it requires a full particle filtering run for each iteration of the MCMC algorithm, which for a complex model with many static parameters could prove infeasible. The algorithm is also slightly wasteful in that all but one particle and its back-tracking lineage are discarded in each step of the algorithm (even though the discarded samples can be used in the final Monte Carlo estimates, as shown by the authors in Section 4.6). This latter point raises the possibility that one might adapt a parallel chain or population Monte Carlo scheme to the particle MCMC framework, to utilize fully in the MCMC algorithm more than one stream of output from the particle filter.

To conclude, I wonder whether the authors have considered adaptations of their approach which incorporate particle smoothing, both Viterbi style (Godsill et al., 2001) and backward sampling (Godsill et al., 2004). These could improve the quality of the proposals from the particle filter at relatively small cost (at least in the backward sampling case, which is O(NT) per sample path, as for the basic particle filter). This latter approach typically gives better diversity of backward sample paths than those arising from the standard filter output—hence I wonder also whether we can gain something by including multiple path imputations from the smoother into the particle MCMC approach—see my earlier comment about parallel chain or population MCMC methods.

The vote of thanks was passed by acclamation.

Nicolas Chopin (Ecole Nationale de la Statistique et de l'Administration Economique, Paris)

Two interesting metrics for the influence of a paper read to the Society are

  • (a)the number of previous papers that it affects in some way and
  • (b)the number of interesting theoretical questions that it opens.

In both respects, this paper fares very well.

Regarding (a), in many complicated models the only tractable operations are state filtering and likelihood evaluation; see for example the continuous time model of Chopin and Varini (2007). In such situations, the particle Hastings–Metropolis (PHM) algorithm offers Bayesian estimates ‘for free’, which is very nice.

Similarly, Chopin (2007) (see also Fearnhead and Liu (2007)) formulated change-point models as state space models, where the state xt=(θt,dt) comprises the current parameter θt and the time since last change dt. Then we may use sequential Monte Carlo (SMC) methods to recover the trajectory x1:T, i.e. all the change dates and parameter values. It works well when xt forgets its past sufficiently quickly, but this forbids hierarchical priors for the durations and the parameters. PHM removes this limitation: Chopin's (2007) SMC algorithm may be embedded in a PHM algorithm, where each iteration corresponds to different hyperparameters. This comes at a cost, however, as each Markov chain Monte Carlo (MCMC) iteration runs a complete SMC algorithm.

Regarding (b), several questions, which have already been answered in the standard SMC case, may be asked again for particle MCMC methods. Does residual resampling outperform multinomial resampling? Is the algorithm with N+1 particles strictly better than that with N particles? What happens about Rao–Blackwellization, or the choice of the proposal distribution? One technical difficulty is that marginalizing out components always reduces the variance in SMC sampling, but not in MCMC sampling. Another difficulty is that particle MCMC methods retains only one particle trajectory x1:T; hence the effect of reducing variability between particles is less obvious.

Similarly, obtaining a single trajectory x1:T from a forward filter is certainly much easier than obtaining many of them, but it may still be demanding in some scenarios, i.e. there may be so much degeneracy in x1 that not even one particle contains an x1 in the support of p(x1|y1:T).

Rong Chen (Rutgers University, Piscataway)

It is a pleasure to congratulate the authors on an impressive, timely and important paper. The problem of parameter estimation for complex dynamic systems by using sequential Monte Carlo methods has been known as a very difficult problem. The authors provide a clean and powerful way to deal with such a problem. The method will certainly become a popular and powerful tool for solving complex problems.

I wish to concentrate my discussion on one aspect—the resampling scheme. The current paper seems to insist on resampling by using the current weights (e.g. assumption 2). We note that the procedure proposed actually works for more flexible resampling schemes. In a way, we can view that a flexible resampling scheme is in effect changing the intermediate distributions. More specifically, in the notation of the paper, a flexible resampling scheme operates as follows. At times n=2,…,T, first construct inline image. Then

  • (a)sample inline image
  • (b)sample inline image and set inline image, and
  • (c)compute and normalize the weights

This is not a new idea. For example, Liu (2001) mentioned the use of inline image for some α ∈ (0,1) to reduce the sudden impact of large jumps in the system. Shephard (private conversation) suggested the use of an incremental weight spreading technique,


The auxiliary particle filter of Pitt and Shephard (1999) in a way can be thought of as using


where inline image is a prediction of the future state Xn. Similarly, we can also use delayed sampling (Chen et al., 2000; Wang et al., 2002) and block sampling (Doucet et al., 2006) ideas to design the resampling schemes, bringing in future information in the resampling scheme. Lin et al. (2010) constructed the resampling scores by using backward pilots in generating Monte Carlo samples of diffusion bridges.

The flexible resampling scheme is essentially changing the intermediate distribution γt−1(xt−1) (which is defined in Section 4.1) to


hence all the theoretical properties of standard particle filters work. It also works inside the particle Markov chain Monte Carlo algorithm.

Mark Girolami (University of Glasgow)

This is a potentially very important contribution to Markov chain Monte Carlo (MCMC) methodology. The capabilities of existing MCMC techniques are being severely stretched, because in part of the increasing awareness of the importance of statistical issues surrounding the mathematical modelling of complex stochastic non-linear dynamical systems in areas such as computational finance and biology. The proposed particle Markov chain Monte Carlo (PMCMC) framework of algorithms provides very general and powerful novel methodology which may allow inference to proceed over increasingly complex models in a more efficient manner and as such this is a most welcome addition to the literature.

The use of an approximate posterior to improve proposal efficiency in terms of producing large moves with high probability of acceptance is a strategy that has been demonstrated to great effect in reversible jump MCMC methods where approximate posteriors for model proposals ensure high acceptance of between-model moves (Lopes and West, 2004; Zhong and Girolami, 2009). A similar strategy is to consider a proposal process as the outcome of forward simulation of a stochastic differential equation which has the desired target distribution as its ergodic stationary distribution. Simulating from the stochastic differential equation numerically incurs errors which can then be corrected for, as with PMCMC sampling, by employing the Hastings ratio, e.g. the Metropolis adjusted Langevin algorithm (Roberts and Stramer, 2003). The alternative method is numerically to forward-simulate a deterministic system based on a Hamiltonian and to employ a Metropolis accept–reject step to correct for discrete integration errors, as in the hybrid Monte Carlo methods which have been shown to perform well on high dimensional problems that were similar to those studied in this paper (Neal, 1993; Girolami et al., 2009).

The correctness of the algorithms is established with extensive and detailed proofs; therefore my comments have a practical focus. The strategy that is adopted is to employ an approximate, potentially non-equilibrium sequential Monte Carlo (SMC) procedure to make high dimensional proposals for the Metropolis method. In many ways the issue of designing a proposal mechanism is pushed back to designing importance distributions for the SMC method so that difficulties may yet arise in terms of tuning the SMC parameters to obtain a high rate of acceptance. Sampling from the joint posterior p(θ,x1:T|y1:T) within the PMCMC framework may still require the undesirable design of a proposal for the parameters θ as employed in the particle marginal Metropolis–Hastings sampler although the particle Gibbs sampler employing conditional SMC updates appears a promising though largely untested alternative.

Nick Whiteley (University of Bristol)

I offer my thanks to the authors for an inspirational paper. Their approach to constructing extended target distributions is powerful and can be exploited further and applied elsewhere. A key ingredient is the elucidation of the probability model underlying a sequential Monte Carlo (SMC) algorithm and the genealogical tree structures that it generates. Two further developments on this theme are described below.

Firstly, at the end of one conditional SMC run in the particle Gibbs algorithm, the authors suggest sampling K from its full conditional under inline image, then deterministically tracing back the ancestral lineage of inline image, to yield


There is an alternative. Having sampled K, for n=T−1,…,1, we could sample from


with inline image defined as before according to expression (44), but with newly sampled ancestor indices.

The advantage of this ‘backward’ sampling is that it enables exploration of all possible ancestral lineages and not only those obtained during the ‘forward’ SMC run. This offers a chance to circumvent the path degeneracy phenomenon and to obtain a faster mixing particle Gibbs kernel, albeit at a slightly increased computational cost.

When p(x1:T,y1:T) arises from a state space model, it is straightforward to verify that


which uses the importance weights that are obtained during the forward SMC run. In this case, the above procedure coincides with one draw using the smoothing method of Godsill et al. (2004).

Secondly, I believe that the particle Markov chain Monte Carlo framework can be adapted to accommodate the particle filter of Fearnhead and Clifford (2003), which is somewhat different from the SMC algorithm that is considered in the present paper. Owing to constraints on space I provide no specifics here, but I believe that suitable formulation of the probability model underlying the algorithm of Fearnhead and Clifford (2003) allows it to be manipulated as part of a particle Markov chain Monte Carlo algorithm.

Gareth Roberts (University of Warwick, Coventry)

I add my congratulations to the authors for this path breaking work. In this discussion, I shall expand on comments in the paper linking the methods introduced to a generic framework for Markov chain Monte Carlo (MCMC) methods which can be applied to missing data problems and other situations where the target density is unavailable but can be estimated unbiasedly by using an auxiliary variable construction. This work can be found in Andrieu and Roberts (2009), generalizing an idea that was introduced in Beaumont (2003).

For MCMC sampling, enlargement of state spaces comes at a price. Consider, for instance an ‘optimized’ Metropolis–Hastings algorithm on π(θ,z). Typically this converges slower than its rival counterpart on the marginalized distribution π(θ). This suggests that we might mimic the marginalized algorithm through Monte Carlo sampling . Here I shall describe the simplest version of the pseudomarginal approach.

Choose Z ∈ RNIIDq, and set


Consider two options for using inline image within an MCMC framework: Monte Carlo within Metropolis and generalized importance Metropolis–Hastings.

StepMarginalMonte Carlo within MetropolisGeneralized importance Metropolis–Hastings
0: givenθ and π(θ)θ and π(θ)θ,Z and inline image
1: sampleθ*q(θ,·)θ*q(θ,·)θ*q(θ,·)
  inline imageinline image
2: computeπ(θ*)inline image and inline imageinline image
3: compute rinline imageinline imageinline image
4: with probability inline imageϑ=θ*ϑ=θ*ϑ=θ*,Z=Z*
 otherwiseϑ=θϑ=θϑ=θ, Z=Z

The Monte Carlo within Metropolis approach biases the MCMC algorithm so that the marginal stationary distribution of θ under the scheme is typically not π (if it exists at all). However, the generalized importance Metropolis–Hastings approach has the following invariant distribution:


The θ-marginal of this chain is π(θ).

Thus there is no Monte Carlo bias in generalized importance Metropolis–Hastings sampling (though of course there is still Monte Carlo error) and, under weak regularity conditions, as N→∞ the algorithm ‘converges’ to the true marginal algorithm.

Drawing Z as an independent and identically distributed sample can be significantly improved on, e.g. by letting Z denote a sample path of a Markov chain with invariant distribution π(z|θ) (or even a particle approximation as in the present paper).

Andrieu and Roberts (2009) applies this idea in simple examples and explores some of the theoretical properties of the method. One important and promising application of the idea involves a substantial generalization of reversible jump MCMC sampling which improves the potentially problematic step of choosing appropriate between-dimension moves.

In modified form, this construction is also an ‘exact’ and efficient computational solution to doubly intractable problems (see Andrieu et al. (2008)),


for unknown K(·) as well as θ.

Miguel A. G. Belmonte (University of Warwick, Coventry) and Omiros Papaspiliopoulos (Universitat Pompeu Fabra, Barcelona)

We congratulate the authors for a remarkable paper, which addresses a problem of fundamental practical importance: parameter estimation in state space models by using sequential Monte Carlo (SMC) algorithms. In Belmonte et al. (2008) we fit duration state space models to high frequency transaction data and we require a computational methodology that can handle efficiently time series of length inline image. We have experimented with particle Markov chain Monte Carlo (PMCMC) methods and with the smooth particle filter (SPF) of Pitt (2002). The latter is also based on the use of SMC algorithms to derive maximum likelihood parameter estimates; it is, however, limited to scalar signals. Therefore, in the context of duration modelling this limitation rules out multifactor or multi-dimensional models, and we believe that PMCMC methods can be very useful in such cases.

In this contribution we present a preliminary simulation study which contrasts particle marginal Metropolis–Hastings (PMMH), particle Gibbs (PG) and the SPF methods on simulated data from a linear single-factor state space model:


Parameter values are set to μ=0.75, φ=0.95 and inline image and various values for T and the signal-to-noise ratio inline image are tried. When Bayesian inference with PMCMC sampling is made for μ, an improper flat prior is used. We adopt a pragmatic point of view according to which the practitioner, especially for a small number of parameters, is invariant to maximum likelihood or Bayesian inference but is mostly worried about the comptutational efficiency of the methods. Our simulation and prior specification set-up is such that the posterior mean and precision estimates coincide with the maximum likelihood and observed information estimates respectively, and the exact values are available by using the Kalman filter (KF). The bootstrap filter is used in all the SMC algorithms.

For various values of T, Table 1 shows a comparison of parameter estimates by the particle methods and KF. In this problem the SPF and PG methods show remarkable robustness to the length of the series in terms of the accuracy of the estimates. The mixing time of the latter does not show deterioration with T (note that the mixing time of the limiting algorithm with T=∞ does not arbitrarily deteriorate with T either; see Papaspiliopoulos et al. (2003) for details). We also varied the signal-to-noise ratio and report our findings in Table 2.

Table 1.   Comparison of estimates by the SPF, PMMH and PG methods against the KF for various T
 Results for the following values of T:
  1. †The particle algorithms set N=500, with inline image. Exact estimates are reported for the KF. For the particle methods we compute the relative error inline image, log-likelihood difference inline image and the ratio of variance inline image. l(·) denotes the exact KF log-likelihood. Efficiency for the PMMH and PG methods is measured by the following approximation to the integrated auto-correlation time inline image, where inline image is the MCMC sample correlation at lag 1.

inline image0.6580.8260.4170.6050.7570.7520.759
inline image−93.75−208.44−502.56−1031.78−2037.27−5132.14−10211.19
inline image0.4410.2550.1120.0580.0300.0120.006
Relative error−0.1200.0190.049−0.0110.0580.008−0.007
Likelihood difference−0.0110−0.0122−0.0037−0.0004−0.0125−0.0014−0.0000
Ratio of variance0.9991.0000.9540.9840.9680.9830.982
Relative error0.0560.024−0.007−0.030−0.0230.070−0.022
Likelihood difference−0.0015−0.0008−0.0000−0.0027−0.0049−0.0972−0.0222
Ratio of variance1.0090.9800.8880.9891.2801.1511.352
Acceptance probability0.6060.4090.2170.0710.0340.0040.003
Relative error0.0163−0.0026−0.0115−0.00410.0007−0.0020−0.0010
Likelihood difference−0.0001−0.0000−0.0001−0.0000−0.0000−0.0000−0.0000
Ratio of variance0.9850.9900.9980.9860.9800.9940.989
Table 2.   Comparison of estimates by the SPF, PMMH and PG methods against the KF for combinations of signal-to-noise ratio†
 Results for the following signal-to-noise ratios:
  1. †The total variance inline image is fixed at 0.35. The larger the ratio the larger the observation variance is. T=1000 observations and N=1000 particles. The non-centred PG algorithm subtracts proposed μ from the trajectory drawn from the smoothing density p(x0:T|μ,y0:T).

inline image0.5370.5590.5820.6080.6410.695
inline image0.1280.1040.0800.0560.0310.007
Relative error0.2110.000−0.063−0.056−0.0000.004
Likelihood difference−0.0501−0.0000−0.0085−0.0103−0.0000−0.0005
Ratio of variance0.9180.9270.9400.9420.9660.985
Relative error−0.1080.1090.0270.008−0.014−0.003
Likelihood difference−0.0131−0.0178−0.0016−0.0002−0.0012−0.0002
Ratio of variance1.5741.0491.1011.0581.0710.961
Acceptance probability0.0230.0970.1590.2180.2310.161
Centred PG
Relative error−0.0150.0040.0040.0030.0020.001
Likelihood difference−0.0002−0.0000−0.0000−0.0000−0.0000−0.0001
Ratio of variance0.9880.9880.9890.9870.9900.990
Non-centred PG
Relative error−0.093−0.099−0.4360.1810.0470.010
Likelihood difference−0.0098−0.0148−0.4028−0.1083−0.0141−0.0032
Ratio of variance0.0000.0180.3451.3561.0881.006

We also consider two different parameterizations under which we applied PG sampling: the so-called centred (X1,…,XT,θ) and non-centred (X1θ,…,XTθ,θ); see Papaspiliopoulos et al. (2003). When the state has so high persistence it is known (Papaspiliopoulos et al., 2003) that the centred Gibbs sampler (for T=∞) has better mixing. The robustness of PG sampling is again very promising. Note that the SPF and PMMH methods have worse performance for small values of the ratio, which is due to the deterioration of the bootstrap filter with decreasing observation error. This deterioration appears to have no effect on PG sampling in this simple setting.

Krzysztof Łatuszyński (University of Toronto) and Omiros Papaspiliopoulos (Universitat Pompeu Fabra, Barcelona)

We congraulate the authors for a beautiful paper. A fundamental idea is the interplay between unbiased estimation (by means of importance sampling in this paper) and exact simulation. We show how unbiased estimation relates to exact simulation of events of unknown probability s ∈ [0,1]. Details, proofs and an application to the celebrated Bernoulli factory problem (Nacu and Peres, 2005) can be found in Łatuszński et al. (2009).

We wish to simulate the binary random variable Cs such that P[Cs=1]=s. If inline image is a realizable unbiased estimator of s taking values in [0,1], we use the following algorithm 1.

  • Step 1: simulate G0U(0,1).

  • Step 2: obtain inline image

  • Step 3: if inline image set Cs:=1; otherwise set Cs:=0.

  • Step 4: output Cs.

If l1,l2,… and u1,u2,… are sequences of lower and upper bounds converging monotonically to s then we can resort to the following algorithm 2.

  • Step 1: simulate G0U(0,1);set n=1.

  • Step 2: compute ln and un.

  • Step 3: if G0leqslant R: less-than-or-eq, slantln set Cs:=1.

  • Step 4: if G0>un set Cs:=0.

  • Step 5: if ln<G0leqslant R: less-than-or-eq, slantun set n:=n+1 and go to step 2.

  • Step 6: output Cs.

We can combine these ideas to have unbiased estimators Ln and Un of ln and un. The estimators live on the same probability space and have the following properties:




Under these assumptions we can use the following algorithm 3.

  • Step 1: simulate G0U(0,1); set n=1.

  • Step 2: obtain Ln and Un given inline image

  • Step 3: if G0leqslant R: less-than-or-eq, slantLn set Cs:=1.

  • Step 4: if G0>Un set Cs:=0.

  • Step 5: if Ln<G0leqslant R: less-than-or-eq, slantUn set n:=n+1 and go to step 2.

  • Step 6: output Cs.

The final step is to weaken condition (49) and to let Ln be a reverse time supermartingale and Un a reverse time submartingale with respect to inline image. Precisely, assume that for every n=1,2,… we have


Consider the following algorithm 4, which uses auxiliary random sequences inline image and inline image constructed on line.

  • Step 1: simulate G0U(0,1); set n=1; set inline image

  • Step 2: obtain Ln and Un given inline image

  • Step 3: compute inline image

  • 52Step 4: compute
  • Step 5: if inline image set Cs:=1.

  • Step 6: if inline image set Cs:=0.

  • Step 7: if inline image set n:=n+1 and go to step 2.

  • Step 8: output Cs.

Thomas Flury and Neil Shephard (University of Oxford)

We congratulate Christophe Andrieu, Arnaud Doucet and Roman Holenstein for this important contribution to the sequential Monte Carlo and Markov chain Monte Carlo (MCMC) literature. At the base of their paper is the deceivingly simple looking idea of combining two powerful and well-known Monte Carlo algorithms to create a truly Herculean tool for statisticians. They use sequential Monte Carlo methods to generate high dimensional proposal distributions for MCMC algorithms.

We focus our discussion on one very specific insight: one can use an unbiased simulation-based estimator of the likelihood inside an MCMC algorithm to perform Bayesian inference. For dynamic models this estimator is obtained from a standard particle filter. Importantly, this means that the particle filter now offers a complete extension of the Kalman filter: it can carry out filtering and now direct parameter estimation.

We are particularly impressed with the minimalistic assumptions that we need to perform likelihood-based inference in dynamic non-linear and non-Gaussian state space models, which is of great interest for microeconometrics, macroeconometrics and financial econometrics. In the particle marginal Metropolis–Hastings algorithm we only need to be able to evaluate the measurement density and to sample from the state transition density. Another advantage is that we do not need an infinite number of simulation draws for consistency: all theoretical results hold from as little as Ngeqslant R: gt-or-equal, slanted1 particles. Practical implementation is also very easy as one only needs to change very few lines of code to estimate a different model.

In Flury and Shephard (2010) we showed the power of this method on four famous examples in econometrics. Other applications, such as in repeated auctions, will also become important. Our experience is that these methods work, are quite simple to implement, general purpose and highly computationally demanding. The last point is important; they take so long to run that it is tempting to use the phrase ‘computationally brutal’.

Christian P. Robert and Pierre Jacob (Centre de Recherche en Economie et Statistique and Université Paris Dauphine, Paris), Nicolas Chopin (Ecole Nationale de la Statistique et de l'Administration Economique, Paris) and Håvard Rue (Norwegian University for Science and Technology, Trondheim)

We congratulate the authors for opening a new vista for running Markov chain Monte Carlo (MCMC) algorithms in state space models. Being able to devise a correct Markovian scheme based on a particle approximation of the target distribution is a genuine tour de force that deserves enthusiastic recognition! This is all the more impressive when considering that the ratio


is not unbiased and thus invalidates the usual importance sampling solutions, as demonstrated by Beaumont et al. (2009). Thus, the resolution of simulating by conditioning on the lineage truly is an awesome resolution of the problem!

We implemented the particle Hastings–Metropolis algorithm for the (notoriously challenging) stochastic volatility model


based on 500 simulated observations. With parameter moves


and state space moves derived from the auto-regressive AR(l) prior, we obtained good mixing properties with no calibration effort, using N=102 particles and 104 Metropolis–Hastings iterations, as demonstrated by Figs 9 and 10. Other runs (which are not reproduced here) exhibited multimodal configurations that the particle MCMC algorithm managed to handle satisfactorily within 104 iterations.

Figure 9.

 Evolution of (a) the parameter simulations for μ, ρ and σ, plotted against iteration indices, and (b) the estimated acceptance rate of the particle MCMC algorithm, with obtained N=102 particles and 104 Metropolis–Hastings iterations and a simulated sequence of 500 observations with true values μ0=1, ρ0=0.9 and σ0=0.5

Figure 10.

 (a)–(c) Auto-correlation and (d)–(f) corresponding pairwise graphs for the μ-, ρ- and σ-sequences for the same target as in Fig. 9

Our computer program (which is available at may be adapted to any state space model by simply rewriting two lines of codes, which

  • (a)computes p(yt|xt) and
  • (b)simulates xt+1|xt.

Contemplating a different model does not even require the calculation of full conditionals, in contrast with Gibbs sampling. Another advantage of the particle Hastings–Metropolis algorithm is that it is trivial to parallelize. (Adding a comment before the loop over the particle index is enough, by using the OpenMP technology.)

Finally, we mention possible options for a better recycling of the numerous simulations that are produced by the algorithm. This dimension of the algorithm deserves deeper study, maybe to the extent of allowing for a finite time horizon overcoming the MCMC nature of the algorithm, as in the particle Monte Carlo solution of Cappéet al. (2008).

A more straightforward remark is that, owing to the additional noise that is brought by the resampling mechanism, more stable recycling would be produced both in the individual weights wn(X1:n) by Rao–Blackwellization of the denominator in equation (7) as in Iacobucci et al. (2009) and over past iterations by a complete reweighting scheme like AMIS (Cornuet et al., 2009). Another obvious question is whether or not the exploitation of the wealth of information that is provided by the population simulations is manageable via adaptive MCMC methods (Andrieu and Robert, 2001; Roberts and Rosenthal, 2009).

Finally, since


is an unbiased estimator of p(y1:T), there must be direct implications of the method towards deriving better model choice strategies in such models, as exemplified in the population Monte Carlo method of Kilbinger et al. (2009) in a cosmology setting.

The following contributions were received in writing after the meeting.

Anindya Bhadra (University of Michigan, Ann Arbor)

The authors present an elegant theory for novel methodology which makes Bayesian inference practical on implicit models. I shall use their example, a sophisticated financial model involving a continuous time stochastic volatility process driven by Lévy noise, to compare their methodology with a state of the art non-Bayesian approach. I applied iterated filtering (Ionides et al., 2006, 2010) implemented via the mif function in the R package pomp (King et al., 2008).

Fig. 11 shows some results from applying the iterated filtering algorithm with 1000 particles to the simulation study that is described by the authors in Section 3.2. If θ denotes the parameter vector of interest, the algorithm generates a sequence of parameter estimates inline image converging to the maximum likelihood estimate inline image. As a diagnostic, the log-likelihood of inline image is plotted against i (Fig. 11(a)). We see that the sequence of log-likelihoods rapidly converges. On simulation studies like this, a quick check for successful maximization is to observe that the maximized log-likelihood typically exceeds the log-likelihood at the true parameter value by approximately half the number of estimated parameters (Fig. 11(a)). We can also check for successful local maximization by sliced likelihood plots (Figs 11(b)–11(e)), in which the likelihood surface is explored along one of the parameters, keeping the other parameters fixed at the estimated local maximum. The likelihood surface is seen to be flat as λ varies, which is consistent with the authors’ observation that parameter combinations are weakly identified in this model. A profile likelihood analysis could aid the investigation of the identifiability issue. Owing to the quick convergence of iterated filtering with a relatively small number of particles, many profile likelihood plots can be generated at the computational expense of, say, one Markov chain Monte Carlo run of length 50000.

Figure 11.

 Diagnostic plots for iterated filtering: (a) likelihood at each iteration, evaluated by sequential Monte Carlo sampling (inline image, likelihood at the truth); (b)–(e) likelihood surface for each parameter sliced through the maximum (?, parameter values, where the likelihoods were evaluated; |, maximum likelihood estimate; inline image, true parameter value)

The decision about whether one wishes to carry out a Bayesian analysis should depend on whether one wishes to impose a prior distribution on unknown parameters. Here, I have shown that likelihood-based non-Bayesian methodology provides a computationally viable alternative to the authors’ Bayesian approach for complex dynamic models.

Luke Bornn and Aline Tabet (University of British Columbia, Vancouver)

We congratulate the authors on this very important contribution to stochastic computation in statistics. Whereas the authors have explored and discussed several applications in the paper, we would like to highlight the benefits of using particle Markov chain Monte Carlo (PMCMC) methods as a way to extend sequential Monte Carlo (SMC) methods which employ sequences of distributions of static dimension. Through PMCMC sampling, we can separate the variables of interest into those which may be easily sampled by using traditional MCMC techniques and those which require a more specialized SMC approach. Consider for instance the use of simulated annealing in an SMC framework (Neal, 2001; Del Moral et al., 2006). Rather than finding the posterior maximum a posteriori estimate of all parameters, PMCMC sampling now allows practitioners to combine annealing with traditional MCMC methods to maximize over some dimensions simultaneously while exploring the full posterior in others.

When variables are highly correlated, SMC methods may be used as an efficient alternative to MCMC sampling. For instance, SMC samplers (Del Moral et al., 2006) and other population-based methods (Jasra et al., 2007) proceed by working through a sequence of auxiliary distributions until a particle-based approximation to the posterior is reached. In non-identifiable or weakly identifiable models, SMC sampling is used to construct a sequence of tempered distributions allowing particles to explore fully the resulting ridges in the posterior surface of the non-identifiable variables. However, because SMC algorithms often rely on importance sampling, they can suffer in high dimensions owing to increased variability in the importance weights. Many non-identifiable models contain only a small portion of variables with identifiability issues, and hence it may be adding unnecessary complication to build the tempered distributions in all dimensions. In this case, PMCMC sampling gives the option to explore some parameters by using MCMC sampling while exploring others (such as those which are highly correlated or non-identifiable) with SMC sampling, and hence limit variance in the SMC importance weights. There are several options for performing this in the PMCMC framework: both the particle Gibbs and the particle Metropolis–Hastings variants could be used; the choice largely depends on the correlation between the identifiable and non-identifiable subsets of variables. In conclusion, we feel that, as much as PMCMC sampling provides Monte Carlo solutions to a unique class of problems, it also provides a flexible framework allowing practitioners to mix and match Monte Carlo strategies to suit their particular application.

Olivier Cappé (Telecom ParisTech and Centre National de la Recherche Scientifique, Paris)

I congratulate the authors for this impressive piece of work which, I believe, is a very significant contribution to the toolbox of Markov chain Monte Carlo and sequential Monte Carlo (SMC) methods.

For brevity, I focus on the particle independent Metropolis–Hastings (PIMH) algorithm which is the basic building block for the other samplers that are presented in the paper. Although theorem 2 also covers the more involved case of SMC sampling, the core idea is the auxiliary construction which shows that a proper Markov chain Monte Carlo algorithm may be obtained from sampling–importance resampling (Rubin, 1987), irrespectively of the number N of particles. This idea, however, does seem to be quite different both from the multiple-try (Liu et al., 2000) and the pseudomarginal (Beaumont, 2003) approaches and I encourage the authors to discuss in more detail its connections, if any, with earlier ideas in the literature.

Fig. 3 (in Section 3.1) is very promising as it suggests that the approach is practicable in large dimensional settings for which a ‘causal’ factorization of the likelihood is available. In particular, I wonder whether it is possible to predict the relationship between the dimension T and the number N of particles that is implicit in Fig. 3. In an attempt to answer this question, I conducted a toy numerical experiment in the spirit of the scaling construction (Roberts and Rosenthal, 2001), where the target πT is a product probability density function and SMC sampling is also carried out by using successive independent proposals—clearly, the latter situation is very specific, although it satisfies assumptions 1–4 that were made in the paper. In this example, any method based on direct importance sampling, including the PIMH algorithm using an SMC algorithm without resampling (i.e. sequential importance sampling), is bound to fail for all feasible values of N when T is larger than, say, 17 (see the caption of Fig. 12). In contrast, Fig. 12 shows that the PIMH algorithm using an embedded SMC algorithm with resampling at each step (as described in Section 2.2.1) can cope with dimensions as large as T=103. In addition, Fig. 12 also suggests that increasing N as O(T) is sufficient to stabilize the acceptance rate. I would be happy to hear the authors’ comments on whether the behaviour of PIMH sampling in this simple scenario can be inferred from known results about SMC methods regarding the rate of convergence of inline image as N increases.

Figure 12.

 PIMH acceptance rate as a function of the dimension T of the target and the number N of particles: the target probability density function is inline image, where π is the normal probability density function truncated to the range [−4, 4]; the SMC proposal ‘kernel’q is an independent proposal, uniformly distributed in the range [−4, 4]; to assess the difficulty of the simulation task, note that for direct self-normalized importance sampling targeting πT the effective sample size statistic (Kong et al., 1994), normalized by N, tends to 2.26T (inline image as N increases, which is about 10−6 for T=17

J. Cornebise (Statistical and Applied Mathematical Sciences Institute, Durham) and G. W. Peters (University of New South Wales, Sydney)

Our comments on adaptive sequential Monte Carlo (SMC) methods relate to particle Metropolis–Hastings (PMH) sampling, which has acceptance probability given in equation (13) of the paper for proposed state inline image, relying on the estimate


Although a small N suffices to approximate the mode of a joint path space distribution, producing a reasonable proposal for x1:T, it results in high variance estimates of inline image. We study the population dynamics example from Hayes et al. (2010), model 3 excerpt, involving a log-transformed θ-logistic state space model; see Wang (2007), equations 3(a) and 3(b), for parameter settings and Figs 13–15 for an illustration of the algorithm's behaviour. Particle Markov chain Monte Carlo (PMCMC) performance depends on the trade-off between degeneracy of the filter, N, and design of the SMC mutation kernel. Regarding the latter, we note the following.

Figure 13.

 Sequence of simulated states and observations for the population dynamic log-transformed θ-logistic model from Wang (2007), with static parameter θ=(r,ζ,K) under constraints K>0, r<2.69 and inline image (the state transition is inline image], 0.01), and the local likelihood is ginline image, for T=100 time steps): inline image, generated latent state realizations; inline image, generated observations

Figure 14.

 Path of three sampled latent states (a) x2, (b) x37 and (c) x93, and of the sampled parameters (d) r, (e) K and (f) ζ, over 100 000 PMH iterations based on N=200 particles by using a simple sampling–importance resampling filter—the one-dimensional state did not call for Rao–Blackwellization: note also the effect of the adaptive MCMC proposal for θ=(r,ζ,K), set up to start at iteration 5000, which is particularly visible on the mixing of parameters K; the most noticeable property of the algorithm is the remarkable mixing of the chain, in spite of the high total dimension of the sampled state; each iteration involves a proposal of (X1:Tθ) of dimension 103

Figure 15.

 Convergence of the distribution of the path of latent states x1:T (note the change in vertical scale; initializing PMH sampling on a very unlikely initial path does not prevent the minimum mean-square error estimate of the latent states from converging; as few as 10 PMH iterations already begin to concentrate the sampled paths around the true path (inline image), which is assumed here to be close to the mode of the posterior distribution thanks to the small observation noise, with very satisfactory results after 20 000 iterations): (a) initialized PMH state for x1:T (inline image); (b) average posterior mean path estimate for X1:T ( inline image), PMH Markov chain iteration i=10 and N=200 particles; (c) average posterior mean path estimate for X1:T, PMH Markov chain iteration i=20 000 and N=200 particles

  • (a)A Rao–Blackwellized filter (Doucet et al., 2000) can improve acceptance rates; see Nevat et al. (2010).
  • (b)Adaptive mutation kernels, which in PMCMC methods can be considered as adaptive SMC proposals, can reduce degeneracy on the path space, allowing for higher dimensional state vectors xn. Adaption can be local (within filter) or global (sampled Markov chain history). Though currently particularly designed for approximate Bayesian computation methods, the work of Peters et al. (2010) incorporates into the mutation kernel of SMC samplers (Del Moral et al., 2006) the partial rejection control (PRC) mechanism of Liu (2001), which is also beneficial for PMCMC sampling. PRC adaption reduces degeneracy by rejecting a particle mutation when its incremental importance weight is below a threshold cn. The PRC mutation kernel
    can also be used in PMH algorithms, where q(xn|yn,xn−1) is the standard SMC proposal, and
    As presented in Peters et al. (2010), algorithmic choices for inline image can avoid evaluation of r(cn,xn−1). Cornebise (2010) extends this work, developing PRC for auxiliary SMC samplers, which are also useful in PMH algorithms. Threshold cn can be set adaptively: locally either at each SMC mutation or Markov chain iteration, or globally based on chain acceptance rates. Additionally, cn can be set adaptively via quantile estimates of pre-PRC incremental weights; see Peters et al. (2010). Cornebise et al. (2008) stated that adaptive SMC proposals can be designed by minimizing function-free risk theoretic criteria such as Kullback–Leibler divergence between a joint proposal in a parametric family and a joint target. Cornebise (2009), chapter 5, and Cornebise et al. (2010) use a mixture of experts, adapting kernels of a mixture on distinct regions of the state space separated by a ‘softmax’ partition. These results extend to PMCMC settings.

Drew D. Creal (University of Chicago) and Siem Jan Koopman (Vrije Universiteit Amsterdam)

We congratulate the authors on writing an interesting paper. They demonstrate how arguments from Markov chain Monte Carlo (MCMC) theory can be extended to include algorithms where proposals are made from the path realizations that are produced by sequential Monte Carlo (SMC) algorithms such as the particle filter. As with all good ideas, this basic idea is simple and quite clever at the same time. The implementation requires a particle filter routine, which is generally easy to code. Various MCMC strategies such as Metropolis–Hastings steps can then be adopted to accept–reject paths proposed from the discrete particle approximations that are created by the particle filter. The resulting particle MCMC algorithms widen the applicability of SMC methods. The authors also provide a theoretical justification for why the methods work. In practice for complex models, it may be easier to design an SMC algorithm and to include it within an MCMC algorithm rather than design an alternative, perhaps more intricate MCMC algorithm that is computationally less expensive.

The examples in Section 3 are interesting. The first example concerns a non-linear state space model which is used to compare the new method with a more standard MCMC algorithm. A numerical exercise reveals that the new method outperforms the other slightly. It should be noted that the model is intricate since the corresponding filtering and smoothing distributions are multimodal. The second example is the most interesting since other MCMC algorithms proposed in the literature can be tedious to implement. The difficulty arises because the transition density p{x(t)|x(t−1);θ}, with x(t)=σ2(t) as given by equations (16), is not known in closed form, making it difficult to implement a good MCMC algorithm. The authors show convincingly that their methods are effective for filtering and smoothing. A minor comment is that the time series dimensions for the simulated data sets (T=400) and for the Standard & Poors 500 data (T=1000) are rather short and atypical. It appears to confirm our suspicion that the method is computationally time intensive, which is due to the repeated loops in the algorithm. However, designing and coding the algorithm are easy anyway.

In the conclusion, the authors state that the performance of particle MCMC algorithms will depend on the variance of the SMC estimates of the normalizing constants. Can they provide some discussion on when practitioners may encounter problems such as this? For example, how does the dimension of the state vector (or state space) affect the algorithm? This is particularly of interest in financial time series where we would like to build multivariate volatility models for high dimensional data. Secondly, how does the specification of the transition equation affect the estimates? For example, many economists specify state space models with unobserved random-walk components.

Despite these somewhat critical but constructive questions, we have enjoyed reading the paper and we are impressed by the results.

Dan Crisan (Imperial College London)

This is an authoritative paper which brings together two of the principal statistical tools for producing samples from high dimensional distributions. The authors propose an array of methods where sequential Monte Carlo (SMC) algorithms are used to design high dimensional proposal distributions for Markov chain Monte Carlo (MCMC) algorithms. The following are some comments that perhaps can suggest future research or improvements in this area.

Firstly the authors present not just the numerical verification of the proposed methodology but also (very laudably) its theoretical justification. They make the point that the theorems that are presented in the paper rely on relatively strong conditions, even though the methods have been empirically observed to apply to scenarios beyond the conditions assumed. In particular, assumption 4 is a very restrictive condition that is rarely satisfied in practice. It amounts (virtually) to the assumption that the state space of the hidden Markov state process is compact. The need for such an assumption is imposed by the preference for a framework where the posterior distribution exhibits stability properties, as discussed in Del Moral and Guionnet (2001). However, in recent years this assumption has been considerably relaxed. Le Gland and Oudjane (2003) have introduced the idea of truncating the posterior distribution, which was further exploited in Oudjane and Rubenthaler (2005) and in Crisan and Heine (2008) to produce stability criteria under quite natural conditions. The theorems in the paper under discussion are likely to hold under the same conditions as those contained, for example, in Crisan and Heine (2008), with proofs that will follow similar steps.

Secondly, the authors concentrate on SMC algorithms where the resampling step is the multinomial step. They make the point that more sophisticated algorithms have been proposed where the multinomial resampling step can be replaced by a stratified resampling procedure and prove the results under conditions that cover other SMC algorithms. However, the optimal choice for the resampling step is the tree-based branching algorithm that was introduced by Crisan and Lyons (2002). This algorithm has several optimality properties (see also Künsch (2005) for additional details) and satisfies the conditions (assumptions 1 and 2) that are required by the theoretical results in the paper.

Thirdly, the trade-off between the average acceptance rate for the particle independent Metropolis sampler and the number of particles that is used to produce the SMC proposal warrants further analysis. The numerical results suggest some deterministic relationship between the two quantities, one that perhaps holds only asymptotically. It would be beneficial to find this relationship and to see what it can tell us about the optimal choice for distributing the computational effort between the SMC and the MCMC steps.

David Draper (University of California, Santa Cruz)

I have two questions on Monte Carlo efficiency for the authors of this interesting paper.

  • (a)Has the authors’ methodology reached a sufficiently mature state that they can give us general advice on how to use their methods to obtain the greatest amount of information per central processor unit second about the posterior distribution under study (because this is of course the real performance measure on which users need to focus), and if so what would that advice be? (The authors made a start on this task in Section 3.1; it would be helpful to potential users of their methodology if they could expand on those remarks.)
  • (b)People often measure Monte Carlo improvement in Markov chain Monte Carlo samplers by how well a new method can drive positive auto-correlations (in the sampled output for the monitored quantities, viewed as time series) down towards zero, but it is sometimes possible (e.g. Dreesman (2000)) to do even better. Is there any scope in the authors’ work for achieving negative auto-correlations in the Markov chain Monte Carlo output?

Richard Everitt (University of Bristol)

I congratulate the authors on this significant paper. My comments relate to the use of the marginal variant of the algorithm for parameter estimation in undirected graphical models and, more generally, the computational cost of the methods.

Let us consider the following factorization into clique potentials φ1:M on cliques C1:M of a joint probability density function over variables X1:T given parameters θ1:M:




As in a state space model, the variables X1:T are observed indirectly through observations y1:T of random variables Y1:T, which are assumed conditionally independent given X1:T and are identially distributed as Yi|X1:Tg(·|X1:T). Our aim is to estimate the unknown θ given the observations, ascribed prior p(θ) by simulating from the posterior p(θ|y1:T). It is well known that Gibbs sampling from p(θ,X1:T|y1:T) is not feasible since the intractable normalizing ‘constant’Zθ1:M must be evaluated when updating θ (other standard approaches also fail for the same reason).

As an alternative, consider the direct application of a marginal particle Markov chain Monte Carlo (PMCMC) move where, as in the paper, the proposal q(·|θ) is used to draw a candidate point θ*, the latent variables X1:T are sampled using a sequential Monte Carlo (SMC) algorithm targeting inline image (as, for example, in Hamze and de Freitas (2005)) and the move is accepted with the probability given in equation (35). Note that (owing to the use of the SMC algorithm) this approach has the advantage that at no point does Zθ1:M need to be evaluated directly. A similar approach may be used in the context of MCMC updates on the space of graphical model structures.

The computational cost of PMCMC methods in general is likely to be high (particularly so in application to the model above). Alleviating this through reusing particles from each run of the SMC algorithm seems intuitively possible. Also, it is worth considering the implementation of PMCMC methods on a graphical processing unit to exploit the parallel nature of the algorithm. The work of Maskell et al. (2006) on graphical processing unit implementations of particle filters is directly applicable here (also see recent work by Lee et al. (2009)).

Andrew Golightly and Darren J. Wilkinson (Newcastle University)

We thank the authors for a very interesting paper. Consider a d-dimensional diffusion process Xt governed by the stochastic differential equation


where Wt is standard Brownian motion. It is common to work with the Euler–Maruyama approximation with transition density f(·|x) such that


For low frequency data, the observed data can be augmented by adding m−1 latent values between every pair of observations. For observations on a regular grid, y1:T=(y1,…,yT) that are conditionally independent given {Xt} and have marginal probability density g(y|x), inferences are made via the posterior distribution θ,x1:T|y1:T by using Bayesian Markov chain Monte Carlo techniques. Owing to high dependence between x1:T and θ, care must be taken in the design of a Markov chain Monte Carlo scheme. A joint update of θ and x1:T or a carefully chosen reparameterization (Golightly and Wilkinson, 2008) can overcome the problem. The particle marginal Metropolis–Hastings (PMMH) algorithm that is described in the paper allows a joint update of parameters and latent data. Given a proposed θ*, the algorithm can be implemented by running a sequential Monte Carlo algorithm targeting p(x1:T|y1:T,θ*) using only the ability to forward-simulate from the Euler–Maruyama approximation.

To compare the performance of the PMMH scheme with the method of Golightly and Wilkinson (2008) (henceforth referred to as the GW scheme), consider inference for a stochastic differential equation governing Xt=(X1,t,X2,t) with


This is the diffusion approximation of the stochastic Lotka–Volterra model (Boys et al., 2008). We analyse a simulated data set of size 50 with θ=(0.5,0.0025,0.3), corrupted by adding zero-mean Gaussian noise. Independent uniform U(−7,2) priors were taken for each  log (θi). The GW scheme and the PMMH sampler were implemented for 500000 iterations, using a random-walk update with normal innovations to propose  log (θ*), with the variance of the proposal being the estimated variance of the target distribution, obtained from a preliminary run. The PMMH scheme was run for N=200, N=500 and N=1000 particles and, in all cases, discretization was set by taking m=5.

Computational cost scales roughly as 1:8:20:40 for GW:PMMH (N=200:500:1000). For N=1000 particles, the mixing of the chain under the PMMH scheme is comparable with the GW scheme; Fig. 16. Despite the extra computational cost of the PMMH scheme, unlike the GW scheme the PMMH algorithm is easy to implement and requires only the ability to forward-simulate from the model. This extends the utility of particle Markov chain Monte Carlo methods to a very wide class of models where evaluation of the likelihood is difficult (or even intractable), but forward simulation is possible.

Figure 16.

 Auto-correlation function of θ1 from the output of the GW scheme (inline image) and PMMH schemes with N=200 (inline image), N=500 (inline image) and N=1000 (inline image)

Edward L. Ionides (University of Michigan, Ann Arbor)

The authors are to be congratulated on an exciting methodological development. An attractive feature of this new methodology is that it has an algorithmic implementation in which the only operation applied to the underlying Markov process model is the generation of draws from f(xn|xn−1). This property has been called plug and play (He et al., 2009; Bretóet al., 2009) since it permits simulation code, which is usually readily available, to be plugged straight into general purpose software. I would like to add some additional comments to the authors’ coverage of this aspect of their work.

The plug-and-play property has been developed in the context of complex system analysis with the terminology equation free (Kevrekidis et al., 2004). For optimization methodology, the analogous term gradient free is used to describe algorithms which are based solely on function evaluations. Plug-and-play inference methodology has previously been proposed for state space models (including Kendall et al. (l999), Liu and West (200l), Ionides et al. (2006), Toni et al. (2008) and Andrieu and Roberts (2009)). This paper is distinguished by describing the first plug-and-play algorithm giving asymptotically exact Bayesian inference for both model parameters and unobserved states.

We should expect plug-and-play approaches to require additional computational effort compared with rival methods that have access to closed form expressions for model properties such as transition densities or their derivatives. However, advances in computational capabilities and algorithmic developments are making plug-and-play methodology increasingly accessible for state space models. The great flexibility in model development that is permitted by the generality of plug-and-play algorithms is enabling scientists to ask and answer scientific questions that were previously inaccessible (e.g. King et al. (2008)). The methodology that is developed here (and other approaches which inherit the plug-and-play property from the basic sequential Monte Carlo algorithm) will benefit from further research into improvements and extensions of sequential Monte Carlo methods that fall within the plug-and-play paradigm: reduced variance resampling schemes are consistent with plug-and-play methods, but most other existing refinements are not.

Pierre Jacob (Centre de Recherche en Economie et Statistique and Université Paris Dauphine, Paris), Nicolas Chopin (Ecole Nationale de la Statistique et de l'Administration Economique, Paris), Christian Robert (Centre de Recherche en Economie et Statistique and Université Paris Dauphine, Paris) and Håvard Rue (Norwegian University for Science and Technology, Trondheim)

This otherwise fascinating paper does not cover the calculation of the marginal likelihood p(y), which is the central quantity in model choice. However, the particle Markov chain Monte Carlo (PMCMC) approach seems to lend itself naturally to the use of Chib's (1995) estimate, i.e.


for any θ. Provided that the p(θ|x,y) density admits a closed form expression, the denominator may be estimated by


where the xis, i=1,…,M, are provided by the MCMC output.

The novelty here is that p(y|θ) in the numerator needs to be evaluated as well. Fortunately, each iteration provides a Monte Carlo estimate of p(y|θ=θi), where θi is the parameter value at MCMC iteration i. Some care may be required when choosing θi; for example selecting the θi with largest (evaluated) likelihood may lead to a biased estimator.

We did some experiments to compare the approach described above with integrated nested Laplace approximations (Rue et al., 2009) and nested sampling (Skilling (2006); see also Chopin and Robert (2010)), using the stochastic volatility example of Rue et al. (2009). Unfortunately, our PMCMC program requires more than 1 day to complete (for a number N of particles and a number M of iterations that are sufficient for reasonable performance), so we cannot include the results in this discussion. A likely explanation is that the cost of PMCMC sampling is at least O(T2), where T is the sample size (T=945 in this example), since, according to the authors, good performance requires that N=O(T), but our implementation may be suboptimal as well.

Interestingly, nested sampling performs reasonably well on this example (reasonable error obtained in 1 h), and, as reported by Rue et al. (2009), the integrated nested Laplace approximation is fast (1 s) and very accurate, but more work is required for a more meaningful comparison.

Michael Johannes (Columbia University, New York) and Nick Polson and Seung M.-Yae (University of Chicago)

We would like to comment on a few aspects of the paper. First, for several years, macroeconomics has used a related algorithm (e.g. Fernandez-Villaverde and Rubio-Ramerez (2005)) to estimate dynamic general equilibrium models by using a random-walk Metropolis algorithm proposing new parameter values and accepting or rejecting the draws via marginal likelihoods from sequential Monte Carlo (SMC) sampling. This now quite large literature encountered a serious problem in models with more than a few parameters. In these cases, Metropolis algorithms often converge very slowly, and the combination of slow convergence and repeated iteration between SMC and Markov chain Monte Carlo (MCMC) sampling often requires that algorithms run for days, even when coded efficiently in C++. This experience provides a cautionary note to those using these algorithms in high dimensions. In the authors’ defence, these problems are extremely difficult, and the computationally expensive SMC–MCMC approach may be the only feasible strategy.

Second, the authors consider learning σ and σx in the non-linear state space model


assuming α=2, β1=0.5, β2=25 and β3=8. It is disappointing that all these parameters are constrained, as a more realistic test of their algorithm would estimate all the unknown parameters.

The authors compare with an MCMC algorithm using single-state updating. We suggest two more realistic competing algorithms. The first assumes α=2 and

  • (a)generates a full vector of latent states, x1:T, by using SMC sampling, accepting or rejecting these draws via Metropolis updates and then
  • (b)updates the parameters by using p(θ|x1:T,y1:T).

This algorithm exploits the fact that the conditional posterior, p(θ|x1:T,y1:T), is a known distribution and simple to sample. This algorithm would probably perform better than the current algorithm combining SMC with random-walk Metropolis sampling.

The second algorithm is the approach of Johannes et al. (2007) that

  • (a)solely relies on SMC methods,
  • (b)uses slice variables to induce sufficient statistics,
  • (c)estimates all the parameters (α,β1,β2,β3,σ,σx) for similar sized data sets and
  • (d)solves the sequential problem by approximating p(θ,xt|y1:t) for each time t.

Fig. 17 provides an example of the output.

Figure 17.

 Posterior distribution and learning of the parameters in the non-stationary growth model (the particle size is 300 000): (a) α; (b) β1; (c) β2; (d) β3; (e) σ; (f) σx

The algorithm of Johannes et al. (2007) relies on similar SMC methods but is computationally simpler. To obtain a sense of the computational demands, the current paper uses 60000 MCMC iterations and 5000 particles whereas we obtain accurate parameter estimates (verified via simulation studies), using one run of 300000 particles, using roughly l/1000th of the computational cost. We would be interested in a direct horse-race of these competing methods in this specification.

Finally, the approach in the paper can have attractive convergence properties under various assumptions, including assumption 4. We would like to ask the authors whether assumption 4 is satisfied in the examples that are considered in the paper. In particular, does it hold for various signal-to-noise ratio combinations of σ and σx?

Adam M. Johansen and John A. D. Aston (University of Warwick, Coventry)

We congratulate the authors on an exciting paper which combines the novel idea of incorporating sequential Monte Carlo proposals within Markov chain Monte Carlo samplers with a synthesis of ideas from disparate areas. It is clear that the paper is a substantial advance in Monte Carlo methodology and is of substantially greater value than a collection of its constituent parts.

However, one constituent which has received little attention in the literature seems to us to be interesting: although it is computationally rather expensive to do so, equations (27)–(28) suggest that it is possible to obtain samples which characterize the path space distribution well, at least in the case of mixing dynamic systems when we are interested in marginal distributions of bounded dimension, albeit at the cost of running an independent sequential Monte Carlo algorithm for every sample. In practice some reuse of samples is likely to be possible.

Typically, such a strategy might be dismissed (perhaps correctly) as being of prohibitive computational cost. However, in an era in which Monte Carlo algorithms whose time cost scales superlinearly in the number of samples employed are common, might there be other situations in which this strategy finds a role?

A rather naive approach to smoothing, for example, would be to employ an ensemble of independent particle filters and to sample one trajectory from each independent filter. For simplicity, consider employing a bootstrap filter in the univariate case, with inline image and inline image. To assess performance, consider the estimated covariance of Xn:n+1 (and the determinant of that covariance, to provide a compact summary). Fig. 18 shows covariance estimates obtained by using a 100-filter ensemble, each of 100 particles, a single particle filter of equal cost (using 10000 particles) and the exact solution (Kalman smoothing). This illustrates the degeneracy and consequent failure to represent sample path variability of a single filter adequately and contrasts it with the estimate obtained by using an ensemble of filters. Each of the Monte Carlo algorithms required approximately 30 s over 1000 time steps using SMCTC (Johansen, 2009) and a 1.33-GHz Intel laptop.

Figure 18.

 Covariance of joint smoothing estimates by using a single 10 000-sample particle filter (?), an ensemble of 100 100-sample particle filters (+) and the exact Kalman smoother (inline image): (a) determinant of two-state covariance matrices; (b) covariances at n=20, 21 (inline image, single-particle filter covariance; inline image, multiple-particle filter covariance); (c) covariances at n=994, 995 (inline image, single-particle filter covariance; inline image, multiple-particle filter covariance)

Might it be possible to employ such a strategy to provide simple-to-implement algorithms with better path space performance? Can the error inline image be controlled uniformly for bounded L?

Anthony Lee and Chris Holmes (University of Oxford)

We congratulate the authors on a major contribution to practical statistical inference in a variety of models. An important application is approximating the posterior distribution of static parameters in state space models. The particle marginal Metropolis–Hastings (PMMH) algorithm is perhaps the simplest of the algorithms introduced, relying only on the unbiasedness of the marginal likelihood estimator. Denoting by y the observations, z the set of all auxiliary random variables used in the filter and θ the static parameters, the likelihood estimator is a joint density p(y,z|θ) satisfying


An interesting feature of the sequential Monte Carlo class of methods is that the choice of auxiliary variables z is flexible. For example, we can perform multinomial resampling in a variety of ways without affecting condition (55). Let x1:T be the latent variables in the state space model. When xt is univariate, sorting the particles before resampling as in Pitt (2002) but without interpolation gives an empirical distribution function for particle indices inline image that is identical to the empirical distribution function for xt itself. We can then construct a Metropolis–Hastings Markov chain targeting p(θ,z|y) by proposing moves of the form (θ,z)→(θ,z) and (θ,z)→(θ,z). For the first type, this amounts to a use of common random variables so that in the acceptance ratio


the terms p(y,z|θ) and p(y,z|θ) are positively correlated. We can therefore expect the resulting Markov transition kernel to be closer to that of the true marginal Metropolis–Hastings algorithm on θ, suggesting superior performance over standard PMMH algorithms.

We ran both a PMMH algorithm and this correlated variant CPMMH on a linear Gaussian state space model with univariate latent variables xt and a single unknown parameter. We used an improper prior with p(θ)∝1 and a random-walk proposal. Since we can compute p(y|θ) for this model, we can also compute the acceptance probabilities of the marginal algorithm and analyse how both algorithms move compared with the marginal algorithm. In a 50000-step chain, the PMMH algorithm differed from the true marginal algorithm 13065 times whereas CPMMH differed only 2333 times in terms of accepting or rejecting a move. Fig. 19 shows the differences between the acceptance probabilities for both the PMMH and the CPMMH algorithms against the marginal algorithm. Although CPMMH does not extend trivially to the multivariate case, tree-based resampling schemes as in Lee (2008) that generalize the methodology in Pitt (2002) give similar improvements.

Figure 19.

 Scatter plots of the difference between the acceptance probabilities of (a) the PMMH algorithm and (b) the CPMMH algorithm and the acceptance probability of the marginal algorithm at each step

Finally, many people at the meeting commented on the heavy computational burden of particle Markov chain Monte Carlo methods. However, the emerging use of parallel architectures such as graphics cards can alleviate this burden via parallelization of the particle filtering algorithm itself, as in Lee et al. (2009).

Simon Maskell (QinetiQ, Malvern)

This paper provides, to the sequential Monte Carlo (SMC) sampling specialist, a mechanism to perform parameter estimation by using Markov chain Monte Carlo (MCMC) sampling. To the MCMC sampling specialist, this paper offers a route to efficient proposals in very high dimensional problems. Both contributions are significant in isolation. To achieve the two simultaneously is a significant achievement.

The paper's approach is to extend the space of variables of interest to include auxiliary variables that are necessarily involved in the algorithmic process of drawing samples from an SMC sampler. The tactic is to extend the state space such that the problem of interest is expressed as a projection of some larger problem (which is easier to consider) onto a smaller dimensional space. There are a number of such larger problems that project onto the same smaller dimensional space. It is therefore surprising that the authors focus on a relatively restrictive structure for their samplers that target the joint distribution of x1:T and θ: the particle marginal Metropolis–Hastings sampler considers a sampler of the form inline image and the particle Gibbs sampler considers a sampler that alternates between q(θ*|x1:T) and inline image. These samplers therefore avoid the possibility of proposals of the form inline image.

It is natural to ask what such proposal distributions would offer (apart from more complex variants of the MCMC acceptance ratios). SMC samplers are recursive algorithms, i.e. xt is sampled conditionally on x1:t−1 for each t. As touched on in the paper, the statistical efficiency of SMC algorithms is coupled to their ability to generate samples of xt from a proposal distribution that is a good approximation to the target density. The optimal proposal distribution is only optimal in terms of its ability to exploit previous samples (and data) to generate the current sample xt: the notion of optimality is intricately tied to the recursive application of an SMC sampler. In the context of particle MCMC methods, SMC samplers are still applied recursively, but we also have a previous sample of the entire trajectory, x1:T. This trajectory encodes information about ‘future’ target distributions and samples that will turn out to be efficient in hindsight. It therefore seems plausible that a different notion of an optimal proposal distribution is needed for particle MCMC sampling and that this should include dependence on the previous sample of the trajectory.

This paper seems likely to seed a unified research direction that facilitates a combined effort between practitioners and researchers who are associated with both MCMC and SMC methods. Such extensions can therefore be expected.

Lawrence Murray, Emlyn Jones, John Parslow, Eddy Campbell and Nugzar Margvelashvili (Commonwealth Scientific and Industrial Research Organisation, Canberra)

We thank the authors for their work on what we agree is a very compelling approach to parameter estimation in state space and other models. We have been investigating similar ideas in the context of marine biogeochemistry, with encouraging results for a toy Lotka–Volterra predator–prey model (Jones et al., 2009). Our approach uses random-walk Metropolis–Hastings steps in parameter space, with a particle filter employed to calculate likelihoods for the Metropolis–Hastings acceptance term. It is essentially an instance of the method described here as particle marginal Metropolis–Hastings (PMMH) sampling. The approach does seem computationally expensive, and we observe some potential consistency problems in the use of a particle filter to estimate likelihoods.

Biogeochemical models characterize the interaction of phytoplankton and zooplankton species and the conserved cycle of nutrients such as nitrogen, carbon and oxygen through an ecosytem. They are generally described by using ordinary differential equations, with our own formulation introducing stochasticity via interaction terms at discrete time intervals. They are one specific case of a wide variety of physical–statistical models obtained via the introduction of stochasticity to existing deterministic models in a Bayesian hierarchical framework.

These models fall into a broad class where the transition density p(xn|xn−1) is not available in closed form. This precludes use of some of the advanced proposal and resampling techniques that are mentioned by the authors, owing to the need to cancel the intractable transition density in the numerator and denominator in expression (7). In particular, the optimal proposal p(xn|yn,xn−1) is not available. We find the iteration of a particle filter in the PMMH framework for these models to be very expensive computationally, mostly because of numerical integration of the ordinary differential equations with the limited availability of these advanced techniques confounding the matter further.

We find it necessary to use many more samples for PMMH sampling than we would with the same particle filter used only for state tracking, to deliver consistent likelihood estimates. Although a particle filter may momentarily fail to track the state adequately at a particular time but then recover (e.g. in a form of mild degeneracy where the effective sample size is low) the likelihood contribution at that time will be unreliable. In the worst case, iterating the particle filter with the same parameter configuration but different sample sets from the prior p(x1|y1) can produce wildly different likelihood estimates in the presence of such anomalies.

G. W. Peters (University of New South Wales, Sydney) and J. Cornebise (Statistical and Applied Mathematical Sciences Institute, Durham)

This paper will clearly have a significant influence on scientific disciplines with a strong interface with computational statistics and non-linear state space models. Our comments are based on practical experience with particle Markov chain Monte Carlo (MCMC) implementation in latent process multifactor stochastic differential equation models for commodities (Peters et al., 2010), wireless communications (Nevat et al., 2010) and population dynamics (Hayes et al., 2010), using Rao–Blackwellized particle filters (Doucet et al., 2000) and adaptive MCMC methods (Roberts and Rosenthal, 2009).

  • (a)From our implementations, ideal use cases consist of highly non-linear dynamic equations for a small dimension dx of the state space, large dimension d of the static parameter and potentially large length T of the time series. In our cases dx was 2 or 3, d up to 20 and T between 100 and 400.
  • (b)In particle Metropolis–Hastings (PMH) sampling, non-adaptive MCMC proposals for θ (e.g. tuned according to presimulation chains or burn-in iterations) would be costly for large T and require that N is kept fixed over the whole run of the Markov chain. Adaptive MCMC proposals such as the adaptive Metropolis sampler (Roberts and Rosenthal, 2009) avoid such issues and proved particularly relevant for large d and T.
  • (c)For intractable joint likelihood p(y1:T|x1:T), we could design a sequential Monte Carlo (SMC)–approximate Bayesian computation algorithm (see for example Peters et al. (2010) and Ratmann (2010), chapter 1) for a fixed approximate Bayesian computation tolerance ɛ, using the approximations
    with ρ a distance on the observation space and inline image simulated observations. Additional degeneracy on the path space induced by the approximate Bayesian computation approximation should be controlled, e.g. with partial rejection control (Peters et al., 2008).
  • (d)Particle Gibbs (PG) sampling could potentially stay frozen on a state x1:T(i). Consider a state space model with state transition function almost linear in xn for some range of θ, from which y1:T is considered to result, and strongly non-linear elsewhere. If the PG samples θ(i) in those regions of strong non-linearity, the particle tree is likely to coalesce on the trajectory preserved by the conditional SMC sampler, leaving it with a high importance weight, maintaining (θ(i+1),x1:T(i+1))=(θ(i),x1:T(i)) over several iterations. Using a PMH within PG algorithm would help to escape this region, especially using partial rejection control and adaptive SMC kernels, outlined in another comment, to fight the degeneracy of the filter and the high variance of inline image.

Ralph S. Silva and Robert Kohn (University of New South Wales, Sydney), Paolo Giordani (Sveriges Riksbank) and Michael K. Pitt (University of Warwick, Coventry)

We congratulate the authors on their important paper which opens the way for a unified method for Bayesian inference using the particle filter and should allow for inference for models which are difficult to estimate by using other methods. To establish notation and to summarize the result that is relevant to our discussion, let p(y|θ) be the correct but intractable likelihood with inline image its approximation by the particle filter, where u is a set of latent variables. By Del Moral (2004),


The authors show that this implies that f(θ|y)=p(θ|y) so a Markov chain Monte Carlo simulation based on the posterior f(θ,u|y) gives iterates of θ from the correct marginal posterior p(θ|y). Our own research reported in Silva et al. (2009) applies the fundamental insight in the current paper to study the behaviour of adaptive sampling schemes when the particle filter is used to obtain f(y|θ,u) for state space models. The two adaptive samplers that we consider are a three-component version of the adaptive random-walk proposal of Roberts and Rosenthal (2009) and the adaptive independent Metropolis–Hastings proposal of Giordani and Kohn (2008). Combining the particle filter with adaptive sampling is attractive because f(y|θ,u) is a stochastic non-smooth function of θ. Our results suggest the following.

  • (a)It is feasible to use adaptive sampling for the particle Markov chain Monte Carlo and in particular particle marginal Metropolis–Hastings algorithm.
  • (b)It is computationally efficient to obtain a good adaptive proposal because the cost of constructing such a proposal is negligible compared with the cost of evaluating f(y|θ,u) by the particle filter.
  • (c)A well-constructed proposal can be much more efficient than an adaptive random-walk proposal.
  • (d)Independent Metropolis–Hastings proposals are attractive because they can be easily run in parallel, thus significantly reducing the computation time of particle-based Bayesian inference.
  • (e)When the particle filter is used, the marginal likelihood of any model is obtained in an efficient and unbiased manner, making model comparison straightforward.

Miika Toivanen and Jouko Lampinen (Helsinki University of Technology, Espoo)

We congratulate the authors for introducing the idea of combining ‘ordinary’ Markov chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC) methodologies in a novel way, namely using SMC algorithms for designing proposal distributions for MCMC algorithms. We wish to share briefly our own experience on using particle Monte Carlo methods on a static problem, related to computer vision.

Consider having a few dozen feature points and a posterior distribution of their locations in a test image, Owing to the combinatorial explosion, approximate methods are needed to compute the integrals that involve the posterior distribution. The multimodality of the posterior distribution complicates the approximation problem. Although MCMC methods can be efficient in exploring a single mode, the probability for them to switch a mode during the sampling is low, especially if the modes are far apart. Although some improvements to overcome this disadvantage exist, the population Monte Carlo (PMC) scheme offers a much more natural approach.

PMC techniques are based on the idea of representing the posterior with a weighted set of particles. Each particle can be considered as a hypothesis about the correct location of the feature set and the weight reveals the goodness of the hypothesis. The particles are sampled from proposal distributions, which are allowed to differ between the particles and iterations. Hence, heuristics can safely be incorporated to guide the sampler towards the modes of the posterior, without jeopardizing the theoretical convergence issues. In our implementation, the proposals are Gaussian distributions, which have the previous estimate as mean value and whose variance decreases for particles with high posterior probability. Owing to the resampling, the weakest hypotheses die, and the resulting particle set gives often a good representation of the posterior distribution (Toivanen and Lampinen, 2009a,b).

Also SMC methods can be applied to sample the posterior, by updating the parameter vector incrementally (Toivanen and Lampinen, 2009c; Tamminen and Lampinen, 2006). The previously sampled components guide the sampler via the conditional prior distribution and the number of distinct modes decreases as the parameter vector expands. However, because the resampling is not based on the whole parameter vector, unlike in PMC methods, the method is prone to lead to a particle set representing a fallacious minor mode which in a marginal posterior is stronger than the main mode of the full posterior. Thus, it might be interesting to test whether PMC, instead of SMC, methods could be combined with MCMC methods in a fashion suggested by the authors, and whether it would improve the performance in these kinds of problem.

Jonghyun Yun and Yuguo Chen (University of Illinois at Urbana—Champaign)

We congratulate the authors on successfully combining two popular sampling tools, sequential Monte Carlo (SMC) and Markov chain Monte Carlo (MCMC) methods. We discuss two specific implementation issues of particle MCMC (PMCMC) algorithms.

In PMCMC sampling, proposing a single sample at each iteration requires N particles. That means running PMCMC algorithms for L iterations needs NL particles. If we can afford to generate only a fixed number N* of particles, a practical question is how to balance between N and L under the constraint that NL=N*. We did a simulation study on model (14)–(15) with known parameters inline image and inline image. Let N*=1 000 000. We simulated 100 sequences of observations y1:300 from the model. For each sequence, four particle independent Metropolis–Hastings (PIMH) samplers with different combinations of N and L were applied to estimate the states x1:300. The standard SMC method in Section 2.2.1 with 1000000 particles is also included in the comparison. The performance criterion is the root-mean-squared error RMSE between the true xi and the estimates:


The average RMSE and acceptance rate from 100 simulations are reported in Fig. 20. According to Fig. 20, PIMH sampling with a small N could perform worse than standard SMC sampling. Part of the reason may be the low acceptance rate. Increasing N seems to improve the acceptance rate and the performance, even though L decreases correspondingly, which may affect the convergence of the Markov chain. For each combination of N and L, the acceptance rate becomes lower as the dimension T of the state x1:T grows.

Figure 20.

 Comparisons of the standard SMC method with 1 million particles with four PIMH samplers with different combinations of N and L (N=25 and L=40 000 (PIMH1), N=100 and L=10 000 (PIMH2), N=250 and L=4000 (PIMH3), and N=1000 and L=1000 (PIMH4)) of (a) average RMSE for all five (?, SMC; △, PIMH1; *, PIMH2; +, PIMH3; □, PIMH4) and (b) average acceptance rate for the four PIMH samplers (△, PIMH1; *, PIMH2; +, PIMH3; □, PIMH4)

Another practical issue is about reusing all particles. Two estimates which use all particles are suggested in Section 4.6. We compared these two estimates with the original PIMH sampler on the same model and the same four settings as before. Denote the two estimates in equations (38) and (39) as PIMH-Reusel and PIMH-Reuse2 respectively. We also propose a new estimate, which is denoted by PIMH-Reuse3 (see theorem 6 for the notation):


This estimate is based on the N weighted particles proposed at each iteration before the accept–reject step. PIMH-Reuse3 can be used when there are no unknown parameters in the model, and its convergence can be proved. The comparison of the average RMSE in Fig. 21 shows that PIMH-Reusel has almost the same performance as PIMH-Reuse2, and both outperform the original PIMH sampler. The relationship between PIMH-Reuse3 and the other methods is not so clear.

Figure 21.

 Comparison of the average RMSE for the PIMH sampler and the three methods that reuse all particles: (a) PIMH1 (?, PIMH1; +, PIMH1-Reuse1; □, PIMH1-Reuse2; ×, PIMH1-Reuse3); (b) PIMH2 (?, PIMH2; +, PIMH2-Reuse1; □, PIMH2-Reuse2; ×, PIMH2-Reuse3); (c) PIMH3 (?, PIMH3; +, PIMH3-Reuse1; □, PIMH3-Reuse2; ×, PIMH3-Reuse3); (d) PIMH4 (?, PIMH4; +, PIMH4-Reuse1; □, PIMH4-Reuse2; ×, PIMH4-Reuse3)

The authors replied later, in writing, as follows.

We thank the discussants for their very interesting comments.

What the users say

Perhaps the most important feedback that we have received is the confirmation by several discussants (Belmonte and Papaspiliopoulos, Bhadra, Flury and Shephard, Cappé, Robert, Jacob and Chopin, and Golightly and Wilkinson) that the approach is not only conceptually simple but also more importantly that it is relatively easy to implement in practice and able to produce satisfactory results. We were particularly interested in the reported simulations and user experience of Golightly and Wilkinson. They indicate that particle Markov chain Monte Carlo (PMCMC) methods can lead to performance that is similar to that obtained with a carefully handcrafted (and possibly complex) algorithm and point to the comparatively little effort that is required by the user in terms of design and implementation. Naturally, except in situations where such implementational simplicity cannot be avoided, this ease comes at the expense of ‘computational brutality’, which might currently deter or prevent some users from using the approach (Chopin, and Flury and Shephard). However, as pointed out by Lee and Holmes, and Everitt, recent advances in the use of cheap graphical processing units and other multicore computing machines (such as game consoles) for scientific computing offer good hope that ever more complex problems can be routinely attacked with PMCMC methods. We naturally realize that the notion of ‘difficult problems’ is not static and do not believe in black boxes and silver bullets: ultimately very difficult problems at the frontier of what current technology can achieve will always require more thinking by the user. In relation to this we are looking forward to seeing applications of PMCMC sampling in the context of approximate Bayesian computations (Cornebise and Peters, and Peters and Cornebise) and general graphical models (Everitt).

Correctness and sequential Monte Carlo implementations

For brevity and to ensure simplicity of exposition the algorithms that were presented throughout the paper focus on some of the simplest implementations, and our discussion of general validity was confined to Section 2.5 and the beginning of Section 4. Not surprisingly quite a few comments focus on this aspect.

Valid sequential Monte Carlo implementations

Although the design of efficient MCMC algorithms can be facilitated by the use of sequential Monte Carlo (SMC) sampling as proposal mechanisms, the performance of the latter will naturally affect the performance of the former and one might wonder what standard SMC improvement strategies are legitimate? One can complement and summarize the rules of Section 2.5 and the beginning of Section 4 as follows. In broad terms PMCMC algorithms are valid

  • (a)when unbiasedness in the resampling step holds and this includes very general and popular schemes (e.g. Chopin, Fearnhead and Crisan) and
  • (b)for all enhancement methods involving additional artificial intermediate distributions; examples include the popular auxiliary particle filter approach (Pitt and Shephard, 1999) and resample–move algorithms (Gilks and Berzuini, 2001) (in other words MCMC-within-SMC methods), but, also as pointed out by Chen, the use of flexible resampling strategies.

It is worth mentioning here that the exchangeability property (assumption 2) is not needed for the PMMH algorithm when only inference on θ is needed. Since writing the paper we have been working on establishing that even more general resampling schemes lead to valid PMCMC algorithms. Of particular interest are adaptive resampling schemes, which usually reduce the number of times that resampling is needed. It has been empirically observed in the literature dedicated to SMC algorithms that such schemes might be beneficial, and we expect this to carry on to the PMCMC framework (see the discussion below on the influence of the variability of inline image (or inline image) on the performance of PMCMC algorithms as well as the discussion of Fearnhead concerning the particle Gibbs (PG) sampler). It is also possible to adapt the number N of particles within the SMC step, which might be for example of interest to moderate the effect of outliers discussed by Murray, Jones, Parslow, Campbell and Margvelashvili.

Large state spaces

As pointed out by several discussants (Girolami, and Creal and Koopman) the design of efficient proposal distributions for the importance sampling stage of the SMC algorithm might be difficult in situations where the dimension of inline image is large. It can be shown on simple examples that such a penalty will typically be exponential in the dimension (consider for example Cappé’s example). However, it is possible in this case to introduce subsequences of intermediate distributions bridging for example πn and πn+1, e.g. Del Moral et al. (2006) and Godsill and Clapp (2001). This offers the possibility of employing well-known standard MCMC-type strategies that are well suited to high dimensional set-ups to update sub-blocks of the state vector between two particular distributions πn and πn+1. An alternative strategy consists of updating the state components one at a time by using conditional SMC updates.

General proposals for particle Metropolis–Hastings algorithms

Whereas the PG sampler bypasses the need for the design of a proposal distribution for θ the particle marginal Metropolis–Hastings (PMMH) algorithm requires such a design, which might not always be obvious as pointed out by Girolami, and Silva, Kohn, Giordani and Pitt.

As pointed out by Maskell, and Robert, Jacob, Chopin and Rue the degree of freedom that is offered by the choice of proposal of the PMMH step, or indeed a particle independent Metropolis–Hastings (PIMH) step, might turn out to be an opportunity which needs to be further explored. Dependence of proposals on previous particle populations is definitely an option (Everitt, and Robert, Jacob, Chopin and Rue) and might be beneficial to calibrate proposal distributions, but also to reduce the variability of acceptance probabilities. Note, however, our remark on the validity of recycling strategies in such a scenario at the very end of Appendix B.5. The work of Lee and Holmes offers an alternative variance reduction strategy of the acceptance probability for some situations.

Another natural solution consists of using adaptive MCMC algorithms (Andrieu and Thoms, 2008). Silva, Kohn, Giordani and Pitt report some results in this direction and in particular report better performance of the adaptive independent MH algorithms compared with that of a particular implementation of the AM algorithm (Haario et al., 2001; Roberts and Rosenthal, 2007). A further interesting comparison might involve robust versions of the AM algorithm described in Andrieu amd Thoms (2008). Finally it is worth mentioning the complementary and competitive method of Ionides et al. (2006) to compute maximum likelihood estimates of the static parameter θ, which could be used as a useful stepping stone towards Bayesian inference in very difficult situations.


The smoothing approaches that were described by Whiteley (and hinted at by Godsill) and Johansen and Aston are very promising developments. The first approach is in the vein of existing ‘particle smoothing’ approaches which allow one to exploit the information that is gathered by all the particles generated by a single SMC procedure within the PMCMC framework. Its interest is intuitively evident in the case of the PG in the light of Fearnhead's discussion, but we expect such a smoothing procedure also to have a positive effect beyond this special case. This might for example improve the quality of samples {X1:P(i)} that are produced by the PMMH algorithm and suggests further improvements to our suggested recycling strategies. The second approach of Johansen and Aston, which was suggested in a non-PMCMC framework, consists of replacing a single SMC sampler using KN particles with K independent SMC samplers using N particles, which amounts to effectively replacing inline image with


and use a stratified sampling strategy to sample K paths. As illustrated by Johansen and Aston, reducing particle interaction might be beneficial when smoothing is of interest. Adaptation of this idea to the PMCMC framework seems possible and raises numerous interesting theoretical and practical questions. This strategy, as well as that described earlier, might address the issue that was raised by Fearnhead concerning the particle depletion phenomenon for initial values.

Performance and the choice of N: from theory to practice

The choice of the number N of particles is a difficult, but central, issue which is paramount to the good performance of PMCMC algorithms. This question is made even more difficult when considering the optimum trade-off between N and L for fixed computational resources, and a credible and generally valid answer to this question is beyond our current understanding.

Dependence on N of the performance of the PMCMC algorithms that were considered in the paper takes two different forms, at first apparently unrelated. It is first important to recall the fact that current PMCMCs can be thought of as being ‘exact approximations’ of idealized algorithms, which might or might not turn out to be ideal. It is indeed possible to construct examples, which are not unrelated to reality, for which the idealized algorithm is slower than its PMCMC version, suggesting that increasing N might not improve performance indefinitely, if at all. This partly answers Chopin's questions related to Rao–Blackwellization and the N versus N+1 issue. Some understanding of the idealized algorithm is therefore necessary, and we shall assume below that this algorithm is a worthy approximation.

For the PMMH or PIMH step the variability of inline image (or inline image will determine how statistically close its transition probability is to that of the idealized algorithm, and as a result some of its performance measures. In the case of the PG, dependence of the performance on the variability of inline image is less obvious, but it seems to be governed by the coalescence structure of particle paths, as discussed by Fearnhead and observed by Whiteley.

In relation to this discussion, residual resampling will outperform multinomial resampling (Chopin) when closeness to the marginal algorithm is considered. Closeness to the marginal algorithm, when achieved, also suggests how the proposal distribution of θ in the PMMH should be adjusted: a random-walk Metropolis step should be tuned such that its acceptance probability is of the order of 0.234 etc. Some results illustrating the effect of N on the performance of the MCMC algorithm can be found in Andrieu and Roberts (2009).

The theoretical results of Section 4 do not unfortunately provide us with precise values but with bounds on rates of convergence as a function of both N and P (or T). Although we believe, and agree with Crisan, that such results can be established under weaker assumptions, we doubt that more practical (and sufficiently general) results can be obtained. We hence doubt that we can ever answer Draper's question, which remains largely unanswered even for standard MCMC algorithms. We find it comforting to see that the experiments of Fearnhead, Cappé and Chen indicate that the main conclusion of the theoretical results, i.e. that N should scale linearly with P for ‘ergodic’ models, seems to hold for quite general scenarios. We were puzzled by the extremely positive results obtained by Belmonte and Papaspiliopoulos for the PG sampler. We note that beyond their explanatory power these results suggest, possibly manual, ways of choosing N by monitoring, for example, the evolution of the variance of normalizing constants as a function of N. Naturally such nice ergodicity properties do not hold for numerous situations of interest, such as models for which components of the state evolve in a quasi-deterministic manner. This includes the class of dynamic stochastic equilibrium models; see Fernandez-Villaverde and Rubio-Ramirez (2007) and Flury and Shephard (2010). This lack of ergodicity of the model probably explains the reported slow convergence of the PMMH algorithm in the scenarios that were mentioned by Johannes, Polson and Yae. As acknowledged in Flury and Shephard (2010), any SMC-based method will suffer from this problem and it is expected that N will scale superlinearly with T in such scenarios. Note, however, that, in principle, the PMCMC framework allows for the use of standard off-the-shelf MCMC remedies, e.g. tempering ideas which might alleviate this issue by introducing bridging models with improved ergodicity.

Ideally we would like the choice of N to be ‘automatic’, in particular for the PMMH and PG algorithms. Indeed, as suggested by the theoretical result on the variance, different values of θ might require different values of N to achieve a set precision. Designing such a scheme which preserves π(θ,x1:P) as invariant distribution of the MCMC algorithm proves to be a challenge. However, adaptation within the SMC algorithm can be achieved through look-ahead procedures and by boosting the number of particles locally when necessary. This can help to prevent the problems that were described by Murray, Jones, Parslow, Campbell and Margvelashvili, where a small number of outliers can have a serious effect on the estimate of the normalizing constant or marginal likelihood and hence the PMCMC procedure.

Unbiasedness versus sampling

Several authors (Flury and Shephard, Łatuszyński and Papaspiliopoulos, Roberts, and Silva, Kohn, Giordani and Pitt) stress the unbiasedness of inline image (or inline image) that is produced by an SMC algorithm as being the basic principle underpinning the validity of the PMMH algorithm, in the spirit of Beaumont (2003), Andrieu and Roberts (2009) and Andrieu et al. (2007). This is indeed one of the two ways in which we came up with the PMMH algorithm initially in the course of working on two separate research projects. The other perspective, favoured in our paper, is that of ‘pseudosampling’, which in our view goes beyond unbiasedness (in the spirit of the ‘pseudomarginal’ approach) and is in our view fertile. Indeed although, in the context of the PMMH algorithm, the pseudomarginal perspective is appropriate when sampling from π(θ) is all that is needed, it is not sufficient to explain that it is possible to sample from π(θ,x1:P) using the same output from the SMC step. We do not think that the PG, of which the conditional SMC update is the key element, could have emerged without this perspective. It is in fact rather interesting to re-explain what the conditional SMC update achieves in the simple situation where the target distribution is π(x1:P) and P=1. In this situation, the extended target distribution of the paper takes the particularly simple form (we omit the subscript 1 to simplify the notation)


A Gibbs sampler to target this distribution consists, given xk, of sampling according to the two following steps:

  • (a)inline image and
  • (b)inline image,

which by standard arguments leave inline image invariant. Step (a) is a trivial instance of the conditional SMC update whereas step (b) consists of choosing a sample in x1:N according to the empirical distribution


Note the similarity of this update with the standard importance sampling–resampling procedure. The remarkable feature here is that whenever xkπ then so is xlπ, owing to the aforementioned invariance property. In other words the conditional SMC update followed by resampling can be thought of as being an MCMC update leaving π invariant. Unbiasedness seems to be a (happy) by-product of the structure of inline image and the proposal distributions used, since it can be easily checked that


for inline image and inline image.

The PIMH and PMMH algorithms take advantage of this unbiasedness property but as illustrated above the structure of inline image offers other useful applications. One interesting application is described in the paper: assume that P is so large that the number N of particles to obtain a reliable SMC step is prohibitive, probably at least of the order of P. Then updating large sub-blocks of x1:P is a tempting solution. In the light of the discussion above, the conditional SMC update offers the possibility of targeting π(xa:b|x1:Pa:b) for 1leqslant R: less-than-or-eq, slantaleqslant R: less-than-or-eq, slantbleqslant R: less-than-or-eq, slantP. Assuming for notational simplicity here that b=P and a>1, if x1:Pπ, then inline image once the update above has been applied to xa:P. Similarly the conditional SMC algorithm can be used in cases where the dimension, say m, of inline image is large in order to update, for example, π{x1:P(l)|x1:P(1:m∖{l})} for l=1,…,m.

Using sequential Monte Carlo methods with Markov chain Monte Carlo moves

As mentioned in Section 2.2.2 and by Johannes, Polson and Yae, an alternative to PMCMC methods consists of using SMC methods with MCMC moves (Fearnhead, 1998; Gilks and Berzuini, 2001). These methods are not applicable in complex models such as the stochastic volatility model in Section 3.2, but, when applicable, seem at first appealing. They are particularly elegant in scenarios where p(θ|x1:T,y1:T) depends on x1:T,y1:T only through a set of fixed dimensional statistics and have received significant attention since their introduction and development over a decade ago; see for example Andrieu et al. (1999, 2005), Fearnhead (1998, 2002), Storvik (2002) and Vercauteren et al. (2005). Despite their appeal these well-documented methods are widely acknowledged to be rather delicate to use, owing to the so-called path degeneracy phenomenon and the fact that a good initialization distribution for θ seems paramount because of the lack of ergodicity of the system. In fact such techniques rely implicitly on the approximation of p(x1:T|y1:T) and it can be observed empirically that the algorithm might converge to incorrect values and even sometimes drift away from the correct values as the time index T increases; see for example Andrieu et al. (1999, 2005).

As a consequence we would recommend extreme caution when using such techniques, whose interest might be to provide a quick initial guess for the inference problem at hand. Assessing path degeneracy is certainly essential to evaluate the credibility of the results. A simple proxy to measure degeneracy consists of monitoring the number of distinct particles representing p(xk|y1:T) for various values of k ∈ {1,…,T} (preferably low values). If this number is below a reasonable number, say 500, then the particle approximation of p(θ,x1:T|y1:T) is most probably unreliable.

Johannes, Polson and Yae propose to reconsider the example that was discussed in Section 3.1 and to estimate the parameters (α,σ,β1,β2,β3,σx). They use Gibbs steps within a bootstrap particle filter to update θ:=(σ,β1,β2,β3,σx) and a slice sampler to update α. As we do not have the details of their slice sampler, we shall limit ourselves to the estimation of p(θ,x1:T|y1:T) by using the PG sampler. We considered their scenario and simulated T=100 data points by using the parameters that Johannes, Polson and Yae used and we set informative priors approximately similar to theirs by checking the width of their posteriors at time n=0. In this context, Johannes, Polson and Yae used 300000 particles for the particle filter with Gibbs moves and argue, on the grounds of the simulations that were discussed at the end of Section 3.1, that PMCMC methods would require l000 times more computation to perform inference in this scenario. We want to reassure them and the readers that this is not so. We used N=5000 particles at the end of Section 3.1 because we addressed a much more difficult scenario where T=500,σ=1 and σx=√10, i.e. the data set was five times larger and we used the bootstrap filter in a very unfavourable scenario where the likelihood of the observations is peaked and the noise of the dynamic diffuse. This is in contrast with the scenario that is considered by Johannes, Polson and Yae where σ=√10 and σx=1, i.e. the likelihood is fairly diffuse and the bootstrap filter and conditional bootstrap filters can provide good proposals for a number of particles as small as 150, which is in agreement with Fig. 3 of the paper. Moreover our PG sampler samples only (N−1)T random variables Xn and one set of parameters (σ,β1,β2,β3,σx) per MCMC iteration whereas the particle filter using Gibbs moves needs to sample NT random variables (Xn,σ,β1,β2,β3,σx). As a result, for the computational complexity of using the bootstrap filter with Gibbs moves for N=300 000, we can run the PG sampler for 12000 iterations using a conditional SMC sampler using 150 particles, which is more than sufficient in this context. The MATLAB program runs in 7 min on a desktop computer. Fig. 22 displays the results. We ran many realizations initialized with this very informative prior and the algorithm consistently returned virtually identical results. Using vague priors for all parameters, we observed that poorly initialized PG samplers can sometimes become trapped in some modes (and we conjecture that this might be so even for the ‘exact’ Gibbs sampler) but also manages to escape, in which case the results are very similar, and stable.

Figure 22.

 Approximations of the marginal posterior distributions of (a) σx, (b) σ, (c) b1, (d) b2 and (e) b3 obtained by using 12 000 PG iterations with 1000 burn-in

For the same data set and the same informative prior, we ran 10 runs of the bootstrap filter with Gibbs steps for N=100 000 particles. For some parameters, the results were quite similar among runs. However, we also observed significant variability in the estimates as illustrated in Fig. 23. As expected this variance increases with time as a result of the path degeneracy phenomenon. Using vague priors for all parameters, the procedure appeared unable to produce sensible approximations of the posterior.

Figure 23.

 Estimates of (a) inline image and (b) inline image for n=1, …, 100 and 10 different runs of the bootstrap filter with Gibbs steps using N=100 000 particles

We conjecture that the variance of the approximation error of p(θ,x1:T|y1:T) increases superlinearly with T for such algorithms.

Some past and future work

As mentioned in Section 5.1 of our paper and as recalled by Godsill and Johannes, Polson and Yae, a version of the PMMH algorithm based on the bootstrap filter has been previously proposed as a natural heuristic to sample approximately from p(θ|y1:T) (and not p(θ,x1:T|y1:T)) by Fernandez-Villaverde and Rubio-Ramirez (2007). As discussed earlier, beyond the (non-trivial to us) proof that this approach is in fact exact, we hope that we have demonstrated that the PMMH algorithm is only a particular case of a more general and useful framework which goes far beyond the heuristic. As pointed out in Section 5.1, the PMCMC framework encompasses the MTM algorithm of Liu et al. (2000) and the configurational-based Monte Carlo update of Siepmann and Frenkel (1992). These connections, which might not be obvious at first sight (Cappé), are detailed in Andrieu et al. (2010), where other interesting developments are also presented.