Discussion on the paper by Andrieu, Doucet and Holenstein
Paul Fearnhead (Lancaster University)
I see great potential for particle Markov chain Monte Carlo (MCMC) methods—as the strengths of particle filters and of MCMC sampling are in many ways complementary. For example, in work on mixture models (Fearnhead, 2004), particle filter methods can perform well at finding different modes of the posterior, whereas MCMC methods do well at exploring the posterior within a mode. Similarly particle methods do well for analysing state space models conditional on known parameters and can analyse models which you can simulate from but cannot calculate transition densities, whereas MCMC methods are better suited to mixing over different parameter values. This is the first work to use particle filters within MCMC sampling in a principled and theoretically justified way.
The paper describes several particle MCMC methods, and I shall concentrate the rest of my comments on just one of these: particle Gibbs sampling.
To understand the mixing properties of particle Gibbs sampling it helps to look at the set of paths that can be sampled from at the end of a conditional sequential Monte Carlo (SMC) update: Fig. 8(a) gives an example. The conditional SMC update is an SMC algorithm conditioned on a specific path surviving, which I shall call the conditioned path. The set of paths can be split into those which coalesce with the conditioned path, and those which do not and hence are independent of it. For the particle Gibbs sampler to mix well we want the probability of sampling one of these latter independent paths to be high.
For this, we would like to minimize the number of times that the particle of the conditioned path is resampled at each iteration. Consider time n, and assume that the conditioned path consists of the first particle at both times n and n+1 (in the notation of the paper B_{n}=1 and B_{n+1}=1). Let O_{n} be the number of times that the first particle at time n is resampled. Then under the conditional SMC algorithm we are interested in
thus it is easy to show that . Hence we see the importance of choosing a resampling scheme that minimizes the variance of the number of times that each particle is resampled; or of not resampling every time step.
To illustrate this empirically, I considered a simple toy model ,
We simulated data for σ=0.1. Fig. 8(b) shows how the probability of sampling an independent path in the conditional SMC step depends on T, N and the type of resampling. Results are given for multinomial resampling and for stratified resampling (Kitagawa, 1996; Carpenter et al., 1999), which is known to minimize var(O_{n}). We see that stratified sampling requires much smaller values of N to have the same performance as for multinomial sampling. Also, as pointed out in the paper, you want N to increase linearly with T to have a roughly constant performance (this suggests that the central processor unit cost of SMC scales as O(T^{2}); this sounds competitive with or better than standard MCMC sampling; see Roberts (2009)).
I have two other comments on particle Gibbs sampling. Firstly it seems that how you initialize the conditional SMC update is important. Naive strategies of sampling particles from the prior will lead to many initial particles being sampled in poor areas of the state space. Care needs to be taken even when a better proposal for the initial particles is used, as initially particles in the mode of this proposal will have small weights and resampling may remove many of these particles sampled unless the resampling probabilities are chosen appropriately (Chen et al., 2005; Fearnhead, 2008). Also, within particle Gibbs sampling are there extra ways of learning a good proposal for the initial particles: could you learn this from the history of the MCMC run or use the information of the conditioned path?
Secondly, you can use particle Gibbs sampling to update jointly parameters and the states by using a conditional SMC update with particles being both the state of the system and the parameters. Again care needs to be taken in terms of the proposal distribution for the parameters, and how resampling is done. On the above toy example it was possible to use the conditional SMC update to sample jointly new X_{1:T} and σvalues—with moves where σ changed by more than an order of magnitude more than that of a Gibbs sampler which updates σX_{1:T}. For such an implementation, can you use MCMC methods within the conditional SMC update (Fearnhead, 1998, 2002; Gilks and Berzuini, 2001; Storvik, 2002)?
It feels like the theory behind the efficiency of particle Gibbs sampling may be very different from that for the other particle MCMC methods. Whereas the latter seems related to the variance of the SMC estimates of the marginal likelihood, the efficiency of particle Gibbs sampling seems related to rates of coalescences of paths in the conditional SMC update (and reminiscent of Kingman (l982)). Are these related, or is there a fundamental difference between particle Gibbs and other particle MCMC methods?
This has been a fascinating paper, and I look forward to see future developments and application of particle MCMC methods. It gives me great pleasure to propose the vote of thanks.
Simon Godsill (University of Cambridge)
In seconding the vote of thanks on this paper I congratulate the authors on a fine contribution to Bayesian computational methods. The techniques that they propose allow us to combine, in a principled way, the two most successful tools that we currently have available in this field: the particle filter and the Markov chain Monte Carlo (MCMC) method. Other attempts in this area have focused on incorporating MCMC methods into sequential updating with particle filters. The current contribution, however, introduces the full power of particle filters into batch MCMC schemes. This has been done before, using empirical justifications (see FernandezVillaverde and RubioRamirez (2007), which implements precisely the particle marginal Metropolis–Hastings update), but here we have a full theoretical justification for this usage, which will reassure practitioners and should increase the uptake of such methods widely. By adopting a fully principled approach, which identifies an augmented target distribution which is at the heart of the particle marginal Metropolis–Hastings approach, we gain significant extra mileage, notably through the particle Gibbs algorithm, a method that applies Gibbs sampling to the same augmented target distribution. This particle Gibbs algorithm goes significantly beyond what has been applied before and allows inference in intractable models where the only feasible state sampling approach is particle filtering. It should be highlighted, however, that the approach is one of the most computationally demanding methods proposed to date. In its basic form it requires a full particle filtering run for each iteration of the MCMC algorithm, which for a complex model with many static parameters could prove infeasible. The algorithm is also slightly wasteful in that all but one particle and its backtracking lineage are discarded in each step of the algorithm (even though the discarded samples can be used in the final Monte Carlo estimates, as shown by the authors in Section 4.6). This latter point raises the possibility that one might adapt a parallel chain or population Monte Carlo scheme to the particle MCMC framework, to utilize fully in the MCMC algorithm more than one stream of output from the particle filter.
To conclude, I wonder whether the authors have considered adaptations of their approach which incorporate particle smoothing, both Viterbi style (Godsill et al., 2001) and backward sampling (Godsill et al., 2004). These could improve the quality of the proposals from the particle filter at relatively small cost (at least in the backward sampling case, which is O(NT) per sample path, as for the basic particle filter). This latter approach typically gives better diversity of backward sample paths than those arising from the standard filter output—hence I wonder also whether we can gain something by including multiple path imputations from the smoother into the particle MCMC approach—see my earlier comment about parallel chain or population MCMC methods.
The vote of thanks was passed by acclamation.
Nicolas Chopin (Ecole Nationale de la Statistique et de l'Administration Economique, Paris)
Two interesting metrics for the influence of a paper read to the Society are
In both respects, this paper fares very well.
Regarding (a), in many complicated models the only tractable operations are state filtering and likelihood evaluation; see for example the continuous time model of Chopin and Varini (2007). In such situations, the particle Hastings–Metropolis (PHM) algorithm offers Bayesian estimates ‘for free’, which is very nice.
Similarly, Chopin (2007) (see also Fearnhead and Liu (2007)) formulated changepoint models as state space models, where the state x_{t}=(θ_{t},d_{t}) comprises the current parameter θ_{t} and the time since last change d_{t}. Then we may use sequential Monte Carlo (SMC) methods to recover the trajectory x_{1:T}, i.e. all the change dates and parameter values. It works well when x_{t} forgets its past sufficiently quickly, but this forbids hierarchical priors for the durations and the parameters. PHM removes this limitation: Chopin's (2007) SMC algorithm may be embedded in a PHM algorithm, where each iteration corresponds to different hyperparameters. This comes at a cost, however, as each Markov chain Monte Carlo (MCMC) iteration runs a complete SMC algorithm.
Regarding (b), several questions, which have already been answered in the standard SMC case, may be asked again for particle MCMC methods. Does residual resampling outperform multinomial resampling? Is the algorithm with N+1 particles strictly better than that with N particles? What happens about Rao–Blackwellization, or the choice of the proposal distribution? One technical difficulty is that marginalizing out components always reduces the variance in SMC sampling, but not in MCMC sampling. Another difficulty is that particle MCMC methods retains only one particle trajectory x_{1:T}; hence the effect of reducing variability between particles is less obvious.
Similarly, obtaining a single trajectory x_{1:T} from a forward filter is certainly much easier than obtaining many of them, but it may still be demanding in some scenarios, i.e. there may be so much degeneracy in x_{1} that not even one particle contains an x_{1} in the support of p(x_{1}y_{1:T}).
Rong Chen (Rutgers University, Piscataway)
It is a pleasure to congratulate the authors on an impressive, timely and important paper. The problem of parameter estimation for complex dynamic systems by using sequential Monte Carlo methods has been known as a very difficult problem. The authors provide a clean and powerful way to deal with such a problem. The method will certainly become a popular and powerful tool for solving complex problems.
I wish to concentrate my discussion on one aspect—the resampling scheme. The current paper seems to insist on resampling by using the current weights (e.g. assumption 2). We note that the procedure proposed actually works for more flexible resampling schemes. In a way, we can view that a flexible resampling scheme is in effect changing the intermediate distributions. More specifically, in the notation of the paper, a flexible resampling scheme operates as follows. At times n=2,…,T, first construct . Then
 (a)
sample
 (b)
sample
and set
, and
 (c)
compute and normalize the weights
and
This is not a new idea. For example, Liu (2001) mentioned the use of for some α ∈ (0,1) to reduce the sudden impact of large jumps in the system. Shephard (private conversation) suggested the use of an incremental weight spreading technique,
The auxiliary particle filter of Pitt and Shephard (1999) in a way can be thought of as using
where is a prediction of the future state X_{n}. Similarly, we can also use delayed sampling (Chen et al., 2000; Wang et al., 2002) and block sampling (Doucet et al., 2006) ideas to design the resampling schemes, bringing in future information in the resampling scheme. Lin et al. (2010) constructed the resampling scores by using backward pilots in generating Monte Carlo samples of diffusion bridges.
The flexible resampling scheme is essentially changing the intermediate distribution γ_{t−1}(x_{t}−1) (which is defined in Section 4.1) to
hence all the theoretical properties of standard particle filters work. It also works inside the particle Markov chain Monte Carlo algorithm.
Mark Girolami (University of Glasgow)
This is a potentially very important contribution to Markov chain Monte Carlo (MCMC) methodology. The capabilities of existing MCMC techniques are being severely stretched, because in part of the increasing awareness of the importance of statistical issues surrounding the mathematical modelling of complex stochastic nonlinear dynamical systems in areas such as computational finance and biology. The proposed particle Markov chain Monte Carlo (PMCMC) framework of algorithms provides very general and powerful novel methodology which may allow inference to proceed over increasingly complex models in a more efficient manner and as such this is a most welcome addition to the literature.
The use of an approximate posterior to improve proposal efficiency in terms of producing large moves with high probability of acceptance is a strategy that has been demonstrated to great effect in reversible jump MCMC methods where approximate posteriors for model proposals ensure high acceptance of betweenmodel moves (Lopes and West, 2004; Zhong and Girolami, 2009). A similar strategy is to consider a proposal process as the outcome of forward simulation of a stochastic differential equation which has the desired target distribution as its ergodic stationary distribution. Simulating from the stochastic differential equation numerically incurs errors which can then be corrected for, as with PMCMC sampling, by employing the Hastings ratio, e.g. the Metropolis adjusted Langevin algorithm (Roberts and Stramer, 2003). The alternative method is numerically to forwardsimulate a deterministic system based on a Hamiltonian and to employ a Metropolis accept–reject step to correct for discrete integration errors, as in the hybrid Monte Carlo methods which have been shown to perform well on high dimensional problems that were similar to those studied in this paper (Neal, 1993; Girolami et al., 2009).
The correctness of the algorithms is established with extensive and detailed proofs; therefore my comments have a practical focus. The strategy that is adopted is to employ an approximate, potentially nonequilibrium sequential Monte Carlo (SMC) procedure to make high dimensional proposals for the Metropolis method. In many ways the issue of designing a proposal mechanism is pushed back to designing importance distributions for the SMC method so that difficulties may yet arise in terms of tuning the SMC parameters to obtain a high rate of acceptance. Sampling from the joint posterior p(θ,x_{1:T}y_{1:T}) within the PMCMC framework may still require the undesirable design of a proposal for the parameters θ as employed in the particle marginal Metropolis–Hastings sampler although the particle Gibbs sampler employing conditional SMC updates appears a promising though largely untested alternative.
Nick Whiteley (University of Bristol)
I offer my thanks to the authors for an inspirational paper. Their approach to constructing extended target distributions is powerful and can be exploited further and applied elsewhere. A key ingredient is the elucidation of the probability model underlying a sequential Monte Carlo (SMC) algorithm and the genealogical tree structures that it generates. Two further developments on this theme are described below.
There is an alternative. Having sampled K, for n=T−1,…,1, we could sample from
with defined as before according to expression (44), but with newly sampled ancestor indices.
The advantage of this ‘backward’ sampling is that it enables exploration of all possible ancestral lineages and not only those obtained during the ‘forward’ SMC run. This offers a chance to circumvent the path degeneracy phenomenon and to obtain a faster mixing particle Gibbs kernel, albeit at a slightly increased computational cost.
When p(x_{1:T},y_{1:T}) arises from a state space model, it is straightforward to verify that
which uses the importance weights that are obtained during the forward SMC run. In this case, the above procedure coincides with one draw using the smoothing method of Godsill et al. (2004).
Secondly, I believe that the particle Markov chain Monte Carlo framework can be adapted to accommodate the particle filter of Fearnhead and Clifford (2003), which is somewhat different from the SMC algorithm that is considered in the present paper. Owing to constraints on space I provide no specifics here, but I believe that suitable formulation of the probability model underlying the algorithm of Fearnhead and Clifford (2003) allows it to be manipulated as part of a particle Markov chain Monte Carlo algorithm.
Gareth Roberts (University of Warwick, Coventry)
I add my congratulations to the authors for this path breaking work. In this discussion, I shall expand on comments in the paper linking the methods introduced to a generic framework for Markov chain Monte Carlo (MCMC) methods which can be applied to missing data problems and other situations where the target density is unavailable but can be estimated unbiasedly by using an auxiliary variable construction. This work can be found in Andrieu and Roberts (2009), generalizing an idea that was introduced in Beaumont (2003).
For MCMC sampling, enlargement of state spaces comes at a price. Consider, for instance an ‘optimized’ Metropolis–Hastings algorithm on π(θ,z). Typically this converges slower than its rival counterpart on the marginalized distribution π(θ). This suggests that we might mimic the marginalized algorithm through Monte Carlo sampling . Here I shall describe the simplest version of the pseudomarginal approach.
Choose Z ∈ R^{N}∼^{IID}q, and set
Consider two options for using within an MCMC framework: Monte Carlo within Metropolis and generalized importance Metropolis–Hastings.
Step  Marginal  Monte Carlo within Metropolis  Generalized importance Metropolis–Hastings 
0: given  θ and π(θ)  θ and π(θ)  θ,Z and 
1: sample  θ^{*}∼q(θ,·)  θ^{*}∼q(θ,·)  θ^{*}∼q(θ,·) 
   
2: compute  π(θ^{*})  and  
3: compute r    
4: with probability  ϑ=θ^{*}  ϑ=θ^{*}  ϑ=θ^{*},Z=Z^{*} 
otherwise  ϑ=θ  ϑ=θ  ϑ=θ, Z=Z 
The Monte Carlo within Metropolis approach biases the MCMC algorithm so that the marginal stationary distribution of θ under the scheme is typically not π (if it exists at all). However, the generalized importance Metropolis–Hastings approach has the following invariant distribution:
The θmarginal of this chain is π(θ).
Thus there is no Monte Carlo bias in generalized importance Metropolis–Hastings sampling (though of course there is still Monte Carlo error) and, under weak regularity conditions, as N∞ the algorithm ‘converges’ to the true marginal algorithm.
Drawing Z as an independent and identically distributed sample can be significantly improved on, e.g. by letting Z denote a sample path of a Markov chain with invariant distribution π(zθ) (or even a particle approximation as in the present paper).
Andrieu and Roberts (2009) applies this idea in simple examples and explores some of the theoretical properties of the method. One important and promising application of the idea involves a substantial generalization of reversible jump MCMC sampling which improves the potentially problematic step of choosing appropriate betweendimension moves.
In modified form, this construction is also an ‘exact’ and efficient computational solution to doubly intractable problems (see Andrieu et al. (2008)),
for unknown K(·) as well as θ.
Miguel A. G. Belmonte (University of Warwick, Coventry) and Omiros Papaspiliopoulos (Universitat Pompeu Fabra, Barcelona)
We congratulate the authors for a remarkable paper, which addresses a problem of fundamental practical importance: parameter estimation in state space models by using sequential Monte Carlo (SMC) algorithms. In Belmonte et al. (2008) we fit duration state space models to high frequency transaction data and we require a computational methodology that can handle efficiently time series of length . We have experimented with particle Markov chain Monte Carlo (PMCMC) methods and with the smooth particle filter (SPF) of Pitt (2002). The latter is also based on the use of SMC algorithms to derive maximum likelihood parameter estimates; it is, however, limited to scalar signals. Therefore, in the context of duration modelling this limitation rules out multifactor or multidimensional models, and we believe that PMCMC methods can be very useful in such cases.
In this contribution we present a preliminary simulation study which contrasts particle marginal Metropolis–Hastings (PMMH), particle Gibbs (PG) and the SPF methods on simulated data from a linear singlefactor state space model:
 (45)
Parameter values are set to μ=0.75, φ=0.95 and and various values for T and the signaltonoise ratio are tried. When Bayesian inference with PMCMC sampling is made for μ, an improper flat prior is used. We adopt a pragmatic point of view according to which the practitioner, especially for a small number of parameters, is invariant to maximum likelihood or Bayesian inference but is mostly worried about the comptutational efficiency of the methods. Our simulation and prior specification setup is such that the posterior mean and precision estimates coincide with the maximum likelihood and observed information estimates respectively, and the exact values are available by using the Kalman filter (KF). The bootstrap filter is used in all the SMC algorithms.
For various values of T, Table 1 shows a comparison of parameter estimates by the particle methods and KF. In this problem the SPF and PG methods show remarkable robustness to the length of the series in terms of the accuracy of the estimates. The mixing time of the latter does not show deterioration with T (note that the mixing time of the limiting algorithm with T=∞ does not arbitrarily deteriorate with T either; see Papaspiliopoulos et al. (2003) for details). We also varied the signaltonoise ratio and report our findings in Table 2.
Table 2. Comparison of estimates by the SPF, PMMH and PG methods against the KF for combinations of signaltonoise ratio†  Results for the following signaltonoise ratios: 

 0.05  0.23  0.41  0.59  0.77  0.95 


KF 
 0.537  0.559  0.582  0.608  0.641  0.695 
l(μ_{KF})  −920.806  −978.023  −1014.000  −1031.784  −1032.964  −993.820 
 0.128  0.104  0.080  0.056  0.031  0.007 
SPF 
Relative error  0.211  0.000  −0.063  −0.056  −0.000  0.004 
Likelihood difference  −0.0501  −0.0000  −0.0085  −0.0103  −0.0000  −0.0005 
Ratio of variance  0.918  0.927  0.940  0.942  0.966  0.985 
PMMH 
Relative error  −0.108  0.109  0.027  0.008  −0.014  −0.003 
Likelihood difference  −0.0131  −0.0178  −0.0016  −0.0002  −0.0012  −0.0002 
Ratio of variance  1.574  1.049  1.101  1.058  1.071  0.961 
Acceptance probability  0.023  0.097  0.159  0.218  0.231  0.161 
Efficiency  138.94  20.62  12.18  6.57  5.06  4.64 
Centred PG 
Relative error  −0.015  0.004  0.004  0.003  0.002  0.001 
Likelihood difference  −0.0002  −0.0000  −0.0000  −0.0000  −0.0000  −0.0001 
Ratio of variance  0.988  0.988  0.989  0.987  0.990  0.990 
Efficiency  1.00  1.00  1.00  1.01  1.01  1.05 
Noncentred PG 
Relative error  −0.093  −0.099  −0.436  0.181  0.047  0.010 
Likelihood difference  −0.0098  −0.0148  −0.4028  −0.1083  −0.0141  −0.0032 
Ratio of variance  0.000  0.018  0.345  1.356  1.088  1.006 
Efficiency  1.77  22.19  166.99  366.05  126.63  22.02 
We also consider two different parameterizations under which we applied PG sampling: the socalled centred (X_{1},…,X_{T},θ) and noncentred (X_{1}−θ,…,X_{T}−θ,θ); see Papaspiliopoulos et al. (2003). When the state has so high persistence it is known (Papaspiliopoulos et al., 2003) that the centred Gibbs sampler (for T=∞) has better mixing. The robustness of PG sampling is again very promising. Note that the SPF and PMMH methods have worse performance for small values of the ratio, which is due to the deterioration of the bootstrap filter with decreasing observation error. This deterioration appears to have no effect on PG sampling in this simple setting.
Krzysztof Łatuszyński (University of Toronto) and Omiros Papaspiliopoulos (Universitat Pompeu Fabra, Barcelona)
We congraulate the authors for a beautiful paper. A fundamental idea is the interplay between unbiased estimation (by means of importance sampling in this paper) and exact simulation. We show how unbiased estimation relates to exact simulation of events of unknown probability s ∈ [0,1]. Details, proofs and an application to the celebrated Bernoulli factory problem (Nacu and Peres, 2005) can be found in Łatuszński et al. (2009).
We wish to simulate the binary random variable C_{s} such that P[C_{s}=1]=s. If is a realizable unbiased estimator of s taking values in [0,1], we use the following algorithm 1.
Step 1: simulate G_{0}∼U(0,1).
Step 2: obtain
Step 3: if set C_{s}:=1; otherwise set C_{s}:=0.
If l_{1},l_{2},… and u_{1},u_{2},… are sequences of lower and upper bounds converging monotonically to s then we can resort to the following algorithm 2.
Step 1: simulate G_{0}∼U(0,1);set n=1.
Step 2: compute l_{n} and u_{n}.
Step 3: if G_{0}l_{n} set C_{s}:=1.
Step 4: if G_{0}>u_{n} set C_{s}:=0.
Step 5: if l_{n}<G_{0}u_{n} set n:=n+1 and go to step 2.
Under these assumptions we can use the following algorithm 3.
Step 1: simulate G_{0}∼U(0,1); set n=1.
Step 2: obtain L_{n} and U_{n} given
Step 3: if G_{0}L_{n} set C_{s}:=1.
Step 4: if G_{0}>U_{n} set C_{s}:=0.
Step 5: if L_{n}<G_{0}U_{n} set n:=n+1 and go to step 2.
Consider the following algorithm 4, which uses auxiliary random sequences and constructed on line.
Step 1: simulate G_{0}∼U(0,1); set n=1; set
Step 2: obtain L_{n} and U_{n} given
Step 3: compute
 52
Step 4: compute
 (52)
 (53)
Step 5: if set C_{s}:=1.
Step 6: if set C_{s}:=0.
Step 7: if set n:=n+1 and go to step 2.
Thomas Flury and Neil Shephard (University of Oxford)
We congratulate Christophe Andrieu, Arnaud Doucet and Roman Holenstein for this important contribution to the sequential Monte Carlo and Markov chain Monte Carlo (MCMC) literature. At the base of their paper is the deceivingly simple looking idea of combining two powerful and wellknown Monte Carlo algorithms to create a truly Herculean tool for statisticians. They use sequential Monte Carlo methods to generate high dimensional proposal distributions for MCMC algorithms.
We focus our discussion on one very specific insight: one can use an unbiased simulationbased estimator of the likelihood inside an MCMC algorithm to perform Bayesian inference. For dynamic models this estimator is obtained from a standard particle filter. Importantly, this means that the particle filter now offers a complete extension of the Kalman filter: it can carry out filtering and now direct parameter estimation.
We are particularly impressed with the minimalistic assumptions that we need to perform likelihoodbased inference in dynamic nonlinear and nonGaussian state space models, which is of great interest for microeconometrics, macroeconometrics and financial econometrics. In the particle marginal Metropolis–Hastings algorithm we only need to be able to evaluate the measurement density and to sample from the state transition density. Another advantage is that we do not need an infinite number of simulation draws for consistency: all theoretical results hold from as little as N1 particles. Practical implementation is also very easy as one only needs to change very few lines of code to estimate a different model.
In Flury and Shephard (2010) we showed the power of this method on four famous examples in econometrics. Other applications, such as in repeated auctions, will also become important. Our experience is that these methods work, are quite simple to implement, general purpose and highly computationally demanding. The last point is important; they take so long to run that it is tempting to use the phrase ‘computationally brutal’.
Christian P. Robert and Pierre Jacob (Centre de Recherche en Economie et Statistique and Université Paris Dauphine, Paris), Nicolas Chopin (Ecole Nationale de la Statistique et de l'Administration Economique, Paris) and Håvard Rue (Norwegian University for Science and Technology, Trondheim)
We congratulate the authors for opening a new vista for running Markov chain Monte Carlo (MCMC) algorithms in state space models. Being able to devise a correct Markovian scheme based on a particle approximation of the target distribution is a genuine tour de force that deserves enthusiastic recognition! This is all the more impressive when considering that the ratio
 (54)
is not unbiased and thus invalidates the usual importance sampling solutions, as demonstrated by Beaumont et al. (2009). Thus, the resolution of simulating by conditioning on the lineage truly is an awesome resolution of the problem!
We implemented the particle Hastings–Metropolis algorithm for the (notoriously challenging) stochastic volatility model
based on 500 simulated observations. With parameter moves
and state space moves derived from the autoregressive AR(l) prior, we obtained good mixing properties with no calibration effort, using N=10^{2} particles and 10^{4} Metropolis–Hastings iterations, as demonstrated by Figs 9 and 10. Other runs (which are not reproduced here) exhibited multimodal configurations that the particle MCMC algorithm managed to handle satisfactorily within 10^{4} iterations.
Contemplating a different model does not even require the calculation of full conditionals, in contrast with Gibbs sampling. Another advantage of the particle Hastings–Metropolis algorithm is that it is trivial to parallelize. (Adding a comment before the loop over the particle index is enough, by using the OpenMP technology.)
Finally, we mention possible options for a better recycling of the numerous simulations that are produced by the algorithm. This dimension of the algorithm deserves deeper study, maybe to the extent of allowing for a finite time horizon overcoming the MCMC nature of the algorithm, as in the particle Monte Carlo solution of Cappéet al. (2008).
A more straightforward remark is that, owing to the additional noise that is brought by the resampling mechanism, more stable recycling would be produced both in the individual weights w_{n}(X_{1:n}) by Rao–Blackwellization of the denominator in equation (7) as in Iacobucci et al. (2009) and over past iterations by a complete reweighting scheme like AMIS (Cornuet et al., 2009). Another obvious question is whether or not the exploitation of the wealth of information that is provided by the population simulations is manageable via adaptive MCMC methods (Andrieu and Robert, 2001; Roberts and Rosenthal, 2009).
Finally, since
is an unbiased estimator of p(y_{1:T}), there must be direct implications of the method towards deriving better model choice strategies in such models, as exemplified in the population Monte Carlo method of Kilbinger et al. (2009) in a cosmology setting.
The following contributions were received in writing after the meeting.
Anindya Bhadra (University of Michigan, Ann Arbor)
The authors present an elegant theory for novel methodology which makes Bayesian inference practical on implicit models. I shall use their example, a sophisticated financial model involving a continuous time stochastic volatility process driven by Lévy noise, to compare their methodology with a state of the art nonBayesian approach. I applied iterated filtering (Ionides et al., 2006, 2010) implemented via the mif function in the R package pomp (King et al., 2008).
The decision about whether one wishes to carry out a Bayesian analysis should depend on whether one wishes to impose a prior distribution on unknown parameters. Here, I have shown that likelihoodbased nonBayesian methodology provides a computationally viable alternative to the authors’ Bayesian approach for complex dynamic models.
Luke Bornn and Aline Tabet (University of British Columbia, Vancouver)
We congratulate the authors on this very important contribution to stochastic computation in statistics. Whereas the authors have explored and discussed several applications in the paper, we would like to highlight the benefits of using particle Markov chain Monte Carlo (PMCMC) methods as a way to extend sequential Monte Carlo (SMC) methods which employ sequences of distributions of static dimension. Through PMCMC sampling, we can separate the variables of interest into those which may be easily sampled by using traditional MCMC techniques and those which require a more specialized SMC approach. Consider for instance the use of simulated annealing in an SMC framework (Neal, 2001; Del Moral et al., 2006). Rather than finding the posterior maximum a posteriori estimate of all parameters, PMCMC sampling now allows practitioners to combine annealing with traditional MCMC methods to maximize over some dimensions simultaneously while exploring the full posterior in others.
When variables are highly correlated, SMC methods may be used as an efficient alternative to MCMC sampling. For instance, SMC samplers (Del Moral et al., 2006) and other populationbased methods (Jasra et al., 2007) proceed by working through a sequence of auxiliary distributions until a particlebased approximation to the posterior is reached. In nonidentifiable or weakly identifiable models, SMC sampling is used to construct a sequence of tempered distributions allowing particles to explore fully the resulting ridges in the posterior surface of the nonidentifiable variables. However, because SMC algorithms often rely on importance sampling, they can suffer in high dimensions owing to increased variability in the importance weights. Many nonidentifiable models contain only a small portion of variables with identifiability issues, and hence it may be adding unnecessary complication to build the tempered distributions in all dimensions. In this case, PMCMC sampling gives the option to explore some parameters by using MCMC sampling while exploring others (such as those which are highly correlated or nonidentifiable) with SMC sampling, and hence limit variance in the SMC importance weights. There are several options for performing this in the PMCMC framework: both the particle Gibbs and the particle Metropolis–Hastings variants could be used; the choice largely depends on the correlation between the identifiable and nonidentifiable subsets of variables. In conclusion, we feel that, as much as PMCMC sampling provides Monte Carlo solutions to a unique class of problems, it also provides a flexible framework allowing practitioners to mix and match Monte Carlo strategies to suit their particular application.
Olivier Cappé (Telecom ParisTech and Centre National de la Recherche Scientifique, Paris)
I congratulate the authors for this impressive piece of work which, I believe, is a very significant contribution to the toolbox of Markov chain Monte Carlo and sequential Monte Carlo (SMC) methods.
For brevity, I focus on the particle independent Metropolis–Hastings (PIMH) algorithm which is the basic building block for the other samplers that are presented in the paper. Although theorem 2 also covers the more involved case of SMC sampling, the core idea is the auxiliary construction which shows that a proper Markov chain Monte Carlo algorithm may be obtained from sampling–importance resampling (Rubin, 1987), irrespectively of the number N of particles. This idea, however, does seem to be quite different both from the multipletry (Liu et al., 2000) and the pseudomarginal (Beaumont, 2003) approaches and I encourage the authors to discuss in more detail its connections, if any, with earlier ideas in the literature.
Fig. 3 (in Section 3.1) is very promising as it suggests that the approach is practicable in large dimensional settings for which a ‘causal’ factorization of the likelihood is available. In particular, I wonder whether it is possible to predict the relationship between the dimension T and the number N of particles that is implicit in Fig. 3. In an attempt to answer this question, I conducted a toy numerical experiment in the spirit of the scaling construction (Roberts and Rosenthal, 2001), where the target π_{T} is a product probability density function and SMC sampling is also carried out by using successive independent proposals—clearly, the latter situation is very specific, although it satisfies assumptions 1–4 that were made in the paper. In this example, any method based on direct importance sampling, including the PIMH algorithm using an SMC algorithm without resampling (i.e. sequential importance sampling), is bound to fail for all feasible values of N when T is larger than, say, 17 (see the caption of Fig. 12). In contrast, Fig. 12 shows that the PIMH algorithm using an embedded SMC algorithm with resampling at each step (as described in Section 2.2.1) can cope with dimensions as large as T=10^{3}. In addition, Fig. 12 also suggests that increasing N as O(T) is sufficient to stabilize the acceptance rate. I would be happy to hear the authors’ comments on whether the behaviour of PIMH sampling in this simple scenario can be inferred from known results about SMC methods regarding the rate of convergence of as N increases.
J. Cornebise (Statistical and Applied Mathematical Sciences Institute, Durham) and G. W. Peters (University of New South Wales, Sydney)
Our comments on adaptive sequential Monte Carlo (SMC) methods relate to particle Metropolis–Hastings (PMH) sampling, which has acceptance probability given in equation (13) of the paper for proposed state , relying on the estimate
Although a small N suffices to approximate the mode of a joint path space distribution, producing a reasonable proposal for x_{1:T}, it results in high variance estimates of . We study the population dynamics example from Hayes et al. (2010), model 3 excerpt, involving a logtransformed θlogistic state space model; see Wang (2007), equations 3(a) and 3(b), for parameter settings and Figs 13–15 for an illustration of the algorithm's behaviour. Particle Markov chain Monte Carlo (PMCMC) performance depends on the tradeoff between degeneracy of the filter, N, and design of the SMC mutation kernel. Regarding the latter, we note the following.
 (a)
A Rao–Blackwellized filter (
Doucet et al., 2000) can improve acceptance rates; see
Nevat et al. (2010).
 (b)
Adaptive mutation kernels, which in PMCMC methods can be considered as adaptive SMC proposals, can reduce degeneracy on the path space, allowing for higher dimensional state vectors
x_{n}. Adaption can be local (within filter) or global (sampled Markov chain history). Though currently particularly designed for approximate Bayesian computation methods, the work of
Peters et al. (2010) incorporates into the mutation kernel of SMC samplers (
Del Moral et al., 2006) the partial rejection control (PRC) mechanism of
Liu (2001), which is also beneficial for PMCMC sampling. PRC adaption reduces degeneracy by rejecting a particle mutation when its incremental importance weight is below a threshold
c_{n}. The PRC mutation kernel
can also be used in PMH algorithms, where
q(
x_{n}
y_{n},
x_{n−1}) is the standard SMC proposal, and
As presented in
Peters et al. (2010), algorithmic choices for
can avoid evaluation of
r(
c_{n},
x_{n−1}).
Cornebise (2010) extends this work, developing PRC for auxiliary SMC samplers, which are also useful in PMH algorithms. Threshold
c_{n} can be set adaptively: locally either at each SMC mutation or Markov chain iteration, or globally based on chain acceptance rates. Additionally,
c_{n} can be set adaptively via quantile estimates of prePRC incremental weights; see
Peters et al. (2010). Cornebise
et al. (2008) stated that adaptive SMC proposals can be designed by minimizing functionfree risk theoretic criteria such as Kullback–Leibler divergence between a joint proposal in a parametric family and a joint target.
Cornebise (2009), chapter 5, and Cornebise
et al. (2010) use a mixture of experts, adapting kernels of a mixture on distinct regions of the state space separated by a ‘softmax’ partition. These results extend to PMCMC settings.
Drew D. Creal (University of Chicago) and Siem Jan Koopman (Vrije Universiteit Amsterdam)
We congratulate the authors on writing an interesting paper. They demonstrate how arguments from Markov chain Monte Carlo (MCMC) theory can be extended to include algorithms where proposals are made from the path realizations that are produced by sequential Monte Carlo (SMC) algorithms such as the particle filter. As with all good ideas, this basic idea is simple and quite clever at the same time. The implementation requires a particle filter routine, which is generally easy to code. Various MCMC strategies such as Metropolis–Hastings steps can then be adopted to accept–reject paths proposed from the discrete particle approximations that are created by the particle filter. The resulting particle MCMC algorithms widen the applicability of SMC methods. The authors also provide a theoretical justification for why the methods work. In practice for complex models, it may be easier to design an SMC algorithm and to include it within an MCMC algorithm rather than design an alternative, perhaps more intricate MCMC algorithm that is computationally less expensive.
The examples in Section 3 are interesting. The first example concerns a nonlinear state space model which is used to compare the new method with a more standard MCMC algorithm. A numerical exercise reveals that the new method outperforms the other slightly. It should be noted that the model is intricate since the corresponding filtering and smoothing distributions are multimodal. The second example is the most interesting since other MCMC algorithms proposed in the literature can be tedious to implement. The difficulty arises because the transition density p{x(t)x(t−1);θ}, with x(t)=σ^{2}(t) as given by equations (16), is not known in closed form, making it difficult to implement a good MCMC algorithm. The authors show convincingly that their methods are effective for filtering and smoothing. A minor comment is that the time series dimensions for the simulated data sets (T=400) and for the Standard & Poors 500 data (T=1000) are rather short and atypical. It appears to confirm our suspicion that the method is computationally time intensive, which is due to the repeated loops in the algorithm. However, designing and coding the algorithm are easy anyway.
In the conclusion, the authors state that the performance of particle MCMC algorithms will depend on the variance of the SMC estimates of the normalizing constants. Can they provide some discussion on when practitioners may encounter problems such as this? For example, how does the dimension of the state vector (or state space) affect the algorithm? This is particularly of interest in financial time series where we would like to build multivariate volatility models for high dimensional data. Secondly, how does the specification of the transition equation affect the estimates? For example, many economists specify state space models with unobserved randomwalk components.
Despite these somewhat critical but constructive questions, we have enjoyed reading the paper and we are impressed by the results.
Dan Crisan (Imperial College London)
This is an authoritative paper which brings together two of the principal statistical tools for producing samples from high dimensional distributions. The authors propose an array of methods where sequential Monte Carlo (SMC) algorithms are used to design high dimensional proposal distributions for Markov chain Monte Carlo (MCMC) algorithms. The following are some comments that perhaps can suggest future research or improvements in this area.
Firstly the authors present not just the numerical verification of the proposed methodology but also (very laudably) its theoretical justification. They make the point that the theorems that are presented in the paper rely on relatively strong conditions, even though the methods have been empirically observed to apply to scenarios beyond the conditions assumed. In particular, assumption 4 is a very restrictive condition that is rarely satisfied in practice. It amounts (virtually) to the assumption that the state space of the hidden Markov state process is compact. The need for such an assumption is imposed by the preference for a framework where the posterior distribution exhibits stability properties, as discussed in Del Moral and Guionnet (2001). However, in recent years this assumption has been considerably relaxed. Le Gland and Oudjane (2003) have introduced the idea of truncating the posterior distribution, which was further exploited in Oudjane and Rubenthaler (2005) and in Crisan and Heine (2008) to produce stability criteria under quite natural conditions. The theorems in the paper under discussion are likely to hold under the same conditions as those contained, for example, in Crisan and Heine (2008), with proofs that will follow similar steps.
Secondly, the authors concentrate on SMC algorithms where the resampling step is the multinomial step. They make the point that more sophisticated algorithms have been proposed where the multinomial resampling step can be replaced by a stratified resampling procedure and prove the results under conditions that cover other SMC algorithms. However, the optimal choice for the resampling step is the treebased branching algorithm that was introduced by Crisan and Lyons (2002). This algorithm has several optimality properties (see also Künsch (2005) for additional details) and satisfies the conditions (assumptions 1 and 2) that are required by the theoretical results in the paper.
Thirdly, the tradeoff between the average acceptance rate for the particle independent Metropolis sampler and the number of particles that is used to produce the SMC proposal warrants further analysis. The numerical results suggest some deterministic relationship between the two quantities, one that perhaps holds only asymptotically. It would be beneficial to find this relationship and to see what it can tell us about the optimal choice for distributing the computational effort between the SMC and the MCMC steps.
David Draper (University of California, Santa Cruz)
I have two questions on Monte Carlo efficiency for the authors of this interesting paper.
 (a)
Has the authors’ methodology reached a sufficiently mature state that they can give us general advice on how to use their methods to obtain the greatest amount of information
per central processor unit second about the posterior distribution under study (because this is of course the real performance measure on which users need to focus), and if so what would that advice be? (The authors made a start on this task in
Section 3.1; it would be helpful to potential users of their methodology if they could expand on those remarks.)
 (b)
People often measure Monte Carlo improvement in Markov chain Monte Carlo samplers by how well a new method can drive positive autocorrelations (in the sampled output for the monitored quantities, viewed as time series) down towards zero, but it is sometimes possible (e.g.
Dreesman (2000)) to do even better. Is there any scope in the authors’ work for achieving
negative autocorrelations in the Markov chain Monte Carlo output?
Richard Everitt (University of Bristol)
I congratulate the authors on this significant paper. My comments relate to the use of the marginal variant of the algorithm for parameter estimation in undirected graphical models and, more generally, the computational cost of the methods.
Let us consider the following factorization into clique potentials φ_{1:M} on cliques C_{1:M} of a joint probability density function over variables X_{1:T} given parameters θ_{1:M}:
where
As in a state space model, the variables X_{1:T} are observed indirectly through observations y_{1:T} of random variables Y_{1:T}, which are assumed conditionally independent given X_{1:T} and are identially distributed as Y_{i}X_{1:T}∼g(·X_{1:T}). Our aim is to estimate the unknown θ given the observations, ascribed prior p(θ) by simulating from the posterior p(θy_{1:T}). It is well known that Gibbs sampling from p(θ,X_{1:T}y_{1:T}) is not feasible since the intractable normalizing ‘constant’Z_{θ1:M} must be evaluated when updating θ (other standard approaches also fail for the same reason).
As an alternative, consider the direct application of a marginal particle Markov chain Monte Carlo (PMCMC) move where, as in the paper, the proposal q(·θ) is used to draw a candidate point θ^{*}, the latent variables X_{1:T} are sampled using a sequential Monte Carlo (SMC) algorithm targeting (as, for example, in Hamze and de Freitas (2005)) and the move is accepted with the probability given in equation (35). Note that (owing to the use of the SMC algorithm) this approach has the advantage that at no point does Z_{θ1:M} need to be evaluated directly. A similar approach may be used in the context of MCMC updates on the space of graphical model structures.
The computational cost of PMCMC methods in general is likely to be high (particularly so in application to the model above). Alleviating this through reusing particles from each run of the SMC algorithm seems intuitively possible. Also, it is worth considering the implementation of PMCMC methods on a graphical processing unit to exploit the parallel nature of the algorithm. The work of Maskell et al. (2006) on graphical processing unit implementations of particle filters is directly applicable here (also see recent work by Lee et al. (2009)).
Andrew Golightly and Darren J. Wilkinson (Newcastle University)
We thank the authors for a very interesting paper. Consider a ddimensional diffusion process X_{t} governed by the stochastic differential equation
where W_{t} is standard Brownian motion. It is common to work with the Euler–Maruyama approximation with transition density f(·x) such that
For low frequency data, the observed data can be augmented by adding m−1 latent values between every pair of observations. For observations on a regular grid, y_{1:T}=(y_{1},…,y_{T})^{′} that are conditionally independent given {X_{t}} and have marginal probability density g(yx), inferences are made via the posterior distribution θ,x_{1:T}y_{1:T} by using Bayesian Markov chain Monte Carlo techniques. Owing to high dependence between x_{1:T} and θ, care must be taken in the design of a Markov chain Monte Carlo scheme. A joint update of θ and x_{1:T} or a carefully chosen reparameterization (Golightly and Wilkinson, 2008) can overcome the problem. The particle marginal Metropolis–Hastings (PMMH) algorithm that is described in the paper allows a joint update of parameters and latent data. Given a proposed θ^{*}, the algorithm can be implemented by running a sequential Monte Carlo algorithm targeting p(x_{1:T}y_{1:T},θ^{*}) using only the ability to forwardsimulate from the Euler–Maruyama approximation.
To compare the performance of the PMMH scheme with the method of Golightly and Wilkinson (2008) (henceforth referred to as the GW scheme), consider inference for a stochastic differential equation governing X_{t}=(X_{1,t},X_{2,t})^{′} with
This is the diffusion approximation of the stochastic Lotka–Volterra model (Boys et al., 2008). We analyse a simulated data set of size 50 with θ=(0.5,0.0025,0.3), corrupted by adding zeromean Gaussian noise. Independent uniform U(−7,2) priors were taken for each log (θ_{i}). The GW scheme and the PMMH sampler were implemented for 500000 iterations, using a randomwalk update with normal innovations to propose log (θ^{*}), with the variance of the proposal being the estimated variance of the target distribution, obtained from a preliminary run. The PMMH scheme was run for N=200, N=500 and N=1000 particles and, in all cases, discretization was set by taking m=5.
Computational cost scales roughly as 1:8:20:40 for GW:PMMH (N=200:500:1000). For N=1000 particles, the mixing of the chain under the PMMH scheme is comparable with the GW scheme; Fig. 16. Despite the extra computational cost of the PMMH scheme, unlike the GW scheme the PMMH algorithm is easy to implement and requires only the ability to forwardsimulate from the model. This extends the utility of particle Markov chain Monte Carlo methods to a very wide class of models where evaluation of the likelihood is difficult (or even intractable), but forward simulation is possible.
Edward L. Ionides (University of Michigan, Ann Arbor)
The authors are to be congratulated on an exciting methodological development. An attractive feature of this new methodology is that it has an algorithmic implementation in which the only operation applied to the underlying Markov process model is the generation of draws from f(x_{n}x_{n−1}). This property has been called plug and play (He et al., 2009; Bretóet al., 2009) since it permits simulation code, which is usually readily available, to be plugged straight into general purpose software. I would like to add some additional comments to the authors’ coverage of this aspect of their work.
The plugandplay property has been developed in the context of complex system analysis with the terminology equation free (Kevrekidis et al., 2004). For optimization methodology, the analogous term gradient free is used to describe algorithms which are based solely on function evaluations. Plugandplay inference methodology has previously been proposed for state space models (including Kendall et al. (l999), Liu and West (200l), Ionides et al. (2006), Toni et al. (2008) and Andrieu and Roberts (2009)). This paper is distinguished by describing the first plugandplay algorithm giving asymptotically exact Bayesian inference for both model parameters and unobserved states.
We should expect plugandplay approaches to require additional computational effort compared with rival methods that have access to closed form expressions for model properties such as transition densities or their derivatives. However, advances in computational capabilities and algorithmic developments are making plugandplay methodology increasingly accessible for state space models. The great flexibility in model development that is permitted by the generality of plugandplay algorithms is enabling scientists to ask and answer scientific questions that were previously inaccessible (e.g. King et al. (2008)). The methodology that is developed here (and other approaches which inherit the plugandplay property from the basic sequential Monte Carlo algorithm) will benefit from further research into improvements and extensions of sequential Monte Carlo methods that fall within the plugandplay paradigm: reduced variance resampling schemes are consistent with plugandplay methods, but most other existing refinements are not.
Pierre Jacob (Centre de Recherche en Economie et Statistique and Université Paris Dauphine, Paris), Nicolas Chopin (Ecole Nationale de la Statistique et de l'Administration Economique, Paris), Christian Robert (Centre de Recherche en Economie et Statistique and Université Paris Dauphine, Paris) and Håvard Rue (Norwegian University for Science and Technology, Trondheim)
This otherwise fascinating paper does not cover the calculation of the marginal likelihood p(y), which is the central quantity in model choice. However, the particle Markov chain Monte Carlo (PMCMC) approach seems to lend itself naturally to the use of Chib's (1995) estimate, i.e.
for any θ. Provided that the p(θx,y) density admits a closed form expression, the denominator may be estimated by
where the x_{i}s, i=1,…,M, are provided by the MCMC output.
The novelty here is that p(yθ) in the numerator needs to be evaluated as well. Fortunately, each iteration provides a Monte Carlo estimate of p(yθ=θ_{i}), where θ_{i} is the parameter value at MCMC iteration i. Some care may be required when choosing θ_{i}; for example selecting the θ_{i} with largest (evaluated) likelihood may lead to a biased estimator.
We did some experiments to compare the approach described above with integrated nested Laplace approximations (Rue et al., 2009) and nested sampling (Skilling (2006); see also Chopin and Robert (2010)), using the stochastic volatility example of Rue et al. (2009). Unfortunately, our PMCMC program requires more than 1 day to complete (for a number N of particles and a number M of iterations that are sufficient for reasonable performance), so we cannot include the results in this discussion. A likely explanation is that the cost of PMCMC sampling is at least O(T^{2}), where T is the sample size (T=945 in this example), since, according to the authors, good performance requires that N=O(T), but our implementation may be suboptimal as well.
Interestingly, nested sampling performs reasonably well on this example (reasonable error obtained in 1 h), and, as reported by Rue et al. (2009), the integrated nested Laplace approximation is fast (1 s) and very accurate, but more work is required for a more meaningful comparison.
Michael Johannes (Columbia University, New York) and Nick Polson and Seung M.Yae (University of Chicago)
We would like to comment on a few aspects of the paper. First, for several years, macroeconomics has used a related algorithm (e.g. FernandezVillaverde and RubioRamerez (2005)) to estimate dynamic general equilibrium models by using a randomwalk Metropolis algorithm proposing new parameter values and accepting or rejecting the draws via marginal likelihoods from sequential Monte Carlo (SMC) sampling. This now quite large literature encountered a serious problem in models with more than a few parameters. In these cases, Metropolis algorithms often converge very slowly, and the combination of slow convergence and repeated iteration between SMC and Markov chain Monte Carlo (MCMC) sampling often requires that algorithms run for days, even when coded efficiently in C++. This experience provides a cautionary note to those using these algorithms in high dimensions. In the authors’ defence, these problems are extremely difficult, and the computationally expensive SMC–MCMC approach may be the only feasible strategy.
Second, the authors consider learning σ and σ_{x} in the nonlinear state space model
assuming α=2, β_{1}=0.5, β_{2}=25 and β_{3}=8. It is disappointing that all these parameters are constrained, as a more realistic test of their algorithm would estimate all the unknown parameters.
The authors compare with an MCMC algorithm using singlestate updating. We suggest two more realistic competing algorithms. The first assumes α=2 and
 (a)
generates a full vector of latent states, x_{1:T}, by using SMC sampling, accepting or rejecting these draws via Metropolis updates and then
 (b)
updates the parameters by using p(θx_{1:T},y_{1:T}).
This algorithm exploits the fact that the conditional posterior, p(θx_{1:T},y_{1:T}), is a known distribution and simple to sample. This algorithm would probably perform better than the current algorithm combining SMC with randomwalk Metropolis sampling.
 (a)
solely relies on SMC methods,
 (b)
uses slice variables to induce sufficient statistics,
 (c)
estimates all the parameters (α,β_{1},β_{2},β_{3},σ,σ_{x}) for similar sized data sets and
 (d)
solves the sequential problem by approximating p(θ,x_{t}y_{1:t}) for each time t.
Fig. 17 provides an example of the output.
The algorithm of Johannes et al. (2007) relies on similar SMC methods but is computationally simpler. To obtain a sense of the computational demands, the current paper uses 60000 MCMC iterations and 5000 particles whereas we obtain accurate parameter estimates (verified via simulation studies), using one run of 300000 particles, using roughly l/1000th of the computational cost. We would be interested in a direct horserace of these competing methods in this specification.
Finally, the approach in the paper can have attractive convergence properties under various assumptions, including assumption 4. We would like to ask the authors whether assumption 4 is satisfied in the examples that are considered in the paper. In particular, does it hold for various signaltonoise ratio combinations of σ and σ_{x}?
Adam M. Johansen and John A. D. Aston (University of Warwick, Coventry)
We congratulate the authors on an exciting paper which combines the novel idea of incorporating sequential Monte Carlo proposals within Markov chain Monte Carlo samplers with a synthesis of ideas from disparate areas. It is clear that the paper is a substantial advance in Monte Carlo methodology and is of substantially greater value than a collection of its constituent parts.
However, one constituent which has received little attention in the literature seems to us to be interesting: although it is computationally rather expensive to do so, equations (27)–(28) suggest that it is possible to obtain samples which characterize the path space distribution well, at least in the case of mixing dynamic systems when we are interested in marginal distributions of bounded dimension, albeit at the cost of running an independent sequential Monte Carlo algorithm for every sample. In practice some reuse of samples is likely to be possible.
Typically, such a strategy might be dismissed (perhaps correctly) as being of prohibitive computational cost. However, in an era in which Monte Carlo algorithms whose time cost scales superlinearly in the number of samples employed are common, might there be other situations in which this strategy finds a role?
A rather naive approach to smoothing, for example, would be to employ an ensemble of independent particle filters and to sample one trajectory from each independent filter. For simplicity, consider employing a bootstrap filter in the univariate case, with and . To assess performance, consider the estimated covariance of X_{n:n+1} (and the determinant of that covariance, to provide a compact summary). Fig. 18 shows covariance estimates obtained by using a 100filter ensemble, each of 100 particles, a single particle filter of equal cost (using 10000 particles) and the exact solution (Kalman smoothing). This illustrates the degeneracy and consequent failure to represent sample path variability of a single filter adequately and contrasts it with the estimate obtained by using an ensemble of filters. Each of the Monte Carlo algorithms required approximately 30 s over 1000 time steps using SMCTC (Johansen, 2009) and a 1.33GHz Intel laptop.
Might it be possible to employ such a strategy to provide simpletoimplement algorithms with better path space performance? Can the error be controlled uniformly for bounded L?
Anthony Lee and Chris Holmes (University of Oxford)
We congratulate the authors on a major contribution to practical statistical inference in a variety of models. An important application is approximating the posterior distribution of static parameters in state space models. The particle marginal Metropolis–Hastings (PMMH) algorithm is perhaps the simplest of the algorithms introduced, relying only on the unbiasedness of the marginal likelihood estimator. Denoting by y the observations, z the set of all auxiliary random variables used in the filter and θ the static parameters, the likelihood estimator is a joint density p(y,zθ) satisfying
 (55)
An interesting feature of the sequential Monte Carlo class of methods is that the choice of auxiliary variables z is flexible. For example, we can perform multinomial resampling in a variety of ways without affecting condition (55). Let x_{1:T} be the latent variables in the state space model. When x_{t} is univariate, sorting the particles before resampling as in Pitt (2002) but without interpolation gives an empirical distribution function for particle indices that is identical to the empirical distribution function for x_{t} itself. We can then construct a Metropolis–Hastings Markov chain targeting p(θ,zy) by proposing moves of the form (θ,z)(θ^{′},z) and (θ,z)(θ,z^{′}). For the first type, this amounts to a use of common random variables so that in the acceptance ratio
the terms p(y,zθ^{′}) and p(y,zθ) are positively correlated. We can therefore expect the resulting Markov transition kernel to be closer to that of the true marginal Metropolis–Hastings algorithm on θ, suggesting superior performance over standard PMMH algorithms.
We ran both a PMMH algorithm and this correlated variant CPMMH on a linear Gaussian state space model with univariate latent variables x_{t} and a single unknown parameter. We used an improper prior with p(θ)∝1 and a randomwalk proposal. Since we can compute p(yθ) for this model, we can also compute the acceptance probabilities of the marginal algorithm and analyse how both algorithms move compared with the marginal algorithm. In a 50000step chain, the PMMH algorithm differed from the true marginal algorithm 13065 times whereas CPMMH differed only 2333 times in terms of accepting or rejecting a move. Fig. 19 shows the differences between the acceptance probabilities for both the PMMH and the CPMMH algorithms against the marginal algorithm. Although CPMMH does not extend trivially to the multivariate case, treebased resampling schemes as in Lee (2008) that generalize the methodology in Pitt (2002) give similar improvements.
Finally, many people at the meeting commented on the heavy computational burden of particle Markov chain Monte Carlo methods. However, the emerging use of parallel architectures such as graphics cards can alleviate this burden via parallelization of the particle filtering algorithm itself, as in Lee et al. (2009).
Simon Maskell (QinetiQ, Malvern)
This paper provides, to the sequential Monte Carlo (SMC) sampling specialist, a mechanism to perform parameter estimation by using Markov chain Monte Carlo (MCMC) sampling. To the MCMC sampling specialist, this paper offers a route to efficient proposals in very high dimensional problems. Both contributions are significant in isolation. To achieve the two simultaneously is a significant achievement.
It is natural to ask what such proposal distributions would offer (apart from more complex variants of the MCMC acceptance ratios). SMC samplers are recursive algorithms, i.e. x_{t} is sampled conditionally on x_{1:t−1} for each t. As touched on in the paper, the statistical efficiency of SMC algorithms is coupled to their ability to generate samples of x_{t} from a proposal distribution that is a good approximation to the target density. The optimal proposal distribution is only optimal in terms of its ability to exploit previous samples (and data) to generate the current sample x_{t}: the notion of optimality is intricately tied to the recursive application of an SMC sampler. In the context of particle MCMC methods, SMC samplers are still applied recursively, but we also have a previous sample of the entire trajectory, x_{1:T}. This trajectory encodes information about ‘future’ target distributions and samples that will turn out to be efficient in hindsight. It therefore seems plausible that a different notion of an optimal proposal distribution is needed for particle MCMC sampling and that this should include dependence on the previous sample of the trajectory.
This paper seems likely to seed a unified research direction that facilitates a combined effort between practitioners and researchers who are associated with both MCMC and SMC methods. Such extensions can therefore be expected.
Lawrence Murray, Emlyn Jones, John Parslow, Eddy Campbell and Nugzar Margvelashvili (Commonwealth Scientific and Industrial Research Organisation, Canberra)
We thank the authors for their work on what we agree is a very compelling approach to parameter estimation in state space and other models. We have been investigating similar ideas in the context of marine biogeochemistry, with encouraging results for a toy Lotka–Volterra predator–prey model (Jones et al., 2009). Our approach uses randomwalk Metropolis–Hastings steps in parameter space, with a particle filter employed to calculate likelihoods for the Metropolis–Hastings acceptance term. It is essentially an instance of the method described here as particle marginal Metropolis–Hastings (PMMH) sampling. The approach does seem computationally expensive, and we observe some potential consistency problems in the use of a particle filter to estimate likelihoods.
Biogeochemical models characterize the interaction of phytoplankton and zooplankton species and the conserved cycle of nutrients such as nitrogen, carbon and oxygen through an ecosytem. They are generally described by using ordinary differential equations, with our own formulation introducing stochasticity via interaction terms at discrete time intervals. They are one specific case of a wide variety of physical–statistical models obtained via the introduction of stochasticity to existing deterministic models in a Bayesian hierarchical framework.
These models fall into a broad class where the transition density p(x_{n}x_{n−1}) is not available in closed form. This precludes use of some of the advanced proposal and resampling techniques that are mentioned by the authors, owing to the need to cancel the intractable transition density in the numerator and denominator in expression (7). In particular, the optimal proposal p(x_{n}y_{n},x_{n−1}) is not available. We find the iteration of a particle filter in the PMMH framework for these models to be very expensive computationally, mostly because of numerical integration of the ordinary differential equations with the limited availability of these advanced techniques confounding the matter further.
We find it necessary to use many more samples for PMMH sampling than we would with the same particle filter used only for state tracking, to deliver consistent likelihood estimates. Although a particle filter may momentarily fail to track the state adequately at a particular time but then recover (e.g. in a form of mild degeneracy where the effective sample size is low) the likelihood contribution at that time will be unreliable. In the worst case, iterating the particle filter with the same parameter configuration but different sample sets from the prior p(x_{1}y_{1}) can produce wildly different likelihood estimates in the presence of such anomalies.
G. W. Peters (University of New South Wales, Sydney) and J. Cornebise (Statistical and Applied Mathematical Sciences Institute, Durham)
This paper will clearly have a significant influence on scientific disciplines with a strong interface with computational statistics and nonlinear state space models. Our comments are based on practical experience with particle Markov chain Monte Carlo (MCMC) implementation in latent process multifactor stochastic differential equation models for commodities (Peters et al., 2010), wireless communications (Nevat et al., 2010) and population dynamics (Hayes et al., 2010), using Rao–Blackwellized particle filters (Doucet et al., 2000) and adaptive MCMC methods (Roberts and Rosenthal, 2009).
 (a)
From our implementations, ideal use cases consist of highly nonlinear dynamic equations for a small dimension d_{x} of the state space, large dimension d of the static parameter and potentially large length T of the time series. In our cases d_{x} was 2 or 3, d up to 20 and T between 100 and 400.
 (b)
In particle Metropolis–Hastings (PMH) sampling, nonadaptive MCMC proposals for
θ (e.g. tuned according to presimulation chains or burnin iterations) would be costly for large
T and require that
N is kept fixed over the whole run of the Markov chain. Adaptive MCMC proposals such as the adaptive Metropolis sampler (
Roberts and Rosenthal, 2009) avoid such issues and proved particularly relevant for large
d and
T.
 (c)
For intractable joint likelihood
p(
y_{1:T}
x_{1:T}), we could design a sequential Monte Carlo (SMC)–approximate Bayesian computation algorithm (see for example
Peters et al. (2010) and Ratmann (2010), chapter 1) for a fixed approximate Bayesian computation tolerance
ɛ, using the approximations
or
with
ρ a distance on the observation space and
simulated observations. Additional degeneracy on the path space induced by the approximate Bayesian computation approximation should be controlled, e.g. with partial rejection control (
Peters et al., 2008).
 (d)
Particle Gibbs (PG) sampling could potentially stay frozen on a state
x_{1:T}(
i). Consider a state space model with state transition function almost linear in
x_{n} for some range of
θ, from which
y_{1:T} is considered to result, and strongly nonlinear elsewhere. If the PG samples
θ(
i) in those regions of strong nonlinearity, the particle tree is likely to coalesce on the trajectory preserved by the conditional SMC sampler, leaving it with a high importance weight, maintaining (
θ(
i+1),
x_{1:T}(
i+1))=(
θ(
i),
x_{1:T}(
i)) over several iterations. Using a PMH within PG algorithm would help to escape this region, especially using partial rejection control and adaptive SMC kernels, outlined in another comment, to fight the degeneracy of the filter and the high variance of
.
Ralph S. Silva and Robert Kohn (University of New South Wales, Sydney), Paolo Giordani (Sveriges Riksbank) and Michael K. Pitt (University of Warwick, Coventry)
We congratulate the authors on their important paper which opens the way for a unified method for Bayesian inference using the particle filter and should allow for inference for models which are difficult to estimate by using other methods. To establish notation and to summarize the result that is relevant to our discussion, let p(yθ) be the correct but intractable likelihood with its approximation by the particle filter, where u is a set of latent variables. By Del Moral (2004),
The authors show that this implies that f(θy)=p(θy) so a Markov chain Monte Carlo simulation based on the posterior f(θ,uy) gives iterates of θ from the correct marginal posterior p(θy). Our own research reported in Silva et al. (2009) applies the fundamental insight in the current paper to study the behaviour of adaptive sampling schemes when the particle filter is used to obtain f(yθ,u) for state space models. The two adaptive samplers that we consider are a threecomponent version of the adaptive randomwalk proposal of Roberts and Rosenthal (2009) and the adaptive independent Metropolis–Hastings proposal of Giordani and Kohn (2008). Combining the particle filter with adaptive sampling is attractive because f(yθ,u) is a stochastic nonsmooth function of θ. Our results suggest the following.
 (a)
It is feasible to use adaptive sampling for the particle Markov chain Monte Carlo and in particular particle marginal Metropolis–Hastings algorithm.
 (b)
It is computationally efficient to obtain a good adaptive proposal because the cost of constructing such a proposal is negligible compared with the cost of evaluating f(yθ,u) by the particle filter.
 (c)
A wellconstructed proposal can be much more efficient than an adaptive randomwalk proposal.
 (d)
Independent Metropolis–Hastings proposals are attractive because they can be easily run in parallel, thus significantly reducing the computation time of particlebased Bayesian inference.
 (e)
When the particle filter is used, the marginal likelihood of any model is obtained in an efficient and unbiased manner, making model comparison straightforward.
Miika Toivanen and Jouko Lampinen (Helsinki University of Technology, Espoo)
We congratulate the authors for introducing the idea of combining ‘ordinary’ Markov chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC) methodologies in a novel way, namely using SMC algorithms for designing proposal distributions for MCMC algorithms. We wish to share briefly our own experience on using particle Monte Carlo methods on a static problem, related to computer vision.
Consider having a few dozen feature points and a posterior distribution of their locations in a test image, Owing to the combinatorial explosion, approximate methods are needed to compute the integrals that involve the posterior distribution. The multimodality of the posterior distribution complicates the approximation problem. Although MCMC methods can be efficient in exploring a single mode, the probability for them to switch a mode during the sampling is low, especially if the modes are far apart. Although some improvements to overcome this disadvantage exist, the population Monte Carlo (PMC) scheme offers a much more natural approach.
PMC techniques are based on the idea of representing the posterior with a weighted set of particles. Each particle can be considered as a hypothesis about the correct location of the feature set and the weight reveals the goodness of the hypothesis. The particles are sampled from proposal distributions, which are allowed to differ between the particles and iterations. Hence, heuristics can safely be incorporated to guide the sampler towards the modes of the posterior, without jeopardizing the theoretical convergence issues. In our implementation, the proposals are Gaussian distributions, which have the previous estimate as mean value and whose variance decreases for particles with high posterior probability. Owing to the resampling, the weakest hypotheses die, and the resulting particle set gives often a good representation of the posterior distribution (Toivanen and Lampinen, 2009a,b).
Also SMC methods can be applied to sample the posterior, by updating the parameter vector incrementally (Toivanen and Lampinen, 2009c; Tamminen and Lampinen, 2006). The previously sampled components guide the sampler via the conditional prior distribution and the number of distinct modes decreases as the parameter vector expands. However, because the resampling is not based on the whole parameter vector, unlike in PMC methods, the method is prone to lead to a particle set representing a fallacious minor mode which in a marginal posterior is stronger than the main mode of the full posterior. Thus, it might be interesting to test whether PMC, instead of SMC, methods could be combined with MCMC methods in a fashion suggested by the authors, and whether it would improve the performance in these kinds of problem.
Jonghyun Yun and Yuguo Chen (University of Illinois at Urbana—Champaign)
We congratulate the authors on successfully combining two popular sampling tools, sequential Monte Carlo (SMC) and Markov chain Monte Carlo (MCMC) methods. We discuss two specific implementation issues of particle MCMC (PMCMC) algorithms.
In PMCMC sampling, proposing a single sample at each iteration requires N particles. That means running PMCMC algorithms for L iterations needs NL particles. If we can afford to generate only a fixed number N^{*} of particles, a practical question is how to balance between N and L under the constraint that NL=N^{*}. We did a simulation study on model (14)–(15) with known parameters and . Let N^{*}=1 000 000. We simulated 100 sequences of observations y_{1:300} from the model. For each sequence, four particle independent Metropolis–Hastings (PIMH) samplers with different combinations of N and L were applied to estimate the states x_{1:300}. The standard SMC method in Section 2.2.1 with 1000000 particles is also included in the comparison. The performance criterion is the rootmeansquared error RMSE between the true x_{i} and the estimates:
The average RMSE and acceptance rate from 100 simulations are reported in Fig. 20. According to Fig. 20, PIMH sampling with a small N could perform worse than standard SMC sampling. Part of the reason may be the low acceptance rate. Increasing N seems to improve the acceptance rate and the performance, even though L decreases correspondingly, which may affect the convergence of the Markov chain. For each combination of N and L, the acceptance rate becomes lower as the dimension T of the state x_{1:T} grows.
Another practical issue is about reusing all particles. Two estimates which use all particles are suggested in Section 4.6. We compared these two estimates with the original PIMH sampler on the same model and the same four settings as before. Denote the two estimates in equations (38) and (39) as PIMHReusel and PIMHReuse2 respectively. We also propose a new estimate, which is denoted by PIMHReuse3 (see theorem 6 for the notation):
 (56)
This estimate is based on the N weighted particles proposed at each iteration before the accept–reject step. PIMHReuse3 can be used when there are no unknown parameters in the model, and its convergence can be proved. The comparison of the average RMSE in Fig. 21 shows that PIMHReusel has almost the same performance as PIMHReuse2, and both outperform the original PIMH sampler. The relationship between PIMHReuse3 and the other methods is not so clear.
The authors replied later, in writing, as follows.
We thank the discussants for their very interesting comments.
Perhaps the most important feedback that we have received is the confirmation by several discussants (Belmonte and Papaspiliopoulos, Bhadra, Flury and Shephard, Cappé, Robert, Jacob and Chopin, and Golightly and Wilkinson) that the approach is not only conceptually simple but also more importantly that it is relatively easy to implement in practice and able to produce satisfactory results. We were particularly interested in the reported simulations and user experience of Golightly and Wilkinson. They indicate that particle Markov chain Monte Carlo (PMCMC) methods can lead to performance that is similar to that obtained with a carefully handcrafted (and possibly complex) algorithm and point to the comparatively little effort that is required by the user in terms of design and implementation. Naturally, except in situations where such implementational simplicity cannot be avoided, this ease comes at the expense of ‘computational brutality’, which might currently deter or prevent some users from using the approach (Chopin, and Flury and Shephard). However, as pointed out by Lee and Holmes, and Everitt, recent advances in the use of cheap graphical processing units and other multicore computing machines (such as game consoles) for scientific computing offer good hope that ever more complex problems can be routinely attacked with PMCMC methods. We naturally realize that the notion of ‘difficult problems’ is not static and do not believe in black boxes and silver bullets: ultimately very difficult problems at the frontier of what current technology can achieve will always require more thinking by the user. In relation to this we are looking forward to seeing applications of PMCMC sampling in the context of approximate Bayesian computations (Cornebise and Peters, and Peters and Cornebise) and general graphical models (Everitt).
Correctness and sequential Monte Carlo implementations
For brevity and to ensure simplicity of exposition the algorithms that were presented throughout the paper focus on some of the simplest implementations, and our discussion of general validity was confined to Section 2.5 and the beginning of Section 4. Not surprisingly quite a few comments focus on this aspect.
Valid sequential Monte Carlo implementations
Although the design of efficient MCMC algorithms can be facilitated by the use of sequential Monte Carlo (SMC) sampling as proposal mechanisms, the performance of the latter will naturally affect the performance of the former and one might wonder what standard SMC improvement strategies are legitimate? One can complement and summarize the rules of Section 2.5 and the beginning of Section 4 as follows. In broad terms PMCMC algorithms are valid
It is worth mentioning here that the exchangeability property (assumption 2) is not needed for the PMMH algorithm when only inference on θ is needed. Since writing the paper we have been working on establishing that even more general resampling schemes lead to valid PMCMC algorithms. Of particular interest are adaptive resampling schemes, which usually reduce the number of times that resampling is needed. It has been empirically observed in the literature dedicated to SMC algorithms that such schemes might be beneficial, and we expect this to carry on to the PMCMC framework (see the discussion below on the influence of the variability of (or ) on the performance of PMCMC algorithms as well as the discussion of Fearnhead concerning the particle Gibbs (PG) sampler). It is also possible to adapt the number N of particles within the SMC step, which might be for example of interest to moderate the effect of outliers discussed by Murray, Jones, Parslow, Campbell and Margvelashvili.
As pointed out by several discussants (Girolami, and Creal and Koopman) the design of efficient proposal distributions for the importance sampling stage of the SMC algorithm might be difficult in situations where the dimension of is large. It can be shown on simple examples that such a penalty will typically be exponential in the dimension (consider for example Cappé’s example). However, it is possible in this case to introduce subsequences of intermediate distributions bridging for example π_{n} and π_{n+1}, e.g. Del Moral et al. (2006) and Godsill and Clapp (2001). This offers the possibility of employing wellknown standard MCMCtype strategies that are well suited to high dimensional setups to update subblocks of the state vector between two particular distributions π_{n} and π_{n+1}. An alternative strategy consists of updating the state components one at a time by using conditional SMC updates.
General proposals for particle Metropolis–Hastings algorithms
Whereas the PG sampler bypasses the need for the design of a proposal distribution for θ the particle marginal Metropolis–Hastings (PMMH) algorithm requires such a design, which might not always be obvious as pointed out by Girolami, and Silva, Kohn, Giordani and Pitt.
As pointed out by Maskell, and Robert, Jacob, Chopin and Rue the degree of freedom that is offered by the choice of proposal of the PMMH step, or indeed a particle independent Metropolis–Hastings (PIMH) step, might turn out to be an opportunity which needs to be further explored. Dependence of proposals on previous particle populations is definitely an option (Everitt, and Robert, Jacob, Chopin and Rue) and might be beneficial to calibrate proposal distributions, but also to reduce the variability of acceptance probabilities. Note, however, our remark on the validity of recycling strategies in such a scenario at the very end of Appendix B.5. The work of Lee and Holmes offers an alternative variance reduction strategy of the acceptance probability for some situations.
Another natural solution consists of using adaptive MCMC algorithms (Andrieu and Thoms, 2008). Silva, Kohn, Giordani and Pitt report some results in this direction and in particular report better performance of the adaptive independent MH algorithms compared with that of a particular implementation of the AM algorithm (Haario et al., 2001; Roberts and Rosenthal, 2007). A further interesting comparison might involve robust versions of the AM algorithm described in Andrieu amd Thoms (2008). Finally it is worth mentioning the complementary and competitive method of Ionides et al. (2006) to compute maximum likelihood estimates of the static parameter θ, which could be used as a useful stepping stone towards Bayesian inference in very difficult situations.
The smoothing approaches that were described by Whiteley (and hinted at by Godsill) and Johansen and Aston are very promising developments. The first approach is in the vein of existing ‘particle smoothing’ approaches which allow one to exploit the information that is gathered by all the particles generated by a single SMC procedure within the PMCMC framework. Its interest is intuitively evident in the case of the PG in the light of Fearnhead's discussion, but we expect such a smoothing procedure also to have a positive effect beyond this special case. This might for example improve the quality of samples {X_{1:P}(i)} that are produced by the PMMH algorithm and suggests further improvements to our suggested recycling strategies. The second approach of Johansen and Aston, which was suggested in a nonPMCMC framework, consists of replacing a single SMC sampler using KN particles with K independent SMC samplers using N particles, which amounts to effectively replacing with
and use a stratified sampling strategy to sample K paths. As illustrated by Johansen and Aston, reducing particle interaction might be beneficial when smoothing is of interest. Adaptation of this idea to the PMCMC framework seems possible and raises numerous interesting theoretical and practical questions. This strategy, as well as that described earlier, might address the issue that was raised by Fearnhead concerning the particle depletion phenomenon for initial values.
Performance and the choice of N: from theory to practice
The choice of the number N of particles is a difficult, but central, issue which is paramount to the good performance of PMCMC algorithms. This question is made even more difficult when considering the optimum tradeoff between N and L for fixed computational resources, and a credible and generally valid answer to this question is beyond our current understanding.
Dependence on N of the performance of the PMCMC algorithms that were considered in the paper takes two different forms, at first apparently unrelated. It is first important to recall the fact that current PMCMCs can be thought of as being ‘exact approximations’ of idealized algorithms, which might or might not turn out to be ideal. It is indeed possible to construct examples, which are not unrelated to reality, for which the idealized algorithm is slower than its PMCMC version, suggesting that increasing N might not improve performance indefinitely, if at all. This partly answers Chopin's questions related to Rao–Blackwellization and the N versus N+1 issue. Some understanding of the idealized algorithm is therefore necessary, and we shall assume below that this algorithm is a worthy approximation.
In relation to this discussion, residual resampling will outperform multinomial resampling (Chopin) when closeness to the marginal algorithm is considered. Closeness to the marginal algorithm, when achieved, also suggests how the proposal distribution of θ in the PMMH should be adjusted: a randomwalk Metropolis step should be tuned such that its acceptance probability is of the order of 0.234 etc. Some results illustrating the effect of N on the performance of the MCMC algorithm can be found in Andrieu and Roberts (2009).
The theoretical results of Section 4 do not unfortunately provide us with precise values but with bounds on rates of convergence as a function of both N and P (or T). Although we believe, and agree with Crisan, that such results can be established under weaker assumptions, we doubt that more practical (and sufficiently general) results can be obtained. We hence doubt that we can ever answer Draper's question, which remains largely unanswered even for standard MCMC algorithms. We find it comforting to see that the experiments of Fearnhead, Cappé and Chen indicate that the main conclusion of the theoretical results, i.e. that N should scale linearly with P for ‘ergodic’ models, seems to hold for quite general scenarios. We were puzzled by the extremely positive results obtained by Belmonte and Papaspiliopoulos for the PG sampler. We note that beyond their explanatory power these results suggest, possibly manual, ways of choosing N by monitoring, for example, the evolution of the variance of normalizing constants as a function of N. Naturally such nice ergodicity properties do not hold for numerous situations of interest, such as models for which components of the state evolve in a quasideterministic manner. This includes the class of dynamic stochastic equilibrium models; see FernandezVillaverde and RubioRamirez (2007) and Flury and Shephard (2010). This lack of ergodicity of the model probably explains the reported slow convergence of the PMMH algorithm in the scenarios that were mentioned by Johannes, Polson and Yae. As acknowledged in Flury and Shephard (2010), any SMCbased method will suffer from this problem and it is expected that N will scale superlinearly with T in such scenarios. Note, however, that, in principle, the PMCMC framework allows for the use of standard offtheshelf MCMC remedies, e.g. tempering ideas which might alleviate this issue by introducing bridging models with improved ergodicity.
Ideally we would like the choice of N to be ‘automatic’, in particular for the PMMH and PG algorithms. Indeed, as suggested by the theoretical result on the variance, different values of θ might require different values of N to achieve a set precision. Designing such a scheme which preserves π(θ,x_{1:P}) as invariant distribution of the MCMC algorithm proves to be a challenge. However, adaptation within the SMC algorithm can be achieved through lookahead procedures and by boosting the number of particles locally when necessary. This can help to prevent the problems that were described by Murray, Jones, Parslow, Campbell and Margvelashvili, where a small number of outliers can have a serious effect on the estimate of the normalizing constant or marginal likelihood and hence the PMCMC procedure.
Unbiasedness versus sampling
Several authors (Flury and Shephard, Łatuszyński and Papaspiliopoulos, Roberts, and Silva, Kohn, Giordani and Pitt) stress the unbiasedness of (or ) that is produced by an SMC algorithm as being the basic principle underpinning the validity of the PMMH algorithm, in the spirit of Beaumont (2003), Andrieu and Roberts (2009) and Andrieu et al. (2007). This is indeed one of the two ways in which we came up with the PMMH algorithm initially in the course of working on two separate research projects. The other perspective, favoured in our paper, is that of ‘pseudosampling’, which in our view goes beyond unbiasedness (in the spirit of the ‘pseudomarginal’ approach) and is in our view fertile. Indeed although, in the context of the PMMH algorithm, the pseudomarginal perspective is appropriate when sampling from π(θ) is all that is needed, it is not sufficient to explain that it is possible to sample from π(θ,x_{1:P}) using the same output from the SMC step. We do not think that the PG, of which the conditional SMC update is the key element, could have emerged without this perspective. It is in fact rather interesting to reexplain what the conditional SMC update achieves in the simple situation where the target distribution is π(x_{1:P}) and P=1. In this situation, the extended target distribution of the paper takes the particularly simple form (we omit the subscript 1 to simplify the notation)
A Gibbs sampler to target this distribution consists, given x^{k}, of sampling according to the two following steps:
 (a)
and
 (b)
,
which by standard arguments leave invariant. Step (a) is a trivial instance of the conditional SMC update whereas step (b) consists of choosing a sample in x^{1:N} according to the empirical distribution
Note the similarity of this update with the standard importance sampling–resampling procedure. The remarkable feature here is that whenever x_{k}∼π then so is x_{l}∼π, owing to the aforementioned invariance property. In other words the conditional SMC update followed by resampling can be thought of as being an MCMC update leaving π invariant. Unbiasedness seems to be a (happy) byproduct of the structure of and the proposal distributions used, since it can be easily checked that
for and .
The PIMH and PMMH algorithms take advantage of this unbiasedness property but as illustrated above the structure of offers other useful applications. One interesting application is described in the paper: assume that P is so large that the number N of particles to obtain a reliable SMC step is prohibitive, probably at least of the order of P. Then updating large subblocks of x_{1:P} is a tempting solution. In the light of the discussion above, the conditional SMC update offers the possibility of targeting π(x_{a:b}x_{1:P∖a:b}) for 1abP. Assuming for notational simplicity here that b=P and a>1, if x_{1:P}∼π, then once the update above has been applied to x_{a:P}. Similarly the conditional SMC algorithm can be used in cases where the dimension, say m, of is large in order to update, for example, π{x_{1:P}(l)x_{1:P}(1:m∖{l})} for l=1,…,m.
Using sequential Monte Carlo methods with Markov chain Monte Carlo moves
As mentioned in Section 2.2.2 and by Johannes, Polson and Yae, an alternative to PMCMC methods consists of using SMC methods with MCMC moves (Fearnhead, 1998; Gilks and Berzuini, 2001). These methods are not applicable in complex models such as the stochastic volatility model in Section 3.2, but, when applicable, seem at first appealing. They are particularly elegant in scenarios where p(θx_{1:T},y_{1:T}) depends on x_{1:T},y_{1:T} only through a set of fixed dimensional statistics and have received significant attention since their introduction and development over a decade ago; see for example Andrieu et al. (1999, 2005), Fearnhead (1998, 2002), Storvik (2002) and Vercauteren et al. (2005). Despite their appeal these welldocumented methods are widely acknowledged to be rather delicate to use, owing to the socalled path degeneracy phenomenon and the fact that a good initialization distribution for θ seems paramount because of the lack of ergodicity of the system. In fact such techniques rely implicitly on the approximation of p(x_{1:T}y_{1:T}) and it can be observed empirically that the algorithm might converge to incorrect values and even sometimes drift away from the correct values as the time index T increases; see for example Andrieu et al. (1999, 2005).
As a consequence we would recommend extreme caution when using such techniques, whose interest might be to provide a quick initial guess for the inference problem at hand. Assessing path degeneracy is certainly essential to evaluate the credibility of the results. A simple proxy to measure degeneracy consists of monitoring the number of distinct particles representing p(x_{k}y_{1:T}) for various values of k ∈ {1,…,T} (preferably low values). If this number is below a reasonable number, say 500, then the particle approximation of p(θ,x_{1:T}y_{1:T}) is most probably unreliable.
Johannes, Polson and Yae propose to reconsider the example that was discussed in Section 3.1 and to estimate the parameters (α,σ,β_{1},β_{2},β_{3},σ_{x}). They use Gibbs steps within a bootstrap particle filter to update θ:=(σ,β_{1},β_{2},β_{3},σ_{x}) and a slice sampler to update α. As we do not have the details of their slice sampler, we shall limit ourselves to the estimation of p(θ,x_{1:T}y_{1:T}) by using the PG sampler. We considered their scenario and simulated T=100 data points by using the parameters that Johannes, Polson and Yae used and we set informative priors approximately similar to theirs by checking the width of their posteriors at time n=0. In this context, Johannes, Polson and Yae used 300000 particles for the particle filter with Gibbs moves and argue, on the grounds of the simulations that were discussed at the end of Section 3.1, that PMCMC methods would require l000 times more computation to perform inference in this scenario. We want to reassure them and the readers that this is not so. We used N=5000 particles at the end of Section 3.1 because we addressed a much more difficult scenario where T=500,σ=1 and σ_{x}=√10, i.e. the data set was five times larger and we used the bootstrap filter in a very unfavourable scenario where the likelihood of the observations is peaked and the noise of the dynamic diffuse. This is in contrast with the scenario that is considered by Johannes, Polson and Yae where σ=√10 and σ_{x}=1, i.e. the likelihood is fairly diffuse and the bootstrap filter and conditional bootstrap filters can provide good proposals for a number of particles as small as 150, which is in agreement with Fig. 3 of the paper. Moreover our PG sampler samples only (N−1)T random variables X_{n} and one set of parameters (σ,β_{1},β_{2},β_{3},σ_{x}) per MCMC iteration whereas the particle filter using Gibbs moves needs to sample NT random variables (X_{n},σ,β_{1},β_{2},β_{3},σ_{x}). As a result, for the computational complexity of using the bootstrap filter with Gibbs moves for N=300 000, we can run the PG sampler for 12000 iterations using a conditional SMC sampler using 150 particles, which is more than sufficient in this context. The MATLAB program runs in 7 min on a desktop computer. Fig. 22 displays the results. We ran many realizations initialized with this very informative prior and the algorithm consistently returned virtually identical results. Using vague priors for all parameters, we observed that poorly initialized PG samplers can sometimes become trapped in some modes (and we conjecture that this might be so even for the ‘exact’ Gibbs sampler) but also manages to escape, in which case the results are very similar, and stable.
For the same data set and the same informative prior, we ran 10 runs of the bootstrap filter with Gibbs steps for N=100 000 particles. For some parameters, the results were quite similar among runs. However, we also observed significant variability in the estimates as illustrated in Fig. 23. As expected this variance increases with time as a result of the path degeneracy phenomenon. Using vague priors for all parameters, the procedure appeared unable to produce sensible approximations of the posterior.
We conjecture that the variance of the approximation error of p(θ,x_{1:T}y_{1:T}) increases superlinearly with T for such algorithms.
Some past and future work
As mentioned in Section 5.1 of our paper and as recalled by Godsill and Johannes, Polson and Yae, a version of the PMMH algorithm based on the bootstrap filter has been previously proposed as a natural heuristic to sample approximately from p(θy_{1:T}) (and not p(θ,x_{1:T}y_{1:T})) by FernandezVillaverde and RubioRamirez (2007). As discussed earlier, beyond the (nontrivial to us) proof that this approach is in fact exact, we hope that we have demonstrated that the PMMH algorithm is only a particular case of a more general and useful framework which goes far beyond the heuristic. As pointed out in Section 5.1, the PMCMC framework encompasses the MTM algorithm of Liu et al. (2000) and the configurationalbased Monte Carlo update of Siepmann and Frenkel (1992). These connections, which might not be obvious at first sight (Cappé), are detailed in Andrieu et al. (2010), where other interesting developments are also presented.