Exact and computationally efficient likelihood-based estimation for discretely observed diffusion processes (with discussion)


Gareth O. Roberts, Department of Mathematics and Statistics, Fylde College, Lancaster University, Lancaster, LA1 4YF, UK.
E-mail: g.o.roberts@lancaster.ac.uk


Summary.  The objective of the paper is to present a novel methodology for likelihood-based inference for discretely observed diffusions. We propose Monte Carlo methods, which build on recent advances on the exact simulation of diffusions, for performing maximum likelihood and Bayesian estimation.

1. Introduction

Diffusion processes are extensively used for modelling continuous time phenomena in many scientific areas; an incomplete list with some indicative references includes economics (Black and Scholes, 1973; Chan et al., 1992; Cox et al., 1985; Merton, 1971), biology (McAdams and Arkin, 1997), genetics (Kimura and Ohta, 1971; Shiga, 1985), chemistry (Gillespie, 1976, 1977), physics (Obuhov, 1959) and engineering (Pardoux and Pignol, 1984). Their appeal lies in the fact that the model is built by specifying the instantaneous mean and variance of the process through a stochastic differential equation (SDE). Specifically, a diffusion process V is defined as the solution of an SDE of the type


driven by the scalar Brownian motion B. The functionals b(·;θ) and σ(·;θ) are called the drift and the diffusion coefficient respectively and are allowed to depend on some parameters θ ∈ Θ. They are presumed to satisfy the regularity conditions (locally Lipschitz, with a linear growth bound) that guarantee a weakly unique, global solution of equation (1); see chapter 4 of Kloeden and Platen (1995). In this paper we shall consider only one-dimensional diffusions, although multivariate extensions are possible.

For sufficiently small time increment dt and under certain regularity conditions (see Kloeden and Platen (1995)), Vt+dtVt is approximately Gaussian with mean and variance given by the so-called Euler (or Euler–Maruyama) approximation


though higher order approximations are also available. The exact dynamics of the diffusion process are governed by its transition density


We shall assume that the process is observed without error at a given collection of time instances,


this justifies the notion of a discretely observed diffusion process. The time increments between consecutive observations will be denoted Δti=titi−1 for 1leqslant R: less-than-or-eq, slantileqslant R: less-than-or-eq, slantn.

The log-likelihood of the data set v is


Unfortunately, in all except a few special cases the transition density of the diffusion process and thus its likelihood are not analytically available. Therefore, it is already well documented that deriving maximum likelihood estimates (MLEs) for discretely observed diffusion processes is a very challenging problem. None-the-less, theoretical properties of such MLEs are now well known in particular under ergodicity assumptions; see for example Kessler (1997) and Gobet (2002).

Inference for discretely observed diffusions has been pursued in three main directions. One direction considers estimators that are alternative to the MLE. Established methods within this paradigm include techniques that are based on estimating functions (Bibby et al., 2002), indirect inference (Gourieroux et al., 1993) and efficient methods of moments (Gallant and Long, 1997). Another direction involves numerical approximations to the unknown likelihood function. Aït-Sahalia (2002) advocated the use of closed form analytic approximations to the unknown transition density; see Aït-Sahalia (2004) for multidimensional extensions. An alternative strategy has been to estimate an approximation to the likelihood by using Monte Carlo (MC) methods. The approximation is given by Euler-type discretization schemes, and the estimate is obtained by using importance sampling. The strategy was put forward by Pedersen (1995) and Santa-Clara (1995) and was considerably refined by Durham and Gallant (2002). The third direction employs Bayesian imputation methods. The idea is to augment the observed data with values at additional time points so that a satisfactory complete-data likelihood approximation can be written down and to use the Gibbs sampler or alternative Markov chain Monte Carlo (MCMC) schemes; see Roberts and Stramer (2001), Elerian et al. (2001) and Eraker (2001). An excellent review of several methods of inference for discretely observed diffusions is given in Sørensen (2004).

The approach that is introduced in this paper follows a different direction, which exploits recent advances in simulation methodology for diffusions. Exact simulation of diffusion sample paths has become feasible since the introduction of the exact algorithm (EA) in Beskos et al. (2004a). The algorithm is reviewed in Section 2 and relies on a technique called retrospective sampling which was developed originally in Papaspiliopoulos and Roberts (2004). To date there are two versions of the algorithm: EA1, which can be applied to a rather limited class of diffusion processes, which we call ��1, and EA2, which is applicable to the much more general ��2-class; all definitions are given in Section 2. The greater applicability of EA2 over EA1 comes at the cost of higher mathematical sophistication in its derivation, since certain results and techniques from stochastic analysis are required. However, its computer implementation is similar to that of EA1.

In this paper we show how to use the EA to produce a variety of methods that can be used for maximum likelihood and Bayesian inference. We first discuss three unbiased MC estimators of the transition density (2) for a fixed value of θ: the bridge method (Section 4; first proposed in Beskos et al. (2004a)), the acceptance method (AM) (Section 5) and the Poisson estimator (Section 6; first proposed in Wagner (1988a)). The last two estimators are evolved in Sections 5.1 and 6 to yield unbiased estimators of the transition density simultaneously for all θ ∈ Θ. Thus, the simultaneous estimators can readily be used in conjunction with numerical optimization routines to estimate the MLE and other features of the likelihood surface.

We proceed by introducing a Monte Carlo expectation–maximization (MCEM) algorithm in Section 8. The construction of the algorithm crucially depends on whether there are unknown parameters in the diffusion coefficient σ. The simpler case where only drift parameters are to be estimated is treated in Section 8.1, whereas the general case necessitates the path transformations of Roberts and Stramer (2001) and it is handled in Section 8.2.

Section 9 presents an MCMC algorithm which samples from the joint posterior distribution of the parameters and of appropriately chosen latent variables. Unlike currently favoured methods, our algorithm is not based on imputation of diffusion paths but instead on what we call a hierarchical simulation model. In that way, our MCMC method circumvents computing the likelihood function.

Therefore, all our methodology is simulation based, but it has advantages over existing methods of this type for two reasons.

  • (a) The methods are exact in the sense that no discretization error exists, and the MC estimation provides the only source of error in our calculations. Specifically, as the number of MC samples increases, the estimated MLE converges to the true MLE and, as the number of iterations in our MCMC algorithm increases, the samples converge to the true posterior distribution of the parameters.
  • (b) Our methods are computationally efficient. Whereas approximate methods require rather fine discretizations (and consequently a number of imputed values which greatly exceeds the observed data size) to guarantee sufficient accuracy, our methodology suffers from no such restrictions.

A limitation of the methods that are introduced here is that their applicability is generally attached to that of the EA. However, on-going advances on the EA itself (Beskos et al., 2005a) will weaken further the required regularity conditions so that a much larger class of diffusions than ��2 can be effectively simulated. It is expected that these enhanced simulation algorithms will be of immediate use to the methods that are presented in this paper.

Our methods are illustrated on three different diffusion models. The first is the periodic drift model, which belongs to ��1 and, although it is quite interesting in its own right since its transition density is unavailable, it is used primarily for exposition. However, we also consider two more substantial and well-known applications: the logistic diffusion model for population growth and the Cox–Ingersoll–Ross (CIR) model for interest rates. The former belongs to the ��2-class, whereas the latter is a diffusion process that is outside the ��2-class, and it is used to illustrate how our exact methods can be extended for processes for which the EA2 algorithm is not applicable. Moreover, since we can calculate analytically the likelihood for this model, we have a bench-mark to test the success of our approach. We fit the CIR model to a well-studied data set, which contains euro–dollar rates (recorded every 10 days) between 1973 and 1995, to allow for comparisons with existing methods.

All the algorithms that are presented in this paper are coded in C and have been executed on a Pentium IV 2.6 GHz processor. We note that our methods are not computationally demanding according to modern statistical computing standards, and in the examples that we have considered the computing times (which are reported explicitly in the following sections) were in the magnitude of seconds, or at worst minutes.

The structure of the paper is as follows. Section 2 reviews the EA. Section 3 sets up the context of transition density estimation, Sections 4–6 present the three different estimators and Section 7 compares them theoretically and empirically. Section 8 introduces the MCEM algorithm and Section 9 the MCMC algorithm. We finish with some general conclusions and directions for further research in Section 10. Background material and proofs are collected in a brief appendix.

2. Retrospective exact sampling of diffusions

Understanding the statistical methodology to be presented in later sections presupposes the introduction of the simulation techniques that are central to our approaches. In particular, we need to understand the form of the output that is provided by the EA. The main references for the material of this section are Beskos et al. (2004a) and Beskos and Roberts (2005). For this paper it suffices to analyse the EA for simulating diffusion paths conditional on their ending point (also known as diffusion bridges). In particular we shall show how to simulate a diffusion path starting from V0=v and ending at Vt=w, for any t > 0, v,w ∈ R. Simulation of unconditioned paths follows in the same lines and is sketched in Section 2.3. When necessary we characterize the EA as conditional or unconditional to emphasize the type of diffusion simulation that it is used for.

The EA performs rejection sampling by proposing paths from processes that we can simulate and accepting them according to appropriate probability density ratios. The novelty lies in the fact that the paths proposed are unveiled only at finite (but random) time instances and the decision whether to accept the path or not can be easily taken.

It is essential that we first transform the diffusion process (1) into an SDE of unit diffusion coefficient by applying the 1–1 transformation Vsη(Vs;θ)=:Xs, where


is any antiderivative of σ−1(·;θ). Assuming that σ(·;θ) is continuously differentiable, we apply Itô’s rule to find that the SDE of the transformed process writes as




η−1 denotes the inverse transformation and σ denotes the derivative with respect to the space variable. In what follows we shall make the following standard assumptions for any θ ∈ Θ.

  • (a) α(·;θ) is continuously differentiable.
  • (b) (α2+α)(·;θ) is bounded below.
  • (c) Girsanov's formula for X that is given in expression (32) in Appendix B.1 is a martingale with respect to Wiener measure.

We define


to be any antiderivative of α. The transition density of X is defined as


Before proceeding, we require the following preliminary notation. Let CC([0,t],R) be the set of continuous mappings from [0,t] to R, �� the corresponding cylinder σ-algebra and ω=(ωs,s ∈ [0,t]) a typical element of C. Let inline image denote the distribution of the process X conditioned to start at X0=x and to finish at Xt=y, for some fixed x and y, and ��(t,x,y) be the probability measure for the corresponding Brownian bridge (BB). The notation highlights the dependence of the measure inline image on θ.

The objective is to construct a rejection sampling algorithm to draw from inline image. The following lemma proved in Appendix B.1 is central to the methodology. By ��t(u) we denote the density of the normal distribution with mean 0 and variance t evaluated at u ∈ R.

Lemma 1.  Under conditions (a)–(c) above, inline image is absolutely continuous with respect to ��(t,x,y) with density




In general r is a positive random variable. However, in special cases it can be chosen independently of ω, e.g. if (α2+α)(·;θ) is bounded above. We now define the non-negative function 0leqslant R: less-than-or-eq, slantφleqslant R: less-than-or-eq, slant1 which transforms a given path ω as follows:


It is now clear that


Thus we have managed in expression (7) to bound the density ratio. The key to proceed to a feasible rejection sampler is to recognize expression (7) as a specific Poisson process probability.

Theorem 1.  Let ω ∈ C, Φ be a homogeneous Poisson process of intensity r(ω,θ) on [0,t]×[0,1] and N be the number of points of Φ below the graph sφ(ωs;θ). Then


Theorem 1 suggests rejection sampling by means of an auxiliary Poisson process as follows.

  • Step 1:simulate a sample path ω∼��(t,x,y).
  • Step 2:calculate r(ω,θ); generate a marked Poisson process Φ={Ψ,ϒ}, with points Ψ={ψ1,…,ψ} that are uniformly distributed on [0,t] and marks ϒ={υ1,…,υ} that are uniformly distributed on [0,1], where κ∼Po{r(ω,θ)t}.
  • Step 3:compute the acceptance indicator
  • Step 4:if I=1, i.e. {N=0} has occurred, then accept ω; otherwise return to step 1.

Unfortunately, this ‘algorithm’ is impossible to implement since it requires the simulation of complete BBs on [0,t]. However, it might be possible to determine I on the basis of only partial information about the path proposed. For instance, when r is only a function of θ (see Section 2.1) we can actually reverse the order in which steps 1 and 2 are carried out. Specifically, we would first simulate Φ, and afterwards, retrospectively, we would realize ω at the time instances that are determined by Ψ, since this is sufficient for determining I. The technique of exchanging the order of simulation to implement in finite time simulation of infinite dimensional random variables has been termed retrospective sampling in Papaspiliopoulos and Roberts (2004).

The general framework under which the EA operates assumes that it is possible to write


where k(ω) has the following properties.

  • (a) k(ω) is finite dimensional.
  • (b) The law of k(ω) can be simulated (under ��(t,x,y)).
  • (c) The finite dimensional distributions of ��(t,x,y) given k(ω) can be simulated.

Under these conditions, the following retrospective implementation of the rejection sampler can be carried out in finite time: the EA.

  • Step 1:simulate k(ω).
  • Step 2:generate a realization of the marked Poisson process Φ={Ψ,ϒ} of rate r{k(ω),θ}.
  • Step 3:simulate the skeleton {ωψ1,…,ωψκ}, conditionally on k(ω).
  • Step 4:compute the acceptance indicator I.
  • Step 5:if I=1, then accept the skeleton proposed, and return k(ω) and S(ω):={(0,x),(ψ1,ωψ1),…,(ψ,ωψκ),(t,y)}; otherwise return to step 1.

S(ω) is an exact draw from a finite dimensional distribution of inline image, which can be filled in at any required times afterwards using simply BB interpolation (see Section 2.3). The technical difficulty of finding or simulating k(ω), and consequently simulating the process at some time points given k(ω), imposes some restrictions on the applicability of the EA. We now describe the two cases where the algorithm can be easily applied.

2.1. Exact algorithm 1 (EA1)

Implementation of the EA is straightforward when r does not depend on ω. This will be true within the following diffusion class.

Definition 1.  We say that a diffusion process V with SDE (1) belongs to ��1, and write V ∈ ��1, if the drift of the transformed process Xs=η(Vs;θ), s ∈ [0,t], satisfies conditions (a)–(c) below equation (4) and (α2+α)(·;θ) is bounded above.

In this case


Within ��1, step 1 of the EA is unnecessary, since no information about the path is a priori needed for determining the Poisson rate. Moreover, step 3 entails simulation from a finite dimensional distribution of the BB (see Appendix A).

2.1.1. Example 1 (periodic drift)

Consider the SDE


Though apparently simple, the SDE cannot be solved analytically. However, the EA1 algorithm can be applied since X ∈ ��1 with inline image and r(θ)=9/8. Our proposed methods will be tested on the SINE data set simulated from X with the unconditional version of EA1 (Beskos and Roberts, 2005) under the specifications n=1000, Δti=1, X0=0 and θ=π; see also Fig. 1. When θ is to be estimated, we take Θ=[0,2π] for identifiability.

Figure 1.

 First 250 values of (a) the SINE (using the EA1 algorithm) and (b) the LOG-GROWTH (using the EA2 algorithm) data sets that are defined in example 1 and example 2 respectively

2.2. Exact algorithm 2 (EA2)

The EA2 algorithm applies to the wider class of diffusion processes that is defined below.

Definition 2.  We say that a diffusion process V with SDE (1) belongs to the class ��2, and write V ∈ ��2, if the drift of the transformed process Xs=η(Vs;θ),s ∈ [0,t], satisfies conditions (a)–(c) below equation (4) and


or  lim   sup u→−∞{(α2+α)(u;θ)}<∞.

Owing to symmetry, we shall study only case (9). We define the following elements of the path space C=C([0,t],R):


thus m is the minimum value of a path ω, and τ is the time that the minimum is attained. Within ��2, k(ω)=m(ω) and


As shown in Beskos et al. (2004a) and summarized in Appendix A, m satisfies the three requirements that are stated just above the EA. Simulation of (τ,m) can be done by using simple transformations of elementary random elements. It is known (Asmussen et al., 1995) that the BB conditionally on (τ,m) can be derived in terms of two independent Bessel bridges, each operating on either side of (τ,m). The Bessel bridge is defined as a BB that is constrained to be positive and its simulation can be carried out by means of independent BBs; see Appendix A for details.

2.2.1. Example 2 (the logistic growth model)

A popular model for describing the dynamics of a population which grows at a geometric rate in an environment with limited feeding resources is given by the SDE


R is the growth rate per individual, Λ the maximal population that can be supported by the resources of the environment and σ a noise parameter; for more details see chapter 6 of Goel and Richter-Dyn (1974). Related models have been investigated in the context of financial economics; see for example Gourieroux and Jasiak (2003). The transition density of V is not known analytically. The modified process Xs=− log (Vs)/σ solves the SDE


It can be verified that V ∈ ��2 and that the EA2 algorithm is applicable with l(θ)=σ2/8−R/2 and


Our proposed methods will be tested on the LOG-GROWTH data set simulated from V by first simulating from X by using the unconditional EA2 algorithm and then transforming XsVs (Beskos et al., 2004a). We took n=1000, Δti=1, V0=700 and (R,Λ,σ)=(0.1,1000,0.1); see Fig. 1.

2.3. Unconditional exact algorithm and path reconstruction

The EA can be applied in a similar fashion when the diffusion is conditioned only on its initial point X0=x. The main difference lies in that the final value of the path proposed is distributed according to the density


(which is assumed to be integrable). Simulation from h can be done efficiently by using an adaptive rejection sampling approach. Conditionally on this final point, the rest of the path is a BB, and the EA proceeds as already described.

The output of the EA is a skeleton S(ω) and possibly the collection k(ω) of variables related to the path. However, we can afterwards fill in this finite representation of an accepted path ω according to the dynamics of the proposed BB and without further reference to the target process. For the EA1 algorithm, the path reconstruction requires the simulation of the BBs that connect the currently unveiled instances of an accepted path ω. For the EA2 algorithm, the extra conditioning of the proposed BB on its minimum m implies that the filling in will be driven by two independent Bessel bridges; see Appendix A.

3. Unbiased transition density estimators—preliminaries

As already discussed in Section 1 one of our primary aims is to provide unbiased MC estimators of the transition density equation (2). The first step is to express equation (2) in terms of the density (5) of the transformed process X. A change of variables argument yields


Thus, we shall demonstrate how to estimate inline image for arbitrary x,y and θ, using the uncon-ditional EA (bridge method) and the conditional EA (AM, Poisson estimator). Automatically, our methods yield unbiased estimators of the likelihood  exp {l(θ|v)}, for any fixed θ ∈ Θ, so coupled with grid or more elaborate stochastic search algorithms they can be used for locating the MLE. However, this pointwise exploration of the likelihood surface is expected to be computationally inefficient owing to the introduction of independent MC error at each likelihood evaluation, and it is not guaranteed to provide consistent MC estimators of the MLE. We overcome these drawbacks by modifying the two methods based on the conditional EA to give estimates of the likelihood function simultaneously for all θ ∈ Θ. The simultaneous methods achieve estimation of the complete function


using a single stream of random elements which are independent of θ. Therefore, numerically efficient optimization algorithms can be used to estimate features of the likelihood surface, such as the MLE and level sets. The simultaneous AM has appealing consistency properties, as we discuss in Section 5.1.

A fundamental result which the AM and the Poisson estimator are based on is the following corollary to lemma 1. This important result is also contained in lemma 1 of Dacunha-Castelle and Florens-Zmirou (1986).

Corollary 1.  Let X be the diffusion process in equation (4) with transition density inline image.



4. The bridge method

The bridge method that was introduced in Beskos et al. (2004a) uses explicitly the output of the unconditional EA to produce unbiased estimates of inline image for any fixed x,y and θ. We enrich the notation for the transition density to allow for conditioning on further random elements: let


be the conditional density of Xt given the starting value and any other random elements.

Consider some δ > 0. Suppose that S(ω) and k(ω) are the output of the unconditional EA for simulating X started at X0=x on the extended time interval [0,t+δ]. By conditional expectation properties, we have that


The joint distribution of S(ω) and k(ω) is intractable but can be easily simulated by using the EA on [0,t+δ]. Recall that the value of the path at times other than those specified by S(ω) can be obtained according to BB dynamics (in EA1) or Bessel bridge dynamics (in EA2). As a result, the conditional distribution of Xt given S(ω) and k(ω), and thus density on the right-hand side of equation (12), can be easily identified. By construction S(ω) contains at least one time point on either side of t, since (0,x),(t+δ,ωt+δ) ∈ S(ω), and let t and t+ denote the adjacent time points, t<t<t+. By the Markov property inline image depends only on the points (t,ωt) and (t+,ωt+) of S(ω) and it will be equal either to a BB or a Bessel bridge density, both of which are analytically available and can be computed explicitly.

Thus, the main attraction of the method is that the density of Xt given the simulated variables (the skeleton) is explicitly known, unlike for example the MC method by Pedersen (1995), where it is approximated.

5. The acceptance method

In this section we show how to use a simple identity that is related to the conditional EA to derive an unbiased estimator of the transition density inline image. Let a(x,y,θ) be the acceptance probability of the EA for simulating from inline image. Directly, expression (7) implies that


Substituting this expression in equation (11), and rearranging terms according to equation (6), yields the following identity which relates the acceptance probability of the EA to the transition density of the diffusion:


Recall the definition of the acceptance indicator I in expression (8), which we shall now rewrite as I(x,y,θ,Φ,ω) to emphasize the elements which determine its value: the Poisson process Φ, and the proposed bridge ω∼��(t,x,y), unveiled at the times that are determined by Φ. By definition,


where the expectation is taken with respect to the joint distribution of Φ and ω. Thus, a simple unbiased estimator of a(x,y,θ) for any fixed θx and y can be obtained by recording the number of times that the EA accepts a proposed skeleton in K, say, number of trials.

Estimation of the transition density through equations (14) and (15) is referred to as the AM. An essential feature of the resulting estimator is that it is almost surely bounded, since Ileqslant R: less-than-or-eq, slant1. As a result, all the moments of the estimator are finite.

5.1. The simultaneous acceptance method

In this section we upgrade the AM to yield an estimator of the complete map in expression (10). The method is now referred to as the simultaneous AM (SAM). We emphasize that equations (13)–(15) are applied for x=x(θ)=η(v;θ) and y=y(θ)=η(w;θ). The simultaneous estimator is obtained by expressing a(x,y,θ) as an expectation of a function of θ, x, y and random variables, none of which depends on θ, x and y. The method is presented for V ∈ ��1, since the derivation of the SAM for V ∈ ��2 is technically much more difficult, and it is contained in Beskos et al. (2005b), along with more theoretical results about the methods.

In the AM, the Poisson process Φ and the proposed path ω both depend on θ, since the former is of rate r(θ) and the latter is a BB from X0=x to Xt=y. As a result a(x,y,θ) can be estimated only by running a separate MC experiment for each θ ∈ Θ. Below we show how the thinning property of the Poisson process (see for example Section 5.1 of Kingman (1993)) and the relocation invariance property of BBs (see Appendix A) can be exploited to decouple the dependence between Φ, ω and θ.

The relocation invariance property of BBs suggests that we can rewrite the acceptance indicator in terms of a standard BB, ω∼��(t,0,0), as follows:


Suppose that we can find an rmax<∞, such that


Let Φmax={Ψmaxmax} be a marked Poisson process on [0,t]×[0,1] with rate rmax and κ number of points, κ∼Po(rmaxt). The thinning property for Poisson processes implies that the process that is obtained by deleting each point of Φmax with probability 1−r(θ)/rmax is a Poisson process with rate r(θ). Conditionally on κ, let U=(U1,…,U) be a collection of independent and identically distributed Un[0,1] selection random variables. The thinning property implies that we can rewrite equation (16) in terms of Φmax as


note that only Poisson points for which Uj < r(θ)/rmax determine the value of I.

Theorem 2.  Let ω∼��(t,0,0), and Φmax={Ψmaxmax} be an independent marked Poisson process of rate rmax with κ number of points. Then


a(x,y,θ) is the expected value of equation (17) with respect to the joint distribution of Ψmax and ω∼��(t,0,0); thus equation (17) together with equation (14) determine an unbiased simultaneous estimator of inline image, for any fixed data points v and w. An MC estimator of a(x,y,θ) is obtained by averaging over independent realizations of equation (17).

5.1.1. Example: periodic drift

We applied the SAM to the SINE data set, choosing rmax=9/8. For two different MC sample sizes, K=100 and K=1000, we used Brent's optimization algorithm (which combines parabolic interpolation with the golden section search algorithm; see section 10.2 of Press et al. (1992)) for finding the maximum of the log-likelihood. The MLE was estimated as 3.116, when K=100 (in 6 s), and 3.112 when K=1000 (in 57 s). Using numerical differentiation, the standard error was estimated as 0.04 when both K=100 and K=1000. Fig. 2(a) shows the estimate of the log-likelihood function based on the two different MC sample sizes.

Figure 2.

 Simultaneous estimate of (a) the log-likelihood θ for the SINE data set and (b) the profile log-likelihood R for the LOG-GROWTH data set, using the SAM with K=100 (- - - - - -) and K=1000 (——) MC samples

When the model assumed is in ��2, it is mathematically much more dificult to derive a formula that is analogous to equation (17) which allows for simultaneous estimation of the likelihood function. In EA2 the minimum m and the time τ when the minimum occurs both depend on θ. Since the Poisson rate r is a function of m as well as θ, and ω is simulated conditionally on (τ,m), the dependence of ω and Φ on θ is considerably more difficult than in EA1 to decouple and the method requires novel couplings and results from stochastic analysis. Such constructions are beyond the scope of this paper and are presented in Beskos et al. (2005b), where several other issues related with the SAM are also tackled. In particular it is proved that the simultaneous estimator for diffusions in ��1 and ��2 converges uniformly in θ to the likelihood function as the number of MC samples increases. Under standard assumptions, this guarantees consistency of the MC estimate of the MLE.

5.1.2. Example: logistic growth

We used the SAM in conjunction with the downhill simplex optimization method (section 10.4 of Press et al. (1992)) to locate the MLE of the LOG-GROWTH data set, for MC sample sizes K=100 and K=1000. The estimates corresponding to (R,Λ,σ) were (0.1098,1014.89,0.10057) for K=100 (in 2 min) and (0.1097,1014.78,0.10057) for K=1000 (in 20 min). Using numerical differentiation the curvature of the estimated log-likelihood at the estimated MLE was found:


The corresponding standard errors are (0.016, 9.5, 0.002). Fig. 2(b) shows the estimates of the profile log-likelihood of R for the two MC sample sizes.

6. The Poisson estimator

Corollary 1 relates the transition density of the diffusion to an expectation over the BB measure. Thus, any unbiased estimator of the expectation on the right-hand side of equation (11) corresponds to an unbiased estimator of inline image. Generalizing, suppose that the expectation


is to be estimated for arbitrary continuous function f and diffusion bridge measure ℙ(t,x,y). For any c ∈ R, λ > 0, and path ω, we write


where the expectation is taken with respect to κ∼Po(λt), conditionally on ω. If ψ∼Un[0,t], then {cf(ω)}/λ is an unbiased estimator of inline image. Let Ψ={ψ1,…,ψ} be a Poisson process of rate λ on [0,t], and ω∼ℙ(t,x,y). Then, we obtain the following simple unbiased estimator of expectation (18), which we call the Poisson estimator:


This estimator was introduced in the context of statistical physics by Wagner (1988a, 1989). It can be derived from first principles that its second moment is


which is not guaranteed to be finite. The choice of c and λ with the view of improving the efficiency of the algorithm is discussed in Section 7.

Taking ℙ(t,x,y)≡��(t,x,y), and f=(α2+α)/2, the Poisson estimator can be used with equation (11) to estimate inline image. This transition density estimator will also be referred to as the Poisson estimator. A simultaneous estimation of the complete map inline image for any data points v and w merely requires decoupling of ω in expression (19) from x and y, since Ψ clearly is independent of θ. Since in the current context ℙ(t,x,y)≡��(t,x,y), it suffices to exploit the relocation invariance property of BBs and to rewrite expression (19) as


This estimator will be referred to as the simultaneous Poisson estimator. For more general diffusion bridge measures, decoupling of ω from x and y can be cumbersome, since the intuitive relocation invariance property does not hold in general. Such a case is treated in the following section.

6.1. Inference for the Cox–Ingersoll–Ross model

We apply the Poisson estimator to infer about a diffusion process outside ��2, the CIR diffusion model, which solves


It is assumed that all parameters are positive and 2ρμ > σ2, which guarantees that V does not hit zero (see page 391 of Cox et al. (1985)). We fit the CIR model to a well-studied data set (used among others by Aït-Sahalia (1996), Elerian et al. (2001) and Roberts and Stramer (2001)) which contains daily euro–dollar rates between 1973 and 1995, to allow for comparisons with existing methods. We take a subsample of 550 values, corresponding to time intervals of 10 days, since the diffusion model seems more appropriate on that scale. We call this subsample the EURODOLLAR data set and plot it in Fig. 3(a).

Figure 3.

 (a) EURODOLLAR data set and (b) real (——) profile log-likelihood of σ and its estimation ( - - - - - -) by using the simultaneous Poisson estimator with λ=c=1 and K=100 MC samples

In this case η(u;θ)=2√u/σ; thus the transformation Xs=η(Vs;θ) solves the SDE


Setting α(u;θ)=(2ρμ/σ2−0.5)/uρu/2, it is easy to check that the CIR model is outside ��2. X is almost surely positive and its distribution is absolutely continuous with respect to that of the Bessel process. Let inline image and ℝ(t,x,y) be the bridge measures of X and the Bessel process respectively, and inline image and qt(x,y) be the corresponding transition densities. Direct application of lemma 1 yields an analogue of equation (11) for positive diffusions:


Thus, we can readily use the Poisson estimator to estimate the transition density of the CIR model for a fixed θ ∈ Θ. However, simultaneous estimation for all θ ∈ Θ is non-trivial, since the decoupling of ω∼ℝ(t,x,y) from x and y is not straightforward. The appropriate construction is in fact a variation of the approach that we have devised for the SAM for diffusions in ��2 and is contained in Beskos et al. (2005b). This stems from the characterization of the Bessel bridge as a BB that is conditioned to remain positive. As in the EA2 algorithm, we consider first the minimum m of the Bessel bridge and subsequently reconstruct the path given m. The distribution of m is the restriction on (0,∞) of the distribution of the minimum of the corresponding BB. Given m, the path can be reconstructed exactly as if it were a BB with known minimum m.

We applied the simultaneous Poisson estimator to the EURODOLLAR data set taking, after some pilot tuning, λ=c=1. In Fig. 3(b) we show the real and the estimated profile log-likelihood for σ by using the downhill simplex optimization algorithm. Even with 100 MC samples the estimated curve coincides with the true curve, although it can be shown that the variance of the estimator is infinite in this case.

7. Comparison of different transition density estimators

We have introduced three different methods which can be used to estimate the diffusion transition density. Whereas the AM (and SAM) were devised solely for this purpose, it is important to recognize that the other two methods have greater scope. The Poisson estimator can estimate expectations of general diffusion exponential functionals (18). The bridge method is based on equation (12), which is a form of Rao–Blackwellization that can be used to estimate other conditional probabilities for diffusions, e.g. hitting probabilities.

In the context of transition density estimation, the weakness of the AM and the bridge method over the Poisson estimator is that their applicability is determined by that of the EA. On the contrary, the Poisson estimator can be used without assuming that the drift α is suitably bounded.

A significant advantage of the AM over the competitors is that it is guaranteed to have finite polynomial moments. For the bridge method little is known since we have not yet derived analytic expressions for its variance. Checking whether the variance of the Poisson estimator is finite is in general tedious. A notable exception is when the diffusion is in ��1, where, since f=(α2+α)/2 is bounded, it is easy to see from expression (20) that the variance will be finite for any finite c and λ. Two examples outside ��1 for which we know that this will not be the case are the CIR and the logistic growth models.

A feature of the bridge method which is likely to make it generally less efficient than the competitors is that the simulated paths ignore the final data point, since the method is founded on the unconditional EA.

When the aim is to explore features of the likelihood surface, it is imperative that the estimators yield estimates of the transition density simultaneously for all θ ∈ Θ. The bridge method is the only method for which we have been unable to achieve that.

Although derived from very different perspectives, the SAM and the simultaneous Poisson estimator are related. Taking λ=rmax, c=λ+l(θ) and contrasting equation (17) with expression (21) reveals that, in ��1, the SAM is a special case of the simultaneous Poisson estimator. Moreover, there are certain optimality properties of this choice. For any c such that c>r(θ)+l(θ), expression (20) is bounded above by  exp ([−2c+λ+{cl(θ)}2/λ]t). This quantity is minimized by any pair (λ,c) such that c=λ+l(θ), where λgeqslant R: gt-or-equal, slantedr(θ). Requiring that the Poisson estimator yields estimates simultaneously for all θ ∈ Θ, the computationally most efficient bound on the variance is achieved by the choice λ=rmax and c=λ+l(θ), under which the Poisson estimator and the SAM coincide. Note that, in this choice, λ is the range and c the maximum of the functional (α2+α)(u;θ)/2 over all u ∈ R,θ ∈ Θ. It is not obvious whether choosing c > r(θ)+l(θ) is optimal. The connection between the two methods is less transparent outside ��1. The rate of the Poisson process that is used in the SAM for diffusions in ��2 will depend on the minimum of the path proposed, and it is precisely because of this dependence that the estimator is almost surely bounded. The Poisson process and the path proposed in the Poisson estimator are inherently independent. Guided by our findings for ��1, we propose to choose λ according to an estimate of the range of (α2+α)(u;θ)/2 over all u ∈ Rθ ∈ Θ, and to take c=λ+l(θ). This choice has proved successful empirically.

A brief study of the performance of the SAM and the simultaneous Poisson estimator in estimating the MLE of the LOG-GROWTH data set is summarized in Table 1. The parameters λ and c were chosen as suggested above. Taking into account its smaller computational cost, the simultaneous Poisson estimator is more efficient in this example. A feature that is not depicted in Table 1 is the sensitivity of the simultaneous Poisson estimator on the choice of c and λ. The method can produce unreliable estimates, especially with few MC samples, for certain choices of the tuning parameters. In contrast the SAM is fully automatic.

Table 1.   Summary of 500 independent estimations of the MLE of the LOG-GROWTH data set†
ParameterEstimatorMinimum1st quartileMedian3rd quartileMaximum
  1. †Each estimation is based on K=10 MC samples. For each parameter, the first row corresponds to the SAM and the second to the simultaneous Poisson estimator, with λ=1 and c=l(θ)+λ. The average time for each experiment was around five times higher in the SAM compared with the simultaneous Poisson estimator.

Simultaneous Poisson0.10300.10720.10870.10990.1139
Simultaneous Poisson1008.881013.481014.931016.501022.80
Simultaneous Poisson0.100400.100500.100540.100580.10069

8. A Monte Carlo expectation–maximization approach

The EM algorithm (Dempster et al., 1977) is suitable for locating the MLE when the likelihood of the observed data is intractable but the joint likelihood of the observed and some appropriately defined missing data is of a simple form. At the expense of some additional effort the EM algorithm can also be used to obtain an estimate of the observed information matrix. To fix ideas, let vobs={V0=v,Vt=w} be the observed data, which consist of discrete time observations from equation (1) (without loss of generality we assume two data points). It is assumed throughout this section that V ∈ ��2. When the diffusion coefficient does not depend on unknown parameters, i.e. when σ(·;θ)=σ(·), we can treat the paths between consecutive observations as missing data and use Girsanov's formula (32) to form the complete likelihood. This naturally leads to an EM algorithm whose implementation is described in the next section. However, when the diffusion coefficient depends on θ, the missing paths should be appropriately transformed, as shown in Section 8.2.

8.1. Monte Carlo expectation–maximization for diffusions with known diffusion coefficient

We first transform the process VsXs=η(Vs), where η is defined in equation (3) and, since by assumption σ(·;θ)=σ(·), η does not depend on θ. Thus, we define xobs={X0=x,Xt=y},x=η(v) and y=η(w); xobs contains discrete time observations from the diffusion X with SDE (4). The observed log-likelihood is inline image. Let xmis denote the missing path, (Xs,s ∈ [0,t]), and xcom={xobs,xmis} denote the complete data, which is the continuous diffusion path starting from X0=x and finishing at Xt=y. The complete log-likelihood is given by Girsanov's formula (32),




The E-step of the EM algorithm requires analytic evaluation of


However, note that we can introduce the random variable U∼Un[0,t], which is independent of both xmis and xobs, and rewrite


We cannot perform analytically the expectation above. However, since we can simulate U, and Xu for any u by using the EA, we adopt an MC implementation of the EM algorithm. The MCEM algorithm was introduced in Wei and Tanner (1990); convergence and implementation issues were tackled in Chan and Ledolter (1995), Sherman et al. (1999) and Fort and Moulines (2003). It is well documented (see for example Fort and Moulines (2003), and references therein) that the number of MC samples that are used to approximate the expectation should increase with the EM iterations. We have followed this approach in our examples. Several methods have been proposed for using the EM algorithm to estimate the observed information matrix; see for example Louis (1982), Jamshidian and Jennrich (2000) and Meng and Rubin (1991). We have used the method that was suggested by Louis (1982), where we use MC estimations of the required expectations, as in Wei and Tanner (1990).

8.1.1. Example: periodic drift

We applied MCEM to the SINE data set; the EM iterations are presented in the second column of Table 2. The M-step was implemented by using Brent's optimization method. The convergence is very rapid, essentially in one iteration, and the estimated MLE is in perfect agreement with the estimate that is obtained with the SAM in Section 5.1. The standard error was estimated to be 0.04.

Table 2.   MCEM iterations for the SINE (second column) and the LOG-GROWTH (last three columns) data sets†
  1. †The first five iterations were performed using 200 MC samples and the last five using 2000 samples. The execution times were 177 s for the SINE data and 18 min for the LOG-GROWTH data.


8.2. The general case: missing information and a path transformation

An important result from stochastic analysis, which has found numerous applications in statistical inference for diffusion processes (see for instance Barndorff-Nielsen and Shephard (2002)), is that a continuous time diffusion path on an interval of time [0,t] can be used to estimate perfectly the parameters that are involved in σ(·;θ). A similar result does not hold for parameters in the drift b(·;θ), where perfect estimation from continuous time data typically holds asymptotically as t→∞. However, a finite number of discrete time observations can contain only finite information about any of the parameters. The computational implication is that we cannot construct an EM algorithm as in Section 8.1 where the paths between the observed data are treated as missing data. According to the EM terminology, this augmentation scheme leads to a fraction of missing information equal to 1. This problem was first encountered in an MCMC framework by Roberts and Stramer (2001) and Elerian (1999).

One remedy to this type of problems is to find a suitable invertible transformation of the missing data to reduce the augmented information. In particular, we want to transform the process so that the diffusion coefficient is independent of θ, exploiting the result that continuous time paths contain only finite information about the drift parameters. Our approach is in the spirit of Roberts and Stramer (2001). We start by transforming the process, VsXs=η(Vs;θ), where η is the transformation in equation (3); therefore X0=x(θ)=η(v;θ) and Xt=y(θ)=η(w;θ). Unlike in Section 8.1, (Xs,s ∈ [0,t]) is not directly observed; instead it is a function of the unknown parameters. The second level of path transformation is from inline image where


inline image is a diffusion bridge starting from inline image and finishing at inline image. Its dynamics depend on θ and are typically intractable; nevertheless it is easy to simulate inline image at any time s ∈ [0,t], conditionally on vobs and a specific value of θ, following the procedure

  • (a) find x=x(θ) and y=y(θ),
  • (b) simulate Xs from the bridge diffusion measure inline image by using the EA and
  • (c) apply the transformation in expression (25).

The inverse transformation from inline image to X is


We propose an augmentation scheme where inline image and vcom={vobs,vmis}. The following lemma, which is proved in Appendix B, gives the complete log-likelihood.

Lemma 2.  The log-likelihood of the complete data vcom is given by


Arguing as in Section 8.1, we have that


We can now apply MCEM in a similar fashion to that in Section 8.1.

8.2.1. Example: logistic growth

We used MCEM to estimate the MLE and the observed information for the LOG-GROWTH data set. The results in Table 2 indicate rapid convergence. We used the Fletcher–Reeves version of the conjugate gradient method for the M-step of the algorithm (see section 10.5 of Press et al. (1992)). The standard errors are estimated using K=2000 as (0.015, 9.5, 0.002).

9. Hierarchical simulation models and inference using Markov chain Monte Carlo methods

In this section we develop an MCMC algorithm for Bayesian inference for discretely observed diffusions. The stationary distribution of our MCMC algorithm is the exact posterior distribution of the parameters. However, it is not only the exactness which distinguishes our approach from competitive existing MCMC algorithms. Existing methods follow a path augmentation approach, as we did in the EM algorithm in Section 8. The missing continuous paths are approximated by a fine discrete time Markov chain, whose transition is assumed to follow an Euler-type approximation to the true diffusion transition. The joint distribution of the observed and missing data is given by an appropriate approximation of Girsanov's formula. Then, a Gibbs sampler (or more general componentwise updating MCMC algorithm) is used to sample from the approximate posterior distribution of the parameters and the missing paths. Often, for any two data points several points need to be imputed in between them to obtain a good approximation to the true posterior of θ. However, the performance of basic MCMC schemes can severely deteriorate as the amount of imputation increases; see Roberts and Stramer (2001) for details.

The approach that is introduced here is not based on augmentation of paths. Instead, we construct a graphical model which involves the variables that are used in the EA, and we show that the posterior distribution of the parameters is obtained as a marginal in this graphical model. Thus, we use an appropriate Metropolis–Hastings algorithm to sample from the joint posterior of all variables in the graph. One of the steps of the sampler involves running the conditional EA. The state space of our MCMC algorithm typically has much smaller dimension than the competing augmentation methods and as a result it can be computationally more efficient. However, a comparison between alternative MCMC methods is not carried out in this paper and will be reported elsewhere.

We describe in detail the MCMC algorithm when it is assumed that V ∈ ��1. The derivation of the algorithm when V ∈ ��2 is technically much more difficult, and it can be found in Beskos et al. (2004b). Nevertheless, we present simulation results for both cases. An essential ingredient of the algorithm is the following lemma which derives the density of the output of the conditional EA1 algorithm.

Lemma 3.  Consider any two fixed points x and y. Let Φ={Ψ,ϒ} be the marked Poisson process on [0,t]×[0,1] with rate r(θ) and number of points κ∼Po{r(θ)t}, which is used in EA1 for simulating from inline image. Let ω∼��(t,0,0), and I be the acceptance indicator (16) which decides whether (ωs+(1−s/t)x+(s/t)y,s ∈ [0,t]) is accepted as a path from inline image. Then, the conditional density of ω and Φ given {I=1}, π(ω,Φ|θ,x,y,I=1), is


with respect to the product measure Φ×��(t,0,0), where Φ is the measure of a unit rate Poisson process on [0,t]×[0,1], and a(x,y,θ) is the acceptance probability of the EA1 algorithm.

To clarify the notation, in the remainder of the section (ω**) represent the accepted random elements in EA1, with density π(ω**|θ,x,y) given in expression (26); Φ*, and ω* at any collection of times, can be easily sampled by using the conditional EA1 algorithm.

Let π(θ) denote an appropriately specified prior density for θ. The aim is to sample from the posterior distribution of θ given diffusion observations v={Vt0,…,Vtn}, say π(θ|v), where V ∈ ��1. It is convenient to define


where η is the transformation in equation (3). For a given θ, the xis are discrete time samples from SDE (4). For 1leqslant R: less-than-or-eq, slantileqslant R: less-than-or-eq, slantn, let inline image and inline image be the accepted elements of the conditional EA1 algorithm applied on the time interval [ti−1,ti]; inline image will denote the corresponding output skeleton. We construct the following hierarchical model, which is defined through the densities:


Note that in this hierarchical model the observed data are in the middle of the hierarchy. We term expression (28) a hierarchical simulation model, to emphasize that it represents the order in which variables must be simulated to ensure that the output of the EA1 algorithm, inline image, is indeed from inline image. The posterior density of interest, π(θ|v), is a marginal of the joint posterior density of θ and the latent variables, inline image. We aim at sampling from this density via the Gibbs sampler. The conditional density of the latent variables is


thus the pairs inline image are conditionally independent with density given in expression (26). The key property of this hierarchical simulation model is that θ is independent of inline image given the collection of skeletons inline image. Specifically, the conditional density of θ given the observed data and the latent variables is given in the following theorem.

Theorem 3. θ is conditionally independent of inline image given v and inline image, with density inline image proportional to


Note that expression (30) is given in a simple computable form, and the intractable terms a(xi−1,xi,θ) are not involved when conditioning on inline image. Our proposed MCMC algorithm is a componentwise updating algorithm, simulating iteratively the skeletons inline image and θ from their posterior conditional distributions. The skeletons inline image are conditionally independent given θ and can be easily simulated by using the EA1 algorithm. In some cases it might be possible to sample from density (30) directly; otherwise a Metropolis–Hastings step could be used. When a random-walk Metropolis algorithm is being used, it will generally make sense to scale the proposal variance to be proportional to n−1.

9.1. Example: periodic drift

We implemented the algorithm for the SINE data set. We used a uniform prior on [0,2π]. It took 47 s to run the algorithm for 10000 iterations, and some summaries are shown in the rightmost column of Fig. 4. The posterior mean is estimated as 3.1127 and the posterior standard deviation as 0.04. We used a Metropolis step to update θ, which had acceptance probability 0.49. The algorithm mixes very rapidly, and essentially the autocorrelation in the θ-series is because a Metropolis step is used rather than direct simulation from its conditional distribution. The dependence between θ and the latent variables is very weak.

Figure 4.

 Summary of MCMC results—(a)–(d) trace plots, (e)–(h) autocorrelation plots and (i)–(l) posterior density estimates: (d), (h), (l) 10000 MCMC iterations for the SINE data set (the first 500 samples were removed as burn-in for autocorrelation and density estimation); (a)–(c), (e)–(g), (i)–(k) 50000 MCMC iterations for the LOG-GROWTH data set (with a burn-in of 2000 iterations); (a), (e), (i) R; (b), (f), (j) Λ; (c), (g), (k) σ; (d), (h), (l) θ

When the assumed model is in ��2, the construction of the MCMC algorithm is more complicated. In particular, derivation of the joint density of (θ,ω**) is challenging owing to the more complex structure of the EA2 algorithm. This is done in Beskos et al. (2004b), where also other important issues related to the implementation of the MCMC algorithm are tackled. The algorithm necessitates recent non-centring reparameterization methodology for hierarchical models as described in Roberts et al. (2004) and Papaspiliopoulos et al. (2003).

9.2. Example: logistic growth

Using the MCMC algorithm that is described in Beskos et al. (2004b), we obtained 50000 samples from the posterior distribution of the parameters (R,Λ,σ) for the LOG-GROWTH data set. The computing time was 20 min, and a summary of the results is given in Fig. 4. The posterior means were estimated as (0.1075, 1017.4, 0.1007) and the posterior precision matrix as


from which the posterior standard deviations read as (0.015, 31.13, 0.002).

10. Conclusions

In this paper we have introduced a variety of methods which can be used for likelihood-based inference for discretely observed diffusions. The methods rely on recent advances in the exact simulation of diffusions. The computational efficiency of the methods was illustrated in a collection of examples. However, an exhaustive simulation study which tests the relative performance of our methods and existing approaches, under various model specifications and parameter settings, is not given here. Such a detailed empirical investigation is currently taking place. However, some qualitative remarks are made below.

In general, the computing time that is required for our methods critically depends on the rate of the Poisson process. Ceteris paribus, this rate is a function of the time increment Δt between

the observations. The performance (measured either in computing time or MC error) of our methods will be very strong for small Δt (high frequency data). However, the performance of the methods as presented here deteriorates for sparser data sets (Δt large). Specifically, the problem stems from the following two characteristics of the conditional EA:

  • (a) the Poisson rate is linear in Δt (in EA2 it can increase even faster);
  • (b) the acceptance rate typically decreases exponentially to 0 in Δt.

The computing time that is required for the AM and SAM increases because of (a) and for the MCEM and MCMC algorithms because of (a) and (b). In the AM and SAM, the variance will also increase with Δt, since MC averages would be heavily dominated by the terms (of very small probability) corresponding to Poisson configurations with very few points. Similarly, expression (20) suggests that the variance of the Poisson estimator increases exponentially with Δt. In MCMC sampling, the computing cost can be transformed to be linear in Δt. This is achieved by augmenting any two observed data with additional points in between. This is implemented by applying the conditional EA on intervals of length a fraction of Δt. Of course, the additional augmentation will affect the MCMC mixing.

Some extensions of our methodology are possible. One important direction is the extension of our methods for diffusions outside the ��2-class. As we have shown, the Poisson estimator can be readily used for likelihood inference for more general diffusion processes. However, current progress on the EA itself, which attempts to remove the boundedness conditions on the drift (Beskos et al. 2005a), is expected to broaden the applicability of the estimation methods. It is worth noting that MCEM and the AM remain unaltered irrespective of which version of the EA is being used; thus they can readily accommodate extensions in the EA. The other methods use explicitly the structure of the EA and will have to be modified appropriately.

We have concentrated here on the case where the diffusion is observed without error at a finite set of times. However, data often occur in different forms. For instance, data might be subject to observation error. All the methods for evaluating MLEs are difficult to extend to this case. However, the MCMC approach can be extended in a straightforward manner. An alternative interesting data form is when we observe a one-dimensional component of a higher dimensional diffusion, as for instance in continuous time filtering models. We are currently working on a collection of such filtering problems, and we have found that our methods can be extended to this case, although there are significant additional implementation challenges in this approach.

Our methodology extends some way to time inhomogeneous and multivariate diffusions. In carrying out these extensions, there are two steps in the arguments that are used in this paper that need to be generalized. Firstly, it is necessary to generalize the transformation (3) which eliminates the diffusion coefficient. This is straightforward under mild smoothness conditions on σ in the time inhomogeneous extension. For multivariate extensions, however, the generalized version of transformation (3) involves the solution of an appropriate vector differential equation which is often intractable or insolvable (see for example Aït-Sahalia (2004)). This imposes restrictions on the class of multivariate diffusions to which the methods that are presented here can currently be applied. Secondly, we need to eliminate the stochastic integral from Girsanov's formula (32) to derive lemma 1. Again this is fairly routine in the time inhomogeneous case, whereas an extra condition is needed in the multivariate case, requiring that the multivariate drift be the gradient of a suitable potential function. This is a well-known condition in stochastic analysis and for ergodic, unit diffusion coefficient diffusions corresponds to reversibility.

It is natural to ask whether the ideas of this paper extend to SDEs that are driven by Lévy processes. However, the Cameron–Martin–Girsanov formula, providing a closed form likelihood function, is critical to all our methodology. Unfortunately, there is not an analogous expression for the Radon–Nikodym derivative of an infinite activity Lévy-driven SDE with respect to a measure that is tractable and easy to simulate from. Nevertheless, it is straightforward to extend our methods to incorporate SDEs with jumps according to a finite activity Lévy process (as for example considered in Roberts et al. (2004)).

We hope that the collection of techniques that are described in this paper will have further applications. We are currently working on MC estimation of derivative hedge ratios in finance. The methodology builds on Rao–Blackwellization techniques such as those devised for the bridge method.

Although we recognize that the mathematical details of this work are complex for those without a working knowledge of diffusion theory, we firmly believe that our methods have the potential to influence applied statistical work. This is because the algorithms that we use are relatively simple and easy to code, and the methods are not computationally demanding and are capable of handling long time series. In addition, motivated by the desire to make our work as accessible as possible, we have started to develop generic software for the implementation of some of our methods, beginning with the EA.


The authors thank Mike Pitt and Neil Shephard for helpful suggestions. We are grateful to all the referees for their comments which contributed to the improvement of the paper. The first author acknowledges financial support from the Greek State Scholarships Foundation and the second author is funded by Engineering and Physical Sciences Research Council grant GR/S61577/01.

Discussion on the paper by Beskos, Papaspiliopoulos, Roberts and Fearnhead

Eric Moulines (Ecole Nationale Supérieure des Télécommunications, Paris)

The problem of estimation and inference for discretely observed diffusion has been studied extensively in recent years. The classical approach to the problem is based on a first-order approximation. In many applications, this approximation is not sufficiently accurate for sampling instants at which data are available. This paper belongs to the class of Monte Carlo maximum likelihood methods which was introduced by Pedersen (1995) and later considerably refined by many researchers.

Suppose that we wish to approximate the transition density of the diffusion over an interval. The first-order approximation is accurate if this interval is sufficiently short. If this is not so, we may partition the interval by introducing fictitious observations in such a way that the first-order approximation is sufficiently accurate on each subinterval. These fictitious points are then marginalized. This is most often carried out by using importance sampling techniques, much of the effort being spent on the design of an appropriate proposal distribution (see for instance Elerian et al. (2001) and the references therein). Although this approach can come arbitrarily close to the true transition density, its implementation remains computationally cumbersome.

The present paper takes a different direction. There are two main contributions in it. First, the authors propose an exact algorithm to sample trajectories of non-linear diffusion. Using the exact algorithm they propose several novel Monte Carlo estimators of the likelihood functions. These methods are extremely ingenious and have considerable appeal; however, they are still limited in scope, because the exact algorithm is available only under restrictive conditions on the drift terms of an appropriately transformed version of the diffusion. Second, the authors propose a novel estimator of the likelihood function which replaces the discretization and importance sampling estimates by a clever Poisson approximation. This method can be applied to a much wider class of diffusions, alleviating some of the problems that are linked to the discretization issue (see the discussion below). These are both potentially very useful pieces of practical statistical methodology and the authors are to be congratulated for these findings. I would like to raise two points.

  • (a) My first question concerns the potential benefits of the Poisson approximation compared with the classical importance sampling estimator. The transition density of the diffusion
    may be expressed as
    where A(u;θ)=∫u ptα(s;θ) ds and ��(t,x,y) denotes the probability of a Brownian bridge pinned at x and y. This estimator immediately suggests a ‘plain’ Monte Carlo estimator of the transition density,
    where ω(k)∼��(t,x,y). Of course this estimator cannot be computed. The classical approach is to approximate the inner integral by Riemann sums.
  • Despite their appearance, the Poisson estimator shares similarities with the Riemann approximation, with the inverse of the intensity λ−1 (the mean distance between two successive samples) playing a role that is similar to the discretization interval δ. It would be of interest to understand why, for the same average number of samples (δ=1/λ), the Poisson approximation should dominate the ‘natural’ Riemann estimator (at least in the asymptotic regime when δ→0 and λ→∞).

  • (b) My second question deals with the Monte Carlo EM algorithm. I focus here on the case where the diffusion term does not depend on the parameter for simplicity. When using the Monte Carlo EM algorithm, we must estimate conditional expectations
    where inline image denotes the distribution of the process conditioned to start at X0=x and to finish at Xt=y. We may estimate these quantities by using a plain Monte Carlo technique, but this supposes that we may obtain exact samples from inline image, which is not always possible. Moreover, this is presumably not a very clever choice from the computational point of view. Assume that we are willing to estimate ∫Φ(xπ(dx). If the samples for π are generated from an instrumental distribution q by using the accept–reject algorithm, then after K draws from the instrumental distribution the Monte Carlo estimate is given by
    where M is some constant satisfying M q(x)>π(x) for all x. It is in general more sensible to use an importance sampling estimator,
  • As suggested by Gelman (1995) and Quintana et al. (1999), nothing prevents us from approximating the intermediate quantities in the Monte Carlo EM algorithm by using the importance sampling estimator. The situation is more complicated here, because the importance weights and the function Φ cannot be computed exactly. Indeed, if the Brownian bridge is chosen to be the importance distribution, the importance sampling estimate of inline image takes the form


    which involves integrals. Nevertheless, this expression can be approximated numerically by replacing integrals by Riemann sums or by the Poisson approximation. It would be of interest to compare these two approaches.

As with any good paper read to the Society, this one contains much interesting material, both on the theoretical foundations of the methodology and on numerous aspects of practical applications. There is still much work to be done, but plenty has already been done by Alexandros Beskos, Omiros Papaspiliopoulos, Gareth Roberts and Paul Fearnhead. It gives me great pleasure to propose a vote of thanks.

Dan Crisan (Imperial College London)

The paper extends classical results related to exact simulations of random variables to an infinite dimensional set-up. Here random processes, in particular, solutions of one-dimensional stochastic differential equations, are exactly sampled and various maximum likelihood and Bayesian inference methods are developed on the basis of the two versions EA1 and EA2 of the exact sampling algorithms.

It is very easy to obtain a sample path from a Brownian motion or from its conditioned version, the Brownian bridge. This observation is the basis of the exact simulation of diffusions: if a diffusion X has an absolutely continuous distribution with respect to that of a Brownian motion or a Brownian bridge, then the retrospective sampling method that was developed by Papaspiliopoulos and Roberts can generate an exact simulation of X by using the perfect Brownian sampling.

Unfortunately the class of one-dimensional stochastic differential equations with solutions that are absolutely continuous with respect to a Brownian motion is very small. In effect we can only have equations of the form


(I shall not make explicit the coefficients’ dependence of the unknown parameter θ as it is not relevant to the arguments that are presented.) X is simply a Brownian motion plus a drift term. This stems from the fact that the Brownian path t →Bt has quadratic variation 〈Bt=t and X must have the same property if it wants to have an absolute continuous distribution with respect to that of B. In what follows I shall call these processes ℬ-diffusions.

The methods that are presented in the paper are applicable to a larger class of processes. Under additional assumptions, all diffusions that can be reduced to a ℬ-diffusion via a suitable change of space co-ordinates can be simulated exactly. These are the diffusions V for which there is a diffeomorphism η:ℝ→ℝ such that η(V) is a ℬ-diffusion. From a pathwise perspective, the change of co-ordinates amounts to stretching or squashing the paths so that they obtain the right quadratic variation. This can be done, more often than not, at the expense of ending up with unbounded or, even worse, exploding drifts α. As a result, the simulation procedures and the inference methodology become either more difficult or not possible at all.

It is far more natural to start with a diffusion X that has an explicit distribution and the same quadratic variation structure as V and then to check whether it has an absolutely continuous distribution with respect to that of V. There is a generic method that finds such diffusions X and it stems from a classical result by Doss (1977) and Sussmann (1978). The result states that, if b is Lipschitz continuous and that σ is twice differentiable with bounded first and second derivatives, then the one-dimensional stochastic differential equation


has a unique strong solution which can be written in the form


where u:ℝ2→ℝ is the solution of the ordinary differential equation


and the function t →Yt(ω) solves an ordinary differential equation for every ω ∈ Ω. However, if we choose a deterministic or a constant Y, the resulting process


will satisfy the equation




In particular, if Y is constant, then


where the last integral is a Stratonovitch integral. Applying this strategy to the logistic growth model (the second example in the paper), the corresponding process X can be chosen to be the log-normal diffusion


with the explicit solution


The density of the (unconditional) distribution of V with respect to the distribution of X (on the path space) will then be proportional to


There are three apparent advantages of this method over the change of spatial co-ordinates approach:

  • (a) it applies to a wide class of sets of coefficients (b,σ);
  • (b) it usually leads to simpler expressions for the Radon–Nikodym derivative, hence perhaps simplifying the ensuing estimation procedures;
  • (c) it extends naturally to multivariate diffusions and time inhomogeneous diffusions. We need to consider the multidimensional and, respectively, the time inhomogeneous version of expression (33).

For the Cox–Ingersoll–Ross model (the third example in the paper) though, the resulting process X is not absolutely continuous with respect to V. However, in the same vein, the Cox–Ingersoll–Ross diffusion is absolutely continuous with respect to the squared Bessel process


I have two further comments on the paper. It would be desirable to see a theoretical analysis of the computational effort that is involved for the various inference methods presented as a function of the number of time instances when the data are acquired. Also no comparison or analogy is made between the methods presented and the sequential Monte Carlo methods. However, the authors say that a comparison study between their methods and other existing methods is in progress.

In conclusion, the authors have already done a large amount of work on the subject, though I would venture to say that there is still much to be done on this line of research and I look forward to seeing the follow-ups. It gives me great pleasure in congratulating the authors on their paper and in seconding the vote of thanks.

The vote of thanks was passed by acclamation.

Barry Rowlingson (Lancaster University)

We are developing a package for the R programming environment to follow the theoretical developments that are described in the paper as well as to implement basic functionality for diffusion processes.

At present we can simulate a range of diffusion processes using the Euler method, and also we can use the EA1 algorithm for those that can be simulated that way. Methods for plotting realizations of diffusion processes have been implemented.

The code contains contributions from the paper's authors as well as from Duncan Murdoch. Eventually we hope to release the package into the CRAN archive of R packages, but for now it is available on request from the authors of the paper. We welcome any offers of help in the development of this package.

N. Chopin (University of Bristol)

This is an impressive piece of work; the methodology is very elegant, and the range of applications is promising. I shall raise three points.

  • (a) Given the first expression in lemma 1, the boundedness assumption on α2+α—condition (b) in Section 2—is natural from a standard accept–reject perspective. One may naïvely remark, however, that, during a small time interval [0,t], a Brownian bridge is confined to a small interval with very high probability. Thus I wonder whether, in some specific settings (for instance when computing the likelihood between two close observations x and y), an arbitrary truncation of function α, to force condition (b), would not produce an acceptable bias. Would this be a reasonable way to extend slightly the class of diffusions that can be simulated from the simulation techniques that are presented here?
  • (b) I believe that the methodology developed in this paper can also be used in sequential problems, and within particle filtering algorithms. I shall not give exact details, for lack of space, but I shall illustrate this point with a toy example: the sequential filtering of a diffusion with known parameters, but observed at discrete times with noise. The simulation of the diffusion trajectory would then replace the simulation of the hidden Markov chain in the standard case.
  • (c) The authors mention some shortcomings of Euler discretization schemes, and corresponding Markov chain Monte Carlo samplers: in particular, the trade-off between discretization bias (coarse grids) and computational inefficiency (fine grids) is difficult to handle; ‘computational inefficiency’ refers here both to the cost of each iteration and to the number of iterations that are required to reach convergence.

I wonder whether this trade-off could not be dealt with in an automatic fashion through some adaptive algorithm. Consider some sequence pn of posterior distributions (of parameters and imputed values) corresponding to a decreasing sequence of discretization steps Δn, and apply a sequential Monte Carlo algorithm to this artificial sequence. It then seems possible to design some stopping rule that detects when the discretization bias has become small, relatively to the Monte Carlo error. This alternative approach may be useful for cases where the Markov chain Monte Carlo algorithm that is developed in this paper cannot be applied.

S. B. Connor and W. S. Kendall (University of Warwick, Coventry)

We enjoyed and are greatly impressed by the elegance of this exact simulation method. This is one of an increasing number of simulation options which involve striking applications of probabilistic ideas. We take the opportunity here to sketch some recent work on another such option, that of coupling from the past (CFTP). It is natural to wonder whether these ideas might connect together in productive and useful ways.

CFTP was first introduced by Propp and Wilson (1996) as a method for sampling from the exact stationary distribution of an ergodic Markov chain. However, as shown by Foss and Tweedie (1998), this algorithm is possible (in an impractical sense) if and only if the Markov chain is uniformly ergodic. More recently, Kendall (2004) showed that all geometrically ergodic chains have (again impractical) dominated CFTP algorithms (as introduced in Kendall (1998)). This poses the question: what about when the chain X is polynomially ergodic?

We are currently investigating conditions under which a dominated CFTP algorithm exists for a polynomially ergodic chain X. Specifically, we consider chains for which there are constants α ∈ (0,1) and b,c ∈ (0,∞), a small set C and a scale function V:��→[1,∞), bounded on C, such that


This drift condition implies that the chain is polynomially ergodic, as shown in Jarner and Roberts (2002).

In Connor and Kendall (2006) we introduce the concept of a tame chain. A chain X satisfying the drift condition (34) is tame if we can subsample X (in a specific, adaptive way) to produce a geometrically ergodic chain X. In this case we prove that we can produce a suitable dominating process for X by slowing down the dominating process for X that is shown to exist in Kendall (2004).

Thus the question becomes: when is a polynomially ergodic chain tame? We have some sufficient conditions for tameness, but also examples to show that none of these are necessary. It is unclear at present whether all chains are tame, and this is an issue that we shall be investigating as part of our future research in this area.

John T. Kent (University of Leeds)

Let me focus on the authors’ first model, which can be recast in the form


As a diffusion on the circle, it can be termed a ‘von Mises diffusion’ because in equilibrium it follows a von Mises distribution, with probability density function proportional to  exp {κ  cos (xμ)},0leqslant R: less-than-or-eq, slantx<2π,κ=2λ/σ2 (e.g. Kent (1978)). It has some nice applications.

  • (a) Position model: imagine a liquid in a magnetic field containing tiny iron filings whose orientations are subject to the same sorts of molecular bombardments that motivate the standard Brownian motion model.
  • (b) Velocity model: consider a moving bacterium with a preferential direction of motion. Suppose that the magnitude of the velocity of the bacterium is constant (or at least the process for magnitude is independent of orientation) and that the orientation of the velocity follows the von Mises diffusion (Hill et al., 1997). The data consist of locations of the bacterium at regularly spaced times.

Several questions come to mind for the von Mises process and more generally.

  • (i) Can the methods of this paper be used for applications such as the bacterium example, where the velocity follows a diffusion, but where the data are based on the integrated velocity?
  • (ii) Suppose that we wish to estimate μ for data from the von Mises process. The simplest statistical approach is to compute the maximum likelihood estimate (MLE) by treating the data as independent and identically distributed from a von Mises distribution. We might expect that this naïve MLE will be close to the proper MLE, but that we shall underestimate the true variance of the estimator. This latter problem might be overcome by allowing for the autocorrelation in the data. How does this simple statistical approach compare with the proper analysis that is proposed in this paper?
  • (iii) The efficiency of the authors’ simulation technique presumably decreases as either of two quantities increases: the time step δt/σ and the concentration parameter κ. Is this correct?
  • (iv) It is an elementary observation that the distribution of a Brownian bridge does not depend on the presence of any (constant) drift in the underlying Brownian motion. Does this property help to explain the unexpectedly high efficiency of the authors’ simulation techniques in general?

Bruno Casella (Università Bocconi, Milan, and Lancaster University)

I shall report on on-going research joint with Gareth Roberts that is strictly connected to the paper. In the paper the exact algorithm is the key tool for the construction of an unbiased Monte Carlo estimator of the transition densities to be used for inference on the parameters of the diffusion. The exact algorithm framework can be successfully applied in many other Monte Carlo problems involving the simulation of functionals ξ(·) of diffusion processes. A natural area where these problems arise is option pricing.

Our main idea is the following. Given a process X={Xt:0leqslant R: less-than-or-eq, slanttleqslant R: less-than-or-eq, slantT} ∈ ��1 with starting-point x and unknown probability measure ℚ, we simulate an exact skeleton


by the (unconditional) EA1 algorithm. Given ��1, ℚ is the product of independent Brownian bridges. Using this characterization, we can simulate ξ conditionally on S1exactly (i.e. without discretization error). We named this methodology the exact Monte Carlo method. We applied the method to the problem of the estimation of


where τ is the first exit time of ω from a set H:=(a,b) such that x ∈ (a,b), and g(·) and h(·) are two measurable functions. This problem is relevant in finance, e.g. in barrier options pricing or credit risk modelling. First, we call algorithm EA1 and output S1. Given S1, all that we need for the simulative of ξ is to sample the crossing event of a product of independent Brownian bridges. In the single-barrier case (α=−∞ or b=∞) we apply the well-known Bachelier–Levy theory which provides an explicit formula for the one-sided crossing probability of a Brownian bridge. Notably, this also allows us to construct a Rao–Blackwellized Monte Carlo estimator of ν, which is similar in spirit to the one that was used for the bridge method, by sampling directly ��(ξ|S1) (Fig. 5). In the double-barrier case we do not have an explicit formula for the (two-sided) crossing probability of a Brownian bridge: we circumvent the problem by developing a suitable procedure to simulate from it. The procedure is based on Doob's representation of the crossing probability of the Brownian motion as a telescopic sum.

Figure 5.

 Monte Carlo estimate of ν=��(XT��{τ>T}) where X is the SINE model (single-barrier case: x=1.5; a=−∞; b=4.5; T=5): (a) Monte Carlo convergence of the exact Monte Carlo estimator (not Rao–Black wellized and Rao–Blackwellized); (b) convergence of the estimators based on a discrete Euler scheme (○) and a continuous Euler scheme (+) to the exact Monte Carlo estimate (⋯⋯), double-barrier case (x=2; a=1; b=4.5; T=5); (c) Monte Carlo convergence of the exact Monte Carlo estimator; (d) convergence of the estimators based on a discrete Euler scheme (○) and a continuous Euler scheme (+) to the exact Monte Carlo estimate (⋯⋯)

Simulation studies show that our estimator is

  • (a) unbiased—Euler-based estimators converge to our estimator as the length of the discretization interval decreases—and
  • (b) efficient—for small discretization intervals, our estimator is more efficient than Euler's.

Extensions of this methodology to processes X ∈ ��2 and jump diffusion processes are possible. We are currently working on the application of the exact Monte Carlo method to Monte Carlo estimation of derivative hedge ratios in finance.

The following contributions were received in writing after the meeting.

Frank Ball, Ian Dryden and Mousa Golalizadeh (University of Nottingham)

We congratulate the authors for their inspirational paper. Our particular interest concerns extensions of the retrospective exact simulation work to compact spaces. Ball et al. (2004) have considered Ornstein–Uhlenbeck (OU) processes in Kendall's (1984) shape space for kgeqslant R: gt-or-equal, slanted3 points in two dimensions. In particular, a family of OU processes has been defined with Itô stochastic differential equations for the Riemannian shape distance ρt to a reference configuration at time t>0 that is given by


where g(ρ)=dg(ρ)/dρ is the infinitesimal OU drift term, Bt is standard Brownian motion on the real line and the process starts at ρ0 at time t=0. In the case k=3 the planar shape space for triangles of points is equivalent to a sphere of radius inline image in three dimensions, and ρt is the colatitude on this sphere.

Since the diffusion process for ρt is restricted to the closed interval [0,π/2], simulation of the ρt-process is challenging. For example the Euler scheme is problematic near the end points 0 or π/2 and needs adaptation (see Ball et al. (2004)).

We are currently exploring the retrospective exact simulation methods of the present paper for simulating ρt, and the methods need to cope with the facts that the domain is a closed interval, α2+α is not bounded below at π/2 and not bounded above at 0.

We have used an adaptation of algorithm EA2 for simulating ρt, where the decomposition is carried out with respect to the minimum, as follows.

  • (a) The function α2+α is replaced with a differentiable bounded function which is identical to α2+α except in the intervals [0,ɛ1) and (π/2−ɛ2,π/2], where ɛ1,ɛ2>0 are small.
  • (b) The time length T of a simulation is chosen such that the probability that the maximum is outside the interval [0,π/2] is negligible. This is practical unless ρ0 is very close to π/2, in which case the decomposition should be taken with respect to the maximum and the probability that the minimum is outside the interval should be negligible.
  • (c) Additional criteria for rejecting a proposed path are if
    • (i) the simulated end point from density h in Section 2.3 is outside the interval [0,π/2] or
    • (ii) the simulated minimum m of the path is below zero.

Consider an example with g(ρ)=κ  sin (2ρ). This process was introduced by Kent (1975) for the sphere (k=3). We use the adapted retrospective exact simulation method for a single long run consisting of 100000 individual time steps of length T=0.02, where the end point from one time step becomes the start-point for the next step. In Fig. 6 we plot simulated values at unit intervals from the OU process, and a histogram of the simulated values evaluated at time intervals of length 0.02 with the density of the theoretical equilibrium distribution overlaid. The histogram and density are extremely similar, indicating that approximations to retrospective exact sampling can also be useful in more general situations.

Figure 6.

 (a) Simulated points at unit time steps from the diffusion process for ρt with g(ρ)=κ  sin (2ρ), k=4, κ=10 and ρ0=0.2 and (b) density scaled histogram of simulated points from the OU process at intervals of 0.02, with the density of the theoretical equilibrium distribution f(ρ) ∝ sin 2k−5(ρ)  cos (ρ)  exp {κ  cos (2ρ)} (——)

N. H. Bingham (University of Sheffield)

I congratulate the authors on this interesting paper. I confine myself to a few comments on theoretical and financial aspects.

Curse of dimensionality

The multidimensional extensions that are mentioned in the final section are intriguing from the point of view of mathematical finance. There, by Markowitzian diversification, one holds a large number of assets; the danger here is the curse of dimensionality. Methods are available to counter this which reduce the effective dimensionality from the number or assets to the number of industrial sectors; see for example Bingham et al. (2003).

Ensemble models

An alternative approach to estimating diffusion coefficients by discrete methods arises where we have (or can obtain) an ensemble of paths, or diffusing particles say, rather than one. We can then extract information from the counts of the number of diffusing particles in some window of observation, as a function of time. The technique is called number fluctuation spectroscopy. It was used by Smoluchowski, following Einstein's work, to estimate Avogadro's number, and by Rothschild to study mobility of spermatozoa. For details and references, see for example Bingham and Dunham (1997) and Bingham and Pitts (1998).

Drift and diffusion

As observed in Section 2, granted a sample path over a time interval of positive length, we can determine the diffusion coefficient with certainty (e.g. in the Brownian case from the quadratic variation). But this involves (uncountably) infinite sample sizes, and we can obtain and handle only finite samples in reality.

Nevertheless, the suggestion is that in principle we can discriminate between different diffusion coefficients with certainty; if both are the same, we can form likelihood ratios and, for example, test hypotheses comparing the drifts.

Diffusions, or Itô processes, being locally Gaussian, relevant here is the split between mutual absolute continuity and singularity of Gaussian processes, for which see for example Ibragimov and Rozanov (1978), chapter III.

Splitting times

The use of random times such as when the minimum is (last) attained—which is far from being a stopping time—to split the path into independent fragments is a powerful technique. Such times are called splitting times; for general theory, see Rogers and Williams (1994), section III.49.

Multidimensional diffusions

The irreducibility condition that was used by Aït-Sahalia (2004) to reduce to the case of unit diffusion coefficient predates the standard work on multidimensional diffusions by Stroock and Varadhan (1979)—which does not contain it. It is good to see practice driving theory here.

Peter Clifford (Oxford University)

I have just a few comments on an enjoyable paper, Firstly note that the Cox–Ingersoll–Ross diffusion


is the square of an Ornstein–Uhlenbeck process in a specific dimension. The model is discussed in Karlin and Taylor (1981), pages 333–334, where the transition density is given explicitly. The transitions of the diffusion can be readily simulated by using the representation of a non-central χ2-distribution as a Poisson mixture of gamma variates.

My second comment is more substantial. I am uneasy that a novel method of solving stochastic and partial differential equations has not being rigorously compared with established methods in numerical analytic terms. At the very least it would have been helpful to see additions to Fig. 1 in Aït-Sahalia (2002), but ideally we should be given comparative analyses of the growth of error and the order of accuracy.

Finally, where can I see a real diffusion? Stock-market transactions and price movements, for example, are essentially discrete. Perhaps it is better to think of diffusion as modelling latent sentiment, which is continuously fluctuating but not directly observable. The observables are then driven by this underlying process. A simple model is that market events (trades etc.) are a Poisson process with intensity that is proportional to the diffusion. For the Cox–Ingersoll–Ross class of diffusions we have recently shown how to calculate likelihoods explicitly for such a doubly stochastic process. The method exploits the identity between discrete processes of this type and the death epochs of the standard immigration, birth and death process (Clifford and Wei, 1993).

Valentine Genon-Catalot (Université Paris 5 René Descartes)

For diffusion having ergodic properties, complete results are available when estimating both drift and diffusion parameters by maximum likelihood (consistency, asymptotic normality as the sampling interval tends to 0 and the total length time interval of observation tends to ∞). Without ergodicity assumptions, exact maximum likelihood estimators of diffusion parameters have now well-known properties (consistency, asymptotic mixed normality, as the sampling interval tends to 0 within a fixed length time interval). Because, for most models, the exact likelihood is not analytically tractable, these results are of a theoretical nature and researchers have all been faced with the same problem: it was not possible to produce exact simulated data.

Now, let me express my enthusiasm and congratulate the authors for their impressive papers. Retrospective exact simulation of diffusions sample paths is a revolution for the study of diffusion processes and especially for the statistics of diffusion processes. This revolution is just at its first steps of development.

The authors focus on producing unbiased estimation of the exact likelihood. A crucial step is to obtain unbiased estimators of the transition density pt(x,y,θ) of the model. I am especially interested in three of the methods proposed to achieve this goal: the acceptance and the simultaneous acceptance methods and the Poisson estimator. For these, the authors use a representation formula of the transition density which was already known but had only been used for theoretical purposes. This formula contains the expectation of a functional under a Brownian bridge distribution. I am particularly impressed by the simultaneous acceptance method which produces a very simple unbiased estimator of this expectation for all values of the parameter to be estimated. Also, the Poisson estimator is interesting because it can be applied to a wider framework.

Numerous problems remain to solve. In particular, the paper makes no connection between the simulations and theoretical results. For instance, the sine model is a null recurrent diffusion whereas logistic growth may be positive recurrent. Asymptotic results on maximum likelihood estimators are thus different. Joint estimation of drift and diffusion parameters is not investigated here. In diffusion models, using simulation based on Euler schemes, we can observe that some parameters are badly estimated and others are not: for instance, when the drift is mean reverting (b(x)=α(βx)), α is badly estimated (a good estimator requires a very large time interval), and β is always well estimated. The parameters in the diffusion coefficient are usually well estimated (even when the total time interval is small). This area is probably under study by the authors and this paper suggests plenty of new and interesting topics.

Konstantinos Kalogeropoulos (Athens University of Economics and Business)

I add my congratulations to the authors for their wonderful contribution to the challenging problem of inference for diffusions.

My comments focus on two assumptions that hinder some desirable extensions. The first assumption requires the Markovianity of the observed diffusion. Although not stated explicitly, it is the Markov property that allows us to write down the log-likelihood as


Non-Markovian diffusion models are being used extensively. Famous examples include stochastic volatility and continuous time autoregressive moving average models.

The second assumption stems from the transformation in equation (3). To find such a transformation h(·) for a d-dimensional diffusion we need to solve the following vector differential equation:


where now σ(Vt,θσ(Vt,θ) denotes the diffusion matrix and Id is the identity matrix of dimension d. Since equation (35) is not always solvable, this limits the extensions to the multivariate case. Aït-Sahalia (2004) termed the diffusions with solvable equation (35)reducible. He also showed that the diffusions that are involved in ordinary stochastic volatility models do not belong to this class.

An alternative likelihood-based approach uses data augmentation schemes; see Roberts and Stramer (2001), Elerian et al. (2001) and Eraker (2001). These schemes overcome the barrier of the Markov property and, although they do not provide exact inference, their discretization error may become arbitrarily small.

The existence of the transformation in equation (3) is a major issue for data augmentation schemes as well. Under a fairly thin discretization the quadratic covariation process determines exactly the parameters in σ(Vt,θ) and the algorithm degenerates. Nevertheless, there is some on-going work that handles some cases of multivariate diffusions, Kalogeropoulos et al. (2006a) propose a reparameterization for reducible diffusions, whereas Kalogeropoulos (2006), Kalogeropoulos et al. (2006b), Chib et al. (2004) and Golightly and Wilkinson (2005) present algorithms for stochastic volatility models.

Mathieu Kessler (Universidad Politécnica de Cartagena)

The authors are to be congratulated for providing an impressive collection of ingenious and innovative ways to implement likelihood-based estimation for discretely observed diffusions. Their paper combines original expressions for the transition density of the process and the use of exact simulation algorithms in the corresponding Monte Carlo procedures.

Like several other recent papers, it focuses on obtaining the best possible numerical approximation to the likelihood, for the derived estimators to inherit the nice properties, e.g. asymptotic efficiency, of the likelihood-based estimators. In particular, the authors emphasize the fact that, through the use of exact simulation algorithms, they can obtain unbiased estimations of the transition density. However, I am missing insight about the consequences of the Monte Carlo approximation on the behaviour of the estimators. Why, for example, is it an improvement with respect to the existing methods to be able to use unbiased estimators of the transition density? Would it be possible to assess, through a theoretical result, the rate at which the number N of Monte Carlo replicates should grow with the number n of observations, for the asymptotic efficiency to be maintained under the assumption of ergodicity? The existing results for approximation-based estimators that have been proposed in the literature can only guarantee the existence of such an increasing sequence Nn. In particular, what would be the asymptotic properties of the proposed estimators, for a fixed Monte Carlo effort, when the number of observations tends to ∞?

Unfortunately, such results do not seem to be easy to obtain. It would be different if the Monte Carlo approximation did not concern the transition density itself but the derivative of its logarithm with respect to the parameter, i.e. the building-block of the score function. Indeed, if an unbiased Monte Carlo approximation of ∂pt/pt(u,v;θ) were available, it would be possible to prove the asymptotic normality of the deduced approximate estimator, even for a fixed Monte Carlo effort. Moreover, it would be easy to characterize precisely the corresponding loss of efficiency in the asymptotic variance; this kind of result was obtained for estimating functions in Kessler and Paredes (2002). My last question is, therefore, could some of the expressions that are proposed for the transition density be used to derive unbiased simulation-based approximations to ∂pt/pt(u,v;θ)?

Claudia Klüppelberg (Munich University of Technology)

I congratulate the authors for an immensely impressive and path breaking paper. They use a very rich methodology embracing deep results from stochastic analysis, Markov chain Monte Carlo simulation and algorithmic methods. Their simultaneous acceptance method is based on the Cameron–Martin–Girsanov formula, which is the basis of dealing with stochastic differential equations driven by Brownian motion. Of course, this is a very large class of models, although they must restrict it further for technical reasons.

As modelling issues quite naturally often run ahead of the necessary development of the corresponding statistical inference, the next question related to their work is already waiting for its solution. High frequency financial data are modelled by continuous time models; stylized facts require, among other features, the estimation of processes with jumps. These features are covered by the meanwhile well-known Barndorff-Nielsen and Shephard (2002) model that was mentioned in the paper, and also by a new continuous time generalized autoregressive conditional heteroscedasticity (COGARCH) model, which is defined as follows. For a Lévy process L and parameters β,η,ϕ>0 the COGARCH(1,1) process is defined as the solution to the equations (for t>0)


(Klüppelberg et al., 2004) where V is left continuous and


is the discrete part of the quadratic variation process of L, i.e. it is a subordinator.

This process has been estimated in Müller (2005) by methods that were used also in Roberts et al. (2004) for the Barndorff-Nielsen and Shephard model, but both papers are restricted to finite activity Lévy processes. It would indeed be interesting to see over the next few years a development of such clever methods as presented in the present paper to models that are driven by infinite activity Lévy processes.

Hans R. Künsch (Eidgenössische Technische Hochschule Zürich)

The authors are to be congratulated on their impressive and highly original paper. I believe that diffusion processes will become increasingly more important in statistical applications because they are the obvious extension of deterministic differential equations which are widespread. It is nice that with the methods of this paper it is not necessary to rely on a time discretization. However, one hopes that the estimation procedures do not depend on the fine structure that is implied by the diffusion model since this is the most doubtful part.

As a more technical comment, I would like to add that the exact simulation method that is presented here allows us to implement the accept–reject version of the particle filter (see Hürzeler and Künsch (l998)) for partially observed diffusions. Assume that the observations Y1:n=(Y1,…,Yn) are conditionally independent given (Xt) and Yi depends on Xti only. The particle filter approximates the conditional density of Xti given Y1:i by a sample of particles inline image. To generate particles at time ti+1 recursively, we select an index K with probabilities proportional to inline image; then given K=k we generate a proposal at time ti+1 according to the density


and finally we decide whether to accept this proposal by the exact algorithm in this paper. If the normalizing constant inline image cannot be computed in closed form, we can replace it by a suitable approximation at the cost of an additional factor in the acceptance probability.

T. J. Lyons (Oxford University)

This paper is focused on developing earlier and interesting work on methods for simulating processes (and particularly bridges) into numerical tools for distinguishing between a parameterized family of Markov models on the basis of discrete time observations of a process. A key point in this is to understand, in a computationally effective way, the relative likelihood of the Markov transition xn → xn+1 given the different parameter values. In effect, we need a parameterized solution to the backward equations.

One could certainly attempt this analysis in a variety of ways, including the use of partial differential equations. It is clear that the benefits and complexity of different approaches will mean that some approaches are better than others.

To make comparisons, it seems essential that a clear quantitative approach is adopted, and one regret that I had about the paper is the lack of analytic details with which to judge the quality of the approach. In particular, it would be helpful to have precise classes of models for which we would apply this approach, and estimates for the errors after a given number of function evaluations etc. in terms of the smoothness of the coefficients in the models studied etc.

It would have been very helpful to have precise results (in terms of the number of function evaluations per unit error) setting out when this method was more effective than, for example, partial differential equation methods. The latter might be expected to perform quite well in low dimensions.

Rogemar Mamon and Keming Yu (Brunel University, Uxbridge)

The applied theme of the paper on likelihood-based inference for discretely observed diffusions is very interesting. However, in financial modelling the specification of a diffusion process V via the stochastic differential equation (SDE) in equation (1) is too simple to capture the stylized features of the economy and markets. Nowadays, the use of regime switching models to understand the behaviour of financial variables (e.g. interest rates) is becoming more common, motivated by empirical evidence and owing to economic grounds. See, for example, Bansal and Zhou (2002). In particular, both the drift and the diffusion coefficients of the SDE in equation (1) can have dynamics of their own. For instance, they can be allowed to switch between economic regimes and the shift is modulated by a finite state Markov chain. This will certainly enrich standard models such as the Cox–Ingersoll–Ross model that was mentioned in the paper. Guo and Zhang (2004) and Wu and Zheng (2004), among others, have provided examples for this type of model formulation. Assuming that an SDE describing the evolution of a financial phenomenon does not have an analytic solution, especially if it has periodic drift and therefore the log-likelihood is not available, we wonder how the exact algorithm can be extended into the setting where the process has a mixture of Gaussian and Markov chain dynamics. It would be of interest to determine how the exact algorithm can be employed efficiently in the estimation of transition matrices, the state of the Markov chain and other model parameters.

Also, the authors pointed out that the ideas and methods of the paper can accommodate SDEs with jumps provided that they are based on a finite activity Lévy process. In general, what class of integrators (i.e properties and characterization) are possible for the stochastic integral in equation (1) so that either the EA1 or EA2 algorithms can be extended successfully? In other words, what kind of semimartingales can be included in this class?

Finally, it is not clear to us how the proposed transition density estimation technique compares with nonparametric estimation methods such as the kernel estimation of transition density. The latter is much simpler in mathematical details, and computationally easier to implement as well as distribution free. For example, given the observations V={Vt0,Vt1,…,Vtn} of a diffusion process V, we can estimate quickly the transition density (2) via


where Kh(·) is a kernel function with bandwidth h.

Jesper Møller (Aalborg University)

I wonder how successfully the ideas of this interesting and stimulating paper can be extended to the case of a Cox process Xt, where Vt is a non-negative diffusion process and Xt conditional on Vt is a Poisson process with intensity function Vt . Suppose that t1,…,tn are the events of the Cox process observed on a finite time interval [0,a] and, as in the paper, we observe the diffusion process only at the times 0=t0<t1,…,<tn. Using the notation in the paper, the likelihood is


where the conditional expectation is with respect to the diffusion process given (Vt0,…,Vtn)=v. How efficiently would the methods in the paper apply when the likelihood is approximated and maximized by using a Markov chain Monte Carlo missing data approach?

In passing it may be worth noting that, if the diffusion is a Cox–Ingersoll–Ross model, Srinivasan (1988) and Clifford and Wei (1993) have established the equivalence between the Cox process and the process of death times of a simple immigration, birth and death process, which is easy to simulate. Incidentally, this model is a special case of the permanent process that is introduced in McCullagh and Møller (2005).

Sara Pasquali and Fabrizio Ruggeri (Istituto di Matematica Applicata e Tecnologie Informatiche, Milan)

We express our gratitude to the authors for an enlightening paper which provides an outbreaking new method for likelihood-based inference for discretely observed diffusions. The paper provides many details about the method and discusses many of its aspects, leaving a thorough analysis of some of them to forthcoming papers. We are looking forward to reading them since they will address some of our concerns. Here we just want to comment on possible extensions of the method.

The most general stochastic differential equation is given by


as in the following case about population dynamics:


where θ=(α,β,σ),wt is Gaussian white noise, T is the temperature (possibly dependent on the time s) and R(s) is a known recruitment function.

In the paper, the time s does not appear directly in the drift and in the diffusion, but only through Vs. We wonder whether the method proposed is valid also in this case.

As a second comment, we think of practical problems in which parallel, multivariate diffusion processes are to be analysed all together. We suppose that the diffusion processes might slightly differ in drift and/or diffusion and/or the prior on the parameters. Is there a way to estimate all of them together, sharing as many simulated points as possible, or should they be analysed separately?

Another related comment is about evolving systems in which new data are made available as time goes by. Is it possible to analyse the process on [0,s+t], starting from estimation in [0,s] and adding the data that are observed in (s,s+t], or should the estimation on [0,s+t] restart from scratch?

It would be interesting to explore the relationship between parameter estimation and stochastic stability, e.g. in the logistic growth model (example 2) where the equilibrium state 0 is stable for σ2>2R and unstable otherwise. In the latter case, which is considered in the paper, it is known that the process Vs fluctuates around the value Λ. Once data have been simulated from a logistic model, we would like to know what happens about the stability properties of the equilibrium, especially when parameters are close to the threshold σ2=2R.

Harry Pavlopoulos (Athens University of Economics and Business)

The authors are to be congratulated for their ground breaking work on likelihood-based parametric inference for discretely observed diffusions, as well as for the clarity with which they have summarized its mathematical and computational aspects (indeed of highly technical nature) in the paper. The purpose of this comment is to raise two interrelated points.

First, adding to the possible extensions of the methods that were considered by the authors, I suggest also the consideration of discretely observed diffusion with boundary condition(s) imposed on either one or both end points of its interval state space, provided that these are regular accessible boundaries. In other words, is it possible to extend the methodology to estimate parameters imposed through boundary condition(s), specifying transition probabilities from the interior of the state space to its boundaries and vice versa, in addition to parameters introduced through the drift and diffusion coefficients? This question is really of much broader intent, pertaining also to nonparametric methods of inference for discretely observed diffusions, such as those proposed by Aït-Sahalia (1996), Jiang and Knight (1997), Stanton (1997) and more recently by Fan and Zhang (2003) and Bandi and Phillips (2003). This consideration is of importance in geophysical and other environmental applications where the observed process is intermittent. An example of this situation is the diffusion model that was proposed by Pavlopoulos and Kedem (1992) for processes of spatially averaged rain rate over a sufficiently large region, where the state space of the diffusion is the closed interval [0,∞) and intermittency between wet and dry states of the region is modelled via a sticky boundary condition at {0}.

Second, I should like to share that throughout my reading of the paper I was under the impression that the spacing between time instants at which the diffusion is observed is quite general, as implied by the inequality 0=t0<t1<…<tn when the log-likelihood of the observations v was defined. Regular spacing in the EURODOLLAR data and in the simulations of the SINE and LOG-GROWTH data did not change this impression, until reading the conclusions of the paper where the authors address the performance of their methods by implying a uniform sampling frequency Δt, and thus regular spacing. This point remains subtle, at least in my understanding, and needs clarification. However, if the methodology holds under irregular spacing, then in the case of regular accessible boundaries it might be still implemented for conditional inference, at least, using observations only from the interior of the state space and ignoring those on the boundaries.

Christian P. Robert (Université Paris Dauphine)

The authors have achieved in this paper a tour de force, bypassing the usual and awkward discretization that is required for processing diffusions. The example of the Cox–Ingersoll–Ross model is quite revealing in this respect since, although a closed form representation via non-central χ2-distributions (or ARG(1) chains) does exist, Euler's discretization of the diffusion is biased because of the positivity constraint on the discretized process. This paper is thus bound to have long-term consequences in the way that we can handle and do inference on stochastic processes.

Obviously (and happily so!), much remains to be done in this new direction for the processing of parameterized diffusions. The boundedness condition on α2+α is quite restrictive, and the construction of the bound r(ω,θ) seems to be another tour de force. It would be of considerable interest to achieve a relaxation of the boundedness condition as well as a more constructive approach to the derivation of the bound r(ω,θ). Given that the method at the core of the paper is intrinsically an accept–reject method, I wonder whether or not importance sampling alternatives could be available, in that they do not require (in principle) a bound on the target distribution that is to be simulated. In particular and in parallel with standard importance sampling theory on finite dimensional distributions, is it possible to come up with variance comparison and optimality goals in a family of different importance functions? Furthermore, in classical Monte Carlo approximations, using an indicator I as in equation (12) in the estimation of pt(x,y;θ) often leads to estimators with high variance. A choice of an importance function that would increase the probability of getting a 1 for this indicator would thus be of interest. For instance, in the simulation of the Cox–Ingersoll–Ross model, importance functions that introduce a drift in the variance of expression (30) lead to a significant improvement in the estimation of some quantities (Douc et al., 2005). Similarly, any kind of Rao–Blackwellization applied to this indicator, if possible, would decrease the variance.

Mathias Rousset (University of Toulouse) and Arnaud Doucet (University of British Columbia, Vancouver)

The authors are to be congratulated for this impressive paper which solves many problems and opens many avenues of investigation. We present here a direct application of their methodology to time discretization error-free filtering of partially observed diffusions. Consider the following diffusion where X0π0 and for t>0:


This diffusion is partially observed at times {tk}kgeqslant R: gt-or-equal, slanted1 (where tk>tk−1) and the conditional density of the kth observation Ytk given by gk(ytk|xtk) is known analytically. We are interested in estimating sequentially the distributions


where t0=0, xt0:tk=(xt0,xt2,…,xtk) and yt1:tk=(yt1,yt2,…,ytk). To achieve this we propose to use a sequential Monte Carlo (SMC) algorithm (Doucet et al., 2001). The distributions are approximated by a large number N of weighted random samples inline image termed particles. The particles are sampled by using




where {qn(xtn|ytn,xtn−1)} are importance distributions known pointwise. In the standard SMC framework, these particles should be reweighted according to normalized weights proportional to








The particles are resampled whenever the variance of inline image is too large. Clearly this SMC algorithm cannot be implemented as inline image does not admit a closed form expression. However, a straightforward argument shows that it is not necessary to know inline image exactly. Only an unbiased positive estimate inline image of inline image is necessary to obtain asymptotically consistent SMC estimates under weak assumptions. Hence all the techniques that were developed by the authors to estimate inline image unbiasedly can be applied straightforwardly. The need for positive estimates restricts us to diffusions that are similar to those of the EA1 or EA2 algorithms.

To be efficient, the SMC method requires the design of ‘good’ importance distributions and to obtain estimates of the importance weights with low variance. To design the importance distributions, approximate analytically inline image by a Gaussian distribution using a local linearization technique (Durham and Gallant, 2002) and combine this approximate prior with g(ytk|xtk) or a linearized version of it to obtain qk(xtk|ytk,xtk−1).

It is also crucial to reduce the variance of the estimates of the weights given by


To achieve this, if the Poisson estimator of Section 6 is used, we could sample for each particle P Poisson random variables but use the same Brownian bridge to sample retrospectively for computational savings. However, a large Poisson parameter λ and a large P may be needed to obtain a reasonable variance.

Osnat Stramer (University of Iowa, Iowa City)

This paper proposes a new, thorough and interesting approach for calculating the likelihood of discretely observed diffusions. The log-likelihood of the data set v is


where the transition density of the diffusion process V is pΔti(Vti−1,Vti;θ). I thank the authors for providing a mine of ideas for modelling continuous time models.

In my comment, I shall also refer to two other existing approaches for estimating the transition density. One approach is the simulation methods that were proposed by Pedersen (1995) and improved substantially by Durham and Gallant (2002). This approach involves two types of approximations or errors:

  • (a) due to discretization and
  • (b) due to averaging.

Increasing the number of time intervals per Δti reduced the bias, but at the cost of increasing the number of simulations (see Stramer and Jun (2005a)). The second error will not disappear even if we use exact simulation. The methodology that is introduced in this paper avoids the first error but its efficiency depends heavily on the efficiency of the exact simulation of conditional diffusions. As pointed out by the authors, the acceptance rate of the Poisson process typically decreases exponentially to 0 in the length of the interval Δti between adjacent observations.

Another approach for estimating the transition probability is the closed form method that was introduced in Aït-Sahalia (2002a, 2004). It has been shown in Aït-Sahalia (2002b) and Stramer and Jun (2005b) that the closed form methods can be much faster and more accurate than the Durham and Gallant (2002) method in many interesting examples, though in theory it is assumed that Δti is small. It would be interesting to see a numerical comparison of the methods that were presented in this paper with the simulation-based methods and the closed form methods.

The methodology that is presented in this paper requires that the diffusion process be transformed to a new one with unit volatility. The simulation approach is in general more amenable when this transformation can be done (see Durham and Gallant (2002)). As noted by the authors a multivariate extension of their methods is therefore only possible for a limited class of models. More investigation in this direction will be necessary.

The authors replied later, in writing, as follows.

Firstly we thank all the discussants for their many contributions, insights and thought-provoking questions. The area of inference for partially observed diffusions is rapidly developing and hopefully our paper and the subsequent discussion will add further momentum to this exciting field.

The scope of the exact algorithm

A number of discussants (including Ball and his colleagues, Bingham, Chopin, Crisan, Kalogeropoulos, Moulines, Pasquali and Ruggeri, and Robert) asked about the scope of the exact algorithm (EA) methodology. Our paper introduces a general framework for inference which is illustrated mainly within the quite restrictive EA1 context and to some extent the more general EA2 situation.

We recall that the application of algorithm EA2 requires conditions (a)–(c) just below equation (4) to be satisfied, and the additional condition (9). Whereas (a)–(c) are largely innocuous, condition (9) is a serious constraint on the applicability of the method. For instance, many natural diffusion models violate this condition (including for instance the Ornstein–Uhlenbeck process). There are two ways in which the applicability of our methodology can be extended significantly beyond these constraints.

  • (a) As outlined in the paper, the Poisson estimator permits unbiased estimation for models outside ��2. This permits the construction of a Monte Carlo EM algorithm for classes of diffusions that are outside ��2. The potential problem with this approach is that finite variances of estimators are not guaranteed to exist.
  • (b) We have recently developed an extension of the EA called the EA3 algorithm (see Beskos et al. (2005a)) which is applicable to all diffusions in a class ��3 which is described as diffusions which satisfy only conditions (a)–(c). Since ��3 is not constained by restriction (9), it covers a very broad class of models.

Algorithm EA3 is mathematically more complex than either EA1 or EA2, though not necessarily less efficient. It is based on the idea that, once upper and lower bounds for the trajectory are known, the usual rejection sampling algorithm is easy to implement. For this, it introduces a different decomposition of Brownian motion, which we call layered Brownian motion. This decomposition permits upper and lower bounds on the diffusion sample path to be imposed once the sample path layer has been constructed. What makes algorithm EA3 more complicated than algorithm EA2 is the fact that the joint law of Brownian motion and its maximum modulus up to a fixed time t is not easily tractable, and exact simulation requires the construction of specific retrospective simulation techniques for probabilities that are expressible as infinite sums of alternating terms.

A minor modification of algorithm EA3 can also be used to cover other cases. For instance, by a judicious choice of the layers that are used in the construction, diffusions with finite entrance boundaries (and which are therefore not reached with probability 1) such as the shape space Ornstein–Uhlenbeck model of Ball and his colleagues can be simulated exactly. However, we also appreciate that the approximate approach that is adopted in their contribution represents sound methodology, since it seems easy to ensure that the approximation that is involved is arbitrarily accurate.

Several discussants enquired about other extensions. Beskos et al. (2005a) also shows that extensions of the EA to the time inhomogeneous case (as queried by Pasquali and Ruggeri) is straightforward, although the function φ must be modified accordingly. Moreover extensions to the multivariate case (as mentioned by Kalogeropoulos and Bingham) is routine provided that the following two (non-trivial) conditions are met.

  • (a) The multivariate version of the transformation that is given in equation (3) needs to be possible. Although equation (3) is always possible for scalar diffusions, its multivariate counterpart (translating a diffusion with arbitrary diffusion coefficient to a diffusion with unit diffusion coefficient) may not exist. Conditions for this are given in Aït-Sahalia (2004) as mentioned in the contributions of Kalogeropoulos and Bingham.
  • (b) Assuming that condition (3) is possible, the resulting multivariate drift function must be of gradient type, i.e. there is a function V(x) such that the drift α(x) satisfies
    This is a very common and natural condition for diffusions which corresponds (in the stationary case) to the reversibility of the diffusion.

However, when these two conditions are not met, it seems difficult to see how the EA can be applied at all.

Kalogeropoulos mentions the interesting extension to non-Markov models. Stochastic differential equations are very naturally formulated in non-Markov settings, and existing data augmentation techniques can in principle handle this generalization. Furthermore this generalization is important for certain application areas such as finance where Markovianity can be doubtful. (See Dellaportas et al. (2004) for an example of a Bayesian analysis of a non-Markov model by data augmentation.) Except in special cases, the EA cannot be applied for non-Markov models.

Pavlopoulos discusses two generalizations. Firstly the case of regular accessible boundaries for diffusions, for instance with reflection at the boundary, is complicated. Where diffusion drifts are locally bounded and converging to 0 at the boundary, and volatilities are bounded away from 0, pure reflecting boundaries can easily be dealt with by unfolding the reflecting boundary producing mirror image dynamics either side of the boundary. This approach works for two-sided boundaries also. However, it is not clear how to extend this to the case where the drift is not converging to 0 at the boundary since now the unfolded diffusion has discontinuous drift.

Secondly, Pavlopoulos asks about varying Δt. In fact this causes no problem for our methodology since all the Monte Carlo procedures are carried out independently between each two consecutive observations.

Kluppelberg and Mamon and Yu ask the intriguing question about whether our methodology can be extended to deal with processes that are driven by Lévy process noise. To date, we have no idea how this can be accomplished in the infinite activity Lévy case. However, we certainly agree that this is an important open problem arising from our work here.

Numerical investigations and comparisons

Several discussants (Clifford, Crisan, Lyons and Stramer) request further comparisons with existing methodology and we agree completely that this is needed.

There are two (not entirely mutually exclusive) stages to this work. Firstly are numerical investigations into the effectiveness of the EA for different diffusions in comparison with competing discretization schemes. Important progress towards this goal has been achieved in Casella's comment and in more detail in his doctoral thesis.

Furthermore, there are tight analytic bounds on the computational cost of the EA in Beskos and Roberts (2005) and Beskos et al. (2004). For instance in the recast von Mises example in the interesting discussion of Kent, Beskos and Roberts (2005) demonstrated that to simulate the diffusion over a time interval of length T requires computational effort of order λ2T/σ2.

The second stage in a comparison study will involve comparisons of the methods of this paper for likelihood-based inference with competitor methodologies. One important issue here will be that, although we would expect most (or maybe all) methods to deteriorate for increasingly sparse data sets, it is of great interest to know winch methods show the greatest level of robustness to this phenomenon. For the Markov chain Monte Carlo data augmentation methodology for instance this involves studying the effects of block updating strategies for data sets where imputing the entire missing data between observations is inefficient. Some interesting conclusions in this direction appear in Chib et al. (2004).

For this comparison work, we emphasize again the important distinction between methods that can smoothly estimate whole likelihood surfaces and those which are focused on pointwise estimation. The former methods are highly likely to be the more effective in most statistical contexts. Smoothness (more precise) of the estimated likelihood surface is required for the proof of consistency (for large Monte Carlo samples) of maximum likelihood estimates as demonstrated in Beskos et al. (2005b). Lyons suggests comparisons with numerical methods for solving the partial differential equations that are satisfied by the diffusion transition density. This approach was explored in Lo (1988) although it appears difficult to estimate entire likelihood surfaces simultaneously by using this method.

Theoretical foundations

Kessler asks interesting questions about the properties of estimators that are produced by our Monte Carlo methodology. We have only partial answers to this question, showing for instance in Beskos et al. (2005b) that, as the Monte Carlo sample size increases, the Monte Carlo maximum likelihood estimate converges almost surely to the true maximum likelihood estimate. Furthermore, the Monte Carlo EM method can give an unbiased estimate of the score function, and thus we would expect that, even for fixed Monte Carlo sample size per observation, the Monte Carlo EM maximum likelihood estimate would be consistent in the limit as the sample size increases. These are both important consequences of unbiasedness in the estimators. We certainly concur that more quantitative rate results would be desirable.

However, although our methods are based on Monte Carlo methods, as pointed out by Robert, the transparency of our methods readily permits variance reduction ideas which mean that, in the examples that we have considered so far, very robust estimates of whole likelihood surfaces are obtained using minimal computational effort.

Filtering and the Poisson estimator

Robert and Stramer ask for guidance about the optimal implementation of the Poisson estimator. In recent work (Fearnhead et al., 2005) we have generalized the Poisson estimator to permit a general distribution for the number of time points at which the diffusion is evaluated, and we have produced optimality criteria for this distribution. Motivated by these criteria, a sensible proposal distribution appears to be a negative binomial distribution whose mean is close to inline image. In the EA1 setting of the SINE example, such a choice can lead to a reduction in variance by two or three orders of magnitude over the Poisson estimator with parameters that are chosen as in Section 7.

There are several comments concerning how our work can be applied to filtering problems, e.g. where we make partial observations of the underlying diffusion at discrete time points. Chopin points out that the EA enables a simple implementation of the basic particle filter (Gordon et al., 1993) as the EA enables us to simulate exactly the value of the state at the next time point given the current value of the state. Künsch suggests that the rejection-sampling-based particle filter of Hürzeler and Künsch (1998) could be applied, and Doucet suggests a more general particle filter where the Poisson estimator is used to generate random weights for each particle, with the mean of these random weights being equal to the true (analytic) weight.

We have independently been investigating how the EA and related ideas can be applied to particle filters, and in Fearnhead et al. (2005) we have developed a general framework for unbiased particle filters. Here we give a brief summary of results in the case of observing the SINE data set with normal error (variance σ2). For this application, the rejection sampling idea of Künsch can be applied by

  • (a) choosing a current particle with probability proportional to
  • (b) proposing Xti+1=u conditional on inline image from a normal distribution with mean
    and variance σ2(ti+1ti)/(σ2+ti+1ti), and
  • (c) performing two accept–reject steps, the first with acceptance probability  exp {A(u)−1}, and the second being the EA1 accept–reject step for the path.

This algorithm has the same overall acceptance probability as a simple implementation of algorithm EA1 for simulating from the SINE state diffusion, but it has the important advantage of simulating from the particle filter's target distribution at time ti+1 and avoiding the need for an importance sampling correction by the likelihood. This is particularly useful if σ2 is small relative to ti+1ti. However, an efficient random-weight particle filter (similar to the ideas of Rousset and Doucet) has computational advantages over even this rejection algorithm. The advantages depend on the value of the states at time ti and ti+1—but in extreme cases (where both state values are close to a multiple of 2π) computational gains of over a factor of 10 are possible.

As Rousset and Doucet point out, to implement a random-weight particle filter requires the random weights to be positive with probability 1. Although this is straightforward to achieve in EA1 situations, we have adapted EA3 ideas to ensure this for general diffusions. The methodology that we have developed is trivially extended to inference for diffusion-driven Cox processes (Møller and Clifford), and it should be possible to extend to stochastic volatility models (Kalogeropoulos) and velocity models where positions are observed (Kent). For full details see Fearnhead et al. (2005).

Mathematical formulation

Crisan suggests that it is more natural to avoid the step which transforms the observed diffusion to one of unit diffusion coefficient. It is clear that this is mathematically equivalent to our approach, and the Radon–Nikodym derivative that is obtained by using both approaches should be almost surely equivalent. In fact, the solution to the differential equation that he gives in his equation (32) turns out to define our unit diffusion coefficient transformation (3) in the time homogeneous case, and the time inhomogeneous case results in a simple generalization of transformation (3) which we describe in detail in Beskos et al. (2005a).

In fact, from a statistical perspective, it seems to us to be more natural to work with the transformed unit volatility process since this can be thought of as an infinitesimal standardization of the missing data.

Implicit in Crisan's comment is the useful idea that different dominating measure proposal distributions can be used as the basis for the rejection sampling approach. There are several different tractable unit diffusion coefficient proposals which could be used in this way: for example in the Cox–Ingersoll–Ross example we used a Bessel process, and in more recent work we have used Ornstein–Uhlenbeck processes. However, Crisan's approach offers no additional generality since any general diffusion which has explicit finite dimensional distributions corresponds (through transformation(3)) to one which has unit diffusion coefficient.


We concur with the view of Clifford, Künsch, and Mamon and Yu, who mention both the fact that diffusions that are used in statistical modelling are often rather approximate, but that precise macroscopic properties of models are likely to be unimportant in many practical modelling contexts. For instance in financial data, stylized facts about financial asset power variation are at odds with the characteristic finite non-zero quadratic variation which diffusions are bound to exhibit.

Clifford points out that the exact likelihood in the Cox–Ingersoll–Ross model is explicit. This is precisely why we used this example so that the analytic solution could be explicitly compared with our results.

Chopin raises the possibility of substantially increasing efficiency at the cost of a certain bias by truncating drifts in regions which the diffusion is unlikely to visit in the time period of interest. This seems a sensible practical suggestion. It turns out that algorithm EA3 using the layered Brownian motion construction can be motivated in a similar way, and moreover allows unbiasedness to be retained.

Pasquali and Ruggeri and Genon-Catalot ask about the relationship between parameter estimation and diffusion stability. In the asymptotic case, much is known about this through the elegant theory of mixed asymptotic normality (see Basawa and Rao (1980)). In the finite sample case, it is difficult to make general statements. However, as the discussants point out, even our simple logistic growth model is sufficiently simple for interesting structure to emerge. In this example for instance, for σ2>2R, the diffusion will eventually converge to 0 so presumably K will be badly estimated in this case.

Mamon and Yu ask about the comparisons of our methods with nonparametric methods for stationary diffusions. Kent mentions a similar issue, pointing out that serial correlation of the data can be taken into account by standard time series methods. Although this idea requires stationarity, an appealing feature of such methodology is the fact that it will provide an increasingly good approximation to an exact likelihood-based approach for sparse data—exactly the situation which is problematic for all currently existing exact likelihood-based methods.


Appendix A: Background material

A.1. Brownian motion and Brownian bridge

Standard Brownian motion {Bs;0leqslant R: less-than-or-eq, slantsleqslant R: less-than-or-eq, slantt}, B0=0, can be simulated on a collection of time instances 0=s0<s1<…<skleqslant R: less-than-or-eq, slantt in the following recursive way:


A standard BB with distribution ��(t,0,0) can be obtained as a transformation of B:


An arbitrary BB with distribution ��(t,x,y) can be simulated via the relocation invariance property of BBs: if ω∼��(t,0,0), then ωs+(1−s/t)x+(s/t)y is a path from ��(t,x,y).

A.2. Decomposition of a Brownian path at its minimum

The following theorem, which was stated in Beskos et al. (2004a), shows how to simulate a BB path together with its minimum m and the time τ when the minimum is attained.

Theorem 4.  A path ω∼��(t,x,y) can be simulated in the following way.

  • Step 1:simulate m and τ according to their joint density
    for mleqslant R: less-than-or-eq, slantmin(x,y) and τ ∈ [0,t].
  • Step 2:the rest of the path is a transformation of six independent standard BBs, bj∼��(1,0,0), j=1,…,6. Analytically, for 0leqslant R: less-than-or-eq, slantsleqslant R: less-than-or-eq, slantτ,
    and, for τ<sleqslant R: less-than-or-eq, slantt,

In step 1, m is drawn from its marginal distribution, which is a transformed Rayleigh distribution and, conditionally on m, τ is drawn by using the inverse cumulative distribution function method.

A.3. Bessel process and Bessel bridge

The (three-dimensional) Bessel process or bridge can be defined as a scalar Brownian motion or bridge that is constrained to be positive; for the main results see chapter 11 of Revuz and Yor (1994). Recall that an arbitrary Bessel bridge measure is denoted by ℝ(t,x,y). Simulation of ω∼ℝ(t,0,y) can be carried out by means of three independent BBs (see for example Bertoin and Pitman (1994)):


Simulation of ω∼ℝ(t,x,y) is more involved and follows the lines of theorem 4. We first simulate its minimum m, whose distribution is the restriction on (0,∞) of the distribution of the minimum of the corresponding BB. Given m, the path can be reconstructed exactly as if it were a BB with known minimum m.

Appendix B: Proofs of main results

B.1. Proof of lemma 1

Girsanov's formula, which gives the density of the law of X with respect to Wiener measure, is


Thus, using Itô’s lemma


B.2. Proof of lemma 2

We shall find the density inline image of the law of vcom with respect to Leb��(t,0,0), where Leb denotes the Lebesgue measure on R. Clearly,


The law of the 1–1 transformation inline image given Vt is just inline image, where x=η(V0;θ) and y=η(Vt;θ). Thus,


Use of lemma 1 concludes the proof.

B.3. Proof of lemma 3

Note that the Radon–Nikodym derivative of a homogeneous Poisson process with rate r(θ) on [0,t]×[0,1] with respect to Φ is


which simplifies to  exp [t{1−r(θ)}]r(θ)κ.

B.4. Proof of theorem 3

The joint density of θ and the latent variables conditionally on v can be decomposed as


To simplify the formulae, we write xi instead of xi(θ). Clearly,


Using equation (14), the acceptance probability of EA1 a(xi−1,xi,θ) can be expressed as


Combining this expression with expressions (29) and (26) we find that inline image is proportional to


It is now straightforward to integrate out inline image and to obtain expression (30).