Conclusions
MCMC methods can work well on problems that are described in the paper, i.e. the computational gap can be reduced (but not removed completely), but this requires careful algorithm construction. (Perhaps there is a role here for adaptive MCMC methods.) The authors’ work offers an appealing and rapid approximation which provides a pragmatic alternative to full posterior exploration. I think it is imperative that we learn more about the accuracy of the approximations that are used in the paper; in particular, how this is affected by the data. However, it is clear that this is exciting and stimulating work. I am therefore very pleased to be able to propose the vote of thanks to the authors for their work.
Christian P. Robert (Université Paris Dauphine and Centre de Recherche en Economie et Statistique, Malakoff)
Rue, Martino and Chopin are to be congratulated on their impressive and wideranging attempt at overcoming the difficulties in handling latent Gaussian structures. In time series as well as spatial problems, the explosion in the dimension of the latent variable is indeed a stumbling block for Markov chain Monte Carlo (MCMC) implementations and convergence, and recent solutions all resort to approximations of sorts whose effect has not always been completely assessed (see, for example, Polson et al. (2008)). The double Laplace approximation that is proposed by the authors is therefore an interesting and competing alternative in these areas.
Nonetheless, as much as I respect the other major contributions to (Bayesian) statistics, mathematics, and other branches of science, of my fellow Norman, Pierre Simon de Laplace, and despite earlier uses in Bayesian statistics (Tierney et al., 1989), I have always (Robert, 1992, 1994) been wary of Laplace's approximation because
 (a)
it is not parameterization invariant,
 (b)
it requires some analytic computation or/and some black box numerical differentiation, while being based on a standard secondorder Taylor approximation to the logdensity, and
 (c)
it misses a clean evaluation of the associated error.
In the present paper the amount of recourse to this approximation is particularly intense since both π(xθ,y) and π(x_{i}θ,x_{−i},y) are approximated by multilevel (nested) Laplace approximations. I, however, contend that a less radical interpretation of approximation (3) as a proposal could lead to a manageable MCMC implementation, at least in some settings.
My first reservation is that the calibration of those Laplace approximations seems to require a high level of expertise that conflicts with the offtheshelf features that are advertised in the final section of the paper. Designing the approximation then represents an overwhelming portion of the analysis time, whereas computation indeed becomes close to instantaneous, unless one abandons the analytical derivation for a numerical version that is difficult to trust entirely. After reading the paper, I certainly feel less than confident in implementing this type of approximation, although I did not try to use the generic open source software that has been carefully developed by Rue and Martino. Attempting to apply this approach to the standard stochastic volatility model using our own programming thus took us quite a while, even though it produces decent approximations to the marginal π(θy), as Casarin and Robert will discuss here later. I am, however, wondering whether or not approximation (3) is a proper density for any and every model, since it is written as Q(θ)^{1/2}Q(θ)+diag{c(θ)}^{−1/2} π{x^{*}(θ),θ,y}, where the dependence of both x^{*} and c on θ is quite obscure.
My second reservation is that, although the pseudomarginal (3) seems to be an acceptably manageable version of the marginal posterior of θ, the additional Laplace approximations in the derivation of the marginals of the x_{i}s do not appear as crucial or necessary. Indeed, once approximation (3) is available as a (numerical) approximation to π(θy), given that θ has a limited number of components, as hinted at on several occurrences in the paper, a regular MCMC strategy targeted at this distribution is likely to work. This would result in a much smaller cost than the discretization underlying Fig. 1 (which cannot resist the curse of dimensionality, unless cruder approximations as in Section 6.5 are used, but at a cost in accuracy). Once simulations of the θs are available, simulations of the xs can also be produced by using the Gaussian approximation as a proposal and the true target as a Metropolis–Hastings correction. (It is thus possible to envision the whole simulation of the pair (θ,x).) Indeed, the derivation of is extremely complex and thus difficult to assess. In particular, the construction is partly numeric and must be repeated for a large number of θs, even though I under stand that this does not induce high computing costs. Given that the s are averaged over several values of θ, it somehow is delicate to become convinced that this complex construction is worthwhile, when compared with an original Gaussian approximation coupled with an MCMC simulation. Simulating a single x_{i} or the whole vector x from the Gaussian approximation has the same precision (in x_{i}); therefore the dimension of x cannot be advanced as having a negative effect on the convergence of the MCMC algorithm. Furthermore, a single run of the chain produces approximations for all x_{i}s.
My last reservation is that the error resulting from this approximation is not, despite the authors’ efforts, properly assessed. We can check on the simpler examples that the error resulting from one of the schemes is indeed minimal but the error rates do little for my reassurance as
 (a)
they involve the sample size, even though we are dealing with a fixed sample, and not a computa tional effort that seems to be restricted to the size of the grid in the θspace, and
 (b)
they do not calibrate the error resulting from the Bayesian inference based on this approximation.
Using and comparing different approximations that are all based on the same principle (Laplace's!) does not provide a clear indicator of the performances of those approximations. Furthermore, resorting to the measure of effective dimension of Spiegelhalter et al. (2002) does not bring additional validation to the approximation.
A minor side remark is that, from a Bayesian point of view, I find rather annoying that latent variables and parameters or hyperparameters with Gaussian priors are mixed together in x (as in the stochastic volatility example) and that θ coalesces all leftovers without paying any attention to the model hierarchy (as in Section 1.3 with θ_{1}versusθ_{2}). Of course, this does not impact the sampling performances of the method, but it still feels awkward. In addition, this may push towards a preference for Gaussian priors, since, the more (hyper)parameters with Gaussian priors, the smaller m and the less costly the numerical integration step.
Given the potential advances resulting from (as well as the challenges that are posed by) this paper, both in terms of modelling and of numerical approximation, I am unreservedly glad to second the vote of thanks.
The vote of thanks was passed by acclamation.
Peter J. Diggle (Lancaster University)
The rediscovery of Markov chain Monte Carlo (MCMC) methods in the 1980s has had a profound effect on statistical practice, enabling the apparently routine fitting of complex models to experimental and observational data. The benefits of this are obvious, but they come at a price.
The authors of this paper raise one oftenvoiced concern with MCMC methods, namely that model fitting can take hours, or even days, of computing time. I discount this objection when the model in question is not in doubt—waiting days for the answer seems unexceptionable when data may have taken months or years to collect—but it carries some force at an exploratory stage, when many competing models may need to be fitted and their results compared.
Another familiar concern is the difficulty of asessing convergence of MCMC algorithms. This seems to me a more important problem, and one that is fundamentally insoluble in complete generality unless and until exact algorithms can be found for a usefully wide class of models. Until then, I see a fundamental distinction between MCMC and direct Monte Carlo methods of inference, namely that, for the latter, convergence is not an issue; hence the assessment of Monte Carlo error is straightforward by using elementary methods applied to independent simulations from the target distribution.
Finally, there is an everpresent danger that the ability to fit overcomplicated models to sparse data sets will encourage people to do precisely that. And, within the Bayesian paradigm at least, a flat likelihood is no bar to inference and the fact that apparently innocuous prior specifications can be highly informative in some dimensions of the multivariate posterior distribution can easily go unnoticed.
For all these reasons and more, reliable alternatives to MCMC methods would be of enormous value, and I consider this paper to be a very important development.
I would like to be able to use the authors’ methods for geostatistical prediction. In this context X represents an unobserved, spatially continuous stochastic process and a common requirement is to be able to draw samples from the predictive distribution of one or more nonlinear functionals of X, e.g. its maximum value over a spatial domain of interest. The authors mention in Section 6.1 the possiblity of extending their methods to enable evaluation of the joint predictive distribution for multiple elements of X, but they warn us that the accuracy of their methods ‘may possibly decrease’ as the dimensionality of X increases. Do they have any more tangible results on this issue?
Leonhard Held and Andrea Riebler (University of Zurich)
In a wideranging paper, Besag et al. (1995) saw Markov chain Monte Carlo (MCMC) methods as ‘putting probability back into statistics’. The authors of the paper discussed tonight are to be congratulated on bringing ‘numerics back into statistics’. Their work proposes important extensions on numerical approximations to posterior distributions (Tierney and Kadane, 1986). The accuracy and computational speed of the integrated nested Laplace approximation (INLA), as illustrated in the examples, is remarkable and—without doubt—INLA will replace MCMC sampling in routine applications of structured additive regression models.
One advantage of MCMC methods is that functions of high dimensional posterior distributions can be estimated easily. For example, Besag et al. (1995), section 6.3, proposed an algorithm to compute simultaneous credible bands, a useful tool for Bayesian data analysis (Held, 2004). Such bands would address the question whether a linear age group effect is plausible in example 5.4. Similarly, there is often interest in the posterior ranking of a large number of random effects (Goldstein and Spiegelhalter, 1996). In Section 6.1, Rue and his colleagues consider approximations of multivariate marginal distributions of small size. For up to what number of parameters can we compute simultaneous credible bands and posterior ranks with INLA?
Going back to example 5.4, the question arises where exactly INLA adjusts for the sumtozero constraint on f^{(a)} and f^{(s)}. Martino and Rue (2008) mention a small positive scalar c to be added to the diagonal of the precision matrix of the unconstrained model to ensure positive definiteness. Does this include an additional approximation error depending on c? A similar problem occurs in MCMC sampling (KnorrHeld and Rue (2002), page 608); however, a Metropolis–Hastings step automatically corrects for c, so the algorithm remains valid exactly.
An exciting feature of INLA is the ease with which the marginal likelihood and crossvalidated probability integral transform values can be calculated. There is a large literature on how to estimate these quantities through MCMC methods; however, some of the estimates proposed have unduly large Monte Carlo error. A multivariate extension of the probability integral transform value, the Box (1980) density ordinate transform has recently been suggested by Gneiting et al. (2008) as
where y_{S}={y_{i}:i ∈ S} and Y_{S} is the associated random vector with density π(y_{S}y_{−S}). We would be interested to learn from the authors whether this quantity can be computed via INLA for moderate dimension of S.
Finally we are wondering whether a sequential version of INLA can be developed for dynamic models with nonGaussian observations to improve the accuracy of the extended Kalman filter and smoother (Fahrmeir, 1992).
Adam M. Johansen (University of Warwick, Coventry)
I congratulate the authors on this interesting and thoughtprovoking paper which presents a potentially widely applicable and practically important contribution to the field of Bayesian inference. It has been clear from the outset, even to Monte Carlo specialists, that whenever it is possible to avoid the use of Monte Carlo methods it is desirable to do so (Trotter and Tukey, 1956). This paper expands the class of problems which can be addressed without recourse to such techniques.
 (a)
An exciting aspect of this work is the impressive performance which is obtained when
p/
n_{d}∼1. In their setting,
Shen and McCullagh (1995) suggested that a Laplace approximation can be expected to be reliable (asymptotically) if
p is
and that
peculiar characteristics of individual problems must be exploited outside this regime. The integrated nested Laplace approximation (INLA) can clearly perform well here. Further investigation seems warranted: a clear theoretical characterization would provide confidence in the robustness of the INLA approach; additionally, INLA may provide some insight into what properties are required for Laplace approximation techniques to be justified in this type of problem, perhaps using novel arguments, and this may be more generally important.
 (b)
Hidden Markov models, such as the stochastic volatility model of
Section 5.3, in which an unobserved, discrete time Markov process {
X_{n}} is imperfectly observed through a second process {
Y_{n}} such that, conditional on
X_{n},
Y_{n} is independent of the remainder of
X and
Y, are common in time series analysis. A small number of unknown parameters are often present. It is challenging to design good Markov chain Monte Carlo algorithms to perform inference in such models. Inference for these models is often restricted to the estimation of low dimensional marginals, often using sequential Monte Carlo (
Doucet et al., 2001) methods. Sometimes this choice is due to a requirement that computation can be performed in an ‘online’ fashion (and one might ask whether it is possible to develop ‘sequential’ versions of INLA); in other cases it is simply an efficient computational technique—see, for example,
Gaetan and Grigoletto (2004). Smoothing (the problem of estimating the marginal posterior distribution of
X_{k},
kt, given
y_{1:t}) as well as parameter estimation as touched on in
Section 5.3 (which is nontrivial with sequential Monte Carlo methods and computationally intensive with Markov chain Monte Carlo methods) are of considerable interest. Existing analytic techniques are very limited: this seems to be a natural application for INLA.
Sujit K. Sahu (University of Southampton)
In this very impressive and stimulating paper it is nice to see the triumph of integrated nested Laplace approximations for making Bayesian inference. The paper develops easytoimplement and readytouse software routines for approximations as alternatives to Markov chain Monte Carlo (MCMC) methods for many relatively low dimensional Bayesian inference problems. Thus, these methods have the potential to provide an independent check for the MCMC code written for at least a simple version of the more complex general problem. However, the conditional independence assumption of the latent Gaussian field in Section 1.3 seems to limit the use of the methods in many space–time data modelling problems. The following example shows that it may be possible to relax the assumption.
Suppose that
where Z is the design matrix and is a known covariance matrix. Assume further that the latent field x follows the multivariate normal distribution with zero mean and precision matrix Q_{x}. A flat prior on θ completes the model specification. As in Section 1.3, we have
Thus, there is no need to obtain a Gaussian approximation, and we have
Because of this exact result at x=x^{*}(θ) will be free of θ and according to equation (3) of the paper
which implies that
 (33)
where
again requiring no approximations.
The marginal model here is given by
with the exact posterior distribution given by
 (34)
The nested approximation in equation (33) and the exact marginalization in equation (34) are in fact identical since it can be shown that A=Q_{y}. Thus we see that the proposed nested approximation method, INLA, works even for spatially coloured covariance matrices for the random field as well as the data. Thus, is it possible to relax the conditional independence assumption in general? In practice, however, being a spatially coloured matrix, Q_{x} will depend on several unknown parameters describing smoothness and rate of spatial decay. It is likely that Gaussian approximations for the posterior distributions of those parameters will be more challenging, since MCMC sampling algorithms often behave very poorly owing to weak identifiability; see for example Sahu et al. (2007).
Roberto Casarin and Christian P. Robert (Université Paris Dauphine)
To evaluate the effect of the Gaussian approximation on the marginal posterior on θ, we consider here a slightly different albeit standard stochastic volatility model
(The difference from the authors’ expression is that the variance of the x_{t}s is set to 1 and that we use the notations ρ instead of φ and σ^{2} instead of exp(μ).) If we look at the secondorder approximation of the nonlinear term, we have
where . A Gaussian approximation to the stochastic volatility model is thus
where the Gaussian precision matrix Q(θ)^{−1} has 3/2+ρ^{2} on its diagonal, −ρ on its first subdiagonal and supdiagonals, and 0 elsewhere. Therefore the approximation (3) of the marginal posterior of θ is equal to
for a specific plugin value of x.
Using for this plugin value the mode (and mean) x^{M} of the Gaussian approximation, as it is readily available, in contrast with the mode of the full conditional of x given y and θ as suggested by the authors, we obtain a straightforward recurrence relation on the components of x^{M}
with appropriate modifications for t=0,T. We thus obtain the recurrence (t>0)
with
and
This choice of x^{M} as a plugin value for the approximation to π(θy) gives quite accurate results, when compared with the ‘true’ likelihood that is obtained by a regular (and unrealistic) importance sampling approximation. Fig. 11 shows the correspondence between both approximations, indicating that the Gaussian approximation (3) can be used as a good proxy to the true marginal.
JeanMichel Marin (Université Montpellier 2 and Centre de Recherche en Economie et Statistique, Malakoff), Roberto Casarin (University of Brescia) andChristian P. Robert (Université Paris Dauphine and Centre de Recherche en Economie et Statistique, Malakoff)
The authors state that for a given computational cost the use of Markov chain Monte Carlo (MCMC) sampling for the latent Gaussian models rarely makes sense in practice. We wish to argue against this point: MCMC sampling, when combined with other Monte Carlo techniques, can still be an efficient methodology.
To assess this point, we consider the same univariate stochastic volatility model as that studied by the authors in Section 5.3. We assume that μ=0,β^{2}=1/τ and f_{0}∼{0,β^{2}/(1−φ^{2})}. For θ=(φ,β^{2}), we use a uniform prior distribution on Θ=[0,1]×[0,1].
The selfavoiding MH step is applied particlewise to the particle set considered as a whole, and is targeting the modified Nproduct of the posterior
(Mengersen and Robert, 2003). The second term in the product forces the particles to explore the posterior. Fig. 12 presents the way that this term acts on a part of the posterior distribution. Increasing ξ_{t} induces a heavier mass reduction in correspondence to the simulated values. Increasing α_{t} reduces the repulsion. After the move step from to , the weights are updated as follows:
with
Fig. 13 relates to the exploration of the parameter space and shows that a strong repulsion factor, when compared with a weaker repulsion, causes a higher number of the particles to explore the tails of π_{t}(θy_{t}).
In contrast with the twostep deterministic procedure that is described in the paper, a Monte Carlo hybrid algorithm can thus address simultaneously the estimation and the parameter space exploration problems. The algorithm proposed produces samples from a sequence of distributions and is thus suitable for online applications. Particularly welcome would be a future extension of the Laplace approximation framework in the context of sequential inference problems.
Sylvia Richardson (Imperial College London) andArnoldo Frigessi (University of Oslo)
Given the identity
the strategy in this paper is to produce an approximating model for which the normalizing function is easy to compute as the squareroot determinant of Q plus a diagonal matrix. Alternative approaches have aimed at approximating the partition function Z_{θ,y} directly, starting with Geyer and Thompson (1992). Have the authors given any thought to understanding the link between the two types of approximation?
In this form, the approximation that is used by the authors is equivalent to an implicit approximation for the likelihood term of the original model. Could the authors work out which approximate likelihood is used for some examples? Since all parameters retain their original interpretation, could we simply use this approximate (nonGaussian) model and forget the original model?
The emphasis is put on evaluating marginal posterior distributions in latent Gaussian models in a range of examples with few hyperparameters. You have demonstrated that the method that you propose is computationally efficient, very accurate and beneficial in this respect in comparison with alternative standard implementations of Markov chain Monte Carlo (MCMC) algorithms. The generic nature of your computational approach is less obvious. MCMC sampling is (much) easier to implement and extend; for example, if we wanted to depart from the latent Gaussian Markov random field, the integrated nested Laplace approximation (INLA) involves complex numerical optimization and programming, and users will mostly rely on existing software, rather than develop their own extensions. INLA requires the user to bring his model into the form of equation (1). This might not be easy, and reparameterization may be necessary, whereas MCMC methods would work with the natural parameterization. We have also experienced some difficulties with using INLA with 3000 latent variables, which could indicate that the optimization requires much randomaccess memory storage.
One crucial benefit of MCMC algorithms is their capacity to compute easily any joint posterior distribution for parameters of interest. In Section 6.1, you discussed the possibility of obtaining bivariate (and low dimensional multivariate) posterior distributions by using INLAgenerated posterior marginals and a copulalike approximation for the joint distribution. If this works well, this would be an important step towards embedding INLA within more general hierarchical structures and to retain inferential flexibility that is typical of MCMC outputs.
Omiros Papaspiliopoulos (Universitat Pompeu Fabra, Barcelona)
I congratulate the authors for an excellent paper. Contrary to what is hinted in Section 7 I think that the paper shares certain principles with Beskos et al. (2006) by devising efficient computational methods for a particular class of models as opposed to appealing to generic Monte Carlo methods. The efficiency comes at the cost of it not being straightforward to use the methodology outside the class.
I have two comments on the methodology: first on the use of the ratio identity for the marginal likelihood
 (35)
as opposed to the integral identity
 (36)
Equation (36) has been intensively used in conjunction with importance sampling. Here the authors approximate with a Gaussian density the denominator and use equation (35) to approximate p(yθ) ‘semiparametrically’. It is interesting to understand the full potential of equation (35). In some cases it can be used for the exact computation of p(yθ), although this is not obvious from equation (36). An example is the socalled Matérn hard core process (Matérn, 1986), which is obtained as a thinning of an initial Poisson marked point process. A point of the process is selected if its mark is larger than all marks in a radius of range δ>0. Here x∪y are the original marked points, y are the selected points and θ are intensity parameters.
An alternative use of equation (35) was proposed in Beskos et al. (2006) for the simultaneous unbiased estimation of p(yθ). Suppose that we can simulate from p(xy,θ) by using rejection sampling with proposal q(xy); hence
Then we have the identity
where α(y,θ) is the acceptance probability of the algorithm and can be estimated unbiasedly with simulations. In Beskos et al. (2006) the simulation was performed without even computing p(y,xθ) by using retrospective sampling techniques.
My second comment relates to the application of the proposed methodology to sequential estimation. Suppose concretely that we deal with state space models with Gaussian state dynamics and data y_{1:T} (the stochastic volatility example that the authors consider is such an instance). The approach of the paper directly covers the problem of computing p(y_{1:T}θ). Suppose instead that we wish to compute p(y_{1:t}θ), as t varies from 1 to T. It seems to me that in general the Laplace approximation would require a computational cost of order T^{2} to achieve this. I would generally be very interested in the authors’ perspectives on how the integrated nested Laplace approximation fits in with problems in estimation of state space models.
The following contributions were received in writing after the meeting.
Philos Kjersti Aas (Norwegian Computing Center, Oslo)
I congratulate the authors on presenting us with a most excellent and interesting paper. I am most interested in the possibilities that integrated nested Laplace approximations (INLAs) may open in this field of financial econometrics. Volatility in financial time series is mainly analysed through two classes of models; generalized autoregressive conditional heteroscedastic (GARCH) models (Bollerslev, 1986) and stochastic volatility (SV) models (Taylor, 1982). Whereas GARCH models are relatively straightforward to estimate by using maximum likelihood techniques, the likelihood function in SV models does not have a closed form, Hence, they require more complex inferential and computational tools. Much attention has been devoted to the development of efficient Markov chain Monte Carlo (MCMC) algorithms for SV models; see for example Shephard and Pitt (1997) and Chib et al. (2002). However, the latent field and strong correlation structures which are often found in SV models make even wellconstructed MCMC algorithms slow and their convergence dubious to assure. Hence, even though SV models in general are recognized to be more flexible than GARCH models (Kim et al., 1998), the latter are still by far the most popular in terms of real life applications.
It is my opinion that the INLA approach that is suggested by Rue and his colleagues may help SV models to exit the academic world and to reach the financial industry. The most basic SV model can be written as
 (37)
 (38)
where r_{t} and h_{t} are the logarithmic return and logvariance at time t respectively, ɛ_{t} and η_{t} are independent and identically distributed N(0,1) and ν,φ and σ are parameters to be estimated. Hence, this model belongs to the class of latent Gaussian models, for which Rue and his colleagues have shown that MCMC calculations can be substituted by accurate deterministic approximations. One may therefore obtain accurate approximations for the posterior marginals of the latent logvariances as well as of the model parameters in a fraction of the time that is used by welldesigned MCMC algorithms.
In financial applications, a Gaussian distribution for ɛ_{t} usually is too restrictive. Using the INLA framework, the Gaussian distribution may be replaced by heavytailed and skew distributions like the Student or the normal inverse Gaussian distribution (BarndorffNielsen, 1997) without any additional computational cost. Another feature that is often observed for financial data is that volatility responds asymmetrically to positive and negative return shocks (socalled leverage effects). Leverage effects may be incorporated in SV models by letting the noise terms ɛ_{t} and η_{t} be correlated (Harvey and Shephard, 1996). Does the INLA approach allow for performing approximate inference even for this kind of model?
Sudipto Banerjee (University of Minnesota, Minneapolis)
I congratulate the authors for a fine paper on Bayesian computation for a very general class of models. The integrated nested Laplace approximations (INLAs) that are developed here can potentially abate the computational concerns that are associated most conspicuously with Bayesian methods. The benefits of these ‘fast and accurate’ approximations will be more pronounced for nonGaussian likelihoods, where marginalization of the latent effects does not render easily tractable distributions and Markov chain Monte Carlo sampling suffers from high autocorrelations and slow convergence.
However, the conditional independence assumption of the latent effects considerably restricts its scope. The INLA uses conditional independence to achieve a sparse Q(θ) in expression (9), which is crucial in speeding up computations. Yet, this assumption precludes several more challenging Bayesian models. In particular, I am dubious about the effectiveness of the INLA with the spatial and spatiotemporal models that are alluded to in part (c) of Section 1.2—especially for high dimensional spatial models. Here the random effects often arise as realizations of a Gaussian process with a correlation function and, in general, the Markovian dependence is lost. Relaxing the assumption of conditional independence would significantly detract from the computational benefits of the INLA and π(x) will now need to be approximated by a Gaussian Markov random field. Such approximations have been explored elsewhere (Rue and Tjelmeland, 2002; Rue and Held, 2005), but this approach is best suited for spatial locations on a regular grid. With irregular locations, realignment to a grid or a torus is required, which is done by an algorithm, possibly introducing unquantifiable errors in precision. Adapting these approaches to richer hierarchical spatial models is potentially problematic.
Another brief remark: the INLA does not deliver posterior samples. The power of Bayesian analysis lies in the vast array of model assessment and model choice techniques facilitated by samplingbased inference. For instance, the posterior samples from θ will immediately yield the posterior of g(θ) for a function g(·). Although the authors have discussed computing posterior predictive distributions and model choice measures such as the deviance information criterion, these (and many others) may not be as easily obtained as from Markov chain Monte Carlo sampling.
Finally, the GMRFLib C libraries have been useful to me in my research and I welcome the proposal to translate this technology into an R package. This will considerably enhance its accessibility to researchers in diverse fields and only then will the INLA be put to its true test.
George Casella (University of Florida, Gainesville)
It was a pleasure to hear the presentation of the authors, and I join in congratulating them for a stimulating paper and a new approach to the problem. In this discussion I would like to point out one fact about the approximation that should be noted. The approximation in equation (3) starts from the fact that, from the calculus of probabilities (Bayes's rule),
but ignores the factor π(y), which the authors call the normalizing constant. Hence, approximation (3) suffers from both an approximation error (due to the Laplace approximation) and the error from ignoring π(y), the marginal distribution of y. Conditions under which the error in the Laplace approximation will improve, as discussed by the authors in Section 4, will not fix the normalizing constant, which can only be taken care of by renormalization. As an example, consider a very special case of equation (l) where
where Y and X are pdimensional vectors, 1 is a vector of 1s and θ is a scalar. In this case all approximations are exact, and the correct π(xy), using equation (5) exactly, is
with A=I−δ^{2}σ^{2}11^{′}/{(σ^{2}+τ^{2})(p δ^{2}+τ^{2})}. However, if equation (5) is used with the approximation (3), we obtain an expression for , the approximation, to be
Thus, without renormalization there is potential for large errors, even if the Laplace approximation is accurate, so equation (5) should never be used without normalization. The authors, of course, know this, as it is discussed in Section 4.1. Moreover, in implementing the approximation in Section 3.2.1, they explicitly normalize (see the discussion after expression (17)), and in the implementation in Section 3.2.2 there is implicit normalization.
However, this discussion highlights another difference between the Markov chain Monte Carlo approach and the approximation approach. The more computationally intensive Markov chain Monte Carlo approach will often circumvent the calculation of the normalizing constant. But this must always be done in using the approximation, and computation of the normalizing constant can sometimes, in itself, be a difficult task.
Nikolaos Demiris (Medical Research Council Biostatistics Unit, Cambridge)
The authors are to be congratulated for an excellent paper that could enable the routine use of this wide class of models among practitioners. I have two questions that the authors may be able to shed some light on. From a practical perspective, interest often focuses on, possibly nonlinear, functions (or functionals) of the parameters θ and/or x. Inference for such functions is typically trivial in a simulationbased approach such as Markov chain Monte Carlo sampling. Matters do not appear as straightforward in the new approach, and some generic guidance to the functions that may or may not be covered with methods based on integrated nested Laplace approximations would be particularly valuable. From a theoretical viewpoint, I wonder whether the authors considered a nonBayesian likelihoodbased approach based on their methods. The formula π(θy)=π(θ,xy)/π(xθ,y) may also be written as π(yθ)=π(y,xθ)/π(xθ,y). In view of the similarity between Laplace approximations and the p^{*}formula (BarndorffNielsen, 1983), it appears that there may be some scope for further exploration of this issue.
Richard Everitt (University of Bristol)
The authors are to be congratulated on their work for two reasons: firstly, for the interesting technical approach and, secondly, for highlighting why Monte Carlo methods should not always be the first resort when tackling complex inference problems. As is mentioned in the paper, the use of an approximation to the true posterior has received much attention in machine learning but is largely out of favour in statistics. Although the theoretical guarantees regarding the Markov chain Monte Carlo errors are appealing, for practical applications there is certainly a place for approximationbased methods. Even if Monte Carlo algorithms are truly the only viable approaches, it seems likely that approximationbased methods could be used to improve the efficiency of these algorithms.
An important aspect of the integrated nested Laplace approximation approach is its generality. The ability to obtain accurate results easily for a wide variety of models is important in itself, but the fact that this can be achieved with minimal tuning is highly significant. The design time that is involved in an inference method is frequently overlooked but makes a large difference to the user of a technique.
As noted in the paper, the clear limitation of the method as it stands is in cases where the dimensionality of the hyperparameters is high. Do the authors believe that there is any scope to tackle this problem through making assumptions about the distributions or conditional independences of the hyperparameters?
It is interesting that in the discussion the authors mention the parallel implementation of the algorithm. Given the recent shift towards parallel computing in home and office computers (using either multicore processors or graphics cards) it seems that parallel implementation of algorithms is an important topic in computational statistics and should receive more attention than it does at present.
I look forward to seeing this method used to tackle applications in signal and image processing.
Ludwig Fahrmeir (LudwigMaximiliansUniversität, Munich) andThomas Kneib (GeorgAugustUniversität, Göttingen, and LudwigMaximiliansUniversität, Munich)
Our first comment relates to intrinsic Gaussian Markov random field (GMRF) priors. Whereas the precision matrix Q is assumed to be nonsingular in Sections 1 and 2, examples 5.4 and 5.5 employ intrinsic GMRF priors with singular Q. The difference between singular and regular prior precision may be ignorable for model estimation, but proper latent Gaussian models seem to be necessary to guarantee that marginal likelihoods and Bayes factors are well defined in Section 6.2. In contrast, the predictive measure (Section 6.3) and the deviance information criterion (Section 6.2) only require propriety of the posterior which, under regularity conditions, can be guaranteed for a wide range of structured additive regression models even when considering partially improper priors (Fahrmeir and Kneib, 2009). Generally the question arises, which situations require regular precision matrices Q?
The authors compare variants of the integrated nested Laplace approximation (INLA) procedure with fully Bayesian inference based on Markov chain Monte Carlo sampling. The simplest variant of INLA seems to be closely related to mixedmodelbased inference which is based on Laplace approximations and restricted maximum likelihood estimation for smoothing parameters. Mixedmodelbased inference in semiparametric regression has gained much attention in recent years, so it would be interesting to contrast it with the INLA procedure. Whereas Ruppert et al. (2003) or Kauermann (2005) took a frequentist perspective on mixedmodelbased inference, it can also be interpreted as an empirical Bayes approach (Fahrmeir et al., 2004; Kneib and Fahrmeir, 2006, 2007).
Our final comment relates to the extensibility of the INLA procedure. The general framework in Sections 1.3, 2.1 and 2.2 suggests that there is only one stage in the model hierarchy involving GMRFs. However, models with more than one latent Gaussian hierarchy may be of interest when local adaptivity of the smoothing parameter will be achieved (see for example Lang et al. (2002), Lang and Brezger (2004), Baladandayuthapani et al. (2005) and Crainiceanu et al. (2007) for adaptive regression function smoothing or Brezger et al. (2007) for adaptive surface smoothing). In randomwalk smoothing, the variance could itself be assumed to follow a randomwalk prior. Recently, Krivobokova et al. (2008) proposed an approach based on Laplace approximations for locally adaptive penalized spline estimation. Although presented in the spirit of penalized likelihood estimation, we speculate that their Laplace approximations could successfully be improved by adapting the INLA approach.
Marco A. R. Ferreira (University of Missouri, Columbia)
I congratulate Professor Rue and his colleagues for their important contribution to the development of fast and accurate computations for the analysis of latent Gaussian models. In addition, I applaud the authors for making available on the Internet open source software for the implementation of their methods.
Their methodology makes use of two properties that are usually valid for latent Gaussian models: first, the latent Gaussian process is a Markov random field; second, the hyperparameters vector is low dimensional. In addition, the methodology implicitly assumes that the mode of the full conditional distribution for x for a given θ and the mode of the approximate marginal posterior for θ are unique. As these assumptions are satisfied by many of the latent Gaussian models that are currently used in practice, their methodology will bring substantial savings in terms of computational time for endusers.
However, I wonder what happens when the posterior distribution has multiple modes. In that case, quasiNewton methods will find only one local mode; then all the computations will be based on that specific local mode. More critically, most likely the proposed method for assessing the approximation error will not be able to detect the inadequacy of the analysis: the simplified Laplace and the Laplace approximations based on the same specific local mode will probably have small symmetric Kullback–Leibler divergence. Thus, to assess how appropriate is the use of their method in a specific problem, it seems extremely important to verify the uniqueness of the posterior mode. Nevertheless, this is usually a daunting task for most highly complex latent Gaussian models.
Montserrat Fuentes (North Carolina State University, Raleigh)
I congratulate Rue, Martino and Chopin for bringing forward new critical approaches to perform approximate Bayesian inference for latent Gaussian models. My comments focus on some of the limitations of the proposed Laplace approximation methods in the context of making Bayesian inference for spatial or spatiotemporal data. Nowadays, statisticians are frequently involved in the spatial analysis of huge data sets. One of the main challenges when analysing continuous spatial processes and making Bayesian spatial inference is calculating the likelihood function of the covariance parameters. For large data sets, calculating the determinants that we have in the likelihood function can often be infeasible. Spectral methods could be used to approximate the likelihood and to obtain the maximum likelihood estimates of the covariance parameters (e.g. Fuentes (2007)). Stein et al. (2004) proposed another spatial likelihood approximation method to reduce the computation of Vecchia's (1988) approach and to improve the efficiency of the estimated covariance parameters. Banerjee et al. (2008) have introduced some methodology for rank reduction. One of the main constraints of the elegant methods proposed by the authors is that they rely on having the maximum likelihood estimates of the covariance or correlation parameters. Therefore, in most practical settings when working with large continuous spatial processes, their approach could not be implemented unless it is combined with some other approximation approach (like those mentioned above) to obtain those maximum likelihood estimates.
In a spatial setting, often the observations are spatially correlated even after conditioning on a latent spatial process. In those situations, the residual correlation can be introduced via a spatial (or spatiotemporal) copula approach. For instance, if the interest is in spatial extremes a generalized Pareto distribution could be used such that the parameters of this distribution vary according to a latent Gaussian spatial model capturing spatial dependence. However, it is likely that there is spatial dependence which is unexplained by the latent spatial specifications for the distribution parameters. In these common situations with observations that are not independent given the spatially varying parameters, if the number of observations is large, then the Laplace approximations would not facilitate the inference, because the main computational challenge is due to the copula and the large covariance matrices in the likelihood function. A spectral approximation to the likelihood of the spatial correlation parameters would again facilitate the computational burden.
The Laplace approximation would not work well in situations where a good representative of the probability mass is not a local maximum, which could be easily the case in spatial settings with complex nonstationary patterns.
Alan Gelfand (Duke University, Durham)
The authors are to be congratulated on a valuable contribution for Bayesian computation. This work shows the full maturation of effort that Rue and his colleagues have expended for nearly a decade. Over time has come an increased appreciation of the subtleties that arise in posterior inference for hierarchical models along with associated computational tricks to best effect approximation at various places. However, a less sophisticated user will struggle to appreciate quantitatively when more sophisticated approximations are needed, much less to assess their value for any particular application. Unless the authors’ technology becomes a ‘black box’, I am sceptical of its widespread usage.
I was struck by the underlying presumption of a conditionally independent hierarchical model specification. It reminded me that, arguably, this setting was the most successful application of Laplace approximation in the 1980s (see, for example, Kass and Steffey (1989)). It also reminded me of the need to build a new approximation for each posterior of interest as well as the need for high dimensional posterior mode evaluation, both of which are still part of the authors’ new technology.
My main point, as one who now works primarily with space and space–time data, is that I believe that the authors’ approximate inference strategy will still suffer some of the problems that plague Markov chain Monte Carlo (MCMC) methods in fitting customary models for such data. Following the authors’ setting, suppose a spatiotemporal process yielding a conceptual binary observation at an arbitrary location and time. As a first comment, conditional independence may be unsuitable here since even a latent Gaussian process model yielding smooth realizations for the binary probabilities need not reveal spatial pattern in the realized binary outcomes. But, suppose that we seek to learn about the parameters of the covariance function of the Gaussian process as well as to infer about the binary probabilities over space and time. It is well known that the variance of the Gaussian process model and the space and time decay parameters are weakly identified. Hence, MCMC implementations struggle. But these parameters fall into the authors’θ and I would be concerned about how well works in this case. The key issue here is prior specification, with very informative priors needed for some parameters. On attending to this, we have found (Banerjee et al., 2008) that MCMC sampling with a predictive process approximation, introducing a small amount of white noise, handles these models very effectively and is essentially off the shelf.
Andrew Gelman (Columbia University, New York)
Statisticians often discuss the virtues of simple models and procedures for extracting a simple signal from messy noise. But in my own applied research I constantly find myself in the opposite situation: fitting models that are simpler than I would like—models that clearly miss important features—because of limitations of computing speed and memory.
But, after decades of Moore's law, it is only fair to describe these as limitations on our computational procedures. I routinely want to fit models that cannot be fitted by using existing software, even though I know that a sufficiently simple algorithm must be out there to fit it using much less than the capabilities of a modern desktop computer.
Examples include hierarchical models for parallel time series (e.g. trends in public opinion in each of 50 states, or models for stochastically aligning tree ring data) and varyingintercept, varyingslope logistic regressions (in which case a covariance matrix needs to be modelled for the group level structure).
When fitting such models I lurch between various approximate methods based on point estimates and full Gibbs–Metropolis steps which can be slow if not guided well. These two approaches meet in the middle: approximations can be iteratively adjusted, leading ultimately to a Gibbslike stochastic procedure, and Markov chain Monte Carlo (MCMC) sampling becomes more efficient when guided by approximations that have been tailored to the problem at hand.
I welcome the paper under discussion because it provides a more general way to construct these approximations. I suspect that, in addition to being a competitor to Gibbs and Metropolis algorithms, this approach ultimately can be used to make these stochastic algorithms more efficient.
As the authors note, a challenge remains with problems with many hyperparameters. It might help to model the hyperparameters explicitly with a hierarchical model rather than to consider them as unconstrained in some potentially large space.
I conclude with some history. 20 years ago, importance sampling was commonly viewed as an exact method, with MCMC sampling as a sometimes necessary but unfortunate approximation. For example, it was sometimes proposed to start a computation with MCMC sampling and then to finish with importance sampling to obtain an exact result. Eventually, though, statisticians realized that actually existing importance sampling is not exact but can instead be viewed as just another iterative simulation method, and one that has no particular advantages over the Metropolis algorithm or other more clearly iterative approaches (Gelman, 1991). As noted by Rue and his colleagues, now MCMC sampling is often perceived to be ‘exact’, but in practice it is not.
John Haslett, Michael SalterTownshend and Nial Friel (Trinity College Dublin)
With the assistance of Håvard Rue, we have been using integrated nested Laplace approximations (INLAs) in inverting a multivariate nonparametric regression, i.e. given training data data={(y_{i},c_{i});i=1,…n} we study the distribution of c given y_{new}; in our application, c is two dimensional. With INLAs, it is now possible to do fast, rather general, Bayesian, multivariate, nonparametric regression without Markov chain Monte Carlo (MCMC) methods. Further, by an extension of their identities in Section 6.3, it is now possible to do fast inverse crossvalidation. For example, if y is multivariate—and there are cases within the applied literature where dim(y)>50—then model evaluation in the inverse sense is more natural. The MCMC difficulties that were addressed in Haslett et al. (2006) and Bhattacharya and Haslett (2007)—in particular those of model evaluation—can now be completely circumvented.
The simplest case is as follows. A univariate count y is related to c via the density π{y;θ_{y},x(c)} to a smooth latent scalar function x(c) itself modelled as a stochastic process, with smoothness parameters θ_{y}; in fact y and x(c) can be multivariate. One simple version leads to
where k is a normalizing constant. With Rue and his colleagues, we model x(c) as an a priori Gaussian Markov random field on a finite lattice C; invoking the INLA, x(c) is approximately Gaussian a posteriori, and, using the simplest of their approximations, , the parameter here denoting the mean and variance of π_{G}(·) evaluated at the mode. The evaluation of the integral by quadrature is fast and adequately accurate. Since C is finite, we can normalize by evaluating the righthand side for all c ∈ C.
It is possible to go further with INLA, for fast approximate crossvalidation within the given data is now possible, i.e. we can evaluate π(cy_{i},data_{−i}) and compare with the known c_{i} for every i. But, as we require for each i evaluation at allc, the updates in Section 6.3 no longer suffice. However, approximate fast updates are available, in the same spirit of the rank 1 constraint in their equation (8); see SalterTownshend (2008) for details. A very powerful Bayesian tool is thus available without MCMC methods.
One limitation is the dimensionality of C. When C is two dimensional on, for example, a 50lattice, the GMRFLib routines are more than adequately fast. But even for a 30×30×30 threedimensional lattice we encounter problems. What might the authors recommend?
Tom Heskes and Botond Cseke (Radboud University, Nijmegen)
The authors are to be congratulated for a very interesting and stimulating paper. For the special case of sparse Gaussian processes with a small number of hyperparameters, the authors provide an automated procedure for approximate inference, producing very accurate results, which is orders of magnitude faster than Markov chain Monte Carlo methods.
We deeply appreciate the authors’ efforts to relate their own approach to the deterministic approximations that have been developed in the machine learning literature. Following up on that, we shall attempt to shed some light on the link to expectation–propagation (EP) and discuss whether it could be used as an alternative to the Laplace approximation.
Computing posterior marginals for the latent variables x_{i}
For accurate approximations of the posterior marginals for the latent variables the authors have to go beyond the Gaussian approximation that they used for computing the posterior marginals for the hyperparameters. A full nested Laplace approximation is (way) too expensive, as would be a full nested EP approximation, and the authors introduce several clever tricks to obtain faster approximations thereof. Although it is not clear to us how generally applicable these approximations are, this appears to be the most important contribution of the paper. It is fair to say that (to the best of our knowledge) there are no deterministic approximations in the machine learning literature that even attempt to reach the same level of accuracy. A recent interpretation of EP as a series expansion (Opper et al., 2009) may be turned into an alternative approach.
Nils Lid Hjort (University of Oslo) andD. M. Titterington (University of Glasgow)
The authors are to be congratulated for what promises to be a very influential contribution to practical Bayesian analysis. The methodology is very well thought out and the examples are convincing. Of course, there are many models involving latent variables that do not fit the Gaussian framework that is considered in this paper, so exploration of approximations such as variational Bayes (VB) methods and expectation–propagation is still appropriate. As mentioned in Section 1.6, VB methods tend to underestimate posterior variance, and experience in references such as Wang and Titterington (2005) is that the VB posterior distribution is typically ‘similar’ to that corresponding to completedata analysis. Indeed, for the scenario that is considered in Appendix A, if x were known, then
If now we note that E(x^{T}Rx)=n/θ, assume as in the paper that the data are not very informative and, for a roughandready calculation, substitute θ by the mean a/b of a Γ(a,b) distribution, then the completedata posterior distribution becomes the same as the VB approximation that is stated in the paper.
Although not noted at the time, this tieup between the VB approximation and the completedata result is manifest in the numerical illustration of a mixture of two known densities in Humphreys and Titterington (2000). The best of the approximate methods that were illustrated there was the socalled probabilistic editor, a recursive method based on matching first and second moments; see for example section 6.2.1 of Titterington et al. (1985). Again in hindsight, the corresponding empirical posterior variance can be seen to be very close to that from the gold standard Gibbs sampler. In fact, the probabilistic editor can be regarded as a recursive version of expectation–propagation. Similar empirical results are available in Stephens (1997) and Minka (2001) and, much more recently, we have exploited this link to establish that, at least for this and some other simple mixture problems, expectation–propagation gets the posterior variance right, asymptotically.
Finally, we have two questions. First, is there any hope of a version of the authors’ approach that at some level handles scenarios, like mixtures, in which the latent variables are anything but Gaussian? Secondly, we wish to point to the ‘model builder’ methodology and software that was developed by Skaug and Fournier (2002) and others, which make it possible to fit quite general nonGaussian hierarchical models with latent variables, also using Laplace approximations. Are there connections to the present paper, and are there classes of models where both approaches may be used?
Jim Hodges (University of Minnesota, Minneapolis)
Faster computing methods are always good news, and I congratulate the authors on a brave, interesting body of work. I have a big concern here, however: the authors’ approach appears to require unimodal posteriors, and the authors seem to think that, in the relevant set of problems, multimodal posteriors are so rare that they can be ignored. Specifically, they say in Section 1.5, ‘For most real problems and data sets, the conditional posterior of x is typically well behaved and looks ‘‘almost’’ Gaussian’ and, again, in Section 6.2, ‘Fortunately, latent Gaussian models generate unimodal posterior distributions in most cases’. They present no evidence to support this sanguine view, because no such evidence exists. The only systematic evidence that I know of is Liu and Hodges (2003), which showed that, in the simplest possible case, the balanced oneway randomeffects model, the joint marginal posterior of the two variances becomes bimodal quite readily. Moreover, since the restricted likelihood is always unimodal in this problem, the bimodality arises entirely because of the prior, which suggests that the prior can create bimodality in any problem. Further examples of bimodal posteriors for simple models and real data sets include a twolevel normal errors model (Wakefield, 1998) and a conditional autoregressive model with two classes of neighbour relations (Reich et al., 2007). And these are just from my own work; there is no reason to think that I am a magnet for freak problems.
I would argue, therefore, that the authors need to take multimodality more seriously. Perhaps they can handle multimodality with a modest extension of their method, which would be good news indeed, but, until they show that they can detect bimodality reliably and then work their approximation in spite of it, they really need to tone down the sales pitch.
Borus JungbackerandSiem Jan Koopman (VU University Amsterdam)
We congratulate the authors for an interesting paper. Their main proposal is to rely on Laplace approximations for the observation density π(yx;θ), with respect to both the latent process, represented by x, and the parameters that drive the latent processes (which are sometimes referred to as hyperparameters), represented by θ. Although it is not recognized by the authors, likelihoodbased methods have been introduced in the time series literature in papers such as Shephard and Pitt (1997) and Durbin and Koopman (1997). They used approximating Gaussian models based on Laplace approximations similar to those which are described in Section 2.2. The generality of this approach for the inference of θ is made evident in Jungbacker and Koopman (2007).
Many different inference procedures have been developed for the stochastic volatility model and they are usually quite successful. It would, however, be more convincing when the methods are also successful for a stochastic volatility model with leverage. It requires (negative) correlation between the standard normal sequences ɛ_{t} and η_{t} in the model
where μ,0<φ<1 and σ>0 are fixed unknown coefficients; see Jungbacker and Koopman (2007) for further details concerning estimation based on Laplace approximations.
Andrew Lawson (Medical University of South Carolina, Charleston)
The authors are to be commended on a very interesting and innovative paper. There is a need for alternatives to sometimes timeconsuming Markov chain Monte Carlo (MCMC) sampling for posterior sampling. Although the authors provide several convincing examples of the efficiency of the approach, I have a general concern about the flexibility of the approach and the need for tuning. In the disease mapping example (5.4) the authors assume particular priors, such as N(0,0.01) for an intercept, whereas in other examples they use diffuse Gaussian priors for regression parameters. The authors also always use gamma priors for precisions. There are issues arising from these choices.
 (a)
Are the informative priors needed, for the disease mapping example? In which case what effect does this have on the modelling approach?
 (b)
A more general question is how sensitive is the approximation to changes in prior distribution
and how easy is it to implement these changes? For example, if I chose not to use gamma priors for precisions and used the halfCauchy priors of
Gelman (2006) what would the effect be?
 (c)
It would also be interesting to find out how well posterior functionals can be computed (e.g., in the disease mapping example, Pr(p_{i}>0.5)). I would guess that there could be a greater problem where tail probabilities are important and shifts of probability mass are found.
 (d)
Is the computation of the deviance information criterion more stable than under MCMC sampling?, i.e. can you obtain negative p_{D}s?
Currently WinBUGS easily implements prior specification changes and can model spatial data with convolution models (to convergence in a reasonably small number of seconds).
Finally I take issue with two statements:
In relation to the first of these, for a wide range of models MCMC sampling is reasonably fast (try the WinBUGS examples on line!). In relation to the second: although this is convenient for the authors’ purpose, I do not think that this is common practice at all. It would usually be more usual to analyse the point locations as a point process.
I guess that implementing the integrated nested Laplace approximations approach for any given model and flexibly changing that model would not be easy (without easytouse software that is graphical user interface based, which is not currently available). Hence WinBUGS will remain the package of choice for its ease (even though MCMC sampling might be painfully slow in some complex cases).
Youngjo Lee (Seoul National University)
In the Bayesian approach, simulation techniques such as Markov chain Monte Carlo sampling have been often used. Likelihood is informative as well. To reduce the computational burden the authors propose to use the Laplace approximation to various marginal posterior distributions π(θ_{i}y),π(x_{i}y) etc. Here if we take π(θ)=1 we have hlikelihood and we (Lee and Nelder, 1996, 2001a) proposed to use the Laplace approximation to the marginal likelihood
We (Lee and Nelder, 2001a) showed that the form that is used in Laplace approximation is identical to the Cox and Reid (1987) adjusted profile likelihood to eliminate fixed parameters. Thus, we can use this form for the Laplace approximation to eliminate both fixed and random parameters simultaneously, by eliminating fixed parameters by conditioning on their maximum likelihood estimators and random parameters by integration. This allows various adjusted profile hlikelihoods (APHLs), e.g. the generalization of the restricted maximum likelihood estimators (Lee and Nelder, 2001a). Consider the Epil example in Section 5.2. Fig. 14 shows various marginal posteriors from OpenBUGS and corresponding APHLs. They show almost identical plots for both random and fixed effects. However, plots for dispersion components can be different because the authors’ inverse gamma prior is informative. This leads to biases when the prior is not right, e.g. when dispersion parameters are not random but fixed unknowns (Jang et al., 2007). APHL could be used for sensitivity analysis of the choice of priors. If the marginal posterior is somewhat different from the corresponding APHL the prior could be informative.
Finn Lindgren (Lund University)
A point that may not be apparent from the paper is the usefulness of the integrated nested Laplace approximation (lNLA) method in the context of latent spatial Gaussian or geostatistical models. Such models are often specified in terms of covariance models (Diggle and Ribeiro, 2006), but computationally efficient Gaussian Markov randomfield (GMRF) alternatives exist (Rue and Held, 2005). The INLA approach promises to be an invaluable tool for practical inference for these types of GMRF models, with a wide range of applications, e.g. in environmetrics and epidemiology. Previously, numerical optimization has been used (Rue and Tjelmeland, 2002) to find GMRF models approximating given covariance models. However, recent advances (Lindgren and Rue, 2007) in methods for obtaining explicit expressions for the precision matrix can be combined with the INLA approach, fully exploiting the efficiency of sparse matrix calculations and direct approximation of the posterior distributions.
Spatial random fields on with Matérn covariance functions
where K is the modified Bessel function of the second kind, are solutions to a fractional stochastic partial differential equation (Whittle, 1954). The stochastic partial differential equation is
where Δ is the Laplace operator and (s) is spatial Gaussian white noise. For integer values of α, an explicit GMRF approximation can be derived (Lindgren and Rue, 2007) from a finite element construction with local basis functions. For example, we can derive explicitly that, for a twodimensional regular lattice, a GMRF with precision matrix with elements from
corresponds to an anisotropic Matérn covariance function with ν=1, and ranges √(8a/β) and √(8b/β). The resulting covariance approximation is shown in Fig. 15.
Generalizations include construction of (possibly oscillating) fields on nonEuclidean manifolds, such as geostatistical models on a globe, inclusion of nonhomogeneous differential operators, specified either directly, via spatial deformation (Sampson and Guttorp, 1992), or via covariates. If the extensions are parameterized by using a few parameters, the INLA approach can be used for inference.
William J. McCausland (University of Montreal)
I congratulate the authors on a paper which may be remembered as a key paper marking a transition away from Markov chain Monte Carlo and importance sampling methods. To me, the most obvious and pressing need is a more convincing way to assess approximation error. The authors assess error by comparing two similar approximations of the target distribution with each other; it seems that respective approximations of the target distribution have unknown properties. I hope that the authors or others can find stronger results sanctioning the use of integrated nested Laplace approximations.
I shall also comment on the computational complexity of the simplified Laplace approximation . The cost of computing expression (22) for a given i‘is the same order as the number of nonzero elements of the Cholesky triangle’. The authors conjecture that the computational complexity of doing this separately for all i is ‘close to the lower limit for any general algorithm’. Although this may turn out to be true, I offer an example of a special case where similar approximations have lower order cost. I believe that there is some promise for extension to more general latent Gaussian models.
In McCausland (2008), I consider models where the latent Gaussian vector (x_{1},…,x_{n}) has a band diagonal precision matrix with bandwidth 3. State space models with univariate states are important examples. I first express the derivative of the logarithm of the integration factor c_{1}(x_{i}) in
in terms of the function f(x_{i})=E[x_{i−1}θ,x_{i},y]. A handful of firstorder difference equations then give coefficients of cubic approximations for all n of these functions, which leads to cubic approximations of the log{π(x_{i}θ,x_{i+1},…,x_{n})}. This leads in turn to an O(n) approximation of π(xθ,y) that is extremely close for the example that is considered in the paper. Inspired by Rue, Martino and Chopin, I discovered that a simple extension gives the derivative of log{c_{2}(x_{i})}, where
in terms of E[x_{i−1}θ,x_{i},y] and E[x_{i+1}θ,x_{i},y]. The latter can be approximated in the same way as the former, using the same difference equations and the timereversed series. One can use these to compute an approximation of all the π(x_{i}θ,y) at cost O(n).
C. McGrory (Queensland University of Technology, Brisbane), J. Marriott (Nottingham Trent University) andA. N. Pettitt (Queensland University of Technology, Brisbane)
We very much enjoyed reading this paper and are very impressed by the computational speed which the approximate methods enjoy, even for the large data set problem that is described in Section 5.5. We are reminded of the early work on Bayesian computation that was carried out at Nottingham in the 1980s. Naylor and Smith (1982) used iterative Gauss–Hermite quadrature to produce posterior expected values for functions of the model parameters. This relies on the posterior density being approximated by a product of a Gaussian density and a polynomial. They showed how the basic iterative Gauss–Hermite algorithm could be applied to multiple integrals as a Cartesian product rule and subsequently incorporated spherical quadrature in their Bayes4 software (Smith et al., 1987). The Bayes4 software routinely provides uni variate and bivariate posterior densities but there is an effective limit on the number of parameters at about 12. Naylor et al. (2008) show how, in some circumstances, this number can be considerably extended; nonetheless we note the contrast that this paper makes.
In the spatial models of Sections 5.4 and 5.5, the Gaussian Markov random field (GMRF) is chosen to be the intrinsic secondorder model which has the tendency to oversmooth. In an attempt to be less smooth, Pettitt et al. (2002) introduced a full rank GMRF which could be used in the integrated nested Laplace approximations context and we believe that this would be a worthwhile alternative to the intrinsic GMRFs. A further attempt to be less smooth is to use an autologistic or Isingtype model for spatial data with a hidden binary variable (e.g. McGrory et al. (2008)), whereas Weir and Pettitt (2000) used a thresholded GMRF.
Another possibility (see Woolrich and Behrens (2006)) is to use a kmultivariate GMRF prior from which a hidden k categorical variable can be approximated, to imitate a Potts model. This can be achieved by introducing a prior distribution on the continuous weights vectors for the GMRF which results in a posterior probability of membership of one of K discrete classes for each observation. We would like to know whether or not the approximations of Section 3.2.2 would be sufficiently good to find good overall approximations to the posteriors in this case.
Debashis Mondal (University of Chicago)
To consolidate the ideas of this paper further, we indicate an empirical Bayes approach in which hyper parameters are estimated by using a set of empirical measurements and then posterior density of covariate and underlying spatial effects are obtained by using the approximations described in Section 3.2, without resorting to Markov chain Monte Carlo computations. Consider the longitudinal model discussed in Section 5.2. Assume, for convenience,
where Z forms the matrix of covariates. From the theoretical expected values of Y_{j,k}(Y_{j,k}−1) and Y_{j,k}Y_{j,k′} for k≠k^{′}, the moment estimate of is
We then minimize the sum of squares of deviances between the observed and the expected values of Y_{j,k} to obtain 0.8073 and 105.296 as respective estimates of and . With this knowledge of hyperparameters, we can now advance calculations as outlined in Section 3.2. Although we sacrifice variability of hyperparameters, the empirical Bayes procedure simplifies the computations.
In spatial mixed linear models on regular arrays, the precision parameters can be estimated by equating the theoretical variogram with the observed variogram at small lags. Once we obtain knowledge of precision parameters, the marginal posterior means and variances for covariate and spatial effects can be computed by using the results of Rue and Martino (2007). My ongoing project extends the parametric empirical Bayes estimation of precision parameters to Poisson and binomial spatial models. When the latent Gaussian field has a nonsingular Q, we can invoke equation (7) to evaluate marginal moments of the observed data. However, when the latent Gaussian field is ‘first order’ intrinsic, differences of X rather than X itself have proper distribution. Consider to derive an extension of expression (7) as
This can be used to evaluate γ_{i,j} to deduce empirical Bayes estimates of precision parameters after computing appropriate expectations of empirical measurements.
The gain in the empirical Bayes approach can be taken to derive frequentist inference of spatial models, which is a longstanding difficult problem. In particular, estimation of precision parameters can be done by moment (or variogram) expansions of marginal densities, and inference for randomtreatment and spatial effects can be based on conditional distributions or point predictors (e.g. best linear unbiased predictors in mixed spatial models). Approximations in Section 3.2 can be used in computations of these point predictors.
In practice, it is important to obtain marginal posterior means, variances and select number of quantiles for covariate or spatial effects. We presume that the skew normal approximation in Section 3.2.3 provides an efficient way to compute these quantities when the loglikelihood is skewed; how does this compare with the general Gauss–Hermite quadrature formula (17)?
John Nelder (Imperial College London)
Since Lee and Nelder (1996) introduced the model class of hierarchical generalized linear models it has been further extended (Lee and Nelder, 2001b, 2006). Random effects can appear in the linear predictors not only for the mean but also for various variance components. With this new class heavytailed distributions can be used for the various components of the model, allowing robust modelling against misspecification of the distribution of random effects (Noh et al., 2005) and various data contaminations (Noh and Lee, 2007a); it also allows modelling of abrupt changes (Yun and Lee, 2006) and modelling of parametric Lévy processes for stochastic volatility in finance data (Castillo and Lee, 2008). The model class (1) in this paper assumes normal random effects only in the linear predictor for the mean, but puts priors on hyperparameters θ. For inferences about extended models we proposed to use hlikelihood and introduced various adjusted profile hlikelihoods for inferences about various components of the model. To approximate the necessary integration we have proposed to use the Laplace approximation and for nonnormal random effects the secondorder Laplace approximation has been recommended (Lee and Nelder, 2001a, 2007b). With hlikelihood we can make inferences without inventing priors for hyperparameters.
David Nott (National University of Singapore) andRobert Kohn (University of New South Wales, Sydney)
We congratulate Rue, Martino and Chopin on their paper which addresses the important issue of effective computation for Bayesian inference. The authors demonstrate that the class of latent models that they consider makes fast computations possible and provides important insights and solutions. The results will be used by applied researchers and will also generate new research in Bayesian computation.
Our first comment concerns Section 6.1 where they suggest copula approximations for marginals of subsets of x based on the univariate marginals. We have also recently developed some approximate Bayesian computational methods using copulas, in particular for marginal likelihood computation in Bayesian model comparison. The starting point for this is the socalled candidate's formula, similar to expression (3). Writing the set of all unknowns including any latent variables now as simply θ,
which holds for any value of θ and clearly approximation of the posterior at a point allows approximation of the marginal likelihood. A subset of the parameters could be handled nonparametrically as in the present paper. Laplace approximation corresponds to use of a Gaussian approximation for p(θy) evaluated at the posterior mode. As an extension it is natural to approximate p(θy) with a Gaussian copula and this can be done both with and without simulation (Nott et al., 2008). Various extensions such as the use of copulas with importancesamplingbased methods and the use of the tcopula instead of a Gaussian copula are possible. It would be interesting to apply copula approximations to posterior distributions based on methods for the current paper in models where those methods can be applied.
A second comment concerns the class of models that was considered. Although this class of models is quite broad, the use of a Gaussian latent variable, a small number of hyperparameters and perhaps most importantly the conditional independence assumptions for y given x are serious restrictions for many applications. Future developments of the work that is described in Section 6.5 are important, we feel, as are methods for combining the approximations that are proposed with simulationbased methods.
J. T. OrmerodandM. P. Wand (University of Wollongong)
We concur with the authors that good analytic approximations, as an accurate alternative to Markov chain Monte Carlo methods, are worth pursuing. These early results on integrated nested Laplace approximations are impressive and we look forward to seeing how this methodology progresses. In particular we are interested in the advertised interface from R and eventually in giving integrated nested Laplace approximations a ‘test drive’.
Our recent research has involved work in variational approximation for similar models. Most of the discussion in Section 1.6 pertains to a particular version of variational approximation where q(x,θ)=q_{x}(x) q_{θ}(θ). The phrase ‘the variational Bayes approach is not without potential problems’ and subsequent discussion actually correspond to this one type of variational approximation, even though q(x,θ) can be constrained in other ways. Indeed, some variational approximations, such as those developed in Jaakkola and Jordan (2000), do not involve Kullback–Leibler contrast. Lastly, the name ‘variational Bayes’ gives the impression of variational approximation being specific to Bayesian approaches, which is not so.
Recently, we have explored some other approaches to variational approximations that exhibit improved accuracy in our test examples. One approach involves applying the Jaakkola and Jordan (2000) tangent transform idea in a gridwise fashion (Ormerod, 2008; Ormerod and Wand, 2008). Another takes the Kullback–Leibler contrast route but restricts q to be in a parametric family, such as the Gaussian distribution. We close with some details on the latter approach, which we call Gaussian variational approximation, for frequentist Poisson mixed models with a single variance component:
 (39)
The loglikelihood of (β,σ^{2}) is
A variational approach to handling the m intractable integrals is to multiply the integrand by the quotient of the N(μ_{i},λ_{i}) density function with itself and to invoke Jensen's inequality:
After simplification we obtain the following lower bound on l(β,σ^{2}):
for all values of the variational parameters μ=(μ_{1},…,μ_{m}) and λ=(λ_{1},…,λ_{m}). Maximizing over these parameters narrows the gap between l(β,σ^{2},μ,λ) and l(β,σ^{2}) and so sensible estimators of the model parameters are
Table 1 conveys excellent performance of Gaussian variational approximation when expression (39) is applied to the data that were used in Section 5.2. Early theoretical exploration looks promising.
Table 1. Estimates and approximate 95% confidence intervals for Gaussian variational approximation corresponding to the example in Section 5.2 with the ν_{ij}term omitted† Parameter  Gaussian variational approximation  Exact 


β_{0}  1.924 (1.767, 2.081)  1.924 (1.766, 2.082) 
β_{Base}  0.165 (−0.128, 0.458)  0.165 (−0.128, 0.459) 
β_{Trt}  0.842 (0.013, 1.671)  0.842 (0.014, 1.673) 
β_{BT}  −0.366 (−0.805, 0.072)  −0.366 (−1.806, 0.073) 
β_{Age}  −0.328 (−1.072, 0.416)  −0.328 (1.074, 0.418) 
β_{v4}  0.236 (0.138, 0.333)  0.236 (0.138, 0.333) 
τ^{−1/2}  0.580 (0.466, 0.723)  0.581 (0.461, 0.700) 
Carl Edward Rasmussen (University of Cambridge)
I congratulate Professor Rue and his colleagues for their contribution to developing efficient analytic approximation methods for a wide and practically important class of models.
I am concerned, however, about the extent to which the shortcomings of the Laplace approximation may have been treated too lightly when advocating it as a generally applicable tool. The Achilles heel of the Laplace approximation is expansion around the mode of the distribution. In high dimensions, for nonGaussian, nonsymmetric posterior distributions, the mode may not be typical of the distribution; for a skew distribution the majority of the mass may lie far to one side of the mode. This is true even for unimodal, logconcave and otherwise fairly harmless distributions. As the Laplace approximation is symmetric around the mode, this may seriously hamper its accuracy.
Gaussian latent variable models with a logistic likelihood is an example in point which has been studied carefully in the machine learning community, where it is known as Gaussian process classification (Rasmussen and Williams, 2006). Careful comparisons between the Laplace approximation and other analytical approximations as well as a Markov chain Monte Carlo gold standard (Kuss and Rasmussen, 2005) document exactly this widespread failure mode. The expectation–propagation (EP) method (Minka, 2001) is an alternative analytical approximation method which does not suffer from this problem, as it is based on (approximate) matching of marginal moments. The EP approximation is found to be in close agreement with Markov chain Monte Carlo results and much more accurate than the Laplace approximation in the difficult cases where the posterior deviates significantly from Gaussianity, in terms of both the quality of approximation to the marginal likelihood and the predictive distribution for test cases.
Kuss and Rasmussen (2005) focused on conditional distributions given the hyperparameters, but the same numerical method suggested by Rue and his colleagues can also be used with EP to treat hyperparameters. Although a simple implementation of both methods scales cubically with the number of latent variables, comparable implementations of both algorithms (available at http://www.gaussianprocess.org/gpml) show that the Laplace approximation is typically about 10 times faster than EP (and not 8000 times faster as implied in Section 5.4). Additionally, both methods can be sparsified to achieve a further speedup. Whereas the Laplace approximation clearly has its merits, an exclusive reliance on it would appear hazardous.
R. A. Rigby (London Metropolitan University)
I congratulate the authors for their general approach and also for their computational achievements.
A key equation in the paper is the approximation that is used for the marginal posterior of the hyperparameters θ given by equation (3).
The denominator in equation (3), , is a Gaussian approximation to π(xθ,y) centred at the posterior mode x^{*}(θ) with precision matrix D^{*}(θ)=Q+diag(c) derived from the curvature at the mode. Equation (3) is evaluated by the authors at x=x^{*}(θ), giving
This Laplace approximation is equivalent to the formula that was used for inference about hyperparameters θ by Lee and Nelder (1996), equation (4.6), in their hierarchical generalized linear models and, among others, by Rigby and Stasinopoulos (2005), equation (12), in their generalized additive models for location, scale and shape.
Matthias Seeger (University of Saarbrücken)
As a machine learning researcher, I welcome the publication of this paper, and the discussion that it might initiate. Bayesian statisticians restrict themselves to a single approved approximate inference methodology: Markov chain Monte Carlo (MCMC) sampling. For smooth generalized linear models with provably unimodal (logconcave) posteriors, but strong couplings, there are solid arguments that, on a reasonable time horizon, random sampling should be less accurate in general than deterministic approximations. Modern statistics supports decision making in new fields of science and engineering, where high dimensional, complex models are used under time constraints. I agree with the authors that the argument of accuracy after reasonable time cannot be taken lightly: in many applications, well served by rigorous Bayesian statistical analysis, it is more important than vanishing error in the limit.
The authors could have presented the considerable amount of work done on variational approximate inference in machine learning, information theory and statistical physics over the last 15 years a little more carefully; Section 1.6 seems somewhat dismissive though uninformed. Variational (mean field) Bayes methods tend to underestimate variances, but their comment about expectation–propagation seems ill informed. I am not convinced that their own method should fare consistently better, when compared on an equal footing (their secondorder Gaussian approximation at the conditional mode being replaced by expectation–propagation or variational Bayes methods), and they do not present a comparison. In my experience, each of the major deterministic approximations comes with strengths and weaknesses, as does MCMC sampling. The technique presented here seems no exception. For example, sparse linear models, which are of substantial interest right now, are latent Gaussian models, but the secondorder Gaussian approximation at the posterior mode is not well defined. A good Gaussian approximation can be found by expectation–propagation or other variational approximations. Also, for provably skew posteriors (logistic regression), it seems suboptimal to place a Gaussian approximation at the mode.
For many posteriors, MCMC sampling is still the only useful option. Often easy to implement, it can be overly slow, and it is very difficult to use properly for nonexperts. For some models, deterministic approximations are nowadays used in many statistics applications and deserve attention in the field. Most concepts that they rely on are grounded in statistics and probability, and they can be analysed by similar mathematics to point estimation techniques. A wider interest among statisticians may help their properties to become better understood, so that practitioners can use them with the confidence that is required for sound statistical analysis.
D. P. Simpson (Queensland University of Technology, Brisbane)
The analysis and implementation of integrated nested Laplace approximations described in this paper is grounded in the fast, exact Cholesky algorithms for sampling from a Gaussian Markov random field introduced by Rue (2001). However, as integrated nested Laplace approximation is an approximate method, further computational savings may possibly be made through the judicious use of inexact methods. This comment will focus on the computation of Gaussian approximations described in Section 2.2 and carry through the formulation of integrated nested Laplace approximation in Section 3 in the obvious way.
Two key computations, the mode and the marginal variances, are required to determine the Gaussian approximation to π(x). The computation of the mode requires the solution of the sequence of linear systems
These systems can be solved quickly and inexactly by the preconditioned conjugate gradient method (Saad, 2003).
Consider the mth step of an orthogonal tridiagonalization
which can be performed by using the symmetric Lanczos algorithm with a random starting vector (Saad, 2003; Stewart, 2001). This leads to the approximation to the marginal variances
For reasonable values of m, this approximation has two correct digits! These values can be refined by applying the preconditioned conjugate gradient to . This refinement can be performed in parallel. We note that a similar procedure does not lead to an efficient method for approximating the mode.
The computational saving can be demonstrated by considering the simple binomial regression model
where Q is a secondorder random walk on a 50×50 grid. The data were generated from the surface z=sin(3x) cos(7y)+cos(7x) sin(4y). All computations were performed by using MATLAB 7.4 on a 2.3GHz Macintosh Book Pro with 2 Gbytes of randomaccess memory. When the reduced model was run with m=20, there was a saving of over a second compared with the exact Gaussian approximation (1.785 s compared with 2.867 s), and
We note that, as m increases, the error in the marginal variances (Fig. 16) does not decay quickly. This validates our assertion that this approximation requires only relatively small values of m.
Hans J. Skaug (University of Bergen)
I congratulate the authors on an innovative paper on the use of Laplacetype approximations in modern Bayesian modelling. Apart from the computational techniques, I find the proposal that is made to assess the accuracy of the Laplace approximation by using asymptotics in terms of p_{D}(θ)/n_{d} to be very promising. However, the interpretation of p_{D} as the effective dimension of x can be misleading in the current context. As the authors point out, for noninformative data we have p_{D}=0, without this meaning that the dimension of x is 0 in any sense. It is more appropriate to think of (n−p_{D})/n=tr{Q(θ)Q^{*}(θ)^{−1}}/n as the relative influence of the prior on the posterior. When the (Gaussian) prior dominates, the Laplace approximation becomes increasingly accurate.
My main comment is that several aspects of the computational machinery that is presented by Rue and his colleagues could benefit from the use of a numerical technique known as automatic differentiation (AD) (see Griewank (2000)). Simply put, AD is compilergenerated code for evaluating (numerically) the first and higher order derivatives of a mathematical function. The method outperforms finite differencebased derivatives with respect to both accuracy and computational cost, and also differs from symbolic differentiation. Skaug and Fournier (2006) argued that by evaluating Q^{*} using AD the Laplace approximation becomes ‘automatic’, relieving the statistician of the burden of calculating secondorder derivatives. Further, they presented a formula for the gradient with respect to θ of , involving up to thirdorder derivatives, that would be useful under (a) in Section 3.1 as well. We may view the authors’ framework as a backbone, where the user only has to specify a few parametric components: π(θ),Q(θ) and π(y_{i}x_{i},θ), or more generally π(yx,θ). By the use of AD one could obtain a system that is automatic from a user's perspective, almost to the same extent as with BUGS. The benefit would be a fast, flexible and easytouse system for doing Bayesian analysis in models with Gaussian latent variables.
I. L. SolisTrapala (Lancaster University)
We propose a procedure to conduct likelihoodbased inference based on the marginal density of the observational variables y. The idea, in tune with the authors’ strategy, is to explore specific features of the models under consideration to reduce the computational effort without compromising accuracy. A potentially useful procedure to achieve this is to implement the method of maximization by parts that was proposed by Song et al. (2005).
The maximizationbyparts procedure involves the decomposition of the loglikelihood function into two components,
where l_{w} is chosen as a ‘working’ model so that its likelihood score equation is easy to solve, but the second derivatives of l_{e} may be challenging. To find the maximum likelihood estimate of θ, the algorithm solves the working score equation and then recursively solves , where denotes the vector of firstorder derivatives of l. The algorithm is attractive because, under certain conditions, its convergence is relatively fast and stable with little loss of efficiency relative to using the full likelihood function.
Let θ_{2} denote the vector of parameters that index the conditional density of the observational variables y={y_{i}:i ∈ } given a latent Gaussian prooess x^{′} with mean 0 and precision matrix Q(θ_{1}). Note that x^{′} does not contain any parameters, in contrast with the latent process x that is defined by the authors. The target of inference is θ=(θ_{1},θ_{2}). Under the assumption of conditional independence of the variables {y_{i}:i ∈ } given x^{′}, the full likelihood function of θ is proportional to
where π(x^{′};θ_{1}) is a multivariate Gaussian density function with mean 0 and precision matrix Q(θ_{1}).
To conduct likelihood inference by using maximization by parts we propose a working model, where we assume that the variables are marginally independent normal random variables; and the working posterior density π_{w}(x^{′}y;θ) is Gaussian with mean and precision as derived in Section 2.2. Under this choice for the working model, the loglikelihood function can be expressed as
 (40)
where
and π_{w}(x^{′};θ) is a multivariate normal density with mean 0 and a diagonal covariance matrix. The first component in equation (40) involves the evaluation of  onedimensional integrals that can be computed through Gaussian quadrature; and the integral in the second component, with the Gaussian working model, can be easily evaluated by using Monte Carlo methods even when the dimension of x^{′} is large. The properties of this procedure are under current investigation.
S. H. Sørbye, F. Godtliebsen and K. Hindberg (University of Tromsø), T. A. Øigård (Institute of Marine Research, Tromsø, and Norwegian Centre for Telemedicine, Tromsø), V. Hadziavdic (Discover Petroleum, Tromsø) and K. Thon (Norwegian Centre for Telemedicine, Tromsø)
First of all, we congratulate the authors on a very important paper that we believe will have a huge influence in many areas of statistics. The authors have an impressive list of potential applications for their new methodology. In this discussion we shall add to their list.
Our focus is how to utilize the integrated nested Laplace approximation (INLA) approach within scale–space analysis, where an observed process that evolves in time and/or space is studied as a function of both location and scale. Scale–space ideas were introduced to nonparametric curve estimation by Chaudhuri and Marron (1999), presenting the ‘SiZer’ methodology. The main idea of SiZer is to perform inference for several scales or smoothing levels simultaneously, detecting and visualizing significant trends of the true curve viewed at different resolutions. Originally, inference in SiZer was based on Gaussian distributional quantile assumptions or bootstrapping. Computational improvements based on extreme value and asymptotic theory were developed in Hannig and Marron (2006).
For Gaussian observational models, the ideas of SiZer can be combined with fast and exact Bayesian inference, utilizing the special conditional independence structure of Gaussian Markov random fields; see Øigård et al. (2006) and Hadziavdic et al. (2008). By applying INLA, these ideas can be extended to nonGaussian observational models. One example is given in Sørbye et al. (2009), where the simplified Laplace approximation was used to extend the ideas of SiZer to spectral density estimation. Applying an integrated Wiener process as a prior, posterior marginals of both the true curve and its first derivative can be estimated for different locations and scales. Tests for significant trends are then evaluated straightforwardly.
Scale–space methods might be rather computer intensive for large data sets. Applying INLA, fast Bayesian inference can now be obtained for the wide class of latent Gaussian models. Compared with Markov chain Monte Carlo alternatives, inference using INLA is also very accurate. Of special importance within scale–space applications is the fact that INLA produces posterior marginal estimates having a relative error, implying that tail probabilities are also estimated accurately. In comparison, Markov chain Monte Carlo techniques would produce marginal estimates with absolute error.
A very nice feature of INLA is that the methodology is easily accessed by using the inla program that is described in Martino and Rue (2008). Different regression models are easily specified, e.g. including covariates, unstructured effects and linear constraints. Interesting scale–space applications could involve for example image analysis and analysis of climate data sets.
Ingelin Steinsland and Henrik Jensen (Norwegian University of Science and Technology, Trondheim)
First we thank the authors for their contributions and also for making the methodology easily available to others through the integrated nested Laplace approximation (INLA) program (Martino and Rue, 2008). Our interest is in quantitative genetics, and we have two comments: