## 1. Introduction

### 1.1. Aim of the paper

This paper discusses how to perform approximate Bayesian inference in a subclass of structured additive regression models, named *latent Gaussian models*. Structured additive regression models are a flexible and extensively used class of models; see for example Fahrmeir and Tutz (2001) for a detailed account. In these models, the observation (or response) variable *y*_{i} is assumed to belong to an exponential family, where the mean *μ*_{i} is linked to a structured additive predictor *η*_{i} through a link function *g*(·), so that *g*(*μ*_{i})=*η*_{i}. The structured additive predictor *η*_{i} accounts for effects of various covariates in an additive way:

Here, the {*f*^{(j)}(·)}s are unknown functions of the covariates **u**, the {*β*_{k}}s represent the linear effect of covariates **z** and the *ɛ*_{i}s are unstructured terms. This class of model has a wealth of applications, thanks to the very different forms that the unknown functions {*f*^{(j)}} can take. Latent Gaussian models are a subset of all Bayesian additive models with a structured additive predictor (1), namely those which assign a Gaussian prior to *α*, {*f*^{(j)}(·)}, {*β*_{k}} and {*ɛ*_{t}}. Let **x** denote the vector of all the latent Gaussian variables, and ** θ** the vector of hyperparameters, which are not necessarily Gaussian. In the machine learning literature, the phrase ‘Gaussian process models’ is often used (Rasmussen and Williams, 2006). We discuss various applications of latent Gaussian models in Section 1.2.

The main aim of this paper is twofold:

- (a) to provide accurate and fast deterministic approximations to all, or some of, the
*n*posterior marginals for*x*_{i}, the components of latent Gaussian vector**x**, plus possibly the posterior marginals foror some of its components*θ**θ*_{j}(if needed, the marginal densities can be post-processed to compute quantities like posterior expectations, variances and quantiles); - (b) to demonstrate how to use these marginals
- (i) to provide adequate approximations to the posterior marginal for subvectors
**x**_{S}for any subset*S*, - (ii) to compute the marginal likelihood and the deviance information criterion (DIC) for model comparison and
- (iii) to compute various Bayesian predictive measures.

### 1.2. Latent Gaussian models: applications

Latent Gaussian models have a numerous and wide ranging list of applications; most structured Bayesian models are in fact of this form; see for example Fahrmeir and Tutz (2001), Gelman *et al.* (2004) and Robert and Casella (1999). We shall first give some areas of applications grouped according to their physical dimension. Let *f*(·) denote one of the *f*^{(j)}(·) terms in equation (1) with variables *f*_{1},*f*_{2},….

- (a)
*Regression models*: Bayesian generalized linear models correspond to the linear predictor (Dey*et al.*, 2000). The*f*(·) terms are used either to relax the linear relationship of the covariate as argued for by Fahrmeir and Tutz (2001), or to introduce random effects or both. Popular models for modelling smooth effects of covariates are penalized spline models (Lang and Brezger, 2004) and random-walk models (Fahrmeir and Tutz, 2001; Rue and Held, 2005), or continuous indexed spline models (Wahba, 1978; Wecker and Ansley, 1983; Kohn and Ansley, 1987; Rue and Held, 2005) or Gaussian processes (O'Hagan, 1978; Chu and Ghahramani, 2005; Williams and Barber, 1998; Besag*et al.*, 1995; Neal, 1998). Random effects make it possible to account for overdispersion caused by unobserved heterogeneity, or for correlation in longitudinal data, and can be introduced by defining*f*(*u*_{i})=*f*_{i}and letting {*f*_{i}} be independent, zero mean and Gaussian (Fahrmeir and Lang, 2001). - (b)
*Dynamic models*: temporal dependence can be introduced by using*i*in equation (1) as a time index*t*and defining*f*(·) and covariate**u**so that*f*(*u*_{t})=*f*_{t}. Then {*f*_{t}} can model a discrete time or continuous time auto-regressive model, a seasonal effect or more generally the latent process of a structured time series model (Kitagawa and Gersch, 1996; West and Harrison, 1997). Alternatively, {*f*_{t}} can represent a smooth temporal function in the same spirit as regression models. - (c)
*Spatial and spatiotemporal models*: spatial dependence can be modelled similarly, using a spatial covariate**u**so that*f*(*u*_{s})=*f*_{s}, where*s*represents the spatial location or spatial region*s*. The stochastic model for*f*_{s}is constructed to promote spatial smooth realizations of some kind. Popular models include the Besag–York–Mollié model for disease mapping with extensions for regional data (Besag*et al.*, 1991; Held*et al.*, 2005; Weir and Pettitt, 2000; Gschløssl and Czado, 2008; Wakefield, 2007), continuous indexed Gaussian models (Banerjee*et al.*, 2004; Diggle and Ribeiro, 2006), texture models (Marroquin*et al.*, 2001; Rellier*et al.*, 2002). Spatial and temporal dependences can be achieved either by using a spatiotemporal covariate (*s*,*t*) or a corresponding spatiotemporal Gaussian field (Kammann and Wand, 2003; Cressie and Johannesson, 2008; Banerjee*et al.*, 2008; Finkenstadt*et al.*, 2006; Abellan*et al.*, 2007; Gneiting, 2002; Banerjee*et al.*, 2004).

In many applications, the final model may consist of a sum of various components, such as a spatial component, random effects and both linear and smooth effects of some covariates. Furthermore, linear or sum-to-zero constraints are sometimes imposed as well to separate the effects of various components in equation (1).

### 1.3. Latent Gaussian models: notation and basic properties

To simplify the following discussion, denote generically *π*(·|·) as the conditional density of its arguments, and let **x** be all the *n* Gaussian variables {*η*_{i}}, *α*, {*f*^{(j)}} and {*β*_{k}}. The density *π*(**x**|*θ*_{1}) is Gaussian with (assumed) zero mean and precision matrix **Q**(*θ*_{1}) with hyperparameters *θ*_{1}. Denote by (**x**;** μ**,

**Σ**), the (

**,**

*μ***Σ**) Gaussian density with mean

**and covariance (inverse precision)**

*μ***Σ**at configuration

**x**. Note that we have included {

*η*

_{i}} instead of {

*ɛ*

_{i}} in

**x**, as it simplifies the notation later.

The distribution for the *n*_{d} observational variables **y**={*y*_{i}:*i* ∈ } is denoted by *π*(**y**|**x**,*θ*_{2}) and we assume that {*y*_{i}:*i* ∈ } are conditionally independent given **x** and *θ*_{2}. For simplicity, denote by with dim(** θ**)=

*m*. The posterior then reads (for a non-singular

**Q**(

**))**

*θ*The imposed linear constraints (if any) are denoted by **A****x**=**e** for a *k*×*n* matrix **A** of rank *k*. The main aim is to approximate the posterior marginals *π*(*x*_{i}|**y**), *π*(** θ**|

**y**) and

*π*(

*θ*

_{j}|

**y**).

Many, but not all, latent Gaussian models in the literature (see Section 1.2) satisfy two basic properties which we shall assume throughout the paper. The first is that the latent field **x**, which is often of large dimension, *n*=10^{2}--10^{5}, admits conditional independence properties. Hence, the latent field is a Gaussian Markov random field (GMRF) with a sparse precision matrix **Q**(** θ**) (Rue and Held, 2005). This means that we can use numerical methods for sparse matrices, which are much quicker than general dense matrix calculations (Rue and Held, 2005). The second property is that the number of hyperparameters,

*m*, is small, say

*m*6. Both properties are usually required to produce fast inference, but exceptions exist (Eidsvik

*et al.*, 2009).

### 1.4. Inference: Markov chain Monte Carlo approaches

The common approach to inference for latent Gaussian models is Markov chain Monte Carlo (MCMC) sampling. It is well known, however, that MCMC methods tend to exhibit poor performance when applied to such models. Various factors explain this. First, the components of the latent field **x** are strongly dependent on each other. Second, ** θ** and

**x**are also strongly dependent, especially when

*n*is large. A common approach to (try to) overcome this first problem is to construct a joint proposal based on a Gaussian approximation to the full conditional of

**x**(Gamerman, 1997, 1998; Carter and Kohn, 1994; Knorr-Held, 1999; Knorr-Held and Rue, 2002; Rue

*et al.*, 2004). The second problem requires, at least partially, a joint update of both

**and**

*θ***x**. One suggestion is to use the one-block approach of Knorr-Held and Rue (2002): make a proposal for

**to**

*θ*

*θ*^{′}, update

**x**from the Gaussian approximation conditional on

*θ*^{′}, then accept or reject jointly; see Rue and Held (2005), chapter 4, for variations on this approach. Some models can alternatively be reparameterized to overcome the second problem (Papaspiliopoulos

*et al.*, 2007). Independence samplers can also sometimes be constructed (Rue

*et al.*, 2004). For some (observational) models, auxiliary variables can be introduced to simplify the construction of Gaussian approximations (Shephard, 1994; Albert and Chib, 1993; Holmes and Held, 2006; Frühwirth-Schnatter and Wagner, 2006; Frühwirth-Schnatter and Frühwirth, 2007; Rue and Held, 2005). Despite all these developments, MCMC sampling remains painfully slow from the end user's point of view.

### 1.5. Inference: deterministic approximations

Gaussian approximations play a central role in the development of more efficient MCMC algorithms. This remark leads to the following questions.

- (a) Can we bypass MCMC methods entirely and base our inference on such closed form approximations?
- (b) To what extent can we advocate an approach that leads to a (presumably) small approximation error over another approach giving rise to a (presumably) large MCMC error?

Obviously, MCMC errors seem preferable, as they can be made arbitrarily small, for arbitrarily large computational time. We argue, however, that, for a given computational cost, the deterministic approach that is developed in this paper outperforms MCMC algorithms to such an extent that, for latent Gaussian models, resorting to MCMC sampling rarely makes sense in practice.

It is useful to provide some orders of magnitude. In typical spatial examples where the dimension *n* is a few thousands, our approximations for all the posterior marginals can be computed in (less than) a minute or a few minutes. The corresponding MCMC samplers need hours or even days to compute accurate posterior marginals. The approximation bias is, in typical examples, much less than the MCMC error and negligible in practice. More formally, on one hand it is well known that MCMC sampling is a last resort solution: Monte Carlo averages are characterized by additive (*N*^{−1/2}) errors, where *N* is the simulated sample size. Thus, it is easy to obtain rough estimates, but nearly impossible to obtain accurate ones; an additional correct digit requires 100 times more computational power. More importantly, the implicit constant in (*N*^{−1/2}) often hides a curse of dimensionality with respect to the dimension *n* of the problem, which explains the practical difficulties with MCMC sampling that were mentioned above. On the other hand, Gaussian approximations are intuitively appealing for latent Gaussian models. For most real problems and data sets, the conditional posterior of **x** is typically well behaved, and looks ‘almost’ Gaussian. This is clearly due to the latent Gaussian prior that is assigned to **x**, which has a non-negligible effect on the posterior, especially in terms of dependence between the components of **x**.

### 1.6. Approximation methods in machine learning

A general approach towards approximate inference is the variational Bayes (VB) methodology that was developed in the machine learning literature (Hinton and van Camp, 1993; MacKay, 1995; Bishop, 2006). VB methodology has provided numerous promising results in various areas, like hidden Markov models (MacKay, 1997), mixture models (Humphreys and Titterington, 2000), graphical models (Attias, 1999, 2000) and state space models (Beal, 2003), among others; see Beal (2003), Titterington (2004) and Jordan (2004) for extensive reviews.

For the sake of discussion, consider the posterior distribution *π*(**x**,** θ**|

**y**) of a generic Bayesian model, with observation

**y**, latent variable

**x**and hyperparameter

**. The principle of VB methods is to use as an approximation the joint density**

*θ**q*(

**x**,

**) that minimizes the Kullback–Leibler contrast of**

*θ**π*(

**x**,

**|**

*θ***y**) with respect to

*q*(

**x**,

**). The minimization is subject to some constraint on**

*θ**q*(

**x**,

**), most commonly**

*θ**q*(

**x**,

**)=**

*θ**q*

_{x}(

**x**)

*q*

_{θ}(

**). Obviously, the VB approximated density**

*θ**q*(

**x**,

**) does not capture the dependence between**

*θ***x**and

**, but one hopes that its marginals (of**

*θ***x**and

**) approximate well the true posterior marginals. The solution of this minimization problem is approached through an iterative, EM-like algorithm.**

*θ*In general, the VB approach is not without potential problems. First, even though VB methods seem often to approximate well the posterior mode (Wang and Titterington, 2006), the posterior variance can be (sometimes severely) underestimated; see Bishop (2006), chapter 10, and Wang and Titterington (2005). In the case of latent Gaussian models, this phenomenon does occur as we demonstrate in Appendix A; we show that the VB-approximated variance can be up to *n* times smaller than the true posterior variance in a typical application. The second potential problem is that the iterative process of the basic VB algorithm is tractable for ‘conjugate exponential’ models only (Beal, 2003). This implies that *π*(** θ**) must be conjugate with respect to the complete likelihood

*π*(

**x**,

**y**|

**) and the complete likelihood must belong to an exponential family. However, few of the latent Gaussian models that are encountered in applications are of this type, as illustrated by our worked-through examples in Section 5. A possible remedy around this requirement is to impose restrictions on**

*θ**q*(

**x**,

**), such as independence between blocks of components of**

*θ***(Beal (2003), chapter 4), or a parametric form for**

*θ**q*(

**x**,

**) that allows for a tractable minimization algorithm. However, this requires case-specific solutions, and the constraints will increase the approximation error.**

*θ*Another approximation scheme that is popular in machine learning is the expectation–propagation (EP) approach (Minka, 2001); see for example Zoeter and Heskes (2005) and Kuss and Rasmussen (2005) for applications of EP to latent Gaussian models. EP follows principles which are quite similar to VB methods, i.e. it minimizes iteratively some pseudodistance between *π*(**x**,** θ**|

**y**) and the approximation

*q*(

**x**,

**), subject to**

*θ**q*(

**x**,

**) factorizing in a ‘simple’ way, e.g. as a product of parametric factors, each involving a single component of (**

*θ***x**,

**). However, the pseudodistance that is used in EP is the Kullback–Leibler contrast of**

*θ**q*(

**x**,

**) relative to**

*θ**π*(

**x**,

**|**

*θ***y**), rather than the other way around (as in VB methods). Because of this, EP usually overestimates the posterior variance (Bishop (2006), chapter 10). Kuss and Rasmussen (2005) derived an EP approximation scheme for classification problems involving Gaussian processes that seems to be accurate and fast; but their focus is on approximating

*π*(

**x**|

**,**

*θ***y**) for

**set to the posterior mode, and it is not clear how to extend this approach to a fully Bayesian analysis. More importantly, deriving an efficient EP algorithm seems to require specific efforts for each class of models. With respect to computational cost, VB and EP methods are both designed to be faster than exact MCMC methods, but, owing to their iterative nature, they are (much) slower than analytic approximations (such as those developed in this paper); see Section 5.3 for an illustration of this in one of our examples. Also, it is not clear whether EP and VB methods can be implemented efficiently in scenarios involving linear constraints on**

*θ***x**.

The general applicability of the VB and EP approaches does not contradict the existence of improved approximation schemes for latent Gaussian models, hopefully without the problems just discussed. How this can be done is described next.

### 1.7. Inference: the new approach

The posterior marginals of interest can be written as

and the key feature of our new approach is to use this form to construct nested approximations

Here, is an approximated (conditional) density of its arguments. Approximations to *π*(*x*_{i}|**y**) are computed by approximating *π*(** θ**|

**y**) and

*π*(

*x*

_{i}|

**,**

*θ***y**), and using numerical integration (i.e. a finite sum) to integrate out

**. The integration is possible as the dimension of**

*θ***is small; see Section 1.3. As will become clear in what follows, the nested approach makes Laplace approximations very accurate when applied to latent Gaussian models. The approximation of**

*θ**π*(

*θ*

_{j}|

**y**) is computed by integrating out

*θ*_{−j}from ; we return in Section 3.1 to the practical details.

Our approach is based on the following approximation of the marginal posterior of ** θ**:

where is the Gaussian approximation to the full conditional of **x**, and **x**^{*}(** θ**) is the mode of the full conditional for

**x**, for a given

**. The proportionality sign in expression (3) comes from the fact that the normalizing constant for**

*θ**π*(

**x**,

**|**

*θ***y**) is unknown. This expression is equivalent to Tierney and Kadane's (1986) Laplace approximation of a marginal posterior distribution and this suggests that the approximation error is relative and of order after renormalization. However, since

*n*is not fixed but depends on

*n*

_{d}, standard asymptotic assumptions that are usually invoked for Laplace expansions are not verified here; see Section 4 for a discussion of the error rate.

itself tends to depart significantly from Gaussianity. This suggests that a cruder approximation based on a Gaussian approximation to *π*(** θ**|

**y**) is not sufficiently accurate for our purposes; this also applies to similar approximations that are based on ‘equivalent Gaussian observations’ around

**x**

^{*}, and evaluated at the mode of expression (3) (Breslow and Clayton, 1993; Ainsworth and Dean, 2006). A critical aspect of our approach is to explore and manipulate and in a ‘non-parametric’ way. Rue and Martino (2007) used expression (3) to approximate posterior marginals for

**for various latent Gaussian models. Their conclusion was that is particularly accurate: even long MCMC runs could not detect any error in it. For the posterior marginals of the latent field, they proposed to start from and to approximate the density of**

*θ**x*

_{i}|

**,**

*θ***y**with the Gaussian marginal derived from , i.e.

Here, ** μ**(

**) is the mean (vector) of the Gaussian approximation, whereas**

*θ*

*σ*^{2}(

**) is a vector of corresponding marginal variances. This approximation can be integrated numerically with respect to**

*θ***(see expression (2)), to obtain approximations of the marginals of interest for the latent field,**

*θ*The sum is over values of ** θ** with area weights Δ

_{k}. Rue and Martino (2007) showed that the approximate posterior marginals for

**were accurate, whereas the error in the Gaussian approximation (4) was higher. In particular, equation (4) can present an error in location and/or a lack of skewness. Other issues in Rue and Martino (2007) were both the difficulty to detect the**

*θ**x*

_{i}s whose approximation is less accurate and the inability to improve the approximation at those locations. Moreover, they could not control the error of the approximations and choose the integration points {

*θ*_{k}} in an adaptive and automatic way.

In this paper, we solve all the remaining issues in Rue and Martino (2007), and present a fully automatic approach for approximate inference in latent Gaussian models which we name *integrated nested Laplace approximations* (INLAs). The main tool is to apply the Laplace approximation once more, this time to *π*(*x*_{i}|**y**,** θ**). We also present a faster alternative which corrects the Gaussian approximation (4) for error in the location and lack of skewness at moderate extra cost. The corrections are obtained by a series expansion of the Laplace approximation. This faster alternative is a natural first choice, because of its low computational cost and high accuracy. It is our experience that INLA outperforms without comparison any MCMC alternative, in terms of both accuracy and computational speed. We shall also demonstrate how the various approximations can be used to derive tools for assessing the approximation error, to approximate posterior marginals for a subset of

**x**, and to compute interesting quantities like the marginal likelihood, the DIC and various Bayesian predictive measures.

### 1.8. Plan of paper

Section 2 contains preliminaries on GMRFs, sparse matrix computations and Gaussian approximations. Section 3 explains the INLA approach and how to approximate *π*(** θ**|

**y**),

*π*(

*θ*

_{j}|

**y**) and

*π*(

*x*

_{i}|

**,**

*θ***y**). For the latent field, three approximations are discussed: Gaussian, Laplace and simplified Laplace. Section 4 discusses the error rates of the Laplace approximations that are used in INLA. Section 5 illustrates the performance of INLA through simulated and real examples, which include stochastic volatility models, a longitudinal mixed model, a spatial model for mapping of cancer incidence data and spatial log-Gaussian Cox processes. Section 6 discusses some extensions: construction of posterior marginals for subsets

**x**

_{S}, approximations of the marginal likelihood and predictive measures, the DIC for model comparison and an alternative integration scheme for cases where the number of hyperparameters is not small but moderate. We end with a general discussion in Section 7.