On the error in Laplace approximations of high-dimensional integrals

Laplace approximations are commonly used to approximate high-dimensional integrals in statistical applications, but the quality of such approximations as the dimension of the integral grows is not well understood. In this paper, we prove a new result on the size of the error in first- and higher-order Laplace approximations, and apply this result to investigate the quality of Laplace approximations to the likelihood in some generalized linear mixed models.


Introduction
Integrals of the form are frequently encountered in statistical applications, where g(.) is a smooth function with a unique minimum. For example, the likelihood function for a generalized linear mixed model is of this form, where u is a vector of random effects. Integrals of this type are also common in Bayesian applications, for example as marginal likelihoods used for model comparison.
Laplace approximations are often used to approximate integrals of form (1). Suppose that g(u) grows at rate n. Often g(u) is a sum with one term for each observation, so n is the sample size. If d is fixed as n → ∞, many results on the quality of the Laplace approximation are available: see Small (2010) for a review. However, in many examples of interest, d and n tend to infinity simultaneously, and there are very few results available on the quality of Laplace approximations in this setting. Shun and McCullagh (1995) provide a formal expansion for integrals of type (1). By studying the size of various terms in this expansion, they conjecture that the first-order Laplace approximation should be reliable if d = o(n 1/3 ), under the assumption that all derivatives of g(.) grow at rate n. This condition is typically not met for generalized linear mixed models, and we give an example in which d = o(n 1/3 ) but the error in the first-order Laplace approximation grows with n.
In Section 2, assuming alternative conditions on g(.), we develop a new result on the error in Laplace approximations of various orders to integrals of type (1). Our result is motivated by a two-level random intercept model with n j observations on items in the jth cluster, for which the likelihood factorizes into a product of terms where each g j (u j ) is a sum over n j terms. In this case, we could use existing results on the error of Laplace approximations to one-dimensional integrals to show that the error in the first-order Laplace approximation to the integral is O( d j=1 n −1 j ). We show that a version of this result also holds more generally, and find similar expressions for the error in higher-order Laplace approximations. In Section 3, we apply these results to study the quality of Laplace approximations of the likelihood for some generalized linear mixed models, including a multilevel random intercept model with any number of levels of hierarchy.
2 Error in the log-integral approximation 2.1 A series expansion for the log-integral Shun and McCullagh (1995) give a series expansion for the log-integral ℓ = log L. We use their expansion here, expressed with slightly different notation. We write wherel 1 is the first-order Laplace approximation to the log-integral, and e l are contributions to the error in this approximation of size decreasing with l, which we define in Section 2.2. The first-order Laplace approximation to the log-integral is whereû = arg min u∈R d {g(u)} and g (2) = g ′′ (û) is the matrix of second derivatives of g(.) with respect to u, evaluated atû. Based on the decomposition (2), we may also define an order-k Laplace approximation to the log-integral, for k ≥ 2, as What is meant by the order of a Laplace approximation is not standard across the literature: our definition is made by grouping together terms in a series expansion to the log-integral in terms of their asymptotic order. This is a different notion of order than that used by Raudenbush et al. (2000), who group together terms according to the number of derivatives required to compute them.
In this paper, we study the errors in these Laplace approximations to the log-integral Shun and McCullagh (1995) give a series expansion for the log-integral in terms of particular bipartitions. For positive integers v and m, define the set of M -bipartitions M v,m to be all (P, Q) such that P = (p 1 | . . . |p v ) and Q = (q 1 | . . . |q m ) are both partitions of {1, . . . , 2m}, such that each block of P contains at least three elements and each block of Q contains exactly two elements.

An expansion over bipartitions
For each (P, Q) ∈ M v,m , define a corresponding graph G(P, Q) with vertices 1, . . . , 2m, and an edge between each pair of vertices contained in the same block of either P or Q. If G P,Q is a connected graph, say that (P, Q) is a connected bipartition, and write (P, Q) ∈ M C v,m . We define the level of (P, Q) ∈ M v,m to be l = m − v, and write M C l for all connected level-l M -bipartitions. For a vector of indices I, write g I (u) = ∇ uI g(u) and g I = g I (û). Let g (k) be the k-dimensional array with entries g (k) j1,...,j k = g j1,...,j k , and write g jk = g (2) −1 jk . Then define where [1 : d] 2m = {(j 1 , . . . , j 2m ) : j l ∈ {1, . . . , d}}, and j p is the sub-vector of j = (j 1 , . . . , j 2m ) corresponding to the indices in p.
We may write the level-l contribution to the log-integral e l as a sum of contributions from each connected level-l M -bipartition, as e l = (P,Q)∈M C l e P,Q . (4)

The level-1 contribution
To demonstrate the definitions in Section 2.2, we find the level-1 contribution e 1 , used in the second-order Laplace approximation.
McCullagh ( 1987) lists 4 bipartitions similar to (P 1 , Q 2 ), 9 similar to (P 2 , Q 2 ) and 6 similar to (P 3 , Q 3 ), so the level-1 contribution is e 1 = 3e P1,Q1 + 9e P2,Q2 + 6e P3,Q3 , and the second-order Laplace approximation to the log-likelihood is There may be more efficient ways to compute e 1 than direct computation of the sums in (5). For example, Zipunnikov and Booth (2011) describe a more efficient method for computing these terms for a generalized linear mixed model.

Asymptotic order of terms
We use a particular notion of a random array being order 1 in probability.

Assumptions
We assume that g(.) in (1) satisfies some conditions. Condition 1. g(.) is a smooth function with a unique minimum.
For a given choice of normalizing terms n 1 , . . . , n d , and for each vector of indices I, define the normalized derivatives Condition 2. There is some choice of normalizing terms n 1 , . . . , n d such that the normalized derivative arrays . The normalizing terms are often chosen so that g jj = Θ p (n j ), and we may think of n j as an effective sample size for u j .

Error in log-integral approximations
We state here our main result, which is proved in Appendix A.
Theorem 1. Suppose L is of form (1), where g(.) satisfies Conditions 1 and 2, for some choice of normalizing terms n 1 , . . . , n d . Then the error in the order-k Laplace approximation to log L is

Linear reparameterizations
Laplace approximations are invariant to linear reparameterizations. That is, we have L (v) = L, and the order-k Laplace approximation of L is unchanged by the reparameterization, so thatL (v) k =L k . In many situations, Condition 2 does not hold in the original parameterization, but does hold after making a suitable linear reparameterization, so we may still apply Theorem 1. We give an example of this in Section 3.5.
3 Application to likelihood approximation for generalized linear mixed models

The model
In a generalized linear mixed model, the distribution of the response Y = (Y 1 , . . . , Y n ) is determined by a linear predictor η = (η 1 , . . . , η n ). Conditional on η, the components Y i of the response are independent, with known density function f (y i |η i ). We assume an exponential family with canonical link, so that where b(.) is a smooth and convex function, a i (φ) > 0, and φ is the dispersion parameter, which we assume here to be known. The linear predictor is modelled as η = Xβ+Zu, where X ∈ R n×p and Z ∈ R n×d are design matrices, β ∈ R p is a vector of fixed effects, and u ∈ R d is a vector of random effects. We assume that u ∼ N d (0, Σ(ψ)), where ψ ∈ R q is an unknown parameter, and write θ = (β, ψ) for the full vector of unknown parameters.

The likelihood
The likelihood for this model is where and (6) is typically intractable, except in the special case of a linear mixed model where Y i |η i are normally distributed. Because of this intractability, it is common to use some numerical approximationL(θ) to the likelihood, and firstorder Laplace approximation is often used. For example, by default the lme4 R package (Bates et al., 2015) uses a first-order Laplace approximation to the likelihood for inference, and the integrated nested Laplace approximations of Rue et al. (2009) is a Bayesian approach based on a Laplace approximation to the likelihood.

Assumption checking
In order to apply Theorem 1 to the likelihood of a generalized linear mixed model, we will first have to show that g(.) as defined in (7) satisfies Conditions 1 and 2. We drop θ from the notation, so that (6) is of form (1). We can show Condition 1 holds in all cases. The proof is in Appendix B.
Proposition 1. Let g(u) be as defined in (7), where Σ is a positive definite matrix. Then g(.) satisfies Condition 1.
We need to show that Condition 2 holds on a case-by-case basis.
In our examples, we choose the normalizing term n j to be the number of observations which involve u j .

A two-level random intercept model
We consider a two-level random intercept model, which is a special case of the generalized linear mixed model of Section 3.1 in which each observation i is contained in a cluster c(i). Observations in the same cluster j are correlated by a shared random effect u j . The linear predictor is η i = x i T β + u c(i) (i = 1, . . . , n), where we suppose the u j are independent N (0, σ 2 ) random variables. In the notation of Section 3.1, we have Z i,c(i) = 1, and Z i,j = 0 if j = c(i) and Σ = σ 2 I, where I is an identity matrix.
In this special case, the likelihood (6) simplifies into a product of onedimensional integrals The log-likelihood may be written as a sum so ǫ k is a sum of separate error terms.
Proposition 2. Suppose we have a two-level random intercept model, with n j observations on cluster j, for j = 1, . . . d. The error in the order-k Laplace approximation to the log-likelihood is Proof. The derivative arrays g (k) are diagonal for all k, with diagonal entries g j...j = Θ p (n j ), so Condition 2 holds with normalizing terms n 1 , . . . , n d . Theorem 1 gives that ǫ k (θ) = O p ( d j=1 n −k j ), as required.
In the balanced case, where all n j = nd −1 , ǫ k = O p (d k+1 n −k ). This tends to zero as n → ∞ if d = o(n (k+1)/k ). The error in the first-order Laplace approximation tends to zero if d = o(n 1/2 ).
In an unbalanced case, the result can be quite different. As an extreme example, suppose where n > d log d. Then which tends to infinity as d → ∞, now matter how large n is relative to d. For example, if n = d 4 , then d = o(n 1/3 ), but ǫ 1 → ∞.

A multilevel random intercept model
Suppose that each observation i is contained in a level-2 cluster c 2 (i), and that each level-2 cluster j is itself contained within a hierarchy of higher-level clusters, c l (j), j = 3, . . . , L. The clusters are nested within one another, so that if c l (j) = c l (k), then c l+1 (j) = c l+1 (k). The linear predictor is where we assume u (l) j ∼ N (0, σ 2 l ), l = 2, . . . , L, with all the u j independent. Suppose that there are d level-2 clusters in total, and d l level-l clusters, for each l = 3, . . . , L. It is no longer possible to write the log-likelihood as a sum of one-dimension log-integrals as in (9). Since an accurate approximation to the exact log-likelihood is no longer readily available, it is important to understand the quality of the Laplace approximation in this case.
Condition 2 does not hold for this parameterization, so we define a new parameterization of the model. Let v j = u where there are now a total of d random effects, rather than d + d 3 + . . . + d L in the original parameterization. We have reduced the structure to the two-level random intercept model of Section 3.4, except now Proposition 3. Suppose we have an L-level random intercept model with independent random effects, with n j observations in level-2 cluster j, for j = 1, . . . d.
The error in the order-k Laplace approximation to the log-likelihood is The proof is in Appendix B.
The asymptotic order of the error in a Laplace approximation to the loglikelihood depends on the number of observations in each of the level-2 clusters, but not on how these level-2 clusters are grouped into higher-level clusters.

Impact on approximate likelihood inference
When an approximate likelihoodL(θ) is used for inference, the impact of the error in the likelihood approximation on the resulting inference is of more interest than the size of that error itself. If the error in the log-likelihood ǫ(θ) = logL(θ)−log L(θ) tends to zero in probability, uniformly in θ, Douc et al. (2004) show that the approximate likelihood estimatorθ will be fully efficient, and have the same first-order asymptotic distribution as the maximum likelihood estimator. In our examples, if we expect the order-k Laplace estimator to be fully efficient. In order to make the argument rigorous, we would need to show that the supremum of the error in log-likelihood in some region around the true parameter value tends to zero. However, condition (10) is likely to be stronger than necessary for a order-k Laplace estimator to be fully efficient. Ogden (2017) gives conditions on the size of the error in the score function ∇ θ ǫ(θ) which ensure that inference with an approximate likelihood retains the same first-order properties as inference with the exact likelihood. By studying this error in score, it should be possible to show that the order-k Laplace estimator is fully efficient under a weaker condition than (10). Some modification of the results of Ogden (2017) would be required before they could be used in this case, as information on different components of the parameter vector may grow at different rates (Nie, 2007).

Appendix A Proof of main result
To prove Theorem 1, we aim to find the size of the contribution from each bipartition (P, Q).
In order to prove Lemma 1, we need some auxiliary results.
where 2m j=1 c j = −l, and each c j < 0. Proof. We may write for some c 1 , . . . c 2m . We have c i = − 1 2 + 1 |p| , for whichever p contains i, so c i < 0 as |p| ≥ 3. We have which gives the result.
In the case k = 1, we have S = T , since S ∩T = ∅. So C j1 = A j1 B j1 = O p (1), so C = O * p (1). Now we suppose the hypothesis is true for dim(C) = k − 1, and consider dim(C) = k ≥ 2.
We have Writing j −a = (j 1 , . . . , j a−1 , j a+1 , . . . , j k ) and C −a j−a = ja |C j |, if we can show that C −a = O * p (1) for some a = i, then In the first case, we must have dim(A) ≥ 2, otherwise S = {a} and S ∩ T = ∅, which would be a contradiction. Since A = O * p (1), the array A −a with entries A −a j S\a = ja |A jS | must also be O * p (1). So the array C −a with entries C −a j−a = |B jT |A −a j S\a is O * p (1), by the induction hypothesis, since dim(C −a ) = k − 1. Similarly, in the second case C −a = O * p (1). In the third case, by the Cauchy-Schwarz inequality. So C −a = O * p (1), by the induction hypothesis.
Proof. We have A P,Q (j) = p∈P f jp q∈Q f jq . Since f (k) = O * p (1) for k ≥ 3 and [f (2) ] −1 = O * p (1), A P,Q is a product of O * p (1) arrays. We build up this product one term at a time, at each step applying Proposition 5 to show that the product remains O * p (1). We start with an arbitrary p 1 ∈ P , and choose q 1 ∈ Q such that one element of q 1 is in p 1 , and the other is not in p 1 , and therefore must be in some other block p 2 ∈ P . If v > 1, it will always be possible to find such a q, because (P, Q) is a connected bipartition, so blocks of P (which form disjoint clusters in G P,Q ) are connected by blocks of Q.
Let S 1 be the array with entries S 1 jp 1 ∪q 1 = f jp 1 f jq 1 . Then S 1 = O * p (1) by Proposition 5, as p 1 ∩ q 1 = ∅. Let T 1 be the array with entries T 1 (1), as p 2 ∩ (p 1 ∪ q 1 ) = p 2 ∩ q 1 = ∅. We continue to choose alternating terms from blocks of Q and P , at step k choosing a block q k with one entry in p 1 ∪ . . . ∪ p k , and the other entry in a new block p k+1 . At each stage k we have S k We continue until we have included all blocks of P , and have . We have already included terms from v − 1 blocks of Q. We may multiply in the remaining 2m − v + 1 blocks of Q while retaining an O * p (1) array by Proposition 5 as T v−1 is an array on all indices j 1 , . . . , j 2m , and q ∩ (1 : 2m) = q = ∅ for each q ∈ Q. So A P,Q = O * p (1), as required. Proof of Lemma 1. By Proposition 4 We apply the weighted form of the inequality of arithmetic and geometric means, which states that given non-negative numbers x 1 , . . . , x n and non-negative weights w 1 , . . . , w n with i w i = 1, Here, we let n = 2m, x i = n −l ji and w i = −c i /l, to give that Putting (12) back into (11) gives j1,...,ji−1,ji+1,...,j2m

Appendix B Proofs for examples
Proof of Proposition 1. The matrix of second derivatives of g(.) with respect to u is g (2) (u) = h (2) (u) + Σ −1 , where h (2) (u) is the matrix of second derivatives of h(.) with respect to u, and h(.) is defined in (8). We have h (2) (u) = Z T W (u)Z, where W (u) is a diagonal matrix with diagonal entries .
But a i (φ) > 0, and since b(.) is a convex function b ′′ (X T i β + Z T i u) ≥ 0, so W ii (u) ≥ 0 for all u. So W (u) is a non-negative definite matrix, and for any x ∈ R d , x T h (2) (u)x = (xZ) T W (Zx) ≥ 0, which means that h (2) (u) is nonnegative definite. Since Σ −1 is positive definite, this means that g (2) (u) is positive definite for all u, so g(.) is strictly convex, and therefore has a unique minimum. Since b(.) is a smooth function, so is g(.), so Condition 1 holds.
Proof of Proposition 3. To prove the result, we need to show that after reparameterization Condition 2 holds with normalizing terms n 1 , . . . , n d , so that we can apply Theorem 1.
is a block-diagonal matrix, with d l blocks, one for each level-l cluster. We have Applying the Sherman-Morrison formula to invert each block of Σ [l] gives where We hypothesize that and prove this by induction on l. This claim is true for l = 2, as Σ −1 [2] = σ −2 2 I. For l ≥ 2, applying the induction hypothesis to (14), we find r j = Θ(1), so s c = Θ(d l c ) and σ 2 l r j r k 1 + σ 2 l s c l (j) = Θ (d l c ) −1 .