# Likelihood functions for state space models with diffuse initial conditions

## Authors

Correspondence to: Siem Jan Koopman, Department of Econometrics, Vrije Universiteit Amsterdam, De Boelelaan 1105, NL-1081 HV Amsterdam, The Netherlands.

## Abstract

State space models with non-stationary processes and/or fixed regression effects require a state vector with diffuse initial conditions. Different likelihood functions can be adopted for the estimation of parameters in time-series models with diffuse initial conditions. In this article, we consider profile, diffuse and marginal likelihood functions. The marginal likelihood function is defined as the likelihood function of a transformation of the data vector. The transformation is not unique. The diffuse likelihood is a marginal likelihood for a data transformation that may depend on parameters. Therefore, the diffuse likelihood cannot be used generally for parameter estimation. The marginal likelihood function is based on an orthonormal data transformation that does not depend on parameters. Here we develop a marginal likelihood function for state space models that can be evaluated by the Kalman filter. The so-called diffuse Kalman filter is designed for computing the diffuse likelihood function. We show that a minor modification of the diffuse Kalman filter is needed for the evaluation of our marginal likelihood function. Diffuse and marginal likelihood functions have better small sample properties compared with the profile likelihood function for the estimation of parameters in linear time series models. The results in our article confirm the earlier findings and show that the diffuse likelihood function is not appropriate for a range of state space model specifications.

## 1. Introduction

Consider the linear regression model y = Xβ u with observation vector y, covariate matrix X, regression coefficient vector β and disturbance vector u ∼ N(0, σ2Ω) where σ is the scaling factor and Ω is a variance matrix depending on the vector of nuisance parameters θ. We therefore may write Ω = Ω(θ) and possibly X = X(θ). The marginal likelihood function is defined as the likelihood function of a transformation of the observations in y such that the transformed data is orthogonal in X and therefore independent of β. The profile likelihood function for the linear regression model is the likelihood function evaluated at the maximum likelihood estimate of β. In econometrics, the profile likelihood function is also known as the concentrated likelihood function. Among others, Cooper and Thompson (1977) and Tunnicliffe Wilson (1989) argue that the marginal likelihood is superior to the profile likelihood for the inference of nuisance parameters collected in vector θ. Small sample evidence for time-series models is provided by Shephard (1993). The marginal likelihood is for a (transformed) random variable and therefore its score vector has expectation zero; see, for example, Shephard (1993), Rahman and King (1997) and Francke and de Vos (2007).

The state space form for linear Gaussian time series models is convenient for likelihood-based estimation, signal extraction and forecasting. State space models can be represented as linear regression models with specifically designed matrices X and Ω, see Durbin and Koopman (2001, section 4.11). The likelihood function for stationary time-series models can be evaluated by the Kalman filter as it effectively carries out the prediction error decomposition; see Schweppe (1965) and Harvey (1989). Nuisance parameter vector θ can be estimated by directly maximizing the likelihood function. Time-series models with (time-varying) regression parameters and non-stationary latent factors require state space formulations with unknown initial conditions. In cases where the initial conditions are treated as fixed regression coefficients, the profile likelihood function can be computed as in Rosenberg (1973). When they are treated as random variables with large variances converging to infinity, a so-called diffuse likelihood function can be defined and computed as described in, among others, Harvey (1989, section 3.4.3), Ansley and Kohn (1985, 1990), De Jong (1988, 1991) and Koopman (1997). The diffuse likelihood function is a marginal likelihood function based on a transformation that is not necessarily invariant to the parameter vector θ. In this article, we develop a marginal likelihood function for the linear Gaussian state space model that is always invariant to θ when θ is linearly dependent on X. The evaluation of the marginal likelihood requires a modification of the diffuse Kalman filter. We further discuss its relation with profile and diffuse likelihood functions.

In Section 2, we develop general expressions for the profile, diffuse and marginal likelihood functions and we discuss their merits. Section 3 shows how the Kalman filter needs to be modified for the computation of the marginal likelihood function. Illustrations are given in Section 4. It is shown that different specifications of the same model lead to different diffuse likelihood functions whereas the marginal likelihood functions remain equal. Section 5 concludes.

## 2. Likelihood functions for state space models

For the Nt × 1 vector of time series yt, with t = 1,…,T, the state space model is given by

(1)

with p × 1 state vector αt and where the system matrices Zt, Tt and Rt are fixed but may depend on known functions of parameter vector θ. The disturbance vectors ɛt and ηt are mutually and serially independent and distributed by

(2)

where σ2 is a scaling factor and variance matrices Ht and Qt are fixed but may depend on θ as well. The state space model specification is completed with the initial state vector modelled by

(3)

where vector b and matrices B, C and Q0 are fixed system variables of appropriate dimensions. The random vector ξ is independent of the other disturbances. The k × 1 vector of coefficients β can be treated in two ways: (i) as a fixed and unknown vector; (ii) as a diffuse random vector, distributed by β ∼ N(0,σ2Σ), where Σ−1 → 0. The initial state constant b is for known effects, the coefficient vector β is for unknown regression effects and for initial effects in non-stationary processes whereas the random vector ξ is for the exact initialization of stationary processes. As ξ is a random vector with a properly defined variance matrix, we are not interested in case (ii) with Σ as a regular variance matrix and therefore we always assume that Σ−1→ 0 and E(β) = 0 without loss of generality. Finally, the (possibly time-varying) system matrices are fixed and known functions of the vector of nuisance parameters θ. Textbook treatments of state space time series models are, amongst others, given by Anderson and Moore (1979), Harvey (1989) and Durbin and Koopman (2001).

The state space model (1) can be represented as a linear regression model. In particular, we can consider the formulation

(4)

The equivalence of (4) with the state space model is obtained by defining

(5)

where Z = diag(Z1,…,ZT) and with Ω representing the covariance structure implied by the state space model and depending on all system matrices. The dimension of y is n × 1 with and the dimension of X is n × k. As system matrices may depend on θ, the explanatory variable matrix X = X(θ) and covariance matrix Ω = Ω(θ) may also depend on θ.

### 2.1. Profile likelihood function

In terms of the linear regression model (4) with a fixed and unknown β, the likelihood function is denoted by L =  exp {ℓ(y; β, σ, θ)} and the scaled log-likelihood function is given by

(6)

Analytical expressions for the maximum likelihood estimators for β and σ can be obtained and are given by the generalized least squares expressions

(7)

where MΩ = I − X(X′Ω−1X)−1X′Ω−1. The log-likelihood function (6) at the maximized location of is given by

(8)

and is defined as the profile log-likelihood function. We obtain the concentrated profile log-likelihood function by replacing σ2 by its maximum likelihood estimator , that is,

(9)

### 2.2. Diffuse likelihood function

In terms of the linear regression model with a random vector β ∼ N(0, σ2Σ), the log-likelihood function is given by

(10)

where ℓ(y|βσ,θ) = ℓ(yβσθ) is given in (6) whereas ℓ(βσθ) = ℓ(βσ) with

The density implied by ℓ(β | y;σ,θ) is obtained as follows. Since E(y) = c + XE(β) = c, var(y) = σ2(XΣX′+Ω), E(β) = 0, var(β) = σ2Σ and E(βy′) = σ2ΣX′, we obtain

where we have suppressed the dependence on σ and θ. These results follow from a matrix inversion lemma and some minor manipulations. The last term in the right-hand side of (10) becomes

By rearranging the different terms of the log-likelihood function (10), we obtain

The diffuse log-likelihood function  log LD is defined as

(11)

from which it follows that

(12)

which is equivalent to (8) apart from the term  log |X′Ω−1X|. This result is because of De Jong (1991). The log-likelihood function (12) at the maximized location of is given by

(13)

which is equivalent to (9) apart from the term  log |X′Ω−1X|.

The definition of the diffuse log-likelihood function (11) may be regarded as somewhat ad hoc. For example, an alternative suggestion is to define the diffuse log-likelihood function as:

(14)

see De Jong and Chn-Chun (1994). In light of definition (14), the likelihood functions (12) and (13) remain the same but with n replaced by m = n − k. The alternative definition in (14) becomes relevant in the discussion of the marginal likelihood function in the next subsection.

### 2.3. Marginal likelihood function

The concept of marginal likelihood has been introduced by Kalbfleisch and Sprott (1970). The marginal likelihood function for model (4) is defined as the likelihood function that is invariant to the regression coefficient vector β. Many contributions in the statistics literature have developed the concept of marginal likelihoods further and have investigated this approach in more detail; for example, see Patterson and Thompson (1971), Harville (1974), King (1980), Smyth and Verbyla (1996), and Rahman and King (1997). In particular, McCullagh and Nelder (1989) consider the marginal likelihood function for the generalized linear model. The marginal likelihood function has also been adopted for the inference of nuisance parameters in time-series models, for example, see Levenbach (1972), Cooper and Thompson (1977) and Tunnicliffe-Wilson (1989). In the linear model y = c + Xβ + u where u ∼ N(0,Ω) with X = X(θ) and Ω = Ω(θ), the marginal likelihood function is for a transformed data vector y* = Ay that does not depend on β. The transformation matrix A has dimension n × m with m = nk, is of full column rank and is subject to AX = 0. Apart from these conditions, the choice of matrix A is irrelevant. In our context of likelihood-based inference for θ, it is important to assume that matrix A does not depend on θ.

The scaled log-density function of y* is given by

(15)

since AX = 0. The equalities

imply that A(A′ΩA)−1A′ = Ω−1MΩ. Furthermore, since

the determinental term in the density is |A′ΩA| = |Ω| · |AA| · |XX|−1|X′Ω−1X|. Following Harville (1974) we normalize matrix A such that AA = Im and |AA| = 1. The marginal likelihood function with respect to β is based on the density of y* = Ay. The scaled marginal log-likelihood function is then given by

(16)

The marginal likelihood (16) is equivalent to (12) apart from the term  log |XX| and n replaced by m. When the diffuse likelihood function is defined as in (14), the marginal likelihood only differs by the term  log |XX|.

The variance scalar σ2 can also be concentrated out from the marginal likelihood function. The marginal likelihood evaluated at the maximized value of σ is given by

(17)

and is equivalent to (13) apart from the term  log |XX| and n replaced by m. Expressions (16) and (17) are new and convenient for our purposes below.

### 2.4. Discussion of likelihood functions

The close resemblance of the diffuse and marginal likelihoods has been discussed by Shephard (1993) and Kuo (1999). Their marginal likelihood function does not have the term  log |XX| in (16) and the marginal and diffuse likelihood functions are proportional. They also argue that the marginal likelihood function is based on the density of a random variable and therefore the score function has zero expectation. Given that the difference between the profile and marginal likelihoods is the term  log |X′Ω−1X| −  log |XX| where Ω = Ω(θ) and X = X(θ), it is obvious that the score of the profile likelihood function is non-zero and the profile likelihood is subject to a bias term.

Let us interpret the diffuse likelihood as a marginal likelihood where the data transformation is scaled such that |AA| is proportional to |XX|. In cases where X does not depend on θ, the marginal and diffuse likelihoods are indeed proportional to each other and the choice between the two likelihoods is irrelevant for the inference of θ. This fact is recognized by Ansley and Kohn (1985) in their treatment of the diffuse likelihood function and they explicitly assume that θ does not influence the transformation matrix. However, in the next section we consider cases where matrix X does depend on θ, that is, X = X(θ). Then, the data transformation implied by the diffuse likelihood function of Shephard (1993) and Kuo (1999) is based on some matrix A* for which we can assume that |A*′A*|∝|XX| without loss of generality. In case X = X(θ), the diffuse likelihood function is not appropriate for a likelihood-based analysis with respect to θ. The diffuse likelihood only reduces to the marginal likelihood when the data transformation matrix A* does not depend on the unknown parameters θ. The marginal likelihood function defined by (16) is based on the transformation matrix A with AA = I as shown in the previous subsection. The orthonormal transformation does not depend on θ in linear models and therefore can be used for the inference of θ. In other words, the term  log |XX| in (16) and (17) cannot be ignored.

In case the regression model (4) implies a time-series model in the state space form (1), matrix X and its dependence on θ should be considered carefully. In case of stationary time-series models without regression effects, this issue does not arise as β is not present. In case regression effects are present and in case the model includes non-stationary processes, coefficient vector β is present and the dependence of θ on covariate matrix X must be taken into account. The use of the marginal likelihood function is recommended for this class of linear time-series models.

## 3. Evaluation of likelihood functions

The Kalman filter effectively carries out the prediction error decomposition for time-series models in the state space representation (1); see Schweppe (1965) and Harvey (1989). The prediction error decomposition is based on

where Yt = {y1,…,yt}. The prediction error vt = yt − E(yt|Yt−1), with its variance matrix Ft = var(yt|Yt−1) = var(vt), is serially uncorrelated when the model is correctly specified. This implies that var(v) = F is block-diagonal with prediction error vector v = (v1,…,vT)′ and associated variance matrix F = diag(F1,…, FT). The Kalman filter therefore carries out the Cholesky decomposition Ω = L−1FL′−1, or F = LΩL′, where Ω = Ω(θ) is implied by state space model (1) and n × n matrix L is a lower block unity triangular matrix with |L| = 1. It also implicitly follows that v = L(yc).

The Kalman filter for the state space model (1) with β = 0 in the initial state specification (3) is given by

(18)

for t = 1,…,T and with a1 = b and P1 = CQ0C′. The likelihood function (6) with β = 0 can be written as:

It follows that the Kalman filter can evaluate the likelihood function (6) with β = 0 in a computationally efficient way.

### 3.1. Evaluation of profile likelihood

The evaluation of the profile likelihood functions (8) and (9) focuses on

where

with V = LX. It follows that

(19)

For these definitions, it is implied that q ≡ (yc)′Ω−1(yc), s ≡ X′Ω−1(yc) and S ≡ X′Ω−1X. Given that the Kalman filter evaluates the block elements of v = L(yc) recursively, the columns of matrix V = LX = L(X1,…, Xk), where Xi is the ith column of X for i = 1,…,k, can be evaluated simultaneously and recursively in the following way

(20)

with A1 = B and . Furthermore, we have

The Kalman filter with the additional recursion (20) is referred to as the diffuse Kalman filter and is developed by De Jong (1991).

The likelihood function (6), for any β, and the profile log-likelihood functions  log LP and can be expressed by

which can be evaluated by the diffuse Kalman filter in a computationally efficient way.

### 3.2. Evaluation of diffuse likelihood

The diffuse log-likelihood functions (12) and (13) are evaluated by

respectively. Here we have replaced n by m and in effect have adopted definition (14) for the diffuse likelihood function. All terms can be evaluated by the diffuse Kalman filter.

### 3.3. Evaluation of marginal likelihood

The marginal log-likelihood differs from the diffuse log-likelihood by the term . It follows from the design of X in (5), implied by the state space model (1), that the k × k matrix S* = XX can be evaluated by the recursion

(21)

with and . The marginal log-likelihood functions are given by

and are evaluated by the diffuse Kalman filter together with the additional recursion (21).

## 4. Two illustrations

This section explores the differences between estimation based on the profile, diffuse and marginal likelihood functions. The diffuse/marginal likelihood functions have score functions with zero expectations since they are based on a random variable (the transformed data vector). The profile likelihood function does not have this property. The non-zero expectation of the score for the profile likelihood may lead to a bias in the estimation of θ. Shephard and Harvey (1990), Shephard (1993) and Kuo (1999) have investigated this in more detail. Consider the stochastic trend model yt = μt + εt with trend μt modelled by the random walk process μt+1 = μt + ηt and with signal-to-noise ratio q = var(ηt)/var(εt). In a Monte Carlo study, it can be shown that the estimation of q by maximizing the profile likelihood leads to many zero estimates whereas the true q value is strictly positive. Estimation based on the diffuse/marginal likelihood function reduces this bias substantially. Next, we consider two cases where the use of the marginal likelihood function is advocated specifically.

### 4.1. Non-stationary time-series models

The initial conditions of non-stationary components in a time-series model must depend on the vector β in (3). In cases where β ≠ 0 and as X does not depend on θ, the diffuse and marginal likelihoods are proportional and provide the same maximum likelihood estimates of θ. In case X = X(θ), the marginal likelihood is advocated. Testing for unit roots in autoregressive models provides an illustration of a clear difference between profile and marginal likelihood functions. Consider the first-order autoregressive model with a constant as given by

(22)

for t = 1,…,T, where β is an unknown scalar. The specification of the initial condition (22) is coherent as the variance of u1 goes to infinity for ρ ↑ 1. The core of this problem is that the profile likelihood degenerates in the unit root. The marginal likelihood is well defined for −1 < ρ ≤ 1 whereas the profile likelihood is 0 when ρ = 1. Francke and de Vos (2007) show that unit root tests based on the marginal likelihood ratio outperform other well-known tests.

The marginal likelihood for model (22) for yt can be based on the difference transformation Δyt. It can be shown that the marginal likelihood does not depend on parameters μ and β as required. The state space formulation (1), for Δyt has state vector αt = (Δut, −ηt−1)′. The disturbance variances in (2) are Ht = 0, Qt = 1 and . Since Δut+1 = Δut − ηt−1 + ηt, we obtain the system matrices

and for the initial conditions in (3) we have b = 0, B = 0 and C = I2.

### 4.2. Multi-variate non-stationary time-series models

The generality of the state space framework allows different state space representations of the same time-series model. We will show that different state space formulations for the same model can lead to different values of the diffuse likelihood function whereas this is not the case for the marginal likelihood functions. A convenient illustration is given in the context of multi-variate time-series models. Consider a model with random walk trends from which some trends are possibly common to all series. The N × 1 vector of observations yt is modelled by

(23)

for t = 1,…,T, where μt is an r × 1 vector of independent random walks with r < N and γ is an N × 1 fixed unknown vector for which the first r elements are zero, γ = (0,…,0, γr+1,…,γN)′. The N × r matrix of factor loadings Λ contain unknown fixed elements which are collected in the parameter vector θ. The properties of disturbance vector ɛt are not relevant for this illustration but ɛt is assumed Gaussian and independent of ηs for t, s = 1,…,T.

A valid state space formulation (1) of model (23) can be based on the N × 1 state vector and with system matrices

(24)

where Λ1 consists of the first r rows of Λ and Λ2 collects the remaining N − r rows of Λ. Given the non-stationary process for μt, all initial values in αt at t = 1 are treated as unknown coefficients and collected in vector β of (3). The initial state condition for this time-series model is therefore given by (3) with b = 0, B = IN and C = 0. As a result, we have matrix in (5) that depends on Λ and therefore X = X(θ). The marginal and diffuse likelihood functions are clearly different since .

The state space formulation (1) for model (23) can also be based on the N × 1 state vector αt = γ + Λμt and with system matrices Zt = IN, Tt = IN, Rt = Λ and Qt = Ir. The initial state conditions in (3) remain the same with b = 0, B = IN and C = 0. In this case, n × N matrix X = (IN,…, IN)′ in (5), with n = N · T, does not depend on θ and the marginal and diffuse likelihoods are proportional to each other.

It further follows that the marginal likelihood functions for both state space representations are the same or, at least, proportional to each other whereas the diffuse likelihood functions are different for both representations. The difference in the diffuse likelihood functions for the two model representations is because of the term  log |X′Ω−1X| as X is defined differently in the two representations. In the first representation, matrix X depends on θ, that is, X = X(θ). The difference in the marginal likelihood functions for the two model representations is because of the log determinental terms  log |X′Ω−1X|− log |XX|. However, the different X matrices in the two model representations do not influence the marginal likelihood because the dependence of X in the term  log |X′Ω−1X| is cancelled by the term  log |XX|. Hence, in our illustration, the marginal likelihood function is the same for both model representations.

For a numerical illustration, we consider state space representation (24) for model (23) with N = 2 and r = 1. We simulate T = 100 observations from the bivariate common trend model (23) with γ = (0,1)′, Λ = (1, 0.1)′, var(εt) = I2 and var(ηt) = 0.252.Figure 1 presents the marginal and diffuse log-likelihoods as functions of . The diffuse likelihood is clearly not proportional to the marginal likelihood whereas the maximum of the latter is in the neighbourhood of the true value ψ = 0.25. The diffuse and marginal log-likelihood functions for the second state space representation are proportional to the marginal log-likelihood.

## 5. Conclusion

We have argued for the preference of the marginal likelihood function over the profile and diffuse likelihood functions when we estimate parameters in time-series models that have non-stationary components and unknown regression effects. In many cases, the diffuse and marginal likelihood functions are proportional to each other. However, in cases where the implied data transformation for the diffuse likelihood function depends on parameters, estimation based on the diffuse likelihood function will lead to unreliable results. For these cases, we should consider the marginal likelihood as defined by Harville (1974); its computation in the case of state space models requires a minor modification of the diffuse Kalman filter.

## Acknowledgements

The authors thank J. J. F. Commandeur and B. Jungbacker for their comments on an earlier version.