On causal and non‐causal cointegrated vector autoregressive time series

Previous‐30 treatments of multivariate non‐causal time series have assumed stationarity. In this article, we consider integrated processes in a non‐causal setting. We generalize the Johansen–Granger representation for causal vector autoregressive (VAR) models to allow for dependence on future errors and discuss how the parameters can be estimated. The asymptotic distribution of the trace statistic is also considered. Some Monte Carlo simulations are presented.


INTRODUCTION
Vector autoregressive, VAR, models are one of the basic tools for analyzing macroeconomic time series. In such time series trends are often observed, and tools for handling such behavior are necessary. Cointegration has played an important role in this respect since it was introduced in the seminal paper by Engle and Granger (1987). One of the implications is that under some regularity conditions there exists a Johansen-Granger representation of a VAR model which means that it is possible to decompose the time series into a random walk part, a stationary part, a non-stochastic part and a part depending on an initial condition. Certain linear combinations of the components of the levels of the time series are stationary and these linear combinations are the cointegration vectors.
The regularity conditions needed for a Johansen-Granger representation ensure that the stationary part is causal, that is, the observations depend on only past and present errors of the VAR. However, this is not the only way to obtain stationarity. In the so-called non-causal and mixed causal non-causal models the observations also depend on future errors. Earlier contributions to the literature on non-causal models are Breidt et al. (1991), Andrews et al. (2006) and the monograph by Rosenblatt (2000) which all mainly considered univariate time series models. Recently also multivariate models have attracted attention, see Lanne and Saikkonen (2013), Cubadda et al. (2019), Gouriéroux and Jasiak (2017), Davis and Song (2020) and Cavaliere et al. (2020). For multivariate nonstationary and non-causal time series the question arises whether there is a formulation such that the important property of cointegration implied by a Johansen-Granger representation is retained and analysis based on levels of the time series is permitted.
The models allowing for non-causality which we will consider can be seen as an extension of the stable causal VAR model. Then the roots of the determinant of the autoregressive polynomial are all located outside the unit circle, {z ∶ |z| = 1}. Requiring only that the determinant of the autoregressive polynomial is non-zero for values at the unit circle is sufficient to ensure existence of stationary processes, causal or non-causal. If all the roots are inside the unit circle, the process is purely non-causal and if there are roots both inside and outside the unit circle mixed causal or non-causal process is used as denomination.
The well-known integrated and cointegrated processes can be seen as another extension of the stable causal VAR models by permitting also one or more root of the determinant of the characteristic polynomial exactly at 1 in addition to those located outside the closed unit disk.
We shall in this article address the situation where both extensions of the stable VAR models are possible. Then the determinant of the characteristic polynomial of the VAR model can have roots outside and inside the unit circle but also some roots exactly at z = 1. Previous studies of reduced rank VAR models allowing roots inside the unit circle have assumed causality in addition, which means that the process is explosive. Johansen (2009) and Nielsen (2010) studied situations where unstable roots are present. Engsted and Nielsen (2012) used such processes to model bubble behavior. Another example where explosive processes arise is in using bootstrap methods to find the appropriate reduced rank r. The usual procedure consists of fitting models of increasing rank. For some estimation methods it may happen that although the data generating model may have a characteristic polynomial where the determinant of the characteristic polynomial is non-zero inside the unit circle, some of the fitted models may not, see, for example, Cavaliere et al. (2012) and Swensen (2006). The alternative to introducing explosive processes is to allow for non-causality. This has been done for unit root testing by Saikkonen and Sandberg (2016). We will consider the multivariate situation.
In the present article we show that a Johansen-Granger representation also exists when some roots of the determinant of the characteristic polynomial of the VAR have modulus less than 1 and the only roots with modulus 1 are exactly at z = 1. Then the stationary part is no longer causal and may depend on future random shocks.
When autoregressive models of this type are fitted to an observed time series a natural question is how many roots of the determinant of the characteristic polynomial of the VAR are located inside the unit circle. This can be answered once the coefficients of the VAR-model have been estimated. To find out how that can be done is therefore important.
An example where such results can be employed is described in the article by Lanne and Saikkonen (2013) where a non-causal VAR model was introduced to analyze the expectation hypothesis of the term structure of interest rates. In the article a bivariate time series consisting of 6-month and 5-year interest rate was considered using the change in the 6-month rate and the spread between the 6-month and 5-year interest rates. Being able to employ the levels directly and not rely on a transformation to obtain stationarity will be an advantage. For example, one can investigate whether the transformation to stationarity using difference and spread is appropriate.
There are a couple of special features which are prominent for the full rank stationary case which also have implications when the rank is reduced and which are worth mentioning. One is how the dependence of future shocks shall be interpreted. Non-causal models take the uncertainty of future errors, not only present and past errors, into account. This can be a clear advantage in some situations. Gouriéroux and Zakoian (2017), section 3.3, stress that such models with error distributions having fat tails can describe the local explosive behavior often found in economic and financial time series. Taking into account non-causality can be useful also when fitting causal models. Lanne and Saikkonen (2013) pointed out that this is the case when only a subset of the variables which are causally generated is modeled. More specifically, building on a result of Johansen and Juselius (2014), they explained that a linear combination, Y t , of such a subset may have a one sided representation, causal models are permitted. For non-causal model to be identifiable only non-Gaussian models are permitted and some additional restrictions must be imposed. The article is organized as follows. In the next section we prove a Johansen-Granger representation for a non-causal VAR model. For the case where there are no deterministic terms we discuss in Sections 3 and 4 how the unknown parameters can be estimated and find the asymptotic distribution of the trace test for determining the rank. Section 5 contains results from some Monte Carlo simulations.
Some additional results can be found in the online supporting information.

A JOHANSEN-GRANGER REPRESENTATION THEOREM
The vector autoregressive, VAR, model of dimension p and order k is defined by the recursion where { t } is a series of uncorrelated random variables with expectation zero and finite second-order moment. The traditional stability requirement, see Hannan and Deistler (1988) and Lütkepohl (2005), that the determinant of the autoregressive or characteristic polynomial, A(z) = I − A 1 z − · · · − A k z k , is non-zero on the closed unit disk, {z ∶ |z| ≤ 1}, is sufficient to ensure that {X t } ∞ t=−∞ is stationary and can be expressed as a linear filter of the present and past values of the variables t , that is, VAR models satisfying the stability requirement have been extensively studied and are an important ingredient in applications in many fields such as empirical macroeconomics, engineering and climate research, just to mention a few. Important references in addition to the mentioned Hannan and Deistler (1988) and Lütkepohl (2005) are Hannan (1970), Brockwell and Davis (1991) and Hendry (1995). For models allowing roots of the determinant of the characteristic polynomial on the unit circle the alternative vector equilibrium, VECM, formulation is useful. The recursion (1) can be reparameterized as where a deterministic term D t has been added. When the rank r of the matrix Π is reduced, 0 ≤ r < p, det A(1) = 0 and it follows from the Johansen-Granger theorem, see Johansen (1995), that under some regularity conditions, X t , t = k + 1, k + 2, … can be decomposed into a random walk, a stationary part, a non-stochastic part and a term depending on the initial values X 1 , … , X k . A Johansen-Granger representation allowing for non-causality can be proved under assumptions which are quite similar to those used in the causal situation. In fact, the only change is that solutions of det A(z) = 0 which have moduli strictly less than 1 are permitted. Assumption 1. The recursion (3) satisfies the following conditions (i) the determinant of the characteristic polynomial has roots which are exactly at 1 or have moduli which are either strictly less than 1 or strictly larger than 1, that is det A(z) = 0 implies z = 1 or |z| ≠ 1, (ii) the matrix Π = ′ where matrices and have full rank r with 0 ≤ r < p, (iii) the matrix 181 The following notation is used. If an m × n matrix a, where n ≤ m, has full rank, a ⟂ denotes an m × (m − n) matrix of full rank such that a ′ ⟂ a = 0. The matrix a(a ′ a) −1 is defined asā, so that a ′ā = I n andāa ′ is the projection matrix on the space spanned by the columns of a.
The following assumptions on the distribution of the errors and the behavior of the deterministic terms are also needed.
Assumption 2. The errors t are i.i.d. random variables with expectation 0 and covariance matrix Ω.
Assumption 3. There exist constants a and b such that the deterministic term, D t , The idea is to express, following Hansen (2005), the model (3) in a companion form, for suitable pk × l, l = p(k − 1) + r matrices * and * , as Multiplying both sides of (4) with * ′ and rearranging yields * ′ X Under Assumption 1, as shown in Appendix A, there exist nonsingular matrices M, G 1 , G 2 such that where all the eigenvalues of G 1 have modulus less than 1, all the eigenvalues of G 2 have modulus larger than 1 and the lower right block is present only when the matrix I + * ′ * is singular.
In Appendix A and the supporting information one can find a proof of the following.
Proposition 1. Under Assumptions 1-3 and with the matrices M, G 1 and G 2 as described above, X t can be represented as Conversely, if I + * ′ * is non-singular, a process satisfying (6) where C s = FMC * s M −1 B, must satisfy the recursion defined in (3) for t = k+1,….
The determinant of the characteristic polynomial is a third order polynomial. We consider a simple situation where = (1, 0, 1) ′ and = (a, a, a) ′ . Then The determinant of the characteristic polynomial of (7) since in this case Γ = I 3 . Then the non-zero coefficients are C s = FMC * s M −1 ′ = (I − C)̄′(1 + 2a) s when s ≥ 0 and |1 + 2a| < 1 and C s = −(I − C)̄′1(1 + 2a) s when s < 0 and |1 + 2a| > 1. Since X t is an I(1) process since ΔX t is stationary as one can see by direct subtraction. Also ′ X t = (1, 0, 1)X t = X 1,t + X 3,t is stationary since the random walk part disappears and only the linear filter where the coefficients are decaying exponentially remains. As one can see, according to whether the root 1∕(1 + 2a) of the determinant of the characteristic polynomial is outside or inside the unit circle the stationary part is causal or purely non-causal. For more complex situations where there are roots both outside and inside the unit circle the stationary part will be a combination of causal and non-causal processes.
−∞ the following Corollary is immediate. One can find a definition of Laurent series in Brockwell and Davis (1991, p. 88).
Corollary 1. Under Assumptions 1-3 X t can be represented as is a Laurent series converging in an annulus 1 − < |z| < 1 + for some > 0.

Corollary 2. Consider the recursion (3) for
Remark 1. Because Π = −A(1) = ′ , the equation det A(z) = 0 has at least p − r roots at z = 1, but in combination with the third requirement in Assumption 1 it follows by the argument in Corollary 4.3 in Johansen (1995) that exactly p − r roots are equal to 1.
Remark 2. By fixing T, see Corollary 2, a variety of solutions can be obtained using the method used to prove Proposition 1. In the following we only discuss the solution of the form (6) since considering the others will depend on specifying terminal conditions for X T+1 , … , X T+k .

184
A. RYGH SWENSEN matrix H. Sufficient conditions that ensure that Assumption 4 is fulfilled can be found in Lanne and Saikkonen (2013) and Davis and Song (2020).

ESTIMATION OF THE PARAMETERS
Let X 1 , … , X T be T observations from the autoregressive model (3) satisfying Assumptions 1, 2 and 4. In the following we only consider the situation where Φ = 0 so Assumption 3 is automatically satisfied. From Proposition 1 it follows that the time series X t consists of a sum of a random walk component and a stationary process in addition to a term depending on the initial condition. The additional Assumption 4 ensures that X t is identified when the stationary part is causal or non-causal.
For parameterization of a stationary non-causal VAR model Lanne and Saikkonen (2013) used the formulation whereΠ andΦ are matrix polynomials. Both detΠ(z) = 0 and detΦ(z) = 0 were required to have solutions strictly larger than 1 in absolute value. Davis and Song (2020) studied stationary non-causal time series with the conventional formulation (1) but allowing for solutions of det A(z) = 0 which have moduli strictly less than 1 in addition to moduli strictly larger than one. We will follow their approach, which has the advantage of not needing to specify the orders ofΠ andΦ.

An Approximate Likelihood
Assume first that is fixed and normalized by requiring c ′ = I where c is a known p × r matrix. Then the process ′ X t is stationary and the known results for that situation can be used. Thus, following Davis and Song (2020) we express the distribution of the variables (X * ′ k * , ΔX * ′ k+1 , … , ΔX * ′ T ) ′ as a transformation of the errors t , t = 0, ±1, ±2 …. It turns out that the influence of the errors before t = k and after t = T + 1 can be ignored asymptotically. Some additional notation is necessary for describing the transformation. From the Jordan canonical form theorem there exists a nonsingular matrix M 1 such that I + * ′ * = M 1 JM −1 1 where J = J 1 ⊕ J 2 , that is, J is block diagonal with diagonal blocks J 1 and J 2 consisting of the canonical blocks with eigenvalues strictly smaller and strictly larger than 1 in modulus respectively. The notation J( ) , J 1 ( ) and J 2 ( ) will be used when it is useful to emphasize that J, J 1 and J 2 are functions of the autoregressive parameters. Since M −1 The variables V 1 t and V 2 t can be expressed as linear filters with coefficients that are exponentially decaying. They are therefore bounded in probability. The following result is proved in Appendix C.
Proposition 2. Assume that Assumptions 1, 2 and 4 are satisfied. Then there exists a non-singular matrix K with parameters by and assume that t has density f . One can then, as in Davis and Song (2020), consider the point wise approximate log likelihood where t ( ) = ΔX t − ′ X t−1 − Γ 1 ΔX t−1 − · · · − Γ k−1 ΔX t−k+1 and the first k observations X 1 , … , X k are considered as given. The approximation consists of ignoring the term log det M 1 and the contribution from V 1 k and V 2 T . Asymptotically these terms will not contribute to a log likelihood with T − k terms since they are bounded in probability.
For fixed the parameters of a general parameter space where the approximate likelihood is defined can be described by dividing into three parts; first, the coefficients in the autoregressive recursion, , Γ 1 , … , Γ k−1 denoted by ; second, the parameters describing the correlation of the errors denoted by , that is, the set of symmetric positive definite p × p matrices; third, the other parameters in the error distribution. They are denoted by and are assumed to belong to an open d-dimensional set. The parameters are a subset of ℝ pr+(k−1)p 2 satisfying the requirements specified in Assumption 1. These requirements define a union of open subsets so and belong to open sets.
The term det J 2 ( ) in (11) is the product of the eigenvalues of I + * ′ * having moduli larger than 1, so for fixed values of the parameters it can be determined by a procedure computing the eigenvalues. Once the autoregressive parameters are known the eigenvalues defining det J 2 ( ) are the inverse of the solutions of det A(z) = 0 with moduli less than 1. The number of such solutions determines also the dimension of J 2 ( ). The extra term (T − k) log | det J 2 ( )| in the likelihood when the process is non-causal is positive and increases with the number of solutions of det A(z) = 0 having moduli less than one and also with their distance from the unit circle. However, since det A(0) = 1 the eigenvalues in det J 2 ( ) must be finite since their inverse values are roots of det A(z) as pointed out by Hansen (2005) in the proof of his Lemma A.2.
The behavior of J 2 ( ) can be quite complicated when some of the eigenvalues are not simple. However in the case where they are distinct the matrices J 1 = J 1 ( ) and J 2 ( ) are diagonal with the diagonal elements equal to the eigenvalues. Furthermore, the eigenvalues are continuous and differentiable as functions of the matrix entries, see for example, Theorem 6.3.12 in Horn and Johnson (2013). Also the decomposition I + * ′ * = M 1 JM −1 1 simplifies since J = J( ) will be a diagonal matrix with the eigenvalues of I + * ′ * as diagonal entries and the eigenvectors as columns of M 1 In Lanne and Saikkonen (2013) a similar approach to derive the distribution of the observations and an approximate likelihood as in (10) and (11) can be found. However due to their use of another parameterization of the process the expression for the approximative likelihood is different.

A Two-Step Procedure
We first discuss how the maximum likelihood estimators of the elements of can be determined for fixed values of for a restricted parameter space which is confined to the situation where the eigenvalues of I + * ′ * , that is, the roots of the determinant of the characteristic polynomial of the autoregressive process * ′ X * t of (5), are distinct. Multiplicity of the roots larger than 1 imposes constraints on the autoregressive coefficients. Hence the complement, which defines the region where the eigenvalues are distinct, that is, the restricted parameter space, is a union of open sets.
The distribution of { t } is assumed to satisfy the conditions A.1-A.7 in Davis and Song (2020) or the corresponding conditions in Lanne and Saikkonen (2013). In particular we focus on the situation where the distribution is elliptic, that is, f (x; ) = (det Σ) −1∕2 f (x ′ Σ −1 x; ) where is a d-dimensional parameter.

186
A. RYGH SWENSEN Let now be an interior point of the restricted parameter set. It follows using the arguments in Davis and Song (2020) that the likelihood is differentiable. 1 Therefore there exists a consistent sequence of roots of the approximate maximum likelihood equation, l T ( ) = 0, as T → ∞. It can also be proved that these estimators are asymptotically normally distributed, that is, For verifying that the location of the global maximum converges an alternative argument is necessary, see for example, van der Wart (1998, p. 68). Notice that the maximization to determine the parameters can be done over the whole space ℝ pr+(k−1)p 2 , since under the conditions we have imposed the probability of choosing values in the exceptional subsets will tend to zero. Also these exceptional parts will be very small parts of the set where the likelihood is defined.
Next, consider the case where is not known. We will consider the situation where a consistent estimator̂is plugged in for and investigate the asymptotic distribution of this estimator̂(̂). By writing and multiplying the two expressions we see that a sufficient condition for and that (12). Although likely to be valid in many cases a formal proof of (12) is needed for a verification of the plug in procedure.
A candidate for estimating satisfying √ T(̂− ) = o P (1) is based on the common method of solving a generalized eigenvalue problem, which amounts to using the maximum likelihood estimator from the situation where the stationery part is causal and the errors are Gaussian. In the notation of Johansen (1995)  Proof. By inspecting the proof of Lemma 13.1 in Johansen (1995) one can see that the essential element of the proof of the consistency of̂in the causal case is the decomposition of X t into a random walk and a stationary part. From Proposition 1 the argument remains valid also in the non-causal situation, so can be estimated as the r eigenvectors of det S( ) = det( S 11 − S 10 S −1 00 S 01 ) = 0 (15) corresponding to the r largest eigenvalues and normalized bŷ′S 11̂= I. It follows from Lemma 6 (i) and (iii) in the supplementary material that S 11 = O P (1) and from Lemma 7 (ii) that S 10 = O P (1). By pre-and post-multiplying S( ) with ( ′ ,̄′ ⟂ ∕ √ T) ′ and applying Lemma 7 (i) from the supporting information the solutions of the equation det S( ) = 0 converge to the solutions of where W u , 0 ≤ u ≤ 1 is a p-dimensional Brownian motion with Cov(W u ) = uΩ and Σ 00 , Σ ′ 0 = Σ 0 and Σ are the limits in probability of S 00 , S 0 = S 01 and S = ′ S 11 respectively. Now consider the space spanned by the the r eigenvectors of (15) corresponding to the r largest solutions of det S( ) = 0. Arguing as in Johansen (1995) it follows that this space converges to the space spanned by the r first unit vectors and that̃=̂(̄′̂) −1 is a consistent estimator of and √ T(̃− ) = o P (1). However according to his formula (13.3) A more thorough treatment of other possible estimators for and their asymptotic distributions is outside the scope of the present article, but is an interesting question.

THE ASYMPTOTIC DISTRIBUTION OF THE TRACE TEST
Next, we address the question of how to determine the rank when Φ = 0. The model where the rank is r is denoted by H(r). The trace test of the hypothesis H(r) versus H(p) is based on the statistic Q r = −(T − k) ∑ p j=r+1 log(1 −̂j) where 1 >̂1 > · · · >̂p are the ordered solutions of det S( ) = 0 where S( ) = S 11 − S 10 S −1 00 S 01 . The hypothesis is rejected for large values of Q r . For the causal situation the trace test is the likelihood ratio test for testing H(r) versus H(p) when the errors are Gaussian. The argument for the asymptotic distribution in this case depends on a representation of the same type as in Proposition 1, but with Σ −1 i=−∞ C i z i = 0, so that the stationary part does not involve future errors.
Without the causality assumption the asymptotic distribution is more complicated but can be found by elaborating on the result of Theorem 11.1 in Johansen (1995). Consider the decomposition of det [( ,̄⟂) Then if = T and T → ∞ for fixed, ′ S( ) = −Σ 0 Σ −1 00 Σ 0 + o P (1) so the probability limit of det[ ′ S( ) ] is different from zero for all . Following Johansen (1995) In the causal case ′ N 0 = 0 which simplifies the derivation of the asymptotic distribution. More generally the asymptotic distributions of S 11 and S 10 are given in Lemma 7 in the supporting information. Recall that W u , 0 ≤ u ≤ 1 is a p-dimensional Brownian motion with Cov(W u ) = uΩ. Then

188
A. RYGH SWENSEN weakly, where N 1 and N 2 are random matrices distributed as the asymptotic distribution of 1 2t respectively. N ′ 3 and N 4 are the limits in probability of 1 respectively. Expressions for N 1 , N 2 and N 3 can be found in Lemma 6 in the supporting information.
Proposition 4. Assume Φ = 0 and that Assumptions 1 and 2 are satisfied. Denotinḡ′ ⟂ C ∫ 1 0 W u W ′ u duC ′̄⟂ by U 1 and the limit of̄′ ⟂ S 10 by U 2 , the sum of the solutions of det Remark 5. The value of this result is limited by the fact that the asymptotic distribution is dependent of unknown parameters. A parametric bootstrap is one possibility to deal with this problem. A reasonable way to proceed may be first to estimate the parameters as described in Section 3 and compute the centered residualŝt, t = k +1, … , T. The bootstrap generated observations can then found using the representation described in Proposition 1 after resampling the centered residuals. Since there may be roots of the determinant of the estimated characteristic polynomial inside and/or outside the unit circle the recursions in Corollary 2 must be used to generate the stationary part. This parallels the procedure used to generate the Monte Carlo simulations in Section 5 where the point is discussed in more detail.

NUMERICAL RESULTS
To get an impression of the finite sample properties, consistency and asymptotic normality of the estimators we present some Monte Carlo simulations. Additional results can be found in the supporting information. The time series were generated from model (3) using the representation from Proposition 1.  The errors were p-variate elliptical t-distributed with expectation zero, degrees of freedom and parameters Σ = { ij } where Σ is a positive definite matrix with square root Σ 1∕2 . The density is This is the density of a random variable X which can be represented as X = Z −1∕2 ( Σ) 1∕2 Y where Y and Z are independent. The p-dimensional variable Y is multivariate normal with expectation 0 and covariance matrix I p and Z is 2 -distributed with degrees of freedom, see for example, Muirhead (1982). The simulations were from a VAR(1) model of the form (3) with dimension p = 3, rank r = 2, For non-Gaussian errors the stationary distribution does not depend only on the two first moments of the error distribution. Therefore the recursions described in Corollary 2 with a burn-in period of 20 for each of the initial distributions of the forward-and backward recursions were used to simulate the stationary part of the process.  Table I illustrates the consistency of the estimator of based on 1000 replications, as the length of the simulated series increases, solving the generalized eigenvalue problem discussed in Section 3.2. As one can see the distributions get more concentrated about the true values (0, 0) ′ with the exception for 32 when = 20 and T passes from 500 to 1000. The value is not significantly different from 0, however.
For maximizing the likelihood for fixed, as explained in Section 3.2, the default Nelder-Mead option in the R-package optim, R Core Team (2016), was used supplemented by employing the BFGS option with numerical calculation of the gradients. The iterations were started with the values used for the simulations.
In Table II the simulations for estimating the parameters are summarized. For the case T = 100 only the Nelder-Mead step was possible. The QQ-plots and histograms in Figures 1 and 2 show in more detail the convergence of the estimates of the parameter 11 . The rather slow convergence toward normality may be related to problems in locating the maximum value of the likelihood which is more pronounced for small values of T. For T = 100 numerical derivatives of the likelihood were not always returned.

CONCLUSION
A main result of this article is a representation theorem of a p-dimensional autoregressive time series X t where the roots of the determinant of the characteristic polynomial can be outside and inside the unit circle or equal to 1.