Penalisation Methods in Fitting HighDimensional Cointegrated Vector Autoregressive Models A Review

Summary Cointegration has shown useful for modeling non-stationary data with long-run equilibrium relationships among variables, with applications in many ﬁ elds such as econometrics, climate research and biology. However, the analyses of vector autoregressive models are becoming more dif ﬁ cult as data sets of higher dimensions are becoming available, in particular because the number of parameters is quadratic in the number of variables. This leads to lack of statistical robustness, and regularisation methods are paramount for obtaining valid estimates. In the last decade, many papers have appeared suggesting different penalisation approaches to the inference problem. Here, we make a comprehensive review of different penalisation methods adapted to the speci ﬁ c structure of vector cointegrated models suggested in the literature, with relevant references to software packages. The methods are evaluated and compared according to a range of error measures in a simulation study, considering combinations of low and high dimension of the system and small and large sample sizes.


Introduction
High-dimensional data sets containing large collections of interrelated time series are becoming more and more common as measurement and recording techniques and ways of data collection advance.A popular statistical model for multivariate time series is the vector autoregressive (VAR) process.Statistical inference of stationary VAR processes has traditionally been based on standard linear regression models (Lütkepohl, 2005).Many applications have employed nonstationary cointegrated VAR models, which means that some of the time series move together in the long run although they have stochastic trends.Then the standard linear regression models may lead to spurious regression (Granger & Newbold, 1974) and cointegration analysis is the right method of statistical inference.Therefore, cointegration analysis is a popular method in econometrics and is also used in other fields, such as in biology (Dahlhaus et al., 2018;Østergaard et al., 2017), agriculture (Zapata & Gil, 1999) or climatology (Schmith et al., 2012;Friedrich et al., 2023).Recently, attempts to fit cointegrated VAR models in high-dimensional settings started appearing, such as for big data in macroeconomics and finance (Wilms & Croux, 2016;Smeekes & Wijler, 2020;Chen & Schienle, 2022), EEG data (Levakova et al., 2022), phase synchronisation (Østergaard et al., 2022) and in structural health monitoring problems (Cross et al., 2011;Mousavi & Gandomi, 2021).
While upscaling a linear regression model for a high-dimensional stationary VAR process has already its challenges, high dimensions of cointegrated VAR processes induce specific issues for model fitting.The most challenging aspects are the determination of the cointegration rank and robust estimation of the matrix of coefficients with the specified rank.The asymptotic distributions of cointegration tests when the dimension and the sample size grow proportionally has been studied recently in (Onatski & Wang, 2018, 2019).If the rank is misspecified, the variance is inflated for overestimated ranks, whereas a bias is introduced for underestimated rank, but often with a lower variance (Holberg & Ditlevsen, 2022).Even for settings with a fixed number of dimensions mildly exceeding two, existing techniques for fitting cointegrated models often fail to provide accurate, testable and computationally tractable estimates (Ho & Sørensen, 1996).A modification of the popular Johansen procedure has recently been proposed in Bykhovskaya & Gorin (2022), and alternative approaches such as factor models have been used (Smeekes & Wijler, 2020).
Regularisation methods in estimation are a popular way to deal with high-dimensional problems, where the number of estimated parameters is close to or even higher than the number of observations and standard estimators suffer from extreme variance or are not defined.Regularisation methods typically place a penalty on some of the estimated parameters, which are estimated with a tolerable bias and lower variance.The appropriate use of penalisation methods when some of the variables are nonstationary has been attracting increasing attention, especially in a single-equation framework (Koo et al., 2020;Lee et al., 2022;Masini & Medeiros, 2021;Smeekes & Wijler, 2021).
The goal of this article is to describe statistical inference and penalisation methods for the analysis of cointegrated multidimensional data, and to provide tools for choosing an adequate method for a given problem.The choice of method depends on the purpose of the analysis.If the purpose is to infer an interaction graph, it is important to identify zero components in the cointegration matrix; if the purpose is prediction, minimisation of squared errors might be appropriate; or if correct identification of cointegration relations are of interest, yet other methods might be the relevant ones to implement.Furthermore, some methods might outperform other methods in high-dimensional settings where the number of observations is small compared with number of parameters, even if they are not competitive in low dimensions.In this review, we consider all of the above mentioned objectives.A thorough discussion of how to deal with high-dimensional cointegrated data through factor models with the specific focus on forecasting can be found in Smeekes & Wijler (2020).
We give an overview of possible penalisations used in fitting high-dimensional cointegrated models.For example, we present a direct penalisation of the coefficients describing interactions between univariate processes, ways to regularise cointegration relationships or adjusting mechanisms and how to shrink the cointegration rank.The main focus of the paper is on the implementation of the methods.In the final section, we present a simulation study.We compare and evaluate all methods with a list of different error measures, and for different scenarios of sample size and system dimension, which highlights advantages and disadvantages of each method.
The paper is structured as follows: First, we provide a brief review of cointegrated vector autoregressive models and their properties in Section 2. In Section 3, we briefly summarise penalisation methods, namely the lasso, ridge regression, elastic net and nuclear norm penalty.In Section 4, we provide a review of the many algorithms with different types of penalisation proposed in the literature to deal with high-dimensional cointegration models, starting with the basis of un-penalised methods, namely ordinary least squares (OLS) and the Johansen procedure.We also include the two main tests, the trace test and the maximum eigenvalue test, to determine the rank of the system, as well as the rank selection criterion and the method of Pedroni et al. (2015), which are more appropriate for large dimensions.In Section 5, we conduct a simulation study, evaluating the methods through different error measures.Finally, in Section 6, we discuss the results and provide some recommendations.

Notation
A p-dimensional column vector of zeros is denoted by 0 p , a p-dimensional column vector of ones is denoted by 1 p and I p is the p Â p unit matrix.The trace of a generic matrix A is trðAÞ, the transpose of A is A 0 , the generalised inverse of A is A À .If A ∈ ℝ p Â r with r ≤ p is a full rank matrix, a matrix A ⊥ ∈ ℝ p Â ðp À rÞ denotes the orthogonal complement, that is, The l 1 norm of a matrix A is ‖A‖ 1 ¼ P i P j jA ij j.A diagonal matrix with the elements of a vector v on the diagonal is denoted by diagðvÞ.Similarly, we sometimes write diagðAÞ where the argument A is a matrix, to denote a diagonal matrix with the same elements on the diagonal as in A. Finally, a hat marks an estimate, that is, μ is an estimate of the parameter μ.

Cointegration
Assume a sample of a p-dimensional stochastic process fx n g N n¼0 observed at equidistant time points t 0 ; …; t N .An observation made at time t n is the vector The series fx n g N n¼0 is said to be a realisation of a vector autoregressive (VAR) process of order k, if where A 1 ; …; A k ∈ ℝ p Â p , μ ∈ ℝ p and fε n g N n¼0 are p -dimensional i.i.d.errors with zero mean and constant variance Varðε n Þ ¼ Σ, where Σ ∈ ℝ p Â p is a positive semidefinite matrix.For the sake of the methods relying on the likelihood approach, we further assume ε n ∼ N p ð0 p ; ΣÞ .However, some of the estimation methods presented in this review do not require the normality assumption.For convenience, we use the vector error-correction (VEC) form of the model, where Cointegration (Engle & Granger, 1987;Johansen, 1995) can only arise when fx n g N n¼0 is nonstationary with a stochastic trend, that is, a realisation of an integrated process.A process, whose first differences Δx n are asymptotically stationary (i.e. the first and the second moment tend to a constant), is termed an integrated process of order 1 or an Ið1Þ process.Similarly, a process, whose first differences Δx n are not asymptotically stationary, but second differences Δ 2 x n ¼ Δx n À Δx n À 1 are, is an Ið2Þ process, and so on.The notion of an integrated process can also be established through the characteristic polynomial jð1 À zÞðI The process fx n g N n¼0 is integrated if the characteristic polynomial has at least one root z ¼ 1, while all the other roots lie outside of the unit circle.A process can be cointegrated if it is integrated of order 1 or more, but we will focus only on Ið1Þ cointegrated processes in this review, because it is the most common scenario in applications.
The rank of Π is termed cointegration rank, r ¼ rankðΠÞ, because it has fundamental implications for the properties of the VAR process.We can distinguish 3 qualitatively different forms of the process fx n g N n¼0 , depending on the value of the cointegration rank: is asymptotically stationary and contains neither a stochastic trend, nor a linear trend (Lütkepohl, 2005).0 < r < p: fx n g N n¼0 is driven by exactly p À r independent stochastic trends.Moreover, there exist r linear combinations of the vector process fx n g N n¼0 , termed cointegration relationships, that yield an asymptotically stationary one-dimensional process.Linear trends due to the drift μ are possible too.
The third case is investigated in cointegration analysis and this is also the case we will consider from now on.Because the matrix Π describes the long-term behavior of the process and represents the main challenge in fitting the model because of the restriction on its rank, while the matrices Γ 1 , …, Γ k-1 describing the short-term effects are of arbitrary rank, we assume for simplicity only VAR processes of order 1, that is, If 0 < r < p, we can find matrices α; β ∈ ℝ p Â r of full rank r such that Π ¼ αβ 0 .The matrices α and β are not unique, because we can take an arbitrary invertible matrix Q ∈ ℝ r Â r and find another decomposition . However, the subspaces spanned by the columns of α and β, spðαÞ and spðβÞ, are unique (Kessler & Rahbek, 2004).The matrices α and β play a special role in the interpretation of the VEC model.
The matrix β is called the cointegration matrix, because the columns of β contain cointegration vectors.A cointegration vector v ∈ ℝ p is a vector such that v 0 x n is an asymptotically stationary univariate process.Cointegration vectors form a linear subspace, which means that any linear combination of 2 distinct cointegration vectors is also a cointegration vector.The columns of β represent one possible basis of the cointegration subspace.
The matrix α is called the loading matrix.It describes the correcting mechanism which ensures that β 0 x n is always pushed to the long-term mean ðβ 0 xÞ ∞ : ¼ lim n→∞ ðβ 0 x n Þ.This can be seen easily by re-writing the VEC model as The matrix α thus reveals the rate at which the system reacts to the deviations of the cointegration relationships β 0 x n À 1 from the asymptotic mean ðβ 0 x ∞ Þ.The natural interpretation of Π is that the elements Π ij describe interactions between the univariate processes fx in g N n¼0 and fx jn g N n¼0 , i; j ∈ f1; …; pg.In a more concise matrix form, we have where To rule out the possibility of an IðkÞ cointegrated process of order k ≥ 2, an additional condition jα 0 ⊥ β ⊥ j ≠ 0 must hold according to the Granger representation theorem (Engle & Granger, 1987).The conditions to secure that fx n g N n¼0 is Ið1Þ and cointegrated can be formulated alternatively using the Jordan normal form Π ¼ PJ P À1 , where J is the associated block-diagonal Jordan matrix J ¼ diagðJ p 1 ðλ 1 Þ; …; J p m ðλ m ÞÞ.The Jordan blocks J p i ðλ i Þ are of dimension p i , such that P m i¼1 p i ¼ p, with λ i on the diagonal.The J p i ð0Þ blocks give rise to stochastic trends.The matrix P is an orthogonal p Â p matrix.The condition jα 0 ⊥ β ⊥ j ≠ 0 holds if and only if there is no J p i ð0Þ block with p i ≥ 2 (Archontakis, 1999).Note that J is not diagonal and Π is not diagonalisable in general, if blocks J p i ðλ i Þ with p i ≥ 2 have jλ i þ 1j < 1, which then have unit elements above the diagonal.
For estimation purposes, other factorisations than Π ¼ α 0 β and Π ¼ PJ P À1 are sometimes useful.One of them is the singular value decomposition (SVD), Π ¼ UDV 0 , where are orthonormal matrices of left and right singular vectors, respectively, and D ¼ diagðd 1 ; …; d r Þ is a diagonal matrix of singular values.One of the uses of the SVD is to find a possible decomposition of Π into α and β.We can either choose α ¼ U , β ¼ V D, or α * ¼ UD, β * ¼ V .Another useful factorisation is the QR decomposition Π 0 ¼ QR, where Q ∈ ℝ p Â p is an orthogonal matrix and R ∈ ℝ p Â p is upper triangular.

High Dimension and Penalisation
When estimating parameters of a cointegrated VAR process, a 'gold standard' is to apply the Johansen procedure (Johansen, 1995), which is a special case of maximum likelihood estimation for the VEC model (2).However, this estimation method may not be optimal when: (1) the dimension p is high, (2) the number of observations N is low compared with p, (3) a substantial portion of elements of Π are zeros and we wish to identify them.Then it can be beneficial to impose some restrictions on Π to encourage a simpler form.
The VEC model already imposes one restriction on Π, namely rankðΠÞ ¼ r.In reality, the rank is unknown and must be estimated as well.In the Johansen procedure, this is carried out by sequential likelihood-ratio testing until the cointegration rank is identified.A particular problem with a high dimension p is that the estimation of the rank is technically difficult and tends to suffer from substantial bias (Østergaard et al., 2022).Estimates of the cointegration relationships are biased for underestimated rank, whereas variance is increased for overestimated rank (Holberg & Ditlevsen, 2022).Therefore, it is worth exploring alternative strategies to identify r.
This paper presents a range of estimation approaches that impose some form of restriction on Π , r or both.Before presenting the methods one by one, we review some basic forms of penalisation in uni-and multivariate regression problems that will appear repeatedly throughout the paper (see Hastie et al., 2019, for further details).First, consider a univariate regression problem of the form (5) where y ∈ ℝ N is the response vector, Z ∈ R N Â q is the design matrix, b ∈ ℝ q is the vector of unknown parameters and ε ∈ ℝ N is a vector of i.i.d.errors.

Lasso
The lasso estimator of b is defined as where λ ≥ 0 is a tuning parameter.A well-known property of the lasso estimator is that the l 1 penalty λ‖b‖ 1 tends to shrink some of the elements of b to exact zeros and thus performs variable selection.Specifically, b ¼ 0, if λ ≥ λ max ¼ max j hy; Z • j i (Hastie et al., 2019).There exist several algorithms to solve (6) numerically.Probably the most convenient one is the cyclic coordinate descent, which relies on the soft-threshold operator, where λ is a positive constant, defined by The cyclic coordinate descent cycles repeatedly over the predictors Z • j and updates the respective regression coefficients as The tuning parameter λ can be chosen by estimating b with a range of values of λ, calculating some measure of fit for each of them, for example, a crossvalidated error, Akaike information criterion (AIC, Akaike, 1974), Bayes information criterion (BIC, Schwarz, 1978) or Hannan-Quinn criterion (HQ, Hannan & Quinn, 1975), and picking the value of λ providing the best fit.

Ridge regression
Another popular penalty is the ridge penalty (also termed Tikhonov regularisation).The ridge estimator is defined as The minimisation problem is easily solved by writing the expression within the brackets in (9) as a sum of squares ‖y * À Z * b 0 ‖, where y * ¼ ðy 0 ; 0 0 q Þ 0 and Z * ¼ Z 0 ; ffiffi ffi λ p I q À Á 0 are augmented data matrices.This yields the estimator The ridge penalty shrinks b towards 0 q as well, but instead of shrinking a couple of coefficients to exact zeros and not affecting much the others as lasso does, the ridge penalty shrinks all coefficients towards zeros more evenly and does not produce exact zeros.It is often used in situations with multicollinearity, where it reduces the variance of the estimator at the price of introducing a tolerable bias.

Elastic net
The elastic net (Zou & Hastie, 2005) combines the lasso and the ridge penalty as follows The aim is to get the advantages of both lasso and ridge regression, that is, to achieve more similar shrinkage of the different elements of b, while allowing some of them to be exact zeros.Clearly, lasso and ridge regression are special cases of the elastic net, for ω ¼ 1 and ω ¼ 0, respectively.The estimate b can be calculated by the cyclic coordinate descent algorithm with the update step Now, consider a multivariate regression where Y ∈ ℝ N Â p is the response matrix, Z ∈ ℝ N Â q is the design matrix, B ∈ ℝ q Â p is the matrix of unknown parameters and ε ∈ ℝ N Â p is a vector of i.i.d.errors.The multivariate elastic net estimator of B, amounts to p independent univariate elastic net problems with the response vector Y • j and vector of coefficients B • j , j ∈ f1; …; pg.The general regression in (13) relates to VECM regression (4) by setting Y ¼ ΔX, Z ¼ X and B ¼ Π 0 .Note, however, that Π has a restricted rank, which is not addressed in the elastic net.

Penalty to enforce row sparsity
The multivariate elastic net produces estimates with zero elements scattered anywhere in B. Sometimes it is desirable to have entire rows of B with either zero, or nonzero elements.This can be achieved by solving the problem One possible algorithm to solve the problem cycles over the rows B j • and updates the estimates as

Nuclear norm penalty
Matrix reconstruction problems gave origin to the use of a penalisation proportional to the nuclear norm (also called trace norm, Schatten norm or Ky Fan norm) ‖B‖ * ¼ P minfq; pg i¼1 d i , where d i are singular values of B. The estimator with a penalised nuclear norm is defined as The nuclear norm is a convex relaxation of the rank of a matrix; therefore, a shrinkage of the nuclear norm frequently leads to reduction of the rank of B. The estimator with the nuclear penalty is more complex to calculate.The solution can be found, for example, by a second-order cone program solver such as SDPT3 (Tütüncü et al., 2003).An alternative simpler algorithm is given in Section 4.4.1.
Apart from these commonly used penalties, some of the estimation methods presented in the following section use more specific penalties that will be explained together with the respective estimation methods.

Methods
In this section, we give an overview of ways of estimating parameters of the VEC model ( 4), with the main focus on penalisation methods.For simplicity, we consider model (4) without a drift term μ in the form, where Y, Z are either centered or standardised counterparts of ΔX and X, respectively.Centering , where σ w denotes empirical standard deviation of a vector of observations w.The estimates Π and μ of the parameters in the original model ( 4) are related to the estimates ΠðcÞ and ΠðsÞ obtained with the centered and standardised data matrices, respectively, as The methods can be grouped according to the type of factorisation of Π utilised: 1.No factorisation: The matrix Π is estimated directly.The penalisation can be imposed on the elements of Π or on rankðΠÞ.2. Cointegration factorisation Π ¼ αβ 0 : The penalisation is imposed either on α or β. 3. QR factorisation Π 0 ¼ QR: The penalty is imposed on R. 4. SVD factorisation Π ¼ UDV 0 : The penalisation can be imposed either only on the singular values, or on both the singular values and the singular vectors simultaneously.
The estimation methods also differ in their approach to the cointegration rank.Some of the methods select an appropriate rank as part of their estimation procedure.However, most of the methods do not address the problem of the rank determination and require some preliminary estimate of r to set the dimensions of the matrices under estimation.The rank of the final estimate Π might get further reduced below the initial value.We present two methods of rank selection in Section 4.5.
All the presented methods are summarised together with their main advantages and disadvantages in Table 1, where relevant references and software implementation are also listed.Details on each method are presented in the following.They are evaluated through a simulation study in Section 5.

Direct Estimation of Π
We start with estimation methods where no special structure of Π is utilised.Many estimators can be expressed in terms of the moment matrices The ordinary least squares (OLS) estimator of Π minimises the sum of squared residuals with no restrictions imposed on the rank of Π.It is given by In practice, the rows of ΠOLS can be obtained from OLS estimation in p univariate regression models This fact reveals the main drawback of the method, namely that possible relations between the response variables Y • i , i ¼ 1; …; p, are ignored.Moreover, the OLS estimator performs poorly when predictors are highly correlated, which can easily happen with a cointegrated process.On the other hand, the OLS estimator is consistent when p is fixed (Lütkepohl, 2005).

Penalising Π with unrestricted rank
The simplest approach to obtain a sparse estimator of Π is to impose a lasso penalty, Equation (6), and thus force some elements of Π to be exact zeros.A ridge penalty, Equation ( 9), can be added to make the estimator robust in case of collinearity.Thus, an estimator with an elastic net penalty, Equation ( 14), is given by In practice, the multivariate elastic net regression ( 22) is performed as p independent univariate elastic net regressions Several forms of lasso-type penalties in single equation models with cointegrated variables have been further studied to illustrate how nonstationarity of the predictors affects the estimator.The asymptotic properties of the standard lasso are provided in Koo et al. (2020).In Smeekes & Wijler (2021), a single-equation penalised error correction selector is proposed.This procedure -Risk of overfitting.
Π is not sparse.+ No need to know the cointegration rank.
+ Smoother penalisation than with the rank penalty.
Adaptive nuclear penalty is close to elastic net penalisation ( 23), but the predictors Z • ðÀiÞ are transformed to achieve that errors ε • ðÀiÞ are orthogonal to ε • i .In Masini & Medeiros (2021), a weighted lasso procedure is proposed, with weights determined according to the nature of the trend in each series.In Lee et al. (2022), a twin adaptive lasso is introduced.It consists of two-step weighted lasso that improve variable selection.
Note that the same tuning parameter λ is used in ( 23) for all i ¼ 1; …; p, hence the choice of λ must be carried out with respect to all the univariate models, for example, by choosing a value that gives the least crossvalidation error of the full multivariate model.The estimator can be further refined by introducing weights into the penalty term, which then becomes λ ω P i; j w ij jΠ ij jþð1 À ωÞ=2 Penalised regression with the lasso-type penalty was investigated in the context of fitting cointegrated VEC models in Smeekes & Wijler (2018), where the lasso outperformed OLS in a simulation study in terms of the forecast error.

Reducing the rank of Π
The OLS estimator and the estimator penalising the elements of Π are likely to give an estimate with full or close to full rank.If we want an estimate of Π of a certain rank r, one possible strategy is to estimate Π with no restriction on the rank and then find an approximation of Π among matrices with rank r.This is best achieved with the SVD decomposition of Π, Π ¼ UDV 0 .The Eckart-Young theorem states that a matrix Πr solving the problem min Πr ∈ ℝ p Â p ‖ Π À Πr ‖ F subject to rankð Πr Þ ¼ r can be found as where the singular values are in descending order, d 1 ≥ … ≥ d p , and U • i , V • i are the corresponding left and right singular vectors.For a more concise notation, let for A ∈ ℝ p Â p be the hard-thresholding operator, and H * λ the hard-thresholding operator for an SVD, H * λ ðΠÞ ¼ U H λ ðDÞV 0 .Then Πr ¼ H * d r ð ΠÞ.

Penalising the rank of Π
The rank penalty was introduced in Bunea et al. (2011) for reduced rank regression.The estimator is defined as A simple computational procedure is derived in Chen et al. (2013).It can be shown that minimising the objective function within the brackets in ( 25) is equivalent to minimising The minimum is achieved when ZΠ 0 is the best approximation to Z Π0 OLS among matrices with a restricted rank, in particular, d . This yields the estimator of Π as where Û DV 0 is the SVD of Z Π0 OLS .The method can be robustified by adding a ridge penalty (Mukherjee & Zhu, 2011) The robustified estimator is found by constructing augmented data matrices

Reduced-rank lasso
A natural idea is to combine the rank restriction and lasso penalisation into one joint estimation problem (Lian et al., 2015) and to define the estimator of Π as The algorithm suggested in Lian et al. (2015) to solve the problem is the proximal gradient descent algorithm.This algorithm is designed for solving the problem min x ∈ G f ðxÞþgðxÞ ½ , where f is a convex and differentiable function, g is a convex function and G is a closed convex set.For a chosen decreasing sequence of positive numbers α k , the general algorithm iterates two steps: 1. Proximal step: Minimising only g while keeping the auxiliary solution s ðk þ 1Þ close to the previous update x ðkÞ .This is carried out by penalising the distance ‖x ðkÞ À s‖.The auxiliary update s ðk þ 1Þ is not restricted to the set G. Altogether, 2. Gradient step: Minimising f by a gradient method and then projecting the minimum on G, where P G ð • Þ is a projection on the set G.
To solve (28), let f ðΠÞ ¼ ð1=2Þ‖Y À ZΠ 0 ‖ 2 F , gðΠÞ ¼ λ‖Π‖ 1 and G ¼ fΠ ∈ ℝ p Â p : rankðΠÞ ¼ rg.The two steps of the proximal gradient descent algorithm become and the solution is where S λ • ð Þ is defined in Equation (7). 2. Minimising the sum of squared errors and projecting the update onto the space of matrices with rankr where d r is the rth largest singular value of the matrix in the parentheses, and H * d r is the hard-thresholding operator for an SVD.The implementation of the proximal gradient descent algorithm suffers from two issues.First, the proximal gradient method is proven to converge when the set G is closed and convex.However, the set of all matrices of rank at most r is not convex.Therefore, the convergence is not guaranteed, although convergence in simulations was reported in Lian et al. (2015).Second, we found no recommendation regarding how to choose the sequence α k .The sequences that we tried lead usually to satisfactory convergence when the dimension p is low, but often lead to diverging behavior for high p.

Estimation Through Factorisation Π ¼ αβ 0
This subsection deals with the methods that take cointegration directly into account by estimating the factorisation Π ¼ αβ 0 , α; β ∈ ℝ p Â r .Before presenting the penalisation methods, we recall the Johansen procedure (Johansen, 1995), which is a popular method of fitting cointegrated VEC models using the maximum-likelihood principle.
The log-likelihood function has the form.
For a given cointegration rank r, the columns of β can be found as the eigenvectors v 1 ; …; v r corresponding to the r largest eigenvalues λ i of the eigenvalue problem satisfying the normalisation condition where the moment matrices S :: are defined in (20).Any basis of spð βÞ can serve as β, because any linear combination of the cointegration vectors yields a cointegration vector.
Once β is determined, we construct new stationary covariates Z β ¼ U ∈ ℝ N Â r and turn the VEC model (2) into a standard linear regression model Y ¼ Uα 0 þ ε with no restrictions on the rank of the parameter matrix α.The maximum likelihood estimator (MLE) of α and Σ are α ¼ S yz βð β0 S zz βÞ À1 ; (36) Σ ¼ S yy À S yz βð β0 S zz βÞ À1 βS zy : (37) The chosen form of β affects α , however, spðαÞ is invariant.Furthermore, Π ¼ α β0 is always unique.The presented estimation procedure gives some flexibility for choosing β, but not for α.

Penalising β
A procedure with a lasso penalty on β was introduced in Wilms & Croux (2016).The original method also incorporates additional penalties on μ and Ω ¼ Σ À1 , but we ignore those here for simplicity.The estimators are derived from the penalised negative log-likelihood For identifiability reasons, the estimator α is constrained by condition α0 Σ À1 α ¼ I r .Restricting α instead of β is convenient for calculations.The solution of ( 38) is found by an algorithm that iterates three steps: 1. Solving for α conditional on β and Σ: The minimisation problem reduces to This is a weighted Procrustes problem and the solution is α ¼ Σ 1=2 V U 0 , where U and V are left and right singular vector matrices from the SVD of β 0 Z 0 YΣ À1=2 ¼ UDV 0 .2. Solving for β conditional on α and Σ: The minimisation problem reduces to βjðα; ΣÞ ¼ arg min β with α * ¼ Σ À1=2 α.This is a multivariate lasso regression of a response matrix YΣ À1=2 α * on Z. 3. Solving for Σ conditional on α and β: The estimator Σ is the usual maximum likelihood estimator (37).
The starting values for β ð0Þ can be chosen as the first r eigenvectors of the matrix diagðS zz Þ À1 S zy diagðS yy Þ À1 S yz and Σ ð0Þ ¼ I p (Wilms & Croux, 2016).If it is not important to have a sparse Σ, we propose instead to determine α ð0Þ by regressing Y on Zβ ð0Þ and set Σ ð0Þ ¼ ðY À Zβ ð0Þ α ð0Þ 0 Þ 0 ðY À Zβ ð0Þ α ð0Þ 0 Þ=N.Note that an appropriate choice of starting values can be more important in high-dimensional settings (Wilms & Croux, 2016).
There is no proof of convergence of the algorithm, only reports of good performance in simulations.Implementation of the iterative algorithm in R is provided in the supplementary material to Wilms & Croux (2016).We discovered a minor discrepancy between the theoretical form of the algorithm and the actual implementation.Namely, the implemented algorithm selects the optimal value of the tuning parameter λ by crossvalidation in each iteration of step 2; therefore, λ varies over the iterations.The resulting estimators cannot be associated with a solution of (38) for a specified value of λ.In practice, we cannot track the amount of shrinkage of Π as a function of λ.It would be possible to correct the implementation, so that the same λ is used in all iterations.However, it would substantially increase the running time and make computations with larger dimensions nearly prohibitive.Therefore, we stick to the original implementation.
Imposing a lasso penalty on β leads to biased estimation of the cointegration space.A sparse β can only induce sparsity in Π ¼ α β0 if it contains a large portion of zeros that make some of the rows identically zero.This can be achieved directly by a penalty that enforces row sparsity of β (Chen & Huang, 2012).The row sparsity means that some variables are forced not to enter any of the cointegration relationships, which seems to be a too strict restriction for most cointegrated processes.

Penalising α
If we prioritise unbiased estimation of β and thus of the cointegration space, it is more meaningful to impose penalty on α instead of β.The following simple procedure was proposed in Levakova et al. (2022).We first estimate β as in the Johansen procedure (34).Then we find α by minimising the penalised sum of squared residuals It is still not guaranteed that Π ¼ α β0 is a sparse matrix.Zeros in α mean that some of the univariate subprocesses of fx n g are not adjusting the disequilibria in some of the cointegration relationships.

Penalisation in a process transformed into stationary components and stochastic trends
The decomposition Π ¼ αβ 0 is used in Liao & Phillips (2015) to transform the process fx n g N n¼0 into stationary and nonstationary components, and the penalised estimation is performed on the transformed process.Let α ⊥ and β ⊥ be full rank p Â ðp À rÞ matrices of orthogonal complements, satisfying α 0 ⊥ α ¼ 0 ðp À rÞ Â r and β 0 ⊥ β ¼ 0 ðp À rÞ Â r , respectively.Without loss of generality, we assume that α 0 ⊥ α ⊥ ¼ I p À r and β 0 ⊥ β ⊥ ¼ I p À r .The matrices α ⊥ and β ⊥ can be constructed from the normalised left and right eigenvectors that correspond to the zero singular values of Π. Define a matrix F ¼ ½β; α ⊥ 0 .Then Thus, the transformed process w n ¼ Fx n satisfies the VEC model where ξ n ¼ Fε n .It means that the transformation decomposes the process into r stationary components and p À r random walks.The estimator of Π is defined as where F is a matrix consisting of the normalised left eigenvectors of a preliminary estimate Π0 .The procedure does not require a prespecified cointegration rank, but tries to reduce the rank to the optimal number by identifying the zero rows of FΠ .Theoretical properties of this estimator are derived in Liao & Phillips (2015).Unfortunately, no algorithm to solve ( 44) is provided and we have not found any information regarding the practical implementation in the literature.

Estimation Through QR Decomposition
The following method was introduced in Liang & Schienle (2019) and makes use of the QR decomposition Π 0 ¼ QR.The matrix Q ∈ ℝ p Â p is an orthonormal matrix and R ∈ ℝ p Â p is an upper triangular matrix.The matrix Π can be written in a block-matrix form as where 22 ∈ ℝ ðp À rÞ Â ðp À rÞ .The matrix Q can be seen as a p Â p extended form of β, where the first r columns span the cointegration space, while R 0 represents an extended form of α that has zeros in the last p À r columns, so R 22 ¼ 0 ðp À rÞ Â ðp À rÞ .Therefore, we expect that R22 has lower diagonal elements close to zero.
This motivates the idea to impose penalty on the columns of R 0 to shrink the block R 22 towards 0. The objective function to be minimised is where λ is the tuning parameter and γ > 0 is a constant that is fixed in advance.In Liang & Schienle (2019), they use γ ∈ f2; 3; 4g and suggest to opt for a lower value when the sample size is small.The matrix Q is an orthogonal matrix obtained from a QR decomposition of a preliminary estimate Π0 , for example, the OLS estimate.Similarly, the weights w j are calculated from the QR decomposition of Π0 .Alternative elementwise penalisation is used in Chen & Schienle (2022),

Penalising the nuclear norm
This estimation procedure was first introduced in Yuan et al. (2007) in the context of reduced-rank regression problems.A practical summary of the method with focus on cointegrated VAR models is provided in Signoretto & Suykens (2012).The argument to use the nuclear norm penalty in fitting a cointegrated VAR model is as follows.The rows Π i • ∈ ℝ p , i ¼ 1; …; p , belong to a linear subspace B of dimension less than p .Let fη 1 ; …; η p g be a set of basis vectors of ℝ p and A a subset of f1; …; pg such that B ⊆ spanfη i : i ∈ Ag.Assume the basis vectors η i 's are known and consider the problem of simultaneous selection of the basis vectors and parameter estimation in a model where min It turns out that we do not need to know the basis vectors η i , i ¼ 1; …; p, because we can utilise an SVD, Π 0 ¼ UDV 0 , from which it follows that Ψ ¼ DV 0 and ‖Ψ It can be shown that the nuclear norm ‖Π‖ * is the convex envelope of rankðΠÞ , which makes (50) the tightest convex relaxation to the problem min Π f‖Y À ZΠ 0 ‖ 2 F þ λrankðΠÞg, discussed in Section 4.1.3.The convexity property implies that any local solution found by a convergent algorithm is globally optimal.One possibility to minimise ( 50) is to use a second-order cone program solver such as SDPT3 (Tütüncü et al., 2003).An alternative is to apply a special case of the fast iterative shrinkage-thresholding algorithm (FISTA): Update auxiliary variables : The sequence ΠðkÞ is an auxiliary sequence of linear combinations of the previous two updates Πðk À 1Þ and Πðk À 2Þ .The convergence properties of this procedure are documented in Beck & Teboulle (2009).
To make the estimator more robust and to prevent numerical issues when λ is close to 0, adding a ridge penalty term to the objective function ( 50) is recommended, which leads to the minimisation problem The estimator Π is found by using augmented data matrices and performing the FISTA algorithm with a tuning parameter λω.

Adaptive nuclear norm penalisation
The previous method was modified in two aspects in Chen et al. (2013), where the estimator Π is defined as First, the objective function contains the norm of ZΠ 0 instead of the norm of Π.This is inspired by the rank penalty from Section 4.1.3,where the minimum of ‖Y À ZΠ 0 ‖ 2 F þ λ rankðΠÞ was found to be the same as the minimum of ‖Y À ZΠ 0 ‖ 2 F þ λrankðZΠ 0 Þ.The solution is expected to yield a low-rank estimator of ZΠ 0 and thus possibly also a sparser form Π. Second, the nuclear norm ‖Π‖ * is generalised by allowing weights w ¼ ðw 1 ; …; w p Þ 0 satisfying 0 ≤ w 1 ≤ … ≤ w p and defining a weighted nuclear norm ‖Π‖ * w ¼ P p i¼1 w i d i ðΠÞ.Due to penalising ZΠ 0 , an explicit solution of (54) can be found using the soft-thresholding operator S * λ , Equation ( 7), where Û DV 0 is the SVD of ZΠ 0 .The only difference from the rank penalised estimator ( 26) is that the hard-thresholding operator is replaced with a soft-thresholding operator.
The adaptive nuclear norm penalisation method can be robustified in a similar way as the penalised rank regression, by incorporating a ridge penalty term.The ridge penalty is again imposed on ZΠ 0 instead on Π, so the objective function to be minimised is The minimiser is Π ¼ ΠðλωÞ =½1 þ λð1 À ωÞ, where ΠðλωÞ is the adaptive nuclear penalty estimator with the tuning parameter λω.

Penalisation of all components of singular value decomposition of Π
An algorithm to estimate a sparse form of SVD of Π has been introduced in Chen et al. (2012).The SVD splits Π into a sum of r unit-rank matrices, . The algorithm relaxes the orthogonality of the left and right singular vectors and minimises the following objective function with respect to the triplets The multiplicative form of the penalty not only promotes sparsity in the singular vectors, but also shrinks the singular values of each SVD layer towards 0.
The objective function is non-convex.The optimisation algorithm optimises the unit rank layers • k in parallel.Therefore, we first present the algorithm for r ¼ 1.
For rankðΠÞ ¼ 1: The problem reduces to minimising where ‖U ‖ 2 ¼ ‖V ‖ 2 ¼ 1.The weights are decomposed as and chosen in a data-driven manner using an SVD of an initial estimate Π0 0 ¼ d0 Û0 V 0 0 as (59) where γ a prespecified non-negative parameter (γ ¼ 2 is used in Chen et al. (2012)) and the absolute value is applied element-wise.The algorithm alternates between estimating ðd; V Þ for fixed U and estimating ðd; U Þ for fixed V until convergence.
1. Fix U ,minimise with respect to ðd; V Þ: The problem (58) can be re-formulated as where . It is a standard lasso problem with unknown parameters v * .
2. Fix V ,minimise with respect to ðd; U Þ: The problem ( 58) can be re-formulated as where . It is a standard lasso problem with unknown parameters u * .

For rankðΠÞ ¼ r > 1:
The problem is decomposed into r parallel unit-rank problems by splitting the response matrix Y into r additive layers 1.For each k ∈ f1; …; rg: a. Construct adaptive weights w The algorithm can be further improved by iterating step 1 until convergence.

Determination of the Cointegration Rank
If a particular method of estimating Π requires to set r ¼ rankð ΠÞ, a popular way of selecting r is the Johansen's trace and maximum eigenvalue test (Johansen, 1995).It can be shown that the maximum of the loglikelihood ℓðα; β; ΣjrÞ given a cointegration rank r is ℓ max ðrÞ ¼ logjS yy jþ P r i¼1 logð1 À λ i Þ , where λ i , i ¼ 1; …; p are eigenvalues of the eigenvalue problem (34) in descending order.
Both test statistics follow nontrivial distributions; the values of these distributions were calculated numerically for p ≤ 11.To determine the cointegration rank, the trace test or maximum eigenvalue test is performed with r 0 ∈ f0; …; p À 1g, starting with r 0 ¼ 0 and increasing r 0 by one until the null hypothesis is rejected.The lowest r 0 , for which the null hypothesis is not rejected, is the estimate of the cointegration rank r.
Determination of the rank by the trace or maximum eigenvalue tests is not viable for large p, due to unknown critical values of the tests.The critical values can possibly be determined by bootstrap methods (Cavaliere et al., 2012(Cavaliere et al., , 2014)), however, computational times become prohibitive for large p.A more general way of estimating r is the rank selection criterion (RSC) (Bunea et al., 2011).The method is based on the rank penalisation from Section 4.1.3,where the tuning parameter is set as The rank estimated by RSC corresponds to the number of singular values of Z Π0 OLS that are larger or equal to ffiffi ffi λ p .The value of λ was derived as an optimal threshold to distinguish significant singular values from those induced by noise.
Another option is the procedure of Pedroni et al. (Pedroni et al., 2015).It starts with projecting the series fx n g N n¼1 on the deterministic trend, which is a linear trend for the model assumed here.This provides a new detrended series of the residuals fû n g N n¼1 , which are estimates of the stochastic component of fx n g N n¼1 , and a cumulative series fŝ n g N n¼1 , where ŝn ¼ X n i¼1 ûi .These time series are used to calculate an estimate of contemporaneous variance i and an estimate of a long-run variance matrix ΩL where λ1 ≥ … ≥ λp are the ordered eigenvalues of ΩC N ð ΩL N Þ À1 .Under the null hypothesis, the distribution of the test statistics converges weakly to tr ∫ 1 0 W 2 ðsÞW 2 ðsÞ 0 ds ∫ 1 0 Q 2 ðsÞQ 2 ðsÞ 0 ds À1 , where W 2 ðsÞ is an orthogonal complement to a projection of a ðp À rÞ -dimensional Brownian motion on a linear trend and Q 2 ðsÞ ¼ ∫ s 0 W 2 ðrÞ dr.The critical values are obtained numerically from simulations for a particular dimension p.The test statistic can be used in the same recursive manner as in the trace and maximum eigenvalue tests.

Illustration of Shrinkage Effects
We first conduct a simulation study to illustrate how the methods shrink Π in different manners as a function of the size of the tuning hyperparameter.In the next section, we conduct a larger simulation study for fixed penalisation (selected by cross-validation) to evaluate and compare the methods by different error measures.
We simulated a cointegrated process of dimension p ¼ 30 with cointegration rank r ¼ 4. The true Π has 114 nonzero elements out of 900, which are located in 4 rows and the remaining rows contain only zeros.The norm ‖Π‖ F is approximately 0.9.The simulated data set consists of N ¼ 30; 000 observations of the vector process to showcase asymptotic behavior of the estimation methods.We show the full regularisation paths, that is, characteristics of the estimates of Π (and some other relevant parameters) as a function of the tuning parameters, from zero penalisation to full penalisation (where Π is shrunk to a zero matrix) in Figure 1.To achieve the same scales in the plots, the x-axis do not show the actual tuning parameter values, but the norm ‖ Π‖ F obtained with a given value of the tuning parameter.Thus, a norm of zero corresponds to the largest tuning parameter value giving the largest shrinkage to an estimate of zero, and a norm of 0.9 corresponds to tuning parameters equal to 0 giving no shrinking of Π compared with nonpenalised estimators (OLS estimator for a nonrestricted rank and Johansen estimator for a restricted rank).
The elastic net estimator tends to shrink the smallest elements of Π in absolute value.The elements outside of the four nonzero rows are typically shrunk to 0 first and smaller elements in those four columns are shrunk next, as the tuning parameter is increased, even though the method does not penalise the rank of Π.The regularisation path has a region around ‖ Π‖ F ¼ 0:3, where the proportion of zeros and the estimated rank change rapidly, while in other regions the proportion of zeros change only little (Figure 1A).
The elastic net penalty imposed on α has the same impact on α as the elastic net penalty has on Π, that is, shrinks the smallest elements of α.The impact of shrinking elements of α on Π is that with increasing penalisation the four nonzero rows are correctly identified while the other rows are shrunk to zero (Figure 1C).The elements in these rows are shrunk evenly to 0, that is, a whole row is eliminated at a time.This gives in fact a more appropriate form of sparsity to Π than the standard elastic net penalisation of Π, because a relatively small shrinkage is needed to pick the most important rows and eliminate the less important rows.Therefore, only little bias is introduced (Figure 1H, red lines).
The penalisation of R0 in the QR decomposition of Π shrinks whole columns of R (Figure 1D).The last p À r columns corresponding to the non-existing cointegration relationships are among the columns that are shrunk first and R has around four columns during the main part of the regularisation path.It means that the rank of Π is close to the true cointegration rank for a large part of the regularisation path (Figure 1H, middle panel, green line).The shrinkage of Π is nevertheless gradual and no exact zero elements are created.
The rank penalty, the nuclear penalty and the adaptive nuclear penalty shrink Π in a similar manner (Figure 1B,E,F).The shrinkage of the singular values does not push any of the elements of Π to exact zeros and the elements are shrunk approximately evenly.The rank penalty gives a step-wise shrinkage of the singular values, meaning that a singular value is either kept intact or shrunk to 0, when the chosen tuning parameter becomes larger.The regularisation path thus shows 'break points', whenever some singular value is dropped out.The nuclear penalty provides a similar, but smoother regularisation path.The adaptive nuclear penalty yields more even shrinkage of Π than the rank penalty and the nuclear penalty.
• Relative shrinkage of Π:Shrinkð ΠÞ ¼ ‖ Π‖ F =‖Π‖ F • Distance between the true and estimated cointegration rank:dðr; rÞ ¼ jrankð ΠÞ À rj • Proportion of zeros in Π:p 0 ¼ jfði; jÞ: Πij ¼ 0j=p 2 • Specificity (true positive): TP ¼ jfði; jÞ: • Sensitivity (true negative): TN ¼ jfði; jÞ: The OLS and the Johansen method are always fastest with respect to computing time.The fastest penalisation methods are the rank penalty and the adaptive nuclear penalty, which use similar algorithms based on hard-and soft-thresholding of singular values.The most computationally heavy methods are the penalisation in the QR decomposition, the nuclear penalty and the penalisation of the whole SVD decomposition.
Altogether, the best performing methods appear to be: lasso with fixed rank for small N and low p, lasso on SVD for small N and high p, and the penalisation of α for large N .

Discussion
In this paper, we have reviewed regularisation methods for estimation in co-integrated VEC models proposed in the literature, highlighting pros and cons of the different approaches, and evaluating them through a simulation study.
The simulation study confirmed that the OLS method leads to one of the largest prediction errors in all investigated setups, where it is feasible.On the contrary, lasso on SVD and lasso with a fixed rank give the best fit in the most challenging setup with high p and small N, even if the rank was underestimated.This reveals that rank reduction, possibly with some additional regularisation, is most beneficial for fitting a cointegrated VEC model and that the reduced rank induced by cointegration should always be taken into account.
The best method for high p and N > p is the penalisation of α.The good performance may be due to several reasons.First, the method always starts with a reduced rank, which is a form of pre-regularisation.If the rank is estimated well (such as by the well established RSC method), it has an advantage over the nuclear penalty, lasso on SVD or penalisation of the QR decomposition, because those methods have to identify the rank as part of the procedure often leading to overestimation of the rank.Second, the penalisation of α is most similar to the Johansen procedure and thus inherits some beneficial properties, namely the unbiased estimation of the cointegration space spanned by β.In contrast, the algorithm used in the penalisation of β updates the estimates of β and α conditioned on the other in an iterative process, which makes both of them biased.Similarly, the penalisation of the whole SVD structure distorts both the left and the right singular vectors.The penalisation of the QR decomposition also distorts the estimated cointegration space, because Q, which is the counterpart of β, does not satisfy the normalisation condition (35).It means that subsets of columns of Q span different subspaces than subsets of the normalised eigenvectors in β estimated with the Johansen procedure.Third, we did not experience any issues with convergence due to the simplicity of the algorithm.
If the aim is to identify the zeros in Π, most of the methods in this review are not suitable.The only options are to use the standard penalisation of Π that does not account for the reduced rank, or the penalisation imposed on QR decomposition or on the whole SVD structure of Π.
Most methods were originally designed for the reduced rank regression problem, a larger class of models than cointegrated VEC models.Many theoretical properties of the estimators were derived under the assumption that X has finite moments.Unfortunately, the moment conditions are not satisfied if fx n g is an Ið1Þ process.It is not clear how much the violation of the assumption affects the properties of these estimators.
We presented most of the methods with only one tuning parameter λ.However, it is often straightforward to generalise the methods by introducing multiple tuning parameters, for example, specific values for columns or rows of the matrix under estimation.The algorithms to solve the unrestricted penalisation of Π, the penalisation of α and the penalisation of β contain a step where a penalised multivariate regression model is fitted.This problem can be split into univariate regressions, each with a different tuning parameter.If the variables are on the same scale or have been standardised, it is most natural to use the same tuning parameter.The penalisation of β is performed with a transformed response variable Y ΣÀ1 and the penalisation of α is performed with a transformed predictor matrix Z β0 , where the same scale is not guaranteed, so multiple tuning parameters can be reasonable.
We opted for one tuning parameter in the simulation study to make the comparison of the methods as fair as possible.The only method where we used multiple tuning parameters is the penalisation of the SVD.Note that this method is usually one of the worst performing methods despite multiple tuning parameters.
The presented methods can be extended to fit a VAR model of higher order than 1 in terms of the number of lags in the model.The additional lags are always accompanied with full-rank matrices Γ 1 , …, Γ k , so a possible strategy could be to penalise them with one of the common types of penalties, such as lasso (used in Wilms & Croux, 2016), weighted lasso (used in Liang & Schienle, 2019;Chen & Schienle, 2022) or elastic net.However, a VAR model that has order of integration larger than 1 requires more complex treatment.
Finally, we hope this review is useful for deciding which penalisation method to use in a specific application according to the purpose of the analysis.

5
Penalisation in Fitting Cointegrated VAR Models International Statistical Review (2023) © 2023 The Authors.International Statistical Review published by John Wiley & Sons Ltd on behalf of International Statistical Institute.

13
Penalisation in Fitting Cointegrated VAR Models International Statistical Review (2023) © 2023 The Authors.International Statistical Review published by John Wiley & Sons Ltd on behalf of International Statistical Institute.

FIGURE 1 .
FIGURE 1. Illustration of the shrinkage of Π and other relevant parameters by the penalisation methods.A-G: Regularisation paths of Π and other parameters being shrunk in the respective estimation methods as a function of ‖ Π‖ F .The grey lines show the relationship between individual elements of the estimated parameters and the amount of shrinkage corresponding to ‖ Π‖ F .The dotted lines mark zero level corresponding to the maximum possible shrinkage.H: The proportion of zeros p 0 in Π as a function of ‖ Π‖ 2 F (left), rankð ΠÞ as a function of ‖ Π‖ 2 F (middle), the norms ‖ Π‖ F resulting from the used sequences of the tuning parameters for each method.The colored lines indicate different methods as indicated in the legend.The dotted lines indicate the norm, the proportion of zeros p 0 and the rank of the true Π

FIGURE 7 .
FIGURE 7. Distribution of the true positives of each estimation method.Boxplots are based on estimates obtained with rð0Þ ¼ rRSC , or with rð0Þ ¼ p for the methods not requiring a pre-estimate of the rank.

Table 1 .
Overview of the estimation methods.Some of the comments in the last column of pros and cons refer to results from the simulation study conducted in Section 5.
, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12553 by Det Kongelige, Wiley Online Library on [22/09/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License (Bunea et al., 2011;Chen et al., 2013)3)Reduces the rank of Π.+No need to know the cointegration rank.(Supplement to this paper) Π is not sparse.-Thealgorithm works only for N > p.International Statistical Review (2023) © 2023 The Authors.International Statistical Review published by John Wiley & Sons Ltd on behalf of International Statistical Institute.17515823 , in which case the objective function becomes , 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12553 by Det Kongelige, Wiley Online Library on [22/09/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License International Statistical Review (2023) © 2023 The Authors.International Statistical Review published by John Wiley & Sons Ltd on behalf of International Statistical Institute.17515823 , 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12553 by Det Kongelige, Wiley Online Library on [22/09/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License International Statistical Review (2023) © 2023 The Authors.International Statistical Review published by John Wiley & Sons Ltd on behalf of International Statistical Institute.17515823 , 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12553 by Det Kongelige, Wiley Online Library on [22/09/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License International Statistical Review (2023) © 2023 The Authors.International Statistical Review published by John Wiley & Sons Ltd on behalf of International Statistical Institute.17515823 The trace test and the maximum eigenvalue test are , 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12553 by Det Kongelige, Wiley Online Library on [22/09/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License likelihood-ratio tests of a null hypothesis H 0 : r ¼ r 0 against the alternative H a : r ¼ p (trace test) or H a : r ¼ r 0 þ 1 (maximum eigenvalue test).The test statistics are trace test : À 2ðℓ max ðr International Statistical Review (2023) © 2023 The Authors.International Statistical Review published by John Wiley & Sons Ltd on behalf of International Statistical Institute.17515823

Table 2 .
Simulation results for p ¼ 50, N ¼ 50 (low dimension, small sample size).The results are averages over 100 simulation runs.See main text for definition of estimation methods and error measures.Bold numbers indicate the best results within a column, after excluding those where true cointegration rank was used.

Table 3 .
Simulation results for p ¼ 50, N ¼ 150 (low dimension, large sample size).The results are averages over 100 simulation runs.See main text for definition of estimation methods and error measures.Bold numbers indicate the best results within a column, after excluding those where true cointegration rank was used.Distribution of Θð ΠÞ of each estimation method.Boxplots are based on estimates obtained with rð0Þ ¼ rRSC , or with rð0Þ ¼ p for the methods not requiring a pre-estimate of the rank., 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12553 by Det Kongelige, Wiley Online Library on [22/09/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License FIGURE 2. Average error measures (angle Θð Π; ΠÞ, mean squared prediction error MSPEð ΠÞ and estimated rankð ΠÞ) as functions of the true rank, calculated on 100 simulated data sets.Colored lines represent different methods.Averages are based on the estimates obtained with the estimated rank by rRSC for those methods that require a pre-estimate of the rank.The ranks estimated by the Johansen method, elastic net followed by SVD and lasso with a fixed rank are identical to the initial value rð0Þ ¼ rRSC and overlap.International Statistical Review (2023) © 2023 The Authors.International Statistical Review published by John Wiley & Sons Ltd on behalf of International Statistical Institute.FIGURE 5. Distribution of the relative deviance RDð ΠÞ of each estimation method.Boxplots are based on estimates obtained with rð0Þ ¼ rRSC , or with rð0Þ ¼ p for the methods not requiring a pre-estimate of the rank.FIGURE 6.International Statistical Review (2023) © 2023 The Authors.International Statistical Review published by John Wiley & Sons Ltd on behalf of International Statistical Institute.17515823