Prediction of Singular VARs and an Application to Generalized Dynamic Factor Models

Vector autoregressive processes (VARs) with innovations having a singular covariance matrix (in short singular VARs) appear naturally in the context of dynamic factor models. Estimating such a VAR is problematic, because the solution of the corresponding equation systems is numerically unstable. For example, if we overestimate the order of the VAR, then the singularity of the innovations renders the Yule‐Walker equation system singular as well. We are going to show that this has a severe impact on accuracy of predictions. While the asymptotic rate of the mean square prediction error is not impacted by this problem, the finite sample behaviour is severely suffering. This effect will be reinforced, if the predictor variables are not coming from the stationary distribution of the process, but contain additional noise. Again, this happens to be the case in context of dynamic factor models. We will explain the reason for this phenomenon and show how to overcome the problem. Our numerical results underline that it is very important to adapt prediction algorithms accordingly.


INTRODUCTION
The principal motivation of our article initiates from an application and numerical implementation of a recent method by Forni et al. (2015) -here denoted by FHLZ -for estimation and forecasting of generalized dynamic factor models (GDFMs). We noticed that predictions can be improved when we slightly adapt the original forecasting algorithm. In essence, our observation was that a regularization step in the estimation part is very important. In this article, we are going to see that this owes to an 'explosive' combination of two features of FHLZ: (a) The method relies crucially on predicting singular VAR systems, that is, vector autoregressive processes with a singular noise covariance. (b) These singular VAR systems are loaded with noisy predictor variables. We cannot directly plug in lagged values of the process to forecast, since the lagged values are latent and need to be estimated. This perturbation -even if it is very small and asymptotically negligible -has a huge impact on the forecasts, especially in combination with (a).
While improving FHLZ was the initial motivation for our research, we believe that our theory has some more general interest in time series forecasting. Hence the main theme of this article is not restricted to predicting or modelling GDFMs. Rather the primary contribution of this article is an analysis of how noisy predictors impact the quality of VAR forecasts (see Section 2). Using these findings we will then -as an important application -improve the FHLZ algorithm. 296 S. HÖRMANN and G. NISOL To fix ideas, let us consider a generic d dimensional VAR(1) setting, where Y t+1 = ΘY t + Z t+t . Here (Z t ) is some white noise sequence, potentially with singular covariance. LetΘ be some estimator (e.g. the OLS estimator) for the coefficient matrix Θ. Then we are interested in the prediction error which, on the one hand, is caused by estimation of Θ and which, on the other hand, is due to an additional perturbation ∈ R d of the predictor variable Y t : (1.1) Notice that is denoting some generic noise which might also depend on t. Possible time dependence is not important for our theory, so it will typically be ignored. When = 0 then the difference in (1.1) corresponds to the difference between the 'usual' prediction and the optimal (but infeasible) prediction ΘY t . In this article, we focus on one-step-ahead forecasting -this setting being easily generalized. As we indicated above, noisy predictors are inherent in the GDFM framework (see Section 3.2) and even very small noise can have a dramatic impact on the quality of the forecasts. However, noisy predictors are also not uncommon in other context and to the best of our knowledge this problem has not been exploited in the literature. We give an illustrative example of data that we have been working with in the Appendix C.
The rest of the article is structured as follows: in Section 2 we outline the theory for singular VARs. In Section 3 some important background on GDFMs is provided. Moreover, we explain the FHLZ prediction method. At the end of this section we point out how the theory from Section 2 is related to this method. Then, in Section 4, we present some numerical exercises which illustrate the dramatic improvement our approach can yield. Proofs, details of the implementation and several tables are provided in the Appendix.

SINGULAR VAR PROCESSES
We review some basics of VAR's and explain how singularity of the noise impacts their estimation and prediction.

The Yule-Walker Equations
Let us consider some generic d-dimensional VAR(p) process of the form where Z t ∼ WN(0, Ω) (white noise). Denote by (h) the lag-h autocovariance matrix. We consider the case where the autoregression is singular, hence where |Ω| = 0. Chen et al. (2011) have studied stationary solutions for this class of processes and have shown a number of interesting and subtle properties of the solutions of the related Yule-Walker equations. Let us briefly recall some basic facts related to the Yule-Walker equations which serve also to fix notation. The equations are obtained by multiplying (2.1) by Y ′ t+1−h , h = 0, … , p and taking expectations. Defining we can summarize them as Hence, as opposed to the case when |Ω| > 0, we obtain that for singular VAR processes Γ p+1 is singular. Consequently, if we overfit the model, that is, if we select an order that is larger than the actual order of the VAR, we will run into a singularity problem when computing the empirical version of Γ −1 p+1 . The latter is required when we do estimation using least squares or the Yule-Walker estimator.
The matrix Γ p may be singular or not. We show below that Γ p is close to singular, if ||Θ p || is small and we give an example where Γ p is singular, even when ||Θ p || is not small. To examine this, let us explore the condition number where max (M) and min (M) denote the largest and smallest eigenvalue respectively, of some square matrix M. A large condition number indicates that we are close to an unidentifiable VAR system, and the resulting Yule-Walker equations will be numerically unstable -small approximation errors in the equation system may have a severe impact on the accuracy of the solution.
Proposition 2.1. Let ||M|| be the spectral radius of some square matrix M, that is, Note that (0) implicitly depends on Ω and the Θ i . Hence the second bound is better suitable to relate the condition number to the model parameters. In particular, if ||Θ p || = , and if we let → 0, then C(Γ p ) diverges at rate −2 .
The lower bounds in Proposition 2.1 are very general and do not impose a structure on the VAR process. It is clear that with diverse structural assumptions we can obtain sharper bounds. We do not pursue this direction much further, but only consider one special case which demonstrates that a large condition number is not necessarily tied to a small norm of Θ p . To this end consider a model in which Θ i = i × I d , for a scalar i . It is easy to see that then (0) is proportional to Ω. From (A1..1) it follows immediately that min (Γ p ) = 0.
In the following little numerical experiment we demonstrate that C(Γ p ) typically depends dramatically on Ω. Our framework consists of 1,000 Monte Carlo experiments for each combination of (p, d) when p = 3, 5 and d = 3, 5. We proceed as follows: perturbation of the noise covariance Ω, rendering this matrix degenerate, leads to a huge increase of the condition number of Γ p .
More theoretical results about singular VAR systems can be found in Anderson et al. (2012Anderson et al. ( , 2016 and references therein.

Estimation and Prediction
In the sequel we fix p and suppress it in the notation. For example, we write Γ and Y t instead of Γ p and Y t,p . We denote by A ⊗ B the Kronecker product between two matrices and let  m ( , Σ) be an m-dimensional normal random vector with mean and covariance matrix Σ. We define Y * t+1 as the optimal but infeasible prediction, that is, Y * t+1 = (Θ 1 , … , Θ p )Y t . We assume that Γ and its estimator are full rank. However, we focus on a situation in which C(Γ) is large.
The derivations of the previous section indicate that |Ω| = 0 will have an impact on the estimation of the VAR parameters. The matrix Γ and thus alsoΓ might be close to singular and the resulting estimator (Θ 1 , … ,Θ p ) is likely to be of poor quality. This does not necessarily imply, however, that the prediction is poor, too. Define (2.5) is the difference between our actual prediction and the optimal but infeasible prediction. The quantity P t ( ) takes into account that our predictor variable is contaminated with some noise and we are interested to see how this noisy variable impacts the forecast.
As we will see in Proposition 2.2, |Ω| = 0 is not relevant for the order of magnitude of ||P t (0)||. However, it can be very relevant for ||P t ( )||, even if || || is comparably small.
Let us first rewrite our VAR model in state-space form. To this end define To simplify the presentation below, we assume that we can base our estimators on observations Y 1 , … , Y T . Relating this to the index set for the Y t 's, it means that we work with Y 2 − p , … , Y T . Then we have Y t+1 = ΘY t + Z t+1 , 299 and thus Taking expectations on both sides of (2.6) we get which is the state-space from of the Yule-Walker equations. On the other hand, summing both sides in (2.6) over t ∈ {1, … , T − 1} and scaling by T we obtainΔ This gives rise to the least squares estimator of Θ, namelyΘ ∶=ΔΓ −1 , (2.9) and accordingly (Θ 1 , … ,Θ p ) ∶=̂Γ −1 , wherê∶= Π dΔ , with Π d = (I d , 0, … , 0) ∈ R d×m and m = dp. Combining (2.8) and (2.9) leads toΘ − Θ = RΓ −1 . We remark that the corresponding Yule-Walker estimator from the state-representation is very similar. Here the summation index inΓ is running form 1 to T. It follows that If our predictor Y t is perturbed with noise we have The term H( ) ∶= Π dΘ represents the additional prediction error due to a noisy predictor. We further split this term into the parts H 1 ( ) ∶= Π d Θ and H 2 ( ) ∶= Π d (Θ − Θ) , so that P t ( ) = P t (0) + H 1 ( ) + H 2 ( ). (2.10) In Proposition 2.2 we investigate the order of magnitude of the terms P t (0) and H i ( ), i = 1, 2.
Proposition 2.2. Assume that = ( ′ 1 , … , ′ p ) ′ and j = ( j1 , … , jd ) ′ with jk being i.i.d. with zero mean and variance 2 . Moreover, we assume that is independent of the innoviations (Z j ) and hence of the VAR process. We assume that (Z j ) are i.i.d. and that |Γ| > 0. It holds that .

Here the variables
We would like to stress that the error terms in this proposition are exact and so we can also use these results to assess the finite sample behaviour. We can interpret the result as follows: 1. The term 2 0 is the usual mean square prediction error when we have no noise contaminating the predictor variables. Due to the limiting distribution of the N k,T we get that ||N k,T || 2 are asymptotically i.i.d. and follow a This means that small eigenvalues are not relevant for the distribution ||N k,T || 2 . The distribution essentially depends only on the large eigenvalues of Ω. 2. The term 2 1 ( ) is a bias term which comes from the noisy predictor variable. It is proportional to 2 and it is not influenced by Ω. 3. The term 2 2 ( ) measures the error when predicting outside of the stationary distribution. It is proportional to 2 ∕T, a factor which obviously is small when the sample size is large, or when 2 is small. However, it crucially depends on the condition number C(Γ) via the reciprocal eigenvalues ofΓ. If |Ω| = 0 these eigenvalues can be very small as we have seen and may dramatically inflate 2 2 ( ) in finite sample. This will become impressively visible in our empirical part.

A Regularization Approach
In view of the previous results we suggest to use a regularized estimatorΘ (K) , which is defined asΘ Simply speaking, eigenvectors ofΓ belonging to small eigenvalues are removed from our estimator. If we denote  K = I m −  K , then we obtain by the results of the previous section thatΘ The term H 3t|K is the additional bias that is introduced by the regularization. The meaning of the other error terms is the same as in the previous section. This bias term is solely related to the predictor without noise.
Proposition 2.3. Assume the same setting and same notation as in Proposition 2.2. It holds that We see that the terms 2 0|K and 2 1|K ( ) are of a similar order of magnitude as 2 0 and 2 1 ( ) respectively, if K is close to m. On the contrary, a large K implies that small reciprocal eigenvalues are included in the expansion of 2 2|K ( ), to which this term is very sensitive. The terms 2 0|K , 2 1|K ( ) and 2 2|K ( ) are growing with K, while 2 3|K decays. The tuning of K will be a bias-variance trade-off problem. In this article, we consider two methods for choosing K: In practice (K) is unknown and requires an empirical counterpart. So we compute estimators of the (K)s. Details of how we tackle this in our implementation are given in Appendix B.
The CV approach: Select K by cross-validation. Again the implementation details are given in Appendix B.
The numerical performance of both approaches and a comparison with the usual OLS estimator are provided in Section 4. From a theoretical point of view, it would be interesting to give bounds on (m) − (K). (Note that (m) = 2 0 + 2 1 ( ) + 2 2 ( ).) Since we are interested in the non-asymptotic case, this appears a very difficult challenge which is beyond the scope of this article.
We conclude this section by noting that other regularization methods could be used as well. For example, in our numerical exercises we have also worked with a ridge penalty but found our approach to be favourable. We have not explored other regularization methods, and leave this open for further research.

PREDICTION OF GDFMS
The literature on dynamic factor models has been expanding at a very rapid pace during the previous decades, mostly due to the success of such models to predict macro-economic variables within a large cross-sectional panel dataset and relatively small time range (cf. Stock and Watson (2011) for a review). Until today dynamic factor models and so-called GDFMs receive a lot of attention due to their wide range of applicability and their empirical success (see e.g. Forni et al. (2018)).
In a GDFM observation vectors X t (which usually are very high dimensional) can be decomposed into an unobserved common component t -capturing most of the serial and cross-sectional dependence -and an unobserved idiosyncratic component t -being weakly correlated in time and cross-section. We then have (3.1) If a generalized dynamic q-factor model holds, then t is driven by q serially orthogonal white noise sequences. We refer to Barigozzi (2019) for a comprehensive exposition of dynamic factor models. We recall key features of the GDFM (3.1) and the way the FHLZ prediction algorithm works. We do not present here all technicalities, but refer to corresponding articles Forni et al. (2015Forni et al. ( , 2017. In Section 3.1 we also explain the FHLZ estimation framework, while in Section 3.2 we outline the prediction step. We show how the findings in Section 2 relate to forecasting GDFMs and how they can be used to improve the corresponding prediction method.

GDFM Setup and FHLZ Estimation Procedure
, and let us assume that our observations X t = (X 1t , X 2t , … , X NT ) ′ are the first N components of it. For a lighter notation we suppress the cross-sectional dimension in our variables and parameters. For any N, our observations are assumed to form a second-order stationary process. Moreover, all processes appearing in the sequel are assumed to be zero mean. We let X (h) = EX t+h X ′ t be the auto-covariance of (X t ) at lag h and let̂X(h) be its sample version. We denote by  X ( ) ∶= ∑ h∈Z X (h)e −ih the corresponding spectral density matrix at frequency ∈ [− , ]. The series defining  X ( ) is supposed to converge for almost all . The eigenvalues of  X ( ) are denoted X (j| ), j ≥ 1, and are indexed in decreasing order. The analogous notation is used for the common component and the idiosyncratic component . With 302 S. HÖRMANN and G. NISOL the superscript ∞ we indicate corresponding quantities based on X ∞ t . For example, we use ∞ (h) or  ∞ ( ). The eigenvalues ∞ X (j| ) are interpreted as the limit of X (j| ) when N → ∞. In a GDFM it is assumed that the common components are given as is the q-dimensional time series of factors. This vector time series is assumed to be an orthonormal white noise independent of the idiosyncratic component s defined in (3.1) at any lead and lag. The b ij (L) = ∑ ≥0 b ij L denote the backshift-series of factor loadings. The idiosyncratic components are assumed to be in some sense weakly dependent in cross-section. Roughly speaking, it is assumed that the cross-sectional average of t converges to zero. Hence, all relevant cross-sectional dependences are contained in the common components. Thus, every unit of idiosyncratic component is forecasted independently of the other unit via a univariate model, typically an autoregressive model. The final forecast is the sum of the common component forecast and the idiosyncratic forecast (Section 8 of Barigozzi (2019)). In the theory part of this article and in the simulation study we focus on the forecast of the common component.
A key result in Forni and Lippi (2001) is that under stationarity conditions and if ∞ X (q| ) = ∞ for almost all ∈ [− , ], while ∞ X (q + 1| ) ≤ B < ∞ for almost all ∈ [− , ], then for any N ≥ 1 the process (X t ) can be represented as in (3.1) and (3.2) where the filter coefficients are square summable, that is The idiosyncratic component and the common component are also stationary and their spectral densities exist. All eigenvalues of  ( ) remain bounded, while the first q eigenvalues of  ( ) diverge with N → ∞. This decomposition is unique.
We now turn to the explanation of the FHLZ prediction method, a state-of-the art prediction method for GDFMs. It requires that the filters in the GDFM are rational: where c ij (L) and d ij (L) are lag polynomials of degree ≤s 1 and ≤s 2 respectively. For simplicity of the presentation, we assume that s 1 and s 2 are uniform bounds with respect to the index i. This latter assumption is milder than the classical assumption that b ij (L) is a finite lag polynomial. The FHLZ framework encompasses important models that the anterior literature could not consider. For instance, it contains the important special case in which every cross-sectional unit it , t = 1, … , T is an autoregressive process. Let i = (i 1 , … , i q+1 ), with 1 ≤ i 1 < i 2 < · · · < i q + 1 ≤ N, and for some vector x ∈ R N denote by x i = (x i 1 , … , x i q+1 ) ′ . It was then shown in Forni et al. (2015) that for almost all such parametrizations c ij and d ij of the GDFM, any subvector it has the following VAR representation: . For the sake of simplicity, we will assume an arbitrary but fixed N of the form N = (q + 1)n.
If j → i j denotes a permutation of {1, … , N}, the above result implies that one can obtain n VAR equations (3.4) by selecting the subsets of indexes {i 1 , i 2 , … , i q + 1 }, {i q + 2 , i q + 3 , … , i 2(q + 1) }, … , {i (n − 1)(q + 1) + 1 , … , i n(q + 1) }. Reassembling these n equations, a VAR representation of the whole vector t can be obtained: where p denotes the maximum among the true orders of the n VAR equations. Some remarks are in order. Remark 3.1. The orders p i0 of these VAR model are unknown and need to be estimated.

Remark 3.2. Let
. Then apparently Ω i is of rank q, which means that ( it ) is a singular VAR process. That this feature is not specifically restricted to above model assumptions is stressed in Anderson et al. (2012): 'In almost all empirical econometric modelling and forecasting exercises using GDFM's, the latent variables are modelled as a singular AR model, … '.
We are now ready to outline the estimation part of the FHLZ algorithm. For details we refer to Forni et al. (2017).

THE FHLZ ESTIMATION ALGORITHM:
1. Estimate the spectral density matrices of the observations (X t ), for example with the Bartlett lag-window estimator: (3.5) 2. Derive estimators ( ) and ( ). Assumptions on the model imply that a spectral expansion of X ( ) along the first q eigenvectors consistently estimates the population spectral density  ( ) and an expansion on the subsequent eigenvectors consistently estimates  ( ).

Estimate the auto-covariance matrices at lag h of the common component througĥ(h)
In practice the integral is computed by averaging over a set of frequencies. 4. Select corresponding components of this matrix to obtain the autocovariance matrix̂i (h) of it . 5. For the n equations (3.4), select the respective ordersp i0 and use the Yule-Walker equations to estimate the coefficients of these VARs. For each of the n (q + 1)-dimensional VARs we thus obtain the estimators Â ij , j = 1, … ,p i0 . Letp be the maximum of the n estimated ordersp i0 . 6. Combine the elements of the previous step to obtain estimators Â 1 , … , Â̂p. Define Ê t ∶= X t − (Â 1 X t−1 + … + Â̂pX t−p ), t =p + 1, … , T. 7. The residuals Ê t are estimators of E t = Ru t + A(L) t where A(L) = I − A 1 L − … − A p L p . Now R and u t are estimated through a regular PCA on the residual process (Ê t ).
Remark 3.3. Note that since we do not directly observe the common components, the estimator of the VAR parameters in 5 is not exactly obtained as in Section 2.2. We obtain the second-order moments via the spectral analysis and use those moments for solving the Yule-Walker equations.

FHLZ Predictions and How to Improve Them
With the steps 1-7 combined, we obtain estimators Â andR and û t . The next step is the actual prediction step. As we have mentioned previously, we focus on the forecast of the common component. Let us describe this in detail and how it relates to our Section 2.
We have already pointed out in Remark 3.2 that the (q + 1)-dimensional subvectors it define singular VAR processes, which we write in a more compact form as ) −1 R i u t . Would we observe i,t−h , h = 0, 1, 2, … , our forecast would be naturally defined aŝ Since we do not observe it we usẽi t = (Â i (L)) −1R i û t and define instead Computing predictions of each subsystem in this way leads to the forecast̂t +1 of the entire vector. In the FHLZ algorithm steps 1-7 are repeated N rep times, typically 30 times, for different partitions of the cross-section in (q + 1)-dimensional subvectors. The resulting predictions for each component are then averaged. A look at (3.6) shows that we find ourselves in the context of predicting singular VARs with noisy predictors. In the notation of Section 2, the error term is given as For the sake of thoroughness, let us point out, however, that in the GDFM context Â i (L) is not estimated in exactly the same way as we outlined in Section 2. This is, on the one hand, because of Remark 3.3. Second, assuming i.i.d. components for is an assumption which may not hold and could be relaxed. Our theory therefore does not capture the entire complexity of the problem, it is rather intended to provide a simple framework capturing its essence.
The key point of our proposed modification is to alter the estimation step 5 (Yule-Walker estimation of the sub-VAR systems) and use either TRIM or CV. The other steps remain unchanged. The details for these implementation methods are given in Appendix B.

SIMULATIONS
We will run two sets of simulations. In the first one, we study the impact of spectral trimming on the prediction of VAR processes of different dimensions, order, persistence, sample size and rank. In the second one, we study the impact of the trimming on the prediction of a dynamic factor model and we compare it to the FHLZ prediction. Below we denote by  (0, 1) a standard normal random variable and by  [a, b] a continuous uniformly distributed variable on the interval [a, b].

Forecasting VARs
We generate a VAR(p) process (Y t : 1 ≤ t ≤ T) (p = 1, 2, 3) of dimension d = 3, 4, 5 with sample size T = 120, 240, where the first eigenvalue of the companion matrix is = 0.1, 0.4, 0.8 and the innovation process has rank d − L, 0 ≤ L ≤ 2. Our procedure is as follows: first, we draw pd 2 coefficients, independently from a uniform distribution on [− 1, 1]. Let Θ 0i , i = 1, … , p, be the resulting coefficients matrices of dimension d × d. If max is the largest eigenvalue of the corresponding companion matrix, we define the actual coefficient matrices by Θ i = ( ∕ max ) i Θ 0i and set p = (Θ 1 , … , Θ p ). This parametrization ensures that the largest eigenvalue of the companion matrix of p is . Hence, with | | < 1 the process will be stable. The innovations process (Z 1 , … , Z T ) is the product of a d × (d − L) matrix whose coefficients are independently drawn from  [−1, 1] and a (d − L) × T matrix whose coefficients are independently drawn from a standard normal distribution. The matrix of coefficients p and the innovations (Z t ) permit to generate the VAR process (Y t ). From this process we compute the LS estimator̂p = (Θ 1 , … ,Θ p ). We then generate perturbed observationsỸ t by adding noise t to Y t . The elements of t are drawn independently from a centred normal distribution with standard deviation being 0.01̂, wherêis the mean standard deviation across the component processes of the time series Y t . Hence, the scale of the noise is 1% of the scale of the observations. We separate the sample into training set and test set. The training set contains the first [0.8T] observations. For t ∈ T test : = {[0.8T] + 1, [0.8T] + 2, … , T} and for iteration i, we compute the following quantities : corresponds to ||P t ( )|| 2 . The terms it3 and it4 also correspond to the prediction errors with noisy predictors, but when we use our proposed regularization methods, namely ||P t|K ( )|| 2 , with K chosen by TRIM or CV.
Based on 1000 replications, we compute the ratios: For simplicity we write RE = RE(2), RE TRIM = RE(3) and RE CV = RE(4). Hence, RE represents the inflation of the prediction error due to the perturbation , while RE TRIM and RE CV additionally take into account the proposed spectral trimming. The values of RE(k) for = 0.4 and T = 240 are displayed in Table I. Other constellations of and T are given in Tables D1-D5 in Appendix D. We notice that in average regularized predictions are more accurate than the standard method for all cases and that we get tremendously more accurate results in some situations.
When L = 0 (full rank noise) the effect of the noisy predictor variables is very small (RE ≈1). Still, the forecasts based on the regularized estimators are significantly better, especially with larger values of p.
When L ≠ 0 (singular VAR) a regularization is absolutely necessary. The prediction error can get completely out of control. The bigger L, the more relevant is the regularization. A lower rank of the noise covariance implies a larger condition number of Γ p which causes often extreme instability of the predictions. For the same reason (see Proposition 2.1), the naive prediction becomes worse the smaller and the larger p. While the OLS estimator is the best linear unbiased estimator, it is far not optimal for predictions.
We also observe that CV is often leading to considerably better results than TRIM. An advantage in favour of TRIM is the low computational costs compared to CV. Moreover, the method TRIM will be useful when we work with GDFMs. Here, CV cannot be directly used as we explain in Appendix B and requires TRIM as a preliminary step.

Comparison of FHLZ, TRIM and CV in the Context of GDFMs
We now use spectral trimming in the prediction algorithm of a dynamic factor model. Again we call the approaches TRIM or CV as compared to FHLZ. We simulate from the data generating process  We take u jt For the idiosyncratic components, we take it i.i.d ∼  (0, 1). Every experiment (T, N, q) consists of 500 repetitions of the same procedure with different drawings of the random values (including the parameters). Based on the DGP described above, we make one-step ahead predictions for the common componentŝ( j) T with j ∈ {FHLZ, TRIM,CV}. These predictions are solely based on the information provided by X it , i = 1, … , N, and t = 1, … , T − 1. We then compare these procedures through the relative mean square error between the forecast and the theoretically optimal forecast 1+ ij L u tj . The accuracy of the forecasts is measured through with j ∈ {FHLZ, TRIM,CV} at every repetition. We estimate the spectral density of the different components using a lag window of length [T 1/2 ]. We limit the computation of the inverse of Â(L) to 20 lags. The number of factors q is estimated by the criterion IC 2 of Hallin and Liska (2007)  The comparison between FHLZ and TRIM is shown in Table II. A comparison of TRIM and CV is shown in Table D6 of Appendix D. We see a huge improvement when using our proposed procedures. Depending on the choice of q, T, N the median prediction error over the 500 simulations can be as low as 6% of the one produced by FHLZ. One can also observe that CV gives a further improvement compared to TRIM. To disentangle the effect of the trimming and order selection we have run all methods with the BIC criterion as it is proposed in FHLZ. Using a modified version (see Section B.4) further improves the accuracy of all methods. The relative accuracy between them remains mostly unchanged.

CONCLUSION
We have investigated the problem of forecasting singular VARs. A particular and important application is the forecasting of GDFMs via the FHLZ prediction algorithm. The latter is founded on the fact that (q + 1)-dimensional subvectors of a q-factor GDFM have a singular VAR structure.
√̂. By a routine martingale CLT argument we can show that for any vectors The consistency ofΓ implies consistency of̂and ê , and hence with a Slutzky type argument it can be readily shown that Noting that 1 √̂√̂j the result for 2 0 is proven. The term 2 1 ( ) is immediate. Finally, using orthogonality of the ê it is easy to show that Taking expectations finishes the proof. ◽ Proof of Proposition 2.3. The result for 2 0|K and 2 2|K ( ) is obtained exactly in the same way as in Proposition 2.2. To obtain 2 1|K ( ) we basically only need to apply the definition. Finally, we have . ◽

B.1. Choosing K with TRIM
We first assume that 2 is known or can be estimated (see e.g. Section B3). Our approach consists of the following steps: 1. We compute the LS estimator̂p. For the GDFM, we use the FHLZ estimator.

B.2. Choosing K with CV
We suggest the following C-fold cross validation algorithm: 1. Divide the training samples into C subsamples of equal size. 2. For c ∈ 1, … , C, estimatê( If one ignores the order of the VAR, one can computê( K) p for different p and deduce the K and p that minimize ∑ C c=1 e 2 ckp . Using the CV algorithm in context of a GDFM we need reference values of the common component to compute prediction errors. For the j-th iteration of the CV algorithm, we use as reference the average of the N rep TRIM estimators of̂i t and the j − 1 first estimators obtained via CV. The reference is thus changed at every iteration.

B.3. Estimation of the Noise Variance in a GDFM
Using the TRIM algorithm requires an estimator of the noise variance 2 . Our approach to this problem is to obtain a preliminary estimator̃2 using the FHLZ algorithm, which is then sequentially updated. More precisely, we proceed as follows: computẽ( i) t = (Â (i) (L)) −1R (i) û t for t ≤ T and for the n selected sub-processes as it is described in

B.4. A Modified BIC Criterion
FHLZ have suggested to use the BIC criterion to choose the orderp i0 . We propose to adapt this criterion to the singular VAR case. For a regular VAR of dimension q + 1, the criterion takes the form where i j is the jth eigenvalue of the covariance matrix of the residuals of (3.4). Imposing that the rank of this covariance matrix is q suggests to expand the sum above only from 1 to q. The difference between both can lead to very different results, due to the degeneracy of the logarithm in zero. Allen et al. (2012) investigate the difference of measurements of particulate matter PM10 or PM2.5 (the numbers 10 and 2.5 refer to the maximal diameter of the particles in micrometer) made with the Patashnick Tapered Element Oscillating MicroBalance (TEOM) method and time-integrated samplers (gravimetric methods which are officially approved by the U.S. Environmental Protection Agency). The TEOM method is attractive because it is comparably cheap and provides direct and thus near-continuous measurements of PM. However, during the TEOM measuring process there is a certain loss of semi-volatile particulate mass, and hence the measurements may differ from the more expensive gravimetric methods. The differences depend on aerosol composition of ambient air, which in turn are influenced by the temperature, relative humidity, and so on. This means that the actually recorded data by TEOM, let us sayỸ t , need to be preprocessed before they can be officially used (e.g. to record exceedances of limit values). Let us denote the processed and corrected data by Y t . Quite naturally, when fitting a statistical model one will work with the processed data Y t . If the processing is not instantaneous, however, timely predictions can only be computed with unprocessed dataỸ t . This means that the stationary model related to the process (Y t ) is loaded with the predictor variablesỸ t . This leads to the question of how big the impact of the processing is, that is, how muchΘỸ t deviates fromΘY t . In terms of (1.1) we have =Ỹ t − Y t . As we show in this article, the impact can be surprisingly large even for very small .