Accelerated Estimation of Switching Algorithms: The Cointegrated VAR Model and Other Applications

Restricted versions of the cointegrated vector autoregression are usually estimated using switching algorithms. These algorithms alternate between two sets of variables but can be slow to converge. Acceleration methods are proposed that combine simplicity and effectiveness. These methods also outperform existing proposals in some applications of the expectation–maximization method and parallel factor analysis.


Introduction
Estimation of models with cointegration involves non-linear optimization, except in the basic case. Following Johansen & Juselius (1994), the standard approach is to alternate between sets of coefficients. They called this a 'switching' algorithm and that terminology is now common in this literature. The numerical literature tends to refer to 'alternating variables' algorithms. Subsequently, there have been many examples of switching algorithms in related settings.
The advantage of switching is that each step is easy to implement, and no derivatives are required. Furthermore, the partitioning circumvents the lack of identification that can otherwise occur in these models and which makes it harder to use Newton-type methods. The drawback is that progress is often slow, taking many iterations to converge and occasionally resulting in premature convergence. This paper proposes a modification that accelerates estimation. It amounts to adding a line search, which is both simple and not problem specific. A minimal extra effort results in a substantial speed-up, as well as better quality of convergence.
The focus at first is on the cointegrated vector autoregression (VAR) with I(1) cointegration. This is introduced in Section 2, together with the basic estimation algorithms. Then, Section 3 considers several line search procedures, which are evaluated in Section 4. While we limit ourselves to I(1) models, we note that the proposed line search works in all cointegration models with switching, including our recently developed algorithms for I(2) models (Doornik, 2017).
The expectation-maximization (EM) algorithm has a similar structure to switching in cointegration models and can also be very slow to converge. Acceleration procedures have been proposed for EM, and we compare some of them with our approach in Section 6. The paper finishes with an application to parallel factor models and low-rank matrix application.

The I(1) model
The starting point is the VAR with p dependent variables and m 1 lags: y t D A 1 y t 1 C : : : C A m y t m Cˆx t C t ; t IIN p OE0 p ; ; (1) for p 1 vector y t , t D 1; : : : ; T , with y j ; j D m C 1; : : : ; 0 fixed and given; x t is a k-vector of additional regressors and is a p p positive definite matrix. This model can be rewritten in equilibrium correction form without imposing any restrictions as y t D y t y t 1 D …y t 1 C 1 y t 1 C : : : C m 1 y t mC1 Cˆx t C t : The cointegrated VAR (CVAR) restricts the rank of … to at most r by writing … D˛ˇ0 y , wherę andˇy are both p r matrices. The implicit rank reduction of the I(2) model is ruled out. More generally, we allow for variables that are restricted to lie in the cointegrating space, x R t , and those that enter unrestrictedly, x U t ; these can be deterministic or stochastic. The CVAR is then formulated as y t D˛ˇ0 y t 1 x R t 1 ! C 1 y t 1 C : : : C m 1 y t mC1 Cˆx U t C t ; D˛ˇ0w 1t C ‰w 2t C t ; (3) whereˇhas extended dimension p 1 r:ˇ0 D .ˇ0 y ;ˇ0 c /. A specific case is the VAR with a restricted linear trend, where x R t D t and x U t D 1, so p 1 D p C 1. This is the specification used below. Gaussian maximum likelihood estimation is via a reduced-rank regression (RRR) after partialling out the unrestricted coefficients ‰ (see, e.g. Juselius, 1990 andJohansen, 1995b). The maximum can be determined by solving an eigen problem. This is no longer the case when imposing restrictions on the columns ofˇ, requiring iterative maximization instead.
Writing´0 t for the residuals from regressing y t on w 2t , and´1 t for the residuals from regressing w 1t on w 2t , the concentrated model becomeś with moment matrices t´0 jt ; i;j D 0; 1: The concentrated log-likelihood based on where O v i are the eigenvalues corresponding to the eigenvalues O i , in descending order, of the generalized eigenvalue problem: S 11 S 10 S 1 00 S 01ˇD 0: Then, b D S 01 b . We write RRR.´0 t ;´1 t j´2 t / for the RRR of´0 t on´1 t corrected foŕ 2t . Note that the second moment matrices can be avoided when computing the eigenvalues (Doornik & O'Brien, 2002).
Testing the rank involves non-standard distributions consisting of functionals of Brownian motions (see Johansen, 1995b, Juselius, 2006 and the references therein). Convenient approximations are provided by Doornik (1998).
Only the cointegrating space is estimated, because˛V 1 Vˇ0 D˛ˇ0, for any non-singular r r matrix V . Identification of the cointegrating vectorsˇis usually guided by suggestions from economic theory about long-run relations. Exact identification imposes no restrictions, but overidentification can be tested with standard 2 inference provided the rank of … is kept fixed. Normalization of each column ofˇmay be kept separate from identification: it will be helpful to avoid normalizing on a coefficient that is close to zero. Johansen & Juselius (1994) consider the case where the cointegrating vectors, i.e. the columns of , are split in two sets, each with a common linear restriction. This is estimated by alternating between the two sets of vectors, keeping the other fixed in turn. The procedure was subsequently generalized by Johansen (1995a) to the case where each column is restricted independently. This is called the 'beta-switching' algorithm and described in Section 2.3. Boswijk & Doornik (2004) provide an overview of the different types of restrictions that have been studied by Søren Johansen and Katarina Juselius. They also introduce more general specifications of restrictions, which are estimated by alternating between the coefficients Â in D˛.Â/ given , and inˇDˇ. / given Â , provided the rank of˛.Â/ˇ. / 0 remains r. The concentrated model becomeś

Alpha-beta switching
where the H i are known p 1 m i matrices and i vectors of length m i . Similarly, the G i are known p s i matrices and Â i vectors of length s i ; when G i D I p , the corresponding vector inį s unrestricted. The H i need not identify the cointegrating space, but the column rank of˛anď must remain r. Now, H D dg.H 1 ; : : : ; H r / is block-diagonal and of dimension rp 1 P m i and 0 D . 0 1 ; : : : ; 0 r /, with G; Â defined analogously. For numerical reasons, we keep the two steps in a generalized least squares (GLS) form that can be estimated by least squares (using a precomputed QR decomposition results in more efficient computations, see the appendix of Doornik, 2017). First, with Â fixed at Â F , so˛F D .Â F / D .G 1 Â F 1 ; : : : ; G r Â F r /, and F , writing vecˇD H : This can be estimated by GLS using F D PP 0 .
Step 1 of Algorithm 1 updates the parameters, providing new candidate values.
Step 2 evaluates the function at the updated parameters.
Step 3, which just accepts the candidates as the new values, will be improved by a line search below, resulting in a significant acceleration. Finally, step T is the convergence decision, based on the relative change in the function value as well as the parameters. The latter, p .k/ , is based on the long-run coefficients …, which are always identified. This is a stronger criterion than the change in the log-likelihood, and the square root of " 1 is used. Basing the convergence decision on the change in the objective function only is more likely to result in premature convergence.

Beta switching
Theˇ-switching algorithm of Johansen (1995a) fixes all columns ofˇexcept one, which is estimated by RRR. One iteration then cycles over all columns freeing one in turn, and this process is repeated until convergence.
Beta switching can handle restrictions of the form: H W˛D C #;ˇD .H 1 1 ; : : : ; H r r /; for a known matrix C.p s/ and coefficients #.s r/, p > s r and H i as in (7). The restriction on˛can be removed (given ) through multiplication of (4) by C D 1 C .C 0 1 C / 1 : The full system is transformed by .C 0 W C 0 ? /, which splits it in two independent systems: (12) and C 0 ?´0 t D C 0 ? t . If is known, we can work with (12) instead of the full system. During iteration, we use the that corresponds to the most recent estimates of˛andˇ. It is also possible to remove C exactly (see, e.g. Juselius, 2006, section 11.1). However, the current approach resulted in faster convergence in our experiments, in addition to being more convenient in our implementation.
For the main part of the algorithm, assume that˛is unrestricted and partitionˇD .H i i W ni / keepingˇn i , which has r 1 columns, fixed. Then, and b i can be obtained from RRR.´0 t ; H 0 i´1 t jˇ0 ni´1 t /.
Algorithm 2:ˇ-switching algorithm To start set k D 0 and choose . 1/ , " 1 , and the maximum number of iterations. Normalization has been omitted from this implementation. Very occasionally, it can help to include an intermittent check on the scale to prevent some coefficients running away, while their product stays unchanged.

Accelerated estimation
Newton and quasi-Newton-type maximization algorithms that iterate over all parameters use a line search to protect against a downwards step that oversteps the maximum in the current upward direction. The line search can also ensure that there is sufficient progress to prove convergence to a stationary point (see, e.g. Nocedal & Wright, 2006, ch. 3). In practice, the added line search results in much better performance, but there is no need to spend too much effort in trying to achieve high accuracy in these intermediate stages.
An alternating variables maximization algorithm uses no derivative information, although the steps can generally be shown to be in a non-downward direction. In practice, progress is often slow and occasionally so slow that it results in premature convergence. Lack of orthogonality of the sets of parameters over which the algorithm alternates exacerbates this problem. By splitting the parameter space in different directions, the actual step has a tendency to be overly conservative. These algorithms can be accelerated by a line search that allows for an expansion (in contrast to the Newton-type line searches that are usually contracting).
To formalize the line search algorithm, we first write for the parameter vector .Â 0 W 0 / 0 . At the start of iteration k, when only the iteration counter has been incremented, we possess current parameter values .k 1/ . One update of the switching algorithm gives candidate values The proposal here is to define instead and run the line search over using This differs from the standard approach (14) because it is based on the previous candidate values. The parameter is the objective of a scalar maximization: Denoting the approximate solution as , the new parameter values are .k/ D .k 1/ c C r . The principle of the algorithm is visualized in Fig. 1, where triangles denote candidate values and circles actual values. In the left graph, the change is expressed relative to the previous actual value. On the right, it is relative to the previous candidate value, which is the proposed method (15). The start in Fig. 1 (right) is at .2/ , the alternating steps move first along the horizontal axis, then the vertical axis to the next candidate point .3/ c , indicated by the solid triangle in the centre. This takes us to .3/ from which the process repeats.
For the˛ˇ-switching algorithm, the new candidate values Â .k/ c and .k/ c are defined in Step 1 of Algorithm 1. The acceleration scheme is . The new parameter values are Â .k/ D Â .k/ . / and .k/ D .k/ . /, both using the same . Some care is needed if the cointegrating vectors are normalized in an iteration: if the normalization changes, then r will be meaningless.
An unusual aspect of (15) and (17) is that it is defined in terms of the previous candidate parameter values, rather than the previous actual values. This is based on practical experience, and the benefit may come from the fact that it offers some protection against actual steps becoming prematurely too small. We will compare several schemes to estimate the step length of the line search.
A parallel implementation could evaluate f .k/ . i / simultaneously for all i and then choose the best.

Concentrated line search
The loadings˛and cointegrating vectorsˇhave different properties in the CVAR, with the latter converging at a faster rate. This suggests a line search that is only over the parameters in the cointegrating vectors, re-evaluating the loadings each time. Now, each function evaluation in the line search requires an additional regression: Algorithm 4: Line search L1Beta Use r from (17) and define Â as Â. . // based on (9). Then follow the recipe of line search L1Step.

Least-squares line search
Because the restrictions are linear, we can write: . / D˛0 C r˛;ˇ. / Dˇ0 C rˇ; with residuals In matrix form where all matrices are T p and Z a D Z 1 OErˇ˛0 0 Cˇ0r 0 and Z b D Z 1 rˇr 0 , Z 0 i D .´i 1 : : :´i T /. The least-squares solution involves minimizing the trace tr 1 E. / 0 E. /. This leads to a cubic equation: which has either one or three real solutions. In the latter case, which we very rarely observed, the real value closest to unity is used. The least-squares line search is Algorithm 5: Line search LLsq Find by solving (18) using .Â .k/ c ; .k/ c / and the change based on candidate values as in (17). Do not accept if it leads to a worse function value.

Performance of accelerated switching algorithms
To illustrate the impact of the proposed additions on the various switching algorithms, we look at a model for the Danish data based on Juselius (2006, section 4.1.1). The model has five dependent variables: real money, real GDP, GDP deflator, and two interest rates. There are two lags in the VAR, with unrestricted constant and restricted trend for the deterministic terms. The estimation period is 1973.3/ 2003.1/, so 119 observations and cointegrating rank r D 3. Note that the aim is not to find a good model, or to test valid restrictions, but rather to stress the algorithms Ox code to replicate all reported results can be found online. Table 1 lists the range of restrictions under which we estimate the models. Restrictions on the columns of˛are indicated by an uppercase letter, those onˇin lower case. Restrictions Aa, for example refers to the model where˛is unrestricted andˇD .I 6I1W3 1 ; I 6I1;6 2 ; I 6I3W6 3 /. Matrices that are subsets of the identity matrix are written as, for example I 6I1;6 to select the first and last column of I 6 , while I 6I2W5 keeps columns two to five. Further matrices used in Table 1 are Table 2 shows the number of iterations required by the two switching algorithms using 1 D 10 12 when estimating the model under a range of restrictions. Note that, in our Table 1. Restrictions on models for Danish data with rank 3 Restrictions on˛D .G 1 Â 1 ; G 2 Â 2 ; G 3 Â 3 /, with G 1 ; G 2 ; G 3 D  implementation, the number of iterations equals the number of parameter updates. In each case, the unrestricted estimates for r D 3 are used as starting values. A better starting value routine should be used in practice. However, here, we focus on algorithm performance and wish to keep the iteration counts comparable. When there is no line search, the algorithm is occasionally slow to reach the specified precision " 1 . Much faster convergence is achieved with the line searches; one case shows an almost 200-fold reduction in the iteration count. For˛ˇswitching, we have also implemented the least-squares line search. However, LLsq does not offer any advantage here over L1Step, while requiring more effort to derive and implement.
L1Beta involves a line search over the coefficients inˇonly, otherwise it behaves as L1Step. Beta switching estimates the model in terms ofˇonly, so here, we only use L1Beta. This is not the case for˛ˇswitching, but there is no clear advantage of L1Beta over L1Step. For restrictions Bb, it requires 206 iterations where L1Step uses 74.

A more comprehensive comparison
A Monte Carlo experiment is used to show the impact of the line search in more detail for three choices of restrictions. Data are generated from (2) without further lags: The data generation process (DGP) parameters˛;ˇ0 D .ˇ0 y ;ˇ0 c / and are the estimates of the I(1) model for the Danish data with 2 lags and rank 3 (but no further restrictions). The generated sample has the same size, and y 1 and y 0 are taken from the actual data. The estimated models using the generated data are the same, except for the additional restrictions on˛andˇ.
M D 1000 random samples are drawn in this way. The maximum number of iterations was set to 10,000, 1 D 10 12 , and all replications are counted. Table 3 gives the average number of iterations required to achieve convergence as well as the total central processing unit (CPU) time for each experiment (in seconds running on all cores of an Intel i7 at 2.3Ghz).
Comparing the L1Step and LLsq line searches first, we see that L1Step has the better performance: it requires more function evaluations, but this is offset by the reduction in updates. This is surprising, because L1Step is ad hoc in comparison, using a limited ¹1:2; 2; 4; 8º line search grid. Mean iteration count (Iters), mean log-likelihood evaluation count (Logliks) and total central processing time (CPU) time in seconds to convergence for " 1 D 10 12 in 1000 replications. LStd is the 'standard' line search (14), using the same 4-point grid, but based on the previous actual values: instead of the previous candidate variables as in L1Step, L1Beta and LLsq. Indeed, between LStd and L1Step, this is the only difference: L1Step is substantially faster.
The best performance is with L1Beta: for Aa, it is more than twice as fast, although for Dc there is almost no difference. The reduction in iteration count is more pronounced than the gain in speed, because of the overhead of evaluating the concentratedp arameters.
The quality of convergence is just as important as the speed. More detailed statistics are reported for the restrictions denoted by Aa, starting with Fig. 2 presenting results forˇswitching. The first two histograms show the number of iterations (i.e. parameter updates) needed to obtain convergence (or reach the upper limit of 10,000). The L1Beta line search is about 50 times faster, reducing updates by two orders of magnitude.
The bottom graph of Fig. 2 shows the difference in the log-likelihoods for the 1000 replications consecutively. In the experiments b l D l c .b / D T =2 log j . b Â; b /j is around 3000, and adding the used line search as an argument, the graph plots b l.no line search/ b l.L1Beta/. A negative number indicates that the L1Beta version converged to a better value. That this happens regularly with small differences is an indication of premature convergence when using no line search. In practice, we should use a better initial value procedure, which is likely to reduce this effect.
The same statistics are shown for˛ˇswitching in Fig. 3 for three different line searches and none. The histograms along the top show the reductions in iteration count. The remaining graphs compare the achieved maxima, always in such a way that a negative number corresponds to L1Step having found a higher maximum. With˛ˇswitching, discrepancies occur less frequently.
Finally, we compare the rates of convergence. For each experiment, this is measured as the change from the first log-likelihood (i.e. from the first update, not the initial values), set to zero, to the final value upon termination, set to one. Figure 4 plots the 50%, 90% and 99% quantiles, both without line search and using L1Beta. In this case, the line search has little impact in the first few iterations ofˇswitching but does result in cutting down the tail. Withˇs witching, the 90% and 99% quantiles are clearly shifted to the left from the early stage onwards.

Quadratic line search
Generic procedures that maximize a scalar function, such as Brent (1973, ch. 5) or Powell (1964, section 8), could be used but make the overall maximization significantly slower. It may be possible, as suggested by an anonymous referee, to specialize the quadratic procedure of Powell (1964) for the line search. To be competitive with the simple approach of L1Step and L1Beta, the number of function calls must be kept small. We will provide an implementation and compare it to the approaches so far.

Implementation
Already available are function values (16) at D 0; 1, to which we add D 2. Then, if the values accelerate upwards, the upper boundary of the specified interval is taken as the predicted value q . Otherwise, we use the quadratic approximation to predict q , unless the quadratic function is flat or leads to a minimum. Finally, if the prediction is close to a value that has already been evaluated (i.e. at 0, 1, 2), it is accepted. Otherwise, the function is evaluated at the prediction, and the best value returned. Mean iteration count (Iters), log-likelihood evaluations (Logliks) and total central processing unit (CPU) time in seconds to convergence for " 1 D 10 12 in 1000 replications.
The resulting algorithm remains relative to the previous candidate point and is formalized as Steps 1a, 1b and 1c are the upwards acceleration, flatness or wrong curvature and quadratic approximation reflectively.
Step 3 is only effective when the prediction is not close, in which case an additional function evaluation is required to check if we have a better point. So LQStep requires only one or two function evaluations. Table 4 extends Table 3 with the new procedure. To illustrate our claims, we have also added a line search based on the algorithm of Brent (1973, ch. 5). The first version, LBrentStd, maximizes the line search relative to the previous actual values (14). This is the analogue to LStd (Table 3) but now maximizing the scalar function, rather than evaluating it at a few points. LStd is the faster (and simpler) of the two. LBrent uses Brent's maximization on the line search based on previous candidate values, the approach advocated in this paper, leading to a fivefold improvement. As expected, LBrent is very similar to LLsq.

Results
There are two versions of the quadratic line search: LQStep, which can be directly compared to the simple grid style of L1Step, and LQBeta, which is the quadratic version of L1Beta. In three out of four cases, the quadratic line search has a reduced number of likelihood calls. This has limited impact here but can help in other settings where they are more costly.

Applications in other settings
It is somewhat surprising that this simple approach to the line search provides such an acceleration of the algorithm. To illustrate the applicability more generally, we consider several different

EM algorithm
The switching algorithms for restricted cointegration estimation exhibit occasional very slow convergence. They have this in common with EM algorithms (see, e.g. Jamshidian & Jennrich, 1997). Many proposals have been made in the literature to improve the EM algorithm by a line search or step adjustment based on recent iteration history (e.g. Varadhan & Roland, 2008, Berlinet & Roland, 2012. In some cases, the proposed line search comes with sophisticated heuristics. The two test problems are taken from Varadhan & Roland (2008, sections 7.1 and 7.2), who provide a clear description of the data, log-likelihoods and the E and M steps for the EM algorithm. The Update function for maximization consists of one application of the E and M steps, while Eval returns the average log-likelihood.
The first model is a two-component mixture of the Poisson distribution, applied to mortality of women 80 years and older, as reported in The Times during 1910-1912. The Poisson model has three parameters D .p; 1 ; 2 / 0 , and the model is repeatedly estimated starting from random initial values .0:05 C 0:9u 0 ; 100u 1 ; 100u 2 /, where u i is drawn from the uniform.0; 1/ distribution.
Two accelerations of the EM algorithm are included in the comparison. The first, labelled LSqS3g, is the global S3 variant of the SQUAREM line search as preferred by Varadhan & Roland (2008). We found that the reported convergence failures could be avoided by forcing the initial step of the SQUAREM line search to stay inside the parameter space. The second is the parabolic line search of Berlinet & Roland (2012), called LParabolic here.
The second EM test case is for the multivariate t -distribution. The estimated model assumes one degree of freedom (Cauchy), but the generated data are IIN.0; 1/. We only report the variant that Varadhan & Roland (2008) call PX-EM. The dimension is 50 with 100 observations. The parameters are the mean and the Choleski factor P of the scale † D PP 0 . So there are 1325 parameters to estimate. The data are repeatedly drawn, and the model estimated with initial values based on the sample mean and variance of the generated data.
Both cases use 1 D 10 12 and a warm-up of three iterations in which the line search is not entered (this is included when counting function calls).

PARAFAC
Parallel factor analysis, or PARAFAC, refers to a type of models that is used in food applications (Bro, 1998), chemometrics (Sanches & Kowalski, 1990) and psychometrics. PARAFAC has the following tensor structure: where e ij k is an error term. The data y ij k can be represented by a three-dimensional matrix, which can be visualized as a sequence of I J matrices stacked behind each other. If, instead, these K matrices are lined up next to each other, the three-dimensional matrix has been flattened to a normal matrix of dimension I JK. This is written as Y .I JK/ . This flattening can be performed in different ways, for example Y .K IJ / . The PARAFAC model can now be written as a trilinear model: with objective function Superscript C denotes the Moore-Penrose inverse; Z 0 Z can be computed using a Hadamard product. The normalization is recommended by Uschmajew (2012) and takes the following form: ka f k kb f k kc f k/ 1=3 ; f D 1; : : : ; F: Rajih et al. (2008) propose using an exact line search, labelled ELS, which can be compared to LLsq, except that is formulated in terms of the previous actual values. The trilinear structure of PARAFAC means that now the roots of a fifth-order polynomial must be found. This may require five function evaluations to choose the best. We have not implemented ELS.
The experiment takes the following form: data are repeatedly generated using (19), with A D 3 C I IF ; B D 2 C I JF ; C D 1 C I KF and e ij k 0:1IINOE0; 1. I IF is the I F identity matric, so A has the value 4 on the diagonal and 3 elsewhere. For each replication, starting values are chosen as a .0/ if D a if C .U.0; 1/ 0:5/=10 with U.0; 1/ representing a uniform random number. Starting values for B and C are chosen in the same way.

Matrix approximation
The final test case is taken from the literature on low-rank matrix approximations; our test case is for an unstructured matrix. A line search is not commonly used in the proposed procedures, and we shall show the benefits of using L1Step. The algorithm is as in Zachariah et al. (2012)  To test the algorithms, we generate X as a draw from IINOE0; 1, keeping it fixed between replications. A and B are taken as the first r columns of the appropriate identity matrix. In each replication, y D Xvec.AB 0 / C ; i IINOE0; 1. Starting values are derived from the singular value decomposition (SVD) of the full rank b The product AB 0 is used for the parameter change in the convergence check. Table 5 shows the impact of adding a line search to the maximization algorithms. It reports the average number of calls to the Update function and to the objective function (Eval). CPU is the total CPU time for the experiment, on a single core of an Intel Xeon E5 at 2.9 Ghz.

Results
The results show that L1Step provides better acceleration of the EM algorithm than the other line searches, with the lower number of function calls reflected in the reduced total time of each experiment. Note that the parabolic line search obtains its speed from using the previous candidate, just as L1Step. The parabolic aspect does not yield any further improvement. There is a small additional advantage from using the quadratic version LQStep. The parabolic line search uses a double update. We can also do this for L1Step: line 5 in Algorithm 7 becomes c D D Update(Update. /). This makes it a bit faster still, trading an increase in updates for a reduction in objective evaluations (see Table 5 under LStep*).
As noted by Varadhan & Roland (2008), the PX variant of the multivariate t-model is already very good. Adding the line searches now only provides a limited improvement.
The speed-up of the PARAFAC algorithm from L1Step is 80-fold, almost two orders of magnitude. The reduction in the number of updates is larger still. In contrast, Rajih et al. (2008) report that, while ELS reduces the number of iterations, it is not really faster than no line search and slower in some cases. L1Step performs extremely well, with the added benefit that it is not problem specific, and extremely easy to implement, both unlike ELS.
Finally, there is a useful improvement of adding L1Step to the low-rank matrix approximation.
There are two cases with premature convergence in the Poisson mixture model without line search. These failures happen because the algorithm makes so little progress in the first 10 iterations that it reports convergence. In the low-rank case, there are five replications that converge to different values. This is not counted as a failure because they appear to be different local modes: different line searches converge to one of two modes. In all other test cases in Table 5, the different accelerations (or absence thereof) converge to the same maximum within the convergence tolerance.

Conclusions
We presented a line search to accelerate switching and alternating variables algorithms. The approach is exceedingly simple, but, nonetheless, provides some useful insights in the requirements for good acceleration. First, partitioning the parameter space results in default steps that are too small, and expanding searches are needed. Second, it is much more effective to use the previous candidate parameter values than the previous actual values. Our results illustrate this in many different settings. Finally, there is a penalty for expending too much effort in the line search. In some models, it is possible to explicitly optimize the step length, but even this results in slower algorithms than the proposed simple approximation of L1Step. Similarly, the parabolic line search proposed for EM algorithms spends a bit too much time in a more sophisticated search.
However, while there is limited return in trying to derive a slightly more optimal curve from updates along the iterative path of the algorithm, it may be possible to exploit statistical knowledge of the model that is estimated. The main focus of this paper is on models for cointegration, because these have lacked acceleration so far. Our results show that L1Step works very well in the I(1) cointegrated VAR models with linear restrictions. Although not reported here, this also holds for non-linear restrictions, different ranks and lag lengths, as well as I(2) models. In the I(1) model, it helps that the˛andˇcoefficients are asymptotically independent. We used the fact thatˇconverges at a faster rate than˛to implement the simple stepwise line search in terms ofˇonly. This did provide a further improvement.
Faster convergence facilitates bootstrapping and Monte Carlo experiments. It is also useful when multimodality is suspected, and the model is re-estimated from many randomized initial values.