#### 2.1 Overall approach and research questions

A linear stationary normal process is completely characterized by its mean and autocovariance function, and the reversed process (in time) has the same distribution as the original process (see for instance [7], Chapter 15). In general, traffic variables appear to be time-irreversible. For example, the levels of traffic volumes usually rise precipitously but decay gradually: the typical shape of the daily profile is an asymmetric M (whereas for speed, it is an asymmetric W). The class of threshold regression models [8] has been widely employed in the literature to explain empirical phenomena similar to the ones discussed earlier (a list of applications is presented in [9]). The major features of such models are limit cycles, amplitude dependent frequencies, and jump phenomena [10]. Despite their primary use as forecasting tools in this paper, threshold regressions may be used to examine some important research questions in vehicular traffic dynamics as follows:

- For a given location in a road network, how many traffic states (regimes with linear dynamics) characterize the traffic cycle in each day and which time intervals correspond to each regime?
- Do regimes occur in different times at different locations in the network?
- Do specific regimes within the daily cycle differ significantly for different weekdays?

Given a forecasting horizon, in each measurement location and for each traffic regime, some past information from upstream or downstream neighboring locations is expected to be useful with regard to short-term forecasting performance. In fact, significant neighbors for a specific location may differ across regimes and significant time lags of measurements collected at a neighboring location may also differ across regimes. In previous works, the set of possible statistically significant neighbors was specified a priori using regime-dependent spatial weight matrices (see, for instance [11] and [6]), and the autoregressive order of the model was decided using statistical information criteria [12].

This work avoids a priori specification of spatial weight matrices by using *l*_{1}-penalized methods that simultaneously perform estimation and model selection. Essentially, given a relatively large autoregressive order and a single matrix that represents general features of the road network topology, penalized estimation shrinks nonsignificant predictors to zero. To the best of our knowledge, this is the first application of penalized estimation methods to space-time regression models; a modified version of the basic *l*_{1}-penalization algorithm adapted for spatial regression problems, different to the ones considered here, appeared recently in [13]. For some recent applications of penalized estimation methods in transportation problems, the reader may consult [14, 15] and [16].

In the application, we compare the forecasting performance of models estimated using: (i) the adaptive least absolute shrinkage and selection operator (LASSO) which performs *l*_{1}-penalized minimization of squared residuals and (ii) *l*_{1}-penalized minimization of the least absolute deviations (LAD) of the residuals (adaptive LAD-LASSO). Although ordinary least squares (OLS) is by far the most popular estimation method in regression models, estimation via LAD is among the most commonly employed robust techniques and has been shown to be particularly effective in terms of forecasting when the distribution of the response variable is prone to outliers [17].

Given some estimated temporal thresholds which define homogeneous traffic regimes, penalized estimation provides interesting insights in addition to its primary use as generator of parsimonious and well-specified (e.g., free from the symptoms of multicollinearity) forecasting models. Namely, the aforementioned method permits answering the following research questions:

- What is the spatial extent of the impact from neighboring locations in the network on the forecasted traffic level on a link?
- Are the spatial and temporal influences of neighboring locations on a road link dependent on other factors such as the distance from the link, congestion level, traffic variability, and so on?

#### 2.2 Decomposition

Our modeling strategy divides traffic dynamics into two basic components: a location specific daily profile and a term that captures the deviation of a measurement from that profile. Forecasting using unobserved components has been frequently adopted as it can provide a better understanding of the dynamic characteristics of the series and the way these characteristics change over time [18]. Within the context of short-term traffic forecasting, such decompositions are expected to lead to superior performance compared with models applied directly to traffic variables [19, 6]. Traffic variables display nonlinear dynamics in both their mean and their variance (e.g., heteroscedasticity, see [11] and [1]); detrending simplifies the dynamics of the modeled series relative to the original ones as the daily profile will capture some of the nonlinearity in mean.

Specifically, let *d* be the day of the week index, *s* the location index, and *t* the time of day index. The overall model structure for a traffic variable *y* is

- (1)

where *d* = 1, … ,*D*, *s* = 1, … ,*S*, and *t* = 1 … ,*T*. *S* represents the number of locations for which we seek to forecast traffic conditions, and *T* is the total number of time intervals per day. *D* may be less than seven if there is sufficient evidence of similarity of traffic dynamics for two (or more) days of the week. The profile *μ*_{d,s} captures the daily trend and can be viewed as a baseline forecasting model that uses only historical data and neglects information from the recent past of the process. A weighted average that weighs more heavily recent historical data can be employed for the estimation of *μ*_{d,s} [6]; alternative methods include principal component analysis [20] and wavelet-based decomposition [21]. In what follows, we let *D* = 7 and estimate daily profiles based on weighted averages with weights fine-tuned for optimal forecasting performance. It should be emphasized that the decomposition described earlier is not absolutely necessary and an alternative modeling strategy would apply models similar to the ones presented in the next subsection to the original (and not the detrended) data.

#### 2.3 The transient model

The second part of the modeling procedure concentrates on the dynamics of the (short-term) deviation from the historical daily profile. In this work, we adopt a regime-switching modeling framework. In particular for each location *s*, a space-time threshold autoregressive model is employed to capture transient behavior

- (2)

where for *r*_{d,s} = 1, … *R*_{d,s} + 1 and we use the convention that *T*_{0} = 0 and . In (2), *r*_{d,s} is an index that specifies the operating regime. The thresholds , separate and characterize different regimes and in general may differ for different locations in the road network and different days of the week. In contrast to [6], in this work, the number of thresholds and their magnitude are unknown quantities that need to be estimated. It should be emphasized that the predictive part in (2) contains solely past information from the variable that is modeled, just for simplicity in exposition. The approach can be extended to include past information from other traffic variables as well, in a straightforward manner.

The aforementioned predictive equation contains an intercept term that varies with location, day of the week, and traffic-regime within a day. *N*_{s} is the number of neighboring locations of *s* that may provide useful information (at some previous time instances) with regard to short-term forecasting performance, and *p* is the autoregressive order (maximum time lag) of the model. Hence, the first sum in (2) contains information on the recent past of the location of interest, whereas the second sum contains information from its neighbors. The *α*'s are unknown coefficients that need to be estimated; the statistically significant ones in the second sum signify which temporal lags of each neighboring location provide useful information with regard to short-term forecasting. Finally, ε is assumed to be a martingale difference sequence with respect to the history of the time series up to time *t* − 1; hence, it is assumed a serially uncorrelated (but not necessarily independent) sequence and its variance is not restricted to be equal across regimes.

The threshold model in (2) essentially dictates abrupt transitions between traffic regimes whereas it would have been more natural to adopt (logistic) smooth transitions as in [1]. However, that would have complicated things considerably as estimation of the parameter related to the speed of transition can be problematic, see for instance [22] and [1]. Furthermore, in contrast to some previous approaches (for example [19]), (2) does not contain any moving average terms. Autoregressive models with sufficiently large autoregressive order *p* may approximate autoregressive moving average processes (as shown for instance in [23], Chapter 2), when some stationarity and invertibility conditions are satisfied by model coefficients. The majority of linear models applied to traffic data satisfy such conditions implicitly [19].

Another feature that separates our modeling approach to previously reported ones is that we do not consider simultaneous estimation of a system of equations with a common covariance matrix (each equation corresponding to a measurement location in the network) as in [24, 11, 25, 12, 6]. Although a system in general is expected to produce more efficient estimates compared with the equation-by-equation approach, such an estimation framework cannot be applied in practice when *S* is as large as in the applications we consider (where *S* may easily exceed 300).

#### 2.4 Threshold estimation

The predictive model in (2) defines a threshold regression per measurement location, with an unknown number of regimes; a detailed discussion of such models is presented in [8]. Time of day is the threshold variable that defines subsamples in which the relationship is stable. In general, the threshold variable can be subject to a model building procedure which chooses the traffic variable for which linearity is more strongly rejected; such a procedure is presented in [1]. In [1], it is shown that the choice of the time index (which facilitates the forecasting procedure) is effective in capturing nonlinear dynamics compared with alternative threshold variables (traffic variables in levels and differences). In the application, the number of thresholds per day/segment combination ranges from 0 (which essentially means that a linear model is adequate for capturing traffic dynamics of the particular segment at that particular day) to 4, which corresponds to five traffic regimes. An example of five regimes is two tranquil periods during the night (e.g., from midnight to early morning and from late afternoon to midnight), a morning peak period, an afternoon peak and an intermediate, not heavily congested regime that separates the two peaks.

A linear specification is nested in the threshold regression depicted in (2). Therefore, a first step is to test the linearity of the model against the piecewise linear specification. If the null hypothesis is rejected, one may proceed to estimate a threshold regression and the residuals of the piecewise linear model should be tested for significant remaining nonlinearity that could be captured by adding a regime in the model. For the purposes of our application, we performed a battery of specification tests that include White's test [26], the sequential procedure proposed in [27] and F-tests for structural change [28]. In the next section, we present a sample of results based on the tests proposed in [29] and [30].

Alternatively, the decision on the number of regimes can be based on the values of an information criterion as in [31]. A sample of comparisons based on the Akaike information criterion (AIC) is shown in the numerical experiments that will follow. It should be noted that the aforementioned methods are not based on the out-of-sample predictive performance of the models, and some fine-tuning may be required by comparing via cross-validation a small set of plausible regime selections per *x*_{d,s}. Indeed, in a number of studies, although nonlinear models are suggested by statistical tests, simpler linear alternatives have been found superior in terms of forecasting performance [32]. In this work, the primary focus is on forecasting performance and although we examine the results of specification tests and information criteria, the final decision on the adequate number of thresholds is based on the results of the forecasting experiment.

Effective methods for threshold estimation for given *R*_{d,s} have been proposed in [8, 31] and [33]. These methods apply in a univariate setting, that is, for each measurement location separately. Here, we employ a strategy that focuses on computational tractability; it is based on the multivariate threshold regression models (or threshold vector autoregressions, TVAR) presented in [34]. Instead of treating measurement locations as independent, we classify them to groups of small size (e.g., three to four neighboring locations) that are expected to be characterized by the same thresholds per day of the week. This is a plausible hypothesis that simplifies considerably the computations that need to be performed as it allows simultaneous threshold estimation for each group of locations. A system of threshold regressions is estimated for each group: it comprises of one predictive equation for each group member, whereas lagged traffic variables for each group member appear as explanatory covariates.

Classification of measurement locations that form each TVAR system is based on distance. Essentially, in the application, we divide measurement locations in non-intersecting groups of size 3: for the location, which is the closest in terms of geographical distance to an existing group, the next group is formed by choosing its two closest first-order neighbors (in the sense defined in [35]) or the first-order neighbor and the closest second-order neighbor. Then, we estimate one TVAR model for each group. This reduces substantially the number of threshold estimations (and computational time according to our experiments) that need to be performed compared with the alternative that estimates thresholds for each measurement location separately. We performed the same procedure with groups of size 4; results were almost identical and were not reported for brevity.

Univariate threshold regressions and TVAR models that are comprised of up to three linear regimes can be estimated in a straightforward manner in modern statistical/econometric software and have been found adequate for our purposes. To test for the need of additional regimes, we combined standard techniques with a priori knowledge of traffic dynamics. For instance, a four-regime model can be estimated by fixing the threshold that marks the beginning of the morning peak period (which is always identified in practice) and estimating a two-threshold model using the remaining part of the data. Similarly, a five-regime model can be estimated by dividing the day in two halves and estimating a two-threshold model for each half day. The first model estimates the thresholds that define the morning peak period, whereas the second are the ones that define the afternoon peak.

Each equation in the TVAR system approximates (as it contains information from a reduced number of neighbors) the predictive model (2) for the measurement location that appears on the left-hand side of (2). The underlying hypothesis is that the timing of traffic regimes does not differ significantly for measurement locations within each group and that omitted predictors do not influence significantly threshold estimation. TVAR systems for groups of five or more locations would contain a large proportion of coefficients which are not statistically significant and hence they were not considered.

In case one would like to avoid the aforementioned approximation schemes, one may implement a combination of the grid-search procedure à la Hansen [8] with penalized estimation, as described in Section 2.5. We would like to highlight that to the best of our knowledge, such combinations have not appeared in the literature until now. Our implementation of such a procedure led to results that are practically equivalent to the ones reported in Section 3 but was substantially more demanding in terms of computational time.

#### 2.5 Penalized estimation for automatic model selection

Within regime *r*_{d,s}, the model in (2) is a linear regression that in theory can be estimated using conventional methods (e.g., OLS or LAD). However, direct estimation may be inefficient as a fraction of the predictors will not contribute significantly to the predictive power of the model. In some cases, direct estimation may be problematic, with the variances of the estimated coefficients being unacceptably high, or even infeasible because of multicollinearity. This happens especially when *p* and *N*_{s} are large. Without the use of any procedure for model building, that is, a selection of significant predictors, the resulting model may be unstable in perturbations and worse yet, may result to undesirable output because of ill-conditioned matrix inversion.

We thus propose the use of a penalized estimation scheme within the context of threshold regression. In previous studies, model building was either based on exploratory analyses (e.g., plots of the estimated (partial) autocorrelations), see [24, 11] and [25], or on information criteria such as AIC as in [36] and [12]. The former method is practically infeasible for large *S*, whereas application of the latter through an automatic (for instance general-to-specific [37]) sequential procedure is very demanding in terms of computational power.

In this work, estimation and model selection per regime take place simultaneously for each location using LASSO penalized regression which enforces sparse solutions in problems with large numbers of predictors [38]. LASSO is a constrained version of ordinary estimation methods and at the same time a widely used automatic model building procedure. Compared with classical variable selection methods, such as subset selection, the LASSO has two advantages. First, the selection procedure is continuous and hence more stable than the subset selection which is discrete [39]. Second, the LASSO is computationally feasible for high-dimensional data. In contrast, computation in subset selection is combinatorial and not feasible when the number of predictors is very large [40].

Given a loss function *g*(.), coefficient estimation within regime *r*_{d,s} in (2) is performed by minimizing the criterion

- (3)

We consider two variants of the estimation procedure, namely, one in which the sum of absolute residuals is minimized, henceforth referred to as LAD-LASSO, and a least squares objective, which we call conventional LASSO. In the former case,

whereas in the latter

when historical traffic data from *D*_{w} past weeks are available.

The second and third components of the sum in (3) are penalty terms which shrink the coefficients toward the origin and tend to discourage models with large numbers of marginally relevant predictors. The intercept *α*_{d,s} is ignored in the LASSO penalty, whose strength is determined by the positive tuning constants *λ*. It is worth emphasizing that the aforementioned criterion performs the adaptive LASSO method which has been shown to be more effective than ordinary LASSO in [41]. In what is presented next, we follow the procedure justified theoretically in [42] and apply the adaptive LASSO as a two-step procedure: in the first step, coefficient estimates are derived using slight penalization and lambdas are calculated as inversely proportional to these estimates; in the second step, (3) is minimized given the lambdas from the first step.

The use of penalized estimation allows considerable flexibility with regard to the specification of matrices that define neighboring relationships in a road network. Using a modeling framework similar to the one adopted in previous studies, (e.g., [24, 11, 25, 12, 6]) we would have to define different matrices per regime and per time lag of the model at a pre-processing stage. The adaptive LASSO procedure allows to automate that process: the input to the model need not be regime nor location-specific predictors. For this reason, the number of input coefficients is fixed across regimes and days and in (2), we can use *p* and *N*_{s} instead of *p*_{d,s} and .

One expects that, depending on the characteristics of traffic data and the density of measurement locations in a road network, there is a maximum time lag and a maximum number of neighbors, above which additional predictors in a piecewise linear regression model as the one in (2), do not contribute in terms of out-of-sample predictive ability. When *p* and *N*_{s} take very large values, the estimation problem becomes harder to solve and the finite sample performance of the estimator degrades slightly; consequently, the out-of-sample predictive ability weakens.