Performance evaluation of deep neural networks for forecasting time-series with multiple structural breaks and high volatility

The problem of automatic and accurate forecasting of time-series data has always been an interesting challenge for the machine learning and forecasting community. A majority of the real-world time-series problems have non-stationary characteristics that make the understanding of trend and seasonality difficult. Our interest in this paper is to study the applicability of the popular deep neural networks (DNN) as function approximators for non-stationary TSF. We evaluate the following DNN models: Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN), and RNN with Long-Short Term Memory (LSTM-RNN) and RNN with Gated-Recurrent Unit (GRU-RNN). These DNN methods have been evaluated over 10 popular Indian financial stocks data. Further, the performance evaluation of these DNNs has been carried out in multiple independent runs for two settings of forecasting: (1) single-step forecasting, and (2) multi-step forecasting. These DNN methods show convincing performance for single-step forecasting (one-day ahead forecast). For the multi-step forecasting (multiple days ahead forecast), we have evaluated the methods for different forecast periods. The performance of these methods demonstrates that long forecast periods have an adverse effect on performance.


Introduction
In recent years, with the development of countries, the stock market is becoming a more and more essential and intricate part of their economy. One such study can be found in [1]. Nowadays, investors investing in stocks need to consider a large number of factors and evaluate a considerable amount of risks before investing in it in any form [2]. This issue is because of the chaotic and dynamic nature of the stock prices in the present times. These investors expect to make decent profits after the investments. However, analysing factors and risks affecting the stock prices and predicting them could be highly exhaustive. They could require a higher degree of skilled task [3]. Hence, the prediction of stock prices could be a significant reference for the investors and financial pundits for trading and investing strategies.
With the streaming developments in machine learning (ML) tools and techniques, especially deep learning (DL) algorithms along with an adequate increase in the potential of computational power, predicting stock prices have become less hectic and does not require much skill on the economic fronts. These DL tools and algorithms, such as Deep Neural Networks (DNNs), would learn the trend and factors responsible for the fluctuations (like sudden rise or drop) in the prices and accordingly predict values with acceptable approximations [4]. Furthermore, the primary advantage of such methods is that they may be able to handle the raw time-series suitably and forecast the future raw outputs. These outputs, however, could be one or multiple: respectively, we can call it as 'single-step' and 'multi-step' forecasting.
Recently, there have been many successful attempts to use machine learning methods for automatic time-series forecasting. Some of these methods do incorporate the information from social media, some ways deal with a transformed feature space, and some work with various economic indicators. One could follow some recent works that are published under this umbrella in [5,6,7,8].
In this paper, we employ and explore various state-of-the-art deep neural network methods to build models predicting stock prices. As we wish the model to analyse and understand the factors affecting the prices over a time period and predict accurately, this problem could also be treated as a kind of time-series analysis problem, where the goal is not only to predict the stock prices but instead show some understanding of the effects of volatility and structural breaks on the prediction [9,10]. In what follows, we outline our significant objectives and contributions to this work.

Objectives and contributions of the study
Our goal is to study the performance of neural machine learning models towards forecasting the prices of stocks that have exhibited a significant degree of volatility with numerous structural breaks. Our study is focused on the application of deep neural networks. To the best of our knowledge, less number of studies has been conducted on Indian stock market data. Therefore our research involves implementations for Indian stock market. This makes our present study a new case study in the field of forecasting in the Indian stock market. However, this does not limit our resulting analysis and conclusion to our datasets only; instead can be applied to other generic datasets as well.
To analyse the relative performances of Deep Neural Networks in Time Series Forecasting, we employ the following neural network models: These deep networks are evaluated for two different ways of time series forecasting viz. single-step ahead stock price prediction and multiple-step ahead 1 stock price prediction. By employing four different stateof-the-art deep network models and with ten different datasets with stock price data from last 17 years, our present work serves as a good case study on the applicability of deep neural networks on Indian stock market data.

Organisation of this paper
This paper is organised as follows: Section 1 introduced the motivation, problem statement, and major contributions of this study. In section 2, we provide brief details about research efforts made by the community in the field of statistics and machine learning for time-series forecasting. Section 3 provides a detailed description of the data and methodology used by our work. Section 4 describes the simulation setup, summarises the results, and discusses the findings. The paper is concluded in section 5. The detailed results and time-series prediction plots for various stocks for both one-step as well as multi-step forecasting are provided in Appendix A.

Related Works
A useful review of multi-step ahead forecasting is published in [11]. These methods describe the different usages of neural networks. They conducted experiments which proposed two constructive algorithms initially developed to learn long-range dependencies in time-series, perform a selective addition of time-delayed to recurrent networks producing noticeable results on single-step forecasting. These results, together with the fact that longer-range delays embodied in the time-delays should be allowed for the system to better learn the series and when predicting for multiple steps and improved results on multi-step prediction problems as can be seen from the experimental evidence. Statistical models are another class of tools 1 a window of stock prices suitable and successful for time-series forecasting. One such model is the Autoregressive integrated moving average (ARIMA) [12]. These models have been quite successful for one-step and sometimes multi-step forecasting. Further, researchers have explored the idea of hybridising ARIMA and other non-statistical models for forecasting: [13,14]. Most successful hybrids are the techniques combining neural networks and statistical models such as as [13,15,16]. However, communities continue to explore the comparative domain of statistical model versus neural network models. One of the latest studies on a similar line is work done by Namini and Namini [17], where the authors explore the applicability of ARIMA and LSTM based RNNs. The authors' empirical study on this suggested that deep learning-based algorithms such as LSTM outperform traditional algorithms such as the ARIMA model. More specifically, the average reduction in error rates obtained by LSTM is around 85% when compared to ARIMA, indicating the superiority of LSTM to ARIMA. Majumder and Hussian [18] have used an artificial neural network model with back-propagation to build the network for forecasting. They have studied the effects of hyperparameters, including activation functions. They have critically selected the input variables and have introduced lags between them. They have tried building models with various delays ranging from 1 to 5 day-lags. The input variables chosen for this model are the lagged observation of the closing prices of the NIFTY Index. The experimental results showed that tanh activation function performed better. However, the various day-lags being compared produced varied results based on the loss function used.
Neeraj et al. [19] have used Artificial Neural Network (Feedforward Backpropagation Networks) model for modelling BSE Sensex data. After performing initial experiments, a model was finalised, which had 800 neurons with tan-sigmoid transfer function in the input layer, three hidden layers with 600 neurons each, and the output layer with one neuron predicting the stock price. They built two networks. The first used 10-week oscillator and the second one had 5-week volatility. A 10-week oscillator (momentum) is an indicator that gives information regarding the future direction of stock values. When combined with the moving averages, it is observed that it improves the performance of ANN. They used RMSE(Root Mean Squared Error) to calculate errors. They concluded that the first network performed better than the second one for predicting the weekly closing values of BSE Sensex. In a recent study [20], the authors have used different DL architectures like RNNs, LSTMs, CNNs, and MLPs to generate the network for the first dataset where they used TATAMOTORS stock prices for training and have used the trained model to test on stock prices of Maruti, Axis Bank, and HCL Tech. They also built linear models like ARIMA to compare the nonlinear DNN architectures. They made the network having 200 input neurons and ten output neurons. They chose window size as 200 after performing error calculations on various window sizes. They also used this model to test on the other two stocks, which were Bank of America (BAC) and Chesapeake Energy (CHK), to identify the typical dynamics between different stock exchanges. It could be seen from their experimental results that the models were capable of detecting the patterns existing in both the stock markets. Linear models like ARIMA were not able to identify the underlying dynamics within various time series. They concluded that deep architectures (particularly CNNs) performed better than the other networks in capturing the abrupt changes in the system.
Our study is a comprehensive addition to the literature in the sense that this work employs four different deep models for ten different Indian time series data with varying degrees of volatility and significant structural breaks over 17 years. Further, it also explores the performances of such models with regard to one-step and multi-step forecasting. This work could be considered as a significant benchmarking study concerning the Indian stock market.

Data
In order to provide generalised inferences and value judgements on the performance of neural networks towards single-step and multi-step time-series forecasting, stock price datasets are quite lucrative as their time-series data typically exhibit characteristics like non-stationarity, multiple structural breaks, as well as high volatility. Further, instead of using a single stock, we used a diversified dataset of 10 different stocks in the Indian stock market. Table 1 describes all the ten stocks that were used for the study. It should be noted that the duration or time-frame of the data for each stock is the same. Furthermore, we use the same dataset of 10 stock prices for both single-step and multi-step forecasting in order to provide better contrasts into the performance of various deep neural network models across both the types of prediction.

Deep Neural Networks (DNN)
We formulate the problem in the following way. Let x be a time-series defined as x = (x 1 , . . . , x w , . . . , x w+p ), where x i represents the stock price at time-step i, w refers to window-size and w test refers to the test period for which forecast is to be evaluated. So, a time-steps (w + 1, . . . , w + w test means a w test -period window. Correspondingly, we will denote neural network predictions for this w + 1 to w + w test time-steps as (x w+1 ,x w+2 , . . . ,x w+wtest ).
For single-step forecasting, the goal is to predictx w+1 given (x 1 , x 2 , . . . , x w ). Mathematically, we can express this as:x where, θ is the learnable model parameters and f represents a deep network. Multi-step prediction can be done using two approaches: iterative approach, and direct approach [21]. In iterative method, first subsequent period information is predicted through past observations. Afterwards, the estimated value is used as an input 2 ; thereby the next period is predicted. The process is carried on until the end of the forecast horizon 3 . The function produces single value at every future time-step. Let (x 1 , . . . , x w ) be the last window of the input time-series, and (x w+1 , . . . , x w+wtest ) is the stock values for the forecast horizon w test . The goal is to predict (x w+1 , . . . ,x w+wtest ). Using iterative approach, this can be defined as follows: Consider an iterator variable j ∈ {w + 1, . . . , w + w test }. If w + 1 ≤ j ≤ 2w, and, if j > 2w,x In the direct multi-step forecast method, successive periods can be predicted all at once. Each prediction is related only to the stock values in the input window. We can write this as: where, j ∈ {w + k, . . . , w + k − 1 + w test } and k is a variable used to denote the iterator over the day instance.
In the following subsections, we briefly describe the existing deep network tools used in this work. These tools are standard, and the mathematical details could be found in the corresponding references, and therefore, we do not explicitly provide the precise mathematical workings of these models.

Multilayer Perceptron (MLP)
An MLP consists of at least three layers of nodes: an input layer, a hidden layer, and an output layer [22]. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilises a supervised learning technique called back-propagation for training [23]. The inputs in our case will be time-series data from a specific window.

Convolutional Neural Network (CNN)
The idea behind CNNs [24] is to convolve a kernel (whose size can be varied) across an array of input values (like in time series data) and extract features at every step. The kernel convolves along the array based on the stride parameter provided. The stride parameter determines the amount with which the kernel moves along the input to learn the required features for predicting the final output. In our case, we have done 1D convolution on our array of stock prices from various time steps with appropriate kernel size. This kernel learns the features from that window of the input in order to predict the next value as accurately as possible. This technique, however, does not capture time-series co-relations and treats each window size separately.

Recurrent Neural Network (RNN)
RNNs make use of sequential information to learn and understand the input features. These are different from MLPs, where inputs and outputs are assumed to be independent. But the conventional methods fail in situations where inputs and outputs influence each other (time-dependence) [25]. RNNs are recurrent as they process all the steps in a sequence in the same way and produce outputs that depend on previous outputs. In other words, RNNs have a memory that stores all the information gained so far. Theoretically, they are expected to learn and remember information from long sequences, but practically, they have found to be storing information only from a few steps back. In our work, we have passed the input time series data sequentially one by one into the network. The hidden states are trained accordingly and are used to predict the next stock price. During training, we compare the predicted and true values and try to reduce the error difference. During testing, we use the previous predicted value to calculate the next time steps (future stock prices). (a) Gated-Recurrent Units (GRU) based RNN: The principle of both GRU and LSTM [26] cells are similar, in the sense that they both are used as "memory" cells and are used to overcome the vanishing gradient problem of RNNs. A GRU cell, however, has a different gating mechanism in which it has two gates, a reset gate, and an update gate [27]. The idea behind the reset gate is that it determines how much of the previously gained memory or hidden state needs to be forgotten. The update gate is responsible for deciding how much of the past gained information needs to be passed along the network. The advantage of using the gating mechanism in these cells is to learn long-term dependencies. (b) Long-Short Term Memory Cells (LSTM) based RNN: LSTMs [26] cells were designed to overcome the problem of vanishing gradients in RNNs. Vanishing gradients is a problem faced in deeper networks when the error propagated through the system becomes smaller due to which training and updating of weights do not happen efficiently. LSTMs overcome this problem by embedding the gating mechanism in each of their cells. They have input, forget, and output gates which updates and controls the cell states. The input gate is responsible for the amount of new hidden state computed after the current input you wish to pass through the ahead network. The forget network decides how much the previous state it has to let through. In the end, the output gate defines how much of the current state it has to expose to the higher layers (next time steps).

Implementation
For single-step forecasting, the input window (i.e. backcast window) size is studied in the set {3, 5, 7, 9, 11, 13, 15}. The implementation for this is straightforward, as explained in section 3.2, where the testing window is a single stock value in the future. For the multi-step forecasting, the implementation is conducted for 3 different backcast windows {30, 60, 90} and 4 different forecast windows such as {7, 14, 21, 28}. The implementation for the multi-step forecasting is carried out using the direct strategy as described earlier.
Further, the following details are relevant in our implementations: The original data for prices of all the stocks were normalised to the interval range [0, 1]. For each stock, the goal was to use the training set for model building, post which the trained model would be used to predict the whole test set. The train-test split for each stock was done in such a way that the training set comprised of stock prices from 1st January 2002 to 1st January 2017, and the subsequent prices formed the testing set.
It should be noted that for all the deep network models, the input size remains equal to the window size (w). The deep networks involve many different hyperparameters; however, given the amount of data and computational resources available to us, we were limited to perform some manual tuning of these parameters. Due to reason of space, we are unable to provide these details. We note that automatically tuning various hyperparameters of these deep networks could result in better forecast performance. The manually fixed set of hyperparameter details are furnished below: MLP: There are 2 hidden layers with sizes (16,16). The output layer has 1 neuron. The activation functions in all layers are relu (rectified linear unit).
CNN: There are 4 hidden layers with sizes (32, 32, 2, 32) with the third layer being a Max-Pooling Layer.
The output layer has the size 1. The activation function used in every layer is relu.

GRU-RNN:
There are 2 hidden layers with sizes (256, 128). The output layer has 1 neuron. The activation function used in each layer is relu with linear activation for the final layer.

LSTM-RNN:
There are 2 hidden layers with sizes (256, 128). The output layer has 1 neuron. The activation function used for every layer is relu with linear activation for the final layer.
The evaluation or loss metric for these models is 'mean-squared-error (MSE)'. Further, for reliable model evaluation, and each model was independently run (trained and tested) for 5 different times to obtain statistically reliable performance estimates. Consequently, we obtained results in the form of loss intervals corresponding to our predictions on the test datasets vs the actual stock prices. These testing loss intervals have been reported in the results' tables. These test loss intervals provide a summary in the form of the mean and standard deviation of MSE obtained over five different runs. In the tables, the representation of the loss intervals is mean (±std.dev.).
All our implementations are carried out in the Python environment. The deep neural networks are implemented using the Python library: Keras. All the experiments are conducted in a machine with Intel i7 processor, 16GB main memory and NVIDIA 1050 GPU that has 4GB of video memory. We used the Python nsepy library to fetch the historical data for all Indian stocks from the National Stock Exchange (NSE: https://www.nseindia.com/). The code and data are shared via GitHub repository: https://github.com/kaushik-rohit/timeseries-prediction.

Result and Discussion
In this section, we provide a summary of results that are obtained for single-and multi-step forecasting of the 10 different Indian stock data. For clear presentation, we place all the result tables, and some sample forecast plots in Appendix A and only provide the statistical test results in this section. However, the individual forecast result tables are referred to in the discussion text.

Single-Step Forecasting
The performance observed for the ACC stock depicts that all four deep models seem to perform the prediction task similarly. However, as we increase the window sizes, the predictions of all the models go further away from the true values increasing the error rate. Hence, it can be concluded that the future single stock price is highly dependent on the immediate previous prices and less dependent on further past prices. However, a different kind of prediction trend was shown by the models for the AXISBANK stocks. It can be seen from the graphs of AXISBANK stocks that all the models performed quite well for the smallest window-size of 3. The predictions for the window-size 7 were also good for all the models. However, the results for the other window-sizes varied irregularly and didn't perform as well. A very different trend was seen for BHARTIARTL stock prediction. Table A3 suggests that for smaller windowsizes, MLP performed slightly better than the others. However, as window-sizes increases, CNN starts outperforming all the other models. One unique aspect of these models can be observed in the forecasting graphs (refer Appendix A): all the models failed to predict the sudden increases in the prices to the actual extent. Hence, it could be emphasised that the information from the previous trends of stock prices is not sufficient enough for predicting future prices, and thus, it may depend on a variety of factors that have not been incorporated in these models. A filter-based deep network such as CNN outperforms other deep models for CIPLA stock dataset as shown in Table A4. This holds for all window sizes. However, the results obtained for the HCLTECH stock is quite contradictory. Table A5 represents that GRU-RNN performs much better compared to other models. The window-size of 13 produced the best result within the GRU model. This demonstrates that the GRU-RNN structure could certainly handle the deviation within the stock prices for an extended period (i.e., w = 13). The almost similar inference could also be made for HDFC stock, where both LSTM-RNN and GRU-RNN have performed very well for w = 9 (refer Table A6). Table A7 shows that an identical trend in performance was observed across different window sizes for INFY stock price prediction. Additionally, CNN required a higher number of input features (i.e., w = 11) to perform to its capacity for this dataset.
The JSWSTEEL stock dataset contains a very high number of structural breaks and is highly volatile. Table A8 shows that this characteristic behaved as an adversarial feature for all the models, and hence the models were not able to perform well. However, LSTM-RNN shows some improved performance given a higher input window of 13. Table A9 suggests that a similar trend in performance was also observed for the MARUTI stock dataset with a surprising result that the model like MLP could perform better than other deep models with minimal input window of 3. MLP also performs better than its counterparts for the ULTRACEMCO dataset, as shown in Table A10.

Statistical significance test
The results obtained over five different independent runs of the models are subjected to a statistical significance test. For this, we conduct the Diebold-Mariano test [28,29]. However, we conduct the DMtest only for the single-step forecasting results. The DM-test compares two hypotheses at a time, and the value is converted into the p-value. From Table 2, it could be concluded that most of the results are significant given any hypotheses pair. The results of Diebold-Mariano Test at 0.01% level of significance (α = 0.0001) suggests that the relative order of performance of the deep network models for single-step forecasting is: GRU-RNN, CNN, LSTM-RNN and MLP, where MLP outperforms all others. We note that the statistical significance strongly looks at overall performance of the model rather than the performance on individual dataset. Althoguh, MLP does not encode any long-term dependency arising in the timeseries data, it may not be expected to perform as good as standard dependency-learning models such as LSTM-or GRU-RNNs. Another observation that could be made is that the data used in our present work may not be containing any such long-term dependencies for which a sequence-based deep model or a convolution-based deep model could be very useful. Our goal here is not to recommend MLP as the best model for real-world applications to time-series modelling, rather as a typical deep model that performs well on data that has mutiple structural breaks and is highly volatile. However, readers should note that the level of significance plays a crucial role in choosing the performance ordering of the models.

Multi-step Forecasting
Multi-step forecasting has always been a challenging problem in time-series prediction problems. The results are in Table A. For Table B1 through to Table B10, the multi-step forecast results suggest that for small forecast window the deep network methods are performing well for all the datasets. As the forecast window size is increased (such as 28), the performance drops significantly. The performance of the four deep network models for the ACC stock data suggests that the MLP needs to observe as high as 30 input days to predict accurately 7 days of future data. This is expected for a densely connected network like an MLP where the salient features are constructed in its intermediate hidden layers. This observation also holds for other stocks expect for the JSWSTEEL stocks. Furthermore, it is in contradiction to more inputs as 60 or 90, where additional days don't aid any useful information to the model. Similarly, for the JSWSTEEL stocks, the performance for the MLP model is best at 60 input days to produce 7 days ahead forecast of stock prices.
The GRU-RNN model looks into a large sized input such as 60 or 90 to make predictions for 7 days in the future, whereas for the LSTM-RNN and CNN, 30 days of input is sufficient to produce accurate future predictions. Similarly, looking at all the performance models for all possible forecast windows considered in this work such as {7, 14, 21, 28}, we note that MLP outperforms all other deep models for the majority of stocks. To support the observation, we conduct a statistical significance test for a sample input-output combination.

Statistical significance test
The DM test results for multi step forecasting with input window size 30 and output window size 7 is in Table 3. The level of significance is set at 0.1%. For comparing the relative forecasting performance of any pair of models from the table, we take a majority vote based on DM-test analysis for each of the 10 stocks. Accordingly, for each pair of model comparison, one model is chosen as the best among the pair if it is found to be the best model for more than 5 out of 10 stocks based on the DM-test p-value analysis for that pair of models. It is observed that MLP outperforms all the other deep network approaches for this setting of input and output window combination. This observation is consistent with the observation for the single-step forecasting performance as well. The overall order of relative forecasting performance of different neural networks for multi-step forecasting is found to be: CNN, LSTM-RNN, GRU-RNN, and MLP. Readers should note that the level of significance plays a crucial role in choosing the performance ordering of the models.

Conclusion
In this paper, we studied the applicability of the popular deep neural networks (DNN) comprehensively as function approximators for non-stationary time-series forecasting. Specifically, we evaluated the following DNN models: Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN), RNN with Long-Short Term Memory Cells (LSTM-RNN), and RNN with Gated-Recurrent Unit (GRU-RNN). These four powerful DNN methods have been evaluated over ten popular Indian financial stocks' datasets. Further, the evaluation is carried out through predictions in both fashions: (1) single-step-ahead, (2) multi-stepahead. The training of the deep models for both single-step and multi-step forecasting has been carried out using over 15 years of data and tested on two years of data. Our experiments show the following: (1) The neural network models used in this experiments demonstrate good predictive performance for the case of single-step forecasting across all stocks datasets; (2) the predictive performance of these models remains consistent across various forecast window sizes; and (3) given the limited input window condition for multi-step forecasting, the performance of the deep network models are not as good as that was seen in the case of single-step forecasting. However, notwithstanding the above limitation of the models for the multi-step forecasting, given the vast amount of data collected over a duration of 17 years on which the models are built, this work could be considered as a significant benchmark study with regard to the Indian stock market. Further, we note the following observation. The deep network models are built with raw time-series of stock prices. That is: no external features such as micro-or macro-economic factors, other statistically handcrafted parameters, relevant news data are provided to these models. These parameters are often considered to be useful to impact stock price prediction. A model that takes into account these additional factors could better the predictive performance of both single-step as well as multi-step forecasting.

A Appendix: Forecasting Results
The forecasting plots during testing. The plots are showing the average of five independent runs of the programs, and this average is compared with the true value. Due to the reason of space, we provide results for only one stock dataset (ACC). However, similar performances were also observed for other datasets, which can be located in the link https://github.com/kaushik-rohit/timeseries-prediction.                     14) 0        14) 0    14) 0