A hybrid crude oil price forecasting framework: Modified ensemble empirical mode decomposition and hidden Markov regression

The considerable influence of crude oil prices on the international economy has motivated numerous scholars to develop various prediction models. Two difficulties are encountered in forecasting. One is that the time series of crude oil prices show massive jumps in high frequency. The other is that the time series of crude oil prices are characterised by nonlinearity, structural breaks and highly volatile states. Therefore, we propose a hybrid model that incorporates the principle of decomposition–integration. Its overall process can be divided into the steps of decomposition and integration. In decomposition step, modified ensemble empirical mode decomposition is used to decompose the data on crude oil prices. Next, K‐means and principal component analysis are utilised to extract the optimal intrinsic mode function (IMF) from diverse frequencies amongst decomposed IMFs. Following decomposition, different forecasting methods are used to match IMFs with various frequencies for prediction. In accordance with structural breaks and the highly volatile states of high‐frequency series, hidden Markov regression is applied to describe predicted values in a probabilistic manner, providing objective interpretation and forecasting. A neural network is taken to predict intermediate‐frequency IMFs by using a fully connected network. Owing to the low frequency and tendency of the residual sequence to be more stable than other frequency sequences, the autoregressive integrated moving average model is applied as the forecasting method for prediction. Finally, all individual predictions for the eventual results are ensembled as final predictions. Results confirm that the proposed model is better than several normal hybrid and single models in terms of prediction accuracy and Diebold–Mariano test results. All results confirm that the newly proposed hybrid model is a promising tool for the analysis and forecasting of crude oil prices.


| INTRODUCTION
Crude oil is the world's largest and most actively traded commodity, accounting for more than 10% of the total world trade.As a result, large swings in oil prices can disrupt overall economic activity.2][3][4] Crude oil prices are determined by the forces of supply and demand but are highly influenced by numerous irregular external events, such as the production of other commodities, for example, renewable energy, which may have a substitution effect; economic and technological developments in financial markets; and political aspects. Figure 1 plots the volatility of Brent crude oil prices from 1993 to 2022.It shows that irregular external events can cause large fluctuations in the crude oil market.For instance, the 1998 Asian financial crisis led to a sharp decline in crude oil prices.The Iraq War in 2003 caused the price of crude oil to rise sharply.The COVID-19 pandemic in 2020 led to a rapid decline in crude oil prices.This review illustrates that supply and demand forces and irregular external events result in the sharp fluctuations and nonlinear characteristics of oil prices. 5Given that crude oil is the main energy source that dominates economic activity, finding a forecasting method for oil prices is a perennially crucial subject that has always garnered wide attention from policymakers, practitioners and other financial marker investors. 6,7rediction methods include single and hybrid modelling.The basic concept of hybrid modelling is decomposition-integration. Hybrid models can effectively eliminate the influence of noise in data and remarkably improve prediction accuracy.However, the original decomposition method has some disadvantages, such as mode mixing, which reduces prediction accuracy.Furthermore, most decomposition methods find eliminating the misjudgement of subjective parameters, which reduces prediction accuracy, difficult.At the same time, decomposed sequences have different local characteristics.Moreover, few studies have provided reasonable explanations on how to choose suitable prediction methods.The objective of our study is to build an effective decomposition-ensemble procedure for forecasting oil prices.Modified ensemble empirical mode decomposition (MEEMD) and hidden Markov regression (HMR) are established to predict oil prices.
The main contributions of our work can be summarised as the introduction of a new model that incorporates the principle of decomposition-integration for reliable oil price forecasting.Validation reveals that the proposed model also presents more robust results than other models.In detail, first, we propose an adaptive hybrid model incorporating MEEMD, HMR, neural network (NN), autoregressive integrated moving average (ARIMA) and ARMA to improve the forecasting accuracy of nonstationary and nonlinear crude oil prices and thus address the above existing drawbacks and improve forecasting performance.The raw series of crude oil prices is decomposed by MEEMD, and the mode confusion of sequences is evaluated through multiscale permutation entropy (MPE).Subsequently, in consideration of the data characteristics of each component, HMR, NN, ARIMA and ARMA are applied to forecast each component independently, and the predicted values of each component are aggregated as the final forecast.Empirical studies show that our proposed model can address the nonlinearity and nonstationarity of crude oil prices better than other well-known models and achieves good results.Our proposed model can not only decompose fluctuations in oil prices accurately, it can also eliminate the misjudgement of fluctuation decomposition by subjective parameters.This ability reduces the mode confusion of decomposition.Second, we propose the use of HMR for prediction to improve interpretability and reduce the heavy reliance on the setting of parameters and initial values.This model can accurately describe structural breaks and highly volatile states and eliminate the misjudgement of structural break states by classifying and redefining the values of sequences and then describing the relations and laws between each type of value through probability.
The remainder of this paper is organised as follows to understand the progress of the research further: we present a thorough review and discussion of different methods in Section 2. In Section 3, we provide a clear overview of methodologies.Empirical results are given in Section 4. Section 5 concludes this paper.

| LITERATURE REVIEW
The related literature on the forecasting of crude oil prices widely discusses the theoretical frameworks of single and hybrid models.The first class is a single model that focuses on classic econometric and statistical approaches. 2 Guha and Bandyopadhyay presented an inside view of the application of an ARIMA time series model to forecast future gold prices in Indian browsers. 8Ediger and Akar used ARIMA and seasonal ARIMA to estimate the future primary energy demand of Turkey. 9Although single models provide accurate forecasts, they are built on the assumption that data series are linear and stationary and therefore cannot easily predict crude oil prices, which are nonlinear and nonstationary. 10ingle models encompass various statistic models: ARIMA, generalised autoregressive conditional heteroskedasticity (GARCH), vector autoregression, and linear regression and grey model.
The second class is hybrid models, including mixed single and multiscale models. 11,12Mixed single models, such ARIMA-support vector regression (SVR) 13,14 and ARIMA-GARCH, 15 combine simple single models.Xu et al. proposed a hybrid ARIMA-SVR model based on the advantages of linear and nonlinear models. 16However, mixed single models still cannot overcome the shortcomings of the assumption that data series are nonlinear and nonstationary.Multiscale models have demonstrated their superiority in forecasting nonlinear and nonstationary crude oil prices. 17,18Such models decompose a complex times series into a few simple components, predict each component individually and finally ensembles all predicted values as the final result. 10l-Alimi et al. provided a suitable model that can address sudden fluctuations in the energy market and different kinds of energy data sets.Their methods mainly combine the features of long short-term memory (LSTM) and artificial neural network (ANN) and employ transfer learning to stop the modification of the backward propagation of LSTM layers and modify their output. 19Al-Qaness et al. presented a metaheuristic optimisation algorithm for optimising the DNR model and used the model to forecast oil production. 20Al-Qaness et al. applied a modified Aquila optimiser with the opposition-based learning technique to forecast oil production. 21These studies show that the hybrid model is not only based on the concept of decomposition-integration, but also combines the advantages of all models and compensates for the disadvantages of single and mixed single models.
Forecasting accuracy directly depends on decomposing the components of time series by capturing the characteristics of time series. 11,16The decompositionintegration principle is considered as a multiscale model tool for analysing nonlinear crude oil data and can considerably improve prediction efficiency.Studies on oil price forecasting demonstrate that prediction is sensitive to the nonlinearity, structural breaks and highly volatile behaviour of time series.Therefore, in this study, we propose the precise prediction of crude oil prices through the wavelet decomposition of the complex signals of a price time series into separate specific components.The principle of wavelet decomposition is the transformation of a complex sequence into a superposition of several simple sequences.Empirical mode decomposition (EMD) is the natural technique for capturing the local characteristics of the time series of crude oil prices.For the reduction of the mode confusion of EMD, Wu and Huang 22 proposed the addition of white noise to EMD to eliminate the noise in the original sequence; given that this approach reduces mode confusion, it has been widely applied in oil price forecasting. 10,17,23However, the white noise added by ensemble EMD (EEMD) cannot completely neutralise the noise of the original sequence.6][27] However, the setting of the amplitudes of white noise and increased iteration times is mainly based on subjective experience, and unreasonable parameter settings result in high time complexity and mode confusion. 2,11,16,17,22Wu et al. proposed a novel model based on EEMD and LSTM for crude oil price forecasting; their results demonstrate that the decomposition tool is promising for forecasting crude oil prices. 28In this study, crude oil prices are decomposed by MEEMD, 29 and the mode confusion of decomposed sequences is evaluated by MPE.The setting of white noise amplitudes and increased iteration times is mainly based on subjective experience, and unreasonable parameter settings result in high time complexity and mode confusion.For addressing this issue, MPE is used to calculate the randomness of the complexity between reconstructed decomposed sequences by amplifying the slight changes in the time series, and the randomness of the complexity is then evaluated on the basis of the entropy of the table operator sequence. 29Multiscaling is reflected by the smoothing of the sequence dimension to encompass the sufficient information and reduce the dimension of the original sequence.
Next, an approach for forecasting each component after the time series of crude oil prices is decomposed on the basis of MEEMD is designed.HMR, NN and ARIMA attempt to match the appropriate frequency intrinsic mode function (IMF) for prediction.On the basis of Markov ideas, we use past information to describe the optimal parameters for modelling and prediction based on the hidden Markov model (HMM).HMR considers the classification and redefinition of the values of sequences and then describes the relationships and laws between each value through probability.It is thus suitable for the short-term prediction of high-frequency sequences.Additionally, NN is used to predict intermediate-frequency sequences and has been adapted to nonlinear or fluctuating characteristics. 30ARIMA is selected as the prediction model in accordance with stationary low-frequency sequences.

| Decomposition based on MEEMD
MEEMD is a decomposition model that combines the advantages of MPE and CEEMD.In 2.1, we first introduce the concept of permutation entropy (PE) and then steps of MEEMD.MPE consists of the reconstruction of the phase space of the signal and the calculation of PE. 31 The phases of PE calculation are as follows: Phase 1: The concept of MPE is to obtain the signal of multiscale coarse graining 32 and then calculate the PE of the signal.The sequence y i i N { ( ), = 1, 2, …, } is set, λ is the time delay factor, m is the embedded dimension, r is a positive integer and The sequence of y is shown below: (2) where U r ( ) is the symbol sequence of the rth component in the phase space, and y r j . Therefore, the sequence is defined as where Formula (3) is standardised to obtain Formula (4), and the value range of the embedded dimension m is suggested in previous works. 3,10,31he highlight of MEEMD is the addition of MPE for component screening.The general phases of MEEMD are as follows: Phase 1: The original signal is decomposed by CEEMD.Let the original signal I t ( ).Next, the opposite white noise ni t ( ) and ni t − ( ) with zero mean value are added: where Phase 2: Yao et al. 29 suggested that the optimal range of the corresponding threshold τ is [0.55, 0.6].When we obtain k components and η abnormal components from PE, k η − pseudocomponents are then represented as We then decompose R t ( ) through EMD and all the components in accordance with the order from high frequency to low frequency and a residual sequence (RS).
CEEMD reduces the reconstruction error caused by white noise and the mode confusion of decomposed local signals while ensuring that its decomposition effect is similar to that of EEMD.PE is a random method for detecting time series that can effectively catch nonstationary signals with mutations.Therefore, MEEMD is suitable for the decomposition of volatile crude oil prices.

| HMR
HMMs are stochastic models wherein observations are assumed to follow a mixed distribution but the parameters of the components are governed by a Markov chain that is unobservable. 33They are regarded as excellent classification models and are widely used for forecasting crude oil prices.We propose HMR based on HM.The model is briefly described below.
First, the parameters are defined:  is the homogeneous Markov chain and based on the state space is the initial distribution of H , A = a ( ) ij is the transition probability matrix for H , namely, a P H j|H i i is the moment t and state j when the observed value is equal to the probability v k .
} is the set of all possible observations.The initial probability is π Subsequently, the model parameter λ* is obtained by using the Baum-Welch algorithm 34 for parameter estimation.
Fourth, the Viterbi algorithm 34 is used to track the optimal state sequence l T 1   .In accordance with the above calculation, we can know that o T corresponds to the status of i ˆ* T .We select i a |j represents all the observed values in interval i ˆ* T+1 ; and the estimated value at time T + 1 is In practice, prices fluctuate for a long time in a certain state and then shift over time.For avoiding i ˆ* l always appearing in the same state, we consider forecasting the optimal state sequence i i ˆ* , …, ˆ* Therefore, when Apparently, m n m (13)   On the basis of the principle of exponential smoothing, 35 Ef ω t X ω ( , ) = ˆ( ) is the estimate of the forecasted value of X ω ( ) Therefore, the forecasted value sequence can be obtained as

| The scheme of the model
The scheme for forecasting crude oil price includes the following steps and is shown in Figure 2: Step 1: Decomposition of the data on crude oil prices.MEEMD is adopted to decompose crude oil prices.Subsequently, IMFs and a low-frequency sequence are obtained.
Step 2: IMF grouping.For IMFs, we mainly use K-means and principal component analysis (PCA) as our grouping technique to cluster 36 and distinguish sequences with different frequencies.First, we use K-means to identify similarities between series and to capture the principal pattern of IMF components.Second, IMFs are divided into high-frequency sequences (IMF H i ), intermediate-frequency sequences (IMF I i ) and low-frequency sequences (IMF L i ).We set m n p , , as a positive integer.Subsequently, Next, we apply PCA 37 to extract the principal frequency sequence in each frequency grouping.The optimal IMFs with different frequencies are obtained by using Formula (17): Step 3: Individual forecasting with certain methods.First, IMF H is forecasted by using HMR.We apply HMR to predict IMF H predictions because the local features of IMF H are more obvious than that of other frequency sequences.Second, IMF I is forecasted by NN.NN is used for the fitting and prediction of intermediate-frequency sequences.ANN has been recently used to provide powerful solutions to the prediction of nonlinear series by Shambora & Rossiter and Kristjanpoller & Minutolo. 38,39The iterative method is an effective model for forecasting with NN.First, subsequent period information is predicted on the basis of past observations.Afterwards, the estimated value is used as input; the next period is thereby predicted. 40Yu et al. 41 studied the internal correlation structures of each intrinsic mode component by utilising the NN model and present an oil price forecasting model with an EMD-based multiscale NN learning.Some important components are selected as the final NN inputs with NN weights.In our model, each input value is related to each other in that we do not remove any of the input values and capture the fluctuations of the input sequence.Second, NN can capture overlooked volatility effects. 42Therefore, let n be a point in time.IMF In is thus , …, ).
Third, IMF H and IMF I are forecasted by ARIMA.Compared with IMF H and IMF I , low-frequency series and RS are relatively stationary.Therefore, IMF L and RS are forecasted by ARIMA.
Step 4: Overall prediction.Individual forecasting models are developed by each IMF group with different frequencies, with the final prediction being an integration of the individual predictions from each model.

| Performance of forecasting accuracy
The predictions of our model are compared with those of benchmark models, such as the ARIMA, Holt-Winters, nonlinear autoregressive (NAR), SVR and NN models.
The predictions are evaluated in terms of root-meansquare error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE): where ε is the number of observations in the test data set, and x * and x ˆ* are the actual and predicted values, respectively.Moreover, we perform the Diebold-Mariano (DM) test to compare prediction models and determine whether significant differences in predictions exist.The DM test is described as follows: where f ˆd is the consistent estimate of the loss differential spectral density, d ̅ is the mean of the two predicted loss differentials and T is the length of the predicted time series.

| EMPIRICAL RESULTS
In this study, two main crude oil price series and Brent (per barrel/US dollar) crude oil prices (indexmundi.com)are chosen as experimental samples.The volatility of crude oil prices during October 2009-September 2019 is plotted in Figure 3, which shows the large fluctuations in crude oil markets and wide price range.
Table 1 reveals the basic statistics of crude oil prices.They include the mean, standard deviation, skewness, kurtosis, median, minimum, maximum, kurtosis, skewness and Shapiro-Wilk test statistics.Hence, our model can normally decompose time-varying volatility and provide the good estimates and forecast of every sequence.
As discussed in Section 3.3, we examine correlations at different scales by decomposing Brent oil prices into different modes by using MEEMD, allowing us to obtain IMFs with different frequencies.These IMFs are illustrated in Figure 4.All of the IMFs present changing frequencies and amplitudes, and the volatility of each IMF decreases from optimally high frequency to low frequency.Given that our decomposition results contain only three component sequences and one RS, the sequences of different frequencies are identified directly by K-means.Subsequently, IMF1 is taken as IMF H , IMF2 is taken as IMF I and IMF3 is taken as IMF L . Figure 4 shows that IMF H (IMF1) fluctuates considerably between [−20, 20].The fluctuation range of IMF I (IMF2) and IMF L (IMF3) is smaller than that of RS.In general, crude oil prices are showing signs of fatigue.
In practice, the continuous effect of the US subprime mortgage crisis in 2008 and the global financial crisis increased crude oil price.During the 2011 Grecian and European debt crises and 2013 European sovereign debt crisis, prices remained high.The establishment of the Asian Infrastructure Investment Bank in 2015 helped change the financial situation.Although crude oil prices may fluctuate under the influence of some major events, they return to their normal trends after a certain period in terms of IMFs.Therefore, our experiment only analyses crude oil from each frequency and does not consider the temporal effect of external events on crude oil prices in our model.
First, we use the decomposed IMFs to perform experiments.We calculate the states of IMF H .We select 0.3 as the threshold value for classification.The four states are positive1 (6, 12], positive2 (0.3, 6), negative1 (−6, 0.3] and negative2 (−6, −13].The results obtained after the calculation of the transfer matrix are shown in Table 2.
The transfer matrix shows that the fluctuation of IMF H easily stabilises and then transfers to other states after a period of time and that low values of IMF H increase after a sudden collapse.Each state is held for 6 points and then moved to the next opposite direction.
Next, IMF I is used to predict NN with 9:1 training and test sets.Figure 5 shows that through calculation, the R value is 0.964, which confirms that the model does not exhibit overfitting.Therefore, we have reason to believe that the predictions are valid.
We use the auto.arimafunction in R language to select the optimal parameters in the calculation of IMF L predictions.ARIMA(1, 1, 1) and AR(2) are fitted to IMF L and RS, respectively.By using the above calculations, prediction performance is determined and is shown in Table 3. Table 3 shows that the DM values range from 4 to 36.55 and are higher than the upper boundary of the 5% significance level of 1.645.That said, our proposed hybrid model is drastically different from other models.The proposed model will ultimately achieve the most significant forecasting amongst all models.Next, we discuss the details of forecasting performance on the basis of MAPE, RMSE and MAE.The RMSE and MAE of the proposed prediction model are drastically lower than those of other well-recognised approaches.The MAE of our method is 6.17, which is significantly smaller than that of ARIMA (19.13),Holt-Winters (29.31),NAR (14.77),SVR (13.33),ARMI-SVR (7.04), NN (20.83),LSTM (58.33) and EEMD-LSTM (24.96).This result may be attributed to the application of an effective decomposition method to extract the local volatility of crude oil prices and the selection of a suitable nonlinear model for fitting and forecasting the IMF of different frequencies.In addition, ARIMA-SVR remains superior to other methods except the proposed method, which surpasses the ARIMA  model because it not only exhibits the effectiveness of the traditional time series model but also considerably improves the prediction accuracy of the model optimised by SVR.Furthermore, the RMSE of the new method is almost lower than that of the other models.This result indicates that in most cases, the proposed method has a small deviation from the accurately predicted crude oil prices.However, we have to admit that the traditional econo-metric ARIMA models also have acceptable predictive performance, although their MAPE values are in most cases slightly worse than those of other models.Moreover, the prediction accuracy of our model is higher than that of ARIMA models mainly due to the poor performance of individual models and the important influence of integration strategies (i.e., simple average) on overall prediction efficiency.Similarly, we find that our model is optimal in RMSE analysis mainly because crude oil prices are characterised by high noise and nonlinear and complex factors.Moreover, our model can better capture the fluctuations of each local characteristic than other models.
In summary, our proposed hybrid model with filtering, reconstruction and integration is suitable for forecasting complex crude oil prices.Meanwhile, the proposed HMR can fully capture the volatility of IMF H and improve the performance of the integrated model.F I G U R E 5 Experimental results of IMF I .IMF, intrinsic mode function.
The complexity of crude oil prices could be related to less-than-ideal predictions due to main two reasons.One is that the time series of crude oil prices have massive jumps in frequency.Therefore, decomposing the time series is important.The other is that the time series of crude oil prices are characterised by nonlinearity, structural breaks and highly volatile states.Consequently, we propose a new hybrid model to obtain predictions with increased accuracy and that match the above characteristics.Our work incorporates the principle of decomposition-integration into the proposed methodology, which was developed to forecast crude oil prices.By following the philosophy of decomposition, we perform MEEMD to obtain different local feature IMFs and then use K-means and PCA to extract the optimal grouping of IMFs with different frequencies.Next, in accordance with the philosophy of integration, we perform individual forecasting with HMR, NN and ARIMA and finally ensemble all individual predictions for obtaining the eventual results as final predictions.
Our work has certain novelties, which are mainly reflected as follows: (1) The solution of unreasonable parameter settings results in high time complexity and mode confusion when decomposing IMFs.We calculate the randomness of the complexity between the reconstructed IMFs by amplifying the slight changes in the time series and then evaluate the randomness of complexity on the basis of the MPE of each IMF.If an IMF presents obvious complexity, then it would be eliminated until the optimal decomposed sequence is obtained.(2) HMR is used to solve the nonlinearity, structural breaks and highly volatile states of high-frequency series.The HMR model selects the optimal parameters by using past information to describe predicted values in a probabilistic manner, providing objective interpretation.With time, the HMR model chooses the optimal parameters for modelling and prediction.Therefore, this method is suitable for the short-term prediction of high-frequency sequences in nonlinear series.However, our proposed method has some limitations.For example, it does not consider the effect of external events, such as wars and political conflicts and substitution with other energy forms, on crude oil prices.Our prediction results mainly depend on historical data.However, the effect of external events causes the prediction accuracy of predictions of our model to decline.Mining historical information to predict future crude oil prices can help further understand the characteristics of oil prices.The future development of the model will involve further expanding its application areas, such as stock prices.At the same time, external effects will be incorporated to forecast future crude oil prices.

F I G U R E 3
Price of Brent (per barrel/US dollar) crude oil.T A B L E 1 Statistical description for price of Brent oil.
the state moves to the suboptimal probability of state probability.For illustrating the validity of the model, we set the random variable X n at any time as o N , set M > 0  n of f ω t ( , ) exists, and we set Transfer matrix of IMF H .
(3)The empirical results obtained by incorporating Brent crude oil prices indicate that the proposed hybrid model performs better than ARIMA, Holt-Winters, NAR, SVR, NN and LSTM in terms of prediction accuracy and DM test statistics.This result confirms that the proposed hybrid model outperforms some single machine learning and measurement models.Meanwhile, the proposed model achieves satisfactory results in MAE, MAPE and RMSE analyses and the DM test, which indicates that the proposed model has better accuracy than normal hybrid models, such as ARIMA-SVR and EEMD-LSTM.