Forecasting carbon price using signal processing technology and extreme gradient boosting optimized by the whale optimization algorithm

Predicting carbon prices is crucial for the growth of China's carbon trading industry. This paper proposes a residual correction model that considers multiple influencing factors. First, the best historical data and main external factors input by the model are determined by using the partial autocorrelation function and Spearman correlation analysis, and the carbon price forecasting index system is constructed. Second, the whale optimization algorithm (WOA) is utilized to determine the optimal parameters of the extreme gradient boosting (XGBoost), and the WOA‐XGBoost model is built to perform preliminary carbon price forecasts and obtain the residual series. Finally, the carbon price residual series undergoes decomposition into multiple components utilizing the complete ensemble empirical mode decomposition for subsequent forecasting and the aggregation of outcomes. Experiments are conducted to predict two carbon trading markets in Hubei and Guangzhou, and a feature importance analysis is performed. The results indicate that the proposed hybrid model consistently outperforms the comparative models in terms of prediction accuracy. Furthermore, it is revealed that historical carbon prices and European Union carbon prices are the key factors influencing the prediction of carbon market prices.

development.To mitigate its effects, all parties must work together.Various policies have been proposed by governments to address this issue.The carbon trading market is widely recognized as a highly effective tool for emissions reduction. 1In the European Union, the first carbon emissions trading system (EU ETS) was created, which has served as a model for carbon markets in other regions. 2s a major global CO 2 emitter, China is actively involved in the development of carbon trading markets.The construction of China's carbon market has progressed at a steady pace, with eight carbon trading market pilots established to date. 3 Moreover, the Chinese government's commitment in 2020 to reach carbon peaking by 2030 and carbon neutrality by 2060 has given further impetus to the development of carbon markets.On 16 July 2021, the Chinese government took a significant stride in expediting the construction of the carbon emissions trading market with the launch of online trading within the national carbon market.In the carbon emission trading market, the carbon trading price serves as the central element, and an accurate carbon trading price not only facilitates the scientific and sustainable growth of the carbon trading market but also aids the government and companies in making reasonable development policies and planning the optimal path to reduce carbon emission reduction costs. 4dditionally, it can reduce investors' risks and maximize their return on investment.Therefore, carbon price prediction has received a lot of attention, and whether carbon price can be accurately predicted is a critical issue facing the academic community, so further research and exploration are needed to create effective approaches to predicting carbon prices. 5n this study, econometric methods, artificial intelligence methods, and hybrid forecasting methods are commonly used for carbon price forecasting.Econometric models are classical time series forecasting methods, and common methods include the autoregressive integrated moving average model (ARIMA), 6 generalized autoregressive conditional heteroscedasticity model (GARCH), 7 cubic exponential smoothing (Holt-Winters), 8 and so on.Li 9 used the ARIMA model to forecast carbon emissions allowance prices in Fujian and found that the fit was perfect and the trend of carbon prices could be accurately predicted.Wu 10 investigated the price of carbon emission rights in Shanghai using an ARIMA model.This type of method does not require too much external information and assumptions and can be quickly applied to linear data for forecasting purposes.Lv et al. 11 constructed an ARIMA model to forecast the trading price of EU carbon futures for the upcoming 3 months, utilizing the monthly carbon futures settlement price of EUAs.Dutta 12 used a GARCH model to predict carbon price volatility and demonstrated that outlier data processing and crude oil volatility indices can improve the prediction accuracy of carbon price volatility.However, Zhu and Wei 13 found that ARIMA could not capture the nonlinear pattern of carbon prices, so a hybrid method constructed by ARIMA and least squares support vector machine models was used to test the prediction of EU carbon prices.In the case of nonlinear carbon price data, it is difficult to capture more useful information for carbon price forecasting using only econometric models. 14ith the continuous development and refinement of artificial intelligence technology, highly accurate models and algorithms have been proposed.Machine learning models are extensively applied in research on price forecasting, providing valuable information and guiding decision-making processes.Zhang 15 applied the extreme learning machine (ELM) model to predict stock prices and achieved good results.Jabeur et al. 16 used the extreme gradient boosting (XGBoost) model to predict gold price trends and interpreted its feature importance using SHapley Additive exPlanations (SHAP).In comparison with the other five well-known benchmark models, XGBoost provided superior results.Meanwhile, numerous experiments have been conducted utilizing artificial intelligence to make predictions regarding carbon prices.For example, Zhu et al. 17 proposed a new multiobjective least squares support vector machine to predict the price of carbon emission rights on the Intercontinental Exchange and the European Energy Exchange in the United States.Jiang and Peng 18 constructed a carbon price prediction model based on a chaotic particle swarm algorithm optimized for a BP neural network.Li et al. 19 used the long short-term memory network (LSTM) model to predict the carbon price in Hubei and Guangdong and found that the model is more suitable for carbon price prediction studies compared to other comparative models.García and Jaramillo-Morán 20 used artificial neural networks to predict short-term EU future carbon allowance prices, providing reliable results.Zhang and Wen 21 used an advanced deep neural network model to predict carbon prices, demonstrating that every evaluation metric of the proposed model is optimal compared to the comparison model.Nevertheless, when addressing the carbon price prediction challenge, a single machine learning model exhibits shortcomings, including the susceptibility to getting stuck in local minima or experiencing overfitting phenomena.A series of nonpurely stochastic, nonlinear residuals can still be obtained after the prediction of a single machine learning model. 22hus, in an attempt to further enhance prediction performance, researchers have started combining DUAN ET AL.
| 811 decomposition, optimization, and prediction models fully using the benefits of different models and algorithms to increase prediction robustness and stability.The most commonly used decomposition algorithms are waveletbased decomposition methods, decomposition algorithms based on variational modal decomposition (VMD), and decomposition algorithms based on empirical modal decomposition (EMD). 23Cao et al. 24 investigated the correlation between predictive models and the carbon pricing market, devising a composite carbon price forecasting approach that integrates a reinforcement learning model with the empirical wavelet transform (EWT) algorithm.Experimental results indicate that this model demonstrates superior and stable predictive performance, thereby serving as a valuable tool for the assessment and management of carbon pricing markets.Chai et al. 25 used VMD to decompose the carbon price series, followed by particle swarm optimization to optimize the ELM for subsequential prediction.Zhu 26 presented a hybrid model integrating EMD, genetic algorithm, and artificial neural network to forecast the prices of two carbon futures on the European Climate Exchange.The experimental findings demonstrated the advantage of incorporating the decomposition algorithm into the model.Sun and Huang 27 proposed a hybrid carbon price prediction model that integrates EMD, VMD, and LSTM.An empirical analysis of eight carbon market pilots in China demonstrates the superiority of the combined model.Zhang et al. 28 proposed a hybrid carbon price prediction model based on fully integrated adaptive noise empirical mode decomposition (CEEM-DAN) and window-based XGBoost methods and confirmed that both CEEMDAN-XGBoost (W-b) outperformed EEMD-XGBoost (W-b).To predict carbon prices and trading volumes in eight carbon markets in China, Lu et al. 29 combined six different machine learning models with the fully integrated adaptive noise empirical mode decomposition (CEEMDAN) algorithm.To sum up, the hybrid prediction model incorporating a decomposition algorithm shows promise and offers advantages in tackling the issue of carbon price prediction.However, most of the current studies only decompose the raw carbon price series, ignoring the valuable information contained in the prediction residual series. 30The residual series obtained after prediction by a high-precision prediction model is often the most complex and disordered part.Therefore, it is necessary to use advanced decomposition techniques to decompose the carbon price residual series into relatively stable components for forecasting again and to correct the errors so as to improve the forecasting accuracy. 31,32n many previous studies, carbon price forecasts rely only on historical time series.However, carbon price trends are influenced by a variety of factors, 33 including but not limited to the economic situation, the energy market, environmental changes, policies, and regulations.For example, Batten et al. 34 investigated the impact of key energy prices and weather conditions on the carbon price within the EU carbon trading market.The study concluded that energy prices could explain 12% of carbon price changes and that climate change would exacerbate volatility and increase hedging costs.Lin and Xu 35 analyzed the relationship between carbon prices and influencing factors using a data-driven nonparametric additive regression model.They discovered that coal prices exhibit an inverted U-shaped nonlinear effect on carbon prices.Additionally, renewable energy prices have a positive U-shaped nonlinear effect on carbon prices, while fuel oil prices demonstrate an Mshaped nonlinear effect on carbon prices.Wu et al. 36 analyzed the evolving relationship between carbon prices and potential drivers before and after the COVID-19 pandemic.Findings indicated that, in the short term, energy prices, macroeconomic indicators, and exchange rates were the primary external influences on carbon prices.Zhang et al. 37 emphasized that the comprehensive incorporation of diverse internal and external factors stands as a pivotal concern in carbon price forecasting.9][40] There is an urgent need to establish an indicator system that integrates the consideration of multiple relevant factors.
Reviewing prior studies reveals potential research gaps in current carbon price forecasting.First, most previous studies have typically focused on selecting only the historical series of carbon prices as input characteristics.This approach often overlooks other potential influencing factors, leading to a lack of comprehensiveness in forecasting carbon prices.Second, existing research often lacks input feature selection.Incorporating feature selection can help mitigate the risk of model overfitting.Third, most existing carbon price forecasting studies use preprocessing models, that is, data decomposition is used to smooth the data before forecasting, and there is a lack of postprocessing model applications based on residual correction.
To fill the above gaps, this paper proposes a hybrid model for carbon price residual correction that takes into account the role of multiple influencing factors based on previous research.In the model, historical carbon price data as well as multiple external influences on energy, the economy, EU carbon prices, and the environment are considered as input features.To build the WOA-XGBoost-CEEMDAN model, the whale optimization algorithm, the extreme gradient boosting algorithm, and the adaptive noise-complete integrated EMD are combined.First, literature analysis is used to identify potential factors affecting carbon prices.For two different carbon pilots, data is collected on carbon prices and their potential factors.Second, a WOA-XGBoost model is established, whereby the parameters of the XGBoost model are optimized using the WOA algorithm.This optimized model is then used to predict carbon prices.Then, using the CEEMDAN method, the residual series resulting from the initial fitting of the WOA-XGBoost model is decomposed into several components with different frequencies.Finally, the WOA-XGBoost model is used to predict each component generated by CEEMDAN separately.The predicted values and residual series values are added after the prediction is complete to estimate the final carbon price.
This paper contributes the following main contributions: The remaining sections of this study are organized as follows: Section 2 presents the fundamental methods and theories used in this research.Section 3 describes the established system of indicators and evaluation criteria.Section 4 focuses on the experimental analysis.Discussions were held in Section 5. Finally, Section 6 summarizes the key findings and conclusions of this study.

| METHODS
This section presents algorithms related to the proposed hybrid prediction model.

| Feature selection
A feature selection tool becomes extremely useful when dealing with large, high-dimensional data sets.By selecting the most influential features, model redundancy can be reduced, prediction accuracy can be improved, and training time can be reduced. 41Additionally, feature selection can help researchers understand the relationship between models and data, enabling better data analysis and decision-making.In this study, PACF was used to select the best historical carbon price data, and Spearman correlation analysis was used to determine the best external influences.

| PACF
In time series analysis, the PACF is utilized to assess the partial correlation between a time series and its lag data.After removing the influence of other lagged periods, the PACF measures the correlation between the current time point and a specified lagged time point.Specifically, the conditional correlation between X i and + −1 .This correlation represents the partial autocorrelation between X i and X i k + after removing the influence of intervening variables X X , …, .
The values of PACF are commonly calculated using the Yule-Walker equations, and the resulting values are utilized to construct the PACF plot. 42In the PACF plot, lag orders beyond the confidence interval may significantly impact the target variable, while the partial autocorrelation coefficient values for other lag orders tend to approach zero.This suggests that these lag orders might not have a substantial impact on predicting the target variable.Simultaneously, to interpret the results of the PACF plot accurately, it is imperative to ensure the stationarity of the time series.

| Spearman correlation analysis
Spearman correlation analysis is a nonparametric statistical method used to measure the relationship between two variables.It transforms the values of each variable into their respective ranks and subsequently computes the relationship between these ranks. 43Therefore, Spearman's rank correlation analysis is based on the ranks of variables rather than their specific numerical values.As a result, it is suitable for handling non-normally distributed, ordinal, or ranked data and is adept at uncovering potential correlations under nonlinear relationships.It possesses characteristics of flexibility and robustness, rendering it highly applicable in tasks such as data exploration, feature selection, and correlation analysis.5][46] However, in certain scenarios, due to nonlinear interactions, variables with low correlation can still make significant contributions to predictions.Additionally, the presence of a significant correlation between two variables does not necessarily imply a causal relationship between them.Therefore, it is crucial to be mindful of the assumptions underlying data selection and to comprehensively understand the data's correlations by integrating domain knowledge and the context of the problem.
The following are the steps involved in Spearman correlation analysis: 1. Data preparation: Gather and organize data sets for the two variables.2. Ranking: For each variable, arrange all of its values in ascending order and convert them into their corresponding ranks.If multiple observations share the same value, assign them the average rank.3. Compute rank differences: For each data point, calculate the difference in rank between the two variables, indicating their disparity in rank across the two variables.4. Calculate the Spearman correlation coefficient between the two variables using the following formula: where ρ s is the Spearman correlation coefficient, d i is the difference between each pair of rank orders, and n is the sample size.Spearman correlation coefficients are calculated with values ranging from −1 to +1.A value of −1 represents a perfect negative correlation, while +1 indicates a strong positive correlation.

| WOA
In 2016, Australian scholar Mirjalili proposed the WOA, 47 a bionic heuristic algorithm for finding optimal target parameters based on simulation of the feeding behavior of humpback whales.The WOA algorithm has the advantage of fast convergence, easy implementation, and strong global search capability.In the WOA algorithm, each humpback whale's position corresponds to a set of parameters representing feasible solutions.The process of its optimization search is as follows.

| Enclosure of prey
As humpback whales search for prey in the ocean, they share information about their location with each other.They then choose the whale closest to the prey as the best position.At the same time, other whales will also approach this best position to better surround the prey.
The equation expression for this behavior is where W is the position of the killer whale, and W t *( ) denotes the current optimal position of the whale, representing the current optimal solution.The variable t represents the current number of iterations, and D denotes the enclosing step.Additionally, A and C are coefficient matrices used in the equation, which can be obtained from ( 4)-( 6) as follows: where r 1 and r 2 are random numbers ranging from 0 to 1, a represents a convergence factor that linearly decreases from 2 to 0, and T max denotes the global maximum number of iterations.

| Hunting behavior
In the search phase, the whales gradually approach the optimal solution along the spiral path, thus increasing the accuracy of the search.The following mathematical model simulates humpback whale hunting behavior: In the above equation, the constant parameter b controls the shape of the spiral.The variable l represents a uniformly distributed random number ranging from −1 to 1. D p indicates the current search whale-to-prey distance.
While the whale uses rotational search to close in on its prey, it contracts its envelope.Assuming an equal probability of executing the enclosure contraction mechanism and the spiral update mechanism, the mathematical expression at this point is where p is a random number that takes values from 0 to 1.

| Searching for prey
When the distance between the whale and the current optimal solution is greater than or equal to the radius of the envelope, the whale performs random exploration behavior to further expand the search range and avoid getting trapped in a local optimal solution.The expressions are where W t ( ) rand is the random whale position.When the absolute value of A is greater than or equal to 1 the whale is outside the envelope and moves away from the current optimal solution, engaging in random search.
When the absolute value of A is less than 1 ( A | | < 1), the whale performs a spiral search.

| WOA-XGBoost
XGBoost is an integrated decision tree-based learning algorithm developed by Chen et al. 48in 2016 that can be used for classification and regression problems.Compared to gradient boosting decision tree (GBDT), XGBoost can be computed in parallel with multiple threads and has significant advantages in terms of training speed and scalability.The model combines multiple tree models with low prediction accuracy through certain strategies.This is to build a combined model with higher prediction accuracy.When the combined model is constructed, it is iterated by gradient boosting.Each iteration generates a new tree to fit the residuals generated by the previous tree until the model is trained to the optimal result.
Classification and Regression Tree (CART) is the base learners of XGBoost.The expression for the final prediction output of K CARTs, given a data set with n samples and m features, is as follows: where f k is the decision function of the kth tree and F is the function set space of all CART decision trees.
The objective function of XGBoost consists of two parts: the loss function and the regularization term.The general form of the objective function is as follows: where l y y ( , ˆ) i i is the loss function, that is, the error value of the true and predicted values.f Ω( ) k is the canonical term, which controls model complexity.
The expression for the regularization term where T represents the number of leaf nodes in tree, while ω j represents the weight assigned to the jth leaf node in tree.The parameters γ and λ serve as penalty coefficients that promote smoother scores for each leaf node, thereby controlling the complexity of the tree and alleviating overfitting.The larger the values of γ and λ, the simpler the tree structure.
During model training, a gradient boosting strategy is used to add a new regression tree to the model one at a time, and the predicted value of the model under the ith sample x i is Equation ( 15) is brought into Equation ( 13) to obtain the following equation: After applying a second-order Taylor expansion to the objective function, removing the constant term, and introducing the regularization term, the formula is expressed as (17): In the equation, g i ( )


are first-order and second-order deriv- , where is the set of samples on each leaf.The objective function is then derived to find the leaf node . Substituting the optimal solution, the final objective function is obtained as where q denotes a tree structure consisting of t decision trees.
XGBoost is an efficient, accurate, and versatile machine learning algorithm, but it comes with a multitude of model parameters. 49When training with XGBoost, different combinations of parameters can have a substantial impact on the model's classification ability and predictive accuracy.It is necessary to set and adjust these parameters according to specific circumstances.Using the grid search method is slow and cannot guarantee finding the optimal combination of parameters.Therefore, the WOA algorithm, which is effective in solving parameter optimization problems, is chosen to globally search for the optimal solutions of the following five parameters, aiming to reduce the randomness in parameter selection. 50After optimizing the parameters for the number of decision trees (n_estimators), the maximum depth of the tree (max_depth), the ratio of features used to train each tree to the total features (colsample_bytree), the boosting learning rate (learning_rate) and the ratio of samples to total samples for training each tree subsample, and the output as the final parameter of the XGBoost algorithm.Figure 1 shows the WOA-XGBoost algorithm flow, and the execution steps are as follows: 1. Initializing the WOA: Configure the appropriate maximum number of iterations, population size, and whale search range boundary parameters to prepare the iterative process for optimizing the solution space.2. Adaptability assessment: For each individual whale, the fitness value is calculated.By assessing the fitness of all whales in the population, the whale with the lowest value was selected as the best global solution.
The adaptation value function is chosen as the mean square error function of the XGBoost model, and the specific expression is where x i represents the location of the ith whale, and θ i j , denotes the jth true value associated with the ith individual.The variable θ ˆi j , is the prediction value obtained by the XGBoost model based on the parameter settings of the ith whale.
3. Updating the location: For each individual whale, three different strategies are used to update the whale's position, depending on its current position and the position of the optimal solution.4. Updating the optimal solution: Identify the whale individual in the current population with the best fitness as the current optimal solution.Retain a record of this optimal solution to facilitate comparisons in subsequent iterations. 5. Check whether the stopping condition is satisfied, such as reaching the maximum number of iterations or reaching the desired fitness threshold.If the condition is satisfied, the algorithm terminates; otherwise, repeat Steps 3 and 4. 6.Output the final optimization search result of the WOA algorithm and bring it into the XGBoost model for model prediction.

| CEEMDAN
The CEEMDAN algorithm is a nonlinear, adaptive, datadriven algorithm for signal processing and analysis proposed by Torres et al. 51 The residual component R t ( ) at this point is given by 3. The residuals are used as the new input data, and the adaptive white noise sequence σn t ( ) , where E (•) j is the jth eigenmodal component obtained after EMD decomposition.The subsequent step involves performing EMD decomposition on the modified sequence, which is then averaged to derive the second modal component and the residual component.
4. Repeat Steps 1-3 to obtain the j+1th modal component and the jth residual component: 5. The aforementioned steps are repeated until the remaining components reach a state where EMD decomposition is no longer feasible, at which point the process is terminated.This iterative process results in the decomposition of the original sequence into multiple eigenmodal components and a trend component.
es (27)   After the decomposition of the original sequence in CEEMDAN is completed, each intrinsic mode function component is individually predicted using the XGBoost model.The resulting predictions for each component are then summed term by term to obtain the final residual prediction result.

| WOA-XGBoost-CEEMDAN model
This paper is dedicated to improving carbon price prediction accuracy and proposes a hybrid algorithm based on CEEMDAN, WOA, and XGBoost to build a WOA-XGBoost-CEEMDAN model.The utilization of CEEMDAN for residual decomposition is primarily grounded in the following considerations: First, the residual sequence obtained after high-precision prediction by the WOA-XGBoost model often contains valuable information.Second, the residual sequence resulting from WOA-XGBoost predictions is complex and disordered.By using CEEMDAN, the carbon price residual sequence is decomposed into relatively stable components, ultimately achieving the goal of enhancing predictive accuracy.The main steps of the experiment are outlined as follows: First, the WOA algorithm is used to find the parameters in the XGBoost model, and the optimized model is used for the initial prediction.Second, the sequence of carbon price residuals is decomposed using CEEMDAN to better understand the characteristics and changing patterns of the signals.Subsequently, the WOA-XGBoost model is used again to predict the different components.Finally, the obtained predicted values of different model components and the preliminary predicted values are summed to obtain the final prediction results of the model.Figure 2 shows the flowchart.In the whole prediction process, WOA-XGBoost is utilized for prediction in two distinct stages.The considerations for its application at different stages are as follows: First, dealing with complex and highdimensional data sets in the initial stage poses a challenge for the selected model.Therefore, we chose the powerful XGBoost algorithm, optimized by WOA, for prediction.By virtue of its flexibility and efficiency, this model enhances its ability to fit complex data and accelerates convergence speed.The prediction task in the second stage is relatively simple since the residual sequence has been effectively decomposed, removing noise and irregular fluctuations in the data and providing a more reliable and accurate foundation for subsequent predictions.Although other algorithms could be considered for this stage, for the sake of maintaining overall research consistency, the decision has been made to continue using WOA-XGBoost for component prediction.

| Primary index system establishment
The carbon trading markets in Hubei and Guangzhou are both among the first pilot carbon trading initiatives nationwide.They play a crucial role in promoting the transition to a low-carbon economy, reducing carbon emissions, and addressing climate change.As of the end of 2022, the cumulative trading volume and turnover of the Hubei carbon trading market accounted for 44.6% and 46.9% of the national share, respectively.The market has demonstrated leading trading volume and liquidity nationwide.In 2022, the Guangzhou carbon trading market performed exceptionally well, ranking first in both trading volume and turnover among the nationwide carbon trading pilot projects.Therefore, this study selects the prices of two representative carbon trading pilots, Hubei and Guangzhou, as the primary data for predicting carbon prices.
The carbon price samples for Hubei cover the period from 27 August 2017 to 23 September 2022, with a total of 1280 data points.For Guangzhou, the carbon price samples span from 31 August 2020 to 4 August 2023, totalling 721 data points.The training and testing data sets are chosen using an 8:2 ratio.The selected training data set timeframe covers three distinct stages: price increase, decrease, and oscillation, constituting a complete price cycle. 52It also includes various policy time points related to the "carbon peaking and carbon neutrality" initiative proposed by the government.This inclusion facilitates the validation of the model proposed in this study for its ability to extract and predict the complex variations in carbon prices under different market conditions.
Carbon market data provides new observations every day, indicating a high update frequency that effectively reflects market changes and fluctuations. 24Therefore, this study uses a hybrid model to further predict carbon prices, ensuring real-time and accurate forecasts of carbon market prices.The data from Hubei and Guangdong carbon markets are presented in Table 1, and the line chart is depicted in Figure 3.
In addition to being influenced by its own dynamics, carbon prices are also affected by a multitude of factors. 53e establishment of a reasonable carbon price indicator system is necessary to correctly predict the carbon price trend and to promote the healthy development of the carbon market.In this study, energy factors, economic factors, international carbon prices, and environmental factors are considered in the forecast indicators.The details are described below.
1. Energy factors: China, as the world's largest energy consumer, holds a substantial share of 23% in global energy consumption.An evident correlation exists between international energy prices and domestic carbon prices, wherein the fluctuations in the international energy market exert an influence on the domestic carbon market. 54This paper focuses on the selection of coal prices, crude oil prices, and natural gas prices as indicators to assess the influence Brent crude oil is more suitable as an influencing factor for carbon prices in China.The New York Mercantile Exchange (NYMEX) is one of the most commonly used reference indices for natural gas prices in the market.Therefore, in this paper, the Rotterdam coal price is used for the coal price, the Brent crude oil price is used for the crude oil price, and the natural gas price is taken from NYMEX.

Economic factors: Economic factors play an important
role in determining carbon prices.These factors include the stock market, the exchange rate, and so forth. 55Changing stock market prices may reflect changes in the market's expectations of economic development, which in turn affects the investment and production behavior of businesses, ultimately affecting carbon prices.Changes in exchange rates can affect supply and demand on international energy and carbon markets, which affect carbon prices.In this paper, the CSI 300 Index, S&P 500 Index, and Euro to RMB exchange rate are used to reflect the impact of economic factors on carbon market prices.

International carbon price:
The EU carbon trading market is one of the largest carbon markets in the world, having a large carbon trading volume and strong carbon price influence, as well as a certain degree of innovation in carbon market design, which has a profound effect on the global carbon market's development.Therefore, the EU carbon market price trend can be used to forecast the Chinese carbon price and control risk.4. Environmental factors: Increasing extreme weather events due to climate change may lead to increased market demand for carbon emission reductions.Governments may intensify their efforts to reduce emissions, thus affecting the carbon market's supply and demand.This paper uses the air quality index (AQI) of the provincial capital city as an environmental indicator.
The detailed information regarding multiple external factors influencing carbon prices in China is presented in Table 2.The data used in this paper are derived from the Wind database.

| Ultimate index system establishment
In determining the input variables for the carbon price forecasting model for Hubei, this paper considers the selection of historical carbon price variables and the selection of external influencing factors.To verify the autocorrelation between historical carbon price variables The daily carbon price curve.
and historical carbon price data, the PACF method is used to select historical input characteristics for the forecasting model.First, it is necessary to verify the stationarity of the carbon price sequence by conducting an augmented Dickey-Fuller test.The obtained p value is 0.0427, which is less than 0.05.Consequently, the null hypothesis is rejected, suggesting that the sequence is stationary.This allows for a direct analysis of partial autocorrelations on the original sequence.Figure 4 shows the PACF results, and the Hubei carbon price data have two-order autocorrelation.x i is the output feature, and {x i−1 , x i−2 } is identified as the input historical variable for the Hubei data set.
As more input features are added, the prediction accuracy may be reduced due to redundancy.By selecting the most effective variables to eliminate the covariance between data, not only can the dimension be reduced, but also the prediction accuracy of the model can be improved.
Therefore, to find out the main external influencing factors of carbon price in Hubei, this paper adopts the Spearman correlation analysis method for feature selection and calculates the correlation coefficient between carbon price and influencing factors according to the formula (1).The correlation analysis between the predictors and the predicted labels is shown in Table 3.Based on the results listed in Table 3, within the carbon price indicator system established in this study, the p value for AQI is greater than 0.05, indicating that there is no significant correlation between AQI and carbon prices.However, all other potential independent variables exhibit a significant correlation with carbon prices (p < 0.01).Moreover, the absolute value of the Spearman correlation coefficient for AQI is 0.037, falling within the extremely weak correlation range of [0.0-0.2].Consequently, AQI is excluded from the indicator system, and Coal price, Crude price, Gas price, CSI 300, S&P 500, EURCNY, and EUA are selected as input variables representing external factors.Finally, the input variables {Coal price, Crude price, Gas price, CSI 300, S&P | 821 500, EURCNY, EUA, x i−2 , x i−1 } are determined for the prediction model of the data set.The same approach is used to identify the optimal input variables for the Guangzhou carbon market, resulting in the selection of {Coal price, Crude price, Gas price, CSI 300, S&P 500, EURCNY, EUA, x i−3 , x i−1 } as input variables.Table 4 provides the statistical description of the two data sets.

| Data preprocessing
There are different magnitudes and units in the collected influence factor data.To make the data more comparable, it is necessary to normalize the data and convert them to the same magnitude or unit.This is to facilitate comparison and analysis and also to improve machine learning algorithms' accuracy and effectiveness.Normalization is expressed as follows: where x* is the normalized value, x is the original data, x min is the minimum value in the data set, and x max is the maximum value in the data set.The data after normalization falls within [0, 1].

| Evaluation indicators
The root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and coefficient of determination (R-squared, R 2 ) are selected in this study.RMSE is the square root of MSE, which is used to eliminate the effect of the magnitude of MSE due to the square.MAE indicates the mean value of the absolute value of the difference between the predicted and true values, which can reflect the average error of the prediction results.The coefficient of determination R 2 indicates the degree of correlation between the predicted result and the true value.The calculation formula for each indicator is as follows: where y i represents the true values of the time series, while y ˆi denotes the predicted values.The parameters n and i, respectively, refer to the test sample size and the sequence number of the test sample point.The more the values of RMSE, MAE, and MAPE converge to 0, the closer the value of R 2 is to 1, and the higher the prediction accuracy of the model.Additionally, the improvement percentages for RMSE, MAE, MAPE, and R 2 were calculated.This was to observe the effectiveness of the proposed model in a more intuitive manner.The improvement evaluation indicators were calculated as follows: Index model1 denotes the RMSE, MAE, MAPE, and R 2 of the benchmark model, while Index model2 denotes the indicator values of the comparison model.

| EXPERIMENTS AND ANALYSIS
Among the models selected in this paper, both GBDT and XGBoost are single prediction models without considering decomposition techniques and optimization algorithms.In WOA-XGBoost, the WOA is added, but the decomposition algorithm is not considered.The WOA-XGBoost-EEMD and WOA-XGBoost-CEEMDAN models use EEMD and CEEM-DAN, respectively.The decomposition algorithms decompose the residual sequences predicted by WOA-XGBoost, then predict each component again using the WOA-XGBoost model, and finally combine the test set prediction results of each component to obtain the final residual prediction results.Additionally, all the above algorithms consider only the historical carbon price.In contrast, the ALLVARIABLE-WOA-XGBoost-CEEMDAN model considers all external influences in addition to historical carbon price data as input features.The proposed model incorporates feature selection by selecting lagged terms with significant biased autocorrelation with historical carbon prices and major external influences with high correlation to form the final carbon price forecasting index system of this paper.Table 5 summarizes the characteristics of the different models.Also, since model parameter settings have a considerable impact on model performance and effectiveness, Table 6 summarizes model parameter settings.

| Case 1: Hubei carbon price forecast results
The comparison of predicted values for different models is illustrated in Figure 5.The evaluation metrics for the established prediction framework and other forecasting models can be found in Figure 6 and Table 7. | 823 visualizes the prediction performance of each model in the form of a bar chart.In Table 7, the values with the best results will be bolded.Table 8 illustrates percentage improvement results of the models.Based on the experimental data, it is evident that the model proposed in this paper surpasses the performance of other comparison models across all cases.The following conclusions can be drawn: Performance different models on the Hubei test set predictions.CEEMDAN, fully integrated adaptive noise empirical mode decomposition; EEMD, ensemble empirical mode decomposition; GBDT, gradient boosting decision tree; WOA, whale optimization algorithm; XGBoost, extreme gradient boosting.
F I G U R E 6 Forecast results of various models in the Hubei market.CEEMDAN, fully integrated adaptive noise empirical mode decomposition; EEMD, ensemble empirical mode decomposition; GBDT, gradient boosting decision tree; WOA, whale optimization algorithm; XGBoost, extreme gradient boosting.
1.The prediction model proposed in this paper outperforms other models in evaluation metrics comparisons.According to

| Case 2: Guangzhou carbon price forecast results
To demonstrate the robustness of the model, an experimental study was conducted on the Guangzhou carbon market.Using relevant data from the Guangzhou carbon market as samples, a line graph comparing the predicted values of different models is shown in Figure 7.
The prediction results of each model are illustrated in Figure 8 and Table 9.In Table 9, the values with the best The forecasting results for Hubei data set.| 825 results will be bolded.Table 10 provides the percentage improvement of the models.

Models
From several charts, it can be concluded that the conclusions of Experiment 2 align with those of Experiment 1.This indicates that the model proposed in this study incorporates the advantages of optimization algorithms, decomposition methods, and feature selection, performing exceptionally well across various aspects.
Based on the results of the two case studies described above, the following conclusions can be drawn: 1.The model proposed in this paper performs best in both carbon market experiments.The advantages of the WOA-XGBoost-based model, CEEMDAN decomposition algorithm, and feature selection method proposed in this paper are clearly confirmed according to any evaluation metrics.This result proves the effectiveness and robustness of this forecasting framework for carbon price forecasting.2. The addition of the WOA to the XGBoost model can improve prediction accuracy and predictive power.In the context of carbon price forecasting, the WOA-XGBoost model performs far better than the XGBoost model in both carbon markets; thus, it is necessary to use the WOA model to perform parameter searches for the XGBoost model.3. Compared with not using decomposition technology, using decomposition technology can greatly improve forecasting accuracy.It is reasonable and effective to apply the decomposition algorithm to carbon price forecasting.The CEEMDAN method improves on the EEMD method by adding adaptive white noise at each stage.This has a significant advantage in dealing with nonlinear series.Therefore, in both of these carbon markets, the model incorporating CEEMDAN works better than the model incorporating EEMD and the model without the decomposition algorithm.4. Feature selection has an important role in the establishment of a carbon price index system.Using this approach, you can select the most informative features for carbon price prediction from a large number of features.This will improve prediction accuracy and model robustness.The experimental results show that the processed indicator system achieves better results.

| Feature importance analysis
A feature importance analysis can help identify factors that have a significant impact on forecast results. 56overnments can use this information to make scientific decisions in a variety of areas.Among them are gaining a deeper understanding of energy markets, optimizing policy design, and developing cross-cutting policies.In this paper, the XGBoost built-in feature_importance attribute is used to analyze the importance of features in the WOA-XGBoost model, which is the best performing of the single models.The importance scores and ranking results for the two carbon markets are depicted in Figure 9 and Table 11.It can be found that historical carbon prices and EU carbon prices are the best data sources for predicting carbon prices among the two carbon markets.In the Hubei data set, the lagged order features for carbon prices achieved scores of 0.1794 and 0.1542, respectively.In the Guangzhou data set, the feature scores for two lag orders reached 0.2282 and 0.2030, respectively.Compared to other features, historical carbon prices hold a dominant position.This is because the price time series represents the external manifestation of the inherent complexity within the market. 57It encapsulates crucial information about past market performance and price fluctuations, aiding researchers in analyzing and predicting future price trends.Additionally, carbon price time series data exhibit autocorrelation.Consequently, in previous studies on carbon price prediction, a majority of researchers have utilized historical carbon price sequences for forecasting.This underscores the indispensable role of historical carbon prices in predicting carbon prices. 19,58This observation aligns with the findings of Jiang et al. 59 and Zhang et al., 37 who respectively conducted analyses on influencing factors in the Chinese carbon trading pilot markets and the European carbon trading market.Both studies discovered that carbon prices are primarily influenced by their own historical prices.
Excluding the historical carbon price, EUA is the feature with the highest score, reaching 0.1362 and 0.1806 in the carbon markets of Hubei and Guangzhou, F I G U R E 8 Forecast results of various models in the Guangzhou market.CEEMDAN, fully integrated adaptive noise empirical mode decomposition; EEMD, ensemble empirical mode decomposition; GBDT, gradient boosting decision tree; WOA, whale optimization algorithm; XGBoost, extreme gradient boosting.

T A B L E 9
The forecasting results for Guangzhou data set.respectively.This is because the carbon prices within the EU ETS have consistently ranked at the forefront of all global carbon markets, solidifying Europe's global leadership in carbon pricing policy.There exists a long-term equilibrium and mutually guiding relationship between the European Union carbon market and the Chinese carbon market, although this influence exhibits an asymmetric effect.The spillover effect from the EU market to the Chinese market is particularly pronounced, with the volatility in EU prices significantly impacting the direction of carbon price fluctuations in China. 60When EUA prices rise, it triggers an increase in demand in the Chinese carbon market, causing prices to go up.Conversely, if EUA prices decrease, the demand in the Chinese carbon market drops, leading to price reductions.Eventually, this process leads to a convergence of carbon trading prices. 61able 11 further reveals that economic and energy factors also make a certain contribution to the predictive target.This is because these factors have an indirect relationship with carbon prices.During economic prosperity and stock market upswings, optimistic market sentiment may lead businesses to consider expanding production, resulting in an increased demand for carbon emissions and ultimately causing shifts in carbon prices.Conversely, economic downturns and declines in stock market indices can similarly influence the carbon market.When there is an increase in the price of fossil fuels, both businesses and individuals face higher costs for using these energy sources.As a result, businesses might adopt more emission reduction measures in response, aiming to decrease energy consumption and lower carbon emissions.This course of action can consequently influence changes in carbon prices. 62| DISCUSSION

| The proposed model application test
To explore the generalization ability of the proposed model and study its universality, the analysis is extended to include price predictions for the European Union carbon market.The framework considers energy factors, economic factors, the national carbon market price (CEA), and its own historical prices.Daily trading data for EUA from 16 July 2021 to 28 February 2023 is selected prediction, comprising a total of 421 data points.The data set is divided using an 8:2 ratio for training and testing, as depicted in Figure 10.During the training period, carbon prices underwent the typical patterns of price variation, including phases of increase, decrease, and oscillation.The performance metrics for each model are presented in Table 12, with the best results highlighted in bold.
Based on the findings from Table 12, the following conclusions can be drawn: First, within the realm of single-machine learning models, the WOA-XGBoost model, with the integration of an optimization algorithm, exhibits outstanding performance.The evaluation metrics have improved by 26.19%, 4.75%, 5.42%, and 1.85%, respectively, compared to XGBoost without the optimization algorithm.This indicates that using WOA for parameter optimization in XGBoost can enhance predictive accuracy.Second, the introduction of the EEMD and CEEMDAN hybrid algorithms has notably improved precision, with CEEMDAN showing a slight gain over the EEMD.WOA-XGBoost-CEEMDAN has achieved metrics of 0.6630, 0.4985, 0.0060, and 0.9903, respectively.By decomposing the initial predicted residual sequence and extracting effective information, the accuracy of the prediction has been further improved.Third, by incorporating relevant influencing factors of the predictive target into the indicator system of WOA-XGBoost-CEEMDAN, followed by feature selection.The four evaluation indicators of the model reached 0.1983, 0.1553, 0.0019, and 0.9991, respectively.Compared to predicting based solely on carbon prices, this represents an improvement of 70.09%, 68.84%, 68.33%, and 0.88%.amount of time and computational resources in the search for suitable parameter settings, which also reduces the intelligence of the model.( 2) The incorporation of feature selection into hybrid models can introduce increased complexity to both the model itself and its operational procedures, resulting in prolonged time consumption.In terms of the carbon price research: (1) The indicator system constructed in this study needs further refinement.(2) Only a one-step ahead prediction of carbon prices has been conducted without an exhaustive examination of future price trends.(3) The time-varying characteristics of impact factors were not considered. 64herefore, in future endeavors, there is a need to strengthen research on optimization algorithms and model simplification.The primary goal is to develop a more intelligent and user-friendly predictive model, enabling the automation of parameter tuning for heuristic optimization algorithms while reducing the complexity of experimental procedures.Furthermore, in the construction of the indicator system, the incorporation of additional influencing factors, such as new energy prices and quantifiable policy factors, should be considered.Also, the inclusion of nonstructured data sources, such as Baidu Index and Google Search Index, should be taken into account to establish a more comprehensive and refined indicator system.Finally, an attempt is made to incorporate sliding window to better capture the dynamics of the data and make the model more adaptable. 38

| Policy recommendations
As one of the world's largest carbon-emitting countries, China's establishment of a robust carbon market will have a positive impact on global carbon emissions reduction.The following are policy recommendations based on the research findings of this paper: 1. Historical carbon prices are one of the crucial indicators in carbon price prediction.The trends and patterns in their variations can provide valuable insights for forecasting future carbon prices.It can also help China formulate carbon market policies and strategies.Therefore, conducting in-depth analysis and research on historical carbon prices is of crucial importance.First, the Government should establish a carbon market reserve.Injecting or withdrawing carbon allowances as needed according to market conditions to ensure stable and controllable trading prices in the carbon market and avoid drastic fluctuations.Second, the government should build a comprehensive carbon market database, including historical carbon prices, carbon trading volumes, and carbon emissions data, across different carbon markets.This will facilitate scholars' access to data for in-depth analyses of relevant issues.2. The EU carbon prices have a significant impact on the carbon markets in Hubei and Guangzhou.It is recommended that the government, when formulating carbon market policies, should start by examining the trading rules, market demand, industry structure, and other aspects of the EU carbon market.Analyzing their impact on carbon prices will provide a better understanding of the mechanisms at play.When the EU carbon price fluctuates, domestically, the government should formulate social security policies to mitigate the impact of carbon price changes on industries and social groups, alleviate the burden on affected groups, and maintain reasonable fluctuations in the domestic carbon market price.On the international front, it should enhance cooperation with other countries and regions, sharing experiences and best practices and collectively promoting the development of the global carbon market.3. Fluctuations in energy prices are directly related to firms' production costs and energy choices and have a complex impact on carbon prices.For the Hubei carbon market, coal and crude oil prices have a significant impact on carbon price predictions.When these prices rise, the government should take a series of measures to alleviate the production cost pressure on businesses.First, the government should encourage enterprises to improve energy efficiency and upgrade their technologies.Providing financial support, low-interest loans, or other incentives can help businesses utilize energy more effectively.Second, promote the development of clean energy.By increasing the supply of clean energy and reducing dependence on coal and oil consumption, it can lower the volatility of carbon prices and dependence on traditional fossil fuel prices.When energy prices decline, the government needs to review and adjust the mechanism for allocating carbon quotas to ensure that the carbon market continues to drive businesses to adopt low-carbon measures.For the Guangzhou carbon market, the government should pay more attention to the prices of natural gas and coal to ensure the accuracy of carbon price predictions and promote sustainable development.4. On the economic front, the Hubei carbon market should monitor the price trends of the CSI 300, S&P 500, and EURCNY.For the Guangzhou carbon market, more attention should be given to the price trends of EURCNY and the S&P 500 index.These factors provide valuable references for accurately predicting carbon market prices.When these indices rise, it indicates optimistic sentiment in the stock market and economic growth.The demand in the carbon market may increase.During economic prosperity, the government should intensify training and education related to the carbon market, enhance understanding among businesses and investors, and promote active trading in the carbon market.When these indices fall, governments should stimulate economic growth by cutting taxes or increasing spending.Finally, the government should also maintain a reasonable stability in the exchange rate to mitigate the adverse effects of exchange rate fluctuations on the economy and energy markets.

| CONCLUSION
The establishment of a high-precision prediction model is conducive to accurate prediction of China's carbon trading price and will maintain the healthy development of China's carbon trading market.In this study, the CEEMDAN decomposition method is introduced for the residual sequence correction problem.The WOA-XGBoost model proposed in this paper is used to predict the IMF and trend components generated by CEEMDAN individually, and finally, the WOA-XGBoost-CEEMDAN ensemble learning combined prediction model is established.This study utilized multiple predictive models to investigate the carbon price forecasting issues in two Chinese carbon pilot markets and introduced four evaluation indicators to discuss the prediction accuracy and stability of each model under different data sets.The empirical analysis draws the following conclusions: 1.The prediction model proposed in this paper has the highest performance among all benchmark models.Taking the carbon price prediction of Hubei as an example, the RMSE, MAE, MAPE, and R 2 of the model are 0.2210, 0.1778, 0.0041, and 0.9983, respectively.Furthermore, these findings were validated in other experiments, underscoring the superiority and universality of the model in carbon price forecasting.2. Decomposition of the residual sequence can greatly improve the prediction accuracy.The introduction of decomposition algorithms aids in extracting valuable information from residual sequences.Experiments have demonstrated that processing residual sequences significantly enhances the performance of predictive models.3. The establishment of an indicator system that takes into account external influences makes a significant contribution to the study of carbon price forecasting.Carbon prices are complex, and there are many factors affecting their price fluctuations and a wide range of factors involved.If only historical carbon price data is considered, it is not sufficient to predict the future carbon price with accuracy.Therefore, this paper builds a hybrid prediction framework that incorporates feature selection.4. Through the feature importance analysis, it is found that the historical carbon price and the EU carbon price are the main influencing factors for carbon price prediction.
The model proposed in this study holds significant value and prospects, offering robust support for the stable development of carbon markets, sustainable economic growth, and the achievement of global climate objectives.The application of this model will aid in guiding stakeholders' decisions within the carbon market, thereby enabling them to better address future challenges and opportunities.

F I G U R E 7
Performance of different models on the Guangzhou test set predictions.CEEMDAN, fully integrated adaptive noise empirical mode decomposition; EEMD, ensemble empirical mode decomposition; GBDT, gradient boosting decision tree; WOA, whale optimization algorithm; XGBoost, extreme gradient boosting.
The framework of the carbon price forecasting based on WOA-XGBoost-CEEMDAN. CEEMDAN, fully integrated adaptive noise empirical mode decomposition; IMF, intrinsic modal function; WOA, whale optimization algorithm; XGBoost, extreme gradient boosting.
T A B L E 1 Sample size and date range.GuangzhouGuangdong carbon emission allowance (GDEA) 31.08.2020-04.08.2023 721 of energy prices on carbon price variations.Rotterdam is one of the most important coal trading centers in the world.Rotterdam coal prices have a certain influence on the global coal market.Brent crude oil is produced in the North Sea, and its origin and market are mainly concentrated in Europe and Asia.Therefore, compared with West Texas Intermediate crude oil, whose origin and market are mainly concentrated in the United States and North America, The details of the multiple external influence factors that affected the carbon price.
T A B L E 2 F I G U R E 4 The partial autocorrelation function results of Hubei carbon price data.T A B L E 3 Correlation analysis between prediction index and predicted label.**p < 0.01.DUAN ET AL.
The descriptive statistics of the data.
T A B L E 4

Table 7 ,
the RMSE, MAE, MAPE, and R 2 of the designed forecasting framework are 0.2210, 0.1778, 0.0041, and 0.9983, respectively.2. A comparison of the single model XGBoost and XGBoost with the addition of the WOA indicates that XGBoost with the WOA has higher accuracy.The values of the four evaluation metrics of WOA-XGBoost are 0.8412, 0.5231, 1.2089, and 0.9750.These values are 9.70%, 3.24%, 2.66%, and 1.16% improvement, respectively, compared to XGBoost metrics.It is evident that the inclusion of the optimization algorithm has led to a certain improvement in the accuracy of the model.3. The single model without the decomposition algorithm is compared with the model that includes the decomposition algorithm, and the latter has improved prediction results.Based on the WOA-XGBoost model with the EEMD decomposition algorithm, the values were 0.4994, 0.3881, 0.0091, and 0.9912.In terms of percentage improvements, 40.12%, 25.80%, 99.20%, and 1.63% were achieved, respectively, with substantial improvements in accuracy.By replacing the EEMD decomposition algorithm with the improved CEEMDAN algorithm, the four evaluation metrics were improved by another small percentage of 17.76%, 17.91%, 18.68%, and 0.28%.This indicates the contribution of the CEEMDAN algorithm to prediction results.4. To demonstrate the importance of developing a reasonable indicator system, this paper compares a model that considers only historical carbon price variables, a model that considers historical carbon price data and all external variables, and a model that considers historical carbon price data and major external variables.The RMSE, MAE, MAPE, and R 2 of the prediction model proposed in this paper are optimal compared to the comparison results of the other two models, and compared to the case where only historical carbon prices are considered, the accuracy of the evaluation indexes for establishing the ultimate index system is 46.19%, 44.19%, 44.59%, and 0.43% better.Based on the comparison of these three models, it can be concluded that considering relevant influencing factors can appropriately improve the accuracy of the prediction model.However, incorporating too many influencing factors may result in a relatively modest enhancement in model performance.The indicator system established through feature selection contributes to improving the predictive accuracy of the model.
The improvement percentage of the models in the Hubei market.
T A B L E 10 The improvement percentage of the models in the Guangzhou market.
Abbreviations: CEEMDAN, fully integrated adaptive noise empirical mode decomposition; EEMD, ensemble empirical mode decomposition; GBDT, gradient boosting decision tree; MAE, mean absolute error; MAPE, mean absolute percentage error; RMSE, root mean square error; WOA, whale optimization algorithm; XGBoost, extreme gradient boosting.F I G U R E 9 Feature importance histogram.T A B L E 11 Importance ranking of carbon price characteristics.
Time series of European Union carbon market.TA B L E 12The forecasting results for EU data set.