Prediction of short‐term photovoltaic power via codec neural network and mode decomposition based deep learning approach

Accurate photovoltaic (PV) power prediction is of great significance for the stable operation of PV system, but the PV power sequence is nonstationary, so it is difficult to establish the prediction model effectively by a simple neural network. In this study, the MCVMD‐MI‐SWATS‐Codec (multidimensional constraints variational mode decomposition‐mixed initialization‐switching from Adam to stochastic gradient descent‐codec) that is based on the idea of deep model fusion is proposed to predict PV power generation. MCVMD method with parameter K determined by multidimensional constraint criterion is used to decompose the PV power data, and the frequency of each component sequence is analyzed after decomposition to explore the physical characteristics and application value of each component frequency. Then, a hybrid ResNet‐LSTM (residual network‐long‐ and short‐term memory) model based on codec mechanism integrates input data with different dimensions, such as weather conditions and historical IMF (intrinsic mode function), into dense vectors with the same dimension. The experimental data of polysilicon PV array in the Australian desert environment are used to test the proposed fusion neural network model and the other six competitive models. The results show that MCVMD algorithm is significantly helpful in decomposing the nonstationary data to improve the prediction accuracy, and MCVMD‐MI‐SWATS‐Codec model has high prediction accuracy and robustness in both stable and unstable weather conditions.

International Energy Agency (IEA) announced in 2020, the installed capacity of photovoltaics has been growing at a faster rate in recent years. 1 Compared with traditional thermal power and nuclear power, PV system works outdoors and is affected by various outdoor environmental factors, such as solar radiation, cloud amount, wind speed, temperature, and humidity. The PV system produces volatility and intermittence under the influence of these natural factors, which affects the quality and the supply reliability of the power grid. Accurate prediction of the power output of PV system can effectively ensure the quality and stability of power grid operation and is of great significance to the integration of large-scale PV power generation. [1][2][3] There are two traditional PV power prediction methods, namely physical model and statistical model. The physical model [4][5][6][7] aims to establish a model of PV power generation related to natural conditions. Statistical models are the black box, which only require inputs and outputs. The relationship between parameters is not taken. However, strong disturbance such as nonstationarity of sequences and outliers of the sequence will reduce the prediction accuracy of traditional statistical model. In addition, traditional filtering methods such as mean filtering and wavelet domain denoizing used in the existing statistical prediction models have limited effect to eliminate the noise component of PV power, which restrict further improvement of the prediction accuracy. [8][9][10][11][12][13] At present, data-driven artificial neural network and machine learning have been used in nonlinear fitting filed. [10][11][12][13]16,17 Tian Z proposed LSSVM for wind power prediction. 14 Ahmad et al used GPR to predict solar and wind power. 15 Wang et al presented a robust combination approach for short-term wind speed forecasting. The combined models include ARIMA, SVM, ELM, LSSVM and GPR. 16 [14], [15] and [16] references show that the time series prediction accuracy of machine learning method is greatly improved compared with traditional statistical or physical models. In addition, the combined machine learning model has higher prediction accuracy than that of a single model. Wang et al used GBDT to predict time series, and the results show that GBDT is more robust than traditional machine learning methods. 17 Almonacid F et al used shallow multi-layer perceptron for PV power prediction in 2011. 18 Experiments show that the neural networks can efficiently extract higher dimensional complex nonlinear data features, and has the ability to map the output directly from the input. With the development in the field of artificial intelligence and computing power, the researchers used DBN (Deep Belief Network, DBN), 19,20 LSTM, 21,22 and CNN (convolutional neural network) 23,24 to predict nonstationary sequences. The results show that nonlinear learning ability by utilizing deep learning models is stronger than shallow artificial neural network.
With the deep development of the research work, it is found that single deep artificial neural network model has limited advantages and cannot meet the needs of complex prediction tasks. Therefore, some scholars suggested that single model should be fused to construct fused neural network. Wang et al used LSTM-CNN 25,26 to predict PV power series. The results show that the accuracy of the fused neural network prediction is improved, and the stability of the fused neural network model is better than that of other single models such as MLP, CNN, and LSTM in every weather conditions. Colak et al used multi-model fusion method that is a fusion neural network model based on bionic optimization algorithms to predict PV power. The experimental results show that adding specific optimization algorithms to the fusion algorithm can improve the accuracy of prediction and convergence speed of the model, and enhance the adaptability of the model. 27 Researches 28,29 show that the prediction performance becomes worse when the time series is nonstationary, and decomposition of signals is used to solve this problem. Widely used signal's decomposition techniques include wavelet transform 28 and EMD (Empirical Mode Decomposition). [29][30][31] Wavelet transform based on kernel function can extract signal components, but the optimization of kernel function is difficult. The kernel function of EMD is extracted from the data, so it has good adaptability, but the mode mixing is serious. Konstantin proposed the VMD (Variational Mode Decomposition) in 2014, 32 which is the application of multiple Wiener filters in multiple adaptive frequency bands. The preliminary research results show that, compared with the existing modal decomposition algorithms, VMD can effectively decompose the sequence of PV power generation and solve the problem of mode mixing. 33,34 At present, there are many challenges in prediction of PV power generation: 1. At present, there is no mature or reliable method to determine the value of K (number of the model components) for VMD. 35 Furthermore, the value of K is more difficult to select due to the increased complexity of the neural network model proposed in this paper. In addition, the degree of mode mixing is only considered and the distortion rate of mode reconstruction is ignored in the previous studies. However, the distortion rate of modal reconstruction affects greatly on the prediction accuracy of final power generation. 36  dramatically. However, the training difficulty of the fusion neural networks is also increased due to the large number of fusion model parameters. Improper initial value and optimization algorithm may cause slower convergence rate, divergence, over and under fitting of the fusion neural networks.
To deal with these challenges, a fusion model called MCVMD-MI-SWATS-Codec is proposed for decomposition. The contributions of this model are as follows: 1. MCVMD is proposed for decomposition in PV power prediction. The optimal K value can be automatically selected by constraint setting. The constraint items include the range of distortion rate of reconstructed signal, distortion variance, and over-decomposition degree. 2. The mixed initialization (MI) and SWATS (Switching from Adam to stochastic gradient descent) are used to optimize the training process of the fusion model. Kaiming 37 and Xaiver 38 are deeply integrated to construct a mixed initialization mechanism to ensure the training efficiency, and SWATS will help to improve the refinement of model training. 3. A new fusion architecture based on codec is adopted. 39,40 Codec (Encoder and Decoder) architecture proposed in this study can make the model layout more suitable, which means that each part of the model performs its own duties rather than dealing with it uniformly and the feature extraction ability much stronger. 41 Meanwhile, the internal structure of the coding layer and the decoding layer is deeply optimized. For example, LSTM 42 model with time sensitivity is selected as the time encoder, and the ResNet (residual network) 43-45 model with spatial sensitivity is selected as the spatial encoder.
The main structure of this paper is as follows: Section 2 introduces the hybrid model framework and the role of each module in the hybrid model framework; Section 3 shows experimental data acquisition and model parameter selection. Section 4 gives the comparison between the proposed model and the competing models in practical applications. Section 5 is conclusion for whole paper.

| Summary of MCVMD
Assuming that the original input signal is f(t), the constraint expression of traditional VMD model is as follows: In the formula, {u k }={u 1 , u 2 , u 3 ,…, u k } represents the decomposed modal component and{ω k }={ω 1 , ω 2 , ω 3 ,…, ω k } represents the modal center frequency. The constraint condition indicates that the sum of all modal components should be the original signal.
In order to better solve the optimal solution of the above equation, the quadratic penalty factor, α, and the Lagrange multiplier operator, λ(t), are introduced in the Lagrange expansion expression that is obtained as follows: The solution of (2) is directly carried out in the frequency domain. The steps are as follow: Step3: n+1 Step4: ̂n Step5: If (6) is met, then n = n+1 and return to step 2. Otherwise, the iteration process ends.
where n, u n k , n k , and n are the iteration number of the whole process, the sequence of modal components, the central frequency of modal components, and the thermal multiplier, respectively. û n k ,̂n k , and ̂n represent the Fourier transform of u n k , n k , and n , respectively. It can be seen that the number of modal components, K, needs to be set manually from the above step. However, at present, there is no mature method to solve the hyperparameter, K, of the VMD algorithm. If the intelligent optimization algorithm or center frequency method is used to solve the K, the amount of calculation will be greatly increased, and the distortion rate of the reconstructed signal will be neglected in the process of solving. 46 (1) In order to solve the two problems, a method is proposed to search the K only by adding constraints. The distortion rate and the distortion variance of reconstructed signal should be within 5% and 0.01, respectively. The degree of over decomposition should include one error component. According to the above ideas, the pseudo code of this method is detailed as follows: f r , e K u , e K var , and e K c representing the reconstructed signal sequence, distortion rate of reconstructed signal, the variance of the reconstructed signal error, and the number of components that are overdecomposed, respectively, in the pseudo code. It is worth noting that K does not iterate from 0. Pre-experiments show that when K is less than 4, the reconstructed signal accuracy obviously does not meet the requirements and when K is greater than 6, the reconstructed signal error requirements have been met. The value of K is set to 9 to ensure sufficient margin. Therefore, K = 4:9 is considered in pseudo code.

| Summary of MI and SWATS
Generally, it is found that when the model has an activation function of unsaturated type such as ReLU, method of Kaiming is more effective. 37 The initial value of the model generated by the Kaiming method obeys the normal distribution of N(0,std), where std can be expressed as Where, a represents the slope of the activation function on the negative semi-axis of x. Therefore, when the activation function is ReLU or Leaky ReLU, then a = 0 or a i . The fan_in represents the magnitude of the input data and could be calculated as H × W×C in , where H, W, and C in are the size of the input data.
The Xaiver method is more applicable when saturated activation function used such as Sigmoid and Tanh. 38 The initial value of the model generated by the Xavier method obeys the normal distribution of N(0, std1), where std 1 can be expressed as (12): In Equation (12), gain represents the scaling factor. When the activation function is Sigmoid or Tanh, gain is 1 or 5/3 respectively. The value of fan_out is H × W × C out , where H, W, and C out are the size of the output data.
The ReLU activation function is used in the ResNet model, and the Sigmoid and Tanh are used as the activation function in the LSTM model in this paper. Therefore, two initialization algorithms are used simultaneously in the hybrid model, which can be called MI (Mixed Initialization).
The convergence of deep fusion model needs to be improved. At present, it is found that the SGD algorithm is more versatile than Adam, Adamgrad, and RMSprop (root mean square prop), but the convergence rate is lower than theirs. [47][48][49] The SWATS 50 algorithm is proposed to optimize the model, which can automatically change the learning method and switch from Adam to SGD when the trigger conditions are met. The switching process is self-adaption, but this method will not increase the number of hyperparameters in the optimizer.
LI et al.

| Codec mechanism of neural network
In this study, power generation in three historical days and NWP of the next two days will be considered as a feature factor to input neural networks, but both of the data dimensions are not the same. Therefore, the neural network based on codec mechanism can be applied. It can guide the data flow and fuse the characteristic signals of different sizes as shown in Figure 1A. In general, researchers use RNN as encoder and decoder, but in this study, LSTM and ResNet are fused as encoder and decoder. Compared with the conventional RNN, the feature extraction ability of the Codec neural network model is further enhanced and the model can extract the temporal and spatial features respectively. The following is a detailed introduction of LSTM and ResNet.
There are three gating units and convergence layer in LSTM as shown in Figure 1B. Forgetting gates can regularly discard the historical information of neurons, which also means that historical time point information can be discarded regularly. Input gates can be used to add new information to neurons and generate memories. The output gate makes a judgment and carries out the information output transmission after combining the aggregation layer information and the historical information.
ResNet is developed on the basis of CNN. The basic structure of ResNet is the same as that of CNN. Both of them are composed of convolution layers, but the connection is different. ResNet has parallel and cascade connection mode, while CNN adopts strict cascade. Taking ResNet as an example, the basic structure of convolution is described as following. Neurons of ResNet are arranged in multidimensional space and have certain spatial structure characteristics. Neurons are also called convolution kernel. The dimension of convolution kernel should be equal to that of input data and the number of convolution kernels is equal to the dimension of the output data. The single ResNet neuron slides on single channel to generate a feature as shown in Figure 1C. A convolution kernel output a one-dimensional vector after calculation shown in Figure 1C. So, the time series are mapped to the corresponding position in the multidimensional space by ResNet, which further enhances the internal features extraction of the time series, when the one-dimensional time series passes through the multidimensional neuron array.
With above-mentioned methods, the fusion of ResNet and LSTM is adopted in codec to extract both temporal and spatial information. Thus, single LSTM is not adopted in codec because of the lack of spatial extraction ability. In addition, ResNet has more layers than ordinary CNN, and feature extraction ability of ResNet is stronger than that of CNN. Therefore, the mix ResNet and LSTM as codec are employed in this study.

| Hybrid model framework
MCVMD, Codec, MI, and SWATS models are integrated into a hybrid model, and the hybrid model framework includes four parts, such as data decomposition and classification, power generation sequence modal decomposition, modal component prediction, and prediction reconstruction. The detail of the model framework is shown in The parameters of the PV array obtained by DKASC are shown in Table 1. DKASC uses a variety of sensors to collect 9 status data of the PV array every 5 minutes from 6:00 to 19:00, so there are 156 sampling points for one sensor per day.
The state characteristics include active energy delivered-received (kWh), current phase average (A), active power (kW), wind speed (m/s), weather temperature Celsius ( o C), weather relative humidity (%), global horizontal radiation (W/m 2 ), diffuse horizontal radiation (W/ m 2 ), wind direction (Degrees), and weather daily rainfall (mm), but only the last seven factors are selected to build a prediction model. The data set division is shown in Figure 4. Each sensor has 56 940 sampling points from December 2008 to December 2009 and 14 040 sampling points from January 2010 to March 2010. The field of sliding window sampler is 5 days (780 sampling points), and the stride of sliding window sampler is 1 days (156 sampling points). There are 450 samples in the total dataset, and they are used to train and test the fused model. Each sample contains five days of sampling data, where all the features of the first three days (1st-3rd day) and NWP of the last two days (4th-5th day) are input data, and the power of the last two days (4th-5th day) are the prediction tags. The PV power is predicted in the next 48 hours by the multi-step prediction method.

| Neural network model parameters
In order to reduce the complexity of the model established in Section 2.4, both of the encoder and the decoder use the same parameters, even their structures are different. Based on this, three fusion neural networks of different scales are designed, as shown in Table 2. The batch size of the model is 40. The model optimization algorithm is the Adam model, the max epochs is 150 and the loss function is MSE. The input data are weather conditions of 1st-3rd day, power generation sequence of 1st-3rd day and weather conditions of 4th-5th day. The output data are power generation sequence of 4th-5th day. The experiment process is completed in Python3.7, PyTorch1.2.0, and Core (TM) i7-9750H @2.60 GHz 2.60 GHz. The iterative process of the loss function of the training process is shown in Figure 5. It can be seen that small-scale model will cause under fitting as shown in Figure 5A. The lowest value of training loss is 0.21, and this result is obviously poor and needs to be improved. The large-scale model will cause over fitting as shown in Figure 5B.
Although the training loss is satisfactory, the testing loss converges to 0.28. Finally, both the training loss and testing loss of middle-scale model converges to the 0.035 in the 400th iteration in Figure 5C. Therefore, the result shown in Figure 5C is obviously superior to Figure 5A,B.
Thus the effectiveness of MI+SWATS algorithm has been verified in Figure 5D on the basis of middle-scale model. Figure 5D shows that the MI+SWATS tends to converge to 0.02 in the 225th iteration. Obviously, MI+SWATS algorithm is better than other algorithms in convergence speed and optimal result. In conclusion, the structure of the ResNet and LSTM is shown in Figure 6A,B. The input data are successively input into three residual blocks which is composed of two convolution layers after passing through a layer of data filter in Figure 6A, and the red dotted line indicates the residual path, which can reduce the gradient vanishing. Then, the number of features is reduced through the pool layer to prevent over fitting. Finally, the data are output through the full connection layer. Each solid block represents a neuron of LSTM in Figure 6B. The number of neurons in each layer is equal to the sequence length, and there is a directed connection between any two adjacent neurons. The input data are calculated and passed along the arrow direction. Finally, the final output data are obtained by integrating the output data of each layer.  In order to comprehensively analyze the effectiveness of MCVMD, an experiment based on discrete Fourier transform analysis is proposed. Firstly, a random and continued 5-day generation power sequence is selected as the original signal as shown in Figure 7A. Thus, there are 780 (156 points × 5 days) sampling points in the original signal. It is assumed that the virtual sampling frequency is 780Hz, so the sampling time is 1s. The original signal will be analyzed by DFT (Discrete Fourier Transform), and the results except DC (Direct-Current) component are shown in Figure 7B. The frequency of DC component is 0Hz, and the spectral density is 2038.2618 kW/Hz. According to Nyquist sampling theorem, the Spectrum bandwidth is 0-390Hz, and the frequency resolution is 1Hz. Figure 7B shows that the spectral density of 5Hz components is 858.1843kW/Hz. Obviously, the spectral density of 0Hz and 5Hz components is significantly higher than that of other frequency signals, so these two signals play a decisive role in the amplitude of the original signal. The period of 0Hz signal has an infinite period; thus, the time domain image is a horizontal line, which indicates that the 0Hz signal represents the inherent average level of generated power. The period of 5Hz signal is measured every 0.2s, and there are 156 sampling points in this range, so the real period is 13h (5min*156). Therefore, 5Hz signals accurately describe the periodicity of daily power generation. The amplitudes of other frequency components are low and have no obvious physical significance, but other components contain fluctuation information due to weather changes and it is less important than 0Hz and 5Hz signals. Secondly, according to the method in Section 2.3, the convergence process of MCVMD algorithm is shown in Figure 7C, and the decomposition result of MCVMD is shown in Figure 7D. Figure 7C shows that the center frequency tends to stabilize to reach the convergence requirement of Equation 6 when the MCVMD algorithm is iterated to 125 generations and the central frequencies of IMF (intrinsic mode function)1-IMF7 are 0.0201Hz, 5.2522Hz, 22.0994Hz, 49.8280Hz, 77.9442Hz, 164.3210Hz, and 335.469901 Hz, respectively. Figure 7D shows that K is 7 and the original signal is decomposed into seven components. IMF 1 and IMF 2 approximately represent the 0Hz and 5Hz signals in Figure 7B. This proves that the MCVMD algorithm is effective for extracting important low-frequency signals. The effectiveness of MCVMD algorithm for medium-and high-frequency signal extraction needs further analysis.
Thirdly, K will be iterated from 4 to 9 during the operation of MCVMD algorithm in Section 2.1. The rest of K value decomposition results will be explained in following section. The results of decomposition are shown in Figure 8A-C when K is 4, 5, and 6, respectively. It is obvious from the figure that the mode mixing is not serious. Then, the central frequency of each component is shown in Table 3 when K is 4, 5, and 6, respectively. These results show that as K increases, the number of medium-and high-frequency signals above 10Hz increases significantly, which indicates that K mainly affects the number of medium-and high-frequency signals. The constraint conditions of (7) -(10) are manually calculated when K is 4, 5, 6, respectively as shown in Table 4. With the increase in medium-and high-frequency components, e u and e var decrease continuously with the minimum value of e u and e var are 5.61% and 0.0164 when K is 6, even though the modal mixing is acceptable when K is 6, e u , and e var still can be optimized. Finally, the original signal is decomposed when K is 8 and 9, respectively, and the corresponding results are shown in Figure 8D,E. The model mixing of K = 8 and K = 9 is more serious than that when K is 4, 5, and 6. For instance, IMF7 and IMF8 overlap seriously in Figure 8D, and IMF8 and IMF9 overlap seriously in Figure 8D. The central frequency of each component is shown in Table 3 when K is 8 and 9, respectively. Obviously, the number of medium-and low-frequency signals below 200Hz remains unchanged when K is 8 and 9, but the number of high frequency above 200Hz significantly increase. The constraint conditions of (7)-(10) are manually calculated when K is 8 and 9 as shown in Table 4. The results show that e c is 2 and 3 when K is 8 and 9 respectively. It indicates that over decomposition may occur when K is 8 and 9. It is unnecessary to over decompose the high-frequency components; this is because the high-frequency components contain random factors which are difficult and unnecessary to predict.
In conclusion, over decomposition does not occur and the value of e u , e var , and e c meet the engineering requirements of MCVMD only when K = 7.

ALGORITHM COMPARISON
There are two types of weather conditions in the dataset, which are stable weather and unstable weather. The evaluation indicators in this section include RMSE (root mean square error), MAE (mean absolute error), MAPE (mean absolute percentage error), and R 2 (R-square), which correspond to (13)- (16), where N is the sequence length, ŷ i is the predicted value, y i is the measured value. ‼ y i is the mean of measured value. The detailed parameters of the comparison model are shown in Table 5.

| Experiment in stable weather
Generally, the main characteristics of a stable weather include sunny weather, little cloud, and no sand-dust. The radiation intensity of stable weather collected by the sensor should exhibit periodically and regular changes. The test results of the MCVMD-MI-SWATS-Codec algorithm on February 13th-14th, 2010 and March 24th -25th, 2010 are shown in Figure 9A,C. Figure 9A,C show that the MCVMD-MI-SWATS-Codec algorithm can accurately predict the trend of PV power generation. With the help of the MCVMD algorithm, even subtle fluctuations can be accurately predicted in power generation. The dispersion degree of the prediction results is shown in Figure 9B,D. Figure 9B,D show that there is a high degree of fitting between the yellow point and the red reference line without obvious abnormal points; thus, the proposed algorithm shows high anti-interference performance. This experiment further proves the performance of MCVMD-MI-SWATS-Codec and other competing algorithms in stable weather conditions. The prediction results of various competing algorithms on February 13th-14th, 2010 and March 24th-25th, 2010 are shown in Figure 10A,B. The evaluation indicators of the predicted values of each model are shown in Table 6.
The result shows that MAPE, RMSE, and MAE of the MLP and SVM are more than 10%, 0.4, 0.3, respectively and R 2 of them is less than 0.9. The MAPE, RMSE of LSTM, ResNet, and ResNet-LSTM are more than 5% and 0.2, respectively. MAE of LSTM, ResNet, and ResNet-LSTM are 0.2272, 0.1770, and 0.1637, respectively and R 2 of them are more than 0.95. However, with the complexity of the neural network increasing, the training time of the network  16.31%, and 32.25%, respectively and the reduction ratio of training time is 17.70%. These prove that the codec mechanism effectively coordinates the fusion of data flow and features so that the stationary feature factors from MCVMD are fully utilized, which directly improves the accuracy and stability of prediction. The MI-SWATS algorithm effectively saves the training time from 820.66s to 675.48s. It is also known from Section 3.2.1 that the MI-SWATS algorithm can indirectly improve the prediction accuracy. Three evaluation indexes such as RRMSE, IA, and TIC are added on the basis of RMSE, MAE, MAPE, and R 2 to fully evaluate these model.
According to Table 6, the MCVMD-MI-SWATS-Codec proposed in this paper has the best performance, which is the best among indexes of TIC, IA, and RRMSE. The error distribution of the predicted values of each model is shown in Figure 11. It shows that the quartile of the MCVMD-MI-SWATS-Codec algorithm is significantly lower than that of other competitive models. The reason for these results is that other competition models lack modules to reduce the nonstationarity of the sequence, which makes it difficult for the model to extract the main features of the nonstationary sequence. In addition, other competitive models have a single structure and limited feature extraction ability, so their prediction accuracy needs to be improved. It further illustrates that the proposed algorithm has high accuracy and high stability in stable weather conditions.

| Experiment in unstable weather
Generally, the main characteristics of unstable weather include continuous cloudy, sand-dust weather, rain, etc The radiation intensity fluctuates greatly in this weather. Some unstable weather data were selected for testing during the experiment. The test results of the  Figure 12A,C. There was rain on January 11th, 2010 and the rest time was cloudy. Figure 12A,C shows that the MCVMD-MI-SWATS-Codec have an excellent fitting effect in unstable weather. Figure 12B,D shows that the predicted value fits well with the dispersion degree of the red reference line, which proves that the algorithm can still show good stability in unstable weather. The test results of the competing algorithms on January 10th-11th, 2010 and February 26th-27th, 2010 are shown in Figure 13A,B. The evaluation indicators of the predicted values of each model are shown in In addition, the MCVMD-MI-SWATS-Codec proposed in this paper performs best of IA, TIC, and RRMSE indexes. The reason for the above phenomenon is that the feature of a nonstationary sequence is difficult to obtain by using other competitive model, which will reduce the prediction accuracy in stable weather conditions. However, there are many disturbances in unstable weather, so the feature extraction ability of the other competitive model will be further reduced because of simple structure and the lack of filtering module. Experimental results show that the fusion model proposed in this study will greatly improve the prediction accuracy in unstable weather conditions. In summary, the prediction error distribution of all the models is shown in Figure 14. It exhibits that MCVMD-MI-SWATS-Codec has better prediction accuracy and stability in unstable weather conditions.

| Performance of models with different inputs
In order to further understand the prediction performance of MCVMD-MI-SWATS-Codec, this section focuses on the prediction accuracy of the future 24h, 48h, and 72h when the length of the input historical data is 1 day, 2 days, 3 days, 4 days or 5 days, respectively. The experimental results are shown in Figures 15 and 16. It shows that MAPE of the proposed algorithm is lower than other competition models. Note that the proposed algorithm has the best prediction effect when the input historical data are 3 days; this is due to that the correlation between input and output will be reduced if the size of input data is too large, and the extraction of historical information will be insufficient if the size of input data is too small. Therefore, the best input data length should be 3 days. Compared with the competitive models, the MAPE of the proposed model has different degrees of decline when the length of input data is 3 days. Figure 17A shows that the prediction MAPE of this model is significantly lower than that of other models in stable weather and especially in 24 hours of prediction. The highest reduction ratio of MAPE is 93.33% in 24 hours of prediction and the highest reduction ratio of MAPE is 76.13% in 48 hours of prediction. Figure 17B shows that the proposed model still has a great contribution to reduce MAPE in unstable weather. Although the weather conditions fluctuate violently, the improvement effect of the algorithm is obvious especially in 24 and 48 hours of prediction. The highest reduction ratios of MAPE are 86.91% and 79.71% in 24 and 48 hours of prediction, respectively. PV power output prediction, MAPE of PV prediction is less than 5% by using the proposed model. In addition, the proposed model has less training time compared with other fusion models such as ResNet-LSTM and MCVMD-ResNet-LSTM. The major contribution of this paper is creating hybrid model named MCVMD-MI-SWATS-Codec that can fully extract the frequency characteristics of time series through taking full advantage of MCVMD, MI, SWATS, and Codec. Furthermore, the effective feature extraction of each frequency component can further improve the prediction accuracy of PV power. In terms of efficiency, the proposed model has less training time in different scenarios. This model can provide accurate data for the safe and efficient operation of PV grid-connected power generation. This method is suitable for short-term prediction, however, whether it is applicable to the long-term prediction remains to be determined. Therefore, we want to extend the contribution of this study to long-term prediction; furthermore, attention mechanism is considered to be combined to increase the ability of the model to capture long series features in future works.