A novel NOx prediction model using the parallel structure and convolutional neural networks for a coal‐fired boiler

In this paper, a novel model with a parallel structure is proposed to predict NOx emissions from coal‐fired boilers by using historical operational data, coal properties, and convolutional neural networks. The model inputs are processed and passed into three parallel subnetworks with well‐designed building blocks. The features learned by the three subnetworks are fused and used to predict NOx emissions from a 330‐MW pulverized coal‐fired utility boiler. A comprehensive comparison of different prediction models based on deep learning algorithms shows that the prediction model proposed in this paper outperforms other prediction models in terms of root mean square error criteria. The results show that the parallel structure is key to obtaining accurate predictions while reducing model complexity. This suggests that the model's performance can be improved by designing the model architecture.


| INTRODUCTION
Power plants that burn fossil fuels release nitrogen oxides (NO x ). In many countries, NO x emissions during coal combustion cause pollution that hugely impacts human health and the environment. 1 Therefore, strict regulations were imposed to limit power plants' NO x emissions. 2 Several technologies have been developed to reduce NO x emissions to meet increasingly stringent environmental regulations. 3 A primary current focus in these low-NO x technologies is to build an accurate NO x prediction model. 4 However, it is difficult to build an accurate NO x prediction model based on coal-fired boilers' fluid mechanics and chemical reaction theory. An alternative approach is necessary.
One approach to building a NO x prediction model is introducing machine learning (ML) algorithms to establish the relationship between historical operational data and NO x emissions. The parameters of an ML algorithm are trained based on the collected data from a database attached to a coal-fired boiler. A trained ML algorithm is considered a NO x prediction model when it can give accurate NO x predictions. In the early studies, the researchers preferred traditional ML algorithms when building NO x prediction models. [5][6][7][8][9][10] For example, Zhou et al. used a shallow neural network to predict NO x emissions from a coal-fired boiler. 6 Lv et al. built an ensemble learning paradigm based on least squares support vector regression to predict NO x emissions. 7 Tan et al. used an extreme learning machine to predict NO x emissions from a coal-fired boiler. 10 However, these traditional ML algorithms have simple structures with fewer parameters, making it challenging to learn meaningful representations from complex data. 11 These critical shortcomings ultimately affect the accuracy of the NO x prediction models in practice.
Recently, there has been much interest in deep learning (DL) algorithms and their applications. 12 DL algorithms belong to the category of ML algorithms. Compared with traditional ML algorithms, DL algorithms have a well-designed structure with more parameters, making them easier to take advantage of big data and learn good representations from complex data. 13 Some researchers preferred the long short-term memory (LSTM) network to build DL-based NO x prediction models. 14 15 The additional computational unit called the attention mechanism is also introduced to boost the performance of LSTM networks. 16,17 Their studies suggested that the performance of LSTM-based NO x prediction models is better than the performance of traditional ML-based NO x prediction models. However, the existing LSTM-based models suffer from two severe limitations. First, high-quality data sets are created by data preprocessing methods, such as principal component analysis. These methods are tricky and expensive, becoming major bottlenecks to applying NO x prediction models in practice. Second, an LSTM network contains multiple components, which makes LSTM-based models computationally costly and challenging to train. The attention mechanism could further exacerbate the computational cost of LSTM-based models.
Several researchers focused on other DL algorithms. Adams et al. designed a deep neural network (DNN) consisting of multiple layers to predict NO x emissions. However, their models' performance was heavily based on the training sets' quality. 18 Li and Hu built NO x prediction models based on the convolutional neural network (CNN) and did a detailed comparative analysis among different CNN-based NO x prediction models. 19 Their studies suggested that it is a considerable alternative approach to build the NO x prediction model based on CNNs. Compared with other DL algorithms, CNN performs simple and efficient convolution operations, which makes CNN have a low computational cost. 20 Furthermore, several well-designed building blocks have been designed to boost the performance of CNN-based models. 21 Several build blocks are stacked in series to form a hierarchical architecture to build a CNN-based NO x prediction model. It is generally accepted that the number of model parameters is essential for the best-performing CNN-based model. However, a CNN-based model with too many parameters is time-consuming, which limits the practical applications of the CNN-based model. How do we reduce the number of parameters of a CNNbased model while maintaining the high performance of the CNN-based model? This is still an issue worth exploring. This paper presents a novel CNN-based model with a parallel structure for modeling NO x emissions. The model parameters were trained by the data collected from the database attained to a 330-MW pulverized coal-fired utility boiler. Experimental results show that this parallel structure is the key to reducing the number of model parameters while maintaining high NO x prediction accuracy. Comparisons with other DL-based NO x prediction models are also made. Section 2 describes the data set and the model design. Section 3 describes the evaluation of our model and detailed comparisons among different DL-based models. Section 4 concludes the paper.

| Brief descriptions of the coal-fired boiler
In this study, the object is a 330-MW subcritical tangential pulverized coal-fired utility boiler manufactured by Shanghai Boiler Co., Ltd. The boiler belongs to one unit of the Dong Sheng power plant in Inner Mongolia, China. The schematic diagram of the boiler is shown in Figure 1. The furnace is 65.500 m in height and has a cross-section of 14.022 m × 14.022 m. Five layers of primary air nozzles (A, B, C, D, and E) and seven layers of secondary air nozzles (AA, AB, BC, CC, DD, DE, and EE) were located alternately in a vertical direction. Coal-air mixtures were fed to the burners on A-D levels.

| Data preparation
This study collected a total of 86,400 raw data points covering five consecutive days from the distributed control system with a time resolution of 5 s. The measurements of the raw data point are divided into three groups. First, the operational variables of the boiler contain a rich source of information in part reflecting the dynamics of the boiler. Fifty-five operational variables are determined based on engineers' suggestions and knowledge of the boiler. These operational variables are as follows: boiler load (x 1 ), main steam pressure (x 2 ), total fuel flow (x 3 ), total airflow (x 4 ), coal-feeder rate (x ). Second, there may be a correlation between the amount of NO x emissions at different moments over a short period. Therefore, NO x emissions at side A and side B of the furnace exit (x x 56 57 ) should also be considered. Third, the formation of NO x is directly related to the properties of the combusted coal. Industrial analysis is an accepted method to measure coal properties. Therefore, the daily results of industrial analysis of the combusted coal, such as the volatile (x 58 ), the ash content (x 59 ), the moisture content (x 60 ), the sulfur content (x 61 ), and the quantity of produced heat (x 62 ), are used to predict NO x emissions. The detailed analysis results are listed in Table 1.
Previous studies have shown that well-crafted data preprocessing methods can significantly improve the quality of raw data by reducing the dimension of the raw data. 15,17 However, the structure of the processed samples is fundamentally different from the raw samples.
In addition, some complex calculations involved in these methods cannot effectively handle a single sample or a small number of samples in a real-time setting. These drawbacks make it difficult for the trained models to quickly obtain NO x predictions on new samples.
In this study, the min-max scaling method is adopted to eliminate the adverse effects of scale differences between raw data. The raw data points are rescaled as follows: where x i denotes the measurement in the raw data point, and z i denotes the measurement in the processed data point. In the formula (1), x min( ) Compared with data preprocessing methods with dimensionality reduction, the min-max scaling method has some advantages. First, this method is simple and efficient as it does not introduce complex computational processes. Second, this method improves the data quality without changing the raw data structure. Third, this  method allows a single data point to be processed in realtime. Consequently, the trained NO x prediction model can be directly applied to unseen samples processed by the min-max scaling method. After data preprocessing, samples in the data set are formed in the following manner: where and For each observation in a sample, all the values of z i 's, i = 1, 2, …, 62 come from the same day. Therefore, we determine the values of z i , i = 56, 57, …, 62 depending on which day these observations come from. For example, if these coal property parameters are from Day1, they equal the corresponding values in the row "Day1" of Table 1. In other words, the coal quality parameters are the same in the observations which come from the same day.
The samples are set in such a format mainly to match the NO x prediction model in this study. In addition, each sample contains multiple observations so the model can learn richer information from them. The data set should be divided into three subsets to train and test the NO x prediction model: the training subset, the validation subset, and the test subset. The training set consists of 51,790 samples covering the first 3 days of five consecutive days, the validation set consists of 17,250 samples covering the fourth of five straight days, and the test set consists of 17,250 samples covering the last of five consecutive days. These three subsets contain enough samples to cover the different operating conditions of the boiler to make the NO x prediction model more generalizable.

| Brief description of CNN
In recent years, CNN has become the dominant algorithm in computer vision. For a two-dimensional image, CNN uses convolution to extract patches from its input with two spatial axes. It applies the different filters to operate on these patches, producing an output.
We note that the first component z(k) in the sample (z(k), y(k)) in Section 2.2 can be considered as a twodimensional image. By using time as one axis and multivariate observations of each time as the other, this structure allows CNN-based models to learn local patterns of samples. Figure 2 shows how the convolution operation works. First, a patch is extracted from the sample along the time axis based on a receptive field with some predefined fixed size. Second, a scalar value is computed by taking dot products on the extracted patch and a learned filter.

| Architecture design of the CNNbased model
The building blocks used to build the NO x prediction model in this study had been designed based on the practical guidelines proposed for lightweight CNN architecture design. 19 Figure 3A shows a schematic representation of a building block with stride 2. This building block has a standard parallel structure with two branches. In the right branch, the initial CNN is used to reduce the input's dimensionality to improve the model's computational efficiency. Separable CNNs (SCNNs) in both branches are the central computational units that learn the representations with less computational consumption. 22 Using a stride equal to 2 means that the input of the SCNN is downsampled by a factor of 2. The following concatenation operation fuses the learned representations of the two branches. Finally, the channel shuffle operation enables information communication between the two branches. 23 In this building block, the role of batch normalization (BN) operation and rectified linear unit (ReLU) is to improve the F I G U R E 2 How CNN layer works on the multivariate time series. CNN, convolutional neural network. output quality of CNN and SCNN to accelerate model training. Figure 3B shows a schematic representation of a building block with stride 1. This building block is different from the building block with stride 2. First, the channel split operator splits an input of the building block into two distinct parts, which pass to two separate branches. The left branch keeps one part without any change. The right branch learns the representations from the other part. Second, the stride of the SCNN equals 1, which means that the input of the SCNN is not downsampled. The other components achieve the same effect as the building block with stride 2. Figure 4A shows a schematic representation of the CNN-based NO x prediction model in this study. The original input samples and the two sampled signals are passed to each subnetwork in the corresponding parallel branch. Each subnetwork consists of a series of building blocks that learn the representations at a different scale. Representations from different subnetworks are then fused by a concatenation operation and formed into a one-dimensional vector by a global average pooling operation. Finally, the one-dimensional vector is fed to two fully connected (FC) networks to give the predicted values of NO x emissions at sides A and B. Figure 4B shows a schematic representation of the subnetwork in the model. In a subnetwork, a building block with stride 2 was set at the beginning, and a building block with stride 1 was repeated three times. A subsequent CNN adjusts the dimension of the subnetwork's output. The

| NO x prediction results
Our model was implemented using Keras with the TensorFlow backend on a single NVIDIA GeForce RTX 2080 with 16 GB of memory. The GPU-accelerated library CUDNN is used to improve the efficiency of model training. The model was trained by the Adam optimizer, and a batch size of 512 based on the root means square error (RMSE) criterion, which is a widely used performance measure for prediction models and is defined LI and HU where N denotes the number of the samples, y i denotes the measured values, and ŷ i denotes the corresponding predicted values.
We combined the early stopping strategy and the checkpoint procedure to reduce the adverse effects of overfitting on the model. The early stopping strategy terminates model training when the RMSE value on the validation set cannot be improved further. Subsequently, the model checkpoint procedure was used to save the best model with the minimum of RMSEs on the validation set. To be consistent with the performance measure used during model training and validation, we used the RMSE criterion, the mean absolute error (MAE) criterion, and the R 2 score to evaluate the performance of our model on the test subset. The MAE criterion and the R 2 score are defined, respectively, where N denotes the number of the samples, y i denotes the measured values, and ŷ i denotes the corresponding predicted values. There are 30 runs total for our model to evaluate the performance of the model. Therefore, each model has 30 RMSE values, 30 MAE values, and 30 R 2 scores, based on which each item's maximum, minimum, mean, and standard deviation are calculated separately. A summary of our model is shown in Table 3. These results indicate that our model can generally predict NO x emissions accurately. The minimum of RMSEs on sides A and B was obtained in the third run simultaneously. Figure 5 compares the measured and predicted values at the third run on the test subset. The perfect lines in the figures represent that the predicted values are equal to the measured values with no error. What can be seen in the figures is that all predicted values follow a diagonal distribution along the perfect line, which means our model can make highly accurate predictions on the test set.
To analyze the relative errors at the third run, the variations in the boiler load are shown in Figure 6, together with the relative errors at the test set. Almost all relative errors of predicted values lie between −3% and 3%, which shows that we establish an accurate NO x prediction model. However, it can be found that when the boiler load changes abruptly, there may be a situation where the relative errors of the predicted values are relatively large.

| Model comparisons and discussions
Some DL-based prediction models have achieved success in predicting NO x emissions from boilers. In this section, a comparative analysis of different DL-based NO x prediction models was conducted to assess the proposed T A B L E 2 Configuration of the subnetwork and FC networks (shown in columns).

Left branch Right branch
Building block with stride 2 SCNN-3-48-2 CNN-3-90-1 deep CNN-based model further. The outlines of these DL-based models are described as follows: the LSTMbased model consisting of a single LSTM network, 14 the LSTM-based model composed of two stacked LSTM networks, 15 the DNN-based model, 18 and the CNN-based model. 19 Most of the hyperparameters used in these four models can be found in the original references. We retrained all models with our data set in the same training environment for a fair comparison. Each model was run 30 times, and the corresponding statistics are shown in Table 4. Besides, Figures 7 and 8 provide more details about each model with the minimum of RMSEs achieved. Our model performed the best results than other models. The DNN-based model 18 and the CNN-based model 19 obtained better results than the LSTM-based models, but their overall performance remained inferior to ours. These two LSTMbased models 14,15 achieved similar results, but their overall performance performed poorly. These results demonstrate that our model can accurately predict values without using complex data preprocessing methods with dimensionality reduction.
Although the optimal results of our model are very close to the optimal results of the CNN-based model, the parameters of our model are less than one-half of the parameters of the CNN-based model. Consequently, our model is more accessible to training and less prone to overfitting than the CNN-based model. The main difference between the two models is the structure of the model: our model has a parallel structure, while the CNN-based model has a cascade structure. This demonstrates the ability of a parallel structure to reduce model complexity while maintaining good model performance. In addition, models with a parallel structure can better take advantage of the available hardware for parallel computing, which could improve the training efficiency of our model. Figure 7 shows that the DNN-based model has the smallest number of parameters, and the prediction performance of this model is slightly higher than ours. However, the DNN-based model has large standard deviations of RMSEs, according to Table 4, which suggests that the problem of numerical instability also arose. In addition, as can be seen from Figure 8, some prediction values have relative errors exceeding the interval [−3%, 3%], which is worse than our model. As shown in Figure 7, the LSTM-based models achieved acceptable results with a greater number of parameters. However, as seen in Table 4, the maximum of RMSEs and the maximum of MAEs for LSTM-based models were significantly higher than those of other models, and the minimum of R 2 scores for LSTM-based models was substantially lower than those of other models. These facts suggest that trained LSTM models often fail to give accurate predictions. Figure 8A,B shows that a large number of T A B L E 4 Summary statistics of the prediction results of all prediction models on the test set.  does not reduce the dimension of the samples in the data set. Therefore, the input dimension of the LSTM model is much larger than that of the LSTM models in previous studies.
In addition, on the aspect of simulation time, as shown in Table 5, it can be seen that our model takes significantly less time to train than the LSTM-based model in Yang et al. 15 and the CNN-based model in Li and Hu. 19 Although our model takes more time to train than the LSTM-based model in Tan et al. 14 and the DNNbased model in Adams et al., 18 the prediction accuracy of our model is better than both.