Intelligent wind turbine blade icing detection using supervisory control and data acquisition data and ensemble deep learning

Ice accretion on wind turbine blades is one of the major faults affecting the operational safety and power generation efficiency of wind turbines. Current icing detection methods are based on either meteorological observing system or extra condition monitoring system. Compared with current methods, icing detection using the intrinsic supervisory control and data acquisition (SCADA) data of wind turbines has plenty of potential advantages, such as low cost, high stability, and early icing detection ability. However, there have not been deep investigations in this field at present. In this paper, a novel intelligent wind turbine blade icing detection method based on the wind turbine SCADA data is proposed. This method consists of three processes: SCADA data preprocessing, automatic feature extraction, and ensemble icing detection model construction. Specifically, deep autoencoders network is employed to learn multilevel fault features from the complex SCADA data adaptively. And the ensemble technique is utilized to make full use of all the extracted features from different hidden layers of the deep autoencoders network to build the ensemble icing detection model. The effectiveness of the proposed method is validated using the data collected from actual wind farms. The experimental results reveal that the proposed method is able to not only adaptively extract valuable fault features from the complex SCADA data, but also obtains higher detection accuracy and generalization capability compared with conventional machine learning models and individual deep learning model.


| INTRODUCTION
In times of sustainable growth of renewable energy industry, the global cumulative wind power installed capacity has made great progress recently and wind power has evolved as the primary contributor of electricity generation. 1 To better harness wind energy, land-based wind turbines (WTs) are usually deployed at high-altitude regions where the climate is cold and high humidity. Under this operating environment, however, WTs are vulnerable to blade icing problem which could cause lots of troubles. 2 On one hand, the blade airfoil changes after ice accretion. This decreases the wind energy capturing capability, whereas more power is consumed to drive the blades, and therefore, the power generation efficiency declines. On the other hand, ice accretion could change the modal parameters at the corresponding regions

Abstract
Ice accretion on wind turbine blades is one of the major faults affecting the operational safety and power generation efficiency of wind turbines. Current icing detection methods are based on either meteorological observing system or extra condition monitoring system. Compared with current methods, icing detection using the intrinsic supervisory control and data acquisition (SCADA) data of wind turbines has plenty of potential advantages, such as low cost, high stability, and early icing detection ability. However, there have not been deep investigations in this field at present.
In this paper, a novel intelligent wind turbine blade icing detection method based on the wind turbine SCADA data is proposed. This method consists of three processes: SCADA data preprocessing, automatic feature extraction, and ensemble icing detection model construction. Specifically, deep autoencoders network is employed to learn multilevel fault features from the complex SCADA data adaptively. And the ensemble technique is utilized to make full use of all the extracted features from different hidden layers of the deep autoencoders network to build the ensemble icing detection model. The effectiveness of the proposed method is validated using the data collected from actual wind farms. The experimental results reveal that the proposed method is able to not only adaptively extract valuable fault features from the complex SCADA data, but also obtains higher detection accuracy and generalization capability compared with conventional machine learning models and individual deep learning model.

K E Y W O R D S
blade icing detection, deep autoencoders, ensemble learning, SCADA data, wind turbine on the blades. This may induce blade break, causing severe operational accidents. Once ice accretion has been detected, ice removal machinery should be triggered. Therefore, timely icing detection has very significant meaning for the wind farms to improve both the power generation efficiency and operational life of WTs.
Currently, there exist two kinds of WT blade icing detection systems, namely meteorological observing based system and external icing condition monitoring based system. 3,4 Meteorological observing systems collect the meteorological data, then predict the blade icing condition through analyzing the effect of atmosphere changes on WT. 5 For external icing condition monitoring systems, specialized sensors are installed on the WT besides the standard configuration to measure the physical property changes in blades due to ice accretion, such as mass, conductivities, and dielectric constants. 2,6-10 However, this method not only increases the WT mechanical complexity, but also induces extra maintenance costs. The sensors degradation and fault will impair the signal accuracy considerably and decrease the reliability of icing detection systems. Based on the above analysis, it is very valuable and promising to develop some kinds of WT blade icing detection approaches which have timely icing detection ability, higher reliability, and lower cost.
As one basic constituent part of wind farms, supervisory control and data acquisition (SCADA) system could supply abundant environmental, electrical, and mechanical information about WT operating state. 11,12 Due to the convenient accessibility and sufficient amount, many researchers have started investigating condition monitoring and fault diagnosis of WT based on SCADA data. Kusiak et al 13 proposed a WT fault prediction model based on SCADA data using several data-mining algorithms, like artificial neural network (ANN) and support vector machine (SVM). This model could predict the fault within 5-60 minutes before the failure occurs. In subsequent studies, principal component analysis (PCA) was employed by Kusiak et al 14 to reduce the dimension of SCADA data and random forest (RF) was used afterward to identify early failures. Chen et al 15 proposed a WT SCADA system alarm processing and diagnosis method based on ANN for recognizing pitch system fault. Simulation results revealed that the proposed method could identify faults fast and reduce the false alarm rate. Besides the research on fault diagnosis and prognostic, performance degradation assessment of WTs based on SCADA data was also investigated. 16 Jia et al 17 presented a PC2-deviation method to assess the degradation state of WT based on PCA. Sun et al 18 proposed a generalized model for WT anomaly identification based on ANN and fuzzy synthetic evaluation, and a quantitative index was defined to quantify the WT abnormal level.
The above literature review indicates that condition monitoring and fault diagnosis of WT based on SCADA data shows quite excellent and convincing performance. However, most of the current studies focus on the monitoring and diagnosis of generator, gearbox, pitch system, etc The research on WT blade icing detection using SCADA data is much fewer. Recently, Ge et al 19 proposed a blade icing prediction model based on RF with 29 SCADA system features. Zhou et al 20 constructed features from SCADA data based on expert knowledge, including the residuals of actual and theoretical values of power, wind speed, generator speed, and the nonprincipal component direction values of wind speed and power. All these features were used as the input of SVM with particle swarm optimization to build the icing detection model. Zhang et al 21 selected wind speed and power as two basic features and constructed six additional features, that is average wind speed, average power, degree of deviation of average power, etc, and then, RF was used to identify blade icing fault.
Although some papers have been published using SCADA data for WT blade icing detection, there are still some deficiencies to be improved. Since SCADA parameters do not directly reflect WT health state, feature extraction plays a very significant role when using SCADA data to assess WT health condition. According to the literature, current studies commonly extract fault features based on expert empirical knowledge before training the models. However, this manual feature extraction method is labor-intensive and relies on the expert knowledge and skills a lot. The imperfection of expert knowledge would hamper the comprehensive utilization of all the SCADA data and lead to unrecoverable loss of valuable information. The handcrafted features may also be not sensitive enough to reflect the change in the WT health state. All these factors will affect the performance and efficiency of the model constructed subsequently.
Recently, deep learning methods have been widely applied in many fields, such as speech recognition 22 and natural language processing, 23 and their performance has been significantly improved continuously. The main idea of deep learning is to adaptively learn the critical information from input data through multiple nonlinear transformations and approximate complex nonlinear functions with small errors. Owing to the superior performance of deep learning in automatic feature extraction, deep neural network (DNN), especially deep autoencoders (DAEs) network, has been widely applied in machine health monitoring and fault diagnosis field in recent years. Wang et al 24 applied DAE network to monitor WT blade breakage. Jia et al 25 mined useful information from machinery vibration signal using DNN, which was employed to diagnose rotating machinery fault intelligently. Stacked denoising autoencoder network with one layer was used by Sun et al 26 to extract nonlinear features from multisensor data and achieved great performance in rotating machinery diagnosis. In the research of WT blade icing detection, Chen et al 27 proposed a DNNbased framework to learn discriminative deep feature representation from SCADA data. The triplet-header hinge loss was applied as the optimization goal of the model to preserve locality across operation stages and discrimination between different health conditions.
Despite the successful application of deep learning in unsupervised feature extraction, there still exist some flaws to be improved. In the deep architecture of DNN, each hidden layer can be considered as one kind feature space deriving from the raw data. However, in most DNN models, researchers only take advantage of the last hidden layer as the input of the final predictor, and the value of the numerous middle hidden layers is not excavated. Except deep learning, ensemble learning is also a commonly used machine learning technique in practice. The fundamental idea of ensemble learning is to make full use of many good but different base learners to acquire better performance than a single learner based on certain combination strategy. Inspired by ensemble learning, this paper proposed an intelligent WT blade icing detection method which integrates a family of DNN-based classifiers. The novelty and merits of the proposed method are summarized as follows: (a) The feature extraction process and acquired feature performance are improved by utilizing deep learning to extract multiple level nonlinear features from the SCADA data adaptively, which lays a foundation for advancing the model diagnostic accuracy. (b) All the abstract features in different hidden layers of the DNN are synthetically utilized to train several base classifiers, then composing the final icing detection model through ensemble learning. Therefore, compared with conventional machine learning models and individual DNN-based model, the proposed ensemble icing detection model has better detection accuracy and generalization capability.
The rest of the paper is organized as follows. Section 2 describes the proposed approaches in details. Section 3 goes through an implementation of the approaches on an actual case and discusses the results. The conclusions are given in Section 4 finally.

| Method overview
The overview of the proposed approach is outlined in Figure  1. The data available for this research are collected from the SCADA system, including WT motion parameters (such as rotation angle of blade and generator speed), WT state parameters (such as nacelle temperature and nacelle acceleration), and environmental parameters (such as wind speed and environment temperature). The collected data are partitioned into training data and testing data. Training data are used to build the feature extraction model and icing detection model, while the models would be evaluated by the testing data. Concerning the training data, firstly, the data preprocessing including outlier elimination, data segmentation, data normalization, and data balancing are applied. Secondly, useful features, which are related to the icing condition of the WT blade, are extracted based on deep learning. Thirdly, both the extracted features and the corresponding data sample labels are used to construct the icing detection model utilizing ensemble technique. Thereafter, in the testing phase, the same F I G U R E 1 Overview of the proposed icing detection method data preprocessing process is utilized and features from the testing data are extracted for icing detection. Each sample would be classified into either "normal" or "icing" class by the icing detection model.

| Data preprocessing
The quality of the dataset has strong influence on the final performance of the constructed model. In data preprocessing, the SCADA data are cleaned and transformed into a uniform data format that can be conveniently manipulated in the subsequent steps, including outlier elimination, data segmentation, data normalization, and data balancing.
Outliers mainly refer to the abnormal data that deviate far from the majority. One important generating mechanism of outlier is electrical noise, transmission failures, or other failures. These data would cause severe disturbance to the model training process. In SCADA data, outlier identification is mainly directed by the WT control strategy. The common control strategy of WT is variable speed pitch control which aims at maximizing energy capture and steady-state limit of the WT. 28,29 There are three critical points in the wind speed and power curve which determine the operating state of the WT 30,31 : (a) cut-in wind speed, below which the turbine will not produce power; (b) rated speed, at which the turbine produces rated power and below which the generator speed varies with the wind speed and the pitch angle is kept at its optimum operation; and (c) cut-out wind speed, beyond which the turbine is not allowed to deliver power since further increase in the wind speed may damage the rotor. If the data point does not follow the aforementioned rules, it can be treated as an outlier and be removed. However, it is worth noting that the fault of mechanical or electrical components in the WT may also cause outliers. Therefore, outlier elimination will not be conducted on the "icing" class data point avoiding the loss of fault information.
After outlier elimination, data segmentation 32 is applied to divide the continuous data into a series of frames through shifting sampling window with a specific frame length L and sliding step s, as illustrated in Figure 2. In each frame, several statistical representations, that is mean, standard deviation, skewness, median, upper quartile, lower quartile, maximum, minimum, peak to peak, and trend value, are extracted. These primary features which roughly describe the behavior of the raw data will replace the raw SCADA data to be further processed, extracting deep features, and improving the model performance. The mathematical expressions of these statistical features are shown in Table 1.
Thereafter, data normalization is conducted to eliminate the large magnitude difference between different physical features without changing their distributions and interdependence. This can also speed up the learning efficiency of subsequent algorithms. In this study, the following formula was used.
where x and x norm correspond to data before and after normalization, respectively, x max and x min correspond to the maximum and minimum values of the raw data, and y max and y min are the maximum and minimum values of the target range for the raw data. The commonly used target range is [0, 1] or [−1, 1]. When the target range is defined as [0, 1], the above formula becomes min-max normalization.
The last step of data preprocessing is data balancing. For the WTs, the data size of "icing" class is much less than that of "normal" class. When training with the class unbalanced data, the trained model will have a high bias toward the majority class and easily misclassify the minority class owing to most classifiers assume a relatively balanced class distribution. 33,34 In this study, synthetic minority oversampling technique (SMOTE), 35,36 which is an oversampling technique, is utilized to solve this problem. According to the preset sampling rate, synthetic samples could be generated by SMOTE from the minority class. The artificial samples created by SMOTE are linear combinations of two similar samples from the minority class. After data balancing, the final preprocessed data are utilized as the input of the subsequent models for deep feature extraction and icing detection.

| Feature extraction using deep autoencoders
Owing to the complexity of the SCADA data and its obscure relation with blade icing condition, an adaptive feature learning strategy is employed instead of the process of manual feature extraction and selection. Deep learning is a collection of DNNs that employ deep structure to extract hierarchical features from input data using multiple nonlinear transformations. 37,38 These highly abstract and nonlinear features could facilitate subsequent fault pattern recognition process achieving better performance.
Deep autoencoders is one common kind of DNNs, which can be used for adaptive feature learning. For DAEs, a threelayered unsupervised neural network which is called autoencoder is employed as the basic network structure. The training principle of autoencoder is attempting to copy its input data to its output. 25 Therefore, an autoencoder contains two symmetric networks, namely encoder network and decoder network. At first, the encoder network transforms an input data x to a hidden representation h through the activation function as follows: where f s (·) is the activation function. In this paper, the hyperbolic tangent function (tanh) was selected as the activation function. Then, the decoder network maps the hidden representation h back to the reconstructed data x in a similar way as follows: The network parameters θ = {W, b, W′, b′}, including weight and bias terms, are optimized to minimize the reconstruction error between the reconstructed data x and the original data x. The reconstruction error can be measured as follows: It has been proven that by preventing the autoencoder from simply learning the identity mapping, a denoising autoencoder which is trained to denoise the corrupted versions of the original data is more effective than the conventional autoencoder in discovering robust features. Therefore, a corruption process q D (x|x) which represents a conditional (1)

No. Name Formula
1 Mean Upper Quartile 8 Minimum Peak to peak value x peak to peak = max (X) − min (X) 10 Trend value Integral value distribution over corrupted samples x is introduced into the original input x during training. Then, through the encoder and decoder network, more robust and powerful representations can be learned by reconstructing the raw data from the corrupted one, and the sensitivity of autoencoders to small random disturbances can be decreased. The architecture of denoising autoencoder is shown in Figure 3. As a hierarchical model, the core of DAEs is that multiple autoencoders could be stacked in the deep architecture presented in Figure 4, with the aim of finding highly nonlinear and complex patterns in the data. This training process which is called layer-wise pretraining is task-free and mainly focuses on the hierarchical representation learning from unlabeled data in an unsupervised manner. The last code vector h n greatly contains the large amount of information which could reflect the intrinsic information of the original data, while its dimension is much less than the original data. Then, the last code vector could be used for subsequent classification.

| Ensemble detection model construction
Generally, by combining DAEs with a classifier, the DNN could be utilized as a discriminative model for classification application. After all the autoencoders have been pretrained, the weights could be slightly adjusted by the back-propagation algorithm for fine tuning on a small size well-labeled dataset. Considering that each hidden layer of DAEs could be regarded as a different level of the feature space abstracting the behavior of the raw data, 39  classification technique can be developed to obtain better prediction performance.
According to the above thinking, the framework of the proposed ensemble deep autoencoders (En-DAEs) icing detection model is show in Figure 5. After completing the stacked autoencoders training, N different levels of abstract features of the SCADA data could be obtained. Then, N base classifiers are generated by connecting each hidden layer with a classifying layer. BP algorithm is employed to update the weights in each hidden layer of the corresponding base classifier. To utilize the ensemble learning strategy, the base learners should meet two requirements, namely both good accuracy and considerable diversity. These requirements are satisfied in this model, since the input features of each base classifier come from different layers of the DAEs network and the parameters of all base classifiers are fine tuned to enhance their performance. Thereafter, the final model output is determined using plurality of votes to synthesize the outcomes from all the base classifiers, which makes the icing detection model F I G U R E 5 The framework of ensemble detection model has higher prediction accuracy and better generalization performance.

| Evaluation metrics
For the model performance evaluation, a confusion matrix was utilized to determine the detection capability as shown in Table 2.
According to the confusion matrix, there are two evaluation metrics which can be derived: Accuracy was a commonly used classification evaluation metric. However, the main concern for the class unbalanced dataset is whether the minority class could be predicted correctly. Therefore, the Matthews correlation coefficient (MCC) is employed. The MCC metric calculates the correlation coefficient between the real labels and the predicted labels in binary classification. The closer the value of MCC is to 1, the more accurate the model prediction is.

| Case introduction
The experimental data come from the first Industrial Big Data Innovation Competition in China, which contains the SCADA system data of two WTs from one wind farm in north China. With the SCADA system, the real-time environmental data, WT operating data, and fault data can be collected. The blade icing detecting system in the SCADA system can give alarms when the ice accretion reaches a certain extent. Then, the deicing system will be started. The pity is that the icing detecting system cannot work well in reality which only alerts until the ice accretion has been very serious.
The supplied SCADA data of WT in this contest include 26 attributes, which can be roughly classified into environmental data, WT motion data, and WT state data, as shown in Table 3. The sampling interval of the SCADA system is 7 seconds, and the time span of this experimental dataset is 1 month. In addition, the "icing" or "normal" labels of the data sample have been tagged by the organizer based on the fault records. Figure 6 displays the time series of part of these attributes. All the supplied data have been encrypted by the contest organizer, and therefore, the magnitudes of these attributes are different from their values in real condition.

| Approaches implementation
To evaluate the performance of the proposed approaches, the SCADA data and icing conditions from two WTs, namely WT A and WT B, are utilized. With the consideration that one of the main problems in front of us in practice is that there are usually some historical data of the old WTs, and what we need to do is to identify the operating state of the new WTs, we select one of the WTs as the training dataset and the other one as the testing dataset in this paper. Therefore, the dataset of WT A is chosen as the training data, which consists of 186 626 samples, 5% of which are icing class. The dataset of WT B is chosen as the testing data, which consists of 187 521 samples, 8% of which are icing class. Obviously, both datasets are very class unbalanced.
In each dataset, the SCADA data went through data cleaning firstly. According to Section 2.2, data points which obviously did not follow the control strategy of WT were shown   Figure 7. These data are categorized as outliers and therefore are eliminated. It can be found from Figure 6 that the rotation angle, Ng5 temperature, and Ng5 direct current of 3 blades and pitch systems have very similar varying trends. Further calculation indicates that they are highly correlated. This arises from the symmetrical mechanical structures of the WT. Therefore, the mean and standard deviation of these signals are calculated to be taken as new attributes. There is also strong correlation between the environmental temperature and nacelle temperature. Therefore, their difference is also calculated as a new attribute. Thereafter, all the attributes went through segmentation. In each segment, a set of statistical features are extracted according to Section 2.2. Then, all the features are normalized. The target value of normalization is selected as [−1, 1] owing to the value range of the activation function tanh is [−1, 1]. For training data, among these feature vectors, the "icing" class samples were oversampled using SMOTE.
After data preprocessing, the balanced training data were used to train the proposed En-DAEs model. There are several parameters in the established En-DAEs model to be determined, that is, the number of hidden layers and the unit numbers of each hidden layer. So, the selection of these parameters is primarily investigated. In the procedure of parameter selection, 5-fold cross-validation on the training dataset was conducted. Accuracy was selected as the model evaluation metric since the dataset had been balanced. The average result of 5-fold cross-validation was taken as the final performance of the model. The parameter selection experiments were conducted with the number of hidden layers encoded as 3, 4, 5, and 6, and three different strategies were applied to decide the unit numbers of the hidden layers according to the work. 24,25,32 The number of the input layer units is determined by the dimension of the input features which is 230 in this research, and the number of the output layer units is 2 considering there are "icing" and "normal" two states for the WT blade. Taking account of the ensemble icing detection model, softmax function is employed as the classifier and is attached on the top of each hidden layer. During the training process of the ensemble icing detection model, the weights of the DNNs are initialized randomly and the biases are initialized to zero. After several comparison tests, the maximum training epoch in the stage of feature extraction is set to 1000 with a learning rate 0.01 and the epoch in the stage of ensemble model construction is set to 200 with a learning rate 0.01. The final detection results were determined by plurality of votes.
The final results of parameter selection were shown in Table  4. It can be seen that the best performance is obtained when the network structure is selected as [173-115-59-30-16-10]. Therefore, an eight layered DNN with the unit numbers of hidden layers set as [173-115-59-30-16-10] is applied in this paper.
The proposed method is written in MATLAB R2017a and run on Windows 10 x64 with the Core-7300 CPU and 8G

| Results comparisons and discussion
To evaluate the performance of the proposed icing detection model, elaborate tests and comparisons were conducted. The tests and corresponding purposes are listed in Table 5.

| Model stability test
To evaluate the stability of the proposed En-DAEs model, 5fold cross-validation was conducted to test the performance of the model toward different combination of training data and validation data. Figure 8 shows the 5-fold cross-validation results of the proposed En-DAEs model on the training dataset. It can be seen from Figure 8 that all the accuracy values exceed 0.98, and the difference is small. This result reveals that the proposed model has good stability toward different combinations of training data and validation data on the training dataset.

| Accuracy contrast test
To compare the performance of the proposed model, four other machine learning models, namely shallow neural network (NN), DNN, SVM, and RF, were selected. According to the proposed network structure, there exist six base classifiers in the En-DAEs network. Among these base classifiers, the 1st base classifier (BC1), which is composed of the first hidden layer and its classification layer, could be regarded as a shallow NN model. The 6th base classifier (BC6) is

Number of hidden layers
Unit numbers of hidden layers Accuracy composed of all the six hidden layers and its classification layer. Therefore, BC6 can be regarded as a DNN model with 6 hidden layers. For the objectivity sake, both the parameters of SVM and RF models were optimized through experiments, including the penalty factor and the parameter of the RBF function for SVM and the number of trees and the maximum depth for RF. All the above models, including En-DAEs, BC6, BC1, SVM, and RF, were tested on the training dataset (WT A). The average result of 5-fold cross-validation was taken as the final accuracy of each model. Figure 9 shows the average accuracy of different icing detection models on the training dataset. Considering the proposed method, the En-DAEs model gets the highest accuracy, the performance of BC6 decreases a little, and the accuracy of BC1 decreases even harder. This may be ascribed that BC1 is only a shallow neural network, while BC6 is a DNN, and the ensemble model effectively synthesizes the results of all base classifiers. The performance of RF is similar to that of BC1, and SVM gets the lowest accuracy among all the models. The above results reveal that deep network and ensemble strategy are helpful to improve the performance of the model.

| Generalization capability contrast test
To further verify the generalization capability of the proposed method, all the training data were used to train different icing detection models, and the testing data were utilized to evaluate the performance. Considering the testing dataset is class unbalanced, the MCC was employed besides accuracy. As shown in Figure 10, all the accuracy values of different models on the testing dataset are lower than their accuracy on the training dataset ( Figure 9). One important reason is that the testing dataset is from WT B, while the training dataset is from WT A, and there is a big bias between their data distribution because of the mechanical, electrical, environmental, and control parameters difference. However, the En-DAEs model still acquires the highest accuracy and the least accuracy loss. Concerning the MCC metric, En-DAEs model is also the highest. The MCC metric of RF is slightly lower than BC6, but its accuracy drops seriously. Based on the above analysis, the high prognosis performance of the proposed ensemble DAEs model is validated.

| CONCLUSIONS
An intelligent WT blade icing detection method based on SCADA data combining deep learning and ensemble learning is proposed in this paper. Deep learning was applied to automatically extract useful deep features from SCADA data. Ensemble learning was applied to improve model accuracy and generalization capability. The performance of the proposed method was verified using operating data of WTs from actual wind farms. Comparison results with other conventional machine learning methods, like SVM, RF, ANN, and individual DNN, reveal that the proposed method has higher detection accuracy and generalization capability.
Some conclusions could be summarized as follows: 1. Deep autoencoders is an effective unsupervised learning method to extract discriminative features adaptively from the SCADA data. After simple preprocessing of the original data, more abstract and useful features can