Data‐driven models for predicting remaining useful life of high‐speed shaft bearings in wind turbines using vibration signal analysis and sparrow search algorithm

Wind turbine bearings play a crucial role in ensuring the safe and efficient operation of wind turbines. Accurate estimation of the remaining useful life (RUL) of bearings can significantly reduce operating and maintenance costs. In this paper, we propose three advanced data‐driven models to predict the RUL of high‐speed shaft bearings in wind turbines. These models combine the sparrow search algorithm (SSA) with three different regression models, namely support vector machine, random forest (RF) regression and Gaussian process regression. The models are based on features extracted from the vibration signal analysis, and the features are selected based on their monotonicity to evaluate bearing degradation. To optimize the performance of the regression models, all model parameters are tuned using the SSA algorithm. The proposed models are validated using vibration data collected from a real 2 MW commercial wind turbine. Our results demonstrate that the proposed models are effective in predicting the RUL of wind turbine bearings, and the SSA algorithm improves the accuracy of the predictions.


| INTRODUCTION
Wind power is a renewable and clean source of energy, and with the ongoing climate crisis, many countries are turning to wind power generation facilities.The wind power industry has experienced significant growth in recent years, with wind turbines getting larger in size and installed at higher heights to capture stronger winds.According to data provided by the European Wind Energy Association (WWEA), 1 the installed wind power capacity reached 650.8 GW by the end of 2019.However, most of these turbines are located in remote mountainous areas or offshore, which poses significant challenges in transporting and hoisting the large components.The traditional method of waiting for system failure before replacing failed parts results in long downtime and puts enormous pressure on maintenance staff and standby parts inventory, increasing operating costs and ultimately affecting the Levelized Cost of Energy (LCoE). 2 The cost of operation and maintenance (O&M) is a significant portion of the total annual wind turbine cost, with O&M accounting for 20%-25% of the LCoE. 3,4To address this issue, improving the reliability of key components and predicting failures by improving maintenance strategies are currently the main ways to reduce O&M costs and lower the LCoE of wind power.
The growing size of wind turbines, their remote location, and the unpredictability of wind patterns necessitate efficient maintenance strategies.Maintenance strategies for wind turbines typically include corrective and proactive maintenance, the latter being further classified into scheduled and predictive maintenance. 5roactive maintenance is essential to prevent failures, and determining when maintenance is required requires knowledge about the current state of system degradation and an estimate of the Remaining Useful Life (RUL) of the system.RUL represents the useful life of an asset during a specific period of operation, and predicting it is a critical concept in Prediction and Health Management (PHM) to ensure high system availability over their life cycles gain lots of attention in recent years in addressing wind turbine challenges. 6The data used in wind turbine RUL estimation can be categorized as recorded run-tofailure data and condition monitoring (CM) data.
CM data is particularly crucial for machines that cannot be allowed to run to failure.Any data associated with the degradation process, such as vibration, temperature, loading, rotating speed, current and voltage data, can theoretically be used to predict RUL.For example, Cheng et al., 7 predicted gearbox fault and RUL by extracting features from generator current signals, while Sivalingam et al., 8 used temperature variation signals to predict wind turbine power converter RUL.Hu et al., 9 shows the distribution of temperature characteristic parameters of critical bearings in a wind turbine, which could be utilized to predict RUL after processing rotor speed and bearing temperature data obtained through the SCADA system.
Currently, most literature focuses on predicting the remaining life of wind turbine gearbox bearings.Vibration analysis is the most commonly used technique for monitoring wind turbines and diagnosing bearing faults.Vibration signals carry essential dynamic information related to fault characteristics and can be analyzed using time domain, frequency domain and time-frequency methods.Saidi et al., 10 proposed technique to predict the RUL of wind turbine high-speed shaft bearings by extracting time domain indices from the spectral kurtosis of vibration signals.Furthermore, Djeziri et al., 11 developed a physical model of wind turbines and collected data during normal and faulty operation for clustering.RUL is calculated in three steps: (1) projecting data in the remaining space using principal component analysis of normal operating data, (2) projecting all data into this remaining space resulting in normal and faulty clusters, and (3) calculating RUL as the ratio of Euclidean positions between the clusters and degradation velocity.Rommel et al., 12 estimated wind turbine remaining life by evaluating load profiles at blade roots, rotor hub centre, and tower head using physical models.
Due to the availability of data and the rise of AI techniques, data-driven models have also been applied in recent years to estimate the RUL of wind turbines such as Pattern recognition methods (e.g., exponential degradation, stochastic filtering-based, Markov deterioration and so on), machine learning methods [e.g., artificial neural networks (ANN), support vector machines (SVM), Convolutional Neural Network (CNN) and so on].Pan et al., 13 used Gaussian process (GP) and Bayesian inference methods to predict bearing failures 1 month in advance by abstracting the state of bearing temperature residuals.Teng et al., 14 used an artificial neural network to train data-driven models and then fit a polynomial curve to reflect the long-term degradation process of bearings.Hu et al. 9 further established a performance degradation model using temperature characteristic parameters and Wiener process.Author of 7 used an ANFIS-based PF method to predict RUL using one-phase stator current of the generator.Nielsen et al., 15 calibrated a Markov deterioration model using past inspection data and updated it using real-time data.Rezamand et al., 16 proposed a real-time prediction approach for wind turbine bearings in varying operating conditions using vibration signals and an adaptive Bayesian algorithm.
Wind turbines are complex systems with many components and structures, making it difficult to accurately build a physics-based model for predicting failures.Wang et al., 17 proposed a hybrid prognostic prediction method for RUL of rolling element bearings, using a correlation vector machine regression with different kernel parameters for sparse representation of the bearing regression data, and then employing the exponential decay model combined with the Fraser distance for adaptive estimation.Sun et al., 18 proposed a method for predicting the remaining service life of hybrid cutting tools based on empirical mode decomposition and back-propagation neural networks.SVMs have been shown to be effective in solving problems that consider structural risk and have better generalisation capabilities than traditional machine learning methods such as ANN. 19][22] 1.1 | Novelty and contribution to knowledge Wind turbines are known to experience high-speed shaft bearing failures that result in significant downtime and maintenance costs, ultimately leading to increased Levelized Cost of Energy (LCoE).To mitigate these costs, it is imperative to improve the reliability of key components or predict failures through better maintenance strategies.This study proposes a novel approach for predicting the RUL of high-speed shaft bearings in wind turbines using a regression model-based approach.We utilize a sparrow search algorithm (SSA) to optimize the regression model's parameters, thus improving the model's prediction accuracy.A series of time and frequency domain parameters are generated from the analysis of the vibration signal of a high-speed wind turbine shaft, and the regression model is trained after feature processing.The trained model is then used to predict the RUL of the high-speed shaft bearing.The effectiveness of our proposed method is confirmed through the analysis of vibration signals obtained from a real high-speed shaft bearing in a 2 MW wind turbine.To the best of our knowledge, this is the first study to apply the SSA algorithm to predict the RUL of wind turbine bearings.Our proposed models can help wind farm operators to optimize maintenance schedules and reduce downtime, ultimately leading to increased profitability and sustainability of wind energy production.

| DATA DESCRIPTION
In this section, detailed description of the data used for constructing RUL model for wind turbine high-speed shaft bearing discussed.The technical framework for the model is presented in Figure 1, which includes data collection, processing, feature selection, machine learning model training and RUL estimation.The data set used in this study was collected from a 2 MW wind turbine shaft in the United States. 23Specifically, 50 vibration signals were recorded from 7 March to 25 April, each of which is 6 s in length.These signals were collected by a monitoring system after an inner race fault caused the failure of a high-speed shaft bearing that was driven by a 20-tooth pinion gear.Figure 2 shows the 50 raw vibration trends that were used in the analysis.

| Preprocessing
The vibration signal carries the basic dynamic information related to the fault characteristics.By using time and frequency domain analysis techniques we can obtain the underlying characteristics of the signal.In addition to the traditional statistical features in the time domain such as root mean square, kurtosis and so forth.Other statistical features in the frequency domain are also widely used for the prediction of the RUL of bearings.Spectral kurtosis (SK), for example, is considered to be a powerful tool that has proven to find successful applications in vibrationbased CM. 24 It can indicate the presence of series of transients and their locations in the frequency domain.For the signal x, its SK value K f ( ) can be calculated based on the short-time Fourier transform of the signal using the following equations. 25


(1) where w t ( ) is the window function used in STFT.Then SK is taken as signal x, and its mean standard deviation, std, Skewness and kurtosis are calculated.

| Feature selection
The process of feature selection was meticulously executed on the ensemble of 15 features extracted in Proposed framework for wind turbine high-speed shaft bearing RUL prediction.RUL, Remaining Useful Life.
PANDIT and XIE | 4559 the preceding section.The time and frequency domain features, pivotal to our analysis, are succinctly enumerated in Table 1, accompanied by their corresponding mathematical expressions.The representation of these features and their respective expressions is provided below.
To improve the prediction accuracy, it was necessary to filter out the features that did not contribute significantly to the RUL prediction model.The filtering process involved selecting the most important features based on their relevance to the RUL prediction.This was achieved by using a feature selection algorithm, which evaluated each feature based on its contribution to the prediction model.The algorithm ranked the features according to their relevance and selected the most important features for further analysis.By eliminating the less important features, the prediction accuracy of the model was significantly improved as shown in Figure 3.This illustrative visualization unveils the comprehensive evaluation of these features, revealing certain elementssuch as SK kurtosis and SK skewness-that demonstrated less-than-anticipated efficacy in their predictive capacity.Conversely, the efficacy of kurtosis as a representative marker of the overarching degradation process emerged as particularly noteworthy.
To improve the accuracy of RUL estimation, it is important to carefully select and identify suitable indicators.Indicators that are suitable for RUL estimation should have certain characteristics, such as monotonicity, prognosability and trendability.
Monotonicity, as a defining characteristic, signifies an indicator's behaviour of consistently and uniformly increasing or decreasing in value as the machinery undergoes degradation.A well-defined monotonic indicator is instrumental in manifesting a distinct and coherent degradation trend, thereby substantially facilitating the accuracy of RUL predictions.This characteristic is formally defined by the following equation: where n is the number of measurement points, in this case n = 50.
The vibration signals.
T A B L E 1 Relevant time and frequency domain features. 26

Features Expression
Skewness Crest factor x RMS max Shape factor Impulse factor Margin factor  ( ) The degradation of the bearings is an irreversible process, so for this study, monotonicity as the criterion of feature selection is chosen.Figure 4 shows the monotonicity metric of 15 features obtained in the previous section.To obtain the best prediction results, features with a score larger than 0.2 are selected for the next step.

| METHODS THEORETICAL DESCRIPTIONS
This paper employs three distinct regression models to predict the RUL of high-speed shaft bearings in wind turbines.The SVM regression, random forest regression (RFR) and Gaussian process regression (GPR)

| SVM
SVM is a type of generalized linear classifier that utilizes supervised learning for binary data classification. 27The decision boundary is set as the maximummargin hyperplane over the learned samples.SVMs have gained popularity in engineering due to their effectiveness in handling nonlinear mapping problems and their ability to determine the final decision function by several support vectors.In situations where only a limited number of samples are available, SVMs are particularly useful in achieving this objective.As a result, SVMs are increasingly being employed for predicting RUL, owing to their efficacy in processing small data samples.The principle of SVM is: Given a sample set , where x i is the input variable and y i is the predicted value for, the regression function f x ( ) is where w is the weight vector,  (5) where C is the penalty factor, ξ i and ξ * i are the relaxation factors and ε is the insensitivity factor.
When the data exhibit a nonlinear relationship with each other, a kernel function is introduced into the SVM to map the original data into a high-dimensional space, transforming the nonlinear problem into a linear problem to solve. 28The RBF kernel function is the most commonly used, and its expression is: where p is kernel function indices.The Lagrangian function is introduced to transform the optimization problem into a convex quadratic programming problem 27,28 : where a a , *

| Random forest
The RFR model is comprised of several regression trees that have no correlation with one another.The output of the model is ultimately decided by the collective decisions of all the decision trees in the forest.The randomness of the random forest is manifested in two distinct ways 29 : ○ The randomness of the samples, where a specific number of samples are randomly selected from the training set to serve as the root node samples of each regression tree.○ The randomness of the features, where a specific number of candidate features are randomly chosen to construct each regression tree.The most appropriate features are then selected as split nodes.
The principle of the algorithm is described as follows. 30Randomly draw m sample points from the training sample set Sto obtain a new subtraining set S S , … , n 1 .Then after using the subtraining set, train a CART regression tree (decision tree), here in the training process, the cut rule for each node is to select k features at random from all features, and then select the optimal cut point from these k features before dividing the left and right subtrees.Note that the obtained decision trees here are all binary trees.By the second step, many CART regression tree models can be generated and final prediction of each CART regression tree is the mean value of the leaf nodes reached by that sample point.Key Attributes of Random Forests in Constructing Regression Trees: Sampling and Comprehensive Splitting Approach.The underpinning methodology of random forests entails two distinctive random sampling procedures.Specifically, the random forest algorithm conducts sampling across both the rows (samples) and columns (features) of the input data.In terms of sample sampling, a 'put-back' strategy is employed, thereby allowing for the potential inclusion of duplicates within the subset of samples generated through the sampling process.
Assuming that the input samples are N , then the samples sampled are also N .This makes it relatively easy to avoid over-fitting by not having all the samples in each tree during training, and then sampling the features, selecting m m M ( << ) from the M features.This is followed by a regression tree using a full split on the sampled data.Each regression tree is a specialist in a narrow field (because we select m features from M features for each regression tree to learn), so that there are many experts in the random forest who are proficient in different fields, and a new problem (new input data) can be viewed from different perspectives, and eventually each expert will come up with his or her own result, which will be averaged.The base learner of a random forest is not a weak learner but a strong learner, consisting of a strong decision tree with a very high depth.
CART regression trees are based on the principle of minimum mean squared error (MSE).That is, for any division of feature A, corresponding to any division point s on both sides of the division into data sets D 1 andD 2 , find the feature and eigenvalue division point that minimises the MSE of the respective sets of D 1 and D 2 , while minimising the sum of the MSE of D 1 and D 2 .The expressions are, where c 1 is the mean of the sample output for the D 1 data set and c 2 is the mean of the sample output for the D 2 data set.The prediction of the CART regression tree is based on the mean of the leaf nodes, so the prediction of the random forest is the average of the predicted values of all the trees. 29,30

| GPR
GPR is a powerful nonparametric model utilized for regression analysis through a GP prior.The GPR model assumptions consist of noise, or regression residuals, and a GP prior, which are resolved through Bayesian inference. 31The process of solving GPR, also referred to as hyper-parameter learning, involves determining the hyper-parameters in the kernel function by learning samples in accordance with the Bayesian approach.By applying Bayes' theorem, the hyper-parameter posterior of GPR is expressed as a function of the likelihood, which is obtained by marginalizing the output of the GPR, and the prior distribution of the hyper parameters.From Bayes' theorem, the hyper-parameter posterior of GPR is expressed as follows 31 : where θ denotes the hyperparameters of the GPR, including the hyperparameters of the kernel function and the variance of the residuals σ n 2 .p y X θ ( | , ) is the likelihood, which, as a non-parametric model, is the marginal likelihood obtained by marginalising the output of the GPR: The GPR likelihood follows a normal distribution when the residuals adhere to an iid normal distribution using following equation.

(
) The GPR's optimal solution is a maximum likelihood estimate, which involves nonlinear optimization.when its common solution is a great likelihood estimate that includes nonlinear optimization. 4

| SSA
A revolutionary swarm intelligence optimisation technique called the SSA was proposed in 2020 and was primarily motivated by the foraging and anti-predatory behaviour of sparrows. 32In the study, SSA was employed to discover the most suitable hyperparameters for the proposed machine learning models, namely SVM, GPs, and RF.The primary goal of using the SSA was to finetune the hyperparameters of these models to align the characteristics of the data with the structure of each model.This process aimed to create prediction models for RUL that exhibit high levels of accuracy.
In the foraging behaviour of sparrows, there are two distinct roles, the finder and the follower.Finders usually have high energy reserves and are responsible for finding food-rich areas, and when they find food, they provide foraging directions for the entire species.Followers forage based on the information provided by the finder.In SSA, finders with better fitness values prioritise access to food during the search.During each iteration, the discoverer's position is updated as described below 33 : Where t represents the number of current iterations and iter max is a constant indicating the maximum number of iterations.X x j , represents the position information of the i-th sparrow in the j-th dimension.
2 and ∈ ST [0.5,1] denote the warning and safety value.Q is a random number that obeys a normal distribution.L denotes a d 1 × matrix, where each element in the matrix is all 1.When R ST < 2 , this means that there are no predators around the foraging environment at this time and the finder can perform an extensive search operation.When  R ST

2
, this means that some sparrows in the population have spotted a predator and alerted the rest of the population, and all sparrows need to fly quickly to other safe places to forage.The updated description of the follower's location is as follows.
where X P is the optimal position currently occupied by the discoverer and X worst denotes the current global worst position.A + denotes a d 1 × matrix where each element is randomly assigned a value of 1 or − 1 and  A A AA = ( ) this indicates that the i-th joiner with a low fitness value is not getting food and is in a very hungry state, at which point it needs to fly to elsewhere to forage for more energy. 32When aware of challenges, sparrow populations engage in antipredatory behaviour, which is mathematically expressed as follows.
where X best is the current global optimum.β, the step control parameter, is a random number with a normal distribution with mean 0 and variance 1. ∈ K [−1,1] is a random number and f is the current fitness value of the individual sparrow.f g and f w are the current global best and worst fitness values, respectively.ε is the smallest constant to avoid zero in the denominator.
For simplicity, when f f > i g means that the sparrow is at the edge of the population and is extremely vulnerable to predators.f f = i g means that sparrows in the middle of the population are aware of the danger and need to move closer to other sparrows to minimise their risk of predation.K indicates the direction of movement of the sparrow and is also the step control parameter.
The concept behind hyperparameter tuning is that adjusting certain settings or parameters of a machine learning model can significantly impact its performance and ability to make accurate predictions.The SSA, in this case, served as the mechanism for systematically exploring and determining the optimal combination of hyperparameters for the SVM, GP and RF models.
To achieve improved training outcomes, it is imperative to divide the training set in the most random manner possible while ensuring that the data incorporates all phases of the degradation process.Furthermore, the model parameters are optimized using a SSA, and the optimal parameter settings of the SSA are obtained after several debugging iterations. 30,31In this study, 80% and 70% of the available data were selected as the training set to train the model, and the results will be elaborated upon in the subsequent section.

| RESULT AND DISCUSSIONS
Mean Absolute Error is used to evaluate the training process.MAE, which may accurately depict the actual situation of the predicted value error and be used as an evaluation metric for the training model's performance, is the average of the absolute error between the true value and the predicted value.Its expression is: where m is the number of training samples, y i is the expected RUL of the i-th sample and p y ( ) i is the predicted RUL of the i-th sample.The final training results are shown in Table 2.
Here, MAE is not a criterion for evaluating the merits of the three models.The value of MAE is only used as a criterion for evaluating the same model after each training, the smaller the value of MAE, the better the training result of the model.Since the GRP model is different from the other two, the training results are not evaluated here.After selecting the best training results, the three trained regression models were tested using the same test set to compare the predicted remaining lifespan with the actual lifespan.The results are shown below.Figure 5 shows the prediction results of the model trained with 70% of the data as the training set, and the following three figures show the prediction results of the model trained with 80% of the data as the training set.
Figure 5 demonstrates that although each model deviates somewhat from the genuine RUL curve, they can all forecast the remaining life to some extent.We offer four parameters to evaluate the regression models: MSE, RMSE, MAE, and R 2 .To more accurately assess each regression model's performance.The other three metrics are shown below, and the definition of MAE has already been discussed in the evaluation of the results of the model training.
MSE is a popular error metric for regression models, which measure the variance of the residuals.It represents the mean of the squared differences between the original and predicted values in the data set and is calculated as: where m is the number of training samples, y i is the expected RUL of the i-th sample and p y ( ) i is the predicted RUL of the i-th sample.
Root Mean Squared Error (RMSE) is an extension of the MSE, which is the square root of MSE.It measures the standard deviation of residuals and is calculated as: The R-squared, also known as the coefficient of determination, is the square of the correlation coefficient R. It is an important statistical parameter in a linear regression model.It indicates how well the regression where y i is the expected RUL of the i-th sample, p y ( ) i is the predicted RUL of the i-th sample and y̅ is the mean value of y.
The results indicated that the SVM model achieved the highest accuracy among the three, suggesting that the fine-tuned hyperparameters obtained through the SSA allowed the SVM model to more effectively capture the underlying patterns and relationships in the data.Following the SVM model, RF model demonstrated the second-best accuracy, while the GPs model achieved comparatively lower accuracy.Tables 3 and 4 show the evaluation parameters for the models obtained by training with 80% of the data as the training set and the model obtained by training with 70% of the data as the training set, respectively.It can be seen that the models obtained using 80% of the data as the training set are more accurate than those obtained using 70% of the data.When 80% of the data was used as the training set, the coefficients of determination of the models obtained were all greater than 0.9, at which point the SVM model reached a coefficient of determination of 0.97, which can indicate that the SVM can predict the remaining life of this  bearing very well.In both cases, the SVM regression model is rated better than the other two regression models, followed by RF. when the regression curves of the three models are placed in the same graph, it can also be seen that the SVM regression curve is closer to the true value.SVM strengths in handling complex data distributions, regularization, high-dimensional data, hyperparameter sensitivity, and balancing bias-variance trade-offs likely contributed to its superior performance in our study.Furthermore, the synergy between the SSA's optimization capabilities and SVM's sensitivity to hyperparameters, complex interactions, and data distribution adaptation likely resulted in SVM's significant outperformance over RF and GPs in our study.While all three algorithms benefited from SSA's optimization, SVM's unique characteristics allowed it to derive maximum advantage from the fine-tuned hyperparameters produced by the algorithm as shown in Figures 6, 7 and Tables 3 and 4.

| CONCLUSION
In conclusion, predicting the RUL of wind turbine highspeed shaft bearings has always been a daunting task.This paper addresses this challenge by proposing three highly promising methods for RUL prediction based on vibration signal analysis with feature selection based on monotonicity to evaluate the degradation of high-speed shaft bearings.Our results demonstrate that the proposed feature extraction method is highly effective.Furthermore, we leverage a SSA to optimize the parameters of a SVM, RFR model, and MAE values to evaluate the model training outcomes.These models are validated using vibration data collected from a real 2 MW commercial wind turbine.Our findings reveal that the proposed models combined with SSA algorithm are efficient in predicting the RUL, and the performance of the models improves with an increase in the proportion of the training set to the data set.Among the three models proposed, SVM model, after optimization of parameters F I G U R E 6 RUL estimation models performance comparison when 80% of the data is the training set.GPR, Gaussian process regression; RF, random forest; RUL, Remaining Useful Life; SVM, support vector machines.
F I U R E 7 RUL estimation models performance comparison when 70% of the data is the training set.GPR, Gaussian process regression; RF, random forest; RUL, Remaining Useful Life; SVM, support vector machines.
using the SSA algorithm, outperforms the others.Proposed research of this paper demonstrates the potential of these methods for RUL prediction of wind turbine high-speed shaft bearings and can have significant implications for the wind power industry.

F
I G U R E 3 Evolution of 15 features over the 50 days.
PANDIT and XIE| 4561 models are utilized.Furthermore, a SSA is implemented to fine-tune the model parameters during the training process.The following section provides a detailed description of these regression models and the optimization algorithm.
∈ w R n ; b is the bias threshold, ∈ b R. w and b can be obtained by solving the following optimal problem.

F
I G U R E 5 RUL estimation results.RUL, Remaining Useful Life.T A B L 3 Evaluation parameters for models when training data sets is 80%.
T A B L E 2 Models training results.
model fits the actual values, with R close to 1 indicating a better fit, and is expressed as: