Predictive maintenance for offshore oil wells by means of deep learning features extraction

Nowadays, the great diffusion of the Internet of Things and the improvements in Artificial Intelligence techniques have given a rise in the development and application of data‐driven approaches for Predictive Maintenance to reduce the costs linked to the maintenance of industrial machinery. Due to the wide real‐life applications and the strong interest by even more industries, this field is highly attractive for academics and practitioners. So, constructing efficient frameworks to address the Predictive Maintenance problem is an open debate. In this work, we propose a Deep Learning approach for the feature extraction in the offshore oil wells monitoring context, exploiting the public 3 W dataset, which is well‐known in the literature. The dataset is made up of about 2000 multivariate time series labelled according to the corresponding functioning of the well. So, there is a classification task with eight classes, each related to a particular machinery condition. Thanks to the peculiarities of the labels, the proposed framework is valid both for diagnostics and prognostics. In more detail, we compare two different approaches in feature extraction. The first is a statistical approach, widely used in the literature related to the considered dataset; the second is based on Convolutional 1D AutoEncoder. The extracted features are then used as input for several Machine Learning algorithms, namely the Random Forest, Nearest Neighbours, Gaussian Naive Bayes and Quadratic Discriminant Analysis. Different experiments on various time horizons prove the worthiness of the Convolutional AutoEncoder.


| INTRODUCTION
Predictive Maintenance is a type of maintenance management based on time series that tries to predict industrial machinery malfunctions to effectively address possible problems and optimize machinery functioning by reducing the number of useless interventions and unexpected faults.
There exist three main types of maintenance techniques: Run to Failure consists in using certain machinery until it is broken, with nil useless interventions but high reparation costs; Preventive Maintenance consists of scheduled maintenance interventions, with a large number of useless interventions; Predictive Maintenance (PdM) consists in trying to predict whenever a machinery is about to break.Sullivan et al. (2002) points out that a proper PdM programme can significantly cut down on production costs, minimize breakdowns, reduce inventory parts, and improve production quality, with a return that can reach 10 times the investment.So, as stressed by Mobley (2002), implementing a proper PdM programme can definitively affect industries' competitiveness and chances to survive in the global market.Therefore, it does not surprise the development of many projects related to this field, causing an increasing interest also from an academic point of view, with the consequent publication of a variety of papers and the development of a large literature.
As reported by Jardine et al. (2006), there are two main scopes in PdM, that are diagnostics and prognostics.The former regards the detection and classification of faults, while the latter is concerned about their prediction and the estimation of the machinery's remaining useful life.Furthermore, the approaches proposed in the context of PdM can be in turn divided into knowledge-based, model-based, and data-driven ones.The firsts use previous experience to infer rules for detecting faults; the second use physical knowledge or statistical estimation methods to provide a representation of the model describing the phenomenon under observation; the third use historical and real-time data to understand when a fault is about to occur.For a long time the PdM approaches relied only on knowledge-base (Majstorovi c & Milači c, 1990;Vingerhoeds et al., 1995) and model-based techniques (Marjanovi c et al., 2011;Richalet, 1993).However, this type of approach has several drawbacks as pointed out by different works such as Nguyen and Medjaher (2019) and Ditzler et al. (2015).In particular, it is not easy to find a theoretical model able to explain mechanisms such complex as those related to industrial machinery, and at the same time, a simplified model can be too shallow to process the information given by the data adequately.Furthermore, knowledge-based and model-based approaches are usually not well-suited for online updating, which is a common situation when working in PdM frameworks.On the other side, the main issue with employing data-driven PdM approaches is the need for a great amount of data and an efficient way to process them.So, only in the last years, with the growth of the internet of things (IoT) and recent advances in artificial intelligence (AI), an approach of this type has become possible.Currently, although the researches in this field are still in the first stage, consistent literature has been developed due to the great attractiveness of the subject as reported, for example, by Xu et al. (2018).
In this work, we propose the application of a deep learning (DL) approach based on 1D convolutional neural networks (CNN) on the 3 W Dataset.Such dataset was proposed by Vargas et al. (2019) in 2019, from a joint project with PetrÃ 3 leo Brasileiro S.A., or Petrobras, one of the biggest petroleum industries in the world.In more detail, the dataset contains about 200 multivariate time series related to the health status of offshore oil wells.The time series exploited are divided into three main groups: real, simulated and hand-drawn.The first type regards real observations of working wells; the second type is made up of time series simulated with specific software, namely the OLGA Dynamic Multiphase Flow Simulator Limited (n.d.); and the third type consists of time series designed by experts.However, the third type contains just 20 series, and their contribution is quite narrow, so it is neglected.
The proposed is a multi-classification task.In particular, the dataset contains nine classes: one for the correct functioning and the other eight related to as many types of faults.However, one of the fault types is unusable because it contains insufficient examples, so we deal with only the remaining seven errors.Each time series contains second-by-second observations related to the temperature and pressure caught by different sensors.Furthermore, these observations are individually labelled, so we are allowed to split the time series into temporal windows of fixed length to build a framework that can provide prompt alarm signals, trying to predict a fault both when it is already occurring and when it is about to occur.Thus, our framework matters with diagnostics and prognostics of the errors in well-functioning.
According to different previous works, a two-stage approach is proposed for the classification: firstly, a feature engineering step is applied to extract relevant features from the data, obtaining its low-dimensional representation discharging the noise and preserving useful information; then, a machine learning (ML) classifier is applied to elaborate the features computed in the previous stage providing the final output of the pipeline, that is, the classification.Figure 1 summarizes the procedure.
In more detail, in the features engineering step, we compare two different approaches for feature extraction.From one side, there is the statistical approach employed in several works such as Turan and Jäschke (2021), while from the other side, there is the DL one, which employs an AutoEncoder as in Li et al. (2020) made up of two encoding 1D Convolutional layers and two decoding 1D deconvolutional layers.The experimental stage shows the superiority of the latter.As for the second stage, four different ML algorithms are used, namely the Random Forest Classifier (RFC), k-Nearest Neighbours (KNN), Gaussian Naive Bayes (GNB) and Quadratic Discriminant Analysis (QDA).In this case, the experiments suggest a predominance of RFC.Moreover, we point out that the hyperparameters of both AE and classifiers are chosen via a genetic approach, namely the Biased Random Key Genetic Algorithm (BRKGA).Finally, the contributions of this paper can be summarized as follows: The typical two-steps approach when working with machine learning algorithms for classification.The features engineering step is divided from the classification task.Starting from the raw time series, some techniques are applied to obtain a set of features (1) whose dimension is lower than the original data one, to avoid the so-called curse of dimensionality and to make the problem computationally tractable (2) that discard the noise into the raw data (3) that catch as much of the variance in the original information as possible.Finally, the extracted features are fed into a classifier whose aim is to provide the classification and, according to this prediction, the eventual error alarm • we propose the application of a DL approach for features extraction on a well-known dataset in literature where, to the best of our knowledge, no similar approaches have been developed; • we assess the worthiness of the CNN approach by comparing it with the widely used statistical approach.
Regarding the rest of this paper it is organized in this way: Section 2 proposes a brief overview of state-of-the-art research in PdM to contextualize our work, with a particular focus on the 3 W dataset. Section 3 provides a quick description of the main data analysis tools exploited in this work.Section 4 describes the dataset used and the experimental stage by showing and commenting on the results obtained.Finally, Section 5 contains the conclusion.

| RELATED WORKS
In recent years, there has been a great improvement in the literature regarding data-driven approaches in PdM.This can be viewed by considering the huge amount of surveys that try to identify a common direction in research, highlighting weak points of the existing methodologies and possible future directions of improvement.Some examples are Zhang et al. (2019); Jimenez et al. (2020); Lei et al. (2020); Zhao et al. (2019).Zhang et al. (2019) stresses that many existing papers in PdM work with public datasets and, after providing a broad overview of pre-processing methodologies, divide existing methods according to the techniques exploited, distinguishing between ML and DL approaches.Jimenez et al. (2020) focuses the attention on the so-called multi-modal approaches, which are those approaches that combine different techniques, exploiting the strong points of each one to improve the overall accuracy.Lei et al. (2020) is related to Intelligent Fault Diagnosis (IFD) systems, which are applications of data analysis tools such as ML and DL techniques into the machinery diagnosis and prognosis frameworks, particularly stressing the importance of AE applications.Zhao et al. (2019) focuses on the applications of DL methodologies in the PdM context, performing an accurate study for each type of Neural Network and giving particular relevance to the CNN.
Different works in the PdM literature exploit tree-based models, in particular RFC.For example, Shrivastava et al. (2017) employs RFC in a multi-classification problem for the diagnostics of bioreactor faults.Instead, Krishnakumari et al. (2017) exploits a tree-based model to perform features selection.The work is based on a simulated dataset of machinery gearboxes online faults diagnostics.Several statistical features are extracted, and then a decision tree is exploited to understand the important features for the classification of faults.Also, Santos et al. (2018) is related to fault diagnostics for gearboxes in industrial contexts.In this case, the authors propose a strategy for automatically detecting multiple types of faults.The peculiarity of their dataset is the strong imbalance of classes, with almost sixty good functioning observations for each fault, which provides a more realistic environment for developing PdM techniques.Their proposal is related to an ensemble of trees modified to be better suited for the imbalance in the data.Furthermore, they compare their proposal with other classification algorithms, particularly KNN and GNB.Li et al. (2018) is concerned about refrigerant flow systems diagnostics.In more detail, the authors propose a three-step procedure founded on the Classification and Regression Tree to come up against diagnostics of three different types of faults, measuring the performances of their strategy both in an offline and online context and comparing it with other tree-based models.
The applications of KNN in different forms have been discretely used in PdM problems.For example, Yao et al. (2017) employs a three-step procedure for classifying faults in rolling bearing context, exploiting Swiss Roll and Swiss Hole datasets.Starting from raw vibration data, the authors perform features extraction and then feature selection to prepare a feature set to be used as input for KNN.As stated by the authors, the strategy obtains satisfying results confirming the system's effectiveness.Glowacz and Glowacz (2017) works with an induction motor.The acoustic data are transformed with statistical techniques to obtain KNN inputs.The authors also claim that their results also apply to other electrical machines.Vanraj Dhami and Pabla (2018) proposes a hybrid approach for diagnosing fixed axis gearbox faults.In particular, they exploit statistical techniques to extract a set of features from acoustic signals to feed KNN classifiers with satisfying results.Finally, Wang et al. (2020) studies the faults of rolling bearings using KNN in an online framework by exploiting the vibration signal and comparing their proposal with other well-known strategies.
As for applying neural networks and DL in the PdM, much literature has been developing in recent years.Chen et al. (2017) works with rotating machinery by comparing three different DL architectures, including an AE, to predict faults in the rolling bearing.The input of the algorithms are features extracted by the raw data with the statistical approach.The results show that the best accuracy is obtained via AE.Also, Shao et al.
(2017) applies an AE with three layers to faults diagnosis for the rotating machinery.The novelty of this approach is related to the loss function, designed to clean the results from noise.This task is achieved by exploiting the maximum correntropy and a softmax classifier.Diez-Olivan et al.The CNN approach is designed to merge the feature extraction and the classification steps.The test is carried out on real-time data and, through the comparison with different methods widely used in literature, shows the effectiveness of the proposed strategy.
One meaningful example of a diagnostics application is contained in the hydraulic system dataset, provided by Helwig et al. (2015).The dataset contains a sequence of simulated working conditions of a complex hydraulic system in different simulated fault scenarios.Then, several sensors extract different features, which form the dataset.The task is to predict the working condition of four different components: the cooler, the valve, the pump and the accumulator.Although the first work exploits simple ML models, many other works have subsequently employed this dataset, obtaining even higher accuracy.In particular, the best results are those obtained with the employment of CNN by Huang et al. (2021), and Yuan et al. (2020), with an average accuracy above 99%.
Regarding the prognosis of hydraulic systems similar to that above mentioned, we have to notice the works Bedotti et al. (2018), andGhini andVacca (2018).The former exploits a model-based approach built on thermodynamic considerations working with a proprietary dataset, while the latter proposes a regression task exploiting a data-driven approach to estimate the remaining useful life (RUL) of some components of a hydraulic system.
Although the 3 W dataset is pretty recent, many works employ it, such as Li and Ge (2021).This paper exploits the same statistical features as in Marins et al. (2021).Moreover, in Marins et al. (2021) a hybrid model based on statistic features engineering, and RFC classifier is proposed.
The test set is made up of real and simulated instances, and the results confirm the robustness of their proposal.Another noteworthy work based on the 3 W dataset is Turan and Jäschke (2021), where a comparison between different classifiers exploiting statistical features is carried out.The comparison shows a predominance of tree-based models, in accordance with the proposal of Marins et al. (2021).Finally, Carvalho et al. (2021) compares many pre-processing techniques and hyperparameters optimization strategies, showing the ineffectiveness of grid search in performing hyperparameters optimization.Furthermore, the work in its conclusion mentions the possibility of applying a DL approach in this dataset.

| DATA ANALYSIS TOOLS
In this section, we provide an overview of the data analysis tools we exploit, explaining the instruments used for the feature extraction and those used for the classification.

| Features extraction
In this subsection, we describe the two different approaches for features extraction we compare, namely the statistical approach and the AE one.

| Statistical features extraction
Statistical features extraction consists in extracting features from raw data by applying different statistical techniques.It is a type of approach widespread in literature; see, for example, DeCarlo (1997); Hoaglin et al. (2000).In particular, we are interested in the same features exploited by Marins et al. (2021).So, from each time window and for each variable, we extract 9 statistical indicators: mean, median, standard deviation, skewness, kurtosis, maximum, minimum, first quartile and third quartile.The values obtained are then standardized to have all the values on the same scale, making it simpler for the ML algorithm to catch patterns in the data eventually.The final result is a matrix in ℝ NÂ9ÃV , where N is the number of time windows extracted from the time series and V is the number of variables in the time series.

| Convolutional neural network
The CNN LeCun et al. ( 1998) is a kind of neural network created to find spatial patterns in the input.It is founded on weights sharing, that is, one weight is used for more than one neuron in a layer.So, the output of a layer, or the feature map, can be calculated as the convolutional product between the layer and a particular matrix, or convolutional filter.Therefore, in the formula, in the specific case of a 1D-CNN with a single filter, the feature map M ℝ K can be written as a non-linear transformation f of a linear combination of the elements in the input layer weighted by the filters where s is the stride of the convolution and the dimension of the feature map, K, is equal to Nþ1ÀL s AE Ç .
Another type of convolutional layer is the transposed convolutional layer, sometimes referred to as the deconvolutional layer Zeiler et al. (2010).
It is often used as a kind of inverse operation for the convolutional layer to recover the original dimension of the input and, at the same type, preserve the spatial relationship among information.So, using the same formulation as above, the feature map M ℝ K has the dimension equal

| AutoEncoder
An AE is a neural network planned to decrease the dimensionality of the feature space.This task is accomplished by projecting the raw information into a subset known as latent space.The projection is designed to preserve important information, eliminating the noise and secondary information.
An AE is composed of two components: the Encoder and the Decoder.The former is used to map the inputs on the latent space, while the latter is used to reconstruct the features from the encoded input.Although only the Encoder is useful for our purpose, the Decoder is necessary to train the entire AE.In fact, in order to extract a significant latent space, the network is trained by exploiting as a loss function the l 2 norm of the difference between the original input vector and its reconstruction provided by the AE.
In our application, the AE is constructed exploiting the CNN.In particular, the Encoder is made up of two convolutional layers, while the Decoder is made up of two deconvolutional layers.The kernel size and the stride of the encoder layers are chosen to obtain a latent space of dimension similar to that of the statistical approach.In contrast, the hyperparameters of the deconvolutional layers are chosen, trying to get a Decoder symmetric concerning the Encoder.Figure 2 describes the kind of AE exploited.
In our application, the AE is constructed exploiting the CNN.In particular, the Encoder is made up of two convolutional layers, while the Decoder is made up of two deconvolutional layers.The kernel sizes and strides of the encoder layers are chosen by trying (1) to obtain a latent space of dimension similar to that of the statistical approach and (2) to keep the dimensionality drop smooth among them.Instead, regarding the deconvolutional layers feature maps, they are designed by attempting to preserve the symmetry between Encoder and Decoder.
Finally, as for the activation functions of the AE, we use the well-known ReLU.Other choices have been tested, such as sigmoid or tanh.However, at first glance, they seem not significantly to improve the results obtained by ReLU, so we work with this activation function.Figure 2 describes the kind of AE exploited, while Figure 3 clarifies the AE structure used in the three experiments.

| Classifiers
In this subsection, we show the classifiers used for the comparison by providing a description of the hyperparameters that must be set.

| Random Forest classifier
RFC is a classifying approach related to decision trees.A tree is constructed by repeatedly splitting the feature space according to a specific feature.The feature to split and the value to cut are selected each time by optimizing the chosen metric.The result of this stage is a partition The AutoEncoder is made up of two pieces: An encoder and a decoder.The former is formed by two convolutional layers, and it has the task of compressing the input into a lower-dimensional latent space.The latter is made up of two deconvolutional layers, and it is designed to reconstruct the original information from the representation provided in the latent space Then, a class is associated with each element of the partition.As for the loss function, we select the popular Gini index, that is, where i is the selected set, the classes are indicated with c 1, …, C f gand b p ic is the percentage of class c elements which are in the set P i .
The trees have an essential advantage: the interpretability of the obtained results.However, usually, it is found that an ensemble of trees is more accurate than just a single one.RFC Breiman (2001) is constructed as bagging of trees.It consists in creating a set of trees, each one fed with a subset of the original features space.Furthermore, each step is considered only a subset of the tree inputs.Then, the predictions of all the trees are averaged with a majority voting to obtain the final classification.
When applying RFC, different hyperparameters have to be set.In particular, we work with the followings: n-estimators which is the number of trees in the forest; max-depth, that is, the max depth of each tree; max-features, that is, the maximum number of features to be considered at each split; class-weight which is useful to handle class imbalance.

| K-nearest neighbours
KNN (Fix & Hodges, 1989) is a widely used ML approach for regression and classification.It consists in representing the data on an n-dimensional space.Then, a metric is used to establish the closeness between points.A fixed number of points, k, is used at each step.To evaluate which class a point belongs to, the k points nearest to it are considered, and a voting procedure selects the class, eventually weighting the contribution of each neighbour with equal weights or with weights decreasing with respect to the distance from the objective point.
For the good functioning of this algorithm, it is important to accurately set both k (also known as n-neighbours) and the weights to associate to each neighbour according to its distance from the considered point.

| Gaussian naive Bayes
GNB model is based on the assumption that the features are independent.Thus, the density function of the input vector can be decomposed as the product of the density functions for each component, which is supposed to be a Gaussian one.Then, the logarithm of the ratio between the membership probabilities to two different classes can be obtained as the sum between a constant term and as many functions as the dimension of the feature space, each one depending on just a single component of the input vector, allowing easy computation of the final class.
When implementing this algorithm, we tried to optimize var-smoothing, which is the hyperparameter representing the percentage of the maximum variance to be added for stability in the computation.QDA is an approach for classification based on the Bayes theorem and Gaussian density functions.The approach consists in estimating the logarithm of the ratio between classes probabilities.Assuming that the probability matrix for the class c is indicated with Σ c , then the QDA establishes the membership class of a given point as the argmax of the so-called discriminant functions defined by 2, where x is the input vector, μ c is its mean among the elements in class c and π c is the prior probability of class c.Finally, what we obtain is a feature space partition, whose boundaries are defined by the solutions of equations To improve the results provided by this classifier, we have optimized the hyperparameter representing the regularization for the covariance estimates, also known as reg-param.

| Hyperparameters optimisation: Biased random key genetic algorithm
This subsection describes the strategy used for hyperparameter optimization, namely BRKGA.BRKGA Ericsson et al. (2002); Gonçalves and de Almeida ( 2002) is a genetic approach for hyperparameter optimization, widely used in the recent literature, see for example Biajoli et al. (2019); Carrabs (2021).The main idea consists in simulating an evolutionary framework in which different combinations of hyperparameters are regarded as an individual of a population, and each hyperparameter is regarded as a gene of the individual.The first generation is randomly generated, and the fitness function is computed for each individual to assess the proposed solutions' goodness.Then, the new generation of individuals is created in three different ways: reproduction consists in copying the best individuals of the previous generation; crossover consists in generating sons individuals by mixing genes from two randomly chosen parents, and mutation consists in randomly mutating some of the generated individuals to avoid local minimums.The generation evaluation and creation process is repeated a fixed number of times.Finally, the best individual in terms of fitness values is selected.

| EXPERIMENTAL RESULTS
In this section, the experimental results are shown.In particular, after describing the exploited dataset, we compare the different feature extraction techniques on different sliding window sizes.

| The dataset
Our work is related to the 3 W dataset provided by Vargas et al. (2019) and downloadable from the UCI website at the link https://archive.ics.uci.edu/ml/datasets/3W+dataset.The dataset, containing observations related to off-shore oil wells, comprises 1984 time series divided into nine classes: good-functioning (class 0) and eight error classes.However, one of the errors is poorly represented in the dataset, so we consider only the error types from 1 to 6 and 8. Furthermore, as in Marins et al. (2021), we delete also the hand-written time series in that they are not useful in training an ML algorithm.So, the final result is a dataset made up of 1960 time series of different lengths.Furthermore, we use a train-validation-test split, as shown in Figure 4.The splitting is performed in this way: all the simulated time series are used only for the train set.The realtime series are sorted by their file name, then the first 60% are used for the train set, and the remaining series are equally split into validation and test sets.
Each time series contains multiple observations.In particular, the observations are taken with a 1 Hz sampling rate, which means second by second.There are 8 variables for each series, representing temperature and pressure caught by sensors placed at different system points.However, feature T-JUS-CKGL contains almost only NaN, so we discard it.Regarding the classes, we consider 8 of them, whose meaning is the following: • 0 Normal Functioning: No system faults occur.
• 1T: the Basic Sediment and Water (BSW) is defined as the ratio between the flow rates of water and sediment from one side and oil on the other side.Although naturally, this number increases over time, a strong increase can cause several problems, so it is important to check this value.
• 2 Spurious Closure of DHSV: the Downhole Safety Valve (DHSV) is a safety valve created to prevent serious damage to the workers and environment.However, it could be unmotivated closures, which lead to unproductive periods.
• 3 Severe Slugging: it is a kind of cyclical flow instability that could seriously harm the well.
• 4 Flow Instability: it is another type of flow instability, less severe than Severe Slugging.However, it could degenerate into class 3.
• 5 Rapid Productivity Loss: it consists of a strong reduction in the flow, which different factors can cause.
• 6 Quick Restriction in PCK: the Production Choke (PCK) is a control valve.Sometimes, there could be unwanted fast restrictions in the presence of operational problems.
• 8 Hydrate in Production Line: hydrate is a crystalline compound that can seriously harm oil system production up to the stop of the well.
Fault time series is made up of three periods: the initial period is characterized by the normal functioning of the well, no errors are occurring, and this piece of time series can be considered as a standard functioning case.This period is also known as the normal period.Then, something starts to go wrong, and the indicators start to exhibit a particular behaviour.In this period, also known as the faulty transient period, the well performance is still at an acceptable level, but they are falling, and the fault starts to be evident.Detecting a fault during this period could mean prognostics of the well.Finally, during the last period, defined faulty period, the error is evident, and the fault seriously harms the performance of the well.Figure 5 shows an example of a fault series divided into three periods.
Although the labels provided by the dataset publishers consider the three periods, we do not make differences between faulty transient and faulty periods, in that not all the series contains all the periods (as reported in Turan and Jäschke ( 2021)).
Our approach to the classification problem is the following.Firstly, we divide each time series into sliding windows.The length of each window is equal to the stride.Experiments with horizons of 301, 451 and 601 seconds are carried out.Note that each observation of each time series is singularly labelled with a value equal to 0 (normal period), 100 þ fault type (faulty transient period), fault type (fault period); and we do not make distinctions between faulty transient and faulty periods.So, to label each time window, we simply use a voting procedure: if the majority of the observations in the class are of type 0, then the entire class is labelled as 0; otherwise, it is labelled with the corresponding fault type.Furthermore, this approach does not harm the prognostics task because the normal period always occurs before the faulty periods.
The train-validation-test splitting.All the simulated time series are used only for the train set.The real-time series are sorted by file name and then are divided in this way: 60% for the train set; 20% for the validation set; and 20% for the test set.The sets are used in this way: The train set is used to train the models; the validation set is used to select the best hyperparameter vector; the test set is used for the comparison The pressure measured by the Permanent Downhole Gauge (PDG) for a class 1 time series.Faulty time series are characterized by 3 periods.In the normal period (green), there are no problems, and the system works well.The problem starts to appear in the faulty transient period (blue), and the production performances are slightly damaged.In the faulty period (red), the fault is evident, and the damage seriously harms productivity Tables 1 and 2 contain the number of respectively time series and time windows, divided into train, validation and test set for all the considered horizons.Regarding the time windows, the first number is related to the 301 s time window, the second to the 451 s window and the third to the 601 s.Observe that although validation and test sets have the same number of time series, they have different numbers of time windows.
It is due to two different factors: the time series, despite in the same number, have not the same length, and so they are not split in the same number of windows; the initial period of each fault number is classified as class 0, and the length of this initial period is not constant through time series.

| Features extraction comparison
For the comparison, we exploit two different ways for feature extraction and four classifiers.In particular, we compare features extraction performed with the statistical and the AE approaches.AE kernel sizes and strides are manually chosen by trying to preserve the symmetry between Encoder and Decoder, while the hyperparameters are optimized via BRKGA.
Then, the extracted features are used as inputs for different classifiers, namely RFC, KNN, GNB and QDA.For their hyperparameter optimization, we adopt BRKGA by training the algorithms in the train set and evaluating their performances in the validation set.Table 3 summarizes the hyperparameter space for the AE and for each classifier, while Figure 6 summarizes the proposed pipeline.
Table 4 shows the AE exploited for the feature extraction, highlighting the dimension of each layer.The Learning Rate is reported under the columns Lr.The regularization is constant among layers and is a l2 regularization whose strength is represented in the columns Reg.For each convolutional layer and deconvolutional layer, reported under the columns Conv and Deconv respectively, the kernel dimension (K), the stride (s) and the number of filters (F) are reported under the dimension of the output layer.
Furthermore, we add two benchmarks (B) to the comparison.In particular, we consider both a statistical and a DL one.The former consists in predicting the 0 class at each sample.Instead, the latter is a CNN classifier made up of two 1D Convolutional layers followed by two fully connected layers.Its hyperparameters are optimized via BRKGA.
As for the metrics used for the comparison, they are four: the overall accuracy (ACC) and then precision (PREC), recall (REC) and F1-score for the class 0, that is, the good functioning one.Indicating with: CS the overall number of correctly predicted samples; TS the overall number of samples in the test set; TP the number of true positives; FP the number of false positives, and FN the number of false negatives, then the considered metrics can be written as: The results are then shown in Table 5.The table shows the results obtained in all the considered time horizons.It is straightforward to see that the features extracted via AE are relatively more easily usable by the ML algorithms than the statistical ones.Furthermore, it results that the best algorithm in terms of accuracy is the RFC when trained with AE features.

| Discussion
According to the results provided in The optimization is done using only the train and the validation sets.Then, the test set is used to compare the methods.All the algorithms are implemented in Python.In particular, for the AutoEncoder the Keras library Chollet et al. (2015) is used, while the classifiers are implemented using the Scikit-learn library Pedregosa et al. (2011).
F I G U R E 6 The proposed pipeline.From the time series, we obtain the sliding windows which are exploited to obtain the set of features, by applying statistical measurements or by applying the AutoEncoder.Then, each set of features is fed into the machine learning algorithms which return as output the final predictions The table shows the network structures for the AutoEncoders which are used to extract the features the size grows.In particular, QDA exhibits a dramatic drop as the length increases.On the other hand, with just a few exceptions, AE features show the opposite behaviour: the metrics seem to be as better as bigger are the time windows.It can be the first indication when choosing which type of predictor to use in this context.
Then, observing how the different classifiers react to the different inputs is worthwhile.The KNN is the only one that exhibits good performances with both the input types, even if AE features seem slightly better.Instead, RFC shows its potential in identifying valuable patterns only when fed with AE input.Furthermore, QDA exhibits similar behaviour, even if smaller in magnitude.Finally, GNB reacts to the input oppositely than QDA, that is, its performances are slightly better when fed with statistical features.
Finally, another observation is regarding the comparison with the benchmark.Also, in this case, if we look mainly at the Accuracy and the F1 score (the other metrics are strongly affected by the nature of the benchmark, so cannot be used for this comparison), then we observe that only the approach AE + RFC can overcome the benchmark.In fact, in the case of 301 s time windows, the performances of the two methods are very close, but as the time windows length increases, the performances of AE + RFC improve and become significantly bigger than benchmark ones.

| CONCLUSIONS
This work presents a DL approach for offshore oil wells diagnostics and prognostics.The public 3 W dataset is exploited, and this work is the first attempt to use DL techniques on this dataset.In more detail, a 1D CNN is used as AE to extract features from raw data.Then, the features are fed into different ML algorithms.The feature extraction approach is compared with the statistical one, which is commonly used for this dataset.
The results highlight the strengths and weaknesses of each proposed classifier.Furthermore, although they differently react to the two proposed feature extraction approaches, the underlined trend seems to reveal a predominance of the DL approach.Furthermore, regarding the considered classifiers, the best results are obtained by the RFC.This is coherent with the previous works that exploit and compare ML approaches for the 3 W dataset.To summarize, from one side, our work confirms the capability of RFC in handling this specific problem; on the other side, applying a hybrid approach with DL shows superior performance concerning the standard approaches to this problem.
In the future, the performance of hybrid DL and ML approaches for this dataset could be deeply analysed by working with different feature extraction techniques and classification methodologies.Furthermore, it could be worth studying the performance of a DL classifier, which contains both the features extraction and the classification model steps in a unique framework by optimizing the weights to provide the latent space that best fits the associated classifier rather than the one which best describes the raw data.
Furthermore, as already pointed out by several works, another helpful improvement direction could be the introduction of an Explainable AI technique to visualize the obtained results and better understand which features are significant to predict the faults.In fact, in the PdM framework, the intelligibility of the proposed classifiers is very valued, as important decisions have to be taken according to its prediction.

Note:
The benchmarks performances are reported in the B rows (stat is for the always-0 class and AE is for the CNN classifier).The metrics used for the comparison are the accuracy (ACC) and, precision (PREC), recall (REC) and F1 for the class 0 .
(2021) tries to solve one of the most significant issues regarding data-driven approaches, that is, the non-stationarity of the data, which can seriously harm the performance of the model.So, the authors propose the application of a Deep Neural Networks supported by the Dendritic Cell Algorithm (DCA), a methodology based on biology which has the aim to find changes in the data distribution and to consequently set the network hyperparameters.Lu et al. (2017) uses a slightly modified version of AE for predicting the faults in a rotary machinery context.The work also provides a strong comparison among several ML and DL architectures to assess the robustness of the proposal.Eren et al. (2019) employs a 1D CNN to predict bearing faults, comparing it with other ML algorithms and different inputs.Ince et al. (2016) uses a 1D CNN to classify motor faults.

F
I G U R E 3 Summary of the AEs used in the experimental stage.The left summary is referred to the AE used in the experiment with 301-longtime windows; the centre summary is related to 451 experiment; the right-side summary contains details about the 601 experiment 3.2.4| Quadratic discriminant analysis

T A B L E 5
Comparison of the results obtained with several features extraction techniques and classifiers in the three considered time horizons, respectively 301, 451 and 601 seconds for each time window Time series present in the train, train simulated, validation and test set, for each possible class Time windows present in the train, train simulated, validation and test set, for each possible class Note: The first number of each cell represents the number of 301 s windows, the second number is referred to 451 s windows and the last one to 601 s windows.

Table 5
The hyperparameters optimized, via BRKGA, for the AutoEncoder and for each classifier , several conclusions can be extracted.Firstly, we can observe how the different classifiers react to different time window lengths.The classifiers fed with statistical features show a drop in performances, particularly Accuracy and F1 score, as T A B L E 3