A hybrid one-class approach for detecting anomalies in industrial systems

The significant advance of Internet of Things in industrial environments has provided the possibility of monitoring the different variables that come into play in an industrial process. This circumstance allows the supervision of the current state of an industrial plant and the consequent decision making possibilities. Then, the use of anomaly detection techniques are presented as a powerful tool to determine unexpected situations. The present research is based on the implementation of one-class classifiers to detect anomalies in two industrial systems. The proposal is validated using two real datasets registered during different operating points of two industrial plants. To ensure a better performance, a clustering process is developed prior the classifier implementation. Then, local classifiers are trained over each cluster, leading to successful results when they are tested with both real and artificial anomalies. Validation results present in all cases, AUC values above 90%

To tackle the anomaly detection problem, it is important to start with the definition of what is considered an anomaly. According to Chalapathy et al. (2018) and Chandola et al. (2009)), all events that do not have the expected behaviour are considered as anomalous. This concept, which is really general, can be applied over a wide range of fields, such as medicine diagnosis, fraud detection, cybersecurity and, of course, industry (Huang et al., 2012;Kou et al., 2004). This last field is the focus of this research, since the source of the abnormal situation can be diverse, from fault detection in actuators and instruments, to structural damage detection or wrong network performance (Gomes et al., 2019).
Despite the simplicity of the definition presented before, the anomaly detection problem must face several important issues (Chandola et al., 2009). First, it is worth mentioning that, in many applications, it is not possible to know in advance the real behaviour of an anomalous situation before it happens. To overcome this concern, the use of one-class classifiers is a common approach (Tax, 2001). This consists of the use of different techniques to model the dataset behaviour, which is commonly comprised of instances registered during the normal operation. Then. test instances that differ from the implemented model are considered anomalies, since they do not belong to the so-called target class (Rodríguez-Ruiz et al., 2020;Tax, 2001). A critical issue related with the definition of target set modelling is the selection of a decision boundary between what is considered anomalous and what it is not. Another typical issue that must be tackled is the presence of noise in the initial dataset.
In this case, instances that might be considered as normal, could present noise, especially in industrial environments (Jove et al., 2019).
Although the one-class approach can deal with the problems detailed before, there are several circumstances where one-class techniques decrease its performance. In many systems, especially in industrial processes, the target set, which is known as the points that belong to the correct operation, can be scattered in different groups. This situation is the result of considering different working points and it is specially concerning when the one-class classifier is implemented using boundary methods to model the target set shape. In these cases, the appearance of anomalies between the different target sub-classes can lead to misclassification. To illustrate the motivation of the work, an example of this undesirable situation is depicted in Figures 1 and 2, where the green dots belong to the target class and the yellow triangle is an anomaly. The consideration of a unique class, presented by a blue boundary, would lead to consider the anomaly as target class. Otherwise, selecting a proper number of sub-classes and applying one-class over each one, would avoid the misclassification situation. It is important to remark that these figures represent the need of finding data groups in the training set, taking into consideration that they are not a description of all one-class classifiers operation.
This issue, added to the idea of contributing to the effective application of artificial intelligence to perform anomaly detection in the smart industry, is the main motivation for this work. Hence, the present research work proposes the combination of clustering algorithms and one-class techniques to carry out the anomaly detection in two industrial plants with a wide range of operating points.
To validate the proposal, two real applications are considered. The first one consists of an industrial system whose main goal is to control the liquid level in a tank. As the idea is to take into account different working points, several tank levels are considered as correct plant behaviour.
F I G U R E 1 Anomaly detection using a one-class approach F I G U R E 2 Anomaly detection using sub-classes The second plant has the purpose of manufacturing wind generator blades made of carbon fibre material, obtained by mixing a resin and a catalyst. In this case, the different flows needed to make the blade results in different operating points.
Eight one-class techniques have been applied to model the normal operation of both plants, but, prior to the classification phase, a clustering procedure is followed. The classifier is tested using real created anomalies in the first set, generated by sudden changes in the valve operating status. In the second set, an anomaly generation procedure is followed.
As the registered data can only reflect normal operation of the process, one-class techniques are suitable to accomplish anomaly classification (Tax, 2001). Then, the initial datasets are structured in normal and abnormal sets, taking into consideration only the normal data to carry out the classifier implementation.
The used clustering technique is DBSCAN, which is combined with eight one-class techniques: NCBoP, Autoencoder, Gauss, K-Centers, Minimum Spanning Trees, Parzen Density Estimator, Principal Component Analysis and Support Vector Data Description. The proposal is trained, tested and validated achieving successful results in the anomaly detection task. This research work is structured according to next sections: Section 2 explains the case of study, Section 3 details the proposal to carry out the anomaly detection. Then, Section 4 describes the techniques used to perform clustering and classification. Sections 5 and 6 details the experiments and results respectively and, finally, Section 7 exposes the conclusions and future works.

| CASE OF STUDY
The present section describes the industrial plant used to evaluate the performance of the proposal and the dataset used to implement and validate the classifier proposal.

| General description
As stated in the introduction section, this research proposes a one-class anomaly detection classifier combined with a clustering technique. A three dimensional replication of the first plant where the proposal is applied is depicted in Figure 3. The goal of the plant is to control the water level of an objective tank by means of a feeding pipe where the liquid is boosted from a lower tank. The actuator group consists of a three phase motor coupled to a rotatory pump SACI K5T and powered by a Variable Frequency Driver Altivar312-Schneider. Furthermore, the objective tank presents two different discharge valves, one is manual and the other one is an proportional, that will be used to generate anomalies. All pipes have a 1 in diameter.
The instrumentation used to measure the tank level is done using an ultrasonic sensor Banner ™ S18UUA, whose working principle consists of sending an ultrasonic wave to the liquid surface and, when it is reflected, it goes back to the sensor again. Then, distance can be calculated by measuring the time elapsed.
As shown in Figure 4, the control loop is implemented using MATLAB software, where the process value is the level percentage of the objective tank and the control signal represents the pump speed. To tackle the nonlinearity of the plant, it is identified on-line as a second order F I G U R E 3 Scheme of the control level plant transfer function by means of the Recursive Least Squares (RLS) algorithm, following the Equation (1) (Calvo-Rolle et al., 2014), where b 0 is the open loop gain, k is the system delay and a 1 ,a 2 are the first and second order coefficient, respectively.
After this calculation, an adaptive PID is tuned to face nonlinear behaviour according to the Equation (2). where Taking into consideration that: • T c is the critical period

| Description of dataset
The dataset used for validating the proposal has been divided in what is considered as correct operation and anomalous situations. An example of the different sets taking into account and how are they labelled is shown in Figure 5. In this case, three different tank levels with three output valve configuration are considered as correct operation. Hence, the target class presents several working points, which is one of the main weaknesses of some one-class techniques. On the other hand, six other plant operations are considered as anomalous. These consists of opening and closing the electric draining valve. Each set is comprised of 5400 instances, for a total number of 16,200 target class samples and 32,400 anomalies.

F I G U R E 4 Conceptual representation of the control loop
Furthermore, the five different features are taken into considered, recorded with sample rate of 1 Hz, contain information about the tank level measured, the PID coefficients c 0 , c 1 and c 2 ,and the control signal. Then, each one of these five variables would be configured as classifier inputs to determine the state of the plant.

| General description
During fabrication of a turbine blade, the shape of the blade is created by combining a resin with reinforcement of fibre glass or carbon. In order to produce such resin, it is necessary an industrial machine to mix two primary fluids, the epoxy and the catalytic to obtain a final product which presents high tensile and compressive strengths and great chemical resistance (Bank et al., 2018;Mishnaevsky et al., 2017). This industrial mixing machine is schematized in Figure 6.
As can be seen in Figure 6, the system is monitored by means of nine sensors. Each fluid line has one pressure sensor at the output of the pumps (P E1 , P C1 ) and another at the input of the mixing valve (P E2 , P C2 ) and also one flow meter per line (F E2 , F C2 ). Each pump has a speed sensor (S E , S C ) and finally the flow of mixed material at the output line is also measured (F M ).
The workflow is as follows: 1. Both fluids are stored in separated tanks and sent to the mixing valve by centrifugal pumps controlled through three phase variable frequency drives (VFD). To visualize how variables evolve in time, four of the measures during 8 min are presented in Figure 7.

F I G U R E 5 Dataset distribution of target class and anomalies classes
Due to the fact that all data samples represent correct working points, the faults have been obtained according to the procedure shown in Figure 8. where j anomalies were generated by modifying a p% one of the random variables in a total of j random samples from a MxN dimensional dataset.
From the initial dataset, 1708 samples (20%) of the instances are subjected to the anomaly generation method, modifying a ±15% one of the nine original variables.

| CLASSIFIER PROPOSAL
As it was mentioned in the introduction section, one-class techniques are defined following the idea of grouping the data belonging to a normal operation, also known as target set. However, this definition may result in wrong performance, especially when the geometric boundaries are used to establish the limits of the target class. A simple example using a boundary one-class technique in two dimensions is shown in Figure 9. If any anomaly lies in between those clusters it would be classified as a target object, leading to a misclassification. Hence, it would be desirable to divide the data before applying one-class technique over each group, as depicted in Figure 10.
From this general idea, the classifier training is separated in two different steps, represented in Figure 11. Once the training is finished and N one-classifiers are implemented, the test phase follows also two steps depicted in Figure 12.
1. In the first step, the test instance is assigned to one of the N clusters determined in the training phase.
2. In a second step, the test instance is sent to the corresponding classifier to label it and determine if it is anomalous.

| TECHNIQUES TO ACHIEVE THE CLASSIFIER
The used techniques to implemented the approach detailed in previous section are described in this section.

| Clustering technique DBSCAN
The density-based spatial clustering of applications with noise (DBSCAN) algorithm is considered the first clustering algorithm based on density (Khan et al., 2014). Besides the proved good performance with many datasets, it received the SIGKDD test-of-time award (Schubert et al., 2017), which gives an idea of its goodness.
This well-known clustering technique was designed to work with data with arbitrary shapes that may present noise (Khan et al., 2014). Its main idea is that for every instance of a cluster, the neighbourhood of a given radius ϵ must contain a minimum amount of instances MinPnts. This implies that the neighbourhood cardinality should be above a threshold (Khan et al., 2014). This neighbourhood N ϵ , of two points a and b is determined by Equation (3).
where D is the database of instances. To consider a point p as core point, it must have at least MinPnts points inside its neighbours, as shown in Equation (4). It is considered as non-core otherwise (Khan et al., 2014).
A new cluster is created each time a new point satisfies this condition of having at least MinPnts neighbourhood. From this core points, more points are searched until no more object can be included in the cluster (Khan et al., 2014).
The DBSCAN clustering technique presents the main advantage of discovering the groups with arbitrary shapes, both linear and nonlinear. It also has the advantage of not having to determine the number of clusters in which the dataset is divided. Furthermore, its good performance use has been widely used especially with large datasets (Birant & Kut, 2007).

| Non convex boundary over projections
The Non Convex Boundary over Projections (NCBoP) algorithm is based on the idea of modelling the shape of the target set by means of the calculation of non-convex hull (Jove et al., 2021). This novel one-class classification algorithm overcomes the weaknesses presented by well-known Convex Hull over random projections (Casale et al., 2014). The main basis of this method is to approximate the boundaries of a dataset S ℝ n using the non-convex hull over a π random projections on 2D planes and then, determine the non-convex limits on that plane, reducing in this way the complexity of calculating the non-convex limits over ℝ n .
The NCBoP calculates one non-convex polygon for each π 2D random projections, by selecting an starting point as the lowest y coordinate among all points project over the p i plane. Then it is calculated its k-nearest points and they are ordered based on the polar angle to finally kept only the further one from the starting point. Once this first couple of points are calculated, a stack structure is created including the next third point. Then it is checked if the next point in the list turns left (pushed into the stack) or right (point on the top is removed). This process is repeated until it is arrived again to the starting point. When the training process finish, all the points are in the non-convex polygon created by the algorithm. Once the training phase is over, the criteria to determine if a new test point is anomalous is the following: if the data is out of at least one of the π projected hulls, the data is considered as anomalous.

| Autoencoder
This Autoencoder technique is based on the use of an Artificial Neural Network (ANN) to determine the appearance of anomalies. The most common ANN configuration is structured in one input layer, one or more hidden layer and an output layer, whose neurons are connected with wheighted links (Sakurada & Yairi, 2014). The main idea of this technique is to reconstruct the input p at the output b p by means of a nonlinear dimensional reduction in the hidden layer. Hence, the number of neurons in the hidden layer are less than the number of inputs.
Once the network is trained using only information from the training set, that data with different behaviour is supposed to present significant different in the hidden layer subspace. This implies that a test data q should have great reconstruction error, which is defined as q À b q . This value is the criteria to label the data as anomalous (Tax, 2001).

| Gaussian model
A especial approach to face anomaly detection using one-class techniques is based on the idea of using density functions. Although more complex topologies considering Gaussian use (Oza & Patel, 2018), the most direct way to follow this approach consists of applicating a normal or Gaussian distribution function over the target set (Tax, 2001). This function is calculated from the target set, which is the same as the training set.
Once the mean vector and covariance matrix are known, the criteria to determine the anomalous nature of a test sample is based on the value obtained of this instance in the Gaussian function. The simplicity of this idea has the corresponding low computational cost, which is a significant advantage in large datasets with normal shapes.
A simplified example of a Gaussian function (blue line) with a one-dimension set is depicted inf Figure 13, where the criteria to determine anomalous nature of a test instance is determined by the threshold (red lines). This value is established during the training phase.

| K-centers
The K-centers method aims to obtain the model of the target set by covering all the points with K hyper-spheres, all of them with equal radius.
Theses hyper-spheres are set looking for a minimisation of the maximum distances between each instance and its center (Tax, 2001). This process can be divided into two steps. First, each training instance is assigned to the cluster whose center is the closest. These centers are randomly selected. Then, the position of each center is adjusted to ensure the minimum distance between it and the rest of points of this cluster. To avoid local minimums, these two steps are repeated several times and the configuration with the optimum solution is chosen (Japkowicz, 1999).
When the final K-Centers configuration is chosen, a new test instance would be labelled as target class data if it is inside of one of the hyperspheres implemented during the training phase (Japkowicz, 1999).

| Minimum spanning tree
This one-class method is based on target set modelling by means of the structure obtained by a Minimum Spanning Tree (MST). It relies on the assumption that two points p i ,p j ℝ n that belong to target class should be neighbours in ℝ n representation (Juszczak et al., 2009). Then, a linear transformation can be found for these points and all points considered as target class. As this set commonly contains more than two objects, then, more than one transformation can be considered. Hence, for a dataset D with n instances, (n À 1) linear transformations can be found (Juszczak et al., 2009).
Then, the MST consists of a set of edges e ij that specify the linear transformations of each point. It is implemented a graph ensuring the absence of loops and minimizing the total length of the edges. Once the MST is trained with data from the target set, the distance between a test object to its nearest edge projection is the criteria to determine an anomalous situation (Juszczak et al., 2009).

| Parzen density estimator
The idea previously mentioned in the Gaussian model can be applied when it is used the non-parametric Parzen density estimator (PDE) (Parzen, 1962), whose performance has proven to be successful in many UCI repositories (Casale et al., 2014). Although it is based on the idea of using density function, it presents the advantage of having good performance even when the data do not have Gaussian shape (Mazhelis, 2006). This method can be considered a mixture of Gaussian functions centered on the individual training instances, with diagonal covariance matri- where the optimal width of the kernel is adjusted during the training set, using a maximum likelihood solution (Tax, 2001). This technique shows better results the size is greater (Cohen et al., 2008).

| Principal component analysis
The Principal Component Analysis (PCA) technique has been commonly used for dimensional reduction problems (Chiang et al., 2000;Wu & Zhang, 2001). It is based on the concept of calculating the Principal Components of the training set, which represents the directions where the data have greater variability. This vectors are computed from the eigenvalues of the covariance matrix, which is follows a relatively simple calculation. Then, using this principal components, the dataset can be projected onto a subspace with lower dimension. This dimensional reduction process can be exploited as it happens with Autoencoder configuration by means of the reconstruction error, which is computed as the distance between the original and the projected data.
Hence, the number of eigenvectors, known as components, can be at most the same as the number of variables. The criteria followed to decide if a test instance belongs to the target class is based on the reconstruction error. This is computed as the difference between the original point and the point projected in the subspace. This technique offers good results when the subspace is clearly linear (Tax, 2001). When a new test sample (red dot) presents greater distance to the first component projection than the distances of training set, it is considered as anomalous.
F I G U R E 1 3 Example of Gaussian model for one-class in ℝ 1

| Support vector data description
The Support Vector Data Description (SVDD) technique was created in (Tax, 2001) as a boundary one class technique. This method is derived from well-known Support Vector Machine (SVM) (Miao et al., 2018), which is a supervised algorithm whose main procedure consists of mapping the training onto a high dimensional feature space by means a kernel function and then, a hyper-plane is constructed to perform classification (Rebentrost et al., 2014).
The SVDD technique presents an analogous approach but, in this case, instead a hyper-plane, a hyper-sphere is implemented to delimit the target class shape (Sanchez-Hernandez et al., 2007). The radius and the centre are tuned during the training phase and then, the anomalous point should be placed outside the implemented hyper-volume. Although this technique presents successful performance over a wide amount of sets and applications, it presents great computation effort compared with most of the previous detailed techniques.

| EXPERIMENTS
As the main proposal of this research is to improve the one-class classifiers performance over systems with different operating points, a comparative analysis of eight different one-class techniques is carried out combined with an clustering stage. Furthermore, the achieved classifiers are compared with the ones obtained without the prior clustering stage. The different hyper-parameters are tested for each classification technique are shown in Table 1: Besides all the hyper-parameters tested for each technique, the data is considered with three different configurations: • Type 0: Data without pre-processing.
• Type 1: Each variable is normalized using a 0 to 1 scaling range (see Equation (5), where min and max are the minimum and maximum values registered in a variable).
• Type 2: Each variable is normalized using a Z-Score conversion (see Equation (6), where μ is the mean and σ is the standard deviation of a variable).
• Initially, with 90% of the data, a k-fold cross validation method is followed with k =5, as depicted in Figure 15. This procedure consists of dividing in a random way the target set in five different groups. Then, five classifiers are implemented using the 80% of the data and leaving the 20% to the test phase. The use five folds ensures that all instances are considered for the training and test phases. Then, the classifier is tested using the non-target set and 20% of the target data.
F I G U R E 1 4 Example of PCA for one-class tasks in ℝ 2 • Once the best configuration for each cluster is known, the classifiers are trained with the 90% left in the previous phase. The performance of each technique is validated using the 10% of the data that were not used to train or test the classifier. This procedure is depicted in Figure 16.
The performance of each classifier configuration is evaluated using the well-known Area Under the Receiving Operating Characteristics Curve (AUC) measure (Fawcett, 2006). The AUC combines the true positive and false positive rates, obtaining a unique measure of the classifier performance. From a statistical point of view, this value represents the probability of classifying a random positive instance as positive (Fawcett, 2006).
Furthermore, in contrast to other measures like sensitivity, precision or recall, AUC is not sensitive to class distribution, which is a significant advantage especially in one-class tasks (Bradley, 1997).

| Control level plant
An initial approach without a clustering process is followed to evaluate the importance of data grouping. The results of this experiments are presented in Table 2. Then a clustering stage is applied using DBSCAN, resulting in three different clusters. Hence, the highest AUC achieved for each technique and its corresponding configuration for cluster 1, 2 and 3 are represented in Tables 3-5. configuration is accepted for the first cluster. In case of critical sampling times, other techniques such as SVDD, K-Centers or MST could be considered.
In respect of cluster 2, the MST is the algorithm that achieves the greatest performance. This technique is significantly better than the rest, with a difference of at least 1.5% comparing to the rest of techniques.
Finally, in the third cluster, almost all techniques achieve their bests results. Although PCA, K-Centers and MST are tied, the technique selected is PCA, since it presents the lowest computational cost and labelling time. This values corresponds to a configuration with a preprocessing stage, one principal component and a training rejection rate of 0.
According to the achieved results, the selected topology is structured using NCBoP for cluster 1, MST for cluster 2 and K-Centers for cluster 3. Then, the data left unused in the training/test phase is used to validate the classifier. The configuration and results for this phase is represented in Table 6.
T A B L E 7 Performance without clustering for wind turbine set. Global classifier

| Wind turbine plant
The results achieved without the use of DBSCAN are presented in Table 7. They are compared with the ones obtained with prior clustering process, which organized the dataset in 14 different groups. Then, taking into account that at each cluster, 18 different techniques are applied, 112 classifiers are implemented. To summarize the results, the technique with greatest AUC for each cluster is represented in Table 8. The best results using DBSCAN always outperform the results achieved by the global classifier. Furthermore, it is important to emphasize that MST outperforms the rest of classifiers in 9 out of 14 clusters. K-Centers and NCBoP achieve the best results in two clusters, and Gauss in one.
Using the selected topology of Table 8, the data unused in the training/test stage is used to validate the classifier. The configuration and results of this validation is shown in Table 9.

| CONCLUSIONS AND FUTURE WORKS
This works faces the problem of anomaly detection in two different industrial environments where the systems work at different operating points.
The proposal is evaluated using two plants with different features, considering real anomalies and artificial anomalies.. The proposal takes advantage of clustering algorithm DBSCAN to divide the target set prior the classifier implementation. This clustering process improves the classifiers results in terms of AUC and also training times. The results obtained reveals that the normal operation is clustered in three different groups for the first plant and 14 clusters for the second one.
Focusing on control level plant, the final topology leads to significant high classification rates, with a validation results of 95.88% in cluster 1, 96.78% in cluster 2 and 96.42 in cluster 3%. Regarding wind turbine plant, the validation results are at least above 90% in all clusters, which is a significant good performance.
This approach can be used to improve the classification when the target set is scattered in different clusters or groups. Using this idea, an early detection of anomalous situations is more feasible, increasing the systems optimisation. This proposal could also help to avoid the propagation of wrong sensor measurements or actuator failures, reducing the corrective maintenance costs and assisting the predictive maintenance schedule.
An interesting research line to continue this work could consist of training the classifier on-line. This would have the strength of training the anomaly detection system during its operations, with the possibility of learning its evolution. However, this idea has the weakness of increasing the target class boundaries, running the risk of assuming that anomalies are normal operation points.
Finally, different anomalies in other plant components, such as sensors and actuators could be considered to implement a distributed topology. Training the classifier with prior knowledge of the anomalies generated would help to identify the source of the wrong performance and isolate this part of the facility.