Early failure detection of paper manufacturing machinery using nearest neighbor‐based feature extraction

In a paper manufacturing system, it is substantially important to detect machine failure before it occurs and take necessary maintenance actions to prevent an unexpected breakdown of the system. Multiple sensor data collected from a machine provides useful information on the system's health condition. However, it is hard to predict the system condition ahead of time due to the lack of clear ominous signs for future failures, a rare occurrence of failure events, and a wide range of sensor signals which might be correlated with each other. We present two versions of feature extraction techniques based on the nearest neighbor combined with machine learning algorithms to detect a failure of the paper manufacturing machinery earlier than its occurrence from the multistream system monitoring data. First, for each sensor stream, the time series data is transformed into the binary form by extracting the class label of the nearest neighbor. We feed these transformed features into the decision tree classifier for the failure classification. Second, expanding the idea, the relative distance to the local nearest neighbor has been measured, results in the real‐valued feature, and the support vector machine is used as a classifier. Our proposed algorithms are applied to the dataset provided by Institute of Industrial and Systems Engineers 2019 data competition, and the results show better performance than other state‐of‐the‐art machine learning techniques.

tus. These sensors generate large amounts of multistream measurements. For instance, the motivating dataset for this research contains system monitoring measurements captured by 61 different sensors located in a paper manufacturing machinery. 2 These raw measurements ought to be processed and analyzed appropriately to obtain useful information regarding the system's health condition. The general purpose of this article is to develop a practical pipeline to process, analyze, and interpret the system monitoring data given as a form of the multistream time series (MSTS) to detect a system failure that may be occurred in the near future.
One challenge problem that we aim to resolve through this project is that the machine failure has to be prognosed ahead of a physical occurrence. The traditional system monitoring tools such as the control chart-based quality control techniques focus on the detection of the assignable causes of the system abnormal status as soon after it occurs as possible. As such, the average run length has been used as the main performance metric for comparing various types of control charts. 3 In the paper manufacturing process, however, once the machine failure occurs, the system instantly stops and there is no benefit to detect the failure afterward. Therefore, it is important to perceive any symptomatic signal followed by a machine breakdown even a few seconds earlier. To achieve this goal, we define our problem as a binary classification task where we aim to distinguish the precursory signs from the normal signals. This problem definition motivated us to use the terminology "MSTS" rather than "multivariate time series" as the multivariate data implies multiple responses in the statistical literature.
Other difficulties for this task may be attributed to the multistream nature of the given data and a lack of failure-labeled observations. Although there exist several feature-based classification algorithms for the time series data, it is problematic to generate and select a proper set of features from high dimensional multistream data. As an alternative, the deep learning techniques are emerging as competent tools handling such data. 4,5 However, these techniques require a substantially large amount of data, which is not the case for our problem where the dataset only includes 124 machine breakdown points among more than 18 000 time points where the labeled data points only consist of 0.67% among the whole dataset. Such an extremely imbalanced dataset makes it even harder to build a model with high performance since we don't have enough labeled data to train the model.
To solve the aforementioned problems, we rely on machine learning algorithms, which have been recognized as more powerful techniques for predictive tasks than traditional approaches that do not incorporate these techniques, 5 with properly processed variables and informative features. Specifically, for each sensor or variable, we transform the time series instance into a scalar extracted from its nearest neighbor and feed the transformed variables into a proper off-the-shelf machine learning algorithms to make a classification. The nearest neighbor-based algorithm has been recognized as one of the most effective classification methods for time series data. 6 In this article, we exploit the advantages of 1-nearest neighbor (1-NN) but extend the method for MSTS data. The objectives of these algorithms are to extract suitable features for MSTS classification. First, we extract the class label of the nearest neighbor only considering a single variable, which results in the binary features for each variable. Second, the relative distance to the nearest neighbor is measured, which is anticipated to provide more useful information on an instance's nearest neighbor. In this research, we demonstrate how to predict the paper machine failure before it occurs (ie, early detection); and find the variables which have a significant effect on causing failures using these nearest neighbor-based features.
The rest of this article is organized as follows. Section 2 reviews related work in time series classification. Section 3 shows the overall procedure to implement, and describes dataset, preprocessing, and two versions of algorithms we propose in this article. In Section 4, we evaluate the performance of the proposed algorithms with the real-world dataset of the paper manufacturing sensor signals. Finally, we conclude our research in Section 5.

RELATED WORK
A wide range of algorithms have been used and proposed to solve classification problems with univariate time series data. Sykacek and Roberts 7 propose an approach with a latent feature representation by applying Bayesian theory to hierarchical time series processing. Esmael et al 8 suggest a hybrid approach to improve the accuracy of the time series classifier with hidden Markov models. Jović et al 9 examine the capability of four common decision tree ensembles in the biomedical time-series dataset. Eads et al 10 employee a support vector machine (SVM) for time series classification with features extracted from the time series data. Cui et al 11 demonstrate convolutional neural networks for time series classification problem to incorporate feature extraction and classification in a sin-gle framework. These algorithms have been employed as a single classifier or a combination of multiple methods, sometimes called an ensemble, to improve the performance of classification. 12 Although the ensemble-based classifier is known as a prominent algorithm for time series classification tasks, 13 it requires much computation for training, which may not be suitable for a large dataset. Meanwhile, Tan et al 14 describe that the nearest neighbor classifier based on the Euclidean distance is a fast and promising classification algorithm when it comes to the  big dataset. Recently, MSTS data has gained great attention, and many researchers have proposed new methods to solve the multistream-based problem. Orsenigo and Vercellis 15 describe a classification method based on a temporal extension of discrete SVMs with the notions of warping distance and softened variable margin in the set of multivariate input sequences. Weng and Shen 16 implement a new approach for MSTS classification. The eigenvectors of row-row and column-column covariance matrices of MSTS samples are calculated to extract features and a 1-NN classifier is used for the classification. The authors show that distance-based methods with 1-NNs are an effective way to classify MSTS. Other algorithms have also been used to deal with MSTS. Zhang et al 17 address the challenges of MSTS data by presenting a real-time multiple profiles sensor-based process monitoring system.
Feature extraction is considered as one of the popular techniques for MSTS classification. Rodríguez and Alonso 18 use the boosting algorithm to generate new features and a SVM is applied with these metafeatures. Kadous and Sammut 19 seek to generate classifiers that are comprehensible and accurate with metafeatures. The authors describe applications of the sign language recognition and the electrocardiogram signal classification. Li et al 20 suggest feature vector selection approaches for MSTS classification using singular value decomposition.
Profile monitoring techniques with the use of the principal component analysis (PCA) method is another way to manage MSTS. Kim et al 21 develop the method to detect profile changes of multistream tonnage signals for forging process monitoring and to classify fault patterns while Chang and Yadama 22 propose a statistical process control framework to monitor nonlinear profiles to identify mean shifts in a profile with discrete wavelet transformation and B-splines. Paynabar et al 23 suggest a multiway extension of the PCA technique to classify multistream profile data. Grasso et al 24 suggest multiway PCA to deal with the reduction of data dimensionality and the fusion to all the sensor outputs. This article carries out two main multiway extensions of the traditional PCAs to handle MSTS.
Deep learning has provided prominent results for this application with the popularity of the neural networks. Zheng et al 25 propose a deep learning framework for MSTS classification using features extracted by a 1-NN with dynamic time warping (DTW). Karim et al 4 utilize the long short-term memory fully convolutional network (LSTM-FCN) and attention LSTM-FCN for MSTS classification. Wang et al 5 utilize a recurrent neural network and adaptive differential evolution algorithm for the same task. Despite the popularity of deep learning, this technique requires a high volume of dataset, and it is not suitable for our problem due to a lack of labeled data.
An imbalanced classification problem where the distribution of class labels are severely skewed needs to be well managed due to the poor performance of learning algorithms in the presence of underrepresented data and severely skewed class distribution. This is because most algorithms assume that distributions of the dataset are balanced. 26 The sampling methods which consist of oversampling and undersampling techniques are commonly used to improve classifier accuracy by providing a balanced distribution. 27 The cost-sensitive method is an alternative for the imbalanced learning problem by using different cost matrices that outline the cost for misclassifying data instances. 28 However, the failures in the paper machine occur so rarely that traditional techniques had difficulty in training models effectively. Active learning can be one of the most prominent methods which are applied to handle extremely imbalanced data. To deal with highly imbalanced classes, Attenberg et al 29 propose guided learning which is an alternative technique where the agent inquires humans to find training examples representing the different classes. Kazerouni et al 30 suggest an active learning algorithm to learn a binary classifier on a highly imbalanced dataset where most data has negative labels with a very small number of positive ones. Hybrid active learning is presented to leverage an explore-exploit trade-off to improve on margin sampling. Moreover, this active learning technique is combined with state-of-the-art deep learning techniques to improve performance. Fang et al 31 reformulate active learning as a reinforcement learning problem where the policy plays a role in the active learning heuristic. An agent in the environment tries to find the data to be labeled in a validation set based on the deep Q-network. Haussmann et al, 32 however, choose a deep Bayesian Neural Net for both a base predictor and the policy network to effectively incorporate the input distribution.

Dataset description
The dataset was provided by the Institute of Industrial and Systems Engineers (IISE) 2019 data competition, which recorded real sensor observations from a paper manufacturing process. 2 Many different types of data are collected over a period of time using a variety of sensors located on the machines. Some sensors measure raw materials (eg, amount of pulp fiber, chemicals, and so on) and the others represent process variables (eg, blade type, couch vacuum, rotor speed, and so on). Overall, 61 different sensor signals are collected, and 1 month of monitoring data are recorded at every 2 minute for a paper manufacturing machine, which results in the dataset of 61 streaming signals at 18 398 time points. In addition, for each time point, the system condition (ie, normal or break) has been recorded in a binary response variable. Despite such a large number of measurements, the failures only occur at 124 time points (0.67% of total observations) during operation and this characteristic of the rare event makes it hard to predict the failure before it occurs. Table 1 summarizes the dataset. A data-driven approach is used for this problem instead of incorporating physical models since no information was given regarding sensor information and domain knowledge.
Predicting failures for a pulp-and-paper mill is critical because a break has a significant impact on the entire process. Even though paper breaks rarely take place during operation, only one failure causes a significant loss of time and labor for identifying a cause of the failure and replacing any broken parts. Once the machine fails, the entire process should be stopped since the operation needs to be halted until the problem is found and fixed. This maintenance procedure would take more than an hour which would incur a substantial amount of cost. It indicates that only a small amount of failure reduction through early detection could give a significant amount of cost savings for industries.

Procedure
The overall procedure of the proposed algorithms in this article is presented in Figure 1 consisting of preprocessing, class label of the local nearest neighbor (CL-LNN) and relative distance of the local nearest neighbor (RD-LNN) with corresponding machine learning techniques. The original MSTS dataset is preprocessed before carrying out two types of feature extraction methods and these features are fed into a decision tree or SVM based on the extracted data types for early failure detection. More detailed information is described in the following sections.

Data preprocessing
The MSTS data obtained from the paper manufacturing machinery is given as system condition for each time point of measurement (ie, c t = 0 for normal, and c t = 1 for break). This sensor information is preprocessed to implement the classification algorithms. First, the entire data needs to be split into training and test dataset before data standardization is conducted for each variable since the test dataset should be unknown during the modeling. We divide the whole dataset into 90% for training and 10% for test dataset to do the experiments in Section 4 to apply the proposed algorithms in this article. Therefore, the training dataset is standardized first, and then the mean and SD from the training dataset is applied to the standardization of the test dataset. For implementing standardization, each measurement is scaled by subtracting the corresponding mean and then being divided by the SD so that the mean becomes 0 and the SD 1, as follows.
where the notation ← indicates that the variable in the left-hand side is replaced with the new variable of the right-hand side, mean(s j ) and std(s j ) are the mean and SD, respectively, of the original measurement data from the jth sensor. Standardization is implemented to scale the data with mean 0 and SD 1 which usually gives better performance on the algorithm. The derivative then is applied to sense sudden changes in the sensor signals. The derivative in the time series is the difference between all neighboring points in one dimension. That is, where st, j ′ and st, j ′′ represent the first and second derivatives of s t, j , respectively. The first derivative is related to a gradual change in time series which may not be sensitive to a sudden machine breakdown, while the second derivative is more useful to detect sharp changes that appeared in the streaming signals. For the rest of this article, we use the second derivative to seize precursors of immediate failure.
In this project, we aim to detect the failure earlier before it occurs. One simple way to achieve this goal is to use the class label of k time points ahead for the current instance's class label so that classifiers are able to learn to predict c t + k the system condition at k time units ahead. 2 We set k = 1, which implies that we build a model to detect a failure 2 minutes earlier than its occurrence. Figure 2 depicts this process.
In classification problems with streaming data, temporal sequence data can normally secure more information compared with the data point sampled at a single time step. 33 Accordingly, we extract small fragments of sequences by conducting, namely, time window processing. For a given window size m, a window instance consists of the last m sensor measurements up to time t which corresponds to the rows of MSTS data given in Equation (1) with time indices t − m + 1, … , t. The class label of the window instance is given as c t so that it represents the system condition at the last time point of the window. These window instances provide features to be used in a machine learning algorithm. In addition, we address the problem of severely imbalanced class labels of the original MSTS data while constructing the window instances by making a balance between two labels to some extent. That is, for time window processing we select all the time points t where c t = 1 and only randomly select t where c t = 0 such that it makes difference between the number of class labels not too large. The constructed window instances and those class labels are given as the following form.
where each row represents each window instance. That is, w i, j is the sequence of length m which is the second derivatives of the jth sensor signals, and y i is the class label of the ith window instance. Note that the row index i = 1, … , n merely distinguishes each window instance, not necessarily implies the time point. The time window processed training dataset (W train y train ) and test dataset (W test ) are used as input of Algorithms 1 and 2, respectively.

1-NN for time series classification
In the field of data mining and machine learning, one of the most frequently studied problems is classification. 34 The classification process is to evaluate the similarities in a dataset to classify them into designated classes. One of the differences between time series classification problems and traditional classification problems is that the attributes are arranged in order and input features may be correlated. The 1-NN is a popular classifier for the time series classification as its performance can compete with the most complex classifiers. 6 When a new observed time series instance comes out, the 1-NN classifier looks for the instance in the training dataset which has the shortest distance with the new instance and predicts the class of the new instance as the class label of the closest instance. A distance measure such as the Euclidean distance is used to compare two-time series instances. For a one-dimensional time series data, the Euclidean distance between two-time series instances w i and w k is measured by where w i and w k are window instances with t = 1, … , m measurements to be compared each other. The other renowned distance measure for time series data is DTW, which is the method to find the optimal alignment between two time-dependent sequences. It has been widely used in the field of pattern recognition and broadly tested on the benchmark time series data. DTW is originally designed to compare different speech patterns for the purpose of automatic speech recognition to solve the problem of distortions in the time axis. 35 It makes a time series stretched and realigned to better match the other time series. 36 To find the DTW distance, the matrix M is built where the (t, Then a warping path is defined as the monotonically increasing sequences of The DTW distance can be found by the warping path which has the minimum cumulative distance between two sequences.
where H is the length of the warping path, M h is the matrix element corresponding to the hth element of a warping path p. 37 Figure 3 depicts how Euclidean matching and DTW matching compare similarities between two-time series instances. In brief, Euclidean distance measures the distance between the two waves regardless of the shapes, while the DTW measures the distance by taking into account the shapes of two sequences. However, due to its computational complexity of DTW, the distance measurement with the DTW may not be suitable to be applied for the real-time sensor streaming data in which it is required to find the nearest neighbor instance quickly. To select the appropriate distance measure between Euclidean distance and DTW distance, a separate experiment is conducted to compare the performance, of which the result is shown in Section 4. For our proposed algorithms, Euclidean distance is used to measure the distance between two-time series instances as the experiment shows that Euclidean distance requires much less time than DTW without a significant difference in performances between the two methods.

3.5
Nearest neighbor-based feature extraction One possible way to extend the 1-NN for a single-stream time-series data to the case of multistream signals could be to use the sum of the Euclidean distance measured by Equation (6) for each variable to measure the similarity between two multistream window instances. In this case, however, information of all variables is aggregated, which results in the loss of each variable's information and relationships between variables. Instead, we look for the nearest neighbor considering each variable only, which we call the localnearest neighbor (ie, the nearest neighbor in an embedded space of a single stream), and extract scalar features from it. These features are fed into different classification algorithms depending on the types of extracted features.
L is a large number used for a initialization 7: for k ∈ Train ⧵ i dod = D ED (x ij , x kj ) 8: for all instances in training data except itself (LOO-CV) 9: if d ≤ d * thenk * ← k,d * ← d 10: end if 11: end forx ij ← y k *

12:
store class label of the local nearest neighbor as a feature 13: end for 14: end for 15: 16: for i ∈ Test do 17: for j ∈ {1, … , p} dod * = L 18: L is a large number used for a initialization 19: for k ∈ Train dod = D ED (x ij , x kj ) 20: for all instances in training data 21: if d ≤ d * thenk * ← k,d * ← d 22: end if 23: end forx ij ← y k * 24: store class label of the local nearest neighbor as a feature 25: end for 26: end for The first feature we propose is the CL-LNN, which is given as 0 or 1 for each variable. Algorithm 1 outlines the procedure of the CL-LNN feature extraction in which the MSTS data is converted into binary feature matrices X train and X test . The local nearest neighbor is found by leave-one-out cross-validation (LOO-CV) for each variable on the training dataset W train . That is, for an instance of the training dataset, LOO-CV searches all the other instances in the training dataset except itself and chooses the one that gives the highest matching with it, which is simple but effective for 1-NN. 38

F I G U R E 4 Three different cases based on nearest neighbor-based feature extraction
On the other hand, for an instance of the test dataset, the algorithm simply searches the nearest neighbor from the training dataset. The nearest neighbor is found by computing Euclidean distance based on the time window for each variable as in Equation (6). Note that, for a given instance, the CL-LNN features for different variables may be varied because the nearest neighbor for each variable could be different. These binary features are fed into the decision tree classifier for the model training and prediction, which will be described in more detail in Section 3.6. The features can keep the original information of each sensor signal by considering each variable separately, and correlations between different variables are expected to be handled by the decision tree algorithm.
Another feature we propose in this article is called the RD-LNN. While the CL-LNN can be thought of as features of hard classification where the outcome is certainly given as 0 or 1, RD-LNN provides features of soft classification, which can be seen as probability-like features. Although the binary feature extracted from the CL-LNN provides information on which one is the closest to the instance under consideration, it is not able to measure the degrees of significance or strength of the extracted feature. Let us consider three cases to classify the label of instances with nearest neighbor-based feature extraction in Figure 4. In the first case, there is a clear decision boundary which makes it easy to separate two distinct groups where CL-LNN might show superior performance. However, outliers in the second example make it more challenging to classify the target instance. Suppose we know the CL-LNN for a given instance is, say, 1 a break signal. To build a robust prediction model, we may also want to know how reliable and accurate this signal is. For the second case, even if the nearest neighbor is the break signal, this nearest neighbor is an outlier with respect to the majority of the other break signals. In this case, relying solely on the class label of the nearest neighbor may be risky. To complement this pitfall of binary features, we may consider measuring distances from the nearest neighbor to the other instances, respectively. If the distance value is large, the nearest neighbor is thought to be located far from the majority of its same class and does not provide reliable information. Whereas if the distance is small, the nearest neighbor is thought to represent the group of the same class and the information provided by this instance is more accurate. In lieu of direct distance measure, we use probability measure which is similar to the computation of P-value for a statistical hypothesis testing. Rare events (ie, breaks) in our dataset, however, appear to be indistinguishable from the other which is ambiguous to differentiate those two groups as in the third case. In this situation, we found that it is more effective to measure the relative distance for each group, respectively, instead of applying the same nearest neighbor to different groups. Specifically, given an instance of which the class label has to be predicted, Euclidean distances to all the other instances in the training dataset are computed. For each class label (y = 0, y = 1), the nearest neighbors are found. Let d * 0 and d * 1 be distances to nearest neighbors with class label 0 and 1, respectively. We can also find the approximated normal distribution for each class. Let X 0 and X 1 be random variables with these approximated normal distributions. The RD-LNN features are computed by P(X 0 ≤ d * 0 ) and P(X 1 ≤ d * 1 ) for each class, which can be interpreted as the probability that an observation is located farther than the nearest neighbor from the center of each class. That is, where Φ is the cumulative distribution function of the standard normal random variable, i and s i are the median and SD of the distance between the target instance and all the training instance with the label i. The smaller RD-LNN is, the less likely the label to be found is reliable.

Algorithm 2. RD-LNN feature extraction
1: Input: Multistream window instances W train for training, and W test for testing, class labels of training instances y train , index set of training data Train, index set of test data Test 2: Output: Numeric feature matrices X train with elements x 0 ij and x 1 ij , i ∈ Train, and X test with elements x 0 ij and x 1 ij , i ∈ Test 3: for i ∈ Train do 4: for j ∈ {1, … , p} dod 0 = d 1 = NULL 5: initialize arrays to store distance values 6: for k ∈ Train ⧵ i dod = D ED (x ij , x kj ) 7: for all instances in training data except itself (LOO-CV) 8: if y k = 0 then append d to 18: extract features from distances with label 1 20: distance to the nearest neighbor with label 1 22: end for 25: end for 26: for i ∈ Test do 27: for j ∈ {1, … , p} dod 0 = d 1 = NULL 28: initialize arrays to store distance values 29: for k ∈ Train dod = D ED (x ij , x kj ) 30: for all instances in training data except itself (LOO-CV) 31: if y k = 0 then append d to d 0 32: end if 33: if y k ≠ 0 then append d to d 1 34: end if 35: end for 36: extract features from distances with label 0 37: distance to the nearest neighbor with label 0 39: 47: 48: end for 49: end for Algorithm 2 describes the procedure in which the algorithm generates the numerical values by measuring the probability representing the relative position of the nearest neighbor for each class compared with the other instances with the same class label. We found that this unique feature extraction technique improves the performance of classification when these extracted features from RD-LNN are fed into SVM. Note that two features are extracted from each variable corresponds to each class label (y = 0, y = 1). The normal distribution is approximated to distance data where the mean and SD are set as the sample median and the SD of distances included in the interquartile range (ie, data ranging from the first quartile Q1 to the third quartile Q3) to minimize the effect of outliers.

Training model
Different machine learning techniques are used for CL-LNN and RD-LNN, respectively, to train the model and predict failures 2 minutes earlier based on the data type that algorithms produce. First, the C5.0 decision tree algorithm which is an improved version of its predecessor C4.5 is applied to the CL-LNN algorithm for the classification between normal and abnormal conditions. In order to improve model performance, we implemented adaptive boosting which is the process in which many trees are built and trees vote for the best class. We set boosting iterations to 10. A cost matrix is also employed by assigning a penalty to different types of errors to improve the accuracy where 1 is assigned for the false positive, and 5 is chosen for the false-negative since failing to detect breaks can be a more expensive mistake. Second, the numerical feature matrix (X train ) from the training dataset (W train ) is fed into SVM to generate the model, and the other matrix (X test ) is used to evaluate the performance of the model generated with training dataset (W test ) from RD-LNN. To train SVM model, the function of kernel which takes data as input and transforms it into the required form for training and predicting is chosen to be radial. The cost is assigned to 1 to trade off the correct classification of training examples against maximization of the decision function's margin. 0.5 is also used for gamma parameter which defines how far the influence of a single training example reaches. These parameters are selected heuristically by experiments based on our dataset.

Performance analysis
We compare our methods with four other different approaches which include a type of artificial neural network and general machine learning models without the feature extraction technique we proposed in this article. The first method is an Autoencoder which is comprised of encoder and decoder for extremely rare event classification 1 . The encoder is to learn the features of input data which are normally in a reduced dimension, while decoder regenerates the original data from the encoder output. This method uses a dense layer Autoencoder which selects the instances in random without considering the correlation among instances. The second approach is the improved version of the first one by constructing LSTM (long short-term memory) Autoencoder which contemplates the temporal features 2 . Both methods also attempt to detect failures 2 minutes earlier with the same dataset we use in this article. We, in addition, compare the method without feature extraction technique (ie, decision tree without CL-LNN, SVM without RD-LNN) in order to show the benefit of the proposed algorithm. Table 2 shows the prediction result in the form of a confusion matrix to compare the performance of six methods. As we can see from these results, it looks like all six methods are comparable, and hard to find which method provides better performance. It also shows the trade-off between the true positive/negative and false positive/negative. RD-LNN, however, shows the lower number of false-positive among the four methods. Table 3 provides other metrics to compare the performance among six different methods. In the table, four metrics are used to evaluate the performance of the proposed classification algorithms. Precision (also known as the positive predictive value) is defined as the proportion of positive instances over the total number of positive. Recall (also known as sensitivity, true positive rate) is the number of true positives divided by the number of true positives plus the number of false negatives. In addition, False positive rate (1 -specificity) refers to the probability of falsely rejecting the null hypothesis for a particular test. Since, however, the distribution of class labels is highly skewed, another performance metric F-measure has been 1 The implementation of Autoencoder refers to this site (https://github.com/cran2367/autoencoder_classifier/blob/master/autoencoder_classifier. ipynb) 2 The implementation of LSTM Autoencoder refers to this site (https://github.com/cran2367/lstm_autoencoder_classifier/blob/master/lstm_ autoencoder_classifier.ipynb) Prediction 0  1  0  1  0  1  0  1 0  1 0  1  0  1   0  2726 22  3355 19  1636 19  1769 24 1514 19 1762 22 TN FN   1  173 3  272 8  179 6  46  1 282 6

TA B L E 3 Performance
with four metrics used to measure the performance of a rare classification problem. F-measure (also sometimes called the F1 score or F-score) is the combination of precision and recall using the harmonic mean, a type of average being used for rates of change. Based on the table, RD-LNN shows the best performance in precision, false-positive rate, and F-measure among six methods while LSTM autoencoder only performs better in a true positive rate. Note that RD-LNN shows outstanding performance compared with the others in F-measure which are well suited to represent the performance of the highly imbalanced dataset. Another metric used to measure the performance is a receiver operating characteristic curve, or ROC curve, which represents the diagnostic ability of a binary classifier. This tool is suitable to visualize and compare the performance of our proposed algorithms. The true positive rate (TPR or sensitivity) is plotted in the ROC curve against the false positive rate (FPR or 1 -specificity) at different threshold settings to exhibit how much a model is able to distinguish classes.
In Figure 5, ROC curves of six different methods are plotted to compare the performance using area under the ROC curve (AUC) which represents the degree of separability. LSTM-Autoencoder which considers temporal features show better performance than Autoencoder and the AUC of RD-LNN is higher than the one of CL-LNN by considering the relative distance to detect the failures. Decision tree and SVM which are not adopting the feature extraction we proposed also provide a lower performance than RD-LNN. Overall, the AUC of RD-LNN shows the largest value 0.724, and we reach to the same conclusion that the performance of RD-LNN is better than any other five methods. Figure 6 summarizes performance comparison based on F-measure and AUC. LSTM-autoencoder and RD-LNN appear to be better than the others in AUC, while RD-LNN is the only one to show the outstanding performance in F-measure.
Additional experiment is conducted separately to choose distance measure algorithm between Euclidean distance and DTW distance considering that 1-NN method requires demanding calculation of distance between the target data point and all the points in the training set. In the experiment comparing two methods to measure the distance in Table 4, we found that DTW distance spends much more time to complete the same task than Euclidean distance which takes only about 4.6 minutes while it shows the almost same performance. The reason why Euclidean distance performs well compared with DTW in this experiment is that DTW distance is particularly well suited for the application of automatic speech recognition in which speaking speeds vary based on time. However, time-series data that has been used here has the same time difference.

Effects of window size and the number of normal instances
In this subsection, the key parameters which highly influence the performance of RD-LNN are examined. First, the window size m = 20 was determined based on the experiment considering F-measure as well as running time which is also an important factor when it is deployed in the real-life application. Figure 7 represents how F-measure and running time 3 are varied over window size m. F-measure shows the downward trend as the window size is increased while running time is increasing almost linearly due to the fact that lager window size demands more computation to estimate the distance. The relationship between F-measure and window size indicates that we need to find the optimal window size to capture the appropriate patterns that the failures might have. We substitute zero for F-measure when the algorithm is not able to detect the true failure 2 minutes earlier. It is noted that 20 window size shows good performance with decent computing burden, and it is used as the number of window sizes of the proposed algorithm in this article.

TA B L E 4 Comparison between Euclidean and DTW distance
Window Size Another parameter we need to carefully determine is the number of normal instances randomly selected in the training dataset. We examined the effectiveness of the number of normal instances with the performance depicted in Figure 8. Note that 99 failures are included in the training dataset and the class distribution between failures and normal instances needs to be balanced to handle imbalanced dataset. It shows that F-measure increases when the number of normal instances for training increases from 100 to 200, and then significantly decreases after 200 while running time keeps rising over the number of normal instances. This indicates that 200 normal instances we randomly selected in the training dataset provide better performance than others.

Root cause analysis
Root cause analysis is implemented by measuring the importance of each variable to find the critical ones which cause the failure of paper manufacturing machinery based on the decision tree algorithm. The variable importance is estimated based on the percentage of training dataset samples that fall into all the terminal nodes after the split to find the root cause. In Table 5 The decision tree which consists of three types of nodes (ie, root nodes, decision nodes, and terminal (or leaf) nodes) and branches also shows similar results. As we can expect from variable importance, the most important variable is on the root node which is located on the top of the decision tree.

Cost benefit analysis
Based on the experiment, it suggests RD-LNN is able to detect three failures 2 minutes earlier among 25 paper breaks. In this section, we will analyze how much this proposed algorithm could make a contribution to the industries even though the performance does not look high enough to detect every failure before it occurs. Table 6 shows that even a small number of failure reduction improved by this algorithm can save a significant amount of cost for the industries every year. The gain is calculated based on the recall 12%, and the loss caused by the false alarm is estimated to find the total cost that we can save throughout a year. Ranjan et al 2 imply that it will cost more than 10 000 dollars for a break. We assumed that failure would occur 124 times for 1 month based on our dataset. Since the classification algorithm can detect 12% of failure, almost 1.7 million dollars can be saved per year by preventing 179 possible failures. However, we also need to consider the other side, a negative effect caused by a false alarm which gives the warning even though the machine is in the normal state. We assumed that this false alarm would cost 100 dollars because people might stop working and need to check the machine status to find out the problem. Based on the fact that data is captured by every 2 minutes, 1488, the number of failures that occurred every year, is subtracted from the total number of failures. The total loss caused by false alarm would be less than 1 million dollars due to the FPR which is 1.9%. If both positive and negative factors are considered together to find the total cost, we can conclude that the algorithm we propose here can save more than 700 thousand dollars in total for a year.

DISCUSSION AND CONCLUSION
It is crucial to detect the failure earlier to save cost and labor in a paper manufacturing facility. However, it is challenging to detect machine failure in advance due to the fact that data is comprised of MSTS and failures which rarely occur during operation without any clear symptom where we call extremely rare event problems. In this research, two types of methods called CL-LNN, RD-LNN are proposed based on the nearest neighbor to extract proper features for early detection of paper manufacturing machinery. The data is preprocessed with several different steps: splitting data, standardization, moving class label, second derivative, and time window processing. CL-LNN measures Euclidean distance to extract the class label of the nearest neighbor which will be fed into the decision tree classifier for the failure classification. Another algorithm called RD-LNN extracts relative distance-generating numerical values which are suitable to be trained with SVM. Experiments are implemented on the dataset provided by the IISE 2019 data competition to show the competitiveness of our proposed methods. Dataset is preprocessed and proposed algorithms are implemented with other machine learning techniques. Through the experiment, it finds that RD-LNN is able to extract features effectively to detect the abnormal condition in the MSTS dataset which would make a considerable contribution to industries by saving cost.
Considering the fact that sensor measurements are collected every 2 minutes and it takes less than 20 seconds to analyze one measurement with our algorithm to detect a failure, this algorithm would be a feasible solution in a real-world environment where a prior warning is given so that technicians can take appropriate actions to prevent a breakdown. However, it would be possible to find a more efficient way to deal with computation complexity when deploying to a real-world environment. One possible solution for the real-time application is that, based on the fact that Euclidean distance is calculated based on squared differences between two instances at m time points (see Equation (6)), if we store these squared differences from, say, t = 1 to t = m, it can be easily updated, when a new signal is measured at t = m + 1, by dropping one at t = 1 and adding one at t = m + 1. In this case, by reutilizing previously computed results at t = 2, … , m, it is only required to compute one for t = m + 1 which will let us save much time to calculate the distance.
It should also be noticed that the test dataset is standardized with mean and SD obtained from the training dataset since these parameters of the test dataset are not available during the model training. This fact could possibly lead to a negative impact on the performance if new measurements show a significant difference from the previous ones (training dataset). Although we assume that the future examples will have similar mean and SD as the training dataset in this article, this can be alleviated by updating those parameters as we gained the new measurements.
Even though cost-benefit analysis shows promising results, further research to overcome the rare event situation is still necessary, since improving performance is limited by insufficient labeled data from which most of the machine learning algorithms normally suffer. More efforts should be made to overcome the lack of failure data which is normally encountered when collecting data in industries such as failures, spam email, fraud credit card transactions, and so on. The concept of active learning could provide a possible solution to handle the extremely rare event problem where the dataset is severely imbalanced (skewed) with a small number of initial training data available. The basic idea of active learning is that better performance in a machine learning algorithm can be achieved with fewer training labeled data if we are allowed to choose the data from which it learns. Therefore, we might be able to get better performance by adopting active learning algorithms in our future research.

ACKNOWLEDGEMENT
We are very grateful to the two anonymous reviewers and the Editor-in-Chief for their comments on the article.

PEER REVIEW INFORMATION
Engineering Reports thanks Giovanna Martinez Arellano and other anonymous reviewer(s) for their contribution to the peer review of this work.

CONFLICT OF INTEREST
The authors have no potential conflict of interest to declare.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are openly available in arXiv.org at https://arxiv.org, reference number arXiv:1809.10717.