Explainable fault prediction using learning fuzzy cognitive maps

IoT sensors capture different aspects of the environment and generate high throughput data streams. Besides capturing these data streams and reporting the monitoring information, there is significant potential for adopting deep learning to identify valuable insights for predictive preventive maintenance. One specific class of applications involves using Long Short‐Term Memory Networks (LSTMs) to predict faults happening in the near future. However, despite their remarkable performance, LSTMs can be very opaque. This paper deals with this issue by applying Learning Fuzzy Cognitive Maps (LFCMs) for developing simplified auxiliary models that can provide greater transparency. An LSTM model for predicting faults of industrial bearings based on readings from vibration sensors is developed to evaluate the idea. An LFCM is then used to imitate the performance of the baseline LSTM model. Through static and dynamic analyses, we demonstrate that LFCM can highlight (i) which members in a sequence of readings contribute to the prediction result and (ii) which values could be controlled to prevent possible faults. Moreover, we compare LFCM with state‐of‐the‐art methods reported in the literature, including decision trees and SHAP values. The experiments show that LFCM offers some advantages over these methods. Moreover, LFCM, by conducting a what‐if analysis, could provide more information about the black‐box model. To the best of our knowledge, this is the first time LFCMs have been used to simplify a deep learning model to offer greater explainability.

IoT applications have remarkably applied data-driven models to cope with pervasive sensors and high throughput of generated data streams (Ghosh et al., 2018). These models have been extensively adopted in predictive preventive maintenance applications because building functional analytical models is overwhelming for complex mechanical systems, and IoT sensors facilitate gathering health condition data (Zhao et al., 2020;Zhao, Jia, Bin et al., 2021). Data-driven fault detection methods take advantage of machine learning, particularly deep learning, to build solutions based on the condition data concerning various states (Long et al., 2022). To this end, two phases are considered, (i) to recognize the fault patterns based on different extracted features (ii) to build a predictor using the extracted features, and machine learning/deep learning models (Hasan et al., 2021).
In conventional machine learning, a primary challenge is to select important features to train a model. Most feature selection techniques reported in the previous fault detection research are either the metaheuristics (Oreski & Oreski, 2014) or the filter-based approach (Ambusaidi et al., 2016). Moreover, data compression methods such as PCA (Xie et al., 2018), and manifold learning techniques are adopted for dimension reduction for fault detection (Refahi Oskouei et al., 2012). The majority of previous research also combines wavelet transformation with other techniques such as empirical mode decomposition and self-organizing maps (Hong et al., 2014), neural networks (Narendiranath et al., 2017;Zhou et al., 2019) and Multiclass SVM (Rahnama et al., 2019). Evolutionary strategies also have been deployed to tackle fault prediction Wang, Kang et al., 2019).
Deep learning models have shown more flexibility in task learning and have taken advantage of the embedded feature transformation to get to a more separable space. Convolutional Neural Networks (CNNs) are used in many fault prediction applications Cheng et al., 2021;Mehdiyev & Fettke, 2021;Sun et al., 2020). Moreover, Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997), and Gated Recurrent Units (GRU) (Cho et al., 2014), have become popular time-series processing methods that can effectively encode temporal information (Guo et al., 2021;Okubo et al., 2017;Wang, Yan et al., 2019).
Despite the success of deep learning, it is often criticized for being a black-box model that lacks of transparency (Arrieta et al., 2019;Guidotti et al., 2019;Lin et al., 2020;Wang et al., 2021;Yoon et al., 2019;Mansouri & Vadera, 2022), and this notion does not allow users to digest and trust the output of deep learning models (Kok et al., 2022). In a black box model, internal processes are either unknown or known, but a human being cannot understand them. Explainable Artificial Intelligence techniques are any effort through which further explanation can be provided to shed light on such opaque machine learning models (Schlegel et al., 2019). Whilst some models are interpretable by nature, the explanation process can assist in more in-depth understanding of a black box model (Doshi-Velez & Kim, 2017;Guidotti et al., 2019;Kok et al., 2022).
Several ways of providing explanations include explanations by text, visualizations, local explanations, explanations by example, explanations by simplification, and feature relevance (Arrieta et al., 2019;Kok et al., 2022;Mansouri & Vadera, 2022). Explanations by text learn to generate texts that help interpret the results. Visualizations give visual understanding of a model's output. Local explanations split the solution space and describe simpler solution subspaces associated with the main model. Explanations, by example extract data instances that reflect the results of a given model. Explanations by simplification use a simplified yet interpretable auxiliary model whose output is loyal to the baseline black box model (Arrieta et al., 2019). Finally, feature relevance addresses the relationship between the model output and the most important inputs (Arrieta et al., 2019;Chen et al., 2018;Mansouri & Vadera, 2022;Yoon et al., 2019).
As LSTMs are widely adopted in fault analysis (Guo et al., 2021;Huang et al., 2021;Mansouri & Vadera, 2022;Nwakanma et al., 2021;Zhao et al., 2017;Zheng et al., 2017) we consider them as the baseline model for fault prediction in industrial bearings using vibration sensor readings. LSTMs are also black boxes in which the internal operation of the gates is very hard to understand. Although gaining an understanding of an LSTM's internal gates is quite interesting, it is unlikely to be suitable for users without in-depth knowledge about this model. In this paper, we explore the use of the Learning Fuzzy Cognitive Map (LFCM) (Salmeron et al., 2019) as an auxiliary model that is built to be loyal to the output of the trained LSTM for interpreting its results. Fuzzy Cognitive Maps (FCMs) are flexible and robust models for system state prediction and interpretable knowledge representation . FCMs are also capable of supporting what-if analysis which can help bring out the causal relationships between the input and output variables. To the best of our knowledge, this work is the first to take advantage of FCMs for explaining deep learning models through model simplification. The paper includes a comparison of the use of LFCM with very recent methods for interpreting deep learning models (Senoner et al., 2022); this model extracts the most important features of a black-box model by calculating its given SHAP values (Lundberg et al., 2017(Lundberg et al., , 2020. SHAP values (SHapley Additive exPlanations) is an explaining technique based on game theory to improve the interpretability of machine learning models. To this end, this technique decomposes the output of the base model into the contribution of each feature named "SHAP values" (see Lundberg et al., 2017Lundberg et al., , 2020 for more details). In other words, the SHAP value method can be viewed as a cooperative game where the payoff should be assigned to each feature based on their contribution (Senoner et al., 2022).
The remaining part of this paper has been organized as follows. Section 2 reviews related research on model simplification techniques, FCM applications, and previous efforts for equipment fault diagnosis. Section 3 presents the framework for this research, and Section 4 describes the model, the dataset, and the experimental results of prediction and explanation tasks. Finally, Section 5 concludes the paper and provides new directions for research.

| PREVIOUS WORKS
Since the approach of this work is to adopt an LFCM to carry out an explanation by simplification of the baseline model, which is an LSTM, this section reviews related research on simplifying black-box models and the use of FCMs. Explanation by simplification means building an interpretable auxiliary model based on a trained black-box model to be explained (Arrieta et al., 2019). The supplemental model is usually much simpler than the baseline and tries to imitate its behaviour by reducing complexity.
Almost all simplification models extract interpreting rules (Arrieta et al., 2019). Local Interpretable Model-Agnostic Explanations (LIME) (Ribeiro et al., 2016) is a typical model in this category. LIME builds locally linear models around the outputs of a black-box network to interpret it. G-REX (Konig et al., 2008) is another method that learns auxiliary rules (Konig et al., 2008). Bastani et al. (2018) implemented a model extraction process to approximate an interpretable model to the opaque one. Tan et al. (2018) took a different approach by combining two methods: a method for model distillation and comparison to investigate the baseline model's risk; and a statistical test to investigate whether the data is missing key training features.
There are several studies on simplifying deep learning models. Zilke et al. (2016) develop DeepRED, which uses an extended decomposition approach to rule extraction for deep neural networks. Che et al. (2017) present Interpretable Mimic Learning using gradient-boosting trees. Thiagarajan et al. (2016) proposed a hierarchical partitioning of the features that display the iterative elimination of improbable class labels until the association is predicted.
FCMs combine ANNs and Fuzzy Logic to introduce an interpretable representation for complex systems (Salmeron et al., 2019). FCMs can be built through experts' judgements or a data-driven learning process. An FCM is developed using a learning method called Learning FCM or LFCM (Salmeron et al., 2019). There are many applications of this technique in the literature, such as in medicine (Nápoles et al., 2014), customer behaviour analysis (Nasserzadeh et al., 2008), students' performance estimation , computer science (Osei-Bryson, 2004), project management (Kahvandi et al., 2018;Kordestani Ghaleenoei et al., 2021), and some other domains (Poomagal et al., 2021).
There are also some reported works in failure/fault modelling using FCMs. Enriquepelaez (1996) used FCMs to carry out Failure Mode Analysis in one of the first efforts. The model was applied to estimate the effects of component faults on the system operation. Mansouri, (2014, 2016) study expert-created FCMs that evaluate further failure associated with ERP implementation. In another work, Liang et al. (2019) deployed a LFCM for fault prediction in part of a railway signalling system. They used a real-coded genetic algorithm to train a LFCM and reported effective performance.
FCMs have been applied in modelling large-scale problems such as Gene Regulatory Networks (Hecker et al., 2009) containing a few thousand concepts (Salmeron et al., 2019). However, as with other interpretable methods, the performance of the model is affected when there are many features. In FCMs, this side effect would lead to a large adjacency matrix that is hard to interpret. To this end, the number of features should be managed through feature selection techniques.

| Fault diagnosis methods
This section describes previous studies on equipment fault diagnosis through data-driven approaches in more depth. As mentioned in the introduction, machine learning and deep learning are widely adopted in this field. Ben Ali et al. (2015)  In the case of vibration analysis, LSTM networks can analyse the internal correlation of vibration signals among time series data Guo et al., 2021;Huang et al., 2021;Nwakanma et al., 2021;Zhao et al., 2017;Zheng et al., 2017). Zhao et al. (2017)  In terms of explanation, there are some recent efforts. Sun et al. (2020) developed a CNN for equipment fault detection. The authors added an extra layer, Class Activation Maps, into the model for a visual explanation. Another work for visualization has been conducted by Chen and Lee, (2020) in which a CNN for classification is proposed. They applied Gradient class activation mapping for generating heat maps by calculating the weights of each feature map according to the classification scores.
Decision Trees are widely applied to interpret a black box deep learning model through explanations by simplification approach (Christou et al., 2020;Mehdiyev & Fettke, 2021;Senoner et al., 2022). To this end, Senoner et al. (2022) proposed a gradient boosting with decision trees to improve process quality and used SHAP values to obtain the importance of the features. Mehdiyev and Fettke (2021) proposed a technique for predictive maintenance; they also proposed a model agnostic explanation approach called Surrogate Decision Trees. Christou et al. (2020) used a rule-based model to explain the results from a model used to estimate the remaining useful life of industrial equipment. Brito et al. (2021) used a number of machine learning techniques along with the SHAP and Local Depth-based Feature Importance for the Isolation Forest. The model is applied to the bearing and mechanical fault datasets. Isolation Forests are similar to Random Forests and build on decision trees. In this technique, there is no label; therefore, it is unsupervised. Samples are processed in a tree based on randomly selected features.
Samples with shorter branches in the tree are possibly isolated ones and anomalies (Liu et al., 2008). Hasan et al. (2021) proposed an explainable fault diagnosis model for bearings, including five steps where they considered data preprocessing, feature selection, and feature importance. They also used an additive Shapley explanation followed by k-NN to diagnose and explain each decision of the k-NN. In another work, Li et al. (2022) proposed an adversarial domain generalization network based on class boundary feature detection to diagnose faults. Wang et al. (2020) applied a multi-headed attention mechanism for optimizing CNNs. In another work, Mansouri and Vadera (2022) proposed an instance-wise feature selection technique to highlight the most contributing features in a deep learning model aiming at fault prediction.

| FRAMEWORK
The above summarizes various studies in the field of explainable AI for fault detection. In this paper, we present an alternative approach to simplify and explain LSTM for fault detection by using LFCMs. The following section presents the framework for this LFCM-based approach to simplify and explain the LSTM for fault prediction. First, we build an LSTM deep neural network with a few fully connected layers to predict a fault happening in degraded bearings within the next few hours. To this end, dataset D contains n tuples: Where x i R TÂm is the input sequence of length T, including m features and y i 0,1 f g is a binary label in which 0 represents normal, and 1 denotes a faulty condition. In this research, we have considered the classification task as a binary classification in which fault might happen or not, even though both the LSTM and LFCM as the auxiliary model can conduct categorical classification to classify different types of faults. The prediction model f γ : x ! 0, 1 ½ parametrized by γ, is the LSTM that undertakes the classification task.
After building the predictor, we aim to explain it by a Learned FCM, which is interpretable. FCM models a system containing several interactive concepts as a weighted d-graph, where the vertices indicate components of the system (C i ) and connecting weights (W ij ) show the interactions between those components ( Figure 1) .
The sign of W ij implies the relationship between concepts C i and C j , and its value shows the intensity of this relationship. A combination of concepts captures a snapshot of the system at any time as a state vector A t ¼ a t 1 , …, a t n È É , this state vector shows the values associated with each concept at time t  and can be updated by Equation (1): In Equation (1), a t i denotes the value of concept C i at time t, W ji is the weight between input concepts C j and C i , n is the number of concepts, and f is an activation function, for which mainly the unipolar sigmoid (Equation (2)) is used (Salmeron et al., 2019): Where x is the input, and α is the function slope estimated as a hyperparameter. Whether FCMs are created by experts' judgements or through a learning method they are interpretable . This interpretability is achieved by using the extracted weight matrix to represent the modelled system, facilitating static and dynamic analyses.
Causal effects among the concepts will be drawn in static analysis by finding the maximum value among several paths connecting an input concept to an output one (Ravasan & Mansouri, 2014). In this analysis, first, a partially ordered set P of causal values is taken into consideration.
Let ζ be a causal concept space and e : ζ Â ζ ! P a causal edge function. Then the simplest abstract operations are achieved by interpreting the indirect-effect operator I as some minimum operator and the total-effect operator T as some maximum operator; these operators can be a simple min and a simple max, respectively. Let there be m-many paths from C i to C j : i, k l 1 , …, k l nl , j for 1 ≤ l ≤ m. Let I l C i , C j À Á depict the indirect effect of concept C i on C j on the lth path; and T C i , C j À Á be the total effect of C i on C j over all m casual paths (Equations (3), (4)).
Where p and p þ l are continuous left-to-right path indices (Kosko, 1986), this analysis highlights the fundamental importance of each concept for the target concept(s). Figure 2 illustrates the static analysis in the FCM. C 1 is a given input concept, and C 4 is the output concept. There are four possible paths to connect them. The I operator finds the minimum among all weights in each path. For instance, the minimum weight in C 1 ! C 2 ! C 4 path is the weight connecting C 1 to C 2 which its value is 0.1. After calculating all possible paths by I operator, the T operator selects the path with the maximum impact, which in this example is C 1 ! C 3 ! C 4 and its value is 0.3.
F I G U R E 2 Static analysis in FCM, in this figure the direct effect of C 1 on C 4 is 0.2, whilst its indirect effect through C 1 to C 3 to C 4 is 0.3.
F I G U R E 1 Fuzzy cognitive maps can be represented through two different approaches. For small to medium-size maps displaying a visualized graph where nodes are concepts and vertices are weights is informative (left), whereas in larger maps, an adjacency matrix is a more straightforward way to represent a FCM in which each cell shows the relationship between concepts located in the associated row and column (right).
The dynamic analysis commences with an initial state vector, such as A 0 ¼ a 0 1 , …, a 0 n È É Indicating the corresponding values of all concepts, and keep updating with Equation (1) and Equation (2) (5), and the algorithm aims to minimize this in-sample error through a novel evolutionary strategy.
Where C n t ð Þ denotes the actual value of the concept n at time t, b C n t ð Þ is the estimated value through the LFCM, K is the number of samples, and N is the number of concepts. Algorithm 1 provides the pseudocode of the FCM-MARO algorithm (Salmeron et al., 2019).
This algorithm follows an evolutionary strategy, namely ARO (Farasat et al., 2010;Mansouri et al., 2011) A solution, in ARO is a vector of decision variables X R n , where n is the length of the vector. The algorithm starts with a randomly initialized solution namely the parent and reproduces an offspring named bud through a specific reproduction mechanism. In the original ARO bud is replaced with the parent once it is better, but the algorithm is modified by adding another acceptance criterion for a new offspring, even if it is not better than the parent by considering the effect of falling in the local minima using Equation (6). Where local means the number of times being trapped in the local optimum and t is the number of iterations. FCM-MARO, which can train a data-driven FCM or LFCM, is fully described in Salmeron et al. (2019).
The reproduction mechanism combines the mutation and the crossover operators in other evolutionary algorithms. A subset of the parent vector will be selected and mutated randomly to constitute an interim vector named larva. Then the crossover operator applies over the larva and the parent to build the ultimate offspring (bud). The length of this larva specifies the possibility of selecting each part of the bud from this vector.
Where g is the length of the mutated part of the parent vector.
In this paper, the initial state is an input sequence x tÀ1 with the window size T along with a slack variable b y 0 ¼ 0 to show the probability of being faulty at the beginning. The output is the following sequence, along with the result of the LSTM b y t ¼ f γ x tÀ1 À Á . Therefore, the LFCM is as LFCM : R Tþ1 ! R Tþ1 and learns to accept the input and generate the fault probability output.

Algorithm 1 Pseudocode for FCM-MARO
while true do.
generate a random offspring from the parent.
calculate the error of the offspring by Equation (5).
measure the extra acceptance interval by Equation (6).
if the error of the offspring is less than the error of the parent then.

| Bearing dataset
We used the real-world dataset collected in Mansouri and Vadera (2022) which contains 4 months of vibration readings from four bearings of the same size, class, and category. One of the bearings is disintegrated and produced several faults, whilst others have better conditions. This dataset contains sequential data where each row consists of 20 vibration readings and one feature class in which 0 means normal situation and 1 means fault. All bearings operated in real situations and faults are captured through installed sensors. There are 29,646 sequences. Figure 3 shows the distribution of normal and fault samples in this dataset. We used 80% of this dataset for training and the rest for testing the models.
Although the data is clearly imbalanced, the accuracy of the models was good and there was therefore no need to use sampling methods to balance the data.

| Baseline model
An LSTM network has been selected as the baseline model because it has shown good performance in time series analysis. As a deep learning model containing extra gates, LSTM is a deep learning model in which the relationship between an input and its associated output is quite complex and opaque. In order to select the topology of LSTM, different numbers of neurons for the LSTM layer, different numbers of fully connected layers and their given number of neurons, optimizers including Adam, SGD, and RMSProp as well as different learning rates are tested through the grid search (Liashchynskyi & Liashchynskyi, 2019). This led to a baseline model with a 60 node LSTM layer followed by a 32 node fully connected layer with a relu activation function and an output layer with one node with a sigmoidal activation function. The selected optimizer is Adam, and 0.01 was the best learning rate. Figure 4 shows the result of training LSTM on the bearing dataset.
As Figure 5 shows, the confusion matrix of the trained LSTM on the test dataset is promising. The number of false positives and false negatives is low even though the dataset is highly imbalanced. Therefore, LSTM has shown its performance to capture this dataset's complexity without being overfitted or underfitted. 1 Although LSTM is a powerful model for fault prediction, as a deep learning model, it is a black box and uninterpretable. Therefore, explaining it as an acceptable obscure deep learning model is worthwhile. To provide some interpretability, an LFCM is developed and used to carry out some analysis. A combination of decision trees and SHAP values is then used to provide a contrasting approach.
F I G U R E 3 The distribution of normal and fault samples in the bearing dataset.

| Explanation by simplification
To create the LFCM model, we used the bearing dataset by which we had trained the LSTM. LFCM aims to receive a sequence containing 20 readings and one null variable and create the next sequence along with the prediction result. Therefore, the input and output have 21 concepts each (sequence members and the slack variable denoting the probability of a fault) and the generated adjacency matrix is 21 by 21. As the current problem was binary classification, we used one output concept. In multi-class problems, one can add outputs to the model as many as the number of classes. So, the final adjacency matrix is (T + m) by (T + m), where T is the number of inputs and m is the number of classes. Figure 6 displays the LFCM and its related input and output structure. During the training, the LFCM learns to accept an input vector and predict the output vector with the same size containing the fault prediction close to the LSTM's output.
Since Algorithm 1 is an evolutionary algorithm with no hyperparameter, we just set the number of iterations as 100, and ran it to build a LFCM using the bearing dataset. As Figure 7 shows, the LFCM converges to a low error rate (as defined by Equation (5)).
The trained LFCM accepts input and estimates the fault probability during the simulation and test. The resulting LFCM has been tested on the same test dataset by which the LSTM was tested. Figure 8 displays the confusion matrix of LFCM on the test subset of the bearing dataset.
Based on this result, the performance of LFCM is very close to the performance of the trained LSTM. However, because LFCM is much simpler than LSTM, its capacity to capture the nonlinearity among the data is less than LSTM (Wang et al., 2021).
F I G U R E 4 Plotting the loss function resulted from the training of the LSTM for fault detection on bearing dataset.
F I G U R E 5 Confusion matrix resulted from running the trained LSTM on the test dataset. The error portion of this matrix is low.
The structure of the LFCM and its input and output.
F I G U R E 7 The structure of the LFCM and its input and output.
F I G U R E 8 Confusion matrix resulted from running the trained LFCM on the test dataset. The error portion of this matrix is also quit low and close to LSTM performance.
Once a trained LFCM can predict the bearing condition using a sequence of vibration readings, it is time to conduct static and dynamic analysis to extract more insight into the baseline model.

| Static analysis
According to the weight matrix obtained from the trained LFCM, each input variable directly affects the target concept, which denotes the probability of being faulty in the future. On the other hand, since FCMs are dynamic systems, these variables have interconnected effects, so their total effects should be considered. Nevertheless, Equations (3) and (4) are applied to estimate the total effects of each variable on the target one. Figure 9 shows the direct and indirect effects achieved by the static analysis.
As Figure 9 implies, almost all concepts have more indirect effects than direct ones. Considering the direct effect, the starting members do not affect the target variables, but after going through the static analysis, their indirect effect becomes considerably higher. Based on the static analysis, it can be concluded that members indexed by 1, 3, 8, 10, 16, 17, 18, and 19 have more effects on the target variable. It means that more fluctuation in the target is expected by changing the related values. It should be noted that FCM static analysis draws the causal relation between input and output concepts throughout the dataset, and it is not an instance-wise feature selection technique. Nevertheless, it is expected that changing the values of the more essential concepts leads to more output change than the other concepts. Therefore, this analysis draws a   F I G U R E 1 0 A normal reading, which is predicted as normal by the LFCM. Figure 11 shows the results of the mentioned scenarios on the chosen sequence. In the first scenario, the first five features are manipulated, and we have modified the others similarly. The fourth scenario highlights that just changing the fourth quarter's values has led to faulty conditions, while the others could not change the classification result.
The next set of experiments is related to the faulty condition. Figure 12 shows a random faulty sample. Here the question is, by controlling which members, one could prevent the fault from happening. As before, there are four experiments where each group member's values have decreased to 0.1. Through LFCM dynamic analysis, the effect of this manipulation is shown in Figure 13.
As noted in Figure 13, one can prevent the fault by controlling the values of the fourth quarter, while controlling the other categories has no direct influence on the condition. By knowing that, we could conduct a what-if analysis. Therefore, the dynamic analysis of a FCM can provide us with more valuable information about the behaviour of a black-box model.

| Validation
In order to check the result of LFCM, we took the idea proposed in Senoner et al. (2022) for fault detection by explainable models. They calculated the feature importance of all parameters of their base model with the tree implementation of the SHAP value method (Lundberg et al., 2020). However, the base model in their work is gradient boosting with decision trees (Ke et al., 2017), in our work it is LSTM. Therefore, to use the tree implementation of SHAP values, a decision tree is built on the same training set by which LSTM and LFCM had been trained. In order to tune the hyperparameter of this decision tree, the grid search on the most important parameters of the decision tree is conducted. Table 1 summarizes the results of the grid search. These values are selected among many combinations to buil a decision three such as different criterion including gini, entropy, and none, and also other hyperparameters.
F I G U R E 1 1 Four scenarios conducted on a normal sequence. In the first scenario, the first five members are manipulated randomly, and in the successive scenarios second, third, and fourth five members are manipulated.
F I G U R E 1 2 A fault reading, which is predicted as fault by the LFCM.
F I G U R E 1 3 Four scenarios conducted on a faulty sequence. In the first scenario, the first five members are manipulated randomly, and in the successive scenarios second, third, and fourth five members are manipulated. F I G U R E 1 4 Confusion matrix resulted from running the trained decision tree on the test dataset. The error portion of this matrix is also low and close to LSTM performance; however, it is not as good as LFCM's performance.
After building the decision tree with the designated parameters, it is tested on the test subset of the bearing dataset. Figure 14 shows the confusion matrix resulting from running the trained decision tree on the test dataset. Although, it is still close to LSTM outputs, the LFCM outperforms the decision tree.
In order to select the most contributing features, the average of SHAP values of the trained decision tree is calculated, and the result presented in Figure 15.
Based on Figure 15, the most important features are indexed by 19, 2, 10, 8, 16, 17, 18, 4, and 3; among them, 3, 8, 10, 16, 17, 18, and 19 are common with the LFCM which can validate its results. The average of SHAP values is close to LFCM, yet LFCM is more accurate in comparison to the decision tree and not only highlights the most important features but also carries out what-if analysis to estimate the change in each feature on the whole system.

| CONCLUSION
This paper tackles the problem of producing a transparent model for predicting faults from vibration sensor readings. The article first develops a LSTM model and then illustrates how a LFCM can carry out a static and dynamic analysis. In order to validate the performance of the proposed F I G U R E 1 5 The average of SHAP values on the bearing dataset and the trained decision tree. This result shows that the most important feature was the last one; nevertheless, features belonging to the second part of the sequences were more contributing.
method as a simplified model, a recent method including decision trees and SHAP values is used. A dataset consisting of sequences of vibration readings and its label is used, and after pre-processing, an LSTM neural network was trained on the training subset and examined on the test set.
Different parameters and topologies were tried to tune the given hyperparameters using a grid search. The trained LSTM, an opaque deep learning model, is selected as the baseline model. Then a LFCM is trained to illustrate some parts of the baseline model. To this end, the LFCM is trained to generate outputs relative to the baseline's output using the same data and the baseline output. Experiments show the performance of the LFCM is like the baseline model.
Since an LFCM can carry out static and dynamic analysis, it can interpret some parts of the baseline model. The static analysis can highlight the effects of the input concepts on the probability of being faulty. In the example used in this paper, the concepts belonging to the fourth quarter had more effect on the prediction results. This analysis also showed that even those features that had no, or low direct impact on the target, affected the results through indirect paths. Moreover, the dynamic analysis showed that changing values of the fourth-quarter variables could lead to a fault being predicted for normal readings. A recent method proposed by Senoner et al. (2022) shows that the use of decision trees coupled with the use of SHAP values can also aid in making LSTM models more transparent. LFCM's performance was better than the decision tree, the result of SHAP values was close to LFCM, and finally, since LFCM can carry out dynamic and static analysis, it can conduct what-if analysis besides the feature importance.
In conclusion, this paper demonstrates that LFCMs can capture most of the LSTM's performance, provide added useful capabilities that aid interpretability, and outperform cutting-edge methods from different aspects. For future research, one can improve the performance of the LFCM by adding more nonlinearity without compromising interpretability , and one can explore if there are benefits in using them for other deep learning models and tasks. Although the use case investigated in this paper was a binary classification, applying the proposed model to multi-class problems is also fascinating. Moreover, combining the proposed method with local feature selection methods when there are many features has the potential to shed further insights about how to make black box models more transparent.

ACKNOWLEDGEMENTS
The work presented in this paper was carried out as part of a Knowledge Transfer Partnership project between the University of Salford and Invisible Systems Ltd (ISL). We are grateful to ISL for providing the data and assisting with the problem definition.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are openly available in stream at https://github.com/tahamsi/stream.