BERT‐TriF: An inductive short text classification model for power equipment defect records

The descriptions of power equipment defect records are often characterized by colloquial short texts. Standardized classification of a large number of colloquial defect descriptions has laid a solid foundation for building a power equipment knowledge graph and improving the level of intelligence in the field of power inspection. Using deep learning and natural language processing technology, this paper proposes a text classification model for power equipment defect records named Bidirectional Encoder Representations from Transformers with Family Feature Fusion (BERT‐TriF). The model firstly leverages BERT to semantically represent the input text. To extract family history as well as text implicit information, we creatively propose a family feature fusion algorithm for training. An improved multi‐head attention mechanism is developed subsequently to enhance text semantic category features and strengthen the learning ability of the model. By comparing BERT‐TriF and baseline models such as TextCNN, TextRNN, and fastText on the specified and generic text dataset, the experimental results demonstrate that it has better performance, robustness, and universality for short text classification.


INTRODUCTION
2][3] However, the related defect records are mainly entered into the system manually.As first-hand materials, these records are not only the foundation of defect classification and elimination but also directly related to the accuracy of health condition evaluation and power equipment maintenance decisions. 4he manual methods mainly have two shortcomings.Firstly, the defect records entered by different operators may be inconsistent, so subsequent maintenance personnel cannot accurately locate the problem and provide defect processing methods according to the defect description, which delays the defect processing time.Secondly, there is a certain human cognitive bias in the defect classification by different operators.Some people are unfamiliar with the defect classification rules.There may be a large deviation in the description of the same defect by different personnel, which may cause incorrect defect classification or incorrectly fill in the elimination time, and so forth. 5,6Since the power equipment management department has established a set of standard knowledge bases for defect description corresponding to different types of equipment, all kinds of text information used to describe defects can correspond to unique defect types and elimination times.Therefore, it is only necessary to determine a certain defect type of the knowledge base corresponding to the manually entered defect description.Then the defect's severity level and elimination time can be clarified simultaneously.
The manually entered defect description text generally does not exceed 50 words.We can use short text classification technology based on deep learning and natural language processing to achieve a one-to-one correspondence between manually entered defect records and standardized defect types, thereby improving the efficiency of defect handling.The processing efficiency, standardization, and intelligence level also laid a foundation for subsequent research on failure analysis, equipment status evaluation, and auxiliary decision-making.
Due to the gradual application of machine learning algorithms represented by deep learning in recent years, scholars at home and abroad have used deep learning algorithms combined with traditional natural language processing technology to improve the processing methods of short texts.Currently, the research on the defect records of power equipment at home and abroad mainly includes text mining, text retrieval, and text classification.Shao et al. 7 proposed a method of pruning, segmentation and reconstruction of power equipment dependency syntax tree using the characteristics of the defect text, but the method based on association rules required a lot of manual intervention in the early model construction.Liu et al. 8 proposed a method for classifying power equipment defect records based on convolutional neural networks.Similarly, Lu et al. 9 proposed a text classification method for power equipment defect records based on multi-head attention recurrent convolutional neural network (MHA-RCNN).It combined graph attention and recurrent neural network to form a classifier that improved the accuracy of defect text classification and training speed.Feng et al. 10 proposed a bi-directional long short-term memory network based on the attention mechanism (BiLSTM-Attention) to classify power equipment defect records, which improved the feature extraction ability and classification ability of long texts containing nonsense confusing information.A retrieval method for defect records of power equipment was developed by Liu et al. 11 based on knowledge graph technology, which presented a knowledge graph of power equipment defect records.However, it did not show how colloquial descriptions relate to standardized defect types that would be used to construct the properties of the graph.
Most text classification methods of power equipment defects will first carry out text pre-processing before classifying the text.Pre-processing is a process of text data cleaning or standardization, which mainly includes three steps: Chinese word segmentation, stop word removal, and normalization of objects.After the process, the defective text will be divided into several phrase tags.The tags will be represented as word vectors in the low-dimensional feature space to extract the feature vectors of the corresponding phrases.These methods can improve classification accuracy to a certain extent for knowledge in specific fields.However, it not only increases the processing time and the cumulative error of the model but also affects the general applicability of the model.In addition, most of the current short-text classification methods for power equipment defects are relatively simple in terms of the number of classifications, which only classify the defect descriptions into three types according to the defect level: urgent, significant, and general.
In summary, in order to solve the problems of the heavy pre-processing workload of power equipment defect records, poor adaptability of defect text methods, and fewer categories of classification, this paper creatively proposes a Bidirectional Encoder Representations from Transformers with Family Feature Fusion model (BERT-TriF).The contributions of this study can be summarized as follows:

2
RELATED WORK

Power equipment defect record
Equipment defect refers to the abnormal phenomenon of equipment quality in the manufacturing, transportation, construction, installation, operation, and maintenance period.Besides, it includes those situations refer to non-compliance with national laws and regulations, national (industry) mandatory provisions, violation of enterprise standards or "countermeasures" requirements, non-compliance with design or technical agreement requirements, unexpected appearance or performance, threatening personal safety, equipment safety and power grid safety.In order to guide and improve equipment life cycle management, it is necessary to organize and analyze the raw defect record.Currently, defect management of power equipment is generally circulated through defect sheets, which record detailed information on equipment defects, including defect level, equipment category, equipment name, defect description, defect type, defect cause, and treatment measures.The defect records are entered manually by the operator after finding the defect.In particular, the defect description is more colloquial and has no fixed format and structure but has prominent short text characteristics.For example, "There is a slight oil leak near the gas relay of the transformer, no more than 12 drops per minute."The corresponding defect type is oil leakage.On one hand, the current description of defects has the problem of omitting critical information, making it impossible to grasp the defect accurately.On the other hand, there is also a problem of redundant defect description. 12

Short text classification
Short text classification, as a kind of multi-label text classification, refers to a process of associating a given short text with one or more labels according to the text's characteristics (content or attributes) under a predefined classification system.The ultimate goal is to find a practical mapping function that accurately maps the input text to a given label.This mapping function is actually what we call a classifier.Therefore, there are two critical issues in text classification: the representation of the text and the classifier design.The principle of the early multi-label text classification method is based on the traditional machine learning method, the implementation process of which is relatively simple, but the effect is still not ideal. 13The text classification method based on traditional machine learning mainly pre-processes the text, extracts features, then vectorizes the processed text, and finally models the training dataset through standard machine learning classification algorithms.Text classification methods based on traditional machine learning mainly include the K-nearest neighbor algorithm, naive Bayes, decision tree, and support vector machine.It is difficult for traditional methods to deal with short text classification because too limited words in short text cannot represent the feature space and the relationship between words and documents. 14The quality of text feature extraction significantly impacts the accuracy of text classification.In contrast, there are other methods that has sparked a flurry of research in recent years.The methods based on deep learning aim to train the data through the deep learning model without the need for manual feature extraction of the data.The impact of the text classification accuracy is more on the amount of data and the number of iterations of training.Minaee et al. 15 grouped these models into several categories based on their architectures, such as recurrent neural networks (RNN)-based models, convolutional neural network (CNN)-based Models, capsule neural networks, etc.Some baseline short text classification models based on deep learning are described in detail below for the convenience of subsequent comparison with Bert-TirF.
• TEXT-CNN: In 2014, Kim et al. 16 proposed TEXT-CNN.This method mainly obtains the n-gram feature representation in the sentence through one-dimensional convolution and uses the pre-trained word vector as the hidden layer, instead of CNN that mainly used for image information processing.However, this method has two shortcomings.Firstly, the location information of the feature can not be extracted during pooling.Although the location information of the feature is preserved in the convolutional layer, it is not preserved by taking the unique maximum value.Secondly, for strong features with high frequency, only one maximum value is reserved for max pooling, so the intensity information of features is not reflected after processing.
• TEXT-RCNN: Belong to a bidirectional recursive structure, Recurrent Convolutional Neural Network (RCNN) refers to a model that combines the characteristics of RNN and CNN. 17 TextRCNN replaces TextCNN's feature extraction with RNN.Firstly, the input text's contextual semantic and grammatical information can be obtained by bidirectional RNN.Maximum pooling is applied to automatically filter out the most critical features, and finally a fully connected layer is put forward to get category probability.However, the improvement of the accuracy of this method is not significant compared with TEXT-CNN, but relatively time-consuming.
• fastText:As a short text classification algorithm, fastText has two significant advantages compared with neural network-based classification algorithms.One is to speed up the training and test speed while maintaining high precision.The other is that fastText will train the word vectors by itself instead of requiring pre-trained word vetors. 18The core idea is to superimpose the words and n-gram vectors of the entire document to obtain the document vectors and then use them for softmax multi-classification.To this end, fastText introduces character-level n-gram features to represent a word and hierarchical Softmax classification, which reduces the complexity from N to logN by constructing a Huffman tree according to the frequency of categories instead of standard softmax.However, its model structure is simple because it uses the bag of words idea, so the semantic information is limited.At the same time, it is limited by the length of the context.For words larger than the size of the context window, the order information cannot be captured.

Current combination method
0][21] However, single self-attention cannot capture the order of the sequence, because it mainly focus on local consecutive word sequence.As a pre-trained language model, BERT leverages a multi-layer multi-head self-attention (called transformer) together with a positional word embedding. 22It is able to capture representation at the level of words and sentences respectively.Unlike OpenGPT 23,24 which predicts words based on previous predictions, BERT is trained using the masked language modeling task that randomly masks some tokens in a text sequence, and then independently recovers the masked tokens by conditioning on the encoding vectors obtained by a bidirectional Transformer.However, the pure BERT model is aimed at words composed of two or more consecutive words, and the random mask words split the correlation between consecutive words, making it difficult for the model to learn the semantic information among the words. 25,26As BERT becomes one of the state of the art text classification and sentence embedding models recently, there have been attempts to improve BERT, such as RoBERTa, 27 ALBERT, 28 DistillBERT, 29 SpanBERT. 30n addition, Some recent studies have combined Graph Convolutional Networks (GCN) to exiting baseline models like attention mechanism and BERT in the task of text classification.Ding et al. 31 propose HyperGAT model based on a dual attention mechanism in order to model text documents with document-level hypergraphs, which improves the model's expressive power and reduces computational consumption.Lin et al. 32 introduces a transductive model named Bertgcn, which initialized node embeddings with pretrained BERT representations, and uses GCN for classification.Yao et al. 33 construct a whole corpus as a heterogeneous graph, and learn word and document embeddings with graph neural networks jointly by applying the TextGCN model.However, those methods integrated with GCN are more suitable for large corpus processing or long text classification tasks rather than short text classification tasks.

Model architecture
The BERT-TriF model is an inductive learning method that enhances the semantic representation of the input text by fusing the historical family features of each defect type, thereby improving text classification accuracy.Unlike the complex process as mentioned above, this paper proposes to directly send the short text of power equipment defects into the BERT model for semantic representation and obtain the semantic feature vector, which can effectively shorten the processing time of the text.It mainly comprises a semantic representation (BERT), a family feature fusion module, an improved multi-head attention module, and a softmax classifier module.Since the BERT model has been widely used in many kinds of research, the family feature fusion module and the improved multi-head attention module proposed in this paper will be introduced in detail below.
Our model has a different structures in the training stage and the application stage.The training stage includes all the above modules, while the application stage only leverages the output results from the family feature fusion module rather than iterating on it continuously.In other words, the green dotted arrows in Figure 1 will not appear during the In the application stage, BERT first semantically represents the input text to obtain semantic feature vectors; then, two identical pre-trained family feature matrices fused with historical information and the semantic feature vectors will be transported to the improved multi-head attention module; finally, after average pooling and nonlinear activation, we are able to obtain the defect type probability.

Family feature fusion module
Texts of the same category generally have similar semantic features.In the semantic space, text feature vectors of the same category tend to cluster together to form family clusters distinguished by categories.The family feature refers to the center of the semantic features of the same type of text, that is, the center of the family cluster.Based on this feature, by effectively utilizing the family features of the training set samples, similar text features can be aggregated, thereby enhancing the semantic representation of the input text and improving the accuracy of the classification task.
In the training stage, we first obtain the semantic representation of text directly by BERT model.As shown in Figure 1, the semantically represented feature vector, denoted as x m ∈ R d m , will be output synchronously to the family feature module and improved multi-head attention module.The fully connected layer (FC) is used for enhancing the feature and then get a new semantic feature vector called x h ∈ R d h .The calculation formula is where W h ∈ R d h ×d m is the weight matrix, b h ∈ R d h is the bias vector.LeakyReLU is used as the activation function.
We denote the family feature matrix as F ∈ R d h ×N , and initialize it to a random zero matrix O ∈ R d h ×N in the first round.Among them, N is the number of text categories, and each column f n ∈ R d h is the N t h category of text family feature.In the training stage, average pooling is performed both on the new text semantic features x h and the family features f n of corresponding category during training.The latest category information can be integrated into the family history features.As a result, an updated family feature, as a part of latest family feature matrix, is appeared for next round iteration.
Since the F has been pre-trained and fully fused with the historical family features of the dataset after training, we leverage it directly in order to shorten the processing time in the application stage.The iterative process of family feature f n is only performed in the training stage, and that is the reason why we call BERT-TriF model as an inductive learning method.
Whether in the training stage or the application stage, it is necessary to reduce the dimension of the current family feature matrix F after integrating with historical family features.We denoted the output from full connected layer as final family feature matrix F m .The calculation formula is where W m ∈ R d m ×d h is the weight matrix and B m ∈ R d m ×N is the bias matrix.

Improved multi-head attention module
Ideally, the semantic features of the same category of text should be clustered together, and the semantic features of different categories should have a considerable spatial distance, showing the characteristics of "similar aggregation, heterogeneous discrete, and disjoint with each other".In the actual semantic expression model, there are often many categories of semantic features interacting together.Finding a suitable separation hyperplane is impossible, which leads to a decrease in the accuracy of classification tasks.If the multi-head attention mechanism is adopted, the family features containing the semantic information of similar historical texts are integrated into the semantic expression so as to enhance the semantic category features, which will help to eliminate the shortcomings of the fuzzy category decision boundary and improve the classification effect of the separation hyperplane.The essence of the multi-head is multiple independent attention calculations, which act as an integrated function to prevent overfitting.The improved multi-head attention mechanism in this paper draws on the idea of graph attention networks (GAT), where V refers to the text semantic feature vector x m output by BERT.Both K and Q are the same two sets of final family feature matrices after dimension reduction F m .The specific algorithm structure is shown in Figure 2.
In order to enhance the feature representation of the input text, a linear layer is adopted to linearly transform V, K, Q to obtain ×N .Among them, d l is the feature transformation dimension, and the weight matrix W m ∈ R d l ×d m is the shared parameter.Each attention mechanism function is only responsible for one subspace in the final output sequence and is independent of each other.
The improvement of this model, on one hand, is to leverage the additive attention mechanism to change the scale-dot product attention unit that used in the multi-head attention mechanism, to an additive attention unit.This makes it easier to extract implicit category features f h related to text semantics from family features.The specific formula is where f l i ∈ R d l is the i column in the family feature matrix F l , that is, the i category of family features.ELU is used as the activation function, and  i is the attention correlation coefficient of the i category feature.The calculation formula is where  ∈ R 2d l is the weight vector and || is the join operator.On the other hand, after the additive attention mechanism's operation, to alleviate the gradient's disappearance and speed up the convergence of the model, the residual category features are connected with the linearly transformed text semantic features.In addition, in order to strengthen the extraction of hidden category features, the text semantics and family features are re-fused according to the multi-head attention mechanism.Finally, the enhanced text semantic representation is generated by concat, and then the feature vector is obtained through a linear transformation.Its specific formula is where is the weight matrix, and f h k and x l k are the k fusion parameters.In order to improve the stability of the fusion effect of family features, h f different family feature information is aggregated into the text semantic representation, and the output feature vector x o ∈ R N is obtained after average pooling and nonlinear activation.The calculation formula is The expression of softmax is as follows.After the classifier and the softmax layer, the similarity of the two types of labels is calculated.When the set threshold is reached, the colloquial defect description is considered to match the corresponding standardized defect type. softmax

Datasets
In order to verify the classification effect of the BERT-TriF model on the actual power equipment defect records, we collect a total of 6763 power inspection defect reports from 2015 to 2022.The PowerFault dataset is obtained through sorting and induction, in which we divided them into 17 defect types.Their corresponding examples of colloquial defect description as shown in Table 1.
To further verify the robustness of the BERT-TriF model, this paper introduces two Chinese benchmark datasets including THUCNews, TouTiao and five English benchmark datasets including 20NG, R8, R52, Ohsumed, MR.The details of the dataset are shown in Tables 2 and 3.

Model parameter settings
In the BERT semantic representation module, we applied the BERT-based pre-training model, where the number of layers of the network structure L is set to 12, the dimension of the hidden layer H = d m is set to 768, the number of self-attention mechanism heads A is set to 12, and the parameters reach 110M.In the family feature fusion module, the dimension of the text semantic feature vector d h is set to 1024.In the multi-head attention mechanism module, the feature transformation dimension d l is set to 512, the number of attention layers stacked in parallel h a is set to 6, and the negative half-plane parameter of the ELU activation function is set to 1.
In addition, the model learning rate is initialized to 1e −4 .The data size of each training set is 128, and a total of h f = 6 family feature fusion modules are stacked in parallel with the corresponding multi-head attention mechanism modules.Among them, the negative half-plane parameter of the LeakyRELU activation function is set to 0.2.

Experimental results
In order to verify the classification effect, we compared BERT-TriF model to those baseline models like TextCNN, Tex-tRNN, fastText and BERT on two Chinese benchmark datasets and our PowerFault dataset.In addition, three integrated state-of-the-art inductive models are leveraged for comparative experiments on five benchmark datasets as supplementary.The classification results of each model are shown in Tables 4 and 5, where each value represents the average accuracy of the model running ten times and has passed the Student's t-test (p < 0.05).In the Powerfault dataset experiment, due to the large-scale pre-trained semantic representation, even though the single BERT model outperforms other benchmark models.Based on the BERT model, the BERT-TriF model further fuses the historical family features of each category to enhance the semantic representation of the input text, thereby improving the accuracy of text classification (Average accuracy of 85.7 as compared to 84.8 for single BERT and 84.3 for fastText and 82.2 for TextRNN).Moreover, in the multi-language general text classification experiments, the BERT-TriF model also achieved better accuracy, which verified that the model has good robustness and universality.Consequently, it can be applied not only in power inspection scenarios with certain colloquialisms and high text sparsity, but also in other common English and Chinese short text scenarios.

Visual representation of classifications
T-distributed stochastic neighbor embedding (t-SNE) is a machine learning algorithm and a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. 34Through the t-SNE method, 35 we reduce the high-dimensional semantic feature vectors that are output by TextCNN, TextRNN, fastText, BERT, and other models in PowerFault dataset experiment to two dimensions, so as to visually mark in a plane coordinate system.Each figure contains 17 defect types labels with different colors in Figure 3 image A-D.It can be shown that the 17 categories cannot be well distinguished and some categories are mixed together in the images.In other words, the classification result of baseline models seems inconspicuous and have the characteristics of "similar discrete and heterogeneous irregular".Since the number of attention layers for parallel stacking is set to six, each image in Figure 3 image E represents one of the layer outputs by the improved multi-head attention module respectively.As mentioned above, each image has 17 types labels.To better differentiate from Figure 3A-D, star markers are superimposed for each label, with each color of star markers representing a defect type.Unlike the result from the above models, the defect record's features from the BERT-TriF model have a pronounced aggregation effect and clear classification boundaries, which verify again the superiority of the model classification performance from the perspective of visualization.

CONCLUSION
This paper proposes an inductive model BERT-TriF for short text classification in the power inspection field.Instead of complex pre-processing, the model uses the BERT directly to represent the text semantics.Through the proposed family feature fusion module, historical information from training text can be fully utilized.In addition, we develop an improved multi-head attention mechanism, which changes the scale-dot product attention unit to an additive attention unit, and integrates implicit category features with text semantics to further speed up model convergence.The experimental results indicate that the model has better classification accuracy, robustness, and general applicability than benchmark deep learning models in specific and generic datasets.In the meanwhile, we leverage visual representation to verify this conclusion again.

F I G U R E 1
Model architecture.application stage.The loss function of the classification result is obtained in the training stage.The relevant parameters of the model are updated through backpropagation in the meanwhile.

F I G U R E 3
Visual comparisons among different classification methods.
PowerFault dataset with 17 defect types.Chinese text classification dataset statistics.English text classification dataset statistics.
TA B L E 1 Comparison of average accuracy in Chinese datasets.Comparison of average accuracy in English datasets.