Object recognition for power equipment via human-level concept learning

Inspection robots are popularized in substations due to the lack of personnel for operation and maintenance. However, these inspection robots remain at the level of perceptual intel-ligence, rather than cognition intelligence. To enable a robot to automatically detect defects of power equipment, object recognition is a critical step because criteria of infrared diag-nosis vary with types of equipment. Since this task is not a big-sample learning problem, prior knowledge needs to be added to improve existing methods. Here, an object recognition model based on human-level concept learning is proposed, which utilizes relationship between equipment. The proposed method is composed of three parts: Mask RCNN, Bayesian Context Network and human-level concept learning. As the backbone network, Mask RCNN, a pixel-wise segmentation network, gives preliminary recognition results. Then, based on the object relationship graph of Bayesian Context Network, human-level concept learning corrects the results in sequence by maximizing the conditional probability of an object given its neighbourhood. Experiments show that the accuracy of the proposed method increases 9.7% compared with its backbone network, making industrial application of this technology possible.


INTRODUCTION
The technology of replacing human with robot has become an important issue in electric power industry. Inspection robots in substation, as a typical type of this technology, can considerably reduce the workload of personnel for operation and maintenance [1][2][3]. However, a lot of manual labour is still required, due to the fact that these robots are not smart enough at present. For example, the work of defect detection based on RGB and infrared images of power equipment is not automatic. To identify defects which cause temperature rise, the information of type and position of equipment needs to be determined, since the temperature distribution of equipment is unobtainable without its position on the image and the normal temperature range of equipment varies with its type. As a result, automatically recognizing power equipment type and relating it to different alarm criterion of temperature become essential characteristics of the next-generation inspection robot technology. However, power equipment recognition is not a big-sample learning problem, which makes it still a difficulty at present. To our knowledge, there is no available dataset of annotated power equipment images. Since the appearance of certain type of power apparatus is quite similar due to insulation requirements, image annotating of power equipment requires professional knowledge of electric engineering which blocks the collection work. Although many models of object recognition have been proposed recently [4][5][6][7][8], they generally achieved a good performance on a large dataset, e.g. the 200K-labelled COCO dataset [9] and the 11.5K-labelled Pascal VOC dataset [10]. To improve their performance in the task of power equipment recognition, more information within the images needs to be added as prior knowledge.
For a scene of power equipment, strong relationships between objects exist because of the functional connections between them. For example, bushings tend to come up in a picture with tubular bus-bars, since they are electrically connected. Similarly, tubular bus-bars and column insulators which support these bus-bars are often found together in a picture. These abundant and useful context information and object relationships are ignored in the object recognition models [4][5][6][7][8], which perform a task of classifying candidate boxes independently. As a result, it is a natural way of improving these models in the power equipment recognition task by adding the relationship between objects.
Object relationship modelling can be divided into two types: parametric and non-parametric. A parametric model tends to mapping features into higher dimension space, which are then processed by complex non-linear functions. Benefiting from a large number of parameters, this kind of model has a strong expressivity, which, in other words, performs quite well on large datasets. Besides, parametric models are still a black box at present, the mechanism of which is totally different with human perception. A non-parametric model, on the other hand, learns object relationship through probabilities, which is less dependent on data by introducing uncertainty measurements and prior knowledge. This kind of learning mechanism is better understood by human, making the combination between algorithm and human knowledge possible.
Parametric object relationship modelling has attracted the attention of researchers for its good performance on large dataset. In [11], object detection is formulated as graph structure inference, with nodes and edges denoting objects and object relationships respectively. The proposed Structure Inference Network (SIN) is based on Recurrent Neural Network [12], where object features are taken as initial state, and the message from scene features and features of other objects are taken as input. The authors of [13] proposed a relation network (RN) according to attention module [14,15], which simulates human focusing on local information instead of whole image. In RN, object relationships consist of appearance features and geometric features which are embedded into the same highdimension space for following operation. These methods tend to include large numbers of learnable parameters, because of which they have a strong expressivity but require a lot of data for learning.
Power equipment relationship modelling should be nonparametric for the following reasons. First, compared with parametric ones, non-parametric model has less dependency on data. Besides, non-parametric model is similar to human perception, making the introduction of human knowledge and logical rules possible. A suitable framework for non-parametric object relationship modelling is introduced. In [16], Hierarchical Bayesian Program Learning (HBPL), based on the three important principles of human learning process: compositionality, causality, and learning to learn [17], is designed for small-sample learning problems. It decomposes a complex concept into combination of simple parts which are suitable for transfer learning. Besides, causal relationships between these simple parts are helpful to the introduction of human experience.
Inspired by the framework of HBPL, a non-parametric object relationship model is proposed. Detection results of Mask R-CNN [18], which is a pixel-wise segmentation network, are used as the input of Bayesian Context Network (BCN). BCN, constructs the relationship network of detection results. Then human-level concept learning module corrects, in order of the number of reliable neighbours, these detection results by max- imizing the probability of a device given its neighbourhood. Compared with those parametric models, the proposed model is more suitable for this task, making industrial application of this technology possible.

NETWORK STRUCTURE
In Figure 1, whole structure of the proposed method is given, where BCN is used to model the object relationship and Human-level Concept Learning (HLCL) is utilized to correct detection results based on these relationships. Since the object relationship is modelled at the level of objects, detection results of Mask R-CNN are used as the input of BCN. Mask R-CNN, a deep convolutional neural network, is elaborated for the task of object recognition but ignores the relationship between objects [18]. BCN is actually a graph representation of the scene where nodes denote equipment and edges denote equipment relationships. HLCL is a generative method based on the framework of HBPL. Through maximizing the probability of a node given its neighbourhood, HLCL is able to infer the accurate equipment types.
Characteristics of Mask R-CNN, which is the foundation of object relationship model, is discussed below. Mask R-CNN, firstly feeds the input image into a convolutional backbone to generate region of interest (RoI) via Region Proposal Network (RPN) and RoI features via RoI Align layer. Then a R-CNN head and a Mask head are utilized respectively to classify ROIs and generate boxes and masks which represent the position of objects in the picture. The classification of ROIs is independent without consideration of their relationships. And class, box and mask compose the detection results.

OBJECT RELATIONSHIP MODELLING
The main contribution of the paper is to model object relationship at the level of objects, which can greatly improve the performance of object recognition in scenes like power equipment in substations. These scenes have two obvious characteristics: (1) few labelled samples can be found for training; (2) strong relationships exist between objects. The modelling process is divided into two parts: BCN and HLCL, which will be introduced in details.

Bayesian context network
Before object relationships modelling, a scene composed of many objects and their relationships needs to be represented clearly and efficiently. As a kind of representation, graph contains rich relation information among elements [19]. As a result, a graph of the scene constructed from the detection results of Mask R-CNN is proposed, which is named as Bayesian Context Network. Here the construction process is described in detail with the illustration shown as Figure 2.
A BCN is composed of nodes and edges which respectively denote objects and the relationship between objects. It is easy to understand that a single object is represented by one node. As for the relationship between objects, some explanations need to be given. A relationship, also an edge in the graph, is defined by the two nodes it connects. From the mathematical perspective, edges should exist between every two nodes. However, this strict rule makes the network quite complicated, and the process of learning and prediction become quite time-consuming. Besides, for a single object, not all the relationships are equally important. The closer two objects are, the stronger connection between them exists. When learning with few labelled samples, focusing on stronger and more helpful relationships rather than all of them is necessary. Consequently, a node is only connected with its neighbours in the BCN.
After the definition of BCN, attributes of the nodes are also discussed. Suppose there are N objects d (1) , d (2) , … d (N ) in the detection result of a single image. For each detection object d (i ) , it has three attributes: class d s . Bbox refers to the bounding box of the object and can be written where (x, y), w, h are the central point, width and height of the box respectively. Score refers to the confidence of the detection object given by Mask R-CNN. Obviously, d s ∈ (0, 1) and a detection object with higher confidence is more likely to be correct.
Based on these attributes, the definition of neighbourhood is given. A single detected object is called an instance for convenience. In an image, the neighbourhood of an instance

Human-level concept learning
After the representation of a scene, the correction of detection results from the perspective of object relationship is done with Human-level concept learning, which is a generative model based on HBPL.
Inferring the correct class of all the instances at one time is difficult, which is done step by step in the proposed method. Suppose there is an instance c and neighbours (d (i ) ). Obviously, from the perspective of object relationship, the instance and its neighbours would be at the highest possibility given its correct class, which can be written as: The order of correction in a single picture is discussed below. When inferring the correct class of a node, it is obvious that this inference is more likely to be accurate if more instances in its neighbourhood are reliable. That is, if (d (i ) )= d is more likely to be accurate. Reliable instances come from two ways: (1) it has a high detection confidence; (2) it has been corrected by the model. Unreliable neighbouring nodes are not used in correction. Figures 3 and 4 give respectively an illustration of the inference process and the correction process of HLCL.
Suppose there are N instances d (1) , d (2) , An illustration of the inference process of Human-level Concept Learning for a single picture. The inference contains five steps, the order of which is given by Roman numerals. In all the steps, white nodes denote raw results of Mask RCNN, blue nodes denote reliable or already corrected ones, and red nodes denote ones that need to be corrected. From Step I to Step II, confidence of instances is used to judge the reliability of nodes and no correction will be made to it. For d ( j ) s < threshold, d ( j ) will be corrected in sequence. Now there are a set of reliable instances d (r 1 ) , d (r 2 ) , … d (r m ) and a sequence of unreliable instances d (c 1 ) , d (c 2 ) , … d (c n ) , which will be corrected in order. The correction will be processed as: For j = 1, 2, … n, Where S j is the set of reliable instances at step j, which is composed of the original set of reliable instances d (r 1 ) , d (r 2 ) , … d (r m ) and corrected instances before step j ; refers to the idea that only reliable instances in the neighbourhood of d (c j ) are used for inference; in Equation (4) The order of the sequence d (c 1 ) , d (c 2 ) , … d (c n ) is determined by the number of reliable instances in the neighbourhood and can be written as: For j = 1 ,2, … n, where S j is the set of reliable instances at step j; #(S j ∩ ε(d (t) ) refers to the number of reliable instances in the neighbourhood of d (t ) ; the range of d (t ) is all the remaining instances that need correction. Now directly handling Equation (3) is unfeasible for two reasons: (1) the probability distribution can not be assumed; (2) the probability distribution is at a very high dimension space, which is quite unfriendly to small-sample learning problems. As a result, Equation (3) is decomposed and simplified.
To make it clear, Step III in Figure 4 is used as an example to explain the decomposition. For this step, Equation (3) is written as: Then the joint probability can be further written as: P (d 5 , d 6 , d 8 , d 9 , g c ) = P (d 9 , g c )P (d 5 |d 9 , g c )P (d 6 |d 5 , d 9 , g c ) P (d 8 |d 5 , d 6 , d 9 , g c ). Note that d 6 and d 8 are not connected, P (d 8 |d 5 , d 6 , d 9 , g c ) can be simplified as P (d 8 |d 5 , d 9 , g c ). Actually, since not all the nodes are connected, all the conditional probability in correction process can be simplified in a similar way.
In the correction process, instance attributes of class d c and bbox d b are used. The effect of location and distance changes of camera is eliminated by normalising d b . Through translation that moves (x (i ) , y (i ) ) to the origin of coordinates and scaling that changes w, h into the unit of x-axis and y-axis, it can be obtained that d . To make the model agree with human cognition, bbox d ′ b = (x ′ , y ′ , w ′ , h ′ ) is divided into two parts: position d ′ p = (x ′ , y ′ ) and size D ′ size = (w ′ , h ′ ). Therefore, for an instance, three attributes of it are considered in the correction process: d c , d ′ p , d ′ size . Simultaneously using three attributes also causes the problem: the probability distribution is hard to assume and therefore unfriendly to small-sample learning. This problem is solved according to human cognition of relationship between power equipment. Suppose there is a picture of tubular buses and column insulators which support these buses, their attributes can be hierarchically described: (1) Class: tubular bus and column insulator tend to appear together; (2) Position: if there are two devices which are tubular bus and column insulator, it is very likely that column insulator is under tubular bus; (3) Size: if there are two devices which are tubular bus and column insulator, it is very likely that their sizes are similar. Here equation of this process is given. For an instance d (i ) , the probability of another instance d ( j ) appearing in its neighbourhood can be written as:

EXPERIMENTS
In this section, the proposed model is utilized to recognize power equipment in substations to prove its feasibility and reliability. Besides, Mask R-CNN and relation network [13] are also evaluated on this task to compare with the proposed method.

Data source
A dataset of pictures taken by inspection robots of a substation in Shanghai, China in August 2018, is created for the experiments. The camera of robots is a RGB-Cam with a resolution of 1920 × 1080 pixels. To guarantee the performance of image recognition, focus-failed, back-light, night-shot and severe-occlusion pictures are not selected. Eventually 751 pictures of power equipment are adopted and annotated by professionals. Typical scenes of power equipment in substations are shown in Figure 5. Through statistics, it is found that the annotation contains 1918 insulators, 330 bushings, 66 arresters, 1326 capacitors, 258 buses, 306 PT/CTs and 210 switches. Generally speaking, using higher resolution pictures as input brings more accurate recognition results at the expense of processing time.
However, there will be no increase of accuracy when the resolution is greater than a certain threshold. In our experiments, these pictures are scaled to a resolution of 640 × 448 pixels, which has been tested as the resolution threshold for our dataset.  To test the model performance under the condition of smallsample learning, the size of training dataset is changed. Table 1 gives the configuration. Take Config. No.1 as an example, 600 pictures are randomly chosen from dataset for training. The rest of the pictures are used for test. Five-hold cross validation is adopted, meaning 80% samples in training set are used for model training, while 20% of them are used for validation. To eliminate the effect of sample choosing, experiments are conducted ten times for each configuration, and all the indexes used for evaluation are averaged.

Human-level concept learning
Weights of Mask R-CNN, which is at the bottom level of the proposed method, are learned with training set and validation set. Since the dataset of power equipment images is too small to train all layers, only the RPN, RCNN and mask head of the network are trained with our data. As for other layers, trained weights based on COCO dataset are used. This is feasible because these layers are at the bottom of Mask R-CNN Step II, it shows the detection confidence, where equipment in green is with a high confidence, that in red is with a low confidence. From Step III to Step VIII, HLCL corrects the detection results step by step and are used to extract bottom-level features like edges, colours, curves and their combinations. The backbone network is Resnet 101 [20], which is a deep neural network proven to be effective in extracting features for object recognition. Feature pyramid networks [21] are adopted to help detection objects on different scales. Non-Maximum Suppression [22] is adopted to remove repeated detections. Conditional probabilities in the correction process are obtained in training set. The validation set is used to obtain hyper-parameters of BCN and HLCL like (controlling the size of neighbourhood) and the threshold of confidence (determining whether an instance is reliable). The value of hyperparameters is chosen when the model has the best performance on validation set.
An example of correction process of HLCL is given in Figure 6, which is discussed below in detail. In Step I, it is the original detection results of Mask RCNN. In Step II, it shows the detection confidence, where devices in green are with a high confidence, those in red are with a low confidence. The threshold of confidence is set as 0.9. From Step II, it can be found that three instances need correction. In order of the number of reliable neighbours, the leftmost instance is corrected first. The type of this instance given by Mask RCNN is insulator. Considering its neighbourhood, which is marked by a red box in Step III, there is an instance of bushing completely overlapping it. The statistics show that this situation seldom happen. Therefore, HLCL remove this instance as shown in Step IV. Similarly, in Step V, the wrong instance of insulator is corrected as bushing based on the instance of bushing near it; in Step VII, the wrong instance of CT is removed based on the instances of bushing above it.
Comparison of detection results between Mask R-CNN(left column) and Human-level concept learning(right column) are shown in Figure 7, where correct detections are in green, wrong detections are in red and missing objects are in white. Although almost half of the detection results of Mask R-CNN are correct, quite a few wrong detection results lower the quality of object recognition, making automatic fault detection difficult. With human-level concept learning, these wrong detections are corrected based on the relationship between power equipment.

Comparison with other methods
In this section, HLCL is compared with relation network (RN) [13] and Det59-R128 Network [23], which are parametric and non-parametric object relationship model respectively. Interaction between appearance features and geometry features of objects are considered in RN. In the experiment, two object relation modules are embedded in Mask R-CNN to improve its performance, which is shown in Figure 9. This module, which is essentially a scaled dot-product operation, includes four large weight matrices of learnable parameters, impeding its performance on small datasets. Besides, human knowledge cannot be added into the module due to these inexplicable parameters. To be specific, the structure of Mask R-CNN here is the same as that used in human-level concept learning. Two object relation modules are respectively added after the two fully connected layers in the R-CNN head. Non-maximum suppression [22] is adopted to remove repeated detections. The authors of Det59-R128 Network, on the other hand, proposed a non-parametric model for object relationship. In this network, a knowledge database of probability and a regionto-region graph are built to indicate object relationship, then features of the original region of object and those of related regions are fused for recognition. Figure 10 shows the network structure used in our experiment. Even though non-parametric module reduces the dependence on data, some weaknesses still exist. First, only category of objects is considered when modelling object relationships, while other useful information such as position and size is ignored. The proposed model, however, includes all useful attributes of objects, making object relationship more accurate. Besides, the reliability of the category of detection results are not discussed. Once the category of the detection results is wrong, the object relationship based on this information becomes unreliable. This problem is solved in the proposed method by correcting object category in sequence, where unreliable results are not used in a single correction step. Specifically speaking, the structure of Det59-R128 Network in the experiment is organized in the following way. DetNet-59 [24] Network is adopted as backbone network. Based on category-level knowledge learnt from training data, a regionto-region graph is used to find related regions from ROIs for each object. Then features of original region of the object and those of related regions are fused for further classification and localization. It should be noted that only the top 128 ROIs are included in the region-to-region graph.  Figure 8, where correct detections are in green, wrong detections are in red and missing objects are in white. It can be seen from Figures 7 and 8 that detection results of RN and Det59-R128 Network are better than those of Mask R-CNN but still worse than the proposed method. This is because the proposed method, as a non-parametric probability-learning-based model, is more suitable for small-sample learning than parametric  models like RN. Even though Det59-R128 Network is also nonparametric, it ignores useful attributes such as the position and size of objects and suffers from unreliable category information, which has been solved in the proposed method by iteration of correction.
To further compare these models, the accuracy of results under Config No.1 is given in Table 2, where F1 score is adopted as evaluation metric. For object recognition, the objective includes two aspects: class and position. While class is easy to evaluate, position is evaluated in a carefully designed way. For both ground truth and detection result, their position can be represented by the bounding box, the difference of which can be evaluated by Intersection over Union (IoU) [25]. A detection result is judged as accurate if its class is correct and the IOU on its bounding box and that of ground truth is greater than a threshold which is commonly set as 50% [18]. Then F1 score is defined as the harmonic mean of precision and recall [26,27].
Some conclusions can be made from Table 2. First, the total F1 score of HLCL is 9.7% higher than that of Mask RCNN. Compared with Mask RCNN, relationship between power equipment is considered in HLCL. Power equipment recognition is a small-sample learning problem, with the characteristics: (1) the type of power equipment is much less than that of common objects; (2) the probability of certain types of equipment appearing together is high. As a result, HLCL can be used in this task and achieves a good performance.
Second, the total F1 score of HLCL is 6.4% higher than that of RN and 5.1% higher than Det59-R128. Compared with the parametric model of RN, HLCL, based on probability learning, models equipment relationship in a human-like way, making it possible to introduce human knowledge as prior. In contrast, the large amount of inexplicable parameters of RN, which are completely learnt from data, become inaccurate with insufficient training samples. As a result, HLCL is more suitable for power equipment recognition than RN. On the other hand, HLCL is better than the non-parametric model of Det59-R128 Network, since it includes more attributes of object such as position and size and overcomes the challenge of unreliable category information by iteration of correction.
Third, F1 score of Mask RCNN varies with equipment type. In fact, there are two factors that largely affect the accuracy of recognition: (1) the number of training samples; (2) the situation of overlapping and occlusion. It can be found from Table 2 that F1 score of Arrester and Bus is comparatively low. This is because: the number of training samples for Arrester is quite small; and the situation of overlapping and occlusion is quite severe for Bus. It should be noted that accuracy of all classes has been largely improved by the proposed method, meaning object relationship modelling is helpful to the recognition of all classes rather than specific ones.
Finally, the tendency of the four models' performance changing with training-set size is shown in Figure 11. Reliability of the proposed method is demonstrated, since it has a much more stable performance than the other three models. A good model for small-sample learning problems is supposed to be less affected by training-set size within a certain range. Besides, its performance can be further improved with more training samples.

CONCLUSION
HLCL is proposed to model power equipment relationships, which is necessary to improve the performance of power equipment recognition with few training samples. Inspired by human recognition mechanism, HLCL corrects detection results in sequence based on object relationship, which is implemented by maximizing the conditional probability of an object given its neighbourhood. Compared with other object relationship models, HLCL has these advantages: non-parametric, less datadependent, easy to introduce human experience as prior as well as logical rules. Experiments on our dataset show that HLCL outperforms other three models on this task, making industrial application of this technology possible.