An insulator inspection method based on deep learning applicable to multi-scale and occlusion conditions

As an important equipment of power system, the insulator’s normal operation is the basis to ensure the safe operation of the power system. The insulator positioning and identiﬁcation technology based on machine vision can quickly and accurately complete the inspection of insulators on site and effectively save the cost of operation and maintenance. This paper proposes an insulator inspection method based on region-convolutional neural networks (RCNNs). First, the dataset of the insulator image is preprocessed by means of data expansion. Then, the feature extraction of the insulator image is realised by using the zeiler fergus (ZF) network. The k-means clustering method is used to optimise the selection of anchor points. Meanwhile, the non-maximum suppression post-processing algorithm is improved, and a non-linear penalty factor is introduced to adapt to multi-scale and overlapping occlusion insulator inspection. Experimental results show that the improved faster RCNNs insulator inspection method can accurately obtain the coordinate frame and the corresponding probability value of the insulator object and improve the average precision by 10.43%, achieving the accurate inspection of the insulator object.


INTRODUCTION
As an important equipment in power system, insulators are widely used in all links of transmission lines. Insulator equipment is mainly used to support, connect and insulate transmission lines. Insulator deterioration will have an important impact on the safety of the power system. The traditional method for insulator inspection is to use manual inspection in the field. With the rapid development of unmanned aerial vehicle technology, robot technology and computer image processing technology, at present, imaging equipment can be mounted on unmanned aerial vehicle (UAV) and inspection robot to realise automatic inspection of insulators, and thus saving a lot of human costs. In recent years, object detection and location methods based on the convolutional neural network (CNN) have been widely studied, and great progress has been made in face recognition, speech recognition, traffic sign recognition and other fields. With the maturity and rapid development of artificial neural network technology in image object positioning, the machine vision recognition method for power equipment has This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2021 The Authors. The Journal of Engineering published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology also received the attention of scholars. In [1], a detection method of insulators based on local feature and spatial sequence feature is proposed. This method uses multi-scale and multi-feature descriptors to describe the local features of insulators in aerial images, improves the robustness of the algorithm through spatial sequential features, and improves the matching performance of feature points through a coarse precision matching strategy. In [2], an insulator string detection method based on feature point detection and binary local feature detection is proposed. This method can effectively solve the object detection of lowquality insulator infrared image and has a high classification accuracy. In [3], a double odd-even morphological gradient edge detection algorithm is proposed to complete the image segmentation of insulator strings. Reference [4] proposes an insulator inspection method based on a deep learning method. In this method, a single multi-frame detection (SSD) algorithm is adopted, and the recognition accuracy of porcelain vases and conforming insulators can reach 94.1% and 86.7% respectively. In [5], you only look once version3 (YOLOV3) is mainly used to realise the detection algorithm framework of insulation equipment. Under the condition of high recognition accuracy, the average detection time of the object is 0. 0182 s. In [6], an algorithm is proposed to decompose the object into multiple variable parts with an intersection, and then aggregate them after detection so as to improve the detection accuracy of an aerial insulator. In [7], the insulator is extracted from the detection image taken by UAV for image segmentation. It is used for fault detection of insulators to determine whether the insulator string has self-explosion defects by the distance between the insulators. Reference [8] proposes an automatic positioning method for aerial image insulators based on the binarisation normalised gradient feature and CNN, which is used to detect aerial photo insulators.
In this paper, the algorithm is improved and optimised for the problems such as different sizes, different lengths and widths, overlapping occlusion and image missing in the detection of insulators. The appropriate length-width ratio and the image scale is found by clustering the enhanced datasets, and the postprocessing algorithm of non-maximum suppression (NMS) is improved and optimised, which effectively solved the above problems and improved the detection accuracy.

Development of object detection methods
Object detection based on traditional manual features needs to design diversified detection algorithms and find ingenious computing methods to accelerate the model. Viola-Jones (VJ) algorithm [9] was proposed in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) in 2001, which is a classical face detection algorithm. VJ uses the most conservative sliding window detection, but it is computatively expensive. VJ uses an integral graph to accelerate and overcomplete random Haar features, and adaptive boosting (Adaboost) algorithm to select features. The histogram of oriented gradient (HOG) [10] in 2005 adopts the most original multi-scale pyramid and sliding window for detection. The size of the detector window is fixed and the multi-scale image pyramid is constructed by scaling the image successively. In order to give consideration to speed and performance, the classifier used by a HOG is usually a linear classifier or a cascade decision classifier. In order to take into account the invariance and discriminative HOG feature, the concept of a block is introduced by a histogram of gradient direction. The 2008 deformable part model (DPM) [11] is based on the peak of classic manual features. DPM splits the detection problem of the whole object in the traditional object detection algorithm and transforms it into the detection problem of each part of the model, and then aggregates the detection results of each part to get the final detection result, which is a wholepart-whole process. There are two main problems in traditional manual feature insulator inspection. First, the region selection strategy based on the sliding window is not objected, the time complexity is high, and the window is redundant. Second, the characteristics of manual design do not have good robustness for the variation of diversity. Especially when dealing with multi-object images, the problem is more prominent.
In 2013, a deep learning-based region-CNNs (R-CNNs) series of object proposal object detection was proposed. R-CNN [12] uses a very simple detection strategy. First, the selective search (SS) algorithm is used to generate object proposals in the image based on information such as texture, and then each proposal is scaled to the same size. Second, the network model trained by ImageNet is used for feature extraction. Finally, the support vector machine (SVM) classifier is used to determine the category, the candidate boxes are sorted according to the score, and then remove the excluded candidates boxes according to the rules and return to the last correct box. Although R-CNN has made great progress, its shortcomings are also obvious. Because it is necessary to access the fully connected layer, the size of the picture must be fixed. The preselection box is cropped or stretched and distorted, which will deform the picture and cause distortion. In 2014, the spatial pyramid pooling (SPP) Net [13] added an SPP layer between the convolutional layer and the fully connected layer. The threelevel pyramid layers are pooled separately, and the pooled results are stitched and sent to the fully connected layer, but it is still slow. In 2015, Fast R-CNN [14] performed feature extraction on the full graph only and used the SS algorithm to generate the candidate box. The feature of the candidate box is mapped to the feature layer through the position in the original image, and the corresponding area is selected according to the candidate box on the feature map. However, because the candidate box, selected by SS are of different sizes, a fixed-size pooling layer (region of interest (ROI) pooling) is added to extract the fixedlength features (ROI) proposed for each region. At the same time, the multi-task loss function is used, which is connected to the softmax classifier and the box bounding regressor, respectively, to integrate border regression into the R-CNN network. Fast R-CNN adopts the advantages of R-CNN and SPP Net but still needs to use external algorithms to extract candidate boxes. Also in 2015, Faster R-CNN [15] achieved end-to-end processing. From R-CNN to Fast R-CNN to Faster R-CNN, the speed and accuracy of detection have been continuously improved.
In 2015, object detection YOLO of integrated convolution network based on deep learning [16] was proposed. YOLO directly takes the entire image as an input to the network and gets the location of the object bounding box and the category of the object directly through only one forward propagation. Although YOLO has a very fast detection speed, its accuracy is somewhat lower than that of Faster R-CNN, especially the poor detection effect of small objects and its slight insufficiency of positioning accuracy. The SSD [17] algorithm proposed in 2016 has improved these problems by absorbing the advantages of YOLO's fast speed and accurate region proposal network (RPN) positioning and combining YOLO's regression idea with Faster R-CNN's anchor box mechanism. The multi-reference window technique in RPN is adopted, and the detection on a multi-resolution feature map is further proposed. SSD achieved only a lower accuracy on visual object classes 2007 (VOC07) than Faster R-CNN due to its use of multi-level feature classification. As a result, it is difficult to detect small objects. The last convolution layer has a large sensing field, which makes the features of small objects not obvious. In 2017, deconvolutional single shot detector (DSSD) [18] changed the reference network of SSD from visual geometry group network (VGG) to resnet-101 in order to solve the problem of poor robustness of small objects, which enhanced the feature extraction ability and added a large amount of context information with deconvolution layer. The prior box selection of multi-dimensional clustering in YOLOv2 [19] plays a better role in improving the aspect ratio. YOLOv3 [20] in CVPR 2018 refers to the residual network structure, forms a deeper network layer, as well as multi-scale detection, and improves the detection effect of mean average precision (AP) and small objects, but its detection of medium or large objects is still unsatisfactory. M2Det [21] in CVPR 2019 introduces a multi-level feature pyramid network on the basis of SSD, which improves the detection of multi-scale objects.
Although the detection model based on an integrated convolution network is obviously faster than the detection algorithm based on object proposal, its detection accuracy has been slightly inferior to the latter.

Faster R-CNN
Faster R-CNN is composed of RPN and Fast R-CNN. Figure 1 shows the structure of Faster R-CNN. They share a convolutional layer. After the image to be inspected is normalised, it enters a CNN model for feature extraction, and then RPN is responsible for finding the proposal. RPN can greatly reduce the speed of extracting region proposals. After the last shared convolution layer, two branches are used for regression and prediction, respectively, and an 'anchor (multi-reference window)' mechanism is designed in it. Fast R-CNN is responsible for further optimising the results of RPN and finally outputting the probability value and coordinates of the object. Among them, the loss function can show the predictive ability of the model and the actual difference, which plays a vital role in the evaluation and adjustment of the model parameters. The loss function consists of two parts, one is the classification loss function, and the other is the regression loss function. From [14,15], we can get where i is an anchor point and p i represents the possibility of being predicted as the object. If the anchor is positive, p i * is equal to 1; otherwise it is 0 for classification judgment. t i is the coordinate of four parameters (x, y, w, h), which represents the coordinate parameters of the network prediction proposal. t i * represents the regression object (x*, y*, w*, h*), and the coordinate parameter of the ground truth box. N cls and N reg are used for standardisation. L cls and L reg represent classification and regression losses, respectively. L cls is a logarithmic loss, and R in L reg is smooth L1 function. The balance parameter λ is used to adjust the weight of the two losses so that they are approximately the same. By default, we set λ = 10. (x a , y a , w a , h a ) represents the coordinate parameter of the anchor box. The difference between the ground truth box and the predicted anchor box was used to study, and the position was regressed and refined. R is the smooth L1 function to reduce the difference between prediction and reality, multiplied by p i after the position regression function to represent that it is only activated for the positive anchor. That is, p i * is 1 when there is an insulator, and 0 when there is no insulator, so only the object area calculates the loss, and the unrelated background area does not calculate the loss. Also, λ is set to adjust the weight.

Dataset enhancement and processing
By transforming the training images, a network with stronger generalisation ability can be obtained, which can better adapt to the application scene, increase the noise data, and improve the robustness of the model. The method in [22] can effectively enhance the dataset. The initial dataset is 500 pictures collected from the training base of our school. This initial dataset is used for data enhancement.
In this paper, the supervised single sample data enhancement technology is adopted. First, a horizontal flip is carried out to increase the diversity of images through mirror operation while increasing the dataset. Then, the sliding window method is used to expand the dataset on a large scale. The sliding window method first normalises the size of all images to 1500*1125 or 1125*1500 pixels. The window with width and height of W*H is used to slide on each picture, and the region generated by each slide is an ROI region. Equation (8) can be obtained from [22]: Divide the 500 pictures into two parts, one part 400 and the other 100. At the same time, data enhancement was carried out on the two parts of the picture, and 7200 picture sets A and 1800 picture sets B could be obtained, respectively. A is the training set and B is the test set. In order to speed up inspection and training and reduce CPU time consumption, datasets with better practicability for caffe should be produced. All the images in the sample library were manually labelled by the visual image calibration tool software LabelImg to reach the XML label file conforming to the pattern analysis, statistical modelling and computational learning visual object classes (PASCAL VOC) standard format. Through excel, the processed tag file can quickly and intuitively see important information such as image width, height, category name and frame coordinate pairs xmin, ymin, xmax, ymax, and so forth. The completed dataset is processed to fit a better lightning memory-mapped database (LMDB) format.

Lightweight network model
The neural network model is the most basic and important part of Faster R-CNN. Generally, the VGG16 network is selected, with a total of 16 convolutional layers and three fully connected layers. However, due to the deep layer of the network, it takes a long time. In addition, the requirements for the graphics processing unit (GPU) are also high, and the network is more suitable for multi-class classification tasks.
Considering the GPU configuration of industrial devices in insulator inspection, the singleness of the insulator inspection category, and the fast speed of inspection, it is more appropriate to choose the lightweight network structure model ZF. Figure 2 shows the main network structures including data layers, convolution layers, pooling layers, Relu, fully connected layers, and softmax. Since a larger convolution kernel and stride will lead to unsatisfactory extracted features, convolutional layer1 uses a 7*7 size convolution kernel with a step size of 2, which helps to improve the classification performance. Then, the activation function Relu is used to increase the non-linearity and sparseness of the network, reduce the interdependence between parameters, and alleviate overfitting. Then, through 3*3 max pooling, translation invariant is used to downsample the image while trying not to lose image features. The advantage is to retain the strongest of these features and reduce the number of parameters. The operation is repeated on the 2, 3, 4, and 5 convolution layers to continuously extract features. After passing through two fully connected layers, local features are integrated through a weight matrix. Finally, it is classified by the softmax function to get the output. Because the network depth is greatly reduced, the inspection speed can be increased, and higher performance in the industry may be achievable.

K-means clustering selected anchor
RPN is the core component of Faster R-CNN. The characteristic graph of the output after convolution pooling is convoluted by a 3*3 convolution kernel and then connected to two convolution kernels of 1*1 size, respectively. Among them, a convolution kernel of 1*1 size is a dichotomy with softmax function, and another convolution kernel is a dichotomy with regression function to determine coordinates. Here, every time the sliding window of size 3*3 slides, an anchor will be generated, and at the same time, nine anchor boxes with three scales and three proportions will be mapped on the original drawing. As the size and aspect ratio are selected and preset by hand, although the network can learn to adjust the boundary box properly, in practice, there will be inaccurate phenomena in the inspection of some special insulators. If we choose a better prior setting for the network, that is, preselect the specific dataset for a specific purpose, it can make the network easier to learn in the face of complex situations for better prediction and inspection. k-means clustering is run on the training set bounding box to find a good aspect ratio and quantity. According to [19,20], if the Euclidean distance standard k-means is used, it will cause more errors in the larger boundary box than in the smaller boundary box, and what we want is the case with a higher intersection-over-union (IOU) score: Here, b 1 , b 2 are two boxes, and IOU is the intersection of areas. In this case, the smaller the d is, the higher the IOU is. In Figure 3, k-means is run with different k values and the geometric centre is drawn of the average IOU and the closest. It can be seen that although k = 5 gives consideration to both complexity and high recall rate, the recall rate still increases with the increase of dimensions. Therefore, in order to pursue higher inspection accuracy, we further improve k. When k = 15, the slope approaches 0, indicating that inspection accuracy does not increase with the increase of clustering complexity. So we set k = 15. Figure 4 shows the anchor before and after the improve-

FIGURE 4
The anchor before and after the improvement ment. The number of anchors has increased from 9 to 15. First, k clusters are randomly assigned, the points divided into the nearest cluster, each cluster updated to the mean value of the current cluster, and iteration continued until the cluster centre is almost unchanged. In the end, we get the aspect ratio as {0.303, 0.389, 1, 2.57, 3.30}, and the scale is {64, 128, 256}.

Non-linear NMS
Due to the anchor mechanism, each anchor will generate a large number of anchor boxes, so the result of multiple anchor boxes of the same insulator will appear in the post-processing stage, that is, repeated inspection. Therefore, NMS processing is required to ensure that an insulator has only one inspection frame. The greedy NMS is to first select an anchor box M with the highest confidence. Then, the IOU of M and any anchor boxes is traversed, and then all anchor boxes that are greater than the set IOU threshold of M are directly deleted so as to delete all duplicate anchor boxes that are considered to be the same object. However, when multiple objects overlap, that is, when the IOU is greater than the set threshold but not the same object, the anchor boxes are also directly deleted, resulting in a missed inspection. In view of this situation, reference [23] proposed improvements to the NMS method.
In this paper, the non-linear NMS algorithm (Nl-NMS) will be selected as the penalty factor according to the IOU and preset threshold of any anchor box bi and the anchor box M with the highest confidence degree to reduce the confidence Si of the anchor box. The bigger the IOU between the windows, the heavier the punishment; and the smaller the IOU between the windows, the lighter the punishment. Instead of directly FIGURE 5 Penalty factor function graph excluding the boxes that overlap with the checkboxes by more than a certain threshold, the boxes that are positioned correctly in crowded situations are not deleted too much. The penalty factor function is shown as follows: According to the dynamic change of IOU, the larger the IOU, the larger the penalty factor value; and the smaller the IOU, the smaller the penalty factor value. Figure 5 shows the function curves of various penalty factors, where Nl is the non-linear penalty factor in the paper, the Gaussian is a Gaussian curve with a mean value of 0 and variance of 1, and linear is a linear curve. It can be seen from the figure that the penalty factor of the linear curve keeps increasing. The Gaussian curve is still very high when the IOU is 1. More reasonable is the non-linear penalty factor in this paper, its inspection effect is better.

Experimental process
Faster R-CNN adopted ZF as the training model. In-depth learning caffe commonly used training set VOC 2007 format was used as the training dataset format. Table 1 shows the training parameters, where base_lr represents the base learning rate; lr_policy represents the adjustment strategy of the learning rate. Gamma represents the learning rate adjustment factor. The choice of the above three parameters is based on the empirical settings in many classic papers, and the experiments in this article have also confirmed that this choice is appropriate.
Stepsize represents the learning rate adjustment stepsize. Its size setting is determined according to the previous pre-training effect. Bath size stands for batch capacity. A large batch size can reduce training time and improve stability, but at the same time, it will Visualisation of convolution features also cause the generalisation ability of the model to decrease. Considering a compromise, we chose 16. Display represents the parameter display frequency. Setting it to 800 is for us to observe the loss value as it is unnecessary to wait too long. Iterations represent the number of iterations. From the loss curve, we can observe that the curve fully converges when 80,000 iterations are carried out, so we stop the training. The snapshot represents the model's saved node. This is to prevent a sudden break in training where we do not have to retrain. The test set data is used to verify the model, and finally, the test model is generated. The hardware environment built is an Intel processor with a CPU of 3. 0 GHz and 16 GB of memory. GPU for nvidia GTX 1080 graphics processor; the operating system is ubuntu16. 04 of Linux, and the model is trained at caffe. Figure 6 shows the result of visualisation of convolution features. From the figure below, we can see the process of convolution and pooling of feature extraction step by step. The ZF network obtains a visual image after depooling, derelu, and deconvolution on the feature map. The main learning at the bottom is low-level features, convolution 1 mainly learns the physical colour and other characteristics of insulators. Convolution 2 mainly learns the edge and contour characteristics of insulators. Convolution 3 mainly learns the texture characteristics of insulators. From pool1 and pool2, we can clearly see the retained enhanced features. At a high level, the main learning is the abstract features related to insulators. Convolution 4 mainly learns significant distinguishing features. Convolution 5 mainly learns the complete and discriminative key features, which are the characteristics of the entire insulator. In the output, we can see that the basic characteristics of the insulator have been fully displayed. The low-level features can converge when the number of iterations is relatively small, and the high-level requires more iterations. The higher the feature, the better the performance for classification.

4.2
Actual inspection effect Figure 7 shows that the results were tested using the Faster R-CNN detector before and after the improvement The confidence level marked in blue on the left of each sub-picture is the inspection effect picture before improvement. The confidence level marked in green on the right is the improved inspection effect picture. At the top of the left picture in (a) and (b), on the right of the left picture in (c), and at the bottom of the left picture in (d), there is a small part of the insulator that has not been detected. Similarly, in the picture on the left of (e), an insulator, half of which is blocked by the equipment, has not been successfully detected. This is because the insulator information is not complete, and the object is too small and missed inspection occurs. In addition, in each of the left images of (f)-(h), there are several insulators that have not been detected correctly. This is because the distance between the insulators is too far, and the object is too small and missed inspection occurs. The improved k-means anchor scheme optimises the selection of anchors according to the ratio and size of the insulators, and at the same time increases the number of anchors, which solves the above problems. There is no false inspection in the right picture in (a)-(h). The three insulators in the left picture in (i) have four inspection frames, and there is a false inspection, which should be caused by the improper setting of the NMS threshold. The insulators on the left in (j)-(l) all have missed inspection. This is a misinspection caused by overlapping insulators. Our improved Nl-NMS algorithm can effectively solve the above problems so that there is no false inspection in the right picture of (i)-(l).

The evaluation index
In this paper, three indexes, including Loss curve, P-R curve and AP, are used to evaluate the inspection effect to verify the improved effect.
Here, TP is retrieved as a positive example and is also a positive example. FP is retrieved as a positive example but a negative example. FN means not retrieved as a positive example, but it is. Figure 9 shows the P-R curve before and after the algorithm improvement. It can be clearly seen that the area under the curve in the improved figure on the right is larger, which means the inspection effect is better.

AP comparison
AP is a performance metric for predicting object location and category. The larger the AP, the higher the accuracy of inspection. The highest value of AP is 0. 9188 after 70,000 and 80,000 iterations. This is because the model has converged after 66,000 iterations, indicating that the inspection effect of the model is getting better and better as the number of iterations increases. The inspection image in Section 4.2 uses the model generated after 80,000 iterations. The performance of the average IOU at k = 5 selected by k-means is similar to the preset 9 anchor boxes. We set k = 15, and Table 2 shows the comparison between Faster R-CNN after using the k-means algorithm to select the scale ratio and AP before the improvement. As can be seen, AP changed from 0.8145 to 0.8776, an increase of 6.31%. This is because the  k-means anchor scheme selects and increases the appropriate aspect ratio, which is more suitable for insulator inspection, and detects smaller insulator images that have not been detected previously. Table 3 shows the comparison between Faster R-CNN of NMS post-processing algorithm with different penalty factors and AP before the improvement, and AP increases from 0. 8145 to 0. 8903, an increase of 7.58%. This is because the improved Nl-NMS algorithm can optimise the occlusion of insulator overlap and reduce missed and false inspection caused by improper threshold selection, thereby improving inspection accuracy.  Table 4 lists the comparison of improved k-means anchor and Nl-NMS algorithm and AP after the improvement, respectively. AP gradually increased from the original 0.8145 to 0.9188, an increase of 10.43%. This is because we have improved the two algorithms at the same time, so the inspection accuracy has reached the best. At the same time, we can see the same improvement result from the AP curve in Figure 9. In addition, in the actual inspection of Figure 7, the effect of algorithm improvement can also be seen by comparing the inspection of insulators.

CONCLUSION
An insulator inspection algorithm based on Faster R-CNN is proposed in this paper. The algorithm first expands the dataset. Then, a lightweight network structure is selected. Meanwhile, k-means algorithm is used to adjust the size scale and ratio of length and width of anchor points to increase the number of object frames of each anchor to 15. The non-linear penalty factor is added to improve the NMS post-processing algorithm. The experimental results after the adjustment and optimisation of the training parameters show that the improved Faster R-CNN has obvious advantages in the field image inspection effect, compared with that before the improvement, and the increase of AP is more than 10%, which can effectively detect the insulator successfully and reduce the error inspection rate and omission rate.