Coal gangue detection and recognition method based on multiscale fusion lightweight network SMS‐YOLOv3

Aiming at the problems of large memory footprint, low detection speed, and low detection accuracy for small and overlapping targets existing in the current coal gangue target detection algorithm, a real‐time detection method for coal gangue based on a multiscale fusion lightweight network (SMS‐YOLOv3) is proposed. Taking MobileNetv3 as a feature extraction network, in which all SE modules are replaced with SKNet, thus improving the ability of image feature extraction and making more effective use of parameters. A shallow detection scale is added to form a detection structure with the fusion of four scales to improve the detection accuracy of small targets. The spatial pyramid pooling is added after the backbone network to convert different feature maps into fixed feature maps, to improve the detection accuracy of the algorithm. CIoU bounding box regression loss and the K‐means++ clustering anchorbox are used to improve the detection accuracy of targets. Experimental equipment was built, and coal gangue datasets of small size, large size, dim light, mutual concealment, and a large number of coal gangue under multiple conditions were constructed. Experiment results demonstrate the effective and fast detection of the proposed algorithm for small targets and overlapping targets of coal gangue accurately, with mAP reaching 98.97%. The algorithm has an mAP improvement of 0.37% and an fps increase of 119.04% compared with the original YOLOv3, with memory only 1/24 of the original.


| INTRODUCTION
Coal is not only the principal form of energy of the current society but also the most economical and safe form of energy that can be used cleanly and efficiently. [1][2][3] Coal mixed with gangue in the mining process tends to cause environmental pollution and burning quality reduction. Therefore, coal gangue separation is an important measure to promote the clean and efficient utilization of coal. 4,5 Despite a variety of ways such as artificial gangue removal, 6 dense medium coal separation, 7 floating coal separation, 8,9 jigging coal separation, 10 and selective crushing 11 currently used for coal separation, they generally have low recognition accuracy, large occupied space, high investment cost, and severe environmental pollution. In the wake of the development of emerging technologies such as artificial intelligence and intelligent manufacturing, intelligent green production of coal has become an irresistible trend in the development of the coal industry, highlighting the importance of intelligent and efficient separation of coal gangue. [12][13][14] Currently, coal preparation by the X-ray method and image method is commonly used in the automatic separation of coal and gangue. 15,16 The X-ray method, however, is at risk of ray leakage, causing harm to the human body and environment. Coal gangue image recognition and detection methods based on machine learning have also been widely studied and utilized. Specifically, images are mainly used to extract feature parameters such as grayscale and texture of coal and gangue, and feature parameter models are established by algorithms to analyze and optimize segmentation parameters, and finally, coal and gangue recognition is achieved. 17, 18 Li and Sun 19 put forward a coal gangue classification method based on the least square support vector machine (LS-SVM), which is characterized by grayscale and texture. By analyzing the grayscale and texture of coal gangue, coal gangue can be identified. In the study of Dou et al., 20 the topography-SVM method was used to identify the optimal features and construct the optimal classifier to realize the recognition and classification of coal gangue. Guo et al. 21 identified coal and gangue by using their differences in dielectric properties and proposed a method to identify coal gangue by combining a SVM based on the dielectric properties of coal and gangue, with coal classification correct rate of up to 100%. However, the method requires manual selection of features and determination of thresholds, which inevitably leads to problems such as single detection dimension, slow speed, and low accuracy.
With the development of deep learning technology in recent years, 22,23 deep learning-based target detection is used in the field of gangue detection, [24][25][26] boasting advantages over traditional machine learning methods in terms of no manual feature selection, excellent feature expression capability, and excellent detection accuracy. 27,28 Deep learning technology can automatically acquire and learn image features through a convolution neural network, thereby accelerating the extraction and detection of feature information in coal gangue images. 29 Initially, the research on deep learning-based gangue detection algorithms overwhelmingly aimed at improving the recognition accuracy aspect of gangue, while not much consideration was given to the computational volume and complexity of the model. Cao et al. 30 improved the AlexNet feature extraction network by transfer learning, and combined it with the region proposal network to obtain the classification information and pixel coordinates of coal and gangue, with a detection accuracy of more than 90%. Li et al. 31 proposed the YOLOv3 algorithm based on deformable convolution to improve the detection accuracy of coal gangue, especially for small targets, by using an adaptive receptive field.
The above research has improved the detection accuracy of coal gangue in different degrees, but with the high complexity of the model. Algorithms with high complexity and large memory footprint are difficult to deploy on mobile platforms with limited storage and computing resources. Therefore, there is an urgent need to conceive a lightweight coal gangue detection network. Pu et al. 32 built a custom convolutional neural network (CNN) model based on the VGG16 (Visual Geometry Group) network by introducing the idea of migration learning, with a view to solving the problem of a large number of trainable parameters and limited computational power. Unfortunately, the model only achieved an accuracy of 82.5%. Xu et al. 33 established a coal gangue image recognition model based on classical networks such as ResNet and advanced lightweight networks such as SqueezeNet, analyzed the training convergence of each model, and realized the compression of the model. Du et al. 34 improved the single shot multiBox detector (SSD) model by using a lightweight network, selfattention mechanism, and anchor frame optimization method built a Ghost-SSD model, and then proposed a lightweight coal gangue target detection method. No matter what kind of research is mentioned above, the performance of the gangue detection algorithm has been improved to varying degrees, and the model has been lightened and improved to reduce the computational effort. Nevertheless, the detection accuracy needs to be further improved.
Overall, in this paper, aiming at the problems existing in the current coal gangue target detection algorithm, such as large memory occupation, low detection accuracy for small size and overlapping targets, a multiscale fusion lightweight network (SMS-YOLOv3) is proposed. In addition, an experimental device is set up, and datasets are built for experiments. It is verified that the algorithm meets the accuracy and real-time requirements of coal gangue small target and overlapping target detection. Table 1 is a comparison of coal and gangue separation methods.

| YOLOv3 algorithm
YOLOv3 algorithm 35 is different from the two-stage target detection algorithms. It divides the image into numerous unique grids to be responsible for the corresponding objects and supports multiclass target detection. As a commonly used algorithm for target detection in the current industry, the YOLOv3 algorithm boasts advantages such as fast detection, high accuracy, and comprehensive compared to many other algorithms. As shown in Figure 1, the detection principle of YOLOv3 is an end-to-end real-time target detection framework, which mainly includes the feature extraction network Darknet-53 and a multiscale prediction network. For an improved detection effect for small targets, YOLOv3 performs multiscale feature fusion by borrowing the idea of a feature pyramid network (FPN) 36 feature network, fuses three different scales of features by adding upsample and contact layers, and makes independent detection on the multiple scales fused feature maps. As a result, YOLOv3 has significantly improved the detection of small targets. See Figure 1 35 for the YOLOv3 network structure.

| MobileNetv3
MobileNetv3 37 network was put forward by Howard in 2019. Based on the previous MobileNet network, the structure of this network is optimized, and it can be detected in the CPU of the mobile phone in real-time, achieving higher accuracy and faster speed. The main improvement is to add squeeze-and-exclusion networks (SENet) 38 after depthwise separable convolution 39 in MobileNetv2, 40 which can automatically obtain the importance of each feature channel through learning and suppress some useless feature information at present. In addition, MobileNetv3 integrates the four characteristics of MobileNetv2 and MobileNetv1, and its backbone network Bneck is shown in Figure 2. 40 First, the dimension is raised by 1 × 1 convolution, and then the inverse residual structure of MobileNetv2 is introduced. Subsequently, 3 × 3 depthwise separable convolution is carried out to reduce the calculation amount of the network. Then, a lightweight SE-net attention mechanism makes the network focus on more useful channel information to adjust each channel's weight. Finally, the h-swish activation function replaces the swish function to reduce the amount of computation and improve performance.

| SKNet network
The human visual cortex will dynamically adjust the receptive field of neurons according to the stimulation T A B L E 1 Comparison of coal and gangue separation methods.  41 proposed an attention mechanism that can adaptively select the size of a selective kernel. The mechanism integrates the ideas of group convolution, hole convolution, and SENet channel attention mechanism, and realizes adding attention mechanism to selective kernels of different sizes. SKNet is a lightweight network structure. Using a nonlinear method, the size of the receptive field can automatically change according to the incentive factors, which can automatically adjust the size of the receptive field with the difference of input scale. SKNet network consists of multiple SK convolution units stacked, in which The SK convolution operation is divided into three parts: split, fuse, and select, as shown in Figure 3. 42 SKNet convolution is no longer an attention mechanism limited to channel level and space level, but an attention mechanism is applied to convolution cores of different sizes, thus allowing the network to adaptively adjust its structure. Furthermore, SK convolution is also a lightweight plug-and-play module that does not add too much computation to the network while bringing accuracy gains.

Separation methods Strengths Weaknesses
F I G U R E 3 SKNet convolution operation schematic diagram.

| Space pyramid pooling
In general CNN structure, the convolutional layer is usually connected to the full connection layer behind the convolutional layer, and the number of features in the full connection layer is fixed. Therefore, the input image size will be fixed during network input. In practice, the input image size cannot be appropriate, so it needs to be stretched and cropped, but it will distort the image. Based on this, He et al. 43 proposed a spatial pyramid pooling (SPP), shown in Figure 4A. SPP is put forward to solve the problem that the input image size of CNN must be fixed so that the aspect ratio and size of the input image can be arbitrary.
In this paper, the SPP module, as shown in Figure 4B, is used to pool the local area of the feature map to the maximum. The multiscale spatial pyramid pool consists of the three largest pool layers, and the pool windows are 9 × 9, 5 × 5, and 3 × 3, respectively, with a 1-step size.

| SMS-YOLOv3 network
The YOLOv3 network uses the Darknrt-53 network structure as the backbone to detect objects at three different scales by using three prediction layers. Darknrt-53 network is characterized by more layers, high model complexity, and training difficulties, so a modified MobileNetv3 network is utilized as the feature extraction network. Specifically, the SENet attention module in Mobilenetv3 is replaced by SKNet, which focuses not only on the target feature information in the channel but also on the space compared to SENet. In general, SKNet can increase the weight of target feature information, effectively reduce the number of parameters, and reduce the model size. To improve the feature extraction ability of small targets, this paper takes the 104 × 104 feature layer as the input of the feature fusion network and obtains a new scale feature detection layer with the size of 104 × 104. Finally, a four-scale detection structure is formed. In this way, accurate detection at multiple scales is also achieved even when the scale of the target in the image changes greatly, thereby improving the accuracy of the network for coal gangue detection. The feature extraction network structure of SMS-YOLOv3 is shown in Table 2.
The overall structure of SMS-YOLOv3 obtained after the improvement is shown in Figure 5. Select the improved MobileNetv3 as the backbone feature extraction network of the detection algorithm to improve the detection speed of the algorithm; SPP is added after the backbone network to convert different feature maps into fixed feature maps, to improve the detection accuracy of the algorithm. In the detection layer, SMS-YOLOv3 adopts the multiscale prediction method. After the pictures are input into the network model, four layers of feature pictures with different scales are output through multiple down-sampling. At the same time, in this study, two DBL (Depthwiseconv2D_BN_Leaky) are added to each detection layer to improve the ability to extract target feature information of the detection layer. At last, the four characteristic maps of different scales are adjusted by a DBL (3 × 3) convolution and a common Conv (3 × 3) convolution and then input into Yolo Head for prediction, forming a four-scale detection network The spatial pyramid pool structure.

| Complete intersection over the union loss function
The loss function of the original YOLOv3 uses intersection over union (IoU). IoU indicates the coincidence degree between the predicted frame and the real frame, as shown below In Formula (1), P and G, respectively, represent the areas of the predicted frame and the real frame, and IoU (P, G) represents the ratio of the intersection and union of the two frames. It can be seen that the larger IoU is, the closer the predicted target position of the model is to the real position. Nevertheless, IoU still has its drawbacks. When the predicted frame and the real frame do not intersect, it can be known from Formula (1) that IoU is zero. At this time, the regression loss function of the frame is zero, and the gradient cannot be updated when the network propagates backward. So this paper proposes to use the loss function of complete intersection over union (CIoU) 44 instead of the original IoU, as shown in Formula (2). The function CIoU can avoid the problem that the value of the loss function IoU is 0.
In Formula (2), p and g, respectively, represent the coordinate centers of the predicted frame and the real frame, ρ p g ( , ) indicates the diagonal length of the smallest rectangle that can contain both the predicted frame and the real frame, αv represents a penalty factor, where α is a parameter of the balance ratio, v is a parameter to measure the consistency of the aspect ratio between the predicted frame and the real frame. The calculation formulas are shown in Formulas (3) and (4), respectively, w gt and h gt indicate the width and height of the prediction frame, w and h represent the width and height of the real frame.
Because CIoU considers the center distance, coincidence degree, and length-width ratio of the predicted frame and the real frame at the same time, the convergence speed of model training is faster, the border regression is more accurate, and the detection effect is better.

| Selection of anchor frame based on K-means++
To speed up the convergence speed during training and locate the target more accurately, the model needs to redesign the size of the Anchorbox that matches the experimental model before training. In this paper, after the network is improved, the default anchor frame needs to be changed from 9 to 12 and assigned to four different scale detection layers. YOLOv3 generally uses a K-means clustering algorithm to select anchor frames, but there are some drawbacks. For example, K-means clustering algorithm, after selecting different initial centroids, the clustering results may be quite different. Therefore, this paper uses K-means++ 45 instead of K-means to solve this problem.
When using the K-means++ algorithm to calculate the anchor box, IoU is selected as the measurement index of distance D x ( ), and IoU is the intersection ratio of real box X and its cluster center anchor box C x with a high coincidence degree. The formulas for calculating distance D x ( ), probability P x ( ), and cluster center C i are: In this paper, a layer of (104 × 104) scale detection is added, so the clustering center K is set to 12, and then 12 sets of prior boxes with different sizes can be obtained by using K-means++. The smallest 13 × 13 feature map has the largest receptive field, so a large Anchor should be adopted, which is suitable for large target detection. Second, the largest 104 × 104 feature map has a small receptive field, so the smallest Anchor is used to detect small targets. Then, 26 × 26 and 52 × 52 size feature maps are used to detect small and large targets, respectively. The final distribution results are shown in Table 3.
K-means++ clustering algorithm can solve the problem of the original algorithm's initial center dependence to a certain extent and can find a better clustering center in the clustering process, that is, the strategy of selecting the next center point every time is to make it as far away from the determined center point as possible.
T A B L E 3 Anchorbox allocation. In this paper, a YOLOv3 lightweight online recognition method of coal gangue based on improved Mobilenetv3 is studied. According to the requirements of online recognition of coal gangue, a real-time detection platform for coal gangue is set up, as shown in Figure 6. The experimental device mainly includes a computer, a camera, an adjustable light source, coal and gangue samples, and a belt conveyor. In this platform, an industrial camera and an adjustable light source are arranged above the belt conveyor. The industrial camera collects the images of coal gangue in real-time and inputs them into the computer. The proposed algorithm is used for recognition and marking, which provides reliable information for the final coal gangue separation. Figure 7 is the experimental steps of the proposed method.

| Dataset
There is currently no public dataset of coal gangue. To obtain a sufficient coal gangue dataset, this paper collected the coal gangue dataset on the experimental platform and the coal gangue image of the coal preparation yard. The dataset completed by the laboratory used coal and gangue from the same coal mine as the coal preparation plant, which was both from Zhangjiamao Coal Mine in Shaanxi Province. A total of 1000 pieces of data were collected, including small size, large size, dim light, mutual concealment, a large number of coal gangue, and so on. See Figure 8 for details.
In this paper, the collected datasets of coal gangue will be preprocessed, and the data enhancement technology will be used to expand the datasets of coal gangue collected under various conditions, including random horizontal flipping, random vertical flipping, random cropping, random brightness change and noise reduction of the input images. After expansion, a total of 2808 images of coal gangue will be obtained. To avoid the interference of sample imbalance and eliminate redundant sample images, 1320 images of coal and gangue were selected. Finally, the labeling tool is used to mark the images, and a label file corresponding to the images one by one is generated, which contains all the coal and gangue target categories and border position information in each image.

| Experimental environment and training
All experiments are run on the Ubuntu system. The specific experimental environment is shown in Table 4.
The corresponding training parameters of network training settings are shown in Table 5. Using the idea of transfer learning, the weight of the YOLO official website pretrained on the VOC2012 dataset is used as the initial F I G U R E 6 Real-time detection platform for coal gangue.
F I G U R E 7 Experimental steps of the proposed method.
weight for training, so as to speed up the network convergence. In each iteration, 32 images were loaded and divided into 16 batches to complete forward propagation (FP). Backpropagation was performed to update parameters after the FP of all 32 images was completed. The weight decay regular term (decay) was introduced to prevent the over-fitting phenomenon in the model training. A multiscale training strategy was adopted to select the image input size from {320, 352, 384, 416, 448, 480, 512, 544, 576, 608} every 10 iterations randomly. The model was trained 5000 times, taking a total of 2 h. Figure 9 shows the loss curve of the training process for SMS-YOLOv3. The figure reveals that the loss function decreases rapidly in the first 900 iterations of the model. After 1200 iterative training, the value of the loss function converges steadily. This finding indicates that the algorithm model has been well-trained, and the hyperparameter setting of the model is reasonable.

| Evaluation
In this experiment, the performance of the model trained by the loss function is evaluated by using Precision, Recall, mean Average Precision (mAP), Average Precision (AP), frame per second (fps), and memory occupied. The multiobjective mAP is commonly used to evaluate detection accuracy. The mAP is defined as: In Equation (8), AP i represents the area under the composite curve representing recall and precision below the ith grade,   i n 1 , and the recall and precision are, respectively, defined as: where TP, FN, and FP represent the numbers of right detected targets, missed detection targets, and falsely detected targets, respectively. Formula (11) is the calculation formula of fps: where Num-Figure is the total quantity of detected images, and Total-Time is the total detection time.

| Results and analysis of the SMS-YOLOv3 algorithm
Through the training of SMS-YOLOv3, the evaluation index values of this algorithm are obtained, as shown in Table 6. From the table, it can be concluded that the detection accuracy of the improved model for coal and gangue is above 98%, the detection speed is 92 fps, and the memory occupation is only 10.7 M. The trained SMS-Yolov3 algorithm model is used in the coal gangue detection test on the experimental device, and the coal gangue detection effect diagram under different conditions is obtained, as shown in Figure 10., showing the detection and identification results of coal gangue under different conditions. A total of coal and gangue samples with diameters of 20-50, 60-90, and 110-140 mm and stacking each other and large area tiling are tested. The light intensity from left to right is 211, 402, 563, 732, and 945 lux. As can be seen from the figure that the proposed algorithm SMS-YOLOv3 can correctly identify different light, different sizes, overlapping targets, and a large amount of coal gangue, which verifies the effectiveness of SMS-YOLOv3.

| Ablation experiment
To accurately demonstrate the detection performance of the proposed algorithm mentioned above, this paper compares and analyzes the improvement points one by one. In the module ablation experiment, the same test set and image input size are adopted. Table 7 shows the results of the ablation experiment, which reveals that the improvement strategy of each module in SMS-YOLOv3 is helpful in improving detection performance.
In Table 6, Model A is the original YOLOv3; Model B is that the feature extraction network in YOLOv3 is replaced by the improved lightweight network SKNet-MobileNetv3; Model C is a four-scale detection output based on Model B; Model D is based on the Model C by adding the SPP module; E is to use CIoU; on the basis of Model D; the Model F is a K-means++ algorithm based on the Model E. The details are as follows: Model A → Model B: The feature extraction network of B:YOLOV3 is Darknet53, which is first replaced by SKnet-Mobilenetv3 to reduce the network parameters and improve the detection speed. The results show that although the mAP has been reduced to 96.22%, the fps has reached 101, increased by 140.4%, and the memory has decreased to 9.2 M Model B → Model C: One detection scale is added, and finally, there are four detection output scales. The mAP is increased from 96.54% to 97.55%. However, with the addition of a feature layer, the amount of network computation also increases, so the detection speed is slightly reduced, and the fps is 94. reach 98.97%, an increase of 0.54 percentage points, without changing the memory and detection speed. The main reason is that the K-means++ algorithm can obtain the best size Anchorbox, which can reduce some false detection and missed detection.

| Comparison with other target detection algorithms
To comprehensively analyze the detection performance of the improved SMS-YOLOv3 algorithm, this paper selects several classical detection algorithms, such as YOLOv3, YOLOv3-Tiny, YOLOv4, 46 YOLOv4-Tiny, and conducts training and detection experiments on the same datasets. The comparison results of detection performance are shown in Figure 11. Compared with the original YOLOv3 and YOLOv4, the mAP is increased by 0.37% and 0.05%, respectively, the fps is increased by 119.04% and 104.7%, and the memory is only 1/24 and 1/25 of the original. Compared with YOLOv3-Tiny, and YOLOv4-Tiny, although the detection speed decreases, the mAP is 2.17% and 1.57% higher, and the memory is reduced by 24  overall performance of this algorithm is better than YOLOv3, YOLOv4, YOLOv3-Tiny, and YOLOv4-Tiny. Furthermore, to intuitively show that the improved SMS-YOLOv3 algorithm can reduce false detection and false detection in special scenes (small targets, overlapping targets, large numbers, etc.), this paper selects four representative pictures for testing and analyzes them with different algorithms. The test results are shown in Figure 12. Figure 12A shows a small target with low illumination, Figure 12B shows a large number of small targets, Figure 12C shows the target overlap and Figure 12D shows a large number of targets. It can be seen from the figure that YOLOv3 has repeated detection for a large number of small targets; YOLOv4 has repeated detection of overlapping targets; YOLOv3-Tiny has false detection and repeated detection for a large number of small targets, and missed detection for a large number of targets, with low detection confidence; YOLOv4-Tiny has missed detection for a large number of small targets and a large number of targets, and the detection confidence is low. SMS-YOLOv3 proposed in this paper can correctly detect coal gangue in all kinds of situations, and the confidence has been significantly improved.
To sum up, the experimental results show that the SMS-YOLOv3 proposed in this paper can show a good detection effect, whether it is a large number of targets or overlapping targets. At the same time, compared with other models, SMS-YOLOv3 also has advantages in model size and parameter quantity, which can better balance model size with detection accuracy and real-time.

| CONCLUSION
Aiming at the problems of large memory footprint, low detection speed, and low detection accuracy for small targets and overlapping targets existing in the current coal gangue target detection algorithm, a real-time detection method for coal gangue based on a multiscale fusion lightweight network (SMS-YOLOv3) is proposed. Taking the YOLOv3 model as the basic framework, the algorithm replaces Darknet-53 with MobileNetv3 lightweight feature extraction network, in which all SE modules are replaced with SKNet, thus improving the ability of image feature extraction and making more effective use of parameters. A shallow detection scale is added to form a detection structure with the fusion of four scales to improve the detection accuracy of small targets. The SPP is added after the backbone network to improve the detection accuracy of the algorithm. CIoU bounding box regression loss and the K-means++ is used to improve the detection accuracy of targets. In addition, an experimental device is set up, and datasets are built for experiments.
1. The results of the experiments demonstrate the effective and fast detection of the proposed algorithm for small targets and overlapping targets of coal gangue accurately, with mAP reaching 98.97%. The algorithm has an mAP improvement of 0.37% and an fps increase of 119.04% compared with the original YOLOv3 algorithm, with memory only 1/24 of the original. 2. The proposed algorithm boasts more significant performance improvement as well as superior environmental robustness and practicality compared with other algorithms, providing theoretical and technical references for gangue detection and recognition.
By balancing the factors of model size, detection accuracy, and speed, a lightweight model SMS-YOLOv3 is proposed, which can realize the effective detection of coal gangue. However, due to the influence of dust and light in the field environment, the images of coal gangue often show low contrast and serious color deviation, which adversely affects the detection of coal gangue. In future research, an image enhancement algorithm will be used to improve the image quality of datasets and further improve the detection accuracy under complex models.