Detail texture detection based on Yolov4-tiny combined with attention mechanism and bicubic interpolation

Aero-engine blades crack detection is one of the important tasks in daily ground maintenance, crack is a kind of texture feature, due to the random distribution, irregular shape and vague characteristics, which is still a challenging task to realize automatic detection in working environment. A detection model based on the Yolov4-tiny is proposed that is universal and focuses more on the characteristics of cracks, and it is implemented in embedded device. First, in order to distinguish the cracks and noises, an improved attention module is introduced into the backbone of Yolov4-tiny to enhance the model’s capability to focus on crack areas; second, in order to improve the effect of multi-scale feature fusion, the bicubic interpolation is implemented in upsampling module; ﬁnally, in order to solve the redundant detection results of bounding-boxes in crack areas, the optimized non-maximum suppression method is proposed to make the detection results better corresponding to the groundTruth. The robustness of proposed detection model was demonstrated by evaluating varying lighting and noise images. The average precision on integrated datasets is 81.6%, which outperforms the original Yolov4-tiny by an increase of 12.3%.


INTRODUCTION
Texture is an inherent feature of objects, which represents the appearance of surface. Crack is a kind of texture feature in intuitive vision, which has physical structures of straight lines or curves, thus crack detection can generally be regarded as the line detection on images. Whether viewed from the global perspective or the local perspective, crack can be regarded as a pixel area with a width of several pixels and a certain length, in addition, compared with other area, crack area usually has a jumpy change in pixel value. Surface crack detection of aero-engine blades is an important task in aviation maintenance, timely and effective discovery of defects can avoid disasters caused by the damage of mechanical components. At present, crack inspection mainly relies on the observation of ground crew and touching blades inch by inch with their hands. This is a boring and labour-intensive work, and the quality of the observation is easily affected by subjective factors. Considering the safety and reliability requirements of aero industry, on the one hand, at software level, we optimized and improved the detection algorithm. On the other hand, at hardware level, we This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2021 The Authors. IET Image Processing published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology plan to provide the detection results to maintenance personnel through monitor and save them frame by frame to facilitate the maintenance personnel to view at second time. Since automatic crack detection systems are safer, strong objectivity and low time costs, its research has received attention from both academy and industry [1]. In traditional image processing, crack detection can be roughly divided into two methods, one based on edge detection and one based on morphology. In ideal situation, images of crack areas usually have obvious contrast and less background noises, the crack can be detected with high precision. For example, the detector based on Canny [2], the methods based on Sobel mask [3] and the LoG operator [4]. However, these methods will be affected by the background noises such as scratches on actual blade surface. In addition, the threshold segmentation method [5] can separate the pixels of crack areas from global background, but this type of method is sensitive to the changes in illumination of actual environment, because the various illumination changes the contrast between the crack areas and the background, which increases the difficulty of detection. In recent years, deep convolutional networks have shown excellent performance in various work, such as classification, object detection and segmentation, furthermore, there are also proposals for edge detection [6,7], contour detection [8,9] and boundary segmentation [10]. These models start from the low-level features of the image, and extract high-level features through cascaded convolution layers and pooling layers. But in practical application, there are a lot of scratches on the blade surface (Generally, scratches are not considered as defects in daily maintenance), which forms strong noises for detection, the common conventional methods are no longer applicable.
We have collected some real engine blades, through the observation, we found a phenomenon that the crack areas are darker and the scratch areas are brighter under certain light conditions. Because the scratches do not completely destroy the chemical bonds between metal molecules, while cracks completely destroy chemical bonds. Based on this, the light can form a total reflection in the dent of the scratch, but not on the crack, as shown in Figure 1.
In working environment, usually, the image background is very complex and the illumination condition changes greatly. In that case, convolution neural networks will extract wrong texture information in the process of automatically extracting features, we used ResNet50 to observe the activation of feature maps in different layers through feature visualization, the strongest feature map of the eighth activation layer and 35th activation layer were extracted and visualized respectively, as shown in Figure 2. The ResNet50 highlights noises and blur the crack areas, because the convolution kernels are more inclined to extract regions with large pixel changes during training. Generally, the shallow convolution layers can extract low-level fea- tures, such as edges, corners and curves. The deep convolution layers can combine the low-level features, such as semicircles and quadrilaterals, in this process, the neural network model iteratively obtains the contour of objects and gradually ignores the detailed texture features. Moreover, comparing the cracked engine blade surface with other cracked surfaces, the cracked areas of engine blade have characteristics of low contrast and small account of crack pixels. Under a certain illumination direction, the texture features of crack areas are especially not obvious. When these crack areas are passed through the deep convolutional neural network, the quality of these areas will become rougher and rougher with cascade calculation, the characteristics of crack areas will gradually become blurred or even lost, which brings greater difficulties and challenges to detection work.
During the early research, we also found that the recall of crack detection result is always lower than that of other object detection task, the crack detection model has the phenomenon of higher bounding-box loss and gradient disappearance, which means that the crack areas cannot be effectively confirmed. Reflected in the visualization results, the sample images usually misses the crack areas or generates a large number of boundingbox on a single crack, which is very different from groundTruth. In order to meet the requirements of actual working conditions, we are committing to develop the handheld devices, we proposed the optimized detection model based on one-step object detection model-Yolov4-tiny [11] and we designed the experiments with the Jetson Xavier NX Development KIT. All the experiments are implemented and compared in embedded device.
Combined above contents and analysis, we mainly faced the following problems: (1) The noises (such as scratches and edges) and cracks have similar spatial shape, how to distinguish them, as shown in Figure 1(a); (2) Because of the inherent properties of neural networks, the image will lose some detail information after forward calculation, how to ensure that more crack information can be restored in upsampling when feature fusion is carried out, as seen in Figure 2(b,c); (3) Based on our early experimental results, we found that a single crack area is very easy to produce redundant detection bounding boxes, thus we need to solve how to eliminate redundant detection results and ensure low false alarm rate, as shown in Figure 3(c).
The contributions of this paper are the following: 1) The improved attention module was introduced to the backbone network of Yolov4-tiny, we have optimized channel attention and spatial attention, the weighted feature vectors are used to replace the original feature vectors for residual fusion, which is to enhance the crack features expression and reduce the influence of scratches and other background noises; 2) The bicubic interpolation was implemented in the upsampling module of Yolov4-tiny, we have changed the upsampling algorithm so that when the feature maps are upsampled, more pixels on the feature maps will participate in calculation, which is to enhance the feature fusion effect of the shallow layers and deep layers and reduce crack information loss; 3) The optimized non-maximum suppression (NMS) method was implemented during detection, we have provided a dynamic IOU threshold to implement dynamic NMS, which is to solve the problem of redundant detection bounding boxes caused by the fixed threshold and reduce false alarm rate; 4) We proposed an integrated and systematic crack detection model with engineering feasibility and implemented it in embedded device, we compared with different detection models to verify the effectiveness of the proposed model.

Machine learning
With the development of machine learning, a variety of crack detection methods based on feature extraction and pattern recognition have been proposed. Considering that the crack is mainly a texture feature, the precondition for the crack detection problem is that the model has a good texture feature learning ability. Li et al. [12] used local binarization mode (LBP) to represent the physical properties of cracks. Oliveira et al. [13] proposed an unsupervised learning method based on the mean and standard deviation to distinguish between cracked and non-cracked areas. Cord et al. [14] used AdaBoost as texture descriptor to represent the crack areas. The performance of these methods mainly relies on manual feature extraction, which may influence the universality and robustness of algorithms. For the special material of aero-engine blades surfaces, it is difficult to generalize the effective features that can adapt to complex application scenarios. These traditional feature extraction methods or feature description methods lack consideration of global information. We often need to consider different adaptation methods for different datasets.

Deep learning
At present, deep neural networks have occupied the dominant position in the field of computer vision, especially in the research fields of object detection and classification. Algorithms based on CNN have made breakthrough progress [15]. Some researchers have applied deep learning to fault and defect detection. Cha et al. [16] based on the sliding window mechanism used CNN to predict whether there are cracks in the current window; Fan et al. [17] used CNN to determine whether the current area is a crack area, generate a binary image that divides crack and background; The two-branched (crack-componentaware and crack-region-aware) CNN detection models [18], are used for the estimation of crack category and crack region, respectively; In order to solve the imbalance between crack sample and none-crack sample. Usually, the crack areas occupy less pixel area, the Bayesian data fusion [19] is a kind of method to balance the positive and negative samples. Considering the physical characteristics of the crack area, some researchers realized crack detection based on image segmentation algorithms. In [20,21], researchers proposed the pixel-level crack segmentation method based on [22] fully convolutional networks (FCN). The DeepCrack [23], is a network with encoder and decoder based on SegNet, feature maps which generated in encoder and decoder are combined in pairs on the same scale to achieve pixel-level crack detection. Song et al. [24] proposed a tunnel crack pixel-level segmentation network based on the DeepLabv4 [25], Atrous Spatial Pyramid Pooling is used in this network to fully acquire the multi-scale features of the image. Mandal et al. [26] proposed a crack detection model based on YOLOv2 for road crack detection, which can realize real-time detection. Academic researchers have published many research results on crack detection. Generally speaking, the results of deep learning are better than traditional image processing. We plan to transplant this technology into embedded devices. Therefore, constructing an end-to-end model for the aero-engine blade surface crack detection is still a problem that needs further research.

Adaptive NMS
In yolov4-tiny, the model assigns pre-set bounding boxes to each grid. Generally, there are many detection boxes corresponding to one object when images pass through the neural network, but one object should have one bounding box, which need the NMS method to suppress the inaccurate boxes and preserve the most accurate box. In the process of NMS, first, delete the detection results whose score is less than the confidence threshold and sort the rest of the test results in descending order, then compare the IOU of the first detection result with that of all subsequent detection results, and delete the detection results whose IOU is greater than the IOU threshold, repeat the above process until there is no similar result. Usually, IOU threshold is a fixed number (such as 0.5), it is not applicable to crack detection, the feature information of crack areas are relatively unified, thus we can get multiple detection results in same area, as shown in Figure 3. You cannot say that the multiple overlapping detection result is wrong, but it is still quite different from the ground truth, which will affect the average precision (AP). The reason for the phenomenon shown in Figure 3 is that the fixed IOU threshold has not filtered surplus detection results. In that case, for the detection reliability, we proposed the adaptive NMS method, which offers a dynamic IOU threshold. When the multiple detection results are similar, a larger IOU threshold is provided to eliminate the duplicate detection results and when the multiple detection results are different, a smaller IOU threshold is provided to retain more detection results. It is supposed that the detection results set is D and the T is bounding box which has the highest detection score. Then, we compare the IOU with T and each d i in D, and calculate the mean value of it. The formula is as follows: where t h m is the mean value of IOU between the highest detection box and each detection results d i , n is the number of detection results, t h fixed is the default IOU threshold (default value is usually 0.5), th is the final IOU threshold.
The following steps are consistent with the original NMS method. When the large IOU value of the detection results accounts for a relatively high proportion, the threshold value is also higher, and the duplicate detection results will be eliminated. When the small value accounts for a relatively high proportion, the threshold value is also smaller, and more detection results will be retained. The proposed adaptive NMS method is more suitable for crack detection

Biubic interpolation
Image downsampling and upsampling need to reduce and increase the number of pixels respectively, and interpolation method is usually used when images need to upsampled. Crack images usually have a lot of edge information, different interpolation methods will produce different upsampling effect, which will affect the detection accuracy. Spatially, a proper interpolation method will make the upsampled results more smooth and continuous, and will also reduce unnecessary noises. This section will discuss the different interpolation methods in upsampling module of Yolov4-tiny. The backbone network of Yolov4-tiny is CSPdarknet53_tiny, the main structure of the detection model is shown in Figure 4. In view of the shortcomings of the previous model (Yolov3-tiny), Yolov4-tiny improves the accuracy for multiobjects detection. The detection of small objects often requires finer receptive field to better display object features [27]. For the deep neural network models, the deep layers extract the abstract semantic information in images, and the shallow layers extract the texture, edge and other features of images by means of convolutional kernels. The fusion of feature maps at different layers in the model often constitutes a more accurate detection system [28]. In Yolov4-tiny, the feature fusion strategy is adopted to integrate high-level features with low-level features. Specifically, the feature map of (19,19,512) were upsampled and integrated with (38,38,256) feature map, so that the detection path with FIGURE 4 Detection paths of Yolov4-tiny larger receptive field was obtained, then they combined with the smallest receptive field detection path which directly output by the backbone network, detection paths work together to complete detection tasks to ensure the multi-scales perception.
However, for the crack detection, we always hope that the model can extract more texture features during training process, thus we have optimized the interpolation algorithm to preserve more crack information before concat. In Yolov4-tiny, the bilinear interpolation was used in upsampling processing. The core concept is to perform linear interpolation in two directions, respectively along with x-axis and y-axis.
Given the pixel values of four points a, b, c and d, first perform linear interpolation in x-axis direction to obtain the interpolation results of 1 and 2 m, then perform linear interpolation in the y-axis direction to obtain the interpolation result of point p.
In the process of image operation, the bilinear interpolation only considers the influence of four directly adjacent points around the interpolation point, but does not consider the influence of the neighbours, thus it has the properties of low-pass filtering. We adopt the bicubic interpolation in the upsampling module. The bicubic interpolation not only considers the four adjacent points, but also considers the pixel changes of neighbourhood, which will retain finer quality detailed texture information. where a = − 0.5, p(x, y) represents the interpolation result and W (x) represents the bicubic function, x i and y j are the nearest 16 pixels. The bicubic interpolation uses the nearest 16 pixels of the point p as parameters and obtains the interpolation result through weighted stacking. This method will obtain more semantic information and obtain more fine grained information from the shallow feature maps, which is very important for crack detection.

Attention module
The human visual attention mechanism can focus on the areas of interest in images; while the attention mechanism in deep learning is to remove redundant information and select information that is more important to the current object. This section will discuss how to apply the attention mechanism to crack detection. Generally, the purpose of attention mechanism is to effectively learn the weight distribution of different parts of feature maps, at the same time, reduce the influence of background information and improve the recognition accuracy and robustness of the model. For example, the residual attention network [29] uses the residual mechanism to construct the network, which introduces the attention structure while ensuring the depth of the network. Convolutional block attention module (CBAM) [30] simultaneously uses the channel information and spatial information of the feature map to design an attention module that can focus the model on more useful information, and further enhance the extracting capacity to features. We designed the attention module with reference to CBAM, and added this module to the backbone network which makes the detection model pay more attention to the crack areas.
where ⊗ denotes element-wise multiplication. F denotes the input feature map, F′ denotes the refined feature map, F″ denotes the final refined output. Channel attention: We exploited the inter-channels relationship of feature maps. Each convolutional kernel can be regarded as a feature detector, thus the feature map produced by each channel can represent a sort of object feature, and the purpose of channel attention is focusing on the most meaningful part of all channels. First, we squeeze the dimension of feature map, and it will pass through the max pooling layer and the average pooling layer respectively, then output two feature descriptors, the max pooling layer is to emphasize the important features of object and the average pooling layer is to effectively compute the extent of object. Second, the two feature descriptors are both forwarded to a shared network, the shared network composed of cascaded layers: one input layer, one output layer and three hidden layers. When the two descriptors passed through the shared network, we use element-wise summation to merge the output feature vectors. Finally, the feature vectors are activated by the sigmoid function, the channel attention map is obtained.
Spatial attention: We use spatial feature relationships to generate spatial attention maps. When the image enters the convolutional neural network, every pixel in the image will participate in the calculation. Similar to the channel attention, spatial attention focuses on the area in the image that contributes the most to objects. First, the refined feature map calculated by channel attention map and the feature map will pass through max pooling layer and average pooling layer respectively, and obtain two feature descriptors. Second, the two feature descriptors are concatenated, and then we apply two convolutional layers to emphasize the areas of descriptors. Finally, the vectors are activated by the sigmoid function, and the channel attention map is obtained.
After channel attention and spatial attention, the weights of feature maps are optimized, and the final feature map will have more crack information. We assume that the average pooling and the max pooling process are F avg and F max , respectively. We assume the outputs of F avg and F max are At t avg and At t max , respectively, (At t avg ℝ (1×1×C ) , At t max ℝ (1×1×C ) ). The At t avg can filter out the global background information of the object well and the At t max can well highlight the salient features of cracks. Let X = [x 1 , x 2 , … , x n ], where x n represents the weight of the nth convolution kernel. The formulas for At t avg and At t max are as follows: At t max = argmax After shared network, the output of channel attention can be represented as: out put max = relu(FC × At t max ) (15) out put channel = (out put avg + out put max ) The weighted feature obtained by matrix multiplication, and the filtered channel features are W = [ 1 , 2 , … , n ], it can be represented as: W = (x n , out put channel ) = x n × out put channel (17) After the channel feature filtering, we need to input W into the spatial attention module.
First, the feature vectors are passed through the average pooling layer and the maximum pooling layer respectively, then along the channel dimension, the features are concatenated to get C conv ℝ 1×1×2C . In order to get features weight information, convolution operation is needed, let F 5 × 5 represents the convolution operation where input channel is 2, output channel is 1 and kernel size is 5 × 5. The final filtered weight can be represented as: The output of whole attention module is out put ch&sp + X , the proportions of different parts of the original input vector are recalculated, through this structure, the model can selectively

Data preparation
Deep learning models usually rely on a large amount of data input to maintain the generalization, it is very important to obtain sufficient real data for model training and verification. This section will introduce the images acquisition and processing of crack data. Crack has some physical characteristics, such as narrow width and varying lengths, but they appear on the image as narrow and irregular texture features. Generally speaking, the amount of pixels in the crack area occupies a small proportion of the total number of pixels in the entire image. At present, most pre-training models are trained on ImageNet [31]. Starting from the AlexNet, the input image resolution is less than 256 × 256 pixels, in addition, the inherent structure of the neural network has characteristic of down sampling, after processing layer by layer, the image size will be greatly reduced. After multiple down sampling, object features will become very abstract and some important detailed features may be ignored [32]. Because the texture information is very important for crack detection, low-resolution images will have a negative impact on detection effect. Figure 8 displays a crack of cement image at different resolutions. It can be seen that a low-resolution image will lose a lot of detailed features of the crack, which will affect the accuracy of detection. All the above, considering the requirements of image semantic information and detection speed, we set the resolution of input image to 608 × 608.
The dataset used for training and testing comes from the aero-engine blade images, and we also draw on two public datasets: magnetic tile defects [33] and crack Forest [34]. In the training process, first, we verified the proposed method in our own datasets; second, we verified the model on magnetic tile defects and crack Forest, respectively; finally, above three types of crack image data are combined together as a comprehensive dataset and we did comparative testing. For the integrated datasets, the total number of images is 10889. The number of training-set and test-set is divided according to the ratio of 8:2. We also used data augmentation, such as random brightness, random contrast, random cropping, flip, rotate and flip.

EXPERIMENTAL DETAILS
In the multi-label classification, all categories are not mutually exclusive. There will be multiple high confidence output results in a single classification result, thus category cross entropy is usually used as the loss function of this type of problem, while the crack detection belongs to the binary classification, we used the binary cross entropy as the category loss function in experiments, which is defined as: where y i andŷ i are the label and prediction of the ith output unit respectively, n is the number of labels. L2 regularization is implemented. For evaluation, precision (Pr), recall (Re) and F1 score (F1) are introduced, they are widely used in classification and detection visual tasks. The definitions are as follows: where TP, FP, FN represent the true positive, false positive and false negative, respectively. Experiments details, the crack detection network proposed were implemented based on Pytorch 1.6, the optimizer was Adam with 0.001 learning rate, the momentum factor and weight attenuation are set to 0.9 and 0.0005, respectively. In addition, in order to improve the recall rate of the bounding box, the prior anchors are clustered based on k-means and the cluster is 9. The proposed crack detection model is implemented with Jetson Xavier NX, and the experiment equipment also includes a light that provides fixed angle illumination and a USB camera that collects image data. Since the proposed adaptive NMS only works during inference, we integrate it into detection model, and we set up two comparative experiments (including yolov3-tiny) to train multiple models in turn, and compare the detection results of interpolation algorithm and attention module.
The experiments in embedded device are carried out according to the following steps: Step 1: We collected crack image data and all image data are labelled, we also draw on two public datasets (magnetic tile defects and crack Forest) to construct our training and testing datasets; Step 2: In order to solve the problem of similar features (cracks and scratches) and noise interference (pits and edges), we introduced an improved attention module into Yolov4-tiny; Step 3: In order to enhance the feature fusion effect of Yolov4-tiny, we introduced the bicubic interpolation into unsampling module; Step 4: In order to solve the problem of redundancy detection results on a single crack, we optimized NMS method; Step 5: We integrated the above improvements and train models on server; Step 6: We installed the Ubuntu operating system in the embedded device and configure the deep learning development environment (same as the server); Step 7: We constantly test models on embedded devices and collected experiment results; Step 8: According to the experimental results, we return models back to server for modification and test them one more time. Repeat the s 7 and 8 continuously until the expected experimental results are obtained.

Upsample comparison experiment
In order to compare the effects of different upsampling module on detection results, we only changed the interpolation algorithm to acquire detection results on our own datasets. The methods are shown in Table 1 and the test results of Yolov4-tiny+Bilinear and Yolov4-tiny+Bicubic are shown in Figure 9. The Yolov4-tiny that uses bicubic interpolation for upsampling has improved accuracy and recall, so the average detection accuracy and F1 have also been correspondingly improved. All of the methods are tested on our own datasets.
Better detection results are achieved by the Yolov4+Bicubic model. The Precision, Recall, F1 and AP increased by 2.6%, 2.3%, 2.4% and 1.8%, respectively. Compared with the detection model using bilinear interpolation, the detection model using bicubic interpolation can not only better integrate the lowlevel features and high-level features, but also has better regression capabilities of bounding box.
The recall of the Yolov4+Bicubic in (a) and (b) are higher than the results of the Yolov4+Bilinear, and the results are closer to the groundTruth. For the Yolov4+Bicubic detection results, (a) The whole crack area is detected more completely; (b) The model is not disturbed by noises and does not falsely detect the pit as crack; (c) In the weak contrast environment with low illumination, the model has better detailed extraction and subtle changes perception capabilities. The detection model using bicubic interpolation in upsampling module can better restore the image details, this allow the deep features maintain a better

Attention module comparison experiment
Woo et al. [30] have discussed the connection sequence of the channel attention and spatial attention modules. We followed the experimental results in that paper, using the connection sequence of channel attention first and spatial attention second.
Channel attention is used to emphasize the parts that need to be given high weight in each convolution channel. Spatial atten- tion is used to emphasize the proportion of the crack area in spatial and the proportion of the crack area pixels in calculation process. Through these two attention maps, the crack areas information is emphasized and the none-crack areas information is suppressed, they also contribute to the information flow within the neural network model. We have compared with another attention module (SE, squeeze-and-excitation) and our attention module to illustrate the effectiveness of our method. All of the methods are tested on our own datasets.
Similar to the previous section, for the comparison of the influence of attention mechanism on detection results, we used Yolov4+Biubic and Yolov4+Biubic+Attention with the attention module as comparative model. The methods are shown in Table 2 and the test results of Yolov4-tiny+Bicubic and Yolov4-tiny+Bicubic+Attection are shown in Figure 11.
As can be seen from the above table, our method has achieved better performance, which increased by 5.6%, 3.7%, 4.7% and 5.1% than Yolov4-tiny+Bicubic, respectively. Compared with SE attention module, the precision, recall, F1 and AP of our method are increased by 4.7%, 2.1%, 3.5% and 3.3% than Yolov4-tiny+Bicubic+SE, our improved attention module has an advantage in recall which means that our model focus more on crack areas.
Considering that in actual aero-engine maintenance, missed inspections are more harmful than false inspections. Therefore, we have appropriately lowered the confidence threshold (0.35) during inference to ensure that more cracks can be detected and unnecessary hazards caused by omissions are avoided. In (a) groundTruth, both the Yolov4-tiny+Biubic and the Yolov4-tiny+Biubic+Attention successfully detected the crack in noisy environment and avoided the interference of edges and pits. Under the complex background in (b), the Yolov4-tiny +Biubic+Attention focus better on the crack area. Also in (c), the model also better covers the crack area and is closer to the groundTruth. The detection model with attention module has stronger regional focusing capability.
The pixel value with greater contribution to the classification is given a higher attention, which improves the feature expression of the region of interest and reduces the influence of useless information such as background noise. The attention module effectively improves the distribution and processing of

Comparison with other models
After above comparison experiments, we also compared with other popular deep learning detection models to illustrate the effectiveness of our methods, including Yolov3 [35], SSD [36], EfficientDet [37] (Including multiple versions), in view of our experimental environment's embedded platform, considering the detection speed, we have not added models such as Faster-RCNN, Mask-RCNN and other state of the art detection models into our comparative test, because it is difficult for them to ensure real-time inference on the embedded platform.
The proposed crack detection model and other models had evaluated on two public datasets (magnetic tile defects and crack Forest) and the integrated datasets, as shown in Tables 3-5. The crack detection in the actual working environment needs real-time detection, we have shown some comparison results (excellent enough) of detection speed in Table 6. It is generally considered that the real-time performance can only be achieved when the detection speed reaches more than 20 FPS.
As shown in the table above, according to the F1 ranking, the top four models are Yolov4, EfficientDet D1, our method and Mask-RCNN-Resnet101. Compared to models with large parameters (such as CenterNet, Faster-RCNN and Mask-RCNN), our method has similar performance, as for the lightweight models, our method has achieved better performance. As shown in Table 4, crack Forest contains data on road cracks, its cracks are long and diverse in shape, therefore, most models did not get good detection results on this dataset, but our method and Yolov4 still have good performance.
Combined with the results in the tables above, it can be seen that the EfficientDet D1, Yolov4 and our model have achieved better performance, among them, Yolov3-tiny, Yolov4-tiny, SSD and our model are lightweight models, except our method, their AP scores is lower than large parameters models. Our method optimizes the lightweight model so that it has a perfor- mance close to that of the large parameter model. Comparing with the integrated datasets, the Yolov4 had achieved better performances, which are 4.4%, 4.6% and 4.0% higher than our model respectively. The reason for this is because Yolov4 has three detection path (same as Yolov3), as shown in Figure 12, but the Yolov4-tiny just has two detection path (same as Yolov3-tiny). Because of the three different detection paths, Yolo greatly improves the accuracy for small objects, especially the most right detection path, it has the smallest receptive field in all the detection path, in the Yolo-tiny, one detection path is removed for faster detection. Therefore the AP of Yolov4 is 4% higher than Yolov4-tiny, but for the inference speed, the Yolov4 is far slower than our model.
For the consideration of practical industrial application and demand for mobile handheld devices, we need the model have to meet the requirement of real-time detection and small amount of parameters. Generally speaking, the minimum requirement of real-time detection is 20-25fps, which means that the model needs to inference a detection result in 0.04 to 0.05 s. Based on the above experiments, we implemented sev-

FIGURE 12
Three detection path in Yolo eral models with better average precision, furthermore, we compared their detection speed and model parameters in embedded devices. Considering the demand of practical application, our method is more feasible, as shown in Table 6.
As can be seen from the above table, models (such as Faster-RCNN+Resnet101 and Mask-RCNN+Resnet101) with a large number of parameters cannot be run on embedded devices efficiently. The three models with the best performance are Effi-cientDet D1, Yolov4 and our method. We have to admit that EfficientDet D1 and Yolov4 have better average precision, but they are difficult to meet the requirement of real-time detection, our optimized method makes the model still have better performance close to that of the large parameter model while keeping small parameters and ensuring real-time performance. Through improved attention module optimized upsampling module and NMS method, our method not only meets better performance, but also meets real-time requirements.
For qualitative analysis, we have shown some detection samples, as shown in Figure 13. For the lightweight models, they are difficult to detect crack in the low-contrast environment and there is also the problem of incomplete coverage of the crack areas, but our model has better performance. Relatively speaking, EfficientDet D1 and Yolov4 both have good performance.

FIGURE 13
Qualitative comparison results (from left to right: groundTruth, Yolov3-tiny, SSD+inception, EfficientDet D1, Yolov4 Yolov4-tiny+Bicubic+Attention They maintain relatively higher precision for cracks and good coverage for crack areas. However, the focusing on the crack areas is relatively low, which means that the models are prone to draw multiple bounding-boxes on a single crack, thus the groundTruth cannot be restored well.
Yolov4-tiny is the one-stage detection model that directly extracts crack features to predict object category and object location. We have optimized the upsampling module and the attention module and added the adaptive NMS method, while ensuring the AP and also the detection speed. In addition, the bicubic interpolation and the attention module increase parameters, which slow down the detection speed of the proposed detection model, but for the aero-engine blades crack detection, we are more concerned about safety, it is acceptable to increase some time consumption while ensuring the accuracy.

CONCLUSION
The aero-engine blades crack detection is an important work of daily maintenance. In order to reduce labour-costs and avoid the negative influence of subjective factors on crack detection, we studied the detailed-aware texture perception based on the Yolov4-tiny. We first analysed the characteristics of traditional image processing techniques. Edge-based or threshold-based methods are difficult to deal with crack detection in complex environments, and are susceptible to changes in illumination and background noise. Through further analysis of the feature visualization experiment and pre-experiment results, we summarized the main problems in aero-engine blades crack detection: 1)How to distinguish noises and cracks which have similar spatial shape in the illumination change and noise interference environment; 2) In view of information loss of crack feature after forward calculation, how to ensure that more crack information can be restored in upsampling during feature fusion; 3) How to eliminate redundant detection results due to the uncertainty and spatial continuity of cracks distribution. For above three problems, 1) the adaptive NMS method was proposed, which can output a dynamic IOU threshold based on current detection results to remove redundant detection results; 2) the improved attention module was added to the backbone network, which can enhance the crack features expression and focusing ability of detection model to crack areas; 3) the bicubic interpolation was implemented in upsampling module, as a result, the feature fusion of deep features and shallow features was optimized and the feature expression of interest regions was improved. Combined with our optimized method, the crack detection model not only approaches the accuracy of large parameter models, but also meets the real-time requirements same as lightweight models. In the future, we will further study the law of the formation and distribution of various forms of cracks and develop the aero-engine blades crack detection equipment.