A coarse to fine network for fast and accurate object detection in high ‐ resolution images

Because of the popularisation of high ‐ resolution images, detecting objects in these images quickly and accurately has attracted increasing attention in recent studies. Current convolutional neural networks (CNN) ‐ based detection methods have limitations in detecting small objects owing to the interference of scale variation. In this work, we propose an improved generic framework based on YOLOv3. Equipped with multiresolution supervision for training and multiresolution aggregation for inference, this method can deal with the challenge of scale variation in high ‐ resolution images. At first, we move up the multiscale prediction position and add a dilated convolution module on YOLOv3 to improve the accuracy of detection, especially for small objects. Then, we present a coarse to fine method to reduce the detection time. Experiments on a COCO dataset show that our approach achieves 2.8% better accuracy compared with the previous YOLOv3. On a Dataset for Object deTection in Aerial images dataset (a high ‐ resolution remote sensing dataset), our approach outperformed the YOLOv3 by nearly three percentage points in mean average precision. Moreover, it is up to three times faster as well and two times smaller than the previous YOLOv3.


| INTRODUCTION
With the development of high-resolution (HR) surveillance devices, vision tasks using HR images have grown in popularity for both research groups and industrial communities. Object detection and image recognition on HR images have a variety of potential applications, such as remote sensing (RS) [1] and autonomous vehicles [2].
Nevertheless, interpreting details in HR images is still a huge challenge for existing detection methods. First, HR images usually contain lots of pixels (e.g. a 4k picture with a resolution of 3840 � 2160 pixels), which means that they have greater scale variation than ordinary images that are less than 500 pixels on each side. Direct down-sampling of HR pictures will result in the loss of information and undetectability of tiny objects. Second, they require plenty of time as well as memory resources when HR images are processed in the patch-bypatch manner. It seems to be feasible to select parts of HR images for detailed analysis, but how to perform the selection automatically and efficiently is still unsolved.
To deal with complex scale variations, existing studies generally adopt multiscale (MS) techniques or data augmentation.
MS techniques include different methods such as image pyramids [3], feature pyramid networks [4] and MS anchors [5]. To avoid training objects with extreme scales, SNIP [6] and SNIPER [7] propose a scale normalisation method that selectively trains objects of appropriate size in each image scale. All of these MS techniques will increase the inference time [8] and may ignore the imbalance among samples of different sizes.
To address this issue, data augmentation is proposed to improve the detection performance of small objects [9]. It is carried out to oversample images with small objects and augment each image by copying and pasting multiple times. However, the method still needs to resize the input to a fixed size and is designed for low-resolution images such as MS-COCO (about 600 � 400) [10] and Pascal visual object classes (VOC) (from 2005 to 2012, about 500 � 400) [11,12].
To reduce processing time in HR images, previous research [13][14][15][16] tried to simplify the network architecture or restrict the input image size to reduce GPU memory consumption and speed detection. However, these approaches are suitable only for particular detection tasks, because the shallow network weakens feature discrimination and smaller objects may become too tiny to be detected in the down-sampled images. Because the general idea is to divide each image into subimages to be detected in a patch-by-patch manner, selecting semantically critical patches for intensive processing is important. However, simple strategies can be suboptimal. For instance, randomly dropping patches of HR images from the functional Map of the World [17] dataset will significantly reduce the accuracy of a trained network. Reinforcement learning is also introduced to patch selection for detection [18]. It requires complicated training and introduces nontrivial overhead on the order of seconds per image.
In this article, we propose an improved generic framework based on YOLOv3 [19] (Figure 1) to handle detection in HR images. First, to solve the scale variation challenge, especially to retain details of small objects, we move up the position of MS predictions compared with the original YOLOv3. The basic idea is that low convolutional layers can extract more precise location information especially for small objects [4], whereas the high convolutional layers extract semantic features. Because larger receptive fields can seize more context information and handle the scale variation better, we add a dilated convolution (DC) module, which includes three branches with different dilation rates after feature extraction. Different branches have the same network structure. They also share the same parameters except for the dilation rates. Finally, we introduce a coarse detection module (CM) sharing feature extraction with early layers of the fine detection module to filter some subimages without objects. Because of the coarse to fine manner, many convolution operations can be omitted. As a result, our approach can significantly improve the inference time and detection accuracy for HR images.
The highlights of our study are: (i) We introduce a novel network to deal with the difficulty of detecting small objects. By carefully designing the positions of MS detection and the DC module with a multibranch structure, our network can obtain accurate small object locations without sacrificing semantic information. (ii) We develop a coarse to fine strategy into the advanced network to filter subimages without objects and greatly improve detection efficiency. (iii) We validate the effectiveness of our approach with ablation studies. Experiments on COCO and HR Dataset for Object deTection in Aerial images (DOTA) [20] datasets demonstrated that our method achieves decent performance with a mean average precision (mAP) of 80.3 and average detection time of 0.78 s compared with state-ofthe-art methods.

| Convolutional neural networks detectors
Object detection has evolved in both accuracy and speed since the pioneering work of R. Girshick et al., which proposed Regions with convolutional neural networks (CNN) features (RCNN) for object detection [21]. As one of the principal genres, two-stage detection [4,[21][22][23][24][25][26] first generates a set of region proposals and then refines them by CNN networks. On the other hand, one-stage detectors that remove the stage of proposal generation and formulate detection as a regression problem have received much more attention for real-time performance. SSD [27], YOLO [19,28,29] and RetinaNet [30] are popular in this method. Despite its high speed and simplicity in images with relatively low resolution, one-stage detectors cannot generalise well on large images with objects of variable sizes [18]. Multiresolution  Fine detection Coarse detection F I G U R E 1 Proposed network structure for object detection. For simplicity, the depth of convolutions in the figure is not represented. The input of size 608 � 608 is a subimage of an original image. As the coarse detection module, the first conv block after 16 times down-samplings calculates a confidence to predict whether the subimage contains foregrounds. If the confidence is greater than the predefined threshold, the CNN models continue to extract features. Otherwise, the subimage is omitted. The conv block is a concatenation of several conv layers. CNN, convolutional neural networks [27] use MS feature maps and detect objects at different layers of the network. Owing to the independent detection between feature maps of each scale, SSD cannot achieve high detection accuracy. We employ the MS feature fusion strategy based on YOLOv3 to improve the detection performance. For CNN detectors, the receptive field and feature resolution are two significant characteristics in which the former refers to the spatial range of input pixels that contribute to the calculation of a single pixel of the output and the latter corresponds to the down-sampling rate between the input and the feature map. As is shown in Li et al. [31], a network with a larger receptive field can capture a larger scale of context information, whereas that with a smaller one may concentrate more on local details. On the other hand, the lower feature resolution will cause small objects to be harder to detect. Removing the pooling layer or reducing the convolution downsampling rate is not feasible because the receptive field will become too small. A practical method to increase both the receptive field and feature resolution is to introduce DC (also known as atrous convolution). Originally proposed in semantic segmentation tasks [32] in 2015, DC enlarges the convolutional kernel with original weights by performing convolution at sparsely sampled locations, increasing the receptive field size without additional parameter cost. It has been widely used in object detection [27,31,33,34] owing to the convenience, but it cannot generalise as well on large images. We employ DC, which includes three branches with different dilation rates to concatenate more features in both small and large images.
To obtain better performance, Gao et al. [18] and Růžička and Franchetti [35] employ a coarse to fine manner that performs coarse detection and fine detection iteratively. Although the method tends to improve the accuracy of detection, it usually sacrifices efficiency. We introduce a CM without iteration to achieve better efficiency.

| RS object detection
RS images, a representative of relatively high resolution, have received more attention in the deep learning era. Because of the differences in positions and angles of sensors between RS images and regular images, Penatti et al. [36] made some experiments on the effectiveness of deep CNN features for RS images and confirmed the applicability of CNN detectors on these images. According to this idea, deep CNN networks have gradually been applied to RS object detection [37,38]. General object detection frameworks such as Faster R-CNN and SSD have also attracted growing attention in RS images [39,40]. However, these methods use deeper CNN networks to improve detection accuracy, which leads to a long detection time. YOLT [1] performs well on inference speed thanks to the simple network structure but without MS features reuse, the detection accuracy is limited. Some methods [41][42][43] are proposed for specific detection objects. Despite achieving better performance on accuracy and efficiency, these networks do not have excellent generalisation capabilities.
We use the RS dataset DOTA [20] to verify the improvement in the proposed network in HR images. As a publicly and commonly used object detection dataset, DOTA contains a large number of HR images and MS objects distributed reasonably.

| THE PROPOSED FRAMEWORK
We propose an improved YOLOv3-based network that employs a coarse to fine strategy to improve the detection performance of both HR and regular images, as shown in Figure 2.

| HR images before processing
The original YOLOv3 network is designed to detect objects in relatively small images. As is typical of HR images, most RS images possess more than 1024 � 1024 pixels. Too much down-sampling of the feature maps may lead small object features to vanish if we take a resized RS image directly as input. A feasible method is to crop a large image into subimages. Considering that the side length of the input image suitable for YOLOv3 is 320-608, we crop each large image into subimages of 608 � 608 pixels while retaining the features of small objects as much as possible. In addition, we allow some overlaps between adjacent subimages to avoid missing some targets at the boundary of the subimages.

| Original YOLOv3
As a widely used method, the YOLOv3 [19] network divides the image into grid cells and predicts bounding boxes and probabilities for each cell simultaneously. In addition, it introduces the feature pyramid network to achieve MS prediction. It also contains the deep residual network to extract as many features as possible.

| MS prediction
As shown in Figure 2, we move the scale 2 and scale 3 positions to eight times and two times down-sampling, respectively, to obtain the earlier feature map and merge it with our upsampled features using concatenation. This improved method allows us to get finer-grained information from the earlier feature maps compared with the original YOLOv3 performing scale 2 detection and scale 3 detection at positions 4 and 3. Information on small objects is increasingly unavailable with the increase in the network depth and down-sampling. Therefore, such an improvement has a significant role in detecting small objects although it adds acceptable computational complexity. Details and ablation studies on each component are discussed in Section 4.2.

| DC module
We add a DC module after whole feature extraction. This module consists of three branches that contain three convolutions with kernel size 1 � 1, 3 � 3 and 5 � 5; each is followed by a 3 � 3 convolution with a different dilation rate. We set dilation rates δ to 1, 3, and 5, respectively, to match the size of the convolution kernel. Finally, we add a 1 � 1 conv to ensure the number of channels is the same as the previous number, and we merge the concatenated features of these three branches with the whole features in an element-wise manner.
The DC module enriches the features and improves the detection accuracy.

| Coarse to fine strategy
An immediate problem of our strategy of cropping the original image into subimages is increasing much processing time. In this section, we propose a novel coarse to fine strategy to reduce inference time. Empirically, most objects distribute in a few subimages instead of the entire image. The sparsity of this distribution motivates us to propose a coarse to fine strategy that adds a CM in the feature extraction module to filter out some subimages without objects.
Similar to the fine detection module, we construct the CM, as shown in Figure 3. However, unlike the fine detection in F I G U R E 2 Given a preprocessed image as input, the coarse detection module filters out some subimages without objects after some convolution operations for feature extraction. The other subimages continue to be extracted features and are detected on the final detection module. Circled numbers one to five mean the positions of the coarse detection module. Parameter δ means the dilation rate GUO ET AL.
-277 which four coordinates, the confidence and categories of objects are calculated, we need only to obtain the confidence of the bounding box. In the rough selection phase, we can filter out some subimages without objects by using only the indicator of confidence. Confidence score p is defined as the sum of confidence p r of all bounding boxes on a grid cell. It could be expressed as: where i denotes the grid cell of the feature map and j, the index of bounding boxes of a grid cell. Given a subimage of an original image, it indicates that the subimage may contain objects to be detected when: where the threshold 0.5 is an experimental value. Then, the subimage can be subjected to subsequent feature extraction and fine detection processes. If the global confidence p of a subimage is less than 0.5, it is regarded as objectless and will be ignored. In the fine detection stage, we only need to detect the remaining subimages. During training, we can use the confidence loss and the loss of the fine detection module to train jointly. Similar to Redmon et al. [28], we usep to 0 if j has the highest intersectionover-union with the ground truth box in grid cell i). Thus, their cross-entropy can be calculated as: Confidence loss L of CM can be computed as: where parameter λ noobj is used to decrease the loss from confidence predictions for boxes without objects. We set λ noobj to 0.5 in our experiments. Finally, the total loss is the sum of the coarse detection and the fine detection whose setting is the same as in Redmon and Farhadi [19]. The last crucial issue is the suitable position of the CM. A discussion of this issue may be found in Section 4.2.3.

| EXPERIMENTS
In this section, we perform experiments on the MS-COCO dataset, Pascal VOC dataset and a large-scale DOTA, which includes plenty of HR images. Datasets such as COCO and VOC are chosen only to validate our improvement in the YOLOv3 network structure in Section 3.2.

| The PASCAL VOCs and MS-COCO datasets
The VOC Challenge was one of most important competitions in the early computer vision community. Twenty classes of objects are annotated in the VOC dataset. Because some larger datasets such as COCO have been released, the VOC has become a test bed for most new detectors. Compared with the VOC, MS-COCO has 80 categories and contains more small objects and more densely located objects. MS-COCO has become the de facto standard for object detection.

| DOTA images
Image resolution in the COCO or VOC dataset is low. To demonstrate our approach better, we take the DOTA dataset as training and test images. The images in DOTA are mainly collected from Google Earth; they contain 1411 images in the training set and 937 test images. According to the authors, the ranges of objects in the small, medium and large scales of the images are [10,50], [50, 300] and [300, ∞] pixels. respectively (∞ indicates the side length of an object is long). Because the resolution of these images varies greatly, we collect 800 test images whose resolution are more than 1500 pixels on the longer side and choose all 15 categories given by the authors. We crop all training images and 800 test images into subimages F I G U R E 3 Position of the coarse and fine detection module in the network. The notation 38 � 38 � 3 means the size of feature map after coarse detection is 38 � 38 and each cell has three bounding boxes. Metric p is a global confidence that is the sum of the three bounding boxes confidence in every cell 278of 608 � 608 pixels during both training and testing as the input of the network.

| Implementation details
We implement YOLOv3 as our baseline method with default parameter settings in the codes released by the authors. Like previous standard detectors, we employ ImageNet-pretrained models as the network backbones on both COCO and DOTA datasets based on a horizontal manner. We crop the training and test images into subimages of 608 � 608 pixels to fit the network. Data augmentations are applied to avoid the problem of overfitting by adjusting the saturation and hue. Both the baseline and our approach are trained in an end-to-end manner. We train in a batch size of 4 on a GPU. We train 20 and 30 epochs on the COCO and DOTA training set, respectively, with the learning rate starting from 0.001 decreased by a factor of 0.1 after every 10 epochs. We set the threshold of global confidence p to 0.5 in experiments.
For the evaluation, we use the standard metric mAP and average detection time per image to measure detection accuracy and efficiency. We also report AP s , AP m and AP l on objects of small, medium and large size. To measure the model efficiency, we use the billions floating-point operations per second. We use nonmaximum suppression to filter out redundant bounding boxes by setting the threshold to 0.5. All experiments are conducted using a K-40 GPU.

| Improvement by MS
We perform experiments on the baseline method, which only improves the MS module, and compare them. The results are reported in Table 1. We find that the improved MS prediction is beneficial for improving the accuracy of detection, especially for small objects, not only on COCO but also on DOTA datasets. It indicates that we can obtain more useful features as a result of the improvement.  Figure 2. The results in Figure 4 show that as the position of the module deepens, the average detection time of each image becomes longer, but average detection accuracy improves. Deeper networks facilitate feature extraction but also increase computational complexity.
In a compromise, we take the third down-sampled position to add the CM most reasonably. Then, we compare our approach with the original YOLOv3 and improved YOLOv3 (YOLOv3 + MS + DC) on the same test set in Table 1. The results indicate that our proposed approach outperforms the other two methods.

| Comparison with state-of-the-art
In this section, we evaluate our proposed approach by extensively comparing it against other state-of-the-art methods on the DOTA test set. To compare it fairly with these methods, we train and test them one by one on preprocessed images on DOTA with default parameters and keep the epochs consistent. Like COCO, objects in DOTA are divided into three scales: small, medium and large. In addition to the overall mAP, we evaluate the metric using three scales: AP s , AP m and AP l . The results in Table 2 show that our method could achieve 80.3 mAP on the DOTA test set and outperform state-of-theart methods. Furthermore, we compare our method with the baseline and YOLT in terms of the size of network parameters and the number of floating-point operations, as reported in Table 3. It indicates the efficiency of our approach, which requires 6.2 times fewer floating-point operations per second, has nearly two times fewer parameters and provides better accuracy as well as 3.3 times less detection time compared with the baseline method.