Integration of gradient guidance and edge enhancement into super-resolution for small object detection in aerial images

Detecting small objects are difﬁcult because of their poor-quality appearance and small size, and such issues are especially pronounced for aerial images of great importance. To address the small object detection (SOD) problem, a united architecture that tries to upsample small objects into super-resolved versions, achieving characteristics similar to those large objects and thus resulting in more discriminative detection is used. For this purpose, a new end-to-end multi-task generative adversarial network (GAN) is proposed. In the architecture, the generator is a super-resolution (SR) network, and the discriminator is a multi-task network. In the generator, a gradient guide and an edge-enhancement strategy are introduced to alleviate structural distortions. In the discriminator, a faster region-based convolutional neural network (FRCNN) is incorporated for the task of object detection. Speciﬁcally, the discriminator outputs a distribution scalar to measure the realness. Then, each super-resolved image passes through the discriminator with a realness distribution, classiﬁcation scores, and bounding box regression offsets. Furthermore, the losses of the detection task are backpropagated into the generator during training rather than being optimized independently. Extensive experiments on the challenging cars overhead with context dataset (COWC), detectIon in optical remote sensing images (DIOR), vision meets drones (VisDrone), and dataset for object detection in aerial images (DOTA) demonstrate the effectiveness of the proposed method in reconstructing structures while generating natu-ral super-resolved images and show the superiority of the proposed method in detecting small objects over state-of-the-art detectors.

Therefore, we propose a new GAN-like architecture to detect small objects. In the generator, a super-resolution (SR) network is introduced to upsample a small object image to a larger scale. In the discriminator, an FRCNN is incorporated for the task of object detection. The SR network includes a basic network, a gradient guidance network (GGN), and an edge-enhancement network (EEN). The GGN provides structural prior to guiding the basic network pays attention to particular regions with sharpness and structures. The EEN further reduces the dirty edges of the SR image. Subsequently, the discriminator distinguishes the real/fake images and detects objects simultaneously. Different from standard GANs [30][31][32][33][34][35][36][37][38][39][24][25][26][27], where the realness of a super-resolved sample is estimated by a single scalar (such as 0 (fake) or 1 (real)), our discriminator outputs a distribution as the measure of realness. The single scalar could be viewed as an abstract or a summarization of multiple measures, which together reflect the overall realness of a superresolved image. Compared to the single scalar, we naturally deduce a super-resolved image based on multiple criteria from more than one angle. More importantly, the losses of detection are backpropagated into the generator for an overall optimization such that the SR network will pay more attention to target regions.
In summary, this work makes the following contributions: (1) proposes a novel end-to-end jointly trainable GAN-like model, which offers a multi-tasking paradigm by handling both SR and detection for aerial imagery; (2) introduces two novel strategies to obtain clear edge information by constructing different subnetworks and optimizing each branch with special objectives; (3) introduces a new discriminator that outputs a distribution as the measure of realness, which provides more insights for the generator instead of barely differentiating fake and real; (4) achieves state-of-the-art performance on representative aerial image datasets.
The remainder of this paper is organized as follows. Section 2 reviews GAN-based SR, SOD, and simultaneous GAN-based SR with object detection. Section 3 provides details of our proposed framework. Section 4 presents the experimental results. Section 5 conducts ablation studies, and Section 6 concludes our work.

RELATED WORK
Our primary focus is on how to enhance the detection accuracy of small objects by producing the SR versions of input images. Furthermore, notable related past works will be discussed in this section.

GAN-based SR
Super-resolution generative adversarial network (SRGAN) [30] is a landmark study in perceptual-driven SR that applies a GANbased framework to traditional SR. Its generator is composed of residual blocks (RBs). Enhanced SRGAN (ESRGAN) [31] introduces a residual-in-residual dense block (RRDB) to replace the original RB. Further improveing ESRGAN (ESRGAN+) [32] uses residual-in-residual dense block residual (RRDBR) to further improve the generative capacity of ESRGAN. Generative adversarial network based edge-enhancement network (EEGAN) [33] develops an additional edge-enhancement subnetwork to the GAN-based SR framework to guide the generator. Structure-preserving SR (SPSR) [34] applies gradient maps (GMs) to the GAN-based SR model as structural guidance to reduce geometric distortions. Spatial feature transformation GAN (SFTGAN) [35] utilizes prior category information to solve the unreal problem. SR generative adversarial networks with ranker (RankSRGAN) [36] introduces a rank-content loss to improve the perceptual quality. SR residual convolutional generative adversarial network (SRResCGAN) [37] proposes a downsampling GAN to follow real-world degradation settings. Single natural image generative adversarial network (SinGAN) [38] introduces a pyramid of fully convolutional GANs to solve the joint deblurring and SR task. Progressive perceptionoriented network (PPON) [39] adopts a progressive upsampling method to progressively create high-quality SR results.

SOD
Small objects occupy only a small part of a large area. 'Small' here has two meanings: small size and low resolution. In general, small objects contain the least information, resulting in the least features that can be extracted from them. Common small object solutions include data augmentation [16], multi-scale feature fusion [12-15, 17, 20, 22], and multiscale training [18]. Recently, researchers have proposed several tailored sampling strategies for detecting small objects [19,21,28,29]. Furthermore, many other meaningful works have been proposed to detect small objects in aerial images [21][22][23]. However, those detectors still do not meet the requirements for SOD.
We show that a GAN-based SR method is an effective means to improve the detection performance for small objects.

Simultaneous GAN-based SR with object detection
Since GAN-based SR is beneficial for SOD, there are some efforts [24][25][26][27] to apply it for improving object detection. Edgeenhanced SR GAN with object detector network (EESRGAN) [24] exploits super-resolved images directly to detect objects at low scales. SOD-MTGAN [25] designs a multi-task loss with an SR loss for object proposals. Perceptual GAN [26] proposes a perceptual GAN to narrow the representation difference of small objects from the large ones. Joint SR and vehicle detection  [27] proposes a multi-scale GAN to create super-resolved versions of the original images. These methods mentioned above improve the performance of SOD to some extent. Compared with these methods, our method has three differences: one is the distribution outputs, the other is the structure-preserving strategies, and the third is the overall optimization.

METHOD
Previous studies usually use a pre-trained SR network to synthesize a super-resolved version of small objects for conducting SOD. Such methods perform one task without utilizing the advantages of the other task. In addition, SR results often contain structural distortions and blurry edges. To address these problems, we propose a new network to simultaneously generate super-resolved aerial images and locate objects in the super-resolved versions. Our proposed framework is composed of a GAN-based SR network and an FRCNN-based detector, as shown in Figure 1. In our architecture, the generator is an SR network, and the discriminator is a multi-task network. The generator includes an SR branch, a gradient branch, and an edge branch. The discriminator includes an FRCNN-based detector and a discriminator branch. More importantly, the discriminator outputs a distribution rather than a single scalar as the measure of realness and the classification and regression losses are backpropagated to further guide the generator network to produce super-resolved images for easier classification and better localization.

Distribution GAN
In standard GAN, the learning process is where V (G , D) is a value function, p d is the real data distribution, p g is the generator's distribution, and 0 and 1 are two single scalars. The single scalar could be viewed as a summarization of multiple measures, which together reflect the overall realness of a sample [40]. Such a measurement may guide the generator from only one perspective.
Here, we utilize a distribution p realness to replace the single scalar in Equation (1). Given an input samplex, we have where Ω represents the set of outcomes of p realness , u represents a potential realness. Similarly, we introduce A 1 (real) and A 0 (fake), which are defined on as virtual ground-truth distributions to represent the realness distributions of real and fake images. Accordingly, the difference between the two scalars is replaced with the Kullback-Leibler (KL) divergence. The learning process thus becomes where DKL(⋅) is the KL divergence. In our implementation, we define a discrete distribution over N outcomes Ω= {u 0 , u 1 , … , u N −1 } to characterize the p realness . Given an input sample x, the dis p realness is obtained as where = ( 0 , 1 , … , N −1 ) are the parameters of the discriminator. Similarly, A 1 (real) and A 0 (fake) are obtained on Ω.

Gradient branch
Previous SR networks ignore the importance of structural information. In this section, we try to exploit it. Since the gradient map (GM) reveals the sharpness of each object region in an image, we introduce a gradient guidance strategy to obtain FIGURE 2 Illustration of our SR network. The brick-red region represents the necessary SR process. The gray region represents the gradient map SR process. The blue region represents the edge-enhancement operation them. For this purpose, we design a gradient branch to reconstruct SR GM. The SR gradients can be integrated into the SR branch to provide a structure prior to SR. In addition, the gradients can highlight the regions where sharpness and structures should be given more attention to guide the high-quality generation explicitly. We utilize the difference between two diagonal pixels to calculate the input LR image gradient: where GM (⋅) represents the extraction function to obtain the GMs and v= (x,y) represents the pixel coordinates in the GM. In practice, the operation to obtain the gradients can be easily achieved by a convolution layer with a fixed kernel [34]. As shown in Figure 2, the gradient branch incorporates several intermediate-level representations from the SR branch. The reason is that an SR branch carries structural information that is pivotal to the recovery of GM. Hence, we utilize the feature maps from the first blocks and the last blocks of the SR branch as a strong prior to accelerate the recovery of SR GM. A connecting block is used to fuse the features. Then, we feed the syncretic feature maps to the gradient block to extract higher level features. Once the SR GMs are obtained, we integrate the features of the GM produced by the next-to-last layer of gradient branch to the SR branch to guide the SR reconstruction by a matrix multiplication operation. Meanwhile, we utilize a 1×1 convolution layer to generate the output GM, as shown in Figure 2. The magnitude of the GM can implicitly reflect whether a recovered region is sharp or smooth.
We formulate the gradient loss by minimizing the distance between the GM extracted from the SR image and the map from the corresponding HR image. With supervision from gradient domains, the generator creates high-quality images without geometric distortions. Subsequently, we use a pixelwise loss to achieve the above goal.
where G (L LR ) represents a super-resolved image and I HR represents a ground-truth image.

SR branch
In this sub-section, we use the SR branch to obtain the intermediate SR (ISR) results. The SR branch provides a fundamental SR operation. This branch constitutes two parts. The first part is a regular SR network which is similar to the generator of ESRGAN. Compared to ESRGAN, we replace the RRDB with the proposed residual-in-residual residual dense block (RRRDB). The RRDB has a residual-in-residual structure with dense blocks in the main path, as presented in Figure 3(a). We add an additional level of residual learning inside the dense blocks, as presented in Figure 3(b), to augment the network capacity without increasing its complexity. Then, RRRDB combines multi-level residual network and dense connections as depicted in Figure 3(c). This new architecture benefits from both feature exploitation and exploration, resulting in images of superior perceptual quality. Since we use 23 RRRDB blocks in the SR branch, we use the feature from the 5th, 10th, 15th, and 20th blocks into the gradient branch to enhance the GM. The second part of the SR branch fuses the feature of GM. We fuse the structure information by matrix multiplication. Finally, we use two convolutional layers to reconstruct the final ISR features, as shown in Figure 2.
The generator loss consists of a perceptual loss (L percep ) and a content loss (L 1 ). The content loss to evaluate the 1-norm distance between the super-resolved image G (L LR ) and the ground-truth image I HR is: Perceptual loss was proposed in [41] to improve the visual quality of the recovered versions. The Euclidean distances between the features of HR images and SR images are minimized in perceptual loss as follows: where i denotes the i-th layer of the visual geometry group (VGG) model.

Edge-enhanced branch
The ISR achieves satisfactory results, but the edges are not detailed enough. Edge information is a critical factor to reconstruct the final SR images. The edge branch aims to further enhance the edges of the ISR image. In the beginning, a Laplacian operator [33,42] is used to extract edges from the ISR. After the edge information is extracted, it is passed through the project block, RRRDBs, and up-sampling blocks. Finally, the enhanced edges are added to the input image, where the edges extracted by the Laplacian operator are subtracted.
To enhance the consistency between ISR and HR, we use two terms of loss, which include a consistency loss for the images (L img_cst ) and a consistency loss for the edges (L edge_cst ).L img_cst utilizes the Charbonnier penalty function between the ISR and HR to enhance the consistency of image contents, and L edge_cst uses the Charbonnier penalty function between the SR's edges and HR's edges to enhance the consistency of image edges.
where (⋅) is the Charbonnier penalty function, I HR is the ground-truth image, I ISR is the ISR image, GM (⋅) is the edge enhancement operation, r is the scaling factor, w and h represent the width and height of the image, I edg_HR is the edge of the ground-truth image, I edg_SR is edge of the SR image, respectively.

Discriminator
Since the discriminator provides valuable feedback information to the generator, we use it to justify whether the SR result is useful for improving SOD accuracy. Taking a super-resolved image as input, the discriminator passes it into two branches, that is, the adversarial branch and the detection branch. Such that the discriminator outputs a realness distribution, classification scores, and bounding box regression offsets. The adversarial branch is similar to the discriminator of ESR-GAN, which uses the VGG-19 [42] architecture. To achieve more realistic results, we design a new discriminator loss based on distribution GAN that is given as We use FRCNN [6] with the ResNet-50-FPN backbone as our detection branch. The FRCNN is an outstanding detector compared to the R-CNN due to employing a region proposal network (RPN) for the selective search. We add two fully connected layers (FCs) behind the last average pooling layer of the ResNet-50-FPN to classify the category of the detected objects and regress bounding boxes. The classification (L clc ) and regression (L reg ) loss of the FRCNN are given as follows: in which [12], where D cls and D reg are the classifier and regressor for the FRCNN, respectively, and t represents the ground truth bounding box coordinates.

Overall objective function
Based on the above analysis, we combine the adversarial loss in Equation (10), classification loss in Equation (11), regression loss in Equation (12), pixel-wise loss in Equation (5), content loss in Equation (6), and perceptual loss in Equation (7). As such, our GAN network can be trained by optimizing the objective function in Equation (13) where , , , , , k, denote the balanced parameters.

Implementation strategies and training parameters
We use the generator and the discriminator of ESRGAN as our SR branch and adversarial branch, respectively. In our network, we design RRRDB to replace the one used by the original ESR-GAN. Here, we use 23 RRRDB blocks for the SR branch, 3 RRRDB blocks for the edge-enhanced branch, and 4 RRRDB blocks for the gradient branch.
We perform all experiments using a 4× upsampling factor between low-and high-resolution images. To obtain LR images, a bicubic kernel is used to downscale the HR images with a scale factor of 4. We train our architecture jointly in an endto-end fashion. During training, we use the Adam optimizer for the generator and the SGD optimizer for the discriminator network. The learning rate for Adam is initially set to 0.0001 and then reduced by a factor of 10 after every 25k iterations with batch size 16. The learning rate for SGD is initially set to 0.01 and then reduced by a factor of 10 after every 40k iterations with batch size 16. Training is terminated after a maximum of 80k iterations. Our system is implemented in the PyTorch framework and trained over three NVIDIA 1080ti, and the source code will be made publicly available. Moreover, the hyperparameters ( , , , , , k, ) adapted in (13) are set to [0.03, 0.03, 0.005, 0.005, 0.005, 0.001, 0.001]. The analysis of the hyperparameters will be discussed in the ablation study.

Training dataset
We utilize four publicly available benchmark datasets to evaluate our method [43,44,28,29,45], which include cars overhead with context dataset (COWC), detectIon in optical remote sensing images (DIOR), vision meets drones (VisDrone), and dataset for object detection in aerial images (DOTA). The COWC dataset is collected by Lawrence Livermore National Laboratory and includes 58,247 annotated images. The size of the labeled car in each image ranges from 10 × 24 to 20 × 48. Therefore, the car in the obtained LR images can be viewed as a small object.
The DIOR dataset is collected by Northwestern Polytechnical University and contains 23,463 images covered by 20 object classes. The DIOR dataset is a large-scale detection benchmark for remote sensing image processing. The size of the collected image is 800 × 800, and the labeled objects contain a few pixels compared to the background, displaying diverse orientations. The VisDrone dataset is collected by Tianjin University and consists of 10,209 images acquired from various bird's-eye views of the street. Moreover, the images captured in a wide range of situations contain different numbers of targets.
The DOTA is collected by Wuhan University and contains 2806 images. In the design of this dataset, the researcher's initial idea is that for the same object with different sizes and diverse directions, the effectiveness of the detector will be effectively checked. Therefore, in contrast to other datasets, the DOTA dataset contains many oriented objects. The number of listed categories is 15, which includes various pixel sizes ranging from 30 × 30 to 1200 × 1200. We select images with small objects as our testing images.

Validation of SR
We compare our method with other GAN-based SR methods, including ESRGAN [31], EEGAN [33], SFTGAN [35], and PPON [39]. To illustrate the impartiality, we utilize the same datasets to retrain these models. The results of the perceptual index (PI), peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM) scores are presented in Table 1. Specifically, a lower PI score indicates generated images with high quality, and higher PSNR and SSIM scores indicate that the generated data distribution better approaches the real data distribution. In Table 1, we can see that the PI of SFTGAN on DOTA and Visdrone are better than our model but with worse performance on PSNR and SSIM. The main reason is that SFTGAN utilizes semantic segmentation maps as categorical prior to guide SR while our work uses GMs as structure priors to guide SR. Thus, SFTGAN generates more visually pleasing textures while our model generates more sharp edges. We also conduct a visual comparison, as shown in Figure 4. This comparison shows that the primary bicubic interpolation method is unable to recover detailed textures. PPON obtains more natural textures than the bicubic interpolation method, but with dirty content. ESRGAN and SFTGAN can produce less blur but more ambiguous edges. EEGAN produces sharp edges but undesired structural distortions. As expected, our approach significantly improves the SR performance with pleasing visual quality. As shown in Figure 5, by applying the gradient guidance and edge-enhancement strategy in our method, the distortions and artefacts are well mitigated. As shown in Table 2, the performance of our SR model training with and without gradient branches or edge branches is tested. The results show that the gradient guide provides more meaningful information for SR and that the SR is significantly improved by combining the two branches.

Detection with SR
We run the FRCNN to document the object detection performance on LR, HR, other GAN-based SR, and our SR images. Table 3 provides the comparison results in terms of AP 50 and F1. The results clearly show that our method achieves the best performance on all datasets. Specifically, our method achieves on average 27.9% and 38.6% higher AP 50 and F1-score than the LR images on the COWC dataset, respectively. We also compare the detection performance by using the LR images, HR images, and SR images with different detectors, as listed in Table 4. The results indicate that there is a considerable performance gap among those detectors. In addition, in terms of AP 50 , our model achieves an average result that is 3.25% higher than EESRGAN and 1.1% higher than Joint-SRVDNet, as presented in Table 5.
To validate that the detection loss back-propagates into the SR network can focus the SR on generating details that are beneficial for detection purposes, we compare the detection performance between the SR optimization with detection loss and the SR optimization without detection loss, as listed in Table 6. From Table 6, we see that the AP 50 performance increases by approximately 1.9% on COWC when the detection loss is incorporated. Clearly, this validates the claim that the detection loss promotes the generator to recover finer detailed information for better detection.

FIGURE 6
Detection results on COWC. The first column represents a low-resolution image, and the second column represents the high-resolution image. The third column represents the detection of ground truth, and the fourth column represents the detection of an SR image Furthermore, several examples of the detection results for small objects are visualized in Figure 6 to Figure 9. Our method successfully finds almost all the objects, even though some objects are very small. This demonstrates the effectiveness of our detector on the SOD problem. Figure 9 shows that our model is limited to rotated objects. In the following research, we will research on detecting the small and rotated objects.

ABLATION STUDY
To achieve the best version of our proposed model, we perform several experiments to see the impact of the number of outcomes, the balanced parameters of loss functions, basic component RRRDB, and the optimization strategy on the original version of our work. All experiments are performed on the COWC dataset.

Distribution GAN analysis
We further study the effect of adjusting the number of outcomes in the realness distribution p realness . As shown in Table 7, it can be observed that SR model produces high-quality images as the number of outcomes grows. However, increasing the number of input images will result in substantial time consumption for training. We choose the number of outcomes as 30.

Hyperparameter analysis
In this section, we analyse the impact of the balanced parameters , , , , , k, in the network. Since there is no rule for choosing the optimum parameters , , , , , k, for the network, we conduct a series of experiments to determine the optimal parameters , , , , , k, . We observe that the optimal values lead the SR to generate real-looking images with clear textures and sharpness regions. In Tables 8 and 9, we show the performance of our model by varying these hyperparameters on COWC.

The effectiveness of RRRDB
We use the RRRDB and RRDB as basic components to reconstruct images. We find that RRRDB can further improve the recovered textures, as shown in Figure 10. From Table 10, Detection results on the DIOR dataset. The first row represents the low-resolution image, and the second row represents the detection results on a low-resolution image. The third row represents the high-resolution image, and the fourth row represents the detection results on an SR image

Comparisons using reference-free indexes
To further demonstrate the efficiency of the proposed SR method on the real-world low-resolution samples without simulated sampling, we conduct other experiments for comparison. Different from the above-mentioned assumption that the LR image is down-sampled from the HR image, here we follow the real-world degradation settings that directly feed the LR image into the SR network. In addition, because of the lack of HR ground truth, two reference-free image quality evaluation indexes (average gradient (AG) [46] and natural image quality evaluator (NIQE) [47]) are introduced for comparison. The comparison results are reported in Table 11. Note that the larger the AG and the smaller the NIQE, the clearer the image.
In addition, we use the wavelet transform-based features to compare our method with other GAN-based SR methods. First, we use the wavelet transform to obtain the horizontal component of an image. Then, we use the entropy of the horizontal component to compare the performance of different methods. The comparison results are reported in Table 12. Note that the smaller the entropy value, the clearer the image.

Influence of the regression loss
From Table 13, we see that the AP 50 performance increases by approximately 2.7% when the classification loss is incorporated.

FIGURE 9
Examples of detection results on VisDrone. The first row represents the low-resolution image, and the second row represents the detection results on the low-resolution image. The third row represents the high-resolution image, and the fourth row represents the detection results on the SR image. The results indicate that our method successfully locates small objects Clearly, this validates the claim that the classification loss promotes the generator to recover finer detailed information for better classification. In doing so, the discriminator can exploit the fine details to predict the correct category of the region of interest (ROI) images.

Influence of the classification loss
As shown in Table 13, the AP 50 performance increases by nearly 1.9% by using the regression loss to train the generator network. Similar to the classification loss, the regression loss drives the generator to recover some fine details for better localization. The increased AP 50 and F1 demonstrate the necessity of regression loss in the generator loss function.

Occluded SOD
To investigate whether our method can detect occluded objects, we conduct a group of experiments, as shown in Figure 11. The experimental results show that our detector can detect partially occluded objects and fails to detect fully occluded objects.

CONCLUSION
We propose a new GAN-based SOD framework that takes LR imagery as input and yields detection results on corresponding the super-resolved version. In the generator, we introduce two strategies to recover detailed information. In the discriminator, we introduce FRCNN for the task of object detection. Specifically, the discriminator outputs a distribution rather than a single scalar as the measure of realness. During training, the losses of the detection task are backpropagated to the generator. The experimental results show that the proposed SR network with FRCNN yields satisfactory results.