A feature-optimized Faster regional convolutional neural network for complex background objects detection

In recent years, convolutional neural networks are playing an increasingly important role in the ﬁeld of object detection. However, the complex background of the detected image, the limited receptive ﬁeld by the ﬁxed geometry of the convolution kernel when building the model, and the positioning and pooling deviation from the region of interest are still important factors that affect the detection accuracy. In this paper, an improved algorithm is proposed for target detection based on Faster regional convolutional neural network. In the bounding box positioning phase, an improved interpolation algorithm-Newton’s parabolic interpolation is proposed instead of bilinear interpolation, after ROI size normalized by extending a parallel branch of tensor to weaken the negative impact of complex background on prospects in the phase of feature extraction using neural network recently popular attention CBAM mechanism model and the deformable convolution. Without bells and whistles, a series of experiments show that our method has higher target detection accuracy on the datasets PASCAL VOC2007, VOC2012, COCO 2014 and DIOR. Hence, the method is effective for actual target recognition tasks in complex background environ-ments. The authors hope


INTRODUCTION
When the performance of the computer is improved enough to process large-scale data, such as images, computer vision has received formal attention and development. Nowadays, computer vision has been widely used in image processing and many other fields [1,2]. Among them, object detection is the core problem of computer vision research, and convolutional neural networks have become mainstream algorithms in the field of object detection in recent years.
In 1980, Yann Lecun first applied the BP algorithm to the training of a neural network structure containing convolutional layers and pooling layers proposed by Kunihiko Fukushima, forming the prototype of contemporary convolutional neural networks-LeNet. However, its performance was not as good as SVM and Boosting algorithms in practical tasks, and the training was difficult. Until 2012, the AlexNet [1] proposed by the Hinton group introduced a new deep network structure and used the Dropout algorithm. The error rate was reduced from This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2020 The Authors. IET Image Processing published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology 25% to 15%, which caused a lot of sensation in the field of convolutional neural networks for target recognition. In 2014, Ross Girshick et al. proposed a regional convolutional neural network(R-CNN) [3], which achieved good results in recognition accuracy. The network first used a selective search method to generate a series of candidate regions on each image. Due to the production of abundant candidate regions (about 2000 per picture), R-CNN is time-consuming to train and expensive to calculate [4]. Limited by the full connection layer in the feature extraction stage of CNN, the size of the input image is fixed, which will inevitably produce information distortion. Following this, Kaiming et al. proposed SPP-net [5], which adopts a spatial pyramid pooling algorithm to eliminate the normalization step of the input image size. In 2015, Ross Girshick proposed Fast R-CNN [6]. Compared with previous work, the deep convolutional neural network is used, and the output ended uses two parallel and different full-connection layers to replace the original SVM for classification and bounding box regression positioning, respectively, avoiding the process of independent 378 wileyonlinelibrary.com/iet-ipr IET Image Process. 2021;15:378-392.
training of the SVM and the spatial pyramid pooling algorithm, improving training and testing speed while also increasing detection accuracy. However, there are still many defects in the network. The whole network cannot be trained end-to-end and still adopts the method of selective search to find regional proposals, which is time-consuming, the back-propagation algorithm cannot improve the extraction process of region proposals. In 2016, Ross Girshick et al. proposed Faster-RCNN [7], an object detection networked that has been very advanced so far. In recent years, the performance of many detection algorithms in the field of object detection have been improved based on Faster-RCNN [8][9][10][11], introduce a region proposal network (RPN) that shares parameters of the deep convolutional features with the detection network, RPN was trained end-toend to generate high-quality regional recommendations, which Fast R-CNN uses for detection, thus significantly improves the speed and network performance of target detection. The two most popular latter models based on end to end are Single Shot MultiBox Detector (SSD) [12] and You Only Look Once (YOLO) [13]. SSD uses multi-scale feature maps for detection, and then establishes the relationship of the prior box, default box, and ground truth box. The model can not only meet the needs of real-time detection, but also achieve similar accuracy to Faster R-CNN on PASCAL VOC. Unfortunately, SSDs do not fully consider the connections between different multi-scale prediction layers. The lack of semantic information in low-level feature maps makes it difficult to obtain sufficient feature information for small objects and is susceptible to interference from complex backgrounds. Therefore, SSD shows poor performance on small object detection [14]. YOLO series algorithms [15][16][17] innovatively integrate the candidate region proposal module and detection mechanism in the R-CNN series into the same network, which improves the detection speed and also shows high recognition accuracy. In recent years, due to the needs of actual tasks and the negative impact of complex backgrounds in the image on foreground object detection is still an important issue we are currently facing, many scholars have achieved good detection results on large complex datasets by optimizing Faster-RCNN. According to the implementation details of the convolutional neural network, some researchers have proposed a series of new solutions and introduced some flexible algorithms, Sanghyun Woo et al. proposed a CBAM [18], which fine-tunes the extracted features through adjusting the weight information on the convolution space and the channel, and focuses the "attention" on the foreground of the object instead of the background, thus reducing the background effect and achieved good performance better than SE-Net. Jifeng Dai et al. proposed the deformable convolutional networks [19,20], which concentrated the target information by increasing the sampling offset, breaking the conventional convolution rules. Later, some researchers innovated builds on previous work to further improve the detection accuracy. Scholars are gradually paying attention to the flexibility of convolutional systems, such as [19][20][21][22]. By analysing the structure of Faster R-CNN, this paper proposes an improved method based on Faster R-CNN for complex background object detection. In the feature extraction phase, we use deformable convolutions to replace the three convolutional layers and fully connected layer of the last three layers of VGG16, and focus on sampling target information and introduce CBAM for optimization and fine-tuning. In the ROI pooling layer, Newton's parabolic interpolation algorithm is used for region proposal border positioning to solve the problem of positioning deviation. The ROI region is pooled to 14 × 14 instead of the 7 × 7 size of previous work, which can reduce the information loss caused by the maximum pooling operation to a certain extent. Our method has achieved great performance on the PASCAL VOC2007, VOC 2012, COCO2014, DIOR datasets, and its performance is greatly improved compared to Faster R-CNN. At the same time, the algorithm shows a strong competitiveness compared to advanced SSD and YOLO v3 on special issues. Figure 1 illustrates the overall structure of Faster R-CNN. This object detection system consists of: (i) shared convolution, (ii) region proposal, and (iii) region classification. It uses an effective positioning area method to index on the feature map, which greatly reduces the time consumption of convolution calculations and has a faster speed for detection.

Anchors
We use a 3 × 3 sliding window on the output feature map of the deep convolution layer to perform the traversal convolution operation from the upper left corner to the lower right corner, and obtain the same size feature map that merges the surrounding 3 × 3 positions spatial information. Generally, we will use 3 scales and 3 aspect ratios (128 2 ,256 2 ,512 2 ; 1:1,1:2,2:1), yielding k = 9 anchors at each sliding position. Therefore, the regression layer has 4 k outputs encoding the coordinates of k boxes, and the classification layer outputs 2 k scores that estimate probability of object or not object for each proposal. For a convolution feature map of size W × H × C, there are a total of W × H × K anchors. The anchors generation algorithm is described in Figure 2. The design of multi scale anchors is a key component for sharing features without extra cost for addressing scales. Two 1 × 1 convolutions are used to perform channel compression in parallel to obtain the classification layer and the regression layer simultaneously. The classification layer is a binary classification problem. It assigns two confidence degrees to the K anchors of each sliding position according to the calculation results of the intersection over union (IOU) values of the prediction box and the ground truth, which are used to judge the belonging of the foreground and background. The coordinate information of the boxes contained in the regression layer is a 4D vector (x, y, w, h), where x and y are the centre coordinates of the proposed boxes, and w and h are the width and height of the anchor boxes, respectively. The regional proposal box is revised to get the ROI.

2.2
The ROI pooling layer Without bells and whistles, the align ROI pooling layer uses bilinear interpolation to accurately locate the region of interest on the feature map according to the region proposal, and then uses maximum pooling to transform the ROI with scale differences into a fixed-size H × W(e.g., 7 × 7) feature map, where H and W are spatial hyper parameters that are independent of arbitrary ROI. Region of interest is a local rectangular window of the feature map containing the main information of the object. It is defined as a four-tuple (l, t, h, w), where (l, t) is the spatial representation of the coordinates of the top-left corner of the ROI, just as the name implies, (h, w) is its height and width. The pooling operation will further extract the characteristic information of the deep network and reduce the calculation parameters, saving the resources occupied by the computer, but there is a significant loss of information. This section was introduced in Section 2.3.2.

Bilinear interpolation
Bilinear interpolation is an excellent algorithm to solve the problem of positioning deviation caused by the rounding operation when the region proposal coordinates is not integers [23].  Given the coordinates and function values of the four points Q 11 , Q 12 , Q 21 and Q 22 , we now want to solve the value of point P. On the x-axis, we let the linear interpolation result of Q 11 and Q 21 be R1(x, y1), and the same linear interpolation result of Q 12 and Q 22 is R2(x, y2). The formula is expressed as: In the y-axis direction, linear interpolation is also used, and the value of point P is obtained by using the information of R1 and R2 as known conditions. The formula is as follows: The purple dotted line in the bottom figure of Figure 3 represents the feature map, the intersection points of which represent pixel points, arrows indicate bilinear interpolation, the solid black line represents the region proposal generated by the RPN network, and 16 black points of 4 sample points represent pixel points. Since the obtained proposal coordinates are a decimal value, in order to avoid the problem of positioning deviation introduced by the rounding operation, it is a reasonable solution to use the method described above to obtain pixel values with relatively place. Essentially, this method is a union of linear interpolation in two dimensions.

ROI pooling
The region proposals are dissimilar, and the shape and size of the ROI is not identical. Because of the limitations of the network structure and the need to reduce parameters, we must normalize the size. The max pooling works by dividing the h × w ROI window into an H × W grid of sub-windows of approximate size h/H × w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel, as in standard max pooling. Multi-grid pooling will lose more information and is also more susceptible to perturbations from complex backgrounds. For example, for an ROI with H × W bins, there are 4 feature pixels in each bin, and the work will result in the loss of 3 pixels. Similarly, if there are 6 feature pixels in each bin, then 5 pixels will result. The maximum pooling work to extract complex background feature pixels is more likely. This is also the emphasis of our research in this paper. Simply put, the loss of unexpected information is the key to our consideration.

OUR APPROACH
On the basis of Faster-RCNN, we mainly made four flexible changes and improvements to increase the accuracy of object detection. In the image feature extraction phase, we adopt deformable convolution with flexible geometric transformations to replace the last three layers of ordinary convolutions conv5_1, conv5_2 and conv5_3 in VGG16, and then use the attention mechanism to optimize the output features of the previous layer to form a shared feature map. In the ROI Pooling phase, we use Newton's parabolic interpolation instead of the previous bilinear interpolation for region proposal frame positioning. At the same time, in order to reduce the loss of elements caused by ROI max pooling, the output size is a bin of 14 × 14 instead of 7 × 7. In parallel, a weight fine-tuning net-work is adjusted on the bin to adjust the pixel value of each sampling point in the module, and the detection accuracy is improved through strategies of target information enhancement and complex background attenuation. Extensive experiments validate the performance of our approach. Without bells and whistles, the network can be easily trained end-to-end by standard back-propagation. Figure 4 illustrates the overall structure of our method.

Improved interpolation algorithm
In this paper, we proposed a new interpolation calculation method to solve the problem of the positioning deviation of the bounding box, namely, Newton's parabolic interpolation. To facilitate a deeper understanding, Figure 5 shows the details of the operation. The pixel values and coordinates at the point {Q i j |i = 1, 2, 3; j = 1, 2, 3} are known, and our purpose is to approximate the value of the P point (decimal coordinate) by interpolation.
In the x-axis direction, the parabolic interpolation of Q 11 , Q 21 and Q 31 gives approximate pixel values at R1(x, y1). With the same operation, the values of R2(x, y2) and R3(x, y3) are obtained by using this method with Q 12 , Q 22 , Q 32 and Q 13 , Q 23 , We have When the pixels of R1, R2 and R3 are acquired and they are used as known conditions, parabolic interpolation is used again We have Compared with bilinear interpolation, our algorithm only adds an insignificant amount of calculation, and with an increase in accuracy, it is worth noting that when we apply the value of three points, a link of nearest neighbour selection is added. In this step, the interpolation point is selected by judging the absolute difference between the target point and the similar point in  Figure 6). First, we calculate the absolute differences ξ and ϒ in the 1D direction of Pi and the neighbouring points in a parabolic interpolation operation, and choose one of the two triples (Q 1 , Q 2 , Q 3 ) and (Q 2 , Q 3 , Q 4 ). For example, when i = 1 corresponds to point P 1 , there is = |x ′ − x 2 |, = |x ′ − x 3 |, when the condition < is triggered, it means that P 1 and (Q 1 , Q 2 , Q 3 ) are more relevant, instead of using (Q 2 , Q 3 , Q 4 ) for interpolation calculations; when the condition > is satisfied, it means that P 1 is more related to (Q 2 , Q 3 , Q 4 ) than (Q 1 , Q 2 , Q 3 ), and then use the former for interpolation calculations. The same applies to the case of i = 2. Therefore, for Q 1 , Q 2 , Q 3 and P 1 participate in the operation is a more ideal result, for Q 2 , Q 3 , Q 4 and P 2 participate in the operation is also recognized. The neighbouring points with higher pixel correlation with the target point are preferentially selected. This process further avoids the loss of network details, which is very beneficial for the detection task of complex background images containing multiple targets. After improving the ROI Pooling layer, the back-propagation formula is revised at the same time.

Feature adjustment network
In the previous work of the ROI Pooling layer, the region of interest will usually be transformed to a size of 7 × 7 by the maximum pooling operation. This operation obtains the

FIGURE 7
Detailed schematic diagram of the feature adjustment network maximum feature pixel value in each sampling bin. Since it is generally accepted that the maximum pixel can better express the main information in the bin, the pixels of other values such as the sub-optimal value are ignored, the information loss and precision decline caused by this practice can be ignored in the image processing task of large target and simple background. However, for visual tasks with small objects, multiple targets and complex backgrounds, the pixels of complex background are more likely to be acquired in the process of maximum pooling, thus negatively inhibiting the collection of foreground information. This negative loss is gradually amplified, so as to break through our acceptable limits and affect the final detection accuracy. In order to solve the above problems, we first transformed the ROI to the size of 14 × 14 using the method of maximum pooling instead of the original 7 × 7 (VGG16 used by deep convolutional networks, and the feature map depth is unified to 512), which initially reduces the loss of a job. At the same time, we connect a feature adjustment network in parallel to output a modified feature map with spatial adjustment capability and the same resolution as the original resolution. We use the means of multiplication to combine the spatial information of the modified map and the original map. The maximum pooling operation with a size of 2 × 2 and a step size of 2 is used to obtain a feature map of 7 × 7 × 512 for subsequent fully connected layers for classification and bounding box regression. Figure 7 shows the schematic of our proposed feature adjustment network, which is described in detail below. As shown in the figure 7, we first used maximum pooling on the channel dimension of ROI to obtain the intermediate feature map(F) containing the main information of the region of interest, with the size of 14 × 14 × 1 and a total of 196 elements. Then we took the traversing expansion of F from the upper left corner to the lower right corner as the input of the subsequent full connection layer. There are 196 input neurons, 196 × r (r is proportional parameter, such as r = 0.5) hidden layer neurons and 196 output neurons together constitute two fully connected layers. The purpose of the use of proportional parameter r is to reduce network computing parameters and increase the nonlinear fitting of the model, which can be set freely according to the actual situation. The output information of the fully connected layer is stacked into a map (F') with the same resolution as the ROI. The graph is rich in the spatial adjustment parameters of the ROI. F' then goes through sigmoid function to limit the parameter range. Finally, a very small parameter factor is obtained for the complex background in ROI through a multiplication operation. On the contrary, the foreground object gets a large parameter factor, thus achieving the detection effect of weakening the background disturbance. The network as a whole is light and it can be trained easily.
The calculation process can be simply expressed as: where (i, j) is the coordinate of the pixel grid, k represents the channel order of the pixel points, and ω is the parameter value of the corresponding position in the feature graph F', x and y represent the pixel value of the corresponding point in ROI and the pixel value of the corresponding point in the adjusted feature map, respectively. A series of ablation experiments proved the effectiveness of our method. The main results of the experiments are presented in Section 4.

Deformable convolution networks
In the field of target recognition, deep convolutional networks are usually used to extract the features of the input image and have achieved acceptable results. However, in complex object recognition tasks, fixed geometric structures in the traditional convolution model are used. The resulting indifference sampling is not what we expected, and deformable convolution overcomes the above problems by centralized sampling. Convolutional layers in feature maps and networks are in the 3D space domain, but specific convolution calculations are performed in 2D. Without losing the generality of the problem, we analyse this module in 2D space for the sake of clarity, and the work is also concisely extended to 3D. Convolution in 2D space generally includes two steps: (i) using a regular grid convolution kernel R to sample pixel values on the input feature map x and, (ii) summing the product of the sampled value and the weight of the corresponding position. The spatial position of the grid convolution kernel R is expressed as follows: defines a 3 × 3 kernel. The output feature map is set to y, and P 0 is any location on it, then the equation holds: where P n enumerates the locations in the grid R such as (−1,−1) and (−1,0). An offset parameter{Δp n |n = 1, 2, 3, … , N } is attached to each grid convolution kernel position in the deformable convolution model, where N = |R|. And the equation is now satisfied: Because of offset coordination is usually not an integer. We use Newton's parabolic interpolation to obtain the pixel value on the decimal coordinated. The process is graphically shown in Figure 8. The offset is represented by the intermediate offset field output by a conventional convolution layer. The intermediate offset field is a spatial resolution consistent with the input feature map and the number of channels is 2N feature map, N corresponds to the receptive field size of a single convolution kernel in the convolutional layer (for example, 3 ×

FIGURE 9
Comprehensive description of CBAM, where ⊗ denotes element-wise multiplication 3 convolution kernel, N = 9). Any one convolution sampling point is associated with two offset parameters (x, y). It is important to note that the convolution kernel and the offset in the convolutional layer are trained simultaneously, and the model learns the offset by back-propagating the gradient using the Newton's parabolic interpolation algorithm, without additional supervision.

Convolutional block attention module
CBAM is a module of attention mechanism that combines space and channels. Assume that an input feature map F with size H × W × C is defined. This model sequentially transforms F into a channel feature map M c with size 1 × 1 × C and a space feature map Ms with size H × W × 1. We show the overall structure of the module in Figure 9, and also briefly explain the details of the spatial module and channel module in Figure 10. We first use the maximum pooling and average pooling operations in the spatial dimension to simultaneously obtain two different feature maps F cmax and F cavg , which aggregate the main spatial information. Two feature maps forward input sharing multi-layer perceptron (MLP) produces our desired channel attention map Mc. The size of the hidden layer in MLP is set to 1 × 1 × C ∕ 2 in this article to improve the nonlinear fitting ability of the model. After sharing the network, element-wise summation is used to merge the two feature vectors output by the separation operation. In general, the calculation process is as follows: where is the sigmoid activation function, W 0 and W 1 denote the weights of the first fully connected layer and the second fully connected layer in the MLP, respectively. One obvious difference between the spatial attention module and the channel attention module is that the former pays more attention to "where" contains object information, which is a supplement to channel attention. We utilize the max pooling and average pooling operations to obtain two feature maps F smax and F savg of size H × W × 1 on the channel axis, concatenate them into a new feature map and utilize it as Research results show that applying pooling operations along the channel axis can highlight information-rich areas [24]. Following the description of the previous step, we employ a convolutional layer to squeeze the feature dimensions and finally generate spatial attention Ms (H × W × 1). In general, the calculation process is as follows: where is the sigmoid activation function, and f 7×7 denotes a convolution operation with the filter size of 7 × 7.
The overall process of CBAM can be briefly summarized as: We added the CBAM module after the deformable convolutional layer in VGG16 to completely focus the network attention on target information about two dimensions to achieve the expectation of enriching the discriminative ability of the network. The deformable convolution and the attention module have a perfect compatibility, which can be easily combined and used standard back propagation for weight training.

Loss function
Faster R-CNN can be seen as a combination of RPN and Fast RCNN. During the training and testing phase, the network loss mainly includes the classification loss and the loss of the bounding box regression, and these two types of losses are further divided into the loss of the RPN and the loss of the Fast RCNN according to the location. Because the RPN network and the Fast RCNN network have similar loss algorithms, the only difference is that the former uses a two-class cross-entropy loss function in the classification process, and the latter uses a multiclass cross-entropy loss function. For the sake of clarity, we will focus on RPN network losses in this section. For a large number of anchor boxes, if the IOU of the box and the ground-truth box is higher than 0.7, a positive label is assigned; however, if the IOU is lower than 0.3, a negative label is attached to the box, and the rest of the anchor boxes are not done. Note that non-positive samples do not contribute to the training of the network and do not participate in the subsequent calculation of the bounding box regression loss. The loss function formula is as follows: where i represent the index of the anchor in the mini-batch, and the probability that the anchor i contains the object is marked as p i . If the relevant anchor is a positive sample, then p i * = 1 is established; if the relevant anchor is a negative sample, then satisfy p i * = 0. t i is a vector representing the coordinate offset between the anchor box and the prediction box, and t i * is related to the offset between the anchor box and the ground-truth box, and they each contained 4 parameters to measure the degree of change in coordinating. L cls is a binary classification log loss that focuses on target or non-target. R in the L reg formula is a smooth L 1 robust loss function. The detailed expression is as follows: and we have: Among them, the four-tuple (x, y, w, h) is the description information of the prediction boxes obtained after the correction of the anchor box, and represents the horizontal coordinates, vertical coordinate of the centre point, and the width and height of the border. The four- tuple (x a , y a , w a , h a ) is related to the anchor box, and the four-tuple (x*, y*, w*, h*) is related to the ground-truth box.
The smooth L 1 function is defined as: In the actual training process, we set λ = 10 by default and assign a value of 3 to the constant σ in smooth L 1 . N cls is equal to the network mini-batch size (for example, N cls = 256). N reg is the number of anchor positions (i.e. N reg ∼ 2400). For the loss function formula, the two terms are normalized by N cls and N reg and weighted by a balancing parameter λ. After the above operations, the weights of classification and regression loss are basically the same.

Implementation details and network parameters
We train and test the network performance on multi-scale pictures to meet the needs of actual detection tasks, limit the input picture to a range instead of a fixed size, and use image flipping to enhance the data set. The length of the longest side is set to l = 1000, the shortest edge size is s = 600. When the image used for training exceeds the size limit, a proportional scaling method is used to meet the input requirements. We performed ablation experiments on different datasets, for example, PASCAL VOC 2007, 2012 and COCO2014 to verify the accuracy improvement brought by our improvement. We adopt the VGG16 model that pre-trained on the ImageNet dataset as the basic deep convolution features extraction network. Due to the imbalance in the number of samples in different categories in the dataset, the RPN will generate about 2000 regional proposals for each image during the training phase, and the number of regional proposals in the test phase is 300, where IOUs greater than 0.7 are considered positive samples, and IOUs below 0.3 are considered negative samples (except for the COCO2014 dataset, IOUs are taken between 0.5 and 0.95, respectively, value). The same object corresponds to multiple bounding boxes when the fully connected layer bounding box regression. The threshold of the non-maximum suppression algorithm used to remove unnecessary bounding boxes is set to 0.7. We use a zero-mean Gaussian distribution with a standard deviation of 0.01 to initialize the weight values of all layers. The initial learning rate α of the network training is set to 0.001. After 40 K iterations on the PASCAL VOC, the learning rate dropped to one-tenth of that of the previous iteration. In COCO2014, it was adjusted to 80 K times. The small batch-based gradient descent momentum is set to 0.9 and a weight delay of 0.0005. The dropout method used to prevent over fitting is set to a probability of random culling of 0.5.
After feature extraction of the first ten convolutional layers in the deep network (VGG16), the information is sampled intensively by the three layers of the deformable convolutional network. Finally, the CBAM module is used to fine-tuning the spatial and channel dimensions to form the Shared feature map. The shared feature map is used as the input of the RPN while sharing the network parameters. The RPN will generate a large number of regional proposals and crop the shared feature map to obtain the ROI. After performing the size normalization and feature adjustment operations, the fully connected layer adopts NMS to perform bounding box regression and object classification.

Experiments on PASCAL VOC
We evaluate our method by performing comparative experiments on the PASCAL VOC2007 and PASCAL VOC2012 datasets, each containing 20 object categories. PASCAL VOC2007 is composed of approximately 5 K trainval images and 5 k test images. We used the union of 5 k trainval images for 60 k iterative training and reported the results of object detection on the standard VOC2007 test set. We also performed similar training on the PASCAL VOC2012 dataset with 20 detection categories, and also provided the detection results on the VOC2012 test set. In order to further prove the advantages of the improved algorithm, we applied the joint training method. Trained on the union of VOC2007 and VOC2012 and tested on the respective test sets of the two data sets, the results of a series of ablation experiments are presented in the following section. The accuracy of object detection is measured by the mean average precision (mAP), which is a well-known performance index in target recognition tasks. We note that the standard Faster R-CNN in the paper [7] achieves great mAP with VGG-16(as shown in Table 1). As shown in Table 1, we performed a large number of comparison experiments between Faster R-CNN and our network on the PASCAL VOC 2007 test set for object detection. Our experiments use the VGG-16 model. The standard Faster R-CNN was trained on VOC2007 and VOC2007 + VOC2012, and reached a mAP of 69.9% and a mAP of 73.2% on the VOC2007 test set, respectively. We use it as a reference baseline. In contrast, when we use PASCAL VOC2007 as training set, we first tested the operation of deformable convolution, using only three deformable convolution layers to replace conv5_1, conv5_2 and conv5_3 in backbone (VGG-16). It is observed that the mAP value of the network is 0.2% higher than Faster R-CNN (70.1% vs. 69.9%). This comparison has empirically verified the importance of the geometric transformation ability FIGURE 11 Schematic effect of attention mechanism on shared feature map of the convolution kernel. A similar experimental phenomenon was reported in ref. [19]. In order to be fair in the experiment, we also used "PASCAL VOC2007 + PASCAL VOC2012" as the training set and carried out detection tasks on the PASCAL VOC2007 test set, which reached a higher mAP than Faster R-CNN (also see Table 1, 73.5% vs. 73.2%).
Next, we disentangle the influence of CBAM on network detection accuracy. We only make a few changes to the original Faster R-CNN architecture. CBAM as a feature fine-tuner is embedded between VGG-16 and shared feature maps and shares parameters. After training on two training sets, there were 70.0% and 73.7% mAP on the VOC2007 test set respectively, which has better recognition accuracy than Faster R-CNN (Table 1). This result shows that CBAM has the advantages of fine-tuning in the spatial and channel dimensions, making the network pay more attention to the target itself rather than the background. In order to more intuitively and visually observe the effect of this strategy, we further visualize the shared feature map, and show some experimental results in Figure 11. As  Table 1). A series of rational experiments indicated that the proposed adjustment network could successfully extract the spatial variation factor of pixel values of each feature sampling point. In fact, the graph containing spatial adjustment information in this module is similar to a weight tensor. The final multiplication step is equivalent to a weighting operation. The pixels of the background sampling point that interfere with the object gradually move closer to the general background pixels.
Following the experiments introduced above, our method achieved the highest mAP on the combination of PASCAL VOC2007 trainval set and PASCAL VOC2007 test set compared to other projects. The mAP improves from 69.9%(standard Faster R-CNN) to 70.6%(our system). We further train the fusion network on the union set of PASCAL VOC 2007 trainval and 2012 trainval, and its mAP value is 74.0%. Similar improvements are observed on the PASCAL VOC 2012 test set. In terms of the single training set of VOC2007 trainval and the combined training set of VOC2007 trainval+VOC2012 trainval, our improvement strategy improved mAP by 0.8% and 0.7%, respectively, compared with the previous Faster R-CNN. Table 2 shows a variety of experimental comparison details. Figure 12 shows some detection results on the PASCAL VOC 2007 test set.

On the impact of ROI output size
In this section we discussed the influence of different ROI outputs on the accuracy of the entire network. The Table 3 shows the main experimental results. Studies have shown that the size  of ROI output has a positive effect on system performance. The mAP at k = 14 is 0.5% higher than that at k = 7 (70.4% vs. 69.9%), and the minimum mAP at k = 1 is 63.2%. It is understood from the analysis that an increase in ROI output size can reduce the loss of detail due to maximum pooling, but this operation significantly increases the computational parameters in the subsequent full connection layer, thereby reducing network execution efficiency. Therefore, to solve this problem, a maximum pooling operation with a size of 2 × 2 and a step size of 2 was used to output a 7 × 7 feature map on the feature map (14 × 14) after feature adjustment in the resized ROI. On the whole, we did not add additional computation consumption in the full connection layer except for adding a few calculation parameters in the feature adjustment network.

On the impact of number of convolutional layers inside DCN
In the previous introduction, the deformable convolution can obtain the offset through an ordinary convolution layer, and then calculate the offset by combining with the original sampling point coordinates and obtaining good results. However, we once questioned the ability of a layer of convolution to fully extract the offset. The Table 4 shows our detection results on the 07 test set using PASCAL VOC 2007 trainval as the training set. Compared with using only one layer, the mAP of the network with two layers of ordinary convolution decreases by 0.1% to 70.0%, and the mAP of the network with three layers of convolution is 70.1%, which is the same as that with one layer of convolution. This result indicates that the learning ability of a layer of ordinary convolution on the offset is not yet reached saturation, so that the migration features can be fully extracted. Applying more internal layers will not significantly improve the detection accuracy of the system, but will introduce additional parameters, which is not what we desire.

Experiments on MS COCO2014
To demonstrate the general validity of our method, we also show more results on the MS COCO2014 dataset. The MS COCO2014 dataset contains about 160 K images of 80 object categories, of which the train set has about 80 K images, the validation has 40 K images, and the remaining 40 K images are on the test set. For this particular dataset, we made small changes to the system parameters. The network trained 240K iterations with a learning rate of 0.003 and then the learning rate dropped to one tenth of the original (0.0003) and then 80 K iterations, batch size is set to 128. We amend the learning rate (using 0.003 instead of the original 0.001 as the initial learning rate) because the batch size has changed. The larger learning rate makes it possible to efficiently learn the network weights and improve the convergence rate. On the other hand, the negative samples were previously defined with reference to the IOU at intervals of (0.1, 0.5). Now we modify the smallest extreme 0.1 to 0.0, which is (0.0, 0.5). Deeply analysed, the negative samples defined by the interval (0.1, 0.5) in the system are naturally used for network fine-tuning. As the structure of Faster R-CNN itself ignores the SVM steps, the negative samples between (0.0, 0.1) will not be accessed, and the samples between (0.0, 0.5) can be used as the background. Moreover, those samples between (0.0, 0.1) can increase the mAP on the MS COCO dataset, but this improvement effect is negligible on VOC, and the rest of the settings are the same as the PASCAL VOC by default. We rely on the average mAP @ (0.5, 0.95) under various IOUs (IoU ∈ [0.5: 0.05: 0.95]) and mAP@0.5 when IOU = 0.5 to evaluate the system performance, where the former is COCO's standard metric and the latter is PASCAL VOC's metric. Noting that the detection task on the MS COCO dataset is faced with a large number of complex backgrounds, small objects and multiple targets at the same time, we may expect our improved method to be better than Faster R-CNN. The following experiments justify this hypothesis.
In Table 5, we studied the positive effects of each module on network performance, and finally presented the recognition results of our system. Each separation experiment corresponds to two training sets (COCO train and trainval) and is detected on the test set. The detection accuracy of standard Faster R-CNN was first reported in the table 5. The system after the use of deformable convolution to replace the correlation layer is simply denoted as "Faster R-CNN + DCN," it has 42.2% mAP@0.5 and 21.5% mAP@(0.5, 0.95) on the COCO test set. It is 0.1% higher than the baseline for mAP@0.5 and its improvement effect is not obvious for mAP@(0.5, 0.95). This phenomenon indicates that our ability to improve by using the deformable convolution alone is limited, and it is easier to approach performance saturation in complex tasks. In the comparison experiment of these modules, the single method of using the feature adjustment network brought us the highest accuracy improvement of 0.6% for mAP@0.5 and 0.3% for map@ (0.5, 0.95), followed by the network using CBAM. These results indicate that the feature adjustment network is more suitable for multi-target detection tasks with complex backgrounds, and the network branch extracts the spatial change information of the ROI well. Similar to the weighting operation, it successfully weakens the influence of unrelated pixels. We combined these modules and used Newton's parabolic interpolation algorithm to achieve a desired result. Using the COCO training set to train, our method achieved 43.2% mAP@0.5 and 22.2% mAP@(0.5, 0.95), and when using trainval as the training set, it has 43.9% mAP@0.5 and 22.6% mAP@(0.5, 0.95) (see Table 3). We also evaluated the system performance after removing the feature adjustment network. As shown in Table 3, when the COCO train set is used, the network after the removal operation is 0.5% lower than our method for mAP@0.5 and 0.6% for mAP@(0.5, 0.95). In addition, when using the COCO trainval set, we observed a similar decline. This directly illustrates the importance of feature extraction networks. These experimental results are fully indicative that the algorithm has better detection effect on objects with complex backgrounds. Figure 13 shows the test samples on the COCO2014 test set.

Performance evaluation for multi-scale targets
A series of experiments in the previous two parts proved the effective performance improvement of our method on the basic network. In order to further illustrate the competitiveness of the entire system, we also compared the detection results of this method with the state-of-the-art object detector SSD and YOLO v3 on the COCO2014 data set for multi-scale targets. In the COCO2014 dataset, there are more small objects than large objects that need to be considered. Specifically, about 41% of objects are very small (area < 322). Targets with an area in the (322, 962) range belong to the medium size category (34%), while area > 962 is empirically considered to be a large object (24%). Where area is the number of pixels in the segmentation mask. For the fairness of the experiment, we all use VGG-16 as the feature extraction network instead of others. Figure 14 shows the details of the evaluation between different models.
We know that the main reason why small targets are difficult to detect is due to the unfriendly size constraints of the last layer of the convolutional network. The decrease in spatial dimensions leads to the loss of important details. It is very interesting. We use the deformable convolution and CBAM algorithm to capture the details of small-scale targets to slow down this downward trend of information. At the same time as a part of the auxiliary network, the fine-tuning network with large-scale pooling and weight adjustment pays more attention to improving the characteristics of small targets. In the end, more features of small objects are obtained by our network, while fine-tuning the network reduces the degree of features being overwhelmed by perturbed signals. Experimental results show that our method has a mAP of 22.6%lower than YOLO v3's 26.1% mAP, but it has similar recognition accuracy to the latter in the detection task of small objects. This also illustrates the strong competitiveness of our model on special issues. For the SSD network, in addition to its high accuracy for the detection of large objects, the remaining two scales are lower than our algorithm results. However, it is mentioned in ref. [12] that the network has a significant speed advantage and is more suitable for real-time detection tasks. Of course, this is not our concern here.

Experiments on remote sensing data sets
The interference of complex background is the core issue we are concerned about in this paper. Due to the high background complexity and small target detection characteristics of aerial remote sensing images, we also present the research results of the model on this data set to further show the actual execution FIGURE 14 The detection details of different models performance of our algorithm. DIOR is a large-scale benchmark dataset for target detection in optical remote sensing images released by North-western Polytechnical University. It consists of 23,463 images and 190,288 target instances, covering 20 object categories. Next, the data set is divided into three parts in the experiment: the training set, the verification set and the test set, of which 70% of the remote sensing images are randomly selected for network training, 20% of the images constitute the verification set and the remaining 10% of the images are used as test sets. The experiments were trained with 60 K iterations and all adopted VGG-16 as the feature extraction network instead of others. It is worth noting that the details of the remaining network parameters of our model are the same as the experiments in PASCAL VOC2012. The actual performance of the model can be evaluated by mAP. Figure 15 shows the effect diagram of the original Faster R-CNN and the algorithm of this paper. The first column is the original image, the second column of images is the detection results of traditional Faster R-CNN and the third column is generated by our model. Through a rational comparison process, it can be observed that the missed detection rate is significantly reduced. In Table 6, our network achieves 80.2% mAP, which is significantly higher than the 78.3% mAP of the original Faster R-CNN. In other words, our model has good effectiveness and stability under practical tasks with complex backgrounds, and it can meet the accuracy requirements in remote sensing target recognition tasks.

CONCLUSION AND FUTURE WORK
Based on the current advanced target recognition network Faster R-CNN, this paper proposes an improved parabolic interpolation algorithm to solve the problem of bounding box positioning deviation. By adding a feature adjustment network in the ROI Pooling stage and changing the normalization of ROI sizes to correct the spatial information of the feature map, thereby weakening the interference of the complex background. The deformable convolution and convolutional attention module are used in the state-of-the-art image classification backbones, such as VGG-16, which greatly improves the accuracy of target recognition. Our system has been tested on multiple data sets, and the results prove that our method has better detection performance than Faster R-CNN. We note that our method has better reference value in the detection of small objects, multiple targets, and complex backgrounds, which can also be extended to the examination of semantic segmentation. We expect our algorithm will easily enjoy the benefits of the progress in the field of machine vision. Our algorithm can be applied to more advanced networks to improve their performance, for example: YOLO v3 etc., which will be one of the researches we focus on in the future. Directions for future work include the use of alternative neural network architectures such as YOLO v3; combining theoretical knowledge of few-shot learning; further research on the identification method of similar targets under complex background.