Toward reliable fusion object detection based on dilated pyramid and semantic attention

Object detection on fused images of visible and infrared modals is of great importance for many applications, for example, surveillance and rescue at low‐light conditions. However, current detectors have difficulty for robust fused image detection for mainly two reasons. First, objects are presented in various shapes and sizes, making some hard samples cannot be localized accurately. Second, the same object category in the fused images will have different appearance due to changing weather condition, temperature and intrinsic heat. Such a contradiction will degrade the classification task of a detection network, since it cannot merge commonalities and distinguish differences well. In this paper, we propose to reconstruct the detection pipeline of current detectors, and enhance the detection ability on difficult samples in fused images. Specifically, a Dilation Pyramid Network (DPN) is designed at the lateral connection to generate and aggregate features of various receptive field, without increasing pyramid layers. To strengthen the classification, a Semantic Category Attention Module (SCAM) is proposed to capture attention centers of semantics in fused images, rather than object centers. Abundant experiments on two fusion datasets show that the proposed method achieves a satisfying performance, and both modules can greatly improve current generic detectors on fused images.


INTRODUCTION
Detecting objects in fused images of visible and infrared modals is a significant task for many applications like traffic surveillance and military reconnaissance.However, in this area, most of research focus on proposing better image fusion methods, 1,2 or designing better infrared object detection. 3,4In other words, the significance of object detection on fusion images is underrated.Object detection consists of two major tasks, that is, object localization and classification.Object localization involves detecting the position and shape of an object in an image.Typically, we look for clues like the object's contour and position to determine its location.For example, when detecting human, the shape of a person can help the detector perceive the location.Object classification, on the other hand, focuses more on the appearance of objects.Characteristics like color, texture, and shape are useful for identifying and distinguishing between different categories of objects.For instance, a detector can differentiate horse and zebra by examining their unique color patterns and textures, even if their shapes are similar.Generic object detectors 5,6 can well satisfy the above tasks, that is the localization and classification, at most regular application scenarios.But the detection on visible-infrared fused images is more challenging and requires more tailored designs.
The first challenge comes from the various object sizes and shapes.The object contour and position information are easier to be perceived in infrared images due to the thermal energy distribution of the object.However, extreme small thermal objects tend to be preserved in fused images, rather than fading into the background like in visible images.This leads to more various distribution of object sizes comparing with a regular visible scenario.In other words, the detector needs to handle both big and extremely small objects.
The second challenge comes from the classification task.As mentioned above, the detailed information like object texture and color is important for classification.But due to the little detail preservation of infrared modality, its joining in the fused image sometimes (e.g., low light conditions) inevitably overrides a part of these information.This will cause poor classification performance.Based on these analysis, we make some requests on a detection network for fusion object detection as • Having multi-scale feature representation.The detection on fused images is usually applied to scenarios like traffic or security surveillance, where objects have various sizes and shapes.To recall more objects and improve the precision, it is necessary to generate multi-scale feature representation based on object characteristics.
• Utilizing known classification-aware semantics as much as possible.A detector for fused image object detection should learn specific semantics that help classification explicitly, so they can be completed when most infrared modality dominates the fused image.
In current object detection methods, the network pipeline is usually made up by a backbone, a neck, and a head.First, the deep feature representation is extracted by a backbone network, for example, the Resnet.Then a neck structure is applied to aggregate features of various resolutions from the backbone, which usually consists of multiple convolution layers.At last, a feature map from each layer of the neck is followed by a detection head to predict object location and category.
Feature pyramid network (FPN), 7 as Figure 1 shows, is one of the most famous neck structure, which is a suitable technique to achieve multi-scale feature representation for two reasons.
• FPN naturally generates multiple feature layers with different scales, which allows it to perform detection on the right layer according to object sizes.
• FPN's input feature comes from the backbone of different depth.The low level information like texture and contour, and high level information like context semantic are fused in the top-down path.
However, current FPN structure still has some defects.Specifically, the upsampling operation after the 1 × 1 convolutions can enlarge the spatial resolution of feature maps, but multi-scale feature still heavily relies on the backbone.Despite the subsequent 3 × 3 convolution refines the features of different pyramid layers, it has no benefits for feature fusion anymore.In other words, the effect of feature fusion process of the FPN is limited by the representation ability of the selected backbone.
To enhance the feature representation of a neck structure, we propose a Dilated Pyramid Network (DPN) to further generate feature of various receptive fields without increasing more pyramid levels.Specifically, the DPN contains an Expander to increase the receptive field of each lateral layer of a FPN, which is implement by a sequence of dilated convolution blocks with increasing dilation rate.The output of each dilation block is residually added together to collect semantics of different depth.In this way, the multi-scale feature representation not only depends on selected backbone, but also can be diversified and enhanced inside the neck.
To meet the second requirement mentioned above, that is, utilizing known classification-aware semantics, we propose a Semantic Category Attention Mechanism (SCAM) to refine the original classification features at the detection head.Specifically, the SCAM first explicitly collects k kinds of semantic centers from the classification branch, where k is an empirically defined parameter based on the object categories.After that, the original classification features are refined by adjusting their intensity based on the importance of corresponding semantics, which is contained in the attention map of SCAM.
Based on the above contents, we can sum up the contribution of this paper as • A novel feature pyramid structure is proposed to achieve multi-scale feature collection and integration in each lateral layer of pyramid, without increasing the pyramid layers and corresponding detection heads.
• A targeted Semantic Category Attention Mechanism is proposed to achieve semantic-level instead object-level attention on the classification branch of detection head.Therefore, the classification will not be confused by various presentation of the same object category in fused images.
• Abundant experiments show that DPN and SCAM modules can effectively improve current detectors with simple modifications, which makes them plug and play techniques for object detection in fused images.
The rest of the paper is organized as follow: Section 2 provides related work about feature pyramid network, attention mechanism and fused image detection in order.Section 3 discuss the necessity of detection on fused images with some simple experiment preceded our work.Section 4 presents detailed implementation of the proposed FusionDet.Section 5 provides experimental results on two fusion image datasets, and Section 6 concludes the paper.

Feature pyramid structures
Scale-invariant feature representation has always been an important goal in pattern recognition of computer vision tasks, and constructing pyramid image features has been proven to be useful and effective.Before the prevalence of deep learning, SIFT 8 (Scale-Invariant Feature Transform) is proposed to achieve rotation and scale-invariant feature description.
In the deep learning era, the feature pyramid structure is widely used to extract multi-scale feature representation.By aggregating features of different scales with lateral and top-down connections, the FPN 7 has improved the detection of objects with various scales, especially small objects.To build a more effective aggregation path in the pyramid structure, NAS-FPN 9 automatically searches for the best connections between multiple basic modules, which makes the pyramid structure adaptive to different tasks as well as datasets.However, the search results cannot be precisely controlled, and the parameter amount has doubled from FPN to NAS-FPN.Similarly, BiFPN 10 optimized the aggregation path of NAS-FPN.Specifically, it repeats the basic blocks of BiFPN several times and applies longer range skip connection on the lateral layers to achieve more efficient feature fusion, while the repeat time is learned by the network.There are also many detectors that do not adopt an explicit pyramid structure, but achieve similar effects with other techniques.For example, CenterNet 11 adopts Hourglass 12 to generate multi-scale feature representation, and learn object center points on it.YOLOv3 13 takes 3 different scale feature maps from the DarkNet backbone, and uses the corresponding 3 heads for detection.
Besides, some research show that pyramid structure also have great potential beyond generic object detection.In the task of weakly supervised object detection, P-MIDN 14 proposes to implement multiple instance detection network (MIDN) in a pyramid manner, therefore avoid being stuck at local discriminative regions.In salient object detection, PAGE-Net 15 builds multiple attention layers based on multi-scale features and stack them together.To facilitate video salient object detection, Song 16 et al. design a Pyramid Dilated Convolution (PDC) module to explicitly extracting spatial saliency features.They also propose a Pyramid Dilated Bidirectional ConvLSTM to learn abundant spatiotemporal information.

Attention mechanisms for vision tasks
In computer vision, the attention mechanism is used to enable a deep model to focus better on the most relevant parts of the input image, which has been widely used in various vision tasks like object detection, image segmentation and image enhancement.Squeeze-and-Excitation Networks (SENet) 17 proposes a channel attention model to learn and recalibrate feature responses by using global average pooling and fully connected layers.Selective Kernel Networks (SKNet) 18 implements a channel attention by designing several branches with different convolution kernel sizes, by which diverse spatial contexts are extracted.To better exploit frequency domain correlations between different channels, the FcaNet 19 upgrades the global average pooling (GAP) operation to a more general form of 2D discrete cosine transform.Besides the channel attention, CBAM 20 further designs a spatial attention module to refine the feature twice by multiplying attention maps to the feature map.GAM 21 inherits the pipeline of CBAM, and redesigned the channel and spatial attention modules to capture global attention from a given feature map.VAN 22 proposes a linear attention called large kernel attention (LKA) to enhance the long-range correlation capture of features.Many of these methods are helpful for generic vision tasks, but they have not specifically consider the semantic category of objects.
There are also some great achievement on attention mechanism targeted on semantic segmentation task.ACFNet 23 achieves attention on different object category by learning the class centers of objects, but the object category of a fusion dataset is too small, which degrades the function of the attention module.COSNet 24 improves the zero-shot video object segmentation with a co-attention that leverages information from multiple reference frames, which is implemented efficiently.The follow-up work 25 proposes to achieve contrastive co-attention, which aims to discover more object patterns and better ground semantics in image regions.

Image fusion and recognition
Image fusion is an important research direction in the field of computer vision, and has made significant progress in recent years.One of the recent developments in image fusion is the use of deep learning techniques.AT-GAN 2 proposes a generative adversarial network (GAN) with intensity attention modules and semantic transition modules to explore key information in infrared and visible modals.Han et al. 1 propose to achieve targeted level image with a scene texture attention module.Moreover, some research has been down to facilitate high-level vision tasks like object detection and semantic segmentation.SeAFusion 26 proposes to cascade the image fusion module and semantic segmentation module, and optimize the fusion process with the constraint of segmentation loss.TarDAL 27 proposes a bi-level optimization formulation for the joint problem of fusion and detection, which aims to achieve high quality image fusion as well as facilitating object detection task.

NECESSITY
In recent years, significant progress has been made in the field of infrared target detection, with various traditional and deep learning-based methods achieving state-of-the-art performance.One might doubt the necessity of performing object detection on fused images, since the infrared modal can presents clear object contours at many challenging conditions.However, in adverse conditions like cold and foggy, the infrared modal may not reveal the object appearance well.For example, the infrared image in the first row of Figure 2 only highlighted the people and some specific parts of vehicles, which contains more heat.More scientifically, we conduct an experimental investigation to provide a solid support for our view.Specifically, we conducted experiments using the M3FD dataset, 27 which is an urban transportation dataset consists of infrared and visible images.We adopt the SeAFusion 26 to fuse them and acquire fusion images.We take the mAP (mean Average Precision) and mAR (mean Average Recall) metrics to evaluate the performance.Precision measures the proportion of detected objects that are correctly classified, while recall measures the proportion of all actual objects that are detected.By combining both measures, mAP and mAR provide a comprehensive evaluation of the overall performance of object detection models.
The experiment results are provided in Figure 3.We first notice that the infrared images did not benefit the detection task as we assumed, and it has the lowest mAP and mAR.For the visible images, the mAP and mAR are both improved for about 6% on the infrared images.These two sets of results reveals that the infrared images cannot always provide reliable contour information in complex urban environment.For example, for the second set of images in Figure 2, the buildings at the background of infrared image are activated even stronger than running vehicles.Then we focus on the results of the fusion images.One can see that the mAR has little change compared with visible images, while the the mAP is increased again for 2.9%, which is a significant improvement considering the baseline.The improvement can be attributed to noise of visible images are suppressed in the fusion process, for example, strong light.In general, the fused images can server the detection tasks better than both infrared and visible images.

METHOD
As the structure of Resnet backbone shows in Figure 4, convolution blocks C2 to C5 will outputs feature maps with four different scales.C2, C3, C4, and C5 downsizes the scale 4 times, 8 times, 16 times, and 32 times, respectively.We first propose a Dilated Pyramid Network which contains an Expander to collect feature of various receptive fields.Then we introduce the implementation of Semantic Category Attention Module.

Dilated pyramid network
Based on the above observations, we propose a Dilated Pyramid Network (DPN) to achieve refine-and-fuse pipeline in the pyramid network.As the Dilated Pyramid in Figure 4 shows, we design an receptive field expander to process the outputted backbone features.The Expander is consist of two dilated convolution blocks with different dilation rates, which facilitates the detection in two aspects.For the one hand, the backbone features are early refined before the upsampling and element-wise add, so more scale-aware features are represented and propagated in fusion.On the other hand, the Expander is implemented based on the dilated convolution, whose larger receptive field enables the lateral layer captures wider spatial information.Therefore, the multi-scale feature extraction and fusion is also achieved in the lateral layers.
For the Backbone, we do use the default Resnet with 101 layers for basic feature extraction.For the detection head, we adopt the common implementation of dense detector like ATSS. 5 However, we add the SCAM module at the classification branch to refine features according to the semantic category.In Figure 4, the Cls_tower and Reg_tower are used to refine features from the DPN to suit corresponding tasks, which both consist of 4 3 × 3 convolutions.While Conv_cls and Conv_reg are both 3 × 3 convolutions to produce prediction results.
The implementation of the Expander is shown as Figure 5.Given the backbone feature map  i from C i , we keep the 1×1 convolution in the original FPN to project the backbone feature map to the corresponding lateral layer, which is denoted as F I G U R E 4 Detection pipeline of a generic dense detector with DPN and SCAM.

F I G U R E 5
An expander consists of a 1 × 1 convolution projector, a residual dilation block, and a 3 × 3 convolution.
Then, the Residual Dilation is constructed by two dilated convolution blocks with dilation rate 2 and 4 respectively.For the first Residule Dilation, its output is calculated as while the output of the second Residule Dilation is calculated as Then, we compute the output of the whole Expander as which follows the upsampling and 3 × 3 convolutions of the original FPN.By leveraging the dilated convolution to expand the receptive field, the dilated pyramid network not only enables the fusion of features across different backbone network layers, but also achieves multi-scale feature fusion within the same layer.This helps to capture more contextual information and enhance the network's ability to handle objects of varying sizes and shapes, making the subsequent detection heads more efficient in localization and classification.
For a regular convolution, assuming the input feature map has a size of W in × H in , the kernel size is K × K, the stride is S, and the padding size is P, the output feature map size is computed as follows: For dilated convolution, assuming the kernel size is K × K and the dilation rate is D, the output feature map size is computed as follows: Here, P represents the padding size, S represents the stride, and ⌊⋅⌋ denotes the floor operation.It should be noted that for dilated convolution, the effective kernel size is (D × (K − 1) + 1).We can see that the equivalent convolution kernel size of a dilated convolution is larger than a regular one.Therefore, when the same padding size and stride are used, the output feature map of the Expander will be reduced.This will cause irregular down-sampling strides of feature map at the detection head.To handle such a problem, we set the padding size to be the same as the dilation rate, which is 2 and 4, so the Expander will not change the size (width and height) of feature maps.

Semantic category attention mechanism
From the perspective of feature expression, the same category of objects in the fused image may have multiple different feature expressions due to weather, environment, and other factors.Specifically, sometimes the fused image is more similar to the visible modality and sometimes it is more similar to the infrared modality.Such contradictions will cause inconsistent feature representation of the same object category, which will further lead to unstable learning in the training stage.When the fused image is more like to the infrared modality, the classification-aware details will be decayed.
To handle such challenges, we propose a semantic category attention mechanism (SCAM) to capture semantic categories instead of object categories in the classification branch of FusionDet.Semantic categories is considered to be larger than the object categories, because we hope the detector explicitly learns multiple kinds of semantics to identify an object.
At the classification branch, the output feature map X cls ∈ R b×c×h×w from the DPN neck is send into the SCAM, where b, c, h, and w represent batch-size, channel number, height and width of a feature map.The first component of SCAM is called Collector to collect k kinds of semantic categories and represent them by k vectors, which is shown in Figure 6.As for the determine of k, we conduct a set of experiment to select a proper value.Here, the Collector is implemented with a convolution block, and the semantic category S c is computed as and followed by a 1 × 1 convolution projector to acquire attention Query  as In Equation ( 9), Conv, BN, and ReLu represent convolution, BatchNormalization and Rectified Linear Unit activation function, respectively.The convolution used in Equation ( 9) has 256 output channels, which means it does not change the input channel number.The convolution used in Equation ( 10) has k output channels, which enables  to contain k kinds of semantics.
The second ConvBlock component of SCAM is used to compute attention Key, which is implemented with another convolution block.But like the ACFNet does in semantic segmentation, we adopt a dropout to randomly drop some neurons with a probability 0.2.The process is written as Here, the convolution does not change the feature map channel like Equation ( 9).After Query and Key are acquired, we can compute the attention map A with a rmsoftmax function as F I G U R E 6 Semantic category attention module (SCAM).
To compute the attentioned feature map of X cls , the attention map needs to be multiplied with Value, which is usually a liner transformation of X cls .Following the ACFNet, we directly make a shape transformation on S c to generate Value as Here, Reshape merges the last dimensions (h and w) of S c , so that  ∈ R b×c×n , where n = h × w.Finally, the semantic category attentioned feature map X ca is calculated as Generally, an attentioned feature map is better to fuse with the original feature map to avoid unstable training and information loss.Therefore, we further design a simple fusion operation as Here Concat denotes the feature map concatenation in the channel dimension.For the Conv, we adopt a 3 × 3 convolution with padding size 1 to reduce the channels of concatenated feature map, rather than a 1 × 1 convolution used in other attention methods.The following steps are shown in the detection head of Figure 4.By changing the target class to semantic class in the class attention mechanism, We can achieve stable classification even when the feature expression of the target changes.This can effectively reduce the impact of interference information on target detection and improve the robustness of the detection model.

Datasets, settings, and metrics
We conduct experiments on two visible-infrared fusion image datasets, namely RGB-Thermal and M3FD. 27In our study, the RGB-Thermal 28 contains 2 categories, that is, Person and Car.The M3FD dataset contains 6 categories, namely People, Car, Bus, Motorcycle, Lamp and Truck.As examples in Figure 7 show, both datasets contains diverse challenges like dazzling light, dark and heat interference.The image numbers of two datasets are provided in Table 1.
As for the training strategy, we adopt different schedules for the RGB-Thermal and M3FD dataset due to different scales.As the Schedule in Table 1 shows, the RGB-Thermal dataset is trained with batch-size 4 with image scale 1000 × 600, we train for 30 epochs with learning rate 0.001, and decay for 10 times at epoch 24.The M3FD dataset is trained with batch-size 8 with image scale 1333 × 800, we train for 12 epochs with learning rate 0.004, and decay for 10 times at epoch 8 and epoch 11, respectively.
As for the evaluation metrics, we select both mean Averaged Precision (mAP) and mean Averaged Recall (mAR) to evaluate the detection accuracy and call ratio, respectively.For Averaged Precision (AP) and Averaged Recall (AR), they are computed as where TP represents True Positive, FP represents False Positive, and FN represents False Negative.In object detection, for each detection result, if its Intersection over Union (IoU) with a ground-truth object is greater than a certain threshold, it is considered as a True Positive; if the IoU with all ground-truth objects is less than the threshold, it is considered as a False Positive; if its IoU with all ground-truth objects is greater than the threshold, it is considered as a Duplicate and not included in the calculation.For the speed, we report the process Frames Per Second (FPS) as the metric.To balance recall (mAR) and precision (mAP), we further adopt the F1 score to produce a comprehensive indicator, which is computed as

Ablation study on SCAM
The SCAM module need to determine the semantic categories k as shown in Figure 6.In this section, we compare 4 different values of semantic categories on the RGB-Thermal dataset.The results are presented in Table 2.We first notice that the value of k has little effect on the speed.Then we can see that the smallest k (2) and the biggest k (16) are both worse than middle ones.k = 8 and k = 4 have very close F1 Score and FPS.In our study, we give more attention on the accuracy comparing with recall.Therefore, we define k as 8 in the SCAM module.

Contribution to generic object detectors
The proposed DPN and SCAM modules are universally effective for current generic object detectors.To evaluate the contribution of these two modules, we select 7 popular detection algorithms as baselines, and report the improvement on them.Specifically, we first select 3 popular two-stage detectors, that is, Faster R-CNN, Grid R-CNN, and Libra R-CNN.A Two-stage method utilizes both region proposal head (RPN Head) and region of interest head (RoI Head), it is difficult to determine the position of SCAM module.Therefore, we only add the DPN module for two-stage detectors.PISA (PrIme Sample Attention) is a learning strategy that selectively determines the importance of samples according their importance, and we apply it on the Faster R-CNN in the experiment.Then, we select 3 popular one-stage dense detectors, that is, PAA, ATSS, and VFNet to equip both DPN and SCAM modules.
We first focus on the first 3 two-stage detectors in Table .3. One can see that when the DPN is added, only the mAR metric of Faster R-CNN is reduced for 2.4%, all detectors and metrics are improved.We also notice that the accuracy indicator mAP of the Faster R-CNN is increased from 73.6% to 78.1%, and achieved the highest one among all methods.For the Libra R-CNN, the mAR metric is also increased to the highest from 85.8% to 90.9%, while the F1 Score is increased from 79.2% to 82.2%.These results show that the DPN module is very effective for two-stage detectors.For one-stage detectors, the DPN and SCAM modules are added to the baseline one by one.We notice that these two modules have effectiveness for different baselines.For PAA and VFNet, the DPN module has greater improvement for their mAPs, and the SCAM decreases their mAR for 2.1% and 0.1%, respectively.But for the ATSS, the SCAM has greater improvement on both mAR (+4.4%) and mAP (+0.9%), comparing with the DPN module.
From the experiment results in Table 3, we can conclude that the proposed DPN and SCAM can effectively improve current generic detectors, especially for the accuracy (mAP).But sometimes, they inevitably reduce the recall rate (mAR) slightly.Based on the comprehensive metric F1 Score, the DPN and SCAM are beneficial for detection on fused images.

Results on m3FD dataset
In this section, we expand the detector range to anchor-free (FSAF, FCOS and FoveaBox), anchor-based (others), two-stage (Faster R-CNN, Grid R-CNN, Libra R-CNN, and PISA) and one-stage (others) detectors for comprehensive comparison.
The results are presented in Table 4.
The performance of anchor-free detectors are generally worse than anchor-based ones, which is mainly caused by the removing of anchors.But based on the FPS metric, anchor-based detectors are usually faster.For two-stage detectors, Faster R-CNN and Grid R-CNN are more superior comparing with Libra R-CNN.Recently published one-stage detectors are generally better than both anchor-free and two-stage detectors, and ATSS and VFNet are best two of them.As we can see, ATSS achieves the first place on mAR (67.1%), and second place on both mAP (45.1%) and F1 Score (54.3%).VFNet achieves the second place on mAR (67.0%) and third place on mAP (45.0%) and F1 Score (53.9%).For our FusionDet with both DPN and SCAM modules, it achieves the first place on mAP (46.4%) and F1 Score (54.5%).
Some detection examples are provided in Figure 8.The images are come from RGB-Thermal and M3FD datasets.The FusionDet is equipped with both DPN and SCAM modules on the basis of ATSS.In extreme low light conditions, our method can discover more weak object that are small and dim.In the fuse images that similar to infrared ones, FusionDet can successively find the targets.

CONCLUSION
In this article, we have proposed two effective modules to enhance the detection on fused images, namely DPN and SCAM at the neck and head position of a detection pipeline.The DPN extracts multi-receptive field for objects, and the SCAM collects robust attention of semantic categories, rather than object categories.Abundant experiments show that the proposed module can greatly improve current detectors on fused images, and can achieve satisfying performance, comparing with more than ten popular object detection methods.

F I G U R E 1
Original structure of feature pyramid network (FPN).

F I G U R E 2
Images of three different modals.F I G U R E 3 Detection performance (%) of different modal images.

F I G U R E 7
Image examples of RGB-Thermal and M3FD dataset.TA B L E 1 Dataset and training schedules.4 1333 × 800 [0.004, 8, 11, 12] Performance (%) of SCAM with different semantic categories k.
Performance (%) comparison and component contribution on RGB-thermal dataset.Rank 1st: red, Rank 2nd: blue, Rank 3rd: green.FPS is tested on a single Nvidia GeForce RTX3090 GPU at the inference stage.Experiment results (%) on M3FD dataset.
Note: TA B L E 4 F I G U R E 8 Detection examples of ATSS and our method (ATSS with DPN and SCAM).