Local bi-directional funnel network for salient object detection

Existing deep-learning–based saliency detection methods mainly de-sign a sophisticated architecture to integrate multi-level convolutional features, for example, recurrent network or bi-directional message passing model, and achieve gratifying performance. However, the direct transmission of information with large span, for example, the information from the deepest layer and the shallowest layer, may lead to antagonistic problems. To address this problem, we propose a local bi-directional funnel network (LBDFN) to effectively integrate multi-level features for salient object detection. In the proposed local bi-directional funnel network, a local bi-directional feature integration (LBDFI) module is designed to fuse the features with similar properties from adjacent convolution layers, which can solve the problem of mismatch between fusion features well. The extensive experiments on ﬁve saliency detection datasets clearly demonstrate that the proposed method outperforms the state-of-the-art approaches.

✉ Email: junxiali99@163.com Existing deep-learning-based saliency detection methods mainly design a sophisticated architecture to integrate multi-level convolutional features, for example, recurrent network or bi-directional message passing model, and achieve gratifying performance. However, the direct transmission of information with large span, for example, the information from the deepest layer and the shallowest layer, may lead to antagonistic problems. To address this problem, we propose a local bidirectional funnel network (LBDFN) to effectively integrate multi-level features for salient object detection. In the proposed local bi-directional funnel network, a local bi-directional feature integration (LBDFI) module is designed to fuse the features with similar properties from adjacent convolution layers, which can solve the problem of mismatch between fusion features well. The extensive experiments on five saliency detection datasets clearly demonstrate that the proposed method outperforms the state-of-the-art approaches.
Introduction: Salient object detection, the task to locate the important parts of natural images or videos which attract human attention, has attracted a lot of focused research in computer vision and has resulted in many applications, such as object detection, recognition and tracking and image/video compression and retargeting.
Existing salient object detection methods can be divided into two categories: traditional methods and deep-learning-based methods. Most traditional methods perform saliency detection with hand-crafted models and highly rely on heuristic saliency priors, for example, contrast prior and boundary prior. Although these saliency priors have proved effective in some cases, they are not robust enough to discover salient objects or accurately depict the boundaries of objects in complex scenes. In addition, saliency prior-based approaches mainly depend on lowlevel hand-crafted features, which are incapable to capture the semantic knowledge of object. Benefitting from the hierarchical structure of convolutional neural network, deep methods can extract multi-scale and multi-level features that contain both low-level local details and highlevel global semantics. To make use of detailed and semantic information, existing methods mainly focus on designing a sophisticated architecture to integrate multi-level convolutional features together [1][2][3]. Hou et al. [1] introduce short connections to the skip-layer structures within the holistically nested edge detection architecture, in which the deeper layer features can be transmitted to the shallower layer by short connection. However, it ignores that in addition to the deeper layers guiding the shallow layer, the information from the shallower layers can also used to guide that of the deep layer in the process of network optimization. He et al. [2] and Zhang et al. [3] consider that the fusion of semantic and detail features can improve the detection performance, and both of them exploit bi-direction structure to fuse them. However, the direct transmission of information with large span, for example, the information from the deepest layer and the shallowest layer, may lead to antagonistic problems. In addition, directly fusing features of different properties (e.g. detail and semantic) will introduce noises.
We address this problem by designing a local bi-directional funnel network (LBDFN) for saliency detection. Considering that the properties of features from adjacent convolution layers are similar, we are able to improve the detection performance of the network by fusing similar features with a bi-direction structure. Hence, we design a novel local bi-directional feature integration (LBDFI) module to fuse the features with similar properties from adjacent convolution layers, which can solve the problem of mismatch between fusion features well. In LBDFI, we optimize the features of adjacent convolution layers through the bidirectional transmission, that is, the salient semantic information of the deep layers are transfered to the shallow layers, meanwhile the salient de- Proposed method: Our goal is to fuse the features with similar properties from adjacent convolution layers so as to avoid the information mismatch due to the large layer span, and thus to fully exploit the multilayer and multi-scale salient features. As a matter of fact, different layers of the network are with different properties of the object. The features from the shallow layers are dominated by the salient detailed information, while the features from the deeper layers contain a lot of salient semantic information. Considering the above issues, we design a novel framework, named LBDFN, to effectively fuse multi-level features for salient object detection, which can optimize the features globally. Specially, in LBDFN, we introduce a local bi-direction feature integration (LBDFI) module to fuse the features from adjacent convolution layers by a local bi-direction structure. Besides, the LBDFN can optimize the features globally.
The overall architecture is illustrated in Figure 1. In our network, we use the ResNet-50 network [5] as the backbone to extract local image features, which contains five convolutional blocks, each of which has several convolutional layers. Let Layer 1, Layer 2, Layer 3, Layer 4 and Layer 5 denote the features from the convolutional layers of five convolutional blocks. Next, the adjacent side-output feature maps are put into the LBDFI module to obtain the local bi-direction feature maps. The details of the LBDFI are shown in the blue box in Figure 1 and the visualization results of the feature maps are shown in Figure 2.
For the sake of narrative, we use Layer 1 and Layer 2, which belong to adjacent convolution layers, as examples. The layers' features are divided into two groups: the original feature map and the new feature map. The new feature map is generated by the original feature map using a threshold function (see the second column in Figure 2). Note that the threshold function used here is the sigmoid activation, which aims to avoid the repeated information fusion and to generate information fusion rates in the range [0,1]. It can be formulated as: where F new and F original denote the new feature map and the original feature map, respectively, and σ represents sigmoid threshold function. Next, the new feature map of Layer 1 and the original feature map of Layer 2 are fused by the concatenation operation to obtain the feature BD-1. Meanwhile, the new feature map of Layer 2 and the original feature map of Layer 1 are fused by the concatenation operation to obtain the feature map BD-2. Last, the feature maps BD-1 and BD-2 are fused by the sum-operation to obtain the local bi-directional feature map. The formula is as follows: where F LBDF I1−1 means the feature map of LBDF I1 − 1. F new−1 and F original−1 denote the new feature map and the original feature map of Layer 1, respectively. F new−2 and F original−2 represent the new feature map and the original feature map of Layer-2, respectively. © represents concatenation operation, and ⊕ denote the sum-operation. Generally, we fuse the adjacent convolution layers by the LBDFI to obtain complementary feature maps. Then, the adjacent complementary feature maps also are put into LBDFI to generate new complementary feature maps. Finally, we build a funnel shape network which can fuse multi-scale and cross-layer salient features. To note that, we only supervise the last feature map by the ground truth. And our loss function is the binary cross entropy loss which treats all pixels equally.
Experimental results: All experiments here are conducted using a single NVIDIA 1080Ti GPU. The proposed network is based on the current popular PyTorch framework. And the training dataset of our network is DUTS-TR. We evaluate our approach on five public saliency detection benchmarks, including SOD, ECSSD, PASCAL-S, DUTS-TE and HKU-IS. SOD, ECSSD, PASCAL-S, DUTS-TE and HKU-IS contain 300 images, 1000 images, 850 images, 5019 images and 4447 images, respectively. To evaluate the performance of the proposed method, we test our results based on two metrics including F-measure score and mean absolute error (MAE) score. Specifically, F-measure is an overall performance metric, which is computed by the weighted harmonic of precision and recall. MAE denotes the average absolute per-pixel difference between a predicted saliency map and its ground truth. Table 1 gives the quantitative results of the proposed approach against five stateof-the-art methods in terms of F-measure and MAE on the five datasets. From Table 1, we can find that our method can significantly outperform the state-of-the-art approaches on all datasets. Figure 3 shows some saliency maps generated by the proposed method and the four state-of-the-art saliency approaches for a subjective comparison. As can be seen, compared to other methods, our method generates more accurate saliency maps in various challenging cases. For images having a low contrast between objects and background (e.g. rows 1, 2, 3 and 4 in Figure 3), and having a complex circumstances (e.g. row 5 in Figure 3), our method still generates accurate saliency maps, while  [6], (e) saliency maps obtained by [7], (f) saliency maps obtained by [4], (g) saliency maps obtained by [3] most of the existing saliency methods cannot effectively highlight the entire salient objects. Besides, for the images with inverted reflection in water (e.g. row 6) or detail structure (e.g. row 7), our method also can effectively detect the whole salient objects with well-defined boundaries.

Conclusion:
We propose a novel LBDFN to fully exploit the multi-layer and multi-scale convolutional features for salient object detection. In LBDFN, an LBDFI module is designed to fuse the features with similar properties from adjacent convolution layers so as to avoid the information mismatch due to the large layer span. The experimental evaluation on five datasets demonstrates that the proposed approach provides more accurate saliency maps compared to the state-of-the-art saliency detection methods. We hope our work can provide valuable insights into enhancing the discriminant ability of feature representation for saliency detection.