Hybrid-Attention Network for RGB-D Salient Object Detection

Chen, Yuzhen; Zhou, Wujie

doi:10.3390/app10175806

Open AccessArticle

Hybrid-Attention Network for RGB-D Salient Object Detection

by

Yuzhen Chen

¹ and

Wujie Zhou

^1,2,*

¹

School of Information and Electronic Engineering, Zhejiang University of Science & Technology, Hangzhou 310023, China

²

College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2020, 10(17), 5806; https://doi.org/10.3390/app10175806

Submission received: 4 July 2020 / Revised: 11 August 2020 / Accepted: 18 August 2020 / Published: 21 August 2020

(This article belongs to the Special Issue Advances in Image Processing, Analysis and Recognition Technology)

Download

Browse Figures

Versions Notes

Abstract

:

Depth information has been widely used to improve RGB-D salient object detection by extracting attention maps to determine the position information of objects in an image. However, non-salient objects may be close to the depth sensor and present high pixel intensities in the depth maps. This situation in depth maps inevitably leads to erroneously emphasize non-salient areas and may have a negative impact on the saliency results. To mitigate this problem, we propose a hybrid attention neural network that fuses middle- and high-level RGB features with depth features to generate a hybrid attention map to remove background information. The proposed network extracts multilevel features from RGB images using the Res2Net architecture and then integrates high-level features from depth maps using the Inception-v4-ResNet2 architecture. The mixed high-level RGB features and depth features generate the hybrid attention map, which is then multiplied to the low-level RGB features. After decoding by several convolutions and upsampling, we obtain the final saliency prediction, achieving state-of-the-art performance on the NJUD and NLPR datasets. Moreover, the proposed network has good generalization ability compared with other methods. An ablation study demonstrates that the proposed network effectively performs saliency prediction even when non-salient objects interfere detection. In fact, after removing the branch with high-level RGB features, the RGB attention map that guides the network for saliency prediction is lost, and all the performance measures decline. The resulting prediction map from the ablation study shows the effect of non-salient objects close to the depth sensor. This effect is not present when using the complete hybrid attention network. Therefore, RGB information can correct and supplement depth information, and the corresponding hybrid attention map is more robust than using a conventional attention map constructed only with depth information.

Keywords:

neural networks; deep learning; salient object detection; RGB-D

1. Introduction

Saliency detection extracts relevant objects with pixel-level details from an image. It has been widely used in many fields such as object segmentation [1], region proposal [2], object recognition [3], image quality assessment [4], and video analysis [5]. It has been found that when the background has similar colors to those of a salient object or it is highly complex and salient objects are very large or small, saliency detection solely based on RGB images often fails to provide accurate results. Therefore, depth information is being increasingly used as a supplement to RGB information for saliency detection [6,7,8]. RGB-D salient object detection based on handcrafted features generally uses depth maps to determine edges, textures, and histogram statistics, and then bottom-up [9] or top-down [10] approaches are used to predict whether a pixel belongs to a salient object. Various methods consider the rarity of pixels in an image at local and global regions [11], while others use prior knowledge to support prediction and obtain accurate detection [12]. However, these methods rely on handcrafted features, empirical parameter setting, and statistical prediction, which limit their performance. In fact, such methods cannot fully extract representative features due to inadequate parameter setting, subjective factors, and redundant or erroneous information. In addition, models of the human visual system may be incomplete and misleading. Alternatively, deep learning methods have emerged in recent years, improving the accuracy of salient object detection [13,14,15,16]. By combining the advantages of deep learning and features in depth maps, several stereoscopic saliency detection methods based on neural networks have achieved great leaps in accuracy. For instance, DF combines RGB images and depth maps into a deep learning framework [17]. Then, encoder–decoder networks, such as PDNet [18], provide high accuracy and robustness. Chen et al. further improved the results by proposing hidden structure conversion [19], complementary fusion [20], a dilated convolutional model [21], and modification to loss functions [22] for highly accurate salient object detection. On the other hand, methods based on attention mechanisms can quickly identify the position of objects and then reconstruct the edges for improving salient object detection. Wang et al. proposed a residual network with attention mechanism [23] and then DANet [24] to achieve accurate results by using channel and spatial attention maps.

Current stereoscopic salient object detection based on deep learning usually adopts networks such as VGG [25], ResNet [26], and Inception [27] as its backbone and the U-Net encoding–decoding structure [28] as the framework. However, this is not an ideal solution for saliency detection. As the depth map (disparity map) is an image reflecting the distances to objects, many networks use it to generate an attention map to distinguish objects from the background. However, depth maps have two major limitations. First, the depth map reflects the distance to all objects, and some non-salient objects are the closest to camera and provide the lowest (highest) pixel intensities. Thus, the underlying network may consider such objects as salient, in a phenomenon that we call the depth principle error. Second, data acquisition limitations may degrade the accuracy of edge information in the depth map.

Overall, the neural networks that determine the location of objects using only depth information to construct the attention map may be biased. Using the RGB image to discard the closest non-salient objects in depth maps may improve the detection accuracy. Based on spatial attention maps, we propose stereoscopic salient object detection using a hybrid attention network (HANet). Before processing features for saliency detection, high-level features extracted from the RGB image are encoded into an attention map, which is then mixed with the depth attention map for subsequent joint processing with the saliency features. Experimental results show that this novel method prevents non-salient object interference present in depth maps. In addition, unlike many symmetric neural networks, the proposed asymmetric network has fewer parameters, because the depth map has less information and a large network is unnecessary. Thus, we use a simplified Inception-v4-ResNet2 [29] architecture with fewer parameters to extract the depth attention map and a Res2Net [30] architecture for feature extraction to construct the RGB attention map containing more complex information. The proposed asymmetric HANet can prevent the depth principle error by filtering features with cross-modal attention maps separately obtained from RGB and depth data.

2. Proposed Method

The proposed HANet architecture achieves salient object detection and prevents the depth principle error. The processing pipeline of HANet is shown in Figure 1. HANet can be divided into two main parts. The first part extracts features through eight neural network blocks (shown in blue in Figure 1) for the RGB attention map and through two blocks (shown in green) for the depth attention map. The second part consists of six blocks (shown in orange in Figure 1) that fuse the two types of features to generate a hybrid attention map, and one block (shown in pink) that generates the saliency prediction map according to feature filtering based on the hybrid attention map.

2.1. Feature Extraction

We adopt two popular backbone networks for feature extraction. Specifically, Res2Net [30] extracts RGB features, and a simplified Inception-v4-ResNet2 [29] extracts depth features. The latter can handle the relatively less information from depth maps while preventing overfitting and reducing the computation time by omitting unnecessary parameters. Therefore, we establish an asymmetric architecture for this two-steam network.

For RGB images, the Res2Net backbone has been used to extract multilevel features for different tasks, being widely used in semantic segmentation, key-point estimation, and salient object detection. We have conducted comprehensive experiments on many datasets and benchmarks and verified the excellent generalization ability of Res2Net. For salient object detection, we remove all the fully connected layers of Res2Net to ensure that the output is an image. To preserve the feature information, we delete the first max pooling layers of the network and set the stride of the convolution to 1 (instead of 2) to prevent excessive downsampling. This prevents severe information loss and failure to reconstruct object details after saliency detection. As we obtain the features at each downsampling process, Res2Net provides four outputs: low-level features extracted by Layer1, middle-level features extracted by Layer2 and Layer3, and high-level features extracted by Layer4. In [27], 1 × 1 convolutions have dual purpose: most critically, they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our network. This allows for not just increasing the depth, but also the width of our networks without significant performance penalty. Then, inspired by [27], we use four 1 × 1 convolutions to reduce the number of channels to one-eighth of the original number, which is high and requires long computation time during both training and inference.

For depth maps, we use a simplified Inception-v4-ResNet2. To reduce the computational complexity, we only adopt its Stem part and five Inception-ResNet-A blocks. In addition, we follow the same procedure for RGB images to ensure that the output is an image. Likewise, we delete the first max pooling layers, set the stride of the convolution to 1, and use 3 × 3 convolutions to construct the depth attention map.

2.2. Hybrid Attention Predictor

The depth principle error in non-salient objects described above makes the closest objects to the depth sensor to have either the lowest or highest intensities in a disparity map. When a neural network searches for salient objects in depth maps, it can be misled by such objects. Therefore, a single-modal attention map containing only depth information is biased. By leveraging the complementarity between RGB and depth information, we can eliminate the depth principle error by constructing a hybrid attention map. This map combines the RGB and depth modes to obtain a weighted attention map in which each pixels has information on its likelihood to belong to a specific object.

To obtain the hybrid attention map, we devise a decoder network (orange blocks in Figure 1) that consist of a 3 × 3 convolutions and binary interpolation upsampling. After each upsampling, we concatenate the lower-level and current features. The decoder blocks can be represented by the following formula:

R^{n} = \sum_{k = 1}^{C} U (F (R_{k}^{n - 1} \oplus r_{k}^{n} \cdot W)),

(1)

where F represents convolution, U represents upsampling, k is the feature channel, R_kⁿ ^{− 1} is the k-th channel of the (n − 1)-th RGB attention features extracted by the corresponding block in the decoder network, r_kⁿ ^{− 1} is the k-th channel of the (n − 1)-th RGB features extracted by Res2Net, whose number of channels is reduced by the convolutions, ⨁ denotes concatenation, and W is the parameter for convolution.

When the RGB attention map is obtained after decoding, we aggregate the depth attention map to generate the hybrid attention map. This cross-modal attention map provides accurate localization of objects in the image. Then, we multiply the map with low-level RGB features, and several convolutions and upsampling operations lead to the prediction map for salient object detection.

2.3. Loss Function

We use the binary cross-entropy as loss function for HANet:

L (Y, G) = \sum_{h} \sum_{w} [Y (h, w) \log [G (h, w)]] + [1 - Y (h, w)] \log [1 - G (h, w)],

(2)

where (h, w) represents the pixel values of the image at the corresponding position, Y is the prediction map, and G is the ground truth. Thus, L(Y, G) provides the final loss function values of the prediction and label map.

3. Evaluation Measures and Implementation Details

3.1. Evaluation Measures

To comprehensively evaluate the detection performance of various saliency methods, we adopt five evaluation measures: precision–recall curve, maximum and mean F-measure, mean absolute error, and area under the precision–recall curve [31,32].

The binary saliency map corresponding to a threshold is then compared to the ground truth, and precision P and recall R are computed as

P = \frac{\sum_{h} \sum_{w} \overset{\land}{Y_{b}} (h, w) - Y (h, w)}{\sum_{h} \sum_{w} \overset{\land}{Y_{b}} (h, w)},

(3)

R = \frac{\sum_{h} \sum_{w} \overset{\land}{Y_{b}} (h, w) - Y (h, w)}{\sum_{h} \sum_{w} Y (h, w)} .

(4)

The average precision and recall for images in each dataset are plotted in a precision–recall curve. An adaptive threshold is applied to the grayscale saliency map to obtain the corresponding binary saliency map. For each saliency map, the precision and recall are computed using (3) and (4). Then, F_β is defined as

F_{β} = \frac{(1 + β^{2}) P R}{β^{2} P + R}

(5)

where β is a positive parameter specifying the relative importance of precision and recall. For consistency while comparing the performance of the proposed network with that of other methods, we set β = 0.3.

The mean absolute error reflects the average absolute pixelwise difference between the predicted saliency maps and corresponding ground truth. Thus, it is an important measure to evaluate the proposed HANet, and it is given by

M A E = \frac{1}{H W} \sum_{w = 1}^{W} \sum_{h = 1}^{H} | \overset{\land}{Y} (h, w) - Y (h, w) |,

(6)

where H and W are the numbers of rows and columns in the saliency map, respectively.

3.2. Implementation Details

We implement HANet using the popular PyTorch 1.2.0 library in Python. We apply Adam optimization with learning rate of 0.001, which is reduced by a factor of 2 if no improvement is observed in the validation performance over five consecutive epochs. The NJUD dataset [31] containing more than 2000 images and the NLPR dataset [32] containing 1000 images corresponding pixel-level ground truths are used to evaluate the proposed HANet. We follow the datasets splitting scheme proposed in [18,21], 80% are used for training and the remaining 20% for test. All the images are resized to 224 × 224 pixels. The network is trained over 100 epochs with early stopping, and a minibatch of 2 images is used at every training iteration. In this study, HANet was trained on a computer equipped with an Intel i7- 7750H CPU at 2.21 GHz and an NVIDIA GeForce GTX TITAN Xp GPU.

4. Results and Discussion

4.1. Comparison with State-of-Art Methods

We compared the proposed method with seven state-of-the-art methods: ACSD [31], CDCP [33], DCMC [34], DF [17], MBP [21], PDNet [18], and SFP [35]. Table 1 and Figure 2 show that the proposed HANet outperforms the other evaluated methods. Figure 3 shows various saliency maps obtained from each method in typical scenarios. In the first and second rows, the closest objects are non-salient and have the highest pixel intensities. For the comparison methods, the two images are misjudged due to the depth principle error. In contrast, HANet can correctly detect the salient objects by using the information in the hybrid attention map.

To further demonstrate the effectiveness of HANet, we conducted an ablation study by removing the RGB attention map. The results are shown in the 12th column of Figure 3, where the miscalculation due to the depth principle error appears. On the third and fourth rows, we show the saliency obtained from HANet in scenes with multiple and large salient objects, confirming the effectiveness of the proposed method.

4.2. Ablation Study

To analyze the effectiveness of both the proposed hybrid attention mechanism and RGB attention map to correct mistakes caused by depth principle error, we removed Layer2, Layer3, and Layer4 and their corresponding 1 × 1 convolutions from HANet. In addition, we removed the upsampling and convolution during fusion, and omitted the RGB attention map and thus its combination with the depth attention map. Table 2 and Figure 4 show that the saliency results are substantially deteriorated, as illustrated in the 12th column of Figure 3, where the depth principle error is evident. Therefore, HANet accurately predicts salient objects and eliminates interference caused by the depth principle error.

4.3. Computational Complexity

The computational complexity of the proposed HANet and the other methods was estimated from tests on the NJUD dataset. It takes approximately 4 h to train HANet using an Intel i5-7500 CPU at 3.4 GHz and an NVIDIA GeForce GTX TITAN Xp GPU. HANet achieves saliency detection at 11.6 fps for images of 224 × 224 pixels. Therefore, our model has low computational complexity and can be applied to real-time image processing systems.

5. Conclusions

We propose HANet, a hybrid network based on an attention mechanism for stereoscopic salient object detection. HANet uses a novel attention method that fuses RGB and depth attention maps to filter the original saliency features. Combined with an encoder–decoder network, HANet provides higher performance on the NJUD and NLPR datasets. Furthermore, an ablation study confirms that the HANet performance decreases when removing the RGB attention map, indicating the effectiveness of the proposed hybrid attention mechanism. The RGB attention map helps solving interference caused by the depth principle error, which occurs when non-salient objects are close to the depth sensor. Moreover, HANet provides high performance in scenes containing multiple objects, large objects, and other complex information.

Author Contributions

Y.C. conceived and designed the experiments, analyzed the data and wrote the paper. W.Z. supervised the work, helped with designing the conceptual framework, and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of China (Grant Nos. 61502429), the Zhejiang Provincial Natural Science Foundation of China (Grant No. LY18F020012), and the China Postdoctoral Science Foundation (Grant No. 2015M581932).

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhou, W.; Yuan, J.; Lei, J.; Luo, T. TSNet: Three-stream self-attention network for RGB-D indoor semantic segmentation. IEEE Intell. Syst. 2020. [Google Scholar] [CrossRef]
Bogdan, A.; Thomas, D.; Vittorio, F. Measuring the objectness of image windows. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2189–2202. [Google Scholar]
Zhang, H.; Cao, X.; Wang, R. Audio visual attribute discovery for fine-grained object recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Zhou, W.; Lei, J.; Jiang, Q.; Yu, L.; Luo, T. Blind binocular visual quality predictor using deep fusion network. IEEE Trans. Comput. Imaging 2020, 6, 883–893. [Google Scholar] [CrossRef]
Liu, H.; Jiang, S.; Huang, Q.; Xu, C. A generic virtual content insertion system based on visual attention analysis. In Proceedings of the 16th ACM international conference on Multimedia, Vancouver, BC, Canada, 27–31 October 2008. [Google Scholar]
Zhou, W.; Lv, Y.; Lei, J.; Yu, L. Global and local-contrast guides content-aware fusion for RGB-D saliency prediction. IEEE Trans. on Syst. Man Cybern. Syst. 2019, 1–9. [Google Scholar] [CrossRef]
Desingh, K.; Krishna, K.M.; Rajan, D.; Jawahar, C.V. Depth really matters: Improving visual salient region detection with depth. In Proceedings of the BMVC 2013—British Machine Vision Conference, Bristol, UK, 9–13 September 2013. [Google Scholar]
Lang, C.; Nguyen, T.V.; Katti, H.; Yadati, K.; Kankanhalli, M.; Yan, S. Depth matters: Influence of depth cues on visual saliency. In Proceedings of the 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012. [Google Scholar]
Zhou, Q.; Zhang, L.; Zhao, W.; Liu, X.; Chen, Y.; Wang, Z. Salient object detection using coarse-to-fine processing. J. Opt. Soc. Am. A 2017, 34, 370–383. [Google Scholar] [CrossRef] [PubMed]
Liu, D.; Chang, F.; Liu, C. Salient object detection fusing global and local information based on nonsubsampled contourlet transform. JOSAA 2016, 33, 1430–1441. [Google Scholar] [CrossRef] [PubMed]
Wang, W.; Lai, Q.; Fu, H.; Shen, J.; Ling, H. Salient object detection in the deep learning era: An in-depth survey. Available online: https://arxiv.org/pdf/1904.09146 (accessed on 18 August 2020).
Li, C.Y.; Guo, J.C.; Cong, R.M.; Pang, Y.W.; Wang, B. Underwater image enhancement by dehazing with minimum information loss and histogram distribution prior. IEEE Trans. Image Process. 2016, 25, 5664–5677. [Google Scholar] [CrossRef] [PubMed]
Keren, F.; Fan, D.P.; Ji, G.P.; Zhao, Q. JL-DCF: Joint learning and densely-cooperative fusion framework for RGB-D salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Zhou, W.; Chen, Y.; Liu, C.; Yu, L. GFNet: Gate fusion network with Res2Net for detecting salient objects in RGB-D images. IEEE Signal Process. Lett. 2020, 27, 800–804. [Google Scholar] [CrossRef]
Fan, D.P.; Lin, Z.; Zhang, Z.; Zhu, M.; Cheng, M.M. Rethinking RGB-D salient object detection: Models, data sets, and large-scale benchmarks. IEEE Trans. Neural Netw. Learn. Syst. 2020. [Google Scholar] [CrossRef]
Zhang, J.; Fan, D.P.; Dai, Y.; Anwar, S.; Saleh, F.S.; Zhang, T.; Barnes, N. UC-net: Uncertainty inspired rgb-d saliency detection via conditional variational autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Qu, L.; He, S.; Zhang, J.; Tian, J.; Tang, Y.; Yang, Q. RGBD salient object detection via deep fusion. IEEE Trans. Image Process. 2017, 26, 2274–2285. [Google Scholar] [CrossRef]
Zhu, C.; Cai, X.; Huang, K.; Li, T.H.; Li, G. PDNet: Prior-model guided depth-enhanced network for salient object detection. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 199–204. [Google Scholar] [CrossRef] [Green Version]
Han, J.; Chen, H.; Liu, N.; Yan, C.; Li, X. CNNs-based RGB-D saliency detection via cross-view transfer and multiview fusion. IEEE Trans. Cybern. 2018, 48, 3171–3183. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Li, Y. Progressively complementarity-aware fusion network for RGB-D salient object detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3051–3060. [Google Scholar] [CrossRef]
Zhu, C.; Li, G. A multilayer backpropagation saliency detection algorithm and its applications. Multimed. Tools Appl. 2018, 77, 25181–25197. [Google Scholar] [CrossRef] [Green Version]
Huang, P.; Shen, C.H.; Hsiao, H.F. RGBD salient object detection using spatially coherent deep learning framework. In Proceedings of the 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP), Pudong, Shanghai, China, 19–21 November 2018. [Google Scholar]
Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef] [Green Version]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef] [Green Version]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. Available online: https://arxiv.org/pdf/1409.1556.pdf (accessed on 18 August 2020).
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef] [Green Version]
Szengedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2014; pp. 1–9. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical image computing and computer-assisted intervention, Munich, Germany, 5–9 October 2015. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In Proceedings of the Thirty-first AAAI conference on artificial intelligence, San Francisco, CA, USA, 4–10 February 2017. [Google Scholar]
Gao, S.; Cheng, M.; Zhao, K.; Zhang, X.; Yang, M.; Torr, P.H.S. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 2938758. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ju, R.; Ge, L.; Geng, W.; Ren, T.; Wu, G. Depth saliency based on anisotropic center-surround difference. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; pp. 1115–1119. [Google Scholar] [CrossRef]
Peng, H.; Li, B.; Xiong, W.; Hu, W.; Ji, R. RGBD salient object detection: A benchmark and algorithms. In Proceedings of the 2014 European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Zhu, C.; Li, G.; Wang, W.; Wang, R. An innovative salient object detection using center-dark channel prior. In Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017; pp. 1509–1515. [Google Scholar] [CrossRef]
Cong, R.; Lei, J.; Zhang, C.; Huang, Q.; Cao, X.; Hou, C. Saliency detection for stereoscopic images based on depth confidence analysis and multiple cues fusion. IEEE Signal Process. Lett. 2016, 23, 819–824. [Google Scholar] [CrossRef] [Green Version]
Guo, J.; Ren, T.; Bei, J.; Zhu, Y. Salient object detection in RGB-D image based on saliency fusion and propagation. In Proceedings of the 7th International Conference on Internet Multimedia Computing and Service, Zhangjiajie, Hunan, China, 19–21 August 2015. [Google Scholar] [CrossRef]

Figure 1. Framework of our HANet. The RGB-D imge is selected form Ref. [31].

Figure 2. Precision–recall curves of different methods on the testing set of the NJUD and NLPR datasets.

Figure 3. Examples of salient object detection from the testing set. (a) Original image, (b) depth map, and (c) ground truth. Saliency maps obtained from (d) ACSD, (e) CDCP, (f) DCMC, (g) DF, (h) MBP, (i) PDnet, (j) SFP, (k) proposed HANet, and (l) HANet without RGB attention map (ablation study). The RGB-D imges are selected form Ref. [33].

Figure 4. Precision–recall curves obtained from ablation study applied to images from NJUD (left) and NLPR (right) datasets.

Table 1. Saliency Detection Performance of Different Methods on the Testing set of the NJUD and NLPR Datasets.

Datasets	Criteria	ACSD	CDCP	SFP	DCMC	DF	MBP	PDNet	Ours
NJUD	AUC	0.923	0.822	0.871	0.926	0.928	0.703	0.952	0.964
	MeanF	0.551	0.572	0.482	0.601	0.654	0.479	0.719	0.834
	MaxF	0.733	0.594	0.655	0.740	0.782	0.557	0.796	0.866
	MAE	0.190	0.204	0.202	0.154	0.154	0.207	0.129	0.065
NLPR	AUC	0.837	0.895	0.864	0.931	0.841	0.781	0.957	0.982
	MeanF	0.461	0.600	0.426	0.590	0.660	0.547	0.610	0.827
	MaxF	0.615	0.654	0.562	0.703	0.745	0.598	0.720	0.869
	MAE	0.156	0.126	0.180	0.120	0.112	0.117	0.119	0.055

Table 2. Performance of HANet During Ablation Study on NJUD and NLPR Datasets.

Datasets	Criteria	Single-Attention	Multi-Attention
NJUD	AUC	0.935	0.964
	MeanF	0.670	0.834
	MaxF	0.755	0.866
	MAE	0.150	0.065
NLPR	AUC	0.959	0.982
	MeanF	0.721	0.827
	MaxF	0.783	0.869
	MAE	0.091	0.055

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Zhou, W. Hybrid-Attention Network for RGB-D Salient Object Detection. Appl. Sci. 2020, 10, 5806. https://doi.org/10.3390/app10175806

AMA Style

Chen Y, Zhou W. Hybrid-Attention Network for RGB-D Salient Object Detection. Applied Sciences. 2020; 10(17):5806. https://doi.org/10.3390/app10175806

Chicago/Turabian Style

Chen, Yuzhen, and Wujie Zhou. 2020. "Hybrid-Attention Network for RGB-D Salient Object Detection" Applied Sciences 10, no. 17: 5806. https://doi.org/10.3390/app10175806

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid-Attention Network for RGB-D Salient Object Detection

Abstract

1. Introduction

2. Proposed Method

2.1. Feature Extraction

2.2. Hybrid Attention Predictor

2.3. Loss Function

3. Evaluation Measures and Implementation Details

3.1. Evaluation Measures

3.2. Implementation Details

4. Results and Discussion

4.1. Comparison with State-of-Art Methods

4.2. Ablation Study

4.3. Computational Complexity

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI