**1. Introduction**

Saliency detection extracts relevant objects with pixel-level details from an image. It has been widely used in many fields such as object segmentation [1], region proposal [2], object recognition [3], image quality assessment [4], and video analysis [5]. It has been found that when the background has similar colors to those of a salient object or it is highly complex and salient objects are very large or small, saliency detection solely based on RGB images often fails to provide accurate results. Therefore, depth information is being increasingly used as a supplement to RGB information for saliency detection [6–8]. RGB-D salient object detection based on handcrafted features generally uses depth maps to determine edges, textures, and histogram statistics, and then bottom-up [9] or top-down [10] approaches are used to predict whether a pixel belongs to a salient object. Various methods consider the rarity of pixels in an image at local and global regions [11], while others use prior knowledge to support prediction and obtain accurate detection [12]. However, these methods rely on handcrafted features, empirical

parameter setting, and statistical prediction, which limit their performance. In fact, such methods cannot fully extract representative features due to inadequate parameter setting, subjective factors, and redundant or erroneous information. In addition, models of the human visual system may be incomplete and misleading. Alternatively, deep learning methods have emerged in recent years, improving the accuracy of salient object detection [13–16]. By combining the advantages of deep learning and features in depth maps, several stereoscopic saliency detection methods based on neural networks have achieved great leaps in accuracy. For instance, DF combines RGB images and depth maps into a deep learning framework [17]. Then, encoder–decoder networks, such as PDNet [18], provide high accuracy and robustness. Chen et al. further improved the results by proposing hidden structure conversion [19], complementary fusion [20], a dilated convolutional model [21], and modification to loss functions [22] for highly accurate salient object detection. On the other hand, methods based on attention mechanisms can quickly identify the position of objects and then reconstruct the edges for improving salient object detection. Wang et al. proposed a residual network with attention mechanism [23] and then DANet [24] to achieve accurate results by using channel and spatial attention maps.

Current stereoscopic salient object detection based on deep learning usually adopts networks such as VGG [25], ResNet [26], and Inception [27] as its backbone and the U-Net encoding–decoding structure [28] as the framework. However, this is not an ideal solution for saliency detection. As the depth map (disparity map) is an image reflecting the distances to objects, many networks use it to generate an attention map to distinguish objects from the background. However, depth maps have two major limitations. First, the depth map reflects the distance to all objects, and some non-salient objects are the closest to camera and provide the lowest (highest) pixel intensities. Thus, the underlying network may consider such objects as salient, in a phenomenon that we call the depth principle error. Second, data acquisition limitations may degrade the accuracy of edge information in the depth map.

Overall, the neural networks that determine the location of objects using only depth information to construct the attention map may be biased. Using the RGB image to discard the closest non-salient objects in depth maps may improve the detection accuracy. Based on spatial attention maps, we propose stereoscopic salient object detection using a hybrid attention network (HANet). Before processing features for saliency detection, high-level features extracted from the RGB image are encoded into an attention map, which is then mixed with the depth attention map for subsequent joint processing with the saliency features. Experimental results show that this novel method prevents non-salient object interference present in depth maps. In addition, unlike many symmetric neural networks, the proposed asymmetric network has fewer parameters, because the depth map has less information and a large network is unnecessary. Thus, we use a simplified Inception-v4-ResNet2 [29] architecture with fewer parameters to extract the depth attention map and a Res2Net [30] architecture for feature extraction to construct the RGB attention map containing more complex information. The proposed asymmetric HANet can prevent the depth principle error by filtering features with cross-modal attention maps separately obtained from RGB and depth data.

### **2. Proposed Method**

The proposed HANet architecture achieves salient object detection and prevents the depth principle error. The processing pipeline of HANet is shown in Figure 1. HANet can be divided into two main parts. The first part extracts features through eight neural network blocks (shown in blue in Figure 1) for the RGB attention map and through two blocks (shown in green) for the depth attention map. The second part consists of six blocks (shown in orange in Figure 1) that fuse the two types of features to generate a hybrid attention map, and one block (shown in pink) that generates the saliency prediction map according to feature filtering based on the hybrid attention map.

**Figure 1.** Framework of our HANet. The RGB-D imge is selected form Ref. [31].

### *2.1. Feature Extraction*

We adopt two popular backbone networks for feature extraction. Specifically, Res2Net [30] extracts RGB features, and a simplified Inception-v4-ResNet2 [29] extracts depth features. The latter can handle the relatively less information from depth maps while preventing overfitting and reducing the computation time by omitting unnecessary parameters. Therefore, we establish an asymmetric architecture for this two-steam network.

For RGB images, the Res2Net backbone has been used to extract multilevel features for different tasks, being widely used in semantic segmentation, key-point estimation, and salient object detection. We have conducted comprehensive experiments on many datasets and benchmarks and verified the excellent generalization ability of Res2Net. For salient object detection, we remove all the fully connected layers of Res2Net to ensure that the output is an image. To preserve the feature information, we delete the first max pooling layers of the network and set the stride of the convolution to 1 (instead of 2) to prevent excessive downsampling. This prevents severe information loss and failure to reconstruct object details after saliency detection. As we obtain the features at each downsampling process, Res2Net provides four outputs: low-level features extracted by Layer1, middle-level features extracted by Layer2 and Layer3, and high-level features extracted by Layer4. In [27], 1 × 1 convolutions have dual purpose: most critically, they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our network. This allows for not just increasing the depth, but also the width of our networks without significant performance penalty. Then, inspired by [27], we use four 1 × 1 convolutions to reduce the number of channels to one-eighth of the original number, which is high and requires long computation time during both training and inference.

For depth maps, we use a simplified Inception-v4-ResNet2. To reduce the computational complexity, we only adopt its Stem part and five Inception-ResNet-A blocks. In addition, we follow the same procedure for RGB images to ensure that the output is an image. Likewise, we delete the first max pooling layers, set the stride of the convolution to 1, and use 3 × 3 convolutions to construct the depth attention map.

#### *2.2. Hybrid Attention Predictor*

The depth principle error in non-salient objects described above makes the closest objects to the depth sensor to have either the lowest or highest intensities in a disparity map. When a neural network searches for salient objects in depth maps, it can be misled by such objects. Therefore, a single-modal attention map containing only depth information is biased. By leveraging the complementarity between RGB and depth information, we can eliminate the depth principle error by constructing a hybrid attention map. This map combines the RGB and depth modes to obtain a weighted attention map in which each pixels has information on its likelihood to belong to a specific object.

To obtain the hybrid attention map, we devise a decoder network (orange blocks in Figure 1) that consist of a 3 × 3 convolutions and binary interpolation upsampling. After each upsampling, we concatenate the lower-level and current features. The decoder blocks can be represented by the following formula:

$$R^n = \sum\_{k=1}^{\mathbb{C}} \mathcal{U}(F(R\_k^{n-1} \oplus r\_k^n \mathcal{W})),\tag{1}$$

where *F* represents convolution, U represents upsampling, *k* is the feature channel, *Rk <sup>n</sup>* <sup>−</sup> <sup>1</sup> is the *k*-th channel of the (*n* − 1)-th RGB attention features extracted by the corresponding block in the decoder network, *rk <sup>n</sup>* <sup>−</sup> <sup>1</sup> is the *<sup>k</sup>*-th channel of the (*<sup>n</sup>* <sup>−</sup> 1)-th RGB features extracted by Res2Net, whose number of channels is reduced by the convolutions, 4 denotes concatenation, and *W* is the parameter for convolution.

When the RGB attention map is obtained after decoding, we aggregate the depth attention map to generate the hybrid attention map. This cross-modal attention map provides accurate localization of objects in the image. Then, we multiply the map with low-level RGB features, and several convolutions and upsampling operations lead to the prediction map for salient object detection.
