Cross-Modal Adaptive Interaction Network for RGB-D Saliency Detection

Du, Qinsheng; Bian, Yingxu; Wu, Jianyu; Zhang, Shiyan; Zhao, Jian

doi:10.3390/app14177440

Open AccessArticle

Cross-Modal Adaptive Interaction Network for RGB-D Saliency Detection

by

Qinsheng Du

^1,2,3,4,

Yingxu Bian

^1,2

,

Jianyu Wu

^1,3,

Shiyan Zhang

^1,2 and

Jian Zhao

^1,2,3,4,*

¹

College of Computer Science and Technology, Changchun University, Changchun 130022, China

²

Ministry of Education Key Laboratory of Intelligent Rehabilitation and Barrier-Free Access for the Disabled, Changchun 130022, China

³

Jilin Provincial Key Laboratory of Human Health State Identification and Function Enhancement, Changchun 130022, China

⁴

Jilin Rehabilitation Equipment and Technology Engineering Research Center for the Disabled, Changchun 130022, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7440; https://doi.org/10.3390/app14177440 (registering DOI)

Submission received: 15 July 2024 / Revised: 16 August 2024 / Accepted: 19 August 2024 / Published: 23 August 2024

(This article belongs to the Special Issue Advances in Computer Vision and Semantic Segmentation, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

The salient object detection (SOD) task aims to automatically detect the most prominent areas observed by the human eye in an image. Since RGB images and depth images contain different information, how to effectively integrate cross-modal features in the RGB-D SOD task remains a major challenge. Therefore, this paper proposes a cross-modal adaptive interaction network (CMANet) for the RGB-D salient object detection task, which consists of a cross-modal feature integration module (CMF) and an adaptive feature fusion module (AFFM). These modules are designed to integrate and enhance multi-scale features from both modalities, improve the effect of integrating cross-modal complementary information of RGB and depth images, enhance feature information, and generate richer and more representative feature maps. Extensive experiments were conducted on four RGB-D datasets to verify the effectiveness of CMANet. Compared with 17 RGB-D SOD methods, our model accurately detects salient regions in images and achieves state-of-the-art performance across four evaluation metrics.

Keywords:

salient object detection; RGB-D images; cross-modality interaction

1. Introduction

Salient object detection (SOD) is a fundamental but challenging task in the field of computer vision. It is primarily used to automatically detect the most distinctive areas in images and has been widely applied in various fields such as video segmentation [1], image retrieval [2], video detection [3], visual tracking [4], and others. Saliency detection tasks can be roughly divided into two basic types: one is RGB salient object detection, which uses RGB images as inputs; the other is RGB-D salient object detection, which inputs both images into the network to obtain the results.

In the field of CV, with the development of deep learning and neural networks, RGB salient object detection has made significant progress, and many high-quality models have been proposed [5]. However, due to challenges such as the similarity of image textures and complex backgrounds, the detection performance using only RGB images remains unsatisfactory. Generally speaking, humans can see not only colors, texture, and other information in the scene but also capture depth information through the binocular visual system. This depth information is not affected by complex backgrounds and lighting intensity, and depth images are a form of expressing this depth information. With the rise of depth cameras, we can obtain corresponding depth images at a lower cost. Introducing them to participate in salient object detection tasks can simulate the visual perception abilities of the human eye and enhance the computer’s recognition of targets. This has aroused researchers’ interest in developing the RGB-D SOD method. Many experiments have verified that using RGB-D images for salient object detection tasks can achieve better detection results in many difficult environments [6]. However, how to fully explore the correlation properties between the two modalities and fully utilize their respective advantages to integrate RGB and depth information is still an issue that needs attention [7,8].

As a prediction task at the pixel level, RGB-D salient object detection based on deep learning mostly adopts an encoder–decoder network architecture. The encoder part primarily uses a pre-trained convolutional neural network model to map the input image to the hidden layer to extract multi-level feature representations; then, the decoder is used to restore the image resolution and finally generate the saliency map detection results. Currently, researchers can divide the fusion methods into three types according to the different positions of RGB images and depth images in the network: early fusion, mid-term fusion, and late fusion [9]. As shown in Figure 1, specifically, early fusion occurs at the input stage; for the RGB image and the depth image, the two are only connected at the channel angle, forming a four-channel image as input, followed by subsequent encoder and decoder operations [10]. However, this fusion method does not consider the inherent differences between RGB images and depth images, which may lead to inaccurate detection results. Late fusion refers to using two independent network models to detect the saliency maps of RGB and depth images, respectively, and then simply multiplying the two saliency maps or obtaining their sum as the final result [11]. This method loses a lot of deep information in the image and cannot exploit the intrinsic relationships between RGB images and depth images, often resulting in blurred edges or insufficiently predicted regions in the resulting image [12]. To address these issues, researchers have explored the mid-term fusion strategy. Mid-term fusion mainly investigates how to fuse multi-level features to extract complementary information, thereby achieving a balance between accuracy and efficiency.

Ji et al. [13] proposed ICNet to mine the complementary relationship between two modalities and the coherence between layers, fusing advanced RGB and depth features in an interactive and adaptive manner, distinguishing the information from two different modalities, and using multiple levels of deep features to express the RGB features in more detail. Liu et al. [14] proposed a selective self-mutual attention module (S2MA) to fuse the features learned in the two modalities. Additionally, several methods assist in the task of enhancing RGB features for saliency detection through features extracted from depth images. Zhu et al. [15] processed depth images by using an auxiliary sub-network, encoding the resulting depth image features, concatenating them with the RGB image in the channel direction, and finally obtaining the final saliency map through operations such as upsampling. Chen et al. [16] aimed to create a high-accuracy detection network, so they designed a deep feature extraction network to separately extract depth image information and proposed an alternating correction strategy in the decoding stage, in which depth features are gradually added to the decoding process for prediction. Although the above models have made undeniable progress and improved the accuracy of RGB-D saliency detection tasks, there are still many obvious problems that need to be solved. First, the above methods generally use depth information as an auxiliary cue to refine the RGB features of the SOD task while ignoring the attribute dependence between the two modalities. Furthermore, the above methods mainly extract shared features by fusing the single-modal feature representations of RGB images and depth images, and finally, the final predicted saliency map is obtained through the decoder network. This approach focuses excessively on the fusion of cross-modal information, neglecting the complementary properties between RGB images and depth images at different stages. Moreover, if the RGB-D cross-modal information is simply concatenated, spliced, or added during the fusion process, the truly effective information may be submerged in the redundancy of a large amount of data.

To address the above problems, we propose a cross-modal adaptive interaction network (CMANet). This network follows an encoder–decoder architecture. In the encoder part, we first use the Res2Net-50 [17] network with the PSA polarization attention mechanism [18] to extract multi-scale features from RGB images and depth images. This combination not only retains the multi-scale feature extraction capability of Res2Net-50 [17] but also utilizes PSA to enhance the global contextual information of the features. We propose a cross-modal feature integration module (CMF) to fully integrate the multi-scale features of RGB and depth images. In the decoder part, we employ the U-Net network structure [19], which can directly transfer the high-resolution features from the encoder to the corresponding layers of the decoder, aiding in the restoration of spatial information and details. To fuse feature maps of different scales into comprehensive full-scale feature maps and improve the model’s multi-scale perception ability, we propose an adaptive feature fusion module (AFFM) to integrate these features within the decoder. Based on the above structure, the model can capture global information while retaining the details and boundary information of the image, thereby generating high-quality output results. As shown in Figure 2, this is a comparison chart of our model with other models.

This paper mainly proposes the following contributions:

A cross-modal adaptive interaction network (CMANet) is proposed to address the RGB-D SOD problem. This network integrates and enhances multi-scale features from both modalities, propagating them to subsequent layers while adaptively preserving crucial information at each level. Simultaneously capturing the global context, it preserves image details and spatial information, thereby yielding more accurate output results;
We propose a cross-modal feature integration module (CMF), which mainly fuses information from different modalities (such as RGB and depth) to capture the complementary characteristics between the two modalities. Through this module, the model can simultaneously utilize the visual features such as color and texture of RGB images, as well as the spatial structure information of depth images, to generate more discernible and representative feature maps;
We propose an adaptive feature fusion module (AFFM) that utilizes the intrinsic features of both modalities and adaptively integrates them with the outcomes from the previously shared network layer. This module merges features of different scales into the full-scale feature map, enhancing the model’s ability to perceive multi-scale information;
Extensive experiments show that our cross-modal adaptive interaction network (CMANet) model outperforms 17 previous state-of-the-art methods on four public RGB-D SOD datasets.

The organization of this paper is as follows: In Section 2, we review two related works, namely RGB salient object detection and RGB-D salient object detection. Subsequently, in Section 3, we provide a detailed description of the proposed cross-modal adaptive interaction network (CMANet). In Section 4, extensive experiments are conducted to verify the effectiveness of CMANet, along with detailed experimental setups and results. Finally, Section 5 concludes this paper and discusses future research directions.

2. Related Work

In this section, two types of related work are reviewed, namely salient object detection based on RGB images and salient object detection based on RGB-D images.

2.1. RGB Salient Object Detection

In the early stages of salient object detection, researchers mainly relied on hand-designed features and heuristics. Itti et al. [20] proposed a biologically inspired saliency detection model in 1998, which uses features such as color, intensity, and orientation of images to calculate saliency maps through multi-scale feature fusion. This method is simple and intuitive and simulates the attention perception of the human eye, but it often performs poorly when processing complex scenes.

As a result, researchers developed image contrast and saliency map generation methods. Achanta et al. [21] proposed a saliency detection method based on global contrast in 2009, which generates saliency maps by calculating the global contrast of each pixel in the image. Cheng et al. [22] proposed the regional contrast method in 2014, which generates saliency maps by segmenting the image into superpixel regions and calculating the contrast between regions. These methods improve the performance of saliency detection to a certain extent but still rely on hand-designed features and heuristic rules.

After entering the era of deep learning, most people choose convolutional neural networks as the main salient target detection method. Deep learning methods use an end-to-end framework to automatically extract high-level features from massive amounts of information, thereby greatly improving the accuracy of saliency detection. Li et al. [23] proposed the DeepSaliency model in 2016, which uses a multi-task deep neural network to learn saliency detection and related tasks simultaneously, thereby improving detection performance. In recent years, researchers have continued to explore more efficient deep learning models and network architectures to further enable salient object detection with higher accuracy. RGB salient object detection continues to make significant progress in model architecture and training strategies. Zhuge et al. [24] proposed an Integrity Consensus Network (ICON) in 2023, aiming to improve integrity learning for salient object detection, which effectively improved the accuracy and robustness of salient object detection at both micro and macro levels. Li et al. [25] proposed a cross-modal coordinate attention network (CCAFusion) in 2024 for the fusion of infrared and visible light images. Through feature awareness fusion and feature enhancement fusion modules, combined with methods based on multi-scale skip connections, effective information fusion is achieved. Xia et al. [26] proposed a hierarchical attention-related context-driven network (RCNet) for salient object detection in 2024. Through multi-receptive field dilated convolutions and multi-source mixed channel attention mechanisms in a diamond hierarchical structure, RCNet effectively improves the accuracy and robustness of feature representation. Although RGB-based salient object detection methods perform well in many applications, they still present challenges when dealing with complex backgrounds, occlusions, and low-contrast targets. Relying only on the information of RGB images, it is difficult to fully distinguish the foreground and background in some cases, thus affecting detection performance.

2.2. RGB-D Salient Object Detection

In order to overcome the limitations of RGB saliency object detection, RGB-D saliency object detection emerged. RGB-D saliency object detection not only utilizes the color and texture information of RGB images but also combines the geometric and structural information provided by depth images, enabling effective detection of salient objects in more complex environments. Depth information can provide additional clues about the three-dimensional structure of the scene, helping to more accurately locate salient objects in the presence of complex backgrounds and occlusions.

Early RGB-D saliency object detection methods mainly focused on how to effectively fuse RGB and depth information. The approach of feeding RGB images and depth images into two separate networks for feature extraction and fusion and then passing the high-level features to a decoder to obtain the final saliency map [27,28] does not fully exploit the unique properties of depth information. In response, Zhao et al. [29] proposed the CPFP model, which introduces contrast enhancement for depth information within the CNN architecture and then integrates the enhanced depth with RGB features to generate the saliency map. However, this approach does not fully utilize the advantages of the dual-stream network and may result in blurred edges in the generated saliency map. To address this, Wang et al. [30] proposed an adaptive fusion scheme that implements efficient detection through the design of a dual-stream convolutional neural network and an adaptive fusion module. Zhang et al. [31] designed the SSF model, which allows the RGB and depth information to complement each other in detail, refining the features and sharpening the boundaries by adding an edge loss function. However, these methods generally perform a simple fusion of the two modalities without fully leveraging the unique properties of modality-specific features. To tackle this issue, researchers began exploring the connections between modality-specific features. Wang et al. [32] proposed a novel deep neural network that learns discriminative cross-modal features, establishing a connection between the modalities of RGB and depth images. Liao et al. [33] introduced the MMNet method, which consists of a multi-stage fusion module and a multi-scale decoder, enhancing RGB information with depth information to improve performance, but it neglects the important spatial features inherent in depth information. Piao et al. [34] combined depth information with multi-scale contextual information and added an attention module to enhance model robustness. Additionally, several models have emerged that deeply explore the connections between modality-specific features, allowing the features of the two modalities to refine each other and enhance the model’s generalization ability [9,35]. However, this approach often results in significant data redundancy, making it crucial to reduce model complexity and make the model more lightweight. Consequently, Ling et al. [36] designed a lightweight model suitable for mobile devices that relies solely on RGB images without utilizing depth information.

Recently, Wei et al. [37] proposed a novel network named EGA-Net aimed at improving edge quality and emphasizing the main features in RGB-D salient object detection. This network innovatively addresses the issues of predicting structurally complete but blurred edges and effectively separating salient objects in complex backgrounds. Chen et al. [38] were the first to propose the use of 3D convolutional neural networks for RGB-D salient object detection, which performs early fusion in the encoder stage and deep fusion in the decoder stage to effectively promote the comprehensive integration of RGB and depth streams. Additionally, Minhyeok et al. [39] proposed a superpixel prototype sampling network (SPSN), which, through the design of a prototype sampling network and a dependency selection module, addresses the challenges in RGB-D salient object detection caused by the significant domain gap between RGB and depth images and the low quality of depth maps.

3. Methodology

In this chapter, we first introduce the overall framework of CMANet in Section 3.1. Section 3.2 and Section 3.3, respectively, discuss the modality-specific learning network and the modality integration network. Finally, Section 3.4 covers the overall loss function.

3.1. Overview

Figure 3 illustrates the overall structure of the cross-modal adaptive interaction network (CMANet). In the encoder part, the RGB and depth images are preprocessed, and then two sub-networks are used to perform feature extraction on the two modalities to obtain their primary features at multiple scales. Within the integrated encoder subnetwork, the CMF module is utilized to learn integrated features. Moving to the decoder section, the U-Net’s [19] skip connection mechanism integrates corresponding multi-level features from the encoder to generate saliency maps. In the integrated decoder subnet, the AFFM module adaptively fuses features from the modality-specific subnetwork and integrated features from the upper layer, enhancing integrated features and enriching cross-modal information. The DeFR module combines feature concatenation and feature enhancement operations from the RFB [40] module. Finally, the saliency maps generated by the integrated decoder subnet serve as the final output.

3.2. Single-Modality Dedicated Network

As depicted in Figure 3, we employ the Res2Net-50 [17] network pre-trained on the ImageNet [41] dataset as the backbone network, enhanced with the PSA polarization attention [18] as the modality-specific encoder network. This network extracts multi-scale features from RGB images and depth images separately. Each image generates five multi-level features denoted as

f_{r}^{i}

and

f_{d}^{i}

, where i ∈ {1, 2, 3, 4, 5}. After obtaining the high-level features

f_{r}^{5}

and

f_{d}^{5}

from the RGB and depth inputs, respectively, through the modality-specific encoder network, these features are passed to the corresponding RGB branch and depth branch in the decoder part. Subsequently, we construct the decoder part using the U-Net network [19] structure, where the multi-level features produced by the modality-specific encoder are jump-connected to the corresponding decoder levels. This step aims to further enhance the features at each level. Furthermore, the input features

f_{r}^{5}

and

f_{d}^{5}

, along with those combined via skip connections, pass through the DeFR module. This module extracts information across different scales using multi-scale convolution and dilated convolution, integrating this information to enhance the representation of feature maps. This enhancement aids in improving the model’s capability to detect features across various scales and complex patterns. Finally, these features are input into the modal integration network to enhance the accuracy of saliency detection.

3.3. Modal Integration Network

In the modal integration network, the cross-modal features of RGB images and depth images are fused to learn their integrated representations. Using skip connections, the hierarchical features of the encoder and decoder are fused to enhance the integrated features and enrich cross-modal information.

3.3.1. Cross-Modal Feature Integration Module

Because RGB data focus on the expression of color and depth data focus on showing the spatial information of objects, they cannot be simply integrated. Therefore, we propose the CMF module to integrate and enhance cross-modal features. As shown in Figure 4, 1 × 1 convolutions are applied to the two inputs, respectively, reducing the number of channels to half of the original to speed up the model. To reduce the redundancy of single-modal features and learn the complementarity of the two modalities, we first use a cross-refinement enhancement strategy, employing attention mechanisms to automatically select and cross-refine important image features. Specifically, combining the unique characteristics of the two modalities, we apply channel attention and spatial attention to them, respectively, to highlight the characteristic responses of salient regions. This process can be described as

f_{o u t}^{r} = f_{i n}^{r} ⨂ σ (c o n v 1 (G A P (f_{i n}^{r})) + c o n v 1 (G M P (f_{i n}^{r})))

(1)

f_{o u t}^{d} = f_{i n}^{d} ⨂ σ (c o n v 2 (c o n c a t (G A P (f_{i n}^{d}), G A P (f_{i n}^{d}), d i m = c h a n n e l)))

(2)

Among them,

f_{i n}^{r}

and

f_{i n}^{d}

represent the input features of the RGB branch and the depth branch after 1 × 1 convolution, respectively. Global Average Pooling is represented by

G A P

, and Global Maximum Pooling is represented by

G M P

.

c o n v i

(i = 1, 2) represents the convolution layers,

⨂

represents element-wise multiplication,

σ

represents the ReLU activation function,

c o n c a t

represents the concatenation operation in the channel dimension, and

f_{o u t}^{r}

and

f_{o u t}^{d}

represent the RGB and depth features after attention selection (i.e.,

f_{r}^{i}

and

f_{d}^{i}

), respectively.

Inspired by the success of the self-attention mechanism [42], as shown in Figure 5, we designed two submodules, DA and RA, to cross-enhance and refine modal features from a cross-modal perspective. Taking DA as an example, since the depth image can provide spatial and positional information for the RGB image, the features of the depth image are used to generate a weight for the RGB image and multiply it to achieve the purpose of enhancing the features of the RGB image. Specifically, first, use 1 × 1 convolution to map

f_{d}^{i}

to

W_{q} \in R^{C_{1} \times (H \times W)}

and

W_{k} \in R^{C_{1} \times (H \times W)}

and map

f_{r}^{i}

to

W_{v} \in R^{C \times (H \times W)}

. The number of channels, height, and width of the input feature map are represented by C, H, and W. The value of C1 is set to 1/6 of C to speed up the model operation. The results of these mappings are used to calculate the enhanced features. The attention weight matrix

W_{a}

is obtained by multiplying the inputs

W_{k}

and

W_{k}

and then applying the

s o f t m a x

function. The size of this matrix is (HW) × (HW), which corresponds to the attention weights of each position in the input feature map. Next, the attention weight matrix

W_{a}

is multiplied by the projected

W_{v}

matrix to obtain the enhanced feature vector

f_{d a}

. Finally, the enhanced feature vector

f_{d a}

is transformed back to the same shape as the original input feature map, that is, C × H × W. The process can be described as

W_{a} = s o f t m a x (W_{q}^{T} ⨂ W_{k})

(3)

f_{d a} = W_{v} ⨂ W_{a}

(4)

The other modules, RA and DA, are complementary and are designed to use RGB image features to generate spatial weights that provide rich texture and color information for depth image features. Additionally, to retain the information of the original modality images, we use the residual connection method to further combine the enhanced features with the original features. The residual connection representation process of the two modalities is as follows:

\{\begin{matrix} f_{R}^{i} = f_{r}^{i} + f_{r}^{i} ⨂ s i g m o i d (f_{d a}) \\ f_{D}^{i} = f_{d}^{i} + f_{d}^{i} ⨂ s i g m o i d (f_{r a}) \end{matrix}

(5)

After obtaining the enhanced features, the most important task is to fuse them together. Specifically, first,

f_{R}^{i}

and

f_{D}^{i}

are convolved with a 3 × 3 convolution to obtain a smooth representation of the features. Then, the two features are fused by element-wise multiplication. Finally, the result is fused with the output

f_{S}^{i - 1}

from the previous CMF and then subjected to a second 3 × 3 convolution to enhance the network’s representation capability. The output result is the

f_{S}^{i}

of the i-th CMF. The specific process is as follows:

f_{S}^{i} = S c o n v (c o n c a t (S c o n v (f_{R}^{i}) ⨂ S c o n v (f_{D}^{i}), f_{S}^{i - 1}))

(6)

In summary, our CMF module initially employs a cross-refinement enhancement strategy to capture feature dependencies between the two modalities, reducing redundancy in single-modal features while learning their complementarity. Subsequently, it effectively integrates these features and passes them to the next layer to acquire multi-level information.

3.3.2. Adaptive Feature Fusion Module

As illustrated in Figure 6, we introduce an adaptive feature fusion module to dynamically compute weights for different modalities and facilitate effective fusion. Specifically, leveraging the skip connection structure of the U-Net network, we derive modality-specific features

g_{r}^{i}

and

g_{d}^{i}

(i = 1, 2, 3, 4, 5) after integrating skip connections. Notably, when i = 1,

g_{r}^{1}

and

g_{d}^{1}

correspond to the high-level features

f_{r}^{5}

and

f_{d}^{5}

generated by the modality-specific encoder. Similarly,

g_{s}^{1}

represents the high-level feature

f_{s}^{5}

generated by the modality integration encoder. Initially, these three features undergo a 3 × 3 convolutional layer to reduce channel dimensions and smooth the features. Subsequently, we devise an attention weight layer, GSA, which integrates Global Average Pooling (GAP), a 1 × 1 convolutional layer, and sigmoid function to compute weights for the RGB and depth branches, thereby enhancing their respective features. Finally, leveraging features generated by the preceding modality integration network, we further amplify and fuse the features of both modalities. The detailed procedure is outlined as follows:

\{\begin{matrix} g_{r_{1}}^{i} = c o n v 3 (g_{s}^{i - 1}) ⨂ (c o n v 3 (g_{r}^{i}) ⨂ G S A (c o n v 3 (g_{r}^{i}))) \\ g_{d_{1}}^{i} = c o n v 3 (g_{s}^{i - 1}) ⨂ (c o n v 3 (g_{d}^{i}) ⨂ G S A (c o n v 3 (g_{d}^{i}))) \end{matrix}

(7)

g_{s}^{i} = c o n v 3 (g_{s}^{i - 1}) ⨁ c o n v 3 (c o n c a t (g_{r_{1}}^{i}, g_{d_{1}}^{i}))

(8)

Among them,

⨂

represents element-wise multiplication, and

c o n c a t

represents concatenating two features. After obtaining the fused result

g_{s_{1}}^{i}

, a 3 × 3 convolutional layer smooths the feature representation. Subsequently, it is connected with

g_{s}^{i - 1}

using addition to generate an enhanced feature with rich cross-modal information. This enhances the model’s ability to perceive multi-scale features. It is particularly important to note that the input to all AFFM modules is preprocessed through the RFB module [40] to enhance the model’s perception and detection capabilities of features across different scales and complexities.

3.4. Loss Function

Our loss function consists of three components: SSIMLoss (Structural Similarity Index) [43], BCELoss (Binary Cross-Entropy) [44], and IOULoss (Intersection Over Union) [45]. These components are weighted as 0.8, 1, and 1, respectively, and combined to form

S L

.

F_{r}

and

F_{d}

denote the saliency maps of the RGB branch and depth branch;

F_{s}

and

G T

represent the outputs generated by the modality integration network and the ground truth saliency maps, respectively. Thus, the loss function is expressed in detail as

L o s s = S L (F_{r}, G T) + S L (F_{d}, G T) + S L (F_{s}, G T)

(9)

4. Experimental Results and Analysis

Section 4.1 explains the specific details of the experiment, and Section 4.2 shows the comparison of the performance of this experiment with other models, including qualitative and quantitative evaluations, as well as the expressiveness of the model when using different backbone networks. Finally, Section 4.3 conducts an ablation study to investigate the importance and effectiveness of each module.

4.1. Experimental Settings

4.1.1. Datasets

We evaluate the experimental results on four datasets, namely NJU2K [46], NLPR [47], SIP [8], and STERE [48]. To obtain universal results, we use 2185 samples as the training set, of which 1485 images are from NJU2K [46] and 700 images are from NLPR [47]. Other images from the NJU2K and NLPR datasets and all samples from SIP and STERE are used as test sets.

4.1.2. Evaluation Metrics

We use four commonly used evaluation metrics to assess the effectiveness of our network, namely F-measure (

F_{β}

) [21], Mean Absolute Error (

M A E

) [49], S-measure (

S_{λ}

) [50], and E-measure (

E_{γ}

) [51]. The detailed definitions of these metrics are as follows:

F-measure (

F_{β}

) is a widely used comprehensive evaluation metric that considers both precision and recall. It is defined as follows:

F_{β} = \frac{{(1 + β)}^{2} \cdot P r e c i s i o n \times R e c a l l}{β^{2} \times P r e c i s i o n + R e c a l l}

(10)

Among them,

P r e c i s i o n

and

R e c a l l

represent the precision score and recall score, respectively, and

β^{2}

is set to 0.3 to emphasize precision.

Mean Absolute Error (

M A E

) represents the average difference between the saliency map

S

and the ground truth

Y

, and the detailed formula is as follows:

M A E = \frac{1}{W \times H} \sum_{x = 1}^{W} \sum_{y = 1}^{H} |S (x, y) - Y (x, y)|

(11)

where

W

and

H

represent the width and height of the saliency map, respectively.

S-measure (

S_{λ}

) evaluates the spatial structural similarity between the saliency map

S_{r}

and the saliency label map

S_{o}

, taking into account the saliency of regions and boundaries. It is also known as the structured measure and is defined as follows:

S_{λ} = α * S_{o} + (1 - α) * S_{r}

(12)

where

α

∈ [0, 1] is the balance parameter, which is set to 0.5 by default [50].

E-measure (

E_{γ}

) is an evaluation metric based on the enhanced alignment mapping between the prediction result map and the salient label map. It is defined as follows:

E_{γ} = \frac{1}{W \times H} \sum_{x = 1}^{W} \sum_{y = 1}^{H} ϕ F M (x, y)

(13)

The width and height of the saliency map are denoted by

W

and

H

, respectively;

ϕ F M (*)

represents the enhanced alignment matrix [51].

4.1.3. Experimental Details

The model was implemented in the PyTorch library. An L20 graphics card with 100GB memory was used. The backbone network uses the Res2Net-50 network [17] pre-trained on the ImageNet dataset [41] and incorporates the PSA (Polarization Self-Attention) mechanism [18]. We use the Adam algorithm as the optimizer to optimize our network. The initial learning rate is set to 1 × 10⁻⁴, and it decreases to one-tenth of its original value every 60 epochs. During the training phase, we employ random flipping, cropping, and rotation for data augmentation to enhance the robustness of the model. The resolution of the input images is 352 × 352, and the model is trained for 200 epochs. In the testing phase, we first resize the RGB and depth images to a resolution of 352 × 352, then input them into the network to obtain saliency predictions while generating a saliency map of the same size as the input image. The model has 145 MB of parameters, 7 h of training time, and 31 ms of inference time. It should be emphasized that we use the output of the branch from the modality integration network as our final predicted saliency map.

4.2. Performance Comparison

4.2.1. Comparison with RGB-D SOD Models

To verify the effectiveness of the model, we compared CMANet with 17 RGB-D saliency object detection models, namely CTMF₁₈ [27], PCF₁₈ [28], AFNet₁₉ [30], CPFP₁₉ [29], SSF₂₀ [31], CoNet₂₀ [13], D³Net₂₁ [8], DCMF₂₂ [32], DMRA₂₂ [34], MMNet₂₂ [33], CMINet₂₂ [52], DSU₂₂ [53], CFIDNet₂₂ [35], SPNet₂₃ [9], DAL₂₃ [36], EGA-Net₂₃ [37], and RD3D₂₄ [38].

4.2.2. Quantitative Evaluation

Our model is compared with 17 state-of-the-art RGB-D SOD methods. For both NJU2K and NLPR datasets, the four evaluation metrics achieve the best results, as shown in Table 1. Higher values of

S_{λ}

,

F_{β}

, and

E_{γ}

indicate better performance, while lower

M A E

values indicate better performance. Notably, our model achieves superior results compared to most RGB-D saliency detection models on the SIP and STERE datasets. For the SIP dataset, the results are similar to those of the RD3D model, and for the STERE dataset, the results are similar to those of the CMINet model. Overall, our CMANet model demonstrates excellent performance in RGB-D saliency object detection tasks, particularly in various indoor and outdoor scenes. Furthermore, we compared the performance of the model using different backbone networks. As shown in Table 2, when using Res2Net-101 as the backbone network, the performance remains outstanding. But, in general, the model has a better detection effect when using Res2Net-50 as the backbone network.

4.2.3. Qualitative Evaluation

We conducted a qualitative evaluation of the proposed model by comparing the generated saliency maps with those of six representative models, covering scenarios such as low scene brightness, cluttered objects, low contrast, high interference, and numerous objects. As shown in Figure 7, the proposed CMANet model can accurately detect salient objects under various challenging conditions. In contrast, other models tend to lose accuracy in such scenarios.

4.2.4. Model Size and Inference Time

We tested the inference time of our model on a NVIDIA GeForce RTX 4090 GPU with 24GB of memory, comparing the model size and inference time against state-of-the-art models (DMRA [34], DSU [53], CFIDNet [35], DAL [36], SPNet [9], EGA-Net [37]). As shown in Table 3, the difference from lightweight networks like DAL is that our model has a larger size and requires more inference time due to the use of two modality-specific networks and one modality integration network, which separately learn unimodal features and integrated modality features. However, among these models, our model achieved the best inference time while also delivering superior performance across four evaluation metrics. In the future, we aim to improve our model by making it more lightweight, thereby reducing complexity, saving time, and achieving higher efficiency.

4.3. Ablation Studies

We performed ablation studies on different modules of our proposed model to verify their effectiveness.

4.3.1. Effectiveness of the PSA Polarized Attention Mechanism

The PSA polarized attention mechanism enhances the quality of feature representation by modeling spatial and channel attention on feature maps. To evaluate its effectiveness, we removed the PSA attention mechanism and used Res2Net-50 directly, represented as “A” in Table 4. Comparing the results, it is evident that the PSA polarized attention mechanism improves the model’s performance. Combining Res2Net-50 with PSA further enhances performance in pixel-level regression tasks.

4.3.2. Effectiveness of the CMF Module

To verify the effectiveness of the CMF module, we first removed it, allowing the modality-specific network results to be directly connected and passed to the next layer, resulting in “B1”. Notably, since the DA and RA sub-modules in the CMF module are responsible for cross-enhancing and refining module features, we deleted these sub-modules while keeping the CMF module intact, resulting in “B2”. Additionally, because the CMF module effectively aggregates the complementary properties of the two modes and transfers this complementary information to subsequent layers, we deleted the connections between CMF modules so the output of one CMF would not pass to the next CMF, resulting in “B3”. The results of “B1”, “B2”, and “B3” are shown in Table 4. The comparison indicates that saliency detection performance declines in these cases, confirming the effectiveness of the DA and RA sub-modules and the importance of connections between CMF modules.

4.3.3. Effectiveness of the AFFM Module

The AFFM module integrates features of different scales and modes, enhancing the model’s ability to understand different modes and scales. Unlike the CMF module, which enhances feature representation and understanding across modalities, the AFFM module uses integrated features to derive enhanced features rich in cross-modal information. To verify the effectiveness of the AFFM module, we deleted it, resulting in “C1”. Additionally, we deleted the attention weight layer GSA within the module to verify its effectiveness, resulting in “C2”. Similarly, we removed the information transfer between AFFM modules to test its necessity, resulting in “C3”. The results of “C1”, “C2”, and “C3” are presented in Table 4. Comparing “C1” shows that the AFFM module improves overall model accuracy. Comparing “C2” and “C3” with the CMANet model demonstrates that the GSA sub-module and the information transfer between AFFM modules are effective and enhance model accuracy.

5. Conclusions

This paper proposes a cross-modal adaptive interaction network (CMANet) to address the RGB-D salient object detection task. Specifically, we introduce a cross-modal feature integration module (CMF) and an adaptive feature fusion module (AFFM), which effectively aggregate the complementary information between the RGB and depth modalities by integrating RGB and depth features. Subsequently, features at different scales are integrated into a full-scale feature map to enhance the model’s multi-scale perception capability. Experiments on four widely used datasets demonstrate that the proposed CMANet achieves state-of-the-art performance compared to 17 other methods. However, there are some limitations, such as the high cost of acquiring and annotating RGB-D data, which restricts the scale and diversity of available datasets. Future work could focus on developing new datasets with increased diversity and annotation quality. Additionally, exploring the xAI-based detection method [54,55] and multi-scale convolution techniques [56] could further optimize the algorithm to improve performance.

We would like to emphasize that the research presented in this paper is conducted with the aim of advancing the field of computer vision and improving technology for positive societal applications. We explicitly disclaim any intent to use or contribute to the use of this technology for harmful purposes, such as automated drones or other forms of autonomous weapons. Our work is guided by ethical principles, and we encourage its use only for beneficial, constructive purposes.

Author Contributions

Conceptualization, Q.D. and J.Z.; Methodology, Q.D. and Y.B.; Software, J.W. and S.Z.; Validation, Q.D., Y.B. and J.Z.; Visualization, Q.D. and Y.B.; Writing—Original Draft, Y.B.; Writing—Review and Editing, Q.D., Y.B. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Development Plan Project of the Jilin Provincial Science and Technology Department (2022JB405L05).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were used in this study. These data can be found here: https://github.com/taozh2017/RGBD-SODsurvey (accessed on 18 August 2024).

Acknowledgments

We would like to express our deepest gratitude to all those who have contributed to the completion of this research and the writing of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, W.; Shen, J.; Yang, R.; Porikli, F. Saliency-Aware Video Object Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 20–33. [Google Scholar] [CrossRef] [PubMed]
Feng, X.; Zhou, S.; Zhu, Z.; Wang, L.; Hua, G. Local to Global Feature Learning for Salient Object Detection. Pattern Recognit. Lett. 2022, 162, 81–88. [Google Scholar] [CrossRef]
Huang, J.; Yan, W.; Li, T.H.; Liu, S.; Li, G. Learning the Global Descriptor for 3-D Object Recognition Based on Multiple Views Decomposition. IEEE Trans. Multimed. 2022, 24, 188–201. [Google Scholar] [CrossRef]
Ma, C.; Miao, Z.; Zhang, X.-P.; Li, M. A Saliency Prior Context Model for Real-Time Object Tracking. IEEE Trans. Multimed. 2017, 19, 2415–2424. [Google Scholar] [CrossRef]
Ji, Y.; Zhang, H.; Zhang, Z.; Liu, M. CNN-based encoder-decoder networks for salient object detection: A comprehensive review and recent advances. Inf. Sci. 2021, 546, 835–857. [Google Scholar] [CrossRef]
Liu, N.; Zhang, N.; Wan, K.; Shao, L.; Han, J. Visual Saliency Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV 2021), Montreal, QC, Canada, 10–17 October 2021; pp. 4702–4712. [Google Scholar] [CrossRef]
Li, G.; Liu, Z.; Chen, M.; Bai, Z.; Lin, W.; Ling, H. Hierarchical Alternate Interaction Network For Rgb-D Salient Object Detection. IEEE Trans. Image Process. 2021, 30, 3528–3542. [Google Scholar] [CrossRef]
Fan, D.-P.; Lin, Z.; Zhang, Z.; Zhu, M.; Cheng, M.-M. Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 2075–2089. [Google Scholar] [CrossRef]
Zhou, T.; Fan, D.P.; Chen, G.; Zhou, Y.; Fu, H. Specificity-preserving RGB-D saliency detection. Comput. Vis. Media 2023, 9, 297–317. [Google Scholar] [CrossRef]
Ren, J.; Gong, X.; Yu, L.; Zhou, W.; Yang, M.Y. Exploiting Global Priors for Rgb-D Saliency Detection. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Workshops (CVPRW), Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
Guo, J.; Ren, T.; Bei, J. Salient Object Detection For Rgb-D Image Via Saliency Evolution. In Proceedings of the 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA, 11–15 July 2016; pp. 1–6. [Google Scholar] [CrossRef]
Song, H.; Liu, Z.; Du, H.; Sun, G.; Le Meur, O.; Ren, T. Depth-aware salient object detection and segmentation via multiscale discriminative saliency fusion and bootstrap learning. IEEE Trans. Image Process. 2017, 26, 4204–4216. [Google Scholar] [CrossRef]
Ji, W.; Li, J.; Zhang, M.; Piao, Y.; Lu, H. Accurate RGB-D Salient Object Detection via Collaborative Learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVIII 16; Springer International Publishing: Cham, Switzerland, 2020; pp. 52–69. [Google Scholar]
Liu, N.; Zhang, N.; Han, J. Learning selective self-mutual attention for RGB-D saliency detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13756–13765. [Google Scholar]
Zhu, C.; Cai, X.; Huang, K.; Li, T.H.; Li, G. PDNet: Prior-model Guided Depth-enhanced Network for Salient Object Detection. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2018. [Google Scholar] [CrossRef]
Chen, S.; Fu, Y. Progressively Guided Alternate Refinement Network for RGB-D Salient Object Detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 520–538. [Google Scholar]
Gao, S.; Cheng, M.-M.; Zhao, K.; Zhang, X.; Yang, M.-H.; Torr, P.H.S. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 652–662. [Google Scholar] [CrossRef] [PubMed]
Liu, H.; Liu, F.; Fan, X.; Huang, D. Polarized Self-Attention: Towards High-quality Pixel-wise Regression. arXiv 2021, arXiv:2107.00782. [Google Scholar]
Wang, W.; Lai, Q.; Fu, H.; Shen, J.; Ling, H.; Yang, R. Salient Object Detection in the Deep Learning Era: An In-Depth Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3239–3259. [Google Scholar] [CrossRef] [PubMed]
Itti, L.; Koch, C.; Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 1254–1259. [Google Scholar] [CrossRef]
Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-tuned salient region detection. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1597–1604. [Google Scholar]
Cheng, M.-M.; Mitra, N.J.; Huang, X.; Torr, P.H.; Hu, S.-M. Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 569–582. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Zhao, L.; Wei, L.; Yang, M.-H.; Wu, F.; Zhuang, Y.; Ling, H.; Wang, J. Deepsaliency: Multi-task deep neural network model for salient object detection. IEEE Trans. Image Process. 2016, 25, 3919–3930. [Google Scholar] [CrossRef]
Zhuge, M.; Fan, D.-P.; Liu, N.; Zhang, D.; Xu, D.; Shao, L. Salient Object Detection via Integrity Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3738–3752. [Google Scholar] [CrossRef]
Li, X.; Li, Y.; Chen, H.; Peng, Y.; Pan, P. CCAFusion: Cross-Modal Coordinate Attention Network for Infrared and Visible Image Fusion. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 866–881. [Google Scholar] [CrossRef]
Xia, C.; Sun, Y.; Li, K.-C.; Ge, B.; Zhang, H.; Jiang, B.; Zhang, J. RCNet: Related Context-Driven Network with Hierarchical Attention for Salient Object Detection. Expert Syst. Appl. 2024, 237, 121441. [Google Scholar] [CrossRef]
Han, J.; Chen, H.; Liu, N.; Yan, C.; Li, X. CNNs-Based RGB-D Saliency Detection via Cross-View Transfer and Multiview Fusion. IEEE Trans. Cybern. 2018, 48, 3171–3183. [Google Scholar] [CrossRef]
Chen, H.; Li, Y. Progressively Complementarity-Aware Fusion Network for RGB-D Salient Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3051–3060. [Google Scholar] [CrossRef]
Jia-Xing, Z.; Cao, Y.; Fan, D.-P.; Cheng, M.-M.; Li, X.-Y.; Zhang, L. Contrast Prior and Fluid Pyramid Integration for Rgbd Salient Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3927–3936. [Google Scholar] [CrossRef]
Wang, N.; Gong, X. Adaptive Fusion for RGB-D Salient Object Detection. IEEE Access 2019, 7, 55277–55284. [Google Scholar] [CrossRef]
Zhang, M.; Ren, W.; Piao, Y.; Rong, Z.; Lu, H. Select, Supplement And Focus For Rgb-D Saliency Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3469–3478. [Google Scholar] [CrossRef]
Wang, F.; Pan, J.; Xu, S.; Tang, J. Learning Discriminative Cross-Modality Features for RGB-D Saliency Detection. IEEE Trans. Image Process. 2022, 31, 1285–1297. [Google Scholar] [CrossRef]
Liao, G.; Gao, W.; Jiang, Q.; Wang, R.; Li, G. MMNet: Multi-Stage and Multi-Scale Fusion Network for RGB-D Salient Object Detection. In Proceedings of the 28th ACM International Conference on Multimedia, Virtual Event, 12–16 October 2020; pp. 2436–2444. [Google Scholar] [CrossRef]
Piao, Y.; Ji, W.; Li, J.; Zhang, M.; Lu, H. Depth-Induced Multi-Scale Recurrent Attention Network for Saliency Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; Volume 2019, pp. 7253–7262. [Google Scholar]
Chen, T.; Hu, X.; Xiao, J.; Zhang, G.; Wang, S. CFIDNet: Cascaded feature interaction decoder for RGB-D salient object detection. Neural Comput. Appl. 2022, 34, 7547–7563. [Google Scholar] [CrossRef]
Ling, L.; Wang, Y.; Wang, C.; Xu, S.; Huang, Y. Depth-aware lightweight network for RGB-D salient object detection. IET Image Process. 2023, 17, 2350–2361. [Google Scholar] [CrossRef]
Wei, L.; Zong, G. EGA-Net: Edge feature enhancement and global information attention network for RGB-D salient object detection. Inf. Sci. 2023, 626, 223–248. [Google Scholar] [CrossRef]
Chen, Q.; Zhang, Z.; Lu, Y.; Fu, K.; Zhao, Q. 3-D Convolutional Neural Networks for RGB-D Salient Object Detection and Beyond. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 4309–4323. [Google Scholar] [CrossRef]
Lee, M.; Park, C.; Cho, S.; Lee, S. SPSN: Superpixel Prototype Sampling Network for RGB-D Salient Object Detection. In Lecture Notes in Computer Science Computer Vision—ECCV 2022; Springer Nature: Cham, Switzerland, 2022; pp. 630–647. [Google Scholar] [CrossRef]
Wu, Z.; Su, L.; Huang, Q. Cascaded Partial Decoder for Fast and Accurate Salient Object Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 15–20 June 2019; pp. 3907–3916. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Chen, Z.; Cong, R.; Xu, Q.; Huang, Q. DPANet: Depth Potentiality-Aware Gated Attention Network for RGB-D Salient Object Detection. IEEE Trans. Image Process. 2021, 30, 7012–7024. [Google Scholar] [CrossRef]
Khan, V.; Singh, A.P.; Latake, S.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar]
Cong, R.; Yang, N.; Li, C.; Fu, H.; Zhao, Y.; Huang, Q.; Kwong, S. Global-and-Local Collaborative Learning for Co-Salient Object Detection. IEEE Trans. Cybern. 2023, 53, 1920–1931. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Ju, R.; Ge, L.; Geng, W.; Ren, T.; Wu, G. Depth saliency based on anisotropic center-surround difference. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; pp. 1115–1119. [Google Scholar]
Peng, H.; Li, B.; Xiong, W.; Hu, W.; Ji, R. Rgbd Salient Object Detection: A Benchmark And Algorithms. Lect. Notes Comput. Sci. 2014, 8691, 92–109. [Google Scholar] [CrossRef]
Niu, Y.; Geng, Y.; Li, X.; Liu, F. Leveraging stereopsis for saliency analysis. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; Volume 2012, pp. 454–461. [Google Scholar]
Borji, A.; Cheng, M.-M.; Jiang, H.; Li, J. Salient object detection: A benchmark. IEEE Trans. Image Process. 2015, 24, 5706–5722. [Google Scholar] [CrossRef]
Fan, D.-P.; Cheng, M.-M.; Liu, Y.; Li, T.; Borji, A. Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4548–4557. [Google Scholar]
Fan, D.-P.; Gong, C.; Cao, Y.; Ren, B.; Cheng, M.-M.; Borji, A. Enhanced-alignment measure for binary foreground map evaluation. arXiv 2018, arXiv:1805.10421. [Google Scholar]
Cong, R.; Lin, Q.; Zhang, C.; Li, C.; Cao, X.; Huang, Q.; Zhao, Y. CIR-Net: Cross-Modality Interaction and Refinement for RGB-D Salient Object Detection. IEEE Trans. Image Process. 2022, 31, 6800–6815. [Google Scholar] [CrossRef]
Ji, W.; Li, J.; Bi, Q.; Guo, C.; Liu, J.; Cheng, L. Promoting Saliency From Depth: Deep Unsupervised RGB-D Saliency Detection. arXiv 2022, arXiv:2205.07179. [Google Scholar] [CrossRef]
Ieracitano, C.; Mammone, N.; Spagnolo, F.; Frustaci, F.; Perri, S.; Corsonello, P.; Morabito, F.C. An explainable embedded neural system for on-board ship detection from optical satellite imagery. Eng. Appl. Artif. Intell. 2024, 133, 108517. [Google Scholar] [CrossRef]
Chen, L.; Cai, X.; Li, Z.; Xing, J.; Ai, J. Where is my attention? An explainable AI exploration in water detection from SAR imagery. Int. J. Appl. Earth Obs. Geoinf. 2024, 130, 103878. [Google Scholar] [CrossRef]
Duda, D.; Uruba, V. Spatial spectrum from PIV data. J. Nucl. Eng. Radiat. Sci. 2019, 5, 030912. [Google Scholar] [CrossRef]

Figure 1. (a) shows early fusion. Specifically, the RGB image and the depth image first undergo feature extraction and are then sent to the encoder–decoder network to integrate the features (usually by simply concatenating their representations) and finally predict the saliency result map. (b) represents late fusion. Specifically, two independent network branches are used to extract features, and then the two features are fused at the end of the encoder; finally, the saliency result map is output through the decoder. Compared with early fusion, late fusion can effectively improve detection accuracy. (c) represents one type of mid-term fusion, which is also the method we use. By adding a new fusion branch in addition to the original RGB branch and depth branch, deeper processing of RGB features and depth features can be performed, which has strong complementarity.

Figure 2. Comparison of our model with other models.

Figure 3. This figure illustrates the CMANet network, which adopts an encoder–decoder architecture comprising a modality-specific network and a modality integration network. The modality-specific network extracts features from RGB images and depth images, while the modality integration network integrates and fuses complementary features from both modalities, generating rich cross-modal information. The encoder and decoder are connected using U-Net’s skip connection method, and the saliency map output by the modality integration network in the decoder section serves as the final detection result.

Figure 4. This figure depicts the structure of the CMF module, “CH” denotes the channel attention mechanism, “SP” denotes the spatial attention mechanism, “×” and “+” denote element-wise multiplication and addition, respectively, and “C” denotes feature concatenation.

Figure 5. This figure illustrates the structure of the DA submodule; “×” denotes element-wise multiplication.

Figure 6. This figure depicts the structural diagram of the AFFM module, where “GAP” signifies the Global Average Pooling operation, “×” and “+” denote element-wise multiplication and addition, respectively, and “C” represents feature concatenation.

Figure 7. This figure presents a visual comparison of the saliency result plots of our method with those of 6 state-of-the-art methods.

Table 1. Using the above four metrics, CMANet is compared with 17 models on four RGB-D public datasets. (

S_{λ} ↑

,

F_{β} ↑

,

E_{γ} ↑

,

M A E

). “↑” and “↓” indicate that higher values are better and lower values are better, respectively. The creation year of each model is indicated at the lower right corner of the model name, and the best results are highlighted in bold.

Table 1. Using the above four metrics, CMANet is compared with 17 models on four RGB-D public datasets. (

S_{λ} ↑

,

F_{β} ↑

,

E_{γ} ↑

,

M A E

). “↑” and “↓” indicate that higher values are better and lower values are better, respectively. The creation year of each model is indicated at the lower right corner of the model name, and the best results are highlighted in bold.

Model	NJU2K				NLPR				SIP				STERE
Model	$S_{λ} ↑$	$F_{β} ↑$	$E_{γ} ↑$	$M ↓$	$S_{λ} ↑$	$F_{β} ↑$	$E_{γ} ↑$	$M ↓$	$S_{λ} ↑$	$F_{β} ↑$	$E_{γ} ↑$	$M ↓$	$S_{λ} ↑$	$F_{β} ↑$	$E_{γ} ↑$	$M ↓$
CTMF₁₈ [27]	0.849	0.845	0.913	0.085	0.860	0.825	0.929	0.056	0.716	0.694	0.829	0.139	0.848	0.831	0.912	0.086
PCF₁₈ [28]	0.877	0.872	0.924	0.059	0.874	0.841	0.925	0.044	0.842	0.838	0.901	0.071	0.875	0.860	0.925	0.064
AFNet₁₉ [30]	0.772	0.775	0.853	0.100	0.799	0.771	0.879	0.058	0.720	0.712	0.819	0.118	0.825	0.823	0.887	0.075
CPFP₁₉ [29]	0.878	0.877	0.923	0.053	0.888	0.867	0.932	0.036	0.850	0.851	0.903	0.064	0.879	0.874	0.925	0.051
SSF₂₀ [31]	0.899	0.896	0.935	0.043	0.914	0.896	0.953	0.026	0.876	0.882	0.922	0.052	0.893	0.890	0.936	0.044
CoNet₂₀ [13]	0.895	0.893	0.937	0.046	0.908	0.887	0.945	0.031	0.858	0.867	0.913	0.063	0.908	0.905	0.949	0.040
D³Net₂₁ [8]	0.900	0.900	0.950	0.041	0.912	0.897	0.953	0.030	0.860	0.861	0.909	0.063	0.899	0.891	0.938	0.046
DCMF₂₂ [32]	0.913	0.922	0.925	0.043	0.922	0.914	0.937	0.029	0.875	0.774	0.844	0.115	0.910	0.906	0.914	0.043
DMRA₂₂ [34]	0.886	0.897	0.921	0.051	0.899	0.888	0.942	0.031	0.806	0.852	0.863	0.085	0.886	0.895	0.934	0.047
MMNet₂₂ [33]	0.910	0.919	0.922	0.038	0.925	0.918	0.955	0.023	0.824	0.860	0.871	0.080	0.901	0.910	0.941	0.040
CMINet₂₂ [52]	0.916	0.924	0.945	0.034	0.922	0.919	0.955	0.024	0.892	0.916	0.930	0.043	0.905	0.912	0.946	0.038
DSU₂₂ [53]	0.900	0.909	0.926	0.036	0.919	0.907	0.957	0.022	0.870	0.895	0.918	0.050	0.886	0.895	0.927	0.037
CFIDNet₂₂ [35]	0.914	0.923	0.913	0.038	0.922	0.915	0.950	0.026	0.864	0.891	0.905	0.060	0.901	0.908	0.924	0.043
SPNet₂₃ [9]	0.924	0.934	0.953	0.028	0.927	0.925	0.959	0.021	0.893	0.916	0.929	0.043	0.906	0.914	0.943	0.037
DAL₂₃ [36]	0.904	0.909	0.940	0.044	0.930	0.918	0.966	0.023	0.898	0.898	0.929	0.047	0.902	0.904	0.941	0.040
EGA-Net₂₃ [37]	-	0.883	0.922	0.033	-	0.912	0.966	0.020	-	0.877	0.922	0.048	-	0.870	0.924	0.042
RD3D₂₄ [38]	0.916	0.914	0.947	0.036	0.930	0.920	0.965	0.022	0.885	0.889	0.924	0.048	0.911	0.906	0.947	0.037
CMANet (Ours)	0.925	0.936	0.954	0.028	0.933	0.932	0.966	0.019	0.893	0.920	0.930	0.043	0.907	0.917	0.943	0.036

Table 2. This table provides a quantitative analysis of the results obtained when the model uses different backbone networks.

Model	NJU2K				NLPR				SIP				STERE
Model	$S_{λ} ↑$	$F_{β} ↑$	$E_{γ} ↑$	$M ↓$	$S_{λ} ↑$	$F_{β} ↑$	$E_{γ} ↑$	$M ↓$	$S_{λ} ↑$	$F_{β} ↑$	$E_{γ} ↑$	$M ↓$	$S_{λ} ↑$	$F_{β} ↑$	$E_{γ} ↑$	$M ↓$
Ours*-Res2Net-101	0.924	0.933	0.951	0.029	0.930	0.929	0.961	0.020	0.896	0.917	0.932	0.041	0.910	0.918	0.944	0.035
Ours-Res2Net-50	0.925	0.936	0.954	0.028	0.933	0.932	0.966	0.019	0.892	0.920	0.930	0.043	0.907	0.917	0.943	0.036

Table 3. Comparison of model size and inference time of different methods.

Model	Ours	DMRA₂₂	DSU₂₂	CFIDNet₂₂	DAL₂₃	SPNet₂₃	EGANet₂₃
Model Size (MB)	145	187	96	55	6	174	170
Inference Time (ms)	31	50	48	46	38	84	33

Table 4. This table presents the quantitative evaluation of the ablation studies.

Model	NJU2K		NLPR		SIP		STERE
Model	$S_{λ} ↑$	$M ↓$	$S_{λ} ↑$	$M ↓$	$S_{λ} ↑$	$M ↓$	$S_{λ} ↑$	$M ↓$
Ours	0.925	0.028	0.933	0.019	0.892	0.043	0.907	0.036
A	0.924	0.028	0.929	0.020	0.883	0.049	0.902	0.040
B1	0.924	0.030	0.933	0.020	0.890	0.043	0.906	0.036
B2	0.924	0.029	0.930	0.022	0.892	0.044	0.905	0.039
B3	0.920	0.029	0.926	0.021	0.889	0.045	0.902	0.040
C1	0.925	0.029	0.931	0.020	0.896	0.042	0.908	0.036
C2	0.923	0.030	0.931	0.020	0.891	0.045	0.908	0.037
C3	0.922	0.030	0.930	0.021	0.953	0.030	0.906	0.037

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Du, Q.; Bian, Y.; Wu, J.; Zhang, S.; Zhao, J. Cross-Modal Adaptive Interaction Network for RGB-D Saliency Detection. Appl. Sci. 2024, 14, 7440. https://doi.org/10.3390/app14177440

AMA Style

Du Q, Bian Y, Wu J, Zhang S, Zhao J. Cross-Modal Adaptive Interaction Network for RGB-D Saliency Detection. Applied Sciences. 2024; 14(17):7440. https://doi.org/10.3390/app14177440

Chicago/Turabian Style

Du, Qinsheng, Yingxu Bian, Jianyu Wu, Shiyan Zhang, and Jian Zhao. 2024. "Cross-Modal Adaptive Interaction Network for RGB-D Saliency Detection" Applied Sciences 14, no. 17: 7440. https://doi.org/10.3390/app14177440

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Cross-Modal Adaptive Interaction Network for RGB-D Saliency Detection

Abstract

1. Introduction

2. Related Work

2.1. RGB Salient Object Detection

2.2. RGB-D Salient Object Detection

3. Methodology

3.1. Overview

3.2. Single-Modality Dedicated Network

3.3. Modal Integration Network

3.3.1. Cross-Modal Feature Integration Module

3.3.2. Adaptive Feature Fusion Module

3.4. Loss Function

4. Experimental Results and Analysis

4.1. Experimental Settings

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.1.3. Experimental Details

4.2. Performance Comparison

4.2.1. Comparison with RGB-D SOD Models

4.2.2. Quantitative Evaluation

4.2.3. Qualitative Evaluation

4.2.4. Model Size and Inference Time

4.3. Ablation Studies

4.3.1. Effectiveness of the PSA Polarized Attention Mechanism

4.3.2. Effectiveness of the CMF Module

4.3.3. Effectiveness of the AFFM Module

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI