Advancing in RGB-D Salient Object Detection: A Survey

Chen, Ai; Li, Xin; He, Tianxiang; Zhou, Junlin; Chen, Duanbing

doi:10.3390/app14178078

Open AccessReview

Advancing in RGB-D Salient Object Detection: A Survey

by

Ai Chen

^1,2,

Xin Li

²,

Tianxiang He

²,

Junlin Zhou

^1,2

and

Duanbing Chen

^1,2,*

¹

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

²

Chengdu Union Big Data Technology Incorporation, Chengdu 610041, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 8078; https://doi.org/10.3390/app14178078

Submission received: 7 August 2024 / Revised: 30 August 2024 / Accepted: 5 September 2024 / Published: 9 September 2024

(This article belongs to the Special Issue Artificial Intelligence in Computer Vision and Object Detection)

Download

Browse Figures

Versions Notes

Abstract

The human visual system can rapidly focus on prominent objects in complex scenes, significantly enhancing information processing efficiency. Salient object detection (SOD) mimics this biological ability, aiming to identify and segment the most prominent regions or objects in images or videos. This reduces the amount of data needed to process while enhancing the accuracy and efficiency of information extraction. In recent years, SOD has made significant progress in many areas such as deep learning, multi-modal fusion, and attention mechanisms. Additionally, it has expanded in real-time detection, weakly supervised learning, and cross-domain applications. Depth images can provide three-dimensional structural information of a scene, aiding in a more accurate understanding of object shapes and distances. In SOD tasks, depth images enhance detection accuracy and robustness by providing additional geometric information. This additional information is particularly crucial in complex scenes and occlusion situations. This survey reviews the substantial advancements in the field of RGB-Depth SOD, with a focus on the critical roles played by attention mechanisms and cross-modal fusion methods. It summarizes the existing literature, provides a brief overview of mainstream datasets and evaluation metrics, and quantitatively compares the discussed models.

Keywords:

salient object detection; RGB-D images; cross-modal fusion; attention mechanisms

1. Introduction

Salient object detection (SOD) [1] is a critical task in computer vision that aims to identify and segment the most prominent targets in images. This process not only emulates the fundamental aspects of human perception by automatically locating and extracting the most salient and important targets in human visual perception, but also serves as a foundational technology for various applications. These include medical imaging [2], video surveillance [3], and content-aware image editing [4], achieving efficient and accurate target detection in different scenarios and applications. With the proliferation of depth sensors, RGB-D SOD has become a hot research topic in the field of computer vision. RGB-D SOD leverages both visual appearance and depth cues to provide more comprehensive scene information. Compared to using only RGB data, it offers a more reliable understanding of scene content, particularly in situations where visual information is insufficient or blurred. SOD can be traced back to the work of Itti et al. in 1998 [5]. They introduced the first saliency model, which detects the salient regions in images by emulating the attention mechanism of the human visual system. In 2012, Lang et al. [6] was the first to incorporate RGB-D data into SOD. By integrating depth information, their model achieved higher robustness and accuracy in handling complex scenes, advancing the development of multimodal saliency detection.

Moreover, SOD has developed several branches over the years. RGB-Thermal (RGB-T) image salient object detection enhances target recognition in low-light and occluded environments by integrating thermal imaging data [7]. Light field image salient object detection [8] improves detail resolution by utilizing the angular information of light field data. The detection of salient objects in optical remote-sensing images adapts to the characteristics of remote-sensing data through specialized algorithms, improving the detection efficiency in complex and vast backgrounds [9]. Lightweight real-time object detection optimizes computational efficiency and speed to meet the demands of real-time applications [10].

Although RGB data provides rich color and texture information, it is limited in its ability to distinguish between the foreground and background in complex environments. Therefore, the three-dimensional structural information from depth data is also crucial for object detection. It can effectively help segment objects that are near or far from the camera, opening up new possibilities for the detection of salient objects. In RGB-D salient object detection, cross-modal fusion is a technique that integrates the information from both RGB and depth data to enhance the precision and robustness of object detection [11]. Various fusion strategies, such as cascade fusion [12], parallel fusion [13], and recurrent fusion [14] have been employed to extract and combine features from these heterogeneous data sources, thus more accurately identifying and highlighting salient objects in a scene. The challenge for all RGB-D SOD tasks lies in effectively utilizing depth information as its quality often varies and is not consistently reliable across different scenes. The primary role of cross-modal fusion is to leverage the strengths of each modality by accurately integrating them without allowing the weaknesses of any one modality to degrade the overall system performance, thereby enhancing object detection performance. As demonstrated by Xiao et al. [15], through a depth-guided fusion network and depth attention mechanisms, spatial information can be effectively encoded, thus ensuring that the advantages of each modality are maximized without allowing any modality’s limitations to weaken the overall system performance.

Attention mechanisms play a critical role in optimizing the fusion of RGB and depth data [16]. By selectively focusing on important features and suppressing less relevant information, attention mechanisms dynamically adjust the contributions of each modality to the task of SOD [17]. This adaptability is particularly crucial when dealing with potentially noisy or incomplete depth data. By focusing computational resources on the richest informational features within each modality, attention models not only optimize the integration of RGB and depth information, but also enhance overall detection accuracy.

The purpose of this survey was to systematically explore and analyze the recent advances and disadvantages in cross-modal fusion strategies and attention mechanisms and how to apply them in the field of RGB-D SOD. By examining a wide range of approaches, this survey aims to highlight innovations and assess various fusion and attention strategies. Through this analysis, this survey aims to identify promising directions for future research that can address the remaining challenges in RGB-D SOD, thereby aiding in the development of more complex and capable visual systems.

The remainders of the survey are organized as follows. In Section 2, the classification and comparison of various cross-modal fusion strategies are explored. The classification and optimization methods of attention mechanisms, as well as some recent attention models for RGB-D SOD tasks, are discussed in Section 3. In Section 4 and Section 5, commonly used datasets and popular evaluation metrics are introduced. Quantitative comparisons of several models are provided. Finally, conclusions are discussed in Section 6.

2. Classification and Comparison of Cross-Modal Fusion Strategies

The evolution of SOD has marked a significant progression from simple feature-based methods to complex deep learning architectures. Initially, SOD predominantly relied on contrast-based techniques that used color and brightness to distinguish the salient regions from their surroundings [4]. Itti et al. [5] utilized basic feature contrasts to create saliency maps. With the advent of machine learning, particularly deep learning, methods evolved to incorporate more advanced features and semantic information, significantly enhancing the accuracy and robustness of detection models [18].

In RGB-D SOD, the challenge of cross-modal fusion involves effectively combining complementary information from RGB and depth modalities to enhance detection performance. Key issues in this process include the following: (1) balancing the contribution of each modality during fusion, and (2) designing fusion strategies to extract more effective salient features. Research and improvement of these strategies are crucial for advancing the field.

2.1. Classification of Fusion Strategies

Cross-modal fusion strategies in RGB-D SOD are crucial for maximizing the complementary advantages of RGB and depth modalities. These strategies can be categorized based on the timing and methods of fusion.

Early Fusion. Early fusion methods [19,20] integrate features at the input level. They first combine RGB and depth data before or at the early stages of feature extraction, and they then input them into the detection model, as shown in Figure 1a. This approach allows the model to take advantage of the spatial consistency between modalities from the very beginning of the processing pipeline. Early fusion is usually achieved by concatenating feature maps or channels to ensure that all subsequent layers in the network process the combined data. In the initial stage, the information from multiple data sources is utilized to maximize the complementary advantages of each modality.

Middle Fusion [14]. The integration in this strategy typically occurs in the middle stages of the network, usually after the initial feature extraction. This combination takes place in the intermediate layers of the neural network, as shown in Figure 1b. This strategy allows the network to initially process the RGB and depth data independently, utilizing dedicated networks customized for each modality before merging their features. At this stage, some strategies such as feature concatenation, weighted sums, and even more sophisticated methods are used to integrate features [21].

Late Fusion [22]. The RGB and depth features in this strategy undergo complete independent feature extraction and processing and are combined in the final stage, as shown in Figure 1c, For example, as discussed in Hu et al. [23], features of the different stages of the network are integrated so as to efficiently utilize both low-level and high-level semantic information. As seen in the work of Sun et al. [24], late fusion allows each network to independently form an opinion before a final decision being made, thus potentially improving robustness to noise in any single modality.

Hybrid Fusion [25]. This approach combines two or more of the above strategies to fully utilize the advantages of each strategy at different processing stages, integrating features from different stages of the network to efficiently utilize different semantic information and dynamically weighting the importance of each modal feature based on context. As in the work of Luo et al. [26], models may use early fusion to quickly integrate basic color and depth information and then use intermediate attention-guided fusion to refine feature integration based on context-specific cues.

2.2. Comparison of Fusion Strategies

The performance of fusion strategies in RGB-D SOD varies according to the specific requirements and constraints of the detection tasks. In image-processing tasks where spatial alignment is crucial, early fusion is often an effective strategy. This method maximizes the interaction between RGB and depth data at the initial stage of feature extraction, thereby fully leveraging spatial consistency. However, as noted by Silberman et al. [27], early fusion, despite its advantages, may exhibit lower robustness when dealing with noisy or incomplete depth data. Mid-level fusion typically achieves superior performance in terms of object detection accuracy and detail preservation. For example, Zhao et al. [28] introduced an attention-guided fusion model that dynamically adjusts the focus on the salient features of each modality, effectively enhancing the overall detection performance in complex scenes. In contrast, late fusion generally offers greater robustness by allowing separate modality paths to mitigate the impact of noisy data before fusion. Shotton et al. [29] exemplify this approach, demonstrating how late fusion enhances detection reliability when dealing with varying quality across modalities in RGB-D human pose estimation. Hybrid fusion strategies combine elements of early, mid, or late fusion, integrating RGB and depth data more intricately to maximize the strengths of each fusion type while compensating for their weaknesses. This approach is seen in the work of Han et al. [30], who utilized hybrid fusion to improve detection accuracy while managing computational efficiency. A comparison of these methods is shown in Table 1.

3. Attention Mechanisms in RGB-D SOD

Attention mechanisms have significantly impacted natural language processing [31] and computer vision [32] by focusing model resources on relevant features and suppressing less useful information. Thus far, the attention mechanism has been a transforming force in various areas of machine learning, especially in computer vision tasks such as RGB-D salient target detection. It helps to enhance feature discrimination and improve the robustness of models against noisy or irrelevant depth cues [33]. At its core, the attention mechanism allows the model to dynamically focus on the most informative parts of the input data, thereby enhancing the ability of the model to discriminate relevant features while suppressing irrelevant information. In the context of RGB-D SOD, utilizing the attention mechanism to accurately control the fusion of RGB and depth data provides a fine-grained modulation to understand the specific impact of each modal feature on the final detection result [34].

3.1. Classification of Attention Mechanisms

In the field of RGB-D SOD, attention mechanisms have evolved from basic spatial attention to more complex forms, such as channel attention and multidimensional attention, refining the processes of feature selection and cross-modal integration.

The application of attention in RGB-D SOD is particularly useful for addressing the challenges posed by differences in modal quality, such as, for example, by weighting the features derived from each modality when the depth data are noisy or incomplete, the attention mechanism mitigates the effects of low-quality data and enhances the model’s ability to accurately detect salient objects [35,36].

Self-attention [24,37]. Processing a single data stream by, for example, solely RGB or depth data, refines feature representations by evaluating the relevance of input features within their context. When dealing with noisy depth data, self-attention effectively captures global dependencies within the input, identifies and highlights critical features in complex geometric structures, and suppresses noise interference.

Mutual or co-attention [38]. The processing of multiple modalities aligns and synchronizes feature representations. This is particularly crucial in RGB-D data, where the correspondence between RGB and depth features directly impacts detection performance. When addressing the challenges posed by noisy depth data, mutual attention ensures an accurate alignment between RGB and depth features, thereby reducing the error propagation caused by noise.

Channel attention [33]. Channel attention improves model detection performance by assigning different weights to each feature channel, enhancing important features while suppressing less relevant ones. In RGB-D SOD tasks, channel attention effectively distinguishes critical features from redundant information in both RGB and depth data. When noise is present in depth data, the channel attention mechanism automatically reduces the weights of noisy channels, highlighting more reliable ones. By optimizing the representation of feature channels, this mechanism enables the model to more accurately process varying depth information while maintaining high sensitivity to key visual details, thereby enhancing overall detection performance.

Multi-head attention [39]. Multi-head attention captures multiple representations of input data by computing several self-attention mechanisms in parallel, allowing it to focus on different aspects of the input from various perspectives, leading to a more comprehensive feature representation. In RGB-D SOD tasks, multi-head attention effectively handles noisy depth data and enhances spatial resolution. By processing multiple attention heads simultaneously, it identifies and emphasizes meaningful information in different feature spaces while suppressing irrelevant features associated with noise. This enables the model to capture and focus on key features in RGB and depth data from different perspectives, enhancing feature representation across scales and improving both spatial resolution and target detection accuracy.

Spatial attention [40]. Spatial attention assigns different weights to features at various locations within an image, emphasizing important regions. In RGB-D SOD tasks, spatial attention effectively identifies and highlights key areas within the image. When dealing with noisy depth data, spatial attention assists the model in ignoring regions heavily affected by noise, allowing it to focus on parts that contain significant spatial information.

Hybrid attention [25,41,42]. Hybrid attention is a comprehensive strategy that combines multiple attention mechanisms, leveraging the strengths of channel attention, spatial attention, and self-attention to capture salient features across different levels and dimensions. In RGB-D SOD tasks, hybrid attention effectively addresses challenges related to noisy depth data and spatial resolution. By incorporating self-attention, it enhances the richness of depth features; through channel attention, it amplifies important features while suppressing noisy channels; and with spatial attention, it focuses on the most relevant regions within the image, thereby improving spatial resolution.

3.2. Attention Models in SOD

In RGB-D SOD, attention mechanisms are crucial for enhancing the integration and processing of RGB and depth data. Recent studies have demonstrated the successful application of various attention models in the RGB-D SOD domain.

Unified unsupervised salient target detection via knowledge transfer. Yuan et al. [43] discussed the role of attention networks in improving RGB-D SOD by enhancing cross-modal feature integration, highlighting the impact of self-attention and mutual attention models on performance. However, the computational complexity is high due to the feature fusion at different levels.

RGB-D salient target detection based on cross-modal and cross-level feature fusion. Peng et al. [11] employed an attention mechanism to enhance the different types of features that are derived from RGB and depth modalities, which helps to mitigate the problems caused by differences in modal quality and improved the accuracy of salient target detection. However, it overly relies on the performance of pre-trained models, making it difficult to transfer to different tasks. Wang et al. [44] showed how different types of attention mechanism can be used effectively to synchronize feature representations between RGB and depth data to improve the overall detection process. Chen et al. [45] used a context-based learning attention scheme to dynamically adjust the contributions of RGB and depth modalities, used an attention model to assess the quality and relevance of each modality, and optimized the fusion process to improve the robustness of cross-scene detection.

Deep perception attention networks distinguish salient objects. Sun et al. [46] introduced a depth-aware attention model, DSAM, which prioritizes depth information to augment RGB features for background suppression, and achieved superior results in distinguishing salient objects from complex backgrounds by using attention to selectively emphasize the depth features that are most reliably and consistently associated with RGB features.

Progressive attention network. Hu et al. [23] used an incremental learning strategy to progressively refine the integration of RGB and depth data using a cross-modal attention fusion module and an incremental decoder, with each stage focusing on increasingly fine-grained features. Progressive decoding strategies make feature fusion more detailed, enhancing detection accuracy, but they also increase model complexity and computation time.

4. Performance Evaluation and Comparison

4.1. RGB-D Dataset

RGB-D data, captured through different sensors, provides RGB and depth information. Depth data offers spatial cues about object structure and the relative distance between objects and the camera, which is particularly valuable in complex and cluttered environments. Evaluating RGB-D SOD models, especially those enhanced by attention mechanisms, requires meticulously designed experiments to assess their effectiveness and robustness in diverse settings. Comprehensive evaluation necessitates the selection of datasets that represent a range of typical challenges encountered in real-world applications, including variations in object scale, background complexity, and quality of depth information.

The key datasets utilized in these evaluations include the following: LFSD [8], STERE [47], NJUD [48], NLPR [19], SIP [49], Holo50K [12], SSD [49], DUT [50], DES [49], and GIT [51], as shown in Figure 2.

These datasets are widely used for their diversity and complexity, making them ideal for testing the robustness of RGB-D SOD models. Using these datasets ensures that the developed models are not only theoretically sound, but also practically viable in various real-world settings.

4.2. Evaluation Criteria

The performance of RGB-D SOD models is typically quantified using multiple metrics that reflect different aspects of model accuracy and reliability.

F index $F_{β}$ [52,53]: The F-measure or F-score is a harmonic mean of precision and recall, providing a single score that balances both the false positives and false negatives. This metric is particularly useful in situations where a balance needs to be struck between losing salient objects and mislabeling non-salient regions:

F_{β} = \frac{(1 + β^{2}) \cdot Precision \cdot Recall}{β^{2} \cdot Precision + Recall},

where

β

is the weight used to balance precision and recall.

E index $E_{ϕ}$ [54]: The E-measure combines local pixel matching and image-level statistics to provide a comprehensive evaluation of both local and global errors in the saliency map:

E_{ϕ} = \frac{1}{W \times H} \sum_{x = 1}^{W} \sum_{y = 1}^{H} ϕ (x, y),

where

ϕ

(x,y) denotes the local pixel matching function at position (x,y), and W and H represent the width and height of the salient map, respectively.

S index $S_{α}$ [55]: The S index, also known as the S-measure, evaluates the structural similarity between the predicted saliency map and the ground truth. It considers both region-aware and object-aware structural similarities:

S_{α} = α \cdot S_{o} + (1 - α) \cdot S_{r},

where

α

is a balance parameter between the object-perceived structural similarity So and region-perceived structural similarity Sr.

Mean Absolute Error (MAE) [53,56]: MAE measures the average magnitude of errors between the predicted saliency map and the ground truth, without considering their direction:

MAE = \frac{1}{W \times H} \sum_{x = 1}^{W} \sum_{y = 1}^{H} |P (x, y) - G (x, y),|

where P(x, y) and G(x, y) denote the value of the predicted significance plot and that of the truth at position (x,y), respectively; and W and H represent the width and height of the significance plot, respectively.

In practical applications, the metrics

F_{β}

,

E_{ϕ}

, and

S_{α}

may interact with each other. In high-precision scenarios, such as autonomous driving and medical diagnostics,

F_{β}

should be prioritized to ensure a balance between precision and recall. In aesthetic and design contexts, the visual consistency and appeal of the salient regions are more critical, thus placing greater emphasis on

E_{ϕ}

and

S_{α}

. In real-time systems, a comprehensive consideration of all metrics is required.

F_{β}

must be maintained at an acceptable level to ensure overall detection efficiency, while

E_{ϕ}

helps the model maintain detection accuracy in rapidly changing scenarios, and

S_{α}

ensures the structural integrity of the target.

4.3. Model Comparison

This survey compared several prominent models from the past seven years, with detailed information provided in Table 2. Attention mechanisms, by emulating the human visual system’s attention allocation, effectively enhance the model’s ability to capture key information. Multimodal information fusion techniques integrate data from different sensors, providing richer contextual information for SOD tasks. The introduction of attention mechanisms and multimodal information fusion allows for balancing the importance of different regions or domains, thereby improving salient object detection performance. Various methods have demonstrated their strengths and limitations through different network structures and feature fusion strategies. For instance, SSRCNN performs well in dynamic scene analyses using recurrent neural networks but may face training complexity; PDNet enhances the detection of objects of varying sizes through multi-scale feature fusion but may increase computational burden; the contrast-prior feature pooling in CPFP is effective in high-contrast images but relies on image contrast; DCFNet introduces depth calibration and fusion strategies that effectively mitigate the noise and blur in raw depth images, leading to more accurate salient object detection; the cross-dimensional fusion network structure in C2DFNet effectively combines RGB and depth information, improving detection accuracy, but requires high computational resources; the superpixel prototype sampling technique in SPSN reduces computational load while maintaining accuracy but is sensitive to initial segmentation quality; and the hierarchical dense attention mechanism in HiDAnet offers more refined feature learning and region localization but may lead to complex and time-consuming training and inference. Early models like PCF (2018) and SSRCNN (2019) made significant advances in feature fusion, but they may lack the ability to handle complex scenes compared to later models like CATNet (2023) or HFMDNet (2024), which incorporate more sophisticated depth calibration techniques to reduce the noise in depth data and improve detection accuracy in challenging environments. The evolution from models like TANet (2019) to EMTrans (2024) highlights the increasing complexity of handling noisy depth data. While TANet integrates depth information with limited robustness to noise, EMTrans leverages a transformer architecture to enhance spatial resolution and feature alignment between RGB and depth modalities, resulting in a more resilient detection process (particularly in environments with low-quality depth data). This evolution underscores how transformer-based models are reshaping the SOD landscape.

As seen from Table 2, recent strategies in attention mechanisms and multimodal fusion include end-to-end deep learning models, attention mechanisms based on graph neural networks, multi-scale and multi-view fusion strategies, and the application of reinforcement learning in multimodal fusion. In terms of backbone architectures, there is a significant trend toward transitioning from traditional backbones like VGG to more advanced networks like ResNet, as well as transformer-based models such as Swin-B and PVT. These new architectures offer improved feature extraction capabilities and enhance overall model performance, particularly in terms of precision and recall, while reducing MAE. Additionally, the integration of attention mechanisms has become a standard approach for enhancing feature selection and increasing the model’s focus on salient objects. These mechanisms work by dynamically directing the model’s attention to the most relevant parts of the input data, thereby improving the model’s ability to distinguish salient objects from background noise. The effective integration of RGB and depth data remains a key area of innovation. Technologies that excel in this domain, such as C2DFNet and MVSalNet, demonstrate superior performance by leveraging the strengths of both modalities. Attention mechanisms significantly enhance feature selection and model interpretability by focusing on the most relevant aspects of the data, leading to more accurate detection. However, these mechanisms may also introduce computational overhead and the risk of overfitting, particularly in resource-intensive models. Additionally, the effectiveness of cross-modal fusion techniques varies depending on the quality of the depth data and the model’s ability to balance RGB and depth information. While some models exhibit exceptional robustness, this often comes at the cost of increased complexity. The future of SOD in RGB-D may see the integration of more advanced technologies, such as self-supervised learning, quantum computing for complex data processing, and further enhancements in multimodal fusion techniques. These innovations will aim to address current limitations, such as the reliance on large annotated datasets and the high computational costs associated with advanced models.

4.4. Comparing with Prior Art

State-of-the-art approaches for RGB-D salient object detection focus on fusion techniques, interactive networks, and feature enhancement. Moreover, hybrid attention models, through attention-driven weighting and selection mechanisms, allow for the integration of features from multiple layers or modalities. This provides a more discriminative and relevant feature set for saliency detection tasks, enhancing the model’s ability to discern salient objects against complex backgrounds. The dynamic nature of these models allows them to adjust the focus on salient features based on the context present within the scene, effectively handling the inherent variability and complexity of real-world scenes.

In contrast, models without an attention mechanism often experience a decline in accuracy when processing small objects or objects that blend into the background. The fixed feature processing strategies of these models fail to adapt to the varying prominence of salient features in complex scenes. As highlighted in the research by Hu et al. [104], the application of self-attention has proven effective in mitigating these issues by enhancing feature representation and discriminability, significantly improving the capability of models to detect salient objects under diverse environmental conditions.

5. Challenges and Future Directions

The field of RGB-D SOD has greatly benefited from advancements in cross-modal fusion and attention mechanisms. While these technologies have significantly pushed the boundaries of what is achievable, they also introduce challenges and limitations primarily related to computational complexity and handling varying data quality across modalities. These challenges include ensuring robust detection under diverse conditions and effectively managing the increased computational load introduced by complex attention mechanisms [103]. We present a comparison of attention mechanisms and cross-modal fusion used in the SOD RGB-D task in Table 3. These challenges directly impact the application of RGB-D SOD methods in practical systems. For instance, in autonomous vehicles, accurate object detection is critical for safety, but noisy depth data and varied lighting conditions in complex traffic environments may affect detection accuracy and reliability. Similarly, in robotic navigation or medical imaging analysis, precise target identification is crucial for task completion, yet current methods may struggle when dealing with complex backgrounds or occluded targets. Future studies will need to address these issues to fully capitalize on the potential of RGB-D data in SOD and other related applications, thereby developing more efficient and adaptive detection systems.

In the field of RGB-D SOD, attention mechanisms and cross-modal fusion methods face several challenges.

Quality of depth data. Feature redundancy is a common issue in RGB-D SOD models, especially after the introduction of multi-layer feature enhancement and edge generation modules, which increase computational complexity and degrade model performance in complex scenarios [105]. In high-dimensional data, redundant features not only waste computational resources, but also interfere with the model’s ability to capture critical information. Therefore, designing fusion strategies that effectively avoid redundancy, such as selective attention or feature filtering modules, is crucial for enhancing model performance, improving computational efficiency, and achieving precise and efficient object detection.

Feature redundancy and performance degradation. Additional modules, such as feature enhancement and edge generation, may lead to feature redundancy and performance degradation [23]. For example, during detection, if the depth information of a foreground object is more prominent than that of the background, but the foreground object is not the target, the model may mistakenly identify the foreground object as the salient target. Foreground interference often leads to misjudgments, especially when the foreground and background have similar saliency. If this interference is not effectively addressed, it can significantly reduce the model’s accuracy and reliability. By introducing depth feature filtering or contextual information aggregation modules, the model can better separate the foreground from the background, thereby improving the accuracy of salient target localization.

Harsh environmental conditions. In RGB-D SOD tasks, model performance is highly sensitive to changes in environmental conditions, particularly under varying lighting conditions, background complexity, and fluctuations in depth data quality, which can affect the brightness and contrast of RGB images and subsequently impact salient object detection. For example, under low-light conditions, images may become blurred, making the edges and features of salient objects difficult to discern. Additionally, when the scene contains multiple objects or elements similar to the salient target, the difficulty of the SOD task increases significantly.

Target size too small. When the salient object is small, the model may struggle to distinguish these objects from the background. This issue is particularly prominent when the resolution is low or the target is distant from the camera. A small target size can lead to the model’s inability to effectively extract salient features, resulting in missed detection or false positives. Research on this problem is limited, with most efforts focusing on improving small target detection through multi-scale feature extraction. Future research could explore more high-resolution enhancement techniques and target-specific focusing strategies to enhance the model’s ability to detect small targets. Additionally, developing models that can dynamically adjust feature extraction scales is an effective approach to address the detection of targets of varying sizes.

Foreground object interference. When foreground objects coexist with salient background objects, depth positional information can cause interference, leading to erroneous model detection. For example, during detection, if the depth information of a foreground object is more prominent than that of the background, but the foreground object is not the target, the model may mistakenly identify the foreground object as the salient target. Foreground interference often leads to misjudgments, particularly when the foreground and background have similar saliency. If this interference is not effectively addressed, it can significantly reduce the model’s accuracy and reliability. By incorporating depth feature filtering or contextual information aggregation modules, the model can better separate the foreground from the background, thereby improving the accuracy of salient target localization.

Development of lightweight attention models. As computational resource constraints and the demand for real-time applications increase, the development of lightweight attention models becomes crucial. These models are essential for reducing computational complexity and resource consumption while maintaining high performance, particularly for deployment in real-time applications and on devices with limited processing power. Future research could focus on improving the attention mechanism itself, designing lightweight attention modules and developing new network architectures to enhance efficiency while exploring various techniques to reduce the computational burden of models.

Adaptive and dynamic fusion strategies. Future models could benefit from more complex adaptive fusion strategies that effectively aggregate multimodal features. These strategies would be capable of dynamically adjusting parameters based on the specific features of the input data, enhancing the flexibility and suitability of the model in different scenes [106]. Through this dynamically adjusted fusion strategy, the model can adaptively adjust parameters based on the specific characteristics of the input data, maintaining excellent performance across various scenarios. By fusing mid-level features in a single step, the model reduces the number of parameters while improving fusion efficiency, effectively capturing complementary information across modalities and ensuring a rational fusion of different modal features. Static fusion strategies may underperform in dynamic scenarios, whereas adaptive fusion strategies can dynamically respond to changes in input data, optimizing the feature fusion process and enhancing the model’s adaptability and generalization in different environments. This is crucial for improving the widespread application and adaptability of SOD technology.

In the research of RGB-D saliency object detection (SOD), although existing achievements have provided us with valuable theoretical and practical foundations, future exploration requires a deeper understanding and response to emerging trends and challenges in this field. First, advances in hardware technology, particularly the rapid development of depth sensors, provide us with higher precision and a broader dynamic range of data, which will significantly enhance detection accuracy and robustness. Additionally, integrating RGB-D SOD with other imaging technologies, such as thermal imaging, may open new research directions and application scenarios. Multimodal data fusion can offer a more comprehensive perspective, significantly improving the model’s adaptability to different environmental conditions, especially in low-visibility situations such as at night.

6. Conclusions

This survey provides a detailed analysis of the theoretical foundations and key technologies in RGB-D salient object detection, with a particular focus on the roles of attention mechanisms and cross-modal fusion. Through a thorough investigation of the literature, the contributions of attention mechanisms and cross-modal fusion in optimizing RGB-D data were identified, addressing various challenges related to feature alignment, noise reduction, and information prioritization. This fosters a more detailed and dynamic fusion process, enabling SOD systems to adaptively focus on the most informative features in both modalities. The existing studies in the literature were summarized, with a brief introduction of the mainstream datasets and evaluation metrics and a quantitative comparison of the discussed models. Additionally, future challenges and research directions were outlined. In conclusion, attention mechanisms and cross-modal fusion are powerful tools for enhancing the performance of RGB-D SOD, and there is anticipation for the development of more advanced, efficient, and robust saliency detection technologies in the future.

Author Contributions

Investigation, A.C., X.L., T.H., J.Z. and D.C.; Resources, A.C.; Writing—original draft, A.C. and X.L; Writing—review & editing, A.C., X.L., T.H. and J.Z.; Visualization, A.C.; Supervision, D.C.; Funding acquisition, J.Z. and D.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Key Research and Development Project of Sichuan under Grant No 24ZDYF0004 and by the Major Program of National Natural Science Foundation of China under Grant No T2293771.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

Authors Ai Chen, Xin Li, Tianxiang He, Junlin Zhou and Duanbing Chen was employed by the company Chengdu Union Big Data Technology Incorporation.

References

Liu, T.; Yuan, Z.; Sun, J.; Wang, J.; Zheng, N.; Tang, X.; Shum, H.Y. Learning to detect a salient object. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 353–367. [Google Scholar]
Li, X.; Chen, H.; Qi, X.; Dou, Q.; Fu, C.W.; Heng, P.A. H-DenseUNet: Hybrid densely connected UNet for liver and tumor segmentation from CT volumes. IEEE Trans. Med. Imaging 2018, 37, 2663–2674. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Gonzalez-Garcia, A.; Weijer, J.V.D.; Danelljan, M.; Khan, F.S. Learning the model update for siamese trackers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 4010–4019. [Google Scholar]
Cheng, M.M.; Mitra, N.J.; Huang, X.; Torr, P.H.; Hu, S.M. Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 569–582. [Google Scholar] [CrossRef]
Itti, L.; Koch, C.; Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 1254–1259. [Google Scholar] [CrossRef]
Lang, C.; Nguyen, T.V.; Katti, H.; Yadati, K.; Kankanhalli, M.; Yan, S. Depth Matters: Influence of Depth Cues on Visual Saliency. In Proceedings of the Computer Vision—ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012, Proceedings, Part II 12; Springer: Berlin/Heidelberg, Germany, 2012; pp. 101–115. [Google Scholar]
Liu, Z.; Tan, Y.; He, Q.; Xiao, Y. SwinNet: Swin transformer drives edge-aware RGB-D and RGB-T salient object detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 4486–4497. [Google Scholar] [CrossRef]
Li, N.; Ye, J.; Ji, Y.; Ling, H.; Yu, J. Saliency detection on light field. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2806–2813. [Google Scholar]
Li, G.; Bai, Z.; Liu, Z.; Zhang, X.; Ling, H. Salient object detection in optical remote sensing images driven by transformer. IEEE Trans. Image Process. 2023, 32, 5257–5269. [Google Scholar] [CrossRef]
Liu, J.J.; Hou, Q.; Cheng, M.M.; Feng, J.; Jiang, J. A simple pooling-based design for real-time salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3917–3926. [Google Scholar]
Peng, Y.; Zhai, Z.; Feng, M. RGB-D Salient Object Detection Based on Cross-Modal and Cross-Level Feature Fusion. IEEE Access 2024, 12, 45134–45146. [Google Scholar] [CrossRef]
Zhang, J.; Fan, D.P.; Dai, Y.; Yu, X.; Zhong, Y.; Barnes, N.; Shao, L. RGB-D saliency detection via cascaded mutual information minimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 4338–4347. [Google Scholar]
Huang, N.; Yang, Y.; Zhang, D.; Zhang, Q.; Han, J. Employing bilinear fusion and saliency prior information for RGB-D salient object detection. IEEE Trans. Multimed. 2021, 24, 1651–1664. [Google Scholar] [CrossRef]
Chen, H.; Li, Y. Progressively complementarity-aware fusion network for RGB-D salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3051–3060. [Google Scholar]
Xiao, F.; Pu, Z.; Chen, J.; Gao, X. DGFNet: Depth-guided cross-modality fusion network for RGB-D salient object detection. IEEE Trans. Multimed. 2023, 26, 2648–2658. [Google Scholar] [CrossRef]
Liu, N.; Zhang, N.; Han, J. Learning selective self-mutual attention for RGB-D saliency detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 13756–13765. [Google Scholar]
Liu, N.; Han, J. Dhsnet: Deep hierarchical saliency network for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 678–686. [Google Scholar]
Wang, L.; Lu, H.; Wang, Y.; Feng, M.; Wang, D.; Yin, B.; Ruan, X. Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 136–145. [Google Scholar]
Ren, J.; Gong, X.; Yu, L.; Zhou, W.; Ying Yang, M. Exploiting global priors for RGB-D saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 June 2015; pp. 25–32. [Google Scholar]
Peng, H.; Li, B.; Xiong, W.; Hu, W.; Ji, R. RGBD Salient Object Detection: A Benchmark and Algorithms. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part III 13; Springer: Berlin/Heidelberg, Germany, 2014; pp. 92–109. [Google Scholar]
Zhang, Q.; Qin, Q.; Yang, Y.; Jiao, Q.; Han, J. Feature calibrating and fusing network for RGB-D salient object detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 1493–1507. [Google Scholar] [CrossRef]
Desingh, K.; Krishna, K.M.; Rajan, D.; Jawahar, C. Depth really Matters: Improving Visual Salient Region Detection with Depth. In Proceedings of the BMVC, Bristol, UK, 9–13 September 2013; pp. 1–11. [Google Scholar]
Hu, X.; Sun, F.; Sun, J.; Wang, F.; Li, H. Cross-modal fusion and progressive decoding network for RGB-D salient object detection. Int. J. Comput. Vis. 2024, 132, 3067–3085. [Google Scholar] [CrossRef]
Sun, F.; Ren, P.; Yin, B.; Wang, F.; Li, H. CATNet: A cascaded and aggregated transformer network for RGB-D salient object detection. IEEE Trans. Multimed. 2023, 26, 2249–2262. [Google Scholar] [CrossRef]
Li, H.; Han, Y.; Li, P.; Li, X.; Shi, L. Hybrid Attention Mechanism And Forward Feedback Unit for RGB-D Salient Object Detection. IEEE Access 2023, 26, 2249–2262. [Google Scholar] [CrossRef]
Luo, Y.; Shao, F.; Xie, Z.; Wang, H.; Chen, H.; Mu, B.; Jiang, Q. HFMDNet: Hierarchical Fusion and Multi-Level Decoder Network for RGB-D Salient Object Detection. IEEE Trans. Instrum. Meas. 2024, 73, 5012115. [Google Scholar] [CrossRef]
Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor Segmentation and Support Inference from Rgbd Images. In Proceedings of the Computer Vision—ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012, Proceedings, Part V 12; Springer: Berlin/Heidelberg, Germany, 2012; pp. 746–760. [Google Scholar]
Zhao, R.; Ouyang, W.; Li, H.; Wang, X. Saliency detection by multi-context deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1265–1274. [Google Scholar]
Shotton, J.; Fitzgibbon, A.; Cook, M.; Sharp, T.; Finocchio, M.; Moore, R.; Kipman, A.; Blake, A. Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 20–25 June 2011; pp. 1297–1304. [Google Scholar]
Ha, H.; Im, S.; Park, J.; Jeon, H.G.; Kweon, I.S. High-quality depth from uncalibrated small motion clip. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5413–5421. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
Zhang, X.; Xu, Y.; Wang, T.; Liao, T. Multi-prior driven network for RGB-D salient object detection. IEEE Trans. Circuits Syst. Video Technol. 2023. early access. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30; NIPS: La Jolla, CA, USA, 2017; Volume 30. [Google Scholar]
Zhang, Z.; Lin, Z.; Xu, J.; Jin, W.D.; Lu, S.P.; Fan, D.P. Bilateral attention network for RGB-D salient object detection. IEEE Trans. Image Process. 2021, 30, 1949–1961. [Google Scholar] [CrossRef] [PubMed]
Sun, P.; Zhang, W.; Wang, H.; Li, S.; Li, X. Deep RGB-D saliency detection with depth-sensitive attention and automatic multi-modal fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1407–1417. [Google Scholar]
Lv, P.; Yu, X.; Wang, J.; Wu, C. HierNet: Hierarchical Transformer U-Shape Network for RGB-D Salient Object Detection. In Proceedings of the 2023 35th Chinese Control and Decision Conference (CCDC), Yichang, China, 20–22 May 2023; pp. 1807–1811. [Google Scholar]
Gao, X.; Cui, J.; Meng, J.; Shi, H.; Duan, S.; Xia, C. JALNet: Joint attention learning network for RGB-D salient object detection. Int. J. Comput. Sci. Eng. 2024, 27, 36–47. [Google Scholar] [CrossRef]
Jiang, B.; Zhou, Z.; Wang, X.; Tang, J.; Luo, B. CmSalGAN: RGB-D salient object detection with cross-view generative adversarial networks. IEEE Trans. Multimed. 2020, 23, 1343–1353. [Google Scholar] [CrossRef]
Wu, Z.; Allibert, G.; Meriaudeau, F.; Ma, C.; Demonceaux, C. Hidanet: Rgb-d salient object detection via hierarchical depth awareness. IEEE Trans. Image Process. 2023, 32, 2160–2173. [Google Scholar] [CrossRef]
Wang, N.; Gong, X. Adaptive fusion for RGB-D salient object detection. IEEE Access 2019, 7, 55277–55284. [Google Scholar] [CrossRef]
Chen, Y.; Zhou, W. Hybrid-attention network for RGB-D salient object detection. Appl. Sci. 2020, 10, 5806. [Google Scholar] [CrossRef]
Yuan, Y.; Liu, W.; Gao, P.; Dai, Q.; Qin, J. Unified Unsupervised Salient Object Detection via Knowledge Transfer. arXiv 2024, arXiv:2404.14759. [Google Scholar]
Wang, F.; Su, Y.; Wang, R.; Sun, J.; Sun, F.; Li, H. Cross-modal and cross-level attention interaction network for salient object detection. IEEE Trans. Artif. Intell. 2023, 5, 2907–2920. [Google Scholar] [CrossRef]
Chen, H.; Shen, F.; Ding, D.; Deng, Y.; Li, C. Disentangled cross-modal transformer for RGB-d salient object detection and beyond. IEEE Trans. Image Process. 2024, 33, 1699–1709. [Google Scholar] [CrossRef] [PubMed]
Sun, P.; Zhang, W.; Li, S.; Guo, Y.; Song, C.; Li, X. Learnable depth-sensitive attention for deep rgb-d saliency detection with multi-modal fusion architecture search. Int. J. Comput. Vis. 2022, 130, 2822–2841. [Google Scholar] [CrossRef]
Niu, Y.; Geng, Y.; Li, X.; Liu, F. Leveraging stereopsis for saliency analysis. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 454–461. [Google Scholar]
Ju, R.; Ge, L.; Geng, W.; Ren, T.; Wu, G. Depth saliency based on anisotropic center-surround difference. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; pp. 1115–1119. [Google Scholar]
Fan, D.P.; Lin, Z.; Zhang, Z.; Zhu, M.; Cheng, M.M. Rethinking RGB-D salient object detection: Models, data sets, and large-scale benchmarks. IEEE Trans. Neural Networks Learn. Syst. 2020, 32, 2075–2089. [Google Scholar] [CrossRef]
Piao, Y.; Ji, W.; Li, J.; Zhang, M.; Lu, H. Depth-induced multi-scale recurrent attention network for saliency detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7254–7263. [Google Scholar]
Ciptadi, A.; Hermans, T.; Rehg, J.M. An In Depth View of Saliency. In Proceedings of the BMVC, Bristol, UK, 9–13 September 2013; pp. 1–11. [Google Scholar]
Hou, Q.; Cheng, M.M.; Hu, X.; Borji, A.; Tu, Z.; Torr, P.H. Deeply supervised salient object detection with short connections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3203–3212. [Google Scholar]
Borji, A.; Cheng, M.M.; Jiang, H.; Li, J. Salient object detection: A benchmark. IEEE Trans. Image Process. 2015, 24, 5706–5722. [Google Scholar] [CrossRef]
Fan, D.P.; Ji, G.P.; Qin, X.; Cheng, M.M. Cognitive vision inspired object segmentation metric and loss function. Sci. Sin. Informationis 2021, 6, 5. [Google Scholar]
Fan, D.P.; Cheng, M.M.; Liu, Y.; Li, T.; Borji, A. Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE International Conference on Cmputer Vision, Venice, Italy, 22–29 October 2017; pp. 4548–4557. [Google Scholar]
Perazzi, F.; Krähenbühl, P.; Pritch, Y.; Hornung, A. Saliency filters: Contrast based filtering for salient region detection. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognitio, Providence, RI, USA, 16–21 June 2012; pp. 733–740. [Google Scholar]
Huang, P.; Shen, C.H.; Hsiao, H.F. RGBD salient object detection using spatially coherent deep learning framework. In Proceedings of the 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP), Shanghai, China, 19–21 November 2018; pp. 1–5. [Google Scholar]
Liu, Z.; Shi, S.; Duan, Q.; Zhang, W.; Zhao, P. Salient object detection for RGB-D image by single stream recurrent convolution neural network. Neurocomputing 2019, 363, 46–57. [Google Scholar] [CrossRef]
Zhu, C.; Cai, X.; Huang, K.; Li, T.H.; Li, G. PDNet: Prior-model guided depth-enhanced network for salient object detection. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 199–204. [Google Scholar]
Chen, H.; Li, Y.; Su, D. Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recognit. 2019, 86, 376–385. [Google Scholar] [CrossRef]
Zhao, J.X.; Cao, Y.; Fan, D.P.; Cheng, M.M.; Li, X.Y.; Zhang, L. Contrast prior and fluid pyramid integration for RGBD salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3927–3936. [Google Scholar]
Chen, H.; Li, Y. Three-stream attention-aware network for RGB-D salient object detection. IEEE Trans. Image Process. 2019, 28, 2825–2835. [Google Scholar] [CrossRef] [PubMed]
Zhao, X.; Zhang, L.; Pang, Y.; Lu, H.; Zhang, L. A single stream network for robust and real-time RGB-D salient object detection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XXII 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 646–662. [Google Scholar]
Zhao, Y.; Zhao, J.; Li, J.; Chen, X. RGB-D salient object detection with ubiquitous target awareness. IEEE Trans. Image Process. 2021, 30, 7717–7731. [Google Scholar] [CrossRef] [PubMed]
Ji, W.; Li, J.; Zhang, M.; Piao, Y.; Lu, H. Accurate RGB-D salient object detection via collaborative learning. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XVIII 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 52–69. [Google Scholar]
Li, G.; Liu, Z.; Ye, L.; Wang, Y.; Ling, H. Cross-modal weighting network for RGB-D salient object detection. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 665–681. [Google Scholar]
Li, G.; Liu, Z.; Ling, H. ICNet: Information conversion network for RGB-D based salient object detection. IEEE Trans. Image Process. 2020, 29, 4873–4884. [Google Scholar] [CrossRef]
Piao, Y.; Rong, Z.; Zhang, M.; Ren, W.; Lu, H. A2dele: Adaptive and attentive depth distiller for efficient RGB-D salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9060–9069. [Google Scholar]
Chen, S.; Fu, Y. Progressively guided alternate refinement network for RGB-D salient object detection. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 520–538. [Google Scholar]
Pang, Y.; Zhang, L.; Zhao, X.; Lu, H. Hierarchical dynamic filtering network for RGB-D salient object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XXV 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 235–252. [Google Scholar]
Fu, K.; Fan, D.P.; Ji, G.P.; Zhao, Q.; Shen, J.; Zhu, C. Siamese network for RGB-D salient object detection and beyond. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5541–5559. [Google Scholar] [CrossRef] [PubMed]
Zhang, M.; Zhang, Y.; Piao, Y.; Hu, B.; Lu, H. Feature reintegration over differential treatment: A top-down and adaptive fusion network for RGB-D salient object detection. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 4107–4115. [Google Scholar]
Zhang, M.; Fei, S.X.; Liu, J.; Xu, S.; Piao, Y.; Lu, H. Asymmetric two-stream architecture for accurate RGB-D saliency detection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XXVIII 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 374–390. [Google Scholar]
Li, C.; Cong, R.; Piao, Y.; Xu, Q.; Loy, C.C. RGB-D salient object detection with cross-modality modulation and selection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part VIII 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 225–241. [Google Scholar]
Li, C.; Cong, R.; Kwong, S.; Hou, J.; Fu, H.; Zhu, G.; Zhang, D.; Huang, Q. ASIF-Net: Attention steered interweave fusion network for RGB-D salient object detection. IEEE Trans. Cybern. 2020, 51, 88–100. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Li, S.; Chen, C.; Fang, Y.; Hao, A.; Qin, H. Data-level recombination and lightweight fusion scheme for RGB-D salient object detection. IEEE Trans. Image Process. 2020, 30, 458–471. [Google Scholar] [CrossRef]
Chen, Q.; Liu, Z.; Zhang, Y.; Fu, K.; Zhao, Q.; Du, H. RGB-D salient object detection via 3D convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 1063–1071. [Google Scholar]
Huang, Z.; Chen, H.X.; Zhou, T.; Yang, Y.Z.; Liu, B.Y. Multi-level cross-modal interaction network for RGB-D salient object detection. Neurocomputing 2021, 452, 200–211. [Google Scholar] [CrossRef]
Zhang, W.; Jiang, Y.; Fu, K.; Zhao, Q. BTS-Net: Bi-directional transfer-and-selection network for RGB-D salient object detection. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
Wen, H.; Yan, C.; Zhou, X.; Cong, R.; Sun, Y.; Zheng, B.; Zhang, J.; Bao, Y.; Ding, G. Dynamic selective network for RGB-D salient object detection. IEEE Trans. Image Process. 2021, 30, 9179–9192. [Google Scholar] [CrossRef]
Zhou, T.; Fu, H.; Chen, G.; Zhou, Y.; Fan, D.P.; Shao, L. Specificity-preserving RGB-D saliency detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 4681–4691. [Google Scholar]
Zhang, C.; Cong, R.; Lin, Q.; Ma, L.; Li, F.; Zhao, Y.; Kwong, S. Cross-modality discrepant interaction network for RGB-D salient object detection. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 2094–2102. [Google Scholar]
Zhou, W.; Zhu, Y.; Lei, J.; Wan, J.; Yu, L. CCAFNet: Crossflow and cross-scale adaptive fusion network for detecting salient objects in RGB-D images. IEEE Trans. Multimed. 2021, 24, 2192–2204. [Google Scholar] [CrossRef]
Zhang, M.; Yao, S.; Hu, B.; Piao, Y.; Ji, W. C² DFNet: Criss-cross dynamic filter network for RGB-D salient object detection. IEEE Trans. Multimed. 2022, 25, 5142–5154. [Google Scholar] [CrossRef]
Wang, F.; Pan, J.; Xu, S.; Tang, J. Learning discriminative cross-modality features for RGB-D saliency detection. IEEE Trans. Image Process. 2022, 31, 1285–1297. [Google Scholar] [CrossRef] [PubMed]
Zhou, J.; Wang, L.; Lu, H.; Huang, K.; Shi, X.; Liu, B. Mvsalnet: Multi-view augmentation for rgb-d salient object detection. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 270–287. [Google Scholar]
Lee, M.; Park, C.; Cho, S.; Lee, S. Spsn: Superpixel prototype sampling network for rgb-d salient object detection. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 630–647. [Google Scholar]
Xia, W.; Zhou, D.; Cao, J.; Liu, Y.; Hou, R. CIRNet: An improved RGBT tracking via cross-modality interaction and re-identification. Neurocomputing 2022, 493, 327–339. [Google Scholar] [CrossRef]
Yang, Y.; Qin, Q.; Luo, Y.; Liu, Y.; Zhang, Q.; Han, J. Bi-directional progressive guidance network for RGB-D salient object detection. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5346–5360. [Google Scholar] [CrossRef]
Pang, Y.; Zhao, X.; Xiang, T.Z.; Zhang, L.; Lu, H. Zoom in and out: A mixed-scale triplet network for camouflaged object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2160–2170. [Google Scholar]
Liang, Y.; Qin, G.; Sun, M.; Qin, J.; Yan, J.; Zhang, Z. Multi-modal interactive attention and dual progressive decoding network for RGB-D/T salient object detection. Neurocomputing 2022, 490, 132–145. [Google Scholar] [CrossRef]
Bi, H.; Wu, R.; Liu, Z.; Zhu, H.; Zhang, C.; Xiang, T.Z. Cross-modal hierarchical interaction network for RGB-D salient object detection. Pattern Recognit. 2023, 136, 109194. [Google Scholar] [CrossRef]
Wei, L.; Zong, G. EGA-Net: Edge feature enhancement and global information attention network for RGB-D salient object detection. Inf. Sci. 2023, 626, 223–248. [Google Scholar] [CrossRef]
Pang, Y.; Zhao, X.; Zhang, L.; Lu, H. CAVER: Cross-modal view-mixed transformer for bi-modal salient object detection. IEEE Trans. Image Process. 2023, 32, 892–904. [Google Scholar] [CrossRef]
Zhuge, M.; Fan, D.P.; Liu, N.; Zhang, D.; Xu, D.; Shao, L. Salient object detection via integrity learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3738–3752. [Google Scholar] [CrossRef]
Wang, H.; Wan, L.; Tang, H. Leno: Adversarial robust salient object detection networks with learnable noise. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 2537–2545. [Google Scholar]
Wang, F.; Wang, R.; Sun, F. DCMNet: Discriminant and cross-modality network for RGB-D salient object detection. Expert Syst. Appl. 2023, 214, 119047. [Google Scholar] [CrossRef]
Wu, Z.; Wang, J.; Zhou, Z.; An, Z.; Jiang, Q.; Demonceaux, C.; Sun, G.; Timofte, R. Object segmentation by mining cross-modal semantics. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 3455–3464. [Google Scholar]
Huo, F.; Liu, Z.; Guo, J.; Xu, W.; Guo, S. UTDNet: A unified triplet decoder network for multimodal salient object detection. Neural Netw. 2024, 170, 521–534. [Google Scholar] [CrossRef] [PubMed]
Chen, G.; Wang, Q.; Dong, B.; Ma, R.; Liu, N.; Fu, H.; Xia, Y. EM-Trans: Edge-Aware Multimodal Transformer for RGB-D Salient Object Detection. IEEE Trans. Neural Netw. Learn. Syst. 2024. early access. [Google Scholar] [CrossRef] [PubMed]
Feng, Z.; Wang, W.; Li, W.; Li, G.; Li, M.; Zhou, M. MFUR-Net: Multimodal feature fusion and unimodal feature refinement for RGB-D salient object detection. Knowl.-Based Syst. 2024, 299, 112022. [Google Scholar] [CrossRef]
Fang, X.; Jiang, M.; Zhu, J.; Shao, X.; Wang, H. GroupTransNet: Group transformer network for RGB-D salient object detection. Neurocomputing 2024, 594, 127865. [Google Scholar] [CrossRef]
Gao, H.; Su, Y.; Wang, F.; Li, H. Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object Detection. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 1–24. [Google Scholar] [CrossRef]
Hu, X.; Yang, K.; Fei, L.; Wang, K. Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1440–1444. [Google Scholar]
Yu, M.; Liu, J.; Liu, Y.; Yan, G. Feature interaction and two-stage cross-modal fusion for RGB-D salient object detection. J. Intell. Fuzzy Syst. 2024, 46, 4543–4556. [Google Scholar] [CrossRef]
Sun, H.; Wang, Y.; Ma, X. An adaptive guidance fusion network for RGB-D salient object detection. Signal Image Video Process. 2024, 18, 1683–1693. [Google Scholar] [CrossRef]

Figure 1. Typical schemes for RGB-D SOD.

Figure 2. Typical RGB-D saliency detection datasets. The RGB image, depth map, and annotation are shown from left to right for each dataset.

Table 1. Comparison of fusion strategies for multimodal data integration.

Fusion Strategy	Advantages	Disadvantages
Early Fusion	Maximizes data interaction	Less flexible
Middle Fusion	Balances independent processing	Designing and training multiple feature extractors and the fusion of features at intermediate layers increase model complexity and training difficulty
Late Fusion	Ensures robustness to modality-specific noise	Limited in capturing beneficial cross-modal interactions at early stages
Hybrid Fusion	Integrates advantages of all stages	Complex in implementation and optimization

Table 2. Comparison of evaluation metrics based on the NJUD dataset. ✓ indicates the presence of the module in the model’s structure, while ✗ indicates the absence of the module, and - Indicates that the model has not measured this indicator.

Time	Name	Backbone	Attention Mechanism	Multi-Model Fusion	$F_{β}$	$S_{α}$	$E_{ϕ}$	MAE
2018	PCF [57]	VGG-16	✗	✗	0.872	0.877	0.924	0.059
2019	SSRCNN [58]	VGG-16	✗	✓	0.876	0.890	0.912	0.047
	PDNet [59]	VGG-16, VGG-19	✗	✗	0.823	0.877	0.899	0.071
	MMCI [60]	-	✗	✓	0.868	0.859	0.882	0.079
	CPFP [61]	VGG-16	✗	✓	0.890	0.878	0.900	0.053
	TANet [62]	VGG-16	✓	✓	0.888	0.878	0.909	0.061
	DMRA [50]	VGG-19	✓	✓	0.896	0.885	0.920	0.051
2020	DANet [63]	VGG-16, VGG-19	✓	✓	0.905	0.897	0.926	0.046
	DASNet [64]	ResNet-50	✗	✓	0.911	0.902	0.927	0.042
	CoNet [65]	ResNet	✗	✗	0.872	0.894	0.912	0.047
	CMWNet [66]	VGG-16	✗	✓	0.902	0.903	0.936	0.046
	ICNet [67]	VGG-16	✗	✓	0.891	0.894	0.926	0.052
	S²MA [16]	VGG-16	✓	✓	0.889	0.894	0.930	0.053
	A2dele [68]	VGG-16	✓	✗	0.873	0.869	0.916	0.051
	PGAR [69]	VGG-16	✗	✓	0.893	0.909	0.916	0.042
	HDFNet [70]	VGG-16, VGG-19	✗	✓	0.922	0.908	0.932	0.038
	JL-DCF [71]	VGG-16, ResNet-101	✗	✓	0.903	0.903	0.944	0.043
	FRDT [72]	VGG-19	✗	✓	0.879	0.898	0.917	0.048
	CmSalGAN [39]	VGG-16	✓	✓	0.897	0.903	0.94	0.046
	ATSA [73]	VGG-19	✓	✓	0.893	0.901	0.921	0.040
	cmMS [74]	VGG-16	✗	✓	0.897	0.900	0.936	0.044
	D³Net [49]	VGG-16	✗	✓	0.887	0.893	0.930	0.051
	ASIFNet [75]	VGG-16	✓	✓	0.901	0.889	-	0.047
	DRLF [76]	VGG-16	✗	✓	0.883	0.886	0.926	0.550
2021	RD3D [77]	ResNet-50	✗	✓	0.914	0.916	0.947	0.036
	MCINet [78]	ResNet-50	✗	✓	0.902	0.906	0.939	0.039
	SwinNet [7]	ResNet-50, ResNet-101	✓	✓	0.922	0.935	0.934	0.027
	BiANet [35]	VGG-16	✓	✓	0.920	0.915	0.948	0.039
	BTSNet [79]	ResNet-50	✗	✓	0.924	0.921	0.954	0.036
	DSNet [80]	ResNet-50	✗	✓	0.895	0.921	0.945	0.034
	SPNet [81]	Res2Net-50	✗	✓	0.935	0.925	0.954	0.028
	EBFNet [13]	VGG-16	✗	✓	0.895	0.907	0.936	0.038
	CMINet [12]	ResNet-50	✗	✓	0.925	0.939	0.956	0.032
	CDINet [82]	VGG-16	✗	✓	0.922	0.919	-	0.036
	CCAFNet [83]	-	✗	✓	0.911	0.910	0.920	0.037
	DSA²F [36]	VGG-19	✓	✓	0.892	0.918	0.950	0.024
2022	C²DFNet [84]	ResNet-50	✗	✓	0.937	0.922	0.948	0.020
	DCMF [85]	VGG-16, ResNet-50	✗	✓	0.913	0.922	0.940	0.029
	MVSalNet [86]	ResNet-50	✗	✓	0.942	0.937	0.973	0.019
	SPSN [87]	ResNet-50	✗	✗	0.942	0.937	0.973	0.017
	CIRNet [88]	RepVGG	✗	✓	0.927	0.925	0.925	0.035
	BPGNet [89]	VGG-16	✗	✓	0.926	0.923	0.953	0.034
	ZoomNet [90]	ResNet-50	✗	✓	0.926	0.914	0.940	0.037
	MIADPD [91]	ResNet-50	✓	✓	0.916	0.914	0.951	0.036
2023	HINet [92]	ResNet-50	✗	✓	0.914	0.915	0.945	0.039
	HiDAnet [40]	VGG-16, ResNet-50	✓	✓	0.927	0.928	0.962	0.021
	CATNet [24]	Swin-B	✓	✓	0.929	0.937	0.933	0.025
	EGANet [93]	-	✗	✓	0.883	-	0.922	0.033
	CAVER [94]	ResNet-50, ResNet-101	✗	✓	0.928	0.926	0.958	0.030
	ICON [95]	VGG, PVT, ResNet	✗	✓	0.891	0.893	0.937	0.051
	LENO [96]	ResNet-50	✓	✗	0.824	0.838	0.888	0.073
	DCMNet [97]	Res2Net	✓	✓	0.899	-	0.920	0.036
	MPDNet [33]	ResNet-50	✓	✗	0.912	0.912	0.937	0.041
	XMSNet [98]	PVT	✓	✓	0.942	0.931	0.960	0.025
2024	UTDNet [99]	VGG-16, ResNet-50	✗	✓	0.925	0.923	0.948	0.036
	DCMTrans [45]	T2T-14	✓	✓	0.934	0.932	0.959	0.031
	EMTrans [100]	PVTv2-B2	✓	✓	0.920	0.903	0.944	0.039
	MFURNet [101]	ResNet-50	✗	✓	0.937	0.923	-	0.035
	Group TransNet [102]	ResNet-50	✗	✓	0.923	0.924	0.928	0.027
	CPNet [23]	Swin-B	✓	✓	0.933	0.935	0.935	0.025
	HFILNet [103]	Swin-Trans	✓	✓	0.931	0.936	0.939	0.025
	HFMDNet [26]	VGG-16, ResNet-50, Swim-Trans	✗	✓	0.944	0.937	0.966	0.023

Table 3. Comparison of attention mechanisms and cross-modal fusion in the SOD RGB-D task.

Technique	Performance Metric	Scenario	Key Strengths	Key Limitations
Attention Mechanisms	Accuracy (e.g., F-measure)	Complex scenes with clutter	Enhances feature focus, improves object localization	High computational cost, may overfit in complex scenarios
Cross-Modal Fusion	Precision, Recall	Low-light environments, occlusions	Combines complementary data, robust to sensor noise	Requires high-quality depth data, integration complexity
Attention + Cross-Modal Fusion	Accuracy, Precision, Recall	Complex and diverse scenarios, real-time	Maximizes feature extraction and data integration, improves robustness and accuracy	Higher computational demand, requires sophisticated model design

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, A.; Li, X.; He, T.; Zhou, J.; Chen, D. Advancing in RGB-D Salient Object Detection: A Survey. Appl. Sci. 2024, 14, 8078. https://doi.org/10.3390/app14178078

AMA Style

Chen A, Li X, He T, Zhou J, Chen D. Advancing in RGB-D Salient Object Detection: A Survey. Applied Sciences. 2024; 14(17):8078. https://doi.org/10.3390/app14178078

Chicago/Turabian Style

Chen, Ai, Xin Li, Tianxiang He, Junlin Zhou, and Duanbing Chen. 2024. "Advancing in RGB-D Salient Object Detection: A Survey" Applied Sciences 14, no. 17: 8078. https://doi.org/10.3390/app14178078

APA Style

Chen, A., Li, X., He, T., Zhou, J., & Chen, D. (2024). Advancing in RGB-D Salient Object Detection: A Survey. Applied Sciences, 14(17), 8078. https://doi.org/10.3390/app14178078

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advancing in RGB-D Salient Object Detection: A Survey

Abstract

1. Introduction

2. Classification and Comparison of Cross-Modal Fusion Strategies

2.1. Classification of Fusion Strategies

2.2. Comparison of Fusion Strategies

3. Attention Mechanisms in RGB-D SOD

3.1. Classification of Attention Mechanisms

3.2. Attention Models in SOD

4. Performance Evaluation and Comparison

4.1. RGB-D Dataset

4.2. Evaluation Criteria

4.3. Model Comparison

4.4. Comparing with Prior Art

5. Challenges and Future Directions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI