Next Article in Journal
A Study on EMI Noise Countermeasure Design of a Dishwasher
Next Article in Special Issue
Application of Improved YOLOv5 Algorithm in Lightweight Transmission Line Small Target Defect Detection
Previous Article in Journal
DetTrack: An Algorithm for Multiple Object Tracking by Improving Occlusion Object Detection
Previous Article in Special Issue
LezioSeg: Multi-Scale Attention Affine-Based CNN for Segmenting Diabetic Retinopathy Lesions in Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Depth-Quality Purification Feature Processing for Red Green Blue-Depth Salient Object Detection

1
Key Laboratory of Intelligent Informatics for Safety & Emergency of Zhejiang Province, Wenzhou University, Wenzhou 325035, China
2
The College of Electrical and Information Engineering, Quzhou University, Quzhou 324000, China
3
The College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China
4
Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydroelectric Engineering, The College of Computer and Information, China Three Gorges University, Yichang 443002, China
*
Authors to whom correspondence should be addressed.
Electronics 2024, 13(1), 93; https://doi.org/10.3390/electronics13010093
Submission received: 25 November 2023 / Revised: 12 December 2023 / Accepted: 19 December 2023 / Published: 25 December 2023
(This article belongs to the Special Issue Advances in Computer Vision and Deep Learning and Its Applications)

Abstract

:
With the advances in deep learning technology, Red Green Blue-Depth (RGB-D) Salient Object Detection (SOD) based on convolutional neural networks (CNNs) is gaining more and more attention. However, the accuracy of current models is challenging. It has been found that the quality of the depth features profoundly affects the accuracy. Several current RGB-D SOD techniques do not consider the quality of the depth features and directly fuse the original depth features and Red Green Blue (RGB) features for training, resulting in enhanced precision of the model. To address this issue, we propose a depth-quality purification feature processing network for RGB-D SOD, named DQPFPNet. First, we design a depth-quality purification feature processing (DQPFP) module to filter the depth features in a multi-scale manner and fuse them with RGB features in a multi-scale manner. This module can control and enhance the depth features explicitly in the process of cross-modal fusion, avoiding injecting noise or misleading depth features. Second, to prevent overfitting and avoid neuron inactivation, we utilize the RReLU activation function in the training process. In addition, we introduce the pixel position adaptive importance (PPAI) loss, which integrates local structure information to assign different weights to each pixel, thus better guiding the network’s learning process and producing clearer details. Finally, a dual-stage decoder is designed to utilize contextual information to improve the modeling ability of the model and enhance the efficiency of the network. Extensive experiments on six RGB-D datasets demonstrate that DQPFPNet outperforms recent efficient models and delivers cutting-edge accuracy.

1. Introduction

Visual saliency refers to a human visual simulation system that uses algorithms to simulate human visual features and locate prominent areas in an image. Salient Object Detection (SOD) is designed to find the most appealing features of an image. It has rapidly developed and is widely used in many fields, including object tracking [1], object detection [2,3], object segmentation [4,5], and other computer vision tasks for pre-processing [6]. Deep learning has advanced considerably over the past few years, and many SOD methods have been proposed. However, the majority of current models for SOD can only handle RGB images.
Park et al. proposed a unique surface-defect detection method [7] that utilizes a deep nested convolutional neural network (NC-NET) with attention and guiding modules to segment defect regions from complicated backgrounds precisely and adaptively refine features. To overcome the inherent limitations of convolution, SwinE-Net [8] effectively combines EfficientNet, driven by a CNN, and the Vision Transformer (VIT)-based Swin Transformer for segmentation. This combination preserves global semantics while maintaining low-level characteristics, demonstrating specific generalization and scalability. CoEg-Net [9] employs a shared attention projection technique to facilitate fast learning from public information, utilizing vast SOD datasets to significantly enhance the model’s scalability and stability. DRFI [10] autonomously integrates regional saliency features of high dimensionality and selects the most discriminative cues. This inevitably creates challenges for SOD in intricate scenes, for example, backdrops with cluttered or low-contrast areas where color provides few clues.
To address the aforementioned problem, combining RGB and depth features for RGB-D SOD has received increasing attention. To learn the transferable representation of RGB-D partition tasks, Bowen et al. [11] proposed an RGB-D framework, DFormer. DFormer encodes RGB and depth information through a series of RGB-D blocks. The model is pre-trained on ImageNet-1K, so DFormer has the ability to encode RGB-D representations. To build a better global long-range dependence model with self-modality and cross-modality, Cong et al. [12] introduced the transformer architecture to create a new RGB-D SOD network called point-aware interaction and CNN-induced refinement (PICR-Net). The network explores the interaction of characteristics under different modules, alleviates the block effects, and details the destruction problems caused by the transformers. Wu et al. [13] designed HiDAnet, which includes a granularity-based attention strategy to enhance the fusion of RGB and depth features. Note that the accuracy depends greatly on the quality of the depth of information, as suggested by the previous work. Cong et al. [14] suggested a method for assessing the dependability of depth maps and utilizing it to minimize the impact of inferior depth maps on salient detection. DPA-Net [15] can recognize the potential value of depth information through a learning-based approach, preventing contamination by accounting for depth potentiality. Although BBS-Net [16] employs a module with improved depth to selectively extract informative regions of depth cues from both channel and spatial viewpoints, the quality of the depth features is still not great, resulting in the prediction accuracy not achieving adequate results. Although the above models consider the quality of depth features, they only perform single-scale filtering and fuse RGB and depth features at the coarsest filtering level without considering the mode of multi-scale filtering and fusion. This may lead to the roughness of features and the lack of feature utilization and fusion. In addition, Cong et al. [14] adapted a top-down UNet [17] architecture, which performs well in extracting and integrating local information, but it cannot effectively capture global information and has some limitations.
The above facts indicate that multi-scale filtering of depth features and multi-scale fusion with RGB features can improve feature utilization and fusion rates, thereby enhancing a model’s accuracy. In addition, a decoder that can capture both global and local information has a significant impact on the performance of a model. Based on this, we propose a depth-quality purification feature processing (DQPFP) network for RGB-D SOD in this paper. Figure 1 shows the overall network architecture. The DQPFP module consists of three key sub-modules, namely a depth denoising module (DDM), depth-quality purification weighting (DQPW) module, and depth purification-enhanced attention (DPEA) module. The DDM filters multi-scale depth features through a channel attention mechanism and a spatial attention mechanism to achieve the initial filtering of the depth features. The DQPW module supplements the color features with purified depth features in a residual-connected manner to enhance feature characterization and then learns the weight factor α from the depth features and RGB features; By assigning smaller weights to poor-quality depth features, we obtain different weight factors on different scales. The DPEA module learns the global attention maps β from the purified depth features, which enhances the quality of the depth features from a spatial dimension. Then, α and β are integrated to obtain the final high-quality depth features. Then, the high-quality depth features and RGB features are fused in a multi-scale manner, and the final saliency map is generated through a two-stage decoder. In addition, after experimental analysis, we utilize the Randomized Leaky Rectified Linear Unit (RReLU) activation function to prevent overfitting and avoid neuron inactivation, which introduces randomness into the neural network training process. Furthermore, we introduce the pixel position adaptive importance (PPAI) loss, which integrates local structure information to assign different weights to each pixel, thus better guiding the network’s learning process and resulting in clearer details.
Our contributions can be summarized as follows:
  • We propose a DQPFP module, consisting of three sub-modules: DDM, DQPW, and DPEA. This module filters the depth features in a multi-scale manner and fuses them with RGB features in a multi-scale manner. It can also control and enhance the depth features explicitly in the process of cross-modal fusion, avoiding injecting noise or misleading depth features, which improves the feature utilization, fusion, and accuracy rates of the model.
  • We design a dual-stage decoder as one of DQPFPNet’s essential elements, which can fully utilize contextual information to improve the modeling ability of the model and enhance the efficiency of the network.
  • We introduce the RReLU activation function to prevent overfitting and avoid neuron inactivation, thereby introducing randomness into the training process. Furthermore, the pixel position adaptive importance (PPAI) loss is utilized to integrate local structure information to assign different weights to each pixel, thus better guiding the network’s learning process and resulting in clearer details.
  • Extensive experiments on six RGB-D datasets demonstrate that DQPFPNet outperforms recent efficient models.
The remainder of this paper is structured as follows. The related research on general RGB-D SOD, effective RGB-D SOD, and depth-quality analysis in RGB-D SOD is covered in Section 2. Section 3 describes the proposed DQPFPNet in detail. Section 4 presents the experimental results, performance evaluation, and ablation analysis. Finally, some conclusions are provided in Section 5.

2. Related Works

For many years, researchers have been investigating the use of RGB-D data for SOD. Considering the objective of this paper, this section reviews common techniques for RGB-D SOD and the previous works on valid methods and depth-quality analysis.

2.1. Common RGB-D SOD Techniques

The effectiveness of traditional methods [18,19] mostly relies on how well made the hand-crafted features are. The first traditional RGB-D SOD method was proposed in 2012. Recently, deep learning-based techniques [20,21,22,23,24] have made great progress, gradually becoming mainstream, with the first deep learning-based RGB-D SOD starting in 2017. To investigate whether and how visual saliency is influenced by depth features, Lang et al. [18] presented the first RGB-D SOD work in 2012, where seven experimenters performed eye-movement experiments on 500 images, recording observation points. A Gauss mixed model was used to simulate the distribution of depth-induced saliency and observe the relationship between 2D saliency and 3D saliency. To investigate the efficacy of global priors for RGB-D data, Peng et al. [19] developed a multi-background contrast model, including local, global, and background contrast, to detect salient targets using depth maps. In addition, the first substantial RGB-D dataset for SOD was provided by this work. In order to accelerate inference speed and improve model training efficiency, GSCINet [21] was proposed with a series of carefully designed convolutions of different scales and attention-to-weight matrices, introducing a cyclic cooperation technique to reduce computing costs while optimizing compressed features, thereby achieving rapid and precise inference for Salient Object Detection. To explore how to combine low-level salient cues to generate master salience maps, DF [20] was created with a new convolutional neural network (CNN) that aggregates many low-level saliency indicators into hierarchical features to effectively find saliency regions in RGB-D images. Published in 2017, it was the first model to incorporate the deep learning technique into RGB-D SOD tasks. In order to make better use of complementary information in multi-modal data and reduce the negative effect of ambiguity between different modes, A2TPNet [24] was proposed to fuse cross-modal features, employing a cooperative technique that combines channel attention and spatial attention mechanisms to lessen the interference of irrelevant information and unimportant aspects in the interaction process. To apply uncertainty to RGB-D Salient Object Detection, UCNet [22], a probability-based RGB-D SOD network that simulates the uncertainty of human annotations through conditional variational automatic encoders, was proposed. In order to fully mine the information of cross-modal complementarity and cross-level continuity, ICNet [23] was proposed, offering a transformation of the information module for interactive high-level feature transformation.
As this research direction has flourished, other encouraging skills have recently been used in RGB-D SOD tasks, for instance, the use of RGB images, bottom-up and top-down depth maps of the multi-modal integration framework [25], co-attention mechanisms [26,27], model compression [28,29], shared networks [30], weak semi-supervised learning [31,32], and self-mutual attention modules [33]. A relatively comprehensive RGB-D SOD survey report can be found in [34].
Although the above-mentioned RGB-D SOD methods can improve detection accuracy, most of the models do not consider the impact of multi-scale depth quality on model accuracy.

2.2. RGB-D SOD Depth-Quality Assessments

As depth quality often affects the performance of a model, some researchers have considered using the RGB-D SOD depth mass to lessen the impact of depth at low mass. To forecast a hint map, in EF-Net [35], a module of a color hint map using RGB pictures was initially employed. The issue of poor-quality depth maps was then resolved, and the saliency detection process was improved thanks to the use of a depth-enhanced module. After removing the depth stream’s feature encoder and creating a lightweight model, the authors of SSN [36] employed the depth map directly to guide the pre-fusion of RGB and depth features. The authors of A2dele [37] used network prediction and attention methods as conduits for transferring depth data from the depth stream to the RGB stream. In JL-DCF [30], depth adjustment and fusion mechanisms were used to explicitly solve depth quality issues. Based on this, the adjusted depth map was able to estimate the original depth map. Using hyperpixels of components created by SPSN [38], component prototypes were created from the input RGB picture and depth map. In addition, a reliability selection module was proposed to detect the quality of RGB feature maps and depth feature maps and weigh them adaptively according to the quality of the feature maps.

3. Proposed Method

3.1. Overview

Figure 1 presents the proposed DQPFPNet structure, consisting of the encoder, decoder, and supervision module. Our encoder adopts the architecture in [16], where the RGB module is in charge of both cross-module fusion between RGB and depth features and feature extraction for RGB to achieve great performance. To create the final saliency map, the decoder performs a dual-stage fusion, namely the first fusion and second fusion. The encoder itself is made up of an RGB-related module, whose backbone network is MobileNet-v2 [2]; a depth-related module, which is an efficient backbone; and the proposed DQPFP. The depth module and RGB module comprise five feature hierarchies, each with an output stride of 2, with the last one having an output stride of 1. The depth features are extracted within the given hierarchy, passed through the DQPFP threshold, added to the RGB module through simple element additions, and then sent to the next hierarchy. Moreover, a PPM (pyramid pooling module [39]) is introduced toward the end of the RGB module to acquire multi-scale semantic data. In practical coding, the DQPFP threshold consists of two operations: depth-quality purification weighting (DQPW) and depth purification-enhanced attention (DPEA). In order to facilitate a better understanding of the overall workflow of the network, Figure 2 shows the pipeline of the entire network.
The features extracted from the five depths/RGB hierarchies are represented as f n i ( n { r g b , d e p } , i = 1 , , 5 ) , the fusion features are represented as f u i ( i = 1 , , 5 ) , and the features from the PPM are represented as f u 6 . This multi-modal feature fusion can be written as:
f u i = f r g b i + ( α i β i f d e p i )
where α i and β i are calculated by DQPW and DPEA, respectively, to control the fusion of the depth features f d e p i . ⊗ indicates element-by-element multiplication. After the encoding process shown in Figure 1, f u i ( i = 1 , , 5 ) and f u 6 are transferred to the next decoder module.

3.2. Depth-Quality Purification Feature Processing (DQPFP)

DQPFP includes two crucial modules: DQPW (depth-quality purification weighting) and DPEA (depth purification-enhanced attention). These two modules calculate α i and β i in Equation (1), respectively. α i R i is a scalar that determines "how many" depth features are used, whereas β i R s × s (s is the feature size for level i) is a spatial attention map, determining "which regions" to focus on within the depth characteristics. The internal structures of the DQPW and DPEA modules are described below.

3.2.1. Depth-Quality Purification Weighting (DQPW)

The paired color features and depth features in the RGB-D features are two different forms of the same object. Color images provide visual cues, and depth images provide 3D information. Considering the inadequate quality of depth maps, this paper proposes a depth de-noising module (DDM). The DDM first purifies the depth features using the attention mechanism, then complements the color features through a residual connection [40], and uses the shortcut connection section to retain more of the original color cues.
In the DDM, as shown in Figure 3, the RGB features are merged with the depth features and transmitted to the channel attention module to obtain the attention channel mask, which is employed to purify the depth features. Subsequently, the purified depth features are input into the spatial attention module to produce the attention space mask, purifying the depth features on a spatial level. This process can be represented as:
F i r = f i d × S A ( f i d × C A ( C a t ( f d d , f i r ) ) ) + f i r
where f i r and f i d , respectively, represent the low-level color and depth features; Cat( · ) represents the concatenation and subsequent convolution operations; CA( · ) and SA( · ) are channel and spatial attention operations proposed by CBAM [41], respectively; “×” denotes the element-wise multiplication operation; and “+” denotes the element-by-element addition operation. This process purifies poor-quality depth features and then merges them into RGB features to produce a more accurate representation F i r .
In Figure 4, the low-level features f r g b 1 and f d e p 1 first obtain f r g b 1 _ e n through the DDM, and DQPW adaptively learns the weighting term α i from the features f r g b 1 _ e n and f d e p 1 . We apply convolution to f r g b 1 _ e n / f d e p 1 to obtain the transformed features f r t / f d t , which are anticipated to obtain more activators associated with the edge:
f r t = BRRConv 1 × 1 ( f r g b 1 _ e n ) , f d t = BRRConv 1 × 1 ( f d e p 1 )
where B R R C o n v 1 × 1 ( · ) represents a 1 × 1 convolution with BatchNorm layers and the RReLU activation. To be able to assess the alignment of low-level features, the alignment feature vector V B A , encoding the alignment between f r t and f d t , is computed as follows, given the edge activations f r t and f d t :
V B A = GAP ( f r t f d t ) GAP ( f r t + f d t )
where G A P ( · ) means the global average pooling operation aggregating element-level details and ⊗ represents the element-level multiplication.
Additionally, to make V B A robust to minor edge movements, this paper calculates V B A on multiple scales and concatenates the results to produce the strengthened vector. Figure 4 shows that this multi-level computation is realized by downsampling the original features f r t / f d t by max-pooling with a stride of 2, and then V B A 1 and V B A 2 are calculated in the same way as in Equation (4). Assuming that V B A , V B A 1 , and V B A 2 are aligned eigenvectors calculated from the three scales shown in Figure 4, the strengthened vector V B A m s is calculated as follows:
V B A m s = [ V B A , V B A 1 , V B A 2 ]
where [ · ] represents a channel cascade. Then, two completely linked layers are used to calculate α R 5 from V B A m s in the manner shown below:
α = MLP ( V B A m s )
where M L P ( · ) represents a two-level perception with the Sigmoid function at the end. Then, α i ( 0 , 1 ) ( i = 1 , 2 , 5 ) is one of the elements of the vector α that is obtained. Note that this paper uses different weighting factors for different levels, and the effectiveness of this multivariable approach is verified in Section 4.4.

3.2.2. Depth Purification-Enhanced Attention (DPEA)

The DPEA enhances the depth features in the spatial dimension by deriving a global attention map β i from the depth channel. As shown in Figure 5, the purified features f d e p 5 _ e n are first obtained from f r g b 1 and f d e p 5 through the DDM to locate the coarse-grained salient areas (with supervision cues shown in Figure 1). In order to simplify the next pixel-by-pixel processes, f d e p 5 _ e n is compressed and then sampled up into f d h t in the same dimension as f r g b 1 / f d e p 1 , as shown in the following formula:
f d h t = F U P 8 ( BRRConv 1 × 1 ( f d e p 5 ) )
where F U P 8 ( · ) represents 8 × bilinear upsampling. f d h t is then re-calibrated with the primary RGB and depth features. Like the calculation in DQPW, this paper first transfers f r g b 1 / f d e p 1 to f r t / f d t . The result is that element-level multiplication generates the features f e c , which somewhat emphasizes the general activation properties linked to the edge. The max-pooling operation and dilated convolution operation are used to rapidly expand the receptive field to simulate better long-term relationships between low- and high-level information (i.e., f e c and f d h t ) while preserving the effectiveness of the DPEA. This re-calibration process is represented as:
F r e c ( f d h t ) = F U P 2 DConv 3 × 3 F D N 2 ( f d h t + f e c )
where F r e c ( · ) is the input of the re-calibration process; DConv 3 × 3 ( · ) represents the 3 × 3 dilated convolution with a stride of 1 and a dilation rate of 2, followed by BatchNorm layers and the RReLU activation; and F U P 2 ( · ) / F D N 2 ( · ) indicates the bi-linear upsampling/downsampling operation to 2 / ( 1 2 ) times the initial dimensions. To achieve a balance between functionality and effectiveness, the following two re-calibrations are performed:
f d h t = F r e c ( f d h t ) , f d h t = F r e c ( f d h t ) ,
where f d h t and f d h t are the features re-calibrated once and twice, respectively. Finally, f d h t is combined with f e c to obtain global attention maps:
β = BRRConv 3 × 3 ( f e c + f d h t ) .
Be aware that the RReLU activation in BRRConv 3 × 3 is replaced with the Sigmoid activation to achieve the attention features of β . Eventually, By downsampling β , five depth global attention maps β 1 , β 2 , , β 5 are obtained, using spatial enhancement factors for the depth levels. Generally, background clutters that are unrelated to the depth features can be prevented by multiplying them with attention maps β 1 β 5 .

3.3. Dual-Stage Decoder

This work suggests a simpler two-phase decoder that comprises first fusion and second fusion stages to further increase efficiency, in contrast to the well-known UNet [17], which uses a hierarchical top-down decoding technique. Hierarchical grouping is used, denoted in Figure 1 as “G”. The first fusion aims to cut down on the feature channels and hierarchies. Based on the outputs of the first fusion stage, the low-level and high-level hierarchical structures are further aggregated to generate the final salient map. Note that in our decoder, instead of ordinary convolutions, separable depth-wise convolutional filters are mainly used with many input channels.

3.3.1. First Fusion Stage

This paper first uses a 3 × 3 depth-by-depth separable convolution [42] with BatchNorm layers and the RReLU activation, represented as DSConv 3 × 3 ( · ) , to reduce the encoder’s features during compression ( f u i , i = 1 , 2 6 ) into an integrated channel of size 16. Then, the popular channel attention operator [43] F C A ( · ) is used to improve the characteristics through channel weighting. The procedure described above can be expressed as:
c f u i = F C A ( DSConv 3 × 3 ( f u i ) ) ,
where c f u i represents the features from the compression and enhancement processes. This work, which is motivated by [16], splits the six feature hierarchies into both high-level and low-level hierarchies, as follows:
c f u l o w = i = 1 3 F U P 2 i 1 ( c f u i ) , c f u h i g h = i = 4 6 c f u i ,
where F U P i is i times the original size of the bilinear upsampling.

3.3.2. Second Fusion Stage

Since the number of channels and hierarchies have been reduced in the first fusion phase, the high-level and low-level hierarchies are directly concatenated in the second fusion phase and then provided to a prediction head to acquire the ultimate full-resolution prediction map, which is expressed as follows:
S c = F p c [ c f u l o w , F U P 8 ( c f u h i g h ) ] ,
where S c represents the final salient features, and F p c ( · ) represents the prediction head consisting of two 3 × 3 separable depth-by-depth convolutions (followed by BatchNorm layers and the RReLU activation function): a 3 × 3 convolution with Sigmoid activation and a 2 × bilinear upsampling to restore the original input dimension.

3.4. RReLU Activation Function

The activation function plays an important role in computer vision tasks such as object segmentation, object tracking, and object detection. An important aspect of neural network design is the selection of the activation functions to be used in the different layers of the network. The activation function is used to introduce nonlinearity into the neural network calculation, and the correct selection of the activation function is very important for the effective performance of the network.
Common activation functions, such as Sigmoid, Tanh, and so on, have good properties, but with the advent of deep neural architectures, it is difficult for researchers to train very deep neural networks because they are saturated with activation functions. To solve this problem, the ReLU activation function was utilized, as shown in Figure 6. Although ReLU is not differentiable at zero, it is unsaturated, and it can keep the gradient constant in the positive interval. This method effectively alleviates the problem of gradient disappearance in the neural network, thereby speeding up the training of the neural model. However, when the input is negative, ReLU will have dead neurons, resulting in the corresponding weights not being updated, which may result in the loss of model information.
To address the problems with the ReLU activation function, in Section 4.4, we conduct a number of experiments to determine the optimal activation function to use in this model: RReLU. As shown in Figure 7, RReLU is a variant of ReLU that prevents overfitting by introducing randomness during model training while helping to resolve the issue of neuronal inactivation. When the input is positive, the gradient is a positive value, and when the input is negative, the gradient is a negative value. However, the slope of the negative value is randomly obtained during training and fixed in subsequent tests.
The beauty of RReLU is that during the training process, a j i is randomly drawn from a uniform distribution of U ( l , u ) , which helps increase the robustness of the model and reduce the dependence on specific input patterns, thereby mitigating the risk of overfitting. By introducing randomness, RReLU allows the activation values of neurons to vary within a range, even with negative inputs, thus avoiding complete neuronal inactivation.

3.5. Pixel Position Adaptive Importance (PPAI) Loss

Despite having three flaws, binary cross-entropy (BCE) is the most popular loss function for RGB and RGB-D SOD. First, it disregards the image’s overall structure and calculates each pixel’s loss separately. Second, the loss of foreground pixels will be less noticeable in photographs where the backdrop predominates. Third, it gives each pixel the same treatment. In actuality, pixels in cluttered or constrained locations (e.g., the pole and horn) are more likely to result in incorrect predictions and require additional effort, whereas pixels located in places like roadways and trees require less focus. So, this paper introduces the pixel position adaptive importance (PPAI) loss, which consists of two components, namely the weighted binary cross-entropy (wBCE) loss and the weighted IoU (wIoU) loss. The wBCE loss is shown in Equation (11)
L w b c e s = i = 1 H j = 1 W 1 + γ α i j l = 0 1 1 g i j s = l log Pr p i j s = l Ψ i = 1 H j = 1 W γ α i j
where 1 ( · ) is the indicator function and γ is a hyperparameter. The symbol l { 0 , 1 } denotes two types of labels. p i j s and g i j s are the prediction and the ground truth of the pixel at location ( i , j ) in an image. Ψ shows all the parameters of the model, and Pr ( p i , j s = l Ψ ) represents the predicted probability.
In L w b c e s , each pixel is given a weight α . A hard pixel corresponds to a larger α , whereas a simple pixel is assigned a smaller weight. α , which is determined based on the disparity between the central pixel and its surrounds, can be used as a measure of pixel significance, as shown in Equation (15).
α i j s = m , n A i j g t m n s m , n A i j 1 g t i j s
where A i j denotes the area around the pixel ( i , j ) . For all pixels, α i j s [ 0 , 1 ] . If α i j s is big, the pixel at ( i , j ) is significant (e.g., an edge or hole) and stands out significantly from its surroundings. Therefore, it warrants extra attention. In contrast, if α i j s is small, the pixel is just an ordinary pixel and not worth attention.
L w b c e s increases the emphasis on hard pixels compared to BCE. Meanwhile, the local structural information is encoded into L w b c e s such that a greater receptive field rather than a single pixel is the model’s primary focus. To further make the network focus on the overall structure, the weighted IoU (wIoU) loss is introduced, as shown in Equation (16).
L wiou s = 1 i = 1 H j = 1 W g t i j s p i j s 1 + γ α i j s i = 1 H j = 1 W g t i j s + p i j s g t i j s p i j s 1 + γ α i j s
In the segmentation of images, the IoU loss is frequently employed. It is not affected by the uneven distribution of pixels, and the optimization of the global structure is the goal, which overcomes the limitation of a single pixel. In recent years, it has been included in SOD in order to address BCE’s deficiencies. However, it still treats each pixel equally and ignores the differences between pixels. In contrast to the IoU loss, our WIoU loss gives harder pixels a higher weight to indicate their significance.
The pixel position adaptive importance (PPAI) loss is shown in Equation (14). It combines the information on local structures to assign different weights to each pixel and provide pixel restriction ( L w b c e s ) and global restriction L w i o u s , thus better guiding the network learning process and resulting in clearer details.
L p p a i s = L w b c e s + L w i o u s
Eventually, the ultimate loss L c p p a i s and deep supervision for the loss of the depth branch L d make up the total loss L , which is formulated as follows:
L = L c p p a i s ( S c , G ) + L d ( S d , G ) ,
where G represents the ground truth (GT) and L c p p a i s and L d denote the PPAI loss and the standard BCE loss, respectively.

4. Experiments and Results

This section introduces the datasets and metrics, the details of the implementation, and comparisons to SOTAs. The experiments include both quantitative and qualitative experiments. Ablation experiments are also conducted to demonstrate the effectiveness of our proposed module.

4.1. Datasets and Metrics

Experiments were performed on six public datasets, including L F S D [44] (100 samples), N J U 2 K [45] (1996 samples), N L P R [46] (1023 samples), R G B D 135 [47] (142 samples), S I P [48] (910 samples), and S T E R E [49] (1000 samples).
Meanwhile, for evaluation, four widely used metrics were employed, including the S-measure ( S α ) [50], maximum F-measure ( F β m ) [51], maximum E-measure ( E ε m ) [52,53], and mean absolute error (MAE, M ) [48]. A higher S α , F β m , a n d E ε m and a lower M mean better performance.

4.2. Details of the Implementation

The experiments were carried out on a personal computer equipped with an Intel (R) Xeon (R) Gold 6248 CPU and an NVIDIA Tesla V100-SXM2 32GB GPU. DQPFPNet was implemented in Pytorch [54], and the RGB and depth features were both scaled to 256 × 256 as input. To extend the network to the limited training examples, following [16], this paper adopted a variety of data enhancement techniques, such as horizontal flipping, random cropping, color enhancement, etc. DQPFPNet was trained on a single Tesla v100 GPU for 300 epochs. The Adam optimizer’s [55] initial learning rate was set to 1 × 10 4 with a batch size of 10. A multiple learning rate strategy was used, with the power set to 0.9.

4.3. Comparison to SOTAs

A total of 1700 samples from NJU2K and 800 samples from NLPR were used for training, and tests were performed on STERE, SIP, NLPR, LFSD, NJU2K, and RGBD135. The results of DQPFPNet were compared to those of 16 state-of-the-art (SOTA) models, including C2DF [56], S2MA [33], JL-DCF [30], CoNet [57], UCNet [22], CIRNet [58], SSLSOD [59], cmMS [60], DANet [36], DCF [61], ATSA [62], DSA2F [63], PGAR [64], A2dele [37], MSal [65], and DFMNet [66], as shown in Table 1. The salient maps for the other models were derived from their released predictions, if available, or produced from their public code.
As shown in Table 1, DQPFPNet outperformed some existing efficient models in terms of detection accuracy, e.g., MSal [65], A2dele [37], and PGAR [64]. Additionally, it is evident that DQPFPNet achieved SOTA performance, indicating that the method of filtering the depth features in a multi-scale manner, fusing the filtered depth features with RGB features in a multi-scale manner, and finally, obtaining the salient graph through a two-stage decoder is of practical significance, thereby proving the effectiveness of our model. Validation of the functionality of each module is performed in Section 4.4. Figure 8 presents a visual comparison of the results of our proposed method and those of the SOTA methods, and our results are closer to the GT.

4.4. Ablation Experiments

Thorough ablation experiments were performed on six classical datasets, including STERE, SIP, RGBD135, NLPR, LFSD, and NJU2K, by changing or deleting parts of the DQPFPNet implementation.

4.4.1. Effectiveness of DQPFP

DQPFP is made up of two essential components: DQPW and DPEA. Table 2 displays several configurations with DQPW/DPEA disabled. Specifically, configuration #1 represents the baseline model with DQPW and DPEA removed from DQPFPNet. Configurations #2 and #3 each introduce one of the components, whereas configuration #4 represents the complete model of DQPFPNet. It can be seen from Table 2 that merging DQPW and DPEA into the baseline model resulted in consistent improvements on almost all datasets. Meanwhile, when comparing configurations #2/#3 to #4, it can be seen that using DQPW and DPEA together further improved the results, demonstrating a synergistic effect between DQPW and DPEA. The possible reason is that although DPEA can enhance potential salient areas in the deep dimension, it is inevitable that certain errors (for example, emphasizing the wrong areas) will occur, especially in the case of poor depth quality. Fortunately, DQPW mitigates some of these mistakes because it allocates lower global weights to the depth features in this case. Hence, the two elements can cooperate to increase network resiliency, as mentioned in Section 3.2.
Figure 9 shows visual examples of configuration #3 (without DQPW) and configuration #4. Figure 9a,b illustrate that combining DQPW contributes to improved detection accuracy. In the first example of good quality (row 1, Figure 9a), in the RGB view, it is challenging to discern between shadows and people’s legs, but this is simple to do in the depth view. The addition of DQPW enhances the depth feature and makes it easier to distinguish the full human body from the shadow. In the first example of bad quality (row 1, Figure 9b), although the boy on the skateboard boy much more blurry in the depth view, the impact of the incorrect depth is lessened, and precise detection of the entire object is still possible.
Table 3 presents the results of the modular ablation experiments, demonstrating the positive effect of each module on detection accuracy. The baseline is the original model, with its precision as the benchmark. The modules are presented in order from the second to the fourth columns, with all other conditions remaining unchanged. In addition, all the experimental parameter configurations remained the same. Based on the detection outcomes, it is evident that the combination of the DQPFP module, RReLU activation function, and PPAI loss can greatly increase the model’s detection accuracy.

4.4.2. DQPFP Threshold Strategy

As described in Section 3.2, a multivariable strategy was used for α i and β i . To verify this strategy, it was compared to the single-variable strategy that uses the same (only one) α i and β i . Table 4 shows the results, and it is evident that the multi-factor approach used in this paper is better because it adds flexibility to the network, enabling it to render at different levels with different quality heuristic weights and attention maps.

4.4.3. Effectiveness of Loss and Activation

The loss function is one of the core components of deep learning, measuring the difference between the predicted results of the model and the true labels. By minimizing the value of the loss function, the model can gradually improve its performance during the training process. The loss function provides a clear optimization objective for neural networks and is an important bridge connecting data and model performance. It is necessary to choose a suitable loss function. Thus, we utilized the DQPFPNet to conduct comparative experiments on six datasets using the widely used BCE with the Sigmoid loss, MSE loss, Hinge loss, BCE loss, and PPAI loss to validate the effectiveness of PPAI loss, and the results are shown in Table 5. All other experimental settings remained the same, with only the loss function transformed each time. From the experimental results, it can be seen that the detection accuracy of the model was improved to some extent after using PPAI loss. This indicates that the PPAI loss can accelerate the convergence of the model and drive it toward better performance.
The activation function plays an important role in the backpropagation of neural networks. It introduces nonlinearity into the network, enabling it to learn complex patterns and make accurate predictions. Some activation functions have the problem of gradient disappearance during training, which leads to slow convergence and hinders the learning process. Therefore, the performance and training speed of neural networks can be greatly affected by choosing the appropriate activation function. We conducted ablation experiments and trained the DQPFPNet model using the ReLU, Sigmoid, Tanh, ELU, and RReLU activations, and the results are presented in Table 6. All other experimental configurations remained the same, with only the activation function changed for training each time. The experimental results show that compared with other activations, the RReLU activation enables the model to achieve higher accuracy. This may be related to the introduction of randomness in RReLU, which reduces the occurrence of neuronal “death” through a certain proportion of negative values, improves the stability of the network, and enhances its rich nonlinear expression ability.

4.4.4. Effectiveness of Dual-Stage Decoder

In Table 7, we present the results of ablation experiments on the decoder, where we used a single-stage decoder and a dual-stage decoder. All other conditions remained the same, with only the decoder architecture changing each time. Based on the outcomes of the experiment, it is evident that the resulting metrics when using the dual-stage decoder are better compared to the single-stage decoder across all six datasets, proving that the two-stage decoder is practical and effective. This may be due to the architectural advantages of the dual-stage decoder itself. The first fusion stage reduces the feature channel and hierarchical structure, and the second fusion stage further aggregates the low-level structure and the high-level structure to produce the final salient graph. This two-stage design can make full use of the context information and improve the modeling ability of the model.

5. Conclusions

This paper proposed DQPFPNet, an RGB-D SOD model with high efficiency and good performance. The method models an efficient RGB-D SOD framework and DQPFP processing, greatly improving detection accuracy. DQPFP consists of three sub-modules: DDM, DQPW, and DPEA. The DDM filters multi-scale depth features through a channel attention mechanism and a spatial attention mechanism to achieve the initial filtering of the depth features. The DQPW module weights the depth features based on the alignment between the enhanced RGB features of the DDM module and the depth features, whereas the DPEA module focuses on the depth features spatially using multiple enhanced attention maps originating from the DDM-enhanced depth features refined with low-level RGB features. Additionally, the framework is built on a dual-stage decoder, which helps further increase efficiency. The pixel position adaptive importance (PPAI) loss is utilized to better explore the structural information in the features, making the network attach significance to detailed areas. In addition, the RReLU activation is used to solve the problem of neuronal ”necrosis”. Experiments conducted on six RGB-D datasets demonstrate that DQPFPNet performs well in terms of both metric values and visualizations. A limitation of the current model is that in the comparison experiments with existing models, it did not achieve the best performance across all metrics and datasets, indicating that the network structure needs to be improved. Furthermore, the behavior of the model in mobile or embedded devices is unknown. Hence, we will continue to explore new network architectures to optimize performance on common datasets in the future. In addition, we will attempt to deploy the DQPFP in embedded/mobile systems that handle RGB-D and video data and continue to optimize the model based on its performance metrics.

Author Contributions

Conceptualization, S.F., L.Z. and S.C.; investigation, S.F., L.Z., J.H., X.Z. and S.C.; methodology, S.F., L.Z., X.Z. and S.C.; code and validation, S.F., L.Z., J.H. and S.C.; writing—original draft preparation, S.F. and S.C.; writing—review and editing, S.F., L.Z., J.H., X.Z. and S.C.; data curation, S.F., L.Z., J.H. and X.Z.; funding acquisition, S.C., J.H. and X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the National Natural Science Foundation of China (Grant No. 61906168, 62101387, 62201400, and 62272267), the Zhejiang Provincial Natural Science Foundation of China (Grant No. LY23F020023, LZ23F020001), the Construction of Hubei Provincial Key Laboratory for Intelligent Visual Monitoring of Hydropower Projects (Grant No. 2022SDSJ01), the Hangzhou AI major scientific and technological innovation project (Grant No. 2022AIZD0061), the Project of Science and Technology Plans of Wenzhou City (Grant No. H20210001) and the Quzhou Science and Technology Projects (2022k91).

Data Availability Statement

This study did not report any data. We used public data for research. The URL and accessed date of the dataset are as follows: https://pan.baidu.com/s/1ckNlS0uEIPV-iCwVzjutsQ, training data, 2022-04-19 (Extracted code: eb2z). https://pan.baidu.com/s/1wI-bxarzdSrOY39UxZaomQ, test data, 2021-08-07 (Extracted code: 940i).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
Deep learningDL
Red Green BlueRGB
Red Green Blue-DepthRGB-D
Salient Object DetectionSOD
Convolutional neural networkCNN
Depth-quality purification feature processingDQPFP
Rectified Linear UnitReLU
Random ReLURReLU
Pixel position adaptive importancePPAI
Depth-quality purification weightingDQPW
Depth purification-enhanced attentionDPEA
Consumer ElectronicsCE
Software-Defined NetworkingSDN
Pyramid pooling modulePPM
Depth de-noising moduleDDM
Channel attentionCA
Spatial attentionSA
Binary cross-entropyBCE
Intersection over UnionIoU
Weighted binary cross-entropywBCE
Weighted IoUwIoU
Ground truthGT
State of the artSOTA
Mean-square errorMSE
Hyperbolic tangent functionTanh
Exponential Linear UnitELU

References

  1. Chan, S.; Tao, J.; Zhou, X.; Bai, C.; Zhang, X. Siamese implicit region proposal network with compound attention for visual tracking. IEEE Trans. Image Process. 2022, 31, 1882. [Google Scholar] [CrossRef] [PubMed]
  2. Chan, S.; Yu, M.; Chen, Z.; Mao, J.; Bai, C. Regional Contextual Information Modeling for Small Object Detection on Highways. IEEE Trans.Instrumentation and Measure. 2023, 72, 1–13. [Google Scholar] [CrossRef]
  3. Dilshad, N.; Khan, T.; Song, J.S. Efficient Deep Learning Framework for Fire Detection in Complex Surveillance Environment. Comput. Syst. Sci. Eng. 2023, 46, 749–764. [Google Scholar] [CrossRef]
  4. Chan, S.; Wang, Y.; Lei, Y.; Cheng, X.; Chen, Z.; Wu, W. Asymmetric Cascade Fusion Network for Building Extraction. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–18. [Google Scholar] [CrossRef]
  5. Javeed, D.; Saeed, M.S.; Ahmad, I.; Kumar, P.; Jolfaei, A.; Tahir, M. An Intelligent Intrusion Detection System for Smart Consumer Electronics Network. IEEE Trans. Consum. Electron. 2023, 1. [Google Scholar] [CrossRef]
  6. Yar, H.; Ullah, W.; Ahmad Khan, Z.; Wook Baik, S. An Effective Attention-based CNN Model for Fire Detection in Adverse Weather Conditions. ISPRS J. Photogramm. Remote Sens. 2023, 206, 335–346. [Google Scholar] [CrossRef]
  7. Park, K.B.; Lee, J.Y. Novel industrial surface-defect detection using deep nested convolutional network with attention and guidance modules. J. Comput. Des. Eng. 2022, 9, 2466–2482. [Google Scholar] [CrossRef]
  8. Park, K.B.; Lee, J.Y. SwinE-Net: Hybrid deep learning approach to novel polyp segmentation using convolutional neural network and Swin Transformer. J. Comput. Des. Eng. 2022, 9, 616–632. [Google Scholar] [CrossRef]
  9. Fan, D.P.; Li, T.; Lin, Z.; Ji, G.P.; Zhang, D.; Cheng, M.M.; Fu, H.; Shen, J. Re-Thinking Co-Salient Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 4339–4354. [Google Scholar] [CrossRef]
  10. Jiang, H.; Wang, J.; Yuan, Z.; Wu, Y.; Zheng, N.; Li, S. Salient Object Detection: A Discriminative Regional Feature Integration Approach. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2083–2090. [Google Scholar] [CrossRef]
  11. Yin, B.; Zhang, X.; Li, Z.; Liu, L.; Cheng, M.M.; Hou, Q. DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation. arXiv 2023, arXiv:2309.09668. [Google Scholar]
  12. Cong, R.; Liu, H.; Zhang, C.; Zhang, W.; Zheng, F.; Song, R.; Kwong, S. Point-aware Interaction and CNN-induced Refinement Network for RGB-D Salient Object Detection. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa ON Canada, 29 October–3 November 2023; pp. 406–416. [Google Scholar] [CrossRef]
  13. Wu, Z.; Allibert, G.; Meriaudeau, F.; Ma, C.; Demonceaux, C. HiDAnet: RGB-D Salient Object Detection via Hierarchical Depth Awareness. IEEE Trans. Image Process. 2023, 32, 2160–2173. [Google Scholar] [CrossRef] [PubMed]
  14. Cong, R.; Lei, J.; Zhang, C.; Huang, Q.; Cao, X.; Hou, C. Saliency Detection for Stereoscopic Images Based on Depth Confidence Analysis and Multiple Cues Fusion. IEEE Signal Process. Lett. 2016, 23, 819–823. [Google Scholar] [CrossRef]
  15. Chen, Z.; Cong, R.; Xu, Q.; Huang, Q. DPANet: Depth Potentiality-Aware Gated Attention Network for RGB-D Salient Object Detection. IEEE Trans. Image Process. 2020, 30, 7012–7024. [Google Scholar] [CrossRef] [PubMed]
  16. Fan, D.P.; Yingjie, Z.; Ali, B.; Jufeng, Y.; Ling, S. BBS-Net: RGB-D Salient Object Detection with a Bifurcated Backbone Strategy Network; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020. [Google Scholar]
  17. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer Assisted Intervention; Springer: Cham, Switzerland, 2015. [Google Scholar]
  18. Lang, C.; Nguyen, T.V.; Katti, H.; Yadati, K.; Kankanhalli, M.S.; Yan, S. Depth Matters: Influence of Depth Cues on Visual Saliency. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012. [Google Scholar]
  19. Ren, J.; Gong, X.; Yu, L.; Zhou, W.; Yang, M.Y. Exploiting global priors for RGB-D saliency detection. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  20. Qu, L.; He, S.; Zhang, J.; Tian, J.; Tang, Y.; Yang, Q. RGBD Salient Object Detection via Deep Fusion. IEEE Trans. Image Process. 2017, 26, 2274–2285. [Google Scholar] [CrossRef] [PubMed]
  21. Sun, Y.; Gao, X.; Xia, C.; Ge, B.; Duan, S. GSCINet: Gradual Shrinkage and Cyclic Interaction Network for Salient Object Detection. Electronics 2022, 11, 1964. [Google Scholar] [CrossRef]
  22. Zhang, J.; Fan, D.P.; Dai, Y.; Anwar, S.; Saleh, F.S.; Zhang, T.; Barnes, N. UCNet: Uncertainty Inspired RGB-D Saliency Detection via Conditional Variational Autoencoders. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  23. Li, G.; Liu, Z.; Ling, H. ICNet: Information Conversion Network for RGB-D Based Salient Object Detection. IEEE Trans. Image Process. 2020, 29, 4873–4884. [Google Scholar] [CrossRef] [PubMed]
  24. Duan, S.; Gao, X.; Xia, C.; Ge, B. A2TPNet: Alternate Steered Attention and Trapezoidal Pyramid Fusion Network for RGB-D Salient Object Detection. Electronics 2022, 11, 1968. [Google Scholar] [CrossRef]
  25. Chen, H.; Li, Y.; Su, D. Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recognit. 2019, 86, 376–385. [Google Scholar] [CrossRef]
  26. Lu, J.; Yang, J.; Batra, D.; Parikh, D. Hierarchical Question-Image Co-Attention for Visual Question Answering. In Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 5–10 December 2016. [Google Scholar]
  27. Yu, Z.; Yu, J.; Cui, Y.; Tao, D.; Tian, Q. Deep Modular Co-Attention Networks for Visual Question Answering. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  28. He, Y.; Lin, J.; Liu, Z.; Wang, H.; Li, L.J.; Han, S. AMC: AutoML for Model Compression and Acceleration on Mobile Devices. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
  29. Cheng, Y.; Wang, D.; Zhou, P.; Zhang, T. A Survey of Model Compression and Acceleration for Deep Neural Networks. arXiv 2017, arXiv:1710.09282. [Google Scholar]
  30. Fu, K.; Fan, D.P.; Ji, G.P.; Zhao, Q. JL-DCF: Joint Learning and Densely-Cooperative Fusion Framework for RGB-D Salient Object Detection. arXiv 2020, arXiv:2004.08515. [Google Scholar]
  31. Zeng, Y.; Zhuge, Y.; Lu, H.; Zhang, L.; Qian, M.; Yu, Y. Multi-source weak supervision for saliency detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  32. Zhang, D.; Meng, D.; Zhao, L.; Han, J. Bridging saliency detection to weakly supervised object detection based on self-paced curriculum learning. In Proceedings of the International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016. [Google Scholar]
  33. Liu, N.; Zhang, N.; Han, J. Learning Selective Self-Mutual Attention for RGB-D Saliency Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  34. Zhou, T.; Fan, D.P.; Cheng, M.M.; Shen, J.; Shao, L. RGB-D salient object detection: A survey. Comput. Vis. Media 2021, 7, 37–69. [Google Scholar] [CrossRef] [PubMed]
  35. Chen, Q.; Fu, K.; Liu, Z.; Chen, G.; Du, H.; Qiu, B.; Shao, L. EF-Net: A novel enhancement and fusion network for RGB-D saliency detection. Pattern Recognit. 2021, 112, 107740. [Google Scholar] [CrossRef]
  36. Zhao, X.; Zhang, L.; Pang, Y.; Lu, H.; Zhang, L. A Single Stream Network for Robust and Real-time RGB-D Salient Object Detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
  37. Piao, Y.; Rong, Z.; Zhang, M.; Ren, W.; Lu, H. A2dele: Adaptive and Attentive Depth Distiller for Efficient RGB-D Salient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  38. Sun, F.; Xu, Y.; Sun, W. SPSN: Seed Point Selection Network in Point Cloud Instance Segmentation. In Proceedings of the International Joint Conference on Neural Network, Glasgow, UK, 19–24 July 2020. [Google Scholar]
  39. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  40. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  41. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
  42. Howard, A.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  43. Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  44. Li, N.; Ye, J.; Ji, Y.; Ling, H.; Yu, J. Saliency Detection on Light Field. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
  45. Ju, R.; Ge, L.; Geng, W.; Ren, T.; Wu, G. Depth saliency based on anisotropic center-surround difference. In Proceedings of the International Conference on Image Processing, Paris, France, 27–30 October 2014. [Google Scholar]
  46. Peng, H.; Li, B.; Xiong, W.; Hu, W.; Ji, R. RGBD Salient Object Detection: A Benchmark and Algorithms. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
  47. Cheng, Y.; Fu, H.; Wei, X.; Xiao, J.; Cao, X. Depth Enhanced Saliency Detection Method. In Proceedings of the International Conference on Internet Multimedia Computing and Service, Xiamen, China, 10–12 July 2014. [Google Scholar]
  48. Fan, D.P.; Lin, Z.; Zhang, Z.; Zhu, M.; Cheng, M.M. Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks. IEEE Trans. Neural Netw. 2021, 32, 2075–2089. [Google Scholar] [CrossRef] [PubMed]
  49. Niu, Y.; Geng, Y.; Li, X.; Liu, F. Leveraging stereopsis for saliency analysis. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
  50. Fan, D.P.; Cheng, M.M.; Liu, Y.; Li, T.; Borji, A. Structure-measure: A New Way to Evaluate Foreground Maps. arXiv 2017, arXiv:1708.00786. [Google Scholar]
  51. Achanta, R.; Hemami, S.S.; Estrada, F.J.; Süsstrunk, S. Frequency-tuned salient region detection. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
  52. Fan, D.P.; Gong, C.; Cao, Y.; Ren, B.; Cheng, M.M.; Borji, A. Enhanced-alignment Measure for Binary Foreground Map Evaluation. In Proceedings of the International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018. [Google Scholar]
  53. Fan, D.P.; Ji, G.P.; Qin, X.; Cheng, M.M. Cognitive vision inspired object segmentation metric and loss function. Sci. Sin. Inf. 2021, 51, 1475–1489. [Google Scholar] [CrossRef]
  54. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  55. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  56. Zhang, M.; Yao, S.; Hu, B.; Piao, Y.; Ji, W. C2 DFNet: Criss-Cross Dynamic Filter Network for RGB-D Salient Object Detection. IEEE Trans. Multimed. 2022, 25, 5142–5154. [Google Scholar] [CrossRef]
  57. Ji, W.; Li, J.; Zhang, M.; Piao, Y.; Lu, H. Accurate RGB-D Salient Object Detection via Collaborative Learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
  58. Cong, R.; Lin, Q.; Zhang, C.; Li, C.; Cao, X.; Huang, Q.; Zhao, Y. CIR-Net: Cross-modality Interaction and Refinement for RGB-D Salient Object Detection. IEEE Trans. Image Process. 2022, 31, 6800–6815. [Google Scholar] [CrossRef]
  59. Zhao, X.; Pang, Y.; Zhang, L.; Lu, H.; Ruan, X. Self-Supervised Pretraining for RGB-D Salient Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; pp. 3463–3471. [Google Scholar] [CrossRef]
  60. Li, C.; Cong, R.; Piao, Y.; Xu, Q.; Loy, C.C. RGB-D Salient Object Detection with Cross-Modality Modulation and Selection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
  61. Ji, W.; Li, J.; Yu, S.; Zhang, M.; Piao, Y.; Yao, S.; Bi, Q.; Ma, K.; Zheng, Y.; Lu, H.; et al. Calibrated RGB-D Salient Object Detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
  62. Zhang, M.; Fei, S.X.; Liu, J.; Xu, S.; Piao, Y.; Lu, H. Asymmetric Two-Stream Architecture for Accurate RGB-D Saliency Detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
  63. Sun, P.; Zhang, W.; Wang, H.; Li, S.; Li, X. Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
  64. Chen, S.; Fu, Y. Progressively Guided Alternate Refinement Network for RGB-D Salient Object Detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
  65. Wu, Y.H.; Liu, Y.; Xu, J.; Bian, J.W.; Gu, Y.C.; Cheng, M.M. MobileSal: Extremely Efficient RGB-D Salient Object Detection. arXiv 2020, arXiv:2012.13095. [Google Scholar] [CrossRef]
  66. Zhang, W.; Ji, G.P.; Wang, Z.; Fu, K.; Zhao, Q. Depth Quality-Inspired Feature Manipulation for Efficient RGB-D Salient Object Detection. arXiv 2021, arXiv:2107.01779. [Google Scholar]
Figure 1. The overall structure of the proposed DQPFPNet.
Figure 1. The overall structure of the proposed DQPFPNet.
Electronics 13 00093 g001
Figure 2. The pipeline of the network architecture.
Figure 2. The pipeline of the network architecture.
Electronics 13 00093 g002
Figure 3. The structure of the DDM.
Figure 3. The structure of the DDM.
Electronics 13 00093 g003
Figure 4. The organization of the DQPW module. The red arrows show the Equation (4) calculation process. The dashed lines indicate max-pooling with a stride of 2.
Figure 4. The organization of the DQPW module. The red arrows show the Equation (4) calculation process. The dashed lines indicate max-pooling with a stride of 2.
Electronics 13 00093 g004
Figure 5. The structure of the DPEA (depth purification-enhanced attention) module.
Figure 5. The structure of the DPEA (depth purification-enhanced attention) module.
Electronics 13 00093 g005
Figure 6. The ReLU activation function.
Figure 6. The ReLU activation function.
Electronics 13 00093 g006
Figure 7. The RReLU activation function.
Figure 7. The RReLU activation function.
Electronics 13 00093 g007
Figure 8. Qualitative comparison of DQPFPNet with SOTA RGB-D SOD methods.
Figure 8. Qualitative comparison of DQPFPNet with SOTA RGB-D SOD methods.
Electronics 13 00093 g008
Figure 9. Visual examples of configuration #3 (without DQPW) and configuration #4 (with DQPW) for good (a) and bad (b) depth-quality cases.
Figure 9. Visual examples of configuration #3 (without DQPW) and configuration #4 (with DQPW) for good (a) and bad (b) depth-quality cases.
Electronics 13 00093 g009
Table 1. Quantitative benchmark results.↑/↓ for a metric denotes that a larger/smaller value is better. Our results are highlighted in bold. The best scores are shown in red. The second-best scores are shown in blue.
Table 1. Quantitative benchmark results.↑/↓ for a metric denotes that a larger/smaller value is better. Our results are highlighted in bold. The best scores are shown in red. The second-best scores are shown in blue.
MetricC2DF TMM 2022JL-DCF CVPR 2020UCNet CVPR 2020SSLSOD AAAI 2022S2MA CVPR 2020CoNet ECCV 2020cmMS ECCV 2020DANet ECCV 2020ATSA ECCV 2020DCF CVPR 2022DSA2F CVPR 2021A2dele CVPR 2020PGAR ECCV 2020MSal TPAMI 2021DFMNet CVPR 2022CIRNet TIP 2022DQPFPNet
Ours
-
SIP S α 0.8710.8790.8750.8700.8780.8580.8670.8780.8640.8760.8620.8290.8750.8730.8730.8610.885
F β m 0.8650.8850.8790.8620.8840.8670.8710.8840.8730.8840.8750.8340.8770.8830.8780.8400.896
E ε m 0.9120.9230.9190.9000.9200.9130.9100.9200.9110.9220.9120.8890.9140.9200.9190.8860.943
M 0.0530.0510.0510.0590.0540.0630.0610.0540.0580.0520.0570.0700.0590.0530.0550.0690.046
NLPR> S α 0.9270.9250.9200.9140.9150.9080.9150.9150.9070.9240.9190.8900.9180.9200.9230.9200.931
F β m 0.9040.9160.9030.8810.9020.8870.8960.9030.8760.9120.9060.8750.8980.9080.9070.8810.930
E ε m 0.9550.9620.9560.9410.9500.9450.9490.9530.9450.9630.9520.9370.9480.9610.9560.9370.961
M 0.0210.0220.0250.0270.0300.0310.0270.0290.0280.0220.0240.0310.0280.0250.0260.0280.022
NJU2K S α 0.9080.9030.8970.9020.8940.8950.9000.8910.9010.9040.8950.8680.9060.9050.9040.9010.906
F β m 0.8980.9030.8950.8870.8890.8920.8970.8800.8930.9060.8970.8720.9050.9050.9050.8800.910
E ε m 0.9360.9440.9360.9290.9300.9370.9360.9320.9210.9500.9360.9140.9400.9420.9450.9170.947
M 0.0380.0430.0430.0430.0530.0470.0440.0480.0400.0400.0440.0520.0450.0410.0410.0470.036
RGBD135 S α 0.8980.9290.9340.9050.9410.9100.9320.9040.9070.9050.9170.8840.8940.9290.9320.9000.941
F β m 0.8850.9190.9300.8830.9350.8960.9220.8940.8850.8940.9160.8730.8790.9240.9240.8880.942
E ε m 0.9460.9680.9760.9410.9730.9450.9700.9570.9520.9510.9540.9200.9290.9700.9690.9270.976
M 0.0310.0220.0190.0250.0210.0290.0200.0290.0240.0240.0230.0300.0320.0210.0200.0510.019
LFSD S α 0.8630.8620.8640.8590.8370.8620.8490.8450.8650.8420.8830.8340.8330.8470.8630.8220.871
F β m 0.8590.8660.8640.8670.8350.8590.8690.8460.8620.8420.8890.8320.8310.8410.8640.8030.871
E ε m 0.8970.9010.9050.9000.8730.9070.8960.8860.9050.8830.9240.8740.8930.8880.9020.8340.906
M 0.0650.0710.0660.0660.0940.0710.0740.0830.0640.0750.0550.0770.0930.0780.0710.0960.065
STERE S α 0.8990.9050.9030.8930.8900.9080.8950.8920.8970.9020.8980.8850.9030.9030.8980.8350.904
F β m 0.8910.9010.8990.8900.8820.9040.8910.8810.8840.9010.9000.8850.8930.8950.8910.8470.901
E ε m 0.9380.9460.9440.9360.9320.9480.9370.9300.9210.9450.9420.9350.9360.9400.9420.9110.947
M 0.0460.0420.0390.0440.0510.0400.0420.0480.0390.0390.0390.0430.0440.0410.0440.0660.040
Table 2. Ablation analysis of DQPFP to validate the effectiveness of DQPW and DPEA. below the module indicates that the model has used the module. Otherwise, the model has not used it. The best results are shown in red.
Table 2. Ablation analysis of DQPFP to validate the effectiveness of DQPW and DPEA. below the module indicates that the model has used the module. Otherwise, the model has not used it. The best results are shown in red.
#DQPWDPEASIPNLPRNJU2KRGBD135LFSDSTERE
S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M
1 0.8730.8790.9190.0540.9120.8990.9540.0270.8980.9030.9410.0420.9260.9310.9710.0170.8500.8530.8910.0750.8850.8830.9380.047
2 0.8770.8850.9230.0510.9160.9050.9580.0250.9410.9020.8980.0420.9410.9410.9680.0160.8530.8570.8950.0740.8850.8870.9400.046
3 0.8760.8830.9230.0510.9140.9010.9540.0250.8970.9030.9410.0430.9340.9310.9760.0180.8550.8560.8950.0730.8890.8860.9400.045
40.8850.8960.9230.0460.9220.9160.9610.0230.9040.9100.9470.0390.9300.9420.9760.0190.8700.8690.9060.0680.9020.8980.9470.041
Table 3. Quantitative module results. ↑/↓ for a metric denotes that a larger/smaller value is better. The best scores are shown in red.
Table 3. Quantitative module results. ↑/↓ for a metric denotes that a larger/smaller value is better. The best scores are shown in red.
MetricBaselineBaseline + DQPFPBaseline + DQPFP + RReLUBaseline + DQPFP + RReLU + PPAI
SIP S α 0.87320.87510.87960.8850
F β m 0.87790.88160.88740.8960
E ε m 0.91910.92490.93720.9425
M 0.05520.05150.05060.0460
NLPR S α 0.92330.92650.92770.9311
F β m 0.90740.90780.91110.9300
E ε m 0.95620.95770.95830.9612
M 0.02580.02490.02440.0221
NJU2K S α 0.90410.90420.90510.9066
F β m 0.90520.90610.90750.9100
E ε m 0.94560.94580.94550.9467
M 0.04180.04110.04060.0364
RGBD135 S α 0.93210.93250.93400.9411
F β m 0.92410.92620.92770.9423
E ε m 0.96900.97150.97380.9761
M 0.02070.02050.02020.0190
LFSD S α 0.86390.86540.87000.8710
F β m 0.86450.86520.86630.8710
E ε m 0.90260.90320.90550.9063
M 0.07080.07340.06840.0654
STERE S α 0.89860.89940.90110.9042
F β m 0.89160.89220.89370.9013
E ε m 0.94260.94250.94270.9472
M 0.04390.04330.04270.0403
Table 4. DQPFP threshold strategy: using identical (only one) α i and β i vs. using multiple α i and β i (five different values). The best scores are shown in red.
Table 4. DQPFP threshold strategy: using identical (only one) α i and β i vs. using multiple α i and β i (five different values). The best scores are shown in red.
#StrategySIPNLPRNJU2KRGBD135LFSDSTERE
S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M
5Identical0.8760.8840.9230.0510.9160.9050.9550.0250.9000.9020.9410.0410.9310.9270.9680.0190.8530.8520.8950.0740.8900.8910.9410.044
4Multiple0.8850.8960.9230.0460.9220.9160.9610.0230.9040.9100.9470.0390.9300.9420.9760.0190.8700.8690.9060.0680.9020.8980.9470.041
Table 5. Ablation analysis of DQPFPNet to validate the effectiveness of the PPAI loss. below the module indicates that the model has used the module. Otherwise, the model has not used it. The best results are shown in red.
Table 5. Ablation analysis of DQPFPNet to validate the effectiveness of the PPAI loss. below the module indicates that the model has used the module. Otherwise, the model has not used it. The best results are shown in red.
#BCE-LogitsMSEHingeBCEPPAISIPNLPRNJU2KRGBD135LFSDSTERE
S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M
6 0.87300.87900.91900.05400.91200.89900.95400.02700.89800.90300.94100.04200.92600.93100.97100.01700.85000.85300.89100.07500.88500.88300.93800.0470
7 0.59260.64620.55450.32500.73250.65170.66070.11400.67640.68010.72510.12600.72880.66030.65850.11950.73620.61340.75490.12250.78840.75310.69490.0980
8 0.49910.64500.52500.34200.63940.75170.63250.13240.78260.58010.62500.26840.73510.66840.72500.11070.66840.71340.68220.24630.69480.65310.71200.2310
9 0.86850.87150.91540.05780.91700.89760.95620.02700.89820.90110.94290.04240.92130.90840.96100.02480.85470.84690.89080.07460.89830.89190.94210.0443
10 0.87400.88100.93230.05320.92110.90480.95640.02540.90290.90400.94560.04050.93120.94110.97160.02170.85200.85580.89490.07210.90150.89190.94420.0423
Table 6. Ablation analysis of DQPFPNet to validate the effectiveness of the RReLU activation. below the module indicates that the model has used the module. Otherwise, the model has not used it. The best results are shown in red.
Table 6. Ablation analysis of DQPFPNet to validate the effectiveness of the RReLU activation. below the module indicates that the model has used the module. Otherwise, the model has not used it. The best results are shown in red.
#ReLUSigmoidTanhELURReLUSIPNLPRNJU2KRGBD135LFSDSTERE
S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M
11 0.87300.87900.91900.05400.90700.89900.95400.02070.86930.90300.94100.04200.92600.93100.97100.02700.85000.85300.89100.07500.88500.88300.93800.0470
12 0.39210.24620.25730.20520.43160.25170.36550.10430.47520.56310.65810.08520.42790.46030.56530.10080.66050.61340.65690.0650.53870.56240.63740.0837
13 0.48250.34620.49540.11960.61820.45170.65990.07250.66510.68010.67460.06380.51190.66030.65370.07300.55240.63340.57890.08630.67550.65310.79750.0625
14 0.87600.88300.92100.05100.90200.88910.95230.03600.89700.90300.94100.04300.92960.93200.96600.01800.85500.85600.89700.07390.88900.88600.94000.0450
15 0.88420.88160.92490.05060.91200.89920.95420.10130.89800.90360.94570.04110.93400.94290.97380.01700.85640.86210.89970.07340.88170.89010.94250.0413
Table 7. Ablation analysis of DQPFPNet to validate the effectiveness of the dual-stage decoder. below the module indicates that the model has used module, otherwise the model has not used it. The best results are shown in red.
Table 7. Ablation analysis of DQPFPNet to validate the effectiveness of the dual-stage decoder. below the module indicates that the model has used module, otherwise the model has not used it. The best results are shown in red.
#Single-StageDual-StageSIPNLPRNJU2KRGBD135LFSDSTERE
S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M S α F β m E ε m M
16 0.86850.87150.91540.05880.92110.90880.95650.03710.89790.90110.93260.04240.92130.90840.96100.03240.85470.84690.89080.07460.89830.89190.92690.0542
17 0.88500.89600.94250.04600.93110.93000.96120.02210.90660.91000.94670.03640.94110.94230.97610.01900.87100.87100.90630.06540.90420.90130.94720.0403
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Feng, S.; Zhao, L.; Hu, J.; Zhou, X.; Chan, S. Depth-Quality Purification Feature Processing for Red Green Blue-Depth Salient Object Detection. Electronics 2024, 13, 93. https://doi.org/10.3390/electronics13010093

AMA Style

Feng S, Zhao L, Hu J, Zhou X, Chan S. Depth-Quality Purification Feature Processing for Red Green Blue-Depth Salient Object Detection. Electronics. 2024; 13(1):93. https://doi.org/10.3390/electronics13010093

Chicago/Turabian Style

Feng, Shijie, Li Zhao, Jie Hu, Xiaolong Zhou, and Sixian Chan. 2024. "Depth-Quality Purification Feature Processing for Red Green Blue-Depth Salient Object Detection" Electronics 13, no. 1: 93. https://doi.org/10.3390/electronics13010093

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop