1. Introduction
Visual saliency refers to a human visual simulation system that uses algorithms to simulate human visual features and locate prominent areas in an image. Salient Object Detection (SOD) is designed to find the most appealing features of an image. It has rapidly developed and is widely used in many fields, including object tracking [
1], object detection [
2,
3], object segmentation [
4,
5], and other computer vision tasks for pre-processing [
6]. Deep learning has advanced considerably over the past few years, and many SOD methods have been proposed. However, the majority of current models for SOD can only handle RGB images.
Park et al. proposed a unique surface-defect detection method [
7] that utilizes a deep nested convolutional neural network (NC-NET) with attention and guiding modules to segment defect regions from complicated backgrounds precisely and adaptively refine features. To overcome the inherent limitations of convolution, SwinE-Net [
8] effectively combines EfficientNet, driven by a CNN, and the Vision Transformer (VIT)-based Swin Transformer for segmentation. This combination preserves global semantics while maintaining low-level characteristics, demonstrating specific generalization and scalability. CoEg-Net [
9] employs a shared attention projection technique to facilitate fast learning from public information, utilizing vast SOD datasets to significantly enhance the model’s scalability and stability. DRFI [
10] autonomously integrates regional saliency features of high dimensionality and selects the most discriminative cues. This inevitably creates challenges for SOD in intricate scenes, for example, backdrops with cluttered or low-contrast areas where color provides few clues.
To address the aforementioned problem, combining RGB and depth features for RGB-D SOD has received increasing attention. To learn the transferable representation of RGB-D partition tasks, Bowen et al. [
11] proposed an RGB-D framework, DFormer. DFormer encodes RGB and depth information through a series of RGB-D blocks. The model is pre-trained on ImageNet-1K, so DFormer has the ability to encode RGB-D representations. To build a better global long-range dependence model with self-modality and cross-modality, Cong et al. [
12] introduced the transformer architecture to create a new RGB-D SOD network called point-aware interaction and CNN-induced refinement (PICR-Net). The network explores the interaction of characteristics under different modules, alleviates the block effects, and details the destruction problems caused by the transformers. Wu et al. [
13] designed HiDAnet, which includes a granularity-based attention strategy to enhance the fusion of RGB and depth features. Note that the accuracy depends greatly on the quality of the depth of information, as suggested by the previous work. Cong et al. [
14] suggested a method for assessing the dependability of depth maps and utilizing it to minimize the impact of inferior depth maps on salient detection. DPA-Net [
15] can recognize the potential value of depth information through a learning-based approach, preventing contamination by accounting for depth potentiality. Although BBS-Net [
16] employs a module with improved depth to selectively extract informative regions of depth cues from both channel and spatial viewpoints, the quality of the depth features is still not great, resulting in the prediction accuracy not achieving adequate results. Although the above models consider the quality of depth features, they only perform single-scale filtering and fuse RGB and depth features at the coarsest filtering level without considering the mode of multi-scale filtering and fusion. This may lead to the roughness of features and the lack of feature utilization and fusion. In addition, Cong et al. [
14] adapted a top-down UNet [
17] architecture, which performs well in extracting and integrating local information, but it cannot effectively capture global information and has some limitations.
The above facts indicate that multi-scale filtering of depth features and multi-scale fusion with RGB features can improve feature utilization and fusion rates, thereby enhancing a model’s accuracy. In addition, a decoder that can capture both global and local information has a significant impact on the performance of a model. Based on this, we propose a depth-quality purification feature processing (DQPFP) network for RGB-D SOD in this paper.
Figure 1 shows the overall network architecture. The DQPFP module consists of three key sub-modules, namely a depth denoising module (DDM), depth-quality purification weighting (DQPW) module, and depth purification-enhanced attention (DPEA) module. The DDM filters multi-scale depth features through a channel attention mechanism and a spatial attention mechanism to achieve the initial filtering of the depth features. The DQPW module supplements the color features with purified depth features in a residual-connected manner to enhance feature characterization and then learns the weight factor
from the depth features and RGB features; By assigning smaller weights to poor-quality depth features, we obtain different weight factors on different scales. The DPEA module learns the global attention maps
from the purified depth features, which enhances the quality of the depth features from a spatial dimension. Then,
and
are integrated to obtain the final high-quality depth features. Then, the high-quality depth features and RGB features are fused in a multi-scale manner, and the final saliency map is generated through a two-stage decoder. In addition, after experimental analysis, we utilize the Randomized Leaky Rectified Linear Unit (RReLU) activation function to prevent overfitting and avoid neuron inactivation, which introduces randomness into the neural network training process. Furthermore, we introduce the pixel position adaptive importance (PPAI) loss, which integrates local structure information to assign different weights to each pixel, thus better guiding the network’s learning process and resulting in clearer details.
Our contributions can be summarized as follows:
We propose a DQPFP module, consisting of three sub-modules: DDM, DQPW, and DPEA. This module filters the depth features in a multi-scale manner and fuses them with RGB features in a multi-scale manner. It can also control and enhance the depth features explicitly in the process of cross-modal fusion, avoiding injecting noise or misleading depth features, which improves the feature utilization, fusion, and accuracy rates of the model.
We design a dual-stage decoder as one of DQPFPNet’s essential elements, which can fully utilize contextual information to improve the modeling ability of the model and enhance the efficiency of the network.
We introduce the RReLU activation function to prevent overfitting and avoid neuron inactivation, thereby introducing randomness into the training process. Furthermore, the pixel position adaptive importance (PPAI) loss is utilized to integrate local structure information to assign different weights to each pixel, thus better guiding the network’s learning process and resulting in clearer details.
Extensive experiments on six RGB-D datasets demonstrate that DQPFPNet outperforms recent efficient models.
The remainder of this paper is structured as follows. The related research on general RGB-D SOD, effective RGB-D SOD, and depth-quality analysis in RGB-D SOD is covered in
Section 2.
Section 3 describes the proposed DQPFPNet in detail.
Section 4 presents the experimental results, performance evaluation, and ablation analysis. Finally, some conclusions are provided in
Section 5.
3. Proposed Method
3.1. Overview
Figure 1 presents the proposed
DQPFPNet structure, consisting of the encoder, decoder, and supervision module. Our encoder adopts the architecture in [
16], where the RGB module is in charge of both cross-module fusion between RGB and depth features and feature extraction for RGB to achieve great performance. To create the final saliency map, the decoder performs a dual-stage fusion, namely the first fusion and second fusion. The encoder itself is made up of an RGB-related module, whose backbone network is MobileNet-v2 [
2]; a depth-related module, which is an efficient backbone; and the proposed DQPFP. The depth module and RGB module comprise five feature hierarchies, each with an output stride of 2, with the last one having an output stride of 1. The depth features are extracted within the given hierarchy, passed through the DQPFP threshold, added to the RGB module through simple element additions, and then sent to the next hierarchy. Moreover, a PPM (pyramid pooling module [
39]) is introduced toward the end of the RGB module to acquire multi-scale semantic data. In practical coding, the DQPFP threshold consists of two operations: depth-quality purification weighting (DQPW) and depth purification-enhanced attention (DPEA). In order to facilitate a better understanding of the overall workflow of the network,
Figure 2 shows the pipeline of the entire network.
The features extracted from the five depths/RGB hierarchies are represented as
, the fusion features are represented as
, and the features from the PPM are represented as
. This multi-modal feature fusion can be written as:
where
and
are calculated by DQPW and DPEA, respectively, to control the fusion of the depth features
. ⊗ indicates element-by-element multiplication. After the encoding process shown in
Figure 1,
and
are transferred to the next decoder module.
3.2. Depth-Quality Purification Feature Processing (DQPFP)
DQPFP includes two crucial modules: DQPW (depth-quality purification weighting) and DPEA (depth purification-enhanced attention). These two modules calculate
and
in Equation (
1), respectively.
is a scalar that determines "how many" depth features are used, whereas
(
s is the feature size for level
i) is a spatial attention map, determining "which regions" to focus on within the depth characteristics. The internal structures of the DQPW and DPEA modules are described below.
3.2.1. Depth-Quality Purification Weighting (DQPW)
The paired color features and depth features in the RGB-D features are two different forms of the same object. Color images provide visual cues, and depth images provide 3D information. Considering the inadequate quality of depth maps, this paper proposes a depth de-noising module (DDM). The DDM first purifies the depth features using the attention mechanism, then complements the color features through a residual connection [
40], and uses the shortcut connection section to retain more of the original color cues.
In the DDM, as shown in
Figure 3, the RGB features are merged with the depth features and transmitted to the channel attention module to obtain the attention channel mask, which is employed to purify the depth features. Subsequently, the purified depth features are input into the spatial attention module to produce the attention space mask, purifying the depth features on a spatial level. This process can be represented as:
where
and
, respectively, represent the low-level color and depth features; Cat(
) represents the concatenation and subsequent convolution operations; CA(
) and SA(
) are channel and spatial attention operations proposed by CBAM [
41], respectively; “×” denotes the element-wise multiplication operation; and “+” denotes the element-by-element addition operation. This process purifies poor-quality depth features and then merges them into RGB features to produce a more accurate representation
.
In
Figure 4, the low-level features
and
first obtain
through the DDM, and DQPW adaptively learns the weighting term
from the features
and
. We apply convolution to
/
to obtain the transformed features
/
, which are anticipated to obtain more activators associated with the edge:
where
represents a
convolution with BatchNorm layers and the RReLU activation. To be able to assess the alignment of low-level features, the alignment feature vector
, encoding the alignment between
and
, is computed as follows, given the edge activations
and
:
where
means the global average pooling operation aggregating element-level details and ⊗ represents the element-level multiplication.
Additionally, to make
robust to minor edge movements, this paper calculates
on multiple scales and concatenates the results to produce the strengthened vector.
Figure 4 shows that this multi-level computation is realized by downsampling the original features
by max-pooling with a stride of 2, and then
and
are calculated in the same way as in Equation (
4). Assuming that
,
, and
are aligned eigenvectors calculated from the three scales shown in
Figure 4, the strengthened vector
is calculated as follows:
where [
] represents a channel cascade. Then, two completely linked layers are used to calculate
from
in the manner shown below:
where
represents a two-level perception with the Sigmoid function at the end. Then,
is one of the elements of the vector
that is obtained. Note that this paper uses different weighting factors for different levels, and the effectiveness of this multivariable approach is verified in
Section 4.4.
3.2.2. Depth Purification-Enhanced Attention (DPEA)
The DPEA enhances the depth features in the spatial dimension by deriving a global attention map
from the depth channel. As shown in
Figure 5, the purified features
are first obtained from
and
through the DDM to locate the coarse-grained salient areas (with supervision cues shown in
Figure 1). In order to simplify the next pixel-by-pixel processes,
is compressed and then sampled up into
in the same dimension as
, as shown in the following formula:
where
represents 8 × bilinear upsampling.
is then re-calibrated with the primary RGB and depth features. Like the calculation in DQPW, this paper first transfers
to
. The result is that element-level multiplication generates the features
, which somewhat emphasizes the general activation properties linked to the edge. The max-pooling operation and dilated convolution operation are used to rapidly expand the receptive field to simulate better long-term relationships between low- and high-level information (i.e.,
and
) while preserving the effectiveness of the DPEA. This re-calibration process is represented as:
where
is the input of the re-calibration process;
represents the
dilated convolution with a stride of 1 and a dilation rate of 2, followed by BatchNorm layers and the RReLU activation; and
indicates the bi-linear upsampling/downsampling operation to
times the initial dimensions. To achieve a balance between functionality and effectiveness, the following two re-calibrations are performed:
where
and
are the features re-calibrated once and twice, respectively. Finally,
is combined with
to obtain global attention maps:
Be aware that the RReLU activation in is replaced with the Sigmoid activation to achieve the attention features of . Eventually, By downsampling , five depth global attention maps are obtained, using spatial enhancement factors for the depth levels. Generally, background clutters that are unrelated to the depth features can be prevented by multiplying them with attention maps .
3.3. Dual-Stage Decoder
This work suggests a simpler two-phase decoder that comprises first fusion and second fusion stages to further increase efficiency, in contrast to the well-known UNet [
17], which uses a hierarchical top-down decoding technique. Hierarchical grouping is used, denoted in
Figure 1 as “G”. The first fusion aims to cut down on the feature channels and hierarchies. Based on the outputs of the first fusion stage, the low-level and high-level hierarchical structures are further aggregated to generate the final salient map. Note that in our decoder, instead of ordinary convolutions, separable depth-wise convolutional filters are mainly used with many input channels.
3.3.1. First Fusion Stage
This paper first uses a
depth-by-depth separable convolution [
42] with BatchNorm layers and the RReLU activation, represented as
, to reduce the encoder’s features during compression
into an integrated channel of size 16. Then, the popular channel attention operator [
43]
is used to improve the characteristics through channel weighting. The procedure described above can be expressed as:
where
represents the features from the compression and enhancement processes. This work, which is motivated by [
16], splits the six feature hierarchies into both high-level and low-level hierarchies, as follows:
where
is
i times the original size of the bilinear upsampling.
3.3.2. Second Fusion Stage
Since the number of channels and hierarchies have been reduced in the first fusion phase, the high-level and low-level hierarchies are directly concatenated in the second fusion phase and then provided to a prediction head to acquire the ultimate full-resolution prediction map, which is expressed as follows:
where
represents the final salient features, and
represents the prediction head consisting of two
separable depth-by-depth convolutions (followed by BatchNorm layers and the RReLU activation function): a
convolution with Sigmoid activation and a
bilinear upsampling to restore the original input dimension.
3.4. RReLU Activation Function
The activation function plays an important role in computer vision tasks such as object segmentation, object tracking, and object detection. An important aspect of neural network design is the selection of the activation functions to be used in the different layers of the network. The activation function is used to introduce nonlinearity into the neural network calculation, and the correct selection of the activation function is very important for the effective performance of the network.
Common activation functions, such as Sigmoid, Tanh, and so on, have good properties, but with the advent of deep neural architectures, it is difficult for researchers to train very deep neural networks because they are saturated with activation functions. To solve this problem, the ReLU activation function was utilized, as shown in
Figure 6. Although ReLU is not differentiable at zero, it is unsaturated, and it can keep the gradient constant in the positive interval. This method effectively alleviates the problem of gradient disappearance in the neural network, thereby speeding up the training of the neural model. However, when the input is negative, ReLU will have dead neurons, resulting in the corresponding weights not being updated, which may result in the loss of model information.
To address the problems with the ReLU activation function, in
Section 4.4, we conduct a number of experiments to determine the optimal activation function to use in this model: RReLU. As shown in
Figure 7, RReLU is a variant of ReLU that prevents overfitting by introducing randomness during model training while helping to resolve the issue of neuronal inactivation. When the input is positive, the gradient is a positive value, and when the input is negative, the gradient is a negative value. However, the slope of the negative value is randomly obtained during training and fixed in subsequent tests.
The beauty of RReLU is that during the training process, is randomly drawn from a uniform distribution of , which helps increase the robustness of the model and reduce the dependence on specific input patterns, thereby mitigating the risk of overfitting. By introducing randomness, RReLU allows the activation values of neurons to vary within a range, even with negative inputs, thus avoiding complete neuronal inactivation.
3.5. Pixel Position Adaptive Importance (PPAI) Loss
Despite having three flaws, binary cross-entropy (BCE) is the most popular loss function for RGB and RGB-D SOD. First, it disregards the image’s overall structure and calculates each pixel’s loss separately. Second, the loss of foreground pixels will be less noticeable in photographs where the backdrop predominates. Third, it gives each pixel the same treatment. In actuality, pixels in cluttered or constrained locations (e.g., the pole and horn) are more likely to result in incorrect predictions and require additional effort, whereas pixels located in places like roadways and trees require less focus. So, this paper introduces the pixel position adaptive importance (PPAI) loss, which consists of two components, namely the weighted binary cross-entropy (wBCE) loss and the weighted IoU (wIoU) loss. The wBCE loss is shown in Equation (
11)
where
is the indicator function and
is a hyperparameter. The symbol
denotes two types of labels.
and
are the prediction and the ground truth of the pixel at location
in an image.
shows all the parameters of the model, and
represents the predicted probability.
In
, each pixel is given a weight
. A hard pixel corresponds to a larger
, whereas a simple pixel is assigned a smaller weight.
, which is determined based on the disparity between the central pixel and its surrounds, can be used as a measure of pixel significance, as shown in Equation (
15).
where
denotes the area around the pixel
. For all pixels,
. If
is big, the pixel at
is significant (e.g., an edge or hole) and stands out significantly from its surroundings. Therefore, it warrants extra attention. In contrast, if
is small, the pixel is just an ordinary pixel and not worth attention.
increases the emphasis on hard pixels compared to BCE. Meanwhile, the local structural information is encoded into
such that a greater receptive field rather than a single pixel is the model’s primary focus. To further make the network focus on the overall structure, the weighted IoU (wIoU) loss is introduced, as shown in Equation (
16).
In the segmentation of images, the IoU loss is frequently employed. It is not affected by the uneven distribution of pixels, and the optimization of the global structure is the goal, which overcomes the limitation of a single pixel. In recent years, it has been included in SOD in order to address BCE’s deficiencies. However, it still treats each pixel equally and ignores the differences between pixels. In contrast to the IoU loss, our WIoU loss gives harder pixels a higher weight to indicate their significance.
The pixel position adaptive importance (PPAI) loss is shown in Equation (
14). It combines the information on local structures to assign different weights to each pixel and provide pixel restriction
and global restriction
, thus better guiding the network learning process and resulting in clearer details.
Eventually, the ultimate loss
and deep supervision for the loss of the depth branch
make up the total loss
, which is formulated as follows:
where
G represents the ground truth (GT) and
and
denote the PPAI loss and the standard BCE loss, respectively.
5. Conclusions
This paper proposed DQPFPNet, an RGB-D SOD model with high efficiency and good performance. The method models an efficient RGB-D SOD framework and DQPFP processing, greatly improving detection accuracy. DQPFP consists of three sub-modules: DDM, DQPW, and DPEA. The DDM filters multi-scale depth features through a channel attention mechanism and a spatial attention mechanism to achieve the initial filtering of the depth features. The DQPW module weights the depth features based on the alignment between the enhanced RGB features of the DDM module and the depth features, whereas the DPEA module focuses on the depth features spatially using multiple enhanced attention maps originating from the DDM-enhanced depth features refined with low-level RGB features. Additionally, the framework is built on a dual-stage decoder, which helps further increase efficiency. The pixel position adaptive importance (PPAI) loss is utilized to better explore the structural information in the features, making the network attach significance to detailed areas. In addition, the RReLU activation is used to solve the problem of neuronal ”necrosis”. Experiments conducted on six RGB-D datasets demonstrate that DQPFPNet performs well in terms of both metric values and visualizations. A limitation of the current model is that in the comparison experiments with existing models, it did not achieve the best performance across all metrics and datasets, indicating that the network structure needs to be improved. Furthermore, the behavior of the model in mobile or embedded devices is unknown. Hence, we will continue to explore new network architectures to optimize performance on common datasets in the future. In addition, we will attempt to deploy the DQPFP in embedded/mobile systems that handle RGB-D and video data and continue to optimize the model based on its performance metrics.