DBCAN: DFormer-Based Cross-Attention Network for RGB Depth Semantic Segmentation

Wu, Aihua; Fu, Liuxu

doi:10.3390/app14188329

Open AccessArticle

DBCAN: DFormer-Based Cross-Attention Network for RGB Depth Semantic Segmentation

by

Aihua Wu

and

Liuxu Fu

^*

College of Information Engineering, Shanghai Maritime University, Shanghai 201308, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8329; https://doi.org/10.3390/app14188329

Submission received: 14 August 2024 / Revised: 12 September 2024 / Accepted: 12 September 2024 / Published: 16 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

Existing RGB-depth semantic segmentation methods primarily rely on symmetric two-stream Convolutional Neural Networks (CNNs) to extract RGB and spatial features separately. However, these architectures have limitations in incorporating spatial features and efficiently fusing RGB and depth information. In this study, we propose a novel architecture called the DFormer-Based Cross-Attention Network (DBCAN), which utilizes DFormer as an encoder for feature extraction and integrates several modifications to address these challenges. While DFormer is leveraged for its strong feature extraction capabilities, our modifications in the decoder focus on improving cross-modal fusion and spatial feature incorporation. We introduce three modules in the decoding process: the Object-Region Generated Module (ORGM), the Feature-Region Relation Module (FRRM), and the Spatial-Semantic Fusion Module (SSFM), which enhance feature interaction and segmentation accuracy. Experimental results on the NYUDepthv2 and SUN-RGBD datasets demonstrate that DBCAN achieves state-of-the-art performance, highlighting the effectiveness of our architectural enhancements in overcoming the limitations of existing models.

Keywords:

deep learning; attention mechanism; RGB-D semantic segmentation

1. Introduction

The goal of RGB-D semantic segmentation is to perform pixel-level classification of objects within a scene by leveraging both RGB and depth information. This approach is increasingly significant due to the advancements of sensor technology, which have made it easier to obtain high-quality RGB-D data. RGB-D semantic segmentation is crucial for applications such as autonomous driving [1], robotic navigation [2], and medical imaging [3].

Compared to RGB images, RGB-D images provide additional depth features. Depth features contain the spatial information of various objects, which can improve the semantic segmentation performance. However, the difference between RGB and depth data make it a big challenge to effectively integrate them. Thus, a good RGD-D semantic segmentation model requires not only a suitable architecture but also a feature fusion method. Several studies [4,5,6] have shown that the encoder–decoder structures are effective for extracting and combining RGB and depth features. Some models [7,8] concatenate RGB and depth images as input to a single Convolutional Neural Network (CNN), treating the depth channel similarly to the RGB channels (one-stream network). In contrast, other models [9,10] use two parallel networks to process RGB and depth images separately, combining their features at a later stage (two-stream network). DFormer [10] utilizes a two-stream network with an encoder–decoder structure, where RGB and depth features are processed separately during encoding and then fused at multiple stages, while in the decoding phase, a lightweight decoder called Hamburger [11] is only used to process RGB features. Although this model has fewer parameters, it relies on a deep encoder and a shallow decoder, which can lead to inaccuracies in object boundary segmentation and the loss of spatial features during decoding.

To ensure that the model can also obtain sufficient spatial features during the decoding process and achieve accurate pixel-level boundary segmentation, we propose an RGB-D semantic segmentation model named DBCAN, which uses the encoder of DFormer to extract features, but with an innovative decoder. Although DFormer effectively fuses and separates RGB and depth features during the encoding process, the lightweight ham-head decoder [11], used for preliminary RGB feature decoding, is shallow and neglects depth information, limiting its ability to capture spatial features. To address this, we designed a series of modules that work together to ensure proper integration of RGB and depth features, improve spatial feature representation, and achieve more accurate segmentation results. First, we introduce the Object-Region Generated Module (ORGM), which processes only RGB features and performs coarse classification through a deep supervision mechanism. ORGM enhances the semantic representation of the RGB features, allowing the model to better utilize the RGB information in subsequent stages for further processing and fusion. Next, we propose the Feature-Region Relation Module (FRRM), which separately fuses RGB and depth features with classification features using attention mechanisms. This enables the parallel processing of both modalities, allowing the model to focus on regions that are more effectively classified. By enhancing the expressive power of both RGB and depth features, FRRM improves the model’s ability to capture important scene details. Finally, the Spatial-Semantic Fusion Module (SSFM) refines the integration of RGB and depth features by adjusting their channel weights. SSFM distinguishes the importance of different features, selectively enhancing relevant ones while suppressing less important information, leading to a unified and optimized feature representation. Extensive experiments on the NYUDepthv2 [12] and SUN-RGBD [13] datasets demonstrate that DBCAN achieves state-of-the-art performance in terms of mIoU and prediction accuracy.

In summary, our main contributions can be summarized as follows:

A novel DFormer-based network is proposed for RGB-D semantic segmentation tasks. It leverages the strengths of CNNs in capturing local features and the attention mechanisms for modeling global dependencies, enabling the network to effectively fuse and separate rich features from the original RGB-D images.
The Feature-Region Relation Module is designed to enhance the attention of multimodal features (RGB and depth). This method can improve the accuracy of semantic and spatial information.
The Spatial-Semantic Fusion Module is designed to adjust the weights of each feature point and to fuse RGB and depth features. This ensures more effective feature fusion, leading to a richer and more balanced integration of semantic and spatial information.

2. Related Work

2.1. Encoder–Decoder-Based Methods

Recently, researchers (e.g., Chen et al. [14] and Yuan et al. [15]) have utilized encoder–decoder architecture for semantic segmentation, aiming to enhance prediction performance by leveraging richer features extracted through the encoder and decoder. These methods have demonstrated impressive results in semantic segmentation tasks. Likewise, in more intricate RGB-D segmentation tasks, this architecture is commonly employed to handle RGB and depth features.

In encoder structures, a crucial challenge is to extract and integrate features across different modalities. Recent studies [4] have discussed the encoder design implemented in either one-stream or two-stream networks. Chen et al. [16] introduced Spatial Information Guided Convolution (S-Conv) in a one-stream network. Initially, their encoder resembled an RGB semantic segmentation encoder, but S-Conv was employed in each stage to incorporate spatial features from the depth image. The inclusion of S-Conv enhances spatial perception at each stage, thus improving semantic segmentation results. Zhang et al. [17] proposed the Multi-modality Non-local Aggregation Module (MNAM) for a two-stream network. MNAM captures spatial and channel dependencies across encoder stages, enabling the extraction of richer features at multiple levels. Similarly, Yin et al. [10] incorporated multiple Global Awareness Attention (GAA) and Local Enhancement Attention (LEA) modules in each stage of the DFormer encoder. They concatenate the features from the last few stages after the encoding stage, enhancing the expression of RGB and spatial features. The design of DFormer allows the encoder to output more effective features that encompass both cross-stage features and inherent semantics.

Another problem in decoder structure is effective prediction using fused or two-modal features. Zhou et al. [18] proposed an up-sampled ResNet decoder for decoding fused features, but it overlooked the role of early encoder features. Fooladgar et al. [9] designed the Multi-Modal Multi-Resolution Fusion (MRF) module to extract encoder features from four stages, resulting in more accurate predictions. Chen et al. [19] introduced L-CFM and G-CFM, applied after the final encoder stage, progressively fusing features from earlier stages. This approach ensures effective feature utilization at each encoder stage and incorporates attention mechanisms in the decoder to enhance prediction accuracy. In this paper, we leverage a subset of encoder stage features in the decoder and introduce FRRM and SSFM to enhance feature expressiveness and inter-modality connections.

2.2. Attention-Based Methods

Attention mechanisms have been widely used to improve the performance of RGB-D semantic segmentation [20]. Typically, attention mechanisms are combined with deep-learning methods to adjust feature map weights. Deep neural networks can learn to identify important areas and assign higher weights to them while suppressing unimportant features. Hu et al. [7] introduced the Attention Complementary Module (ACM) to focus on different regions and extract weighted features from RGB and depth branches. ACM assigns weights to feature channels, enhancing the expression of relevant features. ACNet [21] leverages this mechanism to dynamically integrate RGB and depth information, which helps the network to learn cross-modal dependencies and improve semantic segmentation by emphasizing spatial and depth information at crucial locations. Fooladgar et al. [9] utilized two sequential channel-and-spatial-wise attention (AFB) mechanisms to adjust feature weights based on channels and regions. Zhou et al. [18] designed a co-attention fusion module to effectively fuse RGB and depth features. This approach showcases the strength of attention mechanisms in segmenting complex scenes by dynamically attending to both RGB and depth inputs. In our work, we propose FRRM to enhance attention towards RGB (spatial) features for improved prediction results. Additionally, we introduce SSFM in the decoding stage to fuse RGB and spatial features, enabling re-weighting of global features based on channel and spatial weights.

3. Methodology

3.1. Overview

The architecture of the RGB-D semantic segmentation model DBCAN consists of an encoder and a decoder, as illustrated in Figure 1. Specifically, the encoder is built upon the DFormer [10], which is a powerful and efficient approach designed to effectively integrate semantic and depth features. In the decoder, we employ the lightweight Hamburger head [11] to aggregate the multi-scale features from the last three stages of the encoder. These aggregated features serve as preliminary prediction maps, which are then utilized as input for our decoding process.

Encoder: In the feature encoding stage, two parallel backbone networks with similar structures are utilized to extract RGB and depth features. These features are denoted as

f_{i}

and

f_{d i}

, respectively, where

i \in {0, 1, 2, 3, 4}

(except for

i = 0

, which represents the output of the stem layer, and the remaining i represents the output features of each block). To simplify notation, we use a single symbol to represent multiple symbols. For example,

f_{(d) 0}

represents

f_{0}

and

f_{d 0}

, while

f_{0 - 1}

represents

f_{0}

and

f_{1}

.

Decoder: The encoder output features

f_{i}

and

f_{d i}

(

i \in {2, 3, 4}

) are employed in the decoder. Firstly,

f_{3 - 4}

and

f_{d 3 - d 4}

are upsampled separately to match the size of

f_{(d) 2}

, resulting in maps

f_{3^{'} - 4^{'}}

and

f_{d 3^{'} - d 4^{'}}

. Then, we obtain

f_{r 1}

(

f_{r 1} = C o n c a t ([f_{2}, f_{3^{'}}, f_{4^{'}}])

) and

f_{s 1}

(

f_{s 1} = C o n c a t ([f_{d 2}, f_{d 3^{'}}, f_{d 4^{'}}])

), where

C o n c a t ()

represents the concatenation of features along the channels. The Hamburger module [11] aggregates the features of

f_{r 1}

and outputs

f_{h}

. The Object-Region Generated Module aggregates the features of

f_{h}

and produces the deeply-supervised result

f_{o r g}

[22]. Subsequently, the input features

f_{r 1}

and

f_{s 1}

are passed through the parallel Feature-Region Relation Module, resulting in

f_{r 2 - r 3}

and

f_{s 2 - s 3}

. Finally,

f_{r 4}

(

f_{r 4} = C o n c a t ([f_{r 2}, f_{r 3}])

) and

f_{s 4}

(

f_{s 4} = C o n c a t ([f_{s 2}, f_{s 3}])

) are input into the Spatial-Semantic Feature Fusion Module to obtain

f_{l a s t}

.

3.2. Object-Region Generated Module

The initial coarse segmentation achieved through deep supervision in the Object-Region Generated Module (ORGM) is critical as it provides a preliminary segmentation result that guides the refinement process in subsequent stages. By incorporating deep supervision early in the decoder, the model ensures that essential features are preserved and enhanced throughout the network, reducing the risk of gradient disappearance. This early-stage supervision acts as a foundation for subsequent refinement, enabling the model to maintain strong gradient flow and better feature representation. This approach is rooted in the principle that coarse-to-fine strategies can significantly improve the model’s ability to capture complex patterns, laying a solid groundwork for accurate final predictions.

The ORGM applies multiple nonlinear transformations to the RGB features

f_{r 1}

, obtained from the encoder output, and incorporates deep supervision, resulting in the generation of semantic segmentation features

f_{o r g}

. This operation can be expressed as follows:

f_{h} = C o n v_{1 \times 1} (δ (B a t c h N o r m (C o n v_{1 \times 1} (f_{h}))))

(1)

The operation

C o n v_{k \times k} ()

denotes a convolution operation using a

k \times k

kernel with a stride of 1. The

B a t c h N o r m

operation refers to batch normalization [23], and the

δ

operation represents Gaussian Error Linear Units (GELUs) [24]. Applying a complete

δ (B a t c h N o r m (C o n v_{1 \times 1} ()))

operation allows the input features to be projected into a new feature space, ensuring better convergence of the model during training. Another

C o n v_{1 \times 1} ()

operation is used to adjust the number of channels to match the number of categories. The input map

f_{h}

has a size of

f_{h} \in R^{h a m \times h \times w}

, and after processing it with the feature map

f_{h 1}

, the output size becomes

R^{k \times h \times w}

, where k represents the number of categories.

3.3. Feature-Region Relation Module

The Feature-Region Relation Module (FRRM) is designed to enhance the representation of RGB and depth features in the target classification regions by integrating features and incorporating attention mechanisms. Initially, the module projects multi-stage features into a unified feature space to ensure effective fusion. Attention mechanisms are then applied to model the relationship between features and target regions, dynamically adjusting feature weights to improve classification accuracy. Finally, through further fusion, the FRRM strengthens the connection between semantic features and classification regions, resulting in more precise feature representation.

FRRM consists of three components: the Feature Projector, Feature-Region Attention, and Feature-Region Fusion. In this model, the FRRM independently processes RGB and spatial features in parallel. The Feature Projector is responsible for projecting the concatenated features from multiple stages into a new feature space. The Feature-Region Attention and Feature-Region Fusion components work together to enhance the attention of RGB (or spatial) features towards the classification region, thereby improving the predictive capability of the features. When specifically referring to this module for RGB or spatial features, we use the terms RFRRM (RGB Feature-Region Relation Module) or SFRRM (Spatial Feature-Region Relation Module), respectively.

The specific details of this module are illustrated in Figure 2. Initially, we input

f_{r 1}

and

f_{s 1}

into two parallel Feature Projectors to obtain new feature maps. Subsequently, multiple attention operations are performed on these new feature maps and

f_{o r g}

to strengthen the relationship between semantic (or spatial) features and the object region. Since RGB and depth features are processed separately in these two modules, for convenience, we will focus on explaining them using RGB features only.

The feature

f_{r 1}

is generated through concatenation, resulting in its information being distributed across three different feature spaces. To integrate this information into a unified feature space, we propose the Feature Projector Module. This operation can be expressed as follows:

f_{r 2} = ϕ (f_{r 1})

(2)

where

ϕ ()

represents a transformation function and

f_{r 2} \in R^{512 \times h \times w}

. This step aims to extract the output

f_{s 1}

and project features from different spaces into a unified space, facilitating their integration.

Following this,

f_{r 2}

contains rich fused semantic information. However, we further consider enhancing the connection between features and object classification regions. To address this, the Feature-Region Attention Module has been designed. The operation can be expressed as:

f_{r 2^{'}} = V i e w (f_{r 2}), f_{r 2^{'}} \in R^{512 \times h w}

(3)

\begin{matrix} f_{o r g^{'}} = S o f t m a x & (P e r m u t e (V i e w ((f_{o r g}))), \\ f_{o r g^{'}} \in R^{h w \times k} \end{matrix}

(4)

F_{a t t n} = P e r m u t e (f_{o r g^{'}} f_{s 2^{'}}), F_{a t t n} \in R^{512 \times k}

(5)

The

V i e w ()

operation is used to reshape the feature sizes of

f_{r 2}

and

f_{o r g}

from

R^{512 \times h \times w}

and

R^{k \times h \times w}

to

R^{512 \times h w}

and

R^{k \times h w}

, respectively. In the subsequent steps,

f_{o r g}

is permuted to change the feature dimension to

R^{h w \times k}

, followed by applying the

S o f t m a x

operation on

f_{o r g}

to obtain the attention scores for each feature point corresponding to the object regions. The attention scores are then used in a matrix multiplication operation with

f_{r 2}

to obtain the attention given by the RGB features to each object region. Finally, the resulting feature is permuted to obtain

F_{a t t n}

with dimensions

R^{512 \times k}

. The Feature-Region Attention Module enhances the connection between RGB features and the object regions, resulting in the classification region feature

F_{a t t n}

.

F_{a t t n}

captures the important relationship between RGB features and the classification regions. To further strengthen the connection between

f_{r 2}

and

F_{a t t n}

, the Feature-Region Fusion Module is proposed. This operation can be expressed as:

Q = P e r m u t e (C o n v_{1 \times 1} (f_{r 2}))

(6)

K = C o n v_{1 \times 1} (U n s e q u e e z e (F_{a t t n}))

(7)

V = P e r m u t e (C o n v_{1 \times 1} (U n s e q u e e z e (F_{a t t n})))

(8)

f_{r 3} = S o f t m a x (\frac{Q K}{\sqrt{d_{k}}}) V

(9)

The

U n s e q u e n c e ()

operation expands the feature by one dimension. We apply

C o n v_{1 \times 1}

operations on

f_{r 2}

and

F_{a t t n}

with adjusted dimensions to obtain Q, K, and V. Then, a permute operation is performed on Q and V. Utilizing the generated

Q \in R^{h w \times 512}

,

K \in R^{512 \times k}

, and

V \in R^{k \times 512}

, we integrate semantic and object region features using Equation (9), resulting in the feature

f_{r 3}

.

3.4. Spatial-Semantic Fusion Module

The Spatial-Semantic Fusion Module is designed to effectively combine spatial and semantic information by leveraging both local and global attention mechanisms. This module enhances feature extraction by preserving the original features while also incorporating attention-based features, ensuring a richer and more comprehensive representation. By integrating these features, the module improves the model’s ability to make accurate predictions, highlighting the importance of both spatial and semantic cues in understanding complex visual scenes. Consequently, the fused features,

f_{l a s t}

, offer a comprehensive representation that combines both spatial and semantic features, thereby contributing to the final predictions of the model.

To enhance the effectiveness of the features and integrate RGB and spatial information, we introduce the Spatial-Semantic Fusion Module. However, relying solely on the attention feature

f_{r 3} (f_{s 3})

output from the FRRM may not provide sufficient richness in representation. Taking inspiration from ResNet [25], which effectively preserves original features while enhancing feature extraction, we employ similar techniques to enhance the semantic expression of

f_{r 3}

and

f_{s 3}

. By applying these operations, we aim to enrich the semantic information embedded within these features. Specifically, these operations can be described as follows:

f_{r 4} = C o n c a t (f_{r 2}, f_{r 3})

(10)

f_{s 4} = C o n c a t (f_{s 2}, f_{s 3})

(11)

The

C o n c a t ()

operation represents the concatenation along the channel dimension. Following these two steps,

f_{r 4}

and

f_{s 4}

preserve both the original features and the attention-based features.

Subsequently, we propose the Local Attention Module (LAM) and the Global Attention Module (GAM) to enhance feature extraction from both the spatial and channel perspectives. These operations can be expressed as follows:

\begin{matrix} L A M (x) = C o n v_{7 \times 7} (C o n c a t (A v g P o o l_{c} (x), \\ M a x P o o l_{c} (x))) \end{matrix}

(12)

G A M (x) = δ (C o n v_{1 \times 1} δ (C o n v_{1 \times 1} A v g P o o l (x)))

(13)

LAM is designed to focus on capturing fine-grained, localized information by applying global average pooling (

A v g P o o l_{c} ()

) and global max pooling (

M a x P o o l_{c} ()

) along the channel dimension of the input feature map x. These operations generate two feature maps that, when concatenated and processed through a

7 \times 7

convolution, allow the module to emphasize important local details across channels. This approach ensures that the model pays attention to both dominant and subtle features within each spatial region, thus preserving local contextual information.

GAM, on the other hand, is constructed to capture global context by performing global average pooling (

A v g P o o l ()

) across the HW dimensions of the input feature map x. The module then employs convolution operations (

C o n v_{1 \times 1}

) and activation functions (

δ ()

) to highlight significant features across the entire image. By emphasizing global patterns, GAM ensures that the model retains an understanding of the broader context, which is essential for accurate semantic interpretation.

In the Spatial-Semantic Fusion Module, these operations can be represented as follows:

f_{l} = L A M (f_{r 4})

(14)

f_{g} = G A M (f_{r 4})

(15)

s_{l s} = L A M (f_{s 4})

(16)

s_{g s} = G A M (f_{s 4})

(17)

As illustrated in Figure 3, the Local Attention Module and the Global Attention Module adjust the weights of RGB features (

f_{r 4}

) and spatial features (

f_{s 4}

) based on channels and local feature regions, resulting in local weight features (

f_{l}

and

s_{l s}

) and global weight features (

f_{g}

and

s_{g s}

).

The local weight features are added to the global weight features as newly incorporated features. These added features undergo the

S i g m o i d ()

operation. Subsequently, the added features are multiplied with the original input feature. The resulting features are concatenated together, followed by a convolution operation to obtain the final output. These operations can be expressed as follows:

f_{r f i n} = (f_{r 4} ⊙ S i g m o i d (f_{l} \oplus f_{g}))

(18)

f_{s f i n} = (f_{s 4} ⊙ S i g m o i d (f_{l s} \oplus s_{g s}))

(19)

f_{l a s t} = C o n v_{1 \times 1} (C a t (f_{r f i n}, f_{s f i n}))

(20)

The symbol ⊕ denotes element-wise addition between two matrices of the same size, while ⊙ represents the Hadamard product operation. The function

S i g m o i d

refers to the sigmoid activation function. The variables

f_{r f i n}

and

f_{s f i n}

represent the final results of the semantic and spatial features, respectively. The fused features of semantic and spatial information are denoted by

f_{l a s t}

, which provides comprehensive features for the final predictions of the model. This final fusion step is grounded in the principle that integrating both semantic and spatial dimensions creates a more holistic feature set, crucial for accurate and robust model predictions.

3.5. Model Architecture and Variants

Based on the backbone of DFormer, we developed DBCAN. Consequently, we constructed four models with different scales by adjusting the hyperparameters: The detailed architectures of these models are provided in Table 1, and the corresponding hyperparameters are listed below.

$C = (C_{r g b}, C_{d e p t h})$ : This parameter represents the channel numbers of the RGB features ( $C_{r g b}$ ) and the depth features ( $C_{d e p t h}$ ) in different stages of the model. It specifies the dimensionality of the feature maps in each stage.
$N_{i} = n$ : This parameter indicates the number of building blocks (n) in the $N_{i}$ stages. It determines the depth of the model by specifying the number of repeated blocks in each stage.
$E_{i} = e$ : The Expend ratio (e) is the expansion factor for the number of channels in the MLPs within each building block. It controls the growth of channel dimensions within the blocks.
Decoder channels: This parameter denotes the channel dimension of the RGB and depth features in the decoder part of the model. It determines the dimensionality of the feature maps in the decoder, which is crucial for generating the final predictions.

4. Results

4.1. Datasets

To assess the effectiveness of DBCAN, we conducted experiments on two publicly available RGB-D datasets:

NYUDv2 [12]: The NYUDv2 dataset consists of 1449 RGB-depth images with corresponding pixel-wise labels. The images have a unified resolution of $480 \times 640$ . The dataset is divided into a training set and a testing set, with 795 image-depth pairs used for training and the remaining 654 pairs utilized for testing. It encompasses 40 object categories, thereby providing a diverse range of scenes and objects for evaluation.
SUN-RGBD [13]: The SUN-RGBD dataset is a large-scale RGB-D dataset comprising 10,335 RGB-depth images. The images have a resolution of $530 \times 730$ pixels. The dataset encompasses a wide variety of indoor scenes and objects, with objects categorized into 37 classes. For our experiments, we split the dataset into a training set of 5285 images and a testing set of 5050 images.

By using these two publicly available datasets, our objective is to evaluate the performance of our model on different scenes and object categories. The NYUDv2 dataset offers a relatively smaller yet more diverse set of scenes, while the SUN-RGBD dataset provides a larger-scale dataset with a wide range of object categories. The combination of these datasets enables a comprehensive assessment of the generalization and effectiveness of our proposed approach in various real-world scenarios.

In the experimental section, we will present comprehensive statistics about the datasets used, including the number of images, resolution, and the number of object categories. Furthermore, we will elaborate on the data preprocessing steps and augmentation techniques employed to enhance the training process and improve the model’s performance.

4.2. Evaluation Metrics

We use several common metrics to assess the performance of our model, including mean intersection over union (mIoU), pixel accuracy (Acc), and mean accuracy (mAcc). These metrics offer quantitative measures of the alignment between predicted segmentation masks and ground truth masks. The formulas for calculating these metrics are as follows:

m I o U = \frac{1}{p_{c}} \sum_{i} \frac{p_{i i}}{g_{i} + \sum_{j} p_{j i} - p_{i i}}

(21)

The mean intersection over union (mIoU) is a widely used metric for assessing the overall accuracy of semantic segmentation models. It calculates the average intersection over union for each class. Here,

p_{c}

represents the total number of classes,

g_{i}

indicates the number of pixels with ground truth label i, and

p_{i j}

represents the number of pixels predicted as class j with ground truth label i. The mIoU score ranges from 0 to 1, with higher values indicating better segmentation performance.

A c c = \frac{1}{g} \sum_{i} p_{i i}

(22)

Pixel accuracy (Acc) is a straightforward metric that measures the ratio of correctly predicted pixels to the total number of pixels in the image. It is calculated by summing the diagonal elements of the confusion matrix (

p_{i i}

) and dividing it by the total number of ground truth pixels (g). Acc provides a global measure of segmentation accuracy but does not consider the per-class performance.

m A c c = \frac{1}{p_{c}} \sum_{i}^{c} A c c_{i}

(23)

Mean accuracy (mAcc) is the average accuracy across all classes. It is calculated by dividing the sum of pixel accuracy for each class (

A c c_{i}

) by the total number of classes (

p_{c}

). mAcc takes into account the performance of the model on individual classes and provides a more comprehensive evaluation of segmentation accuracy.

mIoU, Acc, and mAcc are commonly used evaluation metrics in semantic segmentation tasks. They provide different perspectives on the performance of segmentation models, allowing us to assess accuracy at both the pixel and class levels. These metrics will be utilized to evaluate the effectiveness of our proposed approach in the experimental results section.

4.3. Implementation Details

The entire model is implemented using the PyTorch framework and trained on two RTX 3090 GPUs, each equipped with 24 GB of GRAM. We employ the DFormer-T, DFormer-S, DFormer-B, and DFormer-L models, directly using the pretrained weights provided by the DFormer authors [10] from their open-source repository, which were pretrained on ImageNet [26] as the backbone networks for feature extraction. As shown in Table 2, the default output stride used during the training of the DBCAN-T, DBCAN-S, DBCAN-B, and DBCAN-L models is set to 8/16. To augment the training data and mitigate overfitting, we apply random horizontal flipping and random scaling techniques. We use the AdamW optimizer [27] with an initial learning rate of

6 \times 10^{- 5}

or

8 \times 10^{- 5}

and the poly decay schedule. The weight decay is set to 0.01. The number of epochs for training DBCAN on the NYUDepthv2 and SUN-RGBD datasets is 500 and 300, respectively. The optimizer momentum parameters,

β 1

and

β 2

, are set to 0.9 and 0.999, respectively.

4.4. RGB-D Semantic Segmentation Model for Comparison

For the sake of comparison, we refer to the complete model proposed by Yin et al. [10] (Deformer + Hamburger [11]) as TopFormer. We evaluate DBCAN against recent RGB-D semantic segmentation methods on the NYUDepthv2 [12] and SUN-RGBD datasets [13]. Our selection criteria for these methods are as follows: (a) they are representative, (b) they have been recently published, and (c) their source code is publicly available. As shown in Table 3 and Table 4, DBCAN achieves new state-of-the-art performance on these benchmark datasets. The metrics in the last three columns correspond to Acc, mAcc, and mIoU, respectively.

4.4.1. NYUDepthv2 Semantic Segmentation

As presented in Table 3, DBCAN is compared with several previous methods [7,10,16,28,29,30,31,32,33,34,35] on the NYUDepthv2 validation set. We selected various ResNet-based and efficient Convolutional Neural Network (CNN)-based methods as baselines.

Table 3. Result on NYUDepthv2 val dataset.

Method	Backbone	Params	Input Size	Flops	Acc	mAcc	mIoU
ACNet [7]	ResNet-50	116.6 M	$480 \times 640$	126.7 G	-	-	48.3
SGNet [16]	ResNet-101	58.3 M	$480 \times 640$	108.5 G	76.8	63.1	51.0
SA-Gate [28]	ResNet-101	110.9 M	$480 \times 640$	193.7 G	77.9	-	52.4
CEN [29]	ResNet-101	118.2 M	$480 \times 640$	618.7 G	77.2	63.7	51.7
CEN [29]	ResNet-152	133.9 M	$480 \times 640$	664.4 G	77.7	65.0	52.5
ShapeConv [30]	ResNext-101	86.8 M	$480 \times 640$	124.6 G	76.4	63.5	51.3
ESANet [31]	ResNet-34	31.2 M	$480 \times 640$	34.9 G	-	-	50.3
FRNet [32]	ResNet-34	85.5 M	$480 \times 640$	115.6 G	77.6	66.5	53.6
PGDENet [33]	ResNet-34	100.7 M	$480 \times 640$	178.8 G	78.1	66.7	53.7
EMSANet [36]	ResNet-34	46.9 M	$480 \times 640$	45.4 G	-	-	51.0
CMX [34]	MiT-B2	66.6 M	$480 \times 640$	67.6 G	-	-	54.4
CMX [34]	MiT-B4	139.9 M	$480 \times 640$	134.3 G	-	-	56.3
CMX [34]	MiT-B5	181.1 M	$480 \times 640$	167.8 G	-	-	56.9
TopFormer [10]	DFormer-T	6.0 M	$480 \times 640$	11.8 G	77.1	65.8	51.8
TopFormer [10]	DFormer-S	18.7 M	$480 \times 640$	25.6 G	78.1	66.4	53.5
TopFormer [10]	DFormer-B	29.5 M	$480 \times 640$	41.9 G	79.4	69.4	55.5
TopFormer [10]	DFormer-L	39.0 M	$480 \times 640$	65.7 G	79.9	70.2	56.9
DBCAN	DFormer-T	22.7 M	$480 \times 640$	87.3 G	77.3	67.3	52.2
DBCAN	DFormer-S	38.1 M	$480 \times 640$	113.7 G	78.6	68.4	54.5
DBCAN	DFormer-B	48.8 M	$480 \times 640$	129.9 G	79.5	69.9	55.8
DBCAN	DFormer-L	59.2 M	$480 \times 640$	158.1 G	80.3	70.5	57.4

Table 4. Results on SUN-RGBD val dataset.

Method	Backbone	Params	Input Size	Flops	Acc	mAcc	mIoU
ACNet [7]	ResNet-50	116.6 M	$530 \times 730$	163.9 G	-	-	48.1
SGNet [16]	ResNet-101	64.7 M	$530 \times 730$	151.5 G	-	-	48.6
SA-Gate [28]	ResNet-101	110.9 M	$530 \times 730$	250.1 G	-	-	49.4
CEN [29]	ResNet-101	118.2 M	$530 \times 730$	618.7 G	82.8	61.9	50.2
CEN [29]	ResNet-152	133.9 M	$530 \times 730$	664.4 G	83.3	63.0	50.9
ShapeConv [30]	ResNext-101	86.8 M	$530 \times 730$	124.6 G	82.2	59.2	48.6
ESANet [31]	ResNet-34	31.2 M	$480 \times 640$	34.9 G	-	-	48.2
FRNet [32]	ResNet-34	46.9 M	$530 \times 730$	45.4 G	-	-	48.4
CMX [34]	MiT-B2	66.6 M	$530 \times 730$	86.3 G	-	-	49.7
CMX [34]	MiT-B4	139.9 M	$530 \times 730$	173.8 G	-	-	52.1
CMX [34]	MiT-B5	181.1 M	$530 \times 730$	217.6 G	-	-	52.4
TopFormer [10]	DFormer-T	6.0 M	$530 \times 730$	15.1 G	82.5	60.1	48.7
TopFormer [10]	DFormer-S	18.7 M	$530 \times 730$	33.0 G	82.9	61.6	50.0
Topformer [10]	DFormer-B	29.5 M	$530 \times 730$	54.0 G	83.1	62.9	51.0
Topformer [10]	DFormer-L	39.0 M	$530 \times 730$	83.3 G	83.6	63.9	52.2
DBCAN	DFormer-T	22.7 M	$530 \times 730$	87.3 G	82.6	60.3	48.9
DBCAN	DFormer-S	38.1 M	$530 \times 730$	113.7 G	83.0	62.0	50.3
DBCAN	DFormer-B	48.8 M	$530 \times 730$	129.9 G	83.4	63.0	51.2
DBCAN	DFormer-L	59.2 M	$530 \times 730$	158.1 G	83.6	64.9	52.6

Among these ResNet-based baselines, PGDENet [33] achieves good results with Acc (78.1), mAcc (66.7), and mIoU (53.7). DBCAN-L achieves higher Acc (80.3), mAcc (70.5), and mIoU (57.4) compared to PGDENet. The superiority of DBCAN over ResNet-based models can be attributed to the multiple attention operations added in each stage of DFormer, which enable more effective feature extraction. Additionally, the utilization of multiple efficient attention operations in the decoder of DBCAN contributes to more efficient decoding and better prediction results.

Among the hybrid-based baselines, TopFormer’s [10] results for Acc (79.9), mAcc (70.2), and mIoU (56.9) are lower than the results achieved by DBCAN. The multiple attention mechanisms in DBCAN enable the model to focus on more effective features during the decoding stage. As a result, DBCAN achieves the best performance across multiple evaluation metrics.

4.4.2. SUN-RGBD Semantic Segmentation

We also conducted experiments on the SUN-RGBD dataset, and DBCAN outperformed previous approaches, as shown in Table 4. DBCAN surpasses previous encoder-based methods in terms of performance. DBCAN achieves better evaluation metrics compared to the method proposed by Yin et al. [10], including Acc, mAcc, and mIoU, while using the same backbone network. For instance, DBCAN-L achieves better results in terms of Acc (83.6 vs. 83.6), mAcc (63.9 vs. 64.9), and mIoU (52.2 vs. 52.6). Additionally, DBCAN demonstrates the best performance across these evaluation metrics when compared to other methods.

4.5. Ablation Studies

To assess the effectiveness of each component in our proposed model, we performed ablation studies on the NYUDepthv2 validation dataset. Our baseline model consists of the ham decoder with DFormer-S encoder architecture. In this section, we provide the results of these ablation studies.

4.5.1. Effectiveness of Different Components

As shown in Figure 4, We conducted ablation experiments to understand the impact of different modules on the model, and considered the following five main models: (1) The baseline (BL) model (DFormer-S encoder and ham decoder structure, No. 1); (2) The ORGM and SFRRM modules added to the BL model (No. 2); (3) The ORGM and RFRRM modules added to the BL model (No. 3); (4) The ORGM, SFRRM, and RFRRM are integrated with the BL model (No. 4); (5) The proposed full model (No. 5). The results of these experiments are summarized in Table 5, and the ablation experiments in this section were all completed on the NYUDepthv2 dataset. the BL model achieves 78.24 on Acc, 66.6 on mAcc, and 53.5 on mIoU. Compared with BL, after adding ORGM and SFRRM, Acc slightly decreased by 0.14, but mAcc increased by 0.65 and mIoU increased by 0.3. After integrating the ORGM and RFRRM modules, the No. 3 Model improved by 0.04, 0.88, and 0.57 on these three metrics compared to the BL model. When using ORGM, SFRRM, and RFRRM modules simultaneously, the No. 4 model performs 0.11, 1.17, and 0.69 better than the BL model on these three metrics. Notably, using the No. 5 model with all modules has improved the BL model by 0.31, 1.79, and 0.95 on these three metrics, which clearly verifies the effectiveness of each proposed component. Furthermore, it should be noted that the RFRRM significantly improves the detection performance compared to the SFRRM module, indicating that the RGB features extracted in the two-stream network encoder are more abundant and effective. However, from the No. 4 experiment using both SFRRM and RFRRM simultaneously, we can see that using both modules is more effective than using one module alone. This suggests that relying on a single module for extracting either RGB or spatial features is insufficient, and richer features can be extracted by employing both modules simultaneously. Note that the prediction results obtained after adding the SSFM module are closer to the results of the GT graph, indicating that the SSFM module can suppress noise features and enhance effective features through attention.

4.5.2. Effectiveness of Spatial-Semantic Fusion Module

The effectiveness of the proposed Spatial-Semantic Fusion Module (SSFM) is validated through ablation experiments. Additionally, three other feature fusion strategies are considered in this study for comparison with SSFM. The first strategy involves the simple concatenation of RGB and spatial features, the second strategy employs the Local Attention Module (LAM) on RGB and spatial features, and the third strategy utilizes the Global Attention Module (GAM) on RGB and spatial features. The experimental results of these three strategies are listed in rows 1, 2, and 3 of Table 6.

Considering the first row, the Acc, mAcc, and mIoU values are lower than those of the SSFM model on the NYUDepthv2 dataset, indicating that using the Spatial-Semantic Fusion Module effectively improves the segmentation results. Furthermore, considering the second and third rows, the Acc, mAcc, and mIoU results are higher than those of the first row, but lower than the fourth row, indicating that the simultaneous utilization of the Local Attention Module and Global Attention Module enhances the effectiveness of cross-modal fusion features.

4.5.3. Effectiveness of Decoder Channels

To verify the effectiveness of decoder channels, we conducted three sets of experiments, all utilizing the complete decoding part with only differences in the decoding channel. The decoding channels for the three experiments were set to 256, 512, and 1028, and the results are presented in Table 7.

The relatively poor performance in the first row can be attributed to the relatively small number of channels, which leads to incomplete feature extraction and affects the final results. Considering the second row, the experiment with 512 channels yields the best results in terms of Acc, mAcc, and mIoU, indicating that using 512 as the channel number for the decoder is the most effective choice. Although the third row’s experiment can extract the most features in the decoder, its performance in Acc, mAcc, and mIoU is lower than that of the second row, suggesting that 1024 channels are not suitable for the decoder.

4.6. Visualization

As illustrated in Figure 5, we compare the predictions of DBCAN with the ground truth (GT) and TopFormer [10]. It is evident that DBCAN excels in segmenting object boundaries in the third, fourth, fifth, and last rows. In the third row, DBCAN accurately segments the boundaries of the person and paper on the table, whereas TopFormer fails to do so. In the fourth row, DBCAN precisely segments the boundary between the sofa and the carpet, while TopFormer falls short in fully segmenting them. In the fifth row, only our method (DBCAN) successfully segments the boundary of the sofa. TopFormer struggles to accurately segment the sofa. In the last row, only DBCAN is able to completely segment the left chair, while TopFormer only captures part of the chair.

5. Discussion

In this work, we proposed a novel DFormer-Based Cross-Attention Network (DBCAN) for RGB-D semantic segmentation, which addresses several key challenges in integrating RGB and depth modalities. Our asymmetric network design effectively utilizes DFormer as an encoder to extract both RGB and spatial features. By introducing a cross-modal feature fusion module and a symmetric hybrid decoder, our approach improves feature integration and segmentation performance, achieving state-of-the-art results on NYUDepthv2 and SUNRGBD datasets.

One of the key innovations of our model is the cross-modal feature fusion module, which promotes effective interaction between the RGB and spatial modalities. Previous methods have often treated these two modalities separately, leading to suboptimal performance. In contrast, our approach leverages the complementary nature of RGB and spatial features, thereby enhancing the model’s ability to capture fine-grained details and object boundaries.

Despite these advancements, some limitations remain. First, the computational complexity of the proposed method, especially in terms of training and inference time, could be a challenge when deploying the model in real-time applications. While our experiments demonstrate significant performance gains, further research is needed to optimize the model for resource-constrained environments, such as the embedded systems used in autonomous driving. Additionally, although our model shows strong performance on RGB-D datasets, extending this approach to other modalities may require re-pretraining of the backbone networks to handle different types of inputs effectively.

Future work could explore techniques to reduce the model’s complexity without sacrificing performance, as well as expand its applicability to a wider range of input modalities and real-world tasks. Moreover, the integration of self-supervised learning techniques could further enhance the model’s generalization capabilities, especially in scenarios with limited labeled data.

6. Conclusions

In this paper, we present the DFormer-Based Cross-Attention Network (DBCAN), a novel architecture that builds upon the DFormer framework by introducing specific modifications aimed at enhancing RGB-D semantic segmentation. Our approach addresses key limitations in the DFormer architecture by improving the integration of spatial features during decoding and enhancing the alignment between RGB and depth modalities. To achieve this, we introduce three novel modules in the decoding stage: the Object-Region Generated Module (ORGM) to refine RGB feature representation, the Feature-Region Relation Module (FRRM) to align and weigh relevant RGB and spatial features, and the Spatial-Semantic Fusion Module (SSFM) to effectively fuse features from both modalities. These modifications enhance the performance of DBCAN, as demonstrated through extensive experiments. Our method achieves state-of-the-art results on key RGB-D datasets, validating the effectiveness of our approach. Additionally, ablation studies highlight the contributions of each module, providing further insight into how these modifications improve the overall architecture. By building on the strengths of DFormer while addressing its limitations, DBCAN proves to be a robust solution for RGB-D semantic segmentation tasks.

Author Contributions

A.W., conceptualization, funding acquisition, supervision, writing—review and editing; L.F., conceptualization, data curation, formal analysis, investigation, methodology, software, validation, visualization, writing—original draft, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this research are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar] [CrossRef]
Ruan, C.; Zang, Q.; Zhang, K.; Huang, K. DN-SLAM: A Visual SLAM with ORB Features and NeRF Mapping in Dynamic Environments. IEEE Sens. J. 2024, 24, 5279–5287. [Google Scholar] [CrossRef]
Bonaldi, L.; Menti, E.; Ballerini, L.; Ruggeri, A.; Trucco, E. Automatic Generation of Synthetic Retinal Fundus Images: Vascular Network. Procedia Comput. Sci. 2016, 90, 54–60. [Google Scholar] [CrossRef]
Wang, C.; Wang, C.; Li, W.; Wang, H. A brief survey on RGB-D semantic segmentation using deep learning. Displays 2021, 70, 102080. [Google Scholar] [CrossRef]
Fooladgar, F.; Kasaei, S. A survey on indoor RGB-D semantic segmentation: From hand-crafted features to deep convolutional neural networks. Multimed. Tools Appl. 2020, 79, 4499–4524. [Google Scholar] [CrossRef]
Zhang, Y.; Sidibé, D.; Morel, O.; Mériaudeau, F. Deep multimodal fusion for semantic image segmentation: A survey. Image Vis. Comput. 2021, 105, 104042. [Google Scholar] [CrossRef]
Hu, X.; Yang, K.; Fei, L.; Wang, K. ACNET: Attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1440–1444. [Google Scholar] [CrossRef]
Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. FuseNet: Incorporating depth into semantic segmentation via fusion-based CNN architecture. In Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 213–228. [Google Scholar]
Fooladgar, F.; Kasaei, S. Multi-Modal Attention-based Fusion Model for Semantic Segmentation of RGB-Depth Images. arXiv 2019, arXiv:1912.11691. [Google Scholar]
Yin, B.; Zhang, X.; Li, Z.; Liu, L.; Cheng, M.M.; Hou, Q. DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation. arXiv 2023, arXiv:2309.09668. [Google Scholar]
Geng, Z.; Guo, M.H.; Chen, H.; Li, X.; Wei, K.; Lin, Z. Is Attention Better Than Matrix Decomposition? In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor Segmentation and Support Inference from RGBD Images. In Proceedings of the Computer Vision–ECCV 2012, Florence, Italy, 7–13 October 2012; Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 746–760. [Google Scholar]
Song, S.; Lichtenberg, S.P.; Xiao, J. SUN RGB-D: A RGB-D scene understanding benchmark suite. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 567–576. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the Computer Vision–ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 833–851. [Google Scholar]
Yuan, Y.; Chen, X.; Wang, J. Object-Contextual Representations for Semantic Segmentation. In Proceedings of the Computer Vision–ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 173–190. [Google Scholar]
Chen, L.Z.; Lin, Z.; Wang, Z.; Yang, Y.L.; Cheng, M.M. Spatial Information Guided Convolution for Real-Time RGBD Semantic Segmentation. IEEE Trans. Image Process. 2021, 30, 2313–2324. [Google Scholar] [CrossRef] [PubMed]
Zhang, G.; Xue, J.H.; Xie, P.; Yang, S.; Wang, G. Non-Local Aggregation for RGB-D Semantic Segmentation. IEEE Signal Process. Lett. 2021, 28, 658–662. [Google Scholar] [CrossRef]
Zhou, H.; Qi, L.; Huang, H.; Yang, X.; Wan, Z.; Wen, X. CANet: Co-attention network for RGB-D semantic segmentation. Pattern Recognit. 2022, 124, 108468. [Google Scholar] [CrossRef]
Chen, S.; Zhu, X.; Liu, W.; He, X.; Liu, J. Global-Local Propagation Network for RGB-D Semantic Segmentation. arXiv 2021, arXiv:2101.10801. [Google Scholar]
Ning, X.; Gong, K.; Li, W.; Zhang, L. JWSAA: Joint weak saliency and attention aware for person re-identification. Neurocomputing 2021, 453, 801–811. [Google Scholar] [CrossRef]
Hu, J.; Yang, L.; Wang, G. ACNet: Attention Complementary Network for RGB-D Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 6798–6807. [Google Scholar]
Lee, C.Y.; Xie, S.; Gallagher, P.; Zhang, Z.; Tu, Z. Deeply-Supervised Nets. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 9–12 May 2015; Lebanon, G., Vishwanathan, S.V.N., Eds.; Proceedings of Machine Learning Research. Volume 38, pp. 562–570. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (GELUs). arXiv 2016, arXiv:1606.08415. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Chen, X.; Lin, K.Y.; Wang, J.; Wu, W.; Qian, C.; Li, H.; Zeng, G. Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation. In Computer Vision–ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 561–577. [Google Scholar]
Wang, Y.; Huang, W.; Sun, F.; Xu, T.; Rong, Y.; Huang, J. Deep Multimodal Fusion by Channel Exchanging. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 4835–4845. [Google Scholar]
Cao, J.; Leng, H.; Lischinski, D.; Cohen-Or, D.; Tu, C.; Li, Y. ShapeConv: Shape-aware Convolutional Layer for Indoor RGB-D Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 7068–7077. [Google Scholar]
Seichter, D.; Köhler, M.; Lewandowski, B.; Wengefeld, T.; Gross, H.M. Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13525–13531. [Google Scholar] [CrossRef]
Zhou, W.; Yang, E.; Lei, J.; Yu, L. FRNet: Feature Reconstruction Network for RGB-D Indoor Scene Parsing. IEEE J. Sel. Top. Signal Process. 2022, 16, 677–687. [Google Scholar] [CrossRef]
Zhou, W.; Yang, E.; Lei, J.; Wan, J.; Yu, L. PGDENet: Progressive Guided Fusion and Depth Enhancement Network for RGB-D Indoor Scene Parsing. IEEE Trans. Multimed. 2023, 25, 3483–3494. [Google Scholar] [CrossRef]
Zhang, J.; Liu, H.; Yang, K.; Hu, X.; Liu, R.; Stiefelhagen, R. CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers. IEEE Trans. Intell. Transp. Syst. 2023, 24, 14679–14694. [Google Scholar] [CrossRef]
Liu, Y.; Zang, Y.; Zhou, D.; Cao, J.; Nie, R.; Hou, R.; Ding, Z.; Mei, J. An Improved Hybrid Network with a Transformer Module for Medical Image Fusion. IEEE J. Biomed. Health Inform. 2023, 27, 3489–3500. [Google Scholar] [CrossRef] [PubMed]
Seichter, D.; Fischedick, S.B.; Köhler, M.; Groß, H.M. Efficient Multi-Task RGB-D Scene Analysis for Indoor Environments. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–10. [Google Scholar] [CrossRef]

Figure 1. Network structure of the method in this study.

Figure 2. Feature-Region Relation Module.

Figure 3. Spatial-Semantic Fusion Module.

Figure 4. Simplified diagram of different component models.

Figure 5. Segmentation comparison: DBCAN vs. TopFormer.

Table 1. Model architecture and variants.

Stage	Output Size	DBCAN-T	DBCAN-S	DBCAN-B	DBCAN-L
Stem	$\frac{H}{4} \times \frac{W}{4}$	C = (16, 8)	(32, 16)	(32, 16)	(48, 24)
1	$\frac{H}{4} \times \frac{W}{4}$	C = (32, 16), $N_{1} = 3$ , $E_{1} = 8$	(64, 32), 2, 8	(64, 32), 3, 8	(96, 48), 3, 8
2	$\frac{H}{8} \times \frac{W}{8}$	C = (64, 32), $N_{2} = 3$ , $E_{2} = 8$	(128, 64), 2, 8	(128, 64), 3, 8	(192, 96), 3, 8
3	$\frac{H}{16} \times \frac{W}{16}$	C = (128, 64), $N_{3} = 5$ , $E_{3} = 4$	(256, 128), 4, 4	(256, 128), 12, 4	(288, 144), 12, 4
4	$\frac{H}{32} \times \frac{W}{32}$	C = (256, 128), $N_{4} = 2$ , $E_{4} = 4$	(512, 256), 2, 4	(512, 256), 2, 4	(276, 288), 3, 4
Decoder channels		C = (512, 256)	(512, 256)	(512, 256)	(512, 256)

Table 2. DBCAN fine-tuning settings on NYUDepthv2/SUNRGBD. The input size, batch size, base learning rate, epochs, and stochastic depth are specified for NYUDepthv2/SUN-RGBD datasets. All other hyperparameters, such as optimizer, weight decay, optimizer momentum, learning rate schedule, and warmup epochs, are shared across both datasets.

Config	DBCAN-T	DBCAN-S	DBCAN-B	DBCAN-L
input size	$480 \times 480$ / $480 \times 480$	$480 \times 480$ / $480 \times 480$	$480 \times 480$ / $480 \times 480$	$480 \times 480$ / $480 \times 480$
optimizer	AdamW	AdamW	AdamW	AdamW
batch size	8/16	8/16	8/16	8/16
base learning rate	$6 \times 10^{- 5}$ / $8 \times 10^{- 5}$	$6 \times 10^{- 5}$ / $8 \times 10^{- 5}$	$6 \times 10^{- 5}$ / $8 \times 10^{- 5}$	$6 \times 10^{- 5}$ / $8 \times 10^{- 5}$
weight decay	0.01	0.01	0.01	0.01
epochs	500/300	500/300	500/300	500/300
optimizer momentum	$β 1, β 2 = 0.9, 0.999$	$β 1, β 2 = 0.9, 0.999$	$β 1, β 2 = 0.9, 0.999$	$β 1, β 2 = 0.9, 0.999$
learning rate schedule	linear decay	linear decay	linear decay	linear decay
warmup epochs	10	10	10	10
stochastic depth	0.1/0.1	0.1/0.1	0.1/0.1	0.15/0.15

Table 5. Ablation studies on our proposed modules.

Model	ORGM	SFRRM	RFRRM	SSFM	Acc	mAcc	mIoU
No. 1					78.24	66.6	53.5
No. 2	√	√			78.1	67.25	53.8
No. 3	√		√		78.28	67.48	54.07
No. 4	√	√	√		78.35	67.77	54.19
No. 5	√	√	√	√	78.55	68.39	54.45

Table 6. Ablation studies on Spatial-Semantic Fusion Module.

Methods	Acc	mAcc	mIoU
concatenation	78.42	67.20	54.19
LAM	78.83	68.18	54.30
GAM	78.60	67.86	54.34
SSFM	78.55	68.39	54.45

Table 7. Ablation studies on different decoder channels.

Decoder Channels	Acc	mAcc	mIoU
256	78.42	67.20	53.79
512	78.55	68.39	54.45
1028	78.68	67.42	54.11

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, A.; Fu, L. DBCAN: DFormer-Based Cross-Attention Network for RGB Depth Semantic Segmentation. Appl. Sci. 2024, 14, 8329. https://doi.org/10.3390/app14188329

AMA Style

Wu A, Fu L. DBCAN: DFormer-Based Cross-Attention Network for RGB Depth Semantic Segmentation. Applied Sciences. 2024; 14(18):8329. https://doi.org/10.3390/app14188329

Chicago/Turabian Style

Wu, Aihua, and Liuxu Fu. 2024. "DBCAN: DFormer-Based Cross-Attention Network for RGB Depth Semantic Segmentation" Applied Sciences 14, no. 18: 8329. https://doi.org/10.3390/app14188329

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DBCAN: DFormer-Based Cross-Attention Network for RGB Depth Semantic Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Encoder–Decoder-Based Methods

2.2. Attention-Based Methods

3. Methodology

3.1. Overview

3.2. Object-Region Generated Module

3.3. Feature-Region Relation Module

3.4. Spatial-Semantic Fusion Module

3.5. Model Architecture and Variants

4. Results

4.1. Datasets

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. RGB-D Semantic Segmentation Model for Comparison

4.4.1. NYUDepthv2 Semantic Segmentation

4.4.2. SUN-RGBD Semantic Segmentation

4.5. Ablation Studies

4.5.1. Effectiveness of Different Components

4.5.2. Effectiveness of Spatial-Semantic Fusion Module

4.5.3. Effectiveness of Decoder Channels

4.6. Visualization

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI