RSM-Optimizer: Branch Optimization for Dual- or Multi-Branch Semantic Segmentation Networks

Zhang, Xiaohong; Zong, Wenwen; Jiang, Yaning

doi:10.3390/electronics14061109

Open AccessArticle

RSM-Optimizer: Branch Optimization for Dual- or Multi-Branch Semantic Segmentation Networks

by

Xiaohong Zhang

¹,

Wenwen Zong

² and

Yaning Jiang

^2,*

¹

School of Software, Henan Polytechnic University, Jiaozuo 454000, China

²

School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo 454000, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(6), 1109; https://doi.org/10.3390/electronics14061109

Submission received: 5 February 2025 / Revised: 7 March 2025 / Accepted: 10 March 2025 / Published: 11 March 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Semantic segmentation is a crucial task in the field of computer vision, with important applications in areas such as autonomous driving, medical image analysis, and remote sensing image analysis. Dual-branch and multi-branch semantic segmentation networks that leverage deep learning technologies can enhance both segmentation accuracy and speed. These networks typically contain a semantic branch and a context branch. However, the feature maps in the detail branch are limited to a single type of receptive field, which limits models’ abilities to perceive objects at different scales. During the feature map fusion process, low-resolution feature maps from the semantic branch are upsampled with a large factor to match the feature maps in the detail branch. Unfortunately, these upsampling operations inevitably introduce noise. To address these issues, we propose several improvements to optimize the detail and semantic branches. We first design a receptive field-driven feature enhancement module to enrich the receptive fields of feature maps in the detail branch. Then, we propose a stepwise upsampling and fusion module to reduce the noise introduced during the upsampling process of feature fusion. Finally, we introduce a pyramid mixed pooling module (PMPM) to improve models’ abilities to perceive objects of different shapes. Considering the diversity of objects in terms of scale, shape, and category in urban street scene data, we carried out experiments on the Cityscapes and CamVid datasets. The experimental results on both datasets validate the effectiveness and efficiency of the proposed improvements.

Keywords:

semantic segmentation; dual-branch network; multi-branch network; upsampling; receptive field; pyramid pooling module

1. Introduction

Semantic segmentation is a vital task in computer vision [1]. It involves assigning a category label to each pixel in an image, enabling the semantic classification of objects and facilitating visual scene understanding. This technique can be applied in various domains, including autonomous driving [2,3], medical image analysis [4,5], and remote sensing image analysis [6,7]. In the field of autonomous driving, semantic segmentation can be utilized to accurately identify key elements, such as road boundaries, obstacles, and traffic signs, thereby enabling safe driving in complex traffic environments. With the advent of deep learning, convolutional neural networks (CNNs) have been employed to enhance the accuracy of semantic segmentation. Notable networks, such as PSPNet [8] and DeepLab [9], have been proposed. However, these architectures often encounter challenges in achieving satisfactory results for both accuracy and speed simultaneously.

In recent years, Yu et al. [10] proposed the dual-branch network known as BiSeNet, which comprises a semantic branch and a detail branch. BiSeNet significantly enhances inference speed while preserving accuracy. Building on this framework, Pan et al. [11] introduced DDRNet, further improving both speed and accuracy. However, during the feature fusion process, the detail features from the detail branch are often overshadowed by the contextual features from the semantic branch. To mitigate this issue, Xu et al. [12] introduced a boundary branch alongside the existing semantic and detail branches, resulting in the proposed PIDNet. Compared to DDRNet and BiSeNet, PIDNet demonstrates superior advantages in both segmentation accuracy and speed.

In dual- or multi-branch semantic segmentation networks, such as BiSeNet, DDRNet, and PIDNet, the detail branch contains only high-resolution feature maps with a limited receptive field, which restricts the model’s capacity to perceive objects of varying sizes. Conversely, the semantic branch consists of feature maps at multiple resolutions, with the lowest-resolution maps conveying the richest semantic information. To obtain feature maps that integrate both semantic and detail information, the lowest-resolution maps are upsampled by a large factor to merge with the detail branch feature maps. However, this upsampling introduces noise into the fused feature maps, ultimately undermining segmentation accuracy.

To address these issues, we have enhanced both the detail and semantic branches in dual- and multi-branch segmentation networks. First, we designed a receptive field-driven feature enhancement module (RF-FEM) to expand the receptive fields of the feature maps in the detail branch. Next, we implemented a stepwise upsampling and fusion module (SUFM) to mitigate the noise introduced during upsampling of low-resolution feature maps from the semantic branch. Finally, we introduced a pyramid mixed pooling module (PMPM) to improve the network’s ability to perceive objects of varying shapes. The main contributions of this work are as follows:

We propose a receptive field fusion block (RFFB) to extract feature maps with diverse receptive fields. Building upon RFFB, we develop a receptive field-enhancement module (RF-FEM) to augment the receptive fields of these feature maps.
We design a stepwise upsampling and fusion module (SUFM) that systematically upsamples and fuses the lowest-resolution feature maps from the semantic branch with those from the detail branch.
We introduce a mixed pooling module (MPM) that leverages pooling operations with both regular and square kernels. Based on MPM, we develop a pyramid mixed pooling module (PMPM) to better capture objects of varying shapes.

The organization of this paper is as follows: Section 2 introduces related works, while Section 3 elaborates on the proposed improvements. Section 4 presents the experimental results, and finally, Section 5 concludes the paper.

2. Related Work

This section highlights key advancements in semantic segmentation and deep learning techniques. First, Section 2.1 reviews the current state of research on high-accuracy semantic segmentation. Second, Section 2.2 examines the research landscape of real-time semantic segmentation. Finally, Section 2.3 discusses the advantages and evolution of the pyramid pooling module (PPM).

2.1. High-Accuracy Semantic Segmentation

Traditional semantic segmentation methods primarily rely on handcrafted features and classical machine learning models [13]. While these methods are computationally efficient, their feature representation capabilities are limited, making it challenging to capture high-level semantic information in complex scenes. Moreover, their segmentation results are unstable under varying illumination, occlusions, and complex backgrounds. In contrast, deep learning-based semantic segmentation methods can capture richer semantic information, significantly overcoming the limitations of traditional approaches [14,15].

Long et al. [16] were the first to apply deep learning to semantic segmentation. They replaced the fully connected layers with convolution layers, enabling pixel-level semantic segmentation for images of arbitrary sizes. Building on the work of Long et al., Zhao et al. [8] proposed PSPNet and PPM, which capture rich contextual information through multi-scale pooling operations. However, these pooling operations lead to the loss of some fine-grained details. To address this issue, Liang et al. [9] designed atrous convolution, which can effectively expand the receptive field without decreasing the resolution of feature maps. Atrous convolution has been applied to DeepLab and its variants [17,18,19].

In order to improve segmentation accuracy, Wang et al. [20] proposed HRNet. HRNet utilizes a parallel multi-resolution branch architecture to extract high-level semantic features while preserving spatial details. Attention mechanisms are also utilized to improve segmentation accuracy [21,22,23]. Fu et al. [24] exploited channel and spatial attention modules in DANet. Xie et al. [25] utilized transformer in the proposed SegFormer.

Although the aforementioned methods have made significant advancements in segmentation accuracy, their inference speeds do not adequately meet the demands of real-time scenarios such as autonomous driving and real-time video processing.

2.2. Real-Time Semantic Segmentation

To achieve a balance between inference speed and accuracy, researchers have proposed various semantic segmentation solutions suitable for real-time scenarios [26,27]. These solutions primarily utilize encoder–decoder architectures or dual-branch network architectures [28].

The encoder–decoder architecture employs a lightweight backbone network and an efficient decoder to reduce computational complexity while improving segmentation accuracy. ShuffleSeg [29] utilized ShuffleNet [30] combining with channel shuffling and group convolution to reduce computation costs. SwiftNet [31] introduces lightweight lateral connections to the decoder to enhance its ability to recover details. ESPNet [32] utilizes spatial pyramid convolution modules to construct its backbone, reducing computational load while enhancing the capability to capture multi-scale features. However, the encoder–decoder architecture is prone to losing spatial information during the downsampling processes, adversely affecting segmentation accuracy.

BiSeNet is a typical dual-branch network. It consists of a high-resolution detail branch and a low-resolution semantic branch. BiSeNetV2 [33] further optimizes BiSeNet by adopting a more lightweight backbone and a bilateral guided aggregation layer to effectively fuse detail and semantic features. DDRNet enhanced the dual-branch framework by incorporating a bilateral fusion module to strengthen information exchange between branches. PIDNet introduces an additional boundary branch specifically designed to retain object boundary information, further improving segmentation performance.

2.3. Pyramid Pooling Module

The pyramid pooling module captures local and global features through multi-level global pooling operations with kernels of varying sizes, enabling models to achieve accurate segmentation of complex objects. Liang et al. [17] proposed the Atrous Spatial Pyramid Pooling module (ASPP) by replacing some global pooling operations with atrous convolutions with varying dilation rates, enhancing the ability to capture multi-scale contextual information. Pan et al. [11] designed the Deep Aggregation Pyramid Pooling Module (DAPPM) by introducing efficient feature fusion strategies into the PPM. Xu et al. [12] improved DAPPM by introducing branch fusion in parallel.

3. Proposed Approach

3.1. Architecture of the Improved PIDNet

In order to expand the receptive fields of feature maps in the detail branch and reduce the noise introduced during the fusion of feature maps from both the detail and semantic branches, we propose several improvements to optimize both branches. These improvements include expanding the receptive fields of feature maps in the detail branch by designing RF-FEM, decreasing the noise introduced in the fusion process by constructing SUFM, and strengthening the perception of the semantic branch to objects in various shapes by proposing PMPM. Figure 1 shows the improved PIDNet with our improvements.

3.2. Receptive Field-Driven Feature Enhancement Module

In the detail branch, feature maps exhibit high resolution but have a limited receptive field, which undermines their capacity to represent features across different scales. To enhance the receptive field of the feature maps in this branch, DDRNet and PIDNet fuse them with feature maps from the semantic branch, which possess larger receptive fields. Although this fusion improves the receptive field in the detail branch, it remains inadequate for scenes with significant variability in target scales, such as urban scenes that include both small traffic signs and large buildings. To achieve feature maps with diverse receptive fields and enhance the network’s ability to perceive objects at varying scales, we propose RF-FEM and integrate it into the detail branch.

RF-FEM comprises a residual branch and a receptive field fusion branch. It merges the outputs of these two branches through an element-wise addition operation. The receptive field fusion branch employs a receptive field fusion block (RFFB) to extract and combine multiple feature maps with varying receptive fields. Specifically, RFFB begins with a 1 × 1 convolution to reduce the dimensionality of the input feature maps, thereby decreasing computational costs. It then applies three cascaded 3 × 3 convolutions to extract feature maps with different receptive fields from the reduced-dimensionality results. Following this, four dilated convolutions with rates of 1, 3, 5, and 7 are utilized to expand the receptive fields of the feature maps produced by both the dimensionality reduction and the cascaded convolutions. Finally, concatenation and further dimensionality reduction operations are performed to yield the final output. Figure 2 illustrates the structure of RF-FEM.

Let

X_{r}

be the input to RF-FEM. The output of RF-FEM can be expressed as

f_{R F - F E M} (X_{r})

and calculated according to Equation (1). In this equation,

r (\cdot)

denotes the dimensionality reduction operation,

f_{R F F B} (\cdot)

represents RFFB, and ⊕ indicates the element-wise addition operation. The computation of

f_{R F F B} (\cdot)

is described in Equation (2), where

D C o n v_{r = 1} (\cdot)

indicates the

3 \times 3

dilated convolution with a dilation rate of 1 and

D C o n v_{r = 2 i + 1}^{1 \leq i \leq 3} (\cdot)

corresponds to the dilated convolutions with dilation rates of 3, 5, and 7, respectively. The term

C o n v_{3 \times 3}^{i} (\cdot)

represents the output of the ith convolution in the cascaded sequence, which is defined in Equation (3).

f_{R F - F E M} (X_{r}) = C o n v_{1 \times 1} (X_{r}) \oplus f_{R F F B} (X_{r})

(1)

f_{R F F B} (X_{r}) = r (C o n c a t (D C o n v_{r = 1} (r (X_{r})), D C o n v_{r = 2 i + 1}^{1 \leq i \leq 3} (C o n v_{3 \times 3}^{i} (X_{r}))))

(2)

C o n v_{3 \times 3}^{i} (X_{r}) = \{\begin{matrix} C o n v_{3 \times 3} (r (X_{r})) & i = 1 \\ C o n v_{3 \times 3} (C o n v_{3 \times 3}^{i - 1} (X_{r})) & 1 < i \leq 3 \end{matrix}

(3)

3.3. Stepwise Upsampling and Fusing Module

The semantic branch utilizes downsampling operations to produce low-resolution feature maps rich in semantic information, such as those at 1/64 of the input resolution. To generate feature maps that are rich in both semantic content and detail, the lowest-resolution feature maps from the semantic branch—characterized by their highest semantic content—are fused with high-resolution feature maps from the detail branch. To address the resolution disparity, the low-resolution feature maps are upsampled significantly to match the resolution of the high-resolution maps before fusion. However, this upsampling process may introduce noise, adversely affecting segmentation performance.

To reduce such noise, we draw inspiration from the PAG module in PID to develop a weighted fusion block (WFB) and propose SUFM based on this block. SUFM consists of two WFBs. The first WFB fuses feature maps at resolutions of

\frac{1}{64}

and

\frac{1}{32}

of the initial resolution, while the second WFB fuses the output from the first with feature maps at a resolution of

\frac{1}{16}

of the initial resolution. Finally, the results of the second WFB are upsampled to match the feature maps from the detail branch. Figure 3 illustrate the structures of SUFM (top) and WFB (bottom).

WFB processes input feature maps at two different resolutions. It initially applies the Content-Aware ReAssembly of FEatures (CARAFE) algorithm [34] and a 1 × 1 convolution separately to upsample the lower-resolution feature maps and reduce the dimensionality of the higher-resolution feature maps, resulting in feature maps with the same resolution and channel dimensions. An element-wise addition operation is then performed to fuse these generated feature maps. Subsequently, an attention mechanism is employed to compute weights for the fused results. Finally, these weights are utilized to merge the outputs from the CARAFE algorithm and the 1 × 1 convolution, producing the final output of WFB.

f_{W F B} (X_{h}, X_{l}) = f_{a m} (X_{h}^{'} \oplus X_{l}^{'}) \times X_{h}^{'} + (1 - f_{a m} (X_{h}^{'} \oplus X_{l}^{'})) \times X_{l}^{'}

(4)

Given the input feature maps

X_{h}

and

X_{l}

, where

X_{h}

has a higher resolution than

X_{l}

,

f_{W F B} (X_{h}, X_{l})

represents the output of WFB, as defined by Equation (4). In this equation,

X_{h}^{'}

and

X_{l}^{'}

denote the results of applying a

1 \times 1

convolution to

X_{h}

and performing an upsampling operation on

X_{l}

, respectively. Additionally,

f_{a m} (\cdot)

represents the output obtained by applying an attention mechanism to the fused

X_{h}^{'}

and

X_{l}^{'}

.

Based on Equation (4), the outputs of SUFM can be computed using Equation (5). In this equation,

X_{\frac{1}{16}}

,

X_{\frac{1}{32}}

, and

X_{\frac{1}{64}}

represent feature maps at different resolutions. The function

f u p (\cdot)

denotes the results of the upsampling operations.

f_{S U F M} (X_{\frac{1}{16}}, X_{\frac{1}{32}}, X_{\frac{1}{64}}) = f_{u p} (f_{W F B} (X_{\frac{1}{16}}, f_{W F B} (X_{\frac{1}{32}}, X_{\frac{1}{64}})))

(5)

3.4. Pyramid Mixed Pooling Module

PPM employs pooling operations with square kernels of various sizes to extract contextual information at different scales. It and its variants have been integrated into several semantic segmentation networks, including BiSeNet, DDRNet, and PIDNet, to enhance feature extraction capabilities. However, segmentation scenarios often involve elongated objects, for example, a traffic signpost or a lamppost in an urban street scene. These objects differ significantly in shape from the square kernels used in PPM. As a result, these pooling operations may not effectively cover all regions of elongated objects, limiting the ability of PPM to extract contextual information from them.

To address this limitation, we propose a mixed pooling module (MPM) that combines pooling operations using rectangular and square kernels to enhance coverage for elongated objects. Additionally, we introduce PMPM, which replaces the average pooling in PAPPM with MPM. Figure 4 illustrates the structure of MPM.

MPM consists of one average pooling branch and two stripe pooling branches. These stripe pooling branches are specifically designed to extract contextual features of elongated objects in the horizontal and vertical directions, respectively. Each stripe pooling branch includes a pooling operation with a k × 1 or 1 × k kernel, followed by an upsampling operation to match the resolution of the average pooling branch. In MPM, input feature maps are first divided into three parts along the channel dimension, with each part processed by one of the three pooling branches. Finally, the outputs from all branches are concatenated and passed through a 1 × 1 convolution to facilitate channel interactions and produce the final outputs of MPM.

Let

X_{m}

represent the input to MPM, which can be expressed as

(X_{m_{0}}, X_{m_{1}}, \dots, X_{m_{c - 1}})

, where c denotes the total number of channels in

X_{m}

. The output of MPM is denoted as

f_{M P M} (X_{m})

, calculated according to Equation (6). In this equation,

X_{m_{0}^{'}}, X_{m_{1}^{'}},

and

X_{m_{2}^{'}}

represent the three segments derived from

X_{m}

. The notation

P_{k \times k} (\cdot)

indicates the average pooling operation, while

P_{k \times 1} (\cdot)

and

P_{1 \times k} (\cdot)

refer to the

(k \times 1)

and

(1 \times k)

stripe pooling operations, respectively.

f_{M P M} (X_{m}) = C o n v_{1 \times 1} (C o n c a t (P_{k \times k} (X_{m_{0}^{'}}), P_{k \times 1} (X_{m_{1}^{'}}), P_{1 \times k} (X_{m_{2}^{'}})))

(6)

4. Experiments

In this section, we conduct experiments on two public benchmark datasets to evaluate our improvements. Firstly, we represent the details of datasets, training, and evaluation metrics in Section 4.1. Secondly, we analyze the ablation experiments in Section 4.2. Finally, we discuss comparative experiments in Section 4.3.

4.1. Experimental Settings

Dataset. We exploited two urban street scene datasets to evaluate our improvements because these datasets include objects of various sizes, shapes, and categories, especially some elongated objects. These two datasets are Cityscapes [35] and CamVid [36].

Cityscapes is a high-resolution dataset focused on urban street scenes. It contains 30 categories, but only 19 of them are used for semantic segmentation. Cityscapes includes 5000 finely annotated images and 20,000 coarsely annotated images. In our experiments, we only utilized the finely annotated images. Images in Cityscapes were collected from 50 different cities in Germany with an automotive-grade stereo camera system mounted on vehicles at resolutions of 2048 × 1024 pixels.
CamVid is a dataset designed for scene understanding and semantic segmentation tasks. It contains 32 categories, of which 11 are used for semantic segmentation tasks. CamVid consists of 701 images collected with a Panasonic HVX200 digital camera from Osaka, Japan, at resolutions of 960 × 720 pixels.

Training. All the training was carried out on the same server, which is equipped with 5 NVIDIA GeForce RTX 4090 GPUs sourced from Santa Clara, CA, USA, and 504 GB of Kingston memory sourced from Fountain Valley, CA, USA. Our improvements and all the comparative solutions were implemented in Python 3.8.19 and run on the PyTorch 2.5.0 framework. During training, multiple data augmentations were employed, including random cropping, scaling, and horizontal flipping. For the Cityscapes dataset, 2975 images were used for training, 500 images were used for validation, and 1525 images were used for testing. For the CamVid dataset, there were 367 images designated for training, 101 images for validation, and 233 images for testing.

We used the stochastic gradient descent optimizer with an initial momentum of 0.9 and a weight decay of

5 \times 10^{- 4}

. The initial learning rate was set to 0.01 for Cityscape and 0.001 for CamVid. It followed a polynomial decay strategy (poly strategy) on both datasets. The batch size was set to 6 for both Cityscapes and CamVid, with training completed on the two datasets after 500 epochs and 200 epochs, respectively. In addition, we exploited coordinate attention to implement

f_{a m} (\cdot)

in Equation (4).

Evaluation metrics. We used the mean intersection over union (mIoU) and frames per second (FPS) to evaluate the accuracy and inference speed of all solutions, respectively. MIoU was calculated according to Equation (7). In the equation, TP denotes true positives, FP represents false positives, and FN indicates false negatives. Additionally, we used the number of parameters (Params) to evaluate complexity.

m I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P}{T P + F P + F N}

(7)

4.2. Ablation Experiments

To analyze the effectiveness of the improvements in this work, we conducted a series of ablation experiments.

We designed five different models to evaluate the effectiveness of each improvement and recorded the corresponding experiment results, as shown in Table 1. All results were obtained on the Cityscapes dataset. In the table, Model0 denotes the original version of PIDNet-S, while Model1 and Model2 describe the variants with RF-FEM and PMPM, respectively. According to the results in the table, Model1 and Model2 achieved 0.5% and 0.8% improvements, respectively, in mIoU metrics compared to Model0, which indicates the effectiveness of RF-FEM and PMPM.

Model 3 is the variant of PIDNet-S that incorporates RF-FEM and PMPM. Compared to Model 0, Model 1, and Model 2, Model 3 achieved improvements of 1.1%, 0.6%, and 0.3% in mIoU, respectively. These results verify the effectiveness of combining RF-FEM and PMPM in enhancing semantic segmentation accuracy. Model4 represents the variant of PIDNet-S that incorporates all the improvements. Model4 achieved the best performance in all five models. It improved mIoU by 1.5% compared to the baseline model. This highlights the effectiveness of the proposed improvements and their collaborative impact in improving the performance of the model.

4.3. Performance Comparison

We conducted experiments to evaluate our improvements. In these experiments, we carefully selected several dual- or multi-branch segmentation networks that balance segmentation accuracy and speed, including BiSeNetV2, DDRNet-23-slim, and PIDNet-S. We integrated our improvements into these networks, recording them as BiSeNetV2*, DDRNet-23-slim*, and PIDNet-S*. We evaluated our improvements by comparing BiSeNetV2*, DDRNet-23-slim*, and PIDNet-S* against a variety of semantic segmentation solutions, including BiSeNetV2, BiSeNetV2-L, STDC1-Seg75 [37], STDC2-Seg75, HyperSeg [38], DDRNet, and PIDNet.

4.3.1. Comparison on Cityscapes Dataset

Table 2 shows the results of all the solutions on the Cityscapes dataset. According to the table, BiSeNetV2*, DDRNet-23-slim*, and PIDNet-S* improved mIoU by 2.4%, 0.5%, and 1.5% compared to BiSeNetV2, DDRNet-23-slim, and PIDNet-S, respectively. They also achieved improvements in mIoU over STDC1-Seg75, STDC2-Seg75, HyperSeg-M, and HyperSeg-S. Among all the solutions, PIDNet-S* exhibited the highest mIoU. It improved mIoU by 5.3%, 2.8%, 3.6%, and 1.6% compared with these four solutions, respectively. These results validate the effectiveness of our improvements in segmentation accuracy.

We utilized the FPS metric to evaluate the impact of our improvements on segmentation speed. We compared the FPS of BiSeNetV2*, DDRNet-23-slim*, and PIDNet-S* with those of other solutions. According to the results in Table 2, BiSeNetV2* exhibited the highest FPS among all the solutions, with the exception of BiSeNetV2. Compared to BiSeNetV2, BiSeNetV2* demonstrated a 2.4% improvement in mIoU while experiencing a drop of 15 FPS. Nonetheless, BiSeNetV2* achieved the same accuracy as BiSeNetV2-L while being over 90 FPS faster in inference speed. Although DDRNet-23-slim* and PIDNet-S* performed at lower FPS than DDRNet-23-slim and PIDNet-S, they were still fast enough to run in real-time scenarios. Additionally, while DDRNet-23-slim* and PIDNet-S* showed lower FPS than STDC1-Seg75 and STDC2-Seg75, they exhibited higher mIoU than both.

We utilized the Params (M) metric to evaluate the computational complexity of each model. We did not compare our solution with HyperSeg-M, HyperSeg-S, STDC1-Seg75, and STDC2-Seg75 because of their disadvantages in accuracy, inference time, or computational complexity. We divided all models, except for the four excluded models, into three groups. The first group included PIDNet-S*, PIDNet-S, and PIDNet-M. The second group included BiSeNetV2, BiSeNetV2*, and BiSeNetV2-L, and the third group included DDRNet-23-slim* and DDRNet-23-slim. Within each group, our models struck a good balance in terms of accuracy, inference speed, and computational complexity.

4.3.2. Comparison on CamVid Dataset

To further validate the effectiveness of our improvements, we conducted experiments on the CamVid dataset. In these experiments, we compared BiSeNetV2*, DDRNet-23-slim*, and PIDNet-S* with BiSeNetV2, BiSeNetV2-L, HyperSeg-S, DDRNet-23, DDRNet-23-slim, and PIDNet-S. The corresponding results are recorded in Table 3. According to the table, BiSeNetV2*, DDRNet-23-slim*, and PIDNet-S* all achieved improvements in mIoU compared to BiSeNetV2, DDRNet-23-slim, and PIDNet-S, respectively. Notably, PIDNet-S* exhibited the highest mIoU among all solutions, improving mIoU by 2.7% despite experiencing a drop of 3 FPS compared to PIDNet-S. These results further verify the effectiveness of our enhancements.

4.3.3. Analysis of Visualization Results

Figure 5 presents the segmentation results of PIDNet-S and PIDNet-S* for some images from the Cityscapes dataset. Figure 5a,b display the input images along with their corresponding labeled images, while Figure 5c,d illustrate the segmentation outcomes of PIDNet-S and PIDNet-S*, respectively. Red boxes have been employed to emphasize the differences in the segmentation outcomes.

According to the results for image 1, PIDNet-S* demonstrated a more complete and accurate segmentation for sidewalks. In contrast, PIDNet-S misclassified the sidewalks as other categories, resulting in blurred boundaries and incomplete structures. In image 2, PIDNet-S* achieved higher accuracy in segmenting traffic signs, whereas PIDNet-S produced coarser results with noticeable boundary displacements and omissions. From images 2 to 4, PIDNet-S* showed significant advantages in segmenting thin and elongated objects, such as traffic lights, poles, and pedestrians. Conversely, PIDNet-S exhibited noticeable omissions and incomplete structures when segmenting these objects.

We calculated the mIoU for each class in the Cityscapes dataset and recorded the results in Table 4. Cityscapes comprises 19 categories. Among these, PIDNet-S* demonstrated superior segmentation accuracy, outperforming PIDNet-S in 16 classes. Notably, Wall, Fence, Pole, Terrain, Rider, and Motorcycle presented significant challenges for segmentation tasks. In these specific classes, PIDNet-S* achieved marked improvements in accuracy relative to PIDNet-S. Specifically, mIoU increased by 9.5% for Wall, 3% for Pole, and 1.8% for both Rider and Motorcycle, highlighting the effectiveness of our improvements in addressing challenging segmentation scenarios.

In summary, compared to PIDNet-S, PIDNet-S* demonstrated higher robustness and accuracy in segmentation tasks involving thin and elongated objects and complex scenes.

5. Conclusions

In this work, we focused on optimizing the detail and semantic branches of dual- and multi-branch semantic segmentation networks. We proposed RF-FEM to expand the receptive fields of feature maps in the detail branch. Additionally, we introduced SUFM to fuse the lowest-resolution feature maps from the semantic branch with those in the detail branch, effectively reducing noise during the fusion process. Furthermore, we presented PMPM to enhance the model’s ability to perceive objects of varying shapes. We incorporated our improvements into PIDNet-S, BiSeNetV2, and DDRNet-Slim. Experimental results on Cityscapes demonstrate that our modifications yield improvements in mIoU of 1.5%, 2.4%, and 0.5% for the three models, respectively. In addition, on the CamVid dataset, these improvements result in mIoU improvements of 2.7%, 2.7%, and 3.9%, respectively. In the future, we plan to address the challenges of semantic segmentation in real-time scenarios.

Author Contributions

Conceptualization, X.Z. and W.Z.; methodology, X.Z., W.Z. and Y.J.; formal analysis, X.Z., W.Z. and Y.J.; investigation, W.Z. and Y.J.; resources, W.Z. and Y.J.; data curation, W.Z.; writing—original draft preparation, X.Z. and W.Z.; writing—review and editing, X.Z., W.Z. and Y.J.; visualization, W.Z.; supervision, X.Z. and W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (No. 62472145).

Data Availability Statement

The dataset used in this article is a publicly available dataset. The following is the link to the CityScapes dataset: https://www.cityscapes-dataset.com (accessed on 24 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Minaee, S.; Boykov, Y.; Porikli, F.; Plaza, A.; Kehtarnavaz, N.; Terzopoulos, D. Image segmentation using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3523–3542. [Google Scholar] [CrossRef] [PubMed]
Feng, D.; Haase-Schütz, C.; Rosenbaum, L.; Hertlein, H.; Glaeser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1341–1360. [Google Scholar] [CrossRef]
Elhassan, M.A.; Zhou, C.; Khan, A.; Benabid, A.; Adam, A.B.; Mehmood, A.; Wambugu, N. Real-time semantic segmentation for autonomous driving: A review of CNNs, Transformers, and Beyond. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102226. [Google Scholar] [CrossRef]
Asgari Taghanaki, S.; Abhishek, K.; Cohen, J.P.; Cohen-Adad, J.; Hamarneh, G. Deep semantic segmentation of natural and medical images: A review. Artif. Intell. Rev. 2021, 54, 137–178. [Google Scholar] [CrossRef]
Delmoral, J.C.; RS Tavares, J.M. Semantic Segmentation of CT Liver Structures: A Systematic Review of Recent Trends and Bibliometric Analysis: Neural Network-based Methods for Liver Semantic Segmentation. J. Med. Syst. 2024, 48, 97. [Google Scholar] [CrossRef] [PubMed]
Yuan, X.; Shi, J.; Gu, L. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst. Appl. 2021, 169, 114417. [Google Scholar] [CrossRef]
Cheng, J.; Deng, C.; Su, Y.; An, Z.; Wang, Q. Methods and datasets on semantic segmentation for Unmanned Aerial Vehicle remote sensing images: A review. ISPRS J. Photogramm. Remote Sens. 2024, 211, 1–34. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Liang-Chieh, C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. Semantic image segmentation with deep convolutional nets and fully connected crfs. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
Pan, H.; Hong, Y.; Sun, W.; Jia, Y. Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes. IEEE Trans. Intell. Transp. Syst. 2022, 24, 3448–3460. [Google Scholar] [CrossRef]
Xu, J.; Xiong, Z.; Bhattacharyya, S.P. PIDNet: A real-time semantic segmentation network inspired by PID controllers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 19529–19539. [Google Scholar]
Liu, X.; Deng, Z.; Yang, Y. Recent progress in semantic image segmentation. Artif. Intell. Rev. 2019, 52, 1089–1106. [Google Scholar] [CrossRef]
Csurka, G.; Volpi, R.; Chidlovskii, B. Semantic image segmentation: Two decades of research. Found. Trends Comput. Graph. Vis. 2022, 14, 1–162. [Google Scholar] [CrossRef]
Thisanke, H.; Deshan, C.; Chamith, K.; Seneviratne, S.; Vidanaarachchi, R.; Herath, D. Semantic segmentation using Vision Transformers: A survey. Eng. Appl. Artif. Intell. 2023, 126, 106669. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Chen, L.C. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A. Attention is all you need. In Advances in Neural Information Processing Systems, Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. In Advances in Neural Information Processing Systems, Proceedings of the 35th International Conference on Neural Information Processing Systems, Online, 6–14 December 2021; Curran Associates Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 12077–12090. [Google Scholar]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar]
Poudel, R.P.; Liwicki, S.; Cipolla, R. Fast-scnn: Fast semantic segmentation network. arXiv 2019, arXiv:1902.04502. [Google Scholar]
Takos, G. A survey on deep learning methods for semantic image segmentation in real-time. arXiv 2020, arXiv:2009.12942. [Google Scholar]
Gamal, M.; Siam, M.; Abdel-Razek, M. Shuffleseg: Real-time semantic segmentation network. arXiv 2018, arXiv:1803.03816. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Orsic, M.; Kreso, I.; Bevandic, P.; Segvic, S. In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12599–12608. [Google Scholar]
Mehta, S.; Rastegari, M.; Caspi, A.; Shapiro, L.; Hajishirzi, H. Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 552–568. [Google Scholar]
Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Brostow, G.J.; Fauqueur, J.; Cipolla, R. Semantic object classes in video: A high-definition ground truth database. Pattern Recognit. Lett. 2009, 30, 88–97. [Google Scholar] [CrossRef]
Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9716–9725. [Google Scholar]
Nirkin, Y.; Wolf, L.; Hassner, T. Hyperseg: Patch-wise hypernetwork for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4061–4070. [Google Scholar]

Figure 1. PIDNet with our improvements.

Figure 2. The structure of RF-FEM.

Figure 3. The structure of SUFM (top) and WFB (bottom).

Figure 4. The structure of MPM.

Figure 5. Visual segmentation results of the Cityscapes dataset.

Table 1. Effectiveness study of all the improvements.

Model	RF-FEM	PMPM	SUFM	mIoU	FPS	Params (M)	GFLOPs
Model0				78.3	125	7.62	47.6
Model1	✓			78.8	120	7.92	57.2
Model2		✓		79.1	103	8.41	47.9
Model3	✓	✓		79.4	98	8.71	57.7
Model4	✓	✓	✓	79.8	94	9.92	69.3

Table 2. Results on the Cityscapes dataset.

Methods	mIoU (%)	FPS	Resolution (Pi)	Params (M)	GFLOPs
HyperSeg-M	76.2	36.9	$1024 \times 512$	10.1	7.5
HyperSeg-S	78.2	16.1	$1536 \times 768$	10.2	17.0
PIDNet-S*	79.8	94	$2048 \times 1024$	9.9	69.3
PIDNet-S	78.3	125	$2048 \times 1024$	7.6	47.6
PIDNet-M	80.1	39.8	$2048 \times 1024$	34.4	197.4
BiSeNetV2	73.4	156	$1024 \times 512$	3.4	24.8
BiSeNetV2*	75.8	141	$1024 \times 512$	4.5	30.5
BiSeNetV2-L	75.8	47.3	$1024 \times 512$	-	118.5
STDC1-Seg75	74.5	126.7	$1536 \times 768$	11.4	38.1
STDC2-Seg75	77.0	97	$1536 \times 768$	15.4	53.0
DDRNet-23-slim	77.3	138.3	$2048 \times 1024$	5.7	36.3
DDRNet-23-slim*	77.8	91.6	$2048 \times 1024$	8.6	53.6

Table 3. Results on the CamVid dataset.

Methods	mIoU (%)	FPS	Resolution (Pi)
PIDNet-S	80.1	153	720 × 960
PIDNet-S*	82.8	150	720 × 960
BiSeNetV2	72.4	124	720 × 960
BiSeNetV2*	75.1	102	720 × 960
BiSeNetV2-L	73.2	33	720 × 960
HyperSeg-S	78.4	38	720 × 960
DDRNet-23	76.3	94	720 × 960
DDRNet-23-slim	74.7	230	720 × 960
DDRNet-23-slim*	78.6	97	720 × 960

Table 4. Class IoU on the Cityscapes dataset.

Class	PIDNet-S	PIDNet-S*	Class	PIDNet-S	PIDNet-S*
Road	98.3	98.3	Sky	94.7	94.9
Sidewalk	85.9	85.9	Person	82.7	83.6
Building	92.5	93.1	Rider	64.4	66.2
Wall	49.7	59.2	Car	95.5	96.4
Fence	62.1	62.5	Truck	83.3	81.8
Pole	66.1	69.1	Bus	87.9	89.5
Traffic light	72.6	73.9	Train	75.8	79.2
Sign	79.1	79.8	Motorcycle	63.5	65.3
Vegetation	92.5	92.7	Bicycle	77.3	78.2
Terrain	63.9	64.7	mIoU	78.3	79.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Zong, W.; Jiang, Y. RSM-Optimizer: Branch Optimization for Dual- or Multi-Branch Semantic Segmentation Networks. Electronics 2025, 14, 1109. https://doi.org/10.3390/electronics14061109

AMA Style

Zhang X, Zong W, Jiang Y. RSM-Optimizer: Branch Optimization for Dual- or Multi-Branch Semantic Segmentation Networks. Electronics. 2025; 14(6):1109. https://doi.org/10.3390/electronics14061109

Chicago/Turabian Style

Zhang, Xiaohong, Wenwen Zong, and Yaning Jiang. 2025. "RSM-Optimizer: Branch Optimization for Dual- or Multi-Branch Semantic Segmentation Networks" Electronics 14, no. 6: 1109. https://doi.org/10.3390/electronics14061109

APA Style

Zhang, X., Zong, W., & Jiang, Y. (2025). RSM-Optimizer: Branch Optimization for Dual- or Multi-Branch Semantic Segmentation Networks. Electronics, 14(6), 1109. https://doi.org/10.3390/electronics14061109

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RSM-Optimizer: Branch Optimization for Dual- or Multi-Branch Semantic Segmentation Networks

Abstract

1. Introduction

2. Related Work

2.1. High-Accuracy Semantic Segmentation

2.2. Real-Time Semantic Segmentation

2.3. Pyramid Pooling Module

3. Proposed Approach

3.1. Architecture of the Improved PIDNet

3.2. Receptive Field-Driven Feature Enhancement Module

3.3. Stepwise Upsampling and Fusing Module

3.4. Pyramid Mixed Pooling Module

4. Experiments

4.1. Experimental Settings

4.2. Ablation Experiments

4.3. Performance Comparison

4.3.1. Comparison on Cityscapes Dataset

4.3.2. Comparison on CamVid Dataset

4.3.3. Analysis of Visualization Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI