RSM-Optimizer: Branch Optimization for Dual- or Multi-Branch Semantic Segmentation Networks
Abstract
:1. Introduction
- We propose a receptive field fusion block (RFFB) to extract feature maps with diverse receptive fields. Building upon RFFB, we develop a receptive field-enhancement module (RF-FEM) to augment the receptive fields of these feature maps.
- We design a stepwise upsampling and fusion module (SUFM) that systematically upsamples and fuses the lowest-resolution feature maps from the semantic branch with those from the detail branch.
- We introduce a mixed pooling module (MPM) that leverages pooling operations with both regular and square kernels. Based on MPM, we develop a pyramid mixed pooling module (PMPM) to better capture objects of varying shapes.
2. Related Work
2.1. High-Accuracy Semantic Segmentation
2.2. Real-Time Semantic Segmentation
2.3. Pyramid Pooling Module
3. Proposed Approach
3.1. Architecture of the Improved PIDNet
3.2. Receptive Field-Driven Feature Enhancement Module
3.3. Stepwise Upsampling and Fusing Module
3.4. Pyramid Mixed Pooling Module
4. Experiments
4.1. Experimental Settings
- Cityscapes is a high-resolution dataset focused on urban street scenes. It contains 30 categories, but only 19 of them are used for semantic segmentation. Cityscapes includes 5000 finely annotated images and 20,000 coarsely annotated images. In our experiments, we only utilized the finely annotated images. Images in Cityscapes were collected from 50 different cities in Germany with an automotive-grade stereo camera system mounted on vehicles at resolutions of 2048 × 1024 pixels.
- CamVid is a dataset designed for scene understanding and semantic segmentation tasks. It contains 32 categories, of which 11 are used for semantic segmentation tasks. CamVid consists of 701 images collected with a Panasonic HVX200 digital camera from Osaka, Japan, at resolutions of 960 × 720 pixels.
4.2. Ablation Experiments
4.3. Performance Comparison
4.3.1. Comparison on Cityscapes Dataset
4.3.2. Comparison on CamVid Dataset
4.3.3. Analysis of Visualization Results
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Minaee, S.; Boykov, Y.; Porikli, F.; Plaza, A.; Kehtarnavaz, N.; Terzopoulos, D. Image segmentation using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3523–3542. [Google Scholar] [CrossRef] [PubMed]
- Feng, D.; Haase-Schütz, C.; Rosenbaum, L.; Hertlein, H.; Glaeser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1341–1360. [Google Scholar] [CrossRef]
- Elhassan, M.A.; Zhou, C.; Khan, A.; Benabid, A.; Adam, A.B.; Mehmood, A.; Wambugu, N. Real-time semantic segmentation for autonomous driving: A review of CNNs, Transformers, and Beyond. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102226. [Google Scholar] [CrossRef]
- Asgari Taghanaki, S.; Abhishek, K.; Cohen, J.P.; Cohen-Adad, J.; Hamarneh, G. Deep semantic segmentation of natural and medical images: A review. Artif. Intell. Rev. 2021, 54, 137–178. [Google Scholar] [CrossRef]
- Delmoral, J.C.; RS Tavares, J.M. Semantic Segmentation of CT Liver Structures: A Systematic Review of Recent Trends and Bibliometric Analysis: Neural Network-based Methods for Liver Semantic Segmentation. J. Med. Syst. 2024, 48, 97. [Google Scholar] [CrossRef] [PubMed]
- Yuan, X.; Shi, J.; Gu, L. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst. Appl. 2021, 169, 114417. [Google Scholar] [CrossRef]
- Cheng, J.; Deng, C.; Su, Y.; An, Z.; Wang, Q. Methods and datasets on semantic segmentation for Unmanned Aerial Vehicle remote sensing images: A review. ISPRS J. Photogramm. Remote Sens. 2024, 211, 1–34. [Google Scholar] [CrossRef]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Liang-Chieh, C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. Semantic image segmentation with deep convolutional nets and fully connected crfs. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
- Pan, H.; Hong, Y.; Sun, W.; Jia, Y. Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes. IEEE Trans. Intell. Transp. Syst. 2022, 24, 3448–3460. [Google Scholar] [CrossRef]
- Xu, J.; Xiong, Z.; Bhattacharyya, S.P. PIDNet: A real-time semantic segmentation network inspired by PID controllers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 19529–19539. [Google Scholar]
- Liu, X.; Deng, Z.; Yang, Y. Recent progress in semantic image segmentation. Artif. Intell. Rev. 2019, 52, 1089–1106. [Google Scholar] [CrossRef]
- Csurka, G.; Volpi, R.; Chidlovskii, B. Semantic image segmentation: Two decades of research. Found. Trends Comput. Graph. Vis. 2022, 14, 1–162. [Google Scholar] [CrossRef]
- Thisanke, H.; Deshan, C.; Chamith, K.; Seneviratne, S.; Vidanaarachchi, R.; Herath, D. Semantic segmentation using Vision Transformers: A survey. Eng. Appl. Artif. Intell. 2023, 126, 106669. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.C. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
- Vaswani, A. Attention is all you need. In Advances in Neural Information Processing Systems, Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
- Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. In Advances in Neural Information Processing Systems, Proceedings of the 35th International Conference on Neural Information Processing Systems, Online, 6–14 December 2021; Curran Associates Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 12077–12090. [Google Scholar]
- Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar]
- Poudel, R.P.; Liwicki, S.; Cipolla, R. Fast-scnn: Fast semantic segmentation network. arXiv 2019, arXiv:1902.04502. [Google Scholar]
- Takos, G. A survey on deep learning methods for semantic image segmentation in real-time. arXiv 2020, arXiv:2009.12942. [Google Scholar]
- Gamal, M.; Siam, M.; Abdel-Razek, M. Shuffleseg: Real-time semantic segmentation network. arXiv 2018, arXiv:1803.03816. [Google Scholar]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
- Orsic, M.; Kreso, I.; Bevandic, P.; Segvic, S. In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12599–12608. [Google Scholar]
- Mehta, S.; Rastegari, M.; Caspi, A.; Shapiro, L.; Hajishirzi, H. Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 552–568. [Google Scholar]
- Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
- Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
- Brostow, G.J.; Fauqueur, J.; Cipolla, R. Semantic object classes in video: A high-definition ground truth database. Pattern Recognit. Lett. 2009, 30, 88–97. [Google Scholar] [CrossRef]
- Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9716–9725. [Google Scholar]
- Nirkin, Y.; Wolf, L.; Hassner, T. Hyperseg: Patch-wise hypernetwork for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4061–4070. [Google Scholar]
Model | RF-FEM | PMPM | SUFM | mIoU | FPS | Params (M) | GFLOPs |
---|---|---|---|---|---|---|---|
Model0 | 78.3 | 125 | 7.62 | 47.6 | |||
Model1 | ✓ | 78.8 | 120 | 7.92 | 57.2 | ||
Model2 | ✓ | 79.1 | 103 | 8.41 | 47.9 | ||
Model3 | ✓ | ✓ | 79.4 | 98 | 8.71 | 57.7 | |
Model4 | ✓ | ✓ | ✓ | 79.8 | 94 | 9.92 | 69.3 |
Methods | mIoU (%) | FPS | Resolution (Pi) | Params (M) | GFLOPs |
---|---|---|---|---|---|
HyperSeg-M | 76.2 | 36.9 | 10.1 | 7.5 | |
HyperSeg-S | 78.2 | 16.1 | 10.2 | 17.0 | |
PIDNet-S* | 79.8 | 94 | 9.9 | 69.3 | |
PIDNet-S | 78.3 | 125 | 7.6 | 47.6 | |
PIDNet-M | 80.1 | 39.8 | 34.4 | 197.4 | |
BiSeNetV2 | 73.4 | 156 | 3.4 | 24.8 | |
BiSeNetV2* | 75.8 | 141 | 4.5 | 30.5 | |
BiSeNetV2-L | 75.8 | 47.3 | - | 118.5 | |
STDC1-Seg75 | 74.5 | 126.7 | 11.4 | 38.1 | |
STDC2-Seg75 | 77.0 | 97 | 15.4 | 53.0 | |
DDRNet-23-slim | 77.3 | 138.3 | 5.7 | 36.3 | |
DDRNet-23-slim* | 77.8 | 91.6 | 8.6 | 53.6 |
Methods | mIoU (%) | FPS | Resolution (Pi) |
---|---|---|---|
PIDNet-S | 80.1 | 153 | 720 × 960 |
PIDNet-S* | 82.8 | 150 | 720 × 960 |
BiSeNetV2 | 72.4 | 124 | 720 × 960 |
BiSeNetV2* | 75.1 | 102 | 720 × 960 |
BiSeNetV2-L | 73.2 | 33 | 720 × 960 |
HyperSeg-S | 78.4 | 38 | 720 × 960 |
DDRNet-23 | 76.3 | 94 | 720 × 960 |
DDRNet-23-slim | 74.7 | 230 | 720 × 960 |
DDRNet-23-slim* | 78.6 | 97 | 720 × 960 |
Class | PIDNet-S | PIDNet-S* | Class | PIDNet-S | PIDNet-S* |
---|---|---|---|---|---|
Road | 98.3 | 98.3 | Sky | 94.7 | 94.9 |
Sidewalk | 85.9 | 85.9 | Person | 82.7 | 83.6 |
Building | 92.5 | 93.1 | Rider | 64.4 | 66.2 |
Wall | 49.7 | 59.2 | Car | 95.5 | 96.4 |
Fence | 62.1 | 62.5 | Truck | 83.3 | 81.8 |
Pole | 66.1 | 69.1 | Bus | 87.9 | 89.5 |
Traffic light | 72.6 | 73.9 | Train | 75.8 | 79.2 |
Sign | 79.1 | 79.8 | Motorcycle | 63.5 | 65.3 |
Vegetation | 92.5 | 92.7 | Bicycle | 77.3 | 78.2 |
Terrain | 63.9 | 64.7 | mIoU | 78.3 | 79.8 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, X.; Zong, W.; Jiang, Y. RSM-Optimizer: Branch Optimization for Dual- or Multi-Branch Semantic Segmentation Networks. Electronics 2025, 14, 1109. https://doi.org/10.3390/electronics14061109
Zhang X, Zong W, Jiang Y. RSM-Optimizer: Branch Optimization for Dual- or Multi-Branch Semantic Segmentation Networks. Electronics. 2025; 14(6):1109. https://doi.org/10.3390/electronics14061109
Chicago/Turabian StyleZhang, Xiaohong, Wenwen Zong, and Yaning Jiang. 2025. "RSM-Optimizer: Branch Optimization for Dual- or Multi-Branch Semantic Segmentation Networks" Electronics 14, no. 6: 1109. https://doi.org/10.3390/electronics14061109
APA StyleZhang, X., Zong, W., & Jiang, Y. (2025). RSM-Optimizer: Branch Optimization for Dual- or Multi-Branch Semantic Segmentation Networks. Electronics, 14(6), 1109. https://doi.org/10.3390/electronics14061109