Faster SCDNet: Real-Time Semantic Segmentation Network with Split Connection and Flexible Dilated Convolution †
Abstract
:1. Introduction
- -
- We propose a split connection structure to complement more low-level features for the output while introducing parallelism to improve the inference speed. We also propose a three-level hierarchical module to fuse the features of three resolutions;
- -
- We propose a flexible dilated convolution to adjust the receptive field of the network and enrich the size of the receptive field of the output;
- -
- We refine the flexible and lightweight decoder to improve computation speed and segmentation accuracy by utilizing multi-scale information fusing with a lightweight structure;
- -
- We verify the effectiveness of the method on the Cityscapes and Camvid datasets. Specifically, we achieve a 36% improvement in FPS and a 0.7% improvement in mIoU on the Cityscapes test set.
2. Related Work
2.1. Semantic Segmentation
2.2. Real-Time Semantic Segmentation
3. Methods
3.1. Network Overview
3.2. Split Connection
3.3. Three-Level Hierarchical Module
3.4. Flexible Dilated Convolution
3.5. Refined Flexible and Lightweight Decoder
3.5.1. Simple Pyramid Pooling Module
3.5.2. Unified Attention Fusion Module
4. Results
4.1. Datasets and Metrics
4.2. Train Setting
4.3. Ablation Study
4.3.1. Effectiveness of Flexible Dilated Convolution
4.3.2. Effectiveness of Other Modules
4.4. Results
4.4.1. Comparison on Cityscapes
4.4.2. Comparison on Camvid
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Chao, P.; Kao, C.Y.; Ruan, Y.S.; Huang, C.H.; Lin, Y.L. Hardnet: A low memory traffic network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3552–3561. [Google Scholar]
- Tao, A.; Sapra, K.; Catanzaro, B. Hierarchical multi-scale attention for semantic segmentation. arXiv 2020, arXiv:2005.10821. [Google Scholar]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
- Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
- Chen, W.; Gong, X.; Liu, X.; Zhang, Q.; Li, Y.; Wang, Z. Fasterseg: Searching for faster real-time semantic segmentation. arXiv 2019, arXiv:1912.10917. [Google Scholar]
- Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking BiSeNet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9716–9725. [Google Scholar]
- Zhu, L.; Deng, R.; Maire, M.; Deng, Z.; Mori, G.; Tan, P. Sparsely aggregated convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 186–201. [Google Scholar]
- Iandola, F.; Moskewicz, M.; Karayev, S.; Girshick, R.; Darrell, T.; Keutzer, K. Densenet: Implementing efficient convnet descriptor pyramids. arXiv 2014, arXiv:1404.1869. [Google Scholar]
- Peng, J.; Liu, Y.; Tang, S.; Hao, Y.; Chu, L.; Chen, G.; Wu, Z.; Chen, Z.; Yu, Z.; Du, Y.; et al. PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model. arXiv 2022, arXiv:2204.02681. [Google Scholar]
- Lai, X.; Tian, Z.; Xu, X.; Chen, Y.C.; Liu, S.; Zhao, H.; Wang, L.; Jia, J. DecoupleNet: Decoupled Network for Domain Adaptive Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 369–387. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
- Liu, C.; Chen, L.C.; Schroff, F.; Adam, H.; Hua, W.; Yuille, A.L.; Fei-Fei, L. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 82–92. [Google Scholar]
- Gao, R. Rethink dilated convolution for real-time semantic segmentation. arXiv 2021, arXiv:2111.09957. [Google Scholar]
- Hong, Y.; Pan, H.; Sun, W.; Jia, Y. Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. arXiv 2021, arXiv:2101.06085. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25. [Google Scholar]
- Brostow, G.J.; Fauqueur, J.; Cipolla, R. Semantic object classes in video: A high-definition ground truth database. Pattern Recognit. Lett. 2009, 30, 88–97. [Google Scholar] [CrossRef]
- Zhang, Y.; Qiu, Z.; Liu, J.; Yao, T.; Liu, D.; Mei, T. Customizable architecture search for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11641–11650. [Google Scholar]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
- Li, X.; You, A.; Zhu, Z.; Zhao, H.; Yang, M.; Yang, K.; Tan, S.; Tong, Y. Semantic flow for fast and accurate scene parsing. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 775–793. [Google Scholar]
- Xiong, J.; Po, L.M.; Yu, W.Y.; Zhou, C.; Xian, P.; Ou, W. CSRNet: Cascaded Selective Resolution Network for real-time semantic segmentation. Expert Syst. Appl. 2023, 211, 118537. [Google Scholar] [CrossRef]
- Lu, Z.; Cheng, R.; Huang, S.; Zhang, H.; Qiu, C.; Yang, F. Surrogate-assisted Multiobjective Neural Architecture Search for Real-time Semantic Segmentation. IEEE Trans. Artif. Intell. 2022. Early Access. [Google Scholar] [CrossRef]
Network RF | mIoU | Road | Swalk | Build. | Wall | Fence | Pole | Tlight | Sign | Veg | Terrain | Sky | Person | Rider | Car | Truck | Bus | Train | Motor | Bike |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Baseline (1007/4096) | 76.9 | 97.75 | 82.90 | 92.32 | 58.41 | 62.32 | 60.78 | 70.20 | 77.29 | 91.90 | 61.16 | 94.37 | 79.99 | 60.68 | 94.26 | 75.33 | 84.93 | 80.17 | 60.02 | 75.31 |
Ours (3503/4096) | 77.3 | 97.83 | 82.90 | 92.28 | 56.96 | 61.53 | 62.26 | 70.95 | 77.26 | 91.67 | 59.64 | 94.38 | 79.75 | 61.47 | 94.66 | 79.42 | 87.20 | 79.83 | 62.03 | 76.17 |
Ours (3887/4096) | 77.5 | 97.87 | 83.06 | 92.31 | 58.54 | 61.74 | 62.32 | 69.99 | 77.09 | 91.76 | 61.81 | 94.57 | 79.85 | 60.51 | 94.72 | 81.24 | 87.43 | 79.06 | 62.02 | 76.26 |
Backbone | Dilatation Rate Combination | RF | mIoU | FPS | GFLOPs |
---|---|---|---|---|---|
STDC1446 | - | 1199/4096 | 77.0 | 50.2 | 94.3 |
Ours | (1,1,1)(1,1,1)(1,1,1) | 1007/4096 | 76.9 | 60.0 | 98.2 |
Ours | (2,4,10)(2,4,10)(2,4,10) | 3503/4096 | 77.3 | 60.0 | 98.2 |
Ours | (2,2,2)(2,4,4)(10,14,14) | 3887/4096 | 77.5 | 60.0 | 98.2 |
Network w/o TensorRT | FDC | Split Connection | THM | RFLD | mIoU | FPS | GFLOPs |
---|---|---|---|---|---|---|---|
STDC(424) | 76.3 | 43.1 | 85.5 | ||||
Ours(224) | √ | 76.8 | 44.7 | 76.6 | |||
Ours(224) | √ | √ | 77.3 | 38.1 | 95.1 | ||
Ours(224) | √ | √ | √ | 77.5 | 35.4 | 98.2 | |
Ours(224) | √ | √ | 76.8 | 60.5 | 59.6 | ||
Ours(224) | √ | √ | √ | 77.4 | 50.1 | 78.1 | |
Ours(224) | √ | √ | √ | √ | 77.6 | 50.1 | 81.2 |
STDC1446(453) | 77.0 | 37.4 | 94.3 |
Network | Val mIoU | Test mIoU | FPS | Resolution | Params(M) | GFLOPs | GPU |
---|---|---|---|---|---|---|---|
CAS [26] | 71.6 | 70.5 | 108 | 768 × 1536 | - | - | - |
FasterSeg * [8] | 73.1 | 71.5 | 163.9 | 1024 × 2048 | 4.4 | 28.2 | GTX 1080Ti |
MobileNetV3 [27] | 72.4 | 72.6 | - | 1024 × 2048 | 1.51 | 9.74 | GTX 1080Ti |
BiSeNet * [6] | 69.0 | 68.4 | 105.8 | 768 × 1536 | 5.8 | 14.8 | GTX 1080Ti |
BiSeNetV2-L * [7] | 75.8 | 75.3 | 47.3 | 512 × 1024 | 47.3 | 118.5 | GTX 1080Ti |
SFNet(DF1) [28] | - | 74.5 | 74 | 1024 × 2048 | 9.03 | - | GTX 1080Ti |
SFNet(DF2) [28] | - | 77.8 | 53 | 1024 × 2048 | 10.53 | - | GTX 1080Ti |
CSRNet-heavy [29] | 77.3 | 76.0 | 36.3 | 1024 × 2048 | - | - | GTX 1080Ti |
MoSegNet-large [30] | 78.2 | - | 50.1 | 1024 × 2048 | - | 42 | Titan RTX |
FC-HarDNet-70 * [1] | 77.7 | 76.0 | 53 | 1024 × 2048 | 4.12 | 35.6 | Titan V |
STDC2-Seg100 * [12] | 77.0 | 76.9 | 50.2 | 1024 × 2048 | 16.1 | 94.3 | RTX 2080Ti |
Ours * | 77.7 | 77.6 | 68.1 | 1024 × 2048 | 17.8 | 81.2 | RTX 2080Ti |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tian, S.; Yao, G.; Chen, S. Faster SCDNet: Real-Time Semantic Segmentation Network with Split Connection and Flexible Dilated Convolution. Sensors 2023, 23, 3112. https://doi.org/10.3390/s23063112
Tian S, Yao G, Chen S. Faster SCDNet: Real-Time Semantic Segmentation Network with Split Connection and Flexible Dilated Convolution. Sensors. 2023; 23(6):3112. https://doi.org/10.3390/s23063112
Chicago/Turabian StyleTian, Shu, Guangyu Yao, and Songlu Chen. 2023. "Faster SCDNet: Real-Time Semantic Segmentation Network with Split Connection and Flexible Dilated Convolution" Sensors 23, no. 6: 3112. https://doi.org/10.3390/s23063112