Multi-Scale Grid-Based Semantic Surface Point Generation for 3D Object Detection
Abstract
1. Introduction
2. Related Work
2.1. Point Cloud Feature Extraction and 3D Object Detection Methods
2.2. Point Generation Methods
2.3. PG-RCNN
3. Methodology
3.1. System Architecture
3.2. Multi-Scale Grid Attention Module
3.2.1. Multi-Scale Grid Mechanism
3.2.2. Feature Attention Mechanism
- SE-Net performs global average pooling over the entire feature map to extract global statistics and generate channel-wise weights, whereas the proposed method directly produces individual weights for each Transformer feature vector , without requiring global pooling.
- SE-Net applies weights at the channel level for the entire feature map, while the proposed method performs independent weighting for each position’s feature vector, enabling finer-grained feature selection.
- The proposed module uses only a single linear transformation and Sigmoid function to generate weights, resulting in significantly fewer parameters and lower computational cost compared to the two-layer fully connected network in SE-Net. This makes it especially suitable for point cloud structures where computational efficiency is critical.
4. Experiment
4.1. Dataset
4.2. Evaluation Metrics
4.3. Experiment Platform
4.4. Model Performance Comparison
4.5. Ablation Study
5. Conclusions
- Proposed a multi-scale grid mechanism that enables the model to capture both local and global geometric structure information simultaneously, significantly improving detection performance for multi-scale objects. Two different resolution grids are designed in this study, extracting spatial features at different scales through parallel pathways and subsequently fusing them. This design preserves fine-grained details while enhancing the model’s understanding of large-scale objects, overcoming the limitations of traditional single-resolution grids.
- Introduced a lightweight feature attention module that dynamically adjusts the importance of multi-scale features, improving the efficiency of feature fusion and recognition capability. To address potential information redundancy and interference during multi-scale feature fusion, we incorporate a feature attention module that learns and assigns weights to emphasize scale-specific information most relevant to the detection task. This module dynamically adapts feature importance based on scene and object characteristics, effectively enhancing the model’s decision-making ability.
- The proposed architecture can be directly applied to the existing PG-RCNN without modifying its backbone network, preserving its training stability and efficiency. Both the multi-scale grid and the feature attention modules are modularly designed and integrated into PG-RCNN without altering the original backbone structure, making them easily adaptable to other 3D detection frameworks.
- Validated through experiments and ablation analysis on the KITTI validation set, demonstrating stable performance improvements under Moderate and Hard difficulty levels. Comparative experiments with the original PG-RCNN on the KITTI 3D dataset confirm the effectiveness of the proposed modules in improving the 3D AP metric. Furthermore, ablation studies were conducted by individually removing the multi-scale grid or feature attention modules, clearly showing each component’s contribution to overall performance and verifying the rationality and necessity of the design.
- The method proposed in this study demonstrates greater recognition stability for sparse point clouds and occluded objects. Compared with a single mesh resolution, it maintains superior 3D detection performance even when dealing with long-range, occluded, or partially missing point clouds.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Rusu, R.B.; Cousins, S. 3D is here: Point cloud library PCL. In Proceedings of the IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 1–4. [Google Scholar]
- Ye, Y.; Yang, X.; Ji, S. APSNet: Attention based point cloud sampling. In Proceedings of the British Machine Vision Conference (BMVC), London, UK, 21–24 November 2022. [Google Scholar]
- Han, J.-W.; Synn, D.-J.; Kim, T.-H.; Chung, H.-C.; Kim, J.-K. Feature based sampling: A fast and robust sampling method for tasks using 3D point cloud. IEEE Access 2022, 10, 58062–58070. [Google Scholar] [CrossRef]
- Wu, C.; Zheng, J.; Pfrommer, J.; Beyerer, J. Attention-based point cloud edge sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 5333–5343. [Google Scholar]
- Wu, W.; Qi, Z.; Li, F. PointConv: Deep convolutional networks on 3D point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9613–9622. [Google Scholar]
- Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6411–6420. [Google Scholar]
- Shi, W.; Rajkumar, R. Point-GNN: Graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1711–1719. [Google Scholar]
- He, C.; Zeng, H.; Huang, J.; Hua, X.-S.; Zhang, L. Structure aware single-stage 3D object detection from point cloud. In Proceedings of the IEEE Conference Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11870–11879. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017); NeurIPS: La Jolla, CA, USA, 2017. [Google Scholar]
- Shi, S.; Wang, X.; Li, H. PointRCNN: 3D Object Proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3DSSD: Point-based 3D single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Lyu, Y.; Huang, X.; Zhang, Z. Learning to segment 3D point clouds in 2D image space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12255–12264. [Google Scholar]
- Luo, Z.; Ma, J.; Zhou, Z.; Xiong, G. PCPNet: An efficient and semantic-enhanced transformer network for point cloud prediction. IEEE Robot. Autom. Lett. 2023, 8, 4267–4274. [Google Scholar] [CrossRef]
- Yin, T.; Zhou, X.; Krähenbühl, P. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11779–11788. [Google Scholar]
- Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; BeijBom, O. PointPillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Chen, Y.; Liu, J.; Zhang, X.; Qi, X.; Jia, J. Voxelnext: Fully sparse voxelnet for 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Zhou, Y.; Tuzel, O. VoxelNet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
- Yan, Y.; Mao, Y.; Li, B. SECOND:Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
- Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-Voxel feature set abstraction for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA,, 13–19 June 2020. [Google Scholar]
- Liang, M.; Yang, B.; Chen, Y.; Hu, R.; Urtasun, R. Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3d proposal generation and object detection from view aggregation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
- Liu, L.; He, J.; Ren, K.; Xiao, Z.; Hou, Y. A LiDAR–camera fusion 3D object detection algorithm. Information 2022, 13, 169. [Google Scholar] [CrossRef]
- Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.-L. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1090–1099. [Google Scholar]
- Nabati, R.; Qi, H. Centerfusion: Center-based radar and camera fusion for 3d object detection. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 1526–1535. [Google Scholar]
- Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
- Yuan, W.; Khot, T.; Held, D.; Mertz, C.; Hebert, M. Pcn: Point completion network. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 728–737. [Google Scholar]
- Tchapmi, L.P.; Kosaraju, V.; Rezatofighi, H.; Reid, I.; Savarese, S. Topnet: Structural point cloud decoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 383–392. [Google Scholar]
- Xie, H.; Yao, H.; Zhou, S.; Mao, J.; Zhang, S.; Sun, W. Grnet: Gridding residual network for dense point cloud completion. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 365–381. [Google Scholar]
- Yu, X.; Rao, Y.; Wang, Z.; Liu, Z.; Lu, J.; Zhou, J. Pointr: Diverse point cloud completion with geometry-aware transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12498–12507. [Google Scholar]
- Xiang, P.; Wen, X.; Liu, Y.; Cao, Y.; Wan, P.; Zheng, W.; Han, Z. Snowflakenet: Point cloud completion by snowflake point deconvolution with skip-transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 5499–5509. [Google Scholar]
- Yang, G.; Huang, X.; Hao, Z.; Liu, M.-Y.; Belongie, S.; Hariharan, B. PointFlow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4541–4550. [Google Scholar]
- Liu, X.; Kong, X.; Liu, L.; Chiang, K. TreeGAN: Syntax aware sequence generation with generative adversarial networks. In Proceedings of the IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; pp. 1140–1145. [Google Scholar]
- Wang, Y.; Chao, W.-L.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8445–8453. [Google Scholar]
- You, Y.; Wang, Y.; Chao, W.-L.; Garg, D.; Pleiss, G.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Zhang, Y.; Huang, D.; Wang, Y. PC-RGNN: Point cloud completion and graph neural network for 3d object detection. Proc. AAAI Conf. Artif. Intell. 2021, 35, 3430–3437. [Google Scholar] [CrossRef]
- Koo, I.; Lee, I.; Kim, S.H.; Kim, H.S.; Jeon, W.J.; Kim, C. PG-RCNN: Semantic Surface Point Generation for 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023. [Google Scholar]
- Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel R-CNN: Towards high performance voxel-based 3d object detection. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1201–1209. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 2017. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2009, 88, 303–338. [Google Scholar] [CrossRef]
- Simonelli, A.; Bulo, S.R.; Porzi, L.; Lopez-Antequera, M.; Kontschieder, P. Disentangling monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1991–1999. [Google Scholar]
Method | Description | Advantage | Limitation |
---|---|---|---|
Point-based (PointNet [9], PointNet++ [10], PointRCNN [11], 3DSSD [12]) | Directly process the raw point cloud data while preserving its sparsity and irregular structure. Extract features through neighbor aggregation and local geometric structure learning |
|
|
Projection-based (PointPillars [16]) | Project the 3D point cloud onto a 2D plane and use a 2D CNN to extract features |
|
|
Voxel-based (VoxelNet [18], SECOND [19], PV-RCNN [20]) | Convert the point cloud into a regular voxel grid, then use a 3D CNN for feature extraction |
|
|
Multi-module (Frustum PointNets [27]) | Feature fusion by combining point clouds with other modalities (such as camera images) |
|
|
Architecture | |
---|---|
PCN [28] | Encoder–decoder architecture |
TopNet [29], GRNet [30] | Tree-structured decoder/Voxel-based latent space |
PoinTr [31], SnowflakeNet [32] | Transformer with hierarchical refinement |
PointFlow [33] | Flow-based probabilistic generative model |
Tree-GAN [34] | Tree-structured Generative Adversarial Network (GAN) |
Pseudo-LiDAR [35,36] | Depth prediction from monocular/stereo images + projection |
PC-RGNN [37] | Pretrained point cloud completion network with RoI |
Method | Ours(4, 6) | Ours(6, 8) | PG-RCNN | SECOND | PointPillars | PV-RCNN | |
---|---|---|---|---|---|---|---|
Car | Easy | 92.40 | 92.61 | 92.26 | 90.55 | 87.75 | 92.10 |
Mod. | 83.20 | 85.13 | 83.35 | 81.61 | 78.41 | 84.36 | |
Hard | 82.56 | 82.78 | 82.63 | 78.56 | 75.19 | 82.48 | |
Pedestrian | Easy | 69.47 | 67.46 | 64.30 | 55.94 | 57.30 | 64.26 |
Mod. | 61.35 | 59.99 | 57.64 | 51.15 | 51.42 | 56.67 | |
Hard | 56.08 | 55.17 | 52.97 | 46.17 | 46.87 | 51.91 | |
Cyclist | Easy | 91.29 | 90.74 | 91.52 | 82.97 | 81.57 | 88.88 |
Mod. | 71.86 | 73.77 | 71.11 | 66.74 | 62.93 | 71.95 | |
Hard | 67.16 | 69.13 | 66.59 | 62.78 | 58.98 | 66.78 |
Grid Size | 2 | 4 | 6 | 8 | 11 | |
---|---|---|---|---|---|---|
Car | Easy | 90.95 | 92.51 | 92.26 | 92.19 | 91.16 |
Mod. | 81.10 | 83.28 | 83.35 | 83.30 | 82.02 | |
Hard | 78.48 | 82.49 | 82.63 | 82.76 | 79.77 | |
Pedestrian | Easy | 61.02 | 64.07 | 64.30 | 63.44 | 62.04 |
Mod. | 55.18 | 56.99 | 57.64 | 58.08 | 54.97 | |
Hard | 50.17 | 52.00 | 52.97 | 52.97 | 49.58 | |
Cyclist | Easy | 88.12 | 91.54 | 91.52 | 91.84 | 84.60 |
Mod. | 67.29 | 71.64 | 71.11 | 73.12 | 67.65 | |
Hard | 63.11 | 66.98 | 66.59 | 68.57 | 63.29 |
Grid Size | 6 | 4, 6 | 4, 8 | 6, 8 | 4, 6, 8 | |
---|---|---|---|---|---|---|
Car | Easy | 92.26 | 92.48 | 92.44 | 92.85 | 92.28 |
Mod. | 83.35 | 84.90 | 84.92 | 85.41 | 84.92 | |
Hard | 82.63 | 82.57 | 82.68 | 82.99 | 82.68 | |
Pedestrian | Easy | 64.30 | 65.41 | 64.43 | 65.59 | 62.51 |
Mod. | 57.64 | 58.07 | 57.80 | 58.89 | 56.36 | |
Hard | 52.97 | 53.09 | 53.10 | 53.98 | 53.25 | |
Cyclist | Easy | 91.52 | 92.21 | 91.00 | 89.16 | 88.38 |
Mod. | 71.11 | 72.39 | 72.65 | 69.31 | 70.52 | |
Hard | 66.59 | 67.88 | 68.09 | 64.87 | 65.84 |
Method | w/o Feature Attention | With Feature Attention | |
---|---|---|---|
Car | Easy | 92.26 | 92.37 |
Mod. | 83.35 | 84.96 | |
Hard | 82.63 | 82.74 | |
Pedestrian | Easy | 64.30 | 68.56 |
Mod. | 57.64 | 61.41 | |
Hard | 52.97 | 56.10 | |
Cyclist | Easy | 91.52 | 90.01 |
Mod. | 71.11 | 72.76 | |
Hard | 66.59 | 68.09 |
Scale | 0.375× | 0.5× | 0.75× | 1.0× | 1.5× | 2× | |
---|---|---|---|---|---|---|---|
Car | Easy | 91.6 | 92.74 | 92.72 | 92.61 | 92.56 | 92.55 |
Mod. | 83.12 | 85.48 | 83.34 | 85.13 | 83.59 | 82.89 | |
Hard | 80.68 | 83.04 | 82.62 | 82.78 | 80.95 | 80.34 | |
Pedestrian | Easy | 61.90 | 63.34 | 64.60 | 67.46 | 65.83 | 64.88 |
Mod. | 55.55 | 57.55 | 58.81 | 59.99 | 58.79 | 58.46 | |
Hard | 51.75 | 52.90 | 53.55 | 55.17 | 53.98 | 53.26 | |
Cyclist | Easy | 89.61 | 89.63 | 92.32 | 90.74 | 89.81 | 89.05 |
Mod. | 72.38 | 70.70 | 73.54 | 73.77 | 71.86 | 71.48 | |
Hard | 67.83 | 66.16 | 69.01 | 69.13 | 67.41 | 66.87 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, X.-F.; Lee, C.-C.; Lo, J.-H.; Chuang, C.-H.; Fan, K.-C. Multi-Scale Grid-Based Semantic Surface Point Generation for 3D Object Detection. Electronics 2025, 14, 3492. https://doi.org/10.3390/electronics14173492
Chen X-F, Lee C-C, Lo J-H, Chuang C-H, Fan K-C. Multi-Scale Grid-Based Semantic Surface Point Generation for 3D Object Detection. Electronics. 2025; 14(17):3492. https://doi.org/10.3390/electronics14173492
Chicago/Turabian StyleChen, Xin-Fu, Chun-Chieh Lee, Jung-Hua Lo, Chi-Hung Chuang, and Kuo-Chin Fan. 2025. "Multi-Scale Grid-Based Semantic Surface Point Generation for 3D Object Detection" Electronics 14, no. 17: 3492. https://doi.org/10.3390/electronics14173492
APA StyleChen, X.-F., Lee, C.-C., Lo, J.-H., Chuang, C.-H., & Fan, K.-C. (2025). Multi-Scale Grid-Based Semantic Surface Point Generation for 3D Object Detection. Electronics, 14(17), 3492. https://doi.org/10.3390/electronics14173492