Multi-Scale Grid-Based Semantic Surface Point Generation for 3D Object Detection

Chen, Xin-Fu; Lee, Chun-Chieh; Lo, Jung-Hua; Chuang, Chi-Hung; Fan, Kuo-Chin

doi:10.3390/electronics14173492

Open AccessArticle

Multi-Scale Grid-Based Semantic Surface Point Generation for 3D Object Detection

by

Xin-Fu Chen

¹,

Chun-Chieh Lee

¹

,

Jung-Hua Lo

²

,

Chi-Hung Chuang

^3,* and

Kuo-Chin Fan

¹

Department of Computer Science & Information Engineering, National Central University, Taoyuan City 320317, Taiwan

²

Department of Applied Informatics, Fo Guang University, Yilan 26247, Taiwan

³

Department of Information and Computer Engineering, Chung Yuan Christian University, Taoyuan City 320314, Taiwan

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(17), 3492; https://doi.org/10.3390/electronics14173492

Submission received: 25 July 2025 / Revised: 27 August 2025 / Accepted: 28 August 2025 / Published: 31 August 2025

(This article belongs to the Special Issue Digital Signal and Image Processing for Multimedia Technology)

Download

Browse Figures

Versions Notes

Abstract

3D object detection is a crucial technology in fields such as autonomous driving and robotics. As a direct representation of the 3D world, point cloud data plays a vital role in feature extraction and geometric representation. However, in real-world applications, point cloud data often suffers from occlusion, resulting in incomplete observations and degraded detection performance. Existing methods, such as PG-RCNN, generate semantic surface points within each Region of Interest (RoI) using a single grid size. However, a fixed grid scale cannot adequately capture multi-scale features. A grid that is too small may miss fine structures—especially problematic when dealing with small or sparse objects—while a grid that is too large may introduce excessive background noise, reducing the precision of feature representation. To address this issue, we propose an enhanced PG-RCNN architecture with a Multi-Scale Grid Attention Module as the core contribution. This module improves the expressiveness of point features by aggregating multi-scale information and dynamically weighting features from different grid resolutions. Using a simple linear transformation, we generate attention weights to guide the model to focus on regions that contribute more to object recognition, while effectively filtering out redundant noise. We evaluate our method on the KITTI 3D object detection validation set. Experimental results show that, compared to the original PG-RCNN, our approach improves performance on the Cyclist category by 2.66% and 2.54% in the Moderate and Hard settings, respectively. Additionally, our approach shows more stable performance on small object detection tasks, with an average improvement of 2.57%, validating the positive impact of the Multi-Scale Grid Attention Module on fine-grained geometric modeling, and highlighting the efficiency and generalizability of our model.

Keywords:

object detection; multi-scale grid attention; point generation

1. Introduction

As technology continues to advance rapidly, numerous intelligent tools have emerged with the goal of enhancing people’s quality of life and gradually simulating and replicating human sensory capabilities. Among these, vision—one of the primary means through which humans perceive the external world—has been widely adopted in the fields of artificial intelligence and machine perception. In addition to using cameras to simulate human eyes and capture image data, allowing machines to “see” and perform visual analysis, sensors have also gained increasing attention. The point cloud data generated by these sensors provides depth information and 3D structural details of objects, which enhances a machine’s spatial understanding of its environment.

For example, robot vacuums and delivery robots can analyze point cloud data to assess scene layouts and plan optimal paths. Similarly, autonomous vehicles can more accurately detect pedestrians, vehicles, and obstacles, thereby improving driving safety. As a result, many researchers have begun exploring the application of point cloud data in the field of computer vision, where it is gradually playing a more significant role.

However, in point cloud-based 3D object detection tasks, data sparsity and occlusion remain major challenges. Since point cloud acquisition is affected by sensor viewpoints and occluding objects, it often results in missing or incomplete surface regions of objects. This can lead to errors in object recognition and localization by detection models. Therefore, how to compensate for missing geometric information during 3D object detection—so as to help models more accurately determine the size and position of objects—has become a critical and active area of research.

In recent years, many studies have focused on addressing the issues of sparsity and incompleteness in point cloud data for 3D object detection—particularly in cases where objects are partially occluded or the sensing angle is limited, which often leads to difficulties in recognition. The aim of this research is to enhance the model’s ability to understand incomplete point clouds by introducing a multi-scale grid mechanism that generates semantically meaningful and evenly distributed surface points, capturing geometric features at various levels.

To further improve the effectiveness of feature selection, we also design a feature attention mechanism that adaptively integrates information from multiple scales while filtering out non-representative points. Through this approach, we aim to effectively compensate for missing geometric information, thereby improving the accuracy of object localization and recognition in 3D space. The proposed method is evaluated on public datasets to demonstrate its practicality and superior performance.

2. Related Work

2.1. Point Cloud Feature Extraction and 3D Object Detection Methods

A point cloud is a collection of numerous 3D coordinate points, each representing a sampled location on the surface of an object [1], as illustrated in Figure 1. Point cloud data can be acquired from various sources, such as LiDAR scanners and depth cameras (e.g., Kinect). Due to its simple structure and flexible data volume, point cloud data has been widely applied in tasks such as 3D object detection and object classification [2,3,4]. However, compared to traditional, structured 2D image data, point clouds are irregular and sparse in nature. This makes it a key challenge to effectively learn and extract features from point clouds prior to performing 3D object detection or scene understanding. Existing point cloud feature learning methods can be mainly categorized into three types: Point-based methods, Projection-based methods, and Voxel-based methods.

Point-based methods [5,6,7,8] directly take the raw point cloud as input, fully preserving its sparsity and irregular structure. These approaches typically utilize point coordinates and features, extracting information through neighbor point aggregation or local geometric structure learning. Representative methods include PointNet [9] and PointNet++ [10]. PointNet was the first to propose performing independent feature transformation on each point, followed by a max pooling operation to aggregate global features—effectively handling unordered point cloud inputs. Its successor, PointNet++, introduced a hierarchical feature aggregation mechanism to further enhance the learning of local structures. Features extracted using these methods can be used for object detection while preserving the sparse and irregular nature of point clouds. Representative detection frameworks include PointRCNN [11] and 3DSSD [12]. PointRCNN adopts a two-stage detection architecture: a Region Proposal Network first generates candidate regions from the point cloud, and then each proposal is refined through classification and regression to achieve high-precision detection. 3DSSD, on the other hand, proposes a single-stage architecture that does not rely on predefined region proposals. It uses hierarchical sampling and feature learning to speed up inference and reduce latency. Although Point-based methods can finely capture local geometric details, the point-wise feature computation can lead to high computational cost and slow inference when applied to large-scale scenes.

Projection-based methods project 3D point cloud data onto 2D planes [13,14,15], such as Bird’s Eye View (BEV) or depth maps, allowing the use of well-established 2D convolutional neural networks (2D CNNs) to extract features for object detection. A representative method is PointPillars [16], which divides the point cloud into pillar-like structures and uses a 2D CNN to efficiently extract features, thereby improving inference speed and supporting real-time applications. While these methods perform well in terms of inference speed and memory efficiency, and can quickly leverage existing image processing techniques, the projection process may result in the loss of certain spatial information, leading to trade-offs in detail preservation and 3D spatial understanding.

Voxel-based methods convert point cloud data into a regular voxel grid structure through a process called voxelization, enabling the use of 3D convolutional neural networks (3D CNNs) for feature extraction and detection. This regular grid structure allows efficient processing of large-scale point cloud data and accelerates the inference process [17]. Representative methods include VoxelNet [18], SECOND [19], and PV-RCNN [20]: VoxelNet pioneered end-to-end learning of features directly from voxelized data, laying the foundation for voxel-based approaches. However, voxelization inevitably leads to information loss and redundant computation on empty voxels, which affects the model’s ability to capture fine details. To address this, SECOND introduced sparse convolution techniques, improving computational efficiency while retaining as much detail as possible. PV-RCNN combines voxel features with point-wise features, adopting a hybrid architecture for proposal generation and refined detection, striking a balance between efficiency and detection accuracy.

In addition, many recent approaches attempt to fuse multiple data modalities [21,22,23]—such as camera images and point clouds [24,25,26]—by designing more effective feature fusion mechanisms to further enhance 3D object detection performance. For instance, Frustum PointNets [27] leverage 2D camera images to detect initial bounding boxes, then filter the corresponding regions from the point cloud for more precise object detection.

At present, existing 3D object detection methods each have their own strengths and weaknesses in terms of accuracy, computational efficiency, and handling of incomplete data. However, under sparse or heavily occluded scenes, these methods may still suffer from misidentification or localization errors. Therefore, developing techniques that can compensate for missing geometric information and improve detection robustness has become one of the key research directions today.

Here, use a table (Table 1) to describe the advantages and disadvantages of the above-mentioned methods.

2.2. Point Generation Methods

To address the sparsity and incompleteness commonly found in real-world LiDAR point clouds, many studies have proposed various point cloud generation methods that enrich spatial structure representations by generating additional 3D points, aiming to recover complete geometric shapes from partially observed point clouds.

Early approaches such as PCN [28] adopted an encoder–decoder architecture, generating coarse and fine point clouds in stages. Later methods like TopNet [29] and GRNet [30] introduced tree structures and voxel latent spaces, respectively, to enhance structural consistency and resolution in the completion results. In recent years, methods such as PoinTr [31] and SnowflakeNet [32] have integrated Transformer architectures with hierarchical refinement strategies, enabling more effective construction of point clouds that retain both global semantics and geometric details.

Another direction focuses on generative models, which typically generate full point cloud structures from latent vectors or cross-modal information: PointFlow [33] uses a flow-based reversible model to simulate continuous point distributions. TreeGAN [34] employs a Generative Adversarial Network (GAN) to generate point clouds with geometric constraints, making it well-suited for 3D modeling and shape generation tasks.

Additionally, diffusion-based generative models have recently been applied to 3D point cloud synthesis, demonstrating strong capabilities in distribution modeling. The Pseudo-LiDAR family of methods [35,36] predicts depth from monocular or stereo images, then projects the depth into dense pseudo point clouds to supplement the original LiDAR input. PC-RGNN [37] introduces a pretrained point cloud completion network to augment point clouds within Regions of Interest (RoIs), thereby improving both detection accuracy and shape understanding.

A comparison of the architectures of various models is summarized in Table 2.

2.3. PG-RCNN

PG-RCNN: Semantic Surface Point Generation for 3D Object Detection [38] is a two-stage 3D object detection framework designed to enhance detection accuracy in sparse point cloud data. The overall architecture consists of two main stages: Region Proposal Generation and Proposal Refinement.

PG-RCNN adopts a Region Proposal Network (RPN) based on 3D voxel feature extraction networks to generate initial 3D bounding box proposals from the input point cloud. It also introduces the RoI Point Generation (RPG) module, which generates semantic surface point clouds for each initial proposal. Finally, the Detection Head is inspired by PointRCNN and is mainly built upon PointNet++. This module learns from the entire set of generated points and their corresponding features to produce refined RoI features, which are then used by the subsequent classification and bounding box regression branches to generate the final detection results. By integrating both spatial and semantic information, PG-RCNN is able to more effectively recognize object boundaries and categories, thereby enhancing the overall performance of 3D object detection.

By integrating spatial and semantic information, PG-RCNN more effectively identifies object boundaries and categories, enhancing overall 3D object detection performance. PG-RCNN combines the computational efficiency of voxel-based features with the fine-grained representation of point-based features, and shows excellent detection performance, particularly in sparse or heavily occluded scenes. Therefore, this research adopts PG-RCNN as the baseline architecture, and further extends and improves it to enhance 3D object detection under conditions of missing geometric information.

3. Methodology

3.1. System Architecture

This study proposes an enhanced 3D object detection system, built upon the two-stage detection framework adopted by PG-RCNN, with a particular focus on improving the feature extraction and information selection mechanisms in the proposal refinement stage. The overall system flow is illustrated in Figure 2 and can be divided into the following key modules:

First, during the region proposal generation stage, this study adopts SECOND as the region proposal network. The input point cloud data is voxelized, and global spatial features are extracted using a 3D convolutional backbone network to generate initial 3D bounding boxes. These bounding boxes represent potential object locations within the scene and serve as input to the second-stage module.

Next, in the proposal refinement stage, a Multi-Scale Grid Attention Module is designed, which includes two key improvements aimed at enhancing the model’s ability to recognize object features within RoI regions and to use information more efficiently.

After processing through the multi-scale grid attention module, the semantic features within each RoI are transformed into a set of high-quality surface points, where each point carries its spatial position and high-level semantic features. These surface points serve as the foundational input for the subsequent refined detection.

Finally, in the detection output stage, these surface point clouds are passed into the Detection Head for bounding box regression and object classification, yielding the final 3D detection results.

3.2. Multi-Scale Grid Attention Module

The Multi-Scale Grid Attention Module consists of the following two mechanisms:

Multi-Scale Grid Mechanism: For each region of interest (RoI), grid structures with different resolutions are con structed. By combining multi-level aggregated backbone features of various scales, the model can simbility.

Feature Attention Mechanism: During feature propagation, a feature attention module is introduced to adaptively adjust the information flow based on the importance of features. This strengthens critical structural features and filters out less relevant information, further improving recognition performance.

3.2.1. Multi-Scale Grid Mechanism

Although PG-RCNN demonstrates solid performance in 3D object detection tasks, its feature extraction strategy in the region refinement stage still has limitations. PG-RCNN constructs a single-resolution 3D grid within each region of interest (RoI), and feature aggregation is only conducted at that single scale. While this simplifies computation and captures semantic structure to some extent, its performance is limited when dealing with objects of varying sizes or structural complexities.

For small or geometrically complex objects, a single-scale grid may be too sparse to effectively capture fine-grained local structures and edge details. For instance, when processing pedestrians or bicycles, the semantic surface point clouds generated by PG-RCNN often lack sufficient discriminative detail, resulting in inaccurate bounding box regression and degraded detection precision.

Conversely, for large or spatially extensive objects such as trucks or buses, overly dense grids may theoretically offer higher spatial resolution, but in practice, they may increase feature redundancy and lead to feature smoothing. This weakens the model’s ability to highlight key structural features and can cause over-blurred features, ultimately harming its capacity to perceive object contours and shape variations.

Moreover, a single grid resolution cannot flexibly adapt to changes in the RoI size or shape, which is commonly encountered in real-world scenes. Without scale-adaptive feature extraction, the model’s ability to capture semantic information from the surrounding context is limited, reducing the generalization and robustness of 3D object detection.

To address these issues, this study proposes a Multi-Scale Grid Mechanism as an enhancement to PG-RCNN. By constructing multiple grid resolutions within each RoI and aggregating features across these scales, the model not only compensates for missing fine details but also improves understanding of large object structures. This enables better accuracy and robustness across diverse scenes, allowing the model to simultaneously capture coarse and fine geometric features within RoIs—balancing global contours and local details.

As illustrated in Figure 3, multiple cubic grid structures are constructed within each RoI, with resolutions such as 4 × 4 × 4, 6 × 6 × 6, 8 × 8 × 8, etc. These grids divide the interior space of the bounding box uniformly, ensuring spatial consistency and alignment.

Each grid point’s feature is aggregated based on its 3D coordinate location, enhancing adaptability to spatial variations. This also helps maintain object structure representation even in sparse or occluded point cloud scenarios.

The process begins by constructing the grid within the candidate region. Each grid point uses its center coordinate as a query to locate nearby voxels. We adopt the Voxel Query method from Voxel R-CNN [39] to efficiently retrieve neighboring voxels from voxelized data. This method uses Manhattan Distance, where fixed offsets are used to locate neighbors without distance sorting or comparison. If querying K neighbors from N voxels, Voxel Query has O(K) time complexity, while the traditional Ball Query (based on Euclidean Distance) requires O(N) due to sorting—making Voxel Query computationally cheaper while maintaining spatial query quality, especially valuable for real-time or large-scale point cloud processing.

Next, a PointNet++-like architecture is used to aggregate features for each grid point

g_{i}

and its neighboring voxels

V_{i} = {v_{i}^{1}, v_{i}^{2}, v_{i}^{3}, \dots, v_{i}^{K}}

. For each neighbor voxel, its feature is concatenated with its relative position

(v_{i}^{k} - g_{i})

, passed through a Multi-Layer Perceptron (MLP), and aggregated via Max Pooling to obtain the initial feature

f_{g_{i}}

, as shown in Equation (1):

f_{g_{i}} = M a x P o o l ({{M L P}_{a g g} {([v_{i}^{k} - g_{i}; f_{v_{i}^{k}}])}}_{k = 1}^{K})

(1)

To enable grid features to capture higher-level spatial structures, we follow PG-RCNN and introduce a Transformer [40] encoder after feature aggregation for global modeling. Given the geometric nature of 3D space, each grid point also receives a positional encoding

p_{g_{i}}

, as shown in Equation (2), designed to help the Transformer understand relative positions between points:

p_{g_{i}} = {M L P}_{p o s} ([g_{i} - r^{c}; g_{i} - r^{1}; {\dots; g}_{i} - r^{8}])

(2)

Here,

r^{c}

is the RoI center coordinate, and

r^{1} ~ r^{8}

are the coordinates of the 8 box corners. The Transformer encoder then takes the initial feature

f_{g_{i}}

and positional encoding

p_{g_{i}}

as input, producing a refined feature

{f'}_{g_{i}}

, as shown in Equation (3):

{f'}_{g_{i}} = T r a n s f o r m e r (f_{g_{i}}, p_{g_{i}})

(3)

Finally, the features from multiple grid resolutions are concatenated and fed into the subsequent Feature Attention Module. This preserves geometric information from various scales, providing rich semantic cues for the RoI. By integrating multi-scale grid features, the proposed approach enhances both the expressiveness and discriminability of RoI representations, leading to improved accuracy and robustness in 3D object detection—particularly in sparse or scale-varying scenes.

3.2.2. Feature Attention Mechanism

After being processed by the Transformer encoder, each grid point generates a feature vector

{f'}_{g_{i}}

that already contains a certain level of semantic correlation and spatial structural information. Since the Transformer effectively captures long-range dependencies and global context among different grid points, it provides clear advantages in terms of semantic richness and spatial consistency. However, despite the comprehensive description of the RoI region in the Transformer-encoded features, a critical challenge remains in the subsequent semantic point generation process: the model cannot automatically distinguish which feature components contribute more significantly to the final 3D detection task.

To address this limitation, this study introduces a feature attention mechanism as a bridge between the Transformer module and the semantic point generation module. The core objective of this attention module is to reweight each grid point’s feature vector, enhancing those with higher relevance to the detection task while suppressing noisy or redundant information, thereby improving the model’s ability to recognize regional semantics and geometric structures.

For each feature vector

{f'}_{g_{i}}

from the Transformer, the feature attention mechanism generates a weight vector

w_{g_{i}} \in (0, 1)

of the same dimension based on its content. This weight vector is used to reweight the original feature via element-wise multiplication, calculated in Equation (4):

f_{g i}^{a t t e n t i o n} = {f'}_{g_{i}} ⨀ w_{g_{i}}

(4)

Here, the weight

w_{g_{i}}

is computed by applying a linear transformation to

{f'}_{g_{i}}

, followed by a Sigmoid function to constrain the values within the range (0, 1), as shown in Equation (5):

w_{g_{i}} = σ (L i n e a r ({f'}_{g_{i}}))

(5)

This process allows the model to automatically adjust the importance of each feature based on its content, highlighting those with stronger discriminative power. The design of this method is similar in spirit to the Squeeze-and-Excitation Network (SE-Net) [41], which also uses learned weights to reweight features. However, there are several key differences between the proposed method and SE-Net:

SE-Net performs global average pooling over the entire feature map to extract global statistics and generate channel-wise weights, whereas the proposed method directly produces individual weights for each Transformer feature vector ${f'}_{g_{i}}$ , without requiring global pooling.
SE-Net applies weights at the channel level for the entire feature map, while the proposed method performs independent weighting for each position’s feature vector, enabling finer-grained feature selection.
The proposed module uses only a single linear transformation and Sigmoid function to generate weights, resulting in significantly fewer parameters and lower computational cost compared to the two-layer fully connected network in SE-Net. This makes it especially suitable for point cloud structures where computational efficiency is critical.

The proposed feature attention mechanism retains the core idea of SE-Net’s weight adjustment but is specifically tailored for non-image features, offering a structure more appropriate for this application. It effectively strengthens key semantic features and enhances overall recognition performance.

The attended feature

f_{g i}^{a t t e n t i o n}

, produced by the feature attention mechanism, serves as the input foundation for the semantic point generation module. It is used to predict both the position offset and the semantic feature. Through this process, the model can focus more effectively on discriminative features, improving the quality of the generated semantic points and enhancing the accuracy of the final RoI features. Once the refined features are obtained, an MLP is further used to simultaneously predict the position offset

o_{i}

and the semantic feature

f_{{g'}_{i}}

of the generated point

{g'}_{i}

for each grid point, as shown in Equation (6):

[o_{i}; f_{{g'}_{i}}] = {M L P}_{g e n} (f_{g i}^{a t t e n t i o n})

(6)

The position offset

o_{i}

is then added to the grid point’s center coordinate

g_{i}

to determine the actual spatial location of the generated point, as shown in Equation (7):

{g'}_{i} = g_{i} + o_{i}

(7)

This feature attention mechanism, proposed as a bridge between the Transformer and semantic point generation modules, not only enhances feature selectivity but also improves the model’s recognition ability and detection stability in point cloud scenes.

4. Experiment

4.1. Dataset

In this study, the KITTI 3D Object Detection Benchmark is adopted as the dataset for training and evaluation. KITTI is one of the most representative publicly available datasets in the autonomous driving field, containing multi-sensor data collected from real-world road scenes, including stereo cameras and LiDAR. It provides high-quality 3D object annotations and is widely used for tasks such as 3D object detection, tracking, and semantic segmentation.

For the 3D detection task, KITTI provides annotations for three object categories: Car, Pedestrian, and Cyclist. The dataset also classifies samples into three difficulty levels—Easy, Moderate, and Hard—based on the degree of object occlusion and observation difficulty. The primary evaluation metric is the 3D Average Precision (3D AP).

Following common research practices, this study adopts the standard data split convention by dividing the original training set of 7481 samples into 3712 samples for training and 3769 samples for validation. Performance is measured using the official evaluation metrics provided by the benchmark.

4.2. Evaluation Metrics

In the KITTI 3D object detection task, the official benchmark categorizes each annotated 3D bounding box into three difficulty levels—Easy, Moderate, and Hard—based on its level of occlusion and visibility. This classification enables a comprehensive evaluation of a model’s detection capability under different conditions. For each difficulty level, the Average Precision (AP) is calculated separately as the evaluation metric for detection performance.

Traditionally, the KITTI evaluation metric adopts the 11-point Interpolated Average Precision proposed in PASCAL VOC [42]. This metric calculates the maximum precision at 11 fixed recall levels (

R_{11} = {0.1, 0.2, 0.3, \dots 1.0}

), and then takes the average, as defined by Equation (8):

A P |_{R} = \frac{1}{|R|} \sum_{r \in R} ρ (r)

(8)

where

ρ (r)

represents the maximum precision achieved at a recall level greater than or equal to

r

. However, the

R_{11}

method has two major drawbacks: Since recall starts from 0, a single correct prediction per difficulty level can result in 100% precision at recall = 0, leading to an artificially inflated AP score of at least 1/11 ≈ 0.0909, even when the model performs poorly. Approximating the overall precision-recall curve using only 11 points lacks resolution, especially in the mid-to-high recall regions, resulting in insufficient evaluation granularity.

To address these limitations, the paper Disentangling Monocular 3D Object Detection [43] introduced an improved metric: 40-point Interpolated Average Precision

(R_{40} A P)

. This method divides the recall interval into 40 evenly spaced points (

R_{40} = {1 / 40, 2 / 40, 3 / 40, \dots 1.0}

) and avoids starting from recall = 0, thus reducing evaluation bias from the initial recall and enabling a more precise estimation of the area under the precision-recall curve. This metric not only enhances evaluation stability but also provides a more accurate reflection of the model’s overall detection capability.

Currently, the

R_{40} A P

metric has been widely adopted by mainstream approaches on the KITTI benchmark and has become the de facto standard for evaluating 3D object detection performance. Many recent models and studies utilize this improved metric due to its higher evaluation precision and reduced bias compared to the traditional 11-point method. To maintain consistency with existing works and enable fair comparison, this study also employs the

R_{40} A P

metric as the primary evaluation criterion. Accordingly, all experimental results reported in this paper are based on

R_{40} A P

, with detection performance separately evaluated across the Easy, Moderate, and Hard difficulty levels. This comprehensive evaluation approach ensures a thorough assessment of the proposed model’s effectiveness under various levels of occlusion and visibility conditions.

4.3. Experiment Platform

This section outlines the experimental environment and training configurations used in this study. All experiments are conducted based on the implementation framework provided by OpenPCDet, with the proposed multi-scale grid attention module integrated into the PG-RCNN architecture to verify its effectiveness in 3D object detection tasks.

Unlike the original PG-RCNN, which was trained using four NVIDIA GeForce RTX 3090 GPUs, all training and inference in this study were performed on a single NVIDIA GeForce RTX 3090 Ti GPU (Nvidia, Santa Clara, CA, USA). The Adam optimizer was used, following the settings of PG-RCNN, along with a One-Cycle Learning Rate Policy for 80 training epochs. The initial learning rate was set to 0.01.

Data augmentation strategies also followed the original design of PG-RCNN, employing various common techniques to enhance the model’s robustness to spatial variations. These techniques include random flipping along the X-axis, rotation along the Z-axis, global scaling, and Ground Truth Sampling.

In addition, this study explored two grid size configurations: (4, 6) and (6, 8). These configurations help the model capture object details and structures at different scales, contributing to improved detection performance.

4.4. Model Performance Comparison

To validate the practical effectiveness of the proposed multi-scale grid attention module in 3D object detection tasks, we conducted a comprehensive evaluation on the KITTI validation set, comparing the improved PG-RCNN (integrated with our module) against its original version within the OpenPCDet framework. The experimental results are analyzed from two perspectives: overall detection performance and adaptability across different difficulty levels.

This study adopts the official evaluation metrics provided by KITTI, which categorizes samples into Easy, Moderate, and Hard levels based on difficulty. For calculating average precision, we use the more recent and widely adopted R₄₀ metric, which averages precision over 40 evenly spaced recall intervals. This approach enhances both the stability and representativeness of the evaluation results.

Table 3 presents a comparison between our proposed method and several other 3D detection approaches on the KITTI validation set. We evaluate performance under two different grid size configurations: (4, 6) and (6, 8), and compare the results against PG-RCNN, SECOND, PointPillars, and PV-RCNN. The results for SECOND, PointPillars, and PV-RCNN are directly taken from the PG-RCNN paper to ensure a fair and consistent baseline for comparison.

From the results of the car category, it can be observed that the proposed method slightly outperforms PG-RCNN across all difficulty levels. This demonstrates that incorporating the multi-scale grid attention module into the original architecture enhances the model’s object recognition capability. Notably, under the Moderate difficulty, the proposed method (with grid sizes 6 and 8) achieves 85.13% accuracy, surpassing PG-RCNN’s 83.35%, indicating a stronger ability to recognize more challenging samples.

For the pedestrian category, although the overall accuracy is generally lower than that of the car category, the proposed method still outperforms PG-RCNN and other models across all difficulty levels. This shows that the method maintains high recognition capability even when dealing with smaller objects and greater variation in pose. For example, under the Easy difficulty, the proposed method (with grid sizes 4 and 6) achieves 69.47%, showing a clear improvement over PG-RCNN’s 64.3%.

In the bicycle category, both multi-scale grid configurations perform well, especially under the Moderate and Hard difficulty levels, where the proposed method clearly outperforms other models. For instance, under Moderate difficulty, the method (6 and 8) achieves 73.77%, significantly higher than PG-RCNN’s 71.11%, SECOND’s 66.74%, and PointPillars’ 62.93%. Likewise, the results under Hard difficulty also show noticeable improvement, indicating that the proposed mechanism generalizes better to highly variable object shapes like cyclists.

In terms of training time, the original PG-RCNN model takes 6 h, 58 min, and 5 s. For the proposed multi-scale grid models, the (4, 6) configuration takes 8 h, 25 min, and 50 s, while the (6, 8) configuration requires 16 h, 13 min, and 28 s. Although training time increases—particularly for the larger grid configuration—the substantial improvement in recognition performance, especially for medium- and high-difficulty samples and highly variable objects, reflects a favorable trade-off between accuracy and computational cost.

Moreover, during inference, dynamic selection can be applied based on the characteristics of the input data to further optimize performance. For instance, different trained configurations can be selected based on the initial bounding box size: for small or less distinguishable objects, the more stable (4, 6) configuration can be used; for larger or more detailed objects, the more accurate (albeit more computationally expensive) (6, 8) configuration is preferred. This weight selection strategy based on bounding box size enhances recognition efficiency and accuracy in practical applications, highlighting the flexibility and practicality of the proposed method.

Across all categories and difficulty levels, the results show that integrating multi-scale grid and feature attention mechanisms effectively enhances the model’s recognition performance in 3D detection tasks—especially in moderate and hard scenarios, where it consistently demonstrates a stable advantage. These results confirm that the proposed approach maintains robust and outstanding detection performance even in the presence of sparse point clouds, object occlusion, or dramatic scale changes. We attribute this improvement to:

The multi-scale grid mechanism, which enables the model to capture object features at various resolutions, enhancing its ability to represent objects of different sizes and distances.

The feature attention module, which adjusts the flow of information based on feature importance, reducing redundancy and preventing noisy features from disrupting predictions, thereby improving the model’s generalization capability.

To further quantify the individual contributions of each module, an ablation study is conducted in the next section. This dissects the specific impact of the multi-scale grid and feature attention designs from various perspectives. The experiments in this section clearly demonstrate that the proposed method significantly strengthens 3D object detection capabilities on top of the existing PG-RCNN architecture, with a marked advantage in medium- and high-difficulty scenarios, indicating its potential for application in complex and realistic autonomous driving environments.

Figure 4 shows the results after visualizing the 3D object detection. In the figure, yellow bounding boxes indicate bicycles, while green bounding boxes indicate cars. The areas highlighted with red circles emphasize the differences between PG-RCNN detections, the ground truth annotations, and our proposed method’s results.

4.5. Ablation Study

To comprehensively verify the practical contributions and benefits of the proposed multi-scale grid attention module in 3D object detection tasks, we designed and conducted a series of systematic ablation experiments. These experiments perform controlled variable analysis on key design components of the model architecture, examining the impact of individual modules and settings on detection performance. The ablation study focuses on the following three main aspects:

Grid Size Design: By adjusting the 3D grid resolution constructed within the proposal regions, we explore how different grid densities affect feature extraction capability and model performance, aiming to identify the most suitable resolution combination.

Module Contribution: Function-switching experiments are conducted separately for the multi-scale grid mechanism and the feature attention module. By removing each module one at a time, we observe the performance change in its absence to evaluate the importance and contribution of each component within the overall architecture.

Feature Extraction Range: Using the original range settings in PG-RCNN as a baseline, we scale the radius used for feature extraction at each grid point within the multi-scale grid, and analyze the impact of aggregation radius on detection accuracy across different difficulty levels. This helps identify the optimal balance point for feature integration.

Through this multi-dimensional ablation design, our study clearly clarifies the concrete contributions of each enhancement strategy and validates the rationality and scalability of the proposed architecture, further supporting the overall effectiveness of the methodology.

First, we analyzed the use of different single grid sizes in the original PG-RCNN architecture. As shown in Table 4, we evaluated detection performance with grid sizes of 2, 4, 6, 8, and 11. The results reveal that the choice of grid size has a significant impact on performance. Grid sizes 4, 6, and 8 strike the best balance between feature detail and computational efficiency.

Next, we conducted experiments on the PG-RCNN base architecture with Table 5 and Table 6 introducing the multi-scale grid mechanism and the feature attention module, respectively. These experiments compare the effect of each module under different configurations. We observed that introducing either the multi-scale grid mechanism or the feature attention module individually led to notable improvements in accuracy. When both mechanisms were applied together, the model performed even better, confirming that the two modules are complementary in design and mutually reinforce the model’s spatial understanding of objects.

Finally, in the feature extraction part, we conducted comparison experiments in Table 7 using different feature aggregation radii to analyze their effect on detection performance when aggregating point cloud features within the grid. Using the (6, 8) grid size configuration as the base, we varied the radius corresponding to grid size 8 by scaling it to 0.375×, 0.5×, 0.75×, 1.0× (default in PG-RCNN), 1.5×, and 2.0×. The results show that the feature aggregation radius has a significant effect on model performance.

In general, medium-sized radii (0.75×–1.0×) yielded the best results. For cars, pedestrians, and cyclists, peak accuracy was achieved under Easy and Moderate difficulty with these settings, suggesting this range effectively balances local and global feature information. For example, in the car category under Moderate difficulty, the best accuracy was observed at 0.5× and 1.0×. In contrast, performance dropped significantly when using too small (0.375×) or too large (2.0×) radii. A similar trend was observed in the pedestrian and cyclist categories, where 1.0× consistently produced higher accuracy across difficulties, indicating that this setting captures sufficient features within the target region. On the other hand, 0.375× and 2.0× performed poorly overall, likely due to insufficient information or interference from excessive irrelevant features.

In summary, these experiments confirm the practical effectiveness of both the multi-scale grid mechanism and the feature attention module in the architecture. By analyzing various grid size and feature aggregation radius settings, we gained deeper insight into how design choices affect detection performance, providing concrete and evidence-based references for future 3D object detection architecture design.

5. Conclusions

This study addresses the issue of poor recognition performance in existing 3D object detection models when dealing with objects of varying scales. We propose an improved PG-RCNN architecture enhanced with a Multi-Scale Grid Attention Module, which effectively strengthens feature extraction and fusion capabilities. The key contributions are as follows:

Proposed a multi-scale grid mechanism that enables the model to capture both local and global geometric structure information simultaneously, significantly improving detection performance for multi-scale objects. Two different resolution grids are designed in this study, extracting spatial features at different scales through parallel pathways and subsequently fusing them. This design preserves fine-grained details while enhancing the model’s understanding of large-scale objects, overcoming the limitations of traditional single-resolution grids.
Introduced a lightweight feature attention module that dynamically adjusts the importance of multi-scale features, improving the efficiency of feature fusion and recognition capability. To address potential information redundancy and interference during multi-scale feature fusion, we incorporate a feature attention module that learns and assigns weights to emphasize scale-specific information most relevant to the detection task. This module dynamically adapts feature importance based on scene and object characteristics, effectively enhancing the model’s decision-making ability.
The proposed architecture can be directly applied to the existing PG-RCNN without modifying its backbone network, preserving its training stability and efficiency. Both the multi-scale grid and the feature attention modules are modularly designed and integrated into PG-RCNN without altering the original backbone structure, making them easily adaptable to other 3D detection frameworks.
Validated through experiments and ablation analysis on the KITTI validation set, demonstrating stable performance improvements under Moderate and Hard difficulty levels. Comparative experiments with the original PG-RCNN on the KITTI 3D dataset confirm the effectiveness of the proposed modules in improving the 3D AP metric. Furthermore, ablation studies were conducted by individually removing the multi-scale grid or feature attention modules, clearly showing each component’s contribution to overall performance and verifying the rationality and necessity of the design.
The method proposed in this study demonstrates greater recognition stability for sparse point clouds and occluded objects. Compared with a single mesh resolution, it maintains superior 3D detection performance even when dealing with long-range, occluded, or partially missing point clouds.

Author Contributions

Conceptualization, C.-C.L., J.-H.L. and K.-C.F.; Methodology, X.-F.C. and C.-H.C.; Validation, C.-H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Science and Technology Council under Grant No. NSTC 113-2221-E-008-088-MY3.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Rusu, R.B.; Cousins, S. 3D is here: Point cloud library PCL. In Proceedings of the IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 1–4. [Google Scholar]
Ye, Y.; Yang, X.; Ji, S. APSNet: Attention based point cloud sampling. In Proceedings of the British Machine Vision Conference (BMVC), London, UK, 21–24 November 2022. [Google Scholar]
Han, J.-W.; Synn, D.-J.; Kim, T.-H.; Chung, H.-C.; Kim, J.-K. Feature based sampling: A fast and robust sampling method for tasks using 3D point cloud. IEEE Access 2022, 10, 58062–58070. [Google Scholar] [CrossRef]
Wu, C.; Zheng, J.; Pfrommer, J.; Beyerer, J. Attention-based point cloud edge sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 5333–5343. [Google Scholar]
Wu, W.; Qi, Z.; Li, F. PointConv: Deep convolutional networks on 3D point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9613–9622. [Google Scholar]
Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6411–6420. [Google Scholar]
Shi, W.; Rajkumar, R. Point-GNN: Graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1711–1719. [Google Scholar]
He, C.; Zeng, H.; Huang, J.; Hua, X.-S.; Zhang, L. Structure aware single-stage 3D object detection from point cloud. In Proceedings of the IEEE Conference Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11870–11879. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017); NeurIPS: La Jolla, CA, USA, 2017. [Google Scholar]
Shi, S.; Wang, X.; Li, H. PointRCNN: 3D Object Proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3DSSD: Point-based 3D single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Lyu, Y.; Huang, X.; Zhang, Z. Learning to segment 3D point clouds in 2D image space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12255–12264. [Google Scholar]
Luo, Z.; Ma, J.; Zhou, Z.; Xiong, G. PCPNet: An efficient and semantic-enhanced transformer network for point cloud prediction. IEEE Robot. Autom. Lett. 2023, 8, 4267–4274. [Google Scholar] [CrossRef]
Yin, T.; Zhou, X.; Krähenbühl, P. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11779–11788. [Google Scholar]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; BeijBom, O. PointPillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Chen, Y.; Liu, J.; Zhang, X.; Qi, X.; Jia, J. Voxelnext: Fully sparse voxelnet for 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
Yan, Y.; Mao, Y.; Li, B. SECOND:Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-Voxel feature set abstraction for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA,, 13–19 June 2020. [Google Scholar]
Liang, M.; Yang, B.; Chen, Y.; Hu, R.; Urtasun, R. Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3d proposal generation and object detection from view aggregation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
Liu, L.; He, J.; Ren, K.; Xiao, Z.; Hou, Y. A LiDAR–camera fusion 3D object detection algorithm. Information 2022, 13, 169. [Google Scholar] [CrossRef]
Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.-L. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1090–1099. [Google Scholar]
Nabati, R.; Qi, H. Centerfusion: Center-based radar and camera fusion for 3d object detection. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 1526–1535. [Google Scholar]
Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
Yuan, W.; Khot, T.; Held, D.; Mertz, C.; Hebert, M. Pcn: Point completion network. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 728–737. [Google Scholar]
Tchapmi, L.P.; Kosaraju, V.; Rezatofighi, H.; Reid, I.; Savarese, S. Topnet: Structural point cloud decoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 383–392. [Google Scholar]
Xie, H.; Yao, H.; Zhou, S.; Mao, J.; Zhang, S.; Sun, W. Grnet: Gridding residual network for dense point cloud completion. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 365–381. [Google Scholar]
Yu, X.; Rao, Y.; Wang, Z.; Liu, Z.; Lu, J.; Zhou, J. Pointr: Diverse point cloud completion with geometry-aware transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12498–12507. [Google Scholar]
Xiang, P.; Wen, X.; Liu, Y.; Cao, Y.; Wan, P.; Zheng, W.; Han, Z. Snowflakenet: Point cloud completion by snowflake point deconvolution with skip-transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 5499–5509. [Google Scholar]
Yang, G.; Huang, X.; Hao, Z.; Liu, M.-Y.; Belongie, S.; Hariharan, B. PointFlow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4541–4550. [Google Scholar]
Liu, X.; Kong, X.; Liu, L.; Chiang, K. TreeGAN: Syntax aware sequence generation with generative adversarial networks. In Proceedings of the IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; pp. 1140–1145. [Google Scholar]
Wang, Y.; Chao, W.-L.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8445–8453. [Google Scholar]
You, Y.; Wang, Y.; Chao, W.-L.; Garg, D.; Pleiss, G.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Zhang, Y.; Huang, D.; Wang, Y. PC-RGNN: Point cloud completion and graph neural network for 3d object detection. Proc. AAAI Conf. Artif. Intell. 2021, 35, 3430–3437. [Google Scholar] [CrossRef]
Koo, I.; Lee, I.; Kim, S.H.; Kim, H.S.; Jeon, W.J.; Kim, C. PG-RCNN: Semantic Surface Point Generation for 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023. [Google Scholar]
Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel R-CNN: Towards high performance voxel-based 3d object detection. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1201–1209. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 2017. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2009, 88, 303–338. [Google Scholar] [CrossRef]
Simonelli, A.; Bulo, S.R.; Porzi, L.; Lopez-Antequera, M.; Kontschieder, P. Disentangling monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1991–1999. [Google Scholar]

Figure 1. Illustration of 3D Point Cloud Data.

Figure 2. The architecture of the Network.

Figure 3. Illustration of Multi-Scale Grid.

Figure 4. Visualization of 3D detection results on the KITTI validation set (from top to bottom: 2D image, PG-RCNN, Ground Truth, Proposed Method).

Table 1. Comparison of point cloud feature learning methods.

Method	Description	Advantage	Limitation
Point-based (PointNet [9], PointNet++ [10], PointRCNN [11], 3DSSD [12])	Directly process the raw point cloud data while preserving its sparsity and irregular structure. Extract features through neighbor aggregation and local geometric structure learning	Preserve point cloud details Accurately capture local geometric features	High computational cost Slow inference speed in large-scale scenarios
Projection-based (PointPillars [16])	Project the 3D point cloud onto a 2D plane and use a 2D CNN to extract features	Fast inference speed High memory efficiency Able to utilize existing 2D image processing techniques	Projection may lead to spatial information loss Limited preservation of fine details
Voxel-based (VoxelNet [18], SECOND [19], PV-RCNN [20])	Convert the point cloud into a regular voxel grid, then use a 3D CNN for feature extraction	Suitable for handling large-scale data Supports efficient inference	Voxelization leads to information loss Empty voxels result in redundant computation
Multi-module (Frustum PointNets [27])	Feature fusion by combining point clouds with other modalities (such as camera images)	Improve detection accuracy Effectively utilize multi-source information	The fusion mechanism is complex Increases the difficulty of system design and training

Table 2. Overview of point cloud completion and generation methods.

	Architecture
PCN [28]	Encoder–decoder architecture
TopNet [29], GRNet [30]	Tree-structured decoder/Voxel-based latent space
PoinTr [31], SnowflakeNet [32]	Transformer with hierarchical refinement
PointFlow [33]	Flow-based probabilistic generative model
Tree-GAN [34]	Tree-structured Generative Adversarial Network (GAN)
Pseudo-LiDAR [35,36]	Depth prediction from monocular/stereo images + projection
PC-RGNN [37]	Pretrained point cloud completion network with RoI

Table 3. The 3D detection results of each method on the KITTI validation set are shown, with bold indicating the best performance and underlines marking the second-best performance. (Some results are cited from the PG-RCNN paper).

Method		Ours(4, 6)	Ours(6, 8)	PG-RCNN	SECOND	PointPillars	PV-RCNN
Car	Easy	92.40	92.61	92.26	90.55	87.75	92.10
	Mod.	83.20	85.13	83.35	81.61	78.41	84.36
	Hard	82.56	82.78	82.63	78.56	75.19	82.48
Pedestrian	Easy	69.47	67.46	64.30	55.94	57.30	64.26
	Mod.	61.35	59.99	57.64	51.15	51.42	56.67
	Hard	56.08	55.17	52.97	46.17	46.87	51.91
Cyclist	Easy	91.29	90.74	91.52	82.97	81.57	88.88
	Mod.	71.86	73.77	71.11	66.74	62.93	71.95
	Hard	67.16	69.13	66.59	62.78	58.98	66.78

Table 4. The 3D detection results of models using different single grid sizes on the KITTI validation set, where bold indicates the best performance and underlines indicate the second-best.

Grid Size		2	4	6	8	11
Car	Easy	90.95	92.51	92.26	92.19	91.16
	Mod.	81.10	83.28	83.35	83.30	82.02
	Hard	78.48	82.49	82.63	82.76	79.77
Pedestrian	Easy	61.02	64.07	64.30	63.44	62.04
	Mod.	55.18	56.99	57.64	58.08	54.97
	Hard	50.17	52.00	52.97	52.97	49.58
Cyclist	Easy	88.12	91.54	91.52	91.84	84.60
	Mod.	67.29	71.64	71.11	73.12	67.65
	Hard	63.11	66.98	66.59	68.57	63.29

Table 5. Three-dimensional detection performance of models using different multi-scale grid size combinations on the KITTI validation set, with bold indicating the best result and underlines the second-best.

Grid Size		6	4, 6	4, 8	6, 8	4, 6, 8
Car	Easy	92.26	92.48	92.44	92.85	92.28
	Mod.	83.35	84.90	84.92	85.41	84.92
	Hard	82.63	82.57	82.68	82.99	82.68
Pedestrian	Easy	64.30	65.41	64.43	65.59	62.51
	Mod.	57.64	58.07	57.80	58.89	56.36
	Hard	52.97	53.09	53.10	53.98	53.25
Cyclist	Easy	91.52	92.21	91.00	89.16	88.38
	Mod.	71.11	72.39	72.65	69.31	70.52
	Hard	66.59	67.88	68.09	64.87	65.84

Table 6. Comparison of 3D detection accuracy for single-scale grids with and without the feature attention module, highlighting best results in bold.

Method		w/o Feature Attention	With Feature Attention
Car	Easy	92.26	92.37
	Mod.	83.35	84.96
	Hard	82.63	82.74
Pedestrian	Easy	64.30	68.56
	Mod.	57.64	61.41
	Hard	52.97	56.10
Cyclist	Easy	91.52	90.01
	Mod.	71.11	72.76
	Hard	66.59	68.09

Table 7. Influence of different feature extraction radius settings on 3D detection performance, where the best values are highlighted in bold.

Scale		0.375×	0.5×	0.75×	1.0×	1.5×	2×
Car	Easy	91.6	92.74	92.72	92.61	92.56	92.55
	Mod.	83.12	85.48	83.34	85.13	83.59	82.89
	Hard	80.68	83.04	82.62	82.78	80.95	80.34
Pedestrian	Easy	61.90	63.34	64.60	67.46	65.83	64.88
	Mod.	55.55	57.55	58.81	59.99	58.79	58.46
	Hard	51.75	52.90	53.55	55.17	53.98	53.26
Cyclist	Easy	89.61	89.63	92.32	90.74	89.81	89.05
	Mod.	72.38	70.70	73.54	73.77	71.86	71.48
	Hard	67.83	66.16	69.01	69.13	67.41	66.87

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, X.-F.; Lee, C.-C.; Lo, J.-H.; Chuang, C.-H.; Fan, K.-C. Multi-Scale Grid-Based Semantic Surface Point Generation for 3D Object Detection. Electronics 2025, 14, 3492. https://doi.org/10.3390/electronics14173492

AMA Style

Chen X-F, Lee C-C, Lo J-H, Chuang C-H, Fan K-C. Multi-Scale Grid-Based Semantic Surface Point Generation for 3D Object Detection. Electronics. 2025; 14(17):3492. https://doi.org/10.3390/electronics14173492

Chicago/Turabian Style

Chen, Xin-Fu, Chun-Chieh Lee, Jung-Hua Lo, Chi-Hung Chuang, and Kuo-Chin Fan. 2025. "Multi-Scale Grid-Based Semantic Surface Point Generation for 3D Object Detection" Electronics 14, no. 17: 3492. https://doi.org/10.3390/electronics14173492

APA Style

Chen, X.-F., Lee, C.-C., Lo, J.-H., Chuang, C.-H., & Fan, K.-C. (2025). Multi-Scale Grid-Based Semantic Surface Point Generation for 3D Object Detection. Electronics, 14(17), 3492. https://doi.org/10.3390/electronics14173492

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Scale Grid-Based Semantic Surface Point Generation for 3D Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Point Cloud Feature Extraction and 3D Object Detection Methods

2.2. Point Generation Methods

2.3. PG-RCNN

3. Methodology

3.1. System Architecture

3.2. Multi-Scale Grid Attention Module

3.2.1. Multi-Scale Grid Mechanism

3.2.2. Feature Attention Mechanism

4. Experiment

4.1. Dataset

4.2. Evaluation Metrics

4.3. Experiment Platform

4.4. Model Performance Comparison

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI