Next Article in Journal
Research on Weighted Fusion Method for Multi-Source Sea Surface Temperature Based on Cloud Conditions
Previous Article in Journal
A Cross-Domain Landslide Extraction Method Utilizing Image Masking and Morphological Information Enhancement
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Moving-Least-Squares-Enhanced 3D Object Detection for 4D Millimeter-Wave Radar

1
School of Automotive Studies, Tongji University, Shanghai 201804, China
2
Shanghai Motor Vehicle Inspection Certification & Tech Innovation Center Co., Ltd., Shanghai 201805, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(8), 1465; https://doi.org/10.3390/rs17081465
Submission received: 12 February 2025 / Revised: 7 April 2025 / Accepted: 17 April 2025 / Published: 20 April 2025

Abstract

:
Object detection is a critical task in autonomous driving. Currently, 3D object detection methods for autonomous driving primarily rely on stereo cameras and LiDAR, which are susceptible to adverse weather conditions and low lighting, resulting in limited robustness. In contrast, automotive mmWave radar offers advantages such as resilience to complex weather, independence from lighting conditions, and a low cost, making it a widely studied sensor type. Modern 4D millimeter-wave (mmWave) radar can provide spatial dimensions (x, y, z) as well as Doppler information, meeting the requirements for 3D object detection. However, the point cloud density of 4D mmWave radar is significantly lower than that of LiDAR in the case of short distances, and existing point cloud object detection methods struggle to adapt to such sparse data. To address this challenge, we propose a novel 4D mmWave radar point cloud object detection framework. First, we employ moving least squares (MLS) to densify multi-frame fused point clouds, effectively increasing the point cloud density. Next, we construct a 3D object detection network based on point pillar encoding and utilize an SSD detection head for detection on feature maps. Finally, we validate our method on the VoD dataset. Experimental results demonstrate that our proposed framework outperforms comparative methods, and the MLS-based point cloud densification method significantly enhances the object detection performance.

1. Introduction

As an advanced sensing technology, 4D mmWave radar has emerged as a critical component in autonomous driving systems, offering a combination of measurements of the range, velocity, azimuth, and elevation. Currently, stereo cameras, LiDAR, and mmWave radar are widely used to detect 3D objects [1]. Unlike traditional radar systems, which are limited to three-dimensional data (range, velocity, and azimuth), the 4D mmWave radar adds elevation information, allowing the more comprehensive and accurate perception of the surrounding environment.
The 4D mmWave radar stands out due to its ability to operate effectively in adverse conditions where other sensors may struggle. In complex, highly dynamic, and adverse environments, such as rainy or snowy weather, strong winds, heavy fog, and intricate road structures, research on perception systems becomes particularly crucial [2,3,4]. For example, cameras can be affected by low light or glare, while LiDAR’s performance can be degraded in fog, rain, or snow. In contrast, 4D mmWave radar is less affected by such environmental factors, making it a reliable choice for all-weather operation. Furthermore, its ability to measure the velocity directly through the Doppler effect provides a significant advantage in tracking moving objects, such as vehicles and pedestrians, which is critical for path planning. When designing a deep learning network, it is necessary to consider the Doppler information of the object from the 4D mmWave radar [5,6]. As a result, 4D mmWave radar has become an indispensable tool to enhance the safety, reliability, and efficiency of autonomous vehicles by providing robust environmental awareness.
However, despite its numerous advantages, 4D mmWave radar faces a significant limitation: the sparsity of its point cloud data. Compared to LiDAR, which generates dense and high-resolution point clouds, the point cloud produced by 4D mmWave radar is relatively sparse, especially in the case of short distances. This sparsity arises from the radar’s inherent physical limitations, such as a lower angular resolution and fewer channels, which result in fewer detected points per object. As a consequence, the representation of the environment is less detailed, making it challenging to accurately identify and classify objects, particularly at longer distances or for smaller targets. This limitation can hinder the performance of downstream tasks such as object detection, tracking, and scene understanding, which are essential for autonomous driving systems to operate safely and effectively.
To address these challenges, researchers have focused on developing advanced techniques for point cloud densification and object detection. Point cloud densification strategies aim to enhance the spatial resolution and measurement fidelity of radar returns through data completion mechanisms, enabling the precise reconstruction of the environmental topology. By leveraging interpolation, machine learning, or time fusion methods, densification techniques can fill in the gaps in sparse radar data, improving the overall perception capabilities of the system. On the other hand, object detection algorithms tailored to 4D mmWave radar data are designed to improve the recognition and localization of objects, even in sparse point cloud scenarios. These algorithms often incorporate deep learning models, such as convolutional neural networks (CNNs), to extract meaningful features from the radar data and achieve robust detection performance.
This paper introduces a novel two-stage approach to address the challenges of sparse point cloud data in 4D mmWave radar, focusing on point cloud densification and object detection. In the first stage, we propose an MLS-based method to enhance the point cloud density. This process involves three key steps, namely a k-d tree-accelerated nearest neighbor search to identify relevant points, MLS upsampling to reconstruct the surface, and point cloud densification to interpolate and generate additional points, thereby creating a more detailed and spatially consistent representation of the environment. By leveraging spatial correlations in radar point clouds, our method effectively addresses the sparsity issue, producing a denser and more reliable point cloud for downstream perception tasks.
In the second stage, we design a specialized object detection network that integrates a point pillar architecture with a single shot detector (SSD) detection head. The pillarization framework transforms unstructured 3D point clouds into discretized 2D grid representations through pillar-based gridding, enabling computationally efficient feature abstraction while preserving the geometric integrity and object-level spatial hierarchies inherent in the original point distribution. This structured format reduces the computational overhead and facilitates compatibility with standard convolutional neural network (CNN) operations. The SSD detection head, renowned for its real-time performance and multi-scale detection capabilities, is then employed to localize and classify objects within the densified point cloud. By combining the strengths of point pillar encoding and the SSD’s efficiency, our network achieves robust object detection performance even in sparse and noisy radar data scenarios.
Our two-stage framework addresses the limitations of 4D mmWave radar systems by synergizing physics-based densification with data-driven deep learning. The experimental results demonstrate that our method significantly improves the object detection accuracy compared to conventional approaches, particularly for sparse point clouds. This advancement not only bridges the gap between sparse radar data and dense perception requirements but also underscores the potential of 4D mmWave radar as a cost-effective and reliable sensor for autonomous driving systems operating in complex environments.
Our main contributions are as follows.
  • We introduce a novel 4D mmWave radar point cloud densification method based on the MLS method to enhance sparse radar point clouds. By aggregating temporal data and reconstructing surfaces through MLS upsampling, our method provides richer input for downstream tasks.
  • We use an architecture integrating point pillar encoding and an SSD detection head to perform object detection. This network enables multi-scale detection with a 48.44 mAP, particularly effective for objects in the driving corridor area.
  • We validate the proposed method through rigorous experiments and ablation studies. Our experiments demonstrate that the MLS-based method improves the mAP of downstream object detection tasks by 1.37, while the Doppler dimension, which is unique to 4D mmWave radar, significantly enhances the object detection performance.

2. Related Works

2.1. Object Detection

Object detection is an important computer vision task that involves detecting instances of visual objects of a certain class (such as humans, animals, or cars) in digital images [7]. This task encompasses both classification and precise localization, making it one of the most challenging problems in computer vision. In recent years, the advent of deep learning has significantly advanced object detection, yielding remarkable improvements in performance.
Before the deep learning era, object detection primarily relied on sliding window methods and hand-crafted features (such as HOG), combined with classifiers like SVM. These methods faced limitations in terms of computational efficiency and detection accuracy. The Histogram of Oriented Gradients [8] (HOG) feature descriptor was proposed by N. Dalal and B. Triggs in 2005. To balance the feature invariance and nonlinearity, the HOG descriptor was designed to be computed on a dense grid of uniformly spaced cells and use overlapping local contrast normalization.
Detectors based on deep learning can be divided into two categories: two-stage detectors and one-stage detectors.
Two-stage object detectors are a class of algorithms that are widely used in the field of computer vision. Their primary characteristic is the division of the detection process into two consecutive stages: first generating candidate regions and then classifying and refining the localization of these regions. Representative works in this field include R-CNN and Fast R-CNN. R-CNN [9] was proposed by Girshick R et al. It first uses selective search to extract candidate regions from the image and then scales each candidate region to a fixed size, inputs them into a CNN to extract features, and finally uses support vector machines (SVM) for classification and a regressor to refine the bounding box positions. Based on R-CNN, Fast R-CNN [10] performs CNN feature extraction on the entire image once and then directly operates on candidate regions on the feature map, avoiding repeated feature computation for each candidate region, thereby significantly improving the detection speed.
Single-stage detectors are a class of object detection methods that directly predict object classes and bounding boxes from an input image without a separate candidate region generation step. These methods perform end-to-end detection by regressing and classifying on a dense grid of predefined anchor boxes or feature map locations. Representative works include the YOLO series and SSD.
The You Only Look Once (YOLO) series was proposed by Redmon J et al. [11]. It formulates detection as a single regression problem by dividing the image into a fixed grid, where each grid cell predicts bounding boxes and class probabilities, leading to very fast detection speeds. Proposed by Liu W et al. [12], the single shot multibox detector (SSD) detects objects at multiple scales by utilizing feature maps of different resolutions, thereby balancing speed and accuracy for objects of various sizes. CenterPoint [13] is a center-based method for 3D object detection and tracking, primarily designed for processing LiDAR point cloud data. Its center-based detection method replaces the traditional anchor-based or bounding box detection approaches. By directly predicting the object center and regressing its 3D attributes (such as dimensions, orientation, velocity, etc.), the detection process is significantly simplified and the efficiency is enhanced.

2.2. LiDAR Point-Cloud-Based Object Detection

Prior to the emergence of 4D mmWave radar technology, the computer vision community had pioneered foundational architectures for 3D object detection utilizing LiDAR point clouds. These seminal frameworks, originally developed for high-density LiDAR data, have not only served as critical architectural frameworks but also provided substantial methodological guidance for the adaptation of detection pipelines to 4D mmWave radar’s sparse data characteristics. Contemporary deep learning approaches for 3D perception from point cloud data are broadly categorized into three principal paradigms: (1) voxel-based representations that discretize a 3D space into structured grids, (2) projection-based techniques employing multi-view transformations to 2D feature spaces, and (3) point-based approaches operating directly on raw point clouds [14].

2.2.1. Voxel-Based Methods

Voxel-based methods convert irregular point clouds into compact voxel representations and use 3D convolutional neural networks (CNNs) to effectively extract point features for object detection. Li [15,16] proposed a detection technique based on fully convolutional networks, extended to 3D and applied to point cloud data. Traditional 3D CNNs struggle to learn local features at different scales effectively.
Zhou et al. [17] eliminated the need for manual feature engineering on 3D point clouds and proposed VoxelNet, a general-purpose 3D detection network that unifies feature extraction and bounding box prediction into a single-stage, end-to-end trainable deep network. VoxelNet divides the point cloud data uniformly into 3D spatial voxels and uses an innovative voxel feature encoding (VFE) layer to map the set of points within each voxel to a consistent feature vector. This approach effectively converts discrete point cloud data into continuous volumetric descriptions, enhancing the spatial structural information of the data.
Yan et al. [18] proposed SECOND, an improved sparse convolution method based on voxel-based 3D convolutional networks. This method significantly enhances the speed of both training and inference. It introduces a novel form of angular loss regression to improve the directional estimation performance, as well as a new data augmentation technique to accelerate convergence and improve its performance. The proposed network achieved state-of-the-art results on the KITTI 3D object detection benchmark while maintaining fast inference speeds.
To accelerate inference, Lang et al. [19] used point pillar encoding instead of the original voxel encoding. The proposed model runs two to four times faster than the previous state of the art.

2.2.2. Projection and Multi-View Methods

This type of method converts sparse point clouds into dense representations in the front view or bird’s eye view (BEV), allowing for the direct application of convolutional networks or 2D object detection methods. Li et al. [20] proposed the VeloFCN method. This method is applied to object detection tasks based on distance data from the Velodyne 64E LiDAR, and the data are presented in the form of 2D point maps. A single 2D end-to-end fully convolutional network is used to simultaneously predict the object confidence and boundaries. By designing bounding box encoding, this method enables the prediction of complete 3D bounding boxes with a 2D convolutional network [21]. Beltrán et al. [22] used BEV images as input for pedestrian and cyclist detection with the BirdNet method, introducing a new encoding scheme for the BEV projection of 3D point clouds. This method adapts to the latest convolutional neural network (CNN) framework to process point data in real time.

2.2.3. Raw-Point-Cloud-Based Methods

In the field of object detection, a common approach is to directly process the raw point cloud data, which typically involves two main backbone network architectures: PointNet [23] and its extended version, PointNet++ [24], or graph-based neural networks. The goal of these methods is to preserve the original geometric structure of the point cloud data, but they are less efficient than voxel-based methods when retrieving points from a 3D space. PointNet is the first network architecture to directly process raw point clouds for object detection. It extracts features from the point cloud by using a max pooling operation and symmetric functions, addressing the issue of the lack of inherent order in point cloud data. PointNet primarily focuses on global features and individual point features, while lacking the ability to perceive the detailed features of local regions. To overcome this limitation, inspired by 2D CNNs, Qi et al. introduced PointNet++, which is composed of the basic feature extraction blocks of PointNet. It extracts the deep semantic features of the target by cascading and combining sampling layers, grouping layers, and feature extraction modules. The network uses two strategies—multi-scale grouping (MSG) and multi-resolution grouping (MRG)—to ensure the extraction of local point cloud features. Shi et al. [25] proposed PointRCNN, which uses PointNet++ as the point cloud encoder. It then generates 3D proposals based on the extracted semantic and geometric features and refines the proposed 3D bounding boxes in the second stage. Yang et al. [26] proposed a two-stage 3D object detection framework called Sparse-to-Dense 3D Object for Point Cloud (STD). The first stage is a bottom-up proposal generation network that uses raw point clouds as input. In the second stage of proposal box prediction, to improve the localization accuracy, the model employs parallel intersection over union (IoU) branches, further enhancing its performance.

2.3. Object Detection with 4D mmWave Radar

Research on object detection and tracking using 4D mmWave radar primarily focuses on addressing issues related to noise interference and sparsity. Paek et al. [27] constructed a new large-scale object detection dataset, KAITRadar (K-Radar), and provided a 4DRT-based object detection benchmark neural network. This work demonstrated that 4D mmWave radar was a more robust sensor under harsh weather conditions. Lee et al. [28], in an effort to overcome the scarcity of labeled radar data, transformed rich LiDAR data into radar-like point cloud data and introduced novel augmentation techniques for radar-only learning problems to accelerate model convergence. Palmer et al. [29] overcame the sparsity of point clouds by accumulating radar point clouds over continuous time steps. They demonstrated that the use of vehicle motion estimation and dynamic motion correction methods could improve the object detection performance.
Moreover, the current research faces the challenge of the insufficient utilization of features. Yan et al. [30] proposed an end-to-end, anchor-free, single-stage 3D object detection framework. By reweighting foreground and background points, the model enhanced feature learning and introduced Doppler velocity and reflectivity data, significantly improving the detection performance for small moving objects. Bai et al. [31] combined vector attention and scalar attention to fully leverage the spatial, Doppler, and reflectivity information in the radar point cloud, achieving the fusion of local and global features.
To overcome the sparsity of 4D mmWave radar point clouds, some researchers have also adopted multi-sensor fusion methods for object detection. Cui et al. [32] proposed a convolutional neural network with a cross-fusion strategy for 3D road vehicle detection. This method integrates features extracted from both the front view and BEV view generated from images and 4D mmWave radar, achieving comprehensive perception. Xiong et al. [33] explored a view transformation strategy for 3D object detection based on the fusion of camera and 4D mmWave radar. Specifically, they voxelized the input radar point cloud into cylinders and extracted multi-scale BEV features from the voxelized cylinders. Some of these BEV features were fused with the BEV features from the image, while the other part was used to generate a fused BEV feature map. Zheng et al. [34] proposed RCFusion, a simple and valid 3D object detection network for camera and 4D radar fusion. By using the proposed image encoder orthographic feature transform (OFT) and radar encoder radar PillarNet, radar–LiDAR fusion is achieved.

3. Method

The overall architecture of the proposed method is shown in Figure 1. The multi-frame fusion point cloud is input into a point cloud upsampling module based on the MLS method to obtain a densified multi-frame 4D mmWave radar point cloud. Then, the point cloud is pillarized and input into a pillar feature network. The pseudo-images generated by the pillar feature network are input into the backbone network, and, after feature extraction, multi-scale feature maps are generated. We use an SSD detection head to perform object detection on the feature maps, outputting the predicted bounding boxes.

3.1. Point Cloud Densification

Compared to LiDAR, 4D mmWave radar exhibits significant differences in point cloud density. LiDAR point clouds are characterized by their high density, enabling the clear representation of object shapes. In contrast, 4D mmWave radar point clouds are notably sparse, making it challenging to perform object detection tasks. Therefore, densifying 4D mmWave radar point clouds is crucial in accurately estimating information such as the direction and shape of objects.
This paper proposes an upsampling method for 4D mmWave radar point clouds based on the MLS method. We conducted a point cloud density enhancement experiment using multi-frame point cloud data from the VoD dataset. The multi-frame point cloud data are obtained by fusing point clouds from multiple adjacent frames based on inertial navigation component measurements, which can preliminarily increase the point cloud density, enhance the object feature saliency, and facilitate subsequent processing. A comparison between single-frame and multi-frame point cloud data is shown in Figure 2.
The MLS method is a technique used to fit a curve or surface to a set of discrete data points by using a large amount of data. It addresses the issues of traditional least squares fitting, such as the need for piecewise fitting, complex data processing workflows, and the resulting discontinuities between adjacent segments. In addition to its simplicity and smoothness, MLS has several other advantages, including its ability to handle noisy inputs and the ease of computing differential geometric properties of surfaces, such as normals and curvature. Assuming that the true curve or surface at node x is u n o d e , the goal of MLS is to find an approximation of x within a local neighborhood of the node. This approximation is based on data collected from nearby sampling points x i and their corresponding values u i . To estimate the approximation of the true function, a set of polynomials is multiplied by their coefficients, as shown below:
u node ( x ) = i = 1 m poly i T ( x ) a i ( x )
p o l y i represents the i-th-degree term of the fitted polynomial, while a i denotes the coefficient associated with the i-th-degree term.
MLS differs from traditional least squares in that its fitting function consists of a set of coefficients a j ( x n o d e ) and a set of polynomials p o l y j , rather than using a single polynomial function for all points to be fitted. This allows MLS to focus only on nearby sampling points, eliminating the need to consider points far from the node, with the contribution of each sample point being proportional to its distance. In addition to polynomial functions, researchers have also explored the use of trigonometric functions and other types of functions for fitting, but polynomial functions remain the most commonly used.
Table 1 lists commonly used polynomial functions for different dimensions and orders.
For surface fitting problems, a two-dimensional, second-order polynomial function is used. The specific mathematical expression is as follows:
z = a x 2 + b y 2 + c x y + d x + e y + f
The parameters a i of the fitting function can be determined by minimizing the weighted norm of the difference between the estimated value at the node and the values at the sampling points u i . The optimization model J can be expressed as
J x ( a ) = 1 2 i w i x i , x poly T x i a u i 2
Each sampling point’s contribution to the approximation is influenced by a weighting function w i ( x i , x ) , which ensures that closer sampling points have a greater contribution to the objective function J. To solve this optimization problem, the fitted surface can be obtained. For each point, after obtaining the fitted surface, a new point is added to the point cloud. The (x,y) coordinates of newly generated points are stochastically generated and constrained within a circular domain of radius R s a m p l e centered at the original coordinates. Our algorithm’s pseudo-code is shown in Appendix A, Algorithm A1.
The effectiveness of the MLS algorithm is illustrated in Figure 3.

3.2. Pillarization and Backbone

The input point cloud has 7 dimensions, which are x, y, z, rcs, vr, vrcomp, and time. These represent the spatial coordinates, radar cross-section, relative radial velocity, relative radial velocity after compensating for the ego-vehicle speed, and temporal information. The backbone network is mainly divided into three parts: point-pillar-based pseudo-image generation, a 2D convolutional backbone network, and an SSD detection head. In point-pillar-based pseudo-image generation, the grid is first divided in the xy-plane, and point pillars are constructed. Unlike voxel-based methods, no hyperparameters are required to divide the point cloud along the z-axis. Then, the dimensions of the point cloud are expanded. Each point in a pillar is expanded into six dimensions, xc, yc, zc, xp, yp, and zp, where the subscript c indicates the distance from the point to the arithmetic mean of all points in the pillar, and the subscript p represents the offset from the center of the point pillar. The dimension of each point in the point cloud is denoted as D. Due to the sparsity of 4D mmWave radar point clouds, most pillars are empty. To address this sparsity issue, a constraint is applied on the number of non-empty pillars per sample (P) and the number of points per pillar (N) to create a dense tensor. If the number of points in a pillar exceeds N, random sampling is applied to reduce the number to N; if the number of points in a pillar is fewer than N, the remaining part is padded with zeros.
Assuming that there are P non-empty pillars in each sample and N point cloud data in each pillar, the sample can be represented as a tensor of size (D, P, N). After tensorization, a simplified version of PointNet is used to process the tensorized point cloud data and extract features. For each point, a linear layer is used, and BatchNorm and ReLU are applied to generate the tensor. After feature extraction, the dimension of the point cloud is reduced from D to C, resulting in a tensor of size (C, P, N).
Next, max pooling is performed along the pillar dimension, producing a feature map of size (C, P). Finally, the features of shape (P, C) are unfolded based on the positions of the point pillars into pseudo-image features. The P dimension is unfolded into (H, W), yielding a feature representation in the form of (C, H, W), which is similar to an image. After generating the pseudo-image, it is input into the backbone network. The structure of the backbone network is shown in Figure 4. The backbone network consists of two sub-networks: one top-down network that gradually reduces the resolution of the feature map while increasing its dimensionality and another network that upsamples three feature maps to the same size.
The top-down backbone can be represented by a series of blocks (S, L, F). Each block has a stride of S relative to the original input. It consists of L 3 × 3 2D convolutional layers, with an output dimension of F. Each convolutional layer is followed by BatchNorm and a ReLU activation function. Due to the different receptive fields of feature maps at different resolutions, this architecture can detect objects of varying sizes. After upsampling the three feature maps, 2D convolution is applied to output the final features.

3.3. Detection Head

We utilize the single shot detector (SSD) detection head for 3D object detection on feature maps. The SSD is a one-stage detection method, which is simpler compared to two-stage methods. It eliminates the need for region proposal generation and feature resampling. All computations are encapsulated within a single network. The SSD is easy to train and can be integrated into other detection systems as a detection head. For each layer of multi-scale feature maps and existing feature layers from the base network, the detector uses a set of convolution kernels to generate a fixed set of predictions. For feature layers, convolution can produce a class score or offsets relative to anchors.
In the multi-feature map component of the detector, each feature map cell is assigned a series of anchors. Within each feature map cell, the detector predicts offsets relative to the shapes of the anchor in the cell and provides class scores or confidences for objects within each anchor. For three different object categories—cars, pedestrians, and bicycles—anchors of varying sizes are set, with two anchors placed for each object category per cell.
The dimensions of the ground truth and anchors are ( x , y , z , w , l , h , θ ). The local regression residuals between the ground truth and anchors are defined as
Δ x = x g t x a d a , Δ y = y g t y a d a , Δ z = z g t z a d a Δ w = log w g t w a , Δ l = log l g t l a , Δ h = log h g t h a Δ θ = sin θ g t θ a
x g t and x a stand for the ground truth and anchor.
d a = w a 2 + l a 2
The total localization loss is
L loc = b ( x , y , z , w , l , h , b ) SmoothL1 ( Δ b )
Due to the inability of the angle loss to distinguish between flipped bounding boxes, the softmax classification loss is used in discrete orientations, allowing the network to learn the orientation.
L cls = α a 1 p a γ log p a
The total loss function is
L = 1 N pos β l o s L l o c + β c l s L c l s + β d i r L dir
N p o s is the number of positive anchor boxes, β los = 2 , β c l s = 1 , β dir = 0.2 .

4. Experiment

We conducted comparative experiments between the method proposed in this paper and other methods, as well as ablation experiments. Figure 5 shows the results for our method.

4.1. Dataset

The data used in the network training come from the View of Delft (voD) dataset. The VoD dataset was collected and released by Palffy et al. [35] from TU Delft. It contains 8693 frames of synchronized and calibrated 64-line LiDAR, stereo camera, and 4D mmWave radar data captured in complex urban traffic scenarios. The dataset includes annotations of 123,106 3D bounding boxes for both moving and static objects, comprising 26,587 pedestrians, 10,800 cyclists, and 26,949 vehicle labels.
The dataset features a rich set of sensors, including stereo cameras, LiDAR, and 4D mmWave radar. The ZFRGen21 4D mmWave radar radar has a sampling frequency of approximately 13 Hz and is mounted behind the front bumper. The stereo camera is mounted on the windshield. It has a resolution of 1936 × 1216 px and a sampling frequency of approximately 30 Hz. An HDL-64 LiDAR is mounted on the roof. The vehicle is also equipped with an RTK sensor, GPS, IMU, and wheel odometer for inertial navigation, with a sampling frequency of about 100 Hz. All sensors are jointly calibrated.
The VoD dataset features the temporal synchronization of sensors. Using the LiDAR sensor timestamp as a lead, the closest corresponding camera, radar, and speed information is selected, with the maximum time difference set to 0.05 s. The time difference between adjacent frames is approximately 0.1 s. After temporal synchronization, the data are organized into continuous frame segments, each with an average length of about 40 s. Additionally, for LiDAR and 4D mmWave radar point clouds, the dataset applies ego-motion compensation to eliminate errors caused by time differences during the collection of data from different LiDARs.
The VoD dataset provides detailed annotations for the objects. For LiDAR and camera, the dataset uses 6-degree-of-freedom 3D bounding boxes to annotate object classes. For each object, two types of occlusion, “spatial occlusion” and “illumination occlusion”, as well as activity attributes, including “stopped”, “moving”, “parked”, and “waiting”, are labeled. Furthermore, each object is assigned a unique object ID across data frames, making the dataset suitable for tracking and prediction tasks.

Implementation Details

We evaluated the object detection performance for three object categories: cars, pedestrians, and cyclists. Based on the KITTI benchmark, we utilized the average precision (AP) as the metric to measure the object detection performance. The intersection over union (IoU) between the predicted 3D bounding boxes and the annotated ground truth boxes was calculated. Samples with an IoU greater than 50 % were considered positive for the car category, while an IoU threshold of 25% was used for pedestrian and bicycle samples. The mean average precision (mAP) for each category was calculated by averaging the classification results.
We took both static and moving objects into consideration. The dataset was split into train, validation, and test sets with a ratio of 59%/15%/26%. Objects from the same frame appeared only in one of these splits. This ensured that the annotated objects of the three categories (including both static and moving objects) were proportionally distributed across the training, validation, and test sets.
The learning rate for the experiment was set at 0.003, with a total of 20 epochs and a batch size of 4.
We designed the experiment based on the characteristics of the 4D mmWave radar: 4D mmWave radar has higher angular accuracy in the center and lower angular accuracy with higher noise on both sides. Therefore, we conducted separate statistical analyses on the detection accuracy for all labeled objects and adjacent lane objects within a lateral distance of 4 m and a height of 25 m.
Specifically, the experimental items included single-frame point clouds without radar cross-sections (RCS), single-frame point clouds without velocity information, original single-frame point clouds, five-frame fused point clouds, and upsampled point clouds using the MLS method.
The single-frame point cloud served as raw data, possessing six dimensions, including the spatial coordinates, relative radial velocity, velocity after self-vehicle speed compensation, and RCS. Point clouds without RCS have spatial coordinate features, relative radial velocity features, and velocity features after ego-vehicle speed compensation. Point clouds without velocity information have spatial coordinate features and RCS features. Five-frame fused point clouds were obtained by projecting the first four frames onto the current frame. It possessed temporal information, which was input into the network as an additional dimension of the point clouds. The temporal information was assigned a value of 0 for the current frame, −1 for the previous frame, and decreasing values as the distance from the current frame increased. Point clouds upsampled using the MLS method had spatial coordinate features, RCS, velocity information, and temporal information.

4.2. Performance Comparison on the VoD Dataset

Currently, methods suitable for 4D mmWave radar-based single-modal object detection include SECOND, LXL, and CenterPoint [13]. For multi-modal detection methods, there are approaches like RCFusion, which integrates 4D mmWave radar with vision-based methods.
We compare the proposed algorithm with other algorithms on the VoD dataset and calculate the average precision. The calculation of the average precision uses the Recall 40 Position method. The comparison results are shown in Table 2.
The experiments demonstrate that, when the training data are multi-frame fused point clouds, the mean average precision (mAP) of the method proposed in this paper is 47.07, outperforming other single-modal methods in terms of detection performance. When training is conducted using point clouds after upsampling, the mAP reaches 48.44, approaching the performance of RCFusion, which is a fusion method.
Figure 6 illustrates the relationship between the average precision of the model trained on upsampled point clouds and the IOU threshold values.
Pedestrians and cyclists have smaller volumes and fewer points in their point clouds, leading to a faster decline in AP as the IOU threshold increases. On the other hand, cars have larger volumes and more points in their point clouds, resulting in a slower decline in AP with increasing IOU thresholds. Therefore, in our experimental settings, we adopt an IOU threshold of 0.5 for cars and 0.25 for pedestrians and cyclists.

4.3. Ablation Study

4.3.1. Multi-Frame Point Cloud Fusion and MLS

The experimental results for single-frame point clouds, multi-frame fused point clouds, and point clouds upsampled using the MLS method are shown in Table 3.
In the ablation studies, the mAP for the entire area of the fused five-frame point cloud reached 47.07, representing an improvement of 4.7 over the 42.37 mAP achieved with a single-frame point cloud. Meanwhile, the AP for all object categories was also higher than that of the single-frame point cloud results, with improvements of 4.47 for cars, 4.35 for cyclists, and 5.29 for pedestrians. These results indicate that multi-frame point cloud fusion can effectively increase the point cloud density and enhance the network’s feature extraction capabilities.
In complex traffic scenarios, vehicles typically travel at high speeds, and the long time intervals between frames can introduce errors, affecting the detection accuracy. To systematically investigate the relationship between the number of frames in multi-frame point clouds and the object detection performance, we conducted a series of experiments, whose results are presented in Figure 7. Our findings reveal two distinct phases of performance improvement: when accumulating up to three frames, the mean average precision (mAP) shows a rapid improvement with increasing frame numbers, demonstrating the significant benefits of multi-frame integration. However, this enhancement progressively slows between three and five frames, indicating that the benefits of multi-frame accumulation begin to diminish beyond this threshold. This empirical evidence suggests that a five-frame approach presents an optimal balance between computational efficiency and detection accuracy.
Based on multi-frame point cloud fusion, we further increased the point cloud density using MLS. After MLS-based point cloud upsampling, the mAP for the entire area improved by 1.37, reaching 48.44.
To rigorously assess the statistical significance of the moving least squares (MLS) methodology, we implemented comprehensive hypothesis testing. This evaluates whether the observed data provide sufficient evidence to reject a null hypothesis ( H 0 ) in favor of an alternative hypothesis ( H 1 ). The null hypothesis represents the default assumption. The p-value is the probability of observing the current sample data under the assumption that the null hypothesis is true. When the p-value is less than a significance threshold of α = 0.05 , it indicates that the observed data are sufficiently inconsistent with the null hypothesis. In this article, the null hypothesis is that no significant difference exists between the MLS method and the original method without MLS. On the contrary, the alternative hypothesis is that the MLS method demonstrates a statistically significant enhancement in object detection. Following rigorous statistical significance testing, our analysis revealed a statistically significant difference between the experimental datasets obtained with and without moving least squares (MLS) implementation ( p = 0.0098 ), which was substantially below the predefined significance threshold of α = 0.05 . This empirical evidence robustly confirms the statistical validity of our methodology.
Among the various object categories, the mAP for pedestrians and cyclists was higher than that of the five-frame fused point cloud results, with increases of 0.37 and 4.21, while the mAP for cars declined slightly. Point cloud upsampling further enhanced the point cloud density. For objects such as cyclists and pedestrians, which originally had fewer points and less pronounced spatial features, the MLS-based point cloud upsampling method enhanced their spatial features and improved the detection performance of the detection network.

4.3.2. The Influence of the Detection Head

The detection head constitutes a critical component in object detection networks, significantly impacting the detection performance. To validate the effectiveness of the SSD detection head, we conducted comparative experiments between the SSD detection head and the centerhead, with the experimental results detailed in Table 4.
The experiments show that the mAP of the SSD is 5.6 higher than that of the centerhead. The SSD’s anchor boxes, configured through preset parameters, cover the potential positions and sizes of targets. In scenarios where 4D millimeter-wave radar point clouds are sparse and exhibit significant noise, the explicit spatial priors provided by the anchor boxes effectively constrain the search space and reduce false detections. The centerhead’s goal is to produce a heatmap peak at the center location of any detected object [13]. However, the sparsity and noise issues of 4D millimeter-wave radar weaken the features of the center points in the heatmaps, making it difficult to predict the center points from the heatmaps. Moreover, the centerhead must simultaneously regress multiple attributes, such as the center point, dimensions, orientation, and velocity, and the positional noise in radar point clouds can amplify regression errors, especially in low-density regions, leading to an overall performance decline.

4.3.3. The Influence of the Input Dimensions

Due to differences in the measurement principles and data processing methods, the data dimensions of 4D mmWave radar differ from those of LiDAR point clouds and stereo camera point clouds. Specifically, 4D mmWave radar provides objects’ velocity information (Doppler information) and RCS information. Adding this information to the three-dimensional coordinates allows for the better characterization of object features, which is beneficial in improving the object detection performance. To verify the role of Doppler information and RCS information in object detection, we designed ablation experiments, with the results presented in Table 5.
The results of the ablation study indicate that removing the velocity information, and including the relative radial velocity and velocity after compensation for the ego-vehicle’s speed, significantly degrades the performance for both the pedestrian and cyclist object categories. Specifically, the mAP for pedestrians drops from 31.38 to 22.62, while the mAP for cyclists decreases from 60.36 to 41.29.
The experimental results indicate that removing RCS has a limited impact on network performance. The removal of RCS primarily affects the AP of cars, causing a decline from 35.37 to 29.86. Although there is an overall accuracy decrease after removing RCS, the detection accuracy for pedestrians and cyclists improved. This could be attributed to the fact that the metal material of cars makes their RCS characteristics prominent, whereas pedestrians and cyclists have similar RCS ranges and are more likely to be influenced by factors such as clothing materials.

4.3.4. Entire Area and Driving Corridor

In autonomous driving scenarios, the object information from adjacent lanes to the ego-vehicle plays a crucial role in decision making. We define the area with a lateral distance of 4 m from the ego-vehicle’s coordinate system (−4 < x < 4) as the driving corridor. The object detection performance within the entire area and the driving corridor has been statistically analyzed, with the results presented in Table 6 and Table 7.
Compared to the object detection performance across the entire region, the object detection performance of each method within the driving corridor area showed a significant improvement. The mAP of each method across the entire region was 42.37, 41.73, 31.69, 47.07, and 48.44, whereas, for objects in the driving corridor, the mAP was 56.73, 55.85, 48.90, 60.32, and 62.07.

4.3.5. Generalization Performance

To further verify the generalization performance of our method, we conducted experiments using TJ4DRadSet and compared the mAP values of various methods. The results are shown in Table 8.
The experimental results show that our method exhibits good generalization capabilities. Our proposed MLS-based point cloud upsampling method improves the mAP by 1.79, reaching 30.04. Due to the inclusion of truck category annotations in TJ4DRadSet, which increases the classification difficulty, our method exhibits a performance degradation compared to its results on the VoD dataset. On TJ4DRadSet, our approach outperforms SECOND and CenterPoint, achieves comparable performance to LXL-R, and underperforms compared to RCFusion.

5. Discussion

5.1. Features of MLS

While the MLS surface reconstruction framework effectively enforces geometric smoothness constraints through local polynomial approximations, this methodology fundamentally lacks mechanisms for the preservation of radiometric consistency and semantic feature coherence. For objects such as pedestrians and cyclists, which have fewer points and simpler features and are less affected by the rotation angles, surface constraints can ensure that the features of the newly added points approximate those of the original point cloud. However, for objects with numerous points, complex features, and significant rotation angle impacts, such as cars, a single surface constraint struggles to ensure similarity at the feature level. Therefore, in the ablation experiments for MLS, a slight decrease in the detection performance was observed for cars. We believe that, to improve the quality of upsampled point clouds and the object detection accuracy, it would be helpful to introduce neural-network-based point cloud upsampling methods in the future, learning the features of the original point cloud through data-driven approaches.

5.2. Comparison Between MLS and Neural Network

Prior to the proposal of our method, numerous studies had explored deep-learning-based radar point cloud upsampling. Notable works include RadarHD [36] and the “Towards Dense and Accurate Radar Perception via Efficient Cross-Modal Diffusion Model” proposed by Ruibin Zhang et al. [37]. For more general point cloud upsampling techniques, representative approaches include PU-Net [38].
RadarHD is a deep-learning-based radar point cloud upsampling method designed to enhance the resolution and density of radar sensor data, thereby providing more refined environmental perceptions for applications such as autonomous driving and smart transportation. It uses a TI AWR1843 single-chip millimeter-wave radar to enhance the point cloud resolution by integrating multimodal signal processing with deep learning. The radar’s raw ADC signals are converted into tensors, from which Doppler shift and reflection power information are extracted. Subsequently, a convolutional neural network (CNN) is designed to upsample the low-resolution radar images, producing high-density point clouds. Similarly, Cross-Modal Diffusion utilizes raw data collected by 3D radar and achieves high-quality point clouds by incorporating LiDAR supervision and employing a diffusion model for point cloud upsampling.
However, these approaches are primarily designed for 3D millimeter-wave radar systems and suffer from critical limitations: the absence of elevation information makes them difficult to effectively apply to object detection tasks in traffic scenarios. Moreover, these methods typically rely on radar tensors obtained after FFT, with the idea of replacing the traditional CFAR algorithm to achieve higher-quality point clouds. However, most mass-produced radar systems currently output point clouds using the CFAR algorithm, imposing significant limitations on the previously mentioned methods.
PU-Net is a deep-learning-based method for point cloud upsampling, aiming to generate dense and uniformly distributed point clouds from sparse inputs, addressing the challenges caused by occlusion or sensor limitations in 3D scanning. By using a four-stage framework including patch extraction, feature embedding, feature expansion, and coordinate reconstruction, a dense point cloud can be generated. However, PU-Net demonstrates better suitability for processing single-object scenarios and exhibits limited effectiveness when dealing with complex traffic environment point clouds. Furthermore, 4D millimeter-wave radar systems provide multidimensional features, including RCS and relative radial velocity measurements, which pose significant challenges in loss function design while being crucial for detection accuracy. Consequently, directly applying PU-Net to millimeter-wave radar point cloud data fails to achieve performance improvements in detection tasks.
Based on the reasons above, we opted to use MLS for point cloud densification. The RCS and relative radial velocity features of the newly generated points can be assigned based on the values of nodes, thereby preserving the original RCS and velocity features to the greatest extent possible.

5.3. Input Dimensions and Detection Performance

Doppler information has a significant impact on object detection, and, in response to this phenomenon, we propose the following explanations. Different object objects exhibit noticeable differences in velocity, leading to significant variations in their Doppler features. Specifically, bicycle wheels have unique rotational features, and their micro-Doppler features contribute to the significant impact of Doppler information on the object detection performance. In contrast to Doppler information, the impact of RCS on the object detection performance is limited. The primary reason is that the RCS vary significantly with changes in the object’s angle, making it difficult to discern object types based on RCS. We suggest that future research should focus on improving the stability of RCS to enhance the object detection performance.

5.4. Entire Area and Driving Corridor

In the ablation experiments, the mAP within the driving corridor area was significantly higher than that of the entire area. We believe that this may be attributed to the following three aspects. Firstly, the inherent measurement principle of 4D mmWave radar dictates that the angular measurement accuracy decreases as objects approach the maximum field of view, leading to reduced accuracy in the spatial coordinates of the point cloud. Within the driving corridor area, the lateral distance between objects and the radar is small, resulting in a smaller field of view and more precise measurements, which enhances the mAP. Secondly, there are more objects located within the driving corridor in the VoD dataset, allowing the neural network to better learn the object features within this area during training, thereby improving the mAP. Lastly, the point cloud generated by 4D mmWave radar is relatively sparse, and the interference caused by noise is significant. Obstacles such as guardrails and walls in the driving corridor are notably less numerous than those in the entire area, and the noise generated by these obstacles can easily interfere with object detection.

5.5. Limitations and Future Enhancement Pathways

Despite our method achieving a certain level of performance improvement, there are still some shortcomings. Our method is unable to filter out ghost points generated by multiple reflections of echoes from walls, guardrails, and similar structures. Moreover, these ghost points are also upsampled by our method, which enhances their features and interferes with object detection. When small targets such as bicycles and pedestrians approach strong reflectors like walls or guardrails, our method fails to recognize them effectively, potentially leading to false detections or missed detections.
In terms of point cloud noise, LiDAR and cameras are often compared with millimeter-wave radar. At the hardware level, LiDAR holds certain advantages. Its multipath effect is not significant, and LiDARs with more than 64 lines generate point clouds with a much higher density at close range compared to 4D millimeter-wave radar. Point cloud noise from depth cameras primarily originates from errors in depth estimation. Unlike LiDAR and 4D millimeter-wave radar, which measure distances directly, depth cameras calculate point coordinates based on disparities from multiple cameras and their intrinsic parameters. Although depth cameras do not suffer from interference caused by multipath effects and tend to have relatively low noise, the accuracy of their point coordinates is lower. Additionally, under harsh weather conditions and in low-light scenarios, the quality of the point clouds obtained from cameras is poor.
For deployment in real-world autonomous driving scenarios, there are still some challenges, including generalization and real-time performance. Multiple 4D mmWave radar brands, such as Continental, Arbe, and ZF, are currently available in the market. Due to differences in their signal processing algorithms, these products exhibit variations in terms of the point cloud density and noise levels. For example, many detection models exhibit performance degradation on TJ4DRadSet compared to the results on the VoD dataset. This imposes stringent demands on the model’s generalization capabilities, requiring robust adaptation to sensor discrepancies and environmental dynamics. Additionally, real-time performance remains a critical challenge in autonomous driving deployment. Our method primarily operates on CPU architectures with low memory and VRAM consumption. When executed on an Intel i9-10900K processor, the average upsampling time per point cloud frame is 162 ms. Compared to the Cross-Modal Diffusion Model (181.6 ms/frame), our approach achieves comparable latency without requiring training.
Due to the limited quality of 4D millimeter-wave radar point clouds, neural-network-based methods supervised by LiDAR information might yield better results. LiDAR offers a higher angular resolution than millimeter-wave radar, resulting in denser point clouds. Although LiDAR also experiences multipath interference, such interference primarily occurs in rainy or foggy conditions; in clear weather, its noise is much lower than that of 4D millimeter-wave radar. These characteristics make LiDAR an excellent ground truth. Utilizing LiDAR-supervised point cloud upsampling methods or employing LiDAR as a diffusion model for the original point cloud may be an effective approach to improving the quality of millimeter-wave radar point clouds.

6. Conclusions

In this paper, we propose a point cloud densification method and an object detection network. The densification method uses multi-frame fusion point clouds and employs the MLS method to fit neighboring points to obtain a surface, upon which new points are added. We validate our proposed method through downstream object detection tasks. In the object detection tasks, both multi-frame point cloud fusion and MLS significantly improve the mAP, demonstrating the effectiveness of our proposed method. To investigate the impact of other factors on the object detection task, we design ablation experiments. The results indicate that the Doppler information, the RCS information, and the lateral distances of objects from the 4D mmWave radar have certain effects on the detection performance. Among them, Doppler information and the lateral distances of objects exert greater influences, while RCS information has a minor impact.
Our research presents a method for the densification of 4D mmWave radar point clouds and highlights the factors that affect point-cloud-based object detection. We hope that this study will contribute to improving the performance of object detection using 4D mmWave radar.

Author Contributions

Conceptualization, X.B.; methodology, X.B.; software, W.S. and P.T.; validation, W.S. and P.T.; formal analysis, W.S. and P.T.; investigation, W.S. and P.T.; resources, W.S. and P.T.; data curation, W.S. and P.T.; writing—original draft preparation, W.S.; writing—review and editing, W.S. and P.T.; visualization, W.S. and P.T.; supervision, W.S. and P.T.; project administration, W.S. and P.T.; funding acquisition, X.B. All authors have read and agreed to the published version of the manuscript.

Funding

Funded by the National Key R&D Program of China under Grant 2022YFE0117100.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

Author Weigang Shi was employed by the company Shanghai Motor Vehicle Inspection Certification & Tech Innovation Center Co., Ltd., Shanghai 201805, China. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

Algorithm A1: MLS Point Cloud Upsampling
Remotesensing 17 01465 i001

References

  1. Chen, Q.; Xie, Y.; Guo, S.; Bai, J.; Shu, Q. Sensing system of environmental perception technologies for driverless vehicle: A review of state of the art and challenges. Sens. Actuators A Phys. 2021, 319, 112566. [Google Scholar] [CrossRef]
  2. Huang, S.C.; Hoang, Q.V.; Le, T.H. SFA-Net: A selective features absorption network for object detection in rainy weather conditions. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 5122–5132. [Google Scholar] [CrossRef] [PubMed]
  3. Liu, R.W.; Lu, Y.; Guo, Y.; Ren, W.; Zhu, F.; Lv, Y. AiOENet: All-in-one low-visibility enhancement to improve visual perception for intelligent marine vehicles under severe weather conditions. IEEE Trans. Intell. Veh. 2023, 9, 3811–3826. [Google Scholar] [CrossRef]
  4. Fan, L.; Zeng, C.; Liu, H.; Liu, J.; Li, Y.; Cao, D. Sea-net: Visual cognition-enabled sample and embedding adaptive network for sar image object classification. IEEE Trans. Intell. Veh. 2023; Early Access. [Google Scholar]
  5. Schumann, O.; Wöhler, C.; Hahn, M.; Dickmann, J. Comparison of random forest and long short-term memory network performances in classification tasks using radar. In Proceedings of the 2017 Sensor Data Fusion: Trends, Solutions, Applications (SDF), Bonn, Germany, 10–12 October 2017; pp. 1–6. [Google Scholar]
  6. Lombacher, J.; Hahn, M.; Dickmann, J.; Wöhler, C. Object classification in radar using ensemble methods. In Proceedings of the 2017 IEEE MTT-S International Conference on Microwaves for Intelligent Mobility (ICMIM), Nagoya, Japan, 19–21 March 2017; pp. 87–90. [Google Scholar]
  7. Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
  8. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893. [Google Scholar]
  9. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  10. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  11. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  12. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
  13. Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11784–11793. [Google Scholar]
  14. Wang, Y.; Mao, Q.; Zhu, H.; Deng, J.; Zhang, Y.; Ji, J.; Li, H.; Zhang, Y. Multi-modal 3d object detection in autonomous driving: A survey. Int. J. Comput. Vis. 2023, 131, 2122–2152. [Google Scholar] [CrossRef]
  15. Qian, R.; Lai, X.; Li, X. 3D object detection for autonomous driving: A survey. Pattern Recognit. 2022, 130, 108796. [Google Scholar] [CrossRef]
  16. Li, B. 3d fully convolutional network for vehicle detection in point cloud. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 1513–1518. [Google Scholar]
  17. Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
  18. Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
  19. Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
  20. Li, B.; Zhang, T.; Xia, T. Vehicle detection from 3d lidar using fully convolutional network. arXiv 2016, arXiv:1608.07916. [Google Scholar]
  21. Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3d proposal generation and object detection from view aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
  22. Beltrán, J.; Guindel, C.; Moreno, F.M.; Cruzado, D.; Garcia, F.; De La Escalera, A. Birdnet: A 3d object detection framework from lidar information. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 3517–3523. [Google Scholar]
  23. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
  24. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems; MIT: Cambridge, UK, 2017; pp. 5099–5108. [Google Scholar]
  25. Shi, S.; Wang, X.; Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 770–779. [Google Scholar]
  26. Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. Std: Sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1951–1960. [Google Scholar]
  27. Paek, D.H.; Kong, S.H.; Wijaya, K.T. K-radar: 4d radar object detection for autonomous driving in various weather conditions. Adv. Neural Inf. Process. Syst. 2022, 35, 3819–3829. [Google Scholar]
  28. Lee, S. Deep learning on radar centric 3D object detection. arXiv 2020, arXiv:2003.00851. [Google Scholar]
  29. Palmer, P.; Krueger, M.; Altendorfer, R.; Bertram, T. Ego-motion estimation and dynamic motion separation from 3D point clouds for accumulating data and improving 3D object detection. In Proceedings of the AmE 2023—Automotive Meets Electronics; 14. GMM Symposium, Dortmund, Germany, 15–16 June 2023; pp. 86–91. [Google Scholar]
  30. Yan, Q.; Wang, Y. Mvfan: Multi-view feature assisted network for 4d radar object detection. In Neural Information Processing; Springer: Singapore, 2023; pp. 493–511. [Google Scholar]
  31. Bai, J.; Zheng, L.; Li, S.; Tan, B.; Chen, S.; Huang, L. Radar transformer: An object classification network based on 4D MMW imaging radar. Sensors 2021, 21, 3854. [Google Scholar] [CrossRef] [PubMed]
  32. Cui, H.; Wu, J.; Zhang, J.; Chowdhary, G.; Norris, W.R. 3D detection and tracking for on-road vehicles with a monovision camera and dual low-cost 4D mmWave radars. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 2931–2937. [Google Scholar]
  33. Xiong, W.; Liu, J.; Huang, T.; Han, Q.L.; Xia, Y.; Zhu, B. LXL: LiDAR excluded lean 3D object detection with 4D imaging radar and camera fusion. IEEE Trans. Intell. Veh. 2023, 9, 79–92. [Google Scholar] [CrossRef]
  34. Zheng, L.; Li, S.; Tan, B.; Yang, L.; Chen, S.; Huang, L.; Bai, J.; Zhu, X.; Ma, Z. Rcfusion: Fusing 4-d radar and camera with bird’s-eye view features for 3-d object detection. IEEE Trans. Instrum. Meas. 2023, 72, 8503814. [Google Scholar] [CrossRef]
  35. Palffy, A.; Pool, E.; Baratam, S.; Kooij, J.F.; Gavrila, D.M. Multi-class road user detection with 3+ 1D radar in the View-of-Delft dataset. IEEE Robot. Autom. Lett. 2022, 7, 4961–4968. [Google Scholar] [CrossRef]
  36. Prabhakara, A.; Jin, T.; Das, A.; Bhatt, G.; Kumari, L.; Soltanaghai, E.; Bilmes, J.; Kumar, S.; Rowe, A. High resolution point clouds from mmwave radar. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 4135–4142. [Google Scholar]
  37. Zhang, R.; Xue, D.; Wang, Y.; Geng, R.; Gao, F. Towards dense and accurate radar perception via efficient cross-modal diffusion model. IEEE Robot. Autom. Lett. 2024, 9, 7429–7436. [Google Scholar] [CrossRef]
  38. Yu, L.; Li, X.; Fu, C.W.; Cohen-Or, D.; Heng, P.A. Pu-net: Point cloud upsampling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2790–2799. [Google Scholar]
Figure 1. Overall structure of our method. P represents the number of pillars, D represents the dimensions of a single point, N represents the number of points in each pillar, C represents the dimensions after MLP, W represents the width of the pseudo-image, and H represents the height of the pseudo-image.
Figure 1. Overall structure of our method. P represents the number of pillars, D represents the dimensions of a single point, N represents the number of points in each pillar, C represents the dimensions after MLP, W represents the width of the pseudo-image, and H represents the height of the pseudo-image.
Remotesensing 17 01465 g001
Figure 2. Comparison of single-frame point cloud and multi-frame fused point cloud density. (a) Single-frame point cloud; (b) multi-frame fused point cloud.
Figure 2. Comparison of single-frame point cloud and multi-frame fused point cloud density. (a) Single-frame point cloud; (b) multi-frame fused point cloud.
Remotesensing 17 01465 g002
Figure 3. Comparison of multi-frame point cloud and MLS upsampled point cloud density. The multi-frame point cloud is on the left side, and the MLS upsampled point cloud is on the right side.
Figure 3. Comparison of multi-frame point cloud and MLS upsampled point cloud density. The multi-frame point cloud is on the left side, and the MLS upsampled point cloud is on the right side.
Remotesensing 17 01465 g003
Figure 4. The structure of our backbone network.
Figure 4. The structure of our backbone network.
Remotesensing 17 01465 g004
Figure 5. Detection results on the View of Delft dataset. Bounding boxes with different colors represent vehicles (green), pedestrians (red), and cyclists (orange).
Figure 5. Detection results on the View of Delft dataset. Bounding boxes with different colors represent vehicles (green), pedestrians (red), and cyclists (orange).
Remotesensing 17 01465 g005
Figure 6. The relationship between the average precision and IOU thresholds (using upsampled point clouds for training).
Figure 6. The relationship between the average precision and IOU thresholds (using upsampled point clouds for training).
Remotesensing 17 01465 g006
Figure 7. The relationship between the average precision and the number of frames in a multi-frame point cloud.
Figure 7. The relationship between the average precision and the number of frames in a multi-frame point cloud.
Remotesensing 17 01465 g007
Table 1. Different forms of fitting functions.
Table 1. Different forms of fitting functions.
Spatial DimensionFirst-Order FormSecond-Order Form
x = (x)1,x 1 , x , x 2
x = (x,y) 1 , x , y 1 , x , y , x y , x 2 , y 2
x = (x,y,z) 1 , x , y , z 1 , x , y , z , x y , x z , y z , x 2 , y 2 , z 2
Table 2. Performance comparison on the View of Delft dataset. “R” and “C” stand for different modalities, “R” is 4D mmWave radar, and “C” is camera.
Table 2. Performance comparison on the View of Delft dataset. “R” and “C” stand for different modalities, “R” is 4D mmWave radar, and “C” is camera.
MethodModalityCarPedestrianCyclistmAP
SECONDR40.4030.6462.5144.52
CenterPointR32.7438.0065.5145.42
LXL-RR32.7539.6568.1346.84
RCFusionR + C41.7038.9568.3149.65
OursR39.8435.7365.6547.07
Ours with MLSR39.3636.1069.8648.44
Table 3. Ablation study on multi-frame point cloud fusion and MLS method. All methods are evaluated on the View of Delft dataset.
Table 3. Ablation study on multi-frame point cloud fusion and MLS method. All methods are evaluated on the View of Delft dataset.
MethodCarPedestrianCyclistmAP
Single-frame35.3731.3860.3642.37
Multi-frame39.8435.7365.6547.07
MLS39.3636.1069.8648.44
Table 4. Ablation study on detection head. All methods are evaluated on the View of Delft dataset.
Table 4. Ablation study on detection head. All methods are evaluated on the View of Delft dataset.
MethodCarPedestrianCyclistmAP
Centerhead35.0236.9452.4541.47
SSD39.8435.7365.6547.07
Table 5. Ablation study on the influence of the input dimensions on the detection performance. All methods are evaluated on the View of Delft dataset.
Table 5. Ablation study on the influence of the input dimensions on the detection performance. All methods are evaluated on the View of Delft dataset.
MethodCarPedestrianCyclistmAP
Single-frame35.3731.3860.3642.37
No RCS29.8632.3463.0041.73
No Doppler31.1722.6241.2931.69
Table 6. Object detection performance on the entire area. Point clouds without RCS information and point clouds without Doppler information are generated based on single-frame point clouds. “MLS” refers to the densified multi-frame point cloud.
Table 6. Object detection performance on the entire area. Point clouds without RCS information and point clouds without Doppler information are generated based on single-frame point clouds. “MLS” refers to the densified multi-frame point cloud.
MethodCarPedestrianCyclistmAP
Single-frame35.3731.3860.3642.37
No RCS29.8632.3463.0041.73
No Doppler31.1722.6241.2931.69
Multi-frame39.8435.7365.6547.07
MLS39.3636.1069.8648.44
Table 7. Object detection performance on the driving corridor.
Table 7. Object detection performance on the driving corridor.
MethodCarPedestrianCyclistmAP
Single-frame56.3838.9374.8756.73
No RCS52.8738.8575.8255.85
No Doppler56.9132.4757.3348.90
Multi-frame59.6244.4576.8960.32
MLS58.8745.6881.6662.07
Table 8. Performance comparison on the TJ4DRadSet dataset. “R” and “C” stand for different modalities, “R” is 4D mmWave radar, and “C” is camera.
Table 8. Performance comparison on the TJ4DRadSet dataset. “R” and “C” stand for different modalities, “R” is 4D mmWave radar, and “C” is camera.
MethodModalitymAP
SECONDR28.85
CenterPointR29.07
LXL-RR30.79
RCFusionR + C33.85
OursR28.25
Ours with MLSR30.04
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shi, W.; Tong, P.; Bi, X. Moving-Least-Squares-Enhanced 3D Object Detection for 4D Millimeter-Wave Radar. Remote Sens. 2025, 17, 1465. https://doi.org/10.3390/rs17081465

AMA Style

Shi W, Tong P, Bi X. Moving-Least-Squares-Enhanced 3D Object Detection for 4D Millimeter-Wave Radar. Remote Sensing. 2025; 17(8):1465. https://doi.org/10.3390/rs17081465

Chicago/Turabian Style

Shi, Weigang, Panpan Tong, and Xin Bi. 2025. "Moving-Least-Squares-Enhanced 3D Object Detection for 4D Millimeter-Wave Radar" Remote Sensing 17, no. 8: 1465. https://doi.org/10.3390/rs17081465

APA Style

Shi, W., Tong, P., & Bi, X. (2025). Moving-Least-Squares-Enhanced 3D Object Detection for 4D Millimeter-Wave Radar. Remote Sensing, 17(8), 1465. https://doi.org/10.3390/rs17081465

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop