Learning Omni-Dimensional Spatio-Temporal Dependencies for Millimeter-Wave Radar Perception

Yan, Hang; Li, Yongji; Wang, Luping; Chen, Shichao

doi:10.3390/rs16224256

Open AccessArticle

Learning Omni-Dimensional Spatio-Temporal Dependencies for Millimeter-Wave Radar Perception

by

Hang Yan

¹,

Yongji Li

¹,

Luping Wang

¹ and

Shichao Chen

^2,*

¹

School of Electronic and Communication Engineering, Sun Yat-sen University, Shenzhen 518107, China

²

School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(22), 4256; https://doi.org/10.3390/rs16224256

Submission received: 21 October 2024 / Revised: 10 November 2024 / Accepted: 13 November 2024 / Published: 15 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

Reliable environmental perception capabilities are a prerequisite for achieving autonomous driving. Cameras and LiDAR are sensitive to illumination and weather conditions, while millimeter-wave radar avoids these issues. Existing models rely heavily on image-based approaches, which may not be able to fully characterize radar sensor data or efficiently further utilize them for perception tasks. This paper rethinks the approach to modeling radar signals and proposes a novel U-shaped multilayer perceptron network (U-MLPNet) that aims to enhance the learning of omni-dimensional spatio-temporal dependencies. Our method involves innovative signal processing techniques, including a 3D CNN for spatio-temporal feature extraction and an encoder–decoder framework with cross-shaped receptive fields specifically designed to capture the sparse and non-uniform characteristics of radar signals. We conducted extensive experiments using a diverse dataset of urban driving scenarios to characterize the sensor’s performance in multi-view semantic segmentation and object detection tasks. Experiments showed that U-MLPNet achieves competitive performance against state-of-the-art (SOTA) methods, improving the mAP by 3.0% and mDice by 2.7% in RD segmentation and AR and AP by 1.77% and 2.03%, respectively, in object detection. These improvements signify an advancement in radar-based perception for autonomous vehicles, potentially enhancing their reliability and safety across diverse driving conditions.

Keywords:

autonomous driving; environmental perception; multi-view semantic segmentation; millimeter-wave radar sensors; multilayer perceptron

Graphical Abstract

1. Introduction

Millimeter-wave radar stands as a critical sensor type in fields such as autonomous driving [1]. It has the advantages of low cost, all-weather operation, and adaptability to complex environments [2,3,4]. Leveraging high-frequency electromagnetic waves, radar sensors obtain velocity, azimuth, and distance information of detected objects economically, facilitating object classification [5], recognition [6], and tracking [7]. Raw radar data often face challenges such as large data scales and high computational complexity, commonly preventing direct application. As shown in Figure 1, radar echo signals are digitized, then transformed into different types of radar maps using Fast Fourier Transform (FFT) in different dimensions. In the FFT process, using padding and window functions can improve frequency resolution and reduce spectral leakage [8]. After obtaining the radio frequency (RF) signal, techniques such as pulse compression, low-pass filtering, or averaging are first applied to enhance the signal quality [9]. Targets are subsequently extracted by constant false-alarm rate detection [10]. Traditional detection and classification primarily rely on manual feature extraction (e.g., motion features [5], RCS features [6], and micro-Doppler features [11]) and classification techniques (e.g., decision trees [12], SVMs [13], KNN [14], and statistical models [15]). However, it is difficult for these methods to fully represent and exploit the rich information in sensor data. With the development of deep learning technology [16] and the emergence of high-resolution radar [17], radar sensors have been increasingly applied in tasks including object detection, semantic segmentation, and interference robustness. In contrast, radar RF images exhibit significantly different characteristics compared to traditional images. The former are characterized by high levels of noise and clutter, blurred object boundaries, and inter-target interference, whereas the latter have clear edges and shapes. Therefore, image feature extraction methods need to be optimized for these differences. These challenges result in radar semantic segmentation and object detection still being in the early stages, but their great potential deserves further exploration [18].

Existing algorithms can be roughly divided into spatial prediction models and spatio-temporal joint prediction methods. The former rely only on a single radar frame, such as range-angle (RA) or range-Doppler (RD) view information, while the latter use multi-frame data. Two popular frameworks for extracting spatial features are convolutional neural networks (CNNs) [19] and deep neural networks [20]. However, single-frame data input fails to make full use of the inter-frame temporal relationship. Multiple frames contain micro-motion features of the object, and their absence significantly reduces detection performance. This problem can be effectively solved by the use of multi-frame input and time information extraction. Time series-based methods such as long short-term memory (LSTM) networks are adept at capturing temporal dependencies, which can effectively extract spatio-temporal information when combined with CNNs [21]. Similar methods include the cascade framework of gate recurrent units (GRUs) and CNNs [22]. However, these designs introduce recursive structures, which are not suitable for parallel computation. 3D CNNs effectively capture spatio-temporal information within the data and allow for high levels of parallelization [23], thereby achieving good results in radar object detection tasks [24]. This suggests the significant role of 3D convolution in radar perception tasks. Nevertheless, the global feature extraction capability of 3D CNN technology is constrained, and increasing the network depth offers marginal performance improvements while greatly increasing computational costs. On the other hand, transformer models [25], due to their strong global dependency modeling capabilities and high degree of parallelism, have achieved significant results in tasks such as video segmentation [26,27], restoration [28,29,30], and tracking [31,32,33]. By leveraging the attention mechanism, transformer models effectively capture complex relationships between distant features, thereby enhancing the extraction and utilization of spatio-temporal information. These capabilities are particularly beneficial for the modeling of radar spatio-temporal data. Nonetheless, while transformers are powerful in global modeling, they tend to overlook local details, a limitation that can be filled by incorporating 3D CNNs. The latest research has indicated that the combination of transformers and 3D CNNs significantly improves the local and global spatio-temporal representation capability, resulting in outstanding performance in radar tasks [34]. In contrast, attention-based models often face optimization difficulties and model complexity, which present challenges for real-time applications. Hence, there is still a need for further research into more tailored and efficient feature representation frameworks for radar data. Notably, attention-free frameworks enhance optimization and acceleration at the hardware level, making them more suitable for deployment in autonomous driving systems.

This paper aims to efficiently learn omni-dimensional spatio-temporal dependencies from radar data without increasing computational complexity. Three main insights guide our approach. First, the success of combining 3D CNNs and transformers benefits from the strong ability of 3D CNNs to extract local spatio-temporal features and the capacity of transformers to aggregate long-range spatio-temporal relationships. However, the computational burden associated with the attention mechanism is quadratic in both time and space. In contrast, Metaformer [35] retains the full encoder–decoder architecture of the transformer model and employs an attention-free structure, reducing computational costs while maintaining robust spatio-temporal modeling capabilities for complex scenarios [36,37,38,39]. Motivated by this trend, our work introduces Metaformer in the fusion stage as a supplement to the global feature extraction capabilities of 3D CNNs. Secondly, Metaformer splits the transformer into a meta-architecture and a token mixer, where the latter drives efficient feature mixing. Notably, Metaformers with multilayer perceptron (MLP) mixers [40,41] have shown competitive performance and efficient low-level optimization. However, while MLP-based methods perform well in many tasks, they may not be suitable for radar data. In fact, several studies indicate a close relationship between the cross-shaped receptive field and radar characteristics. Cenkeramaddi et al. [42] and Gupta et al. [43] showed that the targets in RA images of FMCW radar typically distribute along the horizontal or vertical axis, presenting a striped pattern. Bi et al. [44] further explained that a single target generates multiple scattering echo points along the coordinate axes. Additionally, Nguyen et al. [45] confirmed the existence of this striped feature through the visualization of radar signals. Inspired by the cross-shaped receptive field of TransRadar, which enhances radar feature extraction, and the demonstration by PeakCom [46] that dilation factors create guard units to reduce interference, we implemented AS-MLP [47] as a radar feature extraction module with a cross-shaped dilation receptive field in the latent space encoder. In the decoder, a CNN structure is employed to reduce parameters while restoring the signal. The framework aligns well with the characteristics of the radar radio-frequency (RF) images while maintaining the robust spatio-temporal modeling capacity of the transformer. Finally, the interaction between features from different views and levels enables the model to achieve a more thorough comprehension of radar data, improving the accuracy of perception tasks.

Based on the findings presented above, this paper proposes a novel U-shaped multilayer perceptron network (U-MLPNet), a radar semantic segmentation and detection model. The proposed model leverages the advantages of transformer architectures while avoiding attention mechanisms and positional encoding, thereby reducing model complexity. Specifically, the overall model consists of three parts: an encoder, a latent space, and a decoder. The encoder and decoder adopt the CNN structure. However, the latent space integrates our U-MLP multi-scale feature fusion module, a novel design that successfully addresses the limitation of CNNs in capturing global features. The U-MLP module possesses a cross-shaped receptive field, allowing it to utilize features from both axes simultaneously. Additionally, the use of dilation factors in radar feature extraction leads to a further enhancement in performance as the guard band facilitated by the cross-dilated receptive field takes effect. We introduce shortcut connections between different resolutions and views, which promote the propagation of low-level detail information from the encoder to the decoder. This enhances the fusion of omni-dimensional spatio-temporal information and the diversity of the multi-view method. Experimental results demonstrate the outstanding performance of the U-MLP module in radar semantic segmentation and object detection tasks. The main contributions of this work are summarized as follows.

We propose a new radar perception framework named U-MLPNet. It combines the global capabilities of Metaformer and the local advantages of 3D CNNs to effectively achieve omni-dimensional spatio-temporal representation.
A plug-and-play multi-scale feature fusion module in the latent space called U-MLP is designed for radar signals, offering new insights into radar perception tasks.
Our approach effectively integrates and preserves the diversity between the RD and RA views, which is crucial for a comprehensive evaluation of object attributes.
Experimental results demonstrate that our proposed approach achieves competitive performance compared to state-of-the-art (SOTA) methods without increasing computational complexity.

The rest of this paper is organized as follows. Section 2 provides an overview of related works in the context of our model, covering methods based on CNNs, attention, and MLP. Section 3 provides a detailed description of the proposed multi-scale feature fusion framework. Subsequently, Section 4 delves into experimental analysis and ablation studies. Finally, Section 5 concludes the paper with a summary and discussion of the proposed approach.

2. Related Works

2.1. CNN-Based Tasks

As deep learning algorithms have achieved success in radar tasks, there has been a large body of research on this topic [48,49,50,51]. 2D CNN models have demonstrated their validity in extracting radar features and have been widely employed in semantic segmentation [52], object detection [53], and classification [54]. Research has found that replacing 2D CNNs with 3D CNNs can achieve superior performance. By exploiting the temporal correlations between radar frames, this method achieves enhanced object recognition performance. The utilization of a 3D CNN in RTCnet [55] showcases superior performance in tasks related to mobile road user detection. Several works [24,56] involved the construction of temporal RA networks based on a 3D CNN on the CRUW dataset [56], demonstrating that spatio-temporal features play a crucial role in improving detection accuracy. In addition, the multi-frame, multi-view fusion method fully aggregates the object information in a higher dimensional feature space by jointly analyzing the different view information of RA, RD, and AD, which, in turn, improves the accuracy of the model. Gao et al. [57] sliced the range-angle Doppler (RAD) tensor and applied a 3D CNN to extract multi-view features, which improved the detection performance of RA. Additionally, Ouaknine et al. [58] successfully tackled the multi-view semantic segmentation task on the CARRADA dataset [59] using a similar method. Nevertheless, the global feature extraction capability of 3D CNNs is restricted, and enlarging the convolutional kernels or adding layers substantially increases training complexity. In ref. [56], the RODNet-HWGI model required up to 10 days of training, with computational complexity close to 6000 G floating point of operations (FLOPs), which poses considerable constraints in real-world applications. Although 3D CNNs have many shortcomings, their ability to extract local spatio-temporal information contributes to enhanced performance in radar tasks.

Thus, we integrate the strengths of 3D CNNs with the latest global fusion methods, presenting a novel network architecture capable of capturing both local and global spatio-temporal features.

2.2. Attention-Based Tasks

Multi-scale fusion models have been widely adopted in natural image segmentation [60]. Representative works include DeepLabv3+ [61], which is based on the Atrous Spatial Pyramidal Pooling (ASPP) module, and networks such as UNet [62] and FCN [63], which rely on downsampling techniques. Unfortunately, extending the above models to the radar domain does not produce the desired outcomes, mainly due to the irregular shape and data sparsity of radar RF images. TMVANet [58] provides a new paradigm for radar semantic segmentation, consisting of an encoder, a latent space, and a decoder. The encoder utilizes a 3D CNN model for feature extraction, while the latent space serves as a fusion module. The latent space of this model adopts an ASPP fusion module, achieving SOTA performance on the CARRADA multi-view dataset. Recent work by Dalbah et al. [64] demonstrated the applicability of the aforementioned framework to the single-view CRUW dataset. To better fuse features, an increasing number of researchers have introduced transformer models into the radar domain. The original Vision Transformer (ViT) [65] model is computationally complex and difficult to train, posing challenges for practical applications. Presently, the dominant strategy involves the adoption of lightweight structures for better performance, such as Swin Transformer [66], cross-attention [67,68], axial-attention [69], and MaXViT [70] models. Jiang et al. [71] presented a groundbreaking approach by introducing the Swin Transformer into the radar feature fusion stage, resulting in a significant reduction in model complexity and improved detection performance. Dalbah et al. [72] applied axial attention in the latent space, resulting in the best performance.

Unlike the attention-based techniques mentioned above, our model harnesses the benefits of transformer architectures through the use of MLP and CNNs, achieving comparable performance with lower computational costs.

2.3. MLP-Based Tasks

Integrating transformers with CNNs augments the global modeling abilities of CNNs. Nonetheless, attention mechanisms are computationally expensive, which can be overcome by lightweight MLP models. MLP-Mixer [40] achieves competitive classification performance with minimal computational overhead, relying solely on straightforward matrix transposition operations. Tu et al. [73] proposed the multi-axis MLP model, employing a U-shaped hierarchical structure, which achieves SOTA performance across various downstream tasks with a lower computational burden, including denoising, deblurring, and enhancement. Yu et al. [74] introduced the spatial-shift MLP model by replacing the transpose operation in the MLP-Mixer model with spatial shifts, aiming to better suit downstream tasks. The proposed method attains higher computational efficiency without incurring additional parameter overhead. Lian et al. further proposed a more lightweight axial shifted MLP (AS-MLP) architecture [47], extending the MLP model to dense prediction tasks. This framework achieves flexible receptive fields through basic channel projection and shift operations, enabling the extraction of features along different axes. Experimental findings suggest that the AS-MLP framework outperforms other MLP models, attaining performance comparable to that of attention-based models while exhibiting lower FLOPs.

Overall, the MLP model is regarded as simple and efficient, displaying remarkable performance across multiple domains. This work pioneers the exploration of MLP structures in radar segmentation and detection tasks, with the goal of developing novel, lightweight models suitable for radar signals.

3. Proposed Method

3.1. Overall Framework

Learning omni-dimensional spatio-temporal dependencies is key to enhancing radar perception. CNNs excel at local modeling but fall short in global modeling, a limitation addressed by combining them with transformer-based methods. However, the attention mechanism increases compute demands and may not align well with radar RF images. To overcome these challenges, we propose U-MLPNet, a novel fusion framework that effectively learns local and global spatio-temporal features from radar data while avoiding excessive computational requirements. The overall framework for multi-view radar semantic segmentation comprises three components: an encoder, a decoder, and a latent space, as illustrated in Figure 2. In this scenario, RA, RD, and AD represent different views of the radar cube at time t. First, the encoder employs a 3D CNN to process

q + 1

consecutive frames (from time

t - q

to t) of three views in parallel, effectively extracting local spatio-temporal details of radar objects. This design allows the model to capture dynamic changes over short time intervals. Then, in the shared latent space, we introduce the U-MLP fusion structure. This structure combines MLP and CNN components in a manner similar to the encoder–decoder architecture of the transformer architecture. U-MLP serves as an enhanced multi-scale feature fusion module capable of effectively integrating long-range spatio-temporal relationships and acquiring information from different axes, better aligning with radar features (detailed in Figure 3). This structure enables U-MLPNet to more accurately capture and exploit the global spatio-temporal dependencies in radar data. Finally, the decoder completes semantic segmentation for both RA and RD views simultaneously. Notably, the long skip connections between the encoder and decoder form a U-shaped architecture, enabling information exchange between them directly and mitigating gradient explosion and vanishing issues. They also preserve diversity among different views, and experiments demonstrate that this design benefits the balance of prediction performance across views. Therefore, our approach demonstrates excellent performance in both multi-view segmentation and single-view radar object detection tasks. For multi-view segmentation, it integrates features from multiple views of consecutive frames through parallel processing, achieving multi-view, multi-scale fusion in the latent space. Meanwhile, the U-MLP module effectively fuses multi-scale features for detection. Unlike transformer models, MLP architectures do not include attention mechanisms, which simplifies the structure and facilitates computational acceleration and optimization at the hardware level. Substituting transformers and CNNs fusion with MLP yields competitive outcomes while demanding fewer computational resources. Moreover, although Batch Normalization (BN) is typically more suitable for CNNs, the inclusion of multi-view, multi-frame inputs presents challenges for radar networks, particularly regarding the feasibility of large batch sizes. Group Normalization (GN) [75] outperforms batch normalization (BN) [76] in scenarios with small batch sizes. Hence, a hybrid approach is adopted in this paper, harnessing the advantages of both normalization techniques, with GN applied in the U-MLP module and BN utilized for CNN.

3.2. Encoder and Decoder Architectures

In U-MLPNet, the main role of the encoder is to transform the input data into a low-dimensional feature representation through downsampling, thereby effectively extracting the key information closely related to the task. This paper employs a 3D CNN to capture inter-frame temporal relationships between frames and local details. It can be expressed as

\begin{matrix} {\hat{x}}_{i} = \otimes (x_{i}), \\ {\hat{μ}}_{i} = \frac{1}{m} \sum_{k = 1}^{m} {\hat{x}}_{k}, \\ {\hat{σ}}_{i} = \sqrt{\frac{1}{m} \sum_{k = 1}^{m} {({\hat{x}}_{k} - {\hat{μ}}_{i})}^{2} + ϵ}, \\ {\tilde{x}}_{i} = Φ (\frac{1}{{\hat{σ}}_{i}} ({\hat{x}}_{i} - {\hat{μ}}_{i})), \end{matrix}

(1)

where

i = (i_{B}, i_{C}, i_{T}, i_{H}, i_{W})

represents a 5D tensor composed of indices for batch, channel, time, height, and width axes. m denotes the total number of samples in a batch dataset;

{\hat{μ}}_{i}

and

{\hat{σ}}_{i}

represents the mean and standard deviation, respectively; and

ϵ

is a small positive number used for numerical stability.

\otimes (\cdot)

denotes the 3D convolution operation, and

Φ (\cdot)

represents the Leaky Rectified Linear Unit (LeakyReLU) activation function, which effectively alleviates the issue of “neuron death”. The terms

x_{i}

and

{\tilde{x}}_{i}

represent the input and output of the convolutional layer, respectively, with the output features comprising rich, local spatial and temporal information.

The decoder in U-MLPNet is responsible for gradually restoring the low-dimensional features to the original resolution. It consists of upsampling and long skip connections. Upsampling expands the low-dimensional features produced by the encoder to match the original data size, while long skip connections allow early-stage information to flow directly into the decoder, preserving fine-grained spatial information that might be lost. U-MLPNet supports a variety of prediction tasks, including the prediction of segmentation results of RA and RD views in multi-view segmentation and the generation of multi-frame detection outcomes in detection tasks. In addition, the introduced long skip connections facilitate the integration of semantic information across different resolutions and maintain diversity among views in multi-view tasks. To summarize, U-MLPNet comprehensively integrates local and global spatio-temporal information by combining and encoder, decoder, and the U-MLP feature fusion module.

3.3. U-MLP

In this work, we propose a novel multi-scale fusion module called U-MLP, which comprises an encoder and a decoder (as shown in Figure 4). The encoder is based on MLP, while the decoder utilizes transposed convolution and linear transformation operations. The original MLP structure results in excessive model complexity. However, the combination of local information fusion and a multi-scale framework in AS-MLP effectively reduces the computational load, rendering it conducive to segmentation and object detection tasks. It utilizes feature projection and shift operations to achieve axial information interaction, rendering it highly suitable for radar object feature extraction. Given the lower resolution of radar systems, integrating high-level semantic information is crucial. Therefore, our module utilizes a U-shaped structure for stepwise fusion. Moreover, to optimize the U-MLP module for radar tasks, we introduce the dilation rate. This design expands the receptive field and also serves as a guard band, as depicted in Figure 5. The relationship between the input (X), displacement size (s), dilation rate (d), and output (

Y_{i, j}

) of

(i, j)

is expressed as follows:

Y_{i, j}^{as} = \sum_{c = 1}^{C} X_{i + k, j, c} W_{c}^{a s - h} + \sum_{c = 1}^{C} X_{i, j + k, c} W_{c}^{a s - v},

(2)

where

k = ⌊\frac{c}{⌈C / s⌉}⌋ - ⌊\frac{s}{2}⌋ \cdot d

.

W_{c}^{a s - h}

and

W_{c}^{a s - v}

denote the learnable weights for horizontal and vertical channel projections, respectively, without considering the activation function and bias terms. In summary, the U-MLP module adopts a U-shaped structure, maintaining consistent feature input and output dimensions. This modular design can be plugged into any existing network. Subsequently, we provide a detailed description of its individual components.

There are four stages in the U-MLP encoder. Except for the feature input stage, which utilizes identity mapping, all subsequent stages employ downsampling by a factor of

[(2, 2), (2, 2), (2, 2)]

. Assuming the input feature size is

(B, C, H, W)

, the corresponding output feature sizes are

(B, 2 C, \frac{H}{2}, \frac{W}{2})

,

(B, 4 C, \frac{H}{4}, \frac{W}{4})

, and

(B, 8 C, \frac{H}{8}, \frac{W}{8})

. At each stage of the encoder, feature extraction is performed by two concatenated AS-MLP blocks. The procedure of AS-MLP begins with channel-wise feature projection of the input (X), followed by partitioning of the projected tensor (

T

) into g groups. Lastly, shift operations are performed along the horizontal direction (w) and the vertical direction (h). Taking a horizontal shift of 3 as an example, it is evident that the displacement of each group is

\{- 1, 0, 1\}

. Therefore, the horizontal shift operation (SW) can be formalized as

\begin{matrix} T_{1} [1 : w, :, :] \leftarrow T_{1} [0 : w - 1, :, :], \\ T_{3} [0 : w - 1, :, :] \leftarrow T_{3} [1 : w, :, :] . \end{matrix}

(3)

The vertical shift operation (SH) can be implemented in the same way. Thus, the computation process of axial shifts can be summarized as follows:

\begin{matrix} {\hat{z}}^{l} = G E L U (C P (L N (z^{l - 1}))), \\ {\hat{z}}_{w}^{l} = G E L U (C P (S W ({\hat{z}}^{l}))), \\ {\hat{z}}_{h}^{l} = G E L U (C P (S H ({\hat{z}}^{l}))), \\ z^{l} = C P (L N ({\hat{z}}_{w}^{l} + {\hat{z}}_{h}^{l})), \end{matrix}

(4)

where

{\hat{z}}_{w}^{l}

,

{\hat{z}}_{h}^{l}

, and

z^{l}

represent the outputs of SW, SH, and AS, respectively. GELU denotes the Gaussian Error Linear Unit activation function,

L N (\cdot)

indicates layer normalization, and CP refers to channel projection. Therefore, the AS-MLP block can be represented as follows:

\begin{matrix} \hat{x} = A S (L N (x)) + x, \\ X = M L P (L N (\hat{x})) + \hat{x}, \end{matrix}

(5)

where

A S (\cdot)

denotes the axial shift operation and

M L P (\cdot)

stands for multilayer perceptron.

The primary objective of the decoder in U-MLP is multi-scale feature fusion. Skip connections are introduced to integrate diverse semantic information across resolutions. Furthermore, dimensionality reduction is employed alongside the skip connections, facilitating effective feature fusion while reducing computational complexity. The relationship between features in layers l and

l + 1

can be expressed using the following formula:

\begin{matrix} {\hat{y}}^{l + 1} = F_{s k i p} Θ H_{u p} (y^{l}), \\ y^{l + 1} \leftarrow {\hat{y}}^{l + 1}, \end{matrix}

(6)

where the term

F_{s k i p}

indicates the skip connection from the encoder feature corresponding to layer

l + 1

, while

Θ

represents the concatenation operation used for feature fusion.

H_{u p} (\cdot)

stands for the transpose convolutional upsampling operation, and the ← symbol denotes dimensional reduction.

In conclusion, U-MLP demonstrates efficient multi-scale information extraction in the encoding phase and progressive fusion of radar features during decoding. Moreover, its modular design is advantageous for investigating its performance across various radar tasks.

3.4. Loss Function

This study adopts various loss functions to accommodate different radar tasks. For radar object detection, a simple binary cross-entropy (BCE) loss function is utilized:

\begin{matrix} L_{B C E} = - (y \cdot l o g (y_{p r e d}) + (1 - y) \cdot l o g (1 - y_{p r e d})), \end{matrix}

(7)

where y represents the ground truth (GT) and

y_{p r e d}

denotes the predicted values.

For multi-view radar semantic segmentation tasks, the richness of available information exceeds that of detection tasks. To fully exploit the benefits of different views, this work adopts a combination of multiple loss functions, as suggested in [72], as the final loss function. The details are outlined as follows.

3.4.1. Object-Centric (OC) Focal Loss

This loss function emphasizes the foreground by weighting the binary cross-entropy of the predicted background and foreground, formulated as follows:

L_{O C} = (1 - y_{p r e d}) (δ L_{B C E_{F G}} + (1 - δ) L_{B C E_{B G}}),

(8)

where

δ

is a weight term. FG and BG represent the foreground and the background, respectively.

3.4.2. Class-Agnostic Object Localization (CL) Loss

Aiming to enhance localization accuracy, this loss penalizes mislocalization of objects, defined as follows:

L_{C L} = 1 - \frac{T P}{T P + F N + F P},

(9)

where TP is true positive, FN is false negative, and FP is false positive.

3.4.3. Soft Dice (SD) Loss

Soft Dice loss serves to evaluate the performance of multi-class semantic segmentation, with its formulation expressed as follows:

L_{S D} = \frac{1}{K} \sum_{k = 1}^{K} [1 - \frac{2 \sum yp}{\sum y^{2} + p^{2}}],

(10)

where

y

and

p

represent the GT and probability map output of the model, respectively, while K indicates the class.

3.4.4. Multi-View (MV) Range-Matching Loss

Multi-view range-matching loss is instituted to maintain prediction consistency across disparate views, expressed as follows:

L_{M V} = \{\begin{matrix} \frac{1}{2} {(R D_{m} - R A_{m})}^{2} & | R D_{m} - R A_{m} | < 1, \\ | R D_{m} - R A_{m} | - \frac{1}{2} & o t h e r w i s e, \end{matrix}

(11)

where the

R D_{m}

and

R A_{m}

are the maximum poolings of the corresponding views along the non-common axis.

Therefore, the weighted combination of the above loss functions constitutes the loss for multi-view radar semantic segmentation:

L_{t o t a l} = α_{1} (L_{O C} + L_{C L}) + α_{2} L_{S D} + α_{3} L_{M V},

(12)

where

α_{1}

,

α_{2}

, and

α_{3}

are the weighted terms.

4. Experiments

4.1. Datasets and Evaluation Metrics

We evaluate the millimeter-wave radar object detection task on two widely-used 77-GHz frequency modulated continuous wave (FMCW) millimeter-wave radar datasets. The detailed configuration information is presented in Table 1. The CARRADA dataset [59] and the CRUW dataset [56] are utilized for multi-view semantic segmentation and object detection tasks, respectively.

4.1.1. CARRADA

This dataset was collected on a test track located in Canada, with annotations generated through a semi-automatic pipeline. Initially, object velocities and ranges were estimated using synchronized video object tracking and radar object detection algorithms. All objects were then annotated via cooperative multi-sensor tracking and clustering based on radar and video data. The dataset includes 30 sequences, totaling 21.1 min of data, with each scene comprising 1–2 moving objects. Although smaller in scale with simpler scenes compared to CRUW, CARRADA offers more precise annotations in three formats: sparse points, boxes, and dense masks. These formats allow for evaluation of object detection, semantic segmentation, and classification algorithms. Additionally, it provides RD and RA dual-view annotations, along with RAD tensors.

The evaluation metrics for the CARRADA dataset employ standard semantic segmentation measures, including intersection over union (IOU) and the Dice similarity coefficient (Dice). They are defined as follows:

\begin{matrix} IoU = |\frac{G \cap P}{G \cup P}|, \\ Dice = \frac{2 | G \cap P |}{| G | + | P |}, \end{matrix}

(13)

where G represents the GT and P represents the predicted values. We also calculated the precision and recall metrics for the RD and RA views, as well as the global performance, to provide a more comprehensive evaluation of our model’s overall effectiveness.

4.1.2. CRUW

This dataset realizes labeling of the RA view through cross-modal supervision and fusion of stereo camera and radar data. The annotations are represented by confidence maps, which consist of Gaussian distributions jointly formed by position (mean) and scale (variance) information. The data collection process involves the synchronous operation of the camera and radar systems at a frame rate of 30 frames per second (FPS) over a duration of 3.5 h, resulting in a total of 260 k objects. The dataset provides recordings from two different views: front and side views of the driver. Its scenes cover various environments, including parking lots, campus roads, urban streets, and highways, with objects categorized as vehicles, pedestrians, bicycles, and backgrounds. Additionally, the dataset offers diverse lighting conditions and rich background clutter, providing opportunities to improve the performance of millimeter-wave radar object detection in challenging weather conditions.

The evaluation metric for CRUW is defined as the object location similarity (OLS) [56], which measures the similarity between the predictions of detection algorithms and the GT. It is expressed as follows:

OLS (i, j) = exp \{\frac{- d_{i j}^{2}}{2 {(s_{j} k_{cls})}^{2}}\},

(14)

where

d_{i j}

denotes the distance between points i and j in the RF image (measured in meters),

s_{j}

represents the distance between the sensor and object j, and

k_{cls}

is the tolerance for class cls, which is associated with the average size of the object. The evaluation methodology employed in this paper is consistent with that of the CRUW dataset. Initially, the OLS metric is computed between the predicted values and the GT for each frame. Subsequently, OLS thresholds are selected incrementally in steps of 0.05 from 0.5 to 0.9, and the corresponding average precision (AP;

A P_{OLS}

) and average recall (AR;

A R_{OLS}

) are computed. Finally, the mean AP and AR across different OLS thresholds serve as the final evaluation criteria.

4.2. Implementation Details

4.2.1. CARRADA

In the training phase, this paper utilizes four past frames to assist in the semantic segmentation of the current frame, with a single input channel. The input size for the RA view is

1 \times 5 \times 256 \times 256

, while the RD and AD view sizes are both

1 \times 5 \times 256 \times 64

. The batch size is set to 6, and the chosen optimization algorithm is the Adam optimizer, with an initial learning rate of 0.0001. The learning-rate adjustment strategy adopts exponential decay.

4.2.2. CRUW

Due to the lack of RD data in the CRUW dataset, we use the RadarFormer [64] framework to tackle this issue. However, its model is replaced by the proposed U-MLP module, forming U-MLPNet. This adaptation enables our model to be applicable for single-view object detection tasks. Apart from RadarFormer and the proposed model, which accept 32 radar frames as input with 4 chips, 2 input channels, and a size of

2 \times 32 \times 4 \times 128 \times 128

, the other algorithms employ 16 input frames with 1 chip and a size of

2 \times 16 \times 128 \times 128

. The hyperparameters utilized in the experiments, such as batch size, initial learning rate, and optimizer selection, are all in accordance with those of the CARRADA dataset, while a cosine annealing algorithm is applied for learning-rate adaptation. To evaluate the detection performance of various algorithms, 36 sequences from the CRUW dataset are used for training, while 4 are reserved for testing. The training process for all networks is conducted using PyTorch (version 1.12.1) on an NVIDIA RTX 4090 24 GB GPU (NVIDIA Corporation, Santa Clara, CA, USA).

4.3. Comparisons with SOTA Models

In the testing phase, due to the differences between the CRUW and CARRADA datasets, relevant evaluation metrics are employed for assessment and compared with SOTA models to validate the effectiveness of the proposed model.

4.3.1. CARRADA

The performances of various multi-view radar semantic segmentation algorithms are evaluated, including FCN-8s [63], U-Net [62], DeepLabv3+ [61], RSS-Net [52], RAMP-CNN [57], MVNet (baseline) [58], TMVA-Net, T-RODNet [71], PeakConv [46], SS-RODNet [34], LQCANet [68], and TransRadar (SOTA) [72]. The models are trained and tested under identical conditions. The analysis of model complexity in terms of FLOPs employs third-party Python libraries fvcore and thop. It is noteworthy that, unlike the CRUW dataset, the mentioned algorithms calculate the metrics for both RD and RA views in parallel to evaluate the performance of various models.

Table 2 summarizes the performance of the models, with U-MLPNet outperforming the others in RD segmentation. It improves mIoU by 3.0% and mDice by 2.7% over TransRadar (SOTA). In the RA view, its mIoU matches that of TransRadar, while it mDice is 0.3% higher, both of which are superior to those of other algorithms. This is because the baseline algorithm, MVNet, lacks a latent space fusion module, resulting in the worst performance. In contrast, FCN-8s, U-Net, and DeepLabv3+ utilize multi-scale fusion processes but do not leverage inter-frame temporal information. TMVA-Net further leverages a CNN to extract and integrate spatio-temporal information, resulting in a notable performance enhancement. Nevertheless, the CNN exhibits weak global modeling capabilities, prompting TransRadar to employ attention mechanisms to enhance its global feature extraction. Additionally, SS-RODNet and LQCANet, utilizing cross-attention and the combination of pre-training with attention, achieve success in radar segmentation tasks, demonstrating the advantages of integrating transformer with CNNs. Moreover, none of the algorithms mentioned above consider the characteristics of radar signals. PeakConv integrates traditional signal processing algorithms into the convolution process, solving the issue of effectively extracting features from radar signals. Finally, our model combines the advantages of TransRadar and PeakConv, effectively extracting and fusing the local and global spatio-temporal dependencies of radar, achieving the best results.

The precision and recall of different algorithms are shown in Table 3. U-MLPNet achieves the best levels of global precision and recall, significantly outperforming other methods. In the RD view, U-MLPNet achieves the best results in both precision and recall, outperforming the second-ranked TransRadar by 2.0% in precision and PeakConv by 0.3% in recall, which are significantly better than the results achieved by the other methods. In the RA view, U-MLPNet has the highest recall (3.3% higher than that of TransRadar), while its precision is slightly lower than that of TransRadar (0.6%) but still clearly superior to that of others. In summary, the improvement in recall for U-MLPNet is particularly significant, indicating its ability to detect correct targets more comprehensively and reduce missed detections. As for precision, U-MLPNet also performs stably, especially in the RD view. This balance between precision and recall makes U-MLPNet more robust and practical in real-world scenarios.

The results for the RA view are illustrated in Figure 6, showing that the U-MLPNet and TransRadar algorithms have a considerable margin of improvement over other algorithms. Apart from MV-Net, the remaining algorithms exhibit high recognition rates for the vehicle class. Distinguishing between pedestrians and bicycles presents a significant challenge for other algorithms due to the close similarity in their RF images. In contrast, our model is better at distinguishing pedestrians and cyclists (rows 4–7). PeakConv and TMVA-Net perform well in pedestrian recognition but tend to be confused in multi-object scenarios (row 8).

The visual effects of RD view segmentation using different algorithms are depicted in Figure 7. It can be observed that our U-MLPNet demonstrates greater accuracy in identifying car categories compared to other algorithms (rows 1–3), which is attributed to the ability of the proposed model to fully integrate global spatio-temporal information. When pedestrians cross the sensor at close range, U-MLPNet accurately provides segmentation results (row 4), which is crucial for autonomous driving tasks. Additionally, our model maintains reliable segmentation performance in multi-object scenarios (rows 5–6). Finally, we conduct a comprehensive evaluation of the performance of different algorithms across various categories using polar plot visualization. As shown in Figure 8, each line in the polar plot corresponds to a specific algorithm, with the distance from the center indicating the segmentation performance of the algorithm in the respective category. It is evident from the plot that our proposed algorithm excels in all categories. In conclusion, the proposed model extracts more effective semantic features and exhibits higher robustness in segmentation tasks under various conditions, outperforming other algorithms.

4.3.2. CRUW

We compare several SOTA radar object detection algorithms, including RODNet-CDC (baseline) [56], RODNet-HG, RODNet-HWGI, SS-RODNet [34], T-RODNet [71], and RadarFormer [64]. The training and testing of all models are performed on the same platform. As the CRUW data dimensions differ from those of CARRADA and lack the RD view, our detection model opts for the architecture of RadarFormer. However, we replace only the model part with the U-MLP module while maintaining consistency in other parameters. The effectiveness of the detection algorithm is evaluated through the AP and AR values, as well as the visual performance of different models.

Table 4 presents the quantitative results of different algorithms on the CRUW dataset. It is clear that our model outperforms the others in overall metrics, with a 1.77% improvement in AR over SS-RODNet and a 2.03% advantage in AP over RadarFormer. This is because SS-RODNet, T-RODNet, RadarFormer, and U-MLPNet are all based on transformer architectures. While the first three models enhance performance through attention mechanisms, U-MLPNet employs a U-MLP module specifically designed for radar, thereby achieving the best results. This demonstrates that our model not only inherits the powerful global modeling capability of the transformer architecture but also more effectively integrates omni-dimensional radar features. Furthermore, the feature fusion stage of U-MLPNet integrates a multi-scale MLP structure, consuming less GPU memory during training compared to RadarFormer while achieving better detection results. This validates the advantage and effectiveness of the proposed method, demonstrating the feasibility of introducing an MLP structure for radar object detection tasks.

Figure 9 presents the visual effects of different algorithms. All algorithms exhibit good performance in recognizing larger objects (rows 5 and 6). On the other hand, U-MLPNet outperforms other algorithms in recognizing small objects, such as bicycles (rows 1 and 2). Additionally, in multi-class recognition tasks, other algorithms tend to misclassify or ignore small objects at longer distances, whereas U-MLPNet shows more robust detection performance (row 3). This is attributed to the introduction of the multi-scale, cross-dilated receptive field, which allows the model to extract radar semantic information more effectively. It is noteworthy that by exploiting the paradigm of latent space fusion, RadarFormer and the proposed algorithm perform better in pedestrian recognition compared to the other algorithms (row 4). In summary, the proposed single-view object detection model exhibits enhanced suitability for radar signal modeling, thereby displaying SOTA comprehensive performance metrics.

4.4. Ablation Studies

In the ablation study section, the CARRADA dataset is chosen to emphasize the effectiveness of each component. This choice is motivated by its provision of labels for both RD and RA views, as well as RAD tensors, offering a richer annotation corpus. Such comprehensive annotations aid in a more thorough exploration of the potential of radar perception algorithms. The results are shown in Table 5.

4.4.1. Effectiveness of Cross-Dilated Receptive Field

We set the dilation rate to 1 to validate the effectiveness of the cross-dilated receptive field. The results in Table 5 demonstrate that adopting cross-dilated receptive fields led to substantial improvements of 1.9% in

mIo U_{RD}

and 2.3% in

mIo U_{RA}

. This is because the receptive field proposed in this paper extracts information that is more aligned with radar semantics. Furthermore, our design indirectly introduces guard units between radar objects, mitigating inter-object interference, which is consistent with the conclusion of PeakConv. Concurrently, our lightweight MLP architecture leads to reduced memory requirements during training and offers superior performance relative to PeakConv.

4.4.2. Effectiveness of Multi-Scale Fusion

To validate the effectiveness of the multi-scale fusion module, we conduct experiments by removing the downsampling process in each stage of U-MLP. As indicated in Table 5, the removal results in a decrease of 0.7% in the

mIoU

metric for the RD view and a decrease of 1.4% for the RA view. The significant decrease in the RA view demonstrates that the extraction and fusion of semantic features at different scales are crucial for radar perception. This is due to the fact that radar sensors have lower sensitivity to object sizes, making it difficult to capture size features at a single scale. Therefore, employing multi-scale semantic features can more effectively assist radar systems in identifying different types of objects.

4.4.3. Effectiveness of Multi-Scale, Multi-View Fusion

To validate the effectiveness of multi-scale, multi-view feature fusion, we separate the RD and RA branches in the decoding stage of the U-MLP. The results presented in Table 5 indicate that, compared to the integrated model, the segregated model experiences a notable decrease in mIoU, with a reduction of 2.0% for the RD view and 2.9% for the RA view. This decline is primarily attributed to the incomplete fusion of radar information by a segregated model. In contrast, the proposed multi-scale, multi-view fusion structure fully integrates more complementary information, achieving better performance in radar tasks.

4.4.4. Effectiveness of Skip Connections

The removal of skip connections between the RD and RA views is conducted to evaluate their impact on model performance. From the results presented in Table 6, it is observed that the removal causes a 2.0% decrease in mIoU for the RD view and only a 0.3% decrease for the RA view. This indicates that skip connections significantly enhance the segmentation accuracy of the RD view while minimally affecting the RA view. In the subsequent analysis, we retain the skip connections for RD and RA separately to further investigate this issue. The results from Table 6 indicate that retaining the former leads to a 1.0% increase in the RD view, while the RA view decreases by 0.9%. Conversely, the latter results in a 3.6% decrease in the RD-view mIoU and a 1.4% decrease in the RA view. It is evident that the RD view plays a dominant role in segmentation tasks, while the connections from the RA view contribute to the balancing of the performance of the model. This is because the RD view contains velocity information with higher resolution, while the RA view has lower-resolution angle information. Overall, the skip connections introduced in the prediction stage preserve the diversity between different views, thereby improving the segmentation performance of the model for both views.

4.4.5. Effectiveness of Dilation Factor

We explore the effects of different dilation factors on model performance. The results show that their selection has a significant impact on radar perception and interference suppression capabilities. As shown in Table 7, when the dilation factor is set to 1, the model exhibits basic feature extraction capabilities. Increasing the dilation factor to 2 results in a decline in performance due to an excessively large receptive field. However, at a dilation factor of 3, the model achieves optimal mIoU values of 63.5% and 44.5% for the RD and RA views, respectively. This is attributed to the fact that a smaller dilation factor may be insufficient to differentiate between target and interference signals, while a larger dilation factor may lead to insufficient fine feature extraction. In summary, an appropriate dilation factor not only enhances perceptual accuracy but also strengthens the model’s robustness against interference.

4.5. Nighttime Test

To comprehensively assess the model’s performance, we test U-MLPNet in a challenging nighttime street scenario. Although the lack of GT labels presents difficulties for quantitative analysis, qualitative visual assessments still demonstrate the advantages of the proposed model. As shown in Figure 10, U-MLPNet performs well with radar data under low-light nighttime conditions, maintaining high detection performance, even in dense vehicle environments. This finding is critical, as traditional visual sensors often struggle to provide reliable results under such conditions. Our experiments confirm that U-MLPNet, based on millimeter-wave radar, can maintain high detection accuracy when cameras cannot operate effectively due to insufficient lighting, highlighting the potential value of our approach in multi-sensor fusion systems.

4.6. Model Complexity Analysis

To evaluate model complexity, we utilize a third-party Python library to compute the FLOPs for different algorithms, with all evaluations conducted in an identical environment. The findings on the CARRADA dataset, as depicted in Table 8, reveal that the computational overhead of U-MLPNet is marginally lower than that of PeakConv but slightly higher than that of TransRadar and TMVA-Net. Nevertheless, it attains the best overall performance. In terms of inference time, U-MLPNet and TransRadar show similar speeds and significantly outperform the other models. Furthermore, as shown in Table 9, we report the average computational complexity of different models when predicting the same number of frames on the CRUW dataset. Remarkably, the proposed method achieves SOTA detection performance with a lower computational burden compared to RadarFormer and T-RODNet. Although the inference time of U-MLPNet is slightly higher than that of RODNet-CDC and T-RODNet, it still shows higher efficiency relative to RODNet-HG and RadarFormer. In summary, our token mixer method strikes a good balance between performance and complexity, outperforming attention-based approaches (e.g., T-RODNet, TransRadar, and RadarFormer), particularly on the larger CRUW dataset.

5. Conclusions

This paper proposes a novel U-MLPNet framework to learn omni-dimensional spatio-temporal dependencies for radar perception. Specifically, we introduce the U-MLP module into the latent space to capture global spatio-temporal information. This overcomes the limitations of CNNs in long-range modeling. U-MLP consists of long skip connections and multi-scale MLPs, facilitating the interaction and fusion of semantic information across different layers. It also implements a cross-shaped receptive field with guard units, preventing interference signals from affecting radar feature extraction. Notably, the MLP structure exhibits lower computational costs and is more conducive to low-level optimization compared to the attention mechanism. In the multi-view task, the long skip connections between the RA and RD encoders and decoders not only enhance the diversity between different views but also balance model performance. Extensive experimentation confirms that U-MLPNet achieves SOTA results on both the CARRADA and CRUW datasets. Particularly on CRUW, our model exhibits remarkable superiority over other models, with lower computational resource consumption compared to attention-based methods.

Despite achieving some progress, our algorithm still requires improvement. The annotation information in radar datasets may contain errors or omissions, which could lead to the model learning incorrect knowledge. Adopting self-supervised learning methods [34] may alleviate this issue. Furthermore, the computational costs and parameters of our proposed U-MLP model are still higher than those of some attention-based methods. We will explore more efficient feature extraction techniques in future work to reduce computational overhead further. Consequently, we will explore more efficient and lightweight radar perception models in future work. This will involve utilizing model pruning [77], quantization [78,79], and knowledge distillation [80,81,82], techniques to further mitigate computational overhead. Finally, we acknowledge the current limitations of our study in terms of scalability and generalization. In future work, we will address these limitations through more extensive experiments:

Expanded Datasets: We plan to test U-MLPNet across a wider range of radar tasks to fully evaluate its scalability.
Real-world Testing: We will collaborate with industry partners to test U-MLPNet on private radar datasets to promote its application in real-world scenarios.
Generalization Enhancement: To improve the generalization capabilities of U-MLPNet, we will introduce techniques such as domain adaptation and transfer learning to enhance the performance of the model in different application scenarios.

Author Contributions

conceptualization and summary, H.Y., L.W. and S.C.; Investigation, H.Y. and Y.L.; Validation, H.Y. and Y.L.; Writing—original draft, H.Y.; Writing—review and editing, H.Y., Y.L., L.W. and S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62271408, and in part by the Fundamental Research Funds for the Central Universities under Grant G2024KY05104.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sun, R.; Suzuki, K.; Owada, Y.; Takeda, S.; Umehira, M.; Wang, X.; Kuroda, H. A millimeter-wave automotive radar with high angular resolution for identification of closely spaced on-road obstacles. Sci. Rep. 2023, 13, 3233. [Google Scholar] [CrossRef] [PubMed]
Yao, S.; Guan, R.; Huang, X.; Li, Z.; Sha, X.; Yue, Y.; Lim, E.G.; Seo, H.; Man, K.L.; Zhu, X.; et al. Radar-camera fusion for object detection and semantic segmentation in autonomous driving: A comprehensive review. IEEE Trans. Intell. Veh. 2023, 9, 2094–2128. [Google Scholar] [CrossRef]
Li, P.; Wang, P.; Berntorp, K.; Liu, H. Exploiting temporal relations on radar perception for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 17071–17080. [Google Scholar]
Yoneda, K.; Suganuma, N.; Yanase, R.; Aldibaja, M. Automated driving recognition technologies for adverse weather conditions. IATSS Res. 2019, 43, 253–262. [Google Scholar] [CrossRef]
Tait, P. Introduction to Radar Target Recognition; IET: London, UK, 2005; Volume 18. [Google Scholar]
Cao, X.; Yi, J.; Gong, Z.; Wan, X. Automatic target recognition based on RCS and angular diversity for multistatic passive radar. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 4226–4240. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, X.; Huang, D.; Fang, X.; Zhou, M.; Zhang, Y. MRPT: Millimeter-wave radar-based pedestrian trajectory tracking for autonomous urban driving. IEEE Trans. Instrum. Meas. 2021, 71, 1–17. [Google Scholar] [CrossRef]
Richards, M.A. Fundamentals of Radar Signal Processing; Mcgraw-Hill: New York, NY, USA, 2005; Volume 1. [Google Scholar]
Wang, Y.; Wang, W.; Zhou, M.; Ren, A.; Tian, Z. Remote monitoring of human vital signs based on 77-GHz mm-wave FMCW radar. Sensors 2020, 20, 2999. [Google Scholar] [CrossRef]
Scharf, L.; Demeure, C. Statistical Signal Processing: Detection, Estimation, and Time Series Analysis; Addison-Wesley Series in Electrical and Computer Engineering; Addison-Wesley Publishing Company: Boston, MA, USA, 1991. [Google Scholar]
Chen, V.C.; Li, F.; Ho, S.S.; Wechsler, H. Analysis of micro-Doppler signatures. IEE Proc.-Radar, Sonar Navig. 2003, 150, 271–276. [Google Scholar] [CrossRef]
Zhou, H.; Jiang, T. Decision tree based sea-surface weak target detection with false alarm rate controllable. IEEE Signal Process. Lett. 2019, 26, 793–797. [Google Scholar] [CrossRef]
Li, Y.; Xie, P.; Tang, Z.; Jiang, T.; Qi, P. SVM-based sea-surface small target detection: A false-alarm-rate-controllable approach. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1225–1229. [Google Scholar] [CrossRef]
Guo, Z.X.; Shui, P.L. Anomaly based sea-surface small target detection using K-nearest neighbor classification. IEEE Trans. Aerosp. Electron. Syst. 2020, 56, 4947–4964. [Google Scholar] [CrossRef]
Du, L.; Liu, H.; Bao, Z. Radar HRRP statistical recognition: Parametric model and model selection. IEEE Trans. Signal Process. 2008, 56, 1931–1944. [Google Scholar] [CrossRef]
Feng, D.; Harakeh, A.; Waslander, S.L.; Dietmayer, K. A review and comparative study on probabilistic object detection in autonomous driving. IEEE Trans. Intell. Transp. Syst. 2021, 23, 9961–9980. [Google Scholar] [CrossRef]
Paek, D.H.; Kong, S.H.; Wijaya, K.T. Enhanced k-radar: Optimal density reduction to improve detection performance and accessibility of 4d radar tensor-based object detection. In Proceedings of the 2023 IEEE Intelligent Vehicles Symposium (IV), Anchorage, AK, USA, 4–7 June 2023; pp. 1–6. [Google Scholar]
Venon, A.; Dupuis, Y.; Vasseur, P.; Merriaux, P. Millimeter Wave FMCW RADARs for Perception, Recognition and Localization in Automotive Applications: A Survey. IEEE Trans. Intell. Veh. 2022, 7, 533–555. [Google Scholar] [CrossRef]
Wang, C.X.; Chen, X.; Zou, H.Y.; He, S.; Tang, X. Automatic target recognition of millimeter-wave radar based on deep learning. J. Phys. Conf. Ser. 2021, 2031, 12031. [Google Scholar] [CrossRef]
Orr, I.; Cohen, M.; Zalevsky, Z. High-resolution radar road segmentation using weakly supervised learning. Nat. Mach. Intell. 2021, 3, 239–246. [Google Scholar] [CrossRef]
Angelov, A.; Robertson, A.; Murray-Smith, R.; Fioranelli, F. Practical classification of different moving targets using automotive radar and deep neural networks. IET Radar Sonar Navig. 2018, 12, 1082–1089. [Google Scholar] [CrossRef]
Wang, J.; Guo, J.; Shao, X.; Wang, K.; Fang, X. Road targets recognition based on deep learning and micro-Doppler features. In Proceedings of the 2018 International Conference on Sensor Networks and Signal Processing (SNSP), Xi’an, China, 28–31 October 2018; pp. 271–276. [Google Scholar]
Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231. [Google Scholar] [CrossRef]
Wang, Y.; Jiang, Z.; Gao, X.; Hwang, J.N.; Xing, G.; Liu, H. Rodnet: Radar object detection using cross-modal supervision. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 504–513. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the NIPS’17: 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Li, X.; Ding, H.; Yuan, H.; Zhang, W.; Pang, J.; Cheng, G.; Chen, K.; Liu, Z.; Loy, C.C. Transformer-based visual segmentation: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10138–10163. [Google Scholar] [CrossRef]
Li, P.; Zhang, Y.; Yuan, L.; Xu, X. Fully transformer-equipped architecture for end-to-end referring video object segmentation. Inf. Process. Manag. 2024, 61, 103566. [Google Scholar] [CrossRef]
Liang, J.; Cao, J.; Fan, Y.; Zhang, K.; Ranjan, R.; Li, Y.; Timofte, R.; Van Gool, L. Vrt: A video restoration transformer. IEEE Trans. Image Process. 2024, 33, 2171–2182. [Google Scholar] [CrossRef]
Li, D.; Shi, X.; Zhang, Y.; Cheung, K.C.; See, S.; Wang, X.; Qin, H.; Li, H. A simple baseline for video restoration with grouped spatial-temporal shift. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 9822–9832. [Google Scholar]
Xu, K.; Xu, L.; He, G.; Yu, W.; Li, Y. Beyond Alignment: Blind Video Face Restoration via Parsing-Guided Temporal-Coherent Transformer. arXiv 2024, arXiv:2404.13640. [Google Scholar]
Jiao, L.; Zhang, X.; Liu, X.; Liu, F.; Yang, S.; Ma, W.; Li, L.; Chen, P.; Feng, Z.; Guo, Y.; et al. Transformer meets remote sensing video detection and tracking: A comprehensive survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 1–45. [Google Scholar] [CrossRef]
Zeng, F.; Dong, B.; Zhang, Y.; Wang, T.; Zhang, X.; Wei, Y. Motr: End-to-end multiple-object tracking with transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 659–675. [Google Scholar]
Xie, F.; Chu, L.; Li, J.; Lu, Y.; Ma, C. Videotrack: Learning to track objects via video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22826–22835. [Google Scholar]
Zhuang, L.; Jiang, T.; Wang, J.; An, Q.; Xiao, K.; Wang, A. Effective mmWave Radar Object Detection Pre-Training Based on Masked Image Modeling. IEEE Sens. J. 2023, 24, 3999–4010. [Google Scholar] [CrossRef]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10819–10829. [Google Scholar]
Wang, J.; Zhang, S.; Liu, Y.; Wu, T.; Yang, Y.; Liu, X.; Chen, K.; Luo, P.; Lin, D. Riformer: Keep your vision backbone effective but removing token mixer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14443–14452. [Google Scholar]
Kang, B.; Moon, S.; Cho, Y.; Yu, H.; Kang, S.J. MetaSeg: MetaFormer-based Global Contexts-aware Network for Efficient Semantic Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 434–443. [Google Scholar]
Lu, Z.; Kang, L.; Huang, J. Depthwise Convolution with Channel Mixer: Rethinking MLP in MetaFormer for Faster and More Accurate Vehicle Detection. In Proceedings of the International Conference on Artificial Neural Networks, Heraklion, Crete, Greece, 26–29 September 2023; Proceedings, Part X. Springer: Cham, Switzerland, 2023; pp. 136–147. [Google Scholar]
Chen, J.; Luo, R. MetaCNN: A New Hybrid Deep Learning Image-based Approach for Vehicle Classification Using Transformer-like Framework. In Proceedings of the 5th International Conference on Computer Science and Software Engineering, Guilin, China, 21–23 October 2022; pp. 517–521. [Google Scholar]
Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. Mlp-mixer: An all-mlp architecture for vision. Adv. Neural Inf. Process. Syst. 2021, 34, 24261–24272. [Google Scholar]
Bozic, V.; Dordevic, D.; Coppola, D.; Thommes, J.; Singh, S.P. Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers. arXiv 2023, arXiv:2311.10642. [Google Scholar]
Cenkeramaddi, L.R.; Rai, P.K.; Dayal, A.; Bhatia, J.; Pandya, A.; Soumya, J.; Kumar, A.; Jha, A. A novel angle estimation for mmWave FMCW radars using machine learning. IEEE Sens. J. 2021, 21, 9833–9843. [Google Scholar] [CrossRef]
Gupta, S.; Rai, P.K.; Kumar, A.; Yalavarthy, P.K.; Cenkeramaddi, L.R. Target classification by mmWave FMCW radars using machine learning on range-angle images. IEEE Sens. J. 2021, 21, 19993–20001. [Google Scholar] [CrossRef]
Bi, X. Environmental Perception Technology for Unmanned Systems; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
Nguyen, M.Q.; Feger, R.; Wagner, T.; Stelzer, A. High Angular Resolution Method Based on Deep Learning for FMCW MIMO Radar. IEEE Trans. Microw. Theory Tech. 2023, 71, 5413–5427. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, X.; Zhang, Y.; Guo, Y.; Chen, Y.; Huang, X.; Ma, Z. Peakconv: Learning peak receptive field for radar semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17577–17586. [Google Scholar]
Lian, D.; Yu, Z.; Sun, X.; Gao, S. As-mlp: An axial shifted mlp architecture for vision. arXiv 2021, arXiv:2107.08391. [Google Scholar]
Zhang, A.; Nowruzi, F.E.; Laganiere, R. Raddet: Range-azimuth-doppler based radar object detection for dynamic road users. In Proceedings of the 2021 18th Conference on Robots and Vision (CRV), Burnaby, BC, Canada, 26–28 May 2021; pp. 95–102. [Google Scholar]
Abdu, F.J.; Zhang, Y.; Fu, M.; Li, Y.; Deng, Z. Application of deep learning on millimeter-wave radar signals: A review. Sensors 2021, 21, 1951. [Google Scholar] [CrossRef]
Jiang, W.; Wang, Y.; Li, Y.; Lin, Y.; Shen, W. Radar target characterization and deep learning in radar automatic target recognition: A review. Remote Sens. 2023, 15, 3742. [Google Scholar] [CrossRef]
van Berlo, B.; Elkelany, A.; Ozcelebi, T.; Meratnia, N. Millimeter wave sensing: A review of application pipelines and building blocks. IEEE Sens. J. 2021, 21, 10332–10368. [Google Scholar] [CrossRef]
Kaul, P.; De Martini, D.; Gadd, M.; Newman, P. Rss-net: Weakly-supervised multi-class semantic segmentation with fmcw radar. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020; pp. 431–436. [Google Scholar]
Dong, X.; Wang, P.; Zhang, P.; Liu, L. Probabilistic oriented object detection in automotive radar. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 102–103. [Google Scholar]
Patel, K.; Rambach, K.; Visentin, T.; Rusev, D.; Pfeiffer, M.; Yang, B. Deep learning-based object classification on automotive radar spectra. In Proceedings of the 2019 IEEE Radar Conference (RadarConf), Boston, MA, USA, 22–26 April 2019; pp. 1–6. [Google Scholar]
Palffy, A.; Dong, J.; Kooij, J.F.; Gavrila, D.M. CNN based road user detection using the 3D radar cube. IEEE Robot. Autom. Lett. 2020, 5, 1263–1270. [Google Scholar] [CrossRef]
Wang, Y.; Jiang, Z.; Li, Y.; Hwang, J.N.; Xing, G.; Liu, H. RODNet: A real-time radar object detection network cross-supervised by camera-radar fused object 3D localization. IEEE J. Sel. Top. Signal Process. 2021, 15, 954–967. [Google Scholar] [CrossRef]
Gao, X.; Xing, G.; Roy, S.; Liu, H. Ramp-cnn: A novel neural network for enhanced automotive radar object recognition. IEEE Sens. J. 2020, 21, 5119–5132. [Google Scholar] [CrossRef]
Ouaknine, A.; Newson, A.; Pérez, P.; Tupin, F.; Rebut, J. Multi-view radar semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 15671–15680. [Google Scholar]
Ouaknine, A.; Newson, A.; Rebut, J.; Tupin, F.; Pérez, P. Carrada dataset: Camera and automotive radar with range-angle-doppler annotations. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 5068–5075. [Google Scholar]
Yu, Y.; Wang, C.; Fu, Q.; Kou, R.; Huang, F.; Yang, B.; Yang, T.; Gao, M. Techniques and challenges of image segmentation: A review. Electronics 2023, 12, 1199. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Dalbah, Y.; Lahoud, J.; Cholakkal, H. RadarFormer: Lightweight and accurate real-time radar object detection model. In Proceedings of the Scandinavian Conference on Image Analysis, Sirkka, Finland, 18–21 April 2023; Proceedings, Part I. Springer: Berlin/Heidelberg, Germany, 2023; pp. 341–358. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Kothari, R.; Kariminezhad, A.; Mayr, C.; Zhang, H. Object detection and heading forecasting by fusing raw radar data using cross attention. arXiv 2022, arXiv:2205.08406. [Google Scholar]
Zhuang, L.; Jiang, T.; Jiang, H.; Wang, A.; Huang, Z. LQCANet: Learnable-Query-Guided Multi-Scale Fusion Network based on Cross-Attention for Radar Semantic Segmentation. IEEE Trans. Intell. Veh. 2023, 9, 3330–3344. [Google Scholar] [CrossRef]
Ho, J.; Kalchbrenner, N.; Weissenborn, D.; Salimans, T. Axial attention in multidimensional transformers. arXiv 2019, arXiv:1912.12180. [Google Scholar]
Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. Maxvit: Multi-axis vision transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 459–479. [Google Scholar]
Jiang, T.; Zhuang, L.; An, Q.; Wang, J.; Xiao, K.; Wang, A. T-rodnet: Transformer for vehicular millimeter-wave radar object detection. IEEE Trans. Instrum. Meas. 2022, 72, 1–12. [Google Scholar] [CrossRef]
Dalbah, Y.; Lahoud, J.; Cholakkal, H. TransRadar: Adaptive-Directional Transformer for Real-Time Multi-View Radar Semantic Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2024; pp. 353–362. [Google Scholar]
Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. Maxim: Multi-axis mlp for image processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5769–5780. [Google Scholar]
Yu, T.; Li, X.; Cai, Y.; Sun, M.; Li, P. S 2-MLPv2: Improved spatial-shift MLP architecture for vision. arXiv 2021, arXiv:2108.01072. [Google Scholar]
Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Cheng, H.; Zhang, M.; Shi, J.Q. A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10558–10578. [Google Scholar] [CrossRef]
Kuzmin, A.; Nagel, M.; Van Baalen, M.; Behboodi, A.; Blankevoort, T. Pruning vs. quantization: Which is better? In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 7–14 December 2024; Volume 36. [Google Scholar]
Tang, H.; Sun, Y.; Wu, D.; Liu, K.; Zhu, J.; Kang, Z. Easyquant: An efficient data-free quantization algorithm for llms. arXiv 2024, arXiv:2403.02775. [Google Scholar]
Sun, S.; Ren, W.; Li, J.; Wang, R.; Cao, X. Logit standardization in knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15731–15740. [Google Scholar]
Pham, C.; Nguyen, V.A.; Le, T.; Phung, D.; Carneiro, G.; Do, T.T. Frequency attention for knowledge distillation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 2277–2286. [Google Scholar]
Wang, J.; Chen, Y.; Zheng, Z.; Li, X.; Cheng, M.M.; Hou, Q. CrossKD: Cross-head knowledge distillation for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16520–16530. [Google Scholar]

Figure 1. The complete millimeter-wave radar signal collection and preprocessing pipeline. First, the received and transmitted signals are mixed to generate raw ADC data. These signals are then subjected to various forms of FFT algorithms, resulting in the RA view, RD view, and RAD tensor, which are the RF signals prepared for further processing.

Figure 2. Overall framework of our U-MLPNet. The left part represents the multi-view encoder, the middle part is the latent space, and the right part is the dual-view decoder. The skip connections between the encoder and decoder effectively maintain the disparities between different perspectives and balance model performance. The latent space contains the U-MLP module, which can efficiently fuse multi-scale, multi-view global and local spatio-temporal features.

Figure 3. Radar RF features. The top row illustrates the CARRADA dataset with RGB images and RA, RD, and AD views arranged from left to right. The bottom row shows the echo of the CRUW dataset, with RGB images on the left and RA images on the right.

Figure 4. Overall framework of our U-MLP. The left side the encoder, while the right side represents the decoder. The encoder employs a lightweight MLP to extract meaningful radar features. The decoder progressively integrates these features and restores resolution in a stepwise manner.

Figure 5. The receptive field of U-MLP. The original receptive field, the receptive field proposed in this paper, and the equivalent guard band are displayed from left to right. Feature points, the guard band, and feature regions are distinguished by orange, a blue diagonal grid, and light blue, respectively.

Figure 6. Visual comparison of RA views for various algorithms on the CARRADA dataset. The pedestrian category is annotated in red, the car category in blue, and the cyclist category in green.

Figure 7. Visual comparison of RD views for various algorithms on the CARRADA dataset. The pedestrian category is highlighted in red, the car category in blue, and the cyclist category in green. (a–h) RGB images, RF images, ground truth (GT), U-MLPNet, TransRadar, PeakConv, TMVA-Net, and MVNet, respectively.

Figure 8. Polar plot of RD views for various algorithms on the CARRADA dataset across different categories. Each line represents the mIoU of a specific algorithm across these categories, with higher values indicating superior performance.

Figure 9. Visual comparison of RA views for various algorithms on the CRUW dataset. The pedestrian category is annotated in red, the car category in blue, and the cyclist category in green.

Figure 10. To evaluate the performance and robustness of U-MLPNet in complex environments, we conduct qualitative testing using a nighttime dataset.

Table 1. Sensor configurations for the CARRADA and CRUW datasets.

Dataset	CARRADA	CRUW
Frequency	77 GHz	77 GHz
Sweep Bandwidth	4 GHz	-
Maximum Range	50 m	-
Range Resolution	0.20 m	0.23 m
Maximum Radial Velocity	13.43 m/s	-
Radial Velocity Resolution	0.42 m/s	-
Field of View	180°	180°
Angle Resolution	0.70°	∼15°
Number of Chirps per Frame	64	255
Number of Samples per Chirp	256	-

Table 2. Comparison of IoU and Dice metrics of different algorithms on the CARRADA dataset, with the best result highlighted in bold.

View	Method	Params (M)	IoU (%)					Dice (%)
View	Method	Params (M)	Bkg.	Ped.	Cycl.	Car	mIoU	Bkg.	Ped.	Cycl.	Car	mDice
RD	FCN-8s [63]	134.3	99.7	47.7	18.7	52.9	54.7	99.8	24.8	16.5	26.9	66.3
	U-Net [62]	17.3	99.7	51.1	33.4	37.7	55.4	99.8	67.5	50.0	54.7	68.0
	DeepLabv3+ [61]	59.3	99.7	43.2	11.2	49.2	50.8	99.9	60.3	20.2	66.0	61.6
	RSS-Net [52]	10.1	99.3	0.1	4.1	25.0	32.1	99.7	0.2	7.9	40.0	36.9
	RAMP-CNN [57]	106.4	99.7	48.8	23.2	54.7	56.6	99.9	65.6	37.7	70.8	68.5
	MVNet [58]	2.4	98.0	0.0	3.8	14.1	29.0	99.0	0.0	7.3	24.8	32.8
	TMVA-Net [58]	5.6	99.7	52.6	29.0	53.4	58.7	99.8	68.9	45.0	69.6	70.9
	PeakConv [46]	6.3	99.7	55.0	28.6	58.3	60.4	99.9	71.0	44.4	73.6	72.2
	TransRadar [72]	4.8	99.8	56.5	28.7	57.0	60.5	99.9	72.2	44.6	72.6	72.3
	U-MLPNet	17.9	99.8	59.3	32.4	62.4	63.5	99.9	74.5	49.0	76.9	75.0
RA	FCN-8s [63]	134.3	99.8	14.8	0.0	23.3	34.5	99.9	25.8	0.0	37.8	40.9
	U-Net [62]	17.3	99.8	22.4	8.8	0.0	32.8	99.9	25.8	0.0	37.8	40.9
	DeepLabv3+ [61]	59.3	99.9	3.4	5.9	21.8	32.7	99.9	6.5	11.1	35.7	38.3
	RSS-Net [52]	10.1	99.5	7.3	5.6	15.8	32.1	99.8	13.7	10.5	27.4	37.8
	RAMP-CNN [57]	106.4	99.8	1.7	2.6	7.2	27.9	99.9	3.4	5.1	13.5	30.5
	MVNet [58]	2.4	98.8	0.1	1.1	6.2	26.8	99.0	0.0	7.3	24.8	28.5
	TMVA-Net [58]	5.6	99.8	26.0	8.6	30.7	41.3	99.9	41.3	15.9	47.0	51.0
	T-RODNet [71]	162.0	99.9	25.4	9.5	39.4	43.5	99.9	40.5	17.4	56.6	53.6
	PeakConv [46]	6.3	99.8	24.3	11.8	35.5	42.8	99.9	39.1	21.1	52.4	53.1
	SS-RODNet [34]	33.1	99.9	26.7	8.9	37.2	43.2	99.9	42.2	16.3	54.2	53.2
	LQCANet [68]	148.3	99.9	25.3	11.3	39.5	44.0	99.9	40.4	20.5	56.6	54.4
	TransRadar [72]	4.8	99.9	28.9	14.3	34.9	44.5	99.9	44.9	25.0	51.8	55.4
	U-MLPNet	17.9	99.9	28.2	17.5	32.4	44.5	99.9	44.0	29.8	48.9	55.7

Table 3. Comparison of precision and recall metrics of different algorithms on the CARRADA dataset, with the best result highlighted in bold.

View	Metric	Method
View	Metric	MVNet	TMVA-Net	PeakConv	TransRadar	U-MLPNet
RD	Precision (%)	29.6	65.0	66.7	69.4	71.4
RD	Recall (%)	65.3	78.4	79.8	77.6	80.1
RA	Precision (%)	56.4	48.1	50.5	58.9	58.3
RA	Recall (%)	26.9	55.1	56.4	55.3	58.6
Global	Precision (%)	43.0	56.5	58.6	64.1	64.9
Global	Recall (%)	46.1	66.8	68.1	66.4	69.4

Table 4. Comparison of AP and AR metrics of different algorithms on the CRUW dataset, with the best result highlighted in bold.

Model	All		Pedestrian		Cyclist		Car
Model	AP (%)	AR (%)	AP (%)	AR (%)	AP (%)	AR (%)	AP (%)	AR (%)
RODNet-CDC [56]	75.20	77.84	76.13	77.98	67.38	68.05	82.46	88.59
RODNet-HG [56]	77.04	79.50	77.93	79.75	68.49	69.06	85.18	90.79
RODNet-HWGI [56]	78.06	81.07	79.47	81.85	70.35	71.40	84.39	90.65
SS-RODNet [34]	83.07	86.43	81.37	84.61	83.34	84.34	85.55	90.86
T-RODNet [71]	80.74	86.12	79.76	83.59	79.87	85.18	83.29	91.27
RadarFormer [64]	82.63	86.56	83.08	86.55	82.52	83.54	82.03	89.94
U-MLPNet	84.84	88.59	85.41	88.30	85.44	86.96	83.22	90.89

Table 5. Analysis of the effectiveness of cross-dilated receptive fields; multi-scale fusion; multi-scale, multi-view fusion; and skip connections. The best result is highlighted in bold.

Cross-Dilated Receptive Fields	Multi-Scale Fusion	Multi-Scale, Multi-View Fusion	mIoU_RD (%)	mIoU_RA (%)
✗	✓	✓	61.6	42.2
✓	✗	✓	62.8	43.1
✓	✓	✗	61.5	41.6
✓	✓	✓	63.5	44.5

Table 6. Analysis of the effectiveness of RA and RD skip connections. The best result is highlighted in bold.

RA-Skip	RD-Skip	$mIo U_{RD}$ (%)	$mIo U_{RA}$ (%)
✗	✗	61.5	44.2
✗	✓	64.5	43.6
✓	✗	59.9	43.1
✓	✓	63.5	44.5

Table 7. Analysis of the effectiveness of various dilation factor values. The best result is highlighted in bold.

Dilation Factor	1	2	3
$mIo U_{RD}$ (%)	61.6	60.0	63.5
$mIo U_{RA}$ (%)	42.2	39.8	44.5

Table 8. Computational complexity and inference time (in milliseconds) on the CARRADA dataset. The best result is highlighted in bold.

Model	MVNet	TMVA-Net	PeakConv	TransRadar	U-MLPNet
FLOPs (G)	53.57	96.11	112.21	91.08	108.17
Infer. Time (ms)	146.04	149.82	2718.56	25.93	23.68

Table 9. Computational complexity and inference time (in milliseconds) on the CRUW dataset. The best result is highlighted in bold.

Model	RODNet-CDC	RODNet-HG	T-RODNet	RadarFormer	U-MLPNet
FLOPs (G)	280.03	129.19	182.53	150.12	138.58
Infer. Time (ms)	7.78	119.56	27.71	106.05	44.35

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, H.; Li, Y.; Wang, L.; Chen, S. Learning Omni-Dimensional Spatio-Temporal Dependencies for Millimeter-Wave Radar Perception. Remote Sens. 2024, 16, 4256. https://doi.org/10.3390/rs16224256

AMA Style

Yan H, Li Y, Wang L, Chen S. Learning Omni-Dimensional Spatio-Temporal Dependencies for Millimeter-Wave Radar Perception. Remote Sensing. 2024; 16(22):4256. https://doi.org/10.3390/rs16224256

Chicago/Turabian Style

Yan, Hang, Yongji Li, Luping Wang, and Shichao Chen. 2024. "Learning Omni-Dimensional Spatio-Temporal Dependencies for Millimeter-Wave Radar Perception" Remote Sensing 16, no. 22: 4256. https://doi.org/10.3390/rs16224256

APA Style

Yan, H., Li, Y., Wang, L., & Chen, S. (2024). Learning Omni-Dimensional Spatio-Temporal Dependencies for Millimeter-Wave Radar Perception. Remote Sensing, 16(22), 4256. https://doi.org/10.3390/rs16224256

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Learning Omni-Dimensional Spatio-Temporal Dependencies for Millimeter-Wave Radar Perception

Abstract

1. Introduction

2. Related Works

2.1. CNN-Based Tasks

2.2. Attention-Based Tasks

2.3. MLP-Based Tasks

3. Proposed Method

3.1. Overall Framework

3.2. Encoder and Decoder Architectures

3.3. U-MLP

3.4. Loss Function

3.4.1. Object-Centric (OC) Focal Loss

3.4.2. Class-Agnostic Object Localization (CL) Loss

3.4.3. Soft Dice (SD) Loss

3.4.4. Multi-View (MV) Range-Matching Loss

4. Experiments

4.1. Datasets and Evaluation Metrics

4.1.1. CARRADA

4.1.2. CRUW

4.2. Implementation Details

4.2.1. CARRADA

4.2.2. CRUW

4.3. Comparisons with SOTA Models

4.3.1. CARRADA

4.3.2. CRUW

4.4. Ablation Studies

4.4.1. Effectiveness of Cross-Dilated Receptive Field

4.4.2. Effectiveness of Multi-Scale Fusion

4.4.3. Effectiveness of Multi-Scale, Multi-View Fusion

4.4.4. Effectiveness of Skip Connections

4.4.5. Effectiveness of Dilation Factor

4.5. Nighttime Test

4.6. Model Complexity Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI