1. Introduction
The accurate perception and real-time tracking of traffic flow on highways plays an important role in transportation system management. The diversification of roadside sensors, such as lidar, millimeter-wave (mmWave) radar and cameras, has expanded the range of data sources available for vehicle tracking. An information-based, systematic, and all-weather traffic flow perception system can enhance the safety and efficiency of highway operation. In practice, it is challenging to build such a traffic flow perception system, because it needs technical support from edge computing devices and 5G communication technology. Hence, to provide these hardware and technical conditions, it is necessary to develop a realistic simulation framework with an accurate tracking algorithm for roadside perception.
The current widely deployed sensors in highway are cameras with object detection algorithms. To train the machine vision models, supervised deep learning is the main method. In recent years, the emergence of neural networks promotes the development of cutting-edge object detection algorithms. Many mature algorithms have been employed for vehicle detection on the road, such as YOLO-V7 [
1], YOLO-V8 [
2] and PP-YOLOE [
3]. These algorithms can be competent in identifying the bounding box of each vehicle, and can even extract vehicle features including vehicle model and color in the video. However, video-based detection still has many drawbacks, because poor illumination conditions, abnormal weather and local feature occlusion can significantly impair the accuracy and continuity of vehicle trajectory detection.
Additionally, collecting traffic flow data is necessary for roadside perception [
4]. It mainly relies on the roadside traffic sensors, which provide raw data for the roadside perception network. The traditional sensors, such as induction coils, section radar and geomagnetic nails, can only provide information on vehicle lane and speed at a specific moment or section. The data they collect cannot meet the high-precision requirements for traffic target identification and detection [
5]. In recent years, the millimeter-wave (mmWave) radar industry for roadside traffic has rapidly developed. It is going to be the mainstream perception sensor due to its high accuracy, low susceptibility to weather conditions, and cost-effectiveness. It is widely adopted in various scenarios including urban traffic, expressways and vehicle–road coordination. Millimeter-wave (mmWave) radar utilizes frequency modulation to generate a continuous wave signal with a linearly varying frequency. MmWave radar shows robust performance in abnormal lighting and severe weather conditions, and advantages in speed and distance measurement. It widely used in autonomous vehicles due to its small size and low cost. The 3D mmWave radar is a traditional and widely used mode. However, it has the disadvantage of poor height resolution due to antenna layout limitations. Also, clutter, noise and low resolution can largely affect the target classification accuracy. To address the problem, multiple-input multiple-output (MIMO) antenna technology is proposed to improve the elevation angle resolution of traditional 3D millimeter-wave radar. It leads to the development of 4D mmWave radar technology, which can provide range, velocity, azimuth and elevation dimensions of information. In this way, the 4D imaging radar extracts specific target attributes, including radar cross section (RCS) and micro-Doppler information, and then use machine learning models to achieve target classification [
6]. The acquired information from each dimension is critical for achieving accurate target classification with 4D imaging radar. However, there is a significant lack of research on 4D mmWwave radar in highway roadside scenarios.
Studies find that radar–camera fusion is beneficial in overcoming the challenges posed by adverse weather and lighting conditions. It can improve the accuracy and robustness of highway vehicle tracking. Yao et al. [
7] point out that radar–vision fusion is a cost-effective way to achieve complementary effects in perception. Because of the ability to resist abnormal weather and light interference, mmWave radar is suitable for outdoor conditions. However, the radar lacks the capability to discern distinct object types, posing a limitation on comprehensive object classification of color and feature [
8]. Hence, combining mmWave radar with camera data contributes to enhancing the accuracy of vehicle trajectory observations [
9]. This integration facilitates more precise and robust monitoring of vehicle movements, and then improves overall tracking capabilities. Spatial visualization based on multi-sensor data fusion has shown significant improvements in both efficiency and precision. Novel fusion methods, such as keypoint analysis on classical imagery and sonar data, point cloud augmentation and reduction, as well as data fitting and mesh generation, have provided new avenues for advancing multi-sensor fusion approaches [
10].
Building a simulation platform is significant to verify the radar–camera fusion algorithms. Existing mmWave radar and vision equipment is often deployed for single-source data usage. It results in the separate detection and insufficient perception data for fusion algorithms. Moreover, inappropriate deployment of radar–vision devices cannot ensure the data quality. The collected perception data are insufficient to support the development of higher-precision data fusion algorithms. Hence, roadside mmWave radar and camera has been adopted in many cases, which can achieve more accurate vehicle perception on highways [
11]. How to develop robust vehicle tracking algorithms and make proper placements of mmWave radar and camera is significant. Establishing a simulation platform can help in the development of fine-tuned algorithms without the need for costly field experiments. SUMO (v1.19.0) is an open-source software for microscopic traffic simulation [
12]. It can provide traffic flow simulation and vehicle kinematics simulation through state-of-the-art driver behavior models and car-following models. Carla is a 3D physics engine-based autonomous driving simulator [
13]. It offers high-fidelity and customizable 3D road networks, diverse lighting and weather conditions, as well as realistic vehicle physics and collision models, enabling the testing and evaluation of autonomous driving technologies. Furthermore, Carla leverages ray tracing techniques to provide a rich set of customizable sensor types. The joint simulation of SUMO and Carla can provide a platform for testing roadside fusion algorithms.
This paper proposes a framework for testing 4D mmWave radar–camera fusion algorithms in a simulated highway roadside environment. It is developed based on the Simulation of Urban MObility (SUMO) software and the autonomous driving simulator Carla. The framework consists of a simulation method for 4D millimeter-wave radar, a roadside multi-sensor perception dataset, and a baseline fusion method. Firstly, we propose a simulation method for 4D millimeter-wave radar. It generates high-resolution radar point clouds based on radar hardware parameters. Secondly, a roadside multi-sensor perception dataset is generated in a realistic 3D environment through co-simulation. Deep learning object detection models are trained with vehicle annotations under varying weather and lighting conditions. Finally, a machine learning based fusion baseline method is proposed and evaluated with different vehicle motion models. Solutions show that the method proposed in this paper can provide realistic virtual environment for radar–camera fusion algorithm testing for roadside traffic perception.
The following shows the major contributions of this paper:
- (1)
We propose an innovative approach to simulate 4D mmWave radar in roadside scenario. A 4D mmWave radar model is implemented in Carla. By using SUMO and Carla co-simulation, we collect data from 4D mmWave radar and camera, annotate them, and generate a roadside dataset for radar–camera fusion. The dataset comprises challenging lighting and weather conditions, upon which the performance of the deep learning visual object detection model is evaluated.
- (2)
This study proposes a machine learning-based algorithm that fuses radar and vision data to extract vehicle trajectories. The algorithm synchronizes the radar and video data spatiotemporally through motion models and generates robust vehicle trajectories. We also extend the algorithm’s application to detecting stop-and-go waves on highways.
- (3)
To the best of authors’ knowledge, this paper is the first to explore the testing framework for radar–camera algorithms using a simulated environment. Firstly, road and traffic flow information is extracted from existing highway scene, and vehicle operations is simulated by using SUMO. Then, high-definition 3D simulation scenarios are constructed using RoadRunner (R2023b) and Carla (v0.9.15). Data collection was performed after determining the deployment positions of the radar and camera. Finally, fusion algorithms are developed in the simulation scenarios through calibration, synchronization and parameter tuning.
The rest of this paper is organized as follows:
Section 2 summarizes related works.
Section 3 illustrates the methodology including the 4D mmWave radar simulation, co-simulation method for roadside radar–camera fusion, and fusion tracking algorithm.
Section 4 describes the case study. Finally,
Section 5 concludes this paper.
2. Related Works
High-quality point clouds are provided by 4D mmWave radar, which can largely extend the applications of mmWave technology. The existing technology is used in vehicle-side applications for autonomous driving [
14], while 4D technology can fulfill the roadside vehicle–infrastructure cooperation. This section introduces the research on 4D mmWave radar, datasets and simulators for radar–camera fusion, and fusion tracking methods.
2.1. MmWave Radar Technology
The core technologies of 4D imaging radar include waveform design, antenna layout design and optimization, super-resolution direction-of-arrival (DOA) estimation algorithms, and point cloud clustering and tracking. The design and optimization of array antenna layout involves uniform linear arrays (ULA), non-uniform linear arrays (NLA), minimum redundancy arrays (MRA), nested arrays (NA), and coprime arrays (CPA). For 4D imaging radar, the azimuth and elevation angle resolutions both need to be within one degree. Waveform design is applied in multiple-input multiple-output (MIMO) signal processing, mainly for channel separation, including time division multiple access (TDMA), bi-phase modulated amplitude (BPMA), dual-digital modulation amplitude (DDMA), code division multiple access (CDMA), time division multiple access and dual-digital modulated amplitude multiple-input multiple-output (TDMA + DDMA − MIMO), and pseudorandom modulated continuous wave (PMCW) waveforms. Super-resolution direction-of-arrival (DOA) estimation algorithms include root-music with orthogonal matching pursuit (ROMP), improved aperture algorithm (IAA), dominant mode rejection algorithm (DMRA), compressive sensing (CS), neural network DOA estimation, etc., which are used to solve the low-resolution problem of traditional DOA estimation algorithms such as fast Fourier transform (FFT), delay-and-sum beamforming (DBF), Capon’s method, multiple signal classification (MUSIC) and estimation of signal parameters via rotational invariance techniques (ESPRIT). Traditional millimeter-wave radar uses the “ULA + FFT” algorithm, which is insufficient for 4D imaging radar. In general, to meet real-time requirements, most engineering solutions use the “ULA (sparse array) + FFT coarse search + super-resolution fine search” solution. Antenna layout and optimization, MIMO signal processing, and super-resolution DOA estimation are usually considered together and require a systematic approach.
After generating point clouds with mmWave radar, clustering and tracking are required to identify targets. For the targets clustering, the commonly used algorithm is DBSCAN [
15]. However, it cannot exactly distinguish vehicles and pedestrians in practice. Hence, to improve the clustering performance, many revised clustering algorithms are proposed, such as iterative closest point for multi-frame joint clustering [
16] and Kd-tree-based accelerated clustering algorithms [
17]. For the targets tracking, the algorithm mainly includes track initialization, tracking filtering and data association. When tracking traffic participants, road clutter, interference and multipath effects [
18] would largely impact the detection accuracy. Thus, additional temporal and spatial filtering is necessary in complex road environments. Among spatial filtering methods, Kalman filtering (KF), Kalman filter (EKF) and unscented Kalman filter (UKF) are widely used [
19]. Additionally, temporal filtering is used to remove outliers and interpolate missing values. When considering the data association, nearest neighbor (NN) and probabilistic data association (PDA) are used for single-target tracking, and joint probabilistic data association (JPDA) [
20] and Hungarian matching are used for multi-object tracking.
Studies show that 4D radars have unique advantages in obtaining information in elevation dimension [
21]. Different fusion algorithms require varying types of 4D radars and parameters. This means that device selection and parameter tuning require extensive field experiments, and conducting spatiotemporal synchronization and sensor calibration also require extensive work. It is necessary to build a simulation platform which includes a comprehensive library of realistic radar hardware simulation models for verifying and testing fusion algorithms.
2.2. Datasets and Simulators
The currently available and widely used multi-sensor fusion datasets are large-scale autonomous driving datasets. They have rich sensor data including lidar, radar and camera inputs for promoting self-driving research and development [
22,
23]. One of the main methods for collecting multi-sensor fusion datasets is to conduct mmWave radar and camera fusion experiments. However, it requires a huge deployment cost. Hence, establishing a multi-sensor fusion simulation tool must be helpful. Developing datasets and simulation tools can facilitate the development of more precise and robust sensor fusion algorithms for roadside perception. It can accurately simulate roadside scenarios and encompass diverse sensor modalities. Studies find that CARLA is a fine choice [
13]. It is an open-source simulator for autonomous driving research, which can provide a wide range of sensors including cameras, lidars and radars to generate realistic data for sensor fusion and perception algorithm. Therefore, it is necessary to conduct research on scene construction in Carla and roadside data collection, especially focusing on roadside scenarios.
In the context of mmWave radar and camera fusion, one of the major challenges in highway vehicle tracking is the lack of roadside datasets and simulation tools [
24]. SUMO, as a traffic simulator, can accurately simulate various traffic events, such as stop-and-go waves and vehicles pulling over on highways [
12]. Additionally, CARLA provides high-quality 3D visual simulation functionality. Connecting SUMO and CARLA is a feasible method. However, the built-in mmWave simulator in CARLA only simulates traditional 3D mmWave radar. It has limited resolution in terms of the pitch angle. This limitation makes it challenging to simulate radar point clouds for specific device parameters.
2.3. Radar–Camera Fusion Methods
The data flow process for radar–camera fusion used in extracting vehicle trajectories on highways mainly includes six stages: sensor calibration and data acquisition, spatiotemporal synchronization, data preprocessing, feature extraction, object detection and tracking, and trajectory post-processing. Feature fusion and semantic fusion are the two main strategies for fusing radar–camera data [
25]. The feature-fusion strategy combines data from different modalities within a single feature space. It merges two feature vectors to produce a unified feature vector. Semantic fusion involves integrating data within a semantic space, which consolidates pixel semantic classes from multiple data sources. The semantic labels of the image can be used for end-to-end semantic segmentation of radar point cloud.
Radar–camera fusion methods can be classified into three stages of processing: data level, decision level and feature level [
9]. The data-level fusion method generates a region of interest based on radar points, extracts the corresponding region of the image, and then uses a feature extractor and classifier for object detection. Feature-level fusion involves extracting radar depth information and fusing it with image features. Decision-level fusion integrates the detection results of the mmWave radar and the camera. It can effectively utilize state-of-the-art detection algorithms that are designed for a single data source.
One of the challenges is the lack of robust roadside radar and vision fusion multi-object tracking algorithms under abnormal lighting and severe weather conditions. The main method for Multiple Object Tracking (MOT) is tracking by detection, which includes two parts: detection and embedding. The detection technology is responsible for identifying potential targets in each frame of the video. The embedding technology assigns and updates the detected targets to their corresponding trajectories. To balance detection accuracy and speed, recent studies focus on simultaneously learning object detection and appearance-embedding tasks in a shared neural network [
26,
27]. During the MOT process, spatiotemporal synchronization of sensors is critical, as it directly impacts the accuracy of data fusion. Fu et al. [
28] propose a multi-threaded time synchronization method that utilizes buffering and real-time updates to synchronize camera and radar data. Differently, Ma et al. [
29] employ a combination of Kalman filtering and Lagrange interpolation algorithms to achieve time synchronization between data from two sensors. Zhang et al. [
30] associate detected bounding boxes without score filtering and use similarities between the tracked object trajectories to recover true object detections and filter out background detections. Cao et al. [
31] compute a virtual trajectory based on object observations and incorporate it into the linear state estimation process to fix the noise accumulated during occlusion. Aharon et al. [
32] apply motion compensation computed using appearance information to improve the estimation of the filter state vector. Zhang et al. [
33] propose a detector-replaceable and non-parametric data association strategy to establish mapping relationships from different perspectives. Xiao et al. [
34] integrate different levels of granularity in motion features to model temporal dynamics and facilitate motion prediction. Liu et al. [
35] propose a pseudo-depth estimation method and a depth cascading matching strategy in the data association process. The existing fusion tracking algorithms struggle to effectively leverage the precise velocity information available from radar sensors. These algorithms also face challenges in addressing the problem of asynchronous sensor data in complex roadside environments. It is arduous to establish time synchronization conditions in the actual deployment environment.
3. Methodology
The methodology section comprises three main parts: 4D mmWave radar simulation, co-simulation method for roadside radar–camera fusion, and the fusion tracking algorithm. The methodology framework is shown in
Figure 1. Firstly, the signal processing flow, parameter estimation, and 4D mmWave radar model in Carla are explained. Secondly, the co-simulation, data collection, and annotation method are illustrated. Finally, the vehicle motion models, filter algorithm, and trajectory generation process are explained.
3.1. Simulation of 4D mmWave Radar in a 3D Scene
This section explains the simulation of a 4D mmWave radar. We construct a 4D mmWave radar model in Carla based on the signal processing flow.
3.1.1. Signal Processing Flow
Radar signal processing is the key part of roadside 4D radar simulation system. This section provides an overview of the typical signal processing flow of a roadside 4D mmWave radar. The roadside 4D mmWave radar uses a series of linear frequency-modulated continuous wave (FMCW) signals to simultaneously measure distance, angle, speed and height.
Table 1 shows the acronyms list used in radar signal processing. FMCW signals are characterized by their starting frequency
, sweep bandwidth
, chirp duration
and slope
. Chirp is the linear frequency modulation of the radar signal over time, and a single FMCW waveform is known as a chirp.
Figure 2a depicts the signal chirp components of typical FMCW radar. During one chirp cycle, the Tx/Rx chirp performs a linear decreasing/increasing frequency from
to
within time
. The frequency of the signal denotes the number of waves that pass through per second, measured in Hertz (Hz). It is referred to as mmWave due to its wavelength being in the millimeter range. Signal bandwidth is defined as the difference between the highest and lowest frequency components in a continuous frequency band. The radar transmits a frame comprising
chirps with equal spacing of the chirp cycle time
, and the total time
, is known as the frame time.
Different chirp configurations are used for different applications. For instance, when detecting fast-moving vehicles at a distance, small slope chirps are used for long-range detection, and long chirps are accumulated to increase the signal-to-noise ratio. In this way, the short chirps are used to increase the maximum detection speed and improve velocity resolution. When detecting vulnerable road users near vehicles, a higher scanning bandwidth is used to achieve high range resolution at the cost of detection distance [
36]. By sending chirps that switch in different configurations, multi-mode radar can operate in different modes simultaneously.
The received signal and transmitted signal are combined using a frequency mixer. The general working principle of the radar system is shown in
Figure 2c. Two signals with sum frequency
and difference frequency
are produced. The low-pass filter is used to remove the sum-frequency components, resulting in the intermediate frequency (IF) signal. It allows the FMCW radar to achieve GHz-level performance with only MHz-level sampling. In practice, an orthogonal frequency mixer is utilized to improve the noise coefficient, which results in a complex exponential IF signal represented by the following equation:
represents the signal’s amplitude, which is measured in dB/dBm, and is relevant when configuring the radar’s output power and detecting the received signal. A higher radar signal amplitude corresponds to increased radar visibility. Compared to automotive radar, roadside radar can utilize larger output power.
is beat frequency.
is the phase of the IF signal.
IF signal is sampled
times using an Analog to Digital Converter (ADC) to obtain a discrete-time complex signal. Multiple chirp signals are combined into a two-dimensional matrix, with the dimension of the sample points in each chirp referred to as the fast time and the dimension of the chirp indices within a frame referred to as the slow time. Assuming an object is moving at a distance of
with a velocity of
, the frequency and phase of the IF signal can be calculated using the following equations:
represents the wavelength of the chirp signal. These equations show that distance and Doppler velocity are coupled. First, the distance change within the slow time caused by the target motion can be ignored due to the short frame time. Second, the Doppler frequency within the fast time can be neglected compared to that obtained using a wideband waveform. Distance can be estimated by the beat frequency using the equation
, while Doppler velocity can be estimated by the phase shift between two chirps using the equation
. The frequency change is then solved using a discrete Fourier transform (DFT) in the fast time dimension, which is followed by a Doppler DFT in the slow time dimension. The result is a two-dimensional complex-valued data matrix known as the range-Doppler (RD) map. In practical applications, a window function is applied before the DFT to reduce sidelobes. The distance and Doppler velocity of the RD map cells are given by
where
and
represent the indices of the DFT,
is the IF bandwidth, and
is the frame time. The fast Fourier transform is used in practical applications due to its computational efficiency. If necessary, the sequence is zero-padded to the nearest power of two.
Azimuth information can be obtained by using multiple receive or transmit channels.
Figure 2b shows the antenna principle of DOA estimation, using an example of 1 TX and 4 RX antennas. Suppose all RX antennas receive reflections from the same target. The reflections received are different in phase and not in amplitude. This phase difference is due to the slightly different distances between each RX antenna and the target, as a function of the azimuth angle α. The azimuth angle could be calculated as
(
). The 4D mmWave radar commonly employs a multiple-input multiple-output (MIMO) antenna design, which operates in both transmit and receive channels. A MIMO radar with
Tx antennas and
Rx antennas can synthesize a virtual array with
channels. To distinguish the transmitted signals at the receiving end, the signals from different Tx antennas should be orthogonal. By comparing the phase shifts between different virtual channels, the range difference between different pairs of channels to a common target can be calculated. Further consideration of the geometric relationship along the TX and RX antennas can provide the DOA of the target.
Figure 2e outlines two common point cloud representation methods, and this paper utilizes the approach using 4D Tensor with FFT and CFAR. First, performing the fast Fourier transform on different transmit–receive channel pairs generates the azimuth and elevation angle information, resulting in a 4D range–azimuth–elevation–Doppler tensor. To enhance the signal-to-noise ratio, integrated coherent accumulation is applied to the range–Doppler map along the virtual receive dimension. A constant false alarm rate detector is employed to identify peaks in the range–Doppler map. Finally, the DOA estimation method is utilized for angle estimation, and true targets in the point cloud format are obtained for downstream tasks. The annotated point cloud generated in bird’s eye view is shown in
Figure 2d.
3.1.2. MmWave Radar Model in Carla
The detailed parameters of roadside radar for simulation can be estimated from the measuring performance specifications. The theoretical maximum detection range can be calculated using the following formula:
where
represents the transmitted power,
is the minimum detectable signal or receiver sensitivity,
is the wavelength of the transmitted signal,
is the radar cross section (RCS) of the target, and
is the antenna gain. RCS is a statistical quantity that varies with viewing angle and target material, and it measures a target’s ability to reflect radar signals back to the receiving antenna. Experimental results indicate that the average RCS value for small objects is approximately 2–3 dBsm, while the average RCS value for typical vehicles is around 10 dBsm, and for large vehicles is approximately 20 dBsm [
37].
The distance range and speed range can be expressed as follows:
For MIMO radar, the azimuth angle augmentation is determined by the antenna spacing
. When
, the theoretical maximum field of view (FoV) is 180°. However, in practical applications, the FoV is influenced by the antenna gain pattern. The azimuth angle augmentation, or field of view is evaluated using the following equation:
Achieving high range resolution requires a large scanning bandwidth
, while high Doppler resolution necessitates a longer integration time (frame time) of
. The resolution distance measuring and speed resolution is determined using the following formula:
The azimuth resolution of a radar system depends on various factors. It includes the number of virtual receiver elements
, target angle
, and antenna spacing
. When
and
, the angular resolution can be simply expressed as
. Additionally, from the perspective of antenna theory, the half-power beamwidth, which is a function of the array aperture
, can also be used to characterize angular resolution. The accuracy of the azimuth angle and the 3 dB beamwidth is calculated by using the following equation:
After estimating the radar parameters, we used Carla to generate the 4D mmWave radar model. First, a radar actor is declared, which includes the sensor definition, owner, and tick methods. The methods for serializing and deserializing sensor data are then defined according to
Section 3.1.1. Then, a class that contains sensor data to output is created. Finally, the radar sensor model is registered in the Carla sensor library.
3.2. Co-Simulation Method for Roadside Radar–Camera Fusion
This section explains the procedures involved in the co-simulation method for roadside radar–camera fusion. It includes static scene construction, traffic input, weather simulation, radar–camera data collection and annotation generation, and SUMO-Carla integrated simulation.
Simulation of Urban MObility (SUMO) is an open-source traffic simulator designed to model and simulate traffic flows. The road network structure in SUMO mainly consists of edges, junctions, connections and lanes, with attributes such as width, speed limits, and access rights integrated into the road–network model. Detailed information regarding possible movements at intersections and corresponding right-of-way rules are included in the networks to reveal dynamic behavior. OpenStreetMap is an open-source, globally oriented world map widely used in traffic studies as high-quality map data and officially recommended by SUMO as a data source for the construction of simulated road networks. NETCONVERT and NETEDIT are two convenient tools used to create the network in SUMO. NETCONVERT is a command-line tool that imports road networks from different data sources, and NETEDIT is a graphical network editor used to create, analyze, and edit network files.
RoadRunner is an interactive 3D scene editor that can be used to design FilmBox format road 3D scenes. It allows for the creation of road signs and markings to customize road scenes. It provides the ability to insert signs, signals barriers, road damage, greenery, buildings and other 3D models. CARLA is an open-source autonomous driving simulation tool with sensor models and high scalability, which can be used for 3D traffic scenario simulations.
3.2.1. Static Scene Construction
The input source for the road network data is OpenStreetMap data. Firstly, the road network data is parsed and preprocessed. It has the main steps: road section extraction, connectivity check, topology check, subnetwork deletion, merging of redundant intersections and completion of missing road information. Next, the matching between the simulation coordinate system and the real-world coordinate system is established to generate an XML format road network file. The lane speed limit model is adjusted based on the actual road conditions. NETEDIT is used to refine the road lane shape and lane details in the 2D plane.
NETCONVERT is utilized to convert the SUMO road network into an XODR file in OpenDRIVE format. The XODR file is imported into RoadRunner to enable the conversion of the 2D to 3D road network. However, as the OpenStreetMap lacks information on the height of road, bridges, and tunnels, a 3D refinement of the road network is essential in RoadRunner to include the elevation data and enhance the modeling of the road segments. Moreover, static 3D assets, such as roadside greenery, traffic signs and markings, camera gantries, intersection/ramp guide lines, and road shoulders are added.
Once a basic multi-lane road model is constructed in RoadRunner based on the provided lane information, the modified XODR file, FilmBox format FBX 3D model file and road network element materials are exported. These exported files are subsequently imported into the Unreal Engine of Carla to generate the 3D mesh. First, Carla generates static road assets from the FBX file. Second, it generates road center points and vehicle spawn points based on the modified XODR file. Third, it connects the road center points with a route planner that corresponds to the route control points of road network in SUMO. The materials of various road network components have distinct effects on radar echoes. By controlling the material textures of different roads and surrounding vegetation, authentic visual and radar echo reflection effects are implemented. Additionally, components that influence the reflection of radar echoes, such as trees and metal lampposts, are directly imported into CARLA to ensure the accuracy of the radar model’s echo reception. The static scene construction process is illustrated in
Figure 3.
3.2.2. Traffic Input
We utilize SUMO to provide realistic traffic flow characteristics for our vehicle simulation. TraCI employs a client–server architecture based on TCP to enable access to SUMO, thus it enables researchers to synchronize, control, and modify SUMO in real-time during simulations. By using the TraCI interface, we can manipulate specific vehicles in the simulation by synchronizing their positions and speeds. Traffic event scenarios based on realistic traffic flow characteristics are converted into TraCI/control-related commands and teleported into the SUMO simulation. Subscribed values are returned if any subscriptions have been made. The simulation progresses to the next step once all clients have issued the simulation step command. This feature enables the microscopic simulation of traffic flow models to accurately simulate interactions between traffic participants on the road. When using vehicle trajectories captured from road video as input, we use TraCI to synchronize the actual positions and velocity measurements of vehicles at each simulation time step to ensure accuracy. When utilizing NGSIM trajectory data, we precisely mark vehicle positions based on their trajectory data at each moment and generate the rou.xml file. Vehicle colors and models are then associated with their trajectories, and the required 3D vehicle models are imported into Carla. By associating vehicle types between SUMO and CARLA, we achieve consistency in the vehicle models used within the simulation.
3.2.3. Weather Simulation
Lighting and weather conditions are crucial factors that impact traffic operations. In CARLA, rainy, cloudy and different lighting conditions can be simulated by adjusting parameters. To better simulate the effects of artificial light sources on nighttime object detection, this study includes two nighttime vehicle operation settings: low-beam headlights for vehicles with streetlights and high-beam headlights for vehicles without streetlights.
3.2.4. Radar–Camera Data Collection and Annotation Generation
The location and rotation settings of the camera and radar are adjusted based on the actual road conditions and the parameters of the sensor hardware. Sensor data are collected using the Carla client, along with the ground truth world coordinates of the vertices of the target 3D box. To create visual target annotations, the world coordinates of the target vertices in the camera coordinate system are first calculated using the camera extrinsic matrix. Next, the camera intrinsic matrix is used to convert the camera’s 3D coordinates into 2D coordinates within the image. Finally, after the depth camera of Carla is utilized to determine whether the target is occluded, the annotated box within the camera’s field of view and the attributes of the target are collected.
3.2.5. Traffic Flow Restoration Based on Co-Simulation
During the co-simulation process, SUMO is utilized to simulate the dynamics of all vehicles running in the road network. The vehicles generated by SUMO can produce real-time synchronized mapping of vehicles in the CARLA environment, and the physical feedback generated by the vehicle simulation in CARLA can also be synchronized in the 2D scene of SUMO. During each time step of the co-simulation, vehicle positions, velocities, and signal light conditions in SUMO are retrieved and updated vehicle by vehicle in Carla.
In addition to synchronizing the temporal–spatial physical attributes of vehicles in the original co-simulation code, this study features fine-grained handling of lateral speed and adaptive simulation of vehicle and roadside lighting. Depending on the lighting and weather conditions of different Carla simulation scenarios, the intensity of vehicle and road lighting can be adjusted.
Vehicle dynamics simulation involves highly detailed vehicle models that are used in automated driving system research and development. Examples of these models include axle kinematics and steering models that represents the driving dynamics of the vehicle [
38]. Dynamic integration and co-simulation enable vehicles to operate in a simulated traffic environment that is responsive to the vehicle’s driving behavior. In each simulation time step, we use the Traci protocol to communicate between Carla and SUMO. We employ an enhanced Krauss model as the car-following model and a sub-lane model as the lane-changing model within SUMO. Additionally, we subscribe to obtain vehicle attributes (vehicle class, color, length, width, height), vehicle position and velocity information (transform, pitch-yaw-roll, speed vector), and traffic environment (signals), and then convert them to a unified 3D simulation coordinate system. We simulate the vehicle’s dynamic behavior and perform collision detection in Carla and synchronizes the vehicle’s speed, position, yaw back to SUMO.
Figure 4 shows the co-simulation and dataset generation methodology.
In this section, we propose a roadside 3D radar and camera fusion simulator with refined vehicle dynamic characteristics. The simulator is capable of utilizing highly realistic road environments and spatiotemporal trajectories of vehicles to construct annotated datasets for validating the radar–camera fusion algorithm.
3.3. Asynchronous Input Supported Radar–Vision Fusion Algorithm
3.3.1. Algorithm Framework
We develop the baseline radar–vision fusion algorithm based on the improved unscented Kalman filter (UKF). The radar object detection is obtained using SMURF [
39], and we employ YOLO-V8 for vision object detection.
Figure 5 displays the baseline method’s trajectory generating process. First, we set up the UKF matrices. If a new camera detection arrives, we try to match it to existing tracks by computing visual and motion feature distance in the matching cascade. If the distance is not in the threshold, IOU match is applied. If a new radar detection arrives, we use the Mahalanobis distance to try to match it with an existing track in the spatial coordinate system. If no track is matched, we create a new track and use the detection to initialize the track’s state and covariance matrices. The track is regarded as confirmed if three successive frames of detection match it. If a track does not have a detection to match in 100 frames, it is considered deleted. Finally, we benchmark different motion models in the filter predict and update process.
3.3.2. Vehicle Motion Models
This section examines three vehicle motion models for the fusion algorithm and explains the filter update process.
Figure 6 illustrates the symbols representing vehicle dynamics.
The state space of Constant Turn Rate and Velocity (CTRV) model is calculated as follows:
Assuming that the process noise originates from the first derivative of turn rate and velocity, defined as
and
. And
,
. Then, the process model is shown as follows:
represents process noise vector, acceleration
.
The state space of Constant Turn Rate and Acceleration (CTRA) model is calculated as follows:
Assuming that the process noise originates from the first derivative of turn rate and acceleration which are defined as
and
. And
,
. Then, the process model is shown as follows:
Here,
with R denoting the radius of the vehicle motion, Constant Curvature and Acceleration (CCA) model assumes that the curvature
is constant.
Using the Equation (15), the state space is calculated as follows:
Assuming that the process noise originates from the first derivative of curvature and acceleration which are defined as
and
. And
,
. Then, the process model is shown as follows:
with
,
,
,
,
defined in Equation (18).
3.3.3. Fusion Filter Algorithm
We use a sample-based method to compute the sigma points from the probability distribution. The unscented Kalman filter (UKF) is a sample-based method that utilizes the unscented transformation to compute the sigma points from the probability distribution. By adding and subtracting each column of the lower-triangular Cholesky-decomposed weighted matrix
to the mean, the sigma points are sampled as follows:
where
,
are scaling parameters. The value of
determines the spread of the sigma points around the mean, and it is typically set to a small value [
17].
is calculated as
, where
is the dimension of the state vector.
After applying the state transition using the vehicle motion models to the sigma points, the predicted mean and covariance of the state vector can be calculated using the following equations:
denotes the vehicle motion model transition,
represents the column number of the sigma points matrix. The weights matrix
for each sigma point can be obtained through the following equation:
The radar measurement vector is defined as follows:
After the radar measurement is obtained, the predicted measurement mean and covariance can be calculated using the following equations:
where
R is the measurement noise covariance.
The Kalman gain is computed by dividing the cross correlation between the sigma point estimates in the state space and the measurement space with the predicted covariance.
Finally, update the state vector and the covariance matrix using the following equations, as in other Kalman filters:
The fusion track update process is depicted as Algorithm 1:
Algorithm 1: Radar–Vision Fusion Track Update. |
Input: current track state Output: updated track state
1: SET initial state estimate
and covariance 2: IF track starts with radar detection THEN: 3: DO SET vehicle speed in with Doppler velocity 4: ELSE IF track starts with camera detection THEN: 5: DO SET vehicle speed in with lane average speed 6: SET the process noise covariance and the measurement noise covariance 7: SET motion model case 8: FOR each detection matched to the track in timestamp DO: 9: CALCULATE sigma points from and using Equation (21) 10: IF detection is a radar detection THEN: 11: TRANSFORM the detection to mesurement space through Equation (25) 12: END IF 13: APPLY state transition to
and obtain predicted state using: 14: CASE is CTRV: Equation (10) 15: CASE is CTRA: Equation (12) 16: CASE is CCA: Equation (17) 17: COMPUTE the mean and covariance of using Equations (22) and (23) 18: COMPUTE innovation covariance using Equation (27) 19: COMPUTE Kalman gain using Equation (28) 20: COMPUTE updated state using Equation (29) 21: COMPUTE updated covariance using Equation (30) 22: RETURN , |
4. Case Study and Results
In this section, we conduct ablation experiments to evaluate the proposed simulated test method and the fusion baseline algorithm. The simulation of abnormal lighting and severe weather based on real-world scenarios provides the necessary conditions for testing the robustness of the radar–camera fusion algorithm. The proposed testing framework for the fusion algorithm is based on real-world scenario and traffic data inputs to recreate the actual traffic operating status. Additionally, by controlling the driving behavior of specific vehicles, we construct infrequent traffic events that are difficult to capture. Traffic events such as stop-and-go waves simulated for testing purposes can provide inputs for the event recognition algorithm based on radar–camera fusion.
We simulate the sensor including 1080p camera (60 Hz) and ARS548 radar (20 Hz) in Carla. The simulated sensor hardware parameters are shown in
Table 2. Firstly, we evaluate the vision model’s performance in simulated 3D scenes under diverse weather and lighting conditions. Next, we reconstruct real-world scenes and generate varying traffic flow scenarios. We obtained asynchronous inputs from radar and camera detection results, and iteratively adjusted the algorithm parameters through the simulation process. Additionally, we benchmark different motion models within the proposed baseline method. Finally, we evaluated the stop-and-go wave detection performance in rainy nighttime environments, comparing the vision-only method and the radar–vision fusion baseline method. This case study selected two actual scenarios in China: the G92 Expressway and an intersection in Suzhou.
4.1. Vision Model Performance Evaluation in Simulated 3D Scene
In this section, we construct static scenes using the methods described in
Section 3.2.1 and create datasets for varying combinations of weather and lighting conditions. Subsequently, we train a deep learning-based detection model based on yolo-v8 [
2]. By using a NVIDIA 3090 graphics card, a dataset of 3000 images per scene, and 64 vehicle models are added. The results after 100 epochs are presented in
Table 3.
The results indicate that lighting has a greater impact on visual object detection than weather conditions. Specifically, in nighttime scenes, the accuracy of video vehicle detection is significantly affected by the streetlights and low beam headlights of vehicles in urban scenes, as well as the high beam headlights of vehicles in highway scenes. Solely relying on visual detection is inadequate for achieving high-precision trajectory extraction.
4.2. Vehicle Motion Model Evaluation
In this section, we select the rainy condition in nighttime and apply the three different vehicle motion models to test the radar–camera fusion algorithm in both urban and highway scenarios. Within the effective detection distance of the video, we use the radar–camera fusion algorithm to extract 5 min of trajectory data for evaluation. The sensor sampling frequency is 30 frames per second for video and 10 frames per second for radar. The results are presented in
Table 4.
The evaluation metrics include the root mean square error for the Euclidean distance, lateral distance, and longitudinal distance between the estimated vehicle position and the ground truth. The results show that the camera provides higher accuracy for lateral distance measurement, while the 4D millimeter-wave radar provides higher accuracy for longitudinal distance measurement and velocity estimation. In the expressway scenario with significant vehicle acceleration, the fusion algorithm using the CTRA model has a better performance. There is no significant difference between the CCA model and the CTRA model, but the CCA model requires more computation. Overall, the fusion tracking algorithm using the CTRA model brings the best performance.
4.3. Data Sampling Frequencies
Existing radar–vision fusion datasets typically provide 20 Hz point cloud data, but only 2 Hz annotations [
22]. This discrepancy has led many common object detectors and multi-object tracking algorithms to operate at the lower 2 Hz frame rate, even though they may be capable of utilizing the full 20 Hz data stream and running at higher frequencies. Tracking performance generally benefits from higher data frame rates, as the prediction error tends to accumulate over successive frames. By leveraging the annotation tool developed in
Section 3.2.4, we are able to assess the trajectory tracking capabilities under varying data sampling frequencies. Therefore, we adopt a 20 Hz data annotation frequency, while also evaluating the tracking performance using higher frame rates.
In
Table 5, we investigate the impact of data sample frequency on tracking performance. We consider four settings: 2 Hz, 10 Hz, 20 Hz, and 10 Hz with interpolation. We employ the AMOTA (Average Multiple Object Tracking Accuracy) and AMOTP (Average Multiple Object Tracking Precision) metrics to quantify the tracking performance under different recall thresholds [
40]. AMOTA and AMOTP combine the detection, association, and localization aspects of multi-object tracking.
Compared to the 2 Hz setting, simply using 10 Hz frames does not evidently improve the performance. This is because the deviation of detections on the high-frequency frames may disturb the trackers and subsequently reduce the performance on the sampled evaluation frames. The inclusion of motion prediction-based interpolation in the 10 Hz setting further explores the potential benefits of motion estimations in intermediate frames. To improve the recall, we output motion model predictions for frames and trajectories lacking corresponding detection bounding boxes and assign these predictions a lower confidence score than any other detected objects. Compared to the non-interpolated 10 Hz setting, this approach resulted in a 4.94% improvement in MOTA.
We further increase the frame rate to 20 Hz, but this barely results in any significant additional performance improvements. We found that excessively high frame rates can amplify the noise generated by the target object’s motion. For instance, when the displacement of the target object between consecutive image frames is only a few pixels, the distance of this movement may be comparable to the estimation noise, resulting in large variances in the filter’s predictions.
The evaluation across different data frequencies provides insights into the optimal settings for reliable and high-precision tracking applications. Therefore, our fusion baseline method uses the 10 Hz with the motion prediction interpolated setting.
4.4. Filter-Based Vehicle Tracking Method Performance
To evaluate the effectiveness of the baseline method for multi-object tracking, we used several higher-order metrics including MOTA (Multi-Object Tracking Accuracy), AssA (Association Accuracy), DetA (Detection Accuracy), IDF1 (Identification F1-score), and HOTA (Higher Order Tracking Accuracy) [
41]. MOTA captures the overall tracking quality, considering both detection and association performance. It is determined by the number of false negatives, false positives, and identity switches. The IDF1 score indicates the quality of the object-to-track associations. HOTA provides a balance between the performance of accurate object detection, object-to-track association, and object localization.
We assess the effectiveness of our tracking methods with several state-of-the-art tracking algorithms, including ByteTrack [
30], OC-SORT [
31], Bot-Sort [
32], ByteTrackv2 [
33], MotionTrack [
34], and SparseTrack [
35], on the dataset described in
Section 3.2. The dataset includes various weather conditions and lighting conditions, as outlined in
Table 3.
The evaluation results are presented in
Table 6. Our baseline method outperforms existing methods in adverse weather and lighting conditions for roadside radar–camera fusion scenarios. Under abnormal lighting conditions, false positive and false negative detections can lead to a faster accumulation of errors. When occlusion or weather/lighting interferes, the state estimation errors in the tracking process further accumulate. This causes model-based filtering approaches to become less robust when targets are lost. In these cases, the filter’s reliance on state prediction increases.
Our baseline method efficiently utilizes the velocity and target attribute information from radar and camera, effectively reducing the accumulation of filtering-induced errors. Additionally, by incorporating a proper motion model during the filtering process, we improve the target orientation alignment between radar and camera perception, enabling more accurate object-to-track association. This allows the state covariance matrix and associated variables to recover more quickly from the issues of false positive/negative detections. Furthermore, the support for asynchronous inputs provides flexibility for our algorithm.
4.5. Simulation-Based Stop-and-Go Wave Detection
In this case, we utilize SUMO to simulate stop-and-go waves on a congested expressway, by applying driving behavior control to each vehicle. Based on the roadside video of the G92 expressway, we constructed the static scenario. We then employ the radar–camera fusion method to extract the trajectories of the vehicles. Then, we apply post-processing using interpolation to improve the trajectories. For each trajectory, we smooth the speed and turn rate using a sliding window.
Figure 7a,b show the stop-and-go wave detection results on the expressway using vision-only method and radar–vision fusion method separately. The results show a 5 min vehicle trajectory diagram of the stop-and-go wave on the expressway, with a time granularity of 0.1 s. The congestion periods are marked between blue and green lines, and the variation in vehicle speed is indicated by the color of each point on the trajectory. Traditional video-based macroscopic traffic event detection, such as stop-and-go waves, is heavily influenced by the continuity of vehicle trajectories. By increasing the accuracy of vehicle trajectory estimation, the proposed radar–camera fusion method effectively improves the stop-and-go waves detection. Also, Using the SUMO traffic simulator, we can modify the input simulation parameters to expand the scenario, such as by adjusting the ramp flow or evaluating the effect of road expansion. This framework can offer a realistic virtual environment for the selection of roadside perception device, algorithm testing, and parameter tuning.
5. Conclusions and Discussion
Costly field experiments and lack of a simulation platform poses great challenge for developing and testing radar–camera fusion algorithms. This paper proposes a framework for testing 4D mmWave radar–camera fusion algorithms. The testing occurs in a simulated highway roadside environment. First, we propose a 4D millimeter-wave radar simulation method. A roadside multi-sensor perception dataset is generated in a 3D environment through co-simulation. Then, deep-learning object detection models are trained under different weather and lighting conditions. Finally, a self-adaptive fusion tracking algorithm is proposed as a baseline and validated in both urban intersection and highway scenarios.
Case studies find that the method proposed in this paper provides a realistic virtual environment for testing radar–camera fusion algorithms for roadside traffic perception. A visual vehicle detection model was trained under different lighting and weather conditions. The model achieved a minimum MAP50-95 of 85.63 in rainy night scenarios. Compared to the camera-only tracking method, the radar–vision fusion method proposed with the CTRA model significantly improves tracking performance in rainy night scenarios. The trajectory RMSE is improved by 68.61% in expressway scenarios and 67.45% in urban scenarios. This method can also be applied to improve the detection of stop-and-go waves on congested expressways.
Although this study serves as a foundation for adding new types of mmWave radar with standardized simulation methods, other high-value scenarios should be addressed. In the future, we plan to investigate its applications, including identifying traffic incidents, analyzing sensor deployment, and developing methods for cross-sensor tracking.