Next Article in Journal
Technical Feasibility of Heavy-Duty Battery-Electric Trucks for Urban and Regional Delivery in Germany—A Real-World Case Study
Previous Article in Journal
A Review of Critical State Joint Estimation Methods of Lithium-Ion Batteries in Electric Vehicles
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Near-Field Area Object Detection Method for Intelligent Vehicles Based on Multi-Sensor Information Fusion

College of Mechanical and Electrical Engineering, Zhengzhou University of Light Industry, Zhengzhou 450002, China
*
Author to whom correspondence should be addressed.
World Electr. Veh. J. 2022, 13(9), 160; https://doi.org/10.3390/wevj13090160
Submission received: 12 July 2022 / Revised: 12 August 2022 / Accepted: 15 August 2022 / Published: 24 August 2022

Abstract

:
In order to solve the difficulty for intelligent vehicles in detecting near-field targets, this paper proposes a near-field object detection method based on multi-sensor information fusion. Firstly, the F-CenterFusion method is proposed to fuse the information from LiDAR, millimeter wave (mmWave) radar, and camera to fully obtain target state information in the near-field area. Secondly, multi-attention modules are constructed in the image and point cloud feature extraction networks, respectively, to locate the targets’ class-dependent features and suppress the expression of useless information. Then, the dynamic connection mechanism is used to fuse image and point cloud information to enhance feature expression capabilities. The fusion results are input into the predictive inference head network to obtain target attributes, locations, and other data. This method is verified by the nuScenes dataset. Compared with the CenterFusion method using mmWave radar and camera fusion information, the NDS and mAP values of our method are improved by 5.1% and 10.9%, respectively, and the average accuracy score of multi-class detection is improved by 22.7%. The experimental results show that the proposed method can enable intelligent vehicles to realize near-field target detection with high accuracy and strong robustness.

1. Introduction

The development of intelligent vehicles has the potential to reduce traffic accidents caused by poor driving habits while also alleviating traffic congestion [1]. Accurate perception of the external world is crucial for driving safety and reliability [2,3]. As shown in Figure 1, the targets in the near-field area are closer to the intelligent vehicles, the perception scale increases, and the targets’ outline exceeds the perception perspective, changing from a continuous regular shape to a non-continuous natural shape, which makes target detection more difficult. As a result, a key task of environmental perception for autonomous driving is the high-precision recognition of near-field targets.
Currently, cameras, LiDARs, and mmWave radars are the most prevalent sensors used for autonomous driving environment perception. The cameras can receive 2D visual information about the targets; the LiDARs can obtain 3D shape information about the targets and the mmWave radars can obtain target location and velocity information [4]. The sensing method of each sensor hardware inevitably has limitations. For instance, the cameras cannot identify the targets’ distance, the LiDARs cannot detect the targets’ color, and the mmWave radars cannot detect the targets’ category [5]. Consequently, it is challenging to achieve adequate results in the near-field area target detection task using a single sensor. When numerous sensor data is integrated with different modalities, more useable information can be obtained, therefore boosting the near-field target detection task’s precision. Consequently, a growing number of academics are conducting research on multi-sensor information fusion technology. Multi-sensor information fusion technology can overcome the problem of a single sensor’s inability to obtain multi-type state information on target obstacles and is effective for close-range target detection tasks. Consequently, the current research hotspot employs multi-sensor information fusion technology to achieve high precision near-field target recognition.
The near-field target detection algorithm based on traditional methods is usually realized by manually extracting target features and various classifiers, such as HOG + SVM [6,7], SIFT + Adaboost [8,9], etc. The biggest difficulty in detecting near-field targets is that only local information about the targets can be obtained. Detecting targets based on known local information is the most common research method. Markus Enzweiler et al. proposed a hybrid expert architecture for the analysis of localized information [10]. The target boundary was initially determined using discontinuous depth and motion feature data, followed by the calculation of the expert visibility weight. Based on the weight ratio, the hybrid expert classifier focused on the targets’ local feature information. In addition, Mathias et al. proposed a novel classifier based on the notion of training a classifier specifically for a certain target [11]. The classifier was separated into two portions—one for the global detection of all image data and the other for the partial detection of local data only. The concept of spatial bias feature selection was developed to focus the classifier’s attention during training on small area characteristics without affecting global detection. This method separates global and local information, enhancing the accuracy of target detection in the near-field region without affecting global detection. On the other hand, the near-field region object identification methods based on the classic visual approach have a low computation amount and require few training samples, but they require a great deal of human design, have a poor detection effect, and are imprecise.
Deep learning-based object detection methods can be divided into two categories: anchor-based object detection and anchor-free object detection. The R-CNN is a common illustration of the anchor-based target identification strategy [12]. DeepParts, a detector based on R-CNN, was proposed by Tian et al. [13]. A specific local convolution network classifier was used to detect targets using only local input, without the need for the classifier to be predefined. This method could train specific local classifiers according to the input information and has high robustness, reducing the false detection rate by 11.89% on the Caltech dataset. However, the anchor frame-based target detection method has a reliance on manual design, imbalanced positive and negative samples, and an inefficient training prediction procedure. The anchor free method predicts in a point-to-point manner and no longer uses the anchor frame mechanism. Qi et al. enhanced the FCOS network architecture by integrating the attention module SE block into the primary network [14,15,16]. The SE block was trained to learn the different weight ratios of feature maps, enhance the expression capability of feature information, and stress the local feature information of near-field targets. This method improved the detection accuracy by 15% on the CrowdHuman dataset. In conclusion, the strategy of adding attention mechanisms based on local information is simple and straightforward to compute in the deep learning-based target detection methods, making it more suitable for the near-field target detection method.
In spite of the fact that various effective optimization methods for near-field target detection based on a single sensor have been developed in recent years, they do not meet the reliability requirements of intelligent driving environment perception systems. The perspective of vehicle-mounted sensors limits the perception of near-field regional targets and can only obtain local information when they perceive targets in the near field. The shape is no longer regular and continuous. Using data from a single sensor makes it difficult for the detector to correctly analyze the size, position, and status of the targets. The cameras collect image information in the form of two-dimensional pixels; LiDARs collect point cloud information, which is represented as disordered and unstructured three-dimensional coordinates; mmWave radars collect point cloud information, or data blocks. From different levels of information fusion classification, multi-sensor information fusion technology is mainly divided into data level fusion, feature level fusion, and decision level fusion [17].
As shown in Figure 2a, data-level fusion refers to the fusion of information gathered directly by the sensors, making full use of the original data’s content. The literature employed a two-step strategy to combine data from cameras and LiDARs [18]. Fusion at the data level was the initial step. A cross-view spatial feature fusion technique was devised since the data from the cameras and LiDARs had different distributions and features. The interpolation projection of the corrected spatial offset was utilized to map the camera view feature to a dense and smooth BEV feature using the automatic calibration feature projection, which minimized information loss as much as possible. However, this method is very complicated in practical use because of the need for correct external sensor calibration. For time calibration to be precise, the sensor must also balance its own motion. In addition, as a result of the high level of information retention, the quantity of calculations and hardware needs are significantly increased. As shown in Figure 2b, decision-level fusion uses a specific network to fuse the detection results of each sensor. Compared with data-level fusion, it is more lightweight, modular, and robust. A mechanism for decision-level fusion was devised by reference [19]. Initially, the modified SSD algorithm was employed to identify and recognize visual targets. After preprocessing the mmWave radar data, the mmWave radar tracking detection findings were projected onto the visual image using the AEKF approach, and the front vehicle’s position state was calculated using the IoU method. By merging visual detection information with mmWave radar information, the AEKF algorithm effectively reduces the problem of radar miss detection and false detection while accelerating the recognition of visual targets. Decision-level fusion facilitates algorithm implementation and debugging while decreasing sensor dependence. It is capable of effectively fusing information in different data formats, but it will lose a substantial amount of critical information and considerably reduce the accuracy of target recognition. As shown in Figure 2c, the feature-level fusion first extracts the features of the collected data to construct feature vectors, and then fuses the information contained within these feature vectors. Chen et al. introduced the multi-view 3D object identification method (MV3D), which was a method for fusing feature-level information [20]. Image and point cloud data were uploaded to the network concurrently. The sparse 3D point cloud was encoded using the multi-view representation method in conjunction with its color information and spatial distribution information. The fused feature was then used to predict targets and get three-dimensional information about targets. Feature-level fusion method is widely used, which can ensure the consistency of fusion information. The key is to choose suitable feature vectors for fusion, such as size, scale, velocity, etc. Additionally, the requirements are substantially lowered for computer hardware and connection speed.
In conclusion, both single-sensor target detection and present fusion detection systems exhibit deficiencies in near-field detection and are unable to gather comprehensive target state data [21]. This paper provides a near-field target detection approach based on multi-sensor information fusion to improve the capability of target detection and adapt to the complicated environment. The following are the principal contributions of this paper:
1. In order to address the lack of target information obtained by intelligent vehicles in the near-field area, the F-PointPillars method is proposed to increase the amount of target feature information by integrating the position and velocity information of LiDAR and millimeter-wave radar point clouds.
2. The attention modules are introduced into the image detection network and the point cloud detection network to help the detector eliminate redundant feature information and learn about related features.
3. We provide a dynamic connection method to establish connections between features of different morphologies and abstraction levels, allowing image and point cloud information to be efficiently and learnedly combined, hence enhancing the performance of the detector.
4. This paper proposes a near-field target detection method based on multi-sensor information fusion. Figure 3 shows the system architecture, which includes the image detection module, the point cloud information processing module, and the information fusion module. Figure 4 shows the network architecture for detection.
5. Experiments on the challenging nuScenes dataset demonstrate that the proposed method has a better detection effect in the near-field region than existing detection methods.
Figure 3. Multi-sensor information fusion system.
Figure 3. Multi-sensor information fusion system.
Wevj 13 00160 g003
Figure 4. Multi-sensor information fusion network architecture.
Figure 4. Multi-sensor information fusion network architecture.
Wevj 13 00160 g004

2. Method of This Paper

2.1. Image Target Detection Based on CenterNet

In this paper, CenterNet detection network is used for initial detection of image information [22]. The DLA network is used as the backbone network to extract the image features. The initial regression head predicts the object center point on the picture, producing an accurate 2D bounding box and a tentative 3D bounding box for the near-field region objects.
CenterNet takes an image I R W × H × 3 as input. The predicted value of the output center keypoint heatmap is shown in Formula (1):
Y ^ [ 0 , 1 ] W R × H R × C
where W and H are the width and height of the image, R is the down-sampling ratio, and C is the number of target categories.
If the anticipated value Y ^ x , y , c is 1, it means that the detection target C is centered on x , y coordinates on the image. If the result of the predicted value Y ^ x , y , c is 0, it means that there is no detection target C in x , y coordinates. Figure 5 is the schematic diagram of the CenterNet network architecture.
For a certain category in the label, its real center point is first calculated for training. The calculation formula of the center point is shown in Formula (2):
p = ( ( x 1 + x 2 ) / 2 , ( y 1 + y 2 ) / 2 )
where x 1 , y 1 and x 2 , y 2 represent the regression box coordinates of a target.
The coordinates after sampling are set as P ˜ = P R , R is the down-sampling factor, and the final calculated center point corresponds to the low-resolution center point. Then use a Gaussian kernel to distribute the points in the downsampled image to the feature map in the form of Y ^ 0 , 1 w R × H R × C [23]. The Gaussian kernel is defined as Formula (3):
Y x y c = exp ( ( ( x p ˜ x ) 2 + ( y p ˜ y ) 2 ) / ( 2 σ p 2 ) )
where σ P is a standard deviation related to the target size W , H .
The loss function of the keypoint heatmap is trained by the focal loss [24], as indicated in Equation (4):
L k = 1 N x y c 1 Y ^ x y c α log Y ^ x y c Y x y c = 1 1 Y x y c β Y ^ x y c α log 1 Y ^ x y c otherwise
where α , β is the super parameter of focal loss, and N is the number of target center points.
In the near-field area scene, there are a large number of targets of the same category but with different information features. These same-category targets have similar features, and these class-dependent traits can be used to identify similar targets with less feature information. Therefore, we first construct a spatial attention module to capture the spatial dependence between two identical features in the feature map. Based on the whole target information, it helps to detect the class-dependent characteristics of local information targets in the near-field area. Spatial attention weights are proportional to the similarity between two features, which mutually benefit. After determining the location of the category features, we constructed the channel attention module. The channel attention module can make the detector focus on the weights of channel feature maps with similar dependencies, highlight the feature representation, and enhance the feature expression ability.
The spatial attention module aggregates features selectively based on a weighted sum of features and improves the expression of similar features. This strategy can improve the feature expression of things outside the frame of view by building a dependence relationship with the features of other similar objects within the field of view. The spatial attention module is shown in Figure 6.
In Figure 6, R stands for reshape, T for transpose, and S for Softmax. ⊗ represents matrix multiplication, and ⊕ represents element addition.
The output of the input feature map after spatial attention can be expressed as:
E j s = α i = 1 N ( M j i D j ) + A j
where α is the scale coefficient, M j i is the influence of the i-th feature map on the spatial attention of the j-th feature map, A j is the j-th input feature map, E j s is the j-th output feature map, and D j is the feature map reshaped from A j .
The following describes the operation of the spatial attention module: assuming that the feature map A is C × H × W , the feature maps B C × H × W , C C × H × W and D C × H × W are obtained by convolution. B is reshaped into N × C N = H × W through matrix transformation and transposition, and the C matrix is reshaped into C × N . Multiplying matrices B and C and inputting the result into the Softmax activation function yields the spatial feature map M N × N . After matrix transformation, M is multiplied by D C , N to obtain C , N , multiplied by the scale coefficient α , then restored to C × H × W after matrix transformation, and then the element α is added to yield the output E, where α is initialized to 0 and the weight value is gradually learned.
The channel attention module selectively emphasizes useful features by integrating relevant features between all channel maps, allowing the detector to better focus on key class-dependent features in the complex background containing numerous interfering targets. The channel attention module is shown in Figure 7.
After the input feature map passes through the channel attention module, the output can be expressed as:
E j c = β i = 1 C ( L j i A j ) + A j
where β is the scale coefficient, L j i is the influence of the i-th feature map on the channel attention of the j-th feature map, A j is the j-th input feature map, E j C is the j-th output feature map.
The following describes the operation of the channel attention module: the feature map A matrix is reshaped into C × N . After transposition, it becomes N × C . C × C is generated by multiplying the two matrices together. After the Softmax function is activated, the channel dimension attention characteristic graph L C × C is obtained. Matrix multiplication of characteristic graph L and C × N after matrix transformation to obtain C × N , multiplied by the scale coefficient β , after matrix transformation, restored to C × H × W . Finally, the output E is obtained by adding element A. When β is initialized to 0, the corresponding weight value is gradually learned.
After the original image passes through the backbone network, the input feature map is obtained. Then, the feature map is used to capture similar features from the spatial and channel dimensions through two parallel attention modules. Finally, the information from the two attention modules is fused by element addition to improve the feature representation.

2.2. Point Cloud Detection Based on F-PointPillars and Attention Mechanism

2.2.1. LiDAR and mmWave Radar Feature Information Fusion

LiDAR has problems with detecting target velocity, whereas millimeter-wave radar has disadvantages in detecting object size, which makes the acquisition of target information in the near-field area unreliable for each radar. The combination of LiDAR size and depth information and millimeter-wave radar depth and velocity information can effectively complement each other and enhance the detection precision of near-field area. The information collected by both LiDAR and mmWave radar is in the form of disordered sparse point cloud data. Therefore, we present F-PointPillars, a point cloud column coding-based information fusion approach for LiDAR and mmWave radar. Our method tensorizes point cloud data for feature extraction, allowing accurate and real-time fusion of high-dimensional and high-volume point cloud information from LiDAR and mmWave radar.
First rasterizing the x y plane with W × H to form W × H = P pillars, each pillar samples N points, and then encodes each point cloud Z as a D-dimensional vector, as follows:
Z = { x , y , z , x 0 , y 0 , z 0 , x m , y m }
where x , y , z is the true 3D coordinate of the point cloud, { x 0 , y 0 , z 0 } is the true 3D coordinate of the geometric center point within the point cloud column, x m , y m is the offset from the point cloud to the center of the point cloud pillars in the direction of X , Y , and finally forms a D , P , N dimensional tensor, D is the point cloud pillars feature dimension, P is the number of non-empty point cloud pillars, and N is the number of point clouds in a single point cloud pillar.
In this paper, the 3D coordinates x 1 , y 1 , z 1 of the original LiDAR point cloud data plus 5 additional dimensions x 0 , y 0 , z 0 , x m , y m make up a total of 8 dimensions: x 1 , y 1 , z 1 , x 0 , y 0 , z 0 , x m , y m . Based on the coordinate relationship between LiDAR and mmWave radar, set the LiDAR translation matrix as T l , the rotation matrix as R l , the mmWave radar translation matrix as T r , and the rotation matrix as R r . The mmWave radar point cloud coordinates are converted to the LiDAR coordinate system by Equation (8), and the converted mmWave radar coordinates are marked as x r l , y r l , z r l .
x r l y r l z r l = R x r y r z r T
where R = R l × R r , T = T l T r .
Due to the fact that the Z-information value of all mmWave radar point cloud data is zero, only the 2D coordinates of the original point cloud data x r l , y r l , the point cloud column center coordinates x 0 , y 0 , the offset x m , y m in the x , y direction and the compensation velocity v x c o m p , v y c o m p are retained, which make up a total of 8 dimensions: x r l , y r l , x 0 , y 0 , x m , y m , v x c o m p , v y c o m p . The LiDAR and mmWave radar feature channel dimensions are superimposed to obtain the D l r , P , N dimensional tensor.
The F-PointPillars method accomplishes the tensorization of point cloud data by transforming point clouds into cylinders and integrating LiDAR and mmWave radar data to offer sufficient feature information for further detection.

2.2.2. Attention Mechanism of Point Cloud Module

The point cloud information received by the radar scanning the target in the near-field region is typically expansive and dense. Due to the influence of the characteristics of the radar scanning beam, however, the distance between the point cloud and the point cloud at the periphery of the target grows, resulting in duplicated information that is sparse and ineffectual. Moreover, because the generated target information lacks a complete contour, the task of detecting targets in point clouds is made more difficult. In order to eliminate redundant information in the fused point cloud features and improve the target key point cloud feature information, this paper employs a multi-attention strategy and designs a point attention module and a channel attention module to further improve the performance of the point cloud detector.
Assuming input feature maps F R N 1 × d + C 1 , refer to the CBAM attention model, F is divided into point attention M P R N 1 × 1 and channel attention M C = R 1 × H × W [25]. The whole attention process is as follows:
F = M p ( F ) F
F = M c ( F ) F
where F is the data after point attention processing, F is for the data after channel attention processing, and ⊗ for element-by-element multiplication.

Point Attention Mechanism

The point attention module reweights each downsampling and upsampling point by establishing similar dependencies between points, which then selects and enhances keypoint features so that they can also receive sufficient attention in incomplete contours and cluttered scenes.
Given a frame of point cloud, the point cloud data is divided first. The column feature network evenly distributes the point cloud data in the O x y plane-based grid, and these grids form a set of columns. Finally, the dimension tensor D , P , N is formed. D represents the feature dimension of point clouds. P represents the number of non-empty point clouds, and N represents the number of point clouds within a single point cloud.
As shown in Figure 8, the input point set P R N × d is composed of randomly sampled points N and gets F R N 1 × d + C 1 by backbone network. It contains N 1 point, and each point consists of d dimensional coordinates and abstract C 1 dimensional feature vector. Firstly, the global maximum pooling operation is used to aggregate F on the coordinate and feature dimension d + C 1 , and the point description F m a x R N 1 × 1 is generated. Then, pass F m a x to the multi-layer perceptron to generate point attention map M P R N 1 × 1 . A two-layer perceptron with nonlinear functions is intended to simplify the module’s complexity and increase its versatility. The first layer’s output unit is followed by the activation function ReLU to increase the nonlinearity of the model, while the second layer’s output unit is followed by the Sigmoid function to calculate the attention weight parameter. Finally, the point attention map M P is multiplied by the corresponding position element F as the mask to obtain the optimized output. The working process of the point attention module can be summarized as follows:
M p ( F ) = σ ( W 1 ( δ ( W 0 ( F max ) ) ) )
where activation functions Sigmoid and ReLU are represented by σ and δ , respectively; W 0 and W 1 represent the weight parameters of the first layer and the second layer of the multi-layer perceptron, respectively; F m a x represents the feature map after global max pooling.

Channel Attention Mechanism

The channel attention module functions as a channel feature selector, emphasizing useful information and suppressing interference through the correlation between feature channels, directing the feature extractor to capture more discriminative geometric features and refining the target’s bounding box, as shown in Figure 9. Since the feature map of each channel is considered to be a detector of features, channel attention focuses primarily on locating the significant portion of the incoming data. The spatial dimension of input data is compressed in order to increase the efficiency of computing channel attention. In order to summarize the spatial information, the channel attention aggregates the spatial information of the feature map using maximum pooling and average pooling, which are represented by F c m a x and F c a v g , respectively. Then, the above information is transmitted to a multi-layer perceptron composed of two fully connected layers. The processing process is consistent with the point attention mechanism After coding-decoding processing, channel attention M c is obtained. After that, an element-by-element addition is used to output the merged feature vector. The working process of the channel attention module can be summarized as follows:
M c ( F ) = σ ( W 1 ( δ ( W 0 ( F max ) ) ) ) + σ ( W 1 ( δ ( W 0 ( F avg ) ) ) )
where activation functions Sigmoid and ReLU are represented by σ and δ , respectively; W 0 and W 1 represent the weight parameters of the first layer and the second layer of the multi-layer perceptron, respectively; F m a x represents the feature map after global max pooling, and F a v g represents the feature map after global average pooling.

2.3. Feature Information Fusion

In order to solve the problem of different modalities of information fusion between image color texture and point cloud geometric space features, we propose a dynamic connection learning mechanism to establish and learn the connection between feature representations from different forms and abstract levels. The dynamic connection mechanism can autonomously learn how to establish the relationship between different features of the image and point cloud and fuse the information from different learning stages to obtain a unified representation of the image and point cloud.
As shown in Figure 10, each square in the square represents a PointPillar feature, and has the mean a coordinate of points accumulated into the column. Given a set of RGB feature maps, R i i 0 , 1 , , B and B are the total number of feature maps in the image detection network. The point cloud feature map is mapped to the image feature space for geometric alignment to obtain a set of feature vectors, which are expressed as F = f i i 0 , 1 , , B . Set an initial weight ω , which is a B-dimensional vector, using Softmax activation function to deal with the above results, and finally calculate ( ω × F ) to get the final feature vector. ω learns the connection weights to determine which image feature layer is fused to which point cloud feature layer. Here, ω is not a simple value, but a linear layer with output, which is applied to PointPillar feature M i P x , P y and generates weights on RGB feature mapping. After ω , the Softmax activation function permits the network to select the fusion layer dynamically. Since the dynamic connection learning method is independently applied to each pillar, the network is able to select the fusion mode and location based on the input properties.
In Figure 10, the grey frame represents the static connection weight, while the green display represents the dynamic connection formed by each value determined by the point cloud’s attributes. Therefore, they determine how to combine the model-generated characteristics at various abstract levels.
The network for center point image recognition generates a truncated cone of the object’s 3D search space, and all points within the cone are produced by PointPillars. Point cloud pillars in opposite directions, resulting in an increase in point cloud position error. Therefore, we normalize the circular table by rotating it so that its center axis is perpendicular to the image plane. This normalization helps to improve the algorithm’s rotation invariance, as illustrated in Figure 11.
Fine regression develops a more precise three-dimensional border box from a group of candidate boxes, which is then used to precisely characterize the target’s position, size, motion direction, and speed. Three 1 × 1 convolution layers and one mask prediction branch make up the head network. As shown in Figure 12, the prediction branch generates inferences and optimizes the target size based on the target category in order to generate the appropriate output.

3. Experimental Analysis

In order to prove the efficacy of the method proposed in this study, it is evaluated and tested using the nuScenes dataset, which has gained widespread recognition in the field of 3D target detection [26]. This work employs the Pytoch framework. The network is trained using the default parameters of the Adam method, with the learning rate set to 1 × 10−4 and the number of iterations set to 60 × 100 [27]. The convergence effect of the model is optimal when the momentum is set to 0.90, the weight attenuation is 0.0001, the learning rate is 1.25 × 10−4, and the number of iterations is 60 × 100. This was determined by modifying the model parameters during many training sessions. The experiment was carried out on a single CPU core of an Intel Xeon Sliver 4110 server equipped with 32 GB of RAM and two NVIDIA Tesla P4 GPUs (accelerated computing).

3.1. Dataset

The nuScenes collection is the first large-scale dataset comprised of synchronous pictures, LiDAR point clouds, and mmWave radar point clouds recorded under actual road conditions. The dataset as a whole consists of 1000 scenes, 1.4 million frames of pictures, 390,000 frames of LiDAR point cloud data, 23 item categories, and 1.4 million 3D labeling boxes, which are significantly larger and more challenging than the previous automatic driving dataset KITTI [28]. In this work, we chose a subset of the nuScenes dataset’s near-field area scene data as the test set for model validation and the remaining data as the training set for model training. Six target types (vehicle, truck, bus, pedestrian, motorbike, and traffic cone) are selected for target detection, and the results are analyzed by comparing them to other approaches.

3.2. Evaluation Methods

In the detection industry, the Average Precision (AP) evaluation indicator is widely used for evaluation purposes. Generally speaking, the greater the AP value, the better. In lieu of IOU, threshold matching is calculated using a 2D center distance on the ground plane, decoupling the effect of object size and orientation on AP calculation. d set to D = 0.5 , 1 , 2 , 4 . Different classes and different levels of difficulty D are used to calculate mAP:
mAP = 1 C D c C d D A P c , d
In addition to the mAP index, the nuScenes standard dataset also includes the nuScenes Detection Score (NDS). The NDS Score is the weighted sum of the average detection rate and the false detection rate. The higher the NDS score, the better the detection performance. This is calculated using the Truth Positive (TP) index:
NDS = 1 10 5 mP + mTP TP 1 min 1 , mTP
In addition, the detection accuracy increases when the average error-index value of speed, size, and attribute decreases. The average error is used to measure the algorithm’s performance in this work. Table 1 explains the meaning of each mean error.

3.3. Results Analysis

On the test set of the nuScenes dataset, the performance of this method is evaluated and compared to other existing state-of-the-art single-sensor and multi-sensor fusion methods, as shown in Table 2. Our method’s NDS and mAP indicators exceed all other metrics. Compared to the single camera detection methods MonoDIS, CenterNet, and the single LiDAR detection method InfoFocus, the NDS score is enhanced by 12%, 17.6%, and 10.9%, respectively [29,30]. Compared to the millimeter-wave radar fusion LiDAR detection technique RL-PiontPillars, the camera fusion LiDAR detection method F-PointNet, and the camera fusion millimeter wave radar method CenterFusion, the improvement is 19.2%, 5.5%, and 5.1%, respectively [23,31,32]. Compared to MonoDIS and CenterNet, the detection approach described in this study improves the mAP score by 13.7% and 13.5%, respectively. In comparison to RL-PiontPillars, F-PointNet, and CenterFusion, the rise is 15.5%, 11.5%, and 10.9%, respectively. In addition, our method outperforms previous methods in terms of average translation error, average angle error, and average attribute error, but not average scale error and average velocity error. When the targets in the near-field area are quite close to the intelligent vehicle, the speed changes rapidly, and it is difficult to collect the proximity information for calculation, resulting in a rise in size and speed inaccuracies.
Table 3 summarizes the detection accuracy values for each category in the nuScenes test results. The detection accuracy of Car has reached 76%, and the detection accuracy of Pedestrian has reached 69.3%; these are the two categories with the greatest improvement in detection accuracy. This paper hypothesizes that, firstly, in various traffic scenarios, Car and Pedestrian account for the largest proportion of targets that can appear in the near-field area of smart vehicles, and secondly, because in the nuScenes dataset, the number of samples of cars and pedestrians is also the largest, the optimization method has the most pronounced effect on the detection of Car and Pedestrian.
Figure 13 shows a portion of the test results for the nuScenes dataset. As can be seen from the diagram, the detector can accurately detect the intelligent vehicle in the near field region of vehicles and pedestrians. In the first picture, the car on the right side of the intelligent vehicle only shows a small part of the front, and in the second picture, the car on the left side of the intelligent vehicle only shows a small part of the rear, which can be effectively detected by the method in this paper. The aforementioned detection results demonstrate that this method can recognize targets within the near-field range of 1 to 10 m from the intelligent vehicle and has a better detection effect on the objects mentioned.
Figure 14 shows the multi-target detection results of this paper’s method under the general scenario of the nuscenes dataset. The figure demonstrates that the method suggested in this research does not miss detection, and accurate detection results are obtained for vehicles, pedestrians, trucks, and traffic cones in the same scenario. It demonstrates that the method presented in this research is also capable of achieving better 3D target detection results in general scenarios and has excellent scene robustness.

3.4. Ablation Study

By conducting ablation tests on the nuScenes test set, we further evaluate the efficacy of each strategy suggested in this study. We use the CenterFusion method as the baseline to examine the efficacy of fusing LiDAR data, adding an attention mechanism, and employing a dynamic connection mechanism on detection performance, respectively. Table 4 shows the overall performance test results of the ablation experiment.
To verify the efficacy of the method presented in this work, the fusion method of LiDAR point cloud information is compared with the baseline detection results. The fact that the overall detection performance has been significantly enhanced demonstrates that combining the information from each sensor is effective. Furthermore, based on the fusion of multi-sensor feature data, we investigate the contributions of the attention mechanism and the dynamic connection mechanism, respectively, to the enhancement of the detector’s performance. It is obvious from Figure 15 that, compared with the F-CenterFusion method, after the addition of the attention modules, the target detection accuracy value mAP has increased significantly, and the mAOE and average mAAE have decreased by 11.2% and 18.8%, respectively. It has been demonstrated that the attention modules are of great help to the attribute and pose detection abilities of the targets in the near field of the intelligent vehicles. After combining feature fusion with the dynamic connection mechanism, both NDS and mATE are greatly enhanced, indicating that the dynamic connection mechanism is successful at enhancing the performance of the detector. The results demonstrate that the attention mechanism and dynamic connection mechanism optimize the backbone network’s feature extraction and feature information expression capabilities. Whether by the promotion of similar features, the deletion of useless data, or the enhancement of the expression capability of fusion features, the performance of the multi-sensor fusion detector is enhanced. However, the metrics for average scale error and average speed error have not improved. In view of the fact that enhancing the deep learning detector only from the perspective of features cannot satisfy all conditions, other optimization strategies should be considered for the speed index.
In addition to the numerous accuracy and error indicators, Table 5 and Figure 16 present the experimental findings of this method’s detection performance in each category of the nuScenes test set for ablation. The statistics indicate that LiDAR information fusion, the attention mechanism, and the dynamic connection mechanism all contribute to improving the detection performance of detector categories, with the detection accuracy of autos and people being enhanced the most.

4. Conclusions

This paper presents an object detection method based on multi-sensor fusion for target detection in the near-field area of intelligent vehicles. The proposed method combines the information from the camera, LiDAR, and mmWave radar to fully acquire the target information in the near field. Attention modules are built into the image and point cloud detection networks, respectively, to capture class-dependent features and eliminate useless feature information. Then, the dynamic connection mechanism is used to fully fuse the image and point cloud feature information to improve the performance of the detector. The fusion results are input into the predictive inference head network to obtain the final detection results. We evaluated our proposed method on the nuScenes dataset. Compared with other methods, our method has achieved better results in the near-field object detection task.
In future work, one of the most important research directions will be determining how to adapt to different unique driving scenarios. At the same time, at the level of perception mode, it is envisioned that the whole urban road traffic information can be obtained more precisely and rapidly through the cross-domain vehicle road collaborative perception mode, thereby reducing the hardware and software costs, and enhancing the practical application value.

Author Contributions

Conceptualization, S.Y. and G.C.; methodology, S.Y.; software, S.Y.; validation, Y.X. and G.C.; formal analysis, L.Y.; investigation, G.C.; resources, Y.X.; data curation, W.Z.; writing—original draft preparation, S.Y.; writing—review and editing, W.Z. and S.Y.; visualization, S.Y.; supervision, Z.F.; project administration, G.C.; funding acquisition, Y.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the 2019 Young Top Talents Project under grant number ZYQR201912162, the National Natural Science Foundation of China under grant number 51805490, the Henan Province Tackling Key Scientific and Technological Problems under grant number 222102220013, and the Major Science and Technology Innovation Project in Zhengzhou under grant number 2020CXZX0046.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Brummelen, J.V.; O’Brien, M.; Gruyer, D.; Najjaran, H. Autonomous vehicle perception: The technology of today and tomorrow. Transp. Res. Part C Emerg. Technol. 2018, 89, 384–406. [Google Scholar] [CrossRef]
  2. Campbell, S.; O’Mahony, N.; Krpalcova, L.; Riordan, D.; Walsh, J.; Murphy, A.; Ryan, C. Sensor technology in autonomous vehicles: A review. In Proceedings of the Irish Signals and Systems Conference (ISSC), Belfast, UK, 21–22 June 2018; pp. 1–4. [Google Scholar]
  3. Gu, Q. Research on Moving and Multi-scaled Object Detection and Tracking. Ph.D. Thesis, University of Electronic Science and Technology of China, Chengdu, China, 2018. [Google Scholar]
  4. Dai, D.; Chen, Z.; Bao, P.; Wang, J. A Review of 3D Object Detection for Autonomous Driving of Electric Vehicles. World Electr. Veh. J. 2021, 12, 139. [Google Scholar] [CrossRef]
  5. Di, F.; Haase-Schütz, C.; Rosenbaum, L.; Hertlein, H.; Glaeser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1341–1360. [Google Scholar]
  6. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
  7. Chandra, M.A.; Bedi, S.S. Survey on SVM and their application in image classification. Int. J. Inf. Tecnol. 2021, 13, 1–11. [Google Scholar] [CrossRef]
  8. Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
  9. Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
  10. Enzweiler, M.; Eigenstetter, A.; Schiele, B.; Gavrila, D.M. Multi-cue pedestrian classification with partial occlusion handling. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA, 13–18 June 2010; pp. 990–997. [Google Scholar]
  11. Mathias, M.; Benenson, R.; Timofte, R.; Van Gool, L. Handling occlusions with franken-classifiers. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 1–8 December 2013; pp. 1505–1512. [Google Scholar]
  12. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  13. Tian, Y.; Luo, P.; Wang, X.; Tang, X. Deep learning strong parts for pedestrian detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1904–1912. [Google Scholar]
  14. Qi, P.; Wang, H.; Zhang, J.; Zhu, F.; Xu, Z. Crowded pedestrian detection algorithm based on improved FCOS. CAAI Trans. Intell. Technol. 2021, 16, 811–818. [Google Scholar]
  15. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  16. Lee, Y.; Hwang, J.W.; Lee, S.; Bae, Y.; Park, J. An energy and GPU-computation efficient backbone network for real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 752–760. [Google Scholar]
  17. Gong, T.; Yan, H. Multi-sensor information fusion and application. Appl. Mech. Mater. 2014, 602, 2623–2626. [Google Scholar] [CrossRef]
  18. Liu, Z.; Zhao, X.; Huang, T.; Hu, R.; Zhou, Y.; Bai, X. Tanet: Robust 3d object detection from point clouds with triple attention. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11677–11684. [Google Scholar]
  19. Yoo, J.H.; Kim, Y.; Kim, J.; Choi, J.W. 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 720–736. [Google Scholar]
  20. Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
  21. Wang, Z.; Zang, L.; Tang, Y.; Shen, Y.; Wu, Z. An Intelligent Networked Car-Hailing System Based on the Multi Sensor Fusion and UWB Positioning Technology under Complex Scenes Condition. World Electr. Veh. J. 2021, 12, 135. [Google Scholar] [CrossRef]
  22. Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
  23. Nabati, R.; Qi, H. Centerfusion: Center-based radar and camera fusion for 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 1527–1536. [Google Scholar]
  24. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  25. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  26. Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. Nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 11621–11631. [Google Scholar]
  27. Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
  28. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
  29. Simonelli, A.; Bulo, S.R.; Porzi, L.; López-Antequera, M.; Kontschieder, P. Disentangling monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 1991–1999. [Google Scholar]
  30. Wang, J.; Lan, S.; Gao, M.; Davis, L.S. Infofocus: 3d object detection for autonomous driving with dynamic information modeling. In Proceedings of European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 405–420. [Google Scholar]
  31. Li, C.; Lan, H.; Wei, X. Attention-based object detection with millimeter wave radar-lidar fusion. J. Comput. Appl. 2021, 41, 2137–2144. [Google Scholar]
  32. Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 918–927. [Google Scholar]
Figure 1. Typical scene of intelligent vehicles near field area.
Figure 1. Typical scene of intelligent vehicles near field area.
Wevj 13 00160 g001
Figure 2. Fusion method of different information levels: (a) data-level fusion; (b) decision-level fusion; (c) feature-level fusion.
Figure 2. Fusion method of different information levels: (a) data-level fusion; (b) decision-level fusion; (c) feature-level fusion.
Wevj 13 00160 g002
Figure 5. CenterNet network architecture.
Figure 5. CenterNet network architecture.
Wevj 13 00160 g005
Figure 6. Spatial attention module.
Figure 6. Spatial attention module.
Wevj 13 00160 g006
Figure 7. Channel attention module.
Figure 7. Channel attention module.
Wevj 13 00160 g007
Figure 8. Point cloud attention module.
Figure 8. Point cloud attention module.
Wevj 13 00160 g008
Figure 9. Channel attention module.
Figure 9. Channel attention module.
Wevj 13 00160 g009
Figure 10. Dynamic linkage mechanism.
Figure 10. Dynamic linkage mechanism.
Wevj 13 00160 g010
Figure 11. The frustum rotates in the direction.
Figure 11. The frustum rotates in the direction.
Wevj 13 00160 g011
Figure 12. The head network model.
Figure 12. The head network model.
Wevj 13 00160 g012
Figure 13. Detection results of targets in the near field area: (a) multi objective situation; (b) small target situation.
Figure 13. Detection results of targets in the near field area: (a) multi objective situation; (b) small target situation.
Wevj 13 00160 g013
Figure 14. Detection results of targets in general scenes: (a) multi objective situation; (b) multi scale target situation.
Figure 14. Detection results of targets in general scenes: (a) multi objective situation; (b) multi scale target situation.
Wevj 13 00160 g014aWevj 13 00160 g014b
Figure 15. Histogram of overall performance ablation experiment on the nuScenes test set.
Figure 15. Histogram of overall performance ablation experiment on the nuScenes test set.
Wevj 13 00160 g015
Figure 16. Histogram of class-based ablation experiment on the nuScenes test set.
Figure 16. Histogram of class-based ablation experiment on the nuScenes test set.
Wevj 13 00160 g016
Table 1. Meaning of each average error index.
Table 1. Meaning of each average error index.
Average ErrorAbbreviationRepresentative Meaning
Mean average translation errormATETwo-dimensional Euclidean center distance in meters
Mean average scale errormASE1-IoU, IoU is a 3D cross-merge ratio after angular alignment
Mean average angle errormAOEMinimum yaw angle difference between predicted and true values
Mean average velocity errormAVEL2 parametrization of 2D velocity difference (m/s)
Mean average attribute errormAAE1-acc, acc is category classification accuracy
Table 2. Performance comparison for 3D object detection on nuScenes dataset. “C”, “R”, and “L” specify camera, mmWave radar, and LiDAR modalities, respectively.
Table 2. Performance comparison for 3D object detection on nuScenes dataset. “C”, “R”, and “L” specify camera, mmWave radar, and LiDAR modalities, respectively.
MethodSensorNDSmAPmATEmASEmAOEmAVEmAAE
CRL
MonoDIS 0.3840.3040.7380.2630.5461.5330.134
InfoFocus 0.3950.3950.3630.2651.1321.0000.395
CenterNet 0.3280.3060.7160.2640.6091.4260.658
RL-PointPillars 0.3120.2860.8200.3600.8501.7300.480
F-PointNet 0.4490.3260.6310.2610.5160.6140.115
CenterFusion 0.4530.3320.6490.2630.5350.5400.142
ADF-CenterFusion0.5040.4410.3880.2490.3840.8080.102
Table 3. Per-class performance comparison for 3D object detection on nuScenes dataset.
Table 3. Per-class performance comparison for 3D object detection on nuScenes dataset.
MethodSensormAP
CRLCarBusTruckPedestrianMotorBarrierAverage
MonoDIS 0.4780.1880.2200.3700.2900.4870.339
InfoFocus 0.4610.3850.2180.4050.1140.1590.290
CenterNet 0.4840.3400.2310.3770.2490.5500.372
RL-PointPillars 0.4920.3950.2250.4060.1030.1690.298
F-PointNet 0.5240.3620.2650.3890.3050.5630.401
CenterFusion 0.5090.2340.2580.3700.3140.5750.376
ADF-CenterFusion0.7600.5410.4900.6930.4600.6740.603
Table 4. Overall performance ablation experiment results on nuScenes test set.
Table 4. Overall performance ablation experiment results on nuScenes test set.
MethodNDSmAPmATEmASEmAOEmAVEmAAE
Baseline0.4530.3320.6490.2630.5350.5400.142
F0.4780.3390.6360.2140.5060.5990.307
AF0.4790.4120.5720.2450.3940.8450.119
DF0.5030.3680.4520.2180.4960.5620.290
ADF0.5040.4410.3880.2490.3840.8080.102
Table 5. Class-based ablation study results on nuScenes test set.
Table 5. Class-based ablation study results on nuScenes test set.
MethodCarBusTruckPedestrianMotorBarrierAverage
Baseline0.5090.2340.2580.3700.3140.5750.376
F0.5410.5430.4480.4670.3170.6440.493
AF0.6840.5490.4850.5970.4150.6570.564
DF0.6170.5410.4540.5630.3620.6610.533
ADF0.7600.5410.4900.6930.4600.6740.603
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Xiao, Y.; Yin, S.; Cui, G.; Yao, L.; Fang, Z.; Zhang, W. A Near-Field Area Object Detection Method for Intelligent Vehicles Based on Multi-Sensor Information Fusion. World Electr. Veh. J. 2022, 13, 160. https://doi.org/10.3390/wevj13090160

AMA Style

Xiao Y, Yin S, Cui G, Yao L, Fang Z, Zhang W. A Near-Field Area Object Detection Method for Intelligent Vehicles Based on Multi-Sensor Information Fusion. World Electric Vehicle Journal. 2022; 13(9):160. https://doi.org/10.3390/wevj13090160

Chicago/Turabian Style

Xiao, Yanqiu, Shiao Yin, Guangzhen Cui, Lei Yao, Zhanpeng Fang, and Weili Zhang. 2022. "A Near-Field Area Object Detection Method for Intelligent Vehicles Based on Multi-Sensor Information Fusion" World Electric Vehicle Journal 13, no. 9: 160. https://doi.org/10.3390/wevj13090160

Article Metrics

Back to TopTop