1. Introduction
The development of intelligent vehicles has the potential to reduce traffic accidents caused by poor driving habits while also alleviating traffic congestion [
1]. Accurate perception of the external world is crucial for driving safety and reliability [
2,
3]. As shown in
Figure 1, the targets in the near-field area are closer to the intelligent vehicles, the perception scale increases, and the targets’ outline exceeds the perception perspective, changing from a continuous regular shape to a non-continuous natural shape, which makes target detection more difficult. As a result, a key task of environmental perception for autonomous driving is the high-precision recognition of near-field targets.
Currently, cameras, LiDARs, and mmWave radars are the most prevalent sensors used for autonomous driving environment perception. The cameras can receive 2D visual information about the targets; the LiDARs can obtain 3D shape information about the targets and the mmWave radars can obtain target location and velocity information [
4]. The sensing method of each sensor hardware inevitably has limitations. For instance, the cameras cannot identify the targets’ distance, the LiDARs cannot detect the targets’ color, and the mmWave radars cannot detect the targets’ category [
5]. Consequently, it is challenging to achieve adequate results in the near-field area target detection task using a single sensor. When numerous sensor data is integrated with different modalities, more useable information can be obtained, therefore boosting the near-field target detection task’s precision. Consequently, a growing number of academics are conducting research on multi-sensor information fusion technology. Multi-sensor information fusion technology can overcome the problem of a single sensor’s inability to obtain multi-type state information on target obstacles and is effective for close-range target detection tasks. Consequently, the current research hotspot employs multi-sensor information fusion technology to achieve high precision near-field target recognition.
The near-field target detection algorithm based on traditional methods is usually realized by manually extracting target features and various classifiers, such as HOG + SVM [
6,
7], SIFT + Adaboost [
8,
9], etc. The biggest difficulty in detecting near-field targets is that only local information about the targets can be obtained. Detecting targets based on known local information is the most common research method. Markus Enzweiler et al. proposed a hybrid expert architecture for the analysis of localized information [
10]. The target boundary was initially determined using discontinuous depth and motion feature data, followed by the calculation of the expert visibility weight. Based on the weight ratio, the hybrid expert classifier focused on the targets’ local feature information. In addition, Mathias et al. proposed a novel classifier based on the notion of training a classifier specifically for a certain target [
11]. The classifier was separated into two portions—one for the global detection of all image data and the other for the partial detection of local data only. The concept of spatial bias feature selection was developed to focus the classifier’s attention during training on small area characteristics without affecting global detection. This method separates global and local information, enhancing the accuracy of target detection in the near-field region without affecting global detection. On the other hand, the near-field region object identification methods based on the classic visual approach have a low computation amount and require few training samples, but they require a great deal of human design, have a poor detection effect, and are imprecise.
Deep learning-based object detection methods can be divided into two categories: anchor-based object detection and anchor-free object detection. The R-CNN is a common illustration of the anchor-based target identification strategy [
12]. DeepParts, a detector based on R-CNN, was proposed by Tian et al. [
13]. A specific local convolution network classifier was used to detect targets using only local input, without the need for the classifier to be predefined. This method could train specific local classifiers according to the input information and has high robustness, reducing the false detection rate by 11.89% on the Caltech dataset. However, the anchor frame-based target detection method has a reliance on manual design, imbalanced positive and negative samples, and an inefficient training prediction procedure. The anchor free method predicts in a point-to-point manner and no longer uses the anchor frame mechanism. Qi et al. enhanced the FCOS network architecture by integrating the attention module SE block into the primary network [
14,
15,
16]. The SE block was trained to learn the different weight ratios of feature maps, enhance the expression capability of feature information, and stress the local feature information of near-field targets. This method improved the detection accuracy by 15% on the CrowdHuman dataset. In conclusion, the strategy of adding attention mechanisms based on local information is simple and straightforward to compute in the deep learning-based target detection methods, making it more suitable for the near-field target detection method.
In spite of the fact that various effective optimization methods for near-field target detection based on a single sensor have been developed in recent years, they do not meet the reliability requirements of intelligent driving environment perception systems. The perspective of vehicle-mounted sensors limits the perception of near-field regional targets and can only obtain local information when they perceive targets in the near field. The shape is no longer regular and continuous. Using data from a single sensor makes it difficult for the detector to correctly analyze the size, position, and status of the targets. The cameras collect image information in the form of two-dimensional pixels; LiDARs collect point cloud information, which is represented as disordered and unstructured three-dimensional coordinates; mmWave radars collect point cloud information, or data blocks. From different levels of information fusion classification, multi-sensor information fusion technology is mainly divided into data level fusion, feature level fusion, and decision level fusion [
17].
As shown in
Figure 2a, data-level fusion refers to the fusion of information gathered directly by the sensors, making full use of the original data’s content. The literature employed a two-step strategy to combine data from cameras and LiDARs [
18]. Fusion at the data level was the initial step. A cross-view spatial feature fusion technique was devised since the data from the cameras and LiDARs had different distributions and features. The interpolation projection of the corrected spatial offset was utilized to map the camera view feature to a dense and smooth BEV feature using the automatic calibration feature projection, which minimized information loss as much as possible. However, this method is very complicated in practical use because of the need for correct external sensor calibration. For time calibration to be precise, the sensor must also balance its own motion. In addition, as a result of the high level of information retention, the quantity of calculations and hardware needs are significantly increased. As shown in
Figure 2b, decision-level fusion uses a specific network to fuse the detection results of each sensor. Compared with data-level fusion, it is more lightweight, modular, and robust. A mechanism for decision-level fusion was devised by reference [
19]. Initially, the modified SSD algorithm was employed to identify and recognize visual targets. After preprocessing the mmWave radar data, the mmWave radar tracking detection findings were projected onto the visual image using the AEKF approach, and the front vehicle’s position state was calculated using the IoU method. By merging visual detection information with mmWave radar information, the AEKF algorithm effectively reduces the problem of radar miss detection and false detection while accelerating the recognition of visual targets. Decision-level fusion facilitates algorithm implementation and debugging while decreasing sensor dependence. It is capable of effectively fusing information in different data formats, but it will lose a substantial amount of critical information and considerably reduce the accuracy of target recognition. As shown in
Figure 2c, the feature-level fusion first extracts the features of the collected data to construct feature vectors, and then fuses the information contained within these feature vectors. Chen et al. introduced the multi-view 3D object identification method (MV3D), which was a method for fusing feature-level information [
20]. Image and point cloud data were uploaded to the network concurrently. The sparse 3D point cloud was encoded using the multi-view representation method in conjunction with its color information and spatial distribution information. The fused feature was then used to predict targets and get three-dimensional information about targets. Feature-level fusion method is widely used, which can ensure the consistency of fusion information. The key is to choose suitable feature vectors for fusion, such as size, scale, velocity, etc. Additionally, the requirements are substantially lowered for computer hardware and connection speed.
In conclusion, both single-sensor target detection and present fusion detection systems exhibit deficiencies in near-field detection and are unable to gather comprehensive target state data [
21]. This paper provides a near-field target detection approach based on multi-sensor information fusion to improve the capability of target detection and adapt to the complicated environment. The following are the principal contributions of this paper:
1. In order to address the lack of target information obtained by intelligent vehicles in the near-field area, the F-PointPillars method is proposed to increase the amount of target feature information by integrating the position and velocity information of LiDAR and millimeter-wave radar point clouds.
2. The attention modules are introduced into the image detection network and the point cloud detection network to help the detector eliminate redundant feature information and learn about related features.
3. We provide a dynamic connection method to establish connections between features of different morphologies and abstraction levels, allowing image and point cloud information to be efficiently and learnedly combined, hence enhancing the performance of the detector.
4. This paper proposes a near-field target detection method based on multi-sensor information fusion.
Figure 3 shows the system architecture, which includes the image detection module, the point cloud information processing module, and the information fusion module.
Figure 4 shows the network architecture for detection.
5. Experiments on the challenging nuScenes dataset demonstrate that the proposed method has a better detection effect in the near-field region than existing detection methods.
Figure 3.
Multi-sensor information fusion system.
Figure 3.
Multi-sensor information fusion system.
Figure 4.
Multi-sensor information fusion network architecture.
Figure 4.
Multi-sensor information fusion network architecture.
2. Method of This Paper
2.1. Image Target Detection Based on CenterNet
In this paper, CenterNet detection network is used for initial detection of image information [
22]. The DLA network is used as the backbone network to extract the image features. The initial regression head predicts the object center point on the picture, producing an accurate 2D bounding box and a tentative 3D bounding box for the near-field region objects.
CenterNet takes an image
as input. The predicted value of the output center keypoint heatmap is shown in Formula (1):
where
W and
H are the width and height of the image,
R is the down-sampling ratio, and
C is the number of target categories.
If the anticipated value
is 1, it means that the detection target
C is centered on
coordinates on the image. If the result of the predicted value
is 0, it means that there is no detection target
C in
coordinates.
Figure 5 is the schematic diagram of the CenterNet network architecture.
For a certain category in the label, its real center point is first calculated for training. The calculation formula of the center point is shown in Formula (2):
where
and
represent the regression box coordinates of a target.
The coordinates after sampling are set as
,
R is the down-sampling factor, and the final calculated center point corresponds to the low-resolution center point. Then use a Gaussian kernel to distribute the points in the downsampled image to the feature map in the form of
[
23]. The Gaussian kernel is defined as Formula (3):
where
is a standard deviation related to the target size
.
The loss function of the keypoint heatmap is trained by the focal loss [
24], as indicated in Equation (4):
where
is the super parameter of focal loss, and
N is the number of target center points.
In the near-field area scene, there are a large number of targets of the same category but with different information features. These same-category targets have similar features, and these class-dependent traits can be used to identify similar targets with less feature information. Therefore, we first construct a spatial attention module to capture the spatial dependence between two identical features in the feature map. Based on the whole target information, it helps to detect the class-dependent characteristics of local information targets in the near-field area. Spatial attention weights are proportional to the similarity between two features, which mutually benefit. After determining the location of the category features, we constructed the channel attention module. The channel attention module can make the detector focus on the weights of channel feature maps with similar dependencies, highlight the feature representation, and enhance the feature expression ability.
The spatial attention module aggregates features selectively based on a weighted sum of features and improves the expression of similar features. This strategy can improve the feature expression of things outside the frame of view by building a dependence relationship with the features of other similar objects within the field of view. The spatial attention module is shown in
Figure 6.
In
Figure 6, R stands for reshape, T for transpose, and S for Softmax. ⊗ represents matrix multiplication, and ⊕ represents element addition.
The output of the input feature map after spatial attention can be expressed as:
where
α is the scale coefficient,
is the influence of the
i-th feature map on the spatial attention of the
j-th feature map,
is the j-th input feature map,
is the
j-th output feature map, and
is the feature map reshaped from
.
The following describes the operation of the spatial attention module: assuming that the feature map A is , the feature maps B, C and D are obtained by convolution. B is reshaped into through matrix transformation and transposition, and the C matrix is reshaped into . Multiplying matrices B and C and inputting the result into the Softmax activation function yields the spatial feature map M. After matrix transformation, M is multiplied by D to obtain , multiplied by the scale coefficient , then restored to after matrix transformation, and then the element is added to yield the output E, where is initialized to 0 and the weight value is gradually learned.
The channel attention module selectively emphasizes useful features by integrating relevant features between all channel maps, allowing the detector to better focus on key class-dependent features in the complex background containing numerous interfering targets. The channel attention module is shown in
Figure 7.
After the input feature map passes through the channel attention module, the output can be expressed as:
where
is the scale coefficient,
is the influence of the
i-th feature map on the channel attention of the
j-th feature map,
is the
j-th input feature map,
is the
j-th output feature map.
The following describes the operation of the channel attention module: the feature map A matrix is reshaped into . After transposition, it becomes . is generated by multiplying the two matrices together. After the Softmax function is activated, the channel dimension attention characteristic graph is obtained. Matrix multiplication of characteristic graph L and after matrix transformation to obtain , multiplied by the scale coefficient , after matrix transformation, restored to . Finally, the output E is obtained by adding element A. When is initialized to 0, the corresponding weight value is gradually learned.
After the original image passes through the backbone network, the input feature map is obtained. Then, the feature map is used to capture similar features from the spatial and channel dimensions through two parallel attention modules. Finally, the information from the two attention modules is fused by element addition to improve the feature representation.
2.2. Point Cloud Detection Based on F-PointPillars and Attention Mechanism
2.2.1. LiDAR and mmWave Radar Feature Information Fusion
LiDAR has problems with detecting target velocity, whereas millimeter-wave radar has disadvantages in detecting object size, which makes the acquisition of target information in the near-field area unreliable for each radar. The combination of LiDAR size and depth information and millimeter-wave radar depth and velocity information can effectively complement each other and enhance the detection precision of near-field area. The information collected by both LiDAR and mmWave radar is in the form of disordered sparse point cloud data. Therefore, we present F-PointPillars, a point cloud column coding-based information fusion approach for LiDAR and mmWave radar. Our method tensorizes point cloud data for feature extraction, allowing accurate and real-time fusion of high-dimensional and high-volume point cloud information from LiDAR and mmWave radar.
First rasterizing the
plane with
to form
pillars, each pillar samples
N points, and then encodes each point cloud
Z as a
D-dimensional vector, as follows:
where
is the true 3D coordinate of the point cloud,
is the true 3D coordinate of the geometric center point within the point cloud column,
is the offset from the point cloud to the center of the point cloud pillars in the direction of
, and finally forms a
dimensional tensor,
D is the point cloud pillars feature dimension,
P is the number of non-empty point cloud pillars, and
N is the number of point clouds in a single point cloud pillar.
In this paper, the 3D coordinates
of the original LiDAR point cloud data plus 5 additional dimensions
make up a total of 8 dimensions:
. Based on the coordinate relationship between LiDAR and mmWave radar, set the LiDAR translation matrix as
, the rotation matrix as
, the mmWave radar translation matrix as
, and the rotation matrix as
. The mmWave radar point cloud coordinates are converted to the LiDAR coordinate system by Equation (8), and the converted mmWave radar coordinates are marked as
.
where
,
.
Due to the fact that the Z-information value of all mmWave radar point cloud data is zero, only the 2D coordinates of the original point cloud data , the point cloud column center coordinates , the offset in the direction and the compensation velocity are retained, which make up a total of 8 dimensions: . The LiDAR and mmWave radar feature channel dimensions are superimposed to obtain the dimensional tensor.
The F-PointPillars method accomplishes the tensorization of point cloud data by transforming point clouds into cylinders and integrating LiDAR and mmWave radar data to offer sufficient feature information for further detection.
2.2.2. Attention Mechanism of Point Cloud Module
The point cloud information received by the radar scanning the target in the near-field region is typically expansive and dense. Due to the influence of the characteristics of the radar scanning beam, however, the distance between the point cloud and the point cloud at the periphery of the target grows, resulting in duplicated information that is sparse and ineffectual. Moreover, because the generated target information lacks a complete contour, the task of detecting targets in point clouds is made more difficult. In order to eliminate redundant information in the fused point cloud features and improve the target key point cloud feature information, this paper employs a multi-attention strategy and designs a point attention module and a channel attention module to further improve the performance of the point cloud detector.
Assuming input feature maps
, refer to the CBAM attention model,
F is divided into point attention
and channel attention
[
25]. The whole attention process is as follows:
where
is the data after point attention processing,
is for the data after channel attention processing, and ⊗ for element-by-element multiplication.
Point Attention Mechanism
The point attention module reweights each downsampling and upsampling point by establishing similar dependencies between points, which then selects and enhances keypoint features so that they can also receive sufficient attention in incomplete contours and cluttered scenes.
Given a frame of point cloud, the point cloud data is divided first. The column feature network evenly distributes the point cloud data in the plane-based grid, and these grids form a set of columns. Finally, the dimension tensor is formed. D represents the feature dimension of point clouds. P represents the number of non-empty point clouds, and N represents the number of point clouds within a single point cloud.
As shown in
Figure 8, the input point set
is composed of randomly sampled points
N and gets
by backbone network. It contains
point, and each point consists of
dimensional coordinates and abstract
dimensional feature vector. Firstly, the global maximum pooling operation is used to aggregate
on the coordinate and feature dimension
, and the point description
is generated. Then, pass
to the multi-layer perceptron to generate point attention map
. A two-layer perceptron with nonlinear functions is intended to simplify the module’s complexity and increase its versatility. The first layer’s output unit is followed by the activation function ReLU to increase the nonlinearity of the model, while the second layer’s output unit is followed by the Sigmoid function to calculate the attention weight parameter. Finally, the point attention map
is multiplied by the corresponding position element
as the mask to obtain the optimized output. The working process of the point attention module can be summarized as follows:
where activation functions Sigmoid and ReLU are represented by
and
, respectively;
and
represent the weight parameters of the first layer and the second layer of the multi-layer perceptron, respectively;
represents the feature map after global max pooling.
Channel Attention Mechanism
The channel attention module functions as a channel feature selector, emphasizing useful information and suppressing interference through the correlation between feature channels, directing the feature extractor to capture more discriminative geometric features and refining the target’s bounding box, as shown in
Figure 9. Since the feature map of each channel is considered to be a detector of features, channel attention focuses primarily on locating the significant portion of the incoming data. The spatial dimension of input data is compressed in order to increase the efficiency of computing channel attention. In order to summarize the spatial information, the channel attention aggregates the spatial information of the feature map using maximum pooling and average pooling, which are represented by
and
, respectively. Then, the above information is transmitted to a multi-layer perceptron composed of two fully connected layers. The processing process is consistent with the point attention mechanism After coding-decoding processing, channel attention
is obtained. After that, an element-by-element addition is used to output the merged feature vector. The working process of the channel attention module can be summarized as follows:
where activation functions Sigmoid and ReLU are represented by
and
, respectively;
and
represent the weight parameters of the first layer and the second layer of the multi-layer perceptron, respectively;
represents the feature map after global max pooling, and
represents the feature map after global average pooling.
2.3. Feature Information Fusion
In order to solve the problem of different modalities of information fusion between image color texture and point cloud geometric space features, we propose a dynamic connection learning mechanism to establish and learn the connection between feature representations from different forms and abstract levels. The dynamic connection mechanism can autonomously learn how to establish the relationship between different features of the image and point cloud and fuse the information from different learning stages to obtain a unified representation of the image and point cloud.
As shown in
Figure 10, each square in the square represents a PointPillar feature, and has the mean a coordinate of points accumulated into the column. Given a set of RGB feature maps,
and
B are the total number of feature maps in the image detection network. The point cloud feature map is mapped to the image feature space for geometric alignment to obtain a set of feature vectors, which are expressed as
. Set an initial weight
, which is a B-dimensional vector, using Softmax activation function to deal with the above results, and finally calculate
to get the final feature vector.
learns the connection weights to determine which image feature layer is fused to which point cloud feature layer. Here,
is not a simple value, but a linear layer with output, which is applied to PointPillar feature
and generates weights on RGB feature mapping. After
, the Softmax activation function permits the network to select the fusion layer dynamically. Since the dynamic connection learning method is independently applied to each pillar, the network is able to select the fusion mode and location based on the input properties.
In
Figure 10, the grey frame represents the static connection weight, while the green display represents the dynamic connection formed by each value determined by the point cloud’s attributes. Therefore, they determine how to combine the model-generated characteristics at various abstract levels.
The network for center point image recognition generates a truncated cone of the object’s 3D search space, and all points within the cone are produced by PointPillars. Point cloud pillars in opposite directions, resulting in an increase in point cloud position error. Therefore, we normalize the circular table by rotating it so that its center axis is perpendicular to the image plane. This normalization helps to improve the algorithm’s rotation invariance, as illustrated in
Figure 11.
Fine regression develops a more precise three-dimensional border box from a group of candidate boxes, which is then used to precisely characterize the target’s position, size, motion direction, and speed. Three 1 × 1 convolution layers and one mask prediction branch make up the head network. As shown in
Figure 12, the prediction branch generates inferences and optimizes the target size based on the target category in order to generate the appropriate output.