1. Introduction
In recent years, with the development of image processing technology [
1] as well as artificial intelligence deep learning technology [
2], target detection [
3] based on computer vision has been dramatically improved in terms of accuracy and speed. However, in the field related to autonomous driving [
4], relying only on the two-dimensional information on the image plane does not effectively obtain the position and structure information of the target in three-dimensional space. Therefore, 3D target detection is crucial for driving self-driving car perception and autonomous systems [
5].
Three-dimensional target detection has gradually become a hotspot for research in the field of target detection, and many methods have emerged. Many mature algorithms rely on point cloud information [
6], or the fusion of of point cloud and image information. Most existing methods rely heavily on LiDAR data to obtain accurate depth information [
7]. However, LiDAR equipment for generating point cloud information is usually expensive and has high maintenance costs [
8]. Compared to other sensors, utilizing images obtained from monocular cameras for 3D inspection is an economical and convenient solution. Unlike LiDAR, monocular cameras themselves do not sense spatial depth and cannot provide information about the target position or structure contours in 3D space [
9], so using only monocular cameras is more challenging. Since no depth information is available in monocular vision scenarios, most algorithms first acquire 2D information about the target in the image space, then later use neural networks, geometric constraints, or 3D model matching methods to predict the 3D bounding box containing the target [
10].
However, in automatic driving scenarios [
11], due to the changing road conditions and complex traffic conditions, the targets in front of the vehicle image or video will appear in each other’s occlusion, and there is also the problem of changing scale, which brings a significant challenge to the 3D target detection algorithm that relies on monocular RGB images only. At the same time, the automatic driving system needs to detect the surrounding targets in real time in order to make timely judgments and warnings of dangerous situations. Therefore, the algorithm model is required to ensure accuracy while considering the limitations of the computing power of the equipment required for the algorithm’s operation, as well as the real-time nature of the algorithm.
In recent years, many 3D target detection algorithms have been proposed by domestic and foreign experts and scholars based on monocular vision, which provides sufficient validation for the research prospect in this field. Some of these algorithms require only images as input. At the same time, some methods need to be coupled with additional labeling data or rely on the results produced by other independent network models.
The method based on additional data assistance is more common, and Mono3D [
12] proposed by Chen et al. in 2016, uses image segmentation technique in 3D prediction, which mainly constructs a 3D candidate frame generation network and assumes that the target to be detected is always located on the ground, and then utilizes the segmentation results, object contours, position a priori, and other intuitive feature information to score the 3D candidate frames projected on the image plane, and ultimately obtains high-quality detection results. As an improvement, the authors of Mono3D proposed 3DOP in 2017 [
13] and added 3D point cloud feature information to score candidate frames, which are estimated from stereo camera pairs. These two algorithms could be more efficient due to the need for many candidate region search operations in 3D space. In addition, some methods use CAD models or object shape information as auxiliary data, e.g., DeepMANTA [
14] proposed by Habot et al. in 2017, which uses a coarse-to-fine process to generate accurate 2D object suggestion frames, which are later used to match 3D CAD models from an external labeled dataset. In 2019, He et al. proposed Mono3D ++ [
15], which uses a deformable wireframe model to estimate a vehicle’s 3D shape and attitude and optimizes the loss of projective consistency between the generated 3D hypotheses and the corresponding 3D pseudo-metrics. The above approach based on the aid of additional data is not applicable in practice because network modelling is done with the help of additional data, which leads to insufficient real-time performance of the modelling algorithms.
Methods based on depth information or pseudo-LiDAR point clouds help to improve the accuracy of 3D detection with the help of depth maps or by automatically generating pseudo-LiDAR point clouds from images. For example, Pseudo-LiDAR [
16], proposed in 2019, utilizes pre-computed depth maps in combination with the original RGB images to predict a 3D point cloud and then employs a point cloud algorithm for subsequent inference. MonoPSR [
17] utilizes ideas such as bounding-box proposal algorithms and shape reconstruction, which first utilizes the fundamental relationships of the pinhole camera model and uses a well-established 2D target detector for the scene. Each target in the scene generates a 3D proposal box, and these proposed 3D positions are shown to be very accurate, which can reduce the difficulty of regressing the final 3D bounding box. Meanwhile, MonoPSR predicts the point cloud in a coordinate system centered on the target and enhances the accuracy of 3D localization by learning local size and shape information. Multi-task network-based AM3D [
18] converts a 2D image into a 3D point cloud plane by combining it with a depth map and then uses PointNet [
19] to estimate the 3D dimensions, positions and orientations. In 2020, ForeSeE [
20], proposed by Wang, first separates the foreground and background portions of an image, and then, using monocular vision, separates depth estimation and, finally, achieves feature enhancement using the depth estimate of the foreground. The above methods require depth prediction or pseudo-LiDAR point cloud generation before 3D target detection, which increases the overall computational overhead of the model. Although high accuracy can be achieved, real-time performance cannot be guaranteed.
Methods relying solely on RGB images are more streamlined than methods introducing additional data or network models. Mousavian et al. proposed Deep3DBox in 2017 [
21], which estimates the local orientation of each object and the 2D and 3D bounding box constraint relationships by introducing a bin-based discretization method to obtain a complete 3D pose. Deep3DBox adds a pose regression network constructed based on a fully connected layer behind an arbitrary 2D target detector. It completes the spatial coordinate prediction using a target 3D centroid solver module. The effectiveness of Deep3dBox depends heavily on the performance of the dependent 2D detectors, such as the Shift R-CNN proposed by Andretti et al. [
22] and the FQNet [
23] proposed by Liu et al., which add fine-tuning of the first-stage pose regression results to Deep3DBox, increasing the accuracy of 3D pose prediction. MonoDIS [
24], proposed by Barabanau et al. in 2019, employs a two-stage architecture for monocular 3D detection. The algorithm is designed for the training process to simultaneously regress target center, size, and orientation, which brings about the problem of different loss sizes in each part; therefore, a decoupled regression loss is designed, which divides the regression part into several groups; each group only has its parameters to be learned, and the other parts are replaced using labels, which makes the training more stable. MonoGRNet [
25], proposed by Qin et al., subdivided the 3D object localization task into four tasks—two-dimensional detection, instance depth estimation, 3D object position estimation, and local corner point estimation—and then stacked these components together to refine the 3D bounding box in the global context information. MonoGRNet needs to train the network in different modules, first in stages and then end-to-end, which makes the training process time-consuming, and requires sufficient control over the subtasks. GS3D [
26], proposed by Li et al., is based on reliable 2D detection results and first predicts a rough 3D bounding box for the target, which provides a reliable approximation of the target’s position, size, and orientation and can be used as a guide for fine-tuning. In an image, the visible surface of an object provides information about the underlying 3D structure, so GS3D projects this 3D bounding box onto the image plane, after which the visible surface features of the object are extracted and fused with the 2D features, which can be used to adjust the rough 3D bounding box to a fine 3D bounding box. M3D-RPN [
27], proposed by Brazil et al., uses an independent 3D Region Proposal Generation Network to predict multi-class 3D bounding boxes based on the geometric relationship between 2D and 3D perspective views using global convolution and local depth-aware convolution. In order to reduce the tedious 3D parameter estimation, the method further designs a depth-aware convolutional layer for extracting location-specific features to improve the algorithm’s performance in understanding the 3D scene. However, M3D-RPN uses an extensive backbone network to improve the results, resulting in many network parameters and high computational complexity. Recently, related work has introduced the idea of key point detection into monocular 3D target detection, which usually uses a more streamlined model structure and improves the running speed of the model. In 2020, Liu et al. proposed SMOKE [
28], which designs an end-to-end network that predicts the projection point of the center point of the target’s 3D bounding box on the image plane, then predicts the target’s local orientation and dimensional information, and later predicts the target’s 3D target bounding box from the camera parameters. Since SMOKE detects only one key point, the error is significant. The above method can accomplish 3D target detection based on only RGB images as input, which has broader research significance and application value than models that introduce additional auxiliary data or add additional prediction tasks.
However, methods relying solely on RGB images still need to improve in the areas of their detection accuracy at multiple scales and mutual occlusion.
Based on the defects of the above networks, this paper proposes an IAE-KM3D network, based on the KM3D network, with the following main contributions:
The Resnet V2 network is introduced, and the residual module is redesigned to improve the training capability of the new residual module with higher generalization.
IBN-NET is introduced to carefully integrate Instance Normalization and Batch Normalization as building blocks to continuously improve performance without increasing computational cost.
The introduction of Simam’s parameter-free attention mechanism allows the network to focus on more key features and does not increase the network complexity.
The Gaussian kernel is improved so that the heat map generated by KM3D is improved from a fixed circle to an ellipse that varies with the width and height of the 3D target, which enhances the algorithm’s ability to detect 3D targets.
A key point loss function based on the predicted values of key points is proposed to improve the training ratio of the algorithm for complex samples.
This paper is organized into four main sections. The
Section 1 describes the key issues, challenges of current 3D target detection methods, and illustrates the approaches to address these issues. The
Section 2 shows the components of the KM3D network and the components of the improved IAE-KM3D network and describes in detail each improved module of the IAE-KM3D network. In the
Section 3, experiments are analyzed, each improved module is subjected to ablation experiments, and the results are evaluated and analyzed and compared with the current mainstream algorithms. The
Section 4 first summarizes the complete text, then describes the impact on society and looks forward to the future research direction of this paper.
3. Experimental Results and Analysis
3.1. Experimental Environment
The experiments in this paper are carried out under the settings of ubuntu20.04, pytorch1.7.0, and python3.6.0, and the network platform’s hardware configuration and parameter settings are shown in
Table 5.
Since the model layer of the KM3D algorithm is more profound and the parameters are more critical, a more powerful GPU is used. The model can get the optimal weight file of the model after 200 training iterations.
3.2. KITTI’s Dataset
KITTI [
34] is currently the most generalized dataset for autonomous driving scenarios and is used to evaluate the performance of a wide range of vision technologies in an in-vehicle environment. It mounts two high-resolution color cameras and a grayscale camera on a standard station wagon with a Velodyne laser scanner and a GPS positioning system to provide accurate ground information. It shoots on the road in multiple scenarios to obtain high-quality images of the vehicle’s front view and additional 3D annotation information. In this paper, all experiments are evaluated on the KITTI 3D detection benchmark, which has a total of 7481 images with known labels and 7518 images with unknown labels for 2D and 3D labeling of targets such as vehicles, pedestrians, and bicycles, respectively, and contains a total of 80,256 labeled objects, with a maximum of 15 cars and 30 pedestrians per image. The labels contain basic information such as the type of target, 2D bounding box, 3D scale, 3D coordinates under the camera coordinate system, the orientation angle of the target, and parameter information of the shooting camera, in addition to other information.
3.3. Evaluation Metrics
Referring to the standard evaluation methods in 3D target detection, this paper uses AP3D and APBEV metrics to evaluate and compare the algorithms on the KITTI dataset. Here, the AP3D and APBEV are calculated.
In order to calculate the AP score, it is necessary to determine what kind of detection result is the correct prediction result. Here, the Intersection over Union (IOU) is used to measure the ratio of the overlapping area between the predicted target candidate frame and the target truth frame. When the ratio of the overlapping area is more significant than a set threshold, the candidate frame is recognized as the correct prediction result. For the AP
BEV mentioned in this paper, the 3D detection frame of the target needs to be mapped to the bird’s eye view; in the bird’s eye view, the target is represented by the 3D bounding box with rotational attributes, and then the
IOU here is defined as the intersection over union of the areas of the two rotated rectangles divided by the concatenation of their areas, which is computed as shown in Equation (9):
Another metric, AP
3D, is used to measure the prediction accuracy of the 3D bounding box of the target. Since the target in an autopilot scenario will usually be on the ground, the 3D bounding box can be represented by coordinates, angles, length, width, and height. Introducing the height–direction overlap, it is easily extended from the above, which is calculated as shown in Equation (10):
3.4. Three-Dimensional Object Detection Network Comparison Experimental Results
We train and test the mainstream algorithms based on the KITTI dataset to select the benchmark model. Based on the comparison results shown in
Table 6, we choose KM3D as the benchmark model for our experiments.
As can be seen from
Table 6, the detection performance of KM3D is better than that of the other detection algorithms. For example, the AP
3D and AP
BEV of the KM3D algorithm are 45% and 62% higher than those of the FQNet algorithm. Although the detection accuracy of the 3DOP algorithm is higher than that of KM3D, time is significantly higher than that of KM3D: 106 times higher. After analyzing the above experiments, we chose KM3D, which is more accurate and less time-consuming, as the benchmark network in our experiments.
3.5. Ablation Experiment and Analysis
In this paper, the ablation experiment is used to verify the effect of each improvement module under the premise of keeping the environment and parameters consistent. The improvement point ablation experiments are divided into ten groups, with KM3D as the benchmark model, and “√” refers to adopting the corresponding improvement module. Each improvement module is first summed up individually and then combined sequentially to conduct ablation experiments on the KM3D benchmark network. Model 1 is the KM3D benchmark network of this paper, and model 10 is the improved IAE-KM3D network. The experimental results are shown in
Table 7.
Based on
Table 7, the following conclusions can be drawn:
Model 1 is the KM3D benchmark network and is a comparison benchmark for subsequent experiments. Its AP2D is 82.95%, its AP3D is 36.39%, its APBEV is 42.31%, and its time is 0.030S.
Model 2 uses the residual module of Resnet-V2, which is more trainable and generalizable than Model 1. The accuracy is also improved.
Model 3 is based on model 1, with the addition of the IN module, which continues to improve their performance without increasing computational costs, based on accuracy and more significant improvement.
Model 4 introduces the Simam mechanism based on model 1. It allows the network to focus on more key features and does not increase the network complexity. On this basis, accuracy and more significant improvement are needed.
Model 5 is based on model 1; the circular Gaussian kernel is improved to an elliptical Gaussian kernel, which enhances the algorithm’s ability to detect targets. On this basis, accuracy and more significant improvement are needed.
Model 6 is based on model 1, adding a key point loss function based on the predicted values of the key point to improve the training ratio of the algorithm for complex samples. On this basis, accuracy and more significant improvement are needed.
Model 7 is based on model 1, to which the residual module of Resnet-V2 was sequentially applied, and the IN module was added. This model improves the evaluation index for Model 1 and Model 2, including accuracy and more significant improvement.
Model 8 is based on model 1, to which the residual module of Resnet-V2 was sequentially applied, the IN module was added, and the Simam mechanism was introduced. This model improves the evaluation index for models 1, 2, and 3: accuracy and more significant improvement.
Model 9 adopts the residual module of Resnet-V2, the IN module, and introduces the Simam mechanism based on Model 1, in that order. The circular Gaussian kernel is improved to an elliptical Gaussian kernel. The model improves the evaluation index for Model 1, Model 2, Model 3, and 4: accuracy and significant improvement.
Model 10 is the final improved algorithm of this paper, which adopts the residual module of Resnetv2, adds the IN module, and introduces the Simam mechanism. It improves the circular Gaussian kernel to an elliptical Gaussian kernel and incorporates a key point loss function based on the predicted values of key points.
Compared with Model 1, it improves 5%, 12.5%, and 8.3% in AP2D, AP3D, and APBEV, respectively. The experiments show that the algorithm in this paper, compared with the original KM3D algorithm, has dramatically improved in all evaluation indexes, improved the detection accuracy, and satisfied the demand for real-time detection.
Relative to Model 2, there is an improvement of 4.3%, 12%, and 7.6% in AP2D, AP3D, and APBEV, respectively. Model 10 is based on Model 2 with the addition of the IN module and the introduction of the Simam mechanism. The circular Gaussian kernel is improved to an elliptical Gaussian kernel. A key point loss function based on the predicted values of key points is added. Although it increases the complexity of the model calculation, the detection accuracy is significantly improved as the detection speed FPS is reduced.
Relative to Model 3, there is an improvement of 3.8%, 9.7% and 5.7% in AP2D, AP3D, and APBEV, respectively. Model 10 introduces the Simam mechanism based on Model 3 by using the residual module of Resnet-V2. The circular Gaussian kernel is improved to an elliptical Gaussian kernel. A key point loss function based on the predicted value of key points is added. The experimental results show that the evaluation indexes of the model are improved, proving the necessity of experimental improvement.
Relative to Model 4, there is an improvement of 1.2%, 6.7% and 3.3% in AP2D, AP3D, and APBEV, respectively. Model 10 is based on Model 4; it uses the residual module of Resnet-V2 and adds the IN module. The circular Gaussian kernel is improved to an elliptical Gaussian kernel an adds a key point loss function based on the predicted values of key points. Experiments show that the detection speed of the model is significantly improved under the premise of guaranteeing the detection accuracy, and the model size and the number of parameters are minimized. In order to meet the real-time requirements of model detection, the model size and the number of parameters are minimized.
Relative to Model 5, there is an improvement of 3.9%, 9.2% and 6.4% in AP2D, AP3D, and APBEV, respectively. Model 10 is based on Model 5, which uses the residual module of Resnet-V2, adds the IN module and introduces the Simam mechanism. A key point loss function based on the predicted values of key points is added. The test results show that the model’s evaluation indexes are improved, proving the necessity of experimental improvement.
Relative to Model 6, there is an improvement of 3.8%, 9.4% and 5.7% in AP2D, AP3D, and APBEV, respectively. Model 10 is based on model 6; it uses the residual module of Resnet-V2, adds the IN module, and introduces the Simam mechanism. The circular Gaussian kernel is improved to an elliptical Gaussian kernel. The test results show that the evaluation indexes of the model have been improved, which proves the necessity of experimental improvement.
Relative to Model 7, there is an improvement of 3.7%, 8.7% and 5.2% in AP2D, AP3D, and APBEV, respectively. Model 10 is based on Model 7 and introduces the Simam mechanism. The circular Gaussian kernel is improved to an elliptical Gaussian kernel. A key point loss function based on the predicted values of key points is added. The results of the experiment proved the need for experimental improvement.
Relative to Model 8, there is an improvement of 1.6%, 2.4% and 3.6% in AP2D, AP3D, and APBEV, respectively. Model 10 is based on Model 8, and the circular Gaussian kernel is improved to an elliptical Gaussian kernel. A key point loss function based on the predicted values of key points is added. The experimental results show that the evaluation indexes of the model are improved, proving the necessity of experimental improvement.
Relative to Model 9, there is an improvement of 0.8%, 1.7% and 1.6% in AP2D, AP3D, and APBEV, respectively. Model 10 is based on model 9 by adding a key point loss function based on the predicted values of key points. The results of the experiment proved the need for experimental improvement.
The algorithm maximizes the detection accuracy while ensuring the detection speed.
3.6. Comparative Experiments and Analysis of IAE-KM3D and Baseline Model KM3D
The performance metrics of this paper’s model, IAE-KM3D, and the baseline model, KM3D, are compared on KITTI’s dataset, as shown in
Table 8.
As can be seen from
Table 8, the IAE-KM3D model has better parameters, AP
2D, AP
3D, AP
BEV, than the KM3D model, with an improvement of 5%, 12.5%, and 8.3% in AP
2D, AP
3D, and AP
BEV, respectively.
Figure 7 demonstrates the decrease in total loss before and after the network improvement, and
Figure 8 compares the test results before and after the network improvement.
By observing the decrease in overall loss during model training in
Figure 7, it can be seen that IAE-KM3D can achieve better overall convergence performance. As shown from the test results in
Figure 8, the introduction of the Resnet-V2 network redesigned the residual module, making the new residual module more trainable and more generalizable than before, compared to the baseline model KM3D. Then, IBN-NET is introduced to integrate IN and BN as building blocks carefully to improve their performance continuously without increasing computational costs. The introduction of Simam’s parameter-free attention mechanism allows the network to focus on more key features and does not increase the network complexity. After that, the Gaussian kernel is improved so that the heat map generated by KM3D is improved from a fixed circle to an ellipse that varies with the shape of the 3D target, which enhances the algorithm’s ability to detect 3D targets. Finally, a key point loss function based on the predicted values of key points is proposed to improve the training ratio of the algorithm for complex samples, which significantly improves the detection accuracy of the model. It also effectively reduces the leakage detection rate.
3.7. Comparison with Mainstream Experiments
To demonstrate the superiority of the improved algorithm IAE-KM3D in this paper, a comparative review of this paper’s algorithm with other current 3D target detection algorithms in monocular vision scenes was conducted for comparison and experimentation to maintain the same environment and parameters.
Table 9 shows the results from this paper’s comparison tests of Monn3D, 3DOP, GS3D, FQNet, and the improved KM3D.
IAE-KM3D simultaneously outperforms other related monocular 3D detection algorithms in several metrics. However, these algorithms typically use more feature extraction networks than the Resnet parameter count, further improving 3D target detection in monocular vision.
Measuring performance, the algorithm proposed in this paper is more advantageous regarding speed and number of model parameters. By comparing AP3D, APBEV, and Time, IAE-KM3D is ahead of other methods in all the indexes. This confirms that the IAE-KM3D proposed in this paper can achieve 3D detection performance in monocular scenes and reach the leading level.