1. Introduction
Traffic congestion is a common occurrence in cities. On one hand, it is related to urban road design; on the other hand, it is related to artificial driving. Drivers on the road completely depend on their driving experience and can be driving for a long time, which can cause the visual fatigue of drivers and car accidents. It is of great significance to study a method that assists or even replaces the human eye and completes the vehicle’s automatic recognition and detection reliably.
Vehicle detection and recognition is a popular research direction in computer vision, which has a wide application prospect in automatic driving. However, the real-time acquisition of road vehicle pictures by onboard cameras in the process of image acquisition is affected by camera angles and the distance between bodies; there will be problems such as block, blur, dark light, and the small size of the target object. Thus, the recognition rate is low. In order to improve the recognition rate, Zha et al. [
1] studied the image information of vehicles in parking lots, trained the classifier by manually extracting features, and matched the interested vehicle features to obtain better recognition results. Amit et al. [
2] proposed a strong classifier based on a machine learning algorithm that uses more features to form a better decision boundary and fewer features to exclude a large number of negative samples and trains a discriminative weak classifier with Haar features a generative weak classifier with HOG features. Taking the AdaBoost algorithm as a bridge, the recognition rate reached 95.7%. However, the above algorithms are based on the application of machine learning algorithms, requiring a high level of manual processing, such as manual feature extraction and classifier design, and the maximum processing speed of the algorithm is not more than 40 FPS, which is low and cannot be applied to urban roads with faster driving speeds.
With the development of deep learning, the extraction method of image features has changed. Different from traditional manual feature extraction, the method based on deep learning can independently extract features and learn [
3,
4]. Based on the idea of deep learning, Cao Shiyu et al. [
5] used a selective search algorithm to obtain the candidate regions of the sample as the input of the network training network. The convolutional features were successfully used to replace the traditional manual features and achieved good results in view of the problem that traditional vehicle target detection needs to manually select appropriate features for different scenes. I.O.D. Oliveira et al. [
6,
7] used a double-flow convolutional neural network to train and learn the high-resolution dataset Vehicle-Rear, which solved the problem of non-overlapping cameras identifying vehicles. All the above studies show that feature extraction by convolutional algorithms completely avoids traditional manual feature extraction. The development of object detection algorithms in deep learning has gradually been divided into two categories: One is a two-stage detection algorithm based on the selective search method to select the region proposal. The basic idea of this kind of algorithm is based on the R-CNN algorithm [
8,
9,
10]. Firstly, selective search is used to select about 2000 region proposals per image. Since the network structure can only accept proposals of the same size, each proposal needs to be warped to 227 × 227. For each proposal, CNN is used to extract features, and then these features are used to train an SVM classifier to obtain the corresponding category score [
11,
12,
13,
14,
15]. NMS (non-maximum suppression) is used to remove some redundant candidate boxes, and bounding box regression is used to fine-tune the candidate boxes. Then, the detection task of the two-stage algorithm is complete. The two-stage target detection algorithm is mainly characterized by high detection accuracy, but the running speed of the algorithm rarely exceeds 30 FPS, and it needs to consume a lot of storage space [
16,
17,
18,
19,
20,
21]. The representative algorithms of two-stage algorithms are R-CNN, Fast R-CNN, Faster R-CNN, and Mask R-CNN [
22,
23,
24,
25]. Compared with the advanced performance of the object detection algorithm at that time in general object detection and general performance in vehicle detection, Quanfu Fan et al. [
15] revealed a vehicle detection technique using Faster R-CNN at the 2016 IEEE Intelligent Vehicles Symposium. It gave a research direction for vehicle detection. Li Wang et al. [
26] made an improvement on Faster R-CNN, which improved the recognition accuracy by 9.5% compared with the original network, and the detection speed reached 9–13 FPS. Another algorithm is based on end-to-end detection. The core idea of this algorithm is to transform the problem of object classification into the regression problem of the object detection box and bounding box. The algorithm does not need to select the candidate box, and the result can be obtained by direct detection. This idea omits the complex candidate box selection and calculation steps in the two-stage algorithm. Therefore, the running speed of the algorithm is far faster than that of the two-stage detection algorithm [
27,
28,
29]. The representative algorithms are the YOLO algorithm and the SSD algorithm. In the two-stage algorithm, it is difficult to improve the speed of vehicle detection, so researchers have turned their attention to single-stage detection algorithms. Jun Sang et al. [
30] studied and created the improved YOLOv2. Through the K-means ++ algorithm, the appropriate anchor box was selected for the collected dataset, and the YOLOv2_V algorithm was proposed. The accuracy of the algorithm reached 94.78%. Fukai Zhang et al. [
31] proposed DP-SSD by improving the SSD algorithm, and the accuracy of the algorithm reached 75.43% at the running speed of 50.47 FPS. In 2016, Joseph Redmon of the University of Washington proposed the YOLOv1 algorithm [
32], and the SSD algorithm was proposed by Wei Liu at the ECCA in 2016 [
33,
34]. Its basic framework improves on the YOLO algorithm and draws on the anchor mechanism in Faster-RCNN. A prior box is generated on the feature map for prediction, and similar to YOLOv3, an anchor is generated on the feature map at multiple different scales [
35]. At present, the YOLO algorithm is widely used in industrial production and has become the mainstream target detection algorithm. Therefore, the application algorithm framework used in this paper is the YOLO algorithm. At present, the performance of this algorithm is the best in YOLOv7, as proposed by Wang et al. in 2022 [
36]. However, there is still room for improvement in the detection accuracy of the algorithm.
This paper proposes three directions of improvement mechanisms. First of all, it is to improve the network model architecture. In most literature on high-speed architecture design, the model’s parameter number, computation amount, and computation density are mainly considered. Ma et al. [
37] also analyzed the influence of the input–output channel ratio, the number of architecture branches, and operation by element on the network reasoning speed based on the characteristics of memory access cost. Dollar P et al. [
38] gave extra consideration to activation when scaling the model, that is, more consideration to the number of elements in the output tensor of the convolution layer. The above gradient analysis methods enable faster and more accurate reasoning, and the design of the ELAN network leads to the conclusion that a deeper network can learn and converge effectively by controlling the shortest and longest gradient paths. Based on this conclusion, the Res3Unit module is proposed in this paper.
Second, the addition of attention mechanisms. Traditional attention mechanisms, such as CBAM and SENet, are usually used in the enhancement of convolutional modules [
39,
40]. Recently, self-attention modules have been proposed to replace traditional convolutions such as SAN and BoTNet. However, the relationship between self-attention mechanisms and convolution has not been discovered and utilized. Xuran Pan et al. [
41] found that the two modules are heavily dependent on the same 1 × 1 convolution operation, so they proposed an ACmix module that can reuse and share the features obtained by the two modules and aggregate the intermediate features. Compared with pure convolution or self-attention modules, this module has less overhead. Hence, this article adds ACmix to the network structure.
Finally, in order to enhance the receptive field of the network for distant small targets, the RFLA
Gauss module is introduced. Enhancing the receptive field of the network to the target object is the RFBNet model proposed by Songtao Liu et al. [
42], first published in ECCV 2018. This model mainly adds a dilated convolutional layer on the basis of inception to effectively increase the receptive field. In this paper, we introduce a new module, RFLA
Gauss, which was proposed by Chang Xu et al. [
43] in 2022, aiming at the characteristics of small objects in perspective, such as fewer pixels in the whole image and limited features that can be collected. The real box does not overlap with almost all anchor boxes (that is, IoU = 0) and does not contain any anchor points, resulting in the lack of positive samples of small objects. New prior knowledge based on Gaussian distribution was introduced, and a label assignment strategy based on the Gaussian receptive field (RFLA) was established, which solved the problem of small object recognition. Therefore, this paper will introduce the model structure to improve the detection effect of the algorithm model for small targets.
In this paper, the YOLOv7 algorithm is improved and the YOLOv7-RAR algorithm is proposed. To solve the problem of insufficient fusion of the nonlinear characteristics of the network, the Res3Unit structure is proposed to reconstruct the backbone network of YOLOv7 and to solve the problem that the network is weak in vehicle target positioning. We add the plug-and-play module ACmix between the backbone network and the detection head. The RFLAGauss module is used to solve the problem that the receptive field of the original network shrinks with the deepening of the network model, and four groups of ablation experiments are conducted to compare the performance of the improved model.
5. Summary and Conclusions
In this paper, an accurate and real-time detection algorithm, YOLOv7-RAR, is proposed, and four groups of ablation experiments have been conducted successively. The experiments proved that YOLOv7-RAR could well realize vehicle detection with high accuracy and speed.
Through four groups of ablation experiments, this paper draws the following conclusions:
Setting the structure of multiple fusion branches will reduce the use of repeated features in the network and fuse the features collected by the upper layer in a more fine-grained way.
The separation of the attention mechanism module and the convolution module can extract the image features as much as possible, and the aggregation use can share the collected feature information to the greatest extent.
Enhancing the receptive field for small targets in the distant view can reduce the miss rate of the model for vehicles in the distant view.
By combining the three improved mechanisms, the final accuracy of the model is 2, which is 4% higher than that of the original model, and the AP50:90 performance is improved by 12.6% compared with the original algorithm.
The image data collected by the camera has a key impact on the prediction effect of the model. For places where the light is not good, the performance of the model will be poor. Solving the input problem of acquisition can make the application effects better.
We hope to continue to introduce faster and more accurate vehicle recognition algorithms in the future and to be able to use larger datasets and let the algorithm recognize more types of vehicles in order to hope to contribute to the field of vehicle detection. In addition, the transformer mechanism has ushered in another wave of research, which has potential research value in vehicle detection applications.