Next Article in Journal
Modelling, Validation and Experimental Analysis of Diverse RF-MEMS Ohmic Switch Designs in View of Beyond-5G, 6G and Future Networks—Part 1
Previous Article in Journal
Diverse Task Classification from Activation Patterns of Functional Neuro-Images Using Feature Fusion Module
Previous Article in Special Issue
Real-Time Trajectory Prediction Method for Intelligent Connected Vehicles in Urban Intersection Scenarios
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

VV-YOLO: A Vehicle View Object Detection Model Based on Improved YOLOv4

1
China FAW Corporation Limited, Global R&D Center, Changchun 130013, China
2
School of Vehicle and Energy, Yanshan University, Qinhuangdao 066000, China
*
Author to whom correspondence should be addressed.
Sensors 2023, 23(7), 3385; https://doi.org/10.3390/s23073385
Submission received: 1 March 2023 / Revised: 13 March 2023 / Accepted: 14 March 2023 / Published: 23 March 2023
(This article belongs to the Special Issue Intelligent Perception for Autonomous Driving in Specific Areas)

Abstract

:
Vehicle view object detection technology is the key to the environment perception modules of autonomous vehicles, which is crucial for driving safety. In view of the characteristics of complex scenes, such as dim light, occlusion, and long distance, an improved YOLOv4-based vehicle view object detection model, VV-YOLO, is proposed in this paper. The VV-YOLO model adopts the implementation mode based on anchor frames. In the anchor frame clustering, the improved K-means++ algorithm is used to reduce the possibility of instability in anchor frame clustering results caused by the random selection of a cluster center, so that the model can obtain a reasonable original anchor frame. Firstly, the CA-PAN network was designed by adding a coordinate attention mechanism, which was used in the neck network of the VV-YOLO model; the multidimensional modeling of image feature channel relationships was realized; and the extraction effect of complex image features was improved. Secondly, in order to ensure the sufficiency of model training, the loss function of the VV-YOLO model was reconstructed based on the focus function, which alleviated the problem of training imbalance caused by the unbalanced distribution of training data. Finally, the KITTI dataset was selected as the test set to conduct the index quantification experiment. The results showed that the precision and average precision of the VV-YOLO model were 90.68% and 80.01%, respectively, which were 6.88% and 3.44% higher than those of the YOLOv4 model, and the model’s calculation time on the same hardware platform did not increase significantly. In addition to testing on the KITTI dataset, we also selected the BDD100K dataset and typical complex traffic scene data collected in the field to conduct a visual comparison test of the results, and then the validity and robustness of the VV-YOLO model were verified.

1. Introduction

As a key technology that can effectively alleviate typical traffic problems and improve traffic safety, an intelligent transportation system has been fully developed around the world [1,2]. The large-scale application of autonomous driving technology has become an inevitable choice for the development of modern transportation [3]. Environmental awareness technology is the key to realizing autonomous driving and the basis for subsequent path planning and decision control of autonomous vehicles. As an important branch of environmental perception technology, object detection from the vehicle perspective is tasked with predicting the position, size, and category of objects in the area of interest in front of the vehicle [4], which directly affects the performance of the environmental perception system of autonomous vehicles.
In terms of sensors used for vehicle-mounted visual angle object detection, visual sensors have become the most used sensors for object detection due to their ability to obtain abundant traffic information, low cost, easy installation, and high stability [5,6,7]. With the continuous development of hardware systems, such as graphics cards and computing units, object detection based on deep learning is the mainstream of current research [8,9]. With its advantages of high robustness and good portability, object detection of four-wheeled vehicles, two-wheeled vehicles, and pedestrians has been realized in many scenes.
In the field of object detection, the deep learning-based object detection model can be divided into two stages, one of which stage is according to the implementation logic. The two-stage object detection model is usually composed of two parts: region of interest generation and candidate box regression. The R-CNN series [10,11,12,13] model, R-FCN [14], SPP [15], and other structures are the representatives of the two-stage object detection model. The two-stage object detection model has made a great breakthrough in precision performance, but it is difficult to use in embedded platforms with insufficient computing power, such as roadside units and domain controllers, which also promotes the birth of the single-stage object detection model. The single-stage object detection model treats the object detection task as a regression problem. By designing the network structure of the end-to-end mode, the feature extraction of the input image is carried out directly, and the prediction results are output. Early single-stage object detection models mainly include YOLO [16] and SSD [17]. Such models have great advantages in inference speed, but their detection precision is lower than that of the two-stage model. Due to this, the balance between detection precision and inference speed has become the focus of single-stage object detection model research and achieved rapid development in recent years. Excellent models, such as RetinaNet [18], YOLOv4 [19], CornerNet [20], and YOLOv7 [21], have emerged.
Table 1 shows the representative work in the field of vehicle-view object detection in recent years. Although these studies can solve the problem of object detection in complex vehicle-view scenes to a certain extent, they usually need to introduce additional large modules, such as the GAN [22] network and its variants, or just study a single object, such as pedestrians or vehicles. However, the autonomous vehicle needs to pay attention to three objects—a four-wheel vehicle, two-wheel vehicle and pedestrian—from the onboard perspective at the same time, and the computing power of its computing platform is limited, so the precision and real-time performance cannot be taken into account.
Inspired by the above research results and remaining problems, this paper proposes a vehicle view object detection model, VV-YOLO, based on improved YOLOv4. This model adopts the end-to-end design idea and optimizes the YOLOv4 benchmark model from three aspects: anchor frame clustering algorithm, loss function and neck network. Firstly, the improved K-means++ [28] algorithm is used to achieve more accurate and stable anchor frame clustering for the experimental dataset, which is a prerequisite for the target detection model based on the anchor frame to obtain a model with excellent performance. Secondly, a focal loss [18] loss function was introduced in the model training part to improve the feature extraction ability of the model for the target of interest in complex scenes. Finally, combined with the coordinate attention module [29], the CA-PAN neck network was proposed to model the channel relationship of image features, which could greatly improve the model’s attention at the region of interest.

2. Related Works

2.1. Structure of the YOLOv4 Model

In 2020, Alexey Bochkovskiy et al. [30] improved YOLOv3 with a lot of clever optimization ideas and then proposed YOLOv4. Figure 1 shows its network structure. The design idea of YOLOv4 is consistent with that of YOLO. It is also a single-stage model, which can be divided into three parts: backbone network, neck network and detection network. The backbone network is called CSPDarkNet53 [19]. Different from the DarkNet53 [30] used in YOLOv3, it uses a cross-stage hierarchical structure for network connection, which reduces the amount of computation and ensures the feature extraction effect. The neck network of YOLOv4 was constructed using the PAN [31] path aggregation network, which improved the fusion effect of multilevel features compared to the FPN [32] feature pyramid network. In addition, YOLOv4 also uses the SPP network in front of the neck network to enrich the receptive field of image features. After the output features of the neck network are obtained, the input features are decoded by the prediction head of three scales to realize the perception of the large, medium and small-scale objects.
YOLOv4 still applies the strategy of prior box and batch standardization from YOLOv2 [33] to ensure the regularity of model training parameters. Meanwhile, the Mish [34] activation function was introduced in YOLOv4 to make the training gradient descent smoother. Compared with the ReLU [35] activation function, the possibility of loss falling into local minimization was reduced. In addition, YOLOv4 also used Mosaic [19] data enhancement and DropBlock [36] regularization to reduce the overfitting of the model.

2.2. Loss Function of the YOLOv4 Model

The loss function of the YOLOv4 is composed of regression loss, confidence loss and classification loss. Different from the function adopted by other YOLO models, YOLOv4 uses the CIoU [37] function to construct the model intersection ratio loss function. It uses the diagonal distance of the minimum enclosure box to formulate a penal strategy to further reduce the false detection rate of the small-scale objects. However, in the class loss function, the cross-entropy function is still adopted.
L = λ coord   i = 0 K × K j = 0 M I ij obj   ( 2 - w i × h i ) ( 1 - CIoU ) -   i = 0 K × K j = 0 M I ij obj   [ C ˆ i log ( C i ) + ( 1 - C ˆ i ) log ( 1 - C i ) ] - λ noobj   i = 0 K × K j = 0 M I ij noobj   [ C ˆ i log ( C i ) + ( 1 - C ˆ i ) log ( 1 - C i ) ] -   i = 0 K × K i = 0 M I ij obj   c classes [ p ˆ i ( c ) log ( p i ( c ) ) + ( 1 - p ˆ i ( c ) ) log ( 1 - p i ( c ) ) ]
In Equation (1),   K   ×   K represents the mesh size, which can be 19 × 19, 38 × 38 or 76 × 76. M represents the detection dimension, whose value is 3.   λ coord represents the positive sample weight coefficient, whose value is generally 1. The values of I ij obj and I ij noobj are either 0 or 1, which is used to judge the positivity or negativity of the sample. C ˆ i and C i represent the sample and predicted values, respectively. ( 2 - w i   ×   h i ) is used to punish the smaller prediction box. w i and h i indicate the width and height of the center point of the prediction box, respectively. The CIoU’s equation is shown below.
CIoU = IoU - ρ 2 ( b , b gt ) c 2 - β ν
In Equation (2),   ρ 2 ( b , b gt ) represents the Euclidean distance between the center point of the prediction box and the real box, and c represents the diagonal distance between the minimum closure region that can contain both the prediction box and the real box. β is the parameter measuring the consistency of the aspect ratio, and ν is the tradeoff parameter. The calculation equations are shown in Equations (3) and (4), respectively.
β = ν 1 - IoU + ν
ν = 4 π 2 ( arctan w gt h - arctan w h ) 2

2.3. Discussion on YOLOv4 Model Detection Performance

As an advanced single-stage target detection model, YOLOv4 has a great advantage over the two-stage target detection model with detection speed. It can achieve a balance between precision and speed in conventional scenarios and meet the basic requirements of an automatic driving system. Figure 2 is a typical scene from the vehicle mount’s perspective. As can be seen from the figure, complex situations, such as dark light, occlusion and distance, are prone to occur under the vehicle-mounted perspective, and multiple types of traffic targets are often included. In the face of such scenarios, the YOLOv4 model’s ability to learn and extract effective features of the target is reduced, often resulting in missed detection and false detection. It can be seen that the current problem that urgently needs to be solved is object detection under the unfavorable conditions of the vehicle-view angle. Therefore, starting with the model structure and training strategy, this paper uses targeted design to improve the image feature modeling ability of the YOLOv4 model and improve the learning and extraction effects of effective features of the model in occlusion, dark light and other scenes, and proposes the vehicle-mounted perspective target detection model VV-YOLO.

3. Materials and Methods

3.1. Improvements to the Anchor Box Clustering Algorithm

For the object detection model based on the regression anchor box, the size of the anchor box is usually set by the clustering algorithm, and the YOLOv4 model uses the K-means clustering algorithm [38]. First, randomly select all the original anchor boxes from all the real boxes, and then adjust the position of the anchor boxes by comparing the IoU of each original anchor box to the real box, and then get the new anchor frame size. Repeat the above steps until all the anchor boxes no longer change. According to the position relationship between the anchor box and the bounding box in Figure 3, the formula for calculating the IoU can be obtained, as shown in Equation (5).
IoU = | Anchor   box     Bounding   box | | Anchor   box     Bounding   box |
The clustering effect of the anchor frame of the YOLOv4 model depends on the random setting of the original anchor box, which has great uncertainty and cannot guarantee the clustering effect, and it usually takes multiple experiments to obtain the optimal anchor box size. In order to avoid the bias and instability caused by the random setting of points, the VV-YOLO model is based on the improved K-means++ clustering algorithm, which is used for the anchor box coordinate setting of experimental data, and its implementation logic is shown in Figure 4.
The essential difference between the improved K-means++ algorithm and the K-means algorithm is reflected in the initialization of the anchor box size and the method of the anchor frame selection. The former first randomly initializes a real box as the original anchor box, and secondly, each real box uses Equation (1) to calculate the difference value from the current anchor box, and the difference value calculation formula is shown in Equation (6).
d ( box , centroid ) = 1 - IoU ( box , centroid )
In Equation (6), box represents the current anchor box, centroid represents a sample of data; IoU represents the intersection and union ratio of the data sample to the current anchor box.
After the variance value is calculated, a new sample is selected as the next anchor frame using the roulette method until all anchor frames are selected. The principle of selection is that samples that differ significantly from the previous anchor box have a higher probability of being selected as the next anchor box. The following mathematical explanation is given for it:
Suppose the minimum difference value of N samples to the anchor box is { D 1 , D 2 , D 3 D N }   , and then use Equation (7) to calculate the sum of the minimum differences from N samples to the current anchor box. Then, randomly select a value that does not exceed Sum , use Equation (8) to iteratively calculate the difference, stop calculating when r is less than 0, and the resulting point is the new anchor box size.
Sum = D 1 + D 2 + + D N
r = r   -   D N
Figure 5 shows the comparison of the average results of multiple clusters of K-means, K-means++ and improved K-means++ on the KITTI dataset [39]. The abscissa represents the number of iterations of the clustering algorithm, and the abscissa represents the average intersection ratio (IoU) of the obtained anchor box and all real boxes. Figure 6 shows the anchor box clustering results of the improved K-means++ algorithm. The results in the above figure show that the improved K-means++ algorithm can obtain a better clustering effect, and its average intersection union ratio is 72%, which is better than the K-means and K-means++ algorithms, which verifies its effectiveness.

3.2. Optimization of the Model Loss Function Based on Sample Balance

For the definition of samples in the YOLOv4 model, the concepts of the four samples are explained as follows:
  • The essence of object detection in the YOLOv4 model is to carry out intensive sampling, generate a large number of prior boxes in an image, and match the real box with some prior boxes. The prior box on the successful match is a positive sample, and the one that cannot be matched is a negative sample.
  • Suppose there is a dichotomous problem, and both Sample 1 and Sample 2 are in Category 1. In the prediction results of the model, the probability that Sample 1 belongs to Category 1 is 0.9, and the probability that Sample 2 belongs to Category 1 is 0.6; the former predicts more accurately and is an easy sample to classify; the latter predicts inaccurately and is a difficult sample to classify.
For deep learning models, sample balance is very important. A large number of negative samples will affect the model’s judgment of positive samples, and then affect the accuracy of the model, and the dataset will inevitably have an imbalance of positive and negative samples and difficult samples due to objective reasons. In order to alleviate the sample imbalance caused by the distribution of the dataset, this paper uses the focus function focal loss to reconstruct the loss function of the model and control the training weight of the sample.
From Equation (1) above, it can be seen that the confidence loss function of the YOLOv4 model is constructed using the cross-entropy function, which can be simplified to the following equation:
L c o n f = i = 0 K × K j = 0 M I ij obj   [ log ( C i ) ] + i = 0 K × K j = 0 M I ij noobj   [ log ( C i ) ]
The confidence function of YOLOv4 is reconstructed by using the focus function focal loss, and the loss function of the VV-YOLO model is obtained, as shown in Equation (10).
L = λ coord   i = 0 K × K j = 0 M I ij obj   ( 2 - w i   ×   h i ) ( 1 - CIoU ) -   i = 0 K × K j = 0 M I ij obj   [ α t ( 1 - C i ) γ log ( C i ) ] - λ noobj   i = 0 K × K j = 0 M I ij noobj   [ α t ( 1 - C i ) γ log ( C i ) ] -   i = 0 K × K i = 0 M I ij obj   c classes [ p ˆ i ( c ) log ( p i ( c ) ) + ( 1 - p ˆ i ( c ) ) log ( 1 - p i ( c ) ) ]
In Equation (10), α t is the balance factor, which is used to balance the positive and negative sample weights; γ is the regulator, which is used to adjust the proportion of difficult and easy sample loss. In particular, when γ is 0, Equation (10) is the loss function of the YOLOv4 model.
In order to verify the validity of α t and γ in the loss function of the VV-YOLO model, the following mathematical derivation is carried out in this section. To reduce the effect of negative samples, add a balance factor α t to Equation (9), leaving aside the parameters that do not affect the result, to get Equation (11).
CE ( C i ) = - α t log ( C i )
In Equation (11), α t ranges from 0 to 1, and α t is α when the sample is positive, and α t is 1 - α when the sample is negative, as shown in Equation (12). It can be seen that by setting the value of α , it is possible to control the contribution of positive and negative samples to the loss function.
α t = { α   if   sample   is   positive 1 - α otherwise
For verification of the effect of regulator γ , a part of Equation (10) can be taken and rewritten as the following equation:
L fl = - C ˆ i ( 1   -   C i ) γ log ( C i )   -   ( 1   -   C ˆ i ) C i γ log ( 1   -   C i )
In the training of deep learning models, the gradient descent method is used to search for the optimal solution to the loss function. The gradient can indicate the training weight of different samples during the training process, and the gradient is related to the first-order partial derivative of the loss function, so using Equation (13) to find the first-order partial derivative of the variable C i , we can obtain Equation (14).
L fl C i = C ˆ i γ ( 1 - C i ) γ - 1 log ( C i )   -   C ˆ i ( 1   -   C i ) γ 1 C i   -   ( 1   -   C ˆ i ) γ C i C ˆ i - 1 log ( 1   -   C i ) + ( 1   -   C ˆ i ) C i γ 1 1 - C i
Suppose that there are two sample points where C ˆ i is 0 and the values of C i are 0.1 and 0.4, respectively. When γ is 0, that is, when the loss function is a cross-entropy function, the values of the partial derivative are 1.11 and 1.66, respectively; when γ is 2, the values of the partial derivative are 0.032 and 0.67, respectively. It can be seen that after setting a certain value for γ   , the ratio of hard-to-distinguish samples to easy-to-distinguish samples is greatly increased, which increases the weight of difficult-to-distinguish samples in network training and effectively improves the problem of insufficient training caused by uneven data distribution.

3.3. Neck Network Design Based on Attention Mechanism

The attention mechanism in convolutional neural networks is a specific design that simulates the human brain, which can be introduced into multiple tasks in the field of computer vision and has the role of judging the importance of image features. The most classic attention mechanism network is SENet [40], whose structure is shown in Figure 7, which uses the global average pooling strategy and the fully connected layer to establish the interrelationship model between channels and effectively extract the importance of different channels.
However, SENet only considers the importance of each channel by modeling channel relationships, ignoring the influence of feature location information on feature extraction. Considering the influence of the accuracy of feature position information on target detection accuracy, this paper chooses the coordinate attention network as a module introduced into the neck network; its structure is shown in Figure 8. In order to build an interaction model with accurate capture ability, each channel was coded along the horizontal and vertical coordinates, respectively. The coding formula is shown below.
z c = 1 H × W i = 1 H j = 1 W x c ( i , j )
z c h ( h ) = 1 W 0 i W x c ( h , i )
z c w ( w ) = 1 H 0 j H x c ( j , w )
In the above equation, x is the input. z c h ( h ) and z c w ( w ) are obtained by encoding each channel along the horizontal and vertical coordinates using a pooled kernel of size ( H , 1 ) or size ( W , 1 )   . This parallel modeling structure allows the attention module to capture one spatial direction while saving precise location information in another spatial direction, which helps the network more accurately mine out the object of interest. After the location information modeling is completed, the weights along the horizontal and vertical directions are obtained through the convolution operation and sigmoid function. The calculation formula for the output feature map is as follows:
y c ( i , j ) = x c ( i , j )   ×   g c h ( i )   ×   g c w ( j )
According to the analysis of the YOLOv4 model in the previous article, based on the two existing improvement methods, a third improvement method is proposed to solve the problem of the declining feature extraction ability of the model. The coordinate attention module is introduced in the neck network of the YOLOv4 model, which improves the model’s attention to effective features by modeling the two dimensions of features and then improves the image feature extraction ability of the model.
Considering that image features are transmitted differently in the backbone network and neck network, this paper hopes that the model can adaptively provide more training weight to effective features when the feature transfer mode changes, so as to reduce the impact of invalid features on the model’s training. Therefore, the coordinate attention module is inserted between the backbone network and the neck network, the CA-PAN neck network is designed and the VV-YOLO model shown in Figure 9 is finally formed.

4. Results and Discussion

4.1. Test Dataset

The KITTI dataset [39], as the world’s largest computer vision algorithm evaluation dataset in unmanned driving scenarios, was jointly proposed by the Karlsruhe Institute of Technology in Germany and the Toyota Institute of Technology in the United States in 2012. The dataset can be used to evaluate multiple tasks in the computer vision field, including object detection, object tracking, visual odometry, etc. The data used to evaluate the object detection model in the KITTI dataset contains nearly 10,000 images in eight categories, including car, van, truck, pedestrian, person (sitting), cyclist, tram and misc, marking more than 200,000 objects in total. The data distribution is shown in Figure 10.
Figure 11 shows the proportion of various objects in the object detection data. It can be found that the number of car classes far exceeds that of other categories, accounting for 52%, with serious sample imbalance. From the point of view of model hyperparameter tuning, highly unbalanced data distribution will seriously affect the fitting effect. According to the characteristics of traffic scene from a vehicle’s perspective and the objects of interest studied in this paper, a Python script is written to merge eight types of objects in the KITTI dataset into vehicle, pedestrian and cyclist [41]. The Vehicle class is composed of car, Van, truck, tram and misc. The Pedestrian class consists of pedestrian and person (sitting).

4.2. Index of Evaluation

In order to evaluate different object detection algorithms reasonably in an all-round way, it is usually necessary to quantify the performance of object detection algorithms from the real-time and precision perspectives. Reasonable evaluation has important guiding significance for selecting a reasonable object detection algorithm in different scenarios. For the object detection task from the vehicle view perspective, focus on precision, recall, average precision and real-time performance.

4.2.1. Precision and Recall

In the field of machine learning, there are usually the following four relationship definitions for positive and negative sample relationships. TP (True Positive) is the correct positive sample, indicating that the negative sample is correctly identified. FP (False Positive) indicates the positive sample, indicating that the positive sample is incorrectly identified. FN (False Negative) is the wrong negative sample, indicating that the negative sample is identified incorrectly. TN (True Negative) indicates the correct negative sample, indicating that the negative sample is correctly identified.
The confusion matrix of the classical evaluation system of machine learning can be formed by arranging the above four positive and negative sample relations in matrix form, as shown in Figure 12.
According to the confusion matrix, the Precision and Recall of commonly used quantization methods can be defined. The precision represents the proportion of correct prediction of the model in all the results whose prediction result is a positive sample. The formula is shown in Equation (19).
Precision = TP TP + FP
Recall, also known as sensitivity, represents the proportion of correct model prediction among all the results whose true value is a positive sample, as shown in Equation (20).
R e c a l l = T P T P + F N

4.2.2. Average Precision

According to the above formula of precision and precision, it can be seen that the relationship between precision and precision is contradictory. If a single performance improvement is pursued, the performance of another index will often be sacrificed. Therefore, in order to comprehensively evaluate the object detection algorithm under different usage scenarios, PR curve is introduced.
The vertical coordinate of the PR curve is the precision under different confidence levels of detection boxes, and the horizontal coordinate is the precision under current confidence levels. The average precision is defined as the area under the PR curve, and its formula is shown in Equation (21).
A P = 0 1 P R d R
When evaluating the object detection model, the average precision of each type of object will be averaged to get mAP. mAP is one of the most commonly used evaluation means, and its size is between 0 and 1. Generally, the larger the mAP is, the better the performance of the object detection algorithm in terms of data. Its formula is shown in Equation (22).
m AP = 1 N i = 1 N AP i

4.3. VV-YOLO Model Training

Before model training, configuration files and super parameters need to be set. Configuration files mainly include category files and prior box files stored in txt file format. The category file stores the name of the object to be trained, and the prior box file stores the coordinates of the prior boxes of different sizes.
The hyperparameters of the model training in this paper are set as follows:
  • Input image size: 608 × 608;
  • Number of iterations: 300;
  • Initial learning rate: 0.001;
  • Optimizer: Adam;
In order to avoid the problem of not obvious feature extraction due to too random weights during model training, the strategy of transfer learning was adopted during VV-YOLO model training, that is, the pre-training model provided by YOLOv4 developers was loaded during training, so as to obtain stable training effects. The change curves of loss function value and training accuracy during model training are shown in Figure 13 and Figure 14, respectively. The loss function value and training accuracy eventually converge to about 0.015 and 0.88, achieving the ideal training effect.

4.4. Discussion

4.4.1. Discussion on Average Precision of VV-YOLO Model

The YOLOv4 model and VV-YOLO model were used to test on the KITTI dataset [39], and the precision, recall and average precision results obtained were shown in the following table. According to the results in Table 2, the average precision of the VV-YOLO model is 80.01%, which is 3.44% higher than that of the YOLOv4 model. In terms of precision and recall, the VV-YOLO model is only lower than the YOLOv4 model in the recall of the pedestrian target, and the rest of the indicators have taken the lead. Figure 15 shows the average precision of the three types of objects of the two models, and the results show that the VV-YOLO model is superior to the YOLOv4 model.
To verify the effectiveness of each improved module of VV-YOLO, multiple rounds of ablation experiments were performed on the KITTI dataset, and the results are shown in the table below. From the results in the table, it can be concluded that the precision of the proposed model is improved by 6.88% and the average precision is improved by 3.44% with a slight increase in the number of parameters. Table 3 also shows the experimental results of comparison between the proposed model and a variety of advanced attention mechanisms, which also proves the effectiveness of the improved module.
In addition, six mainstream object detection models are selected for comparative testing, and Table 4 shows the precision, recall and average precision of the VV-YOLO model and the mainstream object detection model. From the results in the table, it can be concluded that the VV-YOLO model has achieved a leading position in other indicators except for slightly lower precision and recall than YOLOv5 and YOLOv4.

4.4.2. Discussion on the Real-Time Performance of VV-YOLO Model

The weight size of VV-YOLO model is 245.73MB, only 1.29MB higher than that of the YOLOv4 model. On the NVIDIA GeForce RTX 3070 Laptop graphics card, the VV-YOLO model and seven mainstream object detection models were used to test and reason about the pictures in the KITTI dataset. Before the test and reasoning, the model would adjust the test pictures to the same pixel size.
After 100 inferences, the results of inference time and inference frames are shown in Table 5. The data transmission frame rate of the autonomous driving perception system is usually 15, and it is generally believed that the inference frame number of the object detection model is greater than 25 to meet the real-time requirements of the system, while the inference time of the VV-YOLO model is 37.19 ms, which is only 0.7 ms more than the YOLOv4 model, and the inference frame rate is 26.89. Compared with the YOLOv3 and YOLOv5 models, although the inference time of the VV-YOLO model has increased, its comprehensive performance is the best when combined with the precision test results.

4.4.3. Visual Analysis of VV-YOLO Model Detection Results

Figure 16 shows the model inference heat maps of the YOLOv4 model and the VV-YOLO model in multiple scenes from vehicle-mounted perspectives. The results in the figure show that, compared with YOLOv4, VV-YOLO can provide more attention to distant objects, occlusion and other objects. Figure 17 shows the detection results of YOLOv4 and VV-YOLO on the test data of the KITTI dataset. It can be seen that VV-YOLO can detect objects well when facing distant objects and occludes.
In order to verify the generalization performance of the VV-YOLO model, this paper also selected the BDD100K dataset and self-collected data in typical traffic scenes to conduct a comparison test of detection results. The test results are shown in Figure 18 and Figure 19. As can be seen from the results in the figure, the VV-YOLO model can detect both false detection and missing detection in the YOLOv4 model. The positive performance of the VV-YOLO model in actual scenarios is attributable to the specific design of the clustering algorithm, network structure and loss function in this paper.

5. Conclusions

Based on the end-to-end design idea, this paper proposes a vehicle viewing angle object detection model, VV-YOLO. Through the improved K-means++ clustering algorithm, fast and stable anchor box generation is realized on the model data side. In the VV-YOLO model training stage, the focus function focal loss is used to construct the model loss function, which alleviates the training imbalance caused by data distribution imbalance. At the same time, the coordinate attention mechanism is introduced into the model, and the CA-PAN neck network is designed to improve the learning ability of the model for the features of interest. In addition to the experiments on the experimental dataset, this study also collected some real road complex scene data in China for detection and comparison tests, and the visualization results confirmed the superiority of the VV-YOLO model. Several experimental results in this paper confirm that the improved model VV-YOLO can better realize object detection from the vehicle perspective and can take into account the precision and speed of model reasoning at the same time, which provides a new implementation idea for the autonomous vehicle perception module that has good theoretical and engineering practical significance.

Author Contributions

Conceptualization, Y.W.; methodology, H.L.; software, H.L. and B.G.; validation, Z.Z.; formal analysis, Y.W.; investigation, Y.G.; resources, L.J. and H.L.; data curation, X.L. and Y.W.; writing—original draft preparation, Y.G.; writing—review and editing, Z.Z.; visualization, Z.Z.; supervision, Y.W.; project administration, H.L.; funding acquisition, B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Major Scientific and Technological Special Projects in Jilin Province and Changchun City (20220301008GX), the National Natural Science Foundation of China (52072333, 52202503), the Hebei Natural Science Foundation (F2022203054), the Science and Technology Project of Hebei Education Department (BJK2023026).

Data Availability Statement

Data and models are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Saleemi, H.; Rehman, Z.U.; Khan, A.; Aziz, A. Effectiveness of Intelligent Transportation System: Case study of Lahore safe city. Transp. Lett. 2022, 14, 898–908. [Google Scholar] [CrossRef]
  2. Kenesei, Z.; Ásványi, K.; Kökény, L.; Jászberényi, M.; Miskolczi, M.; Gyulavári, T.; Syahrivar, J. Trust and perceived risk: How different manifestations affect the adoption of autonomous vehicles. Transp. Res. Part A Policy Pract. 2022, 164, 379–393. [Google Scholar] [CrossRef]
  3. Hosseini, P.; Jalayer, M.; Zhou, H.; Atiquzzaman, M. Overview of Intelligent Transportation System Safety Countermeasures for Wrong-Way Driving. Transp. Res. Rec. 2022, 2676, 243–257. [Google Scholar] [CrossRef]
  4. Zhang, H.; Bai, X.; Zhou, J.; Cheng, J.; Zhao, H. Object Detection via Structural Feature Selection and Shape Model. IEEE Trans. Image Process. 2013, 22, 4984–4995. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Rabah, M.; Rohan, A.; Talha, M.; Nam, K.-H.; Kim, S.H. Autonomous Vision-based Object Detection and Safe Landing for UAV. Int. J. Control. Autom. Syst. 2018, 16, 3013–3025. [Google Scholar] [CrossRef]
  6. Tian, Y.; Wang, K.; Wang, Y.; Tian, Y.; Wang, Z.; Wang, F.-Y. Adaptive and azimuth-aware fusion network of multimodal local features for 3D object detection. Neurocomputing 2020, 411, 32–44. [Google Scholar] [CrossRef]
  7. Shirmohammadi, S.; Ferrero, A. Camera as the Instrument: The Rising Trend of Vision Based Measurement. IEEE Instrum. Meas. Mag. 2014, 17, 41–47. [Google Scholar] [CrossRef]
  8. Noh, S.; Shim, D.; Jeon, M. Adaptive Sliding-Window Strategy for Vehicle Detection in Highway Environments. IEEE Trans. Intell. Transp. Syst. 2016, 17, 323–335. [Google Scholar] [CrossRef]
  9. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
  10. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  11. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 1440–1448. [Google Scholar]
  12. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Machine Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
  13. He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Machine Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef] [PubMed]
  14. Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks. arXiv 2016, arXiv:1605.06409. [Google Scholar]
  15. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  17. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
  18. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [Green Version]
  19. Bochkovskiy, A.; Wang, C.-Y.; Mark Liao, H.-Y. YOLOv4: Optimal Speed and Precision of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  20. Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
  21. Wang, C.-Y.; Bochkovskiy, A.; Mark Liao, H.-Y. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696v1. [Google Scholar]
  22. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
  23. Hassaballah, M.; Kenk, M.; Muhammad, K.; Minaee, S. Vehicle Detection and Tracking in Adverse Weather Using a Deep Learning Framework. IEEE Trans. Intell. Transp. Syst. 2021, 22, 4230–4242. [Google Scholar] [CrossRef]
  24. Lin, C.-T.; Huang, S.-W.; Wu, Y.-Y.; Lai, S.-H. GAN-Based Day-to-Night Image Style Transfer for Nighttime Vehicle Detec-tion. IEEE Trans. Intell. Transp. Syst. 2021, 22, 951–963. [Google Scholar] [CrossRef]
  25. Tian, D.; Lin, C.; Zhou, J.; Duan, X.; Cao, D. SA-YOLOv3: An Efficient and Accurate Object Detector Using Self-Attention Mechanism for Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4099–4110. [Google Scholar] [CrossRef]
  26. Zhang, T.; Ye, Q.; Zhang, B.; Liu, J.; Zhang, X.; Tian, Q. Feature Calibration Network for Occluded Pedestrian Detection. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4151–4163. [Google Scholar] [CrossRef]
  27. Wang, L.; Qin, H.; Zhou, X.; Lu, X.; Zhang, F. R-YOLO: A Robust Object Detector in Adverse Weather. IEEE Trans. Instrum. Meas. 2023, 72, 1–11. [Google Scholar] [CrossRef]
  28. Arthur, D.; Vassilvitskii, S. k-means plus plus: The Advantages of Careful Seeding. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; pp. 1027–1035. [Google Scholar]
  29. Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 13708–13717. [Google Scholar]
  30. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  31. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
  32. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Dollár, P.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
  33. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 6517–6525. [Google Scholar]
  34. Misra, D. Mish: A Self Regularized Non-Monotonic Activation Function. arXiv 2019, arXiv:1908.08681. [Google Scholar]
  35. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  36. Ghiasi, G.; Lin, T.-Y.; Le, Q.V. DropBlock: A regularization method for convolutional networks. In Proceedings of the Con-ference on Neural Information Processing Systems, Montreal, Canada, 2–8 December 2018; pp. 10727–10737. [Google Scholar]
  37. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar]
  38. Franti, P.; Sieranoja, S. K-means properties on six clustering benchmark datasets. Appl. Intell. 2018, 48, 4743–4759. [Google Scholar] [CrossRef]
  39. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2254–3361. [Google Scholar]
  40. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  41. Cai, Y.; Wang, H.; Sotelo, M.A.; Li, Z. YOLOv4-5D: An Effective and Efficient Object Detector for Autonomous Driving. IEEE Trans. Instrum. Meas. 2021, 70, 4503613. [Google Scholar] [CrossRef]
Figure 1. YOLOv4 model structure.
Figure 1. YOLOv4 model structure.
Sensors 23 03385 g001
Figure 2. Typical scene from vehicle view.
Figure 2. Typical scene from vehicle view.
Sensors 23 03385 g002
Figure 3. Illustration of the IoU calculation.
Figure 3. Illustration of the IoU calculation.
Sensors 23 03385 g003
Figure 4. Improved K-means++ algorithm logic.
Figure 4. Improved K-means++ algorithm logic.
Sensors 23 03385 g004
Figure 5. The clustering effect of two clustering algorithms on KITTI dataset.
Figure 5. The clustering effect of two clustering algorithms on KITTI dataset.
Sensors 23 03385 g005
Figure 6. Cluster results of the improved K-means++ algorithm on the KITTI dataset.
Figure 6. Cluster results of the improved K-means++ algorithm on the KITTI dataset.
Sensors 23 03385 g006
Figure 7. SENet model structure.
Figure 7. SENet model structure.
Sensors 23 03385 g007
Figure 8. Coordinate attention module structure.
Figure 8. Coordinate attention module structure.
Sensors 23 03385 g008
Figure 9. VV-YOLO model structure.
Figure 9. VV-YOLO model structure.
Sensors 23 03385 g009
Figure 10. KITTI dataset data distribution.
Figure 10. KITTI dataset data distribution.
Sensors 23 03385 g010
Figure 11. The proportion of various objects in the KITTI dataset.
Figure 11. The proportion of various objects in the KITTI dataset.
Sensors 23 03385 g011
Figure 12. Confusion matrix structure.
Figure 12. Confusion matrix structure.
Sensors 23 03385 g012
Figure 13. Training loss curve of VV-YOLO model.
Figure 13. Training loss curve of VV-YOLO model.
Sensors 23 03385 g013
Figure 14. Training average precision change curve of VV-YOLO model.
Figure 14. Training average precision change curve of VV-YOLO model.
Sensors 23 03385 g014
Figure 15. Schematic diagram of average precision: (a) YOLOv4; (b) VV-YOLO.
Figure 15. Schematic diagram of average precision: (a) YOLOv4; (b) VV-YOLO.
Sensors 23 03385 g015
Figure 16. Object detection model inference heat map: (a) YOLOv4; (b)VV-YOLO.
Figure 16. Object detection model inference heat map: (a) YOLOv4; (b)VV-YOLO.
Sensors 23 03385 g016
Figure 17. Object detection results of KITTI dataset: (a) YOLOv4; (b) VV-YOLO.
Figure 17. Object detection results of KITTI dataset: (a) YOLOv4; (b) VV-YOLO.
Sensors 23 03385 g017
Figure 18. Object detection results of BDD100K dataset: (a) YOLOv4; (b) VV-YOLO.
Figure 18. Object detection results of BDD100K dataset: (a) YOLOv4; (b) VV-YOLO.
Sensors 23 03385 g018
Figure 19. Object detection results of collected data: (a) YOLOv4; (b) VV-YOLO.
Figure 19. Object detection results of collected data: (a) YOLOv4; (b) VV-YOLO.
Sensors 23 03385 g019
Table 1. Summary of literature survey on vehicle view object detection model.
Table 1. Summary of literature survey on vehicle view object detection model.
YearTitleMethodLimitationReference
2021Vehicle Detection and Tracking in Adverse Weather
Using a Deep Learning Framework
A visual enhancement mechanism was proposed and introduced into the YOLOv3 model to realize vehicle detection in snowy, foggy, and other scenarios.There is the introduction of larger modules, and only the vehicle objects are considered.[23]
2021GAN-Based Day-to-Night Image Style Transfer for Nighttime Vehicle DetectionAugGAN network was proposed to enhance vehicle targets in dark light images, and the data generated by this strategy was used to train R-CNN and YOLO faster, which improved the performance of the object detection model under dark light conditions.GAN networks are introduced, and multiple models need to be trained, and only vehicle objects are considered.[24]
2022SA-YOLOv3: An Efficient and Accurate Object Detector Using Self-Attention Mechanism for Autonomous DrivingA SA-YOLOv3 model is proposed, in which dilated convolution and self-attention module (SAM) are introduced into YOLOv3, and the GIOU loss function is introduced during training.There are fewer test scenarios to validate the model.[25]
2022Feature Calibration Network for Occluded Pedestrian DetectionThe fusion module of SA and FC features is designed, and FC-NET is further proposed to realize pedestrian detection in occlusion scenesOnly pedestrian targets are considered, and there are few verification scenarios.[26]
2023R-YOLO: A Robust Object Detector
in Adverse Weather
QTNet and FCNet adaptive networks were proposed to learn the image features without tags and applied to YOLOv3, YOLOv5 and YOLOX, which improved the precision of object detection in foggy scenarios.With the introduction of additional large networks, multiple models need to be trained.[27]
Table 2. Test results of the YOLOv4 model and the VV-YOLO model on the KITTI dataset.
Table 2. Test results of the YOLOv4 model and the VV-YOLO model on the KITTI dataset.
Evaluation IndicatorsYOLOv4VV-YOLO
PrecisionVehicle95.01%96.87%
Cyclist81.97%93.41%
Pedestrian74.43%81.75%
RecallVehicle80.79%82.21%
Cyclist55.87%55.75%
Pedestrian56.58%52.24%
Average precision 76.57% 80.01%
Table 3. Ablation experimental results of VV-YOLO model on the KITTI dataset.
Table 3. Ablation experimental results of VV-YOLO model on the KITTI dataset.
Test ModelPrecisionRecallAverage Precision
Baseline83.80%64.41%76.57%
+Improved K-means++89.83%60.70%77.49%
+Focal Loss90.24%61.79%78.79%
attention mechanisms+SENet89.47%62.99%78.61%
+CBAM89.83%60.69%78.49%
+ECA89.66%61.96%78.48%
VV-YOLO90.68%63.40%80.01%
Table 4. Comparative test results of the VV-YOLO model and mainstream object detection model.
Table 4. Comparative test results of the VV-YOLO model and mainstream object detection model.
Test ModelPrecisionRecallAverage Precision
RetinaNet90.43%37.52%66.38%
CenterNet87.79%34.01%60.60%
YOLOv589.71%61.08%78.73%
Faster-RCNN59.04%76.54%75.09%
SSD77.59%26.13%37.99%
YOLOv377.75%32.07%47.26%
VV-YOLO90.68%63.40%80.01%
Table 5. Real-time comparison between the VV-YOLO model and the mainstream object detection model.
Table 5. Real-time comparison between the VV-YOLO model and the mainstream object detection model.
Test ModelInference TimeInference Frames
RetinaNet31.57 ms31.67
YOLOv436.53 ms27.37
CenterNet16.49 ms60.64
YOLOv526.65 ms37.52
Faster-RCNN62.47 ms16.01
SSD52.13 ms19.18
YOLOv327.32 ms36.60
VV-YOLO37.19 ms26.89
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Guan, Y.; Liu, H.; Jin, L.; Li, X.; Guo, B.; Zhang, Z. VV-YOLO: A Vehicle View Object Detection Model Based on Improved YOLOv4. Sensors 2023, 23, 3385. https://doi.org/10.3390/s23073385

AMA Style

Wang Y, Guan Y, Liu H, Jin L, Li X, Guo B, Zhang Z. VV-YOLO: A Vehicle View Object Detection Model Based on Improved YOLOv4. Sensors. 2023; 23(7):3385. https://doi.org/10.3390/s23073385

Chicago/Turabian Style

Wang, Yinan, Yingzhou Guan, Hanxu Liu, Lisheng Jin, Xinwei Li, Baicang Guo, and Zhe Zhang. 2023. "VV-YOLO: A Vehicle View Object Detection Model Based on Improved YOLOv4" Sensors 23, no. 7: 3385. https://doi.org/10.3390/s23073385

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop