1. Introduction
With the continuous growth of the public’s travel demand, the rate of automobile ownership has steadily increased in recent years, leading to increases in traffic accidents and congestion [
1]. If accidents and congestion are not handled in a timely manner, they will seriously threaten the personal safety of the public. In traffic situations, the speed and precision of vehicle detection are pivotal factors influencing traffic management. With the popularity of surveillance cameras and the significant development of Intelligent Traffic System (ITS), the expedient utilization of computer technology for the rapid processing of video and image data from road surveillance has emerged as a paramount imperative for actualizing ITS objectives [
2,
3].
Machine vision technology has a powerful video data processing capability, from which it can extract key information, such as vehicle color, model, brand, and license plate number [
4,
5,
6]. This information enables the transportation department to grasp the road conditions in real time, e.g., the supervisory department can use this information to accurately identify various sorts of motor vehicles on the road, thus enhancing the monitoring of dangerous vehicles. In addition, machine vision technology can help accurately identify and locate these specific vehicles, providing strong support for the prevention of traffic accidents or criminal behavior.
In pursuit of more insightful insights, a multitude of researchers have employed various methodologies for vehicle detection and classification. Traditional vehicle detection methods traverse and scan the image through a fixed window and determine whether the current window contains the target vehicle based on manually designed features and classifiers, such as those proposed by Wang et al. [
7], who utilized a pseudo-visual search mechanism to eliminate environmental interference in the image. Additionally, they integrated directional gradient histograms with local binary pattern fusion to enhance vehicle feature extraction capabilities. However, the process of manually annotating features is complicated and lacks real-time performance. As deep learning continues to advance, numerous researchers have employed deep learning techniques within the domain of vehicle detection. Currently, target detection based on deep learning can be categorized into the following two types: two-stage detection algorithms and single-stage detection algorithms. Most of the two-stage detection algorithms rely on the region candidate network to generate candidate frames, and feature extraction of the candidate frame target is carried out by a convolutional neural network [
8]. For example, for a traditional deep learning network in which the feature information transfer process lacks interdependence, Ke et al. [
9] proposed a dense attention network structure through the introduction of dense connections as well as an attention module that enhances the detection ability of the model. Gu et al. [
10] introduced an enhanced Faster RCNN vehicle detection algorithm that aims to enhance the detection accuracy of various vehicle types by developing distinct scales of receptive fields for concurrent detection of image targets. Such algorithms demonstrate elevated detection accuracy but show limitations in real-time performance. The single-stage detection algorithms, on the other hand, discard region selection and directly recognize the target to be detected in the image; representative algorithms include the Single-Shot MultiBox Detector (SSD) [
11], the You Only Look Once (YOLO) series [
12,
13,
14], and EfficientDet [
15]. In contrast to two-stage detection algorithms, single-stage detection algorithms exhibit superior real-time performance, but the detection accuracy is slightly lower [
16].
Many scholars have conducted extensive and in-depth research in the field of lightweight networks and vehicle detection [
17,
18,
19]. Chen et al. [
20] proposed an efficient detection network that achieves three times the detection speed of YOLOv3 by fusing the advantages of densely connected networks and separable convolutions. Dong et al. [
21] devised an advanced approach for vehicle detection, leveraging the C3Ghost module within the YOLOv5 neck network to streamline model parameters. Additionally, they bolstered detection accuracy through the integration of CBAM attention mechanisms and optimized the loss function to expedite model training. Zhang et al. [
22] enhanced the YOLOv8 model by augmenting its feature fusion capabilities through multi-scale fusion within the backbone network. Additionally, they introduced a TA attention mechanism in the feature extraction phase to bolster model accuracy. Luo et al. [
23] introduced an enhanced real-time detection model based on YOLOv5s to address the challenges posed by the high complexity and computational demands of vehicle detection models. This was achieved by incorporating a large-scale convolution function to amalgamate information from various feature images and optimizing the original spatial pyramid structure to bolster the model’s information extraction capabilities. However, as model accuracy increased, detection time also saw a gradual rise.
The above research has promoted the development of vehicle detection. However, the demanding real-time constraints within traffic scenarios pose a challenge, as existing algorithms struggle to strike a balance between detection speed and precision. To address this issue, the present paper introduces an enhanced real-time detection algorithm utilizing You Only Look Once version 7 (YOLOv7). This algorithm effectively reduces model parameters while ensuring recognition accuracy, thereby enabling deployment on edge devices.
The primary advancements delineated in this paper include the following:
- ⮚
Lightweight Modules: This paper employs the lightweight MobileNetV3 architecture to replace the backbone network of YOLOv7 and modifies the spatial pyramid parallel pooling structure to serial pooling to speed up the detection rate. Furthermore, inspired by the Generalized Sparse Convolution (GSConv) module, it utilizes GSConv to replace the standard convolution in the neck network. This neck network, in combination with the Spatial Pyramid Pooling Fast Cross-Stage Partial Channel (SPPFCSPC) module, forms the SPPCSPC-GS module, aiming to reduce the number of parameters in the model.
- ⮚
Attention Mechanisms Module: Aiming at the problem of decreasing feature extraction ability after the model is lightweighted, this paper enhances the detection accuracy of the model without substantially escalating the number of parameters by incorporating the coordinate attention (CA) mechanism in different feature layers.
- ⮚
Minimum Point Distance Intersection over Union (MPDIoU) Loss Function: In order to refine the detection speed of the model and reduce the bounding-box regression loss, the initial complete intersection over union (CIoU) loss function is substituted with the MPDIoU loss function.
2. Materials and Methods
In recent years, all kinds of image detection algorithms have performed extremely well on the metric of accuracy but have neglected the problem of model parameterization. Aside from accuracy, real-time performance is also a significant metric for evaluating models. Overly complex network models are hard to deploy on mobile devices with restricted computational resources and are also difficult to apply to scenarios with high real-time requirements. This paper aims to devise a lightweight and readily deployable network model, prioritizing the reduction in model parameters while preserving the algorithm’s accuracy.
2.1. YOLOV7 Network Structure
After multiple iterations of updates, the YOLO network model has given rise to the YOLOv7 model, which primarily comprises the following three modules: input, backbone, and head modules. The input module resizes the input image to a specified dimension to align with the input criteria of the backbone network. The backbone incorporates a CBS module, E-ELAN module, and MPConv module. E-ELAN is an efficient layer aggregation network that continuously improves the model’s learning ability without changing the structure of the transition layer. CBS convolutional layers improve the training efficiency and performance of the model by introducing binary supervision signals, while MPConv convolutional layers incorporate maxpool layers into their structure, forming upper and lower branches to effectively retain the most significant features. To adapt to multi-scale inputs, the head network uses a Spatial Pyramid Pooling (SPP) structure. To integrate features across various levels, an aggregated feature pyramid network structure is used to pass bottom-layer information to higher layers. Lastly, the reparameterized convolution (REPcon) structure modifies the channel counts of features at varying scales, enabling efficient feature representation.
2.2. YOLOV7 Improvements
The original model’s backbone network utilizes the DarkNet53 architecture, incorporating numerous residual structures that may escalate the model’s complexity and computational demands. Consequently, to address the issue of high parameter count and computational complexity in the original YOLOv7 network, which hinders deployment on terminal devices, this study undertakes a lightweight redesign of the network architecture [
24]. The lightweight MobileNetV3 backbone network is employed instead of the DarkNet53 network to extract feature information from input images efficiently. Drawing inspiration from the Spatial Pyramid Pooling-Fast (SPPF) concept, the SPP module in the neck network is enhanced by adjusting the number of specified convolutional kernels. The original parallel pooling structure is transformed into a serial pooling structure, accelerating data processing and enhancing model training speed and feature extraction capability while keeping the sensory field intact. Furthermore, the conventional 3-convolutional kernel layer in SPPFCSPC is replaced with a lightweight GSConv layer, thereby forming the Spatial Pyramid Pooling Fast Cross-Stage Partial Channel-Generalized Sparse Convolution (SPPFCSPC-GS) module to further refine the model’s real-time performance. To counteract potential accuracy loss resulting from lightweight modifications, the CA mechanism is integrated into various feature extraction layers. The MPDIoU loss function is employed to refine the model’s representation of target features and enhance target detection accuracy.
Figure 1 illustrates the enhanced architecture of the YOLOv7 network.
2.2.1. Lightweight MobileNetV3 Module
MobileNet is a series of Convolutional Neural Network (CNN) architectures for image classification proposed by a team of Google researchers [
25]. Through the introduction of different versions, MobileNet introduces a series of innovative concepts with the main goal of decreasing the number of model parameters to increase operational efficiency on mobile devices while maintaining high classification accuracy. MobileNetV3 not only retains the inverted residual module and depth-separable convolution of MobileNetV1 and MobileNetV2 to optimize the network parameters but also introduces the Squeeze-and-Excite (SE) structure. It replaces the original swish activation function with the h-swish activation function to decrease the number of operations and optimizes the network structure to enhance model performance. The structure of the bneck module for the MobileNetV3 model is presented in
Figure 2.
Figure 3 shows the structure of depth-separable convolution, which consists of depthwise (DW) convolution and pointwise (PW) convolution. DW convolution is a single-channel computation method, while PW convolution extracts the features of each element after DW convolution using a 1 × 1 convolution kernel. The relationship between its parameter calculation and ordinary convolution is shown below.
where
is the number of channels,
is the convolution kernel size,
is the dimension of the output feature map,
is the computation of a single ordinary convolution, and
is the computation of PW.
According to Equation (1), when the convolution kernel is 3 × 3, the number of parameters in ordinary convolution is nine times greater than that of depth-separable convolution. This substitution not only decreases storage space and computational requirements but also lowers the hardware demands of the algorithm.
2.2.2. SPPFCSPC-GS Module
The SPPFCSPC module draws inspiration from the concept of SPPF [
26,
27], which structurally reduces the number of times the convolution kernel size needs to be specified. While SPP requires specification of the dimensions of the convolution kernel three times to pool and splice the data from the CBS module, SPPF only requires specification of one convolution kernel. Additionally, each pooling operation’s output is utilized as the input for the subsequent pooling, accelerating data processing. This allows the model to enhance feature extraction from the data while keeping the receptive domain unchanged. Simultaneously, due to the depthwise-separable convolution’s channel-by-channel convolution, it loses plenty of information, which leads to low feature extraction ability. Therefore, this paper introduces a lightweight convolution layer, GSConv, which can effectively decrease the model parameters without affecting detection precision by fusing depthwise separable convolution with ordinary convolution, further improving the generalization ability of the model.
Figure 4 illustrates the structure of the SPPFCSPC-GS.
In the figure, the GSConv module [
28] splits the number of C1 channels in half by performing a 1 × 1 convolution on the input image and subsequently performs a 5 × 1 depth-separable convolution on the feature image so that the number of output channels remains half of the total number intended for the final output. Ultimately, it obtains a feature image with C2 output channels by integrating and rearranging the feature image.
Figure 5 illustrates the architecture.
If the lightweight GSConv replaces all the ordinary convolutions in the model, it will increase the number of network layers of the network, exacerbate the resistance to data flow, and affect the inference speed of the model. While the neck network channel dimension is maximized and no transformation is needed, in this paper, the ordinary 3 convolutional kernels in the neck network are replaced with the lightweight GSConv convolutional layer, which lowers the computational burden and further enhances the generalization capacity of the model.
2.2.3. CA Module
The integration of lightweight modules like MobileNetV3 in the model reduces the computational load and parameter count, albeit at the cost of decreased feature extraction capability. To amplify the network’s feature extraction prowess, an attention mechanism module is incorporated into the convolutional network. Among the commonly used lightweight attention mechanisms are SE and the Convolutional Block Attention Module (CBAM). While the SE mechanism focuses solely on channel attention, it neglects spatial dimensions, whereas the CBAM mechanism contemplates attention in both spatial and channel dimensions [
29]. However, CBAM’s practical application complexity and computational resource consumption are relatively high [
30]. Hence, when selecting an appropriate attention mechanism, it is essential to strike a balance based on specific task requirements and resource constraints.
In order to amplify the original model structure’s ability to perceive the target locations within the feature map, this study employs the CA mechanism to strengthen the detector’s feature extraction aptitude for vehicles [
31]. This is achieved by embedding position information into the channel attention as a means of attaining lightweight global attention.
Figure 6 illustrates the architecture.
As shown in the figure, the CA mechanism first performs global pooling of the input feature map along the two directions of height and width. This leads to a feature map measuring H in height and W in width, where each channel is encoded. Subsequently, the feature maps from the two distinct directions are concatenated together. Then, a 1 × 1 convolution is applied to downscale the feature maps to C/r, and the downscaled feature maps are batch-normalized. After applying the Sigmoid activation function, we obtain a set of feature maps with dimensions of C/r × 1 × (W + H). Next, the feature map undergoes segmentation along the spatial dimension, and an additional 1 × 1 convolution is employed to derive weights in both directions. Ultimately, the attention weights are applied to the original image features via the activation function, yielding the final output feature map.
By generating weights in different directions for the feature map, the CA mechanism can focus on more important feature information. It not only grasps positional details across channels but also extracts position-specific information, assigning higher weights to significant pixel coordinates. This approach is instrumental in enhancing the precision of detection.
2.2.4. MPDIoU Loss Function
The loss function serves to quantify the disparity between predicted and actual values of a model, with a lower value indicating greater robustness of the model. The initial YOLOv7 architecture employs the CIoU loss function, an extension of the Distance Intersection over Union (DIoU) loss, to gauge the loss associated with box scaling [
32,
33]. This method considers the overlap area, distance from the center point, and aspect ratio of the three geometric parameters, aiming to refine box predictions to values closer to actual dimensions. However, it fails to address the challenge of balancing between complex and straightforward samples. This potentially results in increased computational overhead during training and slows down the model’s convergence speed. The
CIoU calculation formula is as follows:
where
and
denote the aspect ratio;
and
denote the width and height of the prediction frame, respectively;
and
denote the width and height of the real frame, respectively; and
represents the Euclidean distance between the prediction frame and the center point of the actual frame.
To accelerate the training process and enhance the model’s classification accuracy, the CIoU is supplanted with the MPDIoU loss function [
34]. MPDIoU is a comparative measure of the similarity of the bounding box with the minimum point distance.
Figure 7 illustrates the architecture of MPDIoU. The red box indicates the actual bounding box, while the yellow box signifies the predicted bounding box. By minimizing the distance between the two corner points of the prediction, the loss is reduced. This method effectively resolves the issue where the existing loss function struggles to optimize effectively when the predicted bounding box and the ground-truth bounding box have identical aspect ratios but vastly different length and width values. It not only covers the advantages of the existing IoU and the paradigm loss but also makes up for the shortcomings of the existing loss, effectively reduces the localization loss, and further improves the prediction accuracy.
MPDIoU is calculated as follows:
where
represents the distance between the predicted frame and the upper-left corner of the actual frame, and
represents the distance between the predicted frame and the lower-right corner of the actual frame.
5. Conclusions
Due to the extensive number of parameters within the target detection model and the complexity of computations within the traffic scene, this paper proposes a lightweight design inspired by the original YOLOv7 model. The backbone network’s lightweighting is enhanced by incorporating the MobileNetV3 module, followed by substituting conventional convolutions in the neck structure with GSConv, thereby further reducing the model’s parameter count. To mitigate any potential decrease in model accuracy resulting from lightweighting, the model’s feature extraction capability is optimized by integrating the CA mechanism into the feature layer. This enhancement serves to improve the model’s overall detection performance. The loss function CIoU is substituted with MPDIoU to further refine the model’s training process and enhance classification accuracy. The experimental results indicate that, in contrast to the primitive YOLOv7 and other detection models, the improved model reduces the number of parameters in the model and improves the detection speed without compromising detection accuracy. The model achieves an exceptional equilibrium between detection accuracy and speed, a trait that renders it highly compatible for deployment in resource-limited embedded devices scenarios and provides strong support for performance optimization in practical applications, thus laying a foundation for the realization of intelligent traffic management.
Although the vehicle detection model designed in this paper better balances accuracy and real-time issues, there are still some aspects to be improved. For instance, the current dataset contains fewer vehicle categories. If an unlabeled vehicle category is detected, the model may misjudge. Subsequent research will focus on expanding the data categories to further improve experimental scenarios.
If the model is employed for the direction of autonomous driving, it can assist in improving the reaction speed and accuracy of autonomous driving vehicles. However, in actual application scenarios, there are still numerous limitations. Complex lighting conditions, severe vehicle occlusion, and different monitoring perspectives will affect the detection performance of the model. Therefore, it is crucial to consider a range of complex or extreme environmental factors prior to actual deployment to ensure that the model can better handle unexpected situations and complex traffic environments.