Real-Time Monitoring Method for Traffic Surveillance Scenarios Based on Enhanced YOLOv7

Yu, Dexin; Yuan, Zimin; Wu, Xincheng; Wang, Yipen; Liu, Xiaojia

doi:10.3390/app14167383

Open AccessArticle

Real-Time Monitoring Method for Traffic Surveillance Scenarios Based on Enhanced YOLOv7

by

Dexin Yu

¹,

Zimin Yuan

¹,

Xincheng Wu

^1,2,

Yipen Wang

¹ and

Xiaojia Liu

^1,*

¹

Navigation College, Jimei University, Xiamen 361012, China

²

Navigation College, Xiamen Ocean Vocational College, Xiamen 361012, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 7383; https://doi.org/10.3390/app14167383 (registering DOI)

Submission received: 3 July 2024 / Revised: 19 August 2024 / Accepted: 20 August 2024 / Published: 21 August 2024

(This article belongs to the Section Transportation and Future Mobility)

Download

Browse Figures

Versions Notes

Abstract

:

Due to the impact of scale variation of vehicle targets and changes in traffic environments in large-scale traffic monitoring systems, vehicle target detection methods often face challenges. To improve the adaptability of detection methods to these variations, we proposed an enhanced YOLOv7 for traffic systems (ETS-YOLOv7). To mitigate the effects of complex environments, we introduced the convolutional block attention module (CBAM) into the YOLOv7 framework, which filters important features in both channel and spatial dimensions, thereby enhancing the model’s capability to recognize traffic object features. To address the influence of aspect ratio variations in vehicle targets, we replaced the original complete intersection over union (CIoU) with wise intersection over union v3 (WIoUv3), eliminating the aspect ratio consistency loss and enhancing the model’s ability to generalize and its overall performance. Additionally, we employed the compact layer aggregation networks (CLAN) module to replace the efficient layer aggregation networks (ELAN) module, reducing redundant computations and improving computational efficiency without compromising model accuracy. The proposed method was validated on the large-scale traffic monitoring dataset UA-DETARC, achieving a mean average precision (mAP0.5–0.95) of 90.2%, which is a 3% improvement over the original YOLOv7. The frames per second (FPS) reached 149, demonstrating that the proposed model is highly competitive in terms of detection efficiency and vehicle detection accuracy compared to other advanced object detection methods.

Keywords:

traffic detection; surveillance scenarios; attention mechanism; roadside perception

1. Introduction

As artificial intelligence continues to evolve, the necessity for reliable, timely, and robust traffic monitoring methods has become increasingly apparent in the perception layer of intelligent transportation systems (ITS). These methods can assist in alleviating traffic congestion and preventing accidents [1]. While some researchers focus on vehicle target detection and depth estimation using on-board monitoring [2], our study emphasizes perception methods based on roadside monitoring, which more closely aligns with the development of integrated traffic solutions. The accurate detection of vehicle targets in surveillance video has emerged as a pivotal research area within the field of intelligent transportation systems. Nevertheless, a number of practical limitations impede the advancement of object detection methods for traffic surveillance. These challenges include variations in object scale within traffic scenes, different viewpoints of surveillance cameras, occlusion between different vehicle targets, the effects of weather and lighting conditions, and the computational limitations of edge devices.

In large-scale traffic monitoring systems, both straight lanes and intersections are monitored. The varying positions of vehicle targets within these scenes, along with the complex backgrounds influenced by weather and lighting conditions, may adversely impact the effectiveness of traffic object detection. To address these challenges, current research focuses on several key strategies: improving feature pyramids [3], introducing attention mechanisms, modifying loss functions [4], and incorporating small object detection layers [5].

Another challenge in detecting traffic in surveillance video is the computational limitations of roadside units (RSUs). The high costs of installing, maintaining, and managing RSUs limit their widespread use. As a result, each RSU covers a large service area, which makes deploying large models impractical [6]. Reducing the computational load of models often leads to a decrease in model accuracy, which affects detection performance. Common approaches to address this issue include replacing the backbone network with lightweight alternatives [7], using lightweight feature extraction structures [8], or reducing the width and depth of the network [9]. The majority of mainstream object detection algorithms proposed in recent years increase the depth and width of the network by superimposing different modules in the network as a means of improving the detection accuracy of the model in the face of varying difficult scenarios. In this paper, we seek to strike a balance between model accuracy and computational efficiency. To this end, we propose a model that addresses the aforementioned challenges and improves the performance of traffic object detection in large-scale traffic surveillance systems. The main contributions of this research are as follows:

(1): We have embedded the CBAM module into the YOLOv7 network structure. The spatial and channel attention mechanisms within the module enable the network to extract the distinctive features of vehicle targets and effectively eliminate the interference from complex backgrounds.
(2): In terms of the loss function, we have replaced YOLOv7’s CIoU with WIoUv3. CIoU incorporates the aspect ratio similarity of the bounding boxes into the penalty term. However, in large-scale traffic monitoring systems, the same vehicle can have different aspect ratios from different angles and states of motion, which can reduce the accuracy of the model when using CIoU. WIoUv3 avoids this problem and addresses the imbalance between easy and hard samples.
(3): Considering the computational limitations of edge devices in ITS, we proposed CLAN to reduce redundant computations. This module replaces the ELAN module in the YOLOv7 backbone network, reducing computational load and model size without compromising model accuracy.

The remainder of the paper is organized as follows: Section 2 presents a review of pertinent studies. Section 3 provides a detailed account of the framework of this study and the methodology employed in each step. Section 4 presents the experimental data and methodology and offers a discussion of the results obtained from the study. The final section offers a summary of the study and makes recommendations for future research.

2. Related Works

2.1. General Object Detection

Image-based object detection algorithms represent fundamental yet challenging research issues in the field of computer vision. These algorithms can be classified into traditional object detection methods [10,11,12,13] and deep learning-based object detection methods [14,15,16]. Traditional object detection techniques suffer from poor robustness, high time complexity, and redundant detection windows, which limit the improvement of detection performance. In 2014, Girshick et al. proposed R-CNN, which uses convolutional networks for feature extraction, and demonstrated excellent performance on the PASCAL VOC Challenge datasets [17]. Since then, research on image-based object detection has rapidly progressed towards deep learning. In the development of deep learning-based object recognition algorithms, two types of detection frameworks have emerged: two-stage object detection and one-stage object detection [18]. In 2020, a new branch of research emerged in object detection. Facebook introduced the Transformer mechanism into object detection and proposed the DETR method [19]. However, DETR struggles with high resolution images and small object detection, which has limited its application in engineering.

The two-stage object detection process begins with the identification of potential regions of interest, followed by the extraction, classification, and localization of the targets. Although the detection performance has been significantly enhanced in comparison to traditional methods, the original two-stage detection algorithm, R-CNN, had a complex training process with high time and space complexity [17]. SPP-Net proposed the use of spatial pyramid pooling layers to address the issue of target loss or distortion due to scaling, as well as to handle images of arbitrary sizes and scales [20]. This method achieved detection accuracy comparable to that of R-CNN while simultaneously enhancing the speed of both training and detection. However, both methods shared a common limitation: they necessitated multi-stage training to achieve the object detection task. Subsequently, Fast R-CNN [21] and Faster R-CNN [22] were proposed as solutions to address these problems in a sequential manner. Faster R-CNN, as a region proposal network with a fully convolutional network structure (RPN), overcame the limitations of selective search and effectively accelerated the generation of candidate boxes. Subsequent algorithms, such as FPN [23] and Mask R-CNN [24], further augmented the components of Faster R-CNN and enhanced the performance of two-stage object detection methods. Currently, Faster R-CNN remains a highly competitive algorithm in the field of object detection.

One-stage object detection algorithms differ from two-stage algorithms in that they do not include a candidate region search step. This eliminates the need for a separate search phase, which is crucial for real-time detection and is well-suited to a variety of scenarios. The first one-stage object detection algorithm, You Only Look Once (YOLO), was proposed in 2015 and applied a single neural network to the entire image. This further improved the real-time performance of detection, although the accuracy slightly decreased [16]. This method divided the input image into a grid of cells for prediction, but due to the limited number of targets within each cell, it struggled to effectively detect dense and small objects. Subsequent versions, such as SSD [25], RetinaNet [26], and later versions of YOLO [27,28,29], focused more on addressing these issues.

2.2. Vehicle Object Detection

The detection of vehicles based on computer vision represents a fundamental element of intelligent transportation systems. Conventional computer vision techniques rely on the differentiation of vehicle objects from a fixed background through the analysis of their motion. These methods can be divided into three main categories [30]: background subtraction, which uses differences between the current image and background images [31]; frame differencing, which relies on pixel intensity differences between consecutive images [32]; and optical flow methods, which use the motion direction and speed of pixels in an image sequence [33]. However, these methods are unable to detect vehicle types. In addition, under low-light conditions, it is difficult to extract vehicle edges or detect moving vehicles, resulting in low detection accuracy and affecting the further use of detection results. The effectiveness of conventional vehicle detection methods relies on the quality of the background model; when there are changes in illumination, periodic movements in the background, slow-moving vehicles, or complex application scenarios, these methods do not perform well. With the development of deep learning, many researchers have conducted studies on traffic object detection in complex traffic scenes. Huang et al. proposed DC-SPP-YOLO based on dense connections and spatial pyramid pooling [34]. This model demonstrated excellent performance on the UA-DETRAC vehicle dataset, with a 2.25% improvement in mAP, though the fps decreased by 8.3%. Hu et al. restructured the YOLOv3 backbone network and designed a lightweight traffic object detection algorithm [35]. Although the model is still larger than YOLOv3-tiny, its mAP on the KITTI dataset outperforms both YOLOv3-tiny and YOLOv3. Li et al. introduced SCD-YOLO [36], which reduces the model’s parameters by 44.4% through pruning and knowledge distillation, with only a 0.5% drop in mAP on the BIT-vehicle dataset. Li et al. improved the YOLOv3-tiny algorithm [37]. To meet real-time requirements, they significantly increased detection speed through model compression and simplification. However, the detection accuracy on their proprietary vehicle detection dataset dropped by 9.1%, and the model’s generalization ability decreased, making accurate detection difficult in complex traffic scenarios such as shadows, rain, and fog. These methods are significant for vehicle detection, but they still have the following issues: (1) Although most approaches improve detection accuracy in certain cases, they often involve large model parameters and high computational complexity. (2) Some lightweight models effectively reduce model parameters but fail to achieve a balance between accuracy and speed.

Compared with the existing methods, this paper simultaneously addresses the challenges of target scale variation and complex background in traffic surveillance by introducing WIoU and introducing attention modules in the backbone network. A compact layer aggregation network is proposed for feature extraction to improve the computational efficiency of the model. This approach achieves a balance between model size and accuracy.

3. ETS-YOLOv7 Algorithm

We propose ETS-YOLOv7 for traffic surveillance, building on the YOLOv7 model due to its advantages of fast detection speed, high accuracy, and ease of deployment on edge devices. In this section, we present a comprehensive description of the ETS-YOLOv7 algorithm from four perspectives: algorithm overview, multidimensional attention mechanisms, loss function enhancement, and compact layer aggregation networks.

3.1. Algorithm Overview

The framework of the proposed ETS-YOLOv7 algorithm is shown in Figure 1, where YOLOv7 [29] is chosen as the backbone network. To reduce the influence of complex backgrounds and improve the model’s ability to extract vehicle features, channel attention and spatial attention mechanisms are applied sequentially to enhance the focus on different dimensional channel information and vehicle scale features, forming the CBAM attention mechanism [38]. This improved feature is then fed into the neck network. In addition, in large-scale traffic monitoring systems, the aspect ratio of the same vehicle in surveillance videos may vary significantly due to its driving state and monitoring angle. To avoid the larger aspect ratio penalty inherent in the standard CIoU [39] loss function, we adopt the WIoUv3 [40] loss function, which features a dynamic non-monotonic focusing mechanism(FM) and uses outlier degree as a criterion. This loss function also mitigates the impact of high and low quality anchor boxes and increases the network’s attention to medium quality anchor boxes. Finally, we replace the original ELAN module [41] with the CLAN module to adapt to edge devices, reduce redundant computation, and improve network robustness and computation efficiency. The CBAM, the WIoUv3 loss function, and the CLAN module are discussed in detail in the following sections.

3.2. Multidimensional Attention Mechanism

To enhance vehicle feature representation in complex traffic environments and improve the saliency of these features, attention mechanisms have been integrated into the backbone network. For example, the channel attention mechanism of the SE module only considers the saliency of each channel while ignoring the features of the spatial region [42]. On the other hand, the spatial attention mechanism optimizes the interest representation in neural networks by exploiting the different spatial relationships between feature maps and allocating computational resources to more important spatial feature locations [43]. Current research indicates that both channel and spatial attention mechanisms can enhance the feature extraction capability of the method. To comprehensively consider the semantic information from both aspects, the CBAM attention module is introduced. This module serially generates feature map information in both channel and spatial dimensions, thus enhancing the feature extraction capability of the network.

As shown in Figure 2, the first step is to generate the channel attention feature M_C, where the learned weights are multiplied by the original feature map, enhancing effective channel features and weakening ineffective ones by weighting. The second step is the generation of the spatial attention feature M_S, which is multiplied elementwise by the input feature map M_C(F)×F, which enhances the saliency of the vehicle features at the spatial level and outputs the final attention-enhanced feature map. The channel attention module first aggregates spatial information at the spatial level through global average pooling and global max pooling operations, yielding global average pooling and global max pooling features. These two channel-level features are then fed into a shared network consisting of a single hidden layer multi-layer perceptron (MLP). The results are summed to obtain the channel attention features. The formula is as follows:

M_{C} (F) = σ (MLP (AvgPool (F)) + MLP (MaxPool (F)))

(1)

where F is the input feature map, MLP is the multi-layer perceptron, and σ is the sigmoid function.

The spatial attention module performs global average pooling and global max pooling operations along the channel dimension to derive corresponding features. These two features are then fused by a convolution operation to obtain the final features. The formula is as follows:

M_{S} (F) = σ (f^{7 \times 7} ([AvgPool (M_{C} (F) \times F); MaxPool (M_{C} (F) \times F)]))

(2)

where σ is the sigmoid function and f^7×7 is a convolution kernel size of 7 × 7.

We inserted the CBAM modules after the second, third, and fourth CLAN modules in the backbone network. Thus, before feeding the feature maps into the backbone network, we assign weights to both high-dimensional and low-dimensional features from both channel and spatial perspectives, allowing the model to focus more on vehicle characteristics and reduce the influence of complex backgrounds.

3.3. Loss Function Optimization

By default, YOLOv7 uses CIoU as the loss function, which comprehensively considers the Intersection over Union (IoU), the distance between the center points, and the consistency of the aspect ratio. However, it does not consider the balance between hard and easy samples, which can introduce significant aspect ratio penalties in traffic monitoring systems. Therefore, we introduce the WIoUv3 loss function with a dynamic non-monotonic FM. This loss function employs the outlier degree, rather than IoU, to assess the quality of anchor boxes and introduces a non-monotonic gradient gain allocation strategy. This reduces the impact of both hard and easy samples on the results, giving more attention to samples of moderate difficulty. Intersection over union (IoU) is employed to quantify the degree of overlap between the anchor box and the ground truth box in object detection tasks. It eliminates the influence of the bounding box size in a proportional way, allowing the model to effectively balance the learning of large and small objects when using IoU as the bounding box regression loss. The

L_{I o U}

formula is as follows:

L_{IoU} = 1 - IoU .

(3)

In order to reduce the penalty introduced by the geometric parameter metrics when the anchor box and the ground truth box overlap significantly, Tong et al. [40] constructed a distance attention mechanism based on the distance between the center coordinates, forming WIoUv1.

L_{WIoUv 1} = R_{WIoU} L_{IoU},

(4)

R_{WIoU} = \exp (\frac{{(x - x_{gt})}^{2} + {(y - y_{gt})}^{2}}{{(W_{g}^{2} + H_{g}^{2})}^{*}}) .

(5)

where W_g and H_g represent the maximum width and height of the combined area of the anchor box and the ground truth box. To prevent W_g and H_g in the denominator from producing negative gradients and hindering convergence during gradient computation, they are treated as constants and do not participate in backpropagation (indicated by the superscript * for this operation), x and y denote the center coordinates of the anchor box, while x_gt and y_gt represent the center coordinates of the ground truth box, and exp(·) denotes the exponential operation.

The outlier degree β of the anchor box is given by the ratio of

L_{I o U}

to

\bar{L_{I o U}}

. An excessively large or small outlier degree indicates low-quality or high-quality anchor boxes, respectively. Smaller gradient gains are assigned to these anchor boxes to prevent them from having a significant negative effect on the gradients.

β = \frac{L_{IoU}^{*}}{L_{IoU}},

(6)

r = \frac{β}{δ α^{β - δ}} .

(7)

where

\bar{L_{I o U}}

is the moving average with momentum, r is the non-monotonic focusing coefficient, and α and δ are hyperparameters. Applying this to WIoUv1 yields WIoUv3:

L_{WIoUv 3} = r L_{WIoUv 1} .

(8)

As a non-monotonic focusing coefficient, r can adaptively adjust the gradient gain strategy while

\bar{L_{I o U}}

changes dynamically. Therefore, the criterion for judging the quality of the anchor boxes also changes dynamically. This allows WIoUv3 to select the most appropriate gradient gain strategy in different situations.

3.4. Compact Layer Aggregation Networks

In the field of vehicle target detection, due to the computational constraints of edge devices, it is often necessary to limit the size of models to improve their universality. Therefore, considering both model accuracy and computational complexity, the CLAN lightweight module is proposed, as shown in Figure 3.

In the original YOLOv7 network framework, the ELAN module serves as a feature extraction unit, consisting of 2D convolutions, batch normalization, and the SiLU activation function. While this module effectively enhances the network’s feature extraction capability, it also incurs significant computational overhead. In response to this problem, this paper proposes a compact layer aggregation network that replaces certain 2D convolutions with depth-separable convolution (DSConv) [44], optimizes network gradients, and preserves network width, thereby balancing computational efficiency and model accuracy. DSConv is a combination of depthwise convolution and pointwise convolution. As shown in Figure 4a, depthwise convolution extracts feature maps for each channel using a dedicated convolution kernel, thus maintaining the same number of feature maps as the channels in the input layer. However, this operation does not take into account the spatial correlation between different channels at the same location, so pointwise convolution is required to combine these feature maps into a new one. Pointwise convolution, shown in Figure 4b, is similar to 2D convolution but with a kernel size of 1 × 1 × C_in, which expands the feature maps dimensionally, where C_in represents the number of input feature map channels. This operation weigh the feature maps along the channel direction, ultimately producing feature maps with the same dimensions as those from 2D convolution.

When performing convolution operations with an input channel as C_in, an output channel as C_out, and an output feature map size of H × W, the 2D convolution layer consists of C_out filters, where each filter contains C_in convolution kernels, and each kernel size is k × k. At this point, the computational complexity of 2D convolution is as follows:

FLOP s_{Conv} = 2 k^{2} \times H \times W \times C_{in} \times C_{out} .

(9)

The number of parameters in 2D convolution at this point is as follows:

P_{Conv} = k^{2} \times C_{in} \times C_{out} .

(10)

In depthwise separable convolution, when handling the aforementioned convolution operations, the first step involves depthwise convolution, which conducts convolution operations separately for each channel. The computational complexity is the sum of the computational complexities of each input channel:

FLOP s_{DepthwiseConv} = 2 k^{2} \times 1 \times H \times W \times C_{in} .

(11)

The number of parameters is as follows:

P_{DepthwiseConv} = k^{2} \times C_{in} \times 1 .

(12)

The second step is pointwise convolution, which performs convolution on the output feature map obtained from the first step on a per-pixel basis. The kernel size is 1 × 1 × C_in and the computational complexity is as follows:

FLOP s_{PointwiseConv} = 2 \times 1^{2} \times C_{in} \times H \times W \times C_{out} .

(13)

The number of parameters is as follows:

P_{PointwiseConv} = 1^{2} \times C_{in} \times C_{out} .

(14)

Thus, the total computational complexity of the depthwise separable convolution is as follows:

FLOP s_{DSConv} = 2 (k^{2} + C_{out}) \times H \times W \times C_{in} .

(15)

The total number of parameters is as follows:

P_{DSConv} = C_{in} \times (k^{2} + C_{out})

(16)

When the output channel C_out = 256 and the kernel size k = 3, the computational complexity and the number of parameters of depthwise separable convolution are both only 256/2304 of that of 2D convolution. From the above analysis, replacing 2D convolution with DSConv can effectively reduce both computational complexity and parameter count. Meanwhile, the new CLAN module shortens the network depth while maintaining the original network width. All of these operations significantly reduce the computational burden and memory requirements while ensuring model accuracy, thus improving the feasibility of deployment on edge devices.

4. Experimental Analysis

4.1. Experimental Platform and Dataset

All of our detection experiments were conducted on the Windows platform using an i7-12700F CPU, 32 GB of memory, and an NVIDIA GeForce RTX 3060 GPU processor with 12 GB of memory. The experimental framework was built using the Python programming language. The deep learning development environment includes Python 3.9, PyCharm 2023, Pytorch-GPU 2.0.0, CUDA 11.8, and cuDNN 8.0.

We validated the vehicle detection algorithm using the UA-DETRAC [45] dataset. This dataset belongs to large-scale traffic monitoring system datasets, where all the images of 100 challenging video clips are captured by low-position ordinary surveillance cameras in real scenes. The dataset includes cars, buses, trucks, and vehicles of four other categories, with a total of 8250 vehicles. The scenes cover daytime, nighttime, sunny, cloudy, and rainy weather conditions, with a total of 1.21 million annotated bounding boxes and 138,252 images. The dataset is randomly divided into training, validation, and test sets in a 6:2:2 ratio, with 82,950 images in the training set and 27,651 images each in the validation and test sets. The distribution of weather conditions in each group is shown in Figure 5a, and the distribution of vehicle targets is shown in Figure 5b. The proportions of different weather conditions and vehicle targets in the training, validation, and test sets are consistent, which improves the model’s generalization capability and ensures algorithm robustness.

4.2. Experimental Setup and Evaluation Metrics

Based on the computing power of the GPU and the parameter adjustment process, the initial parameters of the YOLOv7 model are set as follows: epoch = 100, batch_size = 16, momentum = 0.937, first learning rate = 0.01, last learning rate = 0.1, warmup_epochs = 10, and weight_decay = 0.005. During the training process, the learning rate gradually increases from 0.01 to 0.1 in a linear fashion to help the model adapt to the training process. To compare with other advanced methods, we used the mAP@50-95 and FPS as the evaluation metrics of the model performance.

4.3. Experiments

4.3.1. Ablation Study

To validate the detection improvement effect of the above methods, experiments based on the YOLOv7 network structure were conducted using the UA-DETRAC dataset.

We trained four different models and analyzed the effects of different combinations of improvements. The experimental results are presented in Table 1. After integration of the CBAM attention mechanism, mAP increased by 1.7%. This suggests that the addition of channel and spatial attention modules can generate more specific features. However, the increase in computational complexity resulted in a decrease of 21 FPS in the output. After replacing the ELAN structure with the CLAN structure, the mAP increased further by 0.1% and the FPS increased by 14, demonstrating that this structure can significantly reduce computational complexity without significantly affecting model accuracy. Finally, substituting the CIoU with the WIoU v3 resulted in a 3% increase in mAP with only a 7 decrease in FPS compared to the original YOLOv7.

We also compared the performance of the original model and the enhanced model under different weather and lighting conditions, as well as their precision, recall, number of parameters, and computational complexity. The results, as shown in Table 2, demonstrate that ETS-YOLOv7 reduces the number of parameters by approximately 14% and the computational complexity by about 23%, while improving both precision and recall. Under four different weather and lighting conditions, the mAP of ETS-YOLOv7 consistently outperforms that of the original model, with the most significant improvements observed under cloudy and sunny conditions, showing increases of 4.4% and 3.6%, respectively. This experiment demonstrates that, regardless of image blurring caused by windy weather or brightness changes due to lighting, the proposed ETS-YOLOv7 can effectively detect and classify vehicle targets, showing robustness to varying lighting intensity and weather conditions.

4.3.2. Comparative Experiments

To verify the superiority of the proposed algorithm, we conducted a detection performance comparison among several advanced detection frameworks, namely Faster R-CNN [22] and RT-DETR [46]. Since the proposed ETS-YOLOv7 is based on the YOLO architecture, we specifically compared it with the latest version of YOLOv5s [28], YOLOv8l [47], and YOLOv9s [48]. The experimental results are shown in Table 3. All the listed detection frameworks were trained on the same UA-DETRAC dataset; the hyperparameters and experimental environment were kept consistent for a fair comparison.

As shown in Table 3, ETS-YOLOv7 achieves higher mAP and FPS compared to other networks such as Faster R-CNN, YOLOv5s, YOLOv8l, YOLOv9s, and RT-DETR-L. The YOLO series models perform particularly well in detecting the bus category. While the two-stage algorithm Faster R-CNN excels in detecting the car category, its performance in detecting the others category lags significantly behind the YOLO series, and its FPS is lower than that of the YOLO models. RT-DETR-L, a transformer-based object detection algorithm, does not perform well in vehicle detection and has a low FPS, indicating room for significant improvement in future research. Although ETS-YOLOv7’s precision for the car category is 1.6% lower than that of Faster R-CNN, it achieves the highest precision in the bus, van, and others categories, as well as the highest average precision and FPS. This balance between detection speed and accuracy makes it well-suited for large-scale traffic surveillance systems.

The detection results on the UA-DTRAC dataset for different models are shown in Figure 6. As seen in the first and second columns, Faster R-CNN, YOLOv5s, YOLOv8l, YOLOv9s, and RT-DETR-L exhibit varying degrees of missed detections for distant small objects. For the large bus target in the second column, RT-DETR-L shows a confidence score of only 0.3. In the third column, only YOLOv9s, RT-DETR-L, and ETS-YOLOv7 successfully detected the truncated vehicle target at the edge of the image, with RT-DETR-L detecting the most truncated targets but misclassifying a car as a bus. In the fourth column, despite differences in confidence scores, Faster R-CNN, YOLOv5s, and YOLOv8l all incorrectly identified the front part of a bus as a car. While missed detections of small objects remain, the algorithm effectively reduces both false alarm and missed detection rates, improves the detection of vehicle targets in dark and occluded conditions, confirming the accuracy and practicality of the algorithm.

5. Conclusions

In response to the robustness requirements of object detection in large-scale traffic monitoring systems, we propose an ETS-YOLOv7 based on YOLOv7. By integrating three CBAM attention modules into the network, the model focuses on more effective spatial and channel features, thus enhancing feature extraction capabilities in intricate traffic scenarios. The use of the WIoUv3 mitigates the influence of aspect ratio variations of vehicle targets on the loss calculation. In addition, we use the CLAN structure in the backbone network to reduce redundant computations while sustaining model accuracy, making the model more suitable for edge devices. Experiments on the UA-DTRAC dataset show that ETS-YOLOv7 outperforms the original YOLOv7 and other advanced methods, showing promise for applications in large-scale traffic monitoring systems. This research has broader implications for the advancement of intelligent transportation systems, particularly in the realms of traffic management, safety, and operational efficiency. The proposed ETS-YOLOv7 model, with its optimized computational efficiency and adaptability to edge devices, demonstrates a high degree of potential scalability across various real-time monitoring systems. This model’s suitability for edge devices is particularly advantageous in scenarios where low latency and real-time processing are essential, such as autonomous vehicles and smart city infrastructure, which require the swift and accurate detection of traffic targets. The present study predominantly focuses on the detection of four-wheeled vehicles and above. Future work will aim to enhance the model’s detection capabilities for smaller and more dynamic targets, such as motorcycles, non-motorized vehicles, and pedestrians. These smaller targets often present unique challenges due to their variability in size, speed, and behavior, as well as their frequent presence in congested urban environments. Improving the model’s ability to accurately detect these entities, especially under adverse conditions such as poor lighting, heavy rain, or fog, is critical for ensuring comprehensive traffic surveillance. By extending detection capabilities to a wider range of targets and improving performance under challenging conditions, this research could lead to the development of more robust and reliable traffic monitoring systems. Such systems would not only help in reducing accidents and improving traffic flow but could also contribute to the design of advanced driver assistance systems and fully autonomous vehicles. Moreover, these advancements are expected to significantly enhance traffic safety and vehicle avoidance strategies, which are important fields of intelligent transportation systems.

Author Contributions

Conceptualization, D.Y. and Z.Y.; methodology, Z.Y.; formal analysis, D.Y. and Z.Y.; investigation, Z.Y., Y.W., and X.W.; resources, D.Y. and Y.W.; data curation, Z.Y.; writing—original draft preparation, Z.Y.; writing—review and editing, D.Y. and Z.Y.; visualization, Z.Y.; supervision, D.Y. and Z.Y.; project administration, X.L.; funding acquisition, D.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Major Program of National Social Science Foundation of China, grant number No. 23&ZD138.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors express their gratitude to Yu Yang and Wanli Peng for their invaluable support during the preparation of the manuscript and data processing.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bin, T.; Tran, M.B.; Ming, T.; Yuqiang, L.; Yanjie, Y.; Chao, G.; Dayong, S.; Shaohu, T. Hierarchical and Networked Vehicle Surveillance in ITS: A Survey. IEEE Trans. Intell. Transp. Syst. 2017, 18, 25–48. [Google Scholar]
Wang, Z.; Zhan, J.; Duan, C.; Guan, X.; Lu, P.; Yang, K. A review of vehicle detection techniques for intelligent vehicles. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 3811–3831. [Google Scholar]
Zheng, H.; Liu, J.; Ren, X. Dim target detection method based on deep learning in complex traffic environment. J. Grid Comput. 2022, 20, 8. [Google Scholar]
Wang, Z.; Zhang, X.; Li, J.; Luan, K. A YOLO-based target detection model for offshore unmanned aerial vehicle data. Sustainability 2021, 13, 12980. [Google Scholar] [CrossRef]
Sun, S.; Mo, B.; Xu, J.; Li, D.; Zhao, J.; Han, S. Multi-YOLOv8: An infrared moving small object detection model based on YOLOv8 for air vehicle. Neurocomputing 2024, 588, 127685. [Google Scholar]
Ni, Y.; He, J.; Cai, L.; Pan, J.; Bo, Y. Joint roadside unit deployment and service task assignment for Internet of Vehicles (IoV). IEEE Internet Things J. 2018, 6, 3271–3283. [Google Scholar]
Zhao, X.; Zhang, W.; Zhang, H.; Zheng, C.; Ma, J.; Zhang, Z. ITD-YOLOv8: An Infrared Target Detection Model Based on YOLOv8 for Unmanned Aerial Vehicles. Drones 2024, 8, 161. [Google Scholar] [CrossRef]
Yu, C.; Zhou, L.; Liu, B.; Zhao, Y.; Zhu, P.; Chen, L.; Chen, B. G-YOLO: A YOLOv7-based target detection algorithm for lightweight hazardous chemical vehicles. PLoS ONE 2024, 19, e0299959. [Google Scholar]
Wang, R.; Wang, Z.; Xu, Z.; Wang, C.; Li, Q.; Zhang, Y.; Li, H. A real-time object detector for autonomous vehicles based on YOLOv4. Comput. Intell. Neurosci. 2021, 2021, 9218137. [Google Scholar] [PubMed]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar]
Felzenszwalb, P.; McAllester, D.; Ramanan, D. A discriminatively trained, multiscale, deformable part model. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; p. 1984. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 142–158. [Google Scholar]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar]
Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
He, K.M.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 16th IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Volume 9905, pp. 21–37. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the 16th IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv preprint 2018, arXiv:1804.02767. [Google Scholar]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Yifu, Z.; Wong, C.; Montes, D. ultralytics/yolov5: v7. 0-yolov5 sota realtime instance segmentation. Zenodo 2022. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Abdulrahim, K.; Salam, R.A. Traffic surveillance: A review of vision based vehicle detection, recognition and tracking. Int. J. Appl. Eng. Res. 2016, 11, 713–726. [Google Scholar]
Manikandan, R.; Ramakrishnan, R. Video object extraction by using background subtraction techniques for sports applications. Digit. Image Process. 2013, 5, 435–440. [Google Scholar]
Baker, S.; Scharstein, D.; Lewis, J.P.; Roth, S.; Black, M.J.; Szeliski, R. A database and evaluation methodology for optical flow. Int. J. Comput. Vis. 2011, 92, 1–31. [Google Scholar]
Liu, Y.; Lu, Y.; Shi, Q.; Ding, J. Optical flow based urban road vehicle tracking. In Proceedings of the 9th International Conference on Computational Intelligence and Security (CIS), Emeishan, China, 14–15 December 2013; pp. 391–395. [Google Scholar]
Huang, Z.; Wang, J.; Fu, X.; Yu, T.; Guo, Y.; Wang, R. DC-SPP-YOLO: Dense connection and spatial pyramid pooling based YOLO for object detection. Inf. Sci. 2020, 522, 241–258. [Google Scholar]
Xu, H.; Guo, M.; Nedjah, N.; Zhang, J.; Li, P. Vehicle and pedestrian detection algorithm based on lightweight YOLOv3-promote and semi-precision acceleration. IEEE Trans. Intell. Transp. Syst. 2022, 23, 19760–19771. [Google Scholar]
Li, H.; Zhuang, X.; Bao, S.; Chen, J.; Yang, C. SCD-YOLO: A lightweight vehicle target detection method based on improved YOLOv5n. J. Electron. Imaging 2024, 33, 023041. [Google Scholar]
Li, L.; Liang, Y. Deep learning target vehicle detection method based on YOLOv3-tiny. In Proceedings of the 2021 IEEE 4th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), Chongqing, China, 18–20 June 2021. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Volume 11211, pp. 3–19. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv preprint 2023, arXiv:2301.10051. [Google Scholar]
Wang, C.-Y.; Liao, H.-Y.M.; Yeh, I.-H. Designing network design strategies through gradient path analysis. arXiv preprint 2022, arXiv:2211.04800. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Ieee, Squeeze-and-Excitation Networks. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Zhu, X.; Cheng, D.; Zhang, Z.; Lin, S.; Dai, J. An Empirical Study of Spatial Attention Mechanisms in Deep Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6687–6696. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar]
Wen, L.; Du, D.; Cai, Z.; Lei, Z.; Chang, M.-C.; Qi, H.; Lim, J.; Yang, M.-H.; Lyu, S. UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking. Comput. Vis. Image Underst. 2020, 193, 102907. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Glenn, J.; Ayush, C.; Jing, Q. Ultralytics YOLOv8. Available online: https://github.com/ultralytics/ultralytics (accessed on 2 July 2024).
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]

Figure 1. ETS-YOLOv7 network framework diagram.

Figure 2. Convolutional lock attention module (CBAM). (a) Channel attention module; (b) spatial attention module.

Figure 3. Comparison between ELAN and CLAN. (a) ELAN structure; (b) CLAN structure.

Figure 4. Depth-separable convolution structure. (a) Depthwise convolution; (b) pointwise convolution.

Figure 5. Results of dataset distribution visualization. (a) Distribution of different weather; (b) distribution of different detection targets.

Figure 6. Comparison of different detection methods. (a) Faster-rcnn; (b) YOLOv5s; (c) YOLOv8l; (d) YOLOv9s; (e) RT-DETR-L; (f) ETS-YOLOv7.

Table 1. Impact of different components on detection effectiveness.

CBAM	CLAN	WIoU-v3	[email protected]:95(%)	FPS
			87.2	156
√			88.9	135
√	√		89.0	149
√	√	√	90.2	149

Table 2. Performance comparison between YOLOv7 and ETS-YOLOv7.

Methods	[email protected]:0.95(%)				Precision(%)	Recall(%)	Parameters	GFLOPs
Methods	Rainy	Cloudy	Sunny	Night	Precision(%)	Recall(%)	Parameters	GFLOPs
YOLOv7	89.3	85.5	82.9	86.1	93.77	97.69	36,795,723	103.7
ETS-YOLOv7	91.7	88.9	86.5	88.2	94.29	98.93	31,658,443	80.7

Table 3. Comparison of different detection methods.

Methods	[email protected]:0.95(%)				[email protected]:0.95(%)	Fps
Methods	Car	Bus	Van	Others	[email protected]:0.95(%)	Fps
Faster-rcnn	86.9	88.3	83.8	79.1	84.5	50
YOLOv5 s(r7.0)	81.4	87.5	82.3	83.6	83.7	144
YOLOv8 l	83.8	91.2	85.1	87.7	86.9	81
YOLOv9 s	83.8	91.5	85.2	89.1	87.4	136
RT-DETR-L	74.2	81.1	73.2	74.6	75.3	48
ETS-YOLOv7	85.3	93.9	87.6	92.8	90.2	149

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, D.; Yuan, Z.; Wu, X.; Wang, Y.; Liu, X. Real-Time Monitoring Method for Traffic Surveillance Scenarios Based on Enhanced YOLOv7. Appl. Sci. 2024, 14, 7383. https://doi.org/10.3390/app14167383

AMA Style

Yu D, Yuan Z, Wu X, Wang Y, Liu X. Real-Time Monitoring Method for Traffic Surveillance Scenarios Based on Enhanced YOLOv7. Applied Sciences. 2024; 14(16):7383. https://doi.org/10.3390/app14167383

Chicago/Turabian Style

Yu, Dexin, Zimin Yuan, Xincheng Wu, Yipen Wang, and Xiaojia Liu. 2024. "Real-Time Monitoring Method for Traffic Surveillance Scenarios Based on Enhanced YOLOv7" Applied Sciences 14, no. 16: 7383. https://doi.org/10.3390/app14167383

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Monitoring Method for Traffic Surveillance Scenarios Based on Enhanced YOLOv7

Abstract

1. Introduction

2. Related Works

2.1. General Object Detection

2.2. Vehicle Object Detection

3. ETS-YOLOv7 Algorithm

3.1. Algorithm Overview

3.2. Multidimensional Attention Mechanism

3.3. Loss Function Optimization

3.4. Compact Layer Aggregation Networks

4. Experimental Analysis

4.1. Experimental Platform and Dataset

4.2. Experimental Setup and Evaluation Metrics

4.3. Experiments

4.3.1. Ablation Study

4.3.2. Comparative Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI