SE-Lightweight YOLO: Higher Accuracy in YOLO Detection for Vehicle Inspection

Niu, Chengwen; Song, Yunsheng; Zhao, Xinyue

doi:10.3390/app132413052

Open AccessArticle

SE-Lightweight YOLO: Higher Accuracy in YOLO Detection for Vehicle Inspection

by

Chengwen Niu

,

Yunsheng Song

^*

and

Xinyue Zhao

School of Information Science and Engineering, Shandong Agricultural University, Taian 271018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(24), 13052; https://doi.org/10.3390/app132413052

Submission received: 27 October 2023 / Revised: 3 December 2023 / Accepted: 5 December 2023 / Published: 7 December 2023

Download

Browse Figures

Versions Notes

Abstract

:

Against the backdrop of ongoing urbanization, issues such as traffic congestion and accidents are assuming heightened prominence, necessitating urgent and practical interventions to enhance the efficiency and safety of transportation systems. A paramount challenge lies in realizing real-time vehicle monitoring, flow management, and traffic safety control within the transportation infrastructure to mitigate congestion, optimize road utilization, and curb traffic accidents. In response to this challenge, the present study leverages advanced computer vision technology for vehicle detection and tracking, employing deep learning algorithms. The resultant recognition outcomes provide the traffic management domain with actionable insights for optimizing traffic flow management and signal light control through real-time data analysis. The study demonstrates the applicability of the SE-Lightweight YOLO algorithm, as presented herein, showcasing a noteworthy 95.7% accuracy in vehicle recognition. As a prospective trajectory, this research stands poised to serve as a pivotal reference for urban traffic management, laying the groundwork for a more efficient, secure, and streamlined transportation system in the future. To solve the existing vehicle detection problems in vehicle type recognition, recognition and detection accuracy need to be improved, alongside resolving the issues of slow detection speed, and others. In this paper, we made innovative changes based on the YOLOv7 framework: we added the SE attention transfer mechanism in the backbone module, and the model achieved better results, with a 1.2% improvement compared with the original YOLOv7. Meanwhile, we replaced the SPPCSPC module with the SPPFCSPC module, which enhanced the trait extraction of the model. After that, we applied the SE-Lightweight YOLO to the field of traffic monitoring. This can assist transportation-related personnel in traffic monitoring and aid in creating big data on transportation. Therefore, this research has a good application prospect.

Keywords:

vehicle inspection; YOLOv7; deep learning; attention mechanism; SPPFCSPC module

1. Introduction

An intelligent transportation system (ITS) represents a crucial form of technology that efficiently addresses common traffic issues and enhances overall traffic safety [1]. It can be divided into four aspects: reducing travel time, ensuring traffic safety, relieving traffic congestion, and reducing traffic pollution. The ITS has the undeniable potential to ensure safer and smoother traffic on the road [2].

As the global population continues to grow and urbanization accelerates, problems such as road congestion and increased pollution emissions are becoming more serious, and the need for intelligent transportation is becoming more and more urgent. According to market monitoring data, the global intelligent transport market is expected to exceed USD 250 billion by 2025 [3]. The intelligent transportation industry concentration is relatively low, and the financial strength and scale of enterprises in the same industry are generally small. The market competition is also mainly concentrated in a certain regional scope and a small number of enterprises. The emerging Internet and IT giants having cross-border access to the intelligent transportation industry will greatly impact the competition and development of the industry; moreover, it is generally expected that the market concentration of the intelligent transportation industry will further increase in the future.

Ultimately, traffic congestion results from disruptions to the typical flow of vehicles, particularly those moving in a straight direction. The pivotal factor contributing to these disruptions is the alteration of traffic light signals [4], significantly hindering most vehicles’ smooth operation. Traffic congestion is generated precisely because the traffic light signals are not set up reasonably. Traffic signals are a silent commander on the highway, playing an irreplaceable role in traffic safety to improve driving efficiency and ensure the lack of traffic congestion; however, it is often unsatisfactory.

Currently, most cities use “timed” traffic lights with a fixed red light–green light time interval [5]. Due to the “timing” of traffic lights, the duration of the traffic light cannot be based on changes in road conditions or real-time adjustments, as these conditions are difficult to satisfy. Using timed traffic lights will often lead to vehicle congestion at intersections with high or changing traffic flows. Given the complexity and variability of traffic conditions, only by adjusting traffic lights in real-time according to the specific situation can we take into account the safety of driving and the efficiency of the use of road resources [6].

With the rapid development of the Internet of Things and artificial intelligence, it has become possible to use information technology to adjust traffic signals according to real-time road conditions. Most traffic signals are equipped with high-definition cameras that can dynamically analyze the situation of vehicles at intersections based on the real-time images and videos they capture, and automatically adjust traffic lights according to the real-time needs of different road sections. In this case, traffic congestion can be reduced to a minimum while ensuring traffic safety. However, the recognition of vehicles and other means of transportation at intersections cannot better identify all types of vehicles and pedestrians nowadays. In addition, the recognition accuracy needs to be improved; for this reason, this paper proposes the SE-Lightweight YOLO model, which is designed to address the current state-of-the-art research and help to improve the situation of lower classification accuracy and fewer classification categories. At the same time, improving the accuracy of the camera’s recognition of vehicles is of great significance to further ITS development and improve traffic smoothness.

This paper proposes an effective method for vehicle recognition to address the problems based on the above analysis. The contributions of this paper are as follows:

To propose a target detection algorithm with higher accuracy that can also detect different types of vehicles.
To improving the spatial pyramid pooling module in the original framework of YOLOv7.
To verify the practical value of the proposed method by experimental results on real datasets, which can greatly improve recognition accuracy. It can also support the expectation of the model’s application in the future.

The rest of the paper is organized as follows. Section 2 describes the related background knowledge and provides the background to set the stage for the new, improved algorithm proposed later. Section 3 describes the related work of other researchers with regard to improving the YOLO algorithm. Section 4 provides a detailed description of the improved SE-Lightweight YOLOv7 framework. Section 5 applies the improved YOLO model in a real traffic scenario which achieves the desired results, confirming the practical application value of the model. Section 6 concludes the paper.

2. Breif Introduction of YOLOv7

2.1. YOLOv7 Object Detection Algorithm

As we know, the YOLOv7 series has eight variants: YOLOv7-tiny, YOLOv7, YOLOv7-W6, YOLOv7-X, YOLOv7-E6, YOLOv7-E6E, and other two models which use leaky ReLU and SiLU as the activation function of the model. YOLOv7 contains four modules: input, backbone, head, and neck.

2.1.1. Input

The input side of YOLOv7 follows the overall logic of YOLOv5 and does not introduce new tricks. The logic mainly calculates the scaling ratio between the native size of the image and the input size to obtain the scaled image size. It then finally performs adaptive filling to obtain the final input image. After inputting the image into the model, it is normalized and converted into many layers of

640 \times 640 \times 3

images, and the processed image is fed into the backbone later.

2.1.2. Backbone

In YOLOv7, the primary function of the backbone network is to extract features from the input image. Feature extraction is a critical step that transforms the original image into a series of feature maps with high-level semantic information that will be used in subsequent target detection tasks.

2.1.3. Neck

The neck acts as a bridge between the backbone and head modules. The neck plays a crucial role in target detection. It is used for multi-scale feature fusion and further feature processing. Therefore, one of the main functions of the neck is to fuse feature maps from different backbone network layers. Since the backbone network usually includes multiple feature maps from different layers, the neck fuses these feature maps so that the model can simultaneously process information at multiple scales. This is important for detecting targets of different sizes.

2.1.4. Head

The detection header of the target detection model is a vital component of the entire YOLOv7 architecture and is responsible for generating the final target detection results. The detailed functions of the detection head in YOLOv7 are below:

Bounding Box Regression
The target’s position in the input image is usually represented as the bounding box coordinates. For each detection box, the header generates four values representing the coordinates of the bounding box’s upper left and lower right corners.
Class Classification
Another critical function is to categorize the detected targets. The detection header generates a category probability distribution for each detection frame, indicating the probability that the target belongs to each possible category. Usually, the model selects the category with the highest probability as the predicted category.
Object Confidence Estimation
In addition to location and category, the detection header generates a target confidence score, which indicates the model’s confidence in whether each bounding box contains a valid target. This score is typically used to filter out detections with high confidence.
Other function

In multi-scale processing for multi-scale target detection, the detection head needs to be able to process information from different layers and sizes of feature maps. It is responsible for merging this information to detect small and large targets. Activation functions: the detection head usually includes activation functions for classification and regression tasks, such as sigmoid for confidence score generation and softmax for category probability calculation. For loss function calculation, the detection header is also responsible for computing loss functions for the target detection task. These loss functions typically include location regression loss, category classification loss, and confidence loss. These loss functions are used to train the model to optimize target detection accuracy. The detailed framework of the YOLOv7 is in Figure 1 below.

3. Related Work

After reviewing papers related to vehicle recognition, it was found that there needs to be more in-depth research in related fields. Wang et al. [7] present the YOLOv7 model to appeal to the object detection fields. Li et al. proposed RSI-YOLO (an improved YOLO) for remote sensing [8]. They added the CBAM attention mechanism to the original model. Yang et al. [9] propose an improved YOLOv5 for vulnerable road user detection. They combined Complete Intersection over Union-Loss (CIoU-Loss) and Distance Intersection over Union-Non-Maximum Suppression (DIoUNMS) to improve the model’s convergent speed. However, there is no practical application because they have yet to consider the possibility of model instability due to the fast convergence rate. Sang et al. [10] propose a new vehicle detection model YOLOv2_Vehicle based on YOLOv2; they use a k-means++ clustering algorithm to cluster the vehicle bounding boxes on the training model. Although its accuracy reaches 94.78%, the dataset has a single perspective and may not have the corresponding application value. Wang et al. [11] represent the VV-YOLO model with 90.68% precision, which may have more room for improvement. Song et al. proposed the TSR-YOLO model to detect Chinese Traffic Signs [12], whose mAP is 92.77%. Wu et al. propose the model YOLOv3-spp to detect the different types of vehicles [13]. Because the model does not add the attention mechanism, the feature extraction ability could be more vital and could better deal with detecting small targets and complex backgrounds. The SPP module is presented by He et al. [14], which effectively avoids the problems of image distortion caused during cropping and scaling operations on the image region and solves the problem of repeated feature extraction using a convolutional neural network for graph correlation, which significantly improves the speed of generating candidate frames and saves computational costs. Wang et al. conceptualized a model-based YOLO to detect offshore unmanned aerial vehicle data [15]; the precision of the model can reach 92.7%, but there is still potential for improvement in accuracy. Qiu et al. proposed an algorithm called YOLO-GNS [16], which introduces the Single-Stage Headless (SSH) context structure while replacing the complex convolution of the GhostNet with a linear transform. However, the precision of the algorithm still has more room for improvement. NAFISEH ZAREI et al. [17] produced a model named YOLO-Rec to detect vehicles, and the precision of the model is 94%. However, they should have considered the scale of the parameters, which may lead to the model’s low efficiency. Liao et al. [18] propose the Eagle-YOLO model, which has improved the IoU to increase the convergence speed of the loss function. However, the application of the scenario combined only with UAVs cannot be migrated to the traffic light traffic field because of the significant difference in the viewpoints. Li et al. [19] designed an RES-YOLO; the model has a single type of identification and does not have good application value. Daniel Padilla Carrasco et al. [20] used a multi-layer neural network to improve YOLO. They proposed a T-YOLO model, but the application scenario was only for car park vehicles, and the recognition categories were too homogeneous. In addition, many researchers [21,22,23,24,25,26,27] with works related to the improvement of general recognition algorithms also have a small variety of recognition scenarios, otherwise, scene application is considered unsuitable in the field of traffic intersections. Because some of the studies were conducted based on the perspective of a driverless car [28,29,30,31] rather than an actual intersection with traffic lights, there is a significant difference concerning the training perspective. In addition, some researchers focus on extracting features without further promoting [32,33,34] them to specific application scenarios to form their application value, which belongs to improving a specific link and does not become a system. Compared to other methods that add attentional mechanisms [35,36,37], our improvements in SPPFCSPC can keep the sensory field size constant.

4. Approach to the New Change of the Improved YOLO Network Model

We have replaced the SPPCSPC module in the original YOLO framework with the SPPFCSPC module. Therefore, we added the Squeeze-and-Excitation Attention module to the original network. The improved YOLOv7 framework diagram is in Figure 2 below.

4.1. Replace SPPCSPC Module with the SPPFCSPC Module

The role of SPP is to increase the receptive field so that the algorithm adapts to different resolution images, which is obtained through maximum pooling to obtain different receptive fields. As shown in the figure, it can be determined that in the first branch, four branches of maxpool are experienced, corresponding to the sizes of 5, 9, 13, and 1, where the purpose of taking different sizes of maxpool is to enable the module to handle different objects. In other words, from the four different scales of maximum pooling, we know that four sensory fields are used separately depending on the target size.

The spatial pyramid pooling (SPP) module in YOLOv7 is shown in Figure 3, which shows the detailed configuration of SPPCSPC. Therefore, we improve the principle of the spatial pyramid pooling module in YOLOv7 and the improved SPP module, which we define as SPPFCSPC in Figure 4. By converting the maximum pooling layer into the form in SPPFCSPC, we can improve the feature extraction ability of the model for the data while ensuring that the sensory field remains unchanged, and further improve the ability to frame the object in the image and the corresponding recognition accuracy. At the same time, this move can improve the generalization ability of the model with better robustness.

4.2. Squeeze-and-Excitation Attention

SE attention mechanism can be divided into two steps: squeeze and excitation. The brief algorithm architecture is in Figure 5 below.

Standard convolution by itself makes it challenging to model channel relationships. It extracts data-related features by sliding convolution kernels on the input data, thus making it challenging to capture long-range channel features and global features by obtaining only local features. Explicitly establishing channel interdependencies can increase the model’s sensitivity to information channels and improve model robustness. In addition, using a global average pool can help the model capture global information by averaging the values of each channel across the feature map to obtain further global information, which helps the model obtain aggregated information about the entire input, which is precisely what convolution lacks. In the following, given the input I, the squeezing step for the kth channel is shown in the following equation:

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{c} (i, j)

(1)

We denoted the input as X. c is the c-th iteration. Next, excitation, targeted at fully capturing channel dependency, can be expressed as

\hat{X} = X \cdot S i g m o i d (\hat{z})

(2)

where · refers to channel-wise multiplication, and

\hat{z}

is the outcome calculated using the transformation function displayed below. Sigmoid refers to the Sigmoid function. Furthermore,

\hat{z}

is calculated using the formula below:

\hat{z} = T_{2} (R e L U (T_{1} (z)))

(3)

In Equation (3),

T_{1}

and

T_{2}

are two linear transformations that can be learned to capture the importance of each channel.

As a critical element for reaching state-of-the-art performance, the SE block has been widely used in recent mobile networks. However, it only considers reweighing the importance of each channel by modeling channel relationships but neglects positional information.

Feature maps capture different features at different locations in the input image and reflect the model’s understanding of the input data. Feature maps play an essential role in convolutional neural networks; their generation and delivery help the network understand and extract features from the input data, thus enabling the effective learning of complex tasks. We can discover more important feature information through feature maps, which is convenient for us to enhance later. The specific location where the primary feature information is concentrated can be seen in Figure 6. By comparing Figure 6 and Figure 7 below, it is clear that the essential features associated with the vehicle have been significantly enhanced. Take the third picture in the first row of Figure 7 as an example: the outline of the vehicle in this picture is significantly enhanced compared to the corresponding position photo in Figure 6.

5. Experiment

5.1. Dataset

The project dataset contains images of Turkey’s cities like Bursa, İstanbul, and Konya. It also contains photos from other countries. The images are in diverse environments with the presence of various lighting conditions. Furthermore, it was taken at different distances and viewpoints. In addition, the clarity of each image is above

1920 \times 1080

, and the dataset images are of high quality and originate from actuality without related vehicle models, which is conducive for establishing high-quality models. In addition, since the source of this dataset is the actual scene of a traffic intersection, this dataset is highly representative of the complex scene of a real-life traffic intersection. By training the model using this dataset, we can obtain a model that can detect persons, cars, buses, motorbikes, bicycles, and trucks.

The relevant parameters regarding the dataset and the division of the training set test set are shown in Table 1. Some images from the dataset are displayed in Figure 8.

5.2. Image Enhancement

To improve the robustness of the model and the generalization ability of the model, the training samples were expanded using offline data enhancement methods. First, the images in the dataset are randomly divided into training and testing sets according to 8:2. Then, the images in the training set are augmented with offline data using random rotation, affine transformation, and random cropping.

The training dataset after offline enhancement is augmented with Mosaic data, corresponding to the idea that randomly selected images of batch size number are randomly cut and merged into one image as the newly generated training data, and then the new image is passed into the neural network for training and learning, which increases the number of vehicle datasets passed into the network, enriching the detection dataset, resulting in a more robust network, and reducing the GPU memory usage. Figure 9 represents the flow of Mosaic data enhancement.

5.3. Environment Configuration

The language used in developing this project is Python, and the vehicle detection task was performed on a laptop loaded with the Windows 10 operating system. However, the performance of existing Windows 10 computers cannot reach the lowest threshold of requirements for the high-performance training task. Therefore, this project’s training task was implemented using a server and the Supercomputing Center at Shandong Agricultural University. The specific experimental environment configuration of the server is shown in Table 2.

In the project, the learning rate of training is 0.01. We set the batch size as 16, and the top epoch is 100. The size of the image in the dataset is 640 × 640 pixels.

5.4. Evaluation Method

5.4.1. Confusion Matrix

We use a confusion matrix to show correspondence between actual and predicted categories (Figure 10). The confusion matrix has four components: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). TP refers to the number of positive samples correctly predicted. FP stands for the false prediction of a positive sample size. TN refers to the number of negative samples correctly predicted. FN refers to the number of negative samples that were incorrectly predicted.

5.4.2. Evaluation Index

Precision represents the proportion of correctly detected objections to the whole number of targets, which is denoted below in Equation (4):

P r e c i s i o n = \frac{T P}{T P + F P}

(4)

Recall is the proportion of correctly recognized targets to all the targets, calculated with Equation (5), as below:

R e c a l l = \frac{T P}{T P + F N}

(5)

The F1-score, also known as the Balanced Score, is defined as the reconciled average of precision and recall. It can be calculated using Equation (6):

F_{1} = 2 \cdot \frac{P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(6)

The P–R curve is drawn with precision as the vertical axis and recall as the horizontal axis; the shaded area under the P–R curve is called the average precision value AP, which is calculated as shown in Equation (7):

A P = \int_{0}^{1} P (R) d R

(7)

The AP values of each category are derived separately, and the average of all AP values can be averaged to obtain the average accuracy mAP; the higher the mAP value, the better the generalization ability of the model, which is calculated as shown in Equation (8):

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(8)

N in the above equation is the total number of categories of the target; in this paper, N = 6 (i.e., six categories of person, car, bus, motorbike, bicycle, and truck) and

A P_{i}

is the average precision value of the i-th category.

[email protected] indicates the average accuracy value. The number 0.5 indicates an IoU (Intersection over Union) threshold of 0.5. In target detection, IoU is the degree of overlap between the detected target bounding box and the actual target bounding box. When the IoU threshold is 0.5, [email protected] indicates the average accuracy under this threshold.

[email protected]/95 denotes the average precision at two different IoU thresholds. Specifically, “[email protected]” denotes the average accuracy when the IoU threshold is 0.5, while “[email protected]” denotes the average accuracy when the IoU threshold is 0.95. Typically, [email protected]/95 is used to evaluate the performance of a target detection algorithm more comprehensively, as it takes into account the performance under different IoU thresholds. A higher value of [email protected] indicates that the algorithm can accurately detect targets even under stringent IoU thresholds.

5.5. Outcoming

After training, in the validation set consisting of 805 images, a comparison of the evaluation indexes of the two types of algorithms was recognized, as shown in Table 3.

After 100 epochs of training, we use the generated best.pt file for recognizing the dataset and obtaining the results, as shown in Figure 11. Figure 11 shows the detection results of the SE-Lightweight YOLO model. It can be seen that the model performs well in both single-vehicle and multi-vehicle detection. YOLO’s ability to locate and recognize the type of vehicle is not affected on sunny and cloudy days. This proves that the model has strong weather adaptation ability.

In addition, in the three images in the third column of Figure 11, some incomplete vehicles are due to occlusion. However, from the actual detection results, such incomplete vehicles do not affect the vehicle detection accuracy of the model. This reflects that SE-Lightweight YOLO can complete the ability of localization and type recognition of all kinds of vehicles via vehicle localization and type recognition, which reflects the effectiveness of the multilayer feature fusion strategy.

From the detection result above, it is clear that the model reaches a high accuracy in detecting six classes of objects. We can conclude that the model can appeal to the natural transportation fields well, which will significantly help related workers, saving their time and providing them with numerical data to help them make wise decisions. Therefore, the model has the potential to be put into the application.

In addition, the model can also maintain a certain degree of recognition accuracy for the presence of noise points in the image, reflecting the model’s robustness and ability to generalize, as shown in the following Figure 12, showing the recognition of noisy and noise-free models.

5.6. Analysis of the Experiment Outcome

The confusion matrix obtained from the experiment is shown above, where the columns represent the predicted categories, and the rows represent the actual categories. The values on the diagonal represent the proportion of correct predictions; the off-diagonal elements are the fraction of incorrect predictions. Higher values on the diagonal of the confusion matrix indicate many correct predictions.

Especially in cases where the amount of data across categories is unbalanced, the confusion matrix provides a visual indication of the accuracy of the classification model corresponding to each category. The above Figure 13 shows that the model maintains a high level of recognition accuracy for each category.

From the comparison in the above Figure 14, it can be seen that the improved model has improved the recognition accuracy in all categories compared to the original model; this can be seen from the detection accuracy of vehicles with a lot of occlusion in the second picture in each line. However, the improved model could be more precise for the recognition accuracy of the more obvious targets. Nevertheless, for targets with overlap (based on the farthest vehicle in the picture), the recognition accuracy is still improved by 5%, from 89% to 94%, which shows that the improved overlapping target recognition accuracy model outperforms the traditional model.

From the above results in Figure 15, it can be seen that the model training has achieved good results in the 85th epoch. The model shows convergence properties and good stability after 85 rounds.

The F1 curve represents the relationship between the score and the confidence threshold (x-axis). The F1-score is a measure of categorization and is the harmonic mean of precision and recall between 0 and 1. The F1-score is the average of the precision and recall scores, whereby the larger, the better.

In general, when the confidence threshold (the probability threshold that the sample is judged to be a specific category) is low, many samples with low confidence are considered to be accurate, with high recall and low precision; when the confidence threshold is high, samples with high confidence can only be considered to be accurate, and the category is detected more accurately, i.e., the precision rate is more significant (only if the confidence is enormous is it judged to be a specific category). Hence, the before and after F1-scores at both ends are relatively small.

It can be seen in Figure 16 that the F1 curve is very “spacious”, and the top is close to 1, indicating that there is an extensive range of confidence thresholds that perform well on the training dataset (both for complete and accurate detection). Combined with the above figure, it can be seen that the model works best for trucks and has a weaker accuracy in recognizing people than the other categories.

The meaning of the P curve is the accuracy of each category recognition when the determination probability exceeds the confidence threshold. When the confidence level is higher, the category detection is more accurate, but there is a risk of missing some actual samples with lower determination probability. From the below Figure 17, we can obtain the same conclusion as the F1 curve in Figure 16, i.e., the model has the best recognition effect on trucks and a relatively weak recognition effect on people.

Recall values range from 0 to 1 and are usually expressed as percentages. A high recall value indicates that the model effectively finds most or all valid targets, while a low recall value indicates that the model misses many proper targets. In YOLO, recall curves are often used to visualize the model’s recall value at different thresholds. YOLO models use a confidence threshold to determine whether a detection is valid. By adjusting the threshold, the recall of the model can be controlled. There is often a trade-off between recall and precision in target detection tasks. High recall means finding more real targets but can lead to more false positives. We can obtain the same conclusion from Figure 18 as the P curve in Figure 17.

The PR curve in Figure 19 shows the relationship between precision and recall at different thresholds. The precision and recall of the model at different thresholds with different confidence thresholds are plotted as the horizontal coordinates, and the precision and recall as the vertical coordinates. Typically, recall increases and precision decreases as the threshold decreases. The closer the PR curve is to the upper left corner, the model’s precision remains high, which is an excellent performance indicator. The area under the PR curve is called AUC-PR (Area Under the Precision–Recall Curve), which is used to quantify the model’s performance. The higher the value of AUC-PR, the better the model’s performance under different levels of precision and recall. Unlike the previous ones, for the PR curves, we can learn that the model is the best for recognizing buses and relatively weak for recognizing people compared to the other categories.

5.7. Comparison of the Different Models

As can be seen from Table 4 and Figure 20, the algorithm in the paper has the highest accuracy and other enhanced aspects in vehicle image detection and can undertake the related detection tasks well.

6. Conclusions

This research focuses on the vehicle detection task, with a particular application in traffic camera object detection. It mainly discusses how to improve the accuracy of vehicle detection and improve the algorithm performance metrics. The feature extraction capability of the original YOLOv7 algorithm was improved. We added the SE attention mechanism to the original YOLOv7 framework. We replaced the SPPCSPC with SPPFCSPC as the backbone of the network, which improved the feature extraction ability of the original YOLOv7 algorithm and effectively enlarged the sensory field of the model, thus improving the detection accuracy. While achieving more efficient and effective feature fusion, the network model’s robustness is concurrently enhanced. The utilization of the Mosaic data augmentation technique further refines the feature extraction capabilities of the model. Experimental results demonstrate that SE-Lightweight YOLO significantly enhances detection precision and recall compared to YOLOv7 and other conventional object detection algorithms. However, the incorporation of modules and detection layers results in an increase in model parameters, leading to elevated floating-point operations (FLOPs) and consequently a higher model complexity. In the future of transportation applications, there is a growing inclination toward deploying real-time monitoring models on embedded systems, posing higher demands for improved accuracy and speed in small object detection. Given that current machine learning-generated data models tend to be larger, hindering seamless deployment and swift monitoring on embedded systems, this presents a challenge for application in such systems. Therefore, future research is deemed necessary to expand training datasets, explore lightweight network technologies, and engage in model pruning and knowledge distillation to reduce model size. This is essential for accelerating the application of research outcomes in transportation. Simultaneously, efforts can be put toward model compression to facilitate the deployment and real-time monitoring of models on embedded systems. This enhances the practical utility of the research and accelerates the application of research results in various domains.

Author Contributions

Drafting of the original manuscript and data preparation were conducted by C.N.; X.Z. took charge of the review and editing process; Y.S. provided oversight and leadership for planning and executing the research activities. All authors have read and agreed to the published version of the manuscript.

Funding

Shandong Provincial Natural Science Foundation, China: ZR2020MF146, and the Innovation and Entrepreneurship Training Program for Students of Shandong Agricultural University: XJZD2022062.

Data Availability Statement

The selected datasets in this paper are public, and they can be freely downloaded at Kaggle (https://www.kaggle.com/datasets/yusufberksardoan/traffic-detection-project/, accessed on 30 September 2023).

Acknowledgments

We thank the Supercomputing Center at Shandong Agricultural University for technical support.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ITS	Intelligent Transportation System
mAP	Mean Average Precision
P	Precision
R	Recall
F1	F1-Score
SPP	Spatial Pyramid Pooling
YOLO	You Only Look Once
SE	Squeeze-and-Excitation
SPPCSPC	Spatial Pyramid Pooling Connected Spatial Pyramid Convolution
SPPFCSPC	Spatial Pyramid Pooling Fully Connected Spatial Pyramid Convolution
AUC-PR	Area Under the Precision–Recall Curve
SSD	Single-Shot Multibox Detector
CNN	Convolutional Neural Network
IoT	Internet of Things
SSH	Single-Stage Headless
SIFT	Scale-Invariant Feature Transform

References

Dimitrakopoulos, G.; Demestichas, P. Intelligent Transportation Systems. IEEE Veh. Technol. Mag. 2010, 3, 77–84. [Google Scholar] [CrossRef]
Zeng, Y. Optimal Control and Application of Traffic Light Timing Based on Fuzzy Control. Master’s Thesis, Changsha University of Technology, Changsha, China, 2020. [Google Scholar] [CrossRef]
Issues Report: Smart Transportation Market. Manufacturing Close-Up. 2020. Available online: https://www.researchandmarkets.com/ (accessed on 26 October 2023).
Cao, Z.W. Research on highway congestion mitigation technology based on intelligent transport. Intell. Build. Smart City 2023, 168–170. [Google Scholar] [CrossRef]
Xu, Z.; Cao, Y.; Kang, Y.; Zhao, Z. Vehicle emission control on road with temporal traffic information using deep reinforcement learning. IFAC-PapersOnLine 2020, 53, 14960–14965. [Google Scholar] [CrossRef]
Cao, Z.; Jiang, S.; Zhang, J.; Guo, H. A unified framework for vehicle rerouting and traffic light control to reduce traffic congestion. IEEE Trans. Intell. Transp. Syst. 2016, 18, 1958–1973. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. Comput. Vis. Pattern Recognit. 2023, 7, 7464–7475. [Google Scholar]
Li, Z.; Yuan, J.; Li, G.; Wang, H.; Li, X.; Li, D.; Wang, X. RSI-YOLO: Object Detection Method for Remote Sensing Images Based on Improved YOLO. Sensors 2023, 23, 6414. [Google Scholar] [CrossRef] [PubMed]
Yang, W.; Tang, X.; Jiang, K.; Fu, Y.; Zhang, X. An Improved YOLOv5 Algorithm for Vulnerable Road User Detection. Sensors 2023, 23, 7761. [Google Scholar] [CrossRef]
Sang, J.; Wu, Z.; Guo, P.; Hu, H.; Xiang, H.; Zhang, Q.; Cai, B. An Improved YOLOv2 for Vehicle Detection. Sensors 2018, 18, 4272. [Google Scholar] [CrossRef]
Wang, Y.; Guan, Y.; Liu, H.; Jin, L.; Li, X.; Guo, B.; Zhang, Z. VV-YOLO: A Vehicle View Object Detection Model Based on Improved YOLOv4. Sensors 2023, 23, 3385. [Google Scholar] [CrossRef]
Song, W.; Suandi, S.A. TSR-YOLO: A Chinese Traffic Sign Recognition Algorithm for Intelligent Vehicles in Complex Scenes. Sensors 2023, 23, 749. [Google Scholar] [CrossRef]
Wu, J.-D.; Chen, B.-Y.; Shyr, W.-J.; Shih, F.-Y. Vehicle Classification and Counting System Using YOLO Object Detection Technology. Trait. Signal 2021, 38, 1087–1093. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Zhang, X.; Li, J.; Luan, K. A YOLO-Based Target Detection Model for Offshore Unmanned Aerial Vehicle Data. Sustainability 2021, 13, 12980. [Google Scholar] [CrossRef]
Qiu, Z.; Bai, H.; Chen, T. Special Vehicle Detection from UAV Perspective via YOLO-GNS Based Deep Learning Network. Drones 2023, 7, 117. [Google Scholar] [CrossRef]
Zarei, N.; Moallem, P.; Shams, M. Fast-Yolo-Rec: Incorporating Yolo-Base Detection and Recurrent-Base Prediction Networks for Fast Vehicle Detection in Consecutive Images. IEEE Access 2022, 10, 120592–120605. [Google Scholar] [CrossRef]
Liao, L.; Luo, L.; Su, J.; Xiao, Z.; Zou, F.; Lin, Y. Eagle-YOLO: An Eagle-Inspired YOLO for Object Detection in Unmanned Aerial Vehicles Scenarios. Mathematics 2023, 11, 2093. [Google Scholar] [CrossRef]
Li, Y.; Wang, J.; Huang, J.; Li, Y. Research on Deep Learning Automatic Vehicle Recognition Algorithm Based on RES-YOLO Model. Sensors 2022, 22, 3783. [Google Scholar] [CrossRef]
Carrasco, D.P.; Rashwan, H.A.; García, M.Á.; Puig, D. T-YOLO: Tiny Vehicle Detection Based on YOLO and Multi-Scale Convolutional Neural Networks. IEEE Access 2021, 11, 22430–22440. [Google Scholar] [CrossRef]
Zapletal, D.; Herout, A. Vehicle Re-Identification for Automatic Video Traffic Surveillance. Comput. Vis. Pattern Recognit. 2016, 3, 25–31. [Google Scholar]
Khana, S.D.; Ullah, H. A survey of advances in vision-based vehicle re-identification. Comput. Vis. Image Underst. 2019, 182, 50–63. [Google Scholar] [CrossRef]
El-gayar, M.M.; Soliman, H.; Meky, N. A comparative study of image low level feature extraction algorithms. Egypt. Inform. J. 2013, 14, 175–181. [Google Scholar] [CrossRef]
Niu, C.; Wang, W.; Li, N. A Mathematical Model for Analyzing and Identifying the Composition of Ancient Glass Objects and Its Application. J. Mater. Process. Des. 2022, 6, 86–95. [Google Scholar] [CrossRef]
Niu, C.; Hou, H.; Shen, Y.; Zhou, Z. The listing price prediction of used sailboats based on LM-BP Neural Network. In Proceedings of the 2023 IEEE 7th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 15–17 September 2023; Volume 7, pp. 1931–1935. [Google Scholar]
Ye, J.; Yuan, Z.; Qian, C.; Li, X. CAA-YOLO: Combined-Attention-Augmented YOLO for Infrared Ocean Ships Detection. Sensors 2022, 22, 3782. [Google Scholar] [CrossRef] [PubMed]
Zhou, L.; Deng, G.; Li, W.; Mi, J.; Lei, B. A lightweight SE-YOLOv3 network for multi-scale object detection in remote sensing imagery. Int. J. Pattern Recognit. Artif. Intell. 2021, 35, 2150037. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J.; Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. Comput. Vis. Pattern Recognit. 2021, 13713–13722. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed]
Song, Y.; Hong, S.; Hu, C.; He, P.; Tao, L.; Tie, Z.; Ding, C. MEB-YOLO: An Efficient Vehicle Detection Method in Complex Traffic Road Scenes. Comput. Mater. Contin. 2023, 75, 5761–5784. [Google Scholar] [CrossRef]
Tai, S.-K.; Dewi, C.; Chen, R.-C.; Liu, Y.-T.; Jiang, X.; Yu, H. Deep Learning for Traffic Sign Recognition Based on Spatial Pyramid Pooling with Scale Analysis. Appl. Sci. 2020, 10, 6997. [Google Scholar] [CrossRef]
Jasem, N.H.; Mohammad, F.G. Iraqi License Plate Recognition System Using (YOLO) with SIFT and SURF Algorithm. J. Mech. Contin. Math. Sci. 2020, 15, 545–561. [Google Scholar] [CrossRef]
Widyastuti, R.; Yang, C.K. Cat’s nose recognition using you only look once (YOLO) and scale-invariant feature transform (SIFT). In Proceedings of the 2018 IEEE 7th Global Conference on Consumer Electronics (GCCE), Nara, Japan, 9–12 October 2018; pp. 55–56. [Google Scholar]
Lu, P.; Ding, Y.; Wang, C. Multi-small target detection and tracking based on improved YOLO and SIFT for drones. Int. J. Innov. Comput. Inf. Control 2021, 17, 205–224. [Google Scholar]
Wu, C.; Ye, M.; Zhang, J.; Ma, Y. YOLO-LWNet: A Lightweight Road Damage Object Detection Network for Mobile Terminal Devices. Sensors 2023, 23, 3268. [Google Scholar] [CrossRef] [PubMed]
Wan, F.; Sun, C.; He, H.; Lei, G.; Xu, L.; Xiao, T. YOLO-LRDD: A lightweight method for road damage detection based on improved YOLOv5s. EURASIP J. Adv. Signal Process. 2022, 2022, 98. [Google Scholar] [CrossRef]
Huang, Z.; Wu, J.; Su, L.; Xie, Y.; Li, T.; Huang, X. SP-YOLO-Lite: A Lightweight Violation Detection Algorithm Based on SP Attention Mechanism. Electronics 2023, 12, 3176. [Google Scholar] [CrossRef]

Figure 1. YOLOv7 modeling framework.

Figure 2. Improved YOLOv7 modeling framework.

Figure 3. Architecture of SPPCSPC module.

Figure 4. Architecture of SPPFCSPC module.

Figure 5. Algorithm of SE block.

Figure 6. Initial feature map display.

Figure 7. Improved feature map display.

Figure 8. Some images in the dataset.

Figure 9. Process of Mosaic data enhancement.

Figure 10. Diagram of the confusion matrix.

Figure 11. Detection results of SE-Lightweight YOLO using the dataset.

Figure 12. Comparison of noise identification with and without noise.

Figure 13. Confusion matrix.

Figure 14. Detection of YOLOv7 and SE-Lightweight YOLO in the dataset.

Figure 15. Demonstration of SE-Lightweight YOLO training process.

Figure 16. F1 curve.

Figure 17. P curve.

Figure 18. R curve.

Figure 19. PR curve.

Figure 20. Comparison of SE-Lightweight YOLO and YOLOv7 in the dataset.

Table 1. Dataset training and validation distribution.

Data Scale	Image Scale
Train	7566
Validation	805
Test	Real-time validation of functional models
Total	8371

Table 2. The training environment configuration for vehicle experiment.

Configuration	Specific Version
CPU	Intel(R) Xeon(R) Silver 4208 CPU @ 2.10 GHz
Memory	160 GB
GPU	NVIDIA Quadro RTX 5000
Operation system	Windows 10 (64-bit)
Editing language	Python 3.8
CUDA version	CUDA 11.7
Deep learning framework	Pytorch 2.0.1

Table 3. ET of two algorithms on the selected datasets.

Class	Precision (%)		Recall (%)		[email protected] (%)		[email protected]:0.95 (%)
Class	Initial	Improved	Initial	Improved	Initial	Improved	Initial	Improved
bicycle	94.5	95.1	88.2	91.4	92.2	93.4	74.3	75.2
bus	91.8	92.3	96.7	98.1	97.4	97.6	88.8	89.1
car	93.9	94.6	93.3	96.3	96.1	94.3	75.9	77.4
motorbike	94.2	95.1	84.8	89.1	92.3	96.6	59.4	61.6
person	88.6	92.7	85.7	90.3	89.7	92.4	63.1	64.4
truck	99.2	99.6	100	100	99.5	99.9	96.2	97.3
all	93.7	94.9	92.6	94.2	94.5	95.7	76.3	77.5

Table 4. Comparison between the SE-Lightweight YOLO and other traditional models.

Model	Precision (%)	Recall (%)	mAP @0.5 (%)	[email protected]/0.95 (%)	Detection Speed per Frame (s)
SSD	78.1	78.6	79.4	55.9	0.43
Faster R-CNN	80.4	86.4	84.2	61.8	0.84
SPPNet	89.1	91.6	90.4	69.7	0.68
SE-YOLOv5	92.5	92.3	93.4	72.6	0.39
YOLOv7	93.7	92.6	94.5	76.3	0.37
Our model	94.9	94.2	95.7	77.5	0.41

The above indicators are expressed as percentages. Detection speed is expressed in seconds.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Niu, C.; Song, Y.; Zhao, X. SE-Lightweight YOLO: Higher Accuracy in YOLO Detection for Vehicle Inspection. Appl. Sci. 2023, 13, 13052. https://doi.org/10.3390/app132413052

AMA Style

Niu C, Song Y, Zhao X. SE-Lightweight YOLO: Higher Accuracy in YOLO Detection for Vehicle Inspection. Applied Sciences. 2023; 13(24):13052. https://doi.org/10.3390/app132413052

Chicago/Turabian Style

Niu, Chengwen, Yunsheng Song, and Xinyue Zhao. 2023. "SE-Lightweight YOLO: Higher Accuracy in YOLO Detection for Vehicle Inspection" Applied Sciences 13, no. 24: 13052. https://doi.org/10.3390/app132413052

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SE-Lightweight YOLO: Higher Accuracy in YOLO Detection for Vehicle Inspection

Abstract

1. Introduction

2. Breif Introduction of YOLOv7

2.1. YOLOv7 Object Detection Algorithm

2.1.1. Input

2.1.2. Backbone

2.1.3. Neck

2.1.4. Head

3. Related Work

4. Approach to the New Change of the Improved YOLO Network Model

4.1. Replace SPPCSPC Module with the SPPFCSPC Module

4.2. Squeeze-and-Excitation Attention

5. Experiment

5.1. Dataset

5.2. Image Enhancement

5.3. Environment Configuration

5.4. Evaluation Method

5.4.1. Confusion Matrix

5.4.2. Evaluation Index

5.5. Outcoming

5.6. Analysis of the Experiment Outcome

5.7. Comparison of the Different Models

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI