Vehicle Multi-Object Detection and Tracking Algorithm Based on Improved You Only Look Once 5s Version and DeepSORT

Bui, Thioanh; Wang, Guihao; Wei, Geng; Zeng, Qian

doi:10.3390/app14072690

Open AccessArticle

Vehicle Multi-Object Detection and Tracking Algorithm Based on Improved You Only Look Once 5s Version and DeepSORT

¹

School of Physics and Electronics, Nanning Normal University, Nanning 530100, China

²

Guangxi Key Laboratory of Functional Information Materials and Intelligent Information Processing, Nanning 530001, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(7), 2690; https://doi.org/10.3390/app14072690

Submission received: 9 February 2024 / Revised: 15 March 2024 / Accepted: 20 March 2024 / Published: 22 March 2024

Download

Browse Figures

Versions Notes

Abstract

:

The increasing popularity of vehicles has led to traffic congestion and frequent traffic accidents. Intelligent transportation technology is an effective solution to this problem. In order to improve the accuracy and effectiveness of vehicle detection and tracking, this paper combined the improved YOLOv5s model with the optimized DeepSORT tracking algorithm to detect and track vehicles on traffic roads. Firstly, in the detection model of YOLOv5s, the Attention-based Intra-scale Feature Interaction (AIFI) module is introduced to detect vehicles more quickly and accurately. Secondly, the Kalman filtering (KF) algorithm of DeepSORT is optimized to improve the accuracy of predictions of the vehicle state by using the width to replace the length-to-width ratio of the vehicle prediction box in the original KF algorithm. Finally, in the re-recognition network of DeepSORT, the original Convolutional Neural Network (CNN) model is replaced by an improved ResNet36 as the backbone network for feature extraction. The experimental results show that, compared with the original algorithm, when examining the performance of the improved algorithm in terms of target detection, the recall rate, average accuracy (mAP), and detection speed, are increased by 7.7%, 15.5%, and 14.2%, respectively; in terms of multi-object tracking performance, such as multi-object tracking precision (MOTP) and multi-object tracking accuracy (MOTA), improvements of 14.84% and 9.62%, respectively, are obtained and the total number of times a trajectory is fragmented (Frag) is reduced by 32.52%.These results indicate that the proposed algorithm can meet the requirements of accuracy, real-time detection, and stable vehicle detection and tracking on traffic roads.

Keywords:

target detection; multi-target tracking; YOLOv5; DeepSORT

1. Introduction

At present, vehicles are a necessity; they greatly facilitate people’s travel and also greatly increase the productivity of society and mankind. According to the statistics of Hedges company’s automotive market research and the digital marketing consulting company [1], as shown in Table 1, there were about 1.446 billion cars on the earth in 2022. The three continents with the most cars are Asia, Europe, and North America, and the most populous, Asia, has the most vehicles, with a total of approximately 531 million vehicles. The two countries with the highest levels of vehicle ownership are China and the United States, with approximately 310 million and 280 million vehicles, respectively.

However, the continuous growth in the means of transportation has led the transportation system to face many problems, such as traffic congestion and traffic accidents. To solve these problems, it is necessary to track the vehicles on the road in real time to determine the traffic flow and adjust the traffic system in a timely manner [2,3].

Object tracking has important application value in the field of computer vision and artificial intelligence [4,5,6,7], and DeepSORT, as a deep-learning-based object tracking algorithm, has attracted widespread attention from scholars worldwide [8,9,10,11,12,13,14,15]. In recent research, deep learning algorithms have made great progress in pedestrian target detection, but the accuracy of vehicle tracking algorithms still needs to be improved, especially in complex scenes, such as those containing occlusion, lighting changes and scale changes.

In the research field based on DeepSORT, Li et al. introduced the channel-relation-aware global attention (RGA-C) and spatial relation-aware global attention (RGA-S) mechanisms into the network structure, and the Hard-Negative Mining Way was introduced into the basic triple loss to improve the accuracy of DeepSORT [7]. Wang et al. combined gray and RGB features using Iterative Deep Aggregation (IDA) to reduce the error rate of the model [10]. Liu et al. selected the OSNet full-scale network to optimize the shallow residual network and improve the apparent feature extraction ability, and used the Complete Intersection over Union (CIoU) matching method to judge the matching degree between the detection frame and boundary regression [11]. Zhang et al. designed a dual-track prediction mechanism comprising a kernel correlation filter and Kalman filter and formed a prediction-track-calibration system with cascade matching in DeepSORT to make data association more reliable [12]. In the process of vehicle feature matching, He et al. extracted Haar-like features to match the light and dark change information of vehicles to improve the object-matching accuracy. Based on the DeepSORT re-recognition network, the improved ResNet13 was adopted as the backbone network of feature extraction, and SENet was added to adjust the feature weights of different channel dimensions [13]. Zheng used GhostNetV1 to replace the re-recognition network in DeepSORT to generate the appearance features of pedestrians and improve the performance of the pedestrian re-recognition network. Then, the Hungarian algorithm was adopted to perform the optimal matching between the detection frame and the prediction frame, and DIOU was adopted to replace IOU (crossover ratio) in the second matching of the unmatched detection frame and improve the tracking performance of DeepSORT network [14].

This work aims to increase the detection speed of the target detection model You Only Look Once 5s version (YOLOv5s) and improve the detection accuracy of this model and the target tracking accuracy of the multi-target tracking algorithm DeepSORT by improving the backbone network of the YOLOv5s model, optimizing the KF target location prediction algorithm, and enhancing DeepSORT’s appearance feature extraction network. This is applied to vehicle multi-object detection and tracking. The overall structure flow of the algorithm in this paper is shown in Figure 1.

Video reading: The functions in OpenCV library are used to read the video file and process it frame by frame;
Target detection: The improved YOLOv5s model is used for target detection in each frame of the image. Focusing on the interference problem of the target in a complex environment with occlusion, the AIFI module with location-coded information is used to replace the space pyramid Pool—Fast (SPPF) module in the backbone network of the original YOLOv5s. Then, all the detection boxes and corresponding coordinates, and confidence and other information, are added to the detection target array;
Target tracking: The feature vector of the detection target array is fed into the tracking algorithm, the optimized Kalman filter is used to predict the target position, the width is used to replace the length-to-width ratio in the original KF, and Mahalanobis distance and minimum cosine distance are combined to form the association of target motion information according to the uncertainty of target motion. In view of the shallow level of the feature extraction network in the DeepSORT algorithm and its insufficient tracking ability, the original CNN model is replaced by an improved ResNet36 as the backbone network of feature extraction in the DeepSORT re-recognition network. The enhanced feature extraction network is used to extract the appearance features of the target and the Hungarian algorithm is used to match the target similarity. If it does not match, it is deleted, and if it matches, it is put into the tracking target array.

2. Materials and Methods

2.1. Selection of Data Set

The key to vehicle detection is to train a good classification model, which requires the use of a large training data set. At present, there are several excellent data sets in the field of image recognition, such as the CALTECH 101 dataset, PascalVOC dataset, MSCOCO dataset, and CIFAR-10 dataset, which have made great contributions to the field of deep learning object detection [15]. In vehicle tracking, the main object of identification is a vehicle on an urban road. Therefore, a variety of vehicle images were used to create the dataset, and the PascalVOC dataset format was used to support the training and testing of the YOLO network.

PascalVOC is a commonly used data set in object recognition. VOC2012 is one of the more complete versions, which contains 11,530 images and 20 marked object types, and can complete object classification, object detection, object segmentation, and other tasks. The data storage rule of each file is shown in Figure 2.

VOC2012 contains rich types of object detection, but, on average, a small amount of data are available for each object, and the image scenes in the data set are not rich enough. Therefore, a dataset that contains more vehicle images of urban road scenes or high-speed road scenes are needed to better detect vehicles on urban or high-speed roads. In this paper, the VisDrone data set was selected to train the vehicle detection model. The data set is introduced and processed as follows.

2.2. VisDrone Data Set

2.2.1. Introduction of VisDrone Data Set

VisDrone is a dataset of large-scale aerial images for computer vision tasks such as object tracking, object detection, object counting, and scene analysis. The dataset consists of multiple aerial videos taken at different angles and heights, covering a variety of urban, rural, and suburban scenes, including dense crowds of people, vehicles, buildings, and natural environments. A part of the VisDrone dataset is shown in Figure 3.

The VisDrone dataset contains 288 video clips, consisting of 261,908 video frames and 10,209 still images, from a wide range of drone cameras, including from 14 different cities in China, thousands of kilometers apart, with different environments, such as rural and urban environments. The images contain pedestrians, vehicles, bicycles, and other objects, and scenes of different densities, including sparse and crowded scenes. The images also cover a variety of traffic scenes, including city streets, highways, and parks, some of which have severe traffic congestion and complicated road structures. It is important to note that the dataset was collected using different models of drone platforms in different scenarios, and different weather and lighting conditions. In addition to video frames, the VisDrone dataset provides detailed annotated information, while providing important attributes including scene visibility, object categories, and occlusion conditions to get the most out of the data. This annotated information helps to improve the accuracy of vehicle detecting and tracking and can be used to evaluate the performance of the algorithm.

In conclusion, the VisDrone dataset is a very valuable vehicle detection and tracking dataset, which provides researchers with rich, diverse, and real data, and helps to promote the research and development of vehicle detection and tracking algorithms.

2.2.2. VisDrone Data Set Processing

After downloading the VisDrone dataset, the vehicle should be annotated with image annotation software and the corresponding xml file will be generated. These xml files contain information such as image name, image path, object tag name, and object location coordinates. The process of vehicle labeling is shown in Figure 4. The xml files containing annotation information cannot be directly used to train the YOLO network, and need to be converted into YOLO-supported txt files. After annotating all images, the original images and all generated files are stored according to the VOC data file structure to train the vehicle recognition model [16], and the training and test data are stored according to the VOC2012 data set folder structure, which supports the training and testing of the YOLO network.

3. The Process Diagram of Vehicle Labeling

3.1. Object Detection Algorithm Based on YOLOv5s

YOLO stands for You Only Look Once [17], meaning one-time detection, which is characterized by the ability to detect images or videos in real time with high accuracy and efficiency. Compared with the previous version, YOLOv5 has significantly improved in terms of accuracy and speed, becoming an important breakthrough in the field of object detection. The YOLOv5 algorithm adopts a method based on single-stage detection, which is realized by transforming the object detection task into a regression problem. It divides the input image into multiple grids, and each grid predicts the probability of obtaining the object, the category, and the location information of the object. Compared with traditional object detection algorithms, YOLOv5 adopts a more efficient model structure and loss function design, which greatly improves detection speed while maintaining accuracy. YOLOv5s is the smallest version of the YOLOv5 series, with a fast detection speed and small model size. It is an object detection algorithm suitable for resource-constrained environments such as mobile devices.

As shown in Figure 5, the network structure of YOLOv5s is divided into four parts: input, backbone network, neck, and head detection. The input includes Mosaic data enhancement, an adaptive anchor frame calculation, and adaptive image scaling; the backbone of YOLOv5s adopts the structure of CSPDarknet53, including multiple convolutional layers, pooling layers, and residual blocks. CSPDarknet53 is a lightweight network structure that can extract high-level features from images, which is one of the features of YOLOv5s; the neck of YOLOv5s adopts the Feature Pyramid Network (FPN) structure, which is used to fuse features with different resolutions to improve the detection accuracy. FPN includes multiple up-sampling and down-sampling operations, as well as feature-fusion operations, which can fuse features of different levels into a unified feature representation. The head of YOLOv5s adopts the YOLOv3 prediction structure, including multiple convolution layers, upper sampling layers, and the final output layer. In the prediction stage, YOLOv5s divides the image into multiple grids and predicts the category and position of the object on each grid, while using the Anchor mechanism to locate the object.

3.2. Improved YOLOv5s Based on AIFI

In order to improve the effect of object detection, this paper uses the AIFI module [18] to replace the space pyramid Pool—Fast (SPPF) module on the backbone network of YOLOv5s. The improved YOLOv5s network structure is shown in Figure 6.

As shown in Figure 6, the AIFI module only deals with the high-level layer of the S3 layer, which can greatly reduce the computing load and improve the computing speed, without damaging the performance and sometimes even improving it. The reason for this is that the self-attention mechanism can capture the connections between different individuals in the layer with rich semantic information, which helps to detect and identify the object. AIFI can be divided into two parts, namely the Multi-Head-Attention module and Feedforward Neural Network (FFN). The query is represented as

q \in R^{d_{q}}

, key as

k \in R^{d_{k}}

, and value as

v \in R^{d_{v}}

;

d_{q}

,

d_{k}

,

d_{v}

express their respective lengths;

R^{d_{q}}

,

R^{d_{k}}

,

R^{d v}

are the sets in which they are located; and

W_{i}^{q}

,

W_{i}^{k}

,

W_{i}^{v}

express the corresponding weight matrices. The mathematical process of AIFI is as follows:

q = k = v = F l a t t e n (S_{3})

(1)

F = Re s h a p e (A t t n (q, k, v))

(2)

where Flatten represents the fusion of adjacent features into a new feature, Attn represents multi-head self-attention, and Reshape represents the restoration of the shape of the feature to the same as S3. The multi-head attention mechanism can process multiple groups of self-attention on an input sequence. This processing allows for the better capture of information at different locations in the input sequence and can improve the model’s effectiveness when dealing with long-distance dependencies. In the multi-head attention mechanism, the input sequence is divided into multiple heads after a linear transformation and each head calculates an attention distribution [19]. Finally, these attention distributions are spliced and undergo another linear transformation to produce the final output.

Each attention head

h_{i} (i = 1, 2, \dots)

is calculated as follows:

h_{i} = f (W_{i}^{q} q, W_{i}^{k} k, W_{i}^{v} v) \in R^{p_{v}}

(3)

where the learnable parameters include

W_{i}^{q} \in R^{p_{q} \times d_{q}}

,

W_{i}^{k} \in R^{p_{k} \times d_{k}}

, and

W_{i}^{v} \in R^{p_{v} \times d_{v}}

, and the function representing the concentration of attention is denoted as

f

. Through the multi-group self-attention processing of the input sequences, information at different locations can better interact and be better integrated, thus improving the representation and generalization ability of the model.

The FFN is shown in Figure 7, which is composed of an input layer, hidden layer, and output layer. The data are transmitted from the input layer to the output layer in one direction through the hidden layer without forming a loop. In each layer, neurons receive the output data of the previous layer and process them via weighted summation through activation function to obtain the output data and then pass them to the next layer. This structure enables the FFN to learn the nonlinear mapping relationship of the input data. The FFN is trained by the back-propagation algorithm. The partial derivative of the loss function to the network parameters is calculated by using the idea of gradient descent, and then the parameters are updated along the opposite direction of the gradient so that the loss function is gradually reduced.

During the training process, by constantly adjusting the weight and bias, the FFN can gradually learn the mapping relationship between the input and output, thus improving its performance in various tasks.

The combination of a multi-head attention module and FFN plays a positive role, making up for the lack of a learning ability for linear transformation. The activation function in FFN is used to carry out nonlinear mapping, strengthening the part with a large value and inhibiting the part with a small value, to learn more abstract features. The two parts complement each other, so that the AIFI module can greatly reduce the calculation amount and speed up the calculation while not compromising the model performance.

4. Improved DeepSORT Algorithm

4.1. DeepSORT Tracking Algorithm

Deep Simple Online and Realtime Tracking (DeepSORT) [20] is an improvement on the Simple Online and Realtime Tracking (Sort) [21] target tracking algorithm and introduces a deep learning model for pedestrian re-recognition. CNN and recurrent neural network (RNN) are combined to realize the real-time tracking of multiple objects in video [22]. In the real-time object tracking process of DeepSORT, the core process has three parts: Prediction/Track, Detection and Update.

Starting from the first frame in which the object appears, the obtained information is stored in Bounding box (Bbox), and the Track Bbox group of each moment is equivalent to forming a set of tracks. After KF prediction, a trajectory Bbox is predicted for the current frame. There are two kinds of results after prediction, Confirmed and Unconfirmed, which are used to distinguish whether the tracking object is an object or not. Then, the current frame is observed, and the Detection Bbox’s results are matched with the Predicted Confirmed Track Bbox using the Hungarian algorithm. After successful matching, the Bbox predicted by KF is updated at the same time (matching at T2, continuing to update at T2). If no match is successful or a missed detection occurs, the failed matching Tracks are matched with the failed matching Detection. If the match fails again, the object is lost and a new Truck is created. After the Update, the current frame is predicted, the next frame is observed, and the update is repeated until the tracking of the object is finished. The flow of the DeepSORT tracking algorithm is shown in Figure 8.

4.1.1. Kalman Filter (KF) Algorithm

The KF algorithm is a classical algorithm of state estimation, which is famous for its high efficiency and stability in state estimation and prediction. The basic principle of the KF algorithm is to predict the future state of the system and optimize the state estimation by estimating the current state of the system, combining a dynamic model of the system and the observed data. KF can obtain the three values of the state quantity, which are the true value of the state

x_{k}

, the predicted value of the state

{\tilde{x}}_{k}^{-}

and the optimal estimate

{\tilde{x}}_{k}

. The predicted value of the state

{\tilde{x}}_{k}^{-}

can be obtained via the state prediction equation:

{\tilde{x}}_{k}^{-} = A \times {\tilde{x}}_{k - 1} + B \times u_{k}

(4)

where

A

is the state transition matrix,

B

is the control input matrix, and

u_{k}

is the control input quantity. The optimal state estimate

{\tilde{x}}_{k}

can be obtained from the state update equation:

{\tilde{x}}_{k} = {\tilde{x}}_{k}^{-} + K (z_{k} - H \times {\tilde{x}}_{k}^{-})

(5)

where

z_{k}

is the observed measurement of the true value of the actual state,

H

is the state observation matrix, and

K

is the Kalman gain matrix. The Kalman gain actually represents the ratio of model prediction error to measurement error in the estimation process of the optimal state.

Q

is the process noise matrix,

R

is the measurement noise matrix,

P_{k}

is the covariance matrix between the true value and the best estimate,

P_{k}^{-}

is the covariance matrix between the true value and the predicted value, and the

K

under the optimal estimation condition is as follows:

K = P_{k}^{-} H^{T} {(H P_{k}^{-} H^{T} + R)}^{- 1}

(6)

The estimation principle of KF is to minimize the covariance of the optimal state estimation and make it closer to the true value. The estimation error covariance matrix is as follows:

P_{k} = (I - K H) \times P_{k}^{-}

(7)

The prediction covariance matrix is as follows:

P_{k + 1}^{-} = A P_{k} A^{T} + Q

(8)

Formulas (4)–(8) are the core of Kalman filtering algorithm, which dynamically adjusts state estimation and prediction to minimize the error between the system state estimation and the actual state.

4.1.2. Hungarian Matching Algorithm

The Hungarian matching algorithm, also known as the Hungarian algorithm, is a classical algorithm that can solve the problem of the maximum matching of bipartite graphs. In bipartite graphs, the goal of the Hungarian matching algorithm is to find a maximum match so that as many vertices in the graph as possible are connected to the vertices on the other side.

The principle of the Hungarian matching algorithm is to match by constantly finding an augmentation path until the maximum match is reached. The algorithm starts at an unmatched point on the left and searches through alternating paths and augmented paths until the augmented path is found. An augmented path is a special path that starts at an unmatched point, alternately passes through an unmatched edge and a matched edge, and finally reaches an unmatched point on the other side. By constantly finding an augmentation path, the unmatched point is connected with the unmatched point on the other side, and the maximum match is finally obtained.

4.2. Improved KF Algorithm

The core of the DeepSORT multi-object tracking algorithm is the use of recursive KF and frame-by-frame Hungarian data association.

(u, v)

are set as the center coordinates of the bounding box,

γ

is the length-to-width ratio, the height is

h

, and

\dot{u}, \dot{v}, \dot{γ}, \dot{h}

represent the velocity information in the corresponding coordinate system of the image. To describe the state of motion, the trajectory processing and state estimation of the algorithm use these eight parameters (

u, v, γ, h, \dot{u}, \dot{v}, \dot{γ}, \dot{h}

). DeepSORT predicts the motion state of the object by using a linear observation model and constant velocity model of standard KF. The predicted result is as follows:

x = {[u, v, γ, h, \dot{u}, \dot{v}, \dot{γ}, \dot{h}]}^{T}

(9)

This attempts to estimate the length-to-width ratio of the frame rather than the width, which results in an inaccurate estimate of the width. Furthermore, the prediction box does not always fully include the vehicle in the real prediction process, as shown in Figure 9.

Through the experimental results, this paper finds that the correctly predicted width can better match the vehicle and greatly improve the IOU in tracking matching. Therefore, this paper uses the width

w

to replace the length-to-width ratio

γ

in KF to obtain the overall prediction, and the prediction result are obtained as follows:

x^{w} = {[u, v, w, h, \dot{u}, \dot{v}, \dot{w}, \dot{h}]}^{T}

(10)

After obtaining the result of

x^{w}

, it is used as the prediction result of the tracker for the object trajectory, including coordinate information and speed information. In this work, a combination of minimum cosine distance and Mahalanobis distance is used to represent the correlation between the motion information contained in the Kalman prediction results and the detection results of moving objects. When the uncertainty of the target motion is low, Mahalanobis distance is used to correlate the motion information, as shown in Equation (11):

d^{w M} (m, n) = {(d_{n} - r_{m})}^{T} {C_{m}}^{- 1} (d_{n} - r_{m})

(11)

where

d_{n}

is the

n

parameter of the first object detection box, matrix

C_{m}

is the covariance between the average tracking position and the detection position, and

r_{m}

is the parameter of the object prediction box of the first tracker.

k^{w M}

is set as the specified threshold if the association of motion state is successful; otherwise, it is not successful, as shown in Formula (12):

{\begin{matrix} d^{w M} (m, n) \leq k^{w M}, s u c c e s s, \\ o t h e r s, u n s u c c e s s . \end{matrix}

(12)

When the action camera is used, in order to reduce the probability of an ID switch, this work uses the minimum cosine distance to correlate the motion information, as shown in Equation (13):

d^{w Y} (m, n) = \min {1 - {y_{n}}^{T} y_{i}^{(m)} | y_{i}^{(m)} \in S_{m}}

(13)

where

y_{n}

is the appearance description vector,

y_{i}

is the corresponding vector of object detection after successful tracking, and

S_{m}

represents the set of all successful detection vectors.

k^{w Y}

is set as the specified threshold. If

d^{w Y} (m, n) \leq k^{w Y}

, the association succeeds; otherwise, it fails, as shown in Formula (14):

{\begin{matrix} d^{w Y} (m, n) \leq k^{w Y}, s u c c e s s, \\ o t h e r s, u n s u c c e s s . \end{matrix}

(14)

As the uncertainty of the object motion increases, in order to capture the directionality and similarity of the data without considering the absolute size of the vector to improve the correlation effect, this work takes the fusion of the two metric distances as the final metric. The correlation result is shown in Formula (15):

{\begin{matrix} R_{m, n} \in (k^{w M} \cap k^{w Y}), s u c c e s s, \\ o t h e r s, u n s u c c e s s . \end{matrix}

(15)

The weight is set as

z \in [0, 1]

; the joint distance is shown in Equation (16).

R_{m, n} = z d^{w M} (m, n) + (1 - z) d^{w Y} (m, n)

(16)

4.3. Enhancing the Feature Extraction Network of DeepSORT

The feature extraction network in the DeepSORT algorithm is a relatively shallow convolutional neural network, consisting of two convolutional layers, one maximum pooling layer, six residual networks, and one average pooling layer, with a feature dimension of 128, as shown in Table 2. The feature extraction network plays an important role, as it is responsible for extracting the recognizable feature vector from the image of the target to achieve the unique identification and tracking of the target. However, in the current multi-target tracking task, there are occlusions and mutual interference between the targets, and the motion trajectories are also complicated and changeable. The simple feature extraction network in DeepSORT is increasingly overwhelmed by heavier tracking tasks. In order to solve these problems, the residual structure is improved in this paper. Increasing the depth of the network allows the neural network to learn more complex feature representations. Through multi-layer nonlinear transformations, neural networks can gradually transform input data into more abstract and high-level feature representations that can better distinguish between different categories, thus improving the classification accuracy of the network. At the same time, increasing the number of layers of the neural network can also increase the network’s representation ability. Deep networks can learn more complex function mapping relationships, so they can better adapt to various complex data distributions and relationships among features and improve the generalization ability and adaptability of the network. The network structure designed in this paper to extract more in-depth information from vehicles and complete the task of vehicle target tracking, the network structure is shown in Table 3.

5. Experimental Results and Analysis

5.1. Overall Optimization Process of Tracking Algorithm

In the process of vehicle driving, due to distance changes and possible occlusion, the width and height of the YOLOv5s target detection frame will change with occlusion and target movement. Therefore, when DeepSORT’s feature extraction network extracts vehicle features, the image size needs to be re-adjusted. To adapt the image to the characteristics of pedestrians, the original network defines the height of the picture as 128 and the width as 64. The height of the rectangular box compared to the pedestrian should be 64 and the width should be 128. In this paper, the minimum detection confidence of YOLOv5s is increased to 0.5 to remove excess interference. The maximum distance of IOU is reduced to 0.5; if this value is small, it is difficult to match and reduces the ID Switch. Increasing the number of consecutive track confirmations to six helps to reduce the occurrence of new IDs. In addition, the averaging pool is changed to an adaptive averaging pool to adapt to input images of different sizes.

5.2. Experimental Environment

The hardware and software environments used in the experiment are shown in Table 4 and Table 5.

5.3. Experiments of Object Detection Algorithm

5.3.1. Comparison and Analysis of Model Evaluation Indicators

The data set VisDrone2019 is used in this experiment. The data set consisted of 10,209 static images and 11 types of object. The VisDrone dataset was captured using drone cameras and data were collected using different drone platforms in different scenarios with different weather and lighting conditions, with high-definition, diverse scenes, and rich target tagging. In this paper, only vehicle objects were detected, so pictures of four object types, namely, car, bus, truck and van, were extracted for training.

After processing the data set into the directory structure required by YOLOv5s, the YOLOv5s configuration file was modified, the number of iterations was set to 100, and the output YOLOv5s and improved YOLOv5s network model were trained, as well as corresponding indicators and indicator graphs. The indexes of recall rate and the PR curve (mAP) value of the model were analyzed to judge the quality and availability of the model. Finally, YOLOv5s and the improved YOLOv5s network model were used to detect the vehicles in the video, and the detection effect was judged.

The recall rate and mAP of the original YOLOv5s and the improved YOLOv5s model trained on the VisDrone2019 vehicle detection data set are shown in Figure 10 and Figure 11, respectively. According to the experiment results, we can see that the addition of the AIFI module significantly improved the detection accuracy of the improved YOLOv5s model. Compared to the original YOLOv5s model, the recall rate increased by 7.7%, while the mean average accuracy (mAP) for cars, buses, trucks, and vans improved by 6.3%, 18.6%, 29.2%, and 11%, respectively.

YOLOv8 introduced a new State of the Art (SOTA) model, including P6640 and P61280 resolution target detection networks and the Yolact-based instance segmentation model. Therefore, in order to obtain a more objective evaluation of the algorithm presented in this paper, the performance in terms of the number of parameters, the computational capacity (FLOPs), and the average precision of the improved YOLOv5s is compared with the original YOLOv5s and YOLOv8s models.

The VisDrone data set was used to extract car, bus, truck, and van information for training. A total of 6192 pictures were collected in the training set and 519 pictures were collected in the verification set. The results are shown in Table 6. Compared with the original YOLOv5s, the number of parameters of the improved YOLOv5s presented in this paper increased by 2.7%, FLOPs increased the same amount, and average precision

{AP}_{50 : 95}^{v a l}

and

{AP}_{50}^{v a l}

increased by 16.2% and 15.5%, respectively. Compared to YOLOv8s, the number of parameters and FLOPS of the improved YOLOv5s decreased by 33.93% and 42.3%, while

{AP}_{50 : 95}^{v a l}

and

{AP}_{50}^{v a l}

increased by 7.5% and 16.5%, respectively. This means that, compared with the YOLOv8s or SOTA models, the improved YOLOv5s presented in this paper is more suitable for vehicle detection.

5.3.2. Comparison and Analysis of the Actual Detection Effect of the Models

In order to verify the usability of the improved model, this experiment used YOLOv5s, improved YOLOv5s and other versions of the YOLOv5 network model to determine the model’s performance regarding the object detection of vehicles in pictures and videos, and judge the actual detection effect before and after optimization. A total of 750 internet vehicle pictures were randomly selected for the actual performance test. The actual detection speed of each network model is shown in Table 7, and the detection effect is shown in Figure 12; the number above each detection box represents the accuracy of the model when detecting the object, and the closer to 1, the more accurate the detection. As we can see from the experimental results, the improved YOLOv5s can detect more vehicles with higher accuracy and a lower detection delay (the average detection speed of each image is 10.9 ms, second only to YOLOv5n), and the average performance is the best. However, except for YOLOv5n, the width and depth of the network of the other models are greater than that of improved YOLOv5s, which means that the volume of the model generated training is also increasing. This suggests that the model contains more parameters and the accuracy of the entire model is further improved, but the detection speed of the model is sacrificed. The truck detection effect of the improved YOLOv5s presented in this paper is slightly worse than that of larger-model-volume models of YOLOv5m, YOLOv5l, and YOLOv5x, which is not very different from that of the original YOLOv5s model and belongs to the normal error range.

5.4. Experiments Using the Multi-Object Tracking Algorithm

After the input of the collected video material, the results were sent to the multi-target tracking algorithm for target tracking after target detection. In this paper, vehicle images taken during the daytime, night, and rainy days were tracked, and three evaluation indexes, MOTA, MOTP, and Frag, were used to compare the algorithms’ performance. The tracking performance indexes obtained from the experiments are shown in Table 8, and the tracking effects are shown in Figure 13, Figure 14, Figure 15, Figure 16, Figure 17 and Figure 18.

Compared with the original algorithm, after combining YOLOv5s with the improved DeepSORT, MOTA increased by 6.97%, MOTP increased by 6.91%, and Frag decreased by 24.76%. The MOTA increased by 4.56%, MOTP increased by 11.16%, and Frag decreased by 23.79% when combining the improved YOLOv5s and DeepSORT. And the MOTA increased by 9.62%, MOTP increased by 14.84%, and Frag decreased by 32.52% when combining the improved YOLOv5s algorithm with the DeepSORT algorithm. On the whole, the improved YOLOv5s algorithm has a great impact on the improvement of MOTP, and can capture objects more accurately. Additionally, the improved DeepSORT algorithm has a greater impact on the improvement of MOTA, and can better maintain the tracking trajectory.

Internet materials were used to conduct the actual tracking effect tests. The test effects of the original YOLOv5s and DeepSORT multi-target tracking algorithms and the proposed algorithm, used during the daytime, are shown in Figure 13 and Figure 14, respectively. It can be seen from the observation results that the original algorithm failed to detect distant vehicles due to environmental occlusion, the great distance, and other problems in frames 10 and 20. The ID of vehicle No. 8 also changed, and the overall actual tracking effect is mediocre. Compared with the original algorithm, the proposed algorithm can detect more vehicles in the same frame and can correctly detect and track vehicles with severe occlusion without ID hopping.

By observing Figure 15 and Figure 16, it can be seen that the original algorithm still fails to detect vehicles at night. Vehicle 9 in frame 30 was not tracked due to the occlusion caused by vehicle 6, and vehicle 5 in frame 40 is lost, indicating the poor tracking stability of the original algorithm. Compared with the original algorithm, the improved algorithm presented in this paper detects more vehicles at night, and the tracking results of frames 30 and 40 are remain unchanged. In frame 30, the vehicle in the lower right corner occupies a small area since only the left front wheel is exposed; neither the original algorithm nor the algorithm presented in this paper can detect it.

In Figure 17 and Figure 18, due to the influence of rain, different degrees of reflective effect appeared in the environment. In vehicle 22 of frame 27 of the original algorithm, there is a situation where multiple vehicles are present in one frame, and some vehicles could not be detected as they were blocked by traffic lights, so the detection effect is not good. The algorithm presented in this paper does not have a situation where multiple vehicles are present in one frame, and vehicles can be correctly detected and tracked in images with severe occlusion and are less affected by the weather environment. However, the light in the room is reflected to the upper left corner, which seriously interferes with the detection of vehicles, there are obstacles blocking the vehicles in the left corner of Figure 17 and Figure 18, and, in frame 27, only the left rear wheel of the vehicle in the upper right corner was exposed, and its area was small. Hence, neither the original algorithm nor the proposed algorithm could detect these vehicles.

6. Conclusions

In this paper, a vehicle multi-object tracking method based on improved YOLOv5s and DeepSORT was introduced in detail. By adding an AIFI module to the backbone network of the YOLOv5s object detection algorithm, the speed of processing complex information and the vehicle detection accuracy under different levels of occlusion were improved. By improving the KF algorithm and the network structure of the DeepSORT multi-object tracking algorithm, the feature extraction ability of the vehicle was improved, the correlation between the data was enhanced, and the tracking effect of the vehicle was strengthened.

The experimental data and results show that, compared with the original algorithm, the proposed algorithm can reduce the occurrence of missed detection and false detection when facing complicated environmental interference and blocked vehicles, and was more stable when continuously tracking the vehicles.

It is important to note that the evaluation of performance depends on the selected metrics and specific application scenarios, and different tasks and data sets may obtain different results. Compared with other CNN architecture models, the improved YOLOv5s presented in this work has a lower volume and fewer parameters, which makes it more suitable for resource-constrained environments, such as mobile devices or embedded systems. The proposed algorithm also has some inherent limitations, such as inaccuracy when locating the object boundary and poor multi-scale target detection.

In future work, we aim to create a more lightweight, faster traffic road vehicle detection model, while ensuring its accuracy and detection speed. We will try to use the SOTA models, in addition to YOLOv5, to further improve the stability and robustness of the vehicle detecting algorithm and to ensure its effectiveness in various complex traffic scenarios. This will enable the detection model to be deployed on embedded devices and edge computing platforms and run more quickly and efficiently, which is critical for real-time decision making in intelligent transportation systems.

Author Contributions

Conceptualization, T.B., G.W. (Guihao Wang) and G.W. (Geng Wei); methodology, T.B., G.W. (Guihao Wang), G.W. (Geng Wei) and Q.Z.; software, T.B. and G.W. (Guihao Wang); validation, T.B., G.W. (Guihao Wang) and G.W. (Geng Wei); formal analysis, T.B., G.W. (Guihao Wang) and G.W. (Geng Wei); investigation, T.B., G.W. (Guihao Wang), G.W. (Geng Wei) and Q.Z.; resources, T.B. and G.W. (Guihao Wang); data curation, T.B. and G.W. (Guihao Wang); writing—original draft preparation, T.B. and G.W. (Guihao Wang); writing—review and editing, T.B., G.W. (Guihao Wang), G.W. (Geng Wei) and Q.Z.; visualization, T.B. and G.W. (Guihao Wang); supervision, T.B.; project administration, T.B.; funding acquisition, T.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Guangxi Science and Technology Program (Grant. No. Guike AD21238038), the National Natural Science Foundation of China (Grant. No. 62161031), and the Natural Science Foundation of Guangxi Province (GrantsNo. 2020GXNSFAA297184).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the finding of the study are available from Bui at [email protected] upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hedges Company. How Many Cars Are There in the World. 2023. Available online: https://hedgescompany.com/blog/2021/06/how-many-cars-are-there-in-the-world/ (accessed on 15 April 2023).
Gao, X.W.; Shen, Z.; Xu, G.Y.; Feng, L. Traffic anomaly detection based on multi-target tracking. Appl. Res. Comput. 2021, 38, 1879–1883. [Google Scholar]
Qiao, P. Research on Traffic Flow Detection Based on Deep Learning and Edge Task Offloading. Master’s Thesis, Xidian University, Xi’an, China, 2019. [Google Scholar]
Aharon, N.; Orfaig, R.; Bobrovsky, B.Z. BoT-SORT: Robust Associations Multi-Pedestrian Tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar]
Liu, Y.J.; Dou, C.H.; Zhao, Q.L.; Li, Z.M. Online Multiple Object Tracking Based on State Prediction and Motion Structure. J. Comput.-Aided Des. Graph. 2018, 30, 289–297. [Google Scholar] [CrossRef]
Bashar, M.K.; Islam, S.; Hussain, K.K.; Hasan, B.; Rahman, A.; Kabir, H. Multiple Object Tracking in Recent Times: A Literature Review. arXiv 2022, arXiv:2209.04796. [Google Scholar]
Li, X.; Fang, G.; Rao, L.; Zhang, T. Multi-target tracking of person based on deep learning. Comput. Syst. Sci. Eng. 2023, 47, 2671–2688. [Google Scholar] [CrossRef]
Yu, H. Multi-objective algorithm based on YOLOv5+DeepSort. Inf. Technol. Informatiz. 2023, 6, 87–90. [Google Scholar]
Bao, J.; Dong, Y.C.; Liu, H.Z. Survey of Object Tracking Algorithms Based on Deep-Sort. In Network Application Branch of China Computer Users Association. In Proceedings of the 23rd Annual Conference on New Network Technology and Application in 2019 of the Network Application Branch of the China Computer Users Association, Enshi, China, 1–2 August 2019; Beijing Key Laboratory of Information Service Engineering, Beijing Union University: Beijing, China, 2019. [Google Scholar]
Wang, R.; Lin, Z.J.; Chen, P.P. Research on Pedestrian Tracking Method Based on Improved DeepSort. Radio Commun. Technol. 2023, 49, 1117–1124. [Google Scholar]
Liu, B.; Wang, S.Q.; Gao, M.; Liu, W. An improved DeepSORT mine personnel tracking algorithm. TV Technol. 2023, 47, 15–19. [Google Scholar]
Zhang, L.J.; Zhang, Z.W.; Jiang, Y.T.; Li, T.M.; Hu, M.D.; Liu, Y.X. Stable and real-time pedestrian tracking method based on improved DeepSORT under complex background. Liq. Cryst. Disp. 2023, 38, 1128–1138. [Google Scholar] [CrossRef]
He, W.K.; Peng, Y.H.; Huang, W.; Yao, Y.J.; Chen, Z.H. Research on Dynamic Vehicle Multi-Object Tracking Method Based on DeepSort. Automot. Technol. 2023, 2023, 27–33. [Google Scholar]
Zheng, F.T.; Xing, G.S. Pedestrian multi-target tracking algorithm based on improved DeepSort. Mod. Electron. Technol. 2023, 46, 40–46. [Google Scholar]
Liu, Z.B. Road Target Tracking Algorithm Based on Improved YOLOv5 and DeepSort. Automot. Appl. Technol. 2022, 47, 40–44. [Google Scholar]
Liu, L. Research on Traffic Flow Statistics of Intelligent Transportation Based on YOLO Network. Master’s Thesis, Xi’an University of Science and Technology, Xi’an, China, 2019. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Lv, W.Y.; Xu, S.L.; Zhao, Y.; Wang, G.Z.; Wei, J.M.; Cui, C.; Du, Y.N.; Dang, Q.Q.; Liu, Y. DETRs Beat YOLOs on real-time object detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
Ashish, V.; Noam, S.; Niki, P.; Jakob, U.; Llion, J.; Aidan, N.G.; Lukasz, K.; Illia, P. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and real time tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and real time tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016. [Google Scholar]
Ren, S.Q.; He, K.M.; Girshick, R.; Sun, J. FasterR-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overall structure of the algorithm in this work.

Figure 2. VOC2012 directory structure.

Figure 3. A part of Images of the VisDrone data set.

Figure 4. The process diagram of vehicle labeling.

Figure 5. YOLOv5s network structure.

Figure 6. Improved YOLOv5s network structure.

Figure 7. Feedforward neural network structure.

Figure 8. Workflow of theDeepSORTtracking algorithm.

Figure 9. Vehicle prediction box.

Figure 10. Recall rate of object detection models.

Figure 11. PR curve performance of object detection model: (a) car; (b) bus; (c) truck; (d) van.

Figure 12. Comparison of detection results of different YOLOv5s models in the same image.

Figure 13. Tracking results of frames 10 and 20 of the original algorithm in daytime.

Figure 14. Tracking results of frames 10 and 20 of the proposed algorithm in daytime.

Figure 15. Tracking results of frames 30 and 40 of the original algorithm at night.

Figure 16. Tracking results of frames 30 and 40 of the proposed algorithm at night.

Figure 17. Tracking results of frames 27 and 37 of the original algorithm, taken during a rainy day.

Figure 18. Tracking results of frames 27 and 37 of the improved algorithm, taken during a rainy day.

Table 1. Number of vehicles (hundred million).

Region	Number of Vehicles (Hundred Million)
Asia	5.31
Europe	4.053
North America	3.51
South America	0.83
Middle East	0.49
Africa	0.26

Table 2. Feature extraction network of the original DeepSORT.

Network Layer	Output	Network Layer	Output
Conv 1	32 × 128 × 64	Residual 6	64 × 32 × 16
Conv 2	32 × 128 × 64	Residual 7	64 × 32 × 16
Max Pool 3	32 × 64 × 32	Residual 8	128 × 16 × 8
Residual 4	32 × 64 × 32	Residual 9	128 × 16 × 8
Residual 5	32 × 64 × 32	Dense 10	128

Table 3. Network structure of improved ResNet36.

Network Layer	Output	Network Layer	Output
Conv 1	32 × 128 × 64	Residual 8	128 × 32 × 32
Conv 2	32 × 128 × 64	Residual 9	256 × 16 × 16
Max Pool 3	32 × 128 × 128	Residual 10	256 × 16 × 16
Residual 4	32 × 128 × 128	Residual 11	512 × 8 × 8
Residual 5	64 × 64 × 64	Residual 12	512 × 8 × 8
Residual 6	64 × 64 × 64	Dense 10	512
Residual 7	128 × 32 × 32

Table 4. Experimental hardware environment.

Hardware Configuration	Version
CPU	Inter(R) Core(TM) 12700KF 3.60 GHz, Intel, Shanghai, China
Memory	16 Gb
GPU	NVIDIA GeForce 3080 RTX, NVIDIA, Shenzhen, China
Operating System	Windows10, 64 Bit

Table 5. Experimental software environment.

Software Configuration	Version
Language	Python 3.8.5
Deep Learning framework	Pytorch 1.8
GPU Engine	Cuda 12.1 and cundann 7.6
Virtual Environment	Miniconda3

Table 6. Detection performance of the improved YOLOv5s and other models.

YOLO Algorithms	#Params (M)	FLOPs (G)	${AP}_{50 : 95}^{v a l}$ (%)	${AP}_{50}^{v a l}$ (%)
YOLOv5s	7.2	16.5	35.9	56.3
YOLOv8s	11.2	28.6	38.8	55.8
Improved YOLOv5s	7.4	16.5	41.7	65.0

Table 7. Average detection speed per image (in ms).

YOLO Algorithms	Pre-Process	Inference	Non-Maximum Suppression (NMS)	ALL
YOLOv5l	0.6	30.3	1.5	31.8
YOLOv5m	0.7	20.2	1.5	22.4
YOLOv5n	0.7	5.2	1.3	7.2
YOLOv5s	0.7	10.5	1.5	12.7
YOLOv5x	0.6	31.5	1.5	33.6
Improved YOLOv5s	0.6	9.4	0.9	10.9

Table 8. Comparison of evaluation indicators of muti-object tracking algorithms.

Algorithms	MOTA (%)	MOTP (%)	Frag
YOLOv5s and DeepSORT	67.57	73.27	206
YOLOv5s and Improved DeepSORT	72.78	78.33	155
Improved YOLOv5s and DeepSORT	70.65	81.45	157
Proposed Algorithm	74.07	84.14	139

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bui, T.; Wang, G.; Wei, G.; Zeng, Q. Vehicle Multi-Object Detection and Tracking Algorithm Based on Improved You Only Look Once 5s Version and DeepSORT. Appl. Sci. 2024, 14, 2690. https://doi.org/10.3390/app14072690

AMA Style

Bui T, Wang G, Wei G, Zeng Q. Vehicle Multi-Object Detection and Tracking Algorithm Based on Improved You Only Look Once 5s Version and DeepSORT. Applied Sciences. 2024; 14(7):2690. https://doi.org/10.3390/app14072690

Chicago/Turabian Style

Bui, Thioanh, Guihao Wang, Geng Wei, and Qian Zeng. 2024. "Vehicle Multi-Object Detection and Tracking Algorithm Based on Improved You Only Look Once 5s Version and DeepSORT" Applied Sciences 14, no. 7: 2690. https://doi.org/10.3390/app14072690

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Vehicle Multi-Object Detection and Tracking Algorithm Based on Improved You Only Look Once 5s Version and DeepSORT

Abstract

1. Introduction

2. Materials and Methods

2.1. Selection of Data Set

2.2. VisDrone Data Set

2.2.1. Introduction of VisDrone Data Set

2.2.2. VisDrone Data Set Processing

3. The Process Diagram of Vehicle Labeling

3.1. Object Detection Algorithm Based on YOLOv5s

3.2. Improved YOLOv5s Based on AIFI

4. Improved DeepSORT Algorithm

4.1. DeepSORT Tracking Algorithm

4.1.1. Kalman Filter (KF) Algorithm

4.1.2. Hungarian Matching Algorithm

4.2. Improved KF Algorithm

4.3. Enhancing the Feature Extraction Network of DeepSORT

5. Experimental Results and Analysis

5.1. Overall Optimization Process of Tracking Algorithm

5.2. Experimental Environment

5.3. Experiments of Object Detection Algorithm

5.3.1. Comparison and Analysis of Model Evaluation Indicators

5.3.2. Comparison and Analysis of the Actual Detection Effect of the Models

5.4. Experiments Using the Multi-Object Tracking Algorithm

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI