Multi-Object Vehicle Detection and Tracking Algorithm Based on Improved YOLOv8 and ByteTrack

You, Longxiang; Chen, Yajun; Xiao, Ci; Sun, Chaoyue; Li, Rongzhen

doi:10.3390/electronics13153033

Open AccessArticle

Multi-Object Vehicle Detection and Tracking Algorithm Based on Improved YOLOv8 and ByteTrack

by

Longxiang You

,

Yajun Chen

^*,

Ci Xiao

,

Chaoyue Sun

and

Rongzhen Li

School of Electronic Information Engineering, China West Normal University, Nanchong 637009, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(15), 3033; https://doi.org/10.3390/electronics13153033

Submission received: 11 July 2024 / Revised: 29 July 2024 / Accepted: 30 July 2024 / Published: 1 August 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Vehicle detection and tracking technology plays a crucial role in Intelligent Transportation Systems. However, due to factors such as complex scenarios, diverse scales, and occlusions, issues like false detections, missed detections, and identity switches frequently occur. To address these problems, this paper proposes a multi-object vehicle detection and tracking algorithm based on CDS-YOLOv8 and improved ByteTrack. For vehicle detection, the Context-Guided (CG) module is introduced during the downsampling process to enhance feature extraction capabilities in complex scenarios. The Dilated Reparam Block (DRB) is reconstructed to tackle multi-scale issues, and Soft-NMS replaces the traditional NMS to improve performance in densely populated vehicle scenarios. For vehicle tracking, the state vector and covariance matrix of the Kalman filter are improved to better handle the nonlinear movement of vehicles, and Gaussian Smoothed Interpolation (GSI) is introduced to fill in trajectory gaps caused by detection misses. Experiments conducted on the UA-DETRAC dataset show that the improved algorithm increases detection performance, with mAP@0.5 and mAP@0.5:0.95 improving by 9% and 8.8%, respectively. In terms of tracking performance, mMOTA improves by 6.7%. Additionally, comparative experiments with mainstream detection and two-stage tracking algorithms demonstrate the superior performance of the proposed algorithm.

Keywords:

computer vision; intelligent transportation systems; vehicle detection; vehicle tracking

1. Introduction

Vehicle detection and tracking technology plays a key role in Intelligent Transportation Systems [1]. Utilizing this technology enables the real-time monitoring of road traffic flow, assisting traffic management authorities in making informed decisions to effectively alleviate traffic congestion. Additionally, vehicle detection and tracking technology can identify traffic violations, reducing the likelihood of traffic accidents. This technology also provides intelligent perception of vehicles and road environments, offering smarter and more efficient solutions for traffic management. In recent years, AI technology has played a significant role in enhancing the capabilities of vehicular networks. For example, in vehicular edge network computing, Y Ju et al. [2] have conducted research on secure offloading using non-orthogonal multiple access (NOMA) technology and asynchronous deep reinforcement learning. This not only improves the security and efficiency of data transmission but also supports the real-time requirements for vehicle detection and tracking.

However, several challenges are often encountered when performing vehicle detection and tracking. Factors such as complex lighting variations, weather impacts, the diversity of vehicle types and sizes, and high-density traffic scenes can significantly reduce the effectiveness of detection and tracking. In response to these challenges, numerous research efforts have been undertaken. For instance, D. Roy et al. [3] improved vehicle detection and tracking performance by introducing multi-modal information. M. Humayun et al. [4], based on YOLOv4, addressed variations in various weather conditions by adding a Spatial Pyramid Pooling Block. P. Deshmukh et al. [5] used the Swin Transformer to extract multi-scale features and utilized a Bidirectional Feature Pyramid Network to handle the diversity in vehicle scales. J. Wang et al. [6], based on YOLOv5, introduced an attention mechanism to enhance the detection performance of small vehicles and employed the JDE algorithm for vehicle tracking. T. Bui et al. [7], also based on YOLOv5, introduced an Attention-based Intra-scale Feature Interaction module to accelerate the convergence of the detection network and improved ResNet36 as the DeepSORT feature extraction network for vehicle tracking. In addition to improving object detection models and vehicle tracking methods based on the Kalman filter, several technological solutions have been proposed to address these challenges. Y Wu et al. [8] proposed a technique using a generative adversarial network (GAN) to convert daytime vehicle datasets into nighttime datasets, enhancing nighttime detection performance from the perspective of dataset expansion. B Xu et al. [9] introduced a monocular vision framework that uses an image inpainting model to restore occluded vehicles through masking, improving performance in congested traffic scenarios. Additionally, B Xu et al. [10] employed a method combining key point detection and multiple deep learning models, optimizing the detection and tracking of vehicle axle load points by integrating the YOLO-v5 and HourglassNet models with a dynamics model. H Xu et al. [11] proposed a Cross-Domain Car Detection Model with an integrated convolutional block attention mechanism (CDCDMA) vehicle detection model that builds a cross-domain object detection framework, emphasizing headlight features to handle changes in complex scenes. C Zhang et al. [12] introduced a system framework that addresses cross-view vehicle re-identification challenges by learning high-order relationships and topological information. This framework effectively improves recognition accuracy across different views through the feature extraction module, graph convolution module, and graph matching module.

In the realm of vehicle detection, most methods are typically optimized for single scenarios. To address this, this paper proposes an improved algorithm based on YOLOv8 [13]—CDS-YOLOv8. This algorithm introduces the Context-Guided module [14] during the downsampling process to enhance feature extraction capabilities, thereby adapting to complex scene variations. Simultaneously, the Dilated Reparam Block [15] structure is reconstructed and integrated with the C2f module to fully extract multi-scale and sparse features, enhancing the model’s performance in multi-scale object detection. Additionally, in the post-processing stage, the algorithm adopts Soft-NMS to replace traditional NMS, preserving more valid candidate boxes in densely populated vehicle scenes through smooth confidence decay. In terms of vehicle tracking, most tracking algorithms typically focus only on high-confidence detection boxes while ignoring low-confidence ones, leading to missed detections of some true targets. To address this, this paper selects ByteTrack [16] as the base tracker and makes adjustments to the state vector and covariance matrix in the Kalman filter. Furthermore, Gaussian Smoothed Interpolation [17] is introduced to post-process vehicle trajectories, compensating for missing trajectories.

The main contributions of this paper are summarized as follows:

We proposed an improved vehicle detection model, CDS-YOLOv8: To address the limitations of YOLOv8 in urban environments for vehicle detection, the CDS-YOLOv8 model introduces a CG module during downsampling to enhance feature extraction capabilities in response to complex scene changes. It reconstructs the DRB to better handle multi-scale issues caused by varying camera distances and vehicle sizes. Additionally, Soft-NMS replaces traditional NMS to retain more valid candidate boxes, improving performance in densely populated vehicle scenes.
We improved the ByteTrack tracker: ByteTrack was selected as the base tracker. To address the limitations of the original Kalman filter in capturing vehicle motion, improvements were made to the state vector and covariance matrix to better handle the nonlinear changes in vehicle movement. To mitigate track interruptions due to detection loss, GSI was used to fill in tracking gaps, improving track continuity and reducing identity switches and track loss.
The integration of the improved detector and tracker was achieved: CDS-YOLOv8 and the improved ByteTrack were combined and validated on public datasets. Experimental results indicate that this method demonstrates advanced performance and superiority across multiple metrics.

2. Related Work

2.1. Vehicle Detection

In recent years, with the rapid development of deep learning technology, significant progress has been made in the field of object detection. Deep learning-based object detection methods can be divided into one-stage and two-stage approaches. The advantage of one-stage methods lies in their fast processing speed, as they directly generate object categories and bounding boxes through a single network. In contrast, two-stage methods first generate candidate regions and then perform fine detection, with the advantage of higher detection accuracy. Representative one-stage methods include the YOLO series, SSD [18], and RetinaNet [19], while representative two-stage methods include Faster R-CNN [20] and Cascade R-CNN [21]. The working principle of the YOLO series is to divide the input image into a fixed-size grid, with each grid directly predicting the location and category probability of the object. SSD extracts the bounding boxes and categories of objects on feature maps of different scales, achieving multi-scale detection. RetinaNet employs a Feature Pyramid Network for feature extraction and introduces Focal Loss to address the issue of class imbalance. Faster R-CNN first generates candidate regions through a Region Proposal Network and then performs fine object classification and bounding box regression on these regions. Cascade R-CNN, building on Faster R-CNN, introduces a multi-stage regression and classification structure that progressively refines candidate regions, further improving detection accuracy.

To meet the practical needs of vehicle detection, several research efforts have further optimized general detectors. Z. Chen et al. [22] proposed an improved method based on SSD by introducing channel attention mechanisms and deconvolution modules for feature fusion, thereby enhancing vehicle detection performance. L. Kang et al. [23] presented a Type-1 Fuzzy Attention method based on YOLOv5 to improve vehicle detection. This approach optimizes detection performance by focusing the detector on the target center. M. Bie et al. [24] proposed a Bidirectional Feature Pyramid Network based on YOLOv5 to enhance the model’s feature extraction capabilities. B. Wang et al. [25] introduced an improved method based on YOLOv8 by incorporating the BiFormer attention mechanism and BiFPN network to enhance feature fusion capabilities.

In terms of selecting detection algorithms, YOLOv8 achieves a good balance between speed, accuracy, and model size due to its fast detection speed, high accuracy, and compact model size. Therefore, this paper proposes a CDS-YOLOv8 algorithm based on YOLOv8 to address vehicle detection issues under complex weather conditions, scale diversity, and dense vehicle environments. Without increasing computational load, this algorithm effectively improves vehicle detection performance.

2.2. Multi-Target Vehicle Tracking

Detection-based multi-target tracking algorithms can be divided into single-stage and two-stage methods. Single-stage methods generally use an end-to-end structure, merging object detection and association into a single step to directly produce detection and tracking results, with the advantage of strong real-time performance. Two-stage methods first perform object detection and then carry out object association, offering higher accuracy. Typical single-stage methods include JDE [26] and FairMOT [27]. JDE completes object detection and embedding in the same network, with the detection and embedding parts sharing most of the network parameters. FairMOT integrates object detection and ReID networks into one network but processes detection and re-identification tasks through independent branches.

Typical two-stage tracking algorithms include the SORT series. SORT [28], proposed by A. Bewley et al., uses Faster R-CNN as the detector in the detection stage to output object location and category information. The tracking stage completes trajectory association by combining the Kalman filter and the Hungarian algorithm. Subsequently, N. Wojke et al. proposed the DeepSORT [29] algorithm, which, compared to SORT, introduces the ReID method, using a feature extraction network to obtain the appearance features of objects and performing cascaded matching of appearance features and Intersection Over Union (IoU) during data association. StrongSORT further enhances appearance feature extraction, motion noise handling, and camera motion compensation based on DeepSORT, and supplements and optimizes trajectories with missing detections and associations. Y. Zhang et al. [16] proposed the ByteTrack algorithm, which improves tracking performance by associating almost all detection boxes. Following this, N. Aharon et al. proposed the BoT-SORT [30] algorithm, which improves the Kalman filter on the basis of ByteTrack, adding camera motion compensation and the ReID method.

In vehicle tracking scenarios, Z. Zhao et al. [31] designed a new joint scoring strategy based on the JDE framework and introduced CBAM and Transformer encoders to enhance detection capabilities, thereby improving tracking performance. K. Zhang et al. [32] incorporated the CAENet attention mechanism into YOLOX to improve vehicle detection and used DeepSORT for vehicle tracking. G. Han et al. [33] tracked vehicles based on FairMOT, optimized the DLA-34 network to enhance detection performance, and introduced a joint loss function to reduce identity switching issues.

Compared to single-stage tracking algorithms, two-stage tracking algorithms are more flexible and achieve higher accuracy in complex scenarios. Therefore, this paper opts to improve ByteTrack based on CDS-YOLOv8 for vehicle detection and tracking. Unlike most tracking methods that ignore low-score detection boxes, ByteTrack fully considers low-score detection boxes in trajectory association, effectively addressing the issue of missing trajectory fragments. For instance, when partial occlusion occurs, the detection box scores may vary from high to low. By considering low-confidence score boxes, ByteTrack can identify partially occluded targets, reducing the incidence of missed detections.

3. Methods

In the process of vehicle detection and tracking, the CDS-YOLOv8 detector first outputs the detection boxes for the current frame. The improved Kalman filter then predicts the bounding boxes for the current frame through tracklet prediction. Subsequently, the detection boxes and predicted boxes are matched using the Hungarian Algorithm, with IoU serving as the basis for association. The specific process of secondary matching is illustrated in Figure 1.

The detection boxes output using CDS-YOLOv8 are divided into high-confidence boxes and low-confidence boxes based on their confidence scores, followed by two rounds of association processing. In the first round of association, the improved Kalman filter predicts the tracked tracklets and the lost tracklets and then matches the high-confidence boxes with the predicted boxes. If the match is successful, the tracklet state is updated; if not, the tracked tracklets that are not matched with high-confidence boxes undergo a second round of association with low-confidence boxes. In the second round of association, if the match is successful, the tracklet state is updated; if not, the tracklet enters the lost state. In tracklet management, tracklets that remain unmatched for more than 30 frames are deleted. New tracklets are assigned to unmatched high-confidence detection boxes. Finally, after vehicle detection and tracking, GSI is used to fill in gaps in the trajectories caused by missing detections, optimizing the continuity and completeness of the trajectories.

The CDS-YOLOv8 model is composed of three parts: the Backbone, Neck, and Head, as illustrated in Figure 2:

In Figure 2, the Backbone is used for feature extraction, extracting multi-level features from the input image. The Neck is responsible for feature fusion, combining deep and shallow features using FPN and PAN. The Head contains three detection heads, which handle feature maps of different scales and output the position and category information of the detection boxes. The CDS-YOLOv8 addresses the shortcomings of YOLOv8 in urban vehicle detection with the following improvements: The CG module is introduced during downsampling, which enhances feature extraction in complex environments by combining local and surrounding contextual information and utilizing global information for weighting. In the Backbone and Neck, the DRB structure is restructured and integrated with the C2f module to better address scale diversity issues. In the post-processing stage, Soft-NMS replaces traditional NMS, avoiding the aggressive elimination of candidate boxes and improving detection performance under partial occlusion.

3.1. Context-Guided Downsampling

The environments where vehicles operate include complex scenarios such as nights, cloudy weather, sunny days, and rainy days. To enhance feature extraction capabilities in these complex environments, the Context-Guided module is introduced during the downsampling process. The CG Module comprises several key components: The Local Feature Extractor uses standard convolution to extract local features of the image, providing fine-grained features and boundary information. The Surrounding Context Extractor employs Dilated Convolution to expand the receptive field of local features, thereby capturing a wider range of contextual information. The Joint Feature Extractor concatenates local features and surrounding contextual features and processes them with Batch Normalization (BN) and SiLU activation, forming joint features that enhance nonlinear expression capabilities. The Global Context Extractor aggregates global context information of the entire image through Global Average Pooling and further processes it using a Multi-Layer Perceptron. This applies weights to the joint features, amplifying useful information while suppressing irrelevant information. The structure of the CG Module is illustrated in Figure 3:

In detail, first, the input feature map undergoes 3 × 3 convolution to extract local features, and 3 × 3 convolution with a dilation rate of 2 to extract surrounding contextual features. Next, the Joint Feature Extractor fuses the local features and surrounding contextual features along the channel dimension, followed by the BN layer and SiLU activation. Finally, the Global Context Extractor leverages the image’s global information, using Global Average Pooling to generate a global feature vector for each channel. This vector is then subjected to nonlinear transformation through two fully connected layers, producing channel weights. These weights are applied to the joint features for weighted optimization, enhancing important features while suppressing unimportant ones, thus boosting the model’s feature extraction capability.

3.2. Reconstructed Dilated Reparam Block

In vehicle detection, significant scale variations arise due to differences in camera distances and vehicle sizes. To address this issue, we have reconstructed the Dilated Reparam Block and integrated it into the C2f structure. The DRB structure comprises a large-kernel convolution layer and multiple dilated small-kernel convolution layers. The large-kernel convolution layer captures broader spatial features, while the dilated small-kernel convolution layers capture sparse spatial features through dilation operations. The outputs of these layers are processed through the BN layer and then summed up. Using reparameterization techniques, they are equivalently represented as a single large-kernel convolution layer. This design has two primary advantages: The small-kernel convolution layers, by incorporating dilation, capture features over a larger range. When combined with the large-kernel convolution layer, the receptive field is effectively expanded, eliminating the need to increase the model depth or design larger convolution kernels. During inference, reparameterization techniques merge multiple convolution layers into an equivalent single convolution kernel, maintaining the kernel size and computational complexity while significantly enhancing the model’s feature extraction capability. Figure 4 illustrates the DRB structure.

From a parameter perspective, the DRB consists of a 7 × 7 large-kernel convolution layer and three small-kernel convolution layers with different dilation rates. The outputs of all convolution layers are processed through BN layers and are equivalently represented as a single 7 × 7 large-kernel convolution layer during inference, without significantly increasing the computational load.

Structurally, Figure 4 demonstrates how each convolution layer corresponds to areas based on kernel size and dilation rate. For large objects, the DRB effectively extends the receptive field by combining large-kernel convolution layers with dilated convolution layers. For small objects, the dilated convolution layers in the DRB, using small kernels and varying dilation rates, efficiently capture fine-grained features, thereby improving multi-scale vehicle detection performance in complex backgrounds.

3.3. Soft-NMS

In the process of vehicle detection and tracking, Soft-NMS is used to replace traditional NMS to improve performance in densely populated vehicle scenarios. The traditional NMS steps are as follows:

Sort the candidate boxes in descending order based on their confidence scores.
Select the candidate box with the highest confidence score and add it to the final result set.
Compute the IoU between this candidate box and the other boxes. If a candidate box’s IoU exceeds a set threshold, it is deleted.
From the remaining candidate boxes, select the one with the highest confidence score, and repeat steps 2 and 3 until all candidate boxes have been processed.

Soft-NMS optimizes the third step of traditional NMS: in Soft-NMS, after calculating the IoU between the candidate box with the highest confidence score and the other boxes, instead of deleting those boxes with an IoU exceeding the threshold, a soft handling method is used by decaying their confidence scores. Specifically, we adopt a Gaussian decay method, as shown in Equation (1):

s_{i} = s_{i} \times \exp (- \frac{{I o U}^{2}}{σ})

(1)

where

s_{i}

represents the confidence score of the i-th candidate box, and

σ

is the decay factor.

If the decayed confidence score is below the set threshold, the candidate box will be removed. In scenarios with dense vehicles, the overlapping of candidate boxes between targets is severe. Compared to traditional NMS, Soft-NMS can retain more valid candidate boxes, handle adjacent targets more effectively, and reduce the occurrence of missed detections.

3.4. Improved Kalman Filter

In two-stage tracking algorithms, the output of the bounding box is usually estimated using the Kalman filter (KF) to determine the target’s motion state. In ByteTrack, the state vector of the Kalman filter is shown in Equation (2):

x = {[x_{c}, y_{c}, a, h, {\dot{x}}_{c}, {\dot{y}}_{c}, \dot{a}, \dot{h}]}^{T}

(2)

where

(x_{c}, y_{c})

are the center coordinates, a is the aspect ratio, and h is the height.

Then, when dealing with partial occlusion and perspective changes, the aspect ratio is difficult to adapt. Therefore, directly estimating the width and height is a better choice. The improved state vector is shown in Equation (3):

x_{k} = [x_{c} (k), y_{c} (k), w (k), h (k), {\dot{x}}_{c} (k), {\dot{y}}_{c} (k), \dot{w} (k), \dot{h} (k)]

(3)

where

{(x}_{c} (k), y_{c} (k)

are the center coordinates,

w (k)

is the width, and

h (k)

is the height.

Compared to using the aspect ratio, directly estimating the width and height has the following advantages: on one hand, it can more accurately reflect the actual size of the object and better adapt to changes in vehicle dimensions; on the other hand, changes in the shape and size of the vehicle during motion are complex, and using the aspect ratio increases nonlinear errors. Directly estimating the width and height can reduce the distortion of the bounding box caused by the accumulation of errors, thus maintaining better stability in complex scenes.

In the classical Kalman filter, the process noise covariance matrix Q and the measurement noise covariance matrix R are set as fixed values, which remain unchanged throughout the tracking process. To enhance the performance of the Kalman filter in dynamic environments, Q and R are dynamically adjusted according to real-time conditions, Q as shown in Equation (4) and R as shown in Equation (5).

\begin{matrix} Q_{k} = diag & ({(σ_{p} {\hat{w}}_{k - 1 ∣ k - 1})}^{2}, {(σ_{p} {\hat{h}}_{k - 1 ∣ k - 1})}^{2}, \\ {(σ_{p} {\hat{w}}_{k - 1 ∣ k - 1})}^{2}, {(σ_{p} {\hat{h}}_{k - 1 ∣ k - 1})}^{2}, \\ {(σ_{v} {\hat{w}}_{k - 1 ∣ k - 1})}^{2}, {(σ_{v} {\hat{h}}_{k - 1 ∣ k - 1})}^{2}, \\ {(σ_{v} {\hat{w}}_{k - 1 ∣ k - 1})}^{2}, {(σ_{v} {\hat{h}}_{k - 1 ∣ k - 1})}^{2}) \end{matrix}

(4)

where

σ_{p}

is the standard deviation of position noise and

σ_{p}

is the standard deviation of velocity noise.

\begin{matrix} R_{k} = diag & ({(σ_{m} {\hat{w}}_{k ∣ k - 1})}^{2}, {(σ_{m} {\hat{h}}_{k ∣ k - 1})}^{2}, \\ {(σ_{m} {\hat{w}}_{k ∣ k - 1})}^{2}, {(σ_{m} {\hat{h}}_{k ∣ k - 1})}^{2}) \end{matrix}

(5)

where

σ_{m}

is the standard deviation of measurement noise.

Time-varying Q and R can more flexibly respond to nonlinear changes in the environment, such as the deceleration or acceleration of a vehicle. Compared to fixed covariance matrices, time-varying Q and R can be adjusted based on the current prediction and measurement states, more effectively addressing issues of noise and error accumulation. This allows for better adaptation to dynamic environments and improves the overall performance of the system.

3.5. Gaussian Smoothed Interpolation

To address the issue of trajectory loss caused by detection omissions in vehicle tracking, the Gaussian Smoothed Interpolation method is employed as a post-processing step for optimization. Traditional linear interpolation methods typically rely only on the positions of two endpoints, ignoring the complexity of vehicle motion, which results in less accurate outcomes. GSI captures the nonlinear motion trajectory of the vehicle through Gaussian Process Regression (GPR), thereby providing more precise results.

For the ith vehicle trajectory, the GSI model is represented by Equation (6):

p_{t} = f^{(i)} (t) + ϵ

(6)

where t is the frame ID,

p_{t}

is the vehicle position coordinate at frame t,

f^{(i)} (t)

is the function fitted by GPR used to describe the vehicle trajectory, and

ϵ \sim N (0, σ^{2})

represents Gaussian noise.

The Gaussian Process is a powerful non-parametric Bayesian method that can capture the nonlinear relationships between data points, assuming the function

f^{(i)}

follows a Gaussian Process, as shown in Equation (7):

f^{(i)} \sim G P (0, k (x, x^{'}))

(7)

where the covariance function

k (x, x^{'})

is the Radial Basis Function (RBF) kernel, which is particularly sensitive to local changes. The closer the data points are, the higher their correlation, and the response of the RBF kernel quickly diminishes as the distance between data points increases. This characteristic makes the RBF kernel more suitable for describing nonlinear changes in vehicle motion, such as starting, turning, and accelerating or decelerating. specifically represented as shown in Equation (8):

k (x, x^{'}) = \exp (- \frac{| | x - x^{'} | |^{2}}{2 λ^{2}})

(8)

The process of trajectory position prediction and interpolation is as follows: for frames with missing detections, the smoothed position

P^{*}

is predicted as shown in Equation (9):

P^{*} = K (F^{*}, F) {(K (F, F) + σ^{2} I)}^{- 1} P

(9)

where

K (\cdot, \cdot)

is the covariance matrix based on the covariance function

k (\cdot, \cdot)

, I is the identity matrix, and

P

represents the known vehicle positions of the trajectory.

In addition, for the RBF kernel function, the scale parameter

λ

determines its sensitivity to data changes. During vehicle tracking, different vehicle trajectory lengths may correspond to different motion patterns. Therefore, by adjusting

λ

to accommodate the length l of the vehicle trajectory, the trajectories can be smoothed, thereby reducing noise and achieving smoother and more accurate tracking. The setting of

λ

is shown in Equation (10):

λ = τ \cdot \log (\frac{τ^{3}}{l})

(10)

where

τ

is a constant used to control the smoothing degree, which has been set to 18 based on experimental results obtained through grid search.

Through the above steps, GSI can effectively fill in missing trajectories in vehicle tracking, improving the accuracy and stability of trajectory predictions.

4. Experiments and Results

4.1. Dataset and Experimental Configuration

The UA-DETRAC dataset [34], jointly released by the University of Alberta and the University of Science and Technology of China, serves as a benchmark dataset specifically designed for multi-object vehicle detection and tracking. This dataset comprises 10 h of video captured using a Canon EOS 550D camera (Canon USA, Inc., Lake Success, NY, USA). The videos are recorded at a frame rate of 25 frames per second, encompassing over 140,000 frames and 1.21 million annotated object detection boxes. The training set consists of 60 video sequences, totaling 83,791 images, while the test set includes 40 video sequences, totaling 56,340 images. The UA-DETRAC dataset presents various challenges, with representative scenarios illustrated in Figure 5. The dataset covers four vehicle categories: cars, buses, vans, and others. Additionally, it includes four weather conditions: sunny, night, cloudy, and rainy, as well as scenes with high vehicle density and occlusion rates.

In the vehicle detection experiments, images without labels were first excluded, resulting in a training set of 82,085 images and a test set of 56,167 images. To avoid overfitting, every 10th frame was sampled, ultimately obtaining 8209 images for training and 5617 images for evaluation. For the vehicle tracking experiments, 40 video sequences from the test set were used for evaluation.

The GPU device utilized in these experiments is an NVIDIA GeForce RTX 4090 (NVIDIA Corporation, Santa Clara, CA, USA), and the CPU device is a Core i9-9900K @ 3.60 GHz CPU (Intel Corporation, Chengdu, China). The experiments were conducted on a Windows 10 system with a deep learning environment based on PyTorch 1.31.1, CUDA 11.6, and Python 3.8. The training parameters were set as follows: the number of training epochs was 200, the batch size was 32, and no pre-trained weights were used during the training process.

4.2. Evaluation Metrics

In vehicle detection, Precision (P), Recall (R), mAP@0.5, and mAP@0.5:0.95 are used to evaluate the experimental results.

P: the proportion of true-positive samples among all samples detected as positive, as shown in Equation (11):

P = \frac{T P}{T P + F P}

(11)

R: the proportion of correctly detected positive samples among all actual positive samples, as shown in Equation (12):

R = \frac{T P}{T P + F N}

(12)

where TP is the number of true-positive samples, FP is the number of false-positive samples, and FN is the number of actual positive samples that were incorrectly detected as negative samples.

Average Precision (AP): the area under the P-R curve for a certain class of samples, as shown in Equation (13):

A P = \int_{0}^{1} P d R

(13)

mAP@0.5 is the mean Average Precision for all classes with an IoU threshold of 0.5.

mAP@0.5:0.95 is the mean Average Precision for all classes with IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05.

In vehicle tracking, MOTA, IDF1, and IDS are used to evaluate the experimental results.

MOTA measures the accuracy of the tracking algorithm, considering various errors that occur during the tracking process, such as false positives, missed detections, and identity switches. This is shown in Equation (14):

M O T A = 1 - \frac{\sum_{t} ({F N}_{t} + {F P}_{t} + {I D S W}_{t})}{\sum_{i} {G T}_{t}}

(14)

where

{F N}_{t}

represents the number of missed detections at time t,

{F P}_{t}

represents the number of false positives at time t,

{I D S W}_{t}

represents the number of identity switches at time t, and

{G T}_{t}

represents the number of ground truth targets at time t.

IDF1: IDF1 measures the identity consistency and continuity of the tracking algorithm, as shown in Equation (15):

I D F 1 = 2 \cdot \frac{I D P \cdot I D R}{I D P + I D R}

(15)

where IDP represents the proportion of correctly identified targets among the detected targets. IDR represents the proportion of correctly detected target identities to the actual target identities.

IDS: The number of identity switches that occur during the tracking process, meaning the same target is assigned different identities. A lower value indicates better performance of the tracker in maintaining identity consistency.

4.3. Detector Ablation Experiments

To verify the effectiveness of the proposed improvement modules in vehicle detection, ablation experiments were conducted using YOLOv8-S as the baseline model. The results are shown in Table 1:

As seen from the data in Table 1, the introduction of the CG module in the downsampling process increased P by 6.6%, R by 4.7%, mAP@0.5 by 5.7%, and mAP@0.5:0.95 by 4.4%, verifying the effectiveness of the CG module in extracting vehicle features. The introduction of the C2f_DRB module increased P by 9.7%, R by 4.0%, mAP@0.5 by 7.2%, and mAP@0.5:0.95 by 5%, demonstrating the significant advantage of C2f_DRB in handling multi-scale issues. When both the CG and C2f_DRB modules were introduced simultaneously, P increased by 8.9%, R by 5.3%, mAP@0.5 by 8.7%, and mAP@0.5:0.95 by 5.6%, proving that the combination of these two modules can better extract and integrate vehicle features. Finally, replacing traditional NMS with Soft-NMS for post-processing slightly increased mAP@0.5 by 0.3 percentage points, but significantly improved mAP@0.5:0.95 by 3.2%, laying a foundation for more precise vehicle tracking in subsequent stages. Compared to the baseline model, with a reduction of 2.2 G in Flops, P increased by 10.6%, R by 3.7%, mAP@0.5 by 9%, and mAP@0.5:0.95 by 8.8%, fully verifying the superiority of CDS-YOLOv8 in handling vehicle detection scenarios.

To further demonstrate the advancements of the improved model in vehicle detection, a visual analysis of the experimental results is presented, as shown in Figure 6. The images from left to right represent the original image, the prediction results of the baseline model YOLOv8-S, the prediction results of the improved model CDS-YOLOv8, and the adjustments made by CDS-YOLOv8 compared to YOLOv8-S, which include the reduction in missed vehicles. The optimized parts are marked with pink circles. Additionally, the black areas in the images are ignored regions, as the subsequent vehicle tracking only focuses on moving vehicles on the main road.

In Figure 6, we selected three representative scenarios: sunny, nighttime, and rainy. In the sunny scenario, as shown in the pink circled area, CDS-YOLOv8 detected a bus that YOLOv8 missed, as well as cars farther from the camera. In the nighttime scenario, CDS-YOLOv8 detected almost all the targets that YOLOv8 missed and accurately identified a partially occluded white car that YOLOv8 did not recognize as a valid target. In the rainy scenario, CDS-YOLOv8 detected almost all the small targets that were far from the camera and identified a taxi that were partially occluded and not recognized as valid targets by YOLOv8.

In these scenarios, CDS-YOLOv8 demonstrated significant improvements over YOLOv8-S in feature extraction capabilities and multi-scale processing. Additionally, it retained more valid targets in dense vehicle scenes by handling highly overlapping candidate boxes using Soft-NMS with confidence decay rather than simply eliminating them through an IoU threshold. In summary, CDS-YOLOv8 achieved excellent performance under various lighting and weather conditions, effectively handling multi-scale issues and dense vehicle scenarios.

4.4. Comparison Experiment of Detectors

To demonstrate the advancement of CDS-YOLOv8 in vehicle detection, we compared it with mainstream detection algorithms. The results are shown in Table 2:

From the comparison results, it is evident that CDS-YOLOv8 exhibits the most advanced performance compared to these mainstream two-stage and one-stage detection methods. Its performance not only surpasses the latest version of YOLOv9-S but also outperforms the large version of YOLOv8, achieving improvements of 1.6% and 4% in mAP@0.5 and mAP@0.5:0.95, respectively. CDS-YOLOv8 strikes a better balance between accuracy and computational complexity, further proving its superiority in vehicle detection scenarios.

4.5. Tracker Ablation Experiment

To verify the effectiveness of the improved detector and the proposed improvement modules in vehicle tracking, an ablation experiment was conducted as shown in Table 3. The experimental results are the averages based on 40 test sequences:

Compared to the original detector, the introduction of the improved detector CDS-YOLOv8 increases mIDF1 by 1.3%, reduces IDS by 70, and significantly enhances mMOTA by 3.4%. This reflects the critical role of the detector in the tracking system. Subsequently, after improving the tracker by optimizing the Kalman filter, mIDF1 increases by 1.5%, IDS decreases by 68, and mMOTA improves by 2.5%. This indicates that the improved Kalman filter can more flexibly handle the nonlinear motion of vehicles. On this basis, introducing GSI to process the tracking trajectories further increases mIDF1 by 0.6%, significantly reduces IDS by 187, and enhances mMOTA by 0.8%, demonstrating the effectiveness of this method in filling trajectory gaps. Using the improved detector and tracker, compared to before the improvements, mIDF1 increases by 3.4%, IDS decreases by 325, and mMOTA improves by 6.7%, significantly enhancing vehicle tracking performance.

To more intuitively demonstrate the effectiveness of the improved detector and tracker in vehicle tracking, a visualization analysis of the tracking results was performed, as shown in Figure 7. The four columns of images from left to right are the original image, the tracking results using the original detector and tracker, the tracking results using the improved detector and tracker, and the optimization effects of the improved model compared to the original detector and tracker.

In Figure 7, we also selected three representative scenarios: sunny, nighttime, and rainy. In the sunny scenario, as shown in the pink circled area, CDS-YOLOv8—compared to YOLOv8—not only detected more missed targets but also provided more accurate bounding box detection. For example, CDS-YOLOv8 was able to more comprehensively identify a white van and a taxi at the edge of the image, whereas YOLOv8 only identified a partial region as a bounding box. Additionally, the tracking ID of the white van was 1281 with the original detector and tracker but it significantly dropped to 1099 after improvements, indicating a notable enhancement in tracking consistency. In the nighttime scenario, the improved detector identified more small targets, and the tracking ID of the bus on the left decreased significantly from 3060 to 2529. In the rainy scenario, YOLOv8 identified the front of a police car as a separate vehicle, while CDS-YOLOv8 correctly recognized the entire bounding box of the police car. Furthermore, the tracking ID of the police car significantly decreased from 5898 to 5026 after the improvements.

Overall, the improved detector not only reduced the number of missed detections but also provided more accurate bounding box detections. The significant reduction in ID numbers for the same target indicates that the improved tracker maintained high accuracy and consistency in vehicle tracking. The enhanced detector and tracker are better equipped to handle complex scenarios and nonlinear changes in vehicles in urban environments.

4.6. Comparison Experiments of Trackers

To validate the advancement of the improved tracker, comparative experiments were conducted with mainstream two-stage object tracking algorithms. To ensure fairness, all experiments used CDS-YOLOv8 for the detector. The tracking results are shown in Table 4:

From the comparison results, it can be seen that the improved BYTETrack demonstrates state-of-the-art performance. Analyzing the results of BYTETrack and BOT-SORT tracking algorithms, it is evident that considering low-confidence score boxes can significantly reduce IDS, proving its effectiveness in identity consistency. Additionally, DeepSORT, StrongSORT, DeepOCSORT, and BOT-SORT all employ the ReID method, which greatly reduces the network’s operating speed. In contrast, the improved BYTETrack presented in this paper does not use the ReID method and still achieves the highest performance across all three tracking metrics, achieving a good balance between tracking speed and accuracy.

5. Conclusions

This paper addresses the challenges encountered in vehicle detection and tracking, such as complex scenarios, scale diversity caused by varying vehicle sizes and camera angles, and partial occlusions in dense scenes. A series of improvements were proposed to tackle these issues.

In terms of vehicle detection, we introduce the CG module during the downsampling process, which combines local and surrounding features to form joint features. These joint features are then weighted using global information, enhancing the feature extraction capability of the detector. Additionally, the DRB module was reconstructed and integrated into the C2f module. Using large-kernel convolution layers and multiple small-kernel convolutions with dilations, different scales and sparse features were extracted. Reparameterization technology was employed to generate equivalent convolution kernels, allowing for better comprehensive feature extraction during the inference stage, effectively addressing multi-scale problems. We also replaced the traditional NMS with Soft-NMS, which better handles partial occlusion scenarios through smooth confidence decay. In the aspect of vehicle tracking, we improved the state vector of the Kalman filter and introduced a time-related covariance matrix to better cope with dynamic vehicle movements. Additionally, the GSI method was incorporated to fill in trajectory gaps caused by missed detections. We conducted experiments on the UA-DETRAC dataset, using ablation studies to demonstrate the role of each module and employing visual analysis to intuitively show the optimization effects of the improved model. Comparative experiments showcased the advanced performance of the CDS-YOLOv8 and the improved BYTETrack algorithm.

In the future, we will focus more on vehicle detection and tracking from different perspectives, while exploring more diverse modalities, such as infrared images, radar, and other methods, to enhance the effectiveness of vehicle detection and tracking.

Author Contributions

Conceptualization, L.Y.; methodology, L.Y.; software, L.Y.; validation, L.Y. and Y.C.; formal analysis, C.S.; investigation, C.S. and R.L.; data curation, L.Y. and C.X.; writing—review and editing, L.Y.; visualization, L.Y.; supervision, Y.C.; All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Talent Fund project of China West Normal University (No. 463177).

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the first author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Boukerche, A.; Tao, Y.; Sun, P. Artificial intelligence-based vehicular traffic flow prediction methods for supporting intelligent transportation systems. Comput. Netw. 2020, 182, 107484. [Google Scholar] [CrossRef]
Ju, Y.; Cao, Z.; Chen, Y.; Liu, L.; Pei, Q.; Mumtaz, S.; Dong, M.; Guizani, M. NOMA-Assisted Secure Offloading for Vehicular Edge Computing Networks with Asynchronous Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2024, 25, 2627–2640. [Google Scholar] [CrossRef]
Roy, D.; Li, Y.; Jian, T.; Tian, P.; Chowdhury, K.; Ioannidis, S. Multi-Modality Sensing and Data Fusion for Multi-Vehicle Detection. IEEE Trans. Multimed. 2023, 25, 2280–2295. [Google Scholar] [CrossRef]
Humayun, M.; Ashfaq, F.; Jhanjhi, N.Z.; Alsadun, M.K. Traffic Management: Multi-Scale Vehicle Detection in Varying Weather Conditions Using YOLOv4 and Spatial Pyramid Pooling Network. Electronics 2022, 11, 2748. [Google Scholar] [CrossRef]
Deshmukh, P.; Satyanarayana, G.S.R.; Majhi, S.; Sahoo, U.K.; Das, S.K. Swin Transformer Based Vehicle Detection in Undisciplined Traffic Environment. Expert Syst. Appl. 2023, 213, 118992. [Google Scholar] [CrossRef]
Wang, J.; Dong, Y.; Zhao, S.; Zhang, Z. A High-Precision Vehicle Detection and Tracking Method Based on the Attention Mechanism. Sensors 2023, 23, 724. [Google Scholar] [CrossRef] [PubMed]
Bui, T.; Wang, G.; Wei, G.; Zeng, Q. Vehicle Multi-Object Detection and Tracking Algorithm Based on Improved You Only Look Once 5s Version and DeepSORT. Appl. Sci. 2024, 14, 2690. [Google Scholar] [CrossRef]
Wu, Y.; Wang, T.; Gu, R.; Liu, C.; Xu, B. Nighttime vehicle detection algorithm based on image translation technology 1. J. Intell. Fuzzy Syst. 2024, 46, 5377–5389. [Google Scholar] [CrossRef]
Xu, B.; Liu, X.; Feng, G.; Liu, C. A Monocular-Based Framework for Accurate Identification of Spatial-Temporal Distribution of Vehicle Wheel Loads under Occlusion Scenarios. Eng. Appl. Artif. Intell. 2024, 133, 107972. [Google Scholar] [CrossRef]
Xu, B.; Liu, C. Keypoint Detection-Based and Multi-Deep Learning Model Integrated Method for Identifying Vehicle Axle Load Spatial-Temporal Distribution. Adv. Eng. Inform. 2024, 62, 102688. [Google Scholar] [CrossRef]
Xu, H.; Lai, S.; Li, X.; Yang, Y. Cross-Domain Car Detection Model with Integrated Convolutional Block Attention Mechanism. Image Vis. Comput. 2023, 140, 104834. [Google Scholar] [CrossRef]
Zhang, C.; Yang, C.; Wu, D.; Dong, H.; Deng, B. Cross-View Vehicle Re-Identification Based on Graph Matching. Appl. Intell. 2022, 52, 14799–14810. [Google Scholar] [CrossRef]
Ultralytics. YOLOv8: Real-Time Object Detection and Image Segmentation. GitHub Repository. Available online: https://github.com/ultralytics/ultralytics (accessed on 28 July 2024).
Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. CGNet: A Light-Weight Context Guided Network for Semantic Segmentation. IEEE Trans. Image Process. 2021, 30, 1169–1179. [Google Scholar] [CrossRef] [PubMed]
Ding, X.; Zhang, Y.; Ge, Y.; Zhao, S.; Song, L.; Yue, X.; Shan, Y. UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition. arXiv 2024, arXiv:2311.15599. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. In Computer Vision—ECCV 2022, Proceedings of the 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. StrongSORT: Make DeepSORT Great Again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving Into High Quality Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Chen, Z.; Guo, H.; Yang, J.; Jiao, H.; Feng, Z.; Chen, L.; Gao, T. Fast Vehicle Detection Algorithm in Traffic Scene Based on Improved SSD. Measurement 2022, 201, 111655. [Google Scholar] [CrossRef]
Kang, L.; Lu, Z.; Meng, L.; Gao, Z. YOLO-FA: Type-1 Fuzzy Attention Based YOLO Detector for Vehicle Detection. Expert Syst. Appl. 2024, 237, 121209. [Google Scholar] [CrossRef]
Bie, M.; Liu, Y.; Li, G.; Hong, J.; Li, J. Real-Time Vehicle Detection Algorithm Based on a Lightweight You-Only-Look-Once (YOLOv5n-L) Approach. Expert Syst. Appl. 2023, 213, 119108. [Google Scholar] [CrossRef]
Wang, B.; Li, Y.-Y.; Xu, W.; Wang, H.; Hu, L. Vehicle–Pedestrian Detection Method Based on Improved YOLOv8. Electronics 2024, 13, 2149. [Google Scholar] [CrossRef]
Wang, Z.; Zheng, L.; Liu, Y.; Li, Y.; Wang, S. Towards Real-Time Multi-Object Tracking. In Computer Vision—ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 107–122. [Google Scholar]
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple Online and Realtime Tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple Online and Realtime Tracking with a Deep Association Metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Aharon, N.; Orfaig, R.; Bobrovsky, B.-Z. BoT-SORT: Robust Associations Multi-Pedestrian Tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar]
Zhao, Z.; Ji, Z.; Yao, Y.; He, Z.; Du, C. Enhanced Detection Model and Joint Scoring Strategy for Multi-Vehicle Tracking. IEEE Access 2023, 11, 30807–30818. [Google Scholar] [CrossRef]
Zhang, K.; Wu, F.; Sun, H.; Cai, M. Monocular Vehicle Speed Detection Based on Improved YOLOX and DeepSORT. Neural Comput. Appl. 2024, 36, 9643–9660. [Google Scholar] [CrossRef]
Han, G.; Jin, Q.; Rong, H.; Jin, L.; Zhang, L. Vehicle Tracking Algorithm Based on Deep Learning in Roadside Perspective. Sustainability 2023, 15, 1950. [Google Scholar] [CrossRef]
Wen, L.; Du, D.; Cai, Z.; Lei, Z.; Chang, M.C.; Qi, H.; Lim, J.; Yang, M.-H.; Lyu, S. UA-DETRAC: A New Benchmark and Protocol for Multi-Object Detection and Tracking. Comput. Vis. Image Underst. 2020, 193, 102907. [Google Scholar] [CrossRef]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
Ultralytics. YOLOv5: Family of Object Detection Architectures and Models Pretrained on the COCO Dataset. GitHub Repository. Available online: https://github.com/ultralytics/yolov5 (accessed on 28 July 2024).
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Cao, J.; Pang, J.; Weng, X.; Khirodkar, R.; Kitani, K. Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 9686–9696. [Google Scholar]
Maggiolino, G.; Ahmad, A.; Cao, J.; Kitani, K. Deep OC-Sort: Multi-Pedestrian Tracking by Adaptive Re-Identification. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 3025–3029. [Google Scholar]

Figure 1. Vehicle Detection and Tracking Flowchart.

Figure 2. Structure of CDS-YOLOv8.

Figure 3. Structure of the Context-Guided module.

Figure 4. Structure of the Dilated Reparam Block.

Figure 5. Representative scenarios in the UA-DETRAC dataset.

Figure 6. Visualization of Detection Results.

Figure 7. Visualization of Tracking Results.

Table 1. Ablation experiment results on the UA-DETRAC dataset.

Model	P	R	mAP@0.5	mAP@0.5:0.95	Flops
YOLOv8-S	71.8%	58.6%	64.2%	46.6%	28.8 G
YOLOv8-S+CG_Down	78.4%	63.3%	69.9%	51.0%	30.7 G
YOLOv8-S+C2f_DRB	81.5%	62.6%	71.4%	51.6%	24.7 G
YOLOv8-S+CG_Down+C2f_DRB	80.7%	63.9%	72.9%	52.2%	26.6 G
YOLOv8-S+CG_Down+C2f_DRB+Soft-NMS	82.4%	62.3%	73.2%	55.4%	26.6 G

Table 2. Comparative Experimental Results on the UA-DETRAC Dataset. (The best results are highlighted in bold.)

Model	mAP@0.5	mAP@0.5:0.95	Flops
Faster-RCNN [20]	66.3%	38.8%	60.5 G
Cascade-RCNN [21]	65.2%	41.1%	88.1 G
RetinaNet [19]	56.6%	32.1%	45.2 G
ATSS [35]	58.8%	32.1%	49.2 G
YOLOv5-S [36]	62.0%	44.9%	16 G
YOLOv6-S [37]	59.0%	43.0%	44.9 G
YOLOv7 [38]	71.1%	53.9%	105.2 G
YOLOv8-S [13]	64.2%	46.6%	28.8 G
YOLOv8-M [13]	68.2%	50.6%	79.3 G
YOLOv8-L [13]	71.6%	51.4%	165.7 G
YOLOv9-S [39]	65.1%	48.0%	39.6 G
L. Kang et al. [23]	70.0%	50.3%	26.7 G
CDS-YOLOv8 (ours)	73.2%	55.4%	26.6 G

Table 3. Ablation Experiment Results on the UA-DETRAC Dataset.

Detector+ Tracker	mIDF1	IDS	mMOTA
YOLOv8s+ BYTETrack	76.9%	855	67.2%
CDS-YOLOv8+ BYTETrack	78.2%	785	70.6%
CDS-YOLOv8+ BYTETrack+ Improved KF	79.7%	717	73.1%
CDS-YOLOv8+ BYTETrack+ Improved KF +GSI	80.3%	530	73.9%

Table 4. Comparison Results on UA-DETRAC Dataset. (Best results are highlighted in bold.)

Tracker	mIDF1	IDS	mMOTA
DeepSORT [29]	73.5%	2731	70.0%
StrongSORT [17]	76.8%	2994	71.3%
OCSORT [40]	76.8%	2191	71.9%
DeepOCSORT [41]	77.1%	1830	72.1%
BYTETrack [16]	78.2%	785	70.6%
BOT-SORT [30]	78.6%	727	72.9%
Improved BYTETrack (ours)	80.3%	530	73.9%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

You, L.; Chen, Y.; Xiao, C.; Sun, C.; Li, R. Multi-Object Vehicle Detection and Tracking Algorithm Based on Improved YOLOv8 and ByteTrack. Electronics 2024, 13, 3033. https://doi.org/10.3390/electronics13153033

AMA Style

You L, Chen Y, Xiao C, Sun C, Li R. Multi-Object Vehicle Detection and Tracking Algorithm Based on Improved YOLOv8 and ByteTrack. Electronics. 2024; 13(15):3033. https://doi.org/10.3390/electronics13153033

Chicago/Turabian Style

You, Longxiang, Yajun Chen, Ci Xiao, Chaoyue Sun, and Rongzhen Li. 2024. "Multi-Object Vehicle Detection and Tracking Algorithm Based on Improved YOLOv8 and ByteTrack" Electronics 13, no. 15: 3033. https://doi.org/10.3390/electronics13153033

APA Style

You, L., Chen, Y., Xiao, C., Sun, C., & Li, R. (2024). Multi-Object Vehicle Detection and Tracking Algorithm Based on Improved YOLOv8 and ByteTrack. Electronics, 13(15), 3033. https://doi.org/10.3390/electronics13153033

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Object Vehicle Detection and Tracking Algorithm Based on Improved YOLOv8 and ByteTrack

Abstract

1. Introduction

2. Related Work

2.1. Vehicle Detection

2.2. Multi-Target Vehicle Tracking

3. Methods

3.1. Context-Guided Downsampling

3.2. Reconstructed Dilated Reparam Block

3.3. Soft-NMS

3.4. Improved Kalman Filter

3.5. Gaussian Smoothed Interpolation

4. Experiments and Results

4.1. Dataset and Experimental Configuration

4.2. Evaluation Metrics

4.3. Detector Ablation Experiments

4.4. Comparison Experiment of Detectors

4.5. Tracker Ablation Experiment

4.6. Comparison Experiments of Trackers

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI