A Robust Multi-Camera Vehicle Tracking Algorithm in Highway Scenarios Using Deep Learning

Li, Menghao; Liu, Miao; Zhang, Weiwei; Guo, Wenfeng; Chen, Enqing; Zhang, Cheng

doi:10.3390/app14167071

Open AccessArticle

A Robust Multi-Camera Vehicle Tracking Algorithm in Highway Scenarios Using Deep Learning

by

Menghao Li

¹,

Miao Liu

^1,*,

Weiwei Zhang

²,

Wenfeng Guo

³,

Enqing Chen

⁴

and

Cheng Zhang

⁵

¹

School of Mechanical and Automotive Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

²

Shanghai Smart Vehicle Cooperating Innovation Center Co., Ltd., Shanghai 201805, China

³

School of the Vehicle and Mobility, Tsinghua University, Beijing 100084, China

⁴

School of Education and Foreign Languages, Wuhan Donghu University, Wuhan 430212, China

⁵

School of Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan 430079, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 7071; https://doi.org/10.3390/app14167071

Submission received: 9 July 2024 / Revised: 2 August 2024 / Accepted: 8 August 2024 / Published: 12 August 2024

(This article belongs to the Special Issue Unmanned Vehicle and Industrial Sensors for Internet of Everything)

Download

Browse Figures

Versions Notes

Abstract

:

In intelligent traffic monitoring systems, the significant distance between cameras and their non-overlapping fields of view leads to several issues. These include incomplete tracking results from individual cameras, difficulty in matching targets across multiple cameras, and the complexity of inferring the global trajectory of a target. In response to the challenges above, a deep learning-based vehicle tracking algorithm called FairMOT-MCVT is proposed. This algorithm con-siders the vehicles’ characteristics as rigid targets from a roadside perspective. Firstly, a Block-Efficient module is designed to enhance the network’s ability to capture and characterize image features across different layers by integrating a multi-branch structure and depth-separable convolutions. Secondly, the Multi-scale Dilated Attention (MSDA) module is introduced to improve the feature extraction capability and computational efficiency by combining multi-scale feature fusion and attention mechanisms. Finally, a joint loss function is crafted to better distinguish between vehicles with similar appearances by combining the trajectory smoothing loss and velocity consistency loss, thereby considering both position and velocity continuity during the optimization process. The proposed method was evaluated on the public UA-DETRAC dataset, which comprises 1210 video sequences and over 140,000 frames captured under various weather and lighting conditions. The experimental results demonstrate that the FairMOT-MCVT algorithm significantly enhances multi-target tracking accuracy (MOTA) to 79.0, IDF1 to 84.5, and FPS to 29.03, surpassing the performance of previous algorithms. Additionally, this algorithm expands the detection range and reduces the deployment cost of roadside equipment, effectively meeting the practical application requirements.

Keywords:

intelligent transportation system; multi-target tracking; roadside sensing; deep learning; real-time

1. Introduction

In recent years, multi-target tracking (MOT) has emerged as a critical application in computer vision. The increasing demand for surveillance and security in urban areas has driven this. According to recent reports, the global video surveillance market is expected to reach USD 86 billion by 2027. This highlights the growing reliance on MOT systems in various domains, including video surveillance, urban security, and autonomous vehicle systems [1,2,3,4,5,6,7,8]. The primary task of MOT is to identify the positions of multiple specific targets and simultaneously track and record their trajectories [9,10,11,12]. With technological advancements, traditional rule-based tracking methods have gradually become insufficient for handling complex environments.

Traditional rule-based tracking methods cannot meet the needs of the applications in complex environments. Bewley et al. proposed the SORT algorithm, which effectively combines Kalman filtering and the Hungarian algorithm to correlate targets and enhance real-time tracking performance [13]. Building on this concept, Wojke et al. introduced the DeepSORT algorithm, which retains the core idea of integrating Kalman filtering with the Hungarian algorithm from the SORT algorithm [14]. However, these methods often struggle with occlusion and overlapping targets.

With advancements in deep learning, several approaches have been developed to address the limitations of traditional methods. Wang et al. introduced the Jointly Learns the Detector and Embedding Model (JDE), integrating the detection and tracking models for joint learning, which proves more effective than separate models like DeepSORT [15]. Recognizing that the detection and tracking models in DeepSORT operate independently, Huang et al. enhanced DeepSORT by integrating it with YOLOv5, achieving end-to-end target detection and tracking, and significantly improving tracking efficiency [16]. Zhou et al. devised a method for target constraint learning to enhance tracking efficiency further [17]. Boragule et al. proposed a pixel-guided approach that integrates joint detection and tracking tasks in MOT, enhancing the algorithm accuracy and mitigating the occlusion and overlap challenges [18]. Building on the JDE model, Zhou et al. [19] introduced a multi-target tracking algorithm utilizing the CenterNet [20] target detection network, transforming the problem into a midpoint-based tracking method. Despite its advancements, this method has limitations in extracting the target identification features, leading to more frequent ID switches when the targets are lost for extended periods. To address these issues, Zhang et al. proposed a frameless FairMOT, similar to CenterNet. It aims to improve the positioning accuracy under complex environmental conditions [21]. Meimetis et al. introduced a deep learning-based real-time target-tracking method, significantly improving the tracking accuracy and robustness by incorporating a multi-scale feature fusion strategy [22]. Luo et al. developed a novel multi-target tracking algorithm by integrating convolutional neural network and spatio-temporal feature extraction techniques, which handled occlusion and target overlapping [23]. Ren et al. proposed a new multi-target tracking framework that combines deep reinforcement learning and multi-task learning. This framework significantly improves tracking performance in complex scenes [24]. Additionally, recent work by Yang J et al. proposed a transformer-based approach for MOT. This approach addresses the issue of long-range dependencies and sequence modeling by leveraging the self-attention mechanism of transformers. It significantly enhances the model’s ability to track multiple targets over long sequences and crowded scenes [25]. Kosaraju V et al. introduced a self-supervised learning method for MOT. This method utilizes contrastive learning to improve the feature representation of the targets. It reduces ID switches and enhances the tracking robustness in diverse and complex environments [26]. In traffic surveillance systems, cameras are typically spaced far apart with non-overlapping fields of view to reduce costs. This camera arrangement leads to several unresolved challenges: (1) Background clutter and object occlusion can result in incomplete local tracking results for a single camera; (2) Significant differences in the viewpoints of different cameras cause substantial variations in the images and surroundings, making cross-camera local tracking matching extremely difficult; (3) The number of times each target appears in different cameras and the total number of targets in the entire multi-camera network are unknown, complicating the inference of each target’s global trajectory, as illustrated in Figure 1. An end-to-end vehicle tracking algorithm, FairMOT-MCVT, is proposed for complex traffic scenarios from the roadside perspective. By optimizing the feature extraction network, the algorithm significantly enhances the ability to extract features and detect vehicles that are occluded at a distance. The optimized feature extraction network introduces the Swish activation function, incorporates a multi-branch structure, and integrates the Multi-scale Dilated Attention (MSDA) mechanism [27] and the Ghost module [28] to enhance network feature extraction performance. These enhancements aim to maintain the tracking accuracy while enabling real-time operation in complex traffic scenarios over long distances, striking a balance between computational accuracy and speed.

The contributions made include:

An optimized feature extraction system has been proposed, accompanied by the design of a Block-Efficient module specifically tailored to detect small targets from a distance, enhancing its suitability for vehicle tracking algorithms.
In complex tunnel traffic scenes, a joint loss function is integrated into the roadside monitoring perspective frames, along with a framework designed to differentiate between similar vehicles and target vehicles. This framework enables the algorithm to learn more discriminative features, thereby enhancing the matching success rate, reducing ID switching, and improving the overall effectiveness of vehicle tracking.
The FairMOT-MCVT algorithm has been validated using the UA-DETRAC dataset [29]. The results show its effectiveness and robustness in multi-target tracking, IDF1, and FPS metrics, with a superior performance compared with other prominent techniques.

2. Related Work

The vehicle tracking algorithm is paramount in intelligent transportation, city surveillance, and autonomous driving. Recent advancements in computer vision techniques have significantly enhanced the capabilities of vehicle tracking systems.

2.1. Multiple Target Tracking

Traditional multiple object tracking (MOT) methods rely heavily on detection and tracking frameworks, where subsequent tracking is processed through candidate regions obtained from the detection results. For instance, Kalman and particle filtering are commonly employed in multi-target tracking methods, achieving a target state estimation through a combination of prediction and update steps.

With the advent of deep learning technology, methods for tracking multiple targets using deep neural networks have witnessed significant advancements. Recent breakthroughs in object recognition, including Faster R-CNN [30], YOLO [31], and SSD [32], have enabled the accurate detection of multiple targets in single-frame images, thereby facilitating target re-identification through the utilization of deep features. Among these methods, DeepSORT [14] has emerged as one of the most effective multi-target tracking approaches. This method integrates object detection with depth matching and leverages Kalman Filter prediction to enhance accuracy and robustness.

2.2. Data Association

Data association poses a pivotal challenge in multi-target tracking, as it necessitates accurately linking detected targets with existing trajectories. Various techniques have been employed to address this challenge, including the Hungarian algorithm, Joint Probabilistic Data Association (JPDA), and Multiple Hypothesis Tracking (MHT). However, conventional methods often rely heavily on the targets’ movement patterns and visual features for association, which occlusions and significant similarities among targets can challenge.

Deep learning methods have introduced innovative approaches to the problem of data association. Tang et al. [33] proposed a graph neural network-based association method that enhances robustness by constructing a relationship graph among the targets. Concurrently, Bergmann et al. [34] introduced the Tracktor algorithm, simplifying the association process by integrating detectors and trackers, thereby enabling direct data association on detection outputs. These advancements demonstrate the potential of deep learning techniques to improve multi-target tracking systems.

Furthermore, multi-sensor fusion technology is crucial in data association, facilitating information integration from diverse sensors, such as cameras, LiDAR, and millimeter-wave radar. This integration leads to a more detailed and accurate understanding of the environment, thereby enhancing the precision and stability of the data relationships. For instance, Kim et al. [35] developed a vehicle tracking technique that integrates cameras and LiDAR. This approach ensures robust vehicle surveillance in complex traffic scenarios by harnessing the precise distance measurements from LiDAR and the rich visual details captured by cameras.

2.3. Current Challenges and Future Directions

Despite significant advancements in vehicle tracking algorithms, practical applications face numerous challenges. Key hurdles include occlusion, dynamic backgrounds, and lighting variations, significantly impacting tracking performance. Furthermore, there is an urgent need to optimize real-time performance and computational efficiency while maintaining high accuracy. To overcome these limitations and advance the effectiveness of vehicle tracking systems, future research should focus on the following directions: (1) Enhancing algorithm robustness in complex scenes; (2) Developing more efficient deep learning models; (3) Exploring innovative multi-sensor fusion technologies; and (4) Investigating lightweight tracking algorithms suitable for resource-constrained environments. These research avenues aim to address the current challenges and propel the development of robust vehicle tracking systems that are capable of operating in diverse and demanding scenarios.

3. Methods

3.1. Overall Network Architecture Pipeline

In the current landscape of multi-target tracking algorithms, a prevalent two-step approach involves initially detecting the targets utilizing a dedicated target-detection algorithm, followed by target matching using a Re-ID model. However, this approach has demonstrated an inconsistent correlation between image embeddings and actual targets, leading to suboptimal tracking precision. Notably, object tracking has undergone significant improvements with the advancement of object detection and Re-ID techniques. Recently, the One-Shot Approach, which concurrently performs object detection and Re-ID, has gained prominence for its efficiency in reducing the computation time.

Our approach introduces a comprehensive tracking network structure that incorporates images from video sequences into two parallel feature networks: one dedicated to detection and the other to re-identification. The detection branch leverages a non-anchor detector that is adept at handling thermal graphs, boundary box dimensions, and center-point deviations. Concurrently, the Re-ID branch enhances object differentiation by exploiting high-dimensional features. As depicted in Figure 2, this integrated framework facilitates feature-sharing across multiple branches, thereby minimizing parameter redundancy, enhancing algorithm inference efficiency, and catering to real-time processing requirements in traffic scenarios.

3.2. DLA-Efficient

DLA (Deep Layer Aggregation) [36] serves as a highly adaptable framework that seamlessly integrates into existing CNN (convolutional neural network) architectures, empowering them for a diverse range of computer vision tasks. CenterNet harnesses the 34-layer DLA-34 convolutional network as its foundational feature extraction backbone. Nevertheless, DLA-34’s complexity necessitates substantial computational resources and a considerable number of parameters. To address this, DLA-34 was meticulously crafted upon the foundations of Inception [37] and the Ghost module, aiming to devise an efficient network architecture that mitigates computational demands and enhances the overall efficiency and scalability. The cornerstone innovations in DLA-34 include:

Referring to the combination of a multi-branch structure and depth-separable convolution proposed in Inception for achieving efficient multi-scale feature extraction, we designed a new block module.
After the DLA-34 network layer, the MSDA module is introduced, which autonomously learns and identifies the most crucial feature regions to prioritize in images or videos. This capability enables the network to concentrate more on features that profoundly impact the final task, thereby enhancing processing efficiency and performance by prioritizing critical areas.
Replacing the ReLU [38] activation function with the smoother Swish [39] activation function enhances the network’s accuracy and generalization. The formula is provided below:

$f (x) = x \frac{1}{1 + e^{- x}} = x σ (x)$

(1)
Referring to the GhostNet module, the Ghost module is employed instead of standard 2D convolution to generate an equivalent number of feature maps. The GhostNet module is shown in Figure 3.

The Block-Efficient module is designed to be a simple, lightweight architecture with six convolutional layers. It employs parallel processing paths to extract multiple features from the input, where each submodule specializes in capturing distinct characteristics that are subsequently aggregated along the channel dimension. This design philosophy aims to comprehensively capture diverse characteristics from the input data, thereby enhancing the overall performance. The module efficiently captures local feature information by integrating multiple convolution kernels and pooling operations of varying sizes. Specifically, Conv1 combines the 1 × 1 and 3 × 3 convolutions to extract and process local features from the input image. Following this, Conv2 incorporates the Ghost module, which significantly reduces the computational requirements and parameters while maintaining the ability to extract multiple feature maps efficiently. Conv3 further enhances performance through the addition of additional convolutional layers. Lastly, Conv4 utilizes pooling and 1 × 1 convolutions to extract features, achieving a balance between reducing the computation time and preserving critical information. This architecture is illustrated in Figure 4.

The pseudo-code flow of the Block-Efficient module is outlined in Algorithm 1. Initially, 1 × 1 and 3 × 3 convolutions are applied to the input feature map X, resulting in feature maps F1_1 and F1_3, respectively. These two feature maps are then concatenated to form the joint feature map F1. Following this, the Ghost module is utilized on F1 to obtain the feature map F2. Subsequently, two additional 3 × 3 convolutional layers are applied to F2, generating the augmented feature map F3. A 2 × 2 pooling operation is then performed, followed by a 1 × 1 convolution, yielding the feature map F4_1. Finally, all the feature maps (F1, F2, F3, and F4_1) are concatenated to create the final augmented feature map Y, which is returned as the module’s output.

Algorithm 1: Block-Efficient Module
input: Feature map X
output: Enhanced feature map Y
1:	function BlockEfficient(X)
2:	// Step 1: Apply Conv1 (1 × 1 and 3 × 3 convolutions)
3:	F1_1 ← Conv(X, kernel_size = 1 × 1)
4:	F1_3 ← Conv(X, kernel_size = 3 × 3)
5:	F1 ← Concat(F1_1, F1_3)
6:	// Step 2: Apply Conv2 (Ghost Module)
7:	F2 ← GhostModule(F1)
8:	// Step 3: Apply Conv3 (additional convolutional layers)
9:	F3 ← Conv(F2, kernel_size = 3 × 3)
10:	F3 ← Conv(F3, kernel_size = 3 × 3)
11:	// Step 4: Apply Conv4 (pooling and 1 × 1 convolutions)
12:	F4_pool ← Pool(F3, pool_size = 2 × 2)
13:	F4_1 ← Conv(F4_pool, kernel_size = 1 × 1)
14:	// Combine the features
15:	Y ← Concat(F1, F2, F3, F4_1)
16:	return Y;

3.3. DLA-MSDA

The MSDA attention mechanism is introduced after the DLA-34 network base. This mechanism effectively combines multi-scale features with attention weights to enhance the model’s capability to capture and utilize feature information across various levels and scales, thereby improving the detection performance. The MSDA attention mechanism is shown in Figure 5.

The pseudo-code of the MSDA attention mechanism is shown in Algorithm 2. First, three multi-scale convolutional kernels with different dilation rates (1, 2, and 3) are initialized. Then, for each position in the input feature map (with shape B × C × H × W), these three convolution kernels are applied separately to obtain the multi-scale feature maps F1, F2, and F3. These three feature maps are concatenated to form the joint feature map F_concat, and a channel attention mechanism is applied to obtain the weighted feature map F_weighted. Finally, the weighted feature map is normalized to produce the enhanced feature map Y. The entire process iterates through all batches, channels, heights, and widths, ultimately returning the final enhanced feature map Y.

Algorithm 2: MSDA Attention Mechanism
input: Feature map X of size B × C × H × W
output: Enhanced feature map Y
1:	Initialize multi-scale convolutional kernels K1, K2, K3 with dilation rates of 1, 2, and 3.
2:	for i ← 1 to B do
3:	for j ← 1 to C do
4:	for k ← 1 to H do
5:	for l ← 1 to W do
6:	F1 ← Conv(X[i, j, k, l], K1);
7:	F2 ← Conv(X[i, j, k, l], K2);
8:	F3 ← Conv(X[i, j, k, l], K3);
9:	F_concat ← Concat(F1, F2, F3);
10:	F_weighted ← ChannelAttention(F_concat);
11:	Y[i, j, k, l] ← Normalize(F_weighted);
12:	end for
13:	end for
14:	end for
15:	end for
16:	return Y;

The FairMOT-MCVT formed by combining the above methods is shown in Figure 6. The Block-Efficient module can capture high-level semantic features, such as color, shape, size, and the spatial attributes of rigid objects. Employing the MSDA module reduces the computational load associated with additional convolutional layers, enhances the generalization capabilities, and maintains computational efficiency.

3.4. Joint Loss Function

The method introduces deep cosine metric learning for calculating the similarity between two feature vectors, which is crucial in human re-identification. However, more than this approach is needed to track multiple vehicles effectively. Moving vehicles may appear similar but vary in angles, while modern cars often share similar colors, resulting in a high feature similarity. To address these challenges, a novel joint loss function is proposed, combining the

Smooth L_{1}

loss function and cross-entropy loss, formulated as:

L_{r e q} (b, g) = \sum_{i \in {x, ν, w, h}} {S m o o t h}_{L 1} (b_{i} - g_{i})

(2)

where

b = {b_{x}, b_{y}, b_{w}, b_{h}}

is the predicted bounding box (center coordinates, width, height), and

g = {g_{x}, g_{y}, g_{w}, g_{h}}

is the corresponding real bounding box.

{S m o o t h}_{L 1}

is defined as:

{S m o o t h}_{L 1} =_{0.5 x^{2}, | x | < 1}^{| x | - 0.5, | x | > 1}

(3)

The cross-entropy loss measures the disparity between the predicted probability distributions and the actual labels. For the binary classification problem, the expression is as follows:

L_{c l s} (p, y) = - [y l o g (p) + (1 - y) l o g (1 - p)]

(4)

where

p

is the probability that the model predicts that the target is a vehicle, and

y

is the real label, 1 for vehicles and 0 for non-vehicles.

The Re-ID branch of the FairMOT-MCVT algorithm integrates a triplet loss function and a temporal continuity loss function. This combination aims to tackle issues related to the uniqueness and consistency of target tracking and challenges in predicting abrupt changes in motion. Doing so prevents tracker failures caused by variations in the target appearance and reduces instances of trajectory discontinuity and prediction errors.

The triplet loss function optimizes the feature space by minimizing the distance between samples of the same class and maximizing the difference between samples of different classes. This enhancement aids the network in learning to track the same target consistently across different frames. The equation is as follows:

L_{r e i d} (f (a), f (p), f (n)) = m a x (0, m + ∥ f (a) - f (p) ∥_{2}^{2} - ∥ f (a) - f (n) ∥_{2}^{2})

(5)

where

f (x)

represents the feature extraction function, which converts the input image

x

into a feature vector,

∥ f (a) - f (p) ∥_{2}^{2}

is the square of the Euclidean distance between the anchor point and the positive sample,

∥ f (a) - f (n) ∥_{2}^{2}

is the square of the Euclidean distance between the anchor point and the negative sample, and

m

is a boundary value (margin) used to ensure that the distance between the anchor point and the negative sample is at least

m

greater than the distance between the anchor point and the positive sample.

The temporal continuity loss function integrates trajectory smoothing loss [40] and velocity continuity loss [41] to enhance the tracker’s predictive capabilities. It emphasizes the target’s consistent motion between consecutive frames, ensuring a stable trajectory even in complex scenarios. The formula of

L_{c o n t}

is as follows:

L_{c o n t} (t) = α \sum_{i = t}^{t + N - 1} ∥ p_{i + 1} - p_{i} ∥_{2}^{2} + β \sum_{i = t}^{t + N - 1} ∥ v_{i + 1} - v_{i} ∥_{2}^{2}

(6)

where

p_{i}

and

p_{i + 1}

are the predicted positions of the vehicle in two consecutive frames, respectively,

v_{i}

and

v_{i + 1}

are the calculated speeds of the vehicle in two consecutive frames, respectively, which can be calculated by

v_{i} = p_{i} - p_{i - 1}

, and

N

is the frame number range that considers continuity and is usually chosen to be dynamically adjusted based on the video frame rate and scene.

α

and

β

are the weight parameters that balance the importance of position smoothness and velocity consistency.

Ultimately, by combining the

{S m o o t h}_{L 1}

loss, the cross-entropy loss function, the ternary loss function, and the temporal continuity loss function, the joint loss function

L

is defined as follows:

\begin{array}{l} L = & + λ_{2} L_{r e g} (b, g) + λ_{3} L_{r e i d} (f (a), f (p), f (n)) + λ_{4} L_{c o n t} (t) \\ = & - λ_{1} [y l o g (p) + (1 - y) l o g (1 - p)] + λ_{2} \sum_{i \in {x, v, w, h}} {Smooth}_{L 1} (b_{i} - g_{i}) \\ + λ_{3} m a x (0, m + ∥ f (a) - f (p) ∥_{2}^{2} - ∥ f (a) - f (n) ∥_{2}^{2}) \\ + λ_{4} (α \sum_{i = t}^{t + N - 1} ∥ p_{i + 1} - p_{i} ∥_{2}^{2} + β \sum_{i = t}^{t + N - 1} ∥ v_{i + 1} - v_{i} ∥_{2}^{2}) \end{array}

(7)

where

L_{c l s} (p, y)

optimizes the classification task,

L_{r e g} (b, g)

refines the precise location of the vehicle bounding box,

L_{r e i d} (f (a), f (p), f (n))

preserves the vehicle’s ID across consecutive video frames [42], and

L_{c o n t} (t)

is an optional temporal continuity loss penalizing tracking inconsistencies.

4. Experimental Results and Discussion

4.1. Datasets and Metrics

The FairMOT-MCVT vehicle tracking system is trained and evaluated using the UA-DETRAC dataset, which serves as a comprehensive benchmark for detecting and monitoring multiple objects in various real-world traffic scenarios. This dataset includes 60 video sequences depicting diverse weather and lighting conditions, such as sunny, cloudy, dark, and rainy. The videos are recorded at 960 × 540 pixel resolution and 25 frames per second (FPS), totaling 144,000 frames across varying durations from 36 to 128 s. The object classes within the dataset encompass cars, buses, trucks, and others.

Multi-target tracking precision reflects the tracking algorithm’s effectiveness in accurately identifying targets and maintaining their trajectories, serving as a crucial metric for evaluation. A higher MOTA value indicates better efficiency of the tracking algorithm. The formula is as follows:

M O T A = 1 - \frac{\sum_{t} (F N_{t} + F P_{t} + I D S W_{t})}{\sum_{t} G T_{t}}

(8)

where the count of undetected detections (false negatives) of

F N_{t}

at time

t

, the count of erroneous identifications (false positives) for

F P_{t}

at time

t

, the number of IDs (ID switches) of

I D S W_{t}

at time

t

, and the actual number of targets of

G T_{t}

at time t (ground truth).

IDs represent the total count of ID switches throughout the video circuit; MT denotes the proportion of tracks adhering to GT and achieving a minimum 80% match.

M T = \frac{Number of mostly tracked targets}{Total number of targets}

(9)

Mostly lost (ML) represents the majority of lost tracking cases, defined as the proportion of trajectories that have a ground truth (GT) matching success rate of less than 20% among all the tracking targets.

M L = \frac{Number of mostly lost targets}{Total number of targets}

(10)

The IDF1 (Identifier F score) signifies the ratio of precisely ascertained tests compared with the actual count and the calculated average values of these tests.

I D F 1 = 2 \times \frac{I D T P}{2 \times I D T P + I D F P + I D F N}

(11)

where IDTP represents the number of correctly identified targets, IDFP represents the number of falsely identified targets, and IDFN represents the number of missed targets. IDF1 measures the combined performance of the precision and recall of the correctly identified target IDs, with higher values indicating better performance.

The multi-target tracking precision directly mirrors how effective the tracking algorithm is in identifying targets and maintaining their paths, serving as the critical metric for evaluation. With MOTA’s proximity, the efficiency of the tracking algorithm is enhanced. The IDs are calculated by adding up the number of IDs throughout the video track, while MT represents the proportion of tracks adhering to the GT and achieving a minimum 80% match. Mostly lost tracks (ML) represent the majority of lost tracks, the proportion of tracks that satisfy all the GT tracking targets, and the match success rates of less than 20%;

I D F 1

represents the proportion of accurately pinpointed tests to the actual count and their computed averages.

4.2. Implementation Details

Our implementation involved Ubuntu 20.04 and three RTX3060 GPU cards. A virtual setting was initiated via Anaconda 3, and the deep learning environments Python 3.7 and Pytorch 1.6 were set up, training and evaluating the FairMOT-MCVT car tracking system utilizing the UA-DETRAC. Throughout the training, these parameters were established: 60 epochs, all six batch numbers, starting with a 1 × 10⁻⁴ learning rate, which was diminished to 1/10 of the initial value in the thirtieth training phase—the training involved using the Adam optimizer, maintaining a momentum parameter of 0.8. Concurrently, standard strategies for enriching data encompass ambiguity, translation, and shearing.

4.3. Ablation Studies

Firstly, an ablation experiment using a DLA-efficient network was performed to illustrate various enhancements to the model. This experiment analyzed the variations in inference speed and precision across 6000 images for each refined COCO [43] test collection. Alterations in training deficits are depicted in Figure 7. The alteration in the total loss curve illustrates that the optimized feature extraction network achieves steadier training processes, quicker convergence speeds, and improved training outcomes. By comparing the loss trajectories between the different branches, it can be seen that the DLA-efficient network shows practical training effects on the center point movement, height, width, and heat map, while having a strong fitting effect. A total of four ablation experiments were conducted to compare the evaluation indicators of FPS and AP. The experimental data in Table 1 show that compared with the original network, the DLA-efficient network has significant improvements and is more suitable for vehicle tracking application scenarios.

First, by replacing the original activation function with the Swish activation function, while the FPS decreased by 0.7%, the other indicators were significantly improved: AP increased by 1.2%,

A P_{50}

increased by 1.5%,

A P_{75}

increased by 1.0%,

A P_{s}

increased by 0.8%,

A P_{M}

increased by 1.0%, and

A P_{L}

increased by 0.7%. Subsequently, the MSDA module was integrated into the front end of the network, leading to improvements in several indicators: AP increased by 0.2%, and

A P_{M}

increased by 0.3%, indicating that the module positively impacts small targets, such as vehicles. By incorporating the designed Block-Efficient module, the computation time is reduced, increasing the frame rate from 30.4 to 30.8 FPS and improving the remaining six indicators. Compared with the original feature extraction network DLA-34, the improved DLA-efficient network enhances the detection of small, occluded targets at a distance and effectively addresses the issue of target loss due to occlusion or distance. In summary, the DLA-efficient network developed in this article adeptly harmonizes inference speed with precision, serving as an applicable feature extraction network in advanced vehicle monitoring algorithms.

To exhibit how the DLA-efficient network enhances vehicle detection skills, a segment of the UA-DETRAC dataset was depicted. Data were gathered using a heat map illustrating the test outcomes, as shown in Figure 8. The results demonstrate that the DLA-efficient network exhibits robust feature extraction capabilities across various scenes, shooting angles, occlusion levels, and lighting conditions. The network effectively reduces missed vehicle detections and minimizes center point deviation.

The model size comparison is shown in Figure 9. It can be seen that DLA-efficient (max-Det = 100) had a 2.59% increase in recall and an 8% increase in accuracy compared with DLA-34. At the same time, the comparative analysis of the model volume and parameters showed that while maintaining the detection accuracy, the model volume was reduced by about 18% and the model parameters were reduced by 15%.

The FairMOT-MCVT algorithm for vehicle tracking was employed in an ablation experiment to evaluate the inference speed per enhancement and the influence of five essential multi-tracking evaluation factors (FPS, MOTA, IDF1, IDS, and ML) on the model. The experimental results are shown in Table 2.

Firstly, baseline-based ablation experiments were performed using an optimized DLA-efficient feature extraction network. When contrasted with the initial network, DLA-efficient enhances the layer count, causing a minor reduction in the FPS values but leading to a 1.7% improvement in MOTA, and 0.4% in IDF1. The results show that the tracking effect is improved while meeting the real-time requirements. However, due to the insufficient recognition of small target vehicles within a specific range by the improved feature extraction network, the number of identifiers increases when they are occluded from each other. Furthermore, the baseline’s introduction of a joint loss function results in a 0.9% rise in MOTA, a 0.2% increase in IDF1, no alteration in MT and ML, a decrease in IDS, and a minor increase in FPS. This shows that the use of an optimized feature extraction network and the introduction of a joint loss function can improve the accuracy of vehicle target discrimination and tracking, and vehicle detection and tracking from a roadside perspective have been improved.

For a more compelling illustration of FairMOT-MCVT’s efficacy, our proposed strategy is contrasted with four standard techniques and two of the most advanced techniques in the same testing video sequence of the UA-DETRAC dataset, as depicted in Table 3. Our algorithm improves MOTA to 79.0%, achieving the best result compared with the other algorithms. The SORT algorithm’s MOTA index reached 70.3. Since the SORT algorithm’s data association step relies solely on IOU technology, its real-time performance is suboptimal. In addition, its data association method is simple, resulting in a large number of ID jumps, which reduces the tracking accuracy. DeepSORT enhances the accuracy of data association and minimizes ID jumps by creating cascading matching. Nevertheless, the algorithmic assessment index remains low because the detector’s average impact is significant. FairMOT ranks just below the technique discussed in this article regarding the MOTA indicators. The method efficiently tracks both slow-moving items and pedestrians who differ significantly in their looks. Nonetheless, it overlooks the issues linked to rapidly moving entities like vehicles, resulting in the algorithm’s ineffectiveness compared with this technique. CenterNet’s superior detection performance indirectly enhances the precision of its tracking process. However, the algorithm is less robust due to its simple data association process and similarity to SORT. This method optimizes the network structure of vehicles and other large objects from the angle of the vehicle traffic scene and the angle of the roadside. Concurrently, implementing the joint loss function enhances the distinguishing powers of similar vehicles, reduces the identity-hopping issue for vehicles resembling each other, and optimizes the vehicle tracking algorithm from various perspectives. The RobMOT algorithm excels in multi-target tracking, achieving an IDF1 of 83.0. However, its complex computational process significantly impacts real-time performance, with an FPS of only 27.00, making it less efficient than the FairMOT-MCVT algorithm. Additionally, RobMOT exhibits deficiencies in tracking fast-moving targets. Conversely, the MTracker algorithm performs stably with long video sequences. However, its high dependence on the detector leads to performance degradation, resulting in an MOTA of only 75.5. In contrast, the FairMOT-MCVT algorithm significantly enhances the robustness and accuracy of tracking by optimizing the network structure and joint loss function, achieving a superior MOTA of 79.0. In summary, the proposed algorithm shows an excellent tracking effect compared with the six main tracking algorithms.

During the evaluation process, we also encountered some limitations and challenges. First, the algorithm’s stability under extreme weather conditions has yet to be verified. Severe weather conditions, such as heavy rain, snow, or fog, may affect the performance of the sensors, thereby reducing the tracking accuracy. Additionally, the tracking accuracy may decrease when dealing with high-density traffic scenarios. Dense traffic flow leads to more occlusion and target overlap, increasing the difficulty of tracking. To overcome these challenges, future research can explore several improvement directions: the fusion of multi-modal data, dynamic weight adjustment, online learning mechanisms, and the integration of contextual information. Developing more advanced occlusion handling techniques will ensure that high-accuracy tracking can be maintained even in prolonged occlusion situations. Through these improvements, the FairMOT-MCVT algorithm is expected to demonstrate stronger robustness and accuracy in more complex and dynamic traffic environments, providing more reliable technical support for intelligent transportation systems.

The enhancements introduced by the DLA-efficient network and FairMOT-MCVT algorithm exhibit substantial and statistically significant improvements. The advancements in crucial metrics, including AP and MOTA, transcend mere random fluctuations or noise, as evidenced by a notable 1.2% increase in AP and a 1.7% boost in MOTA, which undeniably signify genuine enhancements in model performance. The consistency of these improvements across multiple experimental iterations further solidifies the reliability and validity of the findings. To substantiate the statistical significance of these enhancements, we conduct a series of rigorous t-tests comparing the performance metrics of the original and enhanced models. The resulting p-values, which consistently fall below the 0.05 threshold, confirm that the enhanced algorithm’s improvements are statistically meaningful and noteworthy.

4.4. Visualizing Results

To demonstrate the performance of the algorithms more intuitively, we performed a visual validation of the algorithms on self-selected data and compared the algorithms in the high-speed and tunnel scenarios, respectively. As shown in Figure 10, compared with the baseline algorithm, the FairMOT-MCVT algorithm can detect and track a target at longer distances and with more severe occlusions and can maintain stable tracking for the same target in two consecutive cameras. The tracking results of the FairMOT-MCVT algorithm are shown in Figure 11. In the video sequence of Sequence1, there are fewer vehicles from frame 1 to frame 560, and the small target vehicles in the distance can be tracked well and keep the ID number stable, and even the vehicle with ID number 1 can be detected and tracked under long-distance occlusion. This indicates that FairMOT-MCVT achieves superior tracking results in dense scenes, demonstrating that the weighted summation of the joint loss function effectively facilitates intra-class aggregation and inter-class separation. In frames 1030 to 1360, even if the human eye cannot recognize the tracking ID1 vehicle, our designed algorithmic model can continue tracking the target with strong robustness. In the complex congested tunnel scene sequence2, from frame 240 to frame 680, vehicle ID22 is successfully detected even though it is heavily obscured by strong reflections. In frame 1120, the target is still not lost even though it is no longer visible to the naked eye. The visualization results demonstrate that the proposed algorithm effectively tracks distant vehicles and densely occluded traffic while maintaining stable identification.

5. Conclusions

A vehicle tracking algorithm called FairMOT-MCVT for complex traffic scenarios from a roadside view is proposed by focusing on the task of multi-vehicle tracking in computer vision from the roadside view, with deep learning as the basic technology. Regarding target detection, the Block-Efficient module designed for rigid vehicle objects enhances the feature extraction and detection capability, particularly for small-size objects. Additionally, the MSDA module is introduced to enhance the feature extraction efficiency by combining multi-scale feature fusion with attention mechanisms. For target tracking, the joint loss function, weighted

{S m o o t h}_{L 1}

loss function, cross-entropy loss function, and time continuity loss function are designed to consider the continuity of the position and speed in the optimization process to improve the ability to distinguish vehicles with a similar appearance. In the UA-DETRAC open vehicle tracking dataset, the FairMOT-MCVT algorithm achieves optimal tracking results regarding comprehensive metrics compared with current popular tracking algorithms. Among them, MOTA reached 79.0, IDF1 reached 84.5, and FPS reached 29.03, which improved MOTA by 1.3%, IDF1 by 0.3%, and FPS by 0.68% compared with the baseline algorithm. The results show that the FairMOT-MCVT algorithm maintains continuous vehicle tracking in long-range occluded traffic scenarios, showcasing robustness and accuracy. In practical applications, the algorithm reduces equipment deployment costs while enhancing system efficiency and performance, maximizing cost-effectiveness, and meeting operational requirements.

The FairMOT-MCVT algorithm, designed for multi-vehicle tracking in complex traffic scenarios from a roadside view, uses deep learning for enhanced performance. It features the Block-Efficient module for improved detection of small objects and the MSDA module for efficient feature extraction. The algorithm employs a combination of loss functions to maintain position and speed continuity, which improves vehicle distinction. On the UA-DETRAC dataset, FairMOT-MCVT achieves a MOTA of 79.0, IDF1 of 84.5, and FPS of 29.03, outperforming baseline algorithms. It demonstrates robustness and accuracy in long-range occluded scenarios, reducing equipment costs and enhancing system efficiency. Although its commendable performance in multi-target tracking tasks, the proposed algorithm still has limitations. Notably, its stability under extreme weather conditions needs validation, and the tracking accuracy may diminish in high-density traffic scenarios. To address these shortcomings and further refine the algorithm, future research can explore the following avenues: integrating multimodal data, including radar, lidar, and other sensor inputs, to bolster detection and tracking capabilities in complex environments; implementing dynamic weight adjustment to enhance adaptability by dynamically tuning the weight of each loss function in response to real-time environmental variations; incorporating an online learning and self-adaptive mechanism to continually optimize the algorithm post-deployment, thereby improving long-term tracking accuracy; leveraging contextual information integration by incorporating road conditions and traffic rules to strengthen the algorithm’s robustness across diverse scenarios; and developing advanced occlusion handling techniques to ensure high-precision tracking even during extended periods of occlusion. By implementing these enhancements, the FairMOT-MCVT algorithm is anticipated to demonstrate greater robustness and accuracy in more intricate and dynamic traffic environments, ultimately providing more reliable technical underpinnings for intelligent transportation systems.

Author Contributions

Conceptualization, M.L. (Menghao Li) and W.Z.; methodology, M.L. (Menghao Li); software, M.L. (Menghao Li); validation, M.L. (Menghao Li), W.Z., M.L. (Miao Liu) and C.Z.; formal analysis, M.L. (Miao Liu) and C.Z.; investigation, M.L. (Miao Liu) and E.C.; resources, M.L. (Miao Liu); data curation, M.L. (Menghao Li); writing—original draft preparation, M.L. (Menghao Li); writing—review and editing, E.C.; supervision, W.G.; project administration, M.L. (Miao Liu); funding acquisition, W.Z. and W.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Shanghai Special Funds for Centralized Guided Local Science and Technology Development (YDZX20233100002002) and supported by the Postdoctoral Fellowship Program of CPSF under Grant Number GZB20240351.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in the study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

Author Weiwei Zhang was employed by the company Shanghai Smart Vehicle Cooperating Innovation Center Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

MOT	Multi-object Tracking
IDF1	ID F1 Score
FPS	Frames Per Second
UA-DETRAC	University of Alberta-Detection and Tracking of Vehicles
MSDA	Multi-scale Dilated Attention
MOTA	Multiple Object Tracking Accuracy
FP	False Positive
FN	False Negative
IDTP	ID True Positive
IDFP	ID False Positive
IDFN	ID False Negative
GT	Ground Truth
DLA	Deep Layer Aggregation
Re-ID	Re-identification
LIDAR	Light Detection and Ranging
YOLO	You Only Look Once
SSD	Single Shot Multi-box Detector
JPDA	Joint Probabilistic Data Association
MHT	Multiple Hypothesis Tracking
MCHS	Multi-camera Highway Scenario
CNN	Convolutional Neural Network
R-CNN	Region-based Convolutional Neural Network
SORT	Simple Online and Realtime Tracking
DeepSORT	Deep Simple Online and Realtime Tracking
JDE	Jointly Learns the Detector and Embedding Model
CenterNet	Center-based Object Detection Network
Depthwise Separable Convolution	Deeply separable convolution is an approach to optimizing convolutional neural networks that decomposes standard convolution into two simpler operations: depthwise convolution and pointwise convolution. This approach significantly reduces the amount of computation and number of parameters while maintaining the performance of the convolutional neural network.
Multi-scale Feature Fusion	Multi-scale feature fusion is an approach that combines feature maps at different scales to utilize features at different scales to improve detection and classification accuracy. This approach is usually implemented by introducing convolutional or pooling layers at different scales, enabling the model to capture richer image information.
Multi-scale Dilated Attention	The multiscale expansion attention module combines multiscale features with Dilated Convolution and an attention mechanism to enhance the feature extraction capability of the model at different scales by adaptively selecting the most important feature regions.
Joint Loss Function	Joint loss functions are techniques that combine multiple loss functions in order to optimize multiple objectives simultaneously during training. Common joint loss functions include location loss, classification loss, and re-identification loss.

References

Yu, F.; Li, W.; Li, Q.; Liu, Y.; Shi, X.; Yan, J. POI: Multiple object tracking with high performance detection and appearance feature. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 36–42. [Google Scholar]
Voigtlaender, P.; Krause, M.; Osep, A.; Luiten, J.; Sekar, B.B.G.; Geiger, A.; Leibe, B. MOTS: Multi-object tracking and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7942–7951. [Google Scholar]
Sun, S.; Akhtar, N.; Song, X.; Song, H.; Mian, A.; Shah, M. Simultaneous detection and tracking with motion modelling for multiple object tracking. arXiv 2020, arXiv:2008.08826. [Google Scholar]
Peri, N.; Khorramshahi, P.; Rambhatla, S.S.; Shenoy, V.; Rawat, S.; Chen, J.C.; Chellappa, R. Towards real-time systems for vehicle re-identification, multi-camera tracking, and anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 622–623. [Google Scholar]
Chen, L.; Ai, H.; Zhuang, Z.; Shang, C. Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In Proceedings of the IEEE International Conference on Multimedia and Expo, San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar]
Coifman, B.; Beymer, D.; McLauchlan, P.; Malik, J. A real-time computer vision system for vehicle tracking and traffic surveillance. Transp. Res. Part C Emerg. Technol. 1998, 6, 271–288. [Google Scholar] [CrossRef]
Battiato, S.; Farinella, G.M.; Furnari, A.; Puglisi, G.; Snijders, A.; Spiekstra, J. An integrated system for vehicle tracking and classification. Expert Syst. Appl. 2015, 42, 7263–7275. [Google Scholar] [CrossRef]
Peña-González, R.H.; Nuño-Maganda, M.A. Computer vision based real-time vehicle tracking and classification system. In Proceedings of the IEEE 57th International Midwest Symposium on Circuits and Systems (MWSCAS), College Station, TX, USA, 3–6 August 2014; pp. 679–682. [Google Scholar]
Ding, R.; Yu, M.; Oh, H.; Chen, W.H. New multiple-target tracking strategy using domain knowledge and optimization. IEEE Trans. Syst. Man Cybern. Syst. 2017, 47, 605–616. [Google Scholar] [CrossRef]
Bae, S.H.; Yoon, K.J. Confidence-Based Data Association and Discriminative Deep Appearance Learning for Robust Online Multi-Object Tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 595–610. [Google Scholar] [CrossRef]
Fagot-Bouquet, L.; Audigier, R.; Dhome, Y.; Lerasle, F. Improving multi-frame data association with sparse representations for robust near-online multi-object tracking. In Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Berclaz, J.; Fleuret, F.; Türetken, E.; Fua, P. Multiple object tracking using k-shortest paths optimization. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 1806–1819. [Google Scholar] [CrossRef]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and real-time tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Wang, Z.; Zheng, L.; Liu, Y.; Li, Y.; Wang, S. Towards real-time multi-object tracking. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 107–122. [Google Scholar]
Huang, Y.; Xiao, D.; Liu, J.; Tan, Z.; Liu, K.; Chen, M. An Improved Pig Counting Algorithm Based on YOLOv5 and DeepSORT Model. Sensors 2023, 23, 6309. [Google Scholar] [CrossRef]
Zhou, X.; Chan, S.; Qiu, C.; Jiang, X.; Tang, T. Multi-Target Tracking Based on a Combined Attention Mechanism and Occlusion Sensing in a Behavior-Analysis System. Sensors 2023, 23, 2956. [Google Scholar] [CrossRef]
Boragule, A.; Jang, H.; Ha, N.; Jeon, M. Pixel-guided association for multi-object tracking. Sensors 2022, 22, 8922. [Google Scholar] [CrossRef]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking objects as points. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 474–490. [Google Scholar]
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. FairMOT: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
Meimetis, D.; Daramouskas, I.; Perikos, I.; Hatzilygeroudis, I. Real-time multiple object tracking using deep learning methods. Neural Comput. Appl. 2023, 35, 89–118. [Google Scholar] [CrossRef]
Luo, W.; Xing, J.; Milan, A.; Zhang, X.; Liu, W.; Kim, T.-K. Multiple object tracking: A literature review. Artif. Intell. 2021, 293, 103448. [Google Scholar] [CrossRef]
Ren, L.; Lu, J.; Wang, Z.; Tian, Q.; Zhou, J. Collaborative deep reinforcement learning for multi-object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 586–602. [Google Scholar]
Yang, J.; Ge, H.; Su, S.; Liu, G. Transformer-based two-source motion model for multi-object tracking. Appl. Intell. 2022, 52, 9967–9979. [Google Scholar] [CrossRef]
Kosaraju, V.; Sadeghian, A.; Martín-Martín, R.; Reid, I.; Rezatofighi, H.; Savarese, S. Social-bigat: Multimodal trajectory fore casting using bicycle-gan and graph attention networks. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Jiao, J.; Tang, Y.-M.; Lin, K.-Y.; Gao, Y.; Ma, A.J.; Wang, Y.; Zheng, W.-S. Dilateformer: Multi-scale dilated transformer for visual recognition. IEEE Trans. Multimed. 2023, 25, 8906–8919. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Wen, L.; Du, D.; Cai, Z.; Lei, Z.; Chang, M.-C.; Qi, H.; Lim, J.; Yang, M.-H.; Lyu, S. UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking. Comput. Vis. Image Underst. 2020, 193, 102907. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Tang, S.; Andriluka, M.; Andres, B.; Schiele, B. Multiple people tracking by lifted multicut and person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3539–3548. [Google Scholar]
Bergmann, P.; Meinhardt, T.; Leal-Taixe, L. Tracking without bells and whistles. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 941–951. [Google Scholar]
Guan, L.; Chen, Y.; Wang, G.; Lei, X. Real-time vehicle detection framework based on the fusion of LiDAR and camera. Elec tronics 2020, 9, 451. [Google Scholar] [CrossRef]
Yu, F.; Wang, D.; Shelhamer, E.; Darrell, T. Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2403–2412. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Agarap, A.F. Deep learning using rectified linear units (relu). arXiv 2018, arXiv:1803.08375. [Google Scholar]
Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
Fujii, S.; Pham, Q.-C. Realtime trajectory smoothing with neural nets. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 7248–7254. [Google Scholar]
Mao, A.; Mohri, M.; Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. In Proceedings of the Inter national Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 23803–23828. [Google Scholar]
Souza, M.R.; Pedrini, H. Digital video stabilization based on adaptive camera trajectory smoothing. EURASIP J. Image Video Process. 2018, 2018, 37. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]

Figure 1. The current deployment of roadside equipment exhibits a limited detection range and high deployment costs.

Figure 2. The pipeline of the overall network architecture.

Figure 3. Ghost Module diagram.

Figure 4. Schematic diagram of the Block-Efficient structure.

Figure 5. The MSDA attention mechanism framework.

Figure 6. The FairMOT-MCVT structure is comprised of three primary components: (1) the designed Block-Efficient structure is used in the front-end part of the model; (2) an embedded MSDA attention mechanism; and (3) a detection and Re-ID branch, which employs a suite of loss functions, including the

Smooth L_{1}

, cross-entropy, and ternary loss functions. Additionally, the model incorporates a time-continuity loss function.

Figure 6. The FairMOT-MCVT structure is comprised of three primary components: (1) the designed Block-Efficient structure is used in the front-end part of the model; (2) an embedded MSDA attention mechanism; and (3) a detection and Re-ID branch, which employs a suite of loss functions, including the

Smooth L_{1}

, cross-entropy, and ternary loss functions. Additionally, the model incorporates a time-continuity loss function.

Figure 7. Training loss comparison.

Figure 8. Comparison of heat map visualizations for algorithmic ablation.

Figure 9. Model comparison.

Figure 10. Tests on self-acquired datasets demonstrate that the detection range of the developed algorithm surpasses that of the benchmark detection algorithm. Furthermore, the tracked targets can be consistently tracked across two consecutive cameras. The algorithm also maintains a stable tracking performance even in the presence of occlusions and blurring.

Figure 11. Inference results of the FairMOT MCHS algorithm on selected videos.

Table 1. DLA-efficient feature extraction network ablation experiment.

	FPS	$A P$	$A P_{50}$	$A P_{75}$	$A P_{s}$	$A P_{M}$	$A P_{L}$
Baseline	32.5	36.4	53.9	38.8	16.4	40.0	53.4
+Swish	31.8	37.6	55.4	39.8	17.2	41.0	54.1
+Swish + MSDA	30.4	37.8	55.5	40.0	17.2	41.3	54.0
+Swish + MSDA + Block-Efficient	30.8	38.1	56.0	40.5	17.4	41.5	55.2

Table 2. Comparing the tracking results of mainstream algorithms in the UA-DETRAC test dataset.

	MOTA ↑	IDF1 ↑	MT ↑	ML ↓	IDS ↓	FPS ↑
Baseline	77.5	84.2	160	4	48	28.35
Baseline + DLA-efficient	79.2	84.8	162	3	50	28.30
Baseline + Joint Loss	78.4	84.4	160	4	46	28.90
Baseline + DLA-efficient + Joint Loss	79.0	84.5	159	4	45	29.03

Table 3. Tracking algorithm ablation experiment.

	MOTA	IDF1	MT	ML	IDS	FPS
SORT	70.3	79.6	145	4	65	25.10
DeepSORT	74.5	82.4	153	3	55	26.88
FairMOT	77.7	84.2	160	4	47	28.35
CenterTrack	77.0	84.6	155	4	50	27.55
RobMOT	76.0	83.0	150	5	53	27.00
MTracker	75.5	82.7	152	4	54	26.75
FairMOT-MCVT	79.0	84.5	159	4	45	29.03

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, M.; Liu, M.; Zhang, W.; Guo, W.; Chen, E.; Zhang, C. A Robust Multi-Camera Vehicle Tracking Algorithm in Highway Scenarios Using Deep Learning. Appl. Sci. 2024, 14, 7071. https://doi.org/10.3390/app14167071

AMA Style

Li M, Liu M, Zhang W, Guo W, Chen E, Zhang C. A Robust Multi-Camera Vehicle Tracking Algorithm in Highway Scenarios Using Deep Learning. Applied Sciences. 2024; 14(16):7071. https://doi.org/10.3390/app14167071

Chicago/Turabian Style

Li, Menghao, Miao Liu, Weiwei Zhang, Wenfeng Guo, Enqing Chen, and Cheng Zhang. 2024. "A Robust Multi-Camera Vehicle Tracking Algorithm in Highway Scenarios Using Deep Learning" Applied Sciences 14, no. 16: 7071. https://doi.org/10.3390/app14167071

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Robust Multi-Camera Vehicle Tracking Algorithm in Highway Scenarios Using Deep Learning

Abstract

1. Introduction

2. Related Work

2.1. Multiple Target Tracking

2.2. Data Association

2.3. Current Challenges and Future Directions

3. Methods

3.1. Overall Network Architecture Pipeline

3.2. DLA-Efficient

3.3. DLA-MSDA

3.4. Joint Loss Function

4. Experimental Results and Discussion

4.1. Datasets and Metrics

4.2. Implementation Details

4.3. Ablation Studies

4.4. Visualizing Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI