4.1. Datasets and Metrics
The FairMOT-MCVT vehicle tracking system is trained and evaluated using the UA-DETRAC dataset, which serves as a comprehensive benchmark for detecting and monitoring multiple objects in various real-world traffic scenarios. This dataset includes 60 video sequences depicting diverse weather and lighting conditions, such as sunny, cloudy, dark, and rainy. The videos are recorded at 960 × 540 pixel resolution and 25 frames per second (FPS), totaling 144,000 frames across varying durations from 36 to 128 s. The object classes within the dataset encompass cars, buses, trucks, and others.
Multi-target tracking precision reflects the tracking algorithm’s effectiveness in accurately identifying targets and maintaining their trajectories, serving as a crucial metric for evaluation. A higher MOTA value indicates better efficiency of the tracking algorithm. The formula is as follows:
where the count of undetected detections (false negatives) of
at time
, the count of erroneous identifications (false positives) for
at time
, the number of IDs (ID switches) of
at time
, and the actual number of targets of
at time
t (ground truth).
IDs represent the total count of ID switches throughout the video circuit; MT denotes the proportion of tracks adhering to GT and achieving a minimum 80% match.
Mostly lost (ML) represents the majority of lost tracking cases, defined as the proportion of trajectories that have a ground truth (GT) matching success rate of less than 20% among all the tracking targets.
The IDF1 (Identifier F score) signifies the ratio of precisely ascertained tests compared with the actual count and the calculated average values of these tests.
where IDTP represents the number of correctly identified targets, IDFP represents the number of falsely identified targets, and IDFN represents the number of missed targets. IDF1 measures the combined performance of the precision and recall of the correctly identified target IDs, with higher values indicating better performance.
The multi-target tracking precision directly mirrors how effective the tracking algorithm is in identifying targets and maintaining their paths, serving as the critical metric for evaluation. With MOTA’s proximity, the efficiency of the tracking algorithm is enhanced. The IDs are calculated by adding up the number of IDs throughout the video track, while MT represents the proportion of tracks adhering to the GT and achieving a minimum 80% match. Mostly lost tracks (ML) represent the majority of lost tracks, the proportion of tracks that satisfy all the GT tracking targets, and the match success rates of less than 20%; represents the proportion of accurately pinpointed tests to the actual count and their computed averages.
4.3. Ablation Studies
Firstly, an ablation experiment using a DLA-efficient network was performed to illustrate various enhancements to the model. This experiment analyzed the variations in inference speed and precision across 6000 images for each refined COCO [
43] test collection. Alterations in training deficits are depicted in
Figure 7. The alteration in the total loss curve illustrates that the optimized feature extraction network achieves steadier training processes, quicker convergence speeds, and improved training outcomes. By comparing the loss trajectories between the different branches, it can be seen that the DLA-efficient network shows practical training effects on the center point movement, height, width, and heat map, while having a strong fitting effect. A total of four ablation experiments were conducted to compare the evaluation indicators of FPS and AP. The experimental data in
Table 1 show that compared with the original network, the DLA-efficient network has significant improvements and is more suitable for vehicle tracking application scenarios.
First, by replacing the original activation function with the Swish activation function, while the FPS decreased by 0.7%, the other indicators were significantly improved: AP increased by 1.2%, increased by 1.5%, increased by 1.0%, increased by 0.8%, increased by 1.0%, and increased by 0.7%. Subsequently, the MSDA module was integrated into the front end of the network, leading to improvements in several indicators: AP increased by 0.2%, and increased by 0.3%, indicating that the module positively impacts small targets, such as vehicles. By incorporating the designed Block-Efficient module, the computation time is reduced, increasing the frame rate from 30.4 to 30.8 FPS and improving the remaining six indicators. Compared with the original feature extraction network DLA-34, the improved DLA-efficient network enhances the detection of small, occluded targets at a distance and effectively addresses the issue of target loss due to occlusion or distance. In summary, the DLA-efficient network developed in this article adeptly harmonizes inference speed with precision, serving as an applicable feature extraction network in advanced vehicle monitoring algorithms.
To exhibit how the DLA-efficient network enhances vehicle detection skills, a segment of the UA-DETRAC dataset was depicted. Data were gathered using a heat map illustrating the test outcomes, as shown in
Figure 8. The results demonstrate that the DLA-efficient network exhibits robust feature extraction capabilities across various scenes, shooting angles, occlusion levels, and lighting conditions. The network effectively reduces missed vehicle detections and minimizes center point deviation.
The model size comparison is shown in
Figure 9. It can be seen that DLA-efficient (max-Det = 100) had a 2.59% increase in recall and an 8% increase in accuracy compared with DLA-34. At the same time, the comparative analysis of the model volume and parameters showed that while maintaining the detection accuracy, the model volume was reduced by about 18% and the model parameters were reduced by 15%.
The FairMOT-MCVT algorithm for vehicle tracking was employed in an ablation experiment to evaluate the inference speed per enhancement and the influence of five essential multi-tracking evaluation factors (FPS, MOTA, IDF1, IDS, and ML) on the model. The experimental results are shown in
Table 2.
Firstly, baseline-based ablation experiments were performed using an optimized DLA-efficient feature extraction network. When contrasted with the initial network, DLA-efficient enhances the layer count, causing a minor reduction in the FPS values but leading to a 1.7% improvement in MOTA, and 0.4% in IDF1. The results show that the tracking effect is improved while meeting the real-time requirements. However, due to the insufficient recognition of small target vehicles within a specific range by the improved feature extraction network, the number of identifiers increases when they are occluded from each other. Furthermore, the baseline’s introduction of a joint loss function results in a 0.9% rise in MOTA, a 0.2% increase in IDF1, no alteration in MT and ML, a decrease in IDS, and a minor increase in FPS. This shows that the use of an optimized feature extraction network and the introduction of a joint loss function can improve the accuracy of vehicle target discrimination and tracking, and vehicle detection and tracking from a roadside perspective have been improved.
For a more compelling illustration of FairMOT-MCVT’s efficacy, our proposed strategy is contrasted with four standard techniques and two of the most advanced techniques in the same testing video sequence of the UA-DETRAC dataset, as depicted in
Table 3. Our algorithm improves MOTA to 79.0%, achieving the best result compared with the other algorithms. The SORT algorithm’s MOTA index reached 70.3. Since the SORT algorithm’s data association step relies solely on IOU technology, its real-time performance is suboptimal. In addition, its data association method is simple, resulting in a large number of ID jumps, which reduces the tracking accuracy. DeepSORT enhances the accuracy of data association and minimizes ID jumps by creating cascading matching. Nevertheless, the algorithmic assessment index remains low because the detector’s average impact is significant. FairMOT ranks just below the technique discussed in this article regarding the MOTA indicators. The method efficiently tracks both slow-moving items and pedestrians who differ significantly in their looks. Nonetheless, it overlooks the issues linked to rapidly moving entities like vehicles, resulting in the algorithm’s ineffectiveness compared with this technique. CenterNet’s superior detection performance indirectly enhances the precision of its tracking process. However, the algorithm is less robust due to its simple data association process and similarity to SORT. This method optimizes the network structure of vehicles and other large objects from the angle of the vehicle traffic scene and the angle of the roadside. Concurrently, implementing the joint loss function enhances the distinguishing powers of similar vehicles, reduces the identity-hopping issue for vehicles resembling each other, and optimizes the vehicle tracking algorithm from various perspectives. The RobMOT algorithm excels in multi-target tracking, achieving an IDF1 of 83.0. However, its complex computational process significantly impacts real-time performance, with an FPS of only 27.00, making it less efficient than the FairMOT-MCVT algorithm. Additionally, RobMOT exhibits deficiencies in tracking fast-moving targets. Conversely, the MTracker algorithm performs stably with long video sequences. However, its high dependence on the detector leads to performance degradation, resulting in an MOTA of only 75.5. In contrast, the FairMOT-MCVT algorithm significantly enhances the robustness and accuracy of tracking by optimizing the network structure and joint loss function, achieving a superior MOTA of 79.0. In summary, the proposed algorithm shows an excellent tracking effect compared with the six main tracking algorithms.
During the evaluation process, we also encountered some limitations and challenges. First, the algorithm’s stability under extreme weather conditions has yet to be verified. Severe weather conditions, such as heavy rain, snow, or fog, may affect the performance of the sensors, thereby reducing the tracking accuracy. Additionally, the tracking accuracy may decrease when dealing with high-density traffic scenarios. Dense traffic flow leads to more occlusion and target overlap, increasing the difficulty of tracking. To overcome these challenges, future research can explore several improvement directions: the fusion of multi-modal data, dynamic weight adjustment, online learning mechanisms, and the integration of contextual information. Developing more advanced occlusion handling techniques will ensure that high-accuracy tracking can be maintained even in prolonged occlusion situations. Through these improvements, the FairMOT-MCVT algorithm is expected to demonstrate stronger robustness and accuracy in more complex and dynamic traffic environments, providing more reliable technical support for intelligent transportation systems.
The enhancements introduced by the DLA-efficient network and FairMOT-MCVT algorithm exhibit substantial and statistically significant improvements. The advancements in crucial metrics, including AP and MOTA, transcend mere random fluctuations or noise, as evidenced by a notable 1.2% increase in AP and a 1.7% boost in MOTA, which undeniably signify genuine enhancements in model performance. The consistency of these improvements across multiple experimental iterations further solidifies the reliability and validity of the findings. To substantiate the statistical significance of these enhancements, we conduct a series of rigorous t-tests comparing the performance metrics of the original and enhanced models. The resulting p-values, which consistently fall below the 0.05 threshold, confirm that the enhanced algorithm’s improvements are statistically meaningful and noteworthy.