1. Introduction
Visual Object Tracking Segmentation (VOTS) and multiple object tracking (MOT) are pivotal tasks within computer vision, with applications ranging from surveillance and autonomous navigation to augmented reality and human–computer interaction. These tasks involve accurately tracking objects across video sequences, which is challenging due to varying environmental conditions, occlusions, and object deformations [
1,
2].
Traditional approaches to VOTS and MOT often rely on single tracking algorithms, which can struggle to maintain robustness and precision across diverse scenarios [
3,
4,
5]. Recent advancements have shifted towards fusion strategies, which combine multiple tracking algorithms to leverage their strengths and mitigate weaknesses. These fusion strategies aim to enhance tracking performance by exploiting the complementarity between different algorithms [
6,
7,
8].
The main contributions of this work are as follows:
Fusion strategy using SVM:Drawing inspiration from [
9], we introduce a fusion strategy that enhances tracking performance by leveraging a support vector machine (SVM) to merge the outputs of multiple trackers [
10]. Our approach improves speed through parallelization while maintaining high-quality results, using a less recent but highly effective method like SVM for fusing the tracker outputs.
Versatility across datasets: Our fusion strategy demonstrates strong versatility, achieving impressive results on the MOTChallenge 2017 and 2020 datasets [
3,
4]. Specifically, our method achieved 73.7 MOTA on MOT17 public detections and 82.8 MOTA on MOT17 private detections. On the MOT20 dataset, it reached 68.6 MOTA on public detections and 79.7 MOTA on private detections, setting new benchmarks in multi-object tracking.
Improved tracking accuracy through SVM-based fusion: Our method capitalizes on the strengths of two state-of-the-art tracking algorithms: DMAOT (Decoupling Memory AOT) [
11,
12] and HQTrack (High-Quality Track) [
13]. By training an SVM to learn from the outputs of these trackers, we exploit their complementary behaviors, resulting in improved tracking accuracy and robustness. Despite limited experimental resources, our approach achieved a Q metric of 0.65 on the VOTS2023 benchmark [
5], surpassing the performance of individual trackers.
The remainder of this paper is organized as follows: we review related works in the field of multiple object tracking, detail the materials and methodology employed in our experiments, present the results and provide a comprehensive discussion of the findings. Through our contributions, we aim to advance the state of the art in VOTS and MOT, offering a robust and versatile solution for diverse real-world applications.
2. Related Work
The fields of multiple object tracking (MOT) and Video Object Segmentation (VOS) have seen significant advancements over the years, driven by both classical methods and modern deep learning approaches. This section provides an overview of key developments and highlights recent trends that inform our proposed fusion strategy for enhancing tracking performance.
2.1. Classical MOT Techniques
Initial advancements in multiple object tracking (MOT) relied on classical methods that laid the foundation for the development of more complex algorithms:
Kalman filter: Introduced by Kalman, this method provided a linear filtering and prediction mechanism, which was foundational for early tracking systems [
14].
Hungarian algorithm: The Hungarian method by Kuhn solved the assignment problem in tracking, efficiently associating detections with existing tracks [
15].
2.2. Deep Learning-Based MOT Approaches
With the rise of deep learning, MOT approaches were significantly advanced. Key contributions include the following:
DeepSORT: Wojke et al. introduced DeepSORT, which employed a deep association metric for online and real-time tracking, greatly enhancing accuracy and robustness [
16].
Faster R-CNN: Ren et al.’s work on Faster R-CNN revolutionized object detection, influencing subsequent tracking methods [
17].
TrackRCNN and FairMOT: These methods further advanced tracking by integrating Siamese networks for better re-identification and addressing biases in detection and re-identification, leading to fairer tracking performance [
18,
19].
ByteTrack: A robust tracking method that effectively links each detection box to multi-object tracking [
20].
2.3. Transformer-Based MOT Methods
Recent approaches in MOT have incorporated transformer architectures, which were originally designed for natural language processing, to handle the complexity of tracking:
TransCenter/MOTer: Transformer-based architectures in methods like TransCenter/MOTer have helped improve tracking accuracy by modeling global correlations and attention in parallel [
21].
2.4. Video Object Segmentation (VOS) Approaches
In the VOS domain, deep learning-based techniques have become the predominant approach:
SegmentAnything ([
22]): Many methods in VOS, such as OSVOS [
23] and MoNet [
24], fine-tune pre-trained segmentation networks at test time to focus on the target object.
MaskTrack: Methods like MaskTrack use optical flow for mask propagation to track objects in video [
25].
Transformer-based VOS methods: Transformer architectures, such as AOT, have introduced hierarchical attention mechanisms to improve segmentation accuracy [
12].
2.5. Recent Innovations in MOT and VOS
Recent advancements have focused on long-term tracking, multi-person tracking, and unsupervised learning:
UTM and pixel-guided association: Methods like UTM [
26] and pixel-guided association [
27] have made strides in these areas.
NCT and other techniques: Other methods, such as NCT [
28], have pushed the boundaries of tracking performance, especially in challenging scenarios like occlusion and re-identification.
2.6. Ensemble Methods for MOT
Ensemble methods have been explored to combine the strengths of multiple trackers, leading to improved performance in MOT:
CoCoLoT and MixFormer: These frameworks demonstrated the potential of combining multiple models for enhanced tracking performance [
6,
29].
Tracker fusion: Dunnhofer et al. demonstrated that combining complementary trackers could improve long-term tracking performance [
6]. The Chained Tracker by Peng et al. introduced a chaining mechanism for paired attentive regression results, facilitating joint detection and tracking [
30].
ReTracker and Ensemble3ORT: Other works, such as ReTracker [
31] and Ensemble3ORT [
8], have applied the ensemble approach to improve tracking performance.
EnsembleMOT: Du et al. explored ensemble strategies further by developing a robust framework for MOT, showing significant improvements [
32].
Inspired by these ensemble strategies, our approach simplifies the fusion process by employing a support vector machine (SVM) to combine the outputs of multiple trackers.
2.7. Fusion of Trackers via SVM
Using a support vector machine (SVM) to fuse the outputs of multiple trackers is not new, with earlier works exploring similar strategies [
9]. However, due to differences in datasets, publication times, features, and algorithms, direct comparisons between these approaches and our method are not feasible. This work expands on previous research by demonstrating the effectiveness of SVM-based fusion in the context of modern tracking datasets, with mathematical justification presented in
Section 3.
3. Materials and Methods
The methodology employed in this study revolves around the concept illustrated in
Figure 1, which serves as the foundation of our framework. Our approach involves creating an ensemble of trackers, with the outputs of each tracker serving as input to a learner. In this study, a support vector machine (SVM) acts as the learner. For the sake of simplicity and clarity, we utilize two distinct trackers for each experimental setup. The SVM is employed as the classifier to discern the optimal performance among the trackers in various scenarios. Following the methodology section, a succinct overview of each tracker utilized in our experiments will be provided. Subsequently, a post-processing phase is implemented to address the discrepancies in ID mapping among the tracklets generated by different trackers. This post-processing step involves interpolating the tracklets at each frame by remapping the IDs from the current frame to the corresponding IDs in the previous frame using data association techniques.
3.1. SVM-Based Ensemble Optimization
The integration of a support vector machine (SVM) [
10] into ensemble methods can significantly enhance the performance of visual object tracking by leveraging the strengths of multiple trackers. Ensemble methods combine the predictions of several models to improve overall accuracy and robustness. For an ensemble of
M trackers
, the ensemble prediction
is given by
The mean squared error (MSE) for this prediction is
where
is the covariance between the predictions of trackers
i and
j. Assuming the trackers are uncorrelated, the covariance term becomes zero, simplifying the MSE to
Instead of simply averaging the predictions, an SVM can be used to select the best tracker for each frame based on the features of the resulting masks or bounding boxes. The SVM’s decision function is
where
is the weight vector and
b is the bias term. The SVM aims to minimize the objective function
subject to
By classifying the features of the predictions from individual trackers, the SVM determines the optimal tracker
t for each frame. The SVM prediction
is a weighted combination
where
are the weights assigned by the SVM. Assuming zero covariance between the trackers, the MSE for this weighted prediction is
This approach minimizes the MSE more effectively than a simple average, exploiting the complementarity of the trackers and enhancing the accuracy and robustness of the overall tracking system.
3.2. Tracking Pipeline
The tracking pipeline (see
Figure 2) integrates an arbitrary number of individual trackers with a support vector machine (SVM) acting as a controller to select the optimal tracker for each frame. The workflow is as follows:
Frame processing: Each frame is processed by all trackers, and their outputs are passed to the SVM.
Feature extraction: From the results of each tracker, features such as bounding box/mask dimensions (W and H), the number of IDs (#ID), and the average detection score (S) are extracted.
SVM classification: The SVM, trained to predict the best tracker for each frame based on these features, assigns an integer label corresponding to one of the trackers.
Output selection: The results from the selected tracker are used as the final output for that frame.
A significant challenge arises when different trackers are selected for consecutive frames, requiring re-mapping of tracklet IDs. To address this, a post-processing re-mapping of IDs is performed.
3.2.1. ID Re-Mapping
Given the chosen tracker for the current frame and its results, we compute the Intersection over Union (IoU) matrix
between the targets of the current frame and the previous frame. The IoU matrix is defined as follows:
where
and
are the bounding boxes of the targets in the current frame
i and previous frame
j, respectively.
Using this IoU matrix, the Hungarian algorithm is applied to assign the previous frame’s target IDs to the current frame’s targets [
15].
3.2.2. Global ID Counter Update
If no tracker change occurs and no new targets appear, the IDs remain unchanged. If a different tracker is chosen, the IDs re-mapping is applied. In both cases, if new objects appear, the global ID counter increments for each new object:
As shown in
Figure 2, since tracklet IDs may change during the process due to tracker switching, the IDs re-mapping post-processing functions as an agglomeration algorithm within an IoU-based distance topological space. In this space, tracklets are represented as points, and closer points can be grouped and associated with a specific object, as illustrated in the figure with the bus and the car. This method assumes temporal and spatial locality conditions, working under the assumption of a stationary interval of frames, and is effective only when there is no background clutter or out-of-view occurrences in the sequence.
The final running time for the entire ensemble system can be written as follows:
where
is the running time of the
i-th tracker of the system,
is the SVM inference running time,
the IDs re-mapping post-processing time and
is the total running time.
3.2.3. SVM Training Phase
The SVM is trained using video sequences where each tracker has processed the frames. For each frame, features such as bounding box dimensions are extracted. Before determining the ground truth for training the SVM, the Hungarian algorithm is applied to assign IDs between the tracker predictions and the ground truth. This algorithm ensures an optimal assignment by minimizing the overall cost, thereby effectively matching predicted objects to the ground truth objects.
After applying the Hungarian algorithm, the Intersection over Union (IoU) is computed for each tracker based on the bounding box dimensions. The tracker with the highest total IoU is selected as follows:
where
is the bounding box of tracker
t and
is the ground truth bounding box for target
i. The index of the selected tracker is then used as the label for SVM training.
3.3. Benchmarks and Metrics
The MOTChallenge benchmarks, MOT17 [
3] and MOT20 [
4], provide different characteristics for evaluating tracking algorithms. MOT17 focuses on human crowd tracking with accurate ground truth annotations, while MOT20 presents more challenging scenes with higher pedestrian density. Both benchmarks include training and test sets that extend beyond the training scenarios, providing a comprehensive evaluation framework. They have both private and public detections, meaning that the algorithm can provide its own detections on humans or using the default ones (provided by the founders of the challenge), respectively.
The main evaluation metric used to assess tracking performance is the Multiple Object Tracking Accuracy (MOTA). Additional secondary metrics are Identification F1 (IDF1), Higher Order Tracking Accuracy (HOTA), False Positives (FP), False Negatives (FN), Precision, and Recall. Each metric provides different insights into various aspects of tracking performance, ensuring a thorough evaluation of the models [
33,
34,
35].
The VOTS2023 dataset [
5] was created specifically for this benchmark, containing 144 sequences with a total of 341 targets. It includes challenging scenarios such as visually similar objects, significant changes in appearance, cluttered backgrounds, partial occlusions, and objects exiting and re-entering the field of view. Four additional sequences, part of VOTS2023 but not included in the test set, were used to train the model, ensuring robustness and generalization across different scenarios.
The VOTS2023 benchmark evaluates multi-target trackers by assessing their ability to reliably track individual targets throughout video sequences. Five main scenarios are considered: successful localization (sc1), tracker drift (sc2), incorrect target predictions when the target is present (sc3), and when the target is absent (sc4, sc5). The success of the tracking is quantified using the intersection-over-union metric (IoU) with a binarized threshold
. The primary performance measure, tracking quality (
Q), summarizes overall tracking performance by calculating the average overlap normalized to the sequence. Additional secondary metrics include accuracy (Acc) and robustness (Rob), as well as non-reported error (NRE), drift rate error (DRE), and absence detection quality (ADQ) [
5].
4. Experiments
For our experiments with the MOT benchmark, we utilized several state-of-the-art algorithms as individual trackers within our framework. These include ImprAsso [
36] and BrinqTraq_v2 [
37], both leading on the MOT20 private detections benchmark; again, BrinqTraq_v2 and NvMOT_DS2305 [
38], both leading on the MOT17 private detections benchmark; OUTrack [
39] and ByteTrack [
20], both leading on the MOT20 public detections benchmark; PermaTrack [
40] and TransCenter/MOTer, both leading on the MOT17 public detections benchmark. Concerning our experiments with the VOTS 2023 benchmark, we instead utilized the two winners DMAOT [
12] and HQTrack [
13] as individual trackers. These individual trackers, when combined within our SVM-based ensemble, demonstrated improved performance across various tracking scenarios.
Experiments have been carried out a described in
Table 4, where the training and test set have been included in the first two columns, the individual trackers in the third and fourth columns, and finally the features. Features are expressed in this way:
is the counter of the IDs for a prediction, where adding 1 means for the first tracker only and 12 for both of them;
are the prediction scores of the first tracker;
are the width and height of the bounding box/mask (for a mask they are defined as the maximum width and height). The order of the first and second tracker is given in
Table 4 at the third column. Data augmentation has not been employed in the experiments.
The experiments in this study were conducted using Python, with the Scikit-Learn library used for the implementation of the support vector machine (SVM) classifier. For computational resources, an AMD Ryzen processor was utilized to execute the experiments efficiently, with an 8 GB RAM memory. In addition, benchmark evaluation servers were used to measure the tracking performance for each benchmark, ensuring precise and consistent results. This setup provided a robust environment for testing and evaluating the tracking algorithms across various datasets and configurations.
5. Results and Discussion
According to
Table 5, on the MOT17 Public Detections Benchmark, the proposed framework achieved a MOTA of 73.3, which is slightly higher than PermaTrack [
40] at 73.1, and MOTer [
21] at 71.9. The IDF1 score for our model is 62.4, which is lower than PermaTrack 67.2 but comparable to MOTer 62.3. In terms of HOTA, our model scored 54.1, matching the performance of MOTer but slightly underperforming compared to PermaTrack at 54.2. Our model exhibited lower False Positives (FP) with 25,265 and False Negatives (FN) at 120,282, which are better than PixelGuide [
27] and ByteTrack [
20] but higher than MOTer in terms of FP. The precision and recall of our model were 94.6 and 78.7, respectively, indicating a balanced performance between identifying correct positives and avoiding false positives.
In the MOT20 Public Detection benchmark (
Table 6), our model achieved an MOTA of 68.6, which is slightly better than kalman _pub [
20] at 67.0 and significantly better than RETracker [
7] at 62.4. The IDF1 score of 69.5 is competitive, with SUSHI [
41] achieving the highest at 71.6. Our model’s HOTA of 56.2 is the highest among all trackers compared, indicating strong overall tracking performance. The false positives (FP) and false negatives (FN) for our model were 30,499 and 129,937, respectively, which are competitive with the other top-performing models. The recall of our model (Rcll) of 74.9 and precision (Prcn) of 92.7 also indicate a strong balance between sensitivity and specificity. The visual results are shown in
Figure 3.
The framework had good results even on MOT17 Private Detections benchmark (in
Table 7), achieving the highest MOTA of 82.8, slightly surpassing NvMOT_DS2305 [
38] at 82.7 and BrinqTraq_v2 [
37] at 82.3. The IDF1 score for our model was 79.0, competitive with BrinqTraq_v2 having 80.6 and NvMOT_DS2305 having 79.2. With a HOTA score of 64.8, our model demonstrated robust tracking abilities. Our model’s FP and FN were 27,935 and 67,282, respectively, showing effective performance in reducing false detections and misses. Recall and Precision were balanced at 88.1 and 94.7, respectively, outperforming most competitors.
Concerning the MOT20 Private Detections Benchmark (in
Table 8), the framework achieved a MOTA of 79.7, marginally higher than BrinqTraq_v2 [
37] at 79.5 and significantly better than SuppTrack [
44] at 78.2. The IDF1 score was 77.7, with BrinqTraq_v2 scoring 77.4. Our HOTA score of 63.4 was the highest, indicating robust tracking performance. The FP and FN were 26,539 and 77,222, respectively, showing strong detection capabilities. Recall was 85.1, and Precision was 94.3, both indicating a high level of accuracy and reliability.
For the VOTS2023 benchmark, as it is shown in
Table 9, the framework achieved a quality (Q) score of 0.65, slightly outperforming DMAOT [
12] at 0.64. Accuracy (Acc) was 0.77, similar to HQTrack [
13]. Robustness (Rob) was 0.80, matching the top-performing trackers. The Not-Reported Error (NRE) and Drift-Rate Error (DRE) were 0.14 and 0.06, respectively, indicating strong stability and minimal drift. The Absence-Detection Quality (ADQ) of 0.76 was the highest among the compared models, showing effective handling of target absence. Visual results are shown in
Figure 4.
6. Conclusions
The proposed framework optimizes the fusion of multiple trackers by effectively exploiting their complementary strengths, combining the best features of individual trackers to significantly enhance overall tracking performance. In this study, we focused on integrating two trackers and achieved promising results, consistently outperforming the individual trackers.
Our approach leverages the unique properties of each tracker, such as long-term robustness and the ability to work with masks or bounding boxes, effectively handling both single and multiple objects. Additionally, the ensemble method developed can potentially be parallelized and executed in a single run, which could lead to further improvements in performance quality.
However, there are several limitations to our current framework. First, incorporating more trackers into the fusion process could further enhance performance by taking advantage of a wider range of complementary strengths. Furthermore, while we employed a support vector machine (SVM) in this study, exploring alternative and potentially more efficient learners, such as deep learning-based classifiers or gradient boosting methods, could lead to better optimization and tracking accuracy.
In addition, exploring a broader set of features could improve model performance. While we have focused on a certain set of attributes in this study, investigating additional features, such as object motion patterns, appearance descriptors, or temporal information, may yield higher accuracy and robustness.
Moreover, testing our method on a broader set of benchmark datasets could provide more comprehensive validation of its performance across diverse scenarios, helping to assess its robustness in real-world applications. These extensions will be addressed in future work, and we are excited to explore these avenues to further advance the field of multi-object tracking.
This study demonstrates that combining trackers is a powerful strategy for improving tracking performance and opens up new possibilities for future research and development in this field.