Learning from Outputs: Improving Multi-Object Tracking Performance by Tracker Fusion

Scarrica, Vincenzo M.; Staiano, Antonino

doi:10.3390/technologies12120239

Open AccessArticle

Learning from Outputs: Improving Multi-Object Tracking Performance by Tracker Fusion

by

Vincenzo M. Scarrica

^1,2,†

and

Antonino Staiano

^2,*,†

¹

National PhD Program in AI—Agrifood and Environment, University of Naples Federico II, Corso Umberto I 40, 80138 Naples, Italy

²

Department of Science and Technology, University of Naples Parthenope, Centro Direrzione di Napoli, 80143 Naples, Italy

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Technologies 2024, 12(12), 239; https://doi.org/10.3390/technologies12120239

Submission received: 29 October 2024 / Revised: 12 November 2024 / Accepted: 19 November 2024 / Published: 22 November 2024

(This article belongs to the Section Information and Communication Technologies)

Download

Browse Figures

Versions Notes

Abstract

:

This paper presents an approach to improving visual object tracking performance by dynamically fusing the results of two trackers, where the scheduling of trackers is determined by a support vector machine (SVM). By classifying the outputs of other trackers, our method learns their behaviors and exploits their complementarity to enhance tracking accuracy and robustness. Our approach consistently surpasses the performance of individual trackers within the ensemble. Despite being trained on only 4 sequences and tested on 144 sequences from the VOTS2023 benchmark, our approach achieves a Q metric of 0.65. Additionally, our fusion strategy demonstrates versatility across different datasets, achieving 73.7 MOTA on MOT17 public detections and 82.8 MOTA on MOT17 private detections. On the MOT20 dataset, it achieves 68.6 MOTA on public detections and 79.7 MOTA on private detections, setting new benchmarks in multi-object tracking. These results highlight the potential of using an ensemble of trackers with a learner-based scheduler to significantly improve tracking performance.

Keywords:

multi-object tracking; visual object tracking segmentation; human-crowd tracking; support vector machine

1. Introduction

Visual Object Tracking Segmentation (VOTS) and multiple object tracking (MOT) are pivotal tasks within computer vision, with applications ranging from surveillance and autonomous navigation to augmented reality and human–computer interaction. These tasks involve accurately tracking objects across video sequences, which is challenging due to varying environmental conditions, occlusions, and object deformations [1,2].

Traditional approaches to VOTS and MOT often rely on single tracking algorithms, which can struggle to maintain robustness and precision across diverse scenarios [3,4,5]. Recent advancements have shifted towards fusion strategies, which combine multiple tracking algorithms to leverage their strengths and mitigate weaknesses. These fusion strategies aim to enhance tracking performance by exploiting the complementarity between different algorithms [6,7,8].

The main contributions of this work are as follows:

Fusion strategy using SVM:Drawing inspiration from [9], we introduce a fusion strategy that enhances tracking performance by leveraging a support vector machine (SVM) to merge the outputs of multiple trackers [10]. Our approach improves speed through parallelization while maintaining high-quality results, using a less recent but highly effective method like SVM for fusing the tracker outputs.
Versatility across datasets: Our fusion strategy demonstrates strong versatility, achieving impressive results on the MOTChallenge 2017 and 2020 datasets [3,4]. Specifically, our method achieved 73.7 MOTA on MOT17 public detections and 82.8 MOTA on MOT17 private detections. On the MOT20 dataset, it reached 68.6 MOTA on public detections and 79.7 MOTA on private detections, setting new benchmarks in multi-object tracking.
Improved tracking accuracy through SVM-based fusion: Our method capitalizes on the strengths of two state-of-the-art tracking algorithms: DMAOT (Decoupling Memory AOT) [11,12] and HQTrack (High-Quality Track) [13]. By training an SVM to learn from the outputs of these trackers, we exploit their complementary behaviors, resulting in improved tracking accuracy and robustness. Despite limited experimental resources, our approach achieved a Q metric of 0.65 on the VOTS2023 benchmark [5], surpassing the performance of individual trackers.

The remainder of this paper is organized as follows: we review related works in the field of multiple object tracking, detail the materials and methodology employed in our experiments, present the results and provide a comprehensive discussion of the findings. Through our contributions, we aim to advance the state of the art in VOTS and MOT, offering a robust and versatile solution for diverse real-world applications.

2. Related Work

The fields of multiple object tracking (MOT) and Video Object Segmentation (VOS) have seen significant advancements over the years, driven by both classical methods and modern deep learning approaches. This section provides an overview of key developments and highlights recent trends that inform our proposed fusion strategy for enhancing tracking performance.

2.1. Classical MOT Techniques

Initial advancements in multiple object tracking (MOT) relied on classical methods that laid the foundation for the development of more complex algorithms:

Kalman filter: Introduced by Kalman, this method provided a linear filtering and prediction mechanism, which was foundational for early tracking systems [14].
Hungarian algorithm: The Hungarian method by Kuhn solved the assignment problem in tracking, efficiently associating detections with existing tracks [15].

2.2. Deep Learning-Based MOT Approaches

With the rise of deep learning, MOT approaches were significantly advanced. Key contributions include the following:

DeepSORT: Wojke et al. introduced DeepSORT, which employed a deep association metric for online and real-time tracking, greatly enhancing accuracy and robustness [16].
Faster R-CNN: Ren et al.’s work on Faster R-CNN revolutionized object detection, influencing subsequent tracking methods [17].
TrackRCNN and FairMOT: These methods further advanced tracking by integrating Siamese networks for better re-identification and addressing biases in detection and re-identification, leading to fairer tracking performance [18,19].
ByteTrack: A robust tracking method that effectively links each detection box to multi-object tracking [20].

2.3. Transformer-Based MOT Methods

Recent approaches in MOT have incorporated transformer architectures, which were originally designed for natural language processing, to handle the complexity of tracking:

TransCenter/MOTer: Transformer-based architectures in methods like TransCenter/MOTer have helped improve tracking accuracy by modeling global correlations and attention in parallel [21].

2.4. Video Object Segmentation (VOS) Approaches

In the VOS domain, deep learning-based techniques have become the predominant approach:

SegmentAnything ([22]): Many methods in VOS, such as OSVOS [23] and MoNet [24], fine-tune pre-trained segmentation networks at test time to focus on the target object.
MaskTrack: Methods like MaskTrack use optical flow for mask propagation to track objects in video [25].
Transformer-based VOS methods: Transformer architectures, such as AOT, have introduced hierarchical attention mechanisms to improve segmentation accuracy [12].

2.5. Recent Innovations in MOT and VOS

Recent advancements have focused on long-term tracking, multi-person tracking, and unsupervised learning:

UTM and pixel-guided association: Methods like UTM [26] and pixel-guided association [27] have made strides in these areas.
NCT and other techniques: Other methods, such as NCT [28], have pushed the boundaries of tracking performance, especially in challenging scenarios like occlusion and re-identification.

2.6. Ensemble Methods for MOT

Ensemble methods have been explored to combine the strengths of multiple trackers, leading to improved performance in MOT:

CoCoLoT and MixFormer: These frameworks demonstrated the potential of combining multiple models for enhanced tracking performance [6,29].
Tracker fusion: Dunnhofer et al. demonstrated that combining complementary trackers could improve long-term tracking performance [6]. The Chained Tracker by Peng et al. introduced a chaining mechanism for paired attentive regression results, facilitating joint detection and tracking [30].
ReTracker and Ensemble3ORT: Other works, such as ReTracker [31] and Ensemble3ORT [8], have applied the ensemble approach to improve tracking performance.
EnsembleMOT: Du et al. explored ensemble strategies further by developing a robust framework for MOT, showing significant improvements [32].

Inspired by these ensemble strategies, our approach simplifies the fusion process by employing a support vector machine (SVM) to combine the outputs of multiple trackers.

2.7. Fusion of Trackers via SVM

Using a support vector machine (SVM) to fuse the outputs of multiple trackers is not new, with earlier works exploring similar strategies [9]. However, due to differences in datasets, publication times, features, and algorithms, direct comparisons between these approaches and our method are not feasible. This work expands on previous research by demonstrating the effectiveness of SVM-based fusion in the context of modern tracking datasets, with mathematical justification presented in Section 3.

3. Materials and Methods

The methodology employed in this study revolves around the concept illustrated in Figure 1, which serves as the foundation of our framework. Our approach involves creating an ensemble of trackers, with the outputs of each tracker serving as input to a learner. In this study, a support vector machine (SVM) acts as the learner. For the sake of simplicity and clarity, we utilize two distinct trackers for each experimental setup. The SVM is employed as the classifier to discern the optimal performance among the trackers in various scenarios. Following the methodology section, a succinct overview of each tracker utilized in our experiments will be provided. Subsequently, a post-processing phase is implemented to address the discrepancies in ID mapping among the tracklets generated by different trackers. This post-processing step involves interpolating the tracklets at each frame by remapping the IDs from the current frame to the corresponding IDs in the previous frame using data association techniques.

3.1. SVM-Based Ensemble Optimization

The integration of a support vector machine (SVM) [10] into ensemble methods can significantly enhance the performance of visual object tracking by leveraging the strengths of multiple trackers. Ensemble methods combine the predictions of several models to improve overall accuracy and robustness. For an ensemble of M trackers

{\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{M}

, the ensemble prediction

\bar{\hat{y}}

is given by

\bar{\hat{y}} = \frac{1}{M} \sum_{i = 1}^{M} {\hat{y}}_{i} .

The mean squared error (MSE) for this prediction is

MSE (\bar{\hat{y}}) = \frac{1}{M^{2}} \sum_{i = 1}^{M} E [{({\hat{y}}_{i} - y)}^{2}] + \frac{1}{M^{2}} \sum_{i \neq j} Cov ({\hat{y}}_{i}, {\hat{y}}_{j}),

where

Cov ({\hat{y}}_{i}, {\hat{y}}_{j})

is the covariance between the predictions of trackers i and j. Assuming the trackers are uncorrelated, the covariance term becomes zero, simplifying the MSE to

MSE (\bar{\hat{y}}) = \frac{1}{M^{2}} \sum_{i = 1}^{M} E [{({\hat{y}}_{i} - y)}^{2}] .

Instead of simply averaging the predictions, an SVM can be used to select the best tracker for each frame based on the features of the resulting masks or bounding boxes. The SVM’s decision function is

f (x) = w \cdot x + b,

where

w

is the weight vector and b is the bias term. The SVM aims to minimize the objective function

min \frac{1}{2} {∥ w ∥}^{2} + C \sum_{i = 1}^{n} ξ_{i},

subject to

y_{i} (w \cdot x_{i} + b) \geq 1 - ξ_{i} and ξ_{i} \geq 0 .

By classifying the features of the predictions from individual trackers, the SVM determines the optimal tracker t for each frame. The SVM prediction

{\hat{y}}_{SVM}

is a weighted combination

{\hat{y}}_{SVM} = \sum_{i = 1}^{M} α_{i} {\hat{y}}_{i},

where

α_{i}

are the weights assigned by the SVM. Assuming zero covariance between the trackers, the MSE for this weighted prediction is

MSE ({\hat{y}}_{SVM}) = \sum_{i = 1}^{M} α_{i}^{2} E [{({\hat{y}}_{i} - y)}^{2}] .

This approach minimizes the MSE more effectively than a simple average, exploiting the complementarity of the trackers and enhancing the accuracy and robustness of the overall tracking system.

3.2. Tracking Pipeline

The tracking pipeline (see Figure 2) integrates an arbitrary number of individual trackers with a support vector machine (SVM) acting as a controller to select the optimal tracker for each frame. The workflow is as follows:

Frame processing: Each frame is processed by all trackers, and their outputs are passed to the SVM.
Feature extraction: From the results of each tracker, features such as bounding box/mask dimensions (W and H), the number of IDs (#ID), and the average detection score (S) are extracted.
SVM classification: The SVM, trained to predict the best tracker for each frame based on these features, assigns an integer label corresponding to one of the trackers.
Output selection: The results from the selected tracker are used as the final output for that frame.

A significant challenge arises when different trackers are selected for consecutive frames, requiring re-mapping of tracklet IDs. To address this, a post-processing re-mapping of IDs is performed.

3.2.1. ID Re-Mapping

Given the chosen tracker for the current frame and its results, we compute the Intersection over Union (IoU) matrix

{IoU}_{i j}

between the targets of the current frame and the previous frame. The IoU matrix is defined as follows:

{IoU}_{i j} = \frac{Area ({obj}_{i} \cap {obj}_{j})}{Area ({obj}_{i} \cup {obj}_{j})}

(1)

where

{obj}_{i}

and

{obj}_{j}

are the bounding boxes of the targets in the current frame i and previous frame j, respectively.

Using this IoU matrix, the Hungarian algorithm is applied to assign the previous frame’s target IDs to the current frame’s targets [15].

3.2.2. Global ID Counter Update

If no tracker change occurs and no new targets appear, the IDs remain unchanged. If a different tracker is chosen, the IDs re-mapping is applied. In both cases, if new objects appear, the global ID counter increments for each new object:

global_ID_counter = \{\begin{matrix} global_ID_counter + 1 & if new object appears \\ previous_ID & if nothing changes \end{matrix}

(2)

re_mapped_ID = \{\begin{matrix} previous_ID & if tracker does not change \\ HungarianAlgorithm (IoU_matrix) & if tracker changes \end{matrix}

(3)

As shown in Figure 2, since tracklet IDs may change during the process due to tracker switching, the IDs re-mapping post-processing functions as an agglomeration algorithm within an IoU-based distance topological space. In this space, tracklets are represented as points, and closer points can be grouped and associated with a specific object, as illustrated in the figure with the bus and the car. This method assumes temporal and spatial locality conditions, working under the assumption of a stationary interval of frames, and is effective only when there is no background clutter or out-of-view occurrences in the sequence.

The final running time for the entire ensemble system can be written as follows:

T_{ensemble} = max (T_{1}, T_{2}, \dots, T_{n}) + T_{I D} + T_{S V M}

(4)

where

T_{i}

is the running time of the i-th tracker of the system,

T_{S V M}

is the SVM inference running time,

T_{I D}

the IDs re-mapping post-processing time and

T_{ensemble}

is the total running time.

3.2.3. SVM Training Phase

The SVM is trained using video sequences where each tracker has processed the frames. For each frame, features such as bounding box dimensions are extracted. Before determining the ground truth for training the SVM, the Hungarian algorithm is applied to assign IDs between the tracker predictions and the ground truth. This algorithm ensures an optimal assignment by minimizing the overall cost, thereby effectively matching predicted objects to the ground truth objects.

After applying the Hungarian algorithm, the Intersection over Union (IoU) is computed for each tracker based on the bounding box dimensions. The tracker with the highest total IoU is selected as follows:

chosen - tracker = arg max_{t} \sum_{i} IoU ({obj}_{i, t}, {GT}_{i})

(5)

where

{obj}_{i, t}

is the bounding box of tracker t and

{GT}_{i}

is the ground truth bounding box for target i. The index of the selected tracker is then used as the label for SVM training.

3.3. Benchmarks and Metrics

The MOTChallenge benchmarks, MOT17 [3] and MOT20 [4], provide different characteristics for evaluating tracking algorithms. MOT17 focuses on human crowd tracking with accurate ground truth annotations, while MOT20 presents more challenging scenes with higher pedestrian density. Both benchmarks include training and test sets that extend beyond the training scenarios, providing a comprehensive evaluation framework. They have both private and public detections, meaning that the algorithm can provide its own detections on humans or using the default ones (provided by the founders of the challenge), respectively.

The main evaluation metric used to assess tracking performance is the Multiple Object Tracking Accuracy (MOTA). Additional secondary metrics are Identification F1 (IDF1), Higher Order Tracking Accuracy (HOTA), False Positives (FP), False Negatives (FN), Precision, and Recall. Each metric provides different insights into various aspects of tracking performance, ensuring a thorough evaluation of the models [33,34,35].

The VOTS2023 dataset [5] was created specifically for this benchmark, containing 144 sequences with a total of 341 targets. It includes challenging scenarios such as visually similar objects, significant changes in appearance, cluttered backgrounds, partial occlusions, and objects exiting and re-entering the field of view. Four additional sequences, part of VOTS2023 but not included in the test set, were used to train the model, ensuring robustness and generalization across different scenarios.

The VOTS2023 benchmark evaluates multi-target trackers by assessing their ability to reliably track individual targets throughout video sequences. Five main scenarios are considered: successful localization (sc1), tracker drift (sc2), incorrect target predictions when the target is present (sc3), and when the target is absent (sc4, sc5). The success of the tracking is quantified using the intersection-over-union metric (IoU) with a binarized threshold

θ

. The primary performance measure, tracking quality (Q), summarizes overall tracking performance by calculating the average overlap normalized to the sequence. Additional secondary metrics include accuracy (Acc) and robustness (Rob), as well as non-reported error (NRE), drift rate error (DRE), and absence detection quality (ADQ) [5].

Technical information on the data sets has been included in Table 1, Table 2, and Table 3, respectively.

4. Experiments

For our experiments with the MOT benchmark, we utilized several state-of-the-art algorithms as individual trackers within our framework. These include ImprAsso [36] and BrinqTraq_v2 [37], both leading on the MOT20 private detections benchmark; again, BrinqTraq_v2 and NvMOT_DS2305 [38], both leading on the MOT17 private detections benchmark; OUTrack [39] and ByteTrack [20], both leading on the MOT20 public detections benchmark; PermaTrack [40] and TransCenter/MOTer, both leading on the MOT17 public detections benchmark. Concerning our experiments with the VOTS 2023 benchmark, we instead utilized the two winners DMAOT [12] and HQTrack [13] as individual trackers. These individual trackers, when combined within our SVM-based ensemble, demonstrated improved performance across various tracking scenarios.

Experiments have been carried out a described in Table 4, where the training and test set have been included in the first two columns, the individual trackers in the third and fourth columns, and finally the features. Features are expressed in this way:

# I D

is the counter of the IDs for a prediction, where adding 1 means for the first tracker only and 12 for both of them;

S 1

are the prediction scores of the first tracker;

W, H

are the width and height of the bounding box/mask (for a mask they are defined as the maximum width and height). The order of the first and second tracker is given in Table 4 at the third column. Data augmentation has not been employed in the experiments.

The experiments in this study were conducted using Python, with the Scikit-Learn library used for the implementation of the support vector machine (SVM) classifier. For computational resources, an AMD Ryzen processor was utilized to execute the experiments efficiently, with an 8 GB RAM memory. In addition, benchmark evaluation servers were used to measure the tracking performance for each benchmark, ensuring precise and consistent results. This setup provided a robust environment for testing and evaluating the tracking algorithms across various datasets and configurations.

5. Results and Discussion

According to Table 5, on the MOT17 Public Detections Benchmark, the proposed framework achieved a MOTA of 73.3, which is slightly higher than PermaTrack [40] at 73.1, and MOTer [21] at 71.9. The IDF1 score for our model is 62.4, which is lower than PermaTrack 67.2 but comparable to MOTer 62.3. In terms of HOTA, our model scored 54.1, matching the performance of MOTer but slightly underperforming compared to PermaTrack at 54.2. Our model exhibited lower False Positives (FP) with 25,265 and False Negatives (FN) at 120,282, which are better than PixelGuide [27] and ByteTrack [20] but higher than MOTer in terms of FP. The precision and recall of our model were 94.6 and 78.7, respectively, indicating a balanced performance between identifying correct positives and avoiding false positives.

In the MOT20 Public Detection benchmark (Table 6), our model achieved an MOTA of 68.6, which is slightly better than kalman _pub [20] at 67.0 and significantly better than RETracker [7] at 62.4. The IDF1 score of 69.5 is competitive, with SUSHI [41] achieving the highest at 71.6. Our model’s HOTA of 56.2 is the highest among all trackers compared, indicating strong overall tracking performance. The false positives (FP) and false negatives (FN) for our model were 30,499 and 129,937, respectively, which are competitive with the other top-performing models. The recall of our model (Rcll) of 74.9 and precision (Prcn) of 92.7 also indicate a strong balance between sensitivity and specificity. The visual results are shown in Figure 3.

The framework had good results even on MOT17 Private Detections benchmark (in Table 7), achieving the highest MOTA of 82.8, slightly surpassing NvMOT_DS2305 [38] at 82.7 and BrinqTraq_v2 [37] at 82.3. The IDF1 score for our model was 79.0, competitive with BrinqTraq_v2 having 80.6 and NvMOT_DS2305 having 79.2. With a HOTA score of 64.8, our model demonstrated robust tracking abilities. Our model’s FP and FN were 27,935 and 67,282, respectively, showing effective performance in reducing false detections and misses. Recall and Precision were balanced at 88.1 and 94.7, respectively, outperforming most competitors.

Concerning the MOT20 Private Detections Benchmark (in Table 8), the framework achieved a MOTA of 79.7, marginally higher than BrinqTraq_v2 [37] at 79.5 and significantly better than SuppTrack [44] at 78.2. The IDF1 score was 77.7, with BrinqTraq_v2 scoring 77.4. Our HOTA score of 63.4 was the highest, indicating robust tracking performance. The FP and FN were 26,539 and 77,222, respectively, showing strong detection capabilities. Recall was 85.1, and Precision was 94.3, both indicating a high level of accuracy and reliability.

For the VOTS2023 benchmark, as it is shown in Table 9, the framework achieved a quality (Q) score of 0.65, slightly outperforming DMAOT [12] at 0.64. Accuracy (Acc) was 0.77, similar to HQTrack [13]. Robustness (Rob) was 0.80, matching the top-performing trackers. The Not-Reported Error (NRE) and Drift-Rate Error (DRE) were 0.14 and 0.06, respectively, indicating strong stability and minimal drift. The Absence-Detection Quality (ADQ) of 0.76 was the highest among the compared models, showing effective handling of target absence. Visual results are shown in Figure 4.

6. Conclusions

The proposed framework optimizes the fusion of multiple trackers by effectively exploiting their complementary strengths, combining the best features of individual trackers to significantly enhance overall tracking performance. In this study, we focused on integrating two trackers and achieved promising results, consistently outperforming the individual trackers.

Our approach leverages the unique properties of each tracker, such as long-term robustness and the ability to work with masks or bounding boxes, effectively handling both single and multiple objects. Additionally, the ensemble method developed can potentially be parallelized and executed in a single run, which could lead to further improvements in performance quality.

However, there are several limitations to our current framework. First, incorporating more trackers into the fusion process could further enhance performance by taking advantage of a wider range of complementary strengths. Furthermore, while we employed a support vector machine (SVM) in this study, exploring alternative and potentially more efficient learners, such as deep learning-based classifiers or gradient boosting methods, could lead to better optimization and tracking accuracy.

In addition, exploring a broader set of features could improve model performance. While we have focused on a certain set of attributes in this study, investigating additional features, such as object motion patterns, appearance descriptors, or temporal information, may yield higher accuracy and robustness.

Moreover, testing our method on a broader set of benchmark datasets could provide more comprehensive validation of its performance across diverse scenarios, helping to assess its robustness in real-world applications. These extensions will be addressed in future work, and we are excited to explore these avenues to further advance the field of multi-object tracking.

This study demonstrates that combining trackers is a powerful strategy for improving tracking performance and opens up new possibilities for future research and development in this field.

Author Contributions

Conceptualization, V.M.S. and A.S.; methodology, V.M.S. and A.S.; software, V.M.S.; writing—original draft preparation, V.M.S.; writing—review and editing, A.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Code and data used for MOT17 and MOT20 experiments are available at the repository (accessed on 17 November 2023) https://github.com/knapsack96/SVAMOT. Code and data used for VOTS2023 experiments are available at the repository (accessed on 19 August 2024) https://github.com/knapsack96/SVAVOT.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Luo, W.; Xing, J.; Milan, A.; Zhang, X.; Liu, W.; Kim, T. Multiple object tracking: A literature review. Artif. Intell. 2021, 293, 103448. [Google Scholar] [CrossRef]
Yao, R.; Lin, G.; Xia, S.; Zhao, J.; Zhou, Y. Video Object Segmentation and Tracking: A Survey. ACM Trans. Intell. Syst. Technol. 2020, 11, 1–47. [Google Scholar] [CrossRef]
Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A Benchmark for Multi-Object Tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar]
Dendorfer, P.; Rezatofighi, H.; Milan, A.; Shi, J.; Cremers, D.; Reid, I.; Roth, S.; Schindler, K.; Leal-Taixé, L. MOT20: A benchmark for multi object tracking in crowded scenes. arXiv 2020, arXiv:2003.09003. [Google Scholar]
Kristan, M.; Matas, J.; Danelljan, M.; Felsberg, M.; Chang, H.; Zajc, L.; Lukežič, A.; Drbohlav, O.; Zhang, Z.; Tran, K.; et al. The first visual object tracking segmentation vots2023 challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Visionm, Paris, France, 1–6 October 2023; pp. 1796–1818. [Google Scholar]
Dunnhofer, M.; Simonato, K.; Micheloni, C. Combining complementary trackers for enhanced long-term visual object tracking. Image Vis. Comput. 2022, 122, 104448. [Google Scholar] [CrossRef]
Kawanishi, Y. Label-based Multiple Object Ensemble Tracking with Randomized Frame Dropping. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 900–906. [Google Scholar]
Liang, T.; Lan, L.; Zhang, X.; Luo, Z. A generic MOT boosting framework by combining cues from SOT, tracklet and re-identification. Knowl. Inf. Syst. 2021, 63, 2109–2127. [Google Scholar] [CrossRef]
Breuers, S.; Yang, S.; Mathias, M.; Leibe, B. Exploring bounding box context for multi-object tracker fusion. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–8. [Google Scholar]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Yang, Z.; Yang, Y. Decoupling Features in Hierarchical Propagation for Video Object Segmentation. Adv. Neural Inf. Process. Syst. (NeurIPS) 2022, 35, 36324–36336. [Google Scholar]
Yang, Z.; Wei, Y.; Yang, Y. Associating Objects with Transformers for Video Object Segmentation. Adv. Neural Inf. Process. Syst. (NeurIPS) 2021, 34, 2491–2502. [Google Scholar]
Zhu, J.; Chen, Z.; Hao, Z.; Chang, S.; Zhang, L.; Wang, D.; Lu, H.; Luo, B.; He, J.; Lan, J.; et al. Tracking Anything in High Quality. arXiv 2023, arXiv:2307.13974. [Google Scholar]
Kalman, R. A New Approach to Linear Filtering and Prediction Problems. J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]
Kuhn, H. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Shuai, B.; Berneshawi, A.; Modolo, D.; Tighe, J. Multi-Object Tracking with Siamese Track-RCNN. arXiv 2020, arXiv:2004.07786. [Google Scholar]
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. FairMOT: On the Fairness of Detection and Re-identification in Multiple Object Tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. arXiv 2021, arXiv:2110.06864. [Google Scholar]
Xu, Y.; Ban, Y.; Delorme, G.; Gan, C.; Rus, D.; Alameda-Pineda, X. TransCenter: Transformers with Dense Representations for Multiple-Object Tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7820–7835. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.; Lo, W.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
Caelles, S.; Maninis, K.; Pont-Tuset, J.; Leal-Taixe, L.; Cremers, D.; Van Gool, L. One-Shot Video Object Segmentation. In Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Xiao, H.; Feng, J.; Lin, G.; Liu, Y.; Zhang, M. MoNet: Deep Motion Exploitation for Video Object Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1140–1148. [Google Scholar]
Perazzi, F.; Khoreva, A.; Benenson, R.; Schiele, B.; Sorkine-Hornung, A. Learning Video Object Segmentation from Static Images. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3491–3500. [Google Scholar]
You, S.; Yao, H.; Bao, B.; Xu, C. UTM: A Unified Multiple Object Tracking Model With Identity-Aware Feature Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 21876–21886. [Google Scholar]
Boragule, A.; Jang, H.; Ha, N.; Jeon, M. Pixel-Guided Association for Multi-Object Tracking. Sensors 2022, 22, 8922. [Google Scholar] [CrossRef]
Zeng, K.; You, Y.; Shen, T.; Qingwang, W.; Tao, Z.; Wang, Z.; Liu, Q. NCT:noise-control multi-object tracking. Complex Intell. Syst. 2023, 9, 4331–4347. [Google Scholar] [CrossRef]
Cui, Y.; Jiang, C.; Wu, G.; Wang, L. MixFormer: End-to-End Tracking with Iterative Mixed Attention. IEEE Trans. Pattern Anal. Mach. Intell. 2024. [Google Scholar] [CrossRef]
Peng, J.; Wang, C.; Wan, F.; Wu, Y.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Fu, Y. Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking. In Computer Vision—ECCV 2020; Springer: Cham, Switzerland, 2020; pp. 145–161. [Google Scholar]
Stadler, D.; Beyerer, J. Improving Multiple Pedestrian Tracking by Track Management and Occlusion Handling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10958–10967. [Google Scholar]
Du, Y.; Liu, Z.; Su, F. EnsembleMOT: A Step towards Ensemble Learning of Multiple Object Tracking. arXiv 2023, arXiv:2210.05278. [Google Scholar]
Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance Measures and a Data Set for Multi-target, Multi-camera Tracking. In Computer Vision, ECCV 2016 Workshops; Springer: Cham, Switzerland, 2016; pp. 17–35. [Google Scholar]
Luiten, J.; Osep, A.; Dendorfer, P.; Torr, P.; Geiger, A.; Leal-Taixé, L.; Leibe, B. HOTA: A Higher Order Metric for Evaluating Multi-object Tracking. Int. J. Comput. Vis. 2021, 129, 548–578. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Huang, C.; Nevatia, R. Learning to associate: HybridBoosted multi-target tracker for crowded scene. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 2953–2960. [Google Scholar]
Stadler, D.; Beyerer, J. An Improved Association Pipeline for Multi-Person Tracking. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 3170–3179. [Google Scholar]
Mahadik, H.; Bamra, N.; Rynne, J. Brinq Edge AI SDK v2.0; Arcturus Networks Inc.: Toronto, ON, Canada, 2023. [Google Scholar]
NVIDIA. DeepStream SDK 6.3; NVIDIA: Santa Clara, CA, USA, 2023. [Google Scholar]
Liu, Q.; Chen, D.; Chu, Q.; Yuan, L.; Liu, B.; Zhang, L.; Yu, N. Online Multi-Object Tracking with Unsupervised Re-Identification Learning and Occlusion Estimation. Neurocomputing 2022, 483, 333–347. [Google Scholar] [CrossRef]
Tokmakov, P.; Li, J.; Burgard, W.; Gaidon, A. Learning to Track with Object Permanence. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10840–10849. [Google Scholar]
Cetintas, O.; Brasó, G.; Leal-Taixé, L. Unifying Short and Long-Term Tracking with Graph Hierarchies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22877–22887. [Google Scholar]
Stadler, D.; Beyerer, J. Past Information Aggregation for Multi-Person Tracking. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 321–325. [Google Scholar]
Stadler, D. A Detailed Study of the Association Task in Tracking-by-Detection-based Multi-Person Tracking. In Proceedings of the 2022 Joint Workshop of Fraunhofer IOSB and Institute for Anthropomatics, Vision and Fusion Laboratory; Beyerer, J., Zander, T., Eds.; KIT Scientific Publishing: Karlsruhe, Germany, 2023; Volume 62, pp. 59–85. [Google Scholar]
Zhang, Y.; Chen, H.; Lai, Z.; Zhang, Z.; Yuan, D. Handling Heavy Occlusion in Dense Crowd Tracking by Focusing on the Heads. In Australasian Joint Conference on Artificial Intelligence; Springer Nature: Singapore, 2023; pp. 79–90. [Google Scholar]
Larsen, M.; Rolfsjord, S.; Gusland, D.; Ahlberg, J.; Mathiassen, K. BASE: Probably a Better Approach to Multi-Object Tracking. arXiv 2023, arXiv:2309.12035. [Google Scholar]
Chen, X.; Peng, H.; Wang, D.; Lu, H.; Hu, H. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision And Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14572–14581. [Google Scholar]
Paul, M.; Danelljan, M.; Mayer, C.; Van Gool, L. Robust Visual Tracking by Segmentation. In Computer Vision—ECCV 2022; Springer: Cham, Switzerland, 2022; pp. 571–588. [Google Scholar]

Figure 1. Sketch depicting an ensemble of trackers operating in parallel. Each tracker produces its own set of outputs, which are then fed into a learner. The learner aggregates the information from the different trackers and classifies which tracker performed best in each scenario. The final output prediction is generated based on the outputs of the top-performing tracker identified by the learner.

Figure 2. Initially, input frames are analyzed by multiple trackers, followed by an SVM-optimization phase that learns their complementarities. The data then undergo an ID Re-Mapping step, ensuring accurate data association by agglomerating points in an IoU-based topological space. Finally, the refined tracklet predictions are presented.

Figure 3. Output samples from challenging sequence MOT20-04 from the public detections study (a). The same scene 100 frames after (b). The color assignment is performed randomly.

Figure 4. Visual results from VOTS2023 dataset, masks have been highlighted with random assigned colors. Frames (a–c) are taken from ants1 sequence. Frames (d–f) are taken from car sequence. Frames (g–i) are taken from freesbiedog sequence.

Table 1. Properties of the MOT17 dataset.

Property	MOT17 (Training)	MOT17 (Test)
Sequences	21	21
Frames	15,948	17,757
Seconds	645	744
Tracks	1638	2355
Bounding boxes	336,891	564,228
Average density of tracks	21.1	31.8

Table 2. Properties of the MOT20 dataset.

Property	MOT20 (Training)	MOT20 (Test)
Sequences	4	4
Frames	8931	4479
Seconds	357	178
Tracks	2332	1501
Bounding boxes	1,336,920	765,465
Average density of tracks	149.7	170.9

Table 3. VOTS2023 dataset statistics.

Statistic	Value
Total Sequences	144
Total Targets	341
Targets Leaving FOV (at least once)	168
Minimum Frames per Sequence	63
Maximum Frames per Sequence	10,700
Average Targets per Sequence	2.37
Median Target Absence (in frames)	18

Table 4. Parameters settings for SVM-optimization phase for each experiment.

Training Set	Test Set	Trackers	Features
MOT17 Training	MOT17 Test	TransCenter and PermaTrack	$# I D 1, W, H$
MOT17 Test	MOT20 Test	ByteTrack and OUTrack	$# I D 1, S 1, W, H$
MOT17 Training	MOT17 Test	NvMOT and BrinqTraq	$# I D 12, W, H$
MOT17 Test	MOT20 Test	BrinqTraq_v2 and ImprAsso	$# I D 12, W, H$
VOTS23 Validation	VOTS23 Test	DMAOT and HQTrack	$W, H$

Table 5. Results on MOT17 Public Detection Benchmark. The symbols ↑ and ↓ denote metrics ordered in ascending and descending fashion, respectively.

Tracker	↑ MOTA	↑ IDF1	↑ HOTA	↓ FP	↓ FN	↑ Rcll	↑ Prcn
Proposed framework	73.3	62.4	54.1	25,265	120,282	78.7	94.6
PermaTrack [40]	$73.1$	$67.2$	$54.2$	$24, 577$	$123, 508$	$78.1$	$94.7$
MOTer [21]	$71.9$	$62.3$	$54.1$	$17, 378$	$137, 008$	$75.7$	$96.1$
PixelGuide [27]	$69.7$	$68.4$	$55.5$	$26, 871$	$140, 457$	$75.1$	$94.0$
NCT [28]	$69.5$	$68.5$	$54.6$	$65, 463$	$101, 471$	$82.0$	$87.6$
OUTrack_fm_p [39]	$69.0$	$66.8$	$54.8$	$28, 795$	$141, 580$	$74.9$	$93.6$
TransCtr [21]	$68.8$	$61.4$	$51.4$	$22, 860$	$149, 188$	$73.6$	$94.8$
BYTE_Pub [20]	$67.4$	$70.0$	$56.1$	9939	$172, 636$	$69.4$	$97.5$

Table 6. Results on MOT20 Public Detections Benchmark. The symbols ↑ and ↓ denote metrics ordered in ascending and descending fashion, respectively.

Tracker	↑ MOTA	↑ IDF1	↑ HOTA	↓ FP	↓ FN	↑ Rcll	↑ Prcn
Proposed framework	68.6	69.5	56.2	30,499	129,937	74.9	92.7
kalman_pub [20]	67.0	70.2	56.4	9685	160,303	69.0	97.4
OUTrack_fm_p [39]	65.4	65.1	52.1	38,243	137,770	73.4	90.8
UTM [26]	64.4	65.9	53.3	82,726	98,974	80.9	83.5
RETracker [7]	62.4	53.0	45.3	43,503	147,451	71.5	89.5
SUSHI [41]	61.6	71.6	55.4	29,429	168,098	67.5	92.2
TransCtr [21]	61.0	49.8	43.5	49,189	147,890	71.4	88.3
TMOH [31]	60.1	61.2	48.9	38,043	165,899	67.9	90.2

Table 7. Results on MOT17 Private Detections Benchmark. The symbols ↑ and ↓ denote metrics ordered in ascending and descending fashion, respectively.

Tracker	↑ MOTA	↑ IDF1	↑ HOTA	↓ FP	↓ FN	↑ Rcll	↑ Prcn
Proposed framework	82.8	79.0	64.8	27,935	67,282	88.1	94.7
NvMOT_DS2305 [38]	82.7	79.2	65.0	27,117	68,454	87.9	94.8
BrinqTraq_v2 [37]	82.3	80.6	64.4	34,140	63,888	88.7	93.6
ImprAsso [36]	82.2	82.1	66.4	26,727	72,666	87.1	94.8
PIA2 [42]	82.2	81.1	66.0	29,655	69,762	87.6	94.3
UTM_private [26]	81.8	78.7	64.0	25,077	76,298	86.5	95.1
StrongTBD [43]	81.6	80.8	65.6	24,171	78,759	86.0	95.3
NvMOT_DSv62 [38]	81.5	77.8	64.1	29,676	72,696	87.1	94.3

Table 8. Results on MOT20 Private Detections Benchmark. The symbols ↑ and ↓ denote metrics ordered in ascending and descending fashion, respectively.

Tracker	↑ MOTA	↑ IDF1	↑ HOTA	↓ FP	↓ FN	↑ Rcll	↑ Prcn
Proposed framework	79.7	77.7	63.4	26,539	77,222	85.1	94.3
BrinqTraq_v2 [37]	79.5	77.4	63.1	26,830	77,899	84.9	94.2
ImprAsso [36]	78.6	78.8	64.6	27,064	82,715	84.0	94.1
PIA2 [42]	78.5	79.0	64.7	24,841	85,565	83.5	94.6
SuppTrack [44]	78.2	75.5	61.9	30,187	81,119	84.3	93.5
UTM_private [26]	78.2	76.9	62.5	29,964	81,516	84.2	93.6
FFI_BASE [45]	78.2	77.6	63.5	27,606	84,211	83.7	94.0
StrongTBD [43]	78.0	77.0	63.6	25,473	87,330	83.1	94.4

Table 9. Results on VOTS2023 Benchmark. The symbols ↑ and ↓ denote metrics ordered in ascending and descending fashion, respectively.

Tracker	↑ Q	↑ Acc	↑ Rob	↓ NRE	↓ DRE	↑ ADQ
Proposed framework	0.65	0.77	0.80	0.14	0.06	0.76
DMAOT [5,12]	0.64	0.75	0.80	0.14	0.07	0.73
HQTrack [13]	0.62	0.75	0.77	0.15	0.08	0.69
M-VOSTracker [5]	0.61	0.75	0.76	0.16	0.08	0.71
Dynamic_DEAOT [5]	0.59	0.69	0.84	0.07	0.09	0.57
SeqTrack [46]	0.59	0.80	0.73	0.10	0.17	0.44
RST [47]	0.54	0.70	0.76	0.07	0.17	0.35
AOT [12]	0.55	0.70	0.77	0.10	0.14	0.47

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Scarrica, V.M.; Staiano, A. Learning from Outputs: Improving Multi-Object Tracking Performance by Tracker Fusion. Technologies 2024, 12, 239. https://doi.org/10.3390/technologies12120239

AMA Style

Scarrica VM, Staiano A. Learning from Outputs: Improving Multi-Object Tracking Performance by Tracker Fusion. Technologies. 2024; 12(12):239. https://doi.org/10.3390/technologies12120239

Chicago/Turabian Style

Scarrica, Vincenzo M., and Antonino Staiano. 2024. "Learning from Outputs: Improving Multi-Object Tracking Performance by Tracker Fusion" Technologies 12, no. 12: 239. https://doi.org/10.3390/technologies12120239

APA Style

Scarrica, V. M., & Staiano, A. (2024). Learning from Outputs: Improving Multi-Object Tracking Performance by Tracker Fusion. Technologies, 12(12), 239. https://doi.org/10.3390/technologies12120239

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Learning from Outputs: Improving Multi-Object Tracking Performance by Tracker Fusion

Abstract

1. Introduction

2. Related Work

2.1. Classical MOT Techniques

2.2. Deep Learning-Based MOT Approaches

2.3. Transformer-Based MOT Methods

2.4. Video Object Segmentation (VOS) Approaches

2.5. Recent Innovations in MOT and VOS

2.6. Ensemble Methods for MOT

2.7. Fusion of Trackers via SVM

3. Materials and Methods

3.1. SVM-Based Ensemble Optimization

3.2. Tracking Pipeline

3.2.1. ID Re-Mapping

3.2.2. Global ID Counter Update

3.2.3. SVM Training Phase

3.3. Benchmarks and Metrics

4. Experiments

5. Results and Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI