Next Article in Journal
Advancing Fault Detection in Distribution Networks with a Real-Time Approach Using Robust RVFLN
Previous Article in Journal
Shifting Paradigms in Spinal Infection Management: The Rising Role of Spine Endoscopy—A Systematic Review and Case Series Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

ReTrackVLM: Transformer-Enhanced Multi-Object Tracking with Cross-Modal Embeddings and Zero-Shot Re-Identification Integration

by
Ertugrul Bayraktar
Department of Mechatronics Engineering, Yildiz Technical University, 34349 Istanbul, Turkey
Appl. Sci. 2025, 15(4), 1907; https://doi.org/10.3390/app15041907
Submission received: 14 December 2024 / Revised: 15 January 2025 / Accepted: 27 January 2025 / Published: 12 February 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
Multi-object tracking (MOT) is an important task in computer vision, particularly in complex, dynamic environments with crowded scenes and frequent occlusions. Traditional tracking methods often suffer from identity switches (IDSws) and fragmented tracks (FMs), which limits their ability to maintain consistent object trajectories. In this paper, we present a novel framework, called ReTrackVLM, that integrates multimodal embedding from a visual language model (VLM) with a zero-shot re-identification (ReID) module to enhance tracking accuracy and robustness. ReTrackVLM leverages the rich semantic information from VLMs to distinguish objects more effectively, even under challenging conditions, while the zero-shot ReID mechanism enables robust identity matching without additional training. The system also includes a motion prediction module, powered by Kalman filtering, to handle object occlusions and abrupt movements. We evaluated ReTrackVLM on several widely used MOT benchmarks, including MOT15, MOT16, MOT17, MOT20, and DanceTrack. Our approach achieves state-of-the-art results, with improvements of 1.5% MOTA and a reduction of 10. 3% in IDSws compared to existing methods. ReTrackVLM also excels in tracking precision, recording a 91.7% precision on MOT17. However, in extremely dense scenes, the framework faces challenges with slight increases in IDSws. Despite the computational overhead of using VLMs, ReTrackVLM demonstrates the ability to track objects effectively in diverse scenarios.

1. Introduction

Multi-object tracking is a fundamental problem in computer vision, with widespread applications ranging from autonomous driving and surveillance systems to robotics and sports analytics. The ability to accurately track multiple objects simultaneously in a dynamic environment is crucial to understanding complex scenes, making real-time decisions, and ensuring the safety and efficiency of systems that rely on visual input. In real-world scenarios, MOT presents several challenges, including occlusions, changes in object appearance, interactions between objects, and variations in lighting and viewpoint. These challenges are exacerbated by the need to maintain consistent identities of objects over time, even when objects leave the field of view or are temporarily obstructed. Traditional approaches to MOT often struggle with these issues, leading to identity changes, track fragmentation, and inaccuracies in localization. Moreover, the diversity of environments and the complexity of human behaviors necessitate robust tracking systems that can be generalized across different domains. As a result, the development of effective MOT algorithms has become a critical area of research aimed at bridging the gap between theoretical advances and practical deployment in real-world systems. The introduction of large-scale, diverse datasets such as MOT15 [1], MOT16/17 [2], MOT20 [3], DanceTrack [4], and WildTrack [5] has spurred significant progress in this field, enabling the training and evaluation of more sophisticated models. However, despite these advancements, achieving reliable and scalable multi-object tracking remains a formidable challenge, requiring innovative solutions that can adapt to the complexities of real-world scenarios.
The choice of datasets is crucial for developing and evaluating MOT algorithms, providing diverse benchmarks that reflect real-world complexities. The MOT Challenge series (MOT15, MOT16, MOT17, MOT20) is central to MOT research, capturing urban scenes with varying pedestrian densities and challenges like occlusions, camera motion, and lighting. While MOT15 laid the foundation, later versions introduced higher-quality annotations and more complex scenarios. DanceTrack tests algorithms on fast, intertwined movements in dance sequences, emphasizing identity maintenance in dynamic situations. WildTrack adds complexity with its multi-camera setup, which is essential for wide-area coverage and testing scalability in applications like sports analytics and surveillance.
The diverse and challenging environments captured in the MOT15, MOT16, MOT17, MOT20, DanceTrack, and WildTrack datasets are illustrated in Figure 1, further highlighting the importance of using diverse datasets to evaluate the performance of MOT algorithms since they encompass a broad spectrum of challenges, from standard urban environments to extreme conditions.
They serve as invaluable tools for researchers to develop and validate MOT algorithms that are not only effective in controlled settings but are also robust and generalizable to real-world applications.
Despite notable advancements in MOT, current methods still face limitations in real-world scenarios. A key issue is IDSws, where trackers fail to maintain object identities, especially in crowded or cluttered settings due to occlusions or objects leaving and re-entering the field of view. Conventional approaches, relying on motion models and appearance-based ReID, struggle with significant appearance changes or unpredictable motions. Many methods also rely on handcrafted features or shallow models that lack robustness and generalization, requiring fine-tuning across datasets. The increasing scale of data and demand for real-time processing highlight the need for more efficient algorithms, but the high computational complexity limits their deployment in real-time systems. Additionally, many MOT frameworks under-use rich contextual information like keypoints and descriptors, leading to suboptimal performance and challenges with track consistency and false positives. Addressing these limitations requires advanced embedding techniques, improved use of contextual data, and more scalable algorithms to enhance performance across diverse applications. Figure 2 illustrates these issues, showing tracking failures due to IDSws, occlusions, and inefficiencies in current systems.
In the first frame (t), three individuals (ID1, ID2, and ID3) are correctly detected. In the next frame ( t + 1 ), ID1 and ID3 are still tracked (TP), but ID2 is occluded or exits, causing a false negative (FN). A new individual (ID4) enters but is mistakenly identified as ID2, leading to an IDSw. By the third frame ( t + 2 ), ID1 is missed (FN), ID3 remains correctly tracked (TP), and ID2 reappears but is confused with ID5, causing another IDSw. Additionally, ID5 is falsely introduced (FP), while ID4 is correctly tracked, reflecting a true negative (TN). This highlights how IDSws, FNs, and FPs degrade tracking, especially in dynamic, crowded scenes.
Traditional MOT methods often suffer from IDSws and FM, which degrade tracking accuracy and continuity, particularly in crowded and complex environments. ReTrackVLM aims to tackle these challenges by leveraging the advanced detection capabilities of YOLOv8, the robust feature representation of VLMs, and the zero-shot ReID capability for associating tracks across frames. By integrating these components into a unified pipeline, the proposed method reduces IDSws and FM, offering a robust and scalable solution for MOT. The embedding of ReTrackVLM, which is generated using XFeat [6] and LightGlue [7], encapsulates spatial and appearance features essential for accurate tracking. Integrated with fine-tuned VLMs like Detectron2 (DETR2) [8], Florence-2 [9], and CLIP [10], ReTrackVLM evaluates whether a query bbx belongs to an existing track, improving accuracy by reducing IDSw. The system’s track management module, combined with confident track storage and a zero-shot re-identification module, ensures consistent long-term tracking, even when objects reappear after occlusion. ReTrackVLM delivers a robust, scalable tracking system for real-world applications. We can summarize the main contributions of this work as follows:
  • Novel Embedding Structure: We propose a novel embedding structure that combines bbxes, keypoints, descriptors, overlap ratios, and confidence scores, providing a comprehensive representation of detected objects in MOT tasks.
  • Integration with VLMs: ReTrackVLM leverages fine-tuned VLMs, such as DETR2, Florence-2, CLIP, and OpenCLIP, to improve the accuracy and robustness of object tracking by determining the association between bbxes and existing tracks.
  • Enhanced Track Management: We introduce a track management module that utilizes a confident track storage system, ensuring consistent track maintenance and reducing the occurrence of IDSws in challenging scenarios.
  • Zero-Shot ReID: A zero-shot ReID module is incorporated, allowing the system to match newly detected tracks with previously established ones without requiring extensive retraining, thereby enhancing the adaptability of the system to varying environments.
  • Extensive Evaluation on Diverse Datasets: We validate our approach through extensive experiments on diverse datasets, including MOT15, MOT16, MOT17, DanceTrack, and WildTrack, demonstrating its superior performance compared to existing state-of-the-art methods.
The remainder of this paper is organized as follows. In Section 1, we review related studies in the field of MOT, highlighting both traditional approaches and recent advancements. Section 2 details the proposed methodology, including our embedding structure and its integration with fine-tuned VLMs for robust object tracking. We also elaborate on the track management module and the zero-shot re-identification system, which collectively enhance tracking consistency and accuracy. In Section 3, we describe the datasets used for training and evaluation, including MOT15, MOT16, MOT17, DanceTrack, and WildTrack, and discuss the experimental setup. Section 4 presents our results, comparing the performance of our approach against state-of-the-art methods across these diverse datasets. Finally, Section 5 concludes the paper with a summary of our findings, potential applications of our work, and directions for future research.

2. Related Works

Object occlusion poses a significant challenge in multi-object tracking (MOT), often causing mismatches when occluded objects reappear. Consistently maintaining object IDs, even amidst varying occlusion scenarios, is essential for tracking accuracy. Many solutions have been proposed to enhance robustness and accuracy in MOT, with our methodology integrating advanced techniques targeting occlusion complexities. Key challenges include the following: (i) Temporary object occlusions disrupt the tracking process, making it difficult to maintain accurate tracks as objects obscure each other. (ii) Changes in illumination, pose, or scale result in significant appearance variations, requiring robust algorithms capable of adapting to these changes. (iii) Real-time processing is essential, as tracking multiple objects in high-resolution video streams demands computational efficiency. (iv) In dense environments with closely interacting objects, maintaining track identity becomes particularly challenging.
To address these, various approaches have emerged. For example, a dynamic representation-based tracker [11] uses an adaptive representation network and pose supervision for long-term tracking with occlusions. The SiamFEA tracker [12] combines visible and infrared modalities using self-attention mechanisms. iReIDNet [13] enhances person ReID through spatial feature transforms and coordinate attention. A transformer-based dual-branch model [14] improves performance via global–local feature interaction, while contextual relation networks [15] tackle similar local feature issues.
Advancements also include deep convolutional architectures [16], color descriptor-based ReID methods [17], hierarchical clustering frameworks [18], graph-based approaches [19], and recurrent neural networks [20] to enhance data association and occlusion handling.
Recent work on feature embedding aims to reduce dimensionality while retaining key characteristics, with strategies like feature combination from multiple DCNNs [21] and supervised embedding methods [22] improving classification and metric learning. The integration of these techniques into tracking systems [23] has enhanced MOT performance through short tracklets and tracklet–plane matching. Furthermore, structured prediction optimizes feature embedding [24]. Although models like ReDeformTR [25] effectively track animals across cameras, they lack the versatility of ReTrackVLM, which tracks various objects without complex feature fusion. Additionally, YOLOv5+DeepSORT [26] enhances underwater tracking accuracy for marine creatures, while [27] employs YOLOv5 with recurrent networks for real-time object tracking in challenging conditions. Furthermore, a triplet-based MOT method exploiting an attention-based ReID module is presented in [28] to enhance object association, particularly in challenging occlusion scenarios. A method proposed in [29], grayscale spatial–temporal features, aims to improve tracking speed and efficiency, especially for devices with limited computing power, and uses a grayscale mapping technique to acquire spatial–temporal features, allowing for direct target localization in previous frames and reducing the computational burden associated with ID matching.
In multi-camera tracking, the end-to-end approach in [30] utilizes probabilistic association and detection embeddings to manage scenarios effectively, though it struggles with complex occlusions. The semi-online tracking refinement method in [31] corrects IDSws by monitoring appearance similarity changes over time. Similarly, the multi-class tracking approach in [32] achieves predictable execution times by class-splitting the Hungarian matrix, though this may sacrifice accuracy in dense scenarios. A framework in [33] employs weakly supervised multi-object tracking and segmentation for improved mask consistency, but our method enhances this with a novel embedding structure and track management module.
The online multi-object tracker in [34] combines various appearance features with a ReID network to reduce IDSws, while our method leverages visual language models (VLMs) and advanced track management for superior IDSw and occlusion handling. The denoising diffusion strategy in [35] enhances tracking by jointly detecting and associating objects, yet ReTrackVLM offers a more comprehensive solution with its track management module and zero-shot ReID system. The approach in [36] uses weak cues alongside strong spatial and appearance information, whereas our method ensures better generalization through hybrid VLMs and detailed embeddings. The SMILETrack method in [37] combines an efficient detector with a Siamese network for similarity learning; our approach surpasses it with a more advanced embedding structure and track management module. AMDDATrack in [38] is designed to improve tracking accuracy by addressing trajectory breaks caused by dropped low-scoring detection frames utilizing an improved CenterNet [39] detection network incorporating a feature pyramid network, a high-resolution feature map, and a spatial attention mechanism to enhance detection accuracy. Similarly, REACTrack, proposed in [40], also employs CenterNet, but with a focus on enhancing ReID robustness and correcting tracking association errors, especially in complex scenarios like occlusion. The work in [41] integrates temporal and spatial features using a Kalman filter and Hungarian algorithm for tracking in self-driving cars, but struggles with similar-looking or closely blocked objects. Fast re-OBJ [42] improves performance by tightly coupling instance segmentation and embedding generation for a more discriminative representation.
The sports-focused Deep-EIoU method [43] replaces the Kalman filter with an iterative scale-up approach, while our work broadens applicability through fine-tuning on diverse datasets and a novel ReID module. The MG-MOT algorithm in [44] integrates UAV metadata for maritime ReID, demonstrating strong performance, but our approach generalizes better across various environments.
ReTrackVLM distinguishes itself by integrating VLMs and a novel embedding structure specifically designed for MOT in complex environments characterized by severe occlusions. Recent research supports the effectiveness of this approach, as shown in [45,46], which introduces robust MOT methods for sports scenarios and real-time solutions, respectively, emphasizing the importance of embedding strategies and track management central to our methodology.
Additionally, methods like OR-SORT [47] and OneTracker [48] address challenges related to camera motion and multi-modality through a Foundation Tracker, highlighting the potential of VLMs and advanced embeddings in MOT. Similarly, refs. [49,50] propose innovative MOT paradigms incorporating natural language descriptions and multi-modality tracking. Our method builds upon these advancements by refining embedding and track management techniques specifically tailored for human tracking across diverse datasets.
Furthermore, the perspective disentanglement framework for ReID in [51] aligns with our focus on robust embedding strategies for track management. Cross-domain ReID methods [52] highlight ongoing efforts to improve domain adaptation in MOT. While other MOT methods have incorporated VLMs, ReTrackVLM distinguishes itself through its integration of zero-shot ReID for track association and its enhanced track management module. Unlike existing methods that rely solely on fine-tuned embeddings or pre-trained models, ReTrackVLM effectively utilizes cross-modal embeddings to improve robustness in challenging scenarios, such as occlusion and crowded scenes. The novelty lies in its ability to seamlessly integrate these components into a unified framework, which is validated across diverse datasets.

3. Methodology

The proposed MOT flow diagram in Figure 3 addresses challenges in real-world tracking scenarios. It begins with a robust object detection phase using a fine-tuned model to accurately identify and localize humans in images or video streams, generating precise bbx coordinates. Next, an intricate feature extraction process employs advanced techniques such as Xfeat and LightGlue to derive keypoints, descriptors, and confidence scores from detected bbxes. These features are input into a VLM that assesses whether a detected human matches an existing track, producing a Boolean response. The track management module then handles motion prediction and data association, ensuring accurate track continuity across frames despite challenges such as occlusions and sudden motion changes. To enhance reliability, the system incorporates confident track storage, archiving established tracks to reduce IDsws and improve long-term tracking fidelity. Finally, the ReID module uses stored information to reassociate newly detected humans with existing tracks, ensuring robust ID preservation and minimizing errors in dynamic environments. This integrated flow diagram not only overcomes the limitations of traditional MOT systems but also offers a scalable and efficient solution for diverse and challenging conditions, making it ideal for modern large-scale applications.
ReTrackVLM is designed to overcome the persistent challenges encountered in real-world tracking scenarios, integrating VLMs and a novel embedding structure. As can be seen in Figure 4, the operation procedure of ReTrackVLM starts with fine-tuned YOLOv8 [53] detecting pedestrians in input images, followed by feature extraction using XFeat and LightGlue, which generate robust descriptors and keypoints for each bbx. These features are processed through an image encoder and text embedder, then fused using a fine-tuned transformer to predict motion and manage data association. The track management module ensures accurate tracking by predicting movements and refining appearance descriptors. High-confidence tracks are stored in a dedicated database for ReID, allowing the system to maintain track continuity even in challenging scenarios such as occlusions and complex motions. This combination of modules improves the accuracy of the tracking and reduces IDSws across various datasets.
The ReTrackVLM pipeline, which is given in Algorithm 1, processes video frames for multi-object tracking through object detection, track prediction, and re-identification. It starts by initializing an empty track set and confident track storage (CTS) for managing tracks across frames. For each frame, objects are detected using a pre-trained YOLOv8, with keypoints, descriptors, and confidence scores forming embeddings. A Kalman filter predicts object locations based on previous tracks, and detected objects are associated with these predictions using IoU and appearance similarity scores. The Hungarian algorithm handles associations, updating matched tracks, initializing new ones, and maintaining unmatched tracks.
Unmatched tracks are further processed using zero-shot ReID, querying the CTS for identity matches without retraining. High-confidence tracks are added to CTS for future use. After processing all frames, low-confidence tracks are removed, and trajectories are smoothed for output. The final tracks are saved in a format compatible with the MOT Challenge for evaluation.
Algorithm 1 The algorithm for the complete ReTrackVLM framework.
Input: Video frames F = { f 1 , f 2 , , f T }
Output: Object tracks T = { t 1 , t 2 , , t N }
      1:
Initialize:
      2:
    T 0                      ▹ Initial tracks
      3:
    C T S                ▹ Confident Track Storage
      4:
for each frame f t F  do
      5:
    Object Detection:
      6:
        Apply YOLOv8 detector to f t .
      7:
        Extract bounding boxes bbxes t = { b 1 , b 2 , , b M } .
      8:
    for each bounding box b i  do
      9:
        Extract keypoints and descriptors using XFeat.
    10:
        Compute overlap ratios and confidence scores.
    11:
        Construct embeddings e i = { b i , k i , d i , o i , c i } .
    12:
    end for
    13:
    Track Prediction:
    14:
        Predict positions of existing tracks T t 1 using a Kalman filter.
    15:
        Generate predicted states P = { p 1 , p 2 , , p N } .
    16:
    Association:
    17:
        Compute association scores between bbxes t and P using:
    18:
          (a) IoU for spatial alignment.
    19:
          (b) Appearance similarity from VLM embeddings.
    20:
        Solve the association problem using the Hungarian algorithm.
    21:
    Track Update:
    22:
        Update T t based on associations:
    23:
          - Matched bounding boxes are assigned to tracks.
    24:
          - Unmatched bounding boxes initialize new tracks.
    25:
          - Unmatched tracks are updated with predicted states.
    26:
    Zero-Shot ReID:
    27:
        Query C T S for unmatched tracks:
    28:
          - Compare embeddings of unmatched tracks with C T S .
    29:
          - Re-identify tracks based on similarity scores.
    30:
          - Update track identities if a match is found.
    31:
    Confident Track Storage Update:
    32:
        Add high-confidence tracks t to C T S :
    33:
          - Tracks with stable identities over multiple frames are stored.
    34:
end for
    35:
Post-Processing:
    36:
    Remove fragmented or low-confidence tracks.
    37:
    Smooth track trajectories for visualization.
    38:
Output Results:
    39:
    Save tracks T in MOT-compliant format for evaluation.

3.1. Data Preprocessing and Input Representation

We utilize several well-known datasets in the MOT domain, including MOT15, MOT16/17 (which contains refined annotations), MOT20, DanceTrack, and WildTrack. Additionally, we incorporated CrowdHuman [54], which focuses on extremely crowded scenes with severe occlusions and dense pedestrian groupings, and is crucial for applications like public safety and surveillance. Each dataset offers a rich set of annotated frames focusing on pedestrian detection, which are preprocessed to ensure consistency and compatibility with the YOLOv8 framework and subsequent ReID processes. The preprocessing steps involve the following key operations:
  • Resizing: All input images are resized to a uniform size of 640 × 640 pixels to match the input requirements of the YOLOv8 model. This resizing ensures that the aspect ratios of the pedestrians are preserved, minimizing distortion.
  • Normalization: Pixel values are normalized to the range [0, 1] to facilitate faster convergence during model training. This normalization is performed using the mean and standard deviation of the ImageNet dataset [55], which align with the pre-trained weights used in the model.
  • Data Augmentation: Data augmentation techniques such as horizontal flipping, random cropping, rotation, scaling, and color jittering are applied to enhance the model generalization capability. The augmentation parameters are carefully selected to maintain the integrity of the bbxes and the corresponding pedestrian identities.
We used the training and validation sets provided by the datasets originally, with the training set used to optimize the model weights and the validation set used to monitor performance and prevent overfitting. Let B = b 1 , b 2 , b 3 , , b N denote the set of bbxes in an image, where each bbx b i is represented as b i = [ x i m i n , y i m i n , w , h ] . Here, x i m i n and y i m i n are the coordinates of the top-left corner of the bbx, and w and h are the width and height of the bbx, respectively. The content of the bbxes is further processed to extract features using the XFeat and LightGlue methods, which are critical for cross-modal embeddings and ReID tasks.
The YOLOv8 model is fine-tuned on the preprocessed datasets to optimize pedestrian detection performance by following the fine-tuning procedure explained in [56]. The training process involves the following steps:
  • [1] Model Initialization: The YOLOv8 model is initialized with pre-trained weights from the COCO dataset [57]. The model is designed to detect pedestrians with high accuracy, utilizing the features extracted from the bbxes with Equation (1):
    L = L b b x + L c o n f + L c l a s s
    where L b b x is the regression loss of the bbx, L c o n f is the confidence loss, and L c l a s s is the classification loss.
  • [2] Training Configuration: The model is trained using a custom configuration defined in a YAML file, specifying training parameters such as batch size, learning rate, augmentation settings, and the number of epochs.
  • [3] Training Procedure: The model is trained for 100 epochs with early stopping based on validation performance. The multi-scale training strategy is employed, where input images are randomly resized during training to improve the robustness of the model to varying scales.
The fine-tuned YOLOv8 model serves as the backbone for the tracking pipeline, providing accurate detections that are fed into the cross-modal embeddings and track management modules.
In the feature extraction stage using XFeat and LightGlue, we obtained the robust features from the bbxes. These features are employed as inputs to the VLMs and play an important role in track management and ReID. XFeat captures deep semantic features, while LightGlue focuses on point-based features that are useful for matching and association tasks. The extracted features, denoted as F i for each b b x , are a high-dimensional vector representing the appearance and spatial characteristics of the detected pedestrian. LightGlue is a feature matching technique that facilitates the alignment of bbxes across consecutive frames. It computes descriptors that are used to associate detections between frames, thus enabling accurate tracking. Let D i represent the descriptor vector extracted by LightGlue for b b x b i . The similarity between descriptors from different frames is calculated using a distance metric, typically the cosine similarity as given in Equation (2):
sim ( D i , D j ) = D i · D j D i D j
The keypoint (kp) detection structure of XFeat is designed to achieve a balance between accuracy and computational efficiency, making it suitable for deployment on hardware-constrained devices such as mobile robots and embedded systems. The kp detection is handled by a dedicated branch within the network, a design choice that deviates from traditional approaches where kp detection and descriptor extraction are typically coupled. This decoupling allows XFeat to independently optimize each task, resulting in faster and more accurate kp detection. The kp detection branch in XFeat processes the input image by first transforming it into a grid structure, with each grid cell representing an 8 × 8 pixel region. The image is then reshaped into a 64-dimensional feature vector for each grid cell, preserving spatial granularity. A series of rapid 1 × 1 convolutions are applied to this representation to regress the kp coordinates efficiently. The final output of this branch is a kp embedding as given in Equation (3):
K R H 8 × W 8 × ( 64 + 1 )
where k i j is classified into 1 of 64 possible positions within its corresponding cell k i j R 65 , with an additional “dustbin” option for cases where no kp is detected. The dustbin is discarded during inference, and the remaining heatmap is interpreted as an 8 × 8 cell.
The descriptor extraction process in XFeat focuses on generating a dense feature map FF with compact 64-dimensional descriptors. This map is built using a multi-scale feature merging strategy, which enhances the robustness of the network to variations in viewpoint and illumination—critical aspects for applications in real-world scenarios such as mobile robotics. The dense feature map given in Equation (4):
F R H 8 × W 8 × ( 64 )
This is obtained by merging features from different scales, { 1 8 , 1 16 , 1 32 } , of the image. The merging process involves bilinear upsampling of intermediate representations to match the resolution of the final map, followed by an element-wise summation. This strategy leverages the benefits of feature pyramids to increase the receptive field of the network while maintaining the compactness of the descriptors. A convolutional fusion block, consisting of three basic layers, combines these representations into the final feature map. An additional convolutional block is employed to generate the reliability map given in Equation (5):
R R H 8 × W 8
which models the unconditional probability R i , j that a given local feature F i , j can be confidently matched. This map plays a crucial role in filtering out unreliable features, further enhancing the accuracy of the matching process. The following is an explanation of datasets used and the data preprocessing steps. For dense matching, XFeat introduces a lightweight module that enables semi-dense matching while controlling memory and computational footprints. This module selects the t o p K image regions based on their reliability scores R i , j and caches them for future matching. The matching process employs a simple Multi-Layer Perceptron (MLP) for coarse-to-fine matching, avoiding the need for high-resolution feature maps. Given the dense feature map F or its subset F s F , the MLP predicts pixel-level offsets o between matching features from an image pair ( I 1 , I 2 ) . The prediction of the offsets o is conditioned on the matched feature pair ( f a , f b ) and is formulated as in Equation (6):
( x , y ) = arg max i { 1 , 2 , , 8 } , j { 1 , 2 , , 8 } o ( i , j )
where o R 8 × 8 represents the logits of a probability distribution over the possible offsets. This refinement strategy allows for efficient pixel-level matching by reducing the search space, making it particularly suitable for resource-constrained settings. Hereby, we employed XFeat due to its focus on both accuracy and efficiency. The decoupling of keypoint detection and descriptor extraction, the use of multi-scale feature merging, and the integration of reliability maps all contribute to its robust performance in real-world applications. With its ability to perform real-time inference on limited hardware, XFeat provides a valuable solution for further application, such as in mobile robotics and augmented reality scenarios.
ReTrackVLM processes input video frames by detecting objects using YOLOv8, extracting bbxes, keypoints, and descriptors for each detected object during the stage given in Algorithm 2. Additional features, such as overlap ratios and confidence scores, are calculated. These components are combined into embeddings, normalized for compatibility with YOLOv8, and serve as input for the subsequent stages.
Algorithm 2 Data Preprocessing and Representation
Input: Video frame f t , Pre-trained YOLOv8 detector, Keypoint extractor (XFeat)
Output: Set of embeddings E t = { e 1 , e 2 , , e n } for frame f t
      1:
Detect objects in frame f t using YOLOv8, obtaining bounding boxes bbxes t
      2:
for each bounding box b i bbxes t  do
      3:
      Extract keypoints k i and descriptors d i using XFeat
      4:
      Compute overlap ratios o i with other bounding boxes
      5:
      Compute confidence score c i
      6:
      Construct embedding e i = [ b i , k i , d i , o i , c i ]
      7:
      Add e i to E t
      8:
end for
      9:
Normalize bounding box parameters to [ 0 , 1 ] for compatibility with YOLOv8
    10:
return E t

3.2. Cross-Modal Embeddings and Fine-Tuning of VLMs

To create a robust cross-modal embedding for multi-object tracking (MOT), our approach integrates visual and textual representations derived from image data. These embeddings are generated from different datasets, including MOT15, MOT16, MOT17, DanceTrack, CrowdHuman, and WildTrack, and are designed to be compatible with several VLM architectures, such as Florence-2, CLIP, OpenCLIP, and DETR2.
The embeddings for each VLM are constructed by associating the extracted visual features with corresponding textual or categorical descriptions. This process is repeated for each image in the dataset, resulting in structured data stored in various formats suitable for the target VLMs:
  • Florence-2 Embedding incorporates visual features with bbxes, keypoints, descriptors, overlap ratios, and confidence scores, labeling each entity as a “pedestrian”.
  • CLIP and OpenCLIP Embeddings convert visual features into textual descriptions, detailing the position, keypoints, descriptors, and confidence scores of each human figure within the frame.
  • DETR2 Embedding organizes features in a format that includes bbxes, keypoints, descriptors, and overlap ratios, compatible with the DETR2 framework. These data are stored in a DETR2-specific JSON format, allowing for seamless integration with downstream tasks.
To create effective cross-modal embeddings for MOT tasks, we first define a multi-dimensional vector representation for each tracklet. The embedding vector e i for tracklet i is composed of various features as given in Equation (7):
e i = [ v i , k i , d i , b i , c i , r i ]
where
  • v i = [ f n , h , w ] represents the visual features, including the image file name f n , image height h, and image width w.
  • k i is the keypoint vector extracted using XFeat, representing the spatial configuration of the detected human’s keypoints.
  • d i is the 64-dimensional descriptor vector derived from LightGlue, encoding the appearance characteristics.
  • b i represents the bbx [ x , y , w , h ] , capturing the spatial extent of the detected object.
  • c i is the confidence score associated with the detection.
  • r i is the overlap ratio of bbxes across frames, used for temporal consistency. It measures the degree of spatial overlap between different bbxes. This information is crucial for managing occlusions and ensuring accurate tracking across frames.
These features are concatenated to form a unified embedding e i that captures both visual and spatial–temporal information.
To bridge the gap between different modalities (e.g., image and textual descriptions), we project the embeddings into a shared latent space L. Let W v and W t be the projection matrices for visual and textual features, respectively. The projection can be formulated as given in Equation (8):
z i v = W v · e i v , z j t = W t · e j t
where
  • z i v is the projected visual embedding for tracklet i.
  • z j t is the projected textual embedding for the corresponding textual description j.
The loss function for aligning the modalities is typically based on contrastive learning as given in Equation (9):
L c o n t r a s t = l o g e x p ( c o s ( z i v , z j t ) / τ ) k e x p ( c o s ( z i v , z j t ) / τ )
where
  • c o s ( · ) represents the cosine similarity between the projected embeddings.
  • τ is a temperature parameter that controls the sharpness of the similarity distribution.
For fine-tuning VLMs, we leverage the cross-modal embeddings to refine the decision boundaries in the joint embedding space. Given a query bbx b q , we compute its embedding e q and compare it against the embeddings of existing tracks as given in Equation (10):
S c o r e ( b q , e i ) = c o s ( W v · e q , W v · e i )
The query bounding box is assigned to the track i with the highest score as given in Equation (11):
i = arg max i S c o r e ( b q , e i )
We consequently perform the fine-tuning by minimizing a cross-entropy loss over the track assignments during the training process.
The extracted embeddings are processed by a VLM as given in Algorithm 3 to generate rich cross-modal embeddings. These embeddings encode both visual and semantic information, enhancing the model’s capacity for accurate object representation. Fine-tuning the VLM further aligns its outputs to the MOT task.
Algorithm 3 Cross-Modal Embeddings and Fine-Tuning of VLMs
Input: Set of embeddings E t , Pre-trained VLM (e.g., Florence-2, CLIP)
Output: Set of cross-modal embeddings V t = { v 1 , v 2 , , v n }
      1:
for each embedding e i E t  do
      2:
      Extract visual and semantic features v i using the VLM
      3:
      Add v i to V t
      4:
end for
      5:
Fine-tune the VLM on the extracted features to enhance representation for MOT
      6:
return V t
Our approach for generating cross-modal embeddings and fine-tuning VLMs in MOT is illustrated in Figure 5 in detail, which begins with the sequence of previously detected and cropped pedestrians within bbxes that are extracted for each individual. These cropped regions are processed to compute visual features, including keypoints and descriptors, which are structured into a data format containing fields such as f r a m e _ i d , p e r s o n _ i d , b b x _ c o o r d i n a t e s , keypoints, descriptors, o v e r l a p _ r a t i o , and c o n f i d e n c e _ s c o r e . For each individual, a textual description is generated in natural language using these features, aligning the visual information with textual inputs required for VLMs. Both the visual and textual data are projected into a shared embedding space via learned projection matrices, enabling cross-modal alignment. The embeddings for each individual are stored as V t vectors for track management and undergo fine-tuning using a contrastive learning objective to enhance identity association. This process ensures that embeddings of the same identity are brought closer together while those of different identities are separated. The resulting multi-modal embeddings effectively improve track association accuracy by leveraging both visual and textual cues in a unified representation.

3.3. Track Management and Zero-Shot ReID Integration

Effective MOT requires robust track management and ReID mechanisms to address object occlusions, sudden appearances, and IDSws. ReTrackVLM incorporates a sophisticated track management module with zero-shot ReID capabilities, ensuring seamless tracking across diverse scenarios. This module utilizes motion prediction, distance calculation, and data association strategies to maintain track consistency and ReID objects after interruptions, collectively enhancing tracking reliability and accuracy in challenging environments.
The track management and zero-shot ReID integration of ReTrackVLM are demonstrated in Figure 6, which initially obtains the bbxes, keypoints, and descriptors for each detected pedestrian from the current frame. Motion estimation, implemented via a Kalman filter, predicts the future positions of tracks based on their previous states, aiding in association with new detections. For similarity evaluation, the cosine similarity metric is employed to calculate a distance matrix between the embedding vectors of existing tracks V t s and new detections. This matrix quantifies the matching likelihood between tracks and detections. Following distance calculation, data association is performed using a bipartite matching algorithm to optimally pair the detections (Det1 to DetN) with tracks (Track1 to TrackM), taking into account both motion predictions and embedding similarities. After association, zero-shot ReID is applied by leveraging VLM embeddings, allowing for re-identification across frames without additional training. Confident tracks are stored in an array containing their associated embeddings, indexed by unique IDs such as ID1, ID2, …, IDk, along with confidence scores. This array serves as a gallery for ReID, facilitating identity retrieval for new detections and enhancing tracking consistency across occlusions or reappearances. The framework ensures seamless integration of motion, similarity assessment, and ReID for robust MOT.
Tracks from the previous frame are predicted using a Kalman filter in this stage as shown in Algorithm 4. Detected objects in the current frame are associated with these predicted tracks based on IoU and appearance similarity scores, using the Hungarian algorithm. For unmatched tracks, zero-shot ReID compares them with previously stored embeddings in the confident track storage, allowing for robust identity preservation without retraining.
Algorithm 4 Track Management and Zero-Shot ReID Integration
Input: Predicted tracks T t 1 , Current embeddings V t , Confident Track Storage (CTS)
Output: Updated tracks T t
      1:
Predict new locations of tracks T t 1 using a Kalman filter, obtaining P
      2:
for each predicted track p j P  do
      3:
      Compute IoU and appearance similarity scores with v i V t
      4:
end for
      5:
Solve the association problem using the Hungarian algorithm
      6:
for each association result do
      7:
      Update matched tracks in T t
      8:
      Initialize new tracks for unmatched v i
      9:
      Handle unmatched tracks in T t 1 by updating their predicted states
    10:
end for
    11:
Perform zero-shot ReID by comparing unmatched tracks with embeddings in CTS
    12:
return T t

3.3.1. Motion Prediction and Similarity Assessment

Motion Prediction Mechanism: The Kalman filter is employed to predict the motion of detected objects across consecutive frames. Each detected pedestrian’s state is represented by a state vector x k = [ x , y , x ˙ , y ˙ , w , h ] T , where ( x , y ) represents the center of the bbx, ( x ˙ , y ˙ ) represents the velocity, and w and h are the width and height of the bbx. The state transition model F is defined as given in Equation (12):
x k + 1 = F x k + w k
where F is the state transition matrix and w k is the process noise, assumed to follow a Gaussian distribution with covariance Q. The Kalman filter predicts the next state x k + 1 and updates it with the observed measurements.
Distance Calculation: The distance between predicted tracks and new detections is calculated using a combination of spatial and appearance-based metrics. For spatial distance, the Mahalanobis distance d M is utilized to measure the difference between the predicted bounding box and the detected bounding box. The Mahalanobis distance is given by Equation (13):
d M ( z k , H x k ) = ( z k H x k ) T S k 1 ( z k H x k )
where z k is the detected measurement, H is the observation model, and S k is the innovation covariance.
Appearance Feature Descriptors and Similarity Score: Feature locations obtained via XFeat within the bbxes of pedestrians detected by the fine-tuned YOLOv8 are fed into the Kalman filter. These feature locations are further used to compute appearance descriptors, enhancing the reliability of track association. The similarity score between the detected and tracked objects is calculated using the VLM that we also fine-tuned for this purpose in a similar manner as [58]. This score, S V L M , is integrated into the overall distance calculation, combining both spatial and appearance-based distances as in Equation (14):
d t o t a l = α · d M + β · ( 1 S V L M )
where α and β are weights that balance the influence of spatial and appearance-based metrics, respectively. This integrated distance metric ensures robust track management by leveraging both motion prediction and appearance similarity.
Cost Matrix Construction: To apply the Hungarian method, we first construct a cost matrix C that quantifies the dissimilarity between each detected object and each existing track. The elements of the cost matrix C i j = d t o t a l ( z i , x j ) are computed as a weighted combination of spatial and appearance-based distances as defined earlier in Equation (14), where d t o t a l ( z i , x j ) appears to be the Mahalanobis distance between the i-th detection and the j-th track, and S V L M ( z i , x j ) is the similarity score derived from the VLM.
Optimal Assignment: The Hungarian method solves the assignment problem by minimizing the total cost of assigning detections to tracks. Given the cost matrix C, the algorithm finds the optimal assignment AA, which minimizes the sum of the selected costs as given in Equation (15):
A = arg min A i , j A i , j C i , j
where A i , j is a binary variable that equals 1 if the i-th detection is assigned to the j-th track, and 0 otherwise. Solution A ensures that each detection is assigned to at most one track and each track is matched to at most one detection, thereby optimizing the association process.

3.3.2. Data Association

Data association is essential in MOT, aiming to match detected objects in the current frame with existing tracks. We employ the Hungarian method, a combinatorial optimization algorithm, to efficiently solve the assignment problem. Afterward, unassigned detections may signify new objects entering the scene, prompting the initialization of new tracks. Conversely, unassigned tracks that lack matches are terminated based on the number of consecutive frames without valid matches.

3.3.3. Zero-Shot ReID Integration

The zero-shot ReID module is crucial for maintaining object identity continuity, especially when objects exit and re-enter the scene or when tracks are temporarily lost. Unlike traditional ReID methods that require extensive training on labeled data [42], our zero-shot ReID module uses pre-trained VLMs to match reappearing objects to their previous identities without additional fine-tuning. We treat the confident track storage as a dynamic gallery, comparing new queries from the current frame against this gallery using pre-trained embeddings from the VLM. This allows for seamless identification of potential matches based on stored descriptors and VLM similarities, eliminating the need for specialized training on additional ReID datasets.
Feature Extraction and Embedding Comparison: In the zero-shot ReID module, the appearance features of each detected object are extracted using the same embedding structure described in earlier sections. The embedding vector e d for a new detection is compared against the stored embeddings e t of all previously tracked objects in the database. The similarity between the embeddings is computed using a cosine similarity measure as given in Equation (16):
S c o s ( e d , e t ) = e d · e t | | e d | | | | e t | |
where e d is the embedding vector of the current detection, and e t is the embedding vector of a previously tracked object. A high similarity score indicates that the detection is likely to correspond to a previously tracked object, facilitating the re-assignment of the same identity.
Zero-Shot Matching Process: When a detection z i cannot be matched to an existing track during the data association step (as described in Section 3.3.2), the zero-shot ReID module is activated. The module searches for the highest similarity score S c o s between the current detection and the stored embeddings of lost tracks. If the similarity score exceeds a predefined threshold τ R e I D , the detection is reassociated with the corresponding track, effectively recovering the identity of the object as given in Equation (17):
R e I D M a t c h = arg max t S c o s ( e d , e t ) if S c o s ( e d , e t ) > τ R e I D
Integration into Track Management: The zero-shot ReID module is integrated into the track management system, offering a fallback for identity preservation when standard data association fails. This allows the tracking system to maintain consistent object identities during occlusions, exits, and re-entries without the need for manual labeling or retraining. By leveraging the generalization capabilities of VLMs, the zero-shot ReID module enhances the overall robustness and reliability of the tracking pipeline.

3.4. Confident Track Storage

The confident track storage module is vital to the track management pipeline, ensuring the reliability of tracked identities over time. It selectively stores tracks with high confidence, enabling more effective management of active tracks and minimizing erroneous associations.
Track Confidence Calculation: To determine whether a track should be stored as “confident”, we introduce a confidence score C t for each track t, computed as a weighted combination of factors such as detection consistency, association reliability, and ReID match quality. The confidence score is defined as given in Equation (18):
C t = α · C d e t + β · C a s s o c + γ · C R e I D
where
  • C d e t represents the consistency of detections associated with the track. It is calculated as the ratio of successful detections to the total number of frames for which the track has been active.
  • C a s s o c represents the reliability of data associations for the track, measured by the inverse of the number of IDSws.
  • C R e I D represents the quality of ReID matches, quantified by the average cosine similarity score S c o s of ReID matches over time.
The parameters α , β , and γ are empirically determined weights that balance the contribution of each factor, set to 0.4 , 0.3 , and 0.3 respectively, reflecting the relative importance of detection consistency, association reliability, and ReID match quality in establishing track confidence.
Empirical Threshold for Confidence: A track is classified as confident if its confidence score C t exceeds a predefined threshold τ c o n f , empirically set at 0.7 based on validation experiments. This threshold ensures that only tracks with a high level of reliability are stored, reducing the risk of storing incorrect or noisy tracks as given in Equation (19).
C o n f i d e n t T r a c k t if C t > τ c o n f
Storage Mechanism and Interaction with Other Modules: Once a track is classified as confident, it is stored in the confident track storage module, which acts as a repository for high-quality tracks. This allows for the reactivation of tracks if the same object is detected later, enhancing identity management across long sequences. The interaction between the confident track storage and other modules is twofold:
1
Data Association: During the data association process, confident tracks in storage are prioritized for matching with new detections. This reduces the likelihood of identity switches and enhances track continuity.
2
Zero-Shot ReID Integration: In cases where a track has been lost and subsequently reappears, the zero-shot ReID module can utilize the embeddings of confident tracks to reassociate the object with its previous identity, further ensuring consistent identity tracking across the entire sequence.
Incorporating the confident track storage module improves the overall tracking system’s reliability and stability, as these confident tracks serve as anchor points, reducing errors and enhancing long-term identity preservation. Empirical tuning of confidence thresholds and weighting parameters ensures effective adaptation to diverse tracking scenarios, maintaining a balance between precision and recall.
Tracks with high confidence and stable identity are stored in the confident track storage (CTS) for future reference as given in Algorithm 5. Low-confidence or outdated tracks are removed to maintain efficient storage and improve tracking reliability in dynamic and complex scenes.
Algorithm 5 Confident Track Storage
Input: Updated tracks T t , Confident Track Storage (CTS)
Output: Updated CTS
      1:
for each track t k T t  do
      2:
      if  t k has high confidence and stable identity then
      3:
          Add t k to CTS
      4:
      end if
      5:
end for
      6:
Remove obsolete or low-confidence tracks from CTS
      7:
return CTS

4. Experiments and Results

We evaluated the ReTrackVLM performance across various datasets and metrics. This section begins with an overview of the datasets and evaluation metrics used, followed by implementation details. Key experiments, including an ablation study that highlights contributions from different components, are presented. We conclude by comparing ReTrackVLM with state-of-the-art methods, showcasing its effectiveness, especially in challenging scenarios involving cross-modal embeddings and zero-shot ReID integration.

4.1. Datasets and Evaluation Metrics

We selected diverse MOT datasets—MOT15, MOT16, MOT17, MOT20, DanceTrack, and WildTrack—to evaluate the proposed framework comprehensively. These datasets span a wide range of real-world scenarios, including urban environments, indoor and outdoor scenes, and varying levels of crowd density, as provided in Table 1 in detail.
MOT15 includes indoor and outdoor environments with moderate difficulty due to occlusions and varying densities. MOT16 and MOT17 share content but differ in annotation styles, allowing for an assessment of the impact of annotation on performance in crowded urban settings. MOT20 features extremely crowded scenes with prolonged occlusions, ideal for dense urban applications. DanceTrack focuses on dynamic, fast-paced indoor movements, challenging the model with rapid motion and complex interactions. WildTrack emphasizes multi-camera setups and varying angles, posing synchronization challenges. These datasets were selected for a comprehensive evaluation, covering scenarios from crowded urban streets to controlled environments.
We converted annotations to YOLOv8 format, concentrating on the person class for consistency across datasets. Each image corresponds to a label file formatted with one row per object ([ c l a s s , x c e n t e r , y c e n t e r , w i d t h , h e i g h t ]), normalized between 0 and 1 to meet YOLOv8’s input requirements, facilitating efficient fine-tuning. To evaluate tracking performance, we used a comprehensive set of metrics, including accuracy, precision, identity preservation, and robustness. The CLEAR MOT metrics, focusing on detection quality, track continuity, and identity preservation, are standard for assessing MOT performance.
We employed the followed metrics to assess the performance of MOT algorithms:
  • MOTA (Multiple-Object Tracking Accuracy): MOTA is a widely used metric that accounts for three types of errors: FP, FN, and IDSw. It is calculated as given in Equation (20):
    M O T A = 1 F N + F P + I D S w G T
    where G T is the total number of ground truth (GT) objects. MOTA provides a general overview of tracking performance by counting how many errors are made by the tracker in total, with higher values indicating better performance.
  • MOTP (Multiple-Object Tracking Precision): MOTP measures the alignment precision between the GT and the tracker’s output. It is the average distance between the predicted bbxes and the GT across all matches, where the lower values indicate higher precision, as given in Equation (21):
    M O T P = t i d i , t t c t
  • IDSw: IDSw occurs when a tracked object is mistakenly given a different ID than it had in previous frames, and counts the number of times an object’s predicted identity changes in the tracker output. Fewer IDSws indicate better tracking performance in maintaining object identities over time.
  • FM (Fragmentation): FM counts the number of times a track is interrupted by measuring how often an object’s trajectory is interrupted (i.e., when the tracker loses track of the object and then ReIDs it), with a correct identification being split into two or more separate tracks. Lower fragmentation indicates better continuity in tracking.
  • MT (Mostly Tracked): MT refers to the number of ground truth trajectories that are successfully tracked for at least 80% of their length. A higher MT value indicates better tracking performance.
  • PT (Partially Tracked): PT refers to the number of ground truth trajectories that are tracked for 20% to 80% of their length. PT provides additional insight into tracking performance for partially visible objects.
  • ML (Mostly Lost): ML refers to the number of ground truth trajectories that are tracked for less than 20% of their length. A lower ML value is desirable as it indicates fewer instances of lost tracks.

4.2. Implementation Details

Experiments were conducted on a desktop PC with Windows 11, WSL2 (Ubuntu 22.04 LTS), a 24 GB GPU, 32 GB RAM, and 1TB SSD. We fine-tuned YOLOv8 for person detection across the selected datasets. Each dataset was converted to a YOLOv8-compatible format, retaining only the person class, and organized into training and validation directories with corresponding annotation files.
The fine-tuning utilized a custom Python training script with the PyTorch framework [59] and the Ultralytics YOLOv8 library, configured for 100 epochs, 640 × 640 pixel images, and a batch size of 16. Data augmentation techniques, including multi-scale training and horizontal flipping, were applied to enhance model robustness.
For the VLM module of ReTrackVLM, we developed a data processing pipeline to convert datasets for Florence-2, CLIP, and DETR2. This pipeline processes annotations by integrating bbx information, image dimensions, keypoints, descriptors, overlap ratios, and confidence scores into a unified format. The bbx coordinates and image size are extracted directly from YOLOv8 outputs, ensuring accurate localization of objects within each frame. Keypoints and descriptors, which provide additional semantic and structural details such as image-level metadata (e.g., image dimensions), are derived using XFeat and LightGLUE, respectively. These features are then normalized to ensure consistency across frames and datasets.
The structured embedding representation leverages a fixed format that includes image-specific information (e.g., dimensions and bbxes) along with semantic details (e.g., keypoints and descriptors). This fixed structure not only ensures compatibility across modules but also significantly improves capacity management and computational efficiency. Keypoints represent specific positions on detected objects, such as joints or other salient features, with each keypoint described by two coordinates indicating its position. To optimize memory usage, we limited the number of keypoints based on their confidence scores, retaining only the highest confidence points. Descriptors, extracted alongside keypoints, are 64-dimensional vectors describing the local image regions around each keypoint. These descriptors provide rich contextual information for fine-grained identification and tracking. By standardizing the embedding structure, the framework effectively reduces memory usage and enhances processing speed, enabling its deployment in resource-constrained environments. This efficient embedding organization was pivotal in achieving timely performance without compromising accuracy, particularly for tasks involving dense or complex scenes. This compact format enables efficient storage and processing while ensuring the semantic and spatial integrity of the data. The combination of YOLOv8, XFeat, and LightGLUE provides a robust foundation for generating embeddings that are both computationally efficient and effective for ReID and tracking tasks. These optimizations support scalability and practical applicability, making the ReTrackVLM framework suitable for both research and real-world scenarios.

4.3. Ablation Study

In this ablation study, we analyzed the contributions of key components within the ReTrackVLM framework. By evaluating the impact of cross-modal embeddings, zero-shot ReID integration, track management optimizations, and fine-tuning, we identified which elements most significantly enhance tracking accuracy and robustness. This analysis provides insights into our modules’ effectiveness and informs future improvements.
In Table 2, we present the detection performance of YOLOv8 after fine-tuning on several datasets, optimizing it for person detection across diverse scenes. Since YOLOv8 processes the fixed-size input originally ( 640 × 640 ), we also followed this convention and, consequently, the fine-tuned YOLOv8 model achieved an average processing speed of 98 frames per second (fps) on our setup on average. Evaluation metrics include Box Precision, Box Recall, mAP50 (mean Average Precision at IoU threshold of 0.50), and mAP50-95 (mean Average Precision across IoU thresholds from 0.50 to 0.95).
The results establish the ReTrackVLM’s performance, showing that YOLOv8 fine-tuned on datasets like MOT15 and DanceTrack achieves high detection rates, with MOT15 achieving 94.8% Box Precision and 95.2% mAP50, while DanceTrack reaches 91.8% mAP50. In contrast, crowded datasets such as MOT20 exhibit lower precision (67.6%) and recall (69.4%), highlighting the challenges of densely populated scenes that require further refinement. When fine-tuned across all datasets, the model achieves a Box Precision of 71.9% and an mAP50 of 76.7%, offering a generalized performance that balances adaptability without overfitting, making it suitable for diverse MOT applications.
For fine-tuning the models in our framework, the time required was closely tied to the complexity of the architecture and the dataset size. DETR2, being the least complex architecture among the tested models, required approximately 15 h for fine-tuning on the smallest dataset. In contrast, Florence-2, recognized as one of the most complex VLMs due to its multimodal architecture and large model size, required nearly 30 h for fine-tuning on the largest dataset. These time differences align with the relative complexities of the models: Florence-2, with its focus on integrating image–text pairs and object detection data, exhibits significantly higher computational demands than DETR2, a transformer-based object detection model that is less resource-intensive.
In terms of runtime performance, our framework achieved a processing speed of 18 fps when using DETR2, unveiling its moderate complexity and suitability for real-time applications. Conversely, Florence-2 delivered a speed of 12 fps, reflecting the trade-off between its advanced capabilities and computational overhead.
CLIP demonstrates superior tracking performance, as can be seen in Table 3, particularly in MOTA and MOTP, across most datasets, indicating better accuracy and track continuity. DETR2 also performs well, especially on challenging datasets like MOT16/17 and MOT20, but has a higher IDSw, suggesting challenges in identity preservation compared to CLIP. OpenCLIP generally shows lower performance across metrics, while Florence-2 strikes a balance between detection accuracy and identity handling, though it trails behind CLIP and DETR2. CLIP excels in challenging scenarios like WildTrack and DanceTrack, leading in MOTA, MOTP, and FM.
In Table 4, zero-shot ReID performance with confident tracks from a non-fine-tuned YOLOv8 detector is provided. CLIP consistently outperforms other VLMs, achieving the highest MOTA and MOTP, especially on DanceTrack (61.8%) and WildTrack (69.9%), along with the lowest IDSw and FM, demonstrating superior track consistency. DETR2 closely follows, particularly on MOT16/17 and MOT20, balancing higher MT values with lower ML rates. OpenCLIP and Florence-2 generally lag behind, with higher IDSw and FM; however, Florence-2 performs reasonably on MOT16/17 and WildTrack. These findings highlight CLIP’s robustness in maintaining identity consistency in zero-shot ReID scenarios.
In Figure 7, we provide a detailed visualization of the tracking performance of ReTrackVLM across five benchmark datasets: DanceTrack, MOT15, MOT16, MOT17, and MOT20. These datasets represent a variety of challenging scenarios, including dynamic and static pedestrians, different camera movements, and varying levels of crowd density. The tracking results for each dataset are shown for specific frames: the first frame (t = 1), the 100th frame (t = 100), the 200th frame (t = 200), and the final frame (t = N). However, for MOT20, the last frame is excluded due to extreme occlusion and track overlap. Beneath the visualizations, tracking metric graphs illustrate the number of tracks per frame, offering a temporal analysis of trajectory continuity and dataset complexity. In DanceTrack, ReTrackVLM performs well, effectively handling the uniform motion patterns of dancers from t = 1 to t = 200. However, with t = N, trajectory intersections pose challenges, as rapid dancer crossings lead to IDSws as well as fast changes in appearances due to short distances between the dancers and the camera. This indicates a need for improvement in managing synchronized movements and abrupt trajectory overlaps. For MOT15, which features a moderately dense urban setting, ReTrackVLM effectively tracks both static and dynamic individuals across frames. While track consistency is generally strong, a slight dip in the number of active tracks in the metric plot suggests possible track termination due to occlusions. MOT16 and MOT17 highlight both strengths and limitations of the framework. In MOT16, the tracker performs reliably under static camera conditions, maintaining consistent track identities from t = 1 to t = 200. However, in MOT17, which involves moving cameras and frequent occlusions, the framework experiences increased IDSws. By t = N, there are clear signs of frequent IDSw and partial track losses, suggesting a need for enhanced appearance-based association and occlusion-handling mechanisms. In MOT20, one of the most densely populated datasets, ReTrackVLM struggles with trajectory overlaps and significant occlusions. The tracking metrics for MOT20 reveal a steady increase in the number of tracks over time, indicating the dynamic influx of new pedestrians but also hinting at fragmented tracking due to occlusion challenges.

4.4. Comparison with State-of-the-Art Techniques

In this section, we compare the ReTrackVLM performance against state-of-the-art MOT methods across benchmark datasets, including MOT15, MOT16/17, MOT20, DanceTrack, and WildTrack. We evaluate the effectiveness of the cross-modal embeddings module, zero-shot ReID integration, and the performance enhancements from fine-tuning YOLOv8 as the detector. Key metrics such as MOTA, MOTP, and IDSw are assessed to gauge our approach against recent advancements. Distinct fine-tuned weights were employed for the YOLOv8 detector and CLIP models, optimized for the specific datasets to ensure peak performance in detection and ReID tasks.
We selected various methods for comparison—DeepSort, ByteTrack, Deep-OC-SORT, StrongSORT, BoostTrack, SFSORT, Hybrid-SORT, UCMCTrack, TrackFormer, MOTR, and TransTrack—due to their strong performance across the datasets and their compatibility with our evaluation. These methods are recognized for robust detection and track management, with advanced features like occlusion handling and hybrid strategies, aligning with our goal of a comprehensive comparison.
We present a comprehensive comparison of MOT performance across benchmark datasets in Table 5, evaluating several state-of-the-art trackers alongside our proposed ReTrackVLM model. For MOT15, ReTrackVLM achieves competitive results, scoring highest in PT (286) and solidly in MOTA (84.0) and MOTP (88.6), though it has a higher IDSw (1147) than top methods like Hybrid-SORT and SFSORT. In MOT16/17, ReTrackVLM leads in MOTA (79.7) and PT (428) but records a higher IDSw (1889) than some competitors. For MOT20, it excels in MOTA (78.0) and shows balanced MT (727) and low ML (234), demonstrating robustness in challenging environments. While other methods like Deep-OC-SORT and BoostTrack score higher in DanceTrack, ReTrackVLM maintains strong performance across metrics, indicating its versatility.
One can state that ReTrackVLM delivers consistently competitive results across multiple datasets, particularly excelling in MOTA and PT metrics, showcasing its strength in track consistency and accuracy. However, it shows potential for improvement in identity matching, especially in crowded scenes like MOT20, and could benefit from further optimization in complex motion tasks, as evidenced by its performance on DanceTrack. These findings suggest that ReTrackVLM strikes a balanced trade-off between accuracy and robustness but requires fine-tuning to address IDSw challenges.

5. Discussion and Conclusions

MOT in complex environments with dense crowds and unpredictable motion presents significant challenges, including maintaining consistent identities across frames amidst occlusions and abrupt movements. Traditional methods often struggle with IDSws and association failures due to inadequate modeling of appearance and motion. To address these issues, we propose ReTrackVLM, a novel tracking framework that employs cross-modal embeddings and zero-shot ReID to enhance performance.
ReTrackVLM combines a VLM with an appearance-based ReID system, effectively modeling both visual and semantic information to distinguish objects during ambiguous detections and occlusions. The zero-shot ReID capability enables accurate object matching without additional fine-tuning, relying on previously stored track data. Additionally, a Kalman filter-based motion prediction module estimates object positions between frames, reducing track fragmentation.
Extensive evaluations on benchmark datasets—MOT15, MOT16/17, MOT20, and DanceTrack—demonstrate that ReTrackVLM achieves competitive performance, notably excelling in MOTA with 79.7 % on MOT16/17. It shows strong results in minimizing IDSws on DanceTrack (684) and WildTrack (761) and in minimizing PT on MOT15 (286) and MOT16/17 (428). However, it has limitations, such as more IDSws on dense and occluded datasets like MOT20, indicating that the zero-shot ReID struggles in such scenarios. Additionally, the motion prediction model could be improved for complex motion patterns, and the reliance on VLMs and cross-modal embeddings introduces computational overhead that may impact real-time tracking applications.
While the current study demonstrates the efficacy of the proposed framework on MOT benchmarks and controlled environments, its applicability to unstructured scenes and diverse object types remains unexplored. Future studies will incorporate datasets like TAO [70] and KITTI [71] to evaluate its generalizability in these contexts. Furthermore, the effect of integrating temporal smoothing or adaptive reinitialization techniques in stabilizing the tracks will be analyzed in addition to integrating temporal occlusion reasoning, such as interpolation-based ReID to improve recovering tracks after occlusions. Future work will focus on key improvements such as enhancing the ReID module to better handle identity switches in dense scenes, integrating advanced motion models for tracking complex behaviors, and optimizing computational efficiency for real-time applications through efficient architectures or hardware accelerators. Future work will also explore the integration of advanced optimization algorithms to enhance the parameter search space for motion prediction and ReID modules. Recently developed successful methods, such as the Mayfly Optimization Algorithm [72] and Improved Gorilla Troops Optimizer [73], offer exciting opportunities to refine module configurations further, especially in handling diverse tracking scenarios.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Leal-Taixé, L.; Milan, A.; Reid, I.; Roth, S.; Schindler, K. Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv 2015, arXiv:1504.01942. [Google Scholar]
  2. Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A benchmark for multi-object tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar]
  3. Dendorfer, P.; Rezatofighi, H.; Milan, A.; Shi, J.; Cremers, D.; Reid, I.; Roth, S.; Schindler, K.; Leal-Taixé, L. Mot20: A benchmark for multi object tracking in crowded scenes. arXiv 2020, arXiv:2003.09003. [Google Scholar]
  4. Sun, P.; Cao, J.; Jiang, Y.; Yuan, Z.; Bai, S.; Kitani, K.; Luo, P. Dancetrack: Multi-object tracking in uniform appearance and diverse motion. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 20993–21002. [Google Scholar]
  5. Chavdarova, T.; Baqué, P.; Bouquet, S.; Maksai, A.; Jose, C.; Bagautdinov, T.; Lettry, L.; Fua, P.; Gool, L.V.; Fleuret, F. Wildtrack: A multi-camera HD dataset for dense unscripted pedestrian detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5030–5039. [Google Scholar]
  6. Potje, G.; Cadar, F.; Araujo, A.; Martins, R.; Nascimento, E.R. XFeat: Accelerated Features for Lightweight Image Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2682–2691. [Google Scholar]
  7. Lindenberger, P.; Sarlin, P.-E.; Pollefeys, M. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 17627–17638. [Google Scholar]
  8. Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.-Y.; Girshick, R. Detectron2. 2019. Available online: https://github.com/facebookresearch/detectron2 (accessed on 11 February 2024).
  9. Xiao, B.; Wu, H.; Xu, W.; Dai, X.; Hu, H.; Lu, Y.; Zeng, M.; Liu, C.; Yuan, L. Florence-2: Advancing a unified representation for a variety of vision tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4818–4829. [Google Scholar]
  10. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
  11. Yang, Z.; Huang, Z.; He, D.; Zhang, T.; Yang, F. Dynamic representation-based tracker for long-term pedestrian tracking with occlusion. J. Vis. Commun. Image Represent. 2023, 90, 103710. [Google Scholar] [CrossRef]
  12. Feng, L.; Song, K.; Wang, J.; Yan, Y. Exploring the Potential of Siamese Network for RGBT Object Tracking. J. Vis. Commun. Image Represent. 2023, 95, 103882. [Google Scholar] [CrossRef]
  13. Zhang, G.; Chen, C.; Chen, Y.; Zhang, H.; Zheng, Y. Transformer-based global–local feature learning model for occluded person re-identification. J. Vis. Commun. Image Represent. 2023, 95, 103898. [Google Scholar] [CrossRef]
  14. Liu, Y.; Liang, Y.; Chen, Z. LRHW-AP: Using ranking-based metric as loss for Person Re-Identification. J. Vis. Commun. Image Represent. 2022, 85, 103517. [Google Scholar] [CrossRef]
  15. Wang, D.; Chen, Y.; Wang, W.; Tie, Z.; Fang, X.; Ke, W. Uncertainty-guided joint attention and contextual relation network for person re-identification. J. Vis. Commun. Image Represent. 2023, 93, 103822. [Google Scholar]
  16. Ahmed, E.; Jones, M.; Marks, T.K. An improved deep learning architecture for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3908–3916. [Google Scholar]
  17. Yang, Y.; Yang, J.; Yan, J.; Liao, S.; Yi, D.; Li, S.Z. Salient Color Names for Person Re-identification. In Computer Vision—ECCV 2014; Springer: Berlin/Heidelberg, Germany, 2014; Volume 8695, pp. 536–551. [Google Scholar]
  18. Babaee, M.; Athar, A.; Rigoll, G. Multiple People Tracking Using Hierarchical Deep Tracklet Re-identification. arXiv 2018, arXiv:1811.04091. [Google Scholar]
  19. Zhang, Y.; Sheng, H.; Wu, Y.; Wang, S.; Ke, W.; Xiong, Z. Multiplex Labeling Graph for Near-Online Tracking in Crowded Scenes. IEEE Internet Things J. 2020, 7, 7892–7902. [Google Scholar] [CrossRef]
  20. Sadeghian, A.; Alahi, A.; Savarese, S. Tracking The Untrackable: Learning To Track Multiple Cues with Long-Term Dependencies. arXiv 2017, arXiv:1701.01909. [Google Scholar]
  21. Akilan, T.; Wu, Q.M.J.; Jiang, W. A Feature Embedding Strategy for High-level CNN representations from Multiple ConvNets. arXiv 2017, arXiv:1705.04301. [Google Scholar]
  22. Kan, S.; Cen, Y.; He, Z.; Zhang, Z.; Zhang, L.; Wang, Y. Supervised Deep Feature Embedding With Handcrafted Feature. IEEE Trans. Image Process. 2019, 28, 5809–5823. [Google Scholar] [CrossRef] [PubMed]
  23. Peng, J.; Wang, T.; Lin, W.; Wang, J.; See, J.; Wen, S.; Ding, E. TPM: Multiple object tracking with tracklet-plane matching. Pattern Recognit. 2020, 107, 107480. [Google Scholar] [CrossRef]
  24. Song, H.O.; Xiang, Y.; Jegelka, S.; Savarese, S. Deep Metric Learning via Lifted Structured Feature Embedding. arXiv 2015, arXiv:1511.06452. [Google Scholar]
  25. Li, Z.; Yan, Z.; Tian, W.; Zeng, D.; Liu, Y.; Li, W. ReDeformTR: Wildlife Re-identification based on Light-weight Deformable Transformer with Multi-image Feature Fusion. IEEE Access 2024, 12, 106321–106332. [Google Scholar] [CrossRef]
  26. Liu, J.; Li, Q.; Song, S.; Kulyash, K. Detection, tracking and enumeration of marine benthic organisms using an improved YOLO+ DeepSORT network. IEEE Access 2024, 12, 113867–113877. [Google Scholar] [CrossRef]
  27. Alameri, M.; Memon, Q. YOLOv5 Integrated with Recurrent Network for Object Tracking: Experimental Results from a Hardware Platform. IEEE Access 2024, 12, 119733–119742. [Google Scholar] [CrossRef]
  28. Ahn, W.; Ko, K.; Lim, M.; Pae, D.; Kang, T. Multiple object tracking using re-identification model with attention module. Appl. Sci. 2023, 13, 4298. [Google Scholar] [CrossRef]
  29. Xu, L.; Wu, G. Multi-Object Tracking with Grayscale Spatial-Temporal Features. Appl. Sci. 2024, 14, 5900. [Google Scholar] [CrossRef]
  30. Wan, J.; Qian, S.; Tian, Z.; Zhao, Y. An effective framework of multi-class product counting and recognition for automated retail checkout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3282–3290. [Google Scholar]
  31. Wang, M.; Liu, R.; Lina, S.; Abe, N.; Yamada, S. An effective method for semi-online multi-object tracking refinement. IEEE Access 2024, 12, 60656–60667. [Google Scholar] [CrossRef]
  32. Cittadini, E.; Siena, A.D.; Buttazzo, G. CORT: Class-Oriented Real-time Tracking for Embedded Systems. arXiv 2024, arXiv:2407.17521. [Google Scholar]
  33. Cheng, W.; Wu, Y.; Wu, Z.; Ling, H.; Hua, G. Towards High Quality Multi-Object Tracking and Segmentation without Mask Supervision. IEEE Trans. Image Process. 2024, 33, 3369–3384. [Google Scholar] [CrossRef] [PubMed]
  34. Hashempoor, H.; Koikara, R.; Hwang, Y.D. FeatureSORT: Essential Features for Effective Tracking. arXiv 2024, arXiv:2407.04249. [Google Scholar]
  35. Luo, R.; Song, Z.; Ma, L.; Wei, J.; Yang, W.; Yang, M. DiffusionTrack: Diffusion Model for Multi-Object Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 3991–3999. [Google Scholar]
  36. Yang, M.; Han, G.; Yan, B.; Zhang, W.; Qi, J.; Lu, H.; Wang, D. Hybrid-sort: Weak cues matter for online multi-object tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6504–6512. [Google Scholar]
  37. Wang, Y.-H.; Hsieh, J.-W.; Chen, P.-Y.; Chang, M.-C.; So, H.-H.; Li, X. SmileTrack: Similarity Learning for Occlusion-Aware Multiple Object Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 5740–5748. [Google Scholar]
  38. Li, C.; Wang, Y.; Liu, X. A Multi-Pedestrian tracking algorithm for dense scenes based on an attention mechanism and dual data association. Appl. Sci. 2022, 12, 9597. [Google Scholar] [CrossRef]
  39. Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference On Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
  40. Li, J.; Piao, Y. Multi-Object Tracking Based on Re-Identification Enhancement and Associated Correction. Appl. Sci. 2023, 13, 9528. [Google Scholar] [CrossRef]
  41. Li, H.; Xu, Z.; Ma, C.; Tang, X. Multi-Object Tracking Algorithm for Unmanned Vehicle Autonomous Driving Scene Based on Online Spatiotemporal Feature Correlation. IEEE Access 2024, 12, 116489–116497. [Google Scholar] [CrossRef]
  42. Bayraktar, E.; Wang, Y.; DelBue, A. Fast re-OBJ: Real-time object re-identification in rigid scenes. Mach. Vis. Appl. 2022, 33, 97. [Google Scholar] [CrossRef]
  43. Huang, H.-W.; Yang, C.-Y.; Sun, J.; Kim, P.-K.; Kim, K.-J.; Lee, K.; Huang, C.-I.; Hwang, J.-N. Iterative Scale-Up Expansion-IoU and Deep Features Association for Multi-Object Tracking in Sports. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 163–172. [Google Scholar]
  44. Yang, C.-Y.; Huang, H.-W.; Jiang, Z.; Kuo, H.-C.; Mei, J.; Huang, C.-I.; Hwang, J.-N. Sea You Later: Metadata-Guided Long-Term Re-Identification for UAV-Based Multi-Object Tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 805–812. [Google Scholar]
  45. Psalta, A.; Tsironis, V.; Karantzalos, K. Transformer-Based Assignment Decision Network for Multiple Object Tracking. Comput. Vis. Image Underst. 2024, 241, 103957. [Google Scholar] [CrossRef]
  46. Stanojevic, V.D.; Todorovic, B.T. BoostTrack: Boosting the similarity measure and detection confidence for improved multiple object tracking. Mach. Vis. Appl. 2024, 35, 53. [Google Scholar] [CrossRef]
  47. Min, Z.; Hassan, G.M.; Jo, G.-S. Rethinking Motion Estimation: An Outlier Removal Strategy in SORT for Multi-Object Tracking with Camera Moving. IEEE Access 2024, 12, 142819–142837. [Google Scholar] [CrossRef]
  48. Hong, L.; Yan, S.; Zhang, R.; Li, W.; Zhou, X.; Guo, P.; Jiang, K.; Chen, Y.; Li, J.; Chen, Z.; et al. OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 19079–19091. [Google Scholar]
  49. Nguyen, P.; Quach, K.G.; Kitani, K.; Luu, K. Type-to-Track: Retrieve Any Object via Prompt-Based Tracking. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Volume 36. [Google Scholar]
  50. Wu, Z.; Zheng, J.; Ren, X.; Vasluianu, F.-A.; Ma, C.; Paudel, D.P.; Gool, L.V.; Timofte, R. Single-Model and Any-Modality for Video Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–24 June 2024; pp. 19156–19166. [Google Scholar]
  51. Li, Z.; Shi, Y.; Ling, H.; Chen, J.; Liu, B.; Wang, R.; Zhao, C. Viewpoint Disentangling and Generation for Unsupervised Object Re-ID. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 1–23. [Google Scholar] [CrossRef]
  52. Dai, Y.; Sun, Y.; Liu, J.; Tong, Z.; Duan, L.-Y. Bridging the Source-to-Target Gap for Cross-Domain Person Re-Identification with Intermediate Domains. Int. J. Comput. Vis. 2024, 133, 410–434. [Google Scholar] [CrossRef]
  53. Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO (Version 8.0.0) [Computer Software]. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 11 February 2024).
  54. Shao, S.; Zhao, Z.; Li, B.; Xiao, T.; Yu, G.; Zhang, X.; Sun, J. Crowdhuman: A benchmark for detecting human in a crowd. arXiv 2018, arXiv:1805.00123. [Google Scholar]
  55. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
  56. Bayraktar, E.; Yigit, C.B. Conditional-pooling for improved data transmission. Pattern Recognit. 2024, 145, 109978. [Google Scholar] [CrossRef]
  57. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part V; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
  58. Suljagic, H.; Bayraktar, E.; Celebi, N. Similarity based person re-identification for multi-object tracking using deep Siamese network. Neural Comput. Appl. 2022, 34, 18171–18182. [Google Scholar] [CrossRef]
  59. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  60. Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
  61. Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
  62. Maggiolino, G.; Ahmad, A.; Cao, J.; Kitani, K. Deep OC-SORT: Multi-pedestrian tracking by adaptive re-identification. arXiv 2023, arXiv:2302.11813. [Google Scholar]
  63. Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, H. Strongsort: Make deepsort great again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
  64. Morsali, M.M.; Sharifi, Z.; Fallah, F.; Hashembeiki, S.; Mohammadzade, H.; Shouraki, S.B. SFSORT: Scene features-based simple online real-time tracker. arXiv 2024, arXiv:2404.07553. [Google Scholar]
  65. Yi, K.; Luo, K.; Luo, X.; Huang, J.; Wu, H.; Hu, R.; Hao, W. UCMCTrack: Multi-object tracking with uniform camera motion compensation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6702–6710. [Google Scholar]
  66. Meinhardt, T.; Kirillov, A.; Leal-Taixe, L.; Feichtenhofer, C. Trackformer: Multi-object tracking with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8844–8854. [Google Scholar]
  67. Zeng, F.; Dong, B.; Zhang, Y.; Wang, T.; Zhang, X.; Wei, Y. MOTR: End-to-end multiple-object tracking with transformer. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 659–675. [Google Scholar]
  68. Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. Transtrack: Multiple object tracking with transformer. arXiv 2020, arXiv:2012.15460. [Google Scholar]
  69. Yu, E.; Liu, S.; Li, Z.; Yang, J.; Li, Z.; Han, S.; Tao, W. Generalizing multiple object tracking to unseen domains by introducing natural language representation. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 37, pp. 3304–3312. [Google Scholar]
  70. Dave, A.; Khurana, T.; Tokmakov, P.; Schmid, C.; Ramanan, D. Tao: A large-scale benchmark for tracking any object. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part V 16; Springer International Publishing: Cham, Switzerland, 2020; pp. 436–454. [Google Scholar]
  71. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
  72. Zervoudakis, K.; Tsafarakis, S. A mayfly optimization algorithm. Comput. Ind. Eng. 2020, 145, 106559. [Google Scholar] [CrossRef]
  73. Wu, T.; Wu, D.; Jia, H.; Zhang, N.; Almotairi, K.H.; Liu, Q.; Abualigah, L. A Modified Gorilla Troops Optimizer for Global Optimization Problem. Appl. Sci. 2022, 12, 10144. [Google Scholar] [CrossRef]
Figure 1. Diverse and challenging environments captured in the MOT15, MOT16, MOT17, MOT20, DanceTrack, and WildTrack datasets.
Figure 1. Diverse and challenging environments captured in the MOT15, MOT16, MOT17, MOT20, DanceTrack, and WildTrack datasets.
Applsci 15 01907 g001
Figure 2. Key challenges in MOT systems, including IDSws, occlusions, and failures in re-id, as well as issues related to computational complexity and data integration across consecutive frames, which highlights the difficulties in maintaining consistent object identities due to occlusions, object interactions, and changes in the scene.
Figure 2. Key challenges in MOT systems, including IDSws, occlusions, and failures in re-id, as well as issues related to computational complexity and data integration across consecutive frames, which highlights the difficulties in maintaining consistent object identities due to occlusions, object interactions, and changes in the scene.
Applsci 15 01907 g002
Figure 3. The proposed MOT flow diagram depicting the integration of object detection, feature extraction, visual language modeling, track management, confident track storage, and re-id modules to achieve robust and accurate tracking of humans in real-world scenarios. The flow diagram addresses key challenges such as IDsws, occlusions, and the need for robust feature utilization, resulting in a more reliable and scalable MOT solution.
Figure 3. The proposed MOT flow diagram depicting the integration of object detection, feature extraction, visual language modeling, track management, confident track storage, and re-id modules to achieve robust and accurate tracking of humans in real-world scenarios. The flow diagram addresses key challenges such as IDsws, occlusions, and the need for robust feature utilization, resulting in a more reliable and scalable MOT solution.
Applsci 15 01907 g003
Figure 4. Overview of the proposed MOT framework. ReTrackVLM integrates VLMs and a new embedding structure to enhance MOT performance in challenging environments. Fine-tuned YOLOv8 detects pedestrians in input images, generating bbxes that are further processed by XFeat and LightGlue for feature extraction. These features, including descriptors, keypoints, and overlap ratios, are fed into an image encoder and text embedder. The resulting embeddings are fused through a fine-tuned transformer, aiding in motion prediction and data association within the Track Management module. High-confidence tracks are stored in a dedicated database for ReID, ensuring robust tracking continuity across diverse scenarios.
Figure 4. Overview of the proposed MOT framework. ReTrackVLM integrates VLMs and a new embedding structure to enhance MOT performance in challenging environments. Fine-tuned YOLOv8 detects pedestrians in input images, generating bbxes that are further processed by XFeat and LightGlue for feature extraction. These features, including descriptors, keypoints, and overlap ratios, are fed into an image encoder and text embedder. The resulting embeddings are fused through a fine-tuned transformer, aiding in motion prediction and data association within the Track Management module. High-confidence tracks are stored in a dedicated database for ReID, ensuring robust tracking continuity across diverse scenarios.
Applsci 15 01907 g004
Figure 5. The cross-modal embedding and fine-tuning generation process for the VLMs for which bbxes, keypoints, and descriptors are extracted from detected objects, yielding embeddings that align visual and textual modalities. Fine-tuning the VLMs ensures task-specific optimization for multi-object tracking, enabling robust association of objects across frames based on semantic similarity and appearance features.
Figure 5. The cross-modal embedding and fine-tuning generation process for the VLMs for which bbxes, keypoints, and descriptors are extracted from detected objects, yielding embeddings that align visual and textual modalities. Fine-tuning the VLMs ensures task-specific optimization for multi-object tracking, enabling robust association of objects across frames based on semantic similarity and appearance features.
Applsci 15 01907 g005
Figure 6. Overview of the track management and zero-shot ReID modules, which includes motion estimation using a Kalman filter, cosine similarity-based distance calculation, optimal data association via bipartite matching, and zero-shot ReID integration utilizing VLM embeddings for maintaining consistent identities across frames. Confident tracks are stored as an array indexed by unique IDs, supporting robust tracking and ReID. Detailed depiction of the confident track storage mechanism highlights the array-based storage of tracks. The stored data facilitate efficient ReID by comparing gallery embeddings with query detections for identity retrieval.
Figure 6. Overview of the track management and zero-shot ReID modules, which includes motion estimation using a Kalman filter, cosine similarity-based distance calculation, optimal data association via bipartite matching, and zero-shot ReID integration utilizing VLM embeddings for maintaining consistent identities across frames. Confident tracks are stored as an array indexed by unique IDs, supporting robust tracking and ReID. Detailed depiction of the confident track storage mechanism highlights the array-based storage of tracks. The stored data facilitate efficient ReID by comparing gallery embeddings with query detections for identity retrieval.
Applsci 15 01907 g006
Figure 7. Qualitative trajectory tracking results of ReTrackVLM on DanceTrack, MOT15, MOT16, MOT17, and MOT20 datasets. For each dataset, trajectories are visualized at t = 1, t = 100, t = 200, and t = N, with N representing the total frame count (except for MOT20, where t = N is excluded due to excessive overlap). Below the visualizations, the number of tracks per frame is plotted, reflecting the session’s complexity and total duration.
Figure 7. Qualitative trajectory tracking results of ReTrackVLM on DanceTrack, MOT15, MOT16, MOT17, and MOT20 datasets. For each dataset, trajectories are visualized at t = 1, t = 100, t = 200, and t = N, with N representing the total frame count (except for MOT20, where t = N is excluded due to excessive overlap). Below the visualizations, the number of tracks per frame is plotted, reflecting the session’s complexity and total duration.
Applsci 15 01907 g007
Table 1. Overview of datasets used in ReTrackVLM experiments.
Table 1. Overview of datasets used in ReTrackVLM experiments.
Dataset# Seqs# FramesResolutionAnnotationsScenariosChallenges
MOT1522 (11/11)∼11,200640 × 480 to 1920 × 1080BbxesUrban Streets, Indoor ScenesOcclusions, Low Resolution, Camera Motion
MOT16/1714 (7/7)∼14,2001080pBbxesCrowded Urban AreasDense Crowds, Varying Illumination, Occlusions
MOT2012 (8/4)∼13,4001173 × 880 to 3384 × 2710BbxesUrban CrowdsHeavy Occlusions, High Density
DanceTrack100 (40/25/35)∼106,0001280 × 720 to 1920 × 1080BbxesIndoor Dance SequencesRapid Motions, Frequent Occlusions, Pose Changes
WildTrack14 (7/7)∼28001080pBbxes, 3D LocalizationCampus Scenes, OutdoorCross-Camera Occlusions, Varying Lighting
Notes: # Seqs: Number of sequences in train/test splits. For DanceTrack only, it follows a (train/validation/test) format, e.g., “100 (40/25/35)” means 40 sequences for training, 25 for validation, and 35 for testing. Approximate frame counts (# Frames) are marked with “∼”. Only pedestrian Bbxes are used for evaluations unless otherwise specified. WildTrack includes 3D localization as well. Challenges are specific to the dataset and include common issues like occlusions, crowd density, and varying environmental conditions.
Table 2. Detection performance of fine-tuned YOLOv8 on various datasets.
Table 2. Detection performance of fine-tuned YOLOv8 on various datasets.
DatasetBox Precision (%)Box Recall (%)mAP50 (%)mAP50-95 (%)
MOT1594.890.195.270.4
MOT16/1784.380.986.664.7
MOT2067.669.469.544.5
DanceTrack87.985.291.862.4
WildTrack89.381.590.058.1
CrowdHuman86.577.787.359.3
All Datasets71.975.776.748.7
Table 3. Baseline performance metrics for different VLMs in ReTrackVLM. The detector is YOLOv8 for all configurations, no ReID is applied, and track management is handled by a standard DeepSort configuration. Arrows indicate whether a higher or lower value is better for each metric.
Table 3. Baseline performance metrics for different VLMs in ReTrackVLM. The detector is YOLOv8 for all configurations, no ReID is applied, and track management is handled by a standard DeepSort configuration. Arrows indicate whether a higher or lower value is better for each metric.
Dataset/VLMMOTA ↑MOTP ↑IDSw ↓FM ↓MT ↑PT ↑ML ↓
MOT15
OpenCLIP53.756.32654241814463251
Florence-257.259.61461236515264248
DETR258.260.91520238815060240
CLIP56.359.21211230318154233
MOT16/17
OpenCLIP49.650.420882714707172693
Florence-252.256.819542839725170609
DETR254.557.918762792816163555
CLIP58.359.219342817804168542
MOT20
OpenCLIP45.747.125262816230188307
Florence-248.247.626992054266129295
DETR250.153.315612234312120324
CLIP48.550.514531961334132273
DanceTrack
OpenCLIP51.652.317251511304245
Florence-250.254.718811532327275
DETR250.453.114861567336354
CLIP54.055.512611389349363
WildTrack
OpenCLIP57.359.1132292534234269
Florence-260.863.6129488930847278
DETR264.467.21236100734338253
CLIP67.970.8128590230641284
Table 4. Zero-shot ReID performance metrics with no fine-tuned YOLOv8 and confident tracks (no fine-tuned VLMs). The track management was handled via DeepSort. Arrows indicate whether higher or lower values are better for each metric.
Table 4. Zero-shot ReID performance metrics with no fine-tuned YOLOv8 and confident tracks (no fine-tuned VLMs). The track management was handled via DeepSort. Arrows indicate whether higher or lower values are better for each metric.
Dataset/VLMMOTA ↑MOTP ↑IDSw ↓FM ↓MT ↑PT ↑ML ↓
MOT15
OpenCLIP56.458.71328142016965296
Florence-258.960.21188124117570283
DETR260.563.31173120117854241
CLIP60.364.0994110320161208
MOT16/17
OpenCLIP52.554.817121687257107488
Florence-254.254.716331654283113542
DETR255.858.91574169927297512
CLIP56.758.616071545302122447
MOT20
OpenCLIP49.553.327412903346202392
Florence-252.351.923442742382236376
DETR253.757.119142458397275348
CLIP55.456.818262217391257361
DanceTrack
OpenCLIP52.157.214231850401337
Florence-254.659.414171788442307
DETR257.962.414741762439476
CLIP61.866.315081733473414
WildTrack
OpenCLIP60.263.4883102336452253
Florence-262.365.7809100738148238
DETR264.871.280397740457267
CLIP69.973.586494339761220
The values represent the zero-shot ReID performance metrics for various visual language models with no fine-tuned YOLOv8 in the DeepSort framework
Table 5. Comparison of multi-object tracking performance across benchmark datasets. Arrows indicate whether higher or lower values are better for each metric, with the best results highlighted in bold.
Table 5. Comparison of multi-object tracking performance across benchmark datasets. Arrows indicate whether higher or lower values are better for each metric, with the best results highlighted in bold.
Dataset/VLMMOTA ↑MOTP ↑IDSw ↓FM ↓MT ↑PT ↑ML ↓
MOT15
DeepSort [60]73.778.510711954255193441
ByteTrack [61]83.988.18261277388203197
Deep-OC-SORT [62]83.686.88031261379217204
StrongSORT [63]84.189.28211224408187231
BoostTrack [46]84.889.67921289411204218
SFSORT [64]85.290.77341086432198205
Hybrid-SORT [36]85.590.47531109447208182
UCMCTrack [65]84.289.37761175403221211
TrackFormer [66]84.387.49171033329269246
TransTrack [67]80.885.6884956465204289
MOTR [68]82.488.71238907364197173
LTrack [69]84.289.0913887430226208
ReTrackVLM (Ours)84.088.61147968374286229
MOT16/17
DeepSort [60]67.372.9182221302612091047
ByteTrack [61]77.584.613282288748361757
Deep-OC-SORT [62]76.983.1127721431739324778
StrongSORT [63]73.281.7141222711354308904
BoostTrack [46]77.985.8129320531785356806
SFSORT [64]78.286.4130119851647342769
Hybrid-SORT [36]76.875.3123921641521380792
UCMCTrack [65]78.385.2158621791662319783
TrackFormer [66]72.870.6253724401326254401
TransTrack [67]75.681.6353623122296386998
MOTR [68]74.080.9267320971592407696
LTrack [69]77.480.2210420851809397632
ReTrackVLM (Ours)79.778.1188921031867428942
MOT20
DeepSort [60]63.368.712562517369157398
ByteTrack [61]76.482.711681682576182323
Deep-OC-SORT [62]73.981.210711573535193307
StrongSORT [63]70.478.99831449549164363
BoostTrack [46]75.884.110971513591185266
SFSORT [64]74.872.612041538557190314
Hybrid-SORT [36]74.981.410581632605174272
UCMCTrack [65]74.380.912011564550166331
TrackFormer [66]68.280.113851796612185231
TransTrack [67]70.779.514471735506210354
MOTR [68]69.274.716751749466204301
LTrack [69]76.079.212301612524189298
ReTrackVLM (Ours)75.178.511631650637202263
DanceTrack
DeepSort [60]82.284.5128890335540257
ByteTrack [61]88.784.891187838947288
Deep-OC-SORT [62]90.395.477680339344264
StrongSORT [63]88.683.180986734652227
BoostTrack [46]89.594.785290536754249
SFSORT [64]89.194.996482440442234
Hybrid-SORT [36]90.896.072681241150253
UCMCTrack [65]87.994.282369947336201
TrackFormer [66]78.481.6113798442832302
TransTrack [67]87.196.7152296350961214
MOTR [68]81.285.31439108637446276
LTrack [69]84.983.4132069334458199
ReTrackVLM (Ours)91.291.668427725760111
WildTrack
DeepSort [60]59.252.514221006259180425
ByteTrack [61]68.775.31037924389187338
Deep-OC-SORT [62]65.359.816151064307196438
StrongSORT [63]66.968.912401124365185427
BoostTrack [46]64.469.19621082287222495
SFSORT [64]67.872.2807978295237504
Hybrid-SORT [36]65.870.6789861436239539
UCMCTrack [65]61.261.89731246488325441
TrackFormer [66]60.363.313791177379164386
TransTrack [67]67.973.41023936502192363
MOTR [68]62.466.711541103474178352
LTrack [69]63.365.89391048421284461
ReTrackVLM (Ours)69.171.0761946561204509
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bayraktar, E. ReTrackVLM: Transformer-Enhanced Multi-Object Tracking with Cross-Modal Embeddings and Zero-Shot Re-Identification Integration. Appl. Sci. 2025, 15, 1907. https://doi.org/10.3390/app15041907

AMA Style

Bayraktar E. ReTrackVLM: Transformer-Enhanced Multi-Object Tracking with Cross-Modal Embeddings and Zero-Shot Re-Identification Integration. Applied Sciences. 2025; 15(4):1907. https://doi.org/10.3390/app15041907

Chicago/Turabian Style

Bayraktar, Ertugrul. 2025. "ReTrackVLM: Transformer-Enhanced Multi-Object Tracking with Cross-Modal Embeddings and Zero-Shot Re-Identification Integration" Applied Sciences 15, no. 4: 1907. https://doi.org/10.3390/app15041907

APA Style

Bayraktar, E. (2025). ReTrackVLM: Transformer-Enhanced Multi-Object Tracking with Cross-Modal Embeddings and Zero-Shot Re-Identification Integration. Applied Sciences, 15(4), 1907. https://doi.org/10.3390/app15041907

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop