1. Introduction
Multi-object tracking is a fundamental problem in computer vision, with widespread applications ranging from autonomous driving and surveillance systems to robotics and sports analytics. The ability to accurately track multiple objects simultaneously in a dynamic environment is crucial to understanding complex scenes, making real-time decisions, and ensuring the safety and efficiency of systems that rely on visual input. In real-world scenarios, MOT presents several challenges, including occlusions, changes in object appearance, interactions between objects, and variations in lighting and viewpoint. These challenges are exacerbated by the need to maintain consistent identities of objects over time, even when objects leave the field of view or are temporarily obstructed. Traditional approaches to MOT often struggle with these issues, leading to identity changes, track fragmentation, and inaccuracies in localization. Moreover, the diversity of environments and the complexity of human behaviors necessitate robust tracking systems that can be generalized across different domains. As a result, the development of effective MOT algorithms has become a critical area of research aimed at bridging the gap between theoretical advances and practical deployment in real-world systems. The introduction of large-scale, diverse datasets such as MOT15 [
1], MOT16/17 [
2], MOT20 [
3], DanceTrack [
4], and WildTrack [
5] has spurred significant progress in this field, enabling the training and evaluation of more sophisticated models. However, despite these advancements, achieving reliable and scalable multi-object tracking remains a formidable challenge, requiring innovative solutions that can adapt to the complexities of real-world scenarios.
The choice of datasets is crucial for developing and evaluating MOT algorithms, providing diverse benchmarks that reflect real-world complexities. The MOT Challenge series (MOT15, MOT16, MOT17, MOT20) is central to MOT research, capturing urban scenes with varying pedestrian densities and challenges like occlusions, camera motion, and lighting. While MOT15 laid the foundation, later versions introduced higher-quality annotations and more complex scenarios. DanceTrack tests algorithms on fast, intertwined movements in dance sequences, emphasizing identity maintenance in dynamic situations. WildTrack adds complexity with its multi-camera setup, which is essential for wide-area coverage and testing scalability in applications like sports analytics and surveillance.
The diverse and challenging environments captured in the MOT15, MOT16, MOT17, MOT20, DanceTrack, and WildTrack datasets are illustrated in
Figure 1, further highlighting the importance of using diverse datasets to evaluate the performance of MOT algorithms since they encompass a broad spectrum of challenges, from standard urban environments to extreme conditions.
They serve as invaluable tools for researchers to develop and validate MOT algorithms that are not only effective in controlled settings but are also robust and generalizable to real-world applications.
Despite notable advancements in MOT, current methods still face limitations in real-world scenarios. A key issue is IDSws, where trackers fail to maintain object identities, especially in crowded or cluttered settings due to occlusions or objects leaving and re-entering the field of view. Conventional approaches, relying on motion models and appearance-based ReID, struggle with significant appearance changes or unpredictable motions. Many methods also rely on handcrafted features or shallow models that lack robustness and generalization, requiring fine-tuning across datasets. The increasing scale of data and demand for real-time processing highlight the need for more efficient algorithms, but the high computational complexity limits their deployment in real-time systems. Additionally, many MOT frameworks under-use rich contextual information like keypoints and descriptors, leading to suboptimal performance and challenges with track consistency and false positives. Addressing these limitations requires advanced embedding techniques, improved use of contextual data, and more scalable algorithms to enhance performance across diverse applications.
Figure 2 illustrates these issues, showing tracking failures due to IDSws, occlusions, and inefficiencies in current systems.
In the first frame (t), three individuals (ID1, ID2, and ID3) are correctly detected. In the next frame (), ID1 and ID3 are still tracked (TP), but ID2 is occluded or exits, causing a false negative (FN). A new individual (ID4) enters but is mistakenly identified as ID2, leading to an IDSw. By the third frame (), ID1 is missed (FN), ID3 remains correctly tracked (TP), and ID2 reappears but is confused with ID5, causing another IDSw. Additionally, ID5 is falsely introduced (FP), while ID4 is correctly tracked, reflecting a true negative (TN). This highlights how IDSws, FNs, and FPs degrade tracking, especially in dynamic, crowded scenes.
Traditional MOT methods often suffer from IDSws and FM, which degrade tracking accuracy and continuity, particularly in crowded and complex environments. ReTrackVLM aims to tackle these challenges by leveraging the advanced detection capabilities of YOLOv8, the robust feature representation of VLMs, and the zero-shot ReID capability for associating tracks across frames. By integrating these components into a unified pipeline, the proposed method reduces IDSws and FM, offering a robust and scalable solution for MOT. The embedding of ReTrackVLM, which is generated using XFeat [
6] and LightGlue [
7], encapsulates spatial and appearance features essential for accurate tracking. Integrated with fine-tuned VLMs like Detectron2 (DETR2) [
8], Florence-2 [
9], and CLIP [
10], ReTrackVLM evaluates whether a query bbx belongs to an existing track, improving accuracy by reducing IDSw. The system’s track management module, combined with confident track storage and a zero-shot re-identification module, ensures consistent long-term tracking, even when objects reappear after occlusion. ReTrackVLM delivers a robust, scalable tracking system for real-world applications. We can summarize the main contributions of this work as follows:
Novel Embedding Structure: We propose a novel embedding structure that combines bbxes, keypoints, descriptors, overlap ratios, and confidence scores, providing a comprehensive representation of detected objects in MOT tasks.
Integration with VLMs: ReTrackVLM leverages fine-tuned VLMs, such as DETR2, Florence-2, CLIP, and OpenCLIP, to improve the accuracy and robustness of object tracking by determining the association between bbxes and existing tracks.
Enhanced Track Management: We introduce a track management module that utilizes a confident track storage system, ensuring consistent track maintenance and reducing the occurrence of IDSws in challenging scenarios.
Zero-Shot ReID: A zero-shot ReID module is incorporated, allowing the system to match newly detected tracks with previously established ones without requiring extensive retraining, thereby enhancing the adaptability of the system to varying environments.
Extensive Evaluation on Diverse Datasets: We validate our approach through extensive experiments on diverse datasets, including MOT15, MOT16, MOT17, DanceTrack, and WildTrack, demonstrating its superior performance compared to existing state-of-the-art methods.
The remainder of this paper is organized as follows. In
Section 1, we review related studies in the field of MOT, highlighting both traditional approaches and recent advancements.
Section 2 details the proposed methodology, including our embedding structure and its integration with fine-tuned VLMs for robust object tracking. We also elaborate on the track management module and the zero-shot re-identification system, which collectively enhance tracking consistency and accuracy. In
Section 3, we describe the datasets used for training and evaluation, including MOT15, MOT16, MOT17, DanceTrack, and WildTrack, and discuss the experimental setup.
Section 4 presents our results, comparing the performance of our approach against state-of-the-art methods across these diverse datasets. Finally,
Section 5 concludes the paper with a summary of our findings, potential applications of our work, and directions for future research.
2. Related Works
Object occlusion poses a significant challenge in multi-object tracking (MOT), often causing mismatches when occluded objects reappear. Consistently maintaining object IDs, even amidst varying occlusion scenarios, is essential for tracking accuracy. Many solutions have been proposed to enhance robustness and accuracy in MOT, with our methodology integrating advanced techniques targeting occlusion complexities. Key challenges include the following: (i) Temporary object occlusions disrupt the tracking process, making it difficult to maintain accurate tracks as objects obscure each other. (ii) Changes in illumination, pose, or scale result in significant appearance variations, requiring robust algorithms capable of adapting to these changes. (iii) Real-time processing is essential, as tracking multiple objects in high-resolution video streams demands computational efficiency. (iv) In dense environments with closely interacting objects, maintaining track identity becomes particularly challenging.
To address these, various approaches have emerged. For example, a dynamic representation-based tracker [
11] uses an adaptive representation network and pose supervision for long-term tracking with occlusions. The SiamFEA tracker [
12] combines visible and infrared modalities using self-attention mechanisms. iReIDNet [
13] enhances person ReID through spatial feature transforms and coordinate attention. A transformer-based dual-branch model [
14] improves performance via global–local feature interaction, while contextual relation networks [
15] tackle similar local feature issues.
Advancements also include deep convolutional architectures [
16], color descriptor-based ReID methods [
17], hierarchical clustering frameworks [
18], graph-based approaches [
19], and recurrent neural networks [
20] to enhance data association and occlusion handling.
Recent work on feature embedding aims to reduce dimensionality while retaining key characteristics, with strategies like feature combination from multiple DCNNs [
21] and supervised embedding methods [
22] improving classification and metric learning. The integration of these techniques into tracking systems [
23] has enhanced MOT performance through short tracklets and tracklet–plane matching. Furthermore, structured prediction optimizes feature embedding [
24]. Although models like ReDeformTR [
25] effectively track animals across cameras, they lack the versatility of ReTrackVLM, which tracks various objects without complex feature fusion. Additionally, YOLOv5+DeepSORT [
26] enhances underwater tracking accuracy for marine creatures, while [
27] employs YOLOv5 with recurrent networks for real-time object tracking in challenging conditions. Furthermore, a triplet-based MOT method exploiting an attention-based ReID module is presented in [
28] to enhance object association, particularly in challenging occlusion scenarios. A method proposed in [
29], grayscale spatial–temporal features, aims to improve tracking speed and efficiency, especially for devices with limited computing power, and uses a grayscale mapping technique to acquire spatial–temporal features, allowing for direct target localization in previous frames and reducing the computational burden associated with ID matching.
In multi-camera tracking, the end-to-end approach in [
30] utilizes probabilistic association and detection embeddings to manage scenarios effectively, though it struggles with complex occlusions. The semi-online tracking refinement method in [
31] corrects IDSws by monitoring appearance similarity changes over time. Similarly, the multi-class tracking approach in [
32] achieves predictable execution times by class-splitting the Hungarian matrix, though this may sacrifice accuracy in dense scenarios. A framework in [
33] employs weakly supervised multi-object tracking and segmentation for improved mask consistency, but our method enhances this with a novel embedding structure and track management module.
The online multi-object tracker in [
34] combines various appearance features with a ReID network to reduce IDSws, while our method leverages visual language models (VLMs) and advanced track management for superior IDSw and occlusion handling. The denoising diffusion strategy in [
35] enhances tracking by jointly detecting and associating objects, yet ReTrackVLM offers a more comprehensive solution with its track management module and zero-shot ReID system. The approach in [
36] uses weak cues alongside strong spatial and appearance information, whereas our method ensures better generalization through hybrid VLMs and detailed embeddings. The SMILETrack method in [
37] combines an efficient detector with a Siamese network for similarity learning; our approach surpasses it with a more advanced embedding structure and track management module. AMDDATrack in [
38] is designed to improve tracking accuracy by addressing trajectory breaks caused by dropped low-scoring detection frames utilizing an improved CenterNet [
39] detection network incorporating a feature pyramid network, a high-resolution feature map, and a spatial attention mechanism to enhance detection accuracy. Similarly, REACTrack, proposed in [
40], also employs CenterNet, but with a focus on enhancing ReID robustness and correcting tracking association errors, especially in complex scenarios like occlusion. The work in [
41] integrates temporal and spatial features using a Kalman filter and Hungarian algorithm for tracking in self-driving cars, but struggles with similar-looking or closely blocked objects. Fast re-OBJ [
42] improves performance by tightly coupling instance segmentation and embedding generation for a more discriminative representation.
The sports-focused Deep-EIoU method [
43] replaces the Kalman filter with an iterative scale-up approach, while our work broadens applicability through fine-tuning on diverse datasets and a novel ReID module. The MG-MOT algorithm in [
44] integrates UAV metadata for maritime ReID, demonstrating strong performance, but our approach generalizes better across various environments.
ReTrackVLM distinguishes itself by integrating VLMs and a novel embedding structure specifically designed for MOT in complex environments characterized by severe occlusions. Recent research supports the effectiveness of this approach, as shown in [
45,
46], which introduces robust MOT methods for sports scenarios and real-time solutions, respectively, emphasizing the importance of embedding strategies and track management central to our methodology.
Additionally, methods like OR-SORT [
47] and OneTracker [
48] address challenges related to camera motion and multi-modality through a Foundation Tracker, highlighting the potential of VLMs and advanced embeddings in MOT. Similarly, refs. [
49,
50] propose innovative MOT paradigms incorporating natural language descriptions and multi-modality tracking. Our method builds upon these advancements by refining embedding and track management techniques specifically tailored for human tracking across diverse datasets.
Furthermore, the perspective disentanglement framework for ReID in [
51] aligns with our focus on robust embedding strategies for track management. Cross-domain ReID methods [
52] highlight ongoing efforts to improve domain adaptation in MOT. While other MOT methods have incorporated VLMs, ReTrackVLM distinguishes itself through its integration of zero-shot ReID for track association and its enhanced track management module. Unlike existing methods that rely solely on fine-tuned embeddings or pre-trained models, ReTrackVLM effectively utilizes cross-modal embeddings to improve robustness in challenging scenarios, such as occlusion and crowded scenes. The novelty lies in its ability to seamlessly integrate these components into a unified framework, which is validated across diverse datasets.
3. Methodology
The proposed MOT flow diagram in
Figure 3 addresses challenges in real-world tracking scenarios. It begins with a robust object detection phase using a fine-tuned model to accurately identify and localize humans in images or video streams, generating precise bbx coordinates. Next, an intricate feature extraction process employs advanced techniques such as Xfeat and LightGlue to derive keypoints, descriptors, and confidence scores from detected bbxes. These features are input into a VLM that assesses whether a detected human matches an existing track, producing a Boolean response. The track management module then handles motion prediction and data association, ensuring accurate track continuity across frames despite challenges such as occlusions and sudden motion changes. To enhance reliability, the system incorporates confident track storage, archiving established tracks to reduce IDsws and improve long-term tracking fidelity. Finally, the ReID module uses stored information to reassociate newly detected humans with existing tracks, ensuring robust ID preservation and minimizing errors in dynamic environments. This integrated flow diagram not only overcomes the limitations of traditional MOT systems but also offers a scalable and efficient solution for diverse and challenging conditions, making it ideal for modern large-scale applications.
ReTrackVLM is designed to overcome the persistent challenges encountered in real-world tracking scenarios, integrating VLMs and a novel embedding structure. As can be seen in
Figure 4, the operation procedure of ReTrackVLM starts with fine-tuned YOLOv8 [
53] detecting pedestrians in input images, followed by feature extraction using XFeat and LightGlue, which generate robust descriptors and keypoints for each bbx. These features are processed through an image encoder and text embedder, then fused using a fine-tuned transformer to predict motion and manage data association. The track management module ensures accurate tracking by predicting movements and refining appearance descriptors. High-confidence tracks are stored in a dedicated database for ReID, allowing the system to maintain track continuity even in challenging scenarios such as occlusions and complex motions. This combination of modules improves the accuracy of the tracking and reduces IDSws across various datasets.
The ReTrackVLM pipeline, which is given in Algorithm 1, processes video frames for multi-object tracking through object detection, track prediction, and re-identification. It starts by initializing an empty track set and confident track storage (CTS) for managing tracks across frames. For each frame, objects are detected using a pre-trained YOLOv8, with keypoints, descriptors, and confidence scores forming embeddings. A Kalman filter predicts object locations based on previous tracks, and detected objects are associated with these predictions using IoU and appearance similarity scores. The Hungarian algorithm handles associations, updating matched tracks, initializing new ones, and maintaining unmatched tracks.
Unmatched tracks are further processed using zero-shot ReID, querying the CTS for identity matches without retraining. High-confidence tracks are added to CTS for future use. After processing all frames, low-confidence tracks are removed, and trajectories are smoothed for output. The final tracks are saved in a format compatible with the MOT Challenge for evaluation.
Algorithm 1 The algorithm for the complete ReTrackVLM framework. |
Input: Video frames Output: Object tracks - 1:
Initialize: - 2:
▹ Initial tracks - 3:
▹ Confident Track Storage - 4:
for each frame do - 5:
Object Detection: - 6:
Apply YOLOv8 detector to . - 7:
Extract bounding boxes . - 8:
for each bounding box do - 9:
Extract keypoints and descriptors using XFeat. - 10:
Compute overlap ratios and confidence scores. - 11:
Construct embeddings . - 12:
end for - 13:
Track Prediction: - 14:
Predict positions of existing tracks using a Kalman filter. - 15:
Generate predicted states . - 16:
Association: - 17:
Compute association scores between and P using: - 18:
(a) IoU for spatial alignment. - 19:
(b) Appearance similarity from VLM embeddings. - 20:
Solve the association problem using the Hungarian algorithm. - 21:
Track Update: - 22:
Update based on associations: - 23:
- Matched bounding boxes are assigned to tracks. - 24:
- Unmatched bounding boxes initialize new tracks. - 25:
- Unmatched tracks are updated with predicted states. - 26:
Zero-Shot ReID: - 27:
Query for unmatched tracks: - 28:
- Compare embeddings of unmatched tracks with . - 29:
- Re-identify tracks based on similarity scores. - 30:
- Update track identities if a match is found. - 31:
Confident Track Storage Update: - 32:
Add high-confidence tracks t to : - 33:
- Tracks with stable identities over multiple frames are stored. - 34:
end for - 35:
Post-Processing: - 36:
Remove fragmented or low-confidence tracks. - 37:
Smooth track trajectories for visualization. - 38:
Output Results: - 39:
Save tracks T in MOT-compliant format for evaluation.
|
3.1. Data Preprocessing and Input Representation
We utilize several well-known datasets in the MOT domain, including MOT15, MOT16/17 (which contains refined annotations), MOT20, DanceTrack, and WildTrack. Additionally, we incorporated CrowdHuman [
54], which focuses on extremely crowded scenes with severe occlusions and dense pedestrian groupings, and is crucial for applications like public safety and surveillance. Each dataset offers a rich set of annotated frames focusing on pedestrian detection, which are preprocessed to ensure consistency and compatibility with the YOLOv8 framework and subsequent ReID processes. The preprocessing steps involve the following key operations:
Resizing: All input images are resized to a uniform size of 640 × 640 pixels to match the input requirements of the YOLOv8 model. This resizing ensures that the aspect ratios of the pedestrians are preserved, minimizing distortion.
Normalization: Pixel values are normalized to the range [0, 1] to facilitate faster convergence during model training. This normalization is performed using the mean and standard deviation of the ImageNet dataset [
55], which align with the pre-trained weights used in the model.
Data Augmentation: Data augmentation techniques such as horizontal flipping, random cropping, rotation, scaling, and color jittering are applied to enhance the model generalization capability. The augmentation parameters are carefully selected to maintain the integrity of the bbxes and the corresponding pedestrian identities.
We used the training and validation sets provided by the datasets originally, with the training set used to optimize the model weights and the validation set used to monitor performance and prevent overfitting. Let denote the set of bbxes in an image, where each bbx is represented as . Here, and are the coordinates of the top-left corner of the bbx, and w and h are the width and height of the bbx, respectively. The content of the bbxes is further processed to extract features using the XFeat and LightGlue methods, which are critical for cross-modal embeddings and ReID tasks.
The YOLOv8 model is fine-tuned on the preprocessed datasets to optimize pedestrian detection performance by following the fine-tuning procedure explained in [
56]. The training process involves the following steps:
[1] Model Initialization: The YOLOv8 model is initialized with pre-trained weights from the COCO dataset [
57]. The model is designed to detect pedestrians with high accuracy, utilizing the features extracted from the bbxes with Equation (
1):
where
is the regression loss of the bbx,
is the confidence loss, and
is the classification loss.
[2] Training Configuration: The model is trained using a custom configuration defined in a YAML file, specifying training parameters such as batch size, learning rate, augmentation settings, and the number of epochs.
[3] Training Procedure: The model is trained for 100 epochs with early stopping based on validation performance. The multi-scale training strategy is employed, where input images are randomly resized during training to improve the robustness of the model to varying scales.
The fine-tuned YOLOv8 model serves as the backbone for the tracking pipeline, providing accurate detections that are fed into the cross-modal embeddings and track management modules.
In the feature extraction stage using XFeat and LightGlue, we obtained the robust features from the bbxes. These features are employed as inputs to the VLMs and play an important role in track management and ReID. XFeat captures deep semantic features, while LightGlue focuses on point-based features that are useful for matching and association tasks. The extracted features, denoted as
for each
, are a high-dimensional vector representing the appearance and spatial characteristics of the detected pedestrian. LightGlue is a feature matching technique that facilitates the alignment of bbxes across consecutive frames. It computes descriptors that are used to associate detections between frames, thus enabling accurate tracking. Let
represent the descriptor vector extracted by LightGlue for
. The similarity between descriptors from different frames is calculated using a distance metric, typically the cosine similarity as given in Equation (
2):
The keypoint (kp) detection structure of XFeat is designed to achieve a balance between accuracy and computational efficiency, making it suitable for deployment on hardware-constrained devices such as mobile robots and embedded systems. The kp detection is handled by a dedicated branch within the network, a design choice that deviates from traditional approaches where kp detection and descriptor extraction are typically coupled. This decoupling allows XFeat to independently optimize each task, resulting in faster and more accurate kp detection. The kp detection branch in XFeat processes the input image by first transforming it into a grid structure, with each grid cell representing an 8 × 8 pixel region. The image is then reshaped into a 64-dimensional feature vector for each grid cell, preserving spatial granularity. A series of rapid 1 × 1 convolutions are applied to this representation to regress the kp coordinates efficiently. The final output of this branch is a kp embedding as given in Equation (
3):
where
is classified into 1 of 64 possible positions within its corresponding cell
, with an additional “dustbin” option for cases where no kp is detected. The dustbin is discarded during inference, and the remaining heatmap is interpreted as an 8 × 8 cell.
The descriptor extraction process in XFeat focuses on generating a dense feature map FF with compact 64-dimensional descriptors. This map is built using a multi-scale feature merging strategy, which enhances the robustness of the network to variations in viewpoint and illumination—critical aspects for applications in real-world scenarios such as mobile robotics. The dense feature map given in Equation (
4):
This is obtained by merging features from different scales,
, of the image. The merging process involves bilinear upsampling of intermediate representations to match the resolution of the final map, followed by an element-wise summation. This strategy leverages the benefits of feature pyramids to increase the receptive field of the network while maintaining the compactness of the descriptors. A convolutional fusion block, consisting of three basic layers, combines these representations into the final feature map. An additional convolutional block is employed to generate the reliability map given in Equation (
5):
which models the unconditional probability
that a given local feature
can be confidently matched. This map plays a crucial role in filtering out unreliable features, further enhancing the accuracy of the matching process. The following is an explanation of datasets used and the data preprocessing steps. For dense matching, XFeat introduces a lightweight module that enables semi-dense matching while controlling memory and computational footprints. This module selects the
image regions based on their reliability scores
and caches them for future matching. The matching process employs a simple Multi-Layer Perceptron (MLP) for coarse-to-fine matching, avoiding the need for high-resolution feature maps. Given the dense feature map
F or its subset
, the MLP predicts pixel-level offsets
o between matching features from an image pair
. The prediction of the offsets
o is conditioned on the matched feature pair
and is formulated as in Equation (
6):
where
represents the logits of a probability distribution over the possible offsets. This refinement strategy allows for efficient pixel-level matching by reducing the search space, making it particularly suitable for resource-constrained settings. Hereby, we employed XFeat due to its focus on both accuracy and efficiency. The decoupling of keypoint detection and descriptor extraction, the use of multi-scale feature merging, and the integration of reliability maps all contribute to its robust performance in real-world applications. With its ability to perform real-time inference on limited hardware, XFeat provides a valuable solution for further application, such as in mobile robotics and augmented reality scenarios.
ReTrackVLM processes input video frames by detecting objects using YOLOv8, extracting bbxes, keypoints, and descriptors for each detected object during the stage given in Algorithm 2. Additional features, such as overlap ratios and confidence scores, are calculated. These components are combined into embeddings, normalized for compatibility with YOLOv8, and serve as input for the subsequent stages.
Algorithm 2 Data Preprocessing and Representation |
Input: Video frame , Pre-trained YOLOv8 detector, Keypoint extractor (XFeat) Output: Set of embeddings for frame - 1:
Detect objects in frame using YOLOv8, obtaining bounding boxes - 2:
for each bounding box do - 3:
Extract keypoints and descriptors using XFeat - 4:
Compute overlap ratios with other bounding boxes - 5:
Compute confidence score - 6:
Construct embedding - 7:
Add to - 8:
end for - 9:
Normalize bounding box parameters to for compatibility with YOLOv8 - 10:
return
|
3.2. Cross-Modal Embeddings and Fine-Tuning of VLMs
To create a robust cross-modal embedding for multi-object tracking (MOT), our approach integrates visual and textual representations derived from image data. These embeddings are generated from different datasets, including MOT15, MOT16, MOT17, DanceTrack, CrowdHuman, and WildTrack, and are designed to be compatible with several VLM architectures, such as Florence-2, CLIP, OpenCLIP, and DETR2.
The embeddings for each VLM are constructed by associating the extracted visual features with corresponding textual or categorical descriptions. This process is repeated for each image in the dataset, resulting in structured data stored in various formats suitable for the target VLMs:
Florence-2 Embedding incorporates visual features with bbxes, keypoints, descriptors, overlap ratios, and confidence scores, labeling each entity as a “pedestrian”.
CLIP and OpenCLIP Embeddings convert visual features into textual descriptions, detailing the position, keypoints, descriptors, and confidence scores of each human figure within the frame.
DETR2 Embedding organizes features in a format that includes bbxes, keypoints, descriptors, and overlap ratios, compatible with the DETR2 framework. These data are stored in a DETR2-specific JSON format, allowing for seamless integration with downstream tasks.
To create effective cross-modal embeddings for MOT tasks, we first define a multi-dimensional vector representation for each tracklet. The embedding vector
for tracklet
i is composed of various features as given in Equation (
7):
where
represents the visual features, including the image file name , image height h, and image width w.
is the keypoint vector extracted using XFeat, representing the spatial configuration of the detected human’s keypoints.
is the 64-dimensional descriptor vector derived from LightGlue, encoding the appearance characteristics.
represents the bbx , capturing the spatial extent of the detected object.
is the confidence score associated with the detection.
is the overlap ratio of bbxes across frames, used for temporal consistency. It measures the degree of spatial overlap between different bbxes. This information is crucial for managing occlusions and ensuring accurate tracking across frames.
These features are concatenated to form a unified embedding that captures both visual and spatial–temporal information.
To bridge the gap between different modalities (e.g., image and textual descriptions), we project the embeddings into a shared latent space
L. Let
and
be the projection matrices for visual and textual features, respectively. The projection can be formulated as given in Equation (
8):
where
The loss function for aligning the modalities is typically based on contrastive learning as given in Equation (
9):
where
For fine-tuning VLMs, we leverage the cross-modal embeddings to refine the decision boundaries in the joint embedding space. Given a query bbx
, we compute its embedding
and compare it against the embeddings of existing tracks as given in Equation (
10):
The query bounding box is assigned to the track
with the highest score as given in Equation (
11):
We consequently perform the fine-tuning by minimizing a cross-entropy loss over the track assignments during the training process.
The extracted embeddings are processed by a VLM as given in Algorithm 3 to generate rich cross-modal embeddings. These embeddings encode both visual and semantic information, enhancing the model’s capacity for accurate object representation. Fine-tuning the VLM further aligns its outputs to the MOT task.
Algorithm 3 Cross-Modal Embeddings and Fine-Tuning of VLMs |
Input: Set of embeddings , Pre-trained VLM (e.g., Florence-2, CLIP) Output: Set of cross-modal embeddings - 1:
for each embedding do - 2:
Extract visual and semantic features using the VLM - 3:
Add to - 4:
end for - 5:
Fine-tune the VLM on the extracted features to enhance representation for MOT - 6:
return
|
Our approach for generating cross-modal embeddings and fine-tuning VLMs in MOT is illustrated in
Figure 5 in detail, which begins with the sequence of previously detected and cropped pedestrians within bbxes that are extracted for each individual. These cropped regions are processed to compute visual features, including keypoints and descriptors, which are structured into a data format containing fields such as
,
,
, keypoints, descriptors,
, and
. For each individual, a textual description is generated in natural language using these features, aligning the visual information with textual inputs required for VLMs. Both the visual and textual data are projected into a shared embedding space via learned projection matrices, enabling cross-modal alignment. The embeddings for each individual are stored as
vectors for track management and undergo fine-tuning using a contrastive learning objective to enhance identity association. This process ensures that embeddings of the same identity are brought closer together while those of different identities are separated. The resulting multi-modal embeddings effectively improve track association accuracy by leveraging both visual and textual cues in a unified representation.
3.3. Track Management and Zero-Shot ReID Integration
Effective MOT requires robust track management and ReID mechanisms to address object occlusions, sudden appearances, and IDSws. ReTrackVLM incorporates a sophisticated track management module with zero-shot ReID capabilities, ensuring seamless tracking across diverse scenarios. This module utilizes motion prediction, distance calculation, and data association strategies to maintain track consistency and ReID objects after interruptions, collectively enhancing tracking reliability and accuracy in challenging environments.
The track management and zero-shot ReID integration of ReTrackVLM are demonstrated in
Figure 6, which initially obtains the bbxes, keypoints, and descriptors for each detected pedestrian from the current frame. Motion estimation, implemented via a Kalman filter, predicts the future positions of tracks based on their previous states, aiding in association with new detections. For similarity evaluation, the cosine similarity metric is employed to calculate a distance matrix between the embedding vectors of existing tracks
and new detections. This matrix quantifies the matching likelihood between tracks and detections. Following distance calculation, data association is performed using a bipartite matching algorithm to optimally pair the detections (Det1 to DetN) with tracks (Track1 to TrackM), taking into account both motion predictions and embedding similarities. After association, zero-shot ReID is applied by leveraging VLM embeddings, allowing for re-identification across frames without additional training. Confident tracks are stored in an array containing their associated embeddings, indexed by unique IDs such as ID1, ID2, …, IDk, along with confidence scores. This array serves as a gallery for ReID, facilitating identity retrieval for new detections and enhancing tracking consistency across occlusions or reappearances. The framework ensures seamless integration of motion, similarity assessment, and ReID for robust MOT.
Tracks from the previous frame are predicted using a Kalman filter in this stage as shown in Algorithm 4. Detected objects in the current frame are associated with these predicted tracks based on IoU and appearance similarity scores, using the Hungarian algorithm. For unmatched tracks, zero-shot ReID compares them with previously stored embeddings in the confident track storage, allowing for robust identity preservation without retraining.
Algorithm 4 Track Management and Zero-Shot ReID Integration |
Input: Predicted tracks , Current embeddings , Confident Track Storage (CTS) Output: Updated tracks - 1:
Predict new locations of tracks using a Kalman filter, obtaining P - 2:
for each predicted track do - 3:
Compute IoU and appearance similarity scores with - 4:
end for - 5:
Solve the association problem using the Hungarian algorithm - 6:
for each association result do - 7:
Update matched tracks in - 8:
Initialize new tracks for unmatched - 9:
Handle unmatched tracks in by updating their predicted states - 10:
end for - 11:
Perform zero-shot ReID by comparing unmatched tracks with embeddings in CTS - 12:
return
|
3.3.1. Motion Prediction and Similarity Assessment
Motion Prediction Mechanism: The Kalman filter is employed to predict the motion of detected objects across consecutive frames. Each detected pedestrian’s state is represented by a state vector
, where
represents the center of the bbx,
represents the velocity, and
w and
h are the width and height of the bbx. The state transition model
is defined as given in Equation (
12):
where
F is the state transition matrix and
is the process noise, assumed to follow a Gaussian distribution with covariance
Q. The Kalman filter predicts the next state
and updates it with the observed measurements.
Distance Calculation: The distance between predicted tracks and new detections is calculated using a combination of spatial and appearance-based metrics. For spatial distance, the Mahalanobis distance
is utilized to measure the difference between the predicted bounding box and the detected bounding box. The Mahalanobis distance is given by Equation (
13):
where
is the detected measurement,
H is the observation model, and
is the innovation covariance.
Appearance Feature Descriptors and Similarity Score: Feature locations obtained via XFeat within the bbxes of pedestrians detected by the fine-tuned YOLOv8 are fed into the Kalman filter. These feature locations are further used to compute appearance descriptors, enhancing the reliability of track association. The similarity score between the detected and tracked objects is calculated using the VLM that we also fine-tuned for this purpose in a similar manner as [
58]. This score,
, is integrated into the overall distance calculation, combining both spatial and appearance-based distances as in Equation (
14):
where
and
are weights that balance the influence of spatial and appearance-based metrics, respectively. This integrated distance metric ensures robust track management by leveraging both motion prediction and appearance similarity.
Cost Matrix Construction: To apply the Hungarian method, we first construct a cost matrix
C that quantifies the dissimilarity between each detected object and each existing track. The elements of the cost matrix
are computed as a weighted combination of spatial and appearance-based distances as defined earlier in Equation (
14), where
appears to be the Mahalanobis distance between the
i-th detection and the
j-th track, and
is the similarity score derived from the VLM.
Optimal Assignment: The Hungarian method solves the assignment problem by minimizing the total cost of assigning detections to tracks. Given the cost matrix
C, the algorithm finds the optimal assignment AA, which minimizes the sum of the selected costs as given in Equation (
15):
where
is a binary variable that equals 1 if the
i-th detection is assigned to the
j-th track, and 0 otherwise. Solution
ensures that each detection is assigned to at most one track and each track is matched to at most one detection, thereby optimizing the association process.
3.3.2. Data Association
Data association is essential in MOT, aiming to match detected objects in the current frame with existing tracks. We employ the Hungarian method, a combinatorial optimization algorithm, to efficiently solve the assignment problem. Afterward, unassigned detections may signify new objects entering the scene, prompting the initialization of new tracks. Conversely, unassigned tracks that lack matches are terminated based on the number of consecutive frames without valid matches.
3.3.3. Zero-Shot ReID Integration
The zero-shot ReID module is crucial for maintaining object identity continuity, especially when objects exit and re-enter the scene or when tracks are temporarily lost. Unlike traditional ReID methods that require extensive training on labeled data [
42], our zero-shot ReID module uses pre-trained VLMs to match reappearing objects to their previous identities without additional fine-tuning. We treat the confident track storage as a dynamic gallery, comparing new queries from the current frame against this gallery using pre-trained embeddings from the VLM. This allows for seamless identification of potential matches based on stored descriptors and VLM similarities, eliminating the need for specialized training on additional ReID datasets.
Feature Extraction and Embedding Comparison: In the zero-shot ReID module, the appearance features of each detected object are extracted using the same embedding structure described in earlier sections. The embedding vector
for a new detection is compared against the stored embeddings
of all previously tracked objects in the database. The similarity between the embeddings is computed using a cosine similarity measure as given in Equation (
16):
where
is the embedding vector of the current detection, and
is the embedding vector of a previously tracked object. A high similarity score indicates that the detection is likely to correspond to a previously tracked object, facilitating the re-assignment of the same identity.
Zero-Shot Matching Process: When a detection
cannot be matched to an existing track during the data association step (as described in
Section 3.3.2), the zero-shot ReID module is activated. The module searches for the highest similarity score
between the current detection and the stored embeddings of lost tracks. If the similarity score exceeds a predefined threshold
, the detection is reassociated with the corresponding track, effectively recovering the identity of the object as given in Equation (
17):
Integration into Track Management: The zero-shot ReID module is integrated into the track management system, offering a fallback for identity preservation when standard data association fails. This allows the tracking system to maintain consistent object identities during occlusions, exits, and re-entries without the need for manual labeling or retraining. By leveraging the generalization capabilities of VLMs, the zero-shot ReID module enhances the overall robustness and reliability of the tracking pipeline.
3.4. Confident Track Storage
The confident track storage module is vital to the track management pipeline, ensuring the reliability of tracked identities over time. It selectively stores tracks with high confidence, enabling more effective management of active tracks and minimizing erroneous associations.
Track Confidence Calculation: To determine whether a track should be stored as “confident”, we introduce a confidence score
for each track
t, computed as a weighted combination of factors such as detection consistency, association reliability, and ReID match quality. The confidence score is defined as given in Equation (
18):
where
represents the consistency of detections associated with the track. It is calculated as the ratio of successful detections to the total number of frames for which the track has been active.
represents the reliability of data associations for the track, measured by the inverse of the number of IDSws.
represents the quality of ReID matches, quantified by the average cosine similarity score of ReID matches over time.
The parameters , , and are empirically determined weights that balance the contribution of each factor, set to , , and respectively, reflecting the relative importance of detection consistency, association reliability, and ReID match quality in establishing track confidence.
Empirical Threshold for Confidence: A track is classified as confident if its confidence score
exceeds a predefined threshold
, empirically set at
based on validation experiments. This threshold ensures that only tracks with a high level of reliability are stored, reducing the risk of storing incorrect or noisy tracks as given in Equation (
19).
Storage Mechanism and Interaction with Other Modules: Once a track is classified as confident, it is stored in the confident track storage module, which acts as a repository for high-quality tracks. This allows for the reactivation of tracks if the same object is detected later, enhancing identity management across long sequences. The interaction between the confident track storage and other modules is twofold:
- 1
Data Association: During the data association process, confident tracks in storage are prioritized for matching with new detections. This reduces the likelihood of identity switches and enhances track continuity.
- 2
Zero-Shot ReID Integration: In cases where a track has been lost and subsequently reappears, the zero-shot ReID module can utilize the embeddings of confident tracks to reassociate the object with its previous identity, further ensuring consistent identity tracking across the entire sequence.
Incorporating the confident track storage module improves the overall tracking system’s reliability and stability, as these confident tracks serve as anchor points, reducing errors and enhancing long-term identity preservation. Empirical tuning of confidence thresholds and weighting parameters ensures effective adaptation to diverse tracking scenarios, maintaining a balance between precision and recall.
Tracks with high confidence and stable identity are stored in the confident track storage (CTS) for future reference as given in Algorithm 5. Low-confidence or outdated tracks are removed to maintain efficient storage and improve tracking reliability in dynamic and complex scenes.
Algorithm 5 Confident Track Storage |
Input: Updated tracks , Confident Track Storage (CTS) Output: Updated CTS - 1:
for each track do - 2:
if has high confidence and stable identity then - 3:
Add to CTS - 4:
end if - 5:
end for - 6:
Remove obsolete or low-confidence tracks from CTS - 7:
return CTS
|
4. Experiments and Results
We evaluated the ReTrackVLM performance across various datasets and metrics. This section begins with an overview of the datasets and evaluation metrics used, followed by implementation details. Key experiments, including an ablation study that highlights contributions from different components, are presented. We conclude by comparing ReTrackVLM with state-of-the-art methods, showcasing its effectiveness, especially in challenging scenarios involving cross-modal embeddings and zero-shot ReID integration.
4.1. Datasets and Evaluation Metrics
We selected diverse MOT datasets—MOT15, MOT16, MOT17, MOT20, DanceTrack, and WildTrack—to evaluate the proposed framework comprehensively. These datasets span a wide range of real-world scenarios, including urban environments, indoor and outdoor scenes, and varying levels of crowd density, as provided in
Table 1 in detail.
MOT15 includes indoor and outdoor environments with moderate difficulty due to occlusions and varying densities. MOT16 and MOT17 share content but differ in annotation styles, allowing for an assessment of the impact of annotation on performance in crowded urban settings. MOT20 features extremely crowded scenes with prolonged occlusions, ideal for dense urban applications. DanceTrack focuses on dynamic, fast-paced indoor movements, challenging the model with rapid motion and complex interactions. WildTrack emphasizes multi-camera setups and varying angles, posing synchronization challenges. These datasets were selected for a comprehensive evaluation, covering scenarios from crowded urban streets to controlled environments.
We converted annotations to YOLOv8 format, concentrating on the person class for consistency across datasets. Each image corresponds to a label file formatted with one row per object ([, , , , ]), normalized between 0 and 1 to meet YOLOv8’s input requirements, facilitating efficient fine-tuning. To evaluate tracking performance, we used a comprehensive set of metrics, including accuracy, precision, identity preservation, and robustness. The CLEAR MOT metrics, focusing on detection quality, track continuity, and identity preservation, are standard for assessing MOT performance.
We employed the followed metrics to assess the performance of MOT algorithms:
MOTA (Multiple-Object Tracking Accuracy): MOTA is a widely used metric that accounts for three types of errors: FP, FN, and IDSw. It is calculated as given in Equation (
20):
where
is the total number of ground truth (GT) objects. MOTA provides a general overview of tracking performance by counting how many errors are made by the tracker in total, with higher values indicating better performance.
MOTP (Multiple-Object Tracking Precision): MOTP measures the alignment precision between the GT and the tracker’s output. It is the average distance between the predicted bbxes and the GT across all matches, where the lower values indicate higher precision, as given in Equation (
21):
IDSw: IDSw occurs when a tracked object is mistakenly given a different ID than it had in previous frames, and counts the number of times an object’s predicted identity changes in the tracker output. Fewer IDSws indicate better tracking performance in maintaining object identities over time.
FM (Fragmentation): FM counts the number of times a track is interrupted by measuring how often an object’s trajectory is interrupted (i.e., when the tracker loses track of the object and then ReIDs it), with a correct identification being split into two or more separate tracks. Lower fragmentation indicates better continuity in tracking.
MT (Mostly Tracked): MT refers to the number of ground truth trajectories that are successfully tracked for at least 80% of their length. A higher MT value indicates better tracking performance.
PT (Partially Tracked): PT refers to the number of ground truth trajectories that are tracked for 20% to 80% of their length. PT provides additional insight into tracking performance for partially visible objects.
ML (Mostly Lost): ML refers to the number of ground truth trajectories that are tracked for less than 20% of their length. A lower ML value is desirable as it indicates fewer instances of lost tracks.
4.2. Implementation Details
Experiments were conducted on a desktop PC with Windows 11, WSL2 (Ubuntu 22.04 LTS), a 24 GB GPU, 32 GB RAM, and 1TB SSD. We fine-tuned YOLOv8 for person detection across the selected datasets. Each dataset was converted to a YOLOv8-compatible format, retaining only the person class, and organized into training and validation directories with corresponding annotation files.
The fine-tuning utilized a custom Python training script with the PyTorch framework [
59] and the Ultralytics YOLOv8 library, configured for 100 epochs, 640 × 640 pixel images, and a batch size of 16. Data augmentation techniques, including multi-scale training and horizontal flipping, were applied to enhance model robustness.
For the VLM module of ReTrackVLM, we developed a data processing pipeline to convert datasets for Florence-2, CLIP, and DETR2. This pipeline processes annotations by integrating bbx information, image dimensions, keypoints, descriptors, overlap ratios, and confidence scores into a unified format. The bbx coordinates and image size are extracted directly from YOLOv8 outputs, ensuring accurate localization of objects within each frame. Keypoints and descriptors, which provide additional semantic and structural details such as image-level metadata (e.g., image dimensions), are derived using XFeat and LightGLUE, respectively. These features are then normalized to ensure consistency across frames and datasets.
The structured embedding representation leverages a fixed format that includes image-specific information (e.g., dimensions and bbxes) along with semantic details (e.g., keypoints and descriptors). This fixed structure not only ensures compatibility across modules but also significantly improves capacity management and computational efficiency. Keypoints represent specific positions on detected objects, such as joints or other salient features, with each keypoint described by two coordinates indicating its position. To optimize memory usage, we limited the number of keypoints based on their confidence scores, retaining only the highest confidence points. Descriptors, extracted alongside keypoints, are 64-dimensional vectors describing the local image regions around each keypoint. These descriptors provide rich contextual information for fine-grained identification and tracking. By standardizing the embedding structure, the framework effectively reduces memory usage and enhances processing speed, enabling its deployment in resource-constrained environments. This efficient embedding organization was pivotal in achieving timely performance without compromising accuracy, particularly for tasks involving dense or complex scenes. This compact format enables efficient storage and processing while ensuring the semantic and spatial integrity of the data. The combination of YOLOv8, XFeat, and LightGLUE provides a robust foundation for generating embeddings that are both computationally efficient and effective for ReID and tracking tasks. These optimizations support scalability and practical applicability, making the ReTrackVLM framework suitable for both research and real-world scenarios.
4.3. Ablation Study
In this ablation study, we analyzed the contributions of key components within the ReTrackVLM framework. By evaluating the impact of cross-modal embeddings, zero-shot ReID integration, track management optimizations, and fine-tuning, we identified which elements most significantly enhance tracking accuracy and robustness. This analysis provides insights into our modules’ effectiveness and informs future improvements.
In
Table 2, we present the detection performance of YOLOv8 after fine-tuning on several datasets, optimizing it for person detection across diverse scenes. Since YOLOv8 processes the fixed-size input originally (
), we also followed this convention and, consequently, the fine-tuned YOLOv8 model achieved an average processing speed of 98 frames per second (fps) on our setup on average. Evaluation metrics include Box Precision, Box Recall, mAP50 (mean Average Precision at IoU threshold of 0.50), and mAP50-95 (mean Average Precision across IoU thresholds from 0.50 to 0.95).
The results establish the ReTrackVLM’s performance, showing that YOLOv8 fine-tuned on datasets like MOT15 and DanceTrack achieves high detection rates, with MOT15 achieving 94.8% Box Precision and 95.2% mAP50, while DanceTrack reaches 91.8% mAP50. In contrast, crowded datasets such as MOT20 exhibit lower precision (67.6%) and recall (69.4%), highlighting the challenges of densely populated scenes that require further refinement. When fine-tuned across all datasets, the model achieves a Box Precision of 71.9% and an mAP50 of 76.7%, offering a generalized performance that balances adaptability without overfitting, making it suitable for diverse MOT applications.
For fine-tuning the models in our framework, the time required was closely tied to the complexity of the architecture and the dataset size. DETR2, being the least complex architecture among the tested models, required approximately 15 h for fine-tuning on the smallest dataset. In contrast, Florence-2, recognized as one of the most complex VLMs due to its multimodal architecture and large model size, required nearly 30 h for fine-tuning on the largest dataset. These time differences align with the relative complexities of the models: Florence-2, with its focus on integrating image–text pairs and object detection data, exhibits significantly higher computational demands than DETR2, a transformer-based object detection model that is less resource-intensive.
In terms of runtime performance, our framework achieved a processing speed of 18 fps when using DETR2, unveiling its moderate complexity and suitability for real-time applications. Conversely, Florence-2 delivered a speed of 12 fps, reflecting the trade-off between its advanced capabilities and computational overhead.
CLIP demonstrates superior tracking performance, as can be seen in
Table 3, particularly in MOTA and MOTP, across most datasets, indicating better accuracy and track continuity. DETR2 also performs well, especially on challenging datasets like MOT16/17 and MOT20, but has a higher IDSw, suggesting challenges in identity preservation compared to CLIP. OpenCLIP generally shows lower performance across metrics, while Florence-2 strikes a balance between detection accuracy and identity handling, though it trails behind CLIP and DETR2. CLIP excels in challenging scenarios like WildTrack and DanceTrack, leading in MOTA, MOTP, and FM.
In
Table 4, zero-shot ReID performance with confident tracks from a non-fine-tuned YOLOv8 detector is provided. CLIP consistently outperforms other VLMs, achieving the highest MOTA and MOTP, especially on DanceTrack (61.8%) and WildTrack (69.9%), along with the lowest IDSw and FM, demonstrating superior track consistency. DETR2 closely follows, particularly on MOT16/17 and MOT20, balancing higher MT values with lower ML rates. OpenCLIP and Florence-2 generally lag behind, with higher IDSw and FM; however, Florence-2 performs reasonably on MOT16/17 and WildTrack. These findings highlight CLIP’s robustness in maintaining identity consistency in zero-shot ReID scenarios.
In
Figure 7, we provide a detailed visualization of the tracking performance of ReTrackVLM across five benchmark datasets: DanceTrack, MOT15, MOT16, MOT17, and MOT20. These datasets represent a variety of challenging scenarios, including dynamic and static pedestrians, different camera movements, and varying levels of crowd density. The tracking results for each dataset are shown for specific frames: the first frame (t = 1), the 100th frame (t = 100), the 200th frame (t = 200), and the final frame (t = N). However, for MOT20, the last frame is excluded due to extreme occlusion and track overlap. Beneath the visualizations, tracking metric graphs illustrate the number of tracks per frame, offering a temporal analysis of trajectory continuity and dataset complexity. In DanceTrack, ReTrackVLM performs well, effectively handling the uniform motion patterns of dancers from t = 1 to t = 200. However, with t = N, trajectory intersections pose challenges, as rapid dancer crossings lead to IDSws as well as fast changes in appearances due to short distances between the dancers and the camera. This indicates a need for improvement in managing synchronized movements and abrupt trajectory overlaps. For MOT15, which features a moderately dense urban setting, ReTrackVLM effectively tracks both static and dynamic individuals across frames. While track consistency is generally strong, a slight dip in the number of active tracks in the metric plot suggests possible track termination due to occlusions. MOT16 and MOT17 highlight both strengths and limitations of the framework. In MOT16, the tracker performs reliably under static camera conditions, maintaining consistent track identities from t = 1 to t = 200. However, in MOT17, which involves moving cameras and frequent occlusions, the framework experiences increased IDSws. By t = N, there are clear signs of frequent IDSw and partial track losses, suggesting a need for enhanced appearance-based association and occlusion-handling mechanisms. In MOT20, one of the most densely populated datasets, ReTrackVLM struggles with trajectory overlaps and significant occlusions. The tracking metrics for MOT20 reveal a steady increase in the number of tracks over time, indicating the dynamic influx of new pedestrians but also hinting at fragmented tracking due to occlusion challenges.
4.4. Comparison with State-of-the-Art Techniques
In this section, we compare the ReTrackVLM performance against state-of-the-art MOT methods across benchmark datasets, including MOT15, MOT16/17, MOT20, DanceTrack, and WildTrack. We evaluate the effectiveness of the cross-modal embeddings module, zero-shot ReID integration, and the performance enhancements from fine-tuning YOLOv8 as the detector. Key metrics such as MOTA, MOTP, and IDSw are assessed to gauge our approach against recent advancements. Distinct fine-tuned weights were employed for the YOLOv8 detector and CLIP models, optimized for the specific datasets to ensure peak performance in detection and ReID tasks.
We selected various methods for comparison—DeepSort, ByteTrack, Deep-OC-SORT, StrongSORT, BoostTrack, SFSORT, Hybrid-SORT, UCMCTrack, TrackFormer, MOTR, and TransTrack—due to their strong performance across the datasets and their compatibility with our evaluation. These methods are recognized for robust detection and track management, with advanced features like occlusion handling and hybrid strategies, aligning with our goal of a comprehensive comparison.
We present a comprehensive comparison of MOT performance across benchmark datasets in
Table 5, evaluating several state-of-the-art trackers alongside our proposed ReTrackVLM model. For MOT15, ReTrackVLM achieves competitive results, scoring highest in PT (286) and solidly in MOTA (84.0) and MOTP (88.6), though it has a higher IDSw (1147) than top methods like Hybrid-SORT and SFSORT. In MOT16/17, ReTrackVLM leads in MOTA (79.7) and PT (428) but records a higher IDSw (1889) than some competitors. For MOT20, it excels in MOTA (78.0) and shows balanced MT (727) and low ML (234), demonstrating robustness in challenging environments. While other methods like Deep-OC-SORT and BoostTrack score higher in DanceTrack, ReTrackVLM maintains strong performance across metrics, indicating its versatility.
One can state that ReTrackVLM delivers consistently competitive results across multiple datasets, particularly excelling in MOTA and PT metrics, showcasing its strength in track consistency and accuracy. However, it shows potential for improvement in identity matching, especially in crowded scenes like MOT20, and could benefit from further optimization in complex motion tasks, as evidenced by its performance on DanceTrack. These findings suggest that ReTrackVLM strikes a balanced trade-off between accuracy and robustness but requires fine-tuning to address IDSw challenges.
5. Discussion and Conclusions
MOT in complex environments with dense crowds and unpredictable motion presents significant challenges, including maintaining consistent identities across frames amidst occlusions and abrupt movements. Traditional methods often struggle with IDSws and association failures due to inadequate modeling of appearance and motion. To address these issues, we propose ReTrackVLM, a novel tracking framework that employs cross-modal embeddings and zero-shot ReID to enhance performance.
ReTrackVLM combines a VLM with an appearance-based ReID system, effectively modeling both visual and semantic information to distinguish objects during ambiguous detections and occlusions. The zero-shot ReID capability enables accurate object matching without additional fine-tuning, relying on previously stored track data. Additionally, a Kalman filter-based motion prediction module estimates object positions between frames, reducing track fragmentation.
Extensive evaluations on benchmark datasets—MOT15, MOT16/17, MOT20, and DanceTrack—demonstrate that ReTrackVLM achieves competitive performance, notably excelling in MOTA with on MOT16/17. It shows strong results in minimizing IDSws on DanceTrack (684) and WildTrack (761) and in minimizing PT on MOT15 (286) and MOT16/17 (428). However, it has limitations, such as more IDSws on dense and occluded datasets like MOT20, indicating that the zero-shot ReID struggles in such scenarios. Additionally, the motion prediction model could be improved for complex motion patterns, and the reliance on VLMs and cross-modal embeddings introduces computational overhead that may impact real-time tracking applications.
While the current study demonstrates the efficacy of the proposed framework on MOT benchmarks and controlled environments, its applicability to unstructured scenes and diverse object types remains unexplored. Future studies will incorporate datasets like TAO [
70] and KITTI [
71] to evaluate its generalizability in these contexts. Furthermore, the effect of integrating temporal smoothing or adaptive reinitialization techniques in stabilizing the tracks will be analyzed in addition to integrating temporal occlusion reasoning, such as interpolation-based ReID to improve recovering tracks after occlusions. Future work will focus on key improvements such as enhancing the ReID module to better handle identity switches in dense scenes, integrating advanced motion models for tracking complex behaviors, and optimizing computational efficiency for real-time applications through efficient architectures or hardware accelerators. Future work will also explore the integration of advanced optimization algorithms to enhance the parameter search space for motion prediction and ReID modules. Recently developed successful methods, such as the Mayfly Optimization Algorithm [
72] and Improved Gorilla Troops Optimizer [
73], offer exciting opportunities to refine module configurations further, especially in handling diverse tracking scenarios.