1. Introduction
Multi-object tracking (MOT) has been a long-standing problem in computer vision, the aim being to predict the trajectories of objects in a video. It is one of the fundamental yet challenging tasks in computer vision [
1] and forms the basis for important applications ranging from video surveillance [
2,
3] to autonomous driving [
4,
5].
Many modern multiple-object tracking systems follow the tracking-by-detection paradigm, consisting of a detector followed by a method for associating detections into tracks. The displacement of objects of interest provides important cues for object association. Many works have been inspired by tracking objects through motion estimation. SORT [
6] uses the Kalman filter [
7] as the motion model, which is a recursive Bayes filter that follows a typical predict–update cycle. The Kalman filter’s simplicity and effectiveness make it widely used in tracking algorithms [
6,
8,
9]. But the Kalman filter as a hand-crafted motion model struggles to deal with diverse motion on DanceTrack [
10]. OC-SORT [
11] pointed out the limitations of SORT [
6] from the use of the Kalman filter [
7] and improved robustness against occlusion and nonlinear motion. CenterTrack [
12] built on the CenterNet [
13] detector to learn a 2D offset between two adjacent frames and associate them based on center distance. But CenterTrack [
12] has bad association performance. Recently, MOTR [
14], which extended DETR [
15] and introduced track query to model the tracked instances in the entire video, has shown the potential of the transformer on data association. But MOTR [
14] utilizes the same query to implement detection and tracking, resulting in poor detection performance.
DanceTrack [
10] is a large-scale multi-object tracking dataset where objects have uniform appearance and diverse motion patterns. DanceTrack [
10] focuses on situations where multiple objects are moving in a relatively large range, the occluded areas are dynamically changing, and they are even in crossover. Such cases are common in the real world, but naive motion models cannot handle them effectively. It can be concluded that the ability to analyze complex motion patterns is necessary for building a more comprehensive and intelligent tracker.
We aimed to develop a strong motion model capable of handling complex movements. Inspired by MOTR [
14], we utilize transformers to analyze cross-frame motion patterns. Specifically, an object detector is used to generate detection results and track queries. A transformer architecture then takes the track queries and the image feature as input to predict the current location of the detections. In our method, we directly obtain the track queries from the detections of each frame. Consequently, the accuracy of motion prediction is highly influenced by the quality of the detections. While the detector is trained to locate object positions, its performance may fall short in certain scenes. In MOT tasks, occurrences like occlusion or blurring can result in less accurate detection bounding boxes than expected, as illustrated in
Figure 1. This, in turn, renders the track queries less representative and leads to erroneous predictions. We point out that the confidence score can aid in addressing this issue. Thus, we have designed a hybrid strategy to make motion estimates based on the confidence score. For objects with a high confidence score, we adopt a transformer to predict their future locations. Conversely, for objects with a low detection score, we employ a simple linear model to estimate the position. Although the world does not move with constant velocity, many short-term movements, as in the case of two consecutive frames, can be approximated with linear models and by assuming a constant velocity. Additionally, a linear model predicts position through the historical velocity of the trajectory, reducing the impact of the current state. Generally, TLtrack designs a novel hybrid strategy to make motion estimates, not only by considering the historical information of trajectory but also by analyzing the latest movements of each object.
Towards pushing forwards the development of a motion-based MOT algorithm, we propose a novel motion model, named TLtrack. TLtrack adopts a novel hybrid strategy to make motion estimates, utilizing transformers to predict the locations of high confidence score detections and employing a linear model for low confidence score detections. Our experimental results on the DanceTrack dataset show that our method achieves the best performance compared with other motion models.
3. Methodology
In this section, we present the proposed tracking framework, as illustrated in
Figure 2. The overall structure of the encoder and decoder can be seen in
Figure 3.
3.1. Architecture
Following the tracking-by-detection paradigm, our model is built upon an object detector. An extra transformer architecture is employed to leverage motion cues. Given a frame , it is initially fed into the detector to generate the detection results (N represents the number of detected objects, 5 includes the bounding boxes and confidence score) and track queries , which are the features corresponding to each detected object. The backbone of the transformer takes two consecutive frames, and frame , as input and produces the stacked feature map . The transformer encoder consists of a self-attention block and a feed-forward block, taking as the query to generate the enhanced feature for the decoder. The transformer decoder, comprising a cross-attention block and a feed-forward block, utilizes the track queries and the enhanced feature as the query and key, respectively. An MLP is used after the decoder to obtain the prediction results (4 represents the bounding boxes). For each object detected in frame represented by the track query , the prediction results represent their predicted positions in frame t. The Hungarian algorithm is employed to achieve bipartite matching. The assignment is determined by a cost matrix that compares new detections with the tracks obtained in previous frames. We will discuss how to selectively use the prediction results to populate the cost matrix later.
3.2. Transformers and Linear Track
We have designed a hybrid strategy based on confidence scores to make motion estimates. Assuming to be the locations of the detections in frame , our goal is to predict their locations in frame t.
For high confidence score detections, we firstly turn their feature maps into track queries
. Then,
goes through a self-attention block, which can be expressed as
where
is the dimension of the key vector and
is the output of the self-attention block.
is then fed into the cross-attention block, which can be expressed as
where
is the output of the cross-attention block and
represents the enhanced feature generated by the encoder. At the end, a feed-forward network and an MLP work on generating the final predictions:
where
(4 represents the bounding boxes) are the predicted locations on frame
t.
For low confidence score detections, we estimate their locations by a simple linear model. Assuming
to be the location of one low confidence score detection on frame
, its location on frame
t can be represented by
where
v is the mean velocity of this object between the last M frames. Further experiments will determine how many frames to compute the mean velocity it would be appropriate to choose. We set
to be 1.
The whole hybrid strategy can be represented by
where
represents location of the i-th detection in frame
and
represents its confidence score.
represents the processing for high score detections that we discussed above and
is the threshold for the confidence score. We set
as 0.9 based on further experiments.
3.3. Training
Following the same settings as in TransTrack [
24], we choose a static image as the train data; the adjacent frame is simulated by randomly scaling and translating the static image. Firstly, a trained detector generates detections and track queries from the original frame. Secondly, the track queries and the adjacent frame are fed into the transformer to obtain the prediction results. We apply a set prediction loss to supervise the prediction results. The set-based loss produces an optimal bipartite matching between the predictions and the ground truth objects. The matching cost is defined as
where
is the focal loss,
denotes the L1 loss,
is the generalized IoU loss, and
,
, and
are the corresponding weight coefficients. The training loss is the same as the matching cost except that it is only performed on matched pairs.
5. Conclusions
This paper introduces TLtrack, a novel hybrid strategy to make motion estimates based on confidence scores. For detections with a high confidence score, TLtrack employs transformers to predict locations. Conversely, for detections with a low confidence score, it resorts to a straightforward linear model. In this way, not only the direction of the trajectory in the past can be considered, but also the latest movements can be analyzed. TLtrack’s strength lies in its simplicity, real-time processing capability, and effectiveness. An empirical evaluation on the Dancetrack dataset shows that our method achieves the best performance compared with other motion models.