SimpleTrackV2: Rethinking the Timing Characteristics for Multi-Object Tracking

Ding, Yan; Ling, Yuchen; Zhang, Bozhi; Li, Jiaxin; Guo, Lingxi; Yang, Zhe

doi:10.3390/s24186015

Open AccessArticle

SimpleTrackV2: Rethinking the Timing Characteristics for Multi-Object Tracking

by

Yan Ding

¹,

Yuchen Ling

¹

,

Bozhi Zhang

^1,*

,

Jiaxin Li

²,

Lingxi Guo

² and

Zhe Yang

²

¹

Key Laboratory of Dynamics and Control of Flight Vehicle, Ministry of Education, School of Aerospace Engineering, Beijing Institute of Technology, Beijing 100081, China

²

Science and Technology on Space Physics Laboratory, Beijing 100076, China

^*

Author to whom correspondence should be addressed.

Sensors 2024, 24(18), 6015; https://doi.org/10.3390/s24186015

Submission received: 25 July 2024 / Revised: 11 September 2024 / Accepted: 16 September 2024 / Published: 17 September 2024

(This article belongs to the Section Sensing and Imaging)

Download

Browse Figures

Versions Notes

Abstract

:

Multi-object tracking tasks aim to assign unique trajectory codes to targets in video frames. Most detection-based tracking methods use Kalman filtering algorithms for trajectory prediction, directly utilizing associated target features for trajectory updates. However, this approach often fails, with camera jitter and transient target loss in real-world scenarios. This paper rethinks state prediction and fusion based on target temporal features to address these issues and proposes the SimpleTrackV2 algorithm, building on the previously designed SimpleTrack. Firstly, to address the poor prediction performance of linear motion models in complex scenes, we designed a target state prediction algorithm called LSTM-MP, based on long short-term memory (LSTM). This algorithm encodes the target’s historical motion information using LSTM and decodes it with a multilayer perceptron (MLP) to achieve target state prediction. Secondly, to mitigate the effect of occlusion on target state saliency, we designed a spatiotemporal attention-based target appearance feature fusion (TSA-FF) target state fusion algorithm based on the attention mechanism. TSA-FF calculates adaptive fusion coefficients to enhance target state fusion, thereby improving the accuracy of subsequent data association. To demonstrate the effectiveness of the proposed method, we compared SimpleTrackV2 with the baseline model SimpleTrack on the MOT17 dataset. We also conducted ablation experiments on TSA-FF and LSTM-MP for SimpleTrackV2, exploring the optimal number of fusion frames and the impact of different loss functions on model performance. The experimental results show that SimpleTrackV2 handles camera jitter and target occlusion better, achieving improvements of 1.6%, 3.2%, and 6.1% in MOTA, IDF1, and HOTA, respectively, compared to the SimpleTrack algorithm.

Keywords:

multiple object tracking; timing characteristics; state prediction; state fusion

1. Introduction

Multi-object tracking is a challenging task in computer vision that aims to estimate the position and identity of multiple objects in a video sequence [1,2]. Detection-based tracking is a common paradigm for these tasks. It completely decouples the object detection process from the multi-object tracking process. As a result, the tracking process can be viewed as a post-processing step of the detection results. This separation allows for independent optimization of detection and tracking algorithms.

In target-intensive scenarios, tracking becomes challenging due to occlusion between targets and irregular camera jitter caused by external forces (such as wind or mechanical vibrations). When tracking relies solely on the position information the detector provides, it often leads to frequent jumps in the target’s identity. To address this issue, a target reidentification feature is introduced after detection to mitigate environmental interference. In this paper, the “state” in target state prediction refers to the target’s position, represented as (x, y, w, h) in a 2D image. This includes the pixel coordinates of the target’s center relative to the upper-left corner of the image, along with the width and height of the detection frame. On the other hand, the “state” in target state fusion refers to the target’s reidentification feature, which is typically a high-dimensional vector containing more identity information than the position feature.

In recent years, many detection-tracking paradigms [3,4,5,6,7,8] have relied on a combination of target state prediction from the previous frame and target state detection from the current frame to determine the state of the target. The results of the target state fusion of the prior frame are then correlated with the target state of the current frame, allowing for updates to the trajectory encoded with the corresponding identity. The Kalman filter algorithm is the most commonly used method for target state prediction. For target state fusion, the exponential sliding average fusion algorithm is frequently employed, or alternatively, the state of the current frame is used directly as the latest state of the trajectory [9,10,11,12,13]. We found several limitations upon re-evaluating the state prediction and fusion methods in detection-tracking paradigms. Existing target state prediction methods often perform poorly in complex scenes, such as those with frequent occlusions between targets or camera jitter. The Kalman filter motion prediction in particular shows significant bias when the target’s prior knowledge is missing or inaccurate. Moreover, most existing target state fusion methods focus on spatial relationships between different state cues, with limited consideration of temporal dimension fusion. According to our experiments, the performance of these methods degrades significantly in environments with frequent target occlusions.

In pedestrian-dense scenes, we analyzed the difference between the target’s motion state in consecutive frames and the prediction results obtained using a linear Kalman filter. The analysis results are presented in Figure 1.

Before the 100th frame, when the target is not yet occluded, the Kalman filter’s predictions closely match the actual results, and the calculated intersection over union (IoU) between consecutive frames remains consistently high. This high IoU supports accurate identity assignment during subsequent target association. However, starting from frame 100, as the target begins to experience occlusion, the absence of new detection boxes for updating the Kalman filter causes its predictions to deviate from the true trajectory gradually. This deviation eventually leads to errors in the target’s identity assignment. Thus, it can be concluded that the linear Kalman filter has limited accuracy when predicting target motion states over extended periods. Furthermore, Figure 1 shows that during pedestrian movement, there is a periodic swinging of the arms and changes in stride. This causes the frame difference data for the pedestrian’s motion state to exhibit distinct periodic characteristics.

To address this problem, we developed an LSTM-MP model to extract temporal features of the target state for improved prediction. We further incorporated an attention mechanism and designed a spatiotemporal attention-based target appearance feature fusion (TSA-FF) algorithm to compute fusion coefficients from these temporal features. To demonstrate the effectiveness of our proposed prediction and fusion methods, we integrated them into the SimpleTrack tracking framework [1]. As shown in Section 4.3, our approach improved multiple MOT metrics, including MOTA, IDF1, and HOTA.

To further explore the advantages of target state prediction and fusion methods based on temporal features, we propose the SimpleTrackV2 tracking framework, an extension of SimpleTrack. This framework uses a two-stage network to acquire the target state: YOLOX [14] extracts positional features, while SBSResNest [15] captures appearance features. SimpleTrackV2 incorporates a new target state fusion mechanism, TSA-FF, and a novel LSTM-based target state prediction model, LSTM-MP. Experimental results demonstrate that SimpleTrackV2 performs better than the original SimpleTrack algorithm in terms of MOTA, IDF1, and HOTA.

Our main contributions are as follows.

We propose a new target state prediction model, LSTM-MP, which leverages the historical information of the target state to enhance prediction accuracy. This model remains effective even when state observations are correct or prior knowledge of the target is missing or inaccurate, improving the algorithm’s tracking robustness.
We design a novel target state fusion algorithm, TSA-FF, which incorporates both temporal and spatial attention mechanisms. This approach enhances the ability to distinguish target identities, providing more reliable state information for the subsequent data association stage.

The remainder of the paper is arranged as follows. Section 2 summarizes related work, including target reidentification feature extraction, target state prediction, and target state fusion. Section 3 introduces the methodology of SimpleTrackV2, including the target state prediction model LSTM-MP and the target state fusion model TSA-FF. In Section 4, experimental results are provided to verify the performance of the proposed SimpleTrackV2. Section 5 briefly summarizes the work and discusses future directions.

2. Related Work

Multi-target tracking technology primarily involves target detection, feature extraction, motion prediction, data association, and trajectory management. The target state prediction method introduced in this paper can be considered an approach for motion prediction, while target state fusion serves as an enhancement to target reidentification features. Therefore, this paper reviews relevant research in three key areas: target reidentification feature extraction, target state prediction, and target state fusion.

2.1. Target Reidentification Feature Extraction

Target reidentification feature extraction seeks to obtain robust target features under varying observation viewpoints, image resolutions, lighting conditions, poses, occlusions, modalities, camera environments, and backgrounds [16,17,18,19,20,21,22]. These models typically treat the same objects in different frames as distinct classes and learn embeddings of cropped detection images using cross-entropy loss or triplet loss for ID classification [19,23,24,25,26,27,28,29,30]. Many algorithms incorporate additional modules in detection tracking paradigms to post-process network inference results. For example, the MOT with multiple cues algorithm employs GoogLeNet to extract reidentification features and then uses the SAC module for classification [31]. Similarly, the multiplex labeling graph MOT algorithm uses a detection multiplexing method to handle target reidentification features and address occlusion issues [32]. While these methods improve performance on specific datasets, they also introduce significant computational overhead, increasing execution time and limiting the model’s portability and reusability.

To address this problem, SimpleTrackV2 employs the FastReID [15] framework to construct a target reidentification feature extraction network and directly uses its outputs for subsequent state fusion. FastReID provides a modular framework for end-to-end feature extraction in ReID tasks. Unlike traditional applications of FastReID, which focus solely on reidentification, SimpleTrackV2 leverages these features to improve multi-target tracking. By using target reidentification features as the foundation for state fusion, SimpleTrackV2 enhances overall tracking performance.

2.2. Target State Prediction Methods

Most multi-target tracking algorithms use a linear motion model to predict the target state [9,10,11,12,25,33,34,35,36,37,38]. However, this approach assumes linear target motion and struggles to predict the target’s subsequent state during nonlinear motion accurately.

SimpleTrackV2 incorporates the LSTM-MP model to address this limitation, leveraging the long short-term memory (LSTM) network’s ability to capture and maintain information over long time spans [39,40,41,42,43,44,45]. Unlike the linear motion model, LSTM-MP integrates the temporal characteristics of the target state to better model its motion patterns, providing significantly greater robustness.

2.3. Target State Fusion Methods

Data association based on target states is typically performed using single-frame target states [11,12,46,47,48,49,50]. Methods like DeepSORT [12] and ByteTrack [13] associate the target state of the previous frame with the current frame. While these approaches have relatively simple pipelines, they struggle to produce accurate results during occlusion when relying solely on single-frame target states.

BoT-SORT [9] sought to address this issue, employing an exponential averaging method to determine state fusion coefficients. However, in complex scenarios, the importance of target states does not always follow a simple exponential relationship with time. Unlike BoT-SORT, SimpleTrackV2 uses the TSA-FF algorithm to compute state fusion coefficients based on the temporal characteristics of target states. This approach allows SimpleTrackV2 to model the fusion mechanism of target states over time, achieving stable data association.

3. SimpleTrackV2

In this section, we present the details of SimpleTrackV2. As illustrated below, SimpleTrackV2 inherits the feature decoupling and association components from SimpleTrack. The improvements introduced in SimpleTrackV2, highlighted by the red boxes in Figure 2, include the design of LSTM-MP for target state prediction and the spatiotemporal attention-based target appearance feature fusion (TSA-FF) algorithm for target state fusion.

3.1. Target State Prediction

Current multi-target tracking algorithms typically employ a linear motion model, with the target’s motion state modeled using a linear Kalman filter. During tracking, a Kalman filter is initialized for each target. Before the target association begins, the Kalman filter predicts the target’s position in the next frame. After completing the target association, the filter is updated using information from the matched detection frame. The target state

{\hat{x}}_{k}

of the current frame can be obtained from the target state

{\hat{x}}^{'}_{k - 1}

of the previous frame, and the process can be expressed as:

{\hat{x}}_{k} = F_{k} {\hat{x}}^{'}_{k - 1} + B_{k} {\vec{u}}_{k}

(1)

where

F

denotes the state transfer matrix,

B

denotes the control matrix, and

{\vec{u}}_{k}

denotes the control input.

The process of predicting the error covariance

P_{k}

of the current time frame based on the error covariance of the target state

{P^{'}}_{k - 1}

in the previous frame and the process noise

Q_{k}

can be expressed as:

P_{k} = F_{k} {P^{'}}_{k - 1} F_{k}^{T} + Q_{k}

(2)

In the update phase, the predicted error covariance

P_{k}

and measurement noise

R_{k}

are first used to calculate the Kalman gain

K^{'}

, which can be expressed as:

K^{'} = P_{k} H_{k}^{T} {(H_{k} P_{k} H_{k}^{T} + R_{k})}^{- 1}

(3)

where

H

is the observation matrix that describes the observation model.

Updating the state estimate for the current frame based on the observed and predicted states can be expressed as:

{\hat{x}}_{k}^{'} = {\hat{x}}_{k} + K^{'} ({\vec{z}}_{k} - H_{k} {\hat{x}}_{k})

(4)

where

{\hat{x}}_{k}^{'}

denotes the updated target state of the current frame and

{\vec{z}}_{k}

denotes the observed value of the target of the current frame.

Finally, the error covariance

P_{k}^{'}

of the current frame is updated based on the Kalman gain

K^{'}

, which can be expressed as:

P_{k}^{'} = P_{k} - K^{'} H_{k} P_{k}

(5)

The process noise

Q

in the Kalman filtering algorithm reflects the uncertainty in the system model, with a larger

Q

indicating greater uncertainty in the system model, and the measurement noise covariance matrix

R

reflects the uncertainty in the sensor measurements, with a large

R

indicating greater measurement uncertainty. The dependence of the filter on the measured and predicted values can be adjusted by varying the size of

Q

and

R

. Although it is possible to mitigate the effect of the predicted values on the filter by reducing

Q

, this approach also suffers from the problem of not being able to correct by the predicted values when there is a large uncertainty in the detected values, and also the adjustment of

Q

and

R

needs to be continually experimented with.

To address the above problems, this paper designs an LSTM-MP module based on the LSTM structure. The LSTM-MP uses the LSTM as an encoder and the MLP as a decoder to predict the target’s motion, as shown in Figure 3.

The overall structure of LSTM-MP is shown in Figure 4. The encoder uses an LSTM network, and the decoder uses an MLP (multilayer perceptron) network, where the LSTM of the encoder consists of an oblivion gate, a memory gate, a cellular state, and an output gate.

The input of the forgetting gate consists of the target state

x_{t}

of the current frame and the

h_{t - 1}

of the target state of the previous frame computed by the LSTM-MP, and the output of the forgetting gate can be expressed as:

f_{t} = σ (W_{f} h_{t - 1} + U_{f} x_{t - 1} + b_{f})

(6)

where

f_{t}

denotes the moment t oblivion gate output,

σ (\cdot)

denotes the sigmoid function,

W_{f}

and

U_{f}

denote the weight parameters of the oblivion gate, and

b

denotes the bias parameter of the oblivion gate.

LSTM-MP has two memory gates, which use sigmoid and tanh functions to compute the memorization degree of different data, respectively. The data computation of the two memory gates can be expressed as:

i_{t} = σ (W_{i} h_{t - 1} + U_{i} x_{t} + b_{i})

(7)

a_{t} = \tanh (W_{a} h_{t - 1} + U_{a} x_{t} + b_{a})

(8)

where

i_{t}

and

a_{t}

denote the memory gate output at moment t,

σ (\cdot)

denotes the sigmoid function,

\tanh (\cdot)

denotes the tanh function,

W_{i}

,

U_{i}

,

W_{a}

, and

U_{a}

denote the weight parameter of the memory gate input data, and

b_{i}

and

b_{a}

denote the bias parameter of the memory gate input data, respectively.

The cell state

C_{t}

in the LSTM-MP model denotes the memory state of the encoder for the data at moment t, which can be represented by a combination of the forgetting gate and memory gate:

C_{t} = C_{t - 1} ⊙ f_{t} + i_{t} ⊙ a_{t}

(9)

where

⊙

denotes the Hadamard product, i.e., element-by-element multiplication.

The output gate

h_{t}

in the encoder is computed from the combination of the cell state

C_{t}

, the output state

h_{t - 1}

of the previous frame, and the input combination

x_{t}

of the current frame, and the operation can be expressed as:

h_{t} = o_{t} ⊙ \tanh (C_{t})

(10)

o_{t} = σ (W_{o} h_{t - 1} + U_{o} x_{t} + b_{o})

(11)

The LSTM-MP model linearly maps the output of the last frame of data of the LSTM structure and computes the final predicted output through two linear layers. The parameters of the designed linear layers are shown in Table 1, where the output dimension of Linear#2 is 20, which indicates the prediction of the next five frames of target motion data. The length of each data frame is 4, which indicates the prediction parameters of the target’s center point position, width, and height.

3.2. Target State Fusion

Traditional methods for fusing appearance features over time assume that more recent target information is more valuable, modeling fusion coefficients as an exponential function of the time interval between the historical and current frames. However, in complex scenarios, the relationship between a target’s appearance features across different time points is more complex than an exponential function can capture. Thus, the fusion coefficient should account not only for the time interval but also for the information extracted from the appearance features. Our approach enhances the appearance features of the target in the current frame, improving the differentiation between similar targets and maintaining feature consistency for the same target during tracking.

To capture the state-implicit correlations of the target’s historical frames, we propose a spatiotemporal attention-based target appearance feature fusion (TSA-FF) algorithm, which computes adaptive fusion coefficients. TSA-FF combines historical state information to enhance the robustness of the current frame states. The algorithm employs an attention mechanism to calculate weighted fusion coefficients for target appearance features, as illustrated in Figure 5. Given the challenges posed by complex backgrounds and target motion, TSA-FF incorporates features specific to multi-target tracking and introduces self-spatial and interactive spatial attention weights within the attention mechanism.

The self-spatial attention weighting evaluates the fusion weights of appearance features from different historical frames by calculating their similarity. If a historical frame has a higher similarity to all other historical frames, it is considered a key frame and is assigned a greater fusion weight. The process can be represented as:

a t t e n t i o n_{s e l f}^{n \times n} = f_{h i s t o r y}^{n \times d} \otimes {f_{h i s t o r y}^{n \times d}}^{T}

(12)

w_{s - s e l f}^{n \times 1} = m e a n (a t t e n t i o n_{h i s t o r y}^{n \times n})

(13)

where

a t t e n t i o n_{s e l f}^{n \times n}

denotes the self-attention among the target history frame appearance features,

f_{h i s t o r y}^{n \times d}

denotes the number of history frames,

d

denotes the target appearance feature dimensions, and

\otimes

denotes matrix multiplication.

w_{s - s e l f}^{n \times 1}

denotes the self-spatial attention weight, and

m e a n (\cdot)

denotes the mean of the matrix by row.

To extract the degree of influence that appearance features of a target’s history frames have on the current frame, we propose the concept of interaction space attention weight. This method involves calculating the similarity coefficient between the appearance features of the target’s current frame and those of all its history frames.

Interactive spatial attention weighting uses this similarity coefficient to determine the interactive spatial attention fusion weights of the history frames, a process that can be expressed as:

a t t e n t i o n_{c r o s s}^{1 \times n} = f_{c u r r e n t}^{1 \times d} \otimes {f_{h i s t o r y}^{n \times d}}^{T}

(14)

w_{s - c r o s s}^{n \times 1} = {(a t t e n t i o n_{c r o s s}^{1 \times n})}^{T}

(15)

where

a t t e n t i o n_{c r o s s}^{1 \times n}

denotes the interaction attention between the appearance features of the target’s current frame and the history frames,

f_{c u r r e n t}^{1 \times d}

denotes the appearance features of the target’s current frame,

f_{h i s t o r y}^{n \times d}

denotes the appearance features of the target’s history frames,

n

denotes the number of history frames,

d

denotes the dimension of the target’s appearance features,

\otimes

denotes the matrix multiplication, and

w_{s - c r o s s}^{n \times 1}

denotes the interaction spatial attention weight. If a history frame has a higher similarity to the target’s current frame, it is considered to contribute more to feature fusion. Consequently, it is assigned a larger fusion weight.

Instead of modeling time intervals through manual design, self-space attention weights and interaction space attention weights extract complex temporal dependencies from the perspective of target historical appearance features in higher dimensions. This approach determines the temporal fusion method of target appearance features in complex scenes. Based on self-space attention weights and interaction space attention weights, the target appearance feature fusion weights are represented as:

w_{c o m b i n e}^{n \times 1} = (w_{s - c r o s s}^{n \times 1} + w_{s - s e l f}^{n \times 1}) ⊙ w_{t}^{n \times 1}

(16)

where

⊙

denotes the Hadamard product, i.e., the multiplication of elements between matrices.

w_{c o m b i n e}^{n \times 1}

denotes the fusion weight result, obtained by summing the self-space attention weights and the interaction space attention weights, followed by an arithmetic operation with the temporal attention weights.

Further, the fused target appearance feature may be represented as:

f_{c o m b i n e}^{1 \times d} = n o r m (n o r m ({w_{c o m b i n e}^{n \times 1}}^{T} \otimes f_{h i s t o r y}^{n \times d}) + f_{c u r r e n t}^{1 \times d})

(17)

The fused target appearance feature

f_{c o m b i n e}^{1 \times d}

contains the appearance feature information

f_{h i s t o r y}^{n \times d}

of the target in the previous n frames. The attention weight

w_{c o m b i n e}^{n \times 1}

is no longer the traditional weight designed based on artificial a priori knowledge, but the result of the combined consideration of self-space attention weight and interaction space attention weight, and

n o r m (\cdot)

denotes the feature vector normalization operation.

4. Experiments

4.1. Datasets and Metrics

4.1.1. Datasets

We validate the proposed algorithm using the MOTChallenge2017 pedestrian multi-target tracking dataset, which consists of 14 video sequences, 7 for training and 7 for testing. For experimental purposes, each training sequence is divided in half, with the first half used for training and the second half for validation. This dataset poses significant challenges to tracking performance due to its high crowd density, diverse environmental conditions, and varying camera perspectives.

4.1.2. Evaluation Metrics

To evaluate tracking performance, we employed TrackEval to assess all metrics, including MOTA [51], IDF1 [52], and HOTA [53]. MOTA is calculated based on false positives (FPs), false negatives (FNs), and identity switches (IDs), primarily reflecting detection performance, as the number of FPs and FNs exceeds that of IDs. IDF1 measures the capacity for identity retention, focusing more on association performance. HOTA, a metric introduced recently, explicitly balances the effectiveness of accurate detection, association, and localization [13].

4.2. Implementation Details

4.2.1. Tracker

We used SimpleTrack [1] as the baseline model, and in the tracking phase, SimpleTrack defaults to a high detection score threshold

τ_{high}

of 0.3, a low detection score threshold

τ_{low}

of 0.2, a trajectory initialization score

ε

of 0.6, and a trajectory retrieval score

ε_{r}

of 0.1. In the linear assignment step, the assignment threshold is 0.8 for high-confidence detection and 0.4 for low-confidence detection.

4.2.2. Target Motion Trajectory Dataset Construction

The target motion state prediction model, LSTM-MP, requires the target’s historical frame motion states as inputs and uses the future frame motion states as supervised data for training. We construct the target motion trajectory dataset from the MOT17 validation set. Ground truth annotations are first loaded from the MOT17 dataset. For each video sequence, we extracted all motion states of targets with the same identity ID, computed the frame differences between consecutive motion states, arranged them chronologically, and saved the results for each identity ID.

We implemented a data loading strategy to train the model, which handles varying lengths of historical frame inputs and future frame outputs, as outlined in Algorithm 1.

Algorithm: Target state prediction model training loading data strategy

4.3. Ablation Studies

4.3.1. Comparison with Other Methods

We used the team’s previous work, SimpleTrack [1], as a baseline model, which has demonstrated superior performance compared to algorithms such as RelationTrack [54], Semi-TCL [29], CSTrack [55], FairMOT [10], TransTrack [56], and others. As shown in Table 2, due to the overprediction of densely packed small targets in the MOT17 dataset and the mutual suppression caused by the joint query mechanism for tracking and detection in the transformer decoder, especially when new targets are close to already tracked ones, this conflict limits the ability of Transformer-based methods to detect new targets. Our method, SimpleTrackV2, employs the JDE architecture, integrating detection and feature extraction into a single network while decoupling them from subsequent motion prediction and data association. Leveraging the temporal memory mechanism of the LSTM-MP model and the effective feature enhancement capability of the TSA-FF, SimpleTrackV2 achieves higher detection and tracking performance. It surpasses SimpleTrack by 6.1% in HOTA, 3.2% in IDF1, and 1.6% in MOTA. These experimental results further validate the effectiveness of our proposed method.

4.3.2. Target Appearance Feature Fusion Model Ablation Experiment

We compared the performance of single-frame appearance features without fusion, the exponential sliding average fusion method, and the spatiotemporal attention feature fusion method on the MOT17 dataset. The evaluation metrics used were MOTA, IDF1, and HOTA. The results are shown in Table 3.

Our results indicate that the exponential sliding average method has no significant advantage over single-frame appearance features, outperforming them by only 0.004 percentage points in the MOTA metric.

In contrast, our proposed appearance feature fusion method based on spatiotemporal attention demonstrates higher tracking accuracy and stability. Compared to the exponential sliding average method, our feature fusion method improves the MOTA metric by 0.34 percentage points, the IDF1 metric by 0.85 percentage points, and the HOTA metric by 0.289 percentage points.

4.3.3. Analysis of Fusion Frame Count in the Target Appearance Feature Fusion Model

To verify the effect of different numbers of history frames on the fusion of target appearance features, we tested the tracking performance of the model under various conditions: fusing 10, 20, 30, 40, and 50 history frames. The results are summarized in Table 4.

The results indicate that the tracking performance improves gradually as the number of fused history frames increases, reaching its peak when fusing 40 history frames. At this point, the tracking algorithm achieves an MOTA metric of 76.872, an IDF1 metric of 79.097, and an HOTA metric of 67.528.

4.3.4. Comparative Experiments on Target Motion State Prediction Models

To evaluate the effectiveness of LSTM-MP, we used SimpleTrack [1] as the baseline model. The appearance feature fusion algorithms utilize a spatiotemporal attention fusion method with 40 historical frames. We compared the tracking performance of the LSTM-MP algorithm using two different input configurations, as presented in Table 5.

Configuration #1: The inputs to the target motion state prediction network include the target center point coordinates, target width and height, and their respective frame differences, totaling 8 data points.

Configuration #2: The inputs consist of only the frame differences of the target center point coordinates and width–height, totaling 4 data points.

For both configurations, the outputs are the center point and width–height frame differences of the target’s future frames. The results show that LSTM-MP #2 outperforms LSTM-MP #1, achieving a 0.402 percentage point increase in the MOTA metric, a 0.762 percentage point increase in the IDF1 metric, and a 0.843 percentage point increase in the HOTA metric.

Due to the different supervisory effects of various loss functions on LSTM-MP training, we compared the training results of LSTM-MP models under the supervision of the Smooth L1 and MSE loss functions. The results are presented in the following table. The findings indicate that, under the supervision of the MSE loss function, the LSTM-MP #2 model achieves higher MOTA and IDF1 metrics compared to the Smooth L1 loss function. However, the HOTA metric is slightly lower for the MSE-trained model. The results are shown in Table 6.

To assess the effectiveness of LSTM-MP in enhancing tracking performance, we compared it with the GRU (gated recurrent unit) and Kalman filter algorithms. The experimental results are presented in Table 7. The GRU model performed worse in all tracking metrics compared to the LSTM-based model. The LSTM-MP + KF model, which employs LSTM for motion prediction in mobile camera scenarios and Kalman filtering for fixed camera scenarios, outperformed the Kalman filter alone in terms of MOTA, IDF1, and HOTA. However, due to the additional computational overhead introduced by LSTM-MP, the algorithm’s runtime efficiency decreases by 5.36 fps, resulting in a final performance of 12.44 fps.

5. Conclusions and Future Work

We propose an adaptive target state fusion method, spatiotemporal attention-based target appearance feature fusion (TSA-FF), within the SimpleTrack framework. TSA-FF utilizes historical state information to compute fusion coefficients, mitigating the impact of occlusion on target state estimation. Additionally, we introduce a target state prediction module, LSTM-MP, to model nonlinear motion patterns of the target. LSTM, a specialized recurrent neural network (RNN), excels at capturing and retaining dependencies within long sequences. Its unique memory cells effectively manage the flow of past information, allowing LSTM to leverage historical motion states when predicting future states, critical for multi-object tracking under occlusion and jitter. LSTM’s ability to learn nonlinear patterns in data enables more accurate predictions of future target positions, particularly in scenarios involving complex and nonlinear motion. This stands in contrast to linear Kalman filtering (LKF), which relies on process noise models and can misinterpret irregular or sudden jitters, leading to prediction errors. Finally, as a data-driven model, LSTM can autonomously adapt to various complex scenarios through extensive training, unlike LKF, which is limited by its reliance on preset models and noise parameters. Together, these components constitute SimpleTrackV2, which achieves 67.7 HOTA and 76.9 MOTA on the MOT17 dataset, surpassing SimpleTrack’s performance of 61.6 HOTA and 76.3 MOTA, and demonstrating superior adaptability and robustness in dynamic tracking environments.

Future work will focus on reducing the space complexity of the target state fusion algorithm and enhancing the generalization capabilities of the state prediction module.

Author Contributions

Conceptualization, Y.L. and J.L.; investigation, Y.L., Y.D., B.Z. and L.G.; methodology, Y.L. and Y.D.; project administration, Y.D.; software, Y.L. and J.L.; supervision, Y.D.; validation, Y.L.; visualization, Y.L., B.Z. and L.G.; writing—original draft, Y.L., Y.D. and B.Z.; writing—review and editing, Y.D., B.Z., J.L. and Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets in the experiments are available at https://motchallenge.net (accessed on 8 September 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, J.; Ding, Y.; Wei, H.L.; Zhang, Y.; Lin, W. SimpleTrack: Rethinking and Improving the JDE Approach for Multi-Object Tracking. Sensors 2022, 22, 5863. [Google Scholar] [CrossRef] [PubMed]
Vandenhende, S.; Georgoulis, S.; Van Gansbeke, W.; Proesmans, M.; Dai, D.; Van Gool, L. Multi-Task Learning for Dense Prediction Tasks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3614–3633. [Google Scholar]
Yang, B.; Nevatia, R. Multi-target tracking by online learning of non-linear motion patterns and robust appearance models. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 1918–1925. [Google Scholar]
Huang, C.; Li, Y.; Nevatia, R. Multiple Target Tracking by Learning-Based Hierarchical Association of Detection Responses. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 898–910. [Google Scholar]
Riahi, D.; Bilodeau, G.A. Online multi-object tracking by detection based on generative appearance models. Comput. Vis. Image Underst. 2016, 152, 88–102. [Google Scholar]
Mhalla, A.; Chateau, T.; Essoukri Ben Amara, N. Spatio-temporal object detection by deep learning: Video-interlacing to improve multi-object tracking. Image Vis. Comput. 2019, 88, 120–131. [Google Scholar]
Yang, B.; Nevatia, R. Multi-Target Tracking by Online Learning a CRF Model of Appearance and Motion Patterns. Int. J. Comput. Vis. 2014, 107, 203–217. [Google Scholar]
Wang, Z.; Zheng, L.; Liu, Y.; Wang, S. Towards real-time multi-object tracking. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 107–122. [Google Scholar]
Aharon, N.; Orfaig, R.; Bobrovsky, B.Z. BoT-SORT: Robust associations multi-pedestrian tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar]
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple Online and Realtime Tracking. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3645–3649. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature Switzerland: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
He, L.; Liao, X.; Liu, W.; Liu, X.; Cheng, P.; Mei, T. Fastreid: A pytorch toolbox for general instance re-identification. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 9664–9667. [Google Scholar]
Li, S.; Xiao, T.; Li, H.; Yang, W.; Wang, X. Identity-Aware Textual-Visual Matching with Latent Co-attention. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1908–1917. [Google Scholar]
Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S.C.H. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 2872–2893. [Google Scholar]
Karanam, S.; Li, Y.; Radke, R.J. Person Re-Identification with Discriminatively Trained Viewpoint Invariant Dictionaries. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 4516–4524. [Google Scholar]
Li, X.; Zheng, W.S.; Wang, X.; Xiang, T.; Gong, S. Multi-Scale Learning for Low-Resolution Person Re-Identification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 3765–3773. [Google Scholar]
Wang, Y.; Wang, L.; You, Y.; Zou, X.; Chen, V.; Li, S.; Huang, G.; Hariharan, B.; Weinberger, K.Q. Resource Aware Person Re-identification Across Multiple Resolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 8042–8051. [Google Scholar]
Sarfraz, M.S.; Schumann, A.; Eberle, A.; Stiefelhagen, R. A Pose-Sensitive Embedding for Person Re-identification with Expanded Cross Neighborhood Re-ranking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 420–429. [Google Scholar]
Hou, R.; Ma, B.; Chang, H.; Gu, X.; Shan, S.; Chen, X. VRSTC: Occlusion-free video person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7183–7192. [Google Scholar]
Baisa, N.L. Occlusion-robust online multi-object visual tracking using a GM-PHD filter with CNN-based re-identification. J. Vis. Commun. Image Represent. 2021, 80, 103279. [Google Scholar]
Wang, G.; Song, M.; Hwang, J.N. Recent advances in embedding methods for multi-object tracking: A survey. arXiv 2022, arXiv:2205.10766. [Google Scholar]
Chen, L.; Ai, H.; Zhuang, Z.; Shang, C. Real-time Multiple People Tracking with Deeply Learned Candidate Selection and Person Re-Identification. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar]
Yang, F.; Chang, X.; Sakti, S.; Wu, Y.; Nakamura, S. ReMOT: A model-agnostic refinement for multiple object tracking. Image Vis. Comput. 2021, 106, 104091. [Google Scholar]
Ristani, E.; Tomasi, C. Features for multi-target multi-camera tracking and re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6036–6046. [Google Scholar]
Shen, H.; Huang, L.; Huang, C.; Xu, W. Tracklet association tracker: An end-to-end learning-based association approach for multi-object tracking. arXiv 2018, arXiv:1808.01562. [Google Scholar]
Li, W.; Xiong, Y.; Yang, S.; Xu, M.; Wang, Y.; Xia, W. Semi-tcl: Semi-supervised track contrastive representation learning. arXiv 2021, arXiv:2107.02396. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Feng, W.; Li, B.; Ouyang, W. Multi-object tracking with multiple cues and switcher-aware classification. In Proceedings of the International Conference on Digital Image Computing: Techniques and Applications (DICTA), Sydney, Australia, 28–30 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–10. [Google Scholar]
Zhang, Y.; Sheng, H.; Wu, Y.; Wang, S.; Ke, W.; Xiong, Z. Multiplex Labeling Graph for Near-Online Tracking in Crowded Scenes. IEEE Internet Things J. 2020, 7, 7892–7902. [Google Scholar]
Chu, Q.; Ouyang, W.; Li, H.; Wang, X.; Liu, B.; Yu, N. Online multi-object tracking using CNN-based single object tracker with spatial-temporal attention mechanism. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4836–4845. [Google Scholar]
Zhu, J.; Yang, H.; Liu, N.; Kim, M.; Zhang, W.; Yang, M.-H. Online multi-object tracking with dual matching attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 366–382. [Google Scholar]
Breitenstein, M.D.; Reichlin, F.; Leibe, B.; Koller-Meier, E.; Van Gool, L. Robust tracking-by-detection using a detector confidence particle filter. In Proceedings of the IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 1515–1522. [Google Scholar]
Xing, J.; Ai, H.; Lao, S. Multi-object tracking through occlusions by local tracklets filtering and global tracklets association with detection responses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 1200–1207. [Google Scholar]
Chen, J.; Sheng, H.; Zhang, Y.; Xiong, Z. Enhancing Detection Model for Multiple Hypothesis Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2143–2152. [Google Scholar]
Rosello, P.; Kochenderfer, M.J. Multi-agent reinforcement learning for multi-object tracking. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, Stockholm, Sweden, 10–15 July 2018; pp. 1397–1404. [Google Scholar]
Jiang, X.; Li, P.; Li, Y.; Zhen, X. Graph neural based end-to-end data association framework for online multiple-object tracking. arXiv 2019, arXiv:1907.05315. [Google Scholar]
Weng, X.; Wang, Y.; Man, Y.; Kitani, K. GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking with Multi-Feature Learning. arXiv 2020, arXiv:2006.07327. [Google Scholar]
Wang, Y.; Kitani, K.; Weng, X. Joint object detection and multi-object tracking with graph neural networks. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 13708–13715. [Google Scholar]
Alahi, A.; Goel, K.; Ramanathan, V.; Robicquet, A.; Li, F.-F.; Savarese, S. Social LSTM: Human Trajectory Prediction in Crowded Spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 961–971. [Google Scholar]
Gupta, A.; Johnson, J.; Li, F.-F.; Savarese, S.; Alahi, A. Social gan: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2255–2264. [Google Scholar]
Kim, C.; Li, F.; Rehg, J.M. Multi-object Tracking with Neural Gating Using Bilinear LSTM. In Computer Vision–ECCV 2018: Volume 11212; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 208–224. [Google Scholar]
Sadeghian, A.; Alahi, A.; Savarese, S. Tracking the untrackable: Learning to track multiple cues with long-term dependencies. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 300–311. [Google Scholar]
Bochinski, E.; Eiselein, V.; Sikora, T. High-Speed tracking-by-detection without using image information. In Proceedings of the 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–6. [Google Scholar]
Bochinski, E.; Senst, T.; Sikora, T. Extending IOU based multi-object tracking by visual information. In Proceedings of the 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–6. [Google Scholar]
Bergmann, P.; Meinhardt, T.; Leal-Taixe, L. Tracking without bells and whistles. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 941–951. [Google Scholar]
Bae, S.H.; Yoon, K.J. Robust Online Multi-object Tracking Based on Tracklet Confidence and Online Discriminative Appearance Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 1218–1225. [Google Scholar]
Brasó, G.; Leal-Taixé, L. Learning a neural solver for multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6247–6257. [Google Scholar]
Bernardin, K.; Elbs, A.; Stiefelhagen, R. Multiple object tracking performance metrics and evaluation in a smart room envi-ronment. In Proceedings of the Sixth IEEE International Workshop on Visual Surveillance, Graz, Austria, 13 May 2006; Kingston University: London, UK, 2006. [Google Scholar]
Ristani, E.; Solera, F.; Zou, R.; Rita, C.; Carlo, T. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Cham, Switzerland, 2016; pp. 17–35. [Google Scholar]
Luiten, J.; Osep, A.; Dendorfer, P.; Torr, P.; Geiger, A.; Leal-Taixé, L.; Leibe, B. HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking. Int. J. Comput. Vis. 2021, 129, 548–578. [Google Scholar]
Yu, E.; Li, Z.; Han, S.; Wang, H. Relationtrack: Relation-aware multiple object tracking with decoupled representation. IEEE Trans. Multimed. 2022, 25, 2686–2697. [Google Scholar]
Liang, C.; Zhang, Z.; Zhou, X.; Li, B.; Zhu, S.; Hu, W. Rethinking the competition between detection and reid in multiobject tracking. IEEE Trans. Image Process. 2022, 31, 3182–3196. [Google Scholar]
Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. Transtrack: Multiple object tracking with transformer. arXiv 2020, arXiv:2012.15460. [Google Scholar]
Gao, R.; Wang, L. MeMOTR: Long-term memory-augmented transformer for multi-object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 9901–9910. [Google Scholar]
Zhou, X.; Yin, T.; Koltun, V.; Krähenbühl, P. Global tracking transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8771–8780. [Google Scholar]

Figure 1. Motion state differences and linear Kalman filter predictions for the pedestrian target with video sequence ID 23 in MOT17-09. The linear Kalman filter predicts the center point coordinates and the width and height of the target’s bounding box based on a linear motion model. The black curve represents the frame-to-frame differences in the Kalman filter’s predictions, the blue and green curves show the differences in the horizontal and vertical coordinates of the target’s center point, respectively, the purple and orange curves illustrate the differences in the target’s width and height, and the red and cyan curves depict the intersection over union (IoU) of the predicted bounding boxes and the degree of target occlusion.

Figure 2. The target state prediction and fusion pipeline in SimpleTrackV2. SimpleTrackV2 inherits the feature decoupling and association components from SimpleTrack, with the improvements highlighted in the red boxes, which primarily include the design of LSTM-MP for target state prediction and TSA-FF for target state fusion. In the diagram, the blue output lines from feature decoupling represent the extracted positional features, while the purple arrows indicate the extracted appearance features. The LSTM-MP model utilizes the differences in the target’s motion state between consecutive frames as input to predict future motion states. The TSA-FF algorithm computes temporal and spatial attention weights to determine fusion weight coefficients, facilitating the effective fusion of target states.

Figure 3. The LSTM-MP model begins by applying a feature transformation method to convert the motion features of the target’s historical frames into the difference between its front and back frame motion states. This difference is then used as input to the model. The LSTM-MP model encodes and reduces the dimensionality of these features through a multilayer perceptron (MLP) network. Finally, an inverse feature transformation is applied to predict the target’s future motion state, thereby achieving effective motion state prediction.

Figure 4. The overall structure of the LSTM-MP model.

Figure 5. The TSA-FF algorithm first computes the temporal and spatial attention weights for the appearance feature vectors of different time frames for the same target. It then sums the self-spatial attention coefficients of each historical frame with the interaction spatial attention coefficients and multiplies this sum by the temporal attention weights to determine the final fusion coefficients. Finally, the appearance features of each historical frame are weighted and averaged using these fusion coefficients, followed by normalization to obtain the fused target appearance features.

Table 1. Decoder MLP parameter setting.

Linear Layer	Input Dimension	Output Dimension
Linear#1	128	64
Linear#2	64	20

Table 2. Comparison with existing methods. The symbol “↑” indicates that the higher the value of the metric, the better the tracking performance of the algorithm.

Tracker	HOTA↑	IDF1↑	MOTA↑
SimpleTrack [1]	61.6	76.3	75.3
FairMOT [10]	59.3	72.3	73.7
Semi-TCL [29]	59.8	73.2	73.3
RelationTrack [54]	61.0	74.7	73.8
CSTrack [55]	59.3	72.6	74.9
TransTrack [56]	54.1	63.5	75.2
MeMOTR [57]	58.8	71.5	72.8
GTR [58]	59.1	71.5	75.3
SimpleTrackV2	67.7	79.5	76.9

Table 3. We conducted ablation experiments on the target state fusion model, comparing state fusion methods using only single-frame states, exponential sliding average fusion, and the TSA-FF fusion algorithm proposed in this paper. The symbol “↑” indicates that the higher the value of the metric, the better the tracking performance of the algorithm.

Tracker	State Fusion Method	MOTA↑	IDF1↑	HOTA↑
SimpleTrackV2	Single Frame	76.528	78.247	67.239
	EMA	76.532	78.247	67.239
	TSA-FF	76.872	79.097	67.528

Table 4. Hyperparameter experiments with TSA-FF in MOT17, where the number following TSA-FF indicates the number of historical frames fused. The symbol “↑” indicates that the higher the value of the metric, the better the tracking performance of the algorithm.

Tracker	Fusion Frames	MOTA↑	IDF1↑	HOTA↑
SimpleTrackV2	TSA-FF (10)	76.521	78.318	67.309
	TSA-FF (20)	76.51	78.323	67.308
	TSA-FF (30)	76.523	78.341	67.314
	TSA-FF (40)	76.872	79.097	67.528
	TSA-FF (50)	76.867	79.094	67.527

Table 5. Performance of LSTM-MP tracking with different input data. The symbol “↑” indicates that the higher the value of the metric, the better the tracking performance of the algorithm.

State Fusion	State Prediction	MOTA↑	IDF1↑	HOTA↑
TSA-FF (40)	LSTM-MP #1	75.684	77.414	65.664
TSA-FF (40)	LSTM-MP #2	76.086	78.176	66.507

Table 6. Performance of LSTM-MP tracking with different loss function supervision. The symbol “↑” indicates that the higher the value of the metric, the better the tracking performance of the algorithm.

State Fusion	State Prediction	Loss	MOTA↑	IDF1↑	HOTA↑
TSA-FF (40)	LSTM-MP #2	Smooth L1	76.049	78.395	66.591
TSA-FF (40)	LSTM-MP #2	MSE	76.113	78.711	66.468

Table 7. Performance of tracking effects for different state prediction models. The symbol “↑” indicates that the higher the value of the metric, the better the tracking performance of the algorithm.

State Fusion	State Prediction	MOTA↑	IDF1↑	HOTA↑
TSA-FF (40)	GRU #2	75.958	78.539	66.206
	LSTM-MP #2	76.113	78.711	66.468
	Kalman Filter (KF)	76.872	79.097	67.528
	LSTM-MP&KF	76.981	79.526	67.705

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, Y.; Ling, Y.; Zhang, B.; Li, J.; Guo, L.; Yang, Z. SimpleTrackV2: Rethinking the Timing Characteristics for Multi-Object Tracking. Sensors 2024, 24, 6015. https://doi.org/10.3390/s24186015

AMA Style

Ding Y, Ling Y, Zhang B, Li J, Guo L, Yang Z. SimpleTrackV2: Rethinking the Timing Characteristics for Multi-Object Tracking. Sensors. 2024; 24(18):6015. https://doi.org/10.3390/s24186015

Chicago/Turabian Style

Ding, Yan, Yuchen Ling, Bozhi Zhang, Jiaxin Li, Lingxi Guo, and Zhe Yang. 2024. "SimpleTrackV2: Rethinking the Timing Characteristics for Multi-Object Tracking" Sensors 24, no. 18: 6015. https://doi.org/10.3390/s24186015

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SimpleTrackV2: Rethinking the Timing Characteristics for Multi-Object Tracking

Abstract

1. Introduction

2. Related Work

2.1. Target Reidentification Feature Extraction

2.2. Target State Prediction Methods

2.3. Target State Fusion Methods

3. SimpleTrackV2

3.1. Target State Prediction

3.2. Target State Fusion

4. Experiments

4.1. Datasets and Metrics

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.2. Implementation Details

4.2.1. Tracker

4.2.2. Target Motion Trajectory Dataset Construction

4.3. Ablation Studies

4.3.1. Comparison with Other Methods

4.3.2. Target Appearance Feature Fusion Model Ablation Experiment

4.3.3. Analysis of Fusion Frame Count in the Target Appearance Feature Fusion Model

4.3.4. Comparative Experiments on Target Motion State Prediction Models

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI