Multi-Object Tracking with Predictive Information Fusion and Adaptive Measurement Noise

Cheng, Xiaohui; Zhao, Haoyi; Deng, Yun; Shen, Shuangqin

doi:10.3390/app15020736

Open AccessArticle

Multi-Object Tracking with Predictive Information Fusion and Adaptive Measurement Noise

¹

School of Computer Science and Engineering, Guilin University of Technology, Guilin 541004, China

²

Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin 541004, China

³

School of Physics and Electronic Information Engineering, Guilin University of Technology, Guilin 541004, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(2), 736; https://doi.org/10.3390/app15020736

Submission received: 10 November 2024 / Revised: 18 December 2024 / Accepted: 8 January 2025 / Published: 13 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

Multi-object tracking (MOT) aims to detect objects in video sequences and associate them across frames. Currently, the mainstream research direction regarding MOT is the tracking-by-detection (TBD) framework. Tracking results are highly sensitive to detection outputs, and challenges from object occlusion and complex motion present significant obstacles in the field of MOT. To reduce dependence on detection outputs, we propose a method that integrates predictive information to improve Non-Maximum Suppression (NMS). By applying secondary modulation to the suppression scores and dynamically adjusting the suppression threshold using tracking information, our method better retains candidate boxes for occluded objects. Furthermore, to track occluding and overlapping objects more effectively, we introduce an adaptive measurement noise method that adjusts the measurement noise to mitigate the impact of object occlusion or overlap on tracking accuracy. Additionally, we enhance the affinity matrix in the association algorithm by incorporating height information, thereby improving the stability of complex moving objects. Our method outperforms the baseline model ByteTrack on the DanceTrack dataset, increasing Higher Order Tracking Accuracy (HOTA), Multi-Object Tracking Accuracy (MOTA), and the ID F1 Score (IDF1) by 10.2%, 3.0%, and 4.8%, respectively.

Keywords:

multi-object tracking; tracking-by-detection; occlusion; complex motion

1. Introduction

Multi-object tracking (MOT) is a critical task in computer vision, aiming to detect objects in video sequences frame by frame while maintaining the association of the objects across different frames. MOT technology plays a crucial role in various fields such as autonomous driving [1], traffic surveillance [2], and military reconnaissance [3]. Currently, the most popular approach to MOT is the tracking-by-detection (TBD) paradigm [4,5,6,7,8,9,10], which decomposes the MOT task into two subtasks: object detection and data association. Leveraging well-established object detection techniques [11,12,13] has significantly advanced the research and applications of MOT.

The TBD-based MOT method is a two-stage tracking model. Its primary advantage lies in the independence between the detection and tracking processes, which provides strong robustness. When an object is lost or occluded, the system can more easily recapture and recover the tracklet. Moreover, it is well suited for handling multi-object scenarios, since the goal of detection is essentially to locate all objects in the image. Additionally, even if the object’s appearance changes significantly, the detection results are not greatly affected, and the tracking stage can achieve object association through the motion model. However, there are certain limitations to TBD-based MOT. Since the method relies on mature object detection technologies, although it leads to significant improvements in the early stages of technological development, the highly specialized detection part may negatively impact the final tracking results as the technology matures. This is because the goal of object detection is to detect all objects in a single image, while object tracking builds object trajectories across a sequence of images. As a result, researchers have begun exploring the incorporation of additional information to further improve the effectiveness of TBD-based MOT.

The workflow of TBD-based MOT methods begins with the detection phase, where images are processed to detect objects and output detection boxes. These detection boxes are then passed to the tracking phase, where the system associates and matches the prediction boxes (generated from historical data) with the current detection boxes, ultimately generating object trajectories. To improve tracking performance, researchers have attempted to enhance the output of the detection phase by introducing additional information. For instance, TraDes [14] integrates tracking information into the current frame to resolve occlusion issues in detection. MOTDT [15] generates candidate boxes by predicting the object trajectory in the current frame, supplementing detection candidate boxes and thereby optimizing the detection phase’s output. However, as TBD-based MOT methods rely on mature object detection technologies, these improvements have limited effectiveness in further enhancing results.

At the same time, researchers have recognized that there is still considerable room for improvement in the tracking phase and have begun to optimize the prediction and association algorithms by introducing additional information. For example, DeepSORT [5] introduces deep visual features [16] and recalculates the affinity matrix using the Mahalanobis distance, significantly improving the accuracy of data association. GHOST [17] extracts object appearance features via an independent ReID module [18,19,20], enabling precise object association. FairMOT [7] adopts an anchor-free CenterNet [12] architecture and designs a dual-branch network, balancing object detection and appearance feature extraction tasks, greatly enhancing association performance. CSTrack [21] introduces a cross-correlation network to improve the output of network branches. QDTrack [22] combines similarity learning to enhance long-term ReID performance. FineTrack [9] explores different fine-grained representations to improve discriminative representation. ByteTrack [6] enhances tracking accuracy and robustness by performing secondary matching on low-threshold detection boxes, helping to locate occluded objects, reduce missed detections, and effectively filter the background to minimize the impact of False Positives on tracking results. OC-SORT [10] reduces the accumulation of prediction errors from the Kalman filter (KF) during occlusion by constructing a virtual tracklet and considers the motion direction when calculating the affinity matrix, thereby improving association accuracy. Additionally, some research focuses on improving the KF, such as the NSA-KF [8,23], which optimizes the calculation of measurement noise and dynamically adjusts the weighting ratio between measurements and predictions during the update step to enhance filter performance. Other studies [1,3,24] aim to mitigate the adverse effects of camera motion on tracking performance by using Enhanced Correlation Coefficient Maximization (ECC) [25] or feature matching with ORB [26] for camera motion compensation, thereby reducing rigid displacement bias caused by camera movement.

We aimed to improve both the detection and tracking stages by introducing additional information to achieve better performance. Since the input of the tracking phase relies on the output of the detection phase, tracking results are highly sensitive to detection outputs. Some studies have proposed end-to-end MOT methods to address this issue, such as CenterTrack [27], which merges the detection and tracking stages by locating objects and predicting their displacement from the previous frame, and MOTR [28], which employs a Transformer architecture [29] to model the relationships between objects using global context information, building an end-to-end MOT model. However, their accuracy has not yet reached an ideal level. During the object detection process, the detector aims to identify all objects within a single image. As a result, low-confidence detection boxes for background or occluded objects are often discarded by Non-Maximum Suppression (NMS), leading to tracking failures. In contrast, tracking operates on a video sequence composed of multiple images, allowing it to leverage historical information to implicitly distinguish between background and occluded objects. Based on this insight, we improved the TBD-based MOT method by delaying the execution of NMS to the tracking phase and integrating predictive information. This approach enables effective differentiation between a low-confidence background and occluded objects, providing higher-quality detection information for the tracking phase and improving overall tracking performance.

Low-confidence objects can be beneficial for improving tracking performance when passed to the tracking phase. However, these low-confidence detection boxes often have poor quality, typically due to positioning errors caused by foreground occlusion or overlapping objects. To address these issues, we proposed an adaptive measurement noise method to enhance the motion prediction model in the tracking phase. For low-confidence detection boxes caused by foreground occlusion, the positioning errors are significant, leading to higher uncertainty. Based on this observation, after successful trajectory association, we adjusted the measurement noise covariance in the KF update step to be inversely proportional to the confidence level—meaning that the lower the confidence, the greater the measurement noise. Additionally, for positioning noise caused by overlapping objects, we found that as the degree of overlap increases, the mutual interference also increases, resulting in greater uncertainty. Therefore, we calculated the overlap rate between objects and adjusted the measurement noise covariance to be inversely proportional to this overlap rate. By updating the measurement noise covariance in this way, we improved tracking performance.

Data association is another critical subtask in TBD-based MOT. Association methods can be based on motion models, deep visual feature models, or a combination of both [30]. When tracking objects exhibit complex motion, non-linear movement patterns can arise, posing significant challenges for matching algorithms that use the Intersection over Union (IoU) distance as the affinity matrix, as IoU is based on area calculations. We observed that compared to area-based IoU, using height as a feature provides better distinctiveness for tracking objects. Height is less prone to change compared to other simple positional features, offering greater stability and compensation ability. Therefore, we used the height feature to adjust the IoU distance, which enhances the stability of associations and reduces the occurrence of identity switch (IDSW).

To sum up, our key contributions can be outlined as follows:

To improve the quality of detection boxes and reduce the impact of detection results on tracking performance, we proposed a method that utilizes predictive information to filter candidate boxes. NMS is not applied during the detection phase but is instead delayed until the tracking phase to better support the tracking task, significantly enhancing the overall performance of MOT.
To mitigate the impact of localization errors on the motion model, we proposed an adaptive measurement noise method that dynamically adjusts the ratio between prediction and measurement weights, significantly enhancing overall tracking performance.
To mitigate the impact of complex motion on tracking stability, we proposed a method that adjusts the IoU distance using height information to enhance the overall association stability.
Our proposed method underwent comprehensive experimental validation on the DanceTrack dataset, and competitive results were also achieved on the MOT20 dataset.

2. Materials and Methods

In this section, our model builds upon the strategy from the ByteTrack model that performs two rounds of association for high-score and low-score boxes, and we propose three improvements to enhance the model. First, we introduce a Candidate Boxes Enhanced Filtering (CEF) method, which optimizes the output of the detector to effectively address the sensitivity of tracking results to detection outputs. Second, we propose an adaptive measurement noise (AMN) method, which adaptively adjusts the measurement noise covariance by incorporating prediction information and detection confidence, making the update step of the Kalman filter (KF) more flexible and improving the robustness of the model. Lastly, we refine the calculation of the affinity matrix in the association task, enabling more effective data association for objects undergoing complex motion. Our workflow of the proposed methods is illustrated in Figure 1.

2.1. Candidate Boxes Enhanced Filtering (CEF)

NMS is based on a greedy algorithm for selecting the best candidate boxes for objects. It filters multiple candidate boxes, retaining the one with the highest confidence while suppressing others that have a high overlap with it, thereby finding the optimal candidate box. In most existing works, NMS is applied during the detection stage, while the tracking stage relies on the detection boxes after NMS has been executed. Liang [31] proposed the application of NMS in the tracking stage to assist in matching occluded object tracklets. We believe that NMS in the detection stage does not fully utilize the information generated during the tracking stage, leading to resource waste. Filtering out low-scoring and redundant boxes may hinder the effective use of information in the tracking stage, as detection aims to identify and locate all relevant objects in a single frame, while tracking involves continuously following the object’s position and maintaining its identity in a video sequence. Transmitting more information to the tracking stage could potentially improve tracking performance. Therefore, we attempted to delay the NMS work until the tracking stage, using the information generated during tracking to filter detection candidate boxes, thereby enhancing information utilization and further improving tracking results.

2.1.1. Score Update

The video frames were input into the detection algorithm, resulting in

m

detection boxes

B = {b_{1}, b_{2}, \dots, b_{m}}

along with their corresponding scores

S = {s_{1}, s_{2}, \dots, s_{m}}

; meanwhile, the tracking algorithm generated

n

prediction boxes

P = {p_{1}, p_{2}, \dots, p_{j}}

using the KF, in which

b_{i}

represents the

i

th detection box of the current frame,

s_{i}

represents the score of the

i

th detection box of the current frame, and

p_{j}

represents the

j

th prediction box of the current frame. Next, we calculated the IoU values between the detection boxes and the prediction boxes. To simplify the computation, we only focused on the prediction box that has the greatest impact on each detection box

b_{i}

, resulting in a vector

X

of length

m

.

X

can be formulated as follows (1):

X = {x_{i} | x_{i} = \max_{1 \leq j \leq n} IoU (b_{i}, p_{j}), i = 1, 2, \dots, m}

(1)

S_{n e w} = λ S + (1 - λ) S \cdot X

(2)

As from Equation (2), we updated the original score

S

to obtain

S_{n e w}

by using the vector

X

, which contains the maximum IoU values for each detection box and all prediction boxes. This process integrates predictive information to adjust the scores of detection candidate boxes, allowing those closer to the predicted boxes to receive higher weights. Figure 2a shows that weight

λ = 0.7

is the best parameter.

2.1.2. Dynamic Threshold

While updating the scores of detection candidate boxes can improve the handling of foreground occlusion, it still does not effectively address the issue of overlapping objects. When two objects significantly overlap, the classic NMS algorithm often suppresses one of the objects, as the suppression threshold is typically heuristically set based on the entire video, making it insufficient for infrequent overlapping situations. On the other hand, predictive information can indirectly reflect the overlap of objects in the next frame through simple IoU calculations. Therefore, we further utilized predictive information to dynamically adjust the suppression threshold for each detection box, enhancing our ability to handle occlusion issues.

For the prediction boxes represented by the set

P

, we calculated the IoU values between the prediction boxes to reflect their overlapping extent. To simplify the computation, we focused only on the maximum IoU value between each prediction box and all other boxes, resulting in a length

n

vector

O

.

O

represents the overlapping extent by each prediction box in the predictive information, as reflected in Equation (3):

O = {o_{i} | o_{i} = \max_{j \neq i, 1 \leq j \leq n} IoU (p_{i}, p_{j}), i = 1, 2, \dots, n}

(3)

After obtaining the overlap rates for the tracked object predictions, we used this information to dynamically set thresholds for filtering detection candidate boxes. First, we identified the closest prediction box for each detection box and then assigned the overlap rate represented by the prediction box to the threshold vector. Specifically, if the overlap rate of the detection box

b_{i}

exceeded 0.7, we set

t h r e s h o l d_{i}

to that overlap rate; otherwise, we assigned it a fixed threshold, and Figure 3 shows that the fixed threshold of

t h r e s h o l d = 0.7

is the optimal parameter. As a result, we obtained a threshold vector

t h r e s h o l d

of length

m

, and the calculation process can be expressed by Equations (4) and (5):

I n d e x = {i n d e x_{i} | i n d e x_{i} = \underset{1 \leq j \leq n}{\arg \max} (IoU (b_{i}, p_{j})), i = 1, 2, \dots, m}

(4)

t h r e s h o l d_{i} = \{\begin{matrix} o_{i n d e x (i)}, \\ 0.7, \end{matrix} \begin{matrix} o_{i n d e x (i)} > 0.7 \\ otherwise \end{matrix}

(5)

2.1.3. Initializing and Executing

When processing the first frame of the video, since the predicted bounding boxes from the tracking algorithm cannot be obtained, the algorithm needs to be initialized. We decided to set the score

S

directly as the updated score

S_{n e w}

and initialize the threshold vector

t h r e s h o l d

to a fixed value of 0.7. Then, the CET algorithm was executed for initialization. Next, the filtered detection boxes were obtained through this process and input into the tracking algorithm to generate the initial tracking boxes. The initial tracking boxes are defined as those detection boxes in the first frame of the video with a confidence score greater than 0.6.

After processing the first frame, the tracking algorithm can output the prediction boxes. Based on the prediction boxes, we obtained the updated score vector

S_{n e w}

and threshold vector

t h r e s h o l d

. At this point, the suppression threshold for each candidate detection was no longer fixed, so the NMS algorithm needed to be improved to accommodate dynamically adjusted thresholds. We enhanced the suppression step in the traditional NMS algorithm, allowing the suppression threshold to be dynamically adjusted based on the

t h r e s h o l d

. By dynamically adjusting the suppression thresholds for the neighborhoods of the detection candidates, this approach reduces missed detections caused by overlaps or occlusions. This method effectively utilizes predictive information and better preserves overlapping object detection boxes. The workflow of our CEF algorithm is shown in Figure 4 and Algorithm 1.

Algorithm 1: CEF

Input:

B = {b_{1}, b_{2}, \dots, b_{m}}

is the list of initial detection boxes;

S = {s_{1}, s_{2}, \dots, s_{m}}

is the corresponding detection score after the update;

T = {t_{1}, t_{2}, \dots, t_{m}}

is the corresponding NMS threshold after the update.
Output:

D = {d_{1}, d_{2}, \dots, d_{n}}

is the detection box after CEF is executed.

while $B \neq \emptyset$ do
$m = \arg \max S$
$M = b_{m}$
$D = D \cup {M}$
$B = B - {M}$
for $b_{i} \in B$ do
if $IoU (M, b_{i}) \geq t_{m}$ then
$B = B - {b_{i}}$
$S = S - {s_{i}}$
end
end
end

return

D

2.2. Adaptive Measurement Noise (AMN)

The classic tracking algorithm ByteTrack utilizes the KF as its motion model. The state vector includes the object’s center coordinates

x_{c}

,

y_{c}

, aspect ratio

a

, height

h

, and their corresponding velocity components, as shown in Equation (6):

x = {[x_{c}, y_{c}, a, h, x_{v}, y_{v}, a_{v}, h_{v}]}^{T}

(6)

This state vector typically assumes that the object’s contour does not undergo significant deformation, meaning that the aspect ratio of the tracked object remains relatively constant. However, this assumption does not always hold. In certain cases, the object may exhibit nonlinear movements, such as a pedestrian falling, a dancer dancing, or an athlete competing. In these scenarios, the traditional state space setup may struggle to effectively handle the object’s morphological changes, impacting tracking accuracy.

However, our experiments show that directly estimating the width and height of the bounding box yields better tracking of complex motion objects. Therefore, we improved the state space of the Kalman filter by replacing the aspect ratio

a

and its velocity components with the width

w

. At this point, the complete state vector of the Kalman filter is shown in Equation (7):

x = {[x_{c}, y_{c}, w, h, x_{v}, y_{v}, w_{v}, h_{v}]}^{T}

(7)

The KF consists of prediction and update steps. The prediction steps, as shown in Equations (8) and (9), predict the current state based on historical information, resulting in a priori estimate. Here,

x_{k}^{-}

represents the a priori estimate state at the current time,

x_{k - 1}

is the a posteriori estimate from the previous time step, and

F

is the state transition matrix that describes the evolution of the system’s state from the previous to the current time step, as shown in Equation (10). Additionally, we calculated the a priori state covariance matrix, where

P_{k}^{-}

represents the a priori state covariance matrix at the current time,

P_{k - 1}

is the a posteriori state covariance matrix from the previous time step, and

Q

is the process noise covariance matrix.

x_{k}^{-} = F x_{k - 1}

(8)

P_{k}^{-} = F P_{k - 1} F^{T} + Q

(9)

F = [\begin{matrix} \begin{matrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{matrix} & \begin{matrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{matrix} \\ \begin{matrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{matrix} & \begin{matrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{matrix} \end{matrix}]

(10)

The update steps are shown in Equations (11)–(13). First, we calculated the Kalman gain

K

. Then, using the observation matrix

H

and the observation vector

z_{k}

, we updated the posteriori estimate

x_{k}

and the posteriori estimate covariance

P_{k}

, where

R

is the measurement noise covariance matrix. The observation matrix

H

is shown in Equation (14).

K = \frac{P_{k}^{-} H^{T}}{H P_{k}^{-} H^{T} + R}

(11)

x_{k} = x_{k}^{-} + K (z_{k} - H x_{k}^{-})

(12)

P_{k} = (I - K H) P_{k}^{-}

(13)

H = [\begin{matrix} \begin{matrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{matrix} & \begin{matrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{matrix} \end{matrix}]

(14)

As the state vector is modified, the covariance matrices

Q

and

R

in the KF also need to be adjusted. DeepSORT [5] suggests using information from the state vector for this purpose. Here,

Q

is the process noise covariance matrix, reflecting the uncertainty of the model state as it evolves over time.

R

is the measurement noise covariance matrix, indicating the potential uncertainties and errors that may occur during measurement. Calculations reference Equations (15) and (16), where

w_{k - 1}

represents the posterior estimate of the width state in the previous time step, and

w_{k}^{-}

denotes the prior estimate of the width state at the current time step, with the remaining variables defined similarly, and

σ_{p} = 0.05

,

σ_{v} = 0.00625

, and

σ_{m} = 0.05

.

Q = diag [\begin{matrix} {(σ_{p} w_{k - 1})}^{2}, {(σ_{p} h_{k - 1})}^{2}, {(σ_{p} w_{k - 1})}^{2}, {(σ_{p} h_{k - 1})}^{2}, \\ {(σ_{v} w_{k - 1})}^{2}, {(σ_{v} h_{k - 1})}^{2}, {(σ_{v} w_{k - 1})}^{2}, {(σ_{v} h_{k - 1})}^{2} \end{matrix}]

(15)

R = [{(σ_{m} w_{k}^{-})}^{2}, {(σ_{m} h_{k}^{-})}^{2}, {(σ_{m} w_{k}^{-})}^{2}, {(σ_{m} h_{k}^{-})}^{2}]

(16)

We observed that during the update process of the KF, the fixed measurement noise covariance

R

does not accurately reflect the measurement uncertainty, which impacts the model’s precision. GiaoTrack [23] addressed the influence of detection quality on the KF update by proposing the NSA-KF method, which utilizes detection confidence

c

to reflect noise during measurement, as shown in Equation (17). However, the NSA method only considers detection confidence in a simplistic manner. While it effectively compensates for measurement noise in cases of foreground occlusion, its performance regarding measurement noise caused by overlapping objects is less satisfactory. Therefore, we proposed a method that allows the measurement noise covariance

R

to adaptively adjust based on overlap information, reflecting the uncertainty in the observations. This method is referred to as adaptive measurement noise (AMN), as shown in Equation (18). We used the overlap rate

o

as a state to reflect the overlapping phenomena of tracked objects and combined it with the NSA method to more accurately represent the observation quality and uncertainty of the objects, with the calculation of

o

, as described in Equation (3).

R_{n e w} = (1 - c) R

(17)

R_{n e w} = \frac{1 - c}{1 - o} R

(18)

2.3. Dilatation Height IoU (DHIoU)

Data association is a crucial task in MOT, typically involving the generation of an affinity matrix and using an assignment algorithm to match objects. We observed that while the IoU distance between bounding boxes can implicitly address occlusion to some extent, allowing for the correction of continuous object movements through frame-by-frame associations, it falls short when dealing with rapid deformations of bounding boxes caused by complex object motion. Our analysis revealed that changes in an object’s height are generally smaller compared to its width, suggesting that height possesses a certain degree of stability and compensatory capability. Based on this observation, we proposed the combination of height Intersection over Union with IoU distance to calculate the affinity matrix, aiming to achieve a more comprehensive representation of information and improve the accuracy and robustness of data association.

However, both height Intersection over Union and area IoU range from (0,1), and the result of their multiplication tends to be 0, which limits the utility of the affinity matrix and affects the model’s accuracy and robustness. To address this, we introduced a dilatation coefficient to counteract this effect, thereby enhancing model performance.

Specifically, we defined two boxes,

b^{1} (x_{1}^{1}, y_{1}^{1}, x_{2}^{1}, y_{2}^{1})

and

b^{2} (x_{1}^{2}, y_{1}^{2}, x_{2}^{2}, y_{2}^{2})

, where

h^{1} (y_{1}^{1}, y_{2}^{1})

and

h^{2} (y_{1}^{2}, y_{2}^{2})

represent the heights of the two boxes and

B^{1}

and

B^{2}

denote their respective areas, with

d

being the dilatation coefficient. By multiplying the dilatation coefficient, height IoU, and area IoU, we obtained the Dilatation Height IoU (DHIoU), as shown in Equation (19). Figure 2b shows that the dilatation coefficient

d = 1.2

is the optimal parameter.

DHIoU = d \cdot \frac{I n t e r \sec t i o n (h^{1}, h^{2})}{U n i o n (h^{1}, h^{2})} \cdot \frac{I n t e r \sec t i o n (B^{1}, B^{2})}{U n i o n (B^{1}, B^{2})}

(19)

We calculated the DHIoU between each pair of detection boxes filtered by CEF and prediction boxes predicted by AMN-KF, which results in an affinity matrix. This affinity matrix was then used for data association via the Hungarian algorithm, yielding the tracking results.

3. Results

3.1. Datasets and Metrics

3.1.1. Datasets

Our study is based on two publicly available MOT datasets, DanceTrack [32] and MOT20 [33], with experiments conducted under the “private detection” protocol. The DanceTrack dataset is a large-scale MOT dataset, mainly featuring group dance videos. Compared to other datasets, the objects in DanceTrack have highly similar appearances, more complex motion patterns, and frequent occlusion and overlap, posing significant challenges for motion modeling. The DanceTrack dataset contains 100 video sequences, divided into training, validation, and test sets, with 40 training sequences, 25 validation sequences, and 35 test sequences. The ablation study was conducted on the validation set of DanceTrack.

The MOT20 datasets are provided by the MOT Challenge as benchmark datasets, primarily featuring pedestrians in various busy street scenes, and are among the most widely used public datasets for MOT tasks. The MOT20 dataset comprises eight video sequences, including four training sequences and four test sequences, with no separate validation set.

3.1.2. Evaluation Metrics

We used the following metrics to evaluate the performance of the tracking algorithm: Multiple Object Tracking Accuracy (MOTA) [34], Identification F1 Score (IDF1) [35], Higher Order Tracking Accuracy (HOTA) [36], Detection Accuracy (DetA) [36], and association accuracy (AssA) [36].

MOTA measures tracking accuracy by evaluating three types of tracking errors: False Negatives (FN) and False Positives (FP) in object detection, and Identification Switches (IDSW) in association errors. MOTA focuses more on detection performance, with values ranging from less than or equal to 1, where 1 indicates the best case. A higher score signifies better tracking accuracy. The calculation formula for MOTA is shown in Equation (20), where

t

represents the frame number in the video sequence,

{GT}_{t}

represents the number of ground truth instances in the

t

th frame,

{FN}_{t}

represents the number of False Negatives in the

t

th frame,

{FP}_{t}

represents the number of False Positives in the

t

th frame, and

{IDSW}_{t}

represents the number of identity switches in the

t

th frame.

MOTA = 1 - \frac{\sum_{t} ({FN}_{t} {+ FP}_{t} {+ IDSW}_{t})}{\sum_{t} {GT}_{t}}

(20)

IDF1 assesses tracking continuity and re-identification accuracy by calculating the ratio of correctly identified detections to the average of the ground truth detections and the computed detections. IDF1 emphasizes association performance, with values ranging from 0 to 1, where 1 represents the best case. A higher score reflects better tracking association. The calculation formula for IDF1 is shown in Equation (21), where IDTP (True Positive ID) represents the correctly matched object identities, i.e., the cases where the ground truth objects are correctly matched with the tracking results. IDFP (False Positive ID) represents the incorrectly matched object identities, i.e., the cases where the ground truth objects are incorrectly matched with the tracking results. IDFN (False Negative ID) represents the ground truth object identities that are not matched, i.e., the cases where ground truth objects are not matched with any tracking results.

IDF 1 = \frac{2 IDTP}{2 IDTP + IDFP + IDFN}

(21)

DetA is a metric used to measure the accuracy of object localization in detection. It primarily evaluates the overlap between detected bounding boxes and ground truth boxes by calculating their Intersection over Union (IoU). The score ranges from 0 to 1, with 1 representing the best case. The calculation formula for DetA is shown in Equation (22), where TP represents a True Positive, FN represents a False Negative, and FP represents a False Positive.

DetA = \frac{TP}{TP + FN + FP}

(22)

AssA evaluates the accuracy of data association in tracking algorithms, reflecting the algorithm’s ability to avoid IDSW and association errors. Its values range from 0 to 1, with 1 indicating the best performance. The calculation formula for AssA is shown in Equation (23), where TPA represents the number of True Positive matches between two trajectories, FN represents the number of False Negative matches between two trajectories, and FP represents the number of False Positive matches between two trajectories.

AssA = \frac{1}{TP} \sum_{c \in TP} \frac{TPA (c)}{TPA (c) + FNA (c) + FPA (c)}

(23)

HOTA explicitly balances the effects of detection, association, and localization into a unified metric. HOTA scores better reflect human visual evaluation of tracking performance. Its values range from 0 to 1, with 1 indicating the best performance. A higher score represents better overall tracking results. The calculation formula for HOTA is shown in Equation (24), where

α

represents the localization threshold.

HOTA = \int_{0 < α \leq 1} {HOTA}_{α} \approx \frac{1}{19} \sum_{\begin{array}{l} α = 0.05 \\ α + = 0.05 \end{array}}^{0.95} {HOTA}_{α} = \frac{1}{19} \sum_{\begin{array}{l} α = 0.05 \\ α + = 0.05 \end{array}}^{0.95} \sqrt{{DetA}_{α} + {AssA}_{α}}

(24)

3.1.3. Implementation Details

Our model follows the ByteTrack structure. In the detection phase, we used the YOLOX-x model as the detector, pre-trained on the COCO dataset. For the DanceTrack dataset, we trained this on the DanceTrack training set for eight epochs, with an input image size of 1440 × 800. For the MOT20 dataset, we incorporated the CrowdHuman [37] dataset into the training process. The training was conducted using a combination of the training sequences from both the MOT20 and CrowdHuman datasets. The model was trained for 80 epochs with an input image size of 1600 × 896. The optimizer used for training was SGD, with an initial learning rate set to 0.001, employing a cosine annealing strategy to adjust the learning rate dynamically. The high-confidence detection box threshold was set to 0.5, the low-confidence detection box threshold was 0.1, and the minimum detection box area was 100 pixels.

In the tracking phase, we first applied the CEF method to filter candidate detection boxes by incorporating predictive information, setting the score update weight to 0.7. Next, we optimized the state space of the Kalman filter and used the AMN method to dynamically adjust the measurement noise covariance during the Kalman filter update steps. Finally, we employed the DHIOU method to calculate the affinity matrix more accurately for trajectory association, with a dilatation coefficient set to 1.2.

3.2. Comparison Results

In this section, we present the benchmark results on the DanceTrack and MOT20 datasets. Our baseline model was ByteTrack, and compared to the baseline model, our method only added a small amount of additional computational cost.

3.2.1. Comparison on DanceTrack

As shown in Table 1, we compared our proposed method with SORT [4], DeepSORT [5], CenterTrack [27], FairMOT [7], QDTrack [22], TransTrack [38], TraDes [14], ByteTrack [6], and OC-SORT [10] on the DanceTrack dataset. The results show that our method outperforms others, achieving the best performance across all evaluation metrics. Compared to the baseline model, HOTA improves by 10.2%, MOTA by 3.0%, IDF1 by 4.8%, DetA by 10.7%, and AssA by 8.9%. This highlights the effectiveness of our method in handling missed detections and the complex motions of tracking objects. The results were directly obtained from the official evaluation server of DanceTrack.

3.2.2. Comparison on MOT20

As shown in Table 2, we compared our proposed method with FairMOT [7], TransMOT [39], ByteTrack [6], and OC-SORT [10] on the MOT20 dataset. The results were directly obtained from the official evaluation server of MOT Challenge.

Our method achieved the best performance in HOTA, IDF1, and AssA, while also achieving near-optimal results in MOTA and DetA. The improvement on the MOT20 dataset is relatively limited compared to the DanceTrack dataset. We believe there are two main reasons for this. First, the scale of the MOT20 dataset is much smaller than that of the DanceTrack dataset, which may not fully assess the performance of the method. Second, the MOT20 dataset primarily comes from surveillance scenarios, where the tracked objects are usually pedestrians whose behavior patterns are closer to simple linear motion, causing the method’s performance to become saturated. In contrast, the DanceTrack dataset consists of videos of people dancing, where the behavior patterns of the objects are complex and nonlinear, allowing our method to better demonstrate its advantages.

We found that on the MOT20 dataset, our method achieved the best performance in HOTA, IDF1, and AssA, indicating significant improvements in object association. However, MOTA and DetA show slight declines compared to the baseline model ByteTrack, suggesting that our method performs slightly worse in object localization. We believe this may be due to the fact that we introduced more historical information, and the high density of people in the MOT20 dataset led to more localization errors during the detection stage. As the historical information accumulates, these errors may gradually propagate over time, ultimately affecting the accuracy of object localization.

3.3. Ablation Study

3.3.1. Component Ablation

To validate the effectiveness of each module in our proposed method, we conducted ablation experiments on the DanceTrack validation set. Using the ByteTrack tracking algorithm as the baseline, we evaluated the performance of different methods across five metrics: HOTA, MOTA, IDF1, DetA, and AssA. Table 3 summarizes the progression from the ByteTrack baseline to our proposed method, showcasing the improvements achieved at each stage.

As shown in Table 3, the experimental results highlight the effectiveness of each module. Comparing the first and second rows, we see that the CEF method significantly enhances HOTA, MOTA, IDF1, and DetA metrics, indicating that the CEF method, which integrates predictive information, improves tracking performance. When we compare the first and third rows, we observe substantial gains across all metrics, confirming that adjustments to the KF state space combined with adaptive measurement noise positively impact the tracking of objects exhibiting complex motion. Similarly, the comparison between the first and fourth rows reveals significant improvements across all metrics, demonstrating that the enhanced affinity matrix facilitates more effective matching in MOT scenarios. Finally, when comparing the other rows with the last row, we note that all metrics achieve their highest values, providing strong evidence that our proposed improvements collectively contribute to enhanced tracking performance in multi-object tracking.

3.3.2. Measurement Noise

We compared different approaches for adjusting measurement noise in the KF within our method, including the original Constant method, the NSA method, and our proposed AMN method. As shown in Table 4, the results on the DanceTrack validation set clearly demonstrate that our method exhibits superior performance across all metrics compared to other methods. In complex scenarios characterized by frequent occlusions and nonlinear motion, our AMN method exhibits a distinct advantage.

3.3.3. Affinity Matrix

We compared the use of different affinity matrices for data association in our method, including the traditional IoU calculation, the DHIoU method we proposed, and the calculation with a dilatation coefficient of 1 in the DHIoU method. As shown in Table 5, our method demonstrates good performance across most metrics. While the introduction of the dilatation coefficient may result in slight decreases in some metrics, overall, it enhances the overall performance. The dilatation coefficient can effectively alleviate the issue of excessively small multiplicative results, thereby improving tracking performance, and it is also very straightforward to implement.

Since our model performs associations not only on high-confidence detection boxes but also conducts a second association on low-confidence boxes, each association requires the calculation of an affinity matrix. Therefore, we compared three scenarios, both associations using IoU distance to calculate the affinity matrix, and the first match using our DHIoU method, while employing different methods for the second association. As shown in Table 6, when both associations utilize our DHIoU method, all metrics achieve the best results. These results highlight that using the DHIoU method significantly enhances data association and overall tracking effectiveness.

We believe that, compared to the width state, the height state is more advantageous for object association. Similar to DHIoU, we propose Dilatation Width IoU (DWIoU), which calculates the IoU using width instead of height. To ensure a fair comparison, we set the dilatation coefficient

d

to 1.0. As shown in Table 7, the width state negatively impacts association performance, while the height state has a positive effect. We believe this is because changes in the height state typically occur between actions such as crouching, standing, and jumping, whose motion patterns are relatively predictable and continuous, allowing the Kalman filter to model them effectively. In contrast, changes in the width state usually occur between limb movements or posture changes, with more complex and unpredictable motion patterns that exhibit non-linear behavior, posing a significant challenge for Kalman filter modeling.

4. Discussion

Our proposed method enhances tracking performance through three key aspects: improving the handling of low-confidence boxes, reducing the impact of occlusion and overlap noise on the motion model, and optimizing the computation of the affinity matrix to enhance the association capability for objects with complex motion. These improvements work together to significantly boost the overall tracking effectiveness of the model.

As shown in Figure 5 and Figure 6, the tracking visualization results validate the effectiveness of our proposed improvements. The object distribution in the MOT20 dataset is very dense, with significant overlap and occlusion between objects, presenting a substantial challenge for tracking tasks. In this context, our method effectively detects and successfully tracks objects, demonstrating its superiority in handling overlapping and occluded scenarios. The visualization results for MOT20-04 illustrate that our method can cope with the challenging environment of numerous pedestrians on a busy nighttime street, where an object with ID 19 can still be successfully tracked despite being occluded by other tracked objects. Furthermore, the visualization results for MOT20-08 further confirm that our method performs exceptionally well in crowded situations.

In the DanceTrack dataset, the objects exhibit complex motion patterns with frequent occlusions and overlaps, posing significant challenges for MOT. In such cases, our method can maintain stable tracking of objects over extended periods without loss, demonstrating its effectiveness. The visualization results for the DanceTrack-09 sequence show that, even when faced with the complex movements of dancing, we can achieve stable tracking over long durations; after 100 frames, our method still accurately tracks each object. Additionally, the visualization results for the DanceTrack-09 sequence further corroborate this point.

Our proposed method performs excellently in handling object overlap in images, showing significant advantages over the baseline model. As shown in Figure 7, based on the ground truth labels provided by the DanceTrack validation set, we calculated the Intersection over Union (IoU) of object bounding boxes for each frame. When the overlap rate (IoU) exceeded 0.7, we counted and accumulated the number of objects in that frame. In total, we identified 20,338 objects involved in overlap situations. Among these objects, our model successfully tracked 18,497, achieving a success rate of 91.0%. In comparison, the baseline model only successfully tracked 18,160 objects, with a success rate of 89.2%. This result indicates that our method can more effectively maintain tracking accuracy and robustness in complex object overlap scenarios. Figure 8 further presents the experimental results for specific video sequences from the DanceTrack dataset, where we compare the success tracking rates of our model and the baseline model in handling object overlap scenarios. The charts clearly illustrate the performance differences between the two models in complex environments, highlighting the advantages of our method in object tracking.

5. Limitations

Our method still has certain limitations. In high-density scenarios, where objects are densely packed, a significant amount of localization errors can occur. Since our method utilizes more historical information, these errors may accumulate, potentially degrading tracking performance. Additionally, compared to the baseline model, our method introduces additional computational complexity. In video sequences without object overlap, the performance improvement over the baseline model is not very pronounced, while the increased computational complexity leads to slower processing speeds. Finally, when objects are completely occluded and exhibit complex non-linear movements, our method struggles to handle such cases effectively. Future improvements could involve incorporating advanced features such as appearance features into the association algorithm to address these challenges.

6. Conclusions

This study focuses on key challenges in MOT, particularly the high sensitivity of tracking to detection outputs, object occlusion and overlap, and complex motion. We propose an innovative approach that integrates predictive information to improve NMS, thereby reducing reliance on detection results. By applying secondary modulation to suppression scores and dynamically adjusting suppression thresholds based on tracking information, our method significantly enhances the retention of candidate boxes for occluded objects. To track occluded or overlapping objects more effectively, we introduce an adaptive measurement noise method. This approach adjusts measurement noise to not only reduce the impact of foreground occlusion on tracking but also mitigate the effects of object overlap on tracking accuracy. Additionally, for objects with complex motion, we incorporate height information into the association algorithm to enhance the calculation of the affinity matrix, thereby improving the stability of object association. Experimental results demonstrate that these improvements significantly enhance overall MOT performance, especially in busy and complex scenarios. These findings provide new research insights for the MOT field and open up new directions for future studies. We hope our work will contribute positively to the development of multi-object tracking.

Author Contributions

Conceptualization, X.C., H.Z. and Y.D.; methodology, X.C. and H.Z.; software, H.Z.; validation, X.C., H.Z. and S.S.; formal analysis, Y.D. and S.S.; resources, X.C. and S.S.; data curation, H.Z. and Y.D.; writing—original draft preparation, X.C. and H.Z.; writing—review and editing, X.C. and S.S.; visualization, H.Z.; project administration, X.C. and S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research work was supported by Key R & D projects of the Guangxi Science and Technology Program (GuikeAB24010338), The central government guides local science and technology development fund projects (GuikeZY22096012), National Natural Science Foundation of China (32360374), an independent research project (GXRDCF202307-01).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this study are accessible via the following link: https://github.com/DanceTrack/ and https://motchallenge.net/data/MOT20, accessed on 8 September 2024.

Acknowledgments

We are very grateful to the volunteers from Guilin University of Technology for their assistance in the experimental part of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bergmann, P.; Meinhardt, T.; Leal-Taixé, L. Tracking Without Bells and Whistles. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 941–951. [Google Scholar]
Luo, X.; Wang, Y.; Zhang, X. A Violation Analysis Method of Traffic Targets Based on Video and GIS. Geomat. Inf. Sci. Wuhan Univ. 2023, 48, 647–655. [Google Scholar]
Zhang, Y.; Da, F. A Multi-object Tracking Method Based on Dilatation Region Matching and Adaptive Trajectory Management Strategy. Geomat. Inf. Sci. Wuhan Univ. 2024, 49, 572–581. [Google Scholar] [CrossRef]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-object Tracking by Associating Every Detection Box. In Proceedings of the Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; pp. 1–21. [Google Scholar]
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. Strongsort: Make deepsort great again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
Ren, H.; Han, S.; Ding, H.; Zhang, Z.; Wang, H.; Wang, F. Focus On Details: Online Multi-Object Tracking with Diverse Fine-Grained Representation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 11289–11298. [Google Scholar]
Cao, J.; Pang, J.; Weng, X.; Khirodkar, R.; Kitani, K. Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 9686–9696. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Wu, J.; Cao, J.; Song, L.; Wang, Y.; Yang, M.; Yuan, J. Track to detect and segment: An online multi-object tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12352–12361. [Google Scholar]
Chen, L.; Ai, H.; Zhuang, Z.; Shang, C. Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Seidenschwarz, J.; Brasó, G.; Serrano, V.C.; Elezi, I.; Leal-Taixé, L. Simple Cues Lead to a Strong Multi-Object Tracker. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13813–13823. [Google Scholar]
Luo, H.; Gu, Y.; Liao, X.; Lai, S.; Jiang, W. Bag of Tricks and a Strong Baseline for Deep Person Re-Identification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 1487–1495. [Google Scholar]
Wang, G.; Yuan, Y.; Chen, X.; Li, J.; Zhou, X. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM International Conference on Multimedia, New York, NY, USA, 22–26 October 2018; pp. 274–282. [Google Scholar]
Tang, Q.; Jo, K.-H. Unsupervised Object Re-identification via Instances Correlation Loss. In Proceedings of the 2022 IEEE 20th International Conference on Industrial Informatics (INDIN), Perth, Australia, 25–28 July 2022; pp. 135–139. [Google Scholar]
Liang, C.; Zhang, Z.; Zhou, X.; Li, B.; Zhu, S.; Hu, W. Rethinking the competition between detection and reid in multiobject tracking. IEEE Trans. Image Process. 2022, 31, 3182–3196. [Google Scholar] [CrossRef] [PubMed]
Pang, J.; Qiu, L.; Li, X.; Chen, H.; Li, Q.; Darrell, T.; Yu, F. Quasi-Dense Similarity Learning for Multiple Object Tracking. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 164–173. [Google Scholar]
Du, Y.; Wan, J.; Zhao, Y.; Zhang, B.; Tong, Z.; Dong, J. Giaotracker: A comprehensive framework for mcmot with global information and optimizing strategies in visdrone 2021. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2809–2819. [Google Scholar]
Aharon, N.; Orfaig, R.; Bobrovsky, B.-Z. BoT-SORT: Robust Associations Multi-Pedestrian Tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar]
Evangelidis, G.D.; Psarakis, E.Z. Parametric image alignment using enhanced correlation coefficient maximization. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1858–1865. [Google Scholar] [CrossRef] [PubMed]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking objects as points. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 474–490. [Google Scholar]
Zeng, F.; Dong, B.; Zhang, Y.; Wang, T.; Zhang, X.; Wei, Y. Motr: End-to-end multiple-object tracking with transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 659–675. [Google Scholar]
Vaswani, A. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Huang, X.; Zhan, Y. Multi-object tracking with adaptive measurement noise and information fusion. Image Vis. Comput. 2024, 144, 104964. [Google Scholar] [CrossRef]
Liang, H.; Wu, T.; Zhang, Q.; Zhou, H. Non-maximum suppression performs later in multi-object tracking. Appl. Sci. 2022, 12, 3334. [Google Scholar] [CrossRef]
Sun, P.; Cao, J.; Jiang, Y.; Yuan, Z.; Bai, S.; Kitani, K.; Luo, P. Dancetrack: Multi-object tracking in uniform appearance and diverse motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20993–21002. [Google Scholar]
Dendorfer, P.; Rezatofighi, H.; Milan, A.; Shi, J.; Cremers, D.; Reid, I.; Roth, S.; Schindler, K.; Leal-Taixé, L. MOT20: A benchmark for multi object tracking in crowded scenes. arXiv 2020, arXiv:2003.09003. [Google Scholar]
Bernardin, K.; Stiefelhagen, R. Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. EURASIP J. Image Video Process. 2007, 1, 246309. [Google Scholar] [CrossRef]
Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; pp. 17–35. [Google Scholar]
Luiten, J.; Osep, A.; Dendorfer, P.; Torr, P.; Geiger, A.; Leal-Taixé, L.; Leibe, B. Hota: A Higher Order Metric for Evaluating Multi-object Tracking. Int. J. Comput. Vis. 2021, 129, 548–578. [Google Scholar] [CrossRef] [PubMed]
Shao, S.; Zhao, Z.; Li, B.; Xiao, T.; Yu, G.; Zhang, X.; Sun, J. Crowdhuman: A benchmark for detecting human in a crowd. arXiv 2018, arXiv:1805.00123. [Google Scholar]
Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. Transtrack: Multiple object tracking with transformer. arXiv 2020, arXiv:2012.15460 2020. [Google Scholar]
Chu, P.; Wang, J.; You, Q.; Ling, H.; Liu, Z. TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 4859–4869. [Google Scholar]

Figure 1. The workflow of our method: the video frames are input into a detector in which NMS does not work, and then the output is input along with the KF prediction results into the CEF module to obtain the filtered detection boxes. After that, these boxes are matched twice with the KF prediction results using DHIoU. Finally, based on the matching results, the tracklets are managed and KF-AMN updates are performed.

Figure 2. The selection of weight parameters. In Figure (a), the x-axis represents the parameter selection, while the y-axis displays the HOTA evaluation metric. A higher HOTA value indicates better performance of the selected parameter, and the figure demonstrates that when

λ = 0.7

, the HOTA result reaches its peak, indicating that this is the optimal parameter. Similarly, in Figure (b), when

d = 1.2

, the HOTA result reaches its peak, indicating that this is the optimal dilatation coefficient. The best results are highlighted with a red circle.

Figure 2. The selection of weight parameters. In Figure (a), the x-axis represents the parameter selection, while the y-axis displays the HOTA evaluation metric. A higher HOTA value indicates better performance of the selected parameter, and the figure demonstrates that when

λ = 0.7

, the HOTA result reaches its peak, indicating that this is the optimal parameter. Similarly, in Figure (b), when

d = 1.2

, the HOTA result reaches its peak, indicating that this is the optimal dilatation coefficient. The best results are highlighted with a red circle.

Figure 3. Selection of the fixed threshold. When

t h r e s h o l d = 0.7

, the HOTA results reach their peak, indicating that this is the optimal parameter. The best results are highlighted with a red circle.

Figure 3. Selection of the fixed threshold. When

t h r e s h o l d = 0.7

, the HOTA results reach their peak, indicating that this is the optimal parameter. The best results are highlighted with a red circle.

Figure 4. The figure illustrates the workflows of two different methods. The yellow boxes in (2) represent the detection candidate boxes before NMS is applied. The first row demonstrates the processing flow of the classical NMS: the yellow box in (3) indicates the detection box resulting from applying NMS to the detection candidate boxes in (2). Due to significant overlap between the boxes, (3) ultimately retains only one detection box for the frame. The white boxes in (4) show the object position predicted by the Kalman filter., The green box in (5) represents the result of associating the detection box from (3) with the prediction boxes from (4), successfully associating only one object trajectory. The second row showcases the workflow of our method: the yellow boxes in (6) indicate the detection boxes resulting from our method applied to the detection candidate boxes in (2), successfully detecting two objects. The green boxes in (7) represent the result of associating the detection boxes from (6) with the prediction boxes from (4), successfully associating both objects.

Figure 5. A visualization of the tracking results on the MOT20 test dataset. Boxes of the same color and the numbers in the upper-left corner of the boxes indicate the same tracking object.

Figure 6. A visualization of the tracking results on the DanceTrack test dataset. Boxes of the same color and the numbers in the upper-left corner of the boxes indicate the same tracking object. The frame number in the lower-left corner shows the frame index of the image in the video.

Figure 7. We counted the total number of objects appearing in frames with object overlaps (

IoU > 0.7

) in the DanceTrack dataset, totaling 20,338 objects. In these overlapping scenarios, our method successfully tracked 18,497 objects, achieving a success rate of 91.0%; in contrast, the baseline model successfully tracked 18,160 objects, with a success rate of 89.2%.

Figure 7. We counted the total number of objects appearing in frames with object overlaps (

IoU > 0.7

) in the DanceTrack dataset, totaling 20,338 objects. In these overlapping scenarios, our method successfully tracked 18,497 objects, achieving a success rate of 91.0%; in contrast, the baseline model successfully tracked 18,160 objects, with a success rate of 89.2%.

Figure 8. We present a comparison of the tracking success rates between our method and the baseline model for specific video sequences in the DanceTrack dataset. In the DanceTrack-04 sequence, our method tracked 83.6% of the objects, while the baseline method tracked only 78.1%. In the DanceTrack-26 sequence, our method tracked 97.9% of the objects, while the baseline method tracked 94.1%. In the DanceTrack-63 sequence, our method tracked 94.2% of the objects, while the baseline method tracked only 87.5%. In the DanceTrack-73 sequence, our method tracked 85.1% of the objects, whereas the baseline method tracked 82.3%. Finally, in the DanceTrack-97 sequence, our method tracked 80.7% of the objects, while the baseline method tracked 78.1%.

Table 1. A comparison with the state-of-the-art methods on the dataset DanceTrack test set. The best results are shown in bold. ByteTtack, OC-SORT, and our method share detection results.

Method	HOTA (%)	MOTA (%)	IDF1 (%)	DetA (%)	AssA (%)
SORT	47.9	91.8	50.8	72.0	31.2
DeepSORT	45.6	87.8	47.9	71.0	29.7
CenterTrack	41.8	86.8	35.7	78.1	22.6
FairMOT	39.7	82.2	40.8	66.7	23.8
QDTrack	45.7	83.0	44.8	72.1	29.2
TransTrack	45.5	88.4	45.2	75.9	27.5
TraDes	43.3	86.2	41.2	74.5	25.4
ByteTrack	47.3	89.5	52.5	71.6	31.4
OC-SORT	54.6	89.6	54.6	80.4	40.2
Our method	57.5	92.5	57.3	82.3	40.3

Table 2. A comparison with the state-of-the-art methods over the dataset MOT20 test set. The best results are shown in bold. ByteTtack, OC-SORT, and our method share detection results.

Method	HOTA (%)	MOTA (%)	IDF1 (%)	DetA (%)	AssA (%)
FairMOT	54.6	61.8	67.3	54.7	54.7
TransMOT	61.9	77.5	75.2	64.0	60.1
ByteTrack	61.3	77.8	75.2	63.4	59.6
OC-SORT	62.1	75.5	75.9	62.4	62.0
Our method	62.1	75.1	76.0	62.0	62.3

Table 3. Ablation experiments on the DanceTrack validation set. The best results are shown in bold.

CEF	AMN	DHIOU	HOTA (%)	MOTA (%)	IDF1 (%)	DetA (%)	AssA (%)
			46.8	88.3	51.6	70.5	31.2
✓			47.0	88.5	51.9	70.6	31.2
	✓		56.9	90.3	56.3	79.7	40.8
		✓	57.4	90.3	56.7	79.7	41.5
✓	✓		55.6	90.2	56.2	79.1	39.2
✓		✓	56.0	90.3	56.7	78.7	40.1
	✓	✓	57.5	90.2	57.0	79.6	41.7
✓	✓	✓	58.0	90.4	57.3	79.8	42.3

Table 4. Comparison results of different dynamic measurement noise methods in the DanceTrack validation set. The best results are shown in bold.

	HOTA (%)	MOTA (%)	IDF1 (%)	DetA (%)	AssA (%)
Constant	55.6	90.3	56.0	78.7	39.5
NSA	56.6	90.2	56.1	79.8	40.3
AMN	58.0	90.4	57.3	79.8	42.3

Table 5. Comparison results of different affinity matrix methods in the DanceTrack validation set. The best results are shown in bold.

	HOTA (%)	MOTA (%)	IDF1 (%)	DetA (%)	AssA (%)
IoU	55.6	90.1	56.2	79.1	39.2
$DHIoU (d = 1$ )	56.7	90.4	56.3	80.0	40.4
DHIoU	58.0	90.4	57.3	79.8	42.3

Table 6. A comparison of different methods for calculating the affinity matrix in two associations on the DanceTrack validation set. The best results are shown in bold.

First Association	Second Association	HOTA (%)	MOTA (%)	IDF1 (%)	DetA (%)	AssA (%)
IoU	IoU	55.6	90.1	56.2	79.1	39.2
DHIoU	IoU	57.7	90.2	56.8	79.8	41.9
DHIoU	DHIoU	58.0	90.4	57.3	79.8	42.3

Table 7. The comparison of the affinity matrix obtained from the height state and the one obtained from the width state using the DanceTrack validation set. The best results are shown in bold.

	HOTA (%)	MOTA (%)	IDF1 (%)	DetA (%)	AssA (%)
IoU	55.6	90.1	56.2	79.1	39.2
$DWIoU (d = 1$ )	51.4	90.0	49.4	79.3	33.5
$DHIoU (d = 1$ )	56.7	90.4	56.3	80.0	40.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, X.; Zhao, H.; Deng, Y.; Shen, S. Multi-Object Tracking with Predictive Information Fusion and Adaptive Measurement Noise. Appl. Sci. 2025, 15, 736. https://doi.org/10.3390/app15020736

AMA Style

Cheng X, Zhao H, Deng Y, Shen S. Multi-Object Tracking with Predictive Information Fusion and Adaptive Measurement Noise. Applied Sciences. 2025; 15(2):736. https://doi.org/10.3390/app15020736

Chicago/Turabian Style

Cheng, Xiaohui, Haoyi Zhao, Yun Deng, and Shuangqin Shen. 2025. "Multi-Object Tracking with Predictive Information Fusion and Adaptive Measurement Noise" Applied Sciences 15, no. 2: 736. https://doi.org/10.3390/app15020736

APA Style

Cheng, X., Zhao, H., Deng, Y., & Shen, S. (2025). Multi-Object Tracking with Predictive Information Fusion and Adaptive Measurement Noise. Applied Sciences, 15(2), 736. https://doi.org/10.3390/app15020736

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Object Tracking with Predictive Information Fusion and Adaptive Measurement Noise

Abstract

1. Introduction

2. Materials and Methods

2.1. Candidate Boxes Enhanced Filtering (CEF)

2.1.1. Score Update

2.1.2. Dynamic Threshold

2.1.3. Initializing and Executing

2.2. Adaptive Measurement Noise (AMN)

2.3. Dilatation Height IoU (DHIoU)

3. Results

3.1. Datasets and Metrics

3.1.1. Datasets

3.1.2. Evaluation Metrics

3.1.3. Implementation Details

3.2. Comparison Results

3.2.1. Comparison on DanceTrack

3.2.2. Comparison on MOT20

3.3. Ablation Study

3.3.1. Component Ablation

3.3.2. Measurement Noise

3.3.3. Affinity Matrix

4. Discussion

5. Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI