Multi-Object Tracking Model Based on Detection Tracking Paradigm in Panoramic Scenes

Shen, Jinfeng; Yang, Hongbo

doi:10.3390/app14104146

Open AccessArticle

Multi-Object Tracking Model Based on Detection Tracking Paradigm in Panoramic Scenes

by

Jinfeng Shen

^* and

Hongbo Yang

School of Automation, Beijing Information Science and Technology University, Beijing 100192, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(10), 4146; https://doi.org/10.3390/app14104146

Submission received: 27 March 2024 / Revised: 4 May 2024 / Accepted: 9 May 2024 / Published: 14 May 2024

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

With the rapid advancements in artificial intelligence, Multi-Object Tracking algorithms play a crucial role in domains such as intelligent transportation, urban surveillance, and wildlife conservation. Investigating these algorithms in panoramic scenarios can significantly broaden their applicability, offering substantial societal value.

Abstract

Multi-Object Tracking (MOT) technology is dedicated to continuously tracking multiple targets of interest in a sequence of images and accurately identifying their specific positions at different times. This technology is crucial in key application areas such as autonomous driving and security surveillance. However, the application process often requires the coordination of cameras from multiple angles for tracking. Directly studying Multi-Object Tracking algorithms in panoramic scenes is an effective way to address this issue. The uniqueness of panoramic scenes causes target position changes at the boundaries and tracking difficulties due to continuous changes in target scales. To ensure the accuracy of target tracking, this study explores a detection-based tracking method using the newly improved YOLOx detector and the adjusted DeepSORT algorithm. Firstly, YOLOx_s was chosen as the detector because its simple network structure ensures a fast computational speed. During the feature extraction stage, we used the Polarized Self-Attention (PSA) mechanism to capture more feature information, thereby improving the tracking performance on small-scale targets. Secondly, the tracker was improved by adding a camera motion compensation module before predicting the target’s position to mitigate the impact of camera shake on tracking. Finally, to address the difficulty of continuously tracking targets in specific areas of panoramic scenes, this study proposes specific tracking strategies. These strategies effectively resolve the problem of tracking failure caused by target position changes at the boundaries. Experimental results show that improved algorithms have a superior performance on multiple evaluation metrics compared to other algorithms in the field. Compared to the original algorithm, the improved algorithm exhibits a 6% increase in the quantitative metric MOTA, a 7% increase in IDF1, and a 40% decrease in IDSWs, demonstrating its leading performance.

Keywords:

panorama; multi-object tracking; YOLOX; pedestrian re-identification

1. Introduction

Object tracking is essential in computer vision, emerging immediately after object detection [1]. To accomplish object tracking, locating the target within a frame and assigning a unique ID to each object is imperative. Subsequently, every identical object across consecutive frames will form a trajectory. Here, objects can vary from pedestrians, vehicles, and athletes in motion to birds in the sky. When the task involves tracking multiple objects across frames, it is referred to as Multi-Object Tracking (MOT). In Multi-Object Tracking, the algorithm must detect all targets in every video frame, match newly detected targets with those already assigned trajectories, and classify matched targets according to existing trajectories. For unmatched targets, they are considered new and are assigned a new ID along with a new trajectory. Targets that leave the video area are no longer tracked, and their trajectories are removed from the collection. With the advancements in science and technology, object-tracking technology is becoming increasingly vital in everyday life [2].

Panoramic Multi-Object Tracking represents a particularly challenging research domain. It encompasses not only the traditional issues faced by Multi-Object Tracking, such as object occlusion, identity switches, and the need for real-time processing, but also introduces unique challenges due to the panoramic viewpoint. Panoramic object tracking techniques, which provide a 360-degree view, offer a more comprehensive perspective for understanding dynamic information in complex scenes, significantly expanding their potential and impact in various practical applications.

Current methods in Multi-Object Tracking (MOT) can be categorized into detection-based tracking and joint detection and tracking methods. Detection-based tracking, which involves detecting an object and then tracking its movements on each video frame, separates the detection and tracking stages. This approach often achieves better detection metrics due to its sequential process. On the other hand, the joint detection and tracking approach integrates the tracking phase into the object detection process, incorporating elements such as appearance feature extraction and motion offset prediction. This integration allows for the partial parallel processing of detection and tracking, thereby reducing the algorithm’s time complexity. To ensure tracking accuracy, this study focuses on the detection-based tracking method.

In 2017, Li et al. introduced the DeepSORT [3] tracking algorithm, which builds on the SORT algorithm by incorporating deep learning feature representations to enhance tracking performance. It primarily addresses identity switches through appearance information and introduces a more complex motion model to handle occlusions.

In 2020, Zhou et al. proposed the CenterTrack [4] algorithm, which predicts the movement of targets directly between video frames without the need for explicit motion models or object detection. By tracking changes in the center points and movements, this method effectively handles target tracking in static and dynamic scenes.

In 2021, Zhang et al. introduced a plug-and-play tracking module, ByteTrack [5], which improves tracking performance by adding a simple, yet efficient, byte-tracking strategy on top of the YOLOX detector. Its core idea is to utilize low-score but high-quality detection boxes to assist tracking, significantly enhancing tracking accuracy and robustness in complex scenes. In the same year, Zeng et al. presented an end-to-end object detection and tracking method using transformers. MOTR [6] leverages the transformer architecture to address the Multi-Object Tracking problem, enhancing the tracking performance by establishing long-term temporal dependencies. Its use of the transformer’s self-attention mechanism effectively encodes the relationships between objects and their historical information.

In 2023, Cao et al. improved the SORT algorithm and proposed observation-centric SORT [7]. This study rethinks and refines the classic SORT algorithm, introducing an observation-centric approach to SORT. This method enhances the robustness of Multi-Object Tracking by optimizing the matching process between the detected and tracked objects. The core idea is to emphasize observational data during the tracking phase, using improved data association strategies to reduce identity switch issues, thereby achieving a more stable tracking performance in complex dynamic environments. In the same year, Chu et al. proposed the TransMOT [8] algorithm, which models the spatial and temporal interactions between objects using a graph transformer model. By organizing object trajectories in the form of sparse weighted graphs, the algorithm effectively captures interactions between a large number of objects. Additionally, in 2023, Zhang et al. introduced the MOTRv2 [9] algorithm, an improvement on MOTR. By introducing additional object detectors and employing a Query Anchor Box strategy, along with proposing Proposal Anchor Boxes, the algorithm optimizes the conflict between detection and association tasks.

This study conducted an in-depth investigation of current Multi-Object Tracking (MOT) methodologies and analyzed their key challenges. Subsequently, this study delved into Multi-Object Tracking issues within panoramic scenes using deep learning-based algorithms. Given the limitations of current tracking methods in panoramic settings, this study has made several improvements. Specifically, the main contributions of this study are as follows:

To address the challenges of significant target scale variation and changing lighting conditions in current datasets, this study proposes a Multi-Object Tracking algorithm incorporating an improved detector. In terms of model architecture, this study introduces a new small object detection layer, specifically designed to identify targets far from the capturing device and smaller in size, thus compensating for the deficiencies of traditional detection methods in such scenarios. Furthermore, to capture and analyze target features, this study more accurately integrates the Polarized Self-Attention (PSA) [10] mechanism during the object detection phase. This mechanism, by adjusting the model’s focus on features, significantly enhances the effectiveness of feature extraction, ensuring that the algorithm maintains efficient and accurate tracking performance in scenes with dim lighting or significant changes in lighting. These optimizations and improvements enhance the adaptability and robustness of the tracking algorithm and provide robust support for high-precision target tracking under various environmental conditions, significantly improving the algorithm’s reliability and effectiveness in practical applications.
The improvement efforts in the tracker aspect are based on the distinguished multi-object tracker, ByteTrack, which introduces an efficient data association technique known as BYTE, easily integrated into existing tracking systems for consistent performance enhancement. ByteTrack conducts object tracking based on the object detector YOLOX [11] and the appearance feature extraction module FastReID [12] SBS-S50, demonstrating strong occlusion robustness through its precise detection capabilities and optimized low-score detection box association strategy. However, ByteTrack exhibits limitations when dealing with panoramic scenes, leading to performance degradation. To address this issue, this study deeply refines the tracking strategy, imposing restrictions on specific areas within the scene and employing ReID technology for object matching. Additionally, this study introduces a camera motion estimation module to overcome the target distortion caused by camera shake. These enhancements significantly improve the algorithm’s tracking performance in panoramic scenes.
This study has developed a comprehensive dataset for Multi-Object Tracking in panoramic scenes, encompassing indoor and outdoor campus environments. This dataset is meticulously curated to provide diverse scenarios, addressing the challenges specific to panoramic Multi-Object Tracking, such as wide-area coverage, the varying scales of objects, and complex interactions among multiple subjects. Including indoor and outdoor scenes adds to the dataset’s robustness, offering researchers a rich resource for testing and improving Multi-Object Tracking algorithms under varied environmental conditions. This dataset aims to bridge the gap in existing resources for panoramic scene tracking, facilitating advancements in the field by providing a solid foundation for algorithm development and evaluation.

After the aforementioned improvements, compared to the original algorithm, the improved algorithm shows significant enhancements in quantitative metrics: MOTA increased by 6%, IDF1 increased by 7%, and IDSWs decreased by 40%, indicating that the revised algorithm is at a leading level.

2. Materials and Methods

2.1. Introduction of Datasets

To delve into and analyze the application of panoramic images in object detection, this study utilized the Insta360 ONER panoramic camera to collect a series of video data within a school environment. Specifically, this data collection included videos of 3 indoor scenes and 9 outdoor scenes, each lasting approximately one minute, covering a variety of environments and lighting conditions within the campus. These carefully selected scenes aimed to simulate the diverse environments encountered in the real world, providing rich test data for subsequent research.

Given that panoramic images, with their 360-degree field of view, structurally differ from traditional two-dimensional images, utilizing these panoramic images directly for deep learning training and testing presented certain technical challenges. Therefore, this study converted the panoramic images into two-dimensional rectangular images to facilitate processing these images using existing deep learning networks. This conversion process not only preserved the key information in the images but also enhanced the network’s learnability and recognizability of the image data. The comparison of images before and after conversion demonstrated how effectively panoramic images are adapted to the deep learning framework, as illustrated in Figure 1.

Following the conversion of the images, this study subjected the collected video data to downsampling, extracting images at a frame rate of 15 frames per second, thereby obtaining a total of 10,295 images for experimental use. These images encompassed 220 distinct pedestrian targets, covering a wide range of scenarios from single pedestrians to complex crowd scenes, providing an improved object detection algorithm with a wealth of test samples. This diversified dataset was instrumental in evaluating and refining the performance of improved algorithms under various conditions, ensuring its robustness and accuracy in real-world applications.

2.2. Dataset Production

To accurately evaluate the performance of the improved object detection algorithm in handling panoramic images, this study conducted a comprehensive annotation of these images. Figure 2 demonstrates the dark label annotation tool’s utilization to precisely annotate pedestrian targets in each frame and assign a unique target ID. This step was pivotal to the experiment’s success, as it provided the necessary real-label information for training deep learning models and ensured the accuracy and reliability of the experimental outcomes.

After the annotation process, a dataset in the Multi-Object Tracking (MOT) format, consisting of ‘det’ and ‘gt’ directories, was generated. The ‘det’ directory contains a single file named ‘det.txt’, where each line represents a detected object. The fields in ‘det.txt’ are as follows:
The first field indicates the frame number;
The second field represents the trajectory ID, which is always −1 in this file;
The next four fields, prefixed with ‘bb’, denote the coordinates of the top-left corner of the bounding box and its width and height;
The ‘conf’ field indicates the confidence level of the detection;
The last three fields are used in MOT3D and are always set to −1 for 2D detections;
In the ‘gt’ directory, there is a ‘gt.txt’ file, where:
The first field indicates the frame number,
The second field is the ID of the object’s motion trajectory;
The third to sixth fields represent the coordinates of the top-left corner of the bounding box and its dimensions;
The seventh field indicates whether the object trajectory is within the consideration range, with 0 meaning ignore and 1 meaning active;
The eighth field denotes the category of the object corresponding to the trajectory (the category label ID correspondence is provided in a table below);
The ninth field is the visibility ratio of the box, reflecting the extent to which the moving target is included/covered by other object boxes or the situations where the edges of the boxes between targets are clipped.

This structured format facilitates the use of the dataset for training and evaluating object detection and tracking algorithms, providing a standardized way to represent each object’s spatial and temporal information within the scenes.

Through this series of meticulous data preparation efforts, this study has laid a solid foundation for exploring the application of panoramic images in object detection tasks. This broadens and deepens our research and provides valuable experience and data resources for future researchers.

2.3. Evaluation Metrics

The evaluation metrics for Multi-Object Tracking were used to assess the performance of tracking algorithms [13]. These metrics reflected the algorithm’s performance regarding tracking accuracy, persistence, and identification capabilities. Below, we introduce the MOT evaluation metrics used in this study, along with their meanings and formulas:

Multi-Object Tracking Accuracy (MOTA)

MOTA [14] comprehensively considers errors such as false positives, missed detections, and ID switches, indicating overall tracking accuracy.

MOTA = 1 - \frac{\sum_{t} (F N_{t} + F P_{t} + I D S W_{t})}{\sum_{t} G T_{t}}

(1)

In this context, FNs (false negatives) represent the number of missed detections, indicating the number of targets that exist but were not detected by the system. FPs (false positives) represent the number of false positives, indicating the number of non-targets the system mistakenly marked as targets. IDSWs (identity switches)represent the number of identity switches, with a smaller number indicating better performance. GT represents the number of real targets, that is, the number of targets present in each frame.

2.: Identity F1 Score (IDF1)

IDF1 is specifically used to measure the performance of tracking systems in maintaining the consistency of target identities. IDF1 [15] was introduced by Ristani et al. in 2016 and is used to assess the ability of tracking systems to correctly identify and associate target identities throughout an entire video sequence. Unlike the MOTA metric, which focuses on the accuracy and precision of tracking, IDF1 concentrates more on the accuracy of identity preservation. The IDF1 score is based on the harmonic mean of precision and recall, but it calculates the precision and recall of tracking identities rather than tracking locations. Here, precision is defined as the ratio of times the system correctly associates target identities to the total number of times the system associates target identities. Recall is defined as the ratio of times the system correctly associates target identities to the number of actual identity switches in the video sequence. The calculation of IDF1 can be divided into three steps:

(1): Precision

P r e c i s i o n = \frac{T P}{T P + F P}

(2)

In this context, TPs (true positives) represent the number of correctly associated target identities and FPs (false positives) represent the number of incorrectly associated target identities.

(2): Recall

In this context, FNs (false negatives) represent the number of missed identity switches, indicating the number of times target identities were changed but not detected by the system.

(3): Calculating IDF1

IDF 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(3)

The IDF1 score reflects a tracking system’s ability to maintain the consistency of target identities throughout the tracking process. A high IDF1 score means that the tracking system can effectively track individual targets and maintain their identities correctly during interactions or occlusions among targets. This is particularly important in applications where long-term tracking and consistency of identity are required, such as video surveillance, sports analysis, and social behavior research.

These evaluation metrics provide a comprehensive performance assessment for multi-object tracking algorithms, helping researchers to identify the strengths and weaknesses of the algorithms and guiding the optimization and improvement of these algorithms.

2.4. Detector Improvements

YOLOX represents the latest advancement in the YOLO [16] series of object detection algorithms, inheriting the family’s hallmark of balancing speed and accuracy while enhancing detection performance through innovative technologies. Especially in distinguished in real-time object detection, YOLOX has achieved unprecedented performance levels on multiple standard datasets through optimization and innovation. Consequently, this study selected YOLOX as the detector component and improved it. This choice was motivated by YOLOX’s ability to deliver high-speed, accurate detections, critical for applications requiring real-time processing and analysis. The enhancements aimed to tailor the YOLOX model to better suit specific requirements and challenges, especially in handling the complexities of panoramic image data, thereby maximizing the efficiency and effectiveness of object detection tasks.

This study enhanced the tracking accuracy for small-scale targets by adding a detection layer specifically designed for small objects. Moreover, to address the challenges posed by low-light conditions in complex environments, this study incorporated the Polarized Self-Attention (PSA) mechanism during the detection phase, aiming to capture the features of targets more effectively, thereby significantly boosting tracking performance. These key technological innovations optimized algorithms’ performance under challenging conditions and significantly increased their value and applicability in real-world scenarios. Introducing a dedicated layer for small-scale object detection allows for a finer recognition of the details often missed by conventional layers, making the proposed solution particularly effective in scenarios where such targets are prevalent. By enhancing the model’s ability to focus on relevant features even in poor lighting, the PSA mechanism ensured that the improved algorithm remained robust and reliable, further extending its usability across a broader range of applications. The improved YOLOX structure is shown in Figure 3; the parts in red are the parts that have improved.

When the input image dimensions were 640 × 640 × 3, the optimized YOLOX network first utilized the backbone network (CSPDarknet) for feature extraction. It then enhanced the feature maps processed by the CSP module of the backbone network using the Polarized Self-Attention (PSA) mechanism. Following this, the PANet network performed feature fusion on the feature maps of sizes 160 × 160, 80 × 80, 40 × 40, and 20 × 20 produced by the backbone network. Finally, YoloHead predicted the object categories and performed position regression (decoupled head) on the fused scales of feature maps (corresponding to the 4×, 8×, 16×, and 32× downsampled feature maps of the original input image).

Specifically, this dedicated layer for detecting small targets deeply analyzes and learns from the high-resolution features within the input image, effectively extracting detailed information about small-scale targets. Combined with the existing multi-scale feature fusion mechanism of the YOLOX_s model, this layer enhances the model’s responsiveness to small targets and maintains the efficient detection of medium- and large-scale targets, thereby further improving the model’s overall performance in Multi-Object Tracking tasks. This comprehensive approach ensures that the model is well-equipped to handle the challenges of detecting and tracking objects of different sizes across diverse scenarios.

In the design of the small object detection layer, a dense structure of small convolutional kernels was adopted to capture the fine details of small-sized objects more precisely. This layer was positioned early in the network to act directly on high-resolution feature maps. This is crucial because high-resolution feature maps contain a wealth of raw detail information essential for recognizing small objects in the early stages of image processing. For instance, traditional detection layers might only identify a vague human shape for a distant pedestrian, whereas an optimized small object detection layer can recognize details like the head and limbs, significantly improving recognition accuracy.

The feature fusion strategy utilized an attention mechanism to weigh and integrate feature maps from different scales, including those from the small object detection layer. This method allows the model to focus on the fine details of small objects while still effectively recognizing medium- to large-sized objects. Practically, this can be achieved by calculating the contribution of each feature map to the final detection task and dynamically adjusting its weight in the fusion process. This dynamic adjustment mechanism ensures that the model extracts the most crucial features for different sized targets, optimizing the overall detection performance.

Applying this approach has significantly improved the metrics for Multi-Object Tracking algorithms in handling small-scale objects, including enhancing the accuracy of object detection and tracking stability, effectively boosting Multi-Object Tracking performance in complex environments. This improvement demonstrates the possibility of refining functional layers within traditional object detection and tracking frameworks to adapt to specific scenario needs and provides new ideas and methods for future research in Multi-Object Tracking.

2.5. Tracking Method

2.5.1. Analysis of the Problem

Faced with the challenge that panoramic images cannot be directly processed, this study adopted a strategy of unfolding the panoramic images into two-dimensional plane images for further processing. Although effective, this method introduces a new issue: some targets may disappear at one end of the scene and suddenly reappear at the other end, as demonstrated in Figure 4. This behavior of crossing scene boundaries can lead to tracking failures, assigning different IDs to the same target, and poses significant difficulties for existing algorithms in accurately tracking targets across large spatial intervals.

2.5.2. Camera Motion Compensation

Tracking methods based on detection largely depend on associating predicted target positions with the detected bounding boxes. However, due to certain objective factors, such as camera shake or slight shifts caused by wind, there can be displacement in the detection boxes, leading to switches in the target IDs. This study employed a camera motion compensation method to address this issue to improve the algorithm. Drawing from the global motion compensation (GMC) technique used in the video stabilization module with affine transformation proposed by OpenCV [17], a camera motion compensation approach was designed. This image registration method is particularly suited for scenes with significant background shifts.

The process begins with extracting key points and feature tracking using sparse optical flow. The affine matrix is then solved using RANSAC [18]. By utilizing sparse registration techniques, it is possible to ignore other moving objects in the scene, thus more accurately determining the background. This method enhances the stability and accuracy of object tracking by compensating for camera motion, thereby minimizing the risk of incorrect target ID assignments due to background displacement. It is a critical improvement for ensuring the robustness of the detection and tracking systems, especially in outdoor environments where camera movement is more likely.

To transform the predicted bounding boxes from frame k–1 to the coordinates of the next frame using the affine matrix

A_{k - 1}^{k}

, the specific calculation method shown below was used:

A_{k - 1}^{k} = [M_{2 x 2} |T_{2 x 1}] = [\begin{matrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \end{matrix}]

(4)

In this context,

M \in R^{2 x 2}

is a matrix that includes the affine matrix

A

, as well as the scale and rotation components, while

T

is a matrix that contains the translation component.

{\tilde{M}}_{K - 1}^{K} = [\begin{matrix} M & 0 & 0 & 0 \\ 0 & M & 0 & 0 \\ 0 & 0 & M & 0 \\ 0 & 0 & 0 & M \end{matrix}], {\tilde{T}}_{K - 1}^{K} = [\begin{matrix} a_{13} \\ a_{23} \\ 0 \\ 0 \\ ⋮ \\ 0 \end{matrix}]

(5)

Here, two matrix parameters,

{\tilde{M}}_{k - 1}^{k} \in R^{8 \times 8}

and

{\tilde{T}}_{k - 1}^{k} \in R^{8}

, are defined to control the translation and rotation transformations of the image.

{\hat{x}}^{'}_{k |k - 1} = {\tilde{M}}_{k - 1}^{k} {\hat{x}}_{k |k - 1} + {\tilde{T}}_{k - 1}^{k}

(6)

{P^{'}}_{k |k - 1} = {\tilde{M}}_{k - 1}^{k} P_{k |k - 1} {\tilde{M}}_{k - 1}^{k^{Τ}}

(7)

where

{\hat{x}}_{k |k - 1}

and

{\hat{x}}^{'}_{k |k - 1}

denote the predicted state vectors of the Kalman filter at k moments before and after camera motion compensation, and

P_{k |k - 1}

and

{P^{'}}_{k |k - 1}

denote the covariance matrices of the Kalman filter before and after correction, respectively. Subsequently, this study used

{\hat{x}}^{'}_{k |k - 1}

and

{P^{'}}_{k |k - 1}

in the Kalman filter state update process as follows:

K_{k} = {P^{'}}_{k |k - 1} H_{k}^{Τ} {(H_{k} {P^{'}}_{k |k - 1} H_{k}^{Τ} + R_{k})}^{- 1}

(8)

{\hat{x}}_{k |k} = {\hat{x}}^{'}_{k |k - 1} + K_{k} (z_{k} - H_{k} {\hat{x}}^{'}_{k |k - 1})

(9)

P_{k |k} = (I - K_{k} H_{k}) {P^{'}}_{k |k - 1}

(10)

In the case of high-speed target motion, the correction of the state vector, including the velocity term, is essential. When the camera offset changes slowly relative to the target frame rate, the correction of Equation (4) can be omitted. This correction method makes the proposed tracker more robust against camera offsets.

2.5.3. Tracking Model

An improved model is shown in Figure 5.

This study proposed a new tracking method for tracking targets in panoramic scenes, and the tracking flow is shown in Algorithm 1. The inputs to the algorithm were a video sequence (V), a target detector (Det), an appearance feature extractor (F), a detection score threshold (τ), and a new track threshold (η). The outputs were the tracks (T) of the video, each containing the bounding box and the unique ID of the object in each frame.

Algorithm 1: Tracking Algorithm Pseudocode

      Input: Video sequence (V); target detector (Det); appearance feature extractor (Enc); high-score detection box threshold (τ); new track establishment threshold (η); corner range judgment threshold (cor).
      Output: Target tracks.
      1 Initialization:

Τ

←

θ

2 For frame f_k in V:
3 D_k←Det(f_k), D_high←

θ

, D_low←

θ

, F_high←

θ

/*Classify detection results*/
      4            For d in D_k:
      5                   If d.score > τ, then
      6                     D_high←D_high∪{d} /*Store high-score detection boxes*/
      7                     F_high←F_high∪Enc(f_k,d.box) /*Extract target appearance features*/
      8                   Otherwise
      9                     D_low←D_low∪{d} /*Store low-score detection boxes*/
                     /*predict new locations of tracks*/
      10

A_{k - 1}^{k}

= findMotion(f_k-1,f_k) /*Calculate the affine matrix from frame k-1 to k*/
11 for t in

Τ

: /*Predict the position of the track in the next frame*/
12 t←KalmanFilter(t), t←MotionCompensation(t,

A_{k - 1}^{k}

)
/*First association*/
13 C_iou ← IOUDist(

Τ

.boxes,D_high)
14 C_emb ← FusionDist(

Τ

.features,F_high,C_iou)
      15            C_high ← min(C_iou,C_emb)
      16            Associate using the Hungarian algorithm
      17            D_remain ← remaining object boxes from D_high
      18

Τ

_remain ← remaining tracks from

Τ

/*Second association*/
19 C_low ← IOUDist(

Τ

_remain.boxes,D_low)
20 Associate using the Hungarian algorithm
21

Τ

_re-remain ←remaining tracks from

Τ

_remain
                       /*Third association*/
      22            If D_remain∈D_cor, then
      23               F_cor ← F_cor∪Enc(f_k,D_cor.box)
      24               C_emb ← FusionDist(

Τ

_re-remain.features,F_cor)
      25            Associate using the Hungarian algorithm
      26            If D∈D_cor, then /*Update Kalman parameters*/
      27                Initialize Kalman filter
      28            Otherwise
      29                Update matched tracklets Kalman filter.
      30

Τ

←

Τ

\

Τ

_re-remain /*Delete unmatched tracks*/
      31            For d in D_remain: /*Initialize new tracks*/
      32                If d.score>η, then
      33

Τ

←

Τ

∪{d}
34 Return:

Τ

For each frame in the input video, this algorithm used a detector to detect the detection frames in which all targets appear in the frame and the corresponding scores of each detection frame. Then, according to the set threshold, the detected frames were categorized into D_high and D_low according to their scores. Those with detection scores higher than the threshold were categorized as D_high and the target features were extracted and stored in F_high, while those lower than the threshold were categorized as D_low.

After distinguishing the high score detection frame and the low score detection frame, first, we obtained the affine matrix (

A_{k - 1}^{k}

) by calculating the data of the k−1st frame and the kth frame and then used the Kalman filtering method to predict the trajectory (τ) of the previous frame to obtain the new position in the current frame. Finally, we used the calculated affine matrix (

A_{k - 1}^{k}

) to correct the prediction frame of the new position and obtain the final prediction frame.

Next, the first association was started. The first correlation between the high score detection frame and all the trajectories was performed. We used a combination of motion and appearance information for the association. First, the IOU between the high-score and trajectory detection frames was calculated to obtain C_iou. The features extracted from the high-score detection frame and the features of the trajectory detection frame were calculated to obtain C_emd, and finally, the Hungarian protocol was used to perform the matching. The detection frames that were not matched successfully were saved in D_remin, and the trajectories that were not matched were deposited in τ_remin. We used the set threshold to judge the detection frames in D_remin, and to determine whether they were in the lower left or lower right part of the video, and if they were, we stored them in D_cor.

The second association was performed in the low-score detection frame and in the trajectory τ_remin, which was not associated after the first association. Since low-score detection frames are usually caused by occlusion or motion blur, the appearance characteristics are unreliable. Therefore, the second matching selection was based on IOUs only, and the matching method was still the Hungarian algorithm. We put the unmatched tracks into τ_re-remin for the next association.

The third correlation compared the remaining trajectory (τ_re-remin) after the second correlation and the detection frame (D_cor) in the video’s corner. Since there were cases in the panoramic scene where the target disappeared from the bottom left corner and appeared from the bottom right corner or disappeared from the bottom right corner and appeared from the bottom left corner, in which case it was difficult for the motion model to predict the correct position, this algorithm decided to use only Reid’s method for association for this matching. At the end of the third association, the trajectories not successfully matched with the detection box were removed from the trajectory list. Finally, we determined the detection frames not yet associated with a trajectory after three matches. Detection frames with scores higher than a set threshold (η) generated new trajectories.

3. Results and Discussion

To provide a clear understanding of the context in which experiments were conducted, it is essential to outline the software and hardware environments utilized. These environments are crucial for replicating the experiments and understanding the performance metrics of the proposed algorithm. Below are the software and hardware settings details, structured as Table 1 and Table 2, respectively.

3.1. Experiments of Object Detection Algorithm

To thoroughly demonstrate the effectiveness of the improved method, we meticulously selected half of the MOT17 [19] dataset and half of the MOT20 [20] dataset, supplemented with a custom panoramic dataset, for training over 100 epochs. This combination of data not only covers a variety of scenes, but also ensures the comprehensiveness of the testing. This study conducted training sessions for both the original and optimized models, followed by a detailed performance comparison based on a series of quantitative metrics. This process is aimed at visually presenting the specific contributions of improvements to enhancing model performance. The comparative results are illustrated in Figure 6.

As illustrated in Figure 6, the blue curve represents the training results of the original YOLOX detector, while the orange curve showcases improved detector performance throughout the training process. By comparing these two curves, we can clearly observe that, although the proposed model exhibits a slightly slower fitting speed during training, it achieves a significant improvement in accuracy compared to the original model. The related quantitative metrics are detailed in Table 3 for a more in-depth performance analysis.

To conduct a more thorough performance analysis, Table 3 compares the quantitative metrics of two detectors after training. Observing the results in the table, it is evident that, after improvements, the detector’s AP50 metric increased by three percentage points, and the AP50_95 metric increased by two percentage points, providing a strong foundation for the subsequent tracker. AP50 refers to the Average Precision (AP) when the Intersection Over Union (IOU) threshold is set at 50%. AP50_95, on the other hand, refers to the Average Precision calculated by averaging the AP values obtained across a range of IOU thresholds from 50% to 95%. This approach gives a broader evaluation of the detector’s performance across varying degrees of overlap between the predicted and ground-truth bounding boxes.

This comparison highlights the effectiveness of improvements in enhancing the detector’s accuracy. The slower fitting speed of the proposed model could be attributed to the additional complexities introduced by modifications, such as more sophisticated attention mechanisms or enhanced feature extraction layers. However, these enhancements contribute to a more robust model capable of achieving higher precision in object detection tasks, particularly in challenging environments or scenarios presented by the datasets used.

3.2. Experiments Using the Multi-Object Tracking Algorithm

In this study, we utilized an improved version of YOLOX_s as the detector for the proposed model. YOLOX stands out among the YOLO series algorithms due to its novelty, rich model weights, excellent real-time detection speed, precise detection performance, and unique approach to decoupled head processing. Before deploying it, we conducted mixed training on the MOT17, MOT20, and custom panoramic datasets to obtain a pre-trained model. For the re-identification (ReID) model used for pedestrian re-identification, the SBS-S50 model from FastReID was selected and trained on the MOT17 dataset to acquire a pre-trained model. The tracking association model employed a newly proposed method in this study. All parameter settings were chosen through experimentation.

This approach leverages the strengths of YOLOX_s in handling diverse and challenging scenarios by enriching its training with a mix of standard and custom datasets, enhancing its adaptability and performance in real-world applications. Including a robust ReID model further supports the tracking process by improving the accuracy of identity recognition across different frames, which is particularly crucial in dense scenes or scenarios with frequent occlusions.

Following inputting the collected video material, the outcomes were subjected to target tracking by the Multi-Target Tracking algorithm after target detection. This study selected pedestrian targets captured indoors and outdoors on campus as the focus of the investigation. The algorithm’s performance was assessed using three evaluation metrics: MOTA (Multiple Object Tracking Accuracy), IDF1 (IDF1 Score), and IDSWs (ID Switches).

MOTA measures the overall tracking accuracy, considering false positives, false negatives, and ID switches.

IDF1 evaluates the identification accuracy by considering the precision and recall of correctly identified detections over the total number of ground-truths and computed detections.

IDSWs count the number of times the algorithm incorrectly changes the ID of a tracked object.

The experimental outcomes are presented in Table 4. To demonstrate the advancement of this study, the ByteTrack algorithm will be used as a control group for experimental comparison and result visualization. The experimental results are shown in Figure 7, Figure 8, Figure 9 and Figure 10.

Figure 7 and Figure 8 showcase the tracking results of the comparative and proposed algorithms in an outdoor setting, respectively. Similarly, Figure 9 and Figure 10 illustrate the tracking outcomes of the comparative algorithm and proposed algorithm in indoor environments. Figure 7 and Figure 9 show that targets passing near scene boundaries often lead to tracking failures due to large positional spans. For instance, in Figure 7, targets indicated by red arrows as numbers 12 and 15 experience tracking failure after 30 frames, with their IDs changing to 20 and 21. In contrast, in Figure 8 and Figure 10, targets indicated by green arrows are correctly tracked throughout. These results demonstrate that the improved algorithm effectively addresses the issue of tracking failures caused by targets crossing boundaries in panoramic scenes, achieving continuous tracking results.

This clear distinction in performance underscores the effectiveness of proposed algorithmic enhancements, particularly in handling the complexities of tracking across scene boundaries. The proposed algorithm’s ability to maintain consistent tracking of targets through these transitional spaces significantly improves its utility for real-world applications, where subjects frequently move in and out of the camera’s field of view or across distinct spatial zones. This advancement not only contributes to the field of Multi-Object Tracking by addressing a well-known limitation, but also sets a new standard for robustness and reliability in varied settings.

Table 4 demonstrates the effectiveness of individual improvements to both the detector and the tracker, with the enhancements to the tracking strategy, in particular, resulting in a significant improvement in the algorithm’s tracking performance. The improvements across various metrics are notable. Specifically, the proposed algorithm increased by nearly six percentage points in the MOTA index, nearly seven percentage points in the IDF1 index, and reduced by 40% in IDSWs.

Upon analyzing the principles behind the improvements, it can be seen that the modified YOLOX_s detector has enhanced the algorithm’s ability to recognize small-scale objects and targets in dimly lit environments. Meanwhile, the improved DeepSORT algorithm provides a solution to the issue of abrupt changes in target positions at boundary areas in panoramic scenes. The combination of these two methods has resulted in an even better performance overall.

These quantitative metrics underscore the success of modifications in enhancing the algorithm’s overall tracking accuracy and reliability. The increase in MOTA highlights the proposed algorithm’s enhanced ability to accurately track multiple objects without losing their identities. The improvement in the IDF1 score reflects better identity matching over time, indicating that the proposed algorithm is more effective at consistently tracking individual targets across frames. Lastly, the significant reduction in ID switches (IDSWs) points to the proposed algorithm’s improved capability to correctly maintain target identities without erroneously swapping them, which is crucial for applications requiring precise individual tracking over extended periods or through complex scenes.

4. Conclusions

This study introduced significant improvements to the Multi-Object Tracking (MOT) algorithm, particularly within panoramic scenes, which have historically posed unique challenges due to the extensive spatial coverage and the possibility of targets crossing scene boundaries. Building upon the robust foundation of YOLOX_s for object detection, and incorporating advanced tracking strategies, the optimized algorithm has notably improved performance metrics such as MOTA, IDF1, and IDSWs.

The comparison of the proposed algorithm against conventional methods, as illustrated in Figure 7, Figure 8, Figure 9 and Figure 10, underscores its enhanced ability to maintain continuous tracking of targets even as they pass through challenging scene transitions. This represents a substantial advancement over previous studies, where tracking failures at scene boundaries were a common limitation, leading to loss in target tracking or incorrect identity assignments. Our findings, supported by quantitative improvements in key performance indicators, align with the working hypothesis that targeted enhancements in detection and tracking mechanisms can significantly mitigate the issues associated with panoramic Multi-Object Tracking.

From a broader perspective, this study contributes to the ongoing evolution of MOT technologies by addressing a critical gap in panoramic scene tracking. It enriches the toolkit available to researchers and practitioners in the field and opens new avenues for real-world applications where seamless object tracking across wide spatial extents is required, such as in urban surveillance, wildlife monitoring, and autonomous navigation systems.

Looking forward, several directions for future research emerge from this study. Firstly, exploring the integration of semantic segmentation with tracking algorithms could offer deeper insights into scene context, potentially enhancing tracking accuracy in densely populated or highly dynamic environments. Secondly, applying deep learning techniques for adaptive model tuning in real-time tracking scenarios presents a promising area for improving algorithmic efficiency and responsiveness. Lastly, extending this methodology to include aerial or drone-based panoramic footage could broaden the scope of MOT applications, addressing an even wider range of surveillance and monitoring challenges.

Author Contributions

Conceptualization, J.S.; methodology, J.S. and H.Y.; software, H.Y.; validation, J.S.; formal analysis, J.S.; investigation, J.S. and H.Y.; resources, H.Y.; data curation, J.S.; writing—original draft preparation, J.S.; writing—review and editing, H.Y.; visualization, J.S.; supervision, H.Y.; project administration, H.Y.; funding acquisition, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Y. Improving Indoor Pedestrian Detection and Tracking in Crowded Environments: Deep Learning Based Multimodal Approaches. Ph.D. Thesis, The University of Sydney, Sydney, NSW, Australia, 2024. [Google Scholar]
Zhao, X.; Sun, M.; Zhao, Q. Sensors for Robots. Sensors 2024, 24, 1854. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Fang, G.; Rao, L.; Zhang, T. Multi-Target Tracking of Person Based on Deep Learning. Comput. Syst. Sci. Eng. 2023, 47, 2671–2688. [Google Scholar] [CrossRef]
Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking objects as points. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 474–490. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-Object Tracking by associating every detection box. In European Conference on Computer Vision; Springer Nature Switzerland: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
Zeng, F.; Dong, B.; Zhang, Y.; Wang, T.; Zhang, X.; Wei, Y. Motr: End-to-end multiple-object tracking with transformer. In European Conference on Computer Vision; Springer Nature Switzerland: Cham, Switzerland, 2022; pp. 659–675. [Google Scholar]
Cao, J.; Pang, J.; Weng, X.; Khirodkar, R.; Kitani, K. Observation-centric sort: Rethinking sort for robust Multi-Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9686–9696. [Google Scholar]
Chu, P.; Wang, J.; You, Q.; Ling, H.; Liu, Z. Transmot: Spatial-temporal graph transformer for multiple object tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 4870–4880. [Google Scholar]
Zhang, Y.; Wang, T.; Zhang, X. Motrv2: Bootstrapping end-to-end Multi-Object Tracking by pretrained object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22056–22065. [Google Scholar]
Liu, H.; Liu, F.; Fan, X.; Huang, D. Polarized self-attention: Towards high-quality pixel-wise regression. arXiv 2021, arXiv:2107.00782. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
He, L.; Liao, X.; Liu, W.; Liu, X.; Cheng, P.; Mei, T. Fastreid: A pytorch toolbox for general instance re-identification. arXiv 2020, arXiv:2006.02631. [Google Scholar]
Prakash, H.; Chen, Y.; Rambhatla, S.; Clausi, D.A.; Zelek, J. VIP-HTD: A Public Benchmark for Multi-Player Tracking in Ice Hockey. J. Comput. Vis. Imaging Syst. 2023, 9, 22–25. [Google Scholar]
Bernardin, K.; Stiefelhagen, R. Evaluating multiple object tracking performance: The clear mot metrics. EURASIP J. Image Video Process. 2008, 2008, 1–10. [Google Scholar] [CrossRef]
Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2016; pp. 17–35. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Bradski, G.; Kaehler, A. Learning OpenCV, 1st ed.; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2008. [Google Scholar]
Derpanis, K.G. Overview of the RANSAC Algorithm. Image Rochester NY 2010, 4, 2–3. [Google Scholar]
Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A benchmark for Multi-Object Tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar]
Dendorfer, P.; Rezatofighi, H.; Milan, A.; Shi, J.; Cremers, D.; Reid, I.; Roth, S.; Schindler, K.; Leal-Taixé, L. Mot20: A benchmark for multi object tracking in crowded scenes. arXiv 2020, arXiv:2003.09003. [Google Scholar]

Figure 1. Image comparison before and after conversion. (a) Panoramic image; (b) converted 2D matrix image.

Figure 2. Dataset labeling method.

Figure 3. The improved YOLOX structure.

Figure 4. Tracking results for the same target at different frames.

Figure 5. The flow of the proposed method.

Figure 6. Detector training results.

Figure 7. Tracking results of ByteTrack outdoors.

Figure 8. Tracking results of the proposed algorithm outdoors.

Figure 9. Tracking results of ByteTrack indoors.

Figure 10. Tracking results of proposed algorithm indoors.

Table 1. Hardware system environment.

Hardware Name	Hardware Configuration
CPU GPU	Intel(R)Core(TM)i7-10700K CPU 3.80 GHz NVIDIA GeForce RTX 2080ti
RAM	16 GB

Table 2. Software environment system.

Software Environment System	Descriptive	Version
Operating system	Operating system	Windows11
Anaconda3	Package environment manager	2.3.1
torch	Deep learning framework	2.0.0
torchvision	Computer vision graphics library	0.15.0
Python	Programming language environment	3.8.1

Table 3. Model metrics before and after YOLOX_s optimization.

Metrics	AP50	AP50_95
Model	AP50	AP50_95
YOLOX_s	0.88	0.64
YOLOX_s (upgrade)	0.91	0.66

Table 4. Comparison of the evaluation indicators of Muti-Object Tracking algorithms.

Algorithms	MOTA (%)	IDF1 (%)	IDSWs
YOLOX_s and DeepSORT	78.8	73.5	900
YOLOX_s and improved DeepSORT	81.7	78.1	820
Improved YOLOX_s and DeepSORT	79.3	73.8	875
Proposed algorithm	84.7	80.3	542

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shen, J.; Yang, H. Multi-Object Tracking Model Based on Detection Tracking Paradigm in Panoramic Scenes. Appl. Sci. 2024, 14, 4146. https://doi.org/10.3390/app14104146

AMA Style

Shen J, Yang H. Multi-Object Tracking Model Based on Detection Tracking Paradigm in Panoramic Scenes. Applied Sciences. 2024; 14(10):4146. https://doi.org/10.3390/app14104146

Chicago/Turabian Style

Shen, Jinfeng, and Hongbo Yang. 2024. "Multi-Object Tracking Model Based on Detection Tracking Paradigm in Panoramic Scenes" Applied Sciences 14, no. 10: 4146. https://doi.org/10.3390/app14104146

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Object Tracking Model Based on Detection Tracking Paradigm in Panoramic Scenes

Abstract

Featured Application

Abstract

1. Introduction

2. Materials and Methods

2.1. Introduction of Datasets

2.2. Dataset Production

2.3. Evaluation Metrics

2.4. Detector Improvements

2.5. Tracking Method

2.5.1. Analysis of the Problem

2.5.2. Camera Motion Compensation

2.5.3. Tracking Model

3. Results and Discussion

3.1. Experiments of Object Detection Algorithm

3.2. Experiments Using the Multi-Object Tracking Algorithm

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI