Next Article in Journal
Comparison of MOEAs in an Optimization-Decision Methodology for a Joint Order Batching and Picking System
Previous Article in Journal
Spatial Constraints on Economic Interactions: A Complexity Approach to the Japanese Inter-Firm Trade Network
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Bidirectional Tracking Method for Construction Workers in Dealing with Identity Errors

School of Civil Engineering, Harbin Institute of Technology, Harbin 150090, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(8), 1245; https://doi.org/10.3390/math12081245
Submission received: 2 March 2024 / Revised: 2 April 2024 / Accepted: 18 April 2024 / Published: 19 April 2024
(This article belongs to the Section Mathematics and Computer Science)

Abstract

:
Online multi-object tracking (MOT) techniques are instrumental in monitoring workers’ positions and identities in construction settings. Traditional approaches, which employ deep neural networks (DNNs) for detection followed by body similarity matching, often overlook the significance of clear head features and stable head motions. This study presents a novel bidirectional tracking method that integrates intra-frame processing, which combines head and body analysis to minimize false positives and inter-frame matching to control ID assignment. By leveraging head information for enhanced body tracking, the method generates smoother trajectories with reduced ID errors. The proposed method achieved a state-of-the-art (SOTA) performance, with a multiple-object tracking accuracy (MOTA) of 95.191%, higher-order tracking accuracy (HOTA) of 78.884% and an identity switch (IDSW) count of 0, making it a strong baseline for future research.

1. Introduction

The efficacy of worker tracking in construction safety management, both on-site [1] and off-site [2], has been widely acknowledged. The real-time monitoring of workers’ locations and identities enables managers to monitor working conditions closely, facilitating prompt interventions to prevent potential accidents [3], such as ensuring helmet usage, maintaining safe distances and avoiding collisions with vehicles [4]. By analyzing tracking data, valuable insights can be derived into workers’ productivity, intensity, movement patterns and other critical aspects, ultimately optimizing construction processes and labor allocation, enhancing efficiency, minimizing resource waste and reducing labor-intensive tasks for workers [5].
Nonetheless, complex construction scenarios pose challenges in feature extraction and identity matching, leading to identity-tracking errors. Four types of errors are illustrated in Figure 1: identity missing, identity increase, identity switch (IDSW) and identity transfer (IDTR) [6]. Identity missing occurs when false negatives (FNs) in deep learning models result in missed identifications, as demonstrated by worker ID = 2 in Figure 1a, who was occluded by “Bar shop” in frame #78. Identity increase is due to false positives (FPs), as exemplified by worker ID = 7 in Figure 1b, where an incorrect bounding box detection led to an additional ID. IDSW refers to a single person being assigned different IDs consecutively, as in Figure 1c, where worker ID = 5 had different IDs in frames #99 and #130. Conversely, IDTR occurs when two individuals are mistakenly assigned the same ID, as shown in Figure 1d, where worker ID = 3 switched to ID = 4 between frames #140 and #259. Both IDTR and IDSW stem from issues with association algorithms, which can be interpreted as “accepting false ID” and “rejecting truth ID”.
Current MOT methods have some intrinsic limitations:
(1) Relying on detection DNNs [7] limits the number of tracked individuals to the detected count per frame. Missed detections lead to a lost person’s re-identification (ReID) [8] information and, subsequently, motion estimation for the corresponding sequence, resulting in tracking ID loss.
(2) Current motion estimation techniques are confined to the Kalman filter (KF) [9], which may not converge mathematically. In cases of divergence, the KF’s estimates become meaningless and can cause incorrect motion predictions.
(3) Worker tracking, unlike pedestrian tracking, encounters more occlusion challenges in complex scenarios [10], posing higher ID error risks. While pedestrian tracking algorithms typically overlook ID-related metrics, worker tracking, which is crucial for safety management, necessitates their consideration, particularly in MOT challenge-like contexts [11,12].
(4) Head information, which is vital for worker identification, has often been overlooked [2]. Its use is pivotal due to workers’ upright posture and less obstructed heads, which can enhance whole-body tracking by leveraging bidirectional matching algorithms. Head tracking, as it aids in body detection and tracking, exploits this relationship to improve robustness.
The main innovation of this article is to provide a simple bidirectional method for tracking workers under complex construction scenarios. This bidirectional interaction improves the accuracy and robustness of body tracking while also addressing issues such as occlusion and environmental changes. The proposed method can significantly improve the safety and efficiency of construction sites by enabling the tracking of worker positions and movements with fewer ID errors.
The contributions of this research are as follows:
(1) Adopting head-tracking-to-body-tracking models in the construction field for the first time and using head motion state to correct whole-body movement.
(2) In the case where KF cannot be abandoned for movement estimation, modifying the body speed into head speed and making it comply with the basic “near-big, far-small” perspective principle.
(3) Changing the unidirectional method to bidirectional and letting tracking outputs correct current frame detections.
(4) Demonstrating the effectiveness of this method from a mathematical perspective.

2. Related Works

Vision-based MOT methods aim to identify target IDs (humans, vehicles, etc.) across successive frames. MOT usually consists of three subtasks [13]: object detection, feature extraction and data association.
  • Object detection (observation model): using a DNN such as Faster R-CNN [14], YOLO [15], CenterNet [16], or Transformer [17] to obtain the region of interest (ROI) [18], also called the bounding box.
  • Feature extraction (appearance model): employing a person re-identification (ReID) [8] network to extract a unidimensional vector from the ROI.
  • Data association (motion estimation and linear assignment): matching the IDs [7] between the targets detected by the DNN at the current frame and estimated using the KF at the former frame and the similarity cost matrix with Hungarian [19] or Jonker–Volgenant [7] algorithms.
According to the literature [2], instance segmentation has been posited to enhance detection performance, as exemplified by Xiao et al.’s [2] adaptation of the Mask R-CNN [20] for worker tracking. However, their study relies on datasets without annotations for heavily occluded individuals (occlusion rate above 70%), which precludes the possibility of improving the data association process, as previously noted by [10].
There are two basic paradigms for MOT [21]: tracking by detection (TBD) and joint detection and tracking (JDT).
  • TBD treats object detection as a separate detector, while feature extraction and data association are considered trackers [8]. TBD offers the advantage of flexibility in replacing modules with better DNNs or association methods. However, the detector and the tracker cannot enhance each other’s performance. If the detector produces missing or false bounding boxes, this will result in the tracker’s failure to track or accurately identify the target.
  • JDT integrates the detector and tracker into one unified network that can be trained end-to-end, such as Siamese [22] or Transformer networks [23]. JDT relies exclusively on appearance features, but the training of DNNs often demands better GPUs and takes significant time. For instance, TransTrack [24] demands 16.1 GB of GPU memory for inference, making it incompatible with the NVIDIA Tesla T4 (15 GB) on Google Colaboratory [25]. To provide a simple and training-free tracking method, TBD is the better choice.
Matching algorithms in TBD use various cues, including motion information [9], appearance features [26], velocity direction, confidence and height state [27], to calculate the different similarity distances (the cosine distance, the squared Mahalanobis distance, the intersection over union (IoU) distance, etc.) for the cost matrix in the linear assignment problem.
However, unlike pedestrian tracking on roads, the motion of workers always features long stays or frequent crossings [28], which may lead to the continuous growth of the covariance matrix in KF. In practical scenarios, a large covariance matrix can lead to the acceptance of false IDs. Conversely, a small covariance matrix can result in the rejection of the true ID. Hence, it is essential to carefully consider the covariance matrix size to ensure accurate identification. Convergence failure of the covariance matrix implies a mismatch between the selected motion model of the KF and actual motion behavior, but it does not seem to offer an alternative solution for the linear estimator [9].
To provide a comprehensive overview of data association techniques in the MOT domain, we present the technical aspects in Table 1. Contemporary methods focus on retraining detection DNNs for improved input accuracy, while others refine Kalman filter state vectors. Given our emphasis on data association, object detection and feature extraction details are not elaborated upon in this section.
The SOTA tracking-by-detection methods, as depicted in Table 1, are unidirectional, relying solely on the outputs of detection DNNs. Consequently, their tracking capacity is inherently limited to the number of detected objects, and the tracking ID count never exceeds the detection ID count. Our proposed bidirectional tracking approach encompasses two types of interactions, namely, head-to-body and detector–tracker, which enable the mutual correction of tracking and detection results, enhancing the overall performance.

3. Motion Estimation with KF

3.1. Basic KF Formula

Motion data, including position and velocity, can be effectively extracted using a linear Kalman filter (KF) [10]. The KF estimates the state of a dynamic system from noisy measurements, employing a state equation and an observation equation, as depicted in Equations (1) and (2).
In Equation (1), the prior state vector estimate (denoted by x ¯ t ) is predicted based on the posterior distribution from the previous frame (denoted by x ^ t 1 ). The state vector x t = x c , y c , a , h , v x , v y , v a , v h T is (8 × 1) in frame t and consists of the bounding box center, with dimensions, width, height, a = w / h and velocity. The transition matrix Ft governs the system dynamics, while Qt (8 × 8) represents process noise, following a normal distribution.
x ¯ t = F t x ^ t 1 + ω t , ω t N 0 , Q t
Equation (2) accounts for imperfect estimation, where the measurement z t 1 = x c , y c , a , h T (a 4 × 1 vector in frame t − 1) and the projected posterior distribution are subject to observation noise (denoted by Rt, a 4 × 4 matrix). The observation matrix H facilitates dimensional transformation between state and measurement vectors.
z t 1 = H t x ¯ t 1 + ν t , ν t N 0 , R t
The KF’s prediction for the covariance matrix Pt is shown in Equation (3). Since elements in the vector can vary widely, a covariance matrix is used to quantify the variation between elements. Pt (8 × 8) is updated regardless of the presence of a tracking ID, with a “dummy update” [9] occurring during tracking, which can lead to error accumulation and non-convergence.
P ¯ t = F t P ^ t 1 F t + Q t
The Kalman gain, denoted in Equation (4), determines the influence of the current observation on state estimation. A higher value of K t indicates greater confidence in the detection, while a lower value suggests a greater confidence in the prediction. Elements of K t (an 8 × 4 matrix) are confined to the range [0, 1] during normal operation.
K t = P ¯ t H t H t P ¯ t H t + R t = P ^ t H t R t 1
Finally, Equations (5) and (6) describe the posterior update of the state vector and covariance matrix, respectively, incorporating the prior estimate and current observation data.
x ^ t = x ¯ t + K t z t H t x ¯ t = F t x ^ t 1 + K t z t H t F t x ^ t 1
P ^ t = I K t H t P ¯ t

3.2. KF Divergence Proof

This section elucidates the mechanism behind the non-convergence of the covariance matrix in the KF. When the employed motion estimation model in the KF mismatches the actual dynamics, iterative updates lead to divergence. Given the practical challenge of acquiring a precise motion model for humans, the KF often encounters the following issue:
  • If the KF divergence → elements in the covariance matrix P t is larger → elements in the inverse covariance matrix P t 1 is smaller → Mahalanobis distance of two different persons’ IDs is smaller than the threshold → leading to “accepting false ID” errors
Consequently, addressing the KF divergence problem is crucial for reducing such identity errors. From the perspective of Bayesian probabilities, (1) can be expressed as p x t ¯ = N μ , P t ¯ , and if x t ¯ is given, (2) can be expressed as p z t x t ¯ = N H t μ , R ; then, p x t ^ z t = p z t x t ¯ × p x t ¯ , so the corresponding equation of the covariance matrix is as described as follows (7):
P t ^ 1 = H t T R 1 H t + P t ¯ 1
Assuming a simple constant-velocity model, the ground truth motion yields (8):
x t * = x t 1 * + v * = x 0 * + t v * z t * = x t * + ε t = x 0 * + t v * + ε t
Assuming the estimation model is not the same as the ground truth yields (9):
x t = x t 1 + v = x 0 + t v , v v * &   x 0 = x 0 * z t = x t + ε t = x 0 + t v + ε t
For one element pt in the covariance matrix, assuming there is no relevant prior knowledge of noise and initial value at t = 0, let r 0 = σ 2 ,   p 0 = . For the constant-velocity model, f t = h t = 1 , the following is obtained (10):
p t 1 = f t p t 1 f t T 1 + h t r h t T 1 = p t 1 1 + σ 2 = p t 2 1 + σ 2 + σ 2 = p 0 1 + t × σ 2 = 1 + t σ 2 = t σ 2
(4) can be rewritten in one element of the vector, which yields (11):
k t = p ^ t h t r t 1 = σ 2 t 1 σ 2 = 1 t
Substituting (2) into (5) as one element of the posterior state vector yields (12):
x ^ t = f t x ^ t 1 + k t z t h t f t x ^ t 1 = ( x ^ t 1 + v ) + 1 t z t x ^ t 1 v = t 1 t x ^ t 1 + v + 1 t z t = t 2 t 0 x ^ t 2 + v + 1 t z t 1 + 1 t z t = 1 t z 1 + z 2 + z 3 + + z t = 1 t x 0 + v + ε 1 + 1 t x 0 + 2 v + ε 2 + + 1 t x 0 + t v + ε t = x 0 + 1 t ( 1 + t ) t 2 ( v + ε ) = x 0 + ( t + 1 ) 2 v + 1 t i = 1 t ε i
Under the constant-velocity assumption, the discrepancy between the ground truth and posterior estimates (13) exhibits an increasing trend with time. This observation implies that the error grows over time, leading to the divergence of the KF under constant perturbations. Consequently, when the employed motion estimation model deviates from the true dynamics, KF performance is significantly compromised, as it is prone to divergence issues. Given the complexity of worker movements, current research efforts, as exemplified by [7], have not been able to establish accurate motion models, resulting in the persistent challenge of mitigating KF divergence still needing a reasonable mathematical solution. Therefore, it is necessary to find ways to bypass the divergence problem.
Δ = x t * x ^ t = x 0 * + t v * x 0 ( 1 + t ) 2 v 1 t i = 1 t ε i = t 1 2 ( v * v ) 1 t i = 1 t ε i E Δ = t 1 2 ( v * v ) D Δ = t 1 2 ( v * v ) 2 + σ 2 t

3.3. KF Processing Example

In this section, we demonstrate that despite the inherent limitations of the KF’s motion estimation, achieving convergence can still be accomplished through continuous and accurate detection measurements.
The state equation initialization requires understanding the motion model of the object, such as the uniformly moving, two-dimensional object in (14). On occasions where there was no fixed frame interval time in videos, Δ t was assigned the value 1.0, and could be adjusted through velocity.
x t = x t 1 + v t 1 x × Δ t y t = y t 1 + v t 1 y × Δ t v t 1 x = v t 1 x v t 1 y = v t 1 y
Similar to (14), Equation (1), rewritten into the matrix form, can be presented as (15). At this point, our estimation motion model will be fixed.
x t = x t y t a t h t v t x v t y v t a v t h = 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 × x t 1 y t 1 a t 1 h t 1 v t 1 x v t 1 y v t 1 a v t 1 h = F 8 × 8 × x t 1
The initialization of the observation Equation (16):
z t 1 = x t 1 y t 1 a t 1 h t 1 = 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 × x t 1 y t 1 a t 1 h t 1 v t 1 x v t 1 y v t 1 a v t 1 h = H 4 × 8 × x t 1
The initial noise covariance is empirical and exhibits a weight–height dependency. Vanilla DeepSORT assigns a position noise of 1/20 and a velocity noise of 1/160, assuming higher positional accuracy compared to velocity. Both the Q and R matrices are diagonal, with their values being determined by the height of each detection. It is commonly observed that larger bounding boxes imply higher noise levels, but concurrently, they indicate closer proximity to the camera, leading to less noisy images. This aspect has not been previously addressed in the literature. Consequently, we have revised the noise equation to achieve a more uniform and simplified treatment, as detailed in Equations (17) and (18).
Let a bounding box in frame t = 1 be
z t 1 = x c , y c , a , h T = np.array   ( [ 1062.161303 , 316.998036 , 0.405503 , 273.269825 ] )
Then, the height is 273.269825, so the 273.269825 × 1 / 20 2 = 186.690993 .
Q 8 × 8 = n p . s q u a r e w e i g h t _ p o s i t i o n × h t w e i g h t _ p o s i t i o n × h t 1 e 2 w e i g h t _ p o s i t i o n × h t w e i g h t _ v e l o c i t y × h t w e i g h t _ v e l o c i t y × h t 1 e 5 w e i g h t _ v e l o c i t y × h t × I 8 × 8 186.69 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 186.69 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0.0001 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 186.69 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 2.92 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 2.92 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 2.92
R 4 × 4 = n p . s q u a r e w e i g h t _ p o s i t i o n × h t w e i g h t _ p o s i t i o n × h t 1 e 2 w e i g h t _ p o s i t i o n × h t × I 4 × 4 186.69 0 0 0 0 186.69 0 0 0 0 0.01 0 0 0 0 186.69
Once initialized, the F and H will not be changed in any frame during the calculations (independent of t). However, Q and R will be changed in each frame for the height difference of the bounding box. The noise can be adjusted according to the actual scenarios to achieve better effects.
For the covariance matrix P, we follow the traditions presented in (19). The covariance matrix P is also weight–height dependent but with 2-times magnification for position and 10-times magnification for velocity.
P 8 × 8 = n p . s q u a r e 2 × w e i g h t _ p o s i t i o n × h t 2 × w e i g h t _ p o s i t i o n × h t 1 e 2 2 × w e i g h t _ p o s i t i o n × h t 10 × w e i g h t _ v e l o c i t y × h t 10 × w e i g h t _ v e l o c i t y × h t 1 e 5 10 × w e i g h t _ v e l o c i t y × h t × I 8 × 8 P 8 × 8 t = 1 = 746.76 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 746.76 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0.0001 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 746.76 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 291.70 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 291.70 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 291.70
When the person’s detection is missing for 100 frames without an update of state, the same state vector [1062.161303, 316.998036, 0.405503, 273.269825] remains unchanged, while the elements in the covariance matrix can be magnified by more than 5000 times in (20) (3,894,274.937286/746.763973 = 5214.867).
o n l y _ p r e d i c t : P 8 × 8 t = 100 = 3894274.93 0 . 0 . 0 . 43609.84 0 . 0 . 0 . 0 . 3894274.93 0 . 0 . 0 . 43609.84 0 . 0 . 0 . 0 . 0.010134 0 . 0 . 0 . 0.000001 0 . 0 . 0 . 0 . 3894274.93 0 . 0 . 0 . 43609.84 43609.84 0 . 0 . 0 . 583.40 0 . 0 . 0 . 0 . 43609.84 0 . 0 . 0 . 583.40 0 . 0 . 0 . 0 . 0.000001 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 43609.84 0 . 0 . 0 . 583.40
Once a subject’s detection is consistently stable and stationary at a specific position [1062.161303, 316.998036, 0.405503, 273.269825], the covariance matrix can be updated in each frame. The reduction in matrix elements, as observed in (21), is approximately 50% (363.719862/746.763973 = 0.48706), indicating that convergence is not reliant on motion and can occur even when the subject is completely stationary. The KF gain also exhibits a converging pattern in (22). Consequently, the non-convergence issue lies in the prediction stage, where an inadequate update occurs due to missing detections.
Each worker has two KFs, one for the head and the other for the body; likewise, different workers maintain different KFs. However, following the above calculations, with the presence of continuous detection values, even if the motion model is not good enough, the convergent P can also be achieved.
p r e d i c t + u p d a t e : P 8 × 8 t = 100 = 363.71 0 . 0 . 0 . 40.06 0 . 0 . 0 . 0 . 363.71 0 . 0 . 0 . 40.06 0 . 0 . 0 . 0 . 0.001052 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 363.71 0 . 0 . 0 . 40.06 40.06 0 . 0 . 0 . 29.39 0 . 0 . 0 . 0 . 40.06 0 . 0 . 0 . 29.39 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 40.06 0 . 0 . 0 . 29.39
K 8 × 8 t = 2 = 0.86 0 . 0 . 0 . 0 . 0.86 0 . 0 . 0 . 0 . 0.019 0 . 0 . 0 . 0 . 0.86 0.21 0 . 0 . 0 . 0 . 0.21 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0.21 K 8 × 8 t = 100 = 0.66 0 . 0 . 0 . 0 . 0.66 0 . 0 . 0 . 0 . 0.095 0 . 0 . 0 . 0 . 0.66 0.073 0 . 0 . 0 . 0 . 0.073 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0.073
Therefore, it is possible to maintain the correct motion if there is a method to retain detections. The following proposed method can utilize head cues to ensure the approximate accuracy of the whole-body detections even with heavy occlusions.

4. Methods

The bidirectional method can make the detector and tracker interact with each other, instead of only using the outputs of the previous step without changes.
As depicted in Figure 2, the proposed method consists of the detector, imputers, refiners and tracker; among them, the first three items are detailed and discussed in [10] while the tracker is our concern in this paper. Whereas the unidirectional method only contains the body_tracker, our tracker contains four parts:
  • Head_tracker for tracking heads (in Section 4.1).
  • Body_tracker for tracking bodies (in Section 4.1).
  • Intra-frame processing to delete false positives of heads and bodies (in Section 4.2).
  • Inter-frame matching to find the pairing relationship between heads and bodies (in Section 4.3).

4.1. Head Tracker and Body Tracker

The head tracker and body tracker are responsible for associating the head or body with detections, respectively.
The implementation of the head_tracker is basically the same as the body_tracker, which contains two steps of matching (shown in Algorithm 1), although the body_tracker will adopt the velocity of a head other than its own:
(1) The first match: the cascade matching with the appearance vector’s cosine distance and the bounding box center’s Euclidean distance.
(2) The second match: the bounding box’s IoU distance matching for the remaining unmatched detections and tracks.
As shown in Figure 3, there are seven tracks of frame t − 1 and five detections of frame t; after the first match, (1-B), (2-A) and (4-C) were matched; the remaining (3-D) and (6-E) were successfully obtained in the second match. The tracking ID and detection ID are represented by the rows and columns in the cosine distance cost matrix. To match the tracking ID with the current frame, the Hungarian algorithm (Algorithm 1) is used. If the value is below the cosine similarity threshold of 0.2 and meets the tracking ID criteria, it is considered a successful match.
We modify the trackers in four ways:
(1) The Mahalanobis distance is modified to the Euclidean distance (23) in KF, to avoid the influence of covariance matrix divergence. x t = x c , y c , a , h , v x , v y , v a , v h T is the predicted vector of frame t over the state distribution (8-d) from frame t − 1, and it y t = x c , y c , a , h , 0 , 0 , 0 , 0 T refers to the detector’s measurements in frame t.
M a h a l a n o b i s   d i s t a n c e = x t y t T P 1 x t y t E u c l i d e a n   d i s t a n c e = x t y t T x t y t
(2) The head trajectory can be used as a reference to improve body accuracy. Because the head has limited motion and a fixed size, its use helps reduce errors in direction detection, ensuring that the tracker and imputer are not misled by incorrect detection results.
(3) The cost matrix uses a variable threshold (2 × head width), which reduces the number of for-loops to two. The vanilla DeepSORT cost matrix utilizes a fixed threshold of 9.4877, but variable thresholds maintain the visual principle of “near-big, far-small”. Therefore, a smaller threshold is crucial when a worker is far from the camera, has a small head size, and shows minimal movement.
(4) Additional constraints to add a new body ID. The unmatched detections of vanilla DeepSORT were directly added as a new ID, which might have resulted in inflation. The confidence level must be approximately 0.8 or no less than 0.95 during the range intervals near the edge. This effectively filters out most false positives.
Algorithm 1: Two Stages of the Matching Algorithm for the Trackers
Input: M ← number of workers; A ← number of frames;
Two detections set:
D = d i j 1 i M ;   1 j A ; D r e m a i n = d i , r e m a i n j 1 i M r e m a i n ;   1 j A ;
Output: Two tracks set:
T m a t c h e d = t i j 1 i N m a t c h e d ;   1 j A ; T u n _ m a t c h e d = t i j 1 i N u n _ m a t c h e d ;   1 j A ;
1: for frame j in A do:
2:    D j = d 1 j , d 2 j , d 3 j , d M j j /*observations from the detector and ReID at j*/
    x j = x 1 j , x 2 j , x 3 j , x N j j /*posterior state from KF by j − 1 */
3. /*The First match: assignment to list of matches, unmatched_tracks, unmatched_detections:*/
    C j C c o s i n e D j , x j + C E u c l i d e a n D j , x j /*cost matrix at j */
   Use the linear Hungarian   algorithm   to   solve C j .
   if ( c cos i n e > 0.2 and  c E u c l i d e a n > 2 × head_width):
     T m a t c h e d t i j that   matched   with   d i j
     T u n _ m a t c h e d t i j   that   is   not   matched   with   d i j
     D r e m a i n d i j   that   is   not   matched   with   t i j
4. /*The second match: for the remaining detecting-ID in D r e m a i n */
    C j C c o s i n e D r e m a i n j , x u n _ m a t c h e d j + C 1 I o U D r e m a i n j , x u n _ m a t c h e d j
   Use the linear Hungarian   algorithm   to   solve C j .
   if ( c 1 I o U > 0.7) or (match_times = 0 and 0.1 c cos i n e 0.2 ):
     T m a t c h e d t i j that   matched   with   d i , r e m a i n j
     T u n _ m a t c h e d t i j   that   is   not   matched   with   d i , r e m a i n j
     D r e m a i n d i , r e m a i n j   that   is   not   matched   with   t i j
In this study, we utilized a Market-1501-based [34] 512-d vector for appearance information, which was better than the original 128-d vector of ReID (see Appendix A) in DeepSORT because an extensive feature dimension is appropriate for small objects such as heads.

4.2. Intra-Frame Processing

Intra-frame processing is used to reduce false positives and keep true positive detections, which can be treated as the pre-processing stage of the tracker’s inputs, as shown in Figure 2. Intra-frame processing contains the delete process and store process.
First, we need to filter out false positives of detected heads. Compared to height, human heads have smaller differences in size, so heads closer to the camera usually have clearer visual features and greater confidence scores than those that are farther away. Therefore, following the perspective principle of “near-big, far-small” and the detection confidence score, it is possible to filter out most of the false positive detected heads, as shown in Figure 4.
The head closest to the camera has the highest priority, and any heads with a confidence score of no less than 99% will be considered as true positives directly. After sorting by center vertical coordinates, we determine whether to delete the corresponding item based on the difference in height values and confidence scores between adjacent head boxes, as shown in Algorithm 2.
Algorithm 2: Filter out False Positives of Detected Heads
Input:   M ← number of detected heads; N ← number of remain heads; j = frame;
     D h e a d = d i j 1 i M ;   1 j A ; d i j = x c , y c , w , h , c o n f i d e n c e , c l a s s _ i d = 1 i = 1 M
Output: D h e a d d e l = d i j 1 i N ;   1 j A ; D h e a d r e m a i n = d i j 1 i N ;   1 j A ;
1: D h e a d s o r t d i j , y c ;         /*sorted by head center yc*/
2: while i + 1 < M − 1 do:
    Δ i d i j d i + 1 j ;
   if  Δ i < −5:           /*two heads are very close, maybe FP in here*/
    if d i j 4 ≥ 99%:       /* confidence ≥ 0.99*/
       D h e a d d e l d i j        /*delete i*/
    elif  d i + 1 j 4 d i j 4 :
       D h e a d d e l d i + 1 j       /*delete i + 1*/
    else:
       D h e a d d e l d i j   a n d   d i + 1 j   /*delete i and i + 1*/
   i += 1
3: D h e a d r e m a i n D h e a d D h e a d d e l
Second, we need to filter out false positives of detected bodies. Body detections are prone to instability when encountering occlusions or blurring. IoU and keypoints will be taken as conditions to delete unstable bodies in Algorithm 3; first, calculate the IoU between each pair of bounding boxes and if the value is greater than the given threshold (iou_threshold = 0.6), the lower confidence box is deleted; second, check the number of keypoints to ensure at least two body joints are in the box.
Algorithm 3: Filter out False Positives of Detected Bodies
Input:   M ← number of detected bodies; N ← number of remain bodies; j = frame;
     D b o d y = d i j 1 i M ;   1 j A ; d i j = x c , y c , w , h , c o n f i d e n c e , c l a s s _ i d = 0 i = 1 M
Output: D b o d y d e l = d i j 1 i N ;   1 j A ; D b o d y r e m a i n = d i j 1 i N ;   1 j A ;
1: for i, k in M do:
    I o U i C o m p u t e _ I o U d i j , d i j + 1 ;
   if  I o U i ≥ 60%:          /*two bodies are very close, maybe FPs in here*/
     D b o d y d e l min d i j 4 ,   d i j + 1 4   /*lower confidence deleted*/
    i += 1
    k += 1
2: for i in length(keypoints) do:
  /*if the body has no more than two effective keypoints in total*/
  if torch.sum (one_key[:, −1] ≥ 0.05) < 2:
     D b o d y d e l d i j           /*lower confidence deleted*/
    i += 1
3: D b o d y r e m a i n D b o d y D b o d y d e l
After these filtering steps, the remaining heads and bodies are passed to the store process. To handle potential crowd scenarios where multiple heads may occupy a single body location, we employed Python dictionaries (head_sequence_set and body_sequence_set) to store data. This approach allows for persistent tracking even when individuals enter or exit the field of view. Additionally, this method mitigates indexing issues that may arise due to variable head–body pairings, such as the example provided:
body_head_matched_ids = [(3, 0), (0, 1), (2, 2), (None, 3), (4, None), (1, 4)]
By deferring matching to the inter-frame tracking stage, our method can utilize appearance features in addition to intra-frame location data, thus avoiding the potential identity confusion caused by false positives, as shown in Figure 4.
In summary, the intra-frame processing stage employs a combination of confidence scores, size principles and keypoint analysis to refine head and body detections, reducing false positives and preparing the data for more accurate inter-frame tracking using the RseNet-50-based [10] head-integrated keypoint R-CNN model.

4.3. Inter-Frame Matching

Since head detection can aid in body detection [10], head tracking can likewise assist in body tracking, as described in this section. Because a less occluded head has an inclusion relationship with the entire body, the corresponding tracking can be achieved through bidirectional matching algorithms.
Inter-frame matching refers to associating the heads and bodies in consecutive frames. Unlike the traditional matching procedure of “head to head” or “body to body” in vanilla DeepSORT, we have expanded the scope of association to “head to body”.
Because the head and body do not belong to the same recognition category ( c l a s s _ i d = 0 for body; c l a s s _ i d = 1 f or head), it is meaningless to directly use the appearance features of the head to match a body but this is very useful for filtering newly added tracking IDs, as shown in Algorithm 4. This process can provide additional checking conditions for new IDs. In the case of mixed tracking, the correspondence between the body and head does not need to be calculated in every frame but only during the track ID initialization frame.
The head feature is the critical factor in identifying a worker in practical construction scenarios, which may lead to dilemmas in the annotation of tracking IDs in manual labeling. Due to the size of the head being relatively small and fixed, we used the Euclidean distance of the bounding box center (3 × width) and a higher confidence threshold (95%) to restrict the newly added head’s tracking ID in Algorithm 3. Based on the assumption that the head tracking is more accurate than the body tracking, the newly added body should not be around those already-existing heads by the IoU threshold of 0.8.
Whenever there is a newly added ID, we calculate the elements z of the cost matrix as (24) and use Algorithm 4 to find the corresponding head or body ID.
z = 1 modified _ IoU confidence
modified _ IoU = body head min body ,   head
where the modified_IoU in (25) is no longer focused on the union area of the body and head, but on the smaller area. This type of calculation tends to retain a high-overlap “head to body” pair, and the low-confidence pair will lose priority to an extent.
Algorithm 4: Matching of “Head to Body” across Frames
Input:
MT ← number of tracked heads; NT ← number of tracked bodies; j = frame;
T b o d y = t i , b o d y j 1 i N T ;   1 j A ; t i , b o d y j = x c , y c , w , h , c o n f i d e n c e , c l a s s _ i d = 0 i = 1 N T
T h e a d = t i , h e a d j 1 i M T ;   1 j A ; t i , h e a d j = x c , y c , w , h , c o n f i d e n c e , c l a s s _ i d = 1 i = 1 M T
L b o d y , h e a d = t i , b o d y j , t a , h e a d j 1 i N T ;   1 j A ; 1 a N T ;
Output: L b o d y , h e a d
1: if I D h e a d T h e a d :
   for k in L b o d y , h e a d  do:
    /*find the closest head, and calculate the Euclidean distance of the center*/
    if Euclidean I D h e a d t a , h e a d j > 3×w and  I D h e a d 4 > 0.95:
       T h e a d I D h e a d
2: if I D b o d y T b o d y :
   for k in L b o d y , h e a d  do:
    /*find the closest body, and calculate the IoU*/
    if IoU I D b o d y ,   t a , h e a d j > 0.8 and I D a , h e a d j N o n e :
       T b o d y I D b o d y
3: if  T h e a d I D h e a d   or   T b o d y I D b o d y : /*matching of new added head–body pairs*/
     C j cos t _ matrix , C j 1 modified _ IoU confidence   when   C j 1.0 ;
               C j 100000      when   C j > 1.0 ;
    Use the linear Hungarian   algorithm   to   solve C j
    # row_indices, col_indices = linear_assignment(cost_matrix)
     L b o d y , h e a d I D b o d y , I D h e a d
By adding head tracking, the weakness of relying solely on body tracking can be overcome. Figure 5 (the frame #37 in video-3) shows the effective processing of inter-frame matching from left to right images: the left image is the detector’s outputs, while the middle image is the tracker’s outputs and the right image is the final results. Head and body IDs can obtain the pair-wise relationship and stable trajectories successfully. The newly added head ID = 17 in the middle image (confidence = 92.2% in the left image) was deleted as a false positive association and did not affect the final tracking outputs.
In vanilla DeepSORT, the matching algorithm was only restricted to two frames (t − 1 and t) as the Markov property in the KF in (2), whereas former frames (1, 2 … t − 2) were forgotten. Then, unmatched head or body detections could be directly and easily treated as a new track ID for the next frame; consequently, if the ReID feature vector of the current frame is not good enough due to occlusions or blurring, better feature vectors from previous frames will be replaced and cannot be recovered. Using the head as a benchmark can maintain correspondence with past frames during the bad modification of body features in the current frame; the “head to body” matching reduces possible ID errors effectively.
Note: to clearly understand how head tracking aids in body tracking, please refer to Appendix B for the calculation details.

4.4. Evaluation Metrics

CLEAR MOT [35] proposed the IDF1, MOTA and MOTP metrics in (26)–(28).
I D F 1 = 2 × I D T P 2 × I D T P + I D F P + I D F N
M O T A = 1 i I D F P i + I D F N i + I D S W i i G T i , 1
MOTP = i , t d i , t i c i 0 , 1
where True Positive ID (IDTP) is the number of correctly assigned IDs throughout the entire video; False Positive ID (IDFP) is the number of incorrect IDs; False Negative ID (IDFN) is the number of missed IDs; IDSW is the identity switches of one object; GT means the number of manually labeled ground truth boxes; i is one frame in video set; t is a worker identity; d refers to the distance between the prediction box and ground truth box, d may be assigned as the IoU distance.
IDTP does not have the same value as TP. The former emphasizes maintaining ID consistency, whereas TP primarily focuses on bounding box IoU. IDTP can only be obtained after the tracking association calculation is complete, whereas TP can be obtained during the earlier detection stage. The same is true for IDFP and FP, and IDFN and FN.
IDF1 gauges the proportion of correctly identified detections to the mean count of ground truth and computed detections. It offers a fair assessment of all trackers based on their identification precision and recall through their harmonic mean [36].
MOTA measures the combined errors of IDFNs, IDFPs and IDSW. However, IDFPs have a greater influence than IDFNs and IDSW. For instance, if the GT tracking target has an ID error, it results in an increment of IDFP + 1. On the other hand, if the GT does not have a corresponding detection, it is considered missing, resulting in IDFN + 1. Changes in IDFPs will lead to an increase in IDFNs but not vice versa. It is important to note that MOTA does not include a measure of localization error.
To compensate for this, MOTP measures a tracker’s ability to accurately estimate object positions, regardless of its proficiency in recognizing object configurations and maintaining consistent trajectories. It primarily evaluates positional accuracy, which is more relevant to the detector’s performance than the overall performance of the tracker. As a result, MOTChallenge removes MOTP as an evaluation metric.
HOTA [6] balances the effect of three accuracies: detection, association and localization. Equations (29) and (30) decompose HOTA into separate DetA and AssA scores.
HOTA α = c { TP } A ( c ) | TP | + | FN | + | FP | = Det A α AssA α
HOTA = 0 1 HOTA α   d α 1 19 α 0.05 , 0.1 , 0.95 HOTA α
where α is the localization threshold of IoU, with 19 different distinct values (0.05 to 0.95 in 0.05 intervals) for integral calculation; c is one worker ID. More details of the HOTA formula can be found in [6].
HOTA identified two errors in ID distinction for tracking: ID switches (IDSWs) and ID transfers (IDTRs). IDSW is designed for continuous trajectories, while IDTR is designed for intermittent trajectories.
IDSW calculates the frequency at which a tracked trajectory switches its matched GT identity. It is worth mentioning that this definition is applicable only while the target is within the field of view. IDSW occurs when, in consecutive frames, the same worker has varying IDs, but it is restricted to TP and not FN. If a trajectory abruptly vanishes, it will not be counted as IDSW.
IDTR happens when one worker exits the frame and another worker enters the subsequent frame, but with the former’s ID. IDTR is common in scenarios involving frequent entry and exit. When worker A leaves frame-i, worker B enters frame-i + 1. If worker B is mistakenly recognized as worker A in frame-i + 1, worker A’s ID will be transferred to worker B, but IDSW will not be added since worker A was interrupted. Therefore, IDTR is also considered an undesirable behavior during tracking, which MOTA could not consider.
Fortunately, HOTA can effectively resolve IDTR errors [6]. The IDSW in the CLEAR MOT evaluation only measures short-term tracking and cannot accurately assess long-term tracking. HOTA can evaluate the global long-term tracking by comparing FN and FP in the matched trajectories.

5. Results and Discussion

The proposed tracking method achieved an IDF1 of 97.609%, MOTA of 95.191% and HOTA of 78.884% on the testing dataset of nine videos [2] with a total of 41 workers. Additionally, the method achieved an IDSW of 0, even in scenarios with heavy occlusions. This is the first report of the IDSW value since previous research on tracking workers.
The model is implemented using Python 3.8.8 and PyTorch 1.8.0. The method is based on the GitHub code [37], containing KF and Hungarian algorithms. The computer has one NVIDIA RTX 3080 GPU, an Intel(R) Core(TM) i9-10900K CPU @ 3.70 GHz and one 32 GB RAM with Win10. The testing dataset of nine videos is the same as [2,38], with a resolution of 1920 × 1080 and 30 FPS; more details can be found in [2].
The ground truth data with frame-ID, track-ID and bounding box in a four-tuple (x, y, width, height) were manually annotated as follows: <frame>, <id>, <bb_x>, <bb_y>, <bb_width>, <bb_height>, <conf = 1>, <x = −1>, <y = −1>, <z = −1>. This was performed with TrackEval [39].

5.1. Quantitative Results

Table 2 presents a comprehensive analysis of the performance evaluation outcomes for the nine videos across seven metrics: MOTA, IDF1, HOTA, AssA, AssRe, AssPr and LocA. It is important to note that the aggregated value in the final row does not result from a simple average of individual video scores due to variations in total frames; instead, the cumulative frames of all videos are considered.
The method consistently demonstrates stability and reliability across diverse scenarios and conditions, as shown in Figure 6. The challenge for tracking algorithms, particularly in video 3, with the highest number of workers, is exemplified by HOTA reaching 81.446%, which is marginally higher than the combined score of 78.884%, thereby highlighting the effectiveness of the crowd-tracking method.

5.2. Comparision of Other SOTA Methods

Pedestrian tracking on roads presents a relatively straightforward task from a programming standpoint, despite some inherent complexities and uncertainties in traffic environments. In contrast, construction sites pose a greater challenge due to their dynamic nature. The simplicity of pedestrians’ movement patterns further contributes to the relative ease of tracking on roads. Table 3 highlights eight state-of-the-art pedestrian tracking methods employed in recent years, particularly in the context of worker tracking. However, these methods exhibit considerable disparities in application domains, tracked objects and technical challenges. Consequently, selecting the most suitable tracking techniques for practical applications is crucial, as it ensures alignment with the diverse needs and requirements of different scenarios.
The codes of ByteTrack [7], Deep OC_SORT [26], BoTSORT [29], OC_SORT [9] and Strong_SORT [30] are from [40]. These five methods were retrained on 26.6 K images of CrowdHuman, MOT17, Cityperson and ETHZ datasets [7]. TransTrack [24] and UniTrack [31] are in the authors’ original codes, and DeepSORT uses an implementation from [41]. These three methodologies rely on vanilla YOLOX-x (yolox_x.pth, 756 MB).
As displayed in Table 3, YOLOX-based DeepSORT achieved the highest HOTA and MOTA scores, at 68.418% and 69.914%, respectively, but exhibited the lowest FPS (1.77) and the worst IDSW score (21). In contrast, ByteTrack demonstrated the best performance, with an IDF1 score of 81.829%, the fastest FPS (9.09) and the lowest IDSW score (5). Despite these achievements, none of these pedestrian tracking methods surpasses our proposed bidirectional tracking method.

5.3. Discussion

Compared with the unidirectional method, the bidirectional method has the following advantages:
(1)
Low dependency of detection DNN
The bidirectional approach allows the tracker to modify the results of the detector in reverse with imputers and refiners. Therefore, poor detections (ResNet-50 [10]) in the current frame will be modified into acceptable outputs and saved into head or body sequence sets for the next frame’s prediction. Thus, these poor detections will not greatly change the value of the covariance matrix.
(2)
Application of head tracking aid in body tracking
Head cues’ distinctive attributes, including low variation, visibility and reduced occlusion, contribute to their effectiveness in tracking tasks. In tracking algorithms, the utilization of head appearance features for body tracking marks a novel approach. Worker tracking, far from being a mere metric comparison, demonstrates its practicality in complex situations like the occlusions depicted in Figure 7 and Figure 8. Despite individual differences in head shape, the relatively consistent size of heads compared to other body parts ensures more reliable appearance features compared to limbs.
In Figure 7, from frames #4 to #14, workers with ID = 2 and ID = 4 of the same categories are moving in the opposite direction. In frame #16, ID = 2 loses the detection of the head and body until frames #19 and #22, respectively. After calculating, ID = 2 maintained continuous tracklets during the overlap by ID = 4, proving its effectiveness in handling intra-class occlusions.
In Figure 8, due to the obstruction of the fence category, only a few appearance features are visible in the worker category ID = 0. Frame #79 demonstrates that the scaled-down bounding box of ID = 0 was repaired to a normal height and width, indicating that inter-class occlusions can be fixed even during bad detections.
(3)
Avoidance of the impact of KF divergence issues
The prediction of the whole body’s speed can now be replaced by the speed of the head, instead of solely relying on KF prediction. The head’s speed can serve as a reference to refine body velocity predictions in the KF. If body bounding box coordinates are inaccurate, the correct head coordinates can be employed to correct them, preventing the tracker and imputer from being misled by faulty detections.
For instance, in the top row of Figure 9, the body detector’s output for the left worker’s full-body box in frames 112 to 114 shows an unexpected increase in the x-coordinate (ID 4). The expected trend is a decrease as the worker moves leftward, but the detection error pushes the x-coordinates to the right (upper row: 667.91 → 668.17 → 666.25). Conversely, the head’s x-coordinate accurately reflects the leftward movement (lower row: 668.50 → 653.50 → 648.00). By utilizing head-based linear interpolation, the x-coordinate of the full-body bounding box can be corrected to the correct value (lower row: 666.65 → 665.5 → 660.5), ensuring accurate tracking.
(4)
Focus on metrics performance with ID errors
Addressing identity errors can be efficiently achieved with meticulous handling of new head or body IDs, as outlined in the intra-frame processing and inter-frame matching sections. By not immediately assigning unmatched detections to new identities, these methods can significantly reduce false positives (FPs) and false negatives (FNs), as demonstrated by Algorithms 2–4. These heuristic algorithms streamline the process by avoiding the need for retraining and effectively leveraging head information for identity association. By attentively managing new IDs, they contribute to enhanced tracking system performance, presenting a simple, effective and cost-efficient solution. This approach is particularly suitable for practical applications due to its wide applicability.
The main limitation of the proposed method is the fact that many innovations are heuristic and difficult to fully prove using mathematical formulas. Another limitation is that it is not possible to completely abandon the use of KF to estimate velocity values.
A potential future research direction is to track worker movement data, obtain movement characteristics and make trajectory predictions.

6. Conclusions

This article introduced a training-free tracking method for tracking workers. The novel method has a bidirectional interaction mechanism between the detector and tracker, which allows head information to supply more stable and precise navigation for body tracking. We have also analyzed in detail the mathematical reasons for the non-convergence phenomenon, and the proposed method can successfully prevent the problem of non-convergence KF in traditional one-way tracking and significantly reduce identity errors regarding workers through the designed newly added ID-checking algorithms. During testing, our method achieved an HOTA of 78.884%, an MOTA of 95.191% and an IDSW of 0 across nine video datasets, demonstrating that the method remains highly effective even in scenarios with serious occlusions or non-convergence issues.

Author Contributions

Conceptualization, Y.L. and Y.W.; methodology, Y.L.; software, Y.L.; validation, Y.L. and Z.Z.; formal analysis, Y.L.; investigation, Y.L.; resources, Y.L.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L. and Z.Z.; visualization, Y.L.; supervision, Y.W.; project administration, Y.L. and Y.W.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Some studies [42,43] suggest that the similar high-visibility apparel (Hi-Vis [38]) worn by workers must lead to indistinguishable appearance features, which need additional retraining on a new dataset of construction scenarios. Hi-Vis is a special type of clothing commonly used to make workers more visible in busy or low-light environments, thereby increasing their safety. Hi-Vis clothing typically uses eye-catching colors and pattern designs. Due to the introduction of distinguishable appearance features in Hi-Vis clothing, the previously trained model may not be able to accurately recognize or process these new features. Therefore, in order for the model to adapt to these new changes and work properly in new construction scenarios, it is necessary to retrain the model. Retraining typically involves using datasets containing new features (in this case, Hi-Vis clothing) to allow the model to learn and adapt to these new appearance features.
However, our calculations demonstrate that this attitude is wrong. The appearance features of workers are distinguishable. For instance, in Figure A1, which captures frames #18–#19 of video 3, featuring nine workers, our experiments revealed that even in scenarios where workers wear helmets and similar Hi-Vis clothing, a standard human detection ReID model (ckpt.t7/43.9 MB [44]) can still differentiate individuals accurately.
Figure A1. Frames #18 (left) and #19 (right) of a construction video.
Figure A1. Frames #18 (left) and #19 (right) of a construction video.
Mathematics 12 01245 g0a1
The following can be observed from the two cost matrices equations below:
(1) For smaller headers, the order of magnitude of successfully matched elements is 1~10 times less than other values, e.g., in the first row of the following Cosine_cost matrix of heads equation:
(0.272943 + 0.280426 + 0.268017 + 0.318667 + 0.237262 + 0.289022 + 0.245184 + 0.309502) ÷ 8 ÷ 0.026227 = 10.58.
(2) For larger bodies, it is 10~100 times less. Larger objects are easier to distinguish than smaller ones, similar to how body elements are approximately 10 times smaller than the head, e.g., in the first row of the following Cosine_cost matrix of bodies equation,
(0.17495 + 0.220428 + 0.207515 + 0.245856 + 0.277771) ÷ 5 ÷ 0.005517 = 40.8.
When using the default cosine threshold of 0.2, the ReID for pedestrian detection can accurately identify even the smallest heads of distinct workers. Such simple calculations provide evidence that the ReID model for pedestrians can be applied to workers without retraining.
Cosine_cost   matrix   of   bodies = 0.005517 0.17495 0.220428 0.207515 0.245856 0.277771 0.228444 0.269655 0.002582 0.235785 0.305651 0.189078 0.218299 0.175027 0.221523 0.162441 0.193348 0.256855 0.272631 0.173115 0.321291 0.181186 0.00269 0.33439 0.207658 0.174372 0.196977 0.011115 0.182533 0.2122 0.164177 0.009771 0.234517 0.231651 0.197819 0.286635 0.246577 0.224236 0.212872 0.196327 0.241916 0.029123 0.129127 0.131971 0.134357 0.116552 0.138499 0.185706 0.129708 0.127821 0.15255 0.127272 0.138029 0.196159
Cosine_cost   matrix   of   heads = 0.026227 0.272943 0.280426 0.268017 0.318667 0.237262 0.289022 0.245184 0.309502 0.266658 0.202771 0.255982 0.014943 0.358701 0.245981 0.343662 0.292406 0.375493 0.303788 0.271091 0.223952 0.29911 0.271148 0.016107 0.321069 0.189749 0.279182 0.334715 0.023847 0.310186 0.227934 0.251481 0.233631 0.280038 0.286123 0.336056 0.314554 0.29344 0.046261 0.27025 0.300371 0.197424 0.355366 0.271777 0.173631 0.328805 0.260196 0.280138 0.371353 0.250941 0.253999 0.037608 0.292974 0.356252 0.343413 0.300676 0.344513 0.434718 0.320221 0.268226 0.312919 0.031341 0.190585 0.32014 0.216402 0.172925 0.28769 0.112796 0.174532 0.293966 0.220967 0.250534 0.383001 0.340808 0.301298 0.429668 0.447917 0.341719 0.396959 0.19489 0.021867

Appendix B

To clearly show the details of how head tracking aids in body tracking, we choose one output example (shown in Table A1) of our source code, executed in video 7 from frames #1 to #8. Each frame contains five procedures:
  • detector calculation → Intra-frame processing → head_tracker → body_tracker → Inter-frame matching.
Table A1. The process details of the head tracking aid in body tracking.
Table A1. The process details of the head tracking aid in body tracking.
No.Calculation Descriptions
Input frame #1
1
  • The output of Detector (head_integrated Keypoint R-CNN) as:
Bounding boxes in (x1, y1, x2, y2, confidence, class_id) = tensor([
1102.4229275.09471219.5381563.08640.99660#body ID = 0
733.70471.0617789.831897.35160.98640#body ID = 1
1312.4270197.85131434.9780470.18230.98170#body ID = 2
601.37850.0000654.7200104.98320.95610#body ID = 3
1148.3512274.91411189.9109322.35850.99551#head ID = 0
1382.5782198.28401422.7518249.79690.98861#head ID = 1 ])
Number of Body IDs = 4, Number of Head IDs = 2.
2
  • Intra-frame processing: delete the false positives:
Algorithm 2: delta_del_head = [],
Algorithm 3: delta_del_body = [3].
Then the detection = [601.3785, 0.0000, 654.7200, 104.9832, 0.9561, 0] is deleted, and the left bounding box is:
boxes_xyxy = [
1102.422852275.0946661219.538086563.0864260.9965610
733.7047121.061707789.83178797.3516390.9864420
1312.427002197.8512571434.978027470.1822510.981670
1148.351318274.9140931189.910889322.358490.9955361
1382.578247198.2839971422.751831249.7968750.9885961]
3
  • Head_tracker in Algorithm 1:
the first match: matches_a, unmatched_tracks_a, unmatched_detections = [] [] [0, 1]
the second match: matches_b, unmatched_tracks_b, unmatched_detections = [] [] [0, 1]
matches, unmatched_tracks, unmatched_detections = [] [] [0, 1]
The tracking ID in the head sequence set is: 0, 1.
4
  • Body_tracker in Algorithm 1:
the first match: matches_a, unmatched_tracks_a, unmatched_detections = [] [] [0, 1, 2]
the second match: matches_b, unmatched_tracks_b, unmatched_detections = [] [] [0, 1, 2]
The tracking ID in the body sequence set is: 0, 1, 2.
5
  • Inter-frame matching in Algorithm 4:
body_id_list = [0, 2]; head_id_list = [0, 1]
cost_matrix = [
0.003819100,000.
100,000.100,000.
100,000.0.]
row_indices = bodys, col_indices = heads: [0 2] [0 1]
The matched body-ID and head-ID are self.match_body_head = [(0, 0), (2, 1), (1, None)]
Input frame #2
1Bounding boxes in (x1, y1, x2, y2, confidence, class_id) = tensor ([
1100.2812274.16151218.8927563.61950.99770
1317.1239199.09901435.9489468.34470.9848,0
733.17410.5491783.712396.30400.98340
598.90220.0000644.2981103.07690.97010
1148.8118275.43631190.7026322.57420.99151
1383.2902199.00621423.4681248.57010.98931])
Number of Body IDs = 4, Number of Head IDs = 2.
2
  • Intra-frame processing: delete the false positives:
Algorithm 2: delta_del_head = [],
Algorithm 3: delta_del_body = [3].
[598.9022, 0.0000, 644.2981, 103.0769, 0.9701, 0] is deleted.
3
  • Head_tracker in Algorithm 1:
the first match: matches_a, unmatched_tracks_a, unmatched_detections = [(0, 0), (1, 1)] [] []
the second match: matches_b, unmatched_tracks_b, unmatched_detections = [] [] []
4
  • Body_tracker in Algorithm 1:
the first match: matches_a, unmatched_tracks_a, unmatched_detections = [(0, 0), (1, 2), (2, 1)] [] []
the second match: matches_b, unmatched_tracks_b, unmatched_detections= [] [] []
5
  • Inter-frame matching in Algorithm 4:
The matched body-ID and head-ID is: self.match_body_head = [(0, 0), (2, 1), (1, None)]
body-np_xyxy_final = [
1100.495617274.2548161218.957032563.5661990.9977110
731.0064930.616912786.75937696.4425450.9833941
1315.511496198.9340081436.811807468.5877050.9847582]

Input the frame #7
1Bounding boxes in (x1, y1, x2, y2, confidence, class_id) = tensor ([
1103.0753278.16671218.2045573.13160.9963
1327.7861199.19351435.1885466.84710.9849
576.04010.9442636.5553104.24980.9785
721.53322.7390797.3731105.77650.9664
646.13540.9752769.4316103.69110.8158
1153.6139279.10691196.0039327.27070.9968
1388.5093198.77421425.3273243.22250.9838])
Number of Body IDs = 5, Number of Head IDs = 2.
2
  • Intra-frame processing: delete the false positives:
Algorithm 2: delta_del_head = [],
Algorithm 3: delta_del_body = [4].
[646.1354, 0.9752, 769.4316, 103.6911, 0.8158] is deleted.
3
  • Head_tracker in Algorithm 1:
the first match: matches_a, unmatched_tracks_a, unmatched_detections = [(0, 0), (1, 1)] [] []
the second match: matches_b, unmatched_tracks_b, unmatched_detections = [] [] []
4
  • Body_tracker in Algorithm 1:
the first match: matches_a, unmatched_tracks_a, unmatched_detections = [(0, 0), (1, 2), (2, 1)] [3] []
the second match: matches_b, unmatched_tracks_b, unmatched_detections = [] [3] []
5
  • Inter-frame matching in Algorithm 4:
Find a new added body-ID: initiate_np_xyxy = [[721.533203, 2.738953, 797.373108, 105.776482, 0.966378, 3]]
head-candidates_tlwh = [
1153.614397278.97750942.39193148.180364
1388.411052198.60659637.06629444.685888]
body_id_list= [0, 2, 1]; head_id_list = [0, 1, None]
#match between one newly added body-ID and two existing head-ID
cost_matrix = [[100,000. 100,000.]]
#100,000 is bigger than the threshold, then body ID = 3 with head-ID = None
self.match_body_head = [(0, 0), (2, 1), (1, None), (3, None)]
#the newly added track ID’s bounding box is not in the current frame, but will be shown in the next frame.
body-np_xyxy_final = [
1102.914528278.1002121218.567574572.9952620.9962860
570.5310190.777104631.442964104.3581950.9785271
1322.21706198.9377521440.727778466.6204140.9848712]
Input the frame #8
5body-np_xyxy_final = [
1104.565028282.5565131212.796813574.9782880.996750
563.081971.296683623.345229103.5072510.9830131
1323.317717198.2769431441.398572466.7172540.9831892
722.8361212.213905787.554428106.9050210.9411323] #new added body-ID = 3

References

  1. Teizer, J. Status quo and open challenges in vision-based sensing and tracking of temporary resources on infrastructure construction sites. Adv. Eng. Inform. 2015, 29, 225–238. [Google Scholar] [CrossRef]
  2. Xiao, B.; Xiao, H.; Wang, J.; Chen, Y. Vision-based method for tracking workers by integrating deep learning instance segmentation in off-site construction. Autom. Constr. 2022, 136, 104148. [Google Scholar] [CrossRef]
  3. Golizadeh, H.; Hon, C.K.H.; Drogemuller, R.; Hosseini, M.R. Digital engineering potential in addressing causes of construction accidents. Autom. Constr. 2018, 95, 284–295. [Google Scholar] [CrossRef]
  4. Freimuth, H.; Koenig, M. Planning and executing construction inspections with unmanned aerial vehicles. Autom. Constr. 2018, 96, 540–553. [Google Scholar] [CrossRef]
  5. Guo, H.; Yu, Y.; Skitmore, M. Visualization technology-based construction safety management: A review. Autom. Constr. 2017, 73, 135–144. [Google Scholar] [CrossRef]
  6. Luiten, J.; Os, A.A.; Dendorfer, P.; Torr, P.; Geiger, A.; Leal-Taixé, L.; Leibe, B. HOTA: A Higher Order Metric for Evaluating Multi-object Tracking. Int. J. Comput. Vis. 2021, 129, 548–578. [Google Scholar] [CrossRef] [PubMed]
  7. Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. arXiv 2022, arXiv:2110.06864. [Google Scholar] [CrossRef]
  8. Wojke, N.; Bewley, A.; Paulus, D. Simple Online and Realtime Tracking with a Deep Association Metric. In Proceedings of the 24th IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar] [CrossRef]
  9. Cao, J.; Pang, J.; Weng, X.; Khirodkar, R.; Kitani, K. Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking. arXiv 2022, arXiv:2203.14360. [Google Scholar] [CrossRef]
  10. Liu, Y.; Zhou, Z.; Wang, Y.; Sun, C. Head-Integrated Detecting Method for Workers under Complex Construction Scenarios. Buildings 2024, 14, 859. [Google Scholar] [CrossRef]
  11. Dendorfer, P.; Ošep, A.; Milan, A.; Schindler, K.; Cremers, D.; Reid, I.; Roth, S.; Leal-Taixé, L. MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking. arXiv 2020, arXiv:2010.07548. [Google Scholar] [CrossRef]
  12. Leal-Taixé, L.; Milan, A.; Reid, I.; Roth, S.; Schindler, K. MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking. arXiv 2015, arXiv:1504.01942. [Google Scholar] [CrossRef]
  13. Ciaparrone, G.; Sánchez, F.L.; Tabik, S.; Troiano, L.; Tagliaferri, R.; Herrera, F. Deep Learning in Video Multi-Object Tracking: A Survey. arXiv 2019, arXiv:1907.12740. [Google Scholar] [CrossRef]
  14. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2016, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed]
  15. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2015, arXiv:1506.02640. [Google Scholar] [CrossRef]
  16. Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. arXiv 2019, arXiv:1904.08189. [Google Scholar] [CrossRef]
  17. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
  18. Girshick, R. Fast R-CNN. arXiv 2015, arXiv:1504.08083. [Google Scholar] [CrossRef]
  19. Bewley, A.; Ge, Z.Y.; Ott, L.; Ramov, F.; Upcroft, B. Simple Onlne and Realtime Tracking. In Proceedings of the 23rd IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar] [CrossRef]
  20. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. arXiv 2018, arXiv:1703.06870. [Google Scholar] [CrossRef]
  21. Bashar, M.; Islam, S.; Hussain, K.K.; Hasan, M.B.; Rahman, A.B.M.A.; Kabir, M.H. Multiple Object Tracking in Recent Times: A Literature Review. arXiv 2022, arXiv:2209.04796. [Google Scholar] [CrossRef]
  22. Shuai, B.; Berneshawi, A.; Li, X.; Modolo, D.; Tighe, J. SiamMOT: Siamese Multi-Object Tracking. arXiv 2021, arXiv:2105.11595. [Google Scholar] [CrossRef]
  23. Meinhardt, T.; Kirillov, A.; Leal-Taixe, L.; Feichtenhofer, C. TrackFormer: Multi-Object Tracking with Transformers. arXiv 2021, arXiv:2101.02702. [Google Scholar] [CrossRef]
  24. Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. TransTrack: Multiple Object Tracking with Transformer. arXiv 2020, arXiv:2012.15460. [Google Scholar] [CrossRef]
  25. Google. Google Colaboratory. 2023. Available online: https://colab.research.google.com/ (accessed on 29 August 2023).
  26. Maggiolino, G.; Ahmad, A.; Cao, J.; Kitani, K. Deep OC-SORT: Multi-Pedestrian Tracking by Adaptive Re-Identification. arXiv 2023, arXiv:2302.11813. [Google Scholar] [CrossRef]
  27. Yang, M.; Han, G.; Yan, B.; Zhang, W.; Qi, J.; Lu, H.; Wang, D. Hybrid-SORT: Weak Cues Matter for Online Multi-Object Tracking. arXiv 2023, arXiv:2308.00783. [Google Scholar] [CrossRef]
  28. Duan, P.; Zhou, J.; Goh, Y.M. Spatial-temporal analysis of safety risks in trajectories of construction workers based on complex network theory. Adv. Eng. Inform. 2023, 5, 101990. [Google Scholar] [CrossRef]
  29. Aharon, N.; Orfaig, R.; Bobrovsky, B.-Z. BoT-SORT: Robust Associations Multi-Pedestrian Tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar] [CrossRef]
  30. Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. StrongSORT: Make DeepSORT Great Again. arXiv 2022, arXiv:2202.13514. [Google Scholar] [CrossRef]
  31. Wang, Z.; Zhao, H.; Li, Y.L.; Wang, S.; Torr, P.; Bertinetto, L. Do Different Tracking Tasks Require Different Appearance Models? arXiv 2021, arXiv:2107.02156. [Google Scholar] [CrossRef]
  32. Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking. arXiv 2020, arXiv:2004.01888. [Google Scholar] [CrossRef]
  33. Chu, P.; Wang, J.; You, Q.; Ling, H.; Liu, Z. TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking. arXiv 2021, arXiv:2104.00194. [Google Scholar] [CrossRef]
  34. Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Bu, J.; Tian, Q. Person Re-identification Meets Image Search. arXiv 2015, arXiv:1502.02171. [Google Scholar] [CrossRef]
  35. Bernardin, K.; Stiefelhagen, R. Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. Eurasip J. Image Video Process. 2008, 2008, 246309. [Google Scholar] [CrossRef]
  36. Ristani, E.; Solera, F.; Zou, R.S.; Cucchiara, R.; Tomasi, C. Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. arXiv 2016, arXiv:1609.01775. [Google Scholar] [CrossRef]
  37. KubaRurak. Detectron2-Deepsort-Repo. 2021. Available online: https://github.com/KubaRurak/detectron2-deepsort-repo (accessed on 31 August 2023).
  38. Konstantinou, E.; Lasenby, J.; Brilakis, I. Adaptive computer vision-based 2D tracking of workers in complex environments. Autom. Constr. 2019, 103, 168–184. [Google Scholar] [CrossRef]
  39. JonathonLuiten. TrackEval. 2021. Available online: https://github.com/JonathonLuiten/TrackEval (accessed on 31 August 2023).
  40. Mikel-Brostrom. YOLO_Tracking. 2023. Available online: https://github.com/mikel-brostrom/yolo_tracking#real-time-multi-object-segmentation-and-pose-tracking-using-yolov8--yolo-nas--yolox-with-deepocsort-and-lightmbn (accessed on 31 August 2023).
  41. pmj110119. YOLOX_Deepsort_Tracker. 2021. Available online: https://github.com/pmj110119/YOLOX_deepsort_tracker (accessed on 31 August 2023).
  42. Xiao, B.; Kang, S.-C. Vision-Based Method Integrating Deep Learning Detection for Tracking Multiple Construction Machines. J. Comput. Civ. Eng. 2021, 35, 04020071. [Google Scholar] [CrossRef]
  43. Xiao, B.; Lin, Q.; Chen, Y. A vision-based method for automatic tracking of construction machines at nighttime based on deep learning illumination enhancement. Autom. Constr. 2021, 127, 13. [Google Scholar] [CrossRef]
  44. Drive, G. Deepsort_Parameters. 2023. Available online: https://drive.google.com/drive/folders/1xhG0kRH1EX5B9_Iz8gQJb7UNnn_riXi6 (accessed on 31 August 2023).
Figure 1. Types of identity errors in tracking workers.
Figure 1. Types of identity errors in tracking workers.
Mathematics 12 01245 g001
Figure 2. Overview of the proposed tracking method.
Figure 2. Overview of the proposed tracking method.
Mathematics 12 01245 g002
Figure 3. Framework of the head_tracker or body_tracker.
Figure 3. Framework of the head_tracker or body_tracker.
Mathematics 12 01245 g003
Figure 4. Examples of deleted false positives during intra-frame processing.
Figure 4. Examples of deleted false positives during intra-frame processing.
Mathematics 12 01245 g004
Figure 5. Process images of the inter-frame matching.
Figure 5. Process images of the inter-frame matching.
Mathematics 12 01245 g005
Figure 6. Testing videos’ tracking outputs at their final frames.
Figure 6. Testing videos’ tracking outputs at their final frames.
Mathematics 12 01245 g006
Figure 7. Example of intra-class occlusions in video 3.
Figure 7. Example of intra-class occlusions in video 3.
Mathematics 12 01245 g007
Figure 8. Example of inter-class occlusions in video 7.
Figure 8. Example of inter-class occlusions in video 7.
Mathematics 12 01245 g008
Figure 9. Example of body’s reverse movement for ID = 4.
Figure 9. Example of body’s reverse movement for ID = 4.
Mathematics 12 01245 g009
Table 1. Strengths and limitations of SOTA methods.
Table 1. Strengths and limitations of SOTA methods.
No.SOTA MethodsYearInformation TypesAdvantages (A) & Shortcomings (S)
1SORT [19]2016O: Faster R-CNN;
A: None;
M: KF + IoU + Hungarian.
A: presented as the baseline; KF state is x t = x c , y c , s , r , v x , v y , v s T , s = aspect ratio and r = area.
S: highly dependent on detection performance and many IDSWs; no occlusion-solving considerations.
2DeepSORT [8]2017O: Faster R-CNN;
A: ReID (128-d);
M: KF + Cosine distance + IoU + Hungarian.
A: presented as the baseline integrate appearance information. KF state is x t = x c , y c , γ , h , v x , v y , v γ , v h T , γ = aspect ratio and h = height.
S: detection performance dependency; occlusion-related IDSWs reduced but still frequent; constant-velocity model.
3ByteTrack [7]2022O: re-trained YOLOX-x on 1400 videos;
A: ReID (1024-d);
M: KF + Cosine distance + IoU + Hungarian + low scores re-match.
A: best performance in MOT20 and with already published codes. KF state is x t = x c , y c , a , h , v x , v y , v a , v h T ; a = aspect ratio and h = height.
S: highly dependent on detection performance and many IDSWs; no occlusion considerations; constant-velocity model.
4OC_SORT [9]2022O: baseline detections in MOTChallenge;
A: None;
M: KF + IoU + Hungarian + re-update of KF + motion direction difference.
A: first to explain the KF predict errors accumulation in detail; motion direction difference is added in the association cost matrix. KF state is x t = x c , y c , a , s , v x , v y , v a T , a = area and s = aspect ratio.
S: no real online method for KF update; needs future frame; constant-velocity assumption during occlusion, cannot remain effective during long-term occlusions; detection performance dependency; constant-velocity model.
5Deep OC_SORT [26]2023O: YOLOX;
A: ReID (SBS50, 287MB) + Camera Motion Compensation + Dynamic Appearance;
M: KF + IoU + Hungarian + re-update of KF + motion direction difference.
A: Apply Camera Motion Compensation to correct the KF state for better locations of the bounding box; apply detection confidence to modify ReID output vectors. KF state is x t = x c , y c , a , s , v x , v y , v a T ; a = area and s = aspect ratio.
S: the same as the OC_SORT; constant-velocity model.
6BoTSORT [29]2022O: Faster R-CNN;
A: ReID + Camera Motion Compensation;
M: KF + cosine distance + IoU + Hungarian.
A: modify KF state to x t = x c , y c , s , a , v x , v y , v s T , s = area and a = aspect ratio; apply Camera Motion Compensation to reduce errors of moving cameras; apply new cost matrix with weights of appearance cost and motion cost.
S: the same as the OC_SORT; constant-velocity model; slow when working with sparse optical flow.
7Strong_SORT [30]2022O: YOLOX-x;
A: ReID (BoT) + Camera Motion Compensation;
M: NSA-KF + cosine distance + IoU + Hungarian.
A: apply a new cost matrix with weights of appearance cost and motion cost; KF state is x t = x c , y c , a , h , v x , v y , v a , v h T ; a = aspect ratio and h = height.
S: MOTA is slightly lower, mainly due to the high detection score threshold leading to many missing detections; working speed is not high.
8TransTrack [24]2020O: re-trained transformer;
A: None;
M: None.
A: Self-Attention Mechanism and Query-Key pipeline.
S: hard to train; no motion information utilization; JDT not better than TBD in performance.
9UniTrack [31]2021O: ResNet-50;
A: ImageNet-supervised appearance model;
M: KF + cosine distance + IoU + Hungarian.
A: can support different tracking tasks and leverage many existing general appearance models. KF state is x t = x c , y c , a , h , v x , v y , v a , v h T ; a = aspect ratio and h = height.
S: not better in terms of metrics performance.
10FairMOT [32]2020O: encoder–decoder;
A: encoder–decoder;
M: KF + cosine distance + IoU + Hungarian.
A: one encoder–decoder network to obtain observation and appearance at the same time, with no need for an independent ReID model;
S: need training for about 30 h on two RTX 2080 Ti GPUs; still a SORT-related method.
11TransMOT [33]2021O: spatial–temporal graph Transformer;
A: None;
M: KF + cosine distance + IoU + Hungarian.
A: A cascaded association structure to handle low confidence detection and long-term occlusion.
S: relatively large computing resources and data; no public codes.
12Hybrid-SORT [27] 2023O: YOLOX-x;
A: ReID + Camera Motion Compensation;
M: KF + cosine distance + IoU + Hungarian + weak cues.
A: apply a new cost matrix with weights of appearance cost, motion cost, four corners’ velocity direction, and height-modulated IoU. KF state is x t = x c , y c , s , c , r , v x , v y , v s , v c T ; r = aspect ratio, s = area and c = confidence score.
S: detection performance dependency.
13Xiao et al. [2] 2023O: Mask R-CNN;
A: ReID (128-d);
M: KF + cosine distance + IoU + Hungarian.
A: baseline in worker tracking;
S: needs retraining on a new dataset; does not address severe occlusions; no public codes.
Observation DNN: O; appearance DNN: A; motion estimation model and assignment algorithm: M.
Table 2. Performance of our tracking method on the testing dataset.
Table 2. Performance of our tracking method on the testing dataset.
MetricsVideo-1Video-2Video-3Video-4Video-5Video-6Video-7Video-8Video-9Combined
1MOTA↑ (%)10094.0893.94393.58294.83398.85995.4496.84393.97295.191
2IDF1↑ (%)10097.0496.96696.82497.46899.43197.7398.42297.00797.609
3HOTA↑ (%)93.02377.60181.44675.774.80176.91178.27381.79977.44678.884
4AssA↑ (%)93.02379.78486.06678.96281.2885.45180.48483.34178.14983.296
5AssRe↑ (%)94.29884.02689.91684.52587.21989.14585.35286.76284.2887.83
6AssPr↑ (%)94.29884.02690.05283.24784.14288.89584.63486.68983.01686.976
7LocA↑ (%)92.56883.92286.91582.64282.64381.79884.32685.81585.61484.725
8IDSW↓0000000000
9Frag↓047212246101286
10Hz↑9.056.015.695.967.116.696.625.336.826.58
11MT13955446441
(‘↑’ means higher is better, ‘↓’ means lower is better).
Table 3. Performance of classical pedestrian tracking method on the testing dataset.
Table 3. Performance of classical pedestrian tracking method on the testing dataset.
MetricsDeepSORTByteTrackDeep OC_SORTBoTSORTOC_SORTStrong_SORTTransTrackUniTrack
1MOTA↑ (%)69.91462.51556.42960.69966.00256.69839.34527.441
2IDF1↑ (%)79.77181.82974.47177.36580.63377.06570.45961.887
3HOTA↑ (%)68.41866.91562.42563.20866.59663.89357.60749.201
4AssA↑ (%)74.58577.99571.39473.42976.16674.772.68563.686
5AssRe↑ (%)78.89682.71775.46276.77780.31678.67976.71268.208
6AssPr↑ (%)84.48883.63783.04385.38586.22184.34882.49574.389
7LocA↑ (%)85.35380.69482.0182.83382.65782.05880.22269.791
8IDSW↓2155181012715
9Frag↓101107178180168140161268
10Hz↑1.779.098.337.148.477.695.134.30
(‘↑’ means higher is better, ‘↓’ means lower is better).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Y.; Wang, Y.; Zhou, Z. Bidirectional Tracking Method for Construction Workers in Dealing with Identity Errors. Mathematics 2024, 12, 1245. https://doi.org/10.3390/math12081245

AMA Style

Liu Y, Wang Y, Zhou Z. Bidirectional Tracking Method for Construction Workers in Dealing with Identity Errors. Mathematics. 2024; 12(8):1245. https://doi.org/10.3390/math12081245

Chicago/Turabian Style

Liu, Yongyue, Yaowu Wang, and Zhenzong Zhou. 2024. "Bidirectional Tracking Method for Construction Workers in Dealing with Identity Errors" Mathematics 12, no. 8: 1245. https://doi.org/10.3390/math12081245

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop