Intra-Frame Graph Structure and Inter-Frame Bipartite Graph Matching with ReID-Based Occlusion Resilience for Point Cloud Multi-Object Tracking

Sun, Shaoyu; Shi, Chunhao; Wang, Chunyang; Zhou, Qing; Sun, Rongliang; Xiao, Bo; Ding, Yueyang; Xi, Guan

doi:10.3390/electronics13152968

Open AccessArticle

Intra-Frame Graph Structure and Inter-Frame Bipartite Graph Matching with ReID-Based Occlusion Resilience for Point Cloud Multi-Object Tracking

by

Shaoyu Sun

¹

,

Chunhao Shi

²,

Chunyang Wang

^1,*

,

Qing Zhou

³,

Rongliang Sun

⁴,

Bo Xiao

³,

Yueyang Ding

¹ and

Guan Xi

³

¹

School of Electronic and Information Engineering, Changchun University of Science and Technology, Changchun 130022, China

²

Hong Kong Applied Science and Technology Research Institute, Hong Kong 999077, China

³

Xi’an Key Laboratory of Active Photoelectric Imaging Detection Technology, Xi’an Technological University, Xi’an 710021, China

⁴

Jinhua Campus, Xi’an University of Technology, Xi’an 710048, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(15), 2968; https://doi.org/10.3390/electronics13152968

Submission received: 19 June 2024 / Revised: 22 July 2024 / Accepted: 25 July 2024 / Published: 27 July 2024

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Three-dimensional multi-object tracking (MOT) using lidar point cloud data is crucial for applications in autonomous driving, smart cities, and robotic navigation. It involves identifying objects in point cloud sequence data and consistently assigning unique identities to them throughout the sequence. Occlusions can lead to missed detections, resulting in incorrect data associations and ID switches. To address these challenges, we propose a novel point cloud multi-object tracker called GBRTracker. Our method integrates an intra-frame graph structure into the backbone to extract and aggregate spatial neighborhood node features, significantly reducing detection misses. We construct an inter-frame bipartite graph for data association and design a sophisticated cost matrix based on the center, box size, velocity, and heading angle. Using a minimum-cost flow algorithm to achieve globally optimal matching, thereby reducing ID switches. For unmatched detections, we design a motion-based re-identification (ReID) feature embedding module, which uses velocity and the heading angle to calculate similarity and association probability, reconnecting them with their corresponding trajectory IDs or initializing new tracks. Our method maintains high accuracy and reliability, significantly reducing ID switches and trajectory fragmentation, even in challenging scenarios. We validate the effectiveness of GBRTracker through comparative and ablation experiments on the NuScenes and Waymo Open Datasets, demonstrating its superiority over state-of-the-art methods.

Keywords:

point cloud multi-object tracking; inter-frame graph; bipartite graph matching; motion-based ReID

1. Introduction

Three-dimensional multiple object tracking (MOT) aims to detect objects of interest in point cloud data and assign each object a unique identity, forming a track of objects throughout the sequence. This technology is essential in many applications, including autonomous driving, smart cities, and robotic navigation and interaction [1,2]. The tracking-by-detection paradigm splits the problem into two main tasks: detecting objects in each frame (localization) and performing data association (linking detections over time to form object trajectories). Object localization is typically achieved using powerful detectors, while data association remains a complex problem. The primary challenge in data association arises from occlusions, which can lead to missed detections and, subsequently, ID switches. Therefore, improving detection accuracy and data association correctness under occlusion is crucial for effective 3D MOT.

As can be seen in Figure 1, the tracking-by-detection paradigm uses a detector, which is a stage named Before Data Association, to extract the object’s spatial and contextual features. Most methods design sparse convolutions for voxel-based [3] approaches and accurate down-sampling algorithms [4] to extract object-oriented features for point-based methods. Point-voxel-based detectors [5] leverage both techniques to improve detection performance. Recently, graph-based detectors [6,7,8] and trackers [9,10], which can be used as plug-and-play modules for these three types of detectors, have emerged. By building a K-Nearest Neighborhood (KNN) graph structure, these detectors can extract neighborhood features to aggregate for the center object, enhancing spatial feature embedding. Our tracker pipeline follows a similar approach, incorporating a graph structure to improve the accuracy of the detector.

Data Association in 3D MOT is crucial for accurately linking detected objects across frames to maintain consistent object identities. A common approach to this problem is Bipartite Graph Matching. In this paradigm, detected objects in the current frame are matched to tracklets from the previous frame, which can be visualized as a bipartite graph, where nodes represent detections and tracklets, and edges represent potential matches with association costs. Traditionally, methods like the Hungarian algorithm [11,12,13,14] and greedy approaches [15,16] have been employed to solve this matching problem. However, these methods often do not account for the weights of edges between nodes across frames, which can lead to suboptimal, local solutions. To address this limitation, we construct a comprehensive cost matrix and formulate the problem as a minimum-cost flow optimization to find a globally optimal solution for data association, thereby improving the accuracy and robustness of 3D MOT under conditions of occlusion.

After Data Association is the stage of track management involving updating and maintaining trajectories. Various methods exist for track updates and trajectory management, such as trajectory similarity [17], cross-frame object intersection-over-union (IoU) association [18], outlier removal [19], and trajectory hypothesis [20]. ReID [21,22] is commonly used in video-based multi-object tracking for pedestrian detection. In this context, the detector extracts features and encodes them using an MLP to obtain ReID features for re-identifying objects in subsequent frames. However, in point-cloud-based multi-object tracking, leveraging image appearance features is not feasible due to the unordered nature of points and the lack of color information. Therefore, we utilize velocity- and angle-based ReID features to re-match unmatched tracks with their corresponding trajectory IDs in the event of missed detections or occlusions.

We propose a point cloud multi-object tracking-by-detection tracker: GBPTracker. For detection, we design an intra-frame graph geometrical structure to extract neighborhood features and aggregate them for each object. Unlike many methods that solely rely on object location or motion information and often view each object as an independent entity, neglecting the geometrical relationships between them, our detector enhances spatial and contextual information, thereby reducing missed detections and indirectly decreasing errors in data association. For data association, we simplify the adjacency matrix formed by tracking and detection associations into a bipartite graph matching problem. We view bipartite graph matching as an optimization problem and define the objective function to find the optimal matching with the minimum global cost. We design a sophisticated cost matrix using object attributes such as the center location, bounding box size, velocity, and heading angle. A minimum-cost flow optimization algorithm is then applied to achieve globally optimal matching at the minimum cost. After data association, we use dynamic motion-based ReID embeddings, incorporating velocity and heading angle, to calculate the similarity between current detections and trajectory detections. This helps reconnect unmatched detections to their proper ID trajectories, reducing fragmented trajectories and correctly initializing new trajectories to obtain accurate tracking results.

Our tracker effectively alleviates occlusion-induced missed detections and improves object association accuracy. Furthermore, unmatched detections can reconnect to the proper trajectory ID, helping to decrease ID switches and avoiding the wrong initialization of new tracks for more effective tracking.

In summary, our work makes the following contributions:

We propose an intra-frame graph structure that leverages adaptive graph convolution to aggregate edge features into central nodes, thereby enhancing the robust representation of each node.
We view data association as inter-frame bipartite graph matching and define the objective function to minimize the global optimal matching cost. By designing a sophisticated cost matrix and applying a minimum-cost flow optimization algorithm, we achieve globally optimal matching for accurate data association in complex scenarios, thereby reducing ID switches.
For unmatched objects, we propose a motion-based ReID layer that uses similarity scores and association probabilities to accurately re-associate objects with their previous fragmented trajectory IDs, thereby reducing ID switches and avoiding the wrong initialization of new tracks.

2. Related Works

2.1. Three-Dimensional Multi-Object Tracking

Model-based tracking methods such as PMRA-PMBM [23] and GNN-PMB [24] utilize probabilistic graphical models, making them suitable for handling high-dimensional data with complex relationships and dependencies. These methods offer flexibility and global optimality but suffer from high computational complexity, making them less suitable for real-time applications.

Filter-based tracking methods such as Simpletrack [25] and AB3DMOT [26] rely on predefined motion models to describe the system’s dynamics and measurement processes. However, these methods struggle when multiple objects exhibit nonlinear and diverse motions simultaneously. Recently, graph-based methods such as PolarMOT [9], LOGR [27], and LEGO [28] have built graph structures with neural message passing for online object detection and tracking. BP-Tracker [29] employs a factor graph and a belief propagation algorithm to compute the marginal association probability, enhancing tracking performance.

Our work also leverages graph-based approaches utilizing neighborhood information aggregation and data association to improve tracking performance.

2.2. Data Association in Multi-Object Tracking

The data association performed by the tracking-by-detection paradigm views each detection as a node in a graph, with edges linking nodes over the temporal domain to form trajectories. For instance, PolarMOT [9] constructs node graphs, inter-frame graphs, and intra-frame graphs directly on targets without using an object detector. TrackMPNN [30] utilizes dynamic undirected graphs to tackle data association challenges across multiple timesteps. AGTSSD [8] uses an adaptive graph transformer within the backbone for graphical object detection. STBGtacker leverages a 4D graph spatio-temporal backbone [31] for single-object tracking, proving that objects are not independent and that their relationships need to be explored. Similarly, LearnableTrack [27] employs a graph-based structure to manage detection and track data association simultaneously using graph neural message passing.

The determination of connected nodes can then be addressed using the Hungarian algorithm [11,13,14] or greedy assignment [16,32]. Compared to minimum-cost flow and maximum-cost flow methods, these approaches are more prone to local optimization, potentially missing some matchings. Occlusions or undetected objects can lead to fragmented trajectories over several frames. Various methods employ trajectory update mechanisms, such as trajectory interpolation [33] or the elimination of outliers in tracklets [19], using proposals to directly generate and associate tracklets [34].

ReID [21,35] is commonly employed in pedestrian tracking within crowded scenes to identify individuals. Typically, detection and ReID tasks are decoupled. Our method uses intra-frame graphs, a sophisticated cost matrix with a minimum-cost flow algorithm for bipartite graph matching for data association, and motion-enhanced ReID to properly reconnect previously unmatched trajectories or initialize new tracks. This approach helps reduce ID switches and improve tracking performance by avoiding the direct assignment of new track IDs.

3. Methods

Within this section, as shown in Figure 2, we propose the 3D point cloud multi-object tracking GBRTracker, which comprises several modules: object detection feature embedding, bipartite graph matching for data association, and trajectory management and track update.

3.1. Intra-Frame Graph Structure

To address the challenge of sparse convolution in voxel representation feature maps, which may lack spatial information in voxel-based detectors, we modified our previous work [8] by designing a voxel graph layer. As shown in Figure 3, this enhancement integrates an intra-frame graph structure within the backbone, effectively improving geometric feature embedding.

3.1.1. Voxel Feature Extraction

The input point cloud

P

is first voxelized to create a structured grid of voxels. Each voxel aggregates the points within its spatial boundaries, forming initial voxel features

V

. These voxel features are then processed through a Sparse Convolutional Network to obtain the final voxel feature representation

F

:

F = SparseConvNet (V),

(1)

where

F

represents the feature tensor for each voxel. This approach leverages the efficiency of sparse convolutions to handle the high-dimensional and sparse nature of point cloud data, ensuring that important spatial information is preserved and effectively utilized in the detection process. We define the voxel feature

F

as follows:

F \in R^{N \times C \times H \times W}

(2)

where N is the batch size, C is the number of feature channels, and H and W are the height and width, respectively.

3.1.2. Graph Construction and Adaptive Graph Convolution

For each node feature, we build a KNN graph for each feature channel of voxels in the 3D Euclidean feature space instead of the position space. K represents the number of nearest neighbors considered for each point, forming the set

N (i)

. Then, we use adaptive graphconv [8] for each edge between similar features pairwise. Finally, the aggregated feature node updates each center feature node using a max-pooling operation as follows:

F_{i}^{'} = max_{j \in N (i)} σ (w \cdot [F_{i}, F_{i} - F_{j}])

(3)

where

w

is a shared learnable parameter,

σ

represents a nonlinear activation function such as ReLU,

N (i)

denotes the set of neighbors for point i, and

F_{i}

and

F_{j}

are the feature embeddings of the center and its neighbors, respectively.

Our adaptive graph voxel feature layer effectively captures and aggregates spatial neighborhood features, enhancing the robustness of the 3D point cloud features extracted by the backbone network. This improvement significantly reduces missed detections and enhances overall tracking accuracy, particularly in challenging environments.

3.2. Bipartite Graph Matching for Data Association

In 3D multi-object tracking (MOT), data association is vital for linking detected objects to their corresponding track IDs. Common approaches in 3D MOT involve optimizing an overall association cost matrix. This optimization often involves calculating the overlap or distance between the predicted and detected 3D bounding boxes using algorithms like the global nearest neighbor or the Hungarian algorithm. To improve the effectiveness of tracking-by-detection data association, we define the comprehensive and robust cost of data association in MOT as shown in Figure 4.

3.2.1. Objective Function

We treat the matching of detected objects in the current frame with tracklets from the previous frame as a bipartite graph matching problem for data association. Each object is defined as a node. Let

O = {o_{i}}

be a set of object detections, where

o_{i} = (p_{i}, t_{i})

, with

p_{i} = (x, y, z)

representing the position and features and

t_{i}

representing the timestamp. A trajectory is defined as a list of object detections

T_{k} = {o_{k_{1}}, o_{k_{2}}, \dots, o_{k_{N}}}

, where k is the index of the trajectory.

The objective function for bipartite graph matching aims to find the minimum cost, defined as

min_{M_{i j}} \sum_{i = 1}^{N_{t}} \sum_{j = 1}^{N_{r}} (1 - C M_{i, j}) M_{i j}

(4)

subject to

\begin{matrix} \sum_{i = 1}^{N_{t}} M_{i j} = 1, \end{matrix}

(5)

\begin{matrix} \sum_{j = 1}^{N_{r}} M_{i j} = 1, \end{matrix}

(6)

where

M_{i j} = 1

if the i-th detection is assigned to the j-th track in the previous frame; otherwise,

M_{i j} = 0

.

Here,

N_{t}

is the number of detections in the current frame,

N_{r}

is the number of tracks in the previous frame, and

C M_{i, j}

is the total cost matrix representing the cost of matching detection i in the current frame with track j in the previous frame. The values in

C M_{i, j}

are normalized to fall between 0 and 1 to ensure that the objective function ranges between 0 and 1 as well.

3.2.2. Association Cost Matrix

Bounding Box EIoU Cost To address challenges in tracking, especially during occlusions and varying object sizes that result in negligible IoUs, we utilize an enhanced intersection-over-union (EIoU) [36] metric that considers both positional and dimensional discrepancies between detected objects and trajectory predictions. The following describes the difference between IoU and EIoU:

I o U_{3 d} = \frac{B^{i} \cap B^{j}}{B^{i} \cup B^{j}},

(7)

E I o U_{3 d} = I o U_{3 d} - \frac{ρ^{2} (b^{i}, b^{j})}{c^{2}} - \frac{ρ^{2} (l^{i}, l^{j})}{{(c_{l})}^{2}} - \frac{ρ^{2} (w^{i}, w^{j})}{{(c_{w})}^{2}} - \frac{ρ^{2} (h^{i}, h^{j})}{{(c_{h})}^{2}},

(8)

where

B^{i}

and

B^{j}

denote the 3D bounding boxes of detection and trajectory prediction, respectively. C is the smallest enclosing box,

ρ (b^{i}, b^{j})

is the center distance between

B^{i}

and

B^{j}

,

ρ (l^{i}, l^{j}) = | l^{i} - l^{j} |

,

ρ (w^{i}, w^{j}) = | w^{i} - w^{j} |

,

ρ (h^{i}, h^{j}) = | h^{i} - h^{j} |

, and

c, c_{l}, c_{w}, c_{h}

represent the diagonal length, length, width, and height of the smallest enclosing box, respectively. EIoU computes both positional and geometric correlations, including center distance and dimension differences.

Center Point Distance Cost The distance between the center points of objects in two different frames is computed based on the Euclidean distance. The Euclidean distance

d (i, j)

is computed as follows:

d (i, j) = \sqrt{{(x_{i}^{1} - x_{j}^{2})}^{2} + {(y_{i}^{1} - y_{j}^{2})}^{2}},

(9)

where

(x_{i}^{1}, y_{i}^{1})

are the coordinates of the center point of object i in the trajectory frame.

(x_{j}^{2}, y_{j}^{2})

are the coordinates of the center point of object j in the detection frame.

d (i, j)

is the Euclidean distance between object i and object j.

Motion Cost We design the motion cost calculation, which comprises angle prediction, velocity prediction, and a combined motion cost metric:

Angle Prediction Angle prediction is derived from the predicted values of $\sin (θ)$ and $\cos (θ)$ . The angle $θ$ is then determined using the formula

$θ = atan 2 (\sin (θ), \cos (θ)),$

(10)

The $atan 2$ function computes the angle $θ$ from sin and cos values, accommodating positive or negative inputs for both functions and ensuring accurate angle determination.
Velocity Prediction The velocity prediction provides the two-dimensional speed components, $v_{x}$ and $v_{y}$ , of the detected objects.

$V = (v_{x}, v_{y}),$

(11)

where $v_{x}$ and $v_{y}$ represent the components of velocity in the eastward and northward directions, respectively. These contribute to the comprehensive motion state of the detected objects, encompassing both linear and angular velocities.
Velocity and Angle-based Motion Cost The $(i, j)$ -th entry in the motion cost matrix $M_{i, j}$ is formulated as follows:

$M_{i, j} = w_{θ} \cdot (\frac{1 - \cos (θ_{i} - θ_{j})}{2}) + w_{v} \cdot {∥ v_{i} - v_{j} ∥}^{2},$

(12)

where $θ_{i}$ and $θ_{j}$ are the angles of the predicted object state in the current and previous frames, respectively. $v_{i}$ and $v_{j}$ represent the velocities of the predicted object state in the current and previous frames, respectively. $w_{θ}$ denotes the weight for heading angle similarity, and $w_{v}$ denotes the weight for velocity similarity.
−
Angle Similarity The cosine similarity $\cos (θ_{i} - θ_{j})$ is utilized to measure the similarity between two angles. The value ranges from −1 to 1, with 1 indicating complete similarity and −1 indicating complete dissimilarity. To ensure that the similarity calculation results fall within [0, 1], we use $1 - \cos (θ_{i} - θ_{j})$ . Further, dividing the entire expression by 2 normalizes the cost range to a more intuitive scale of 0 to 1.
−
Velocity Difference Calculating the square of the velocity differences simplifies the process and serves to amplify larger discrepancies while being less sensitive to smaller variations.

Total Cost Matrix The graph backbone aims to extract features and obtain predicted bounding box information. We have designed a comprehensive cost matrix based on the EIOU, center-point positional information, heading angle, and velocity-based motion cost. The total cost matrix

C M_{i, j}

is composed of the center-point cost, motion cost, and bounding box EIOU cost:

C M_{i j} = α \cdot {EIoU}_{3 d} + β \cdot d (i, j) + γ \cdot M_{i j},

(13)

where

α

,

β

, and

γ

are the corresponding weights for the EIoU cost, center-point distance cost, and motion cost, respectively.

3.3. Trajectory Management and Track Update

As shown in Figure 5, in frame t, we define

D = {D_{1}, D_{2}, \dots, D_{N_{D}}}

as the set of multi-object detections in the current frame and

T = {T_{1}, T_{2}, \dots, T_{N_{T}}}

as the set of trajectories obtained from past frames.

N_{D}

and

N_{T}

denote the number of objects detected in frame t and the number of candidate tracklets. The detection is denoted by

D_{i} = (C_{i}, B_{i}, M_{i})

, where

C_{i}

is the center coordinates of the detection obtained from the feature extraction network.

B_{i} = (x_{i}, y_{i}, z_{i})

represents the size of the bounding box.

M_{i}

includes the rotation around the y-axis and the velocities along the x- and y-coordinates. Each tracklet contains a sequence of detection objects sharing the same tracklet ID. For detections that are successfully associated, they are added to the set of tracks, represented by

T_{i d}^{t - 1} \cup D_{(i d)}^{t - 1} \to T_{i d}^{t}

. Since the detection ID does not always correspond to the associated tracklet ID, we use parentheses (id) to indicate the detection ID that aligns with the

i d

-th tracklet. After global short-term bipartite graph matching data association, we address occluded objects resulting in missing matches by reconnecting them to the appropriate trajectories using an MLP-based ReID module that encodes motion information (velocity and heading angle).

3.3.1. Association Probability Calculation

The detection output features

z_{i}^{d}

, which include motion information (velocity and heading angle), are passed through an MLP to generate ReID feature vectors

r_{i}^{d}

. Similarly, the tracklets’ last-frame feature vector

z^{T}

is used to generate the track ReID feature vector

r^{D}

.

Compute Similarity Scores Compute the cosine similarity between the ReID feature vectors of the tracklet’s last-frame and the current frame’s detections:

s (r_{i}^{d}, r^{D}) = \frac{{(r_{i}^{d})}^{⊤} r^{D}}{∥ r_{i}^{d} ∥ ∥ r^{D} ∥}

(14)

where

s (r_{i}^{d}, r^{D})

is the similarity score based on motion-information-encoded ReID features.

Compute Association Probability We use a softmax function to compute the association probability between tracks and detections based on the combined similarity scores:

p (d_{i} | T) = \frac{\exp (s (r_{i}^{d}, r^{D}))}{\sum_{j} \exp (s (r_{j}^{d}, r^{D}))}

(15)

3.3.2. Track Creation and Deletion Strategy

As shown in Algorithm 1, to manage the initialization and termination of tracks, we propose a simplified track management strategy:

Refinement of Scores: For the matched pairs and new tracks, we refine the confidence scores using the embeddings obtained from the architecture head module. This refinement step, which follows a methodology similar to CenterPoint’s [15] second stage, ensures that the final tracking results are accurate and reliable.
Handling Unmatched Detections: Unmatched detections are reconnected with the corresponding trajectory IDs using ReID features by Equations (14) and (15). If these unmatched detections do not get matched within the next three frames, they are considered noise and removed.
New Track Creation: In each frame t, if a detection does not match any existing tracklet or unmatched detection and has a high confidence score, a new track is initialized. Each new track is assigned a unique track ID, and the corresponding detection is added to this new tracklet. This step ensures that newly appearing objects are properly tracked from the moment they are first detected.
Track Deletion: A tracklet is deleted if it has no matching detection for three consecutive frames. This parameter ensures that tracks are not immediately discarded when an object is missed for a few frames, allowing for temporary occlusions or missed detections without losing the track.

Algorithm 1 Trajectory management and track update

Require: Detection set

D

, trajectory set

T

, maximum age

m a x_a g e

Ensure: Updated trajectory set

T

Input:
Detection set

D = {D_{1}, D_{2}, \dots, D_{N_{D}}}

, where each detection

D_{i} = M_{i}

M_{i}

: Motion information (velocity and angle)
Trajectory set

T = {T_{1}, T_{2}, \dots, T_{N_{T}}}

, where each tracklet

T_{j}

contains a sequence of detections with motion information
Maximum age

m a x_a g e

Output:
Updated trajectory set

T

Hyperparameters:
Similarity score threshold

λ_{s}

Association probability threshold

λ_{p}

Step 1: Refinement of Scores
for each matched pair of detection

D_{i}

and tracklet

T_{j}

do
   Refine confidence scores using embeddings from the architecture head module.
  end for
  Step 2: Handle Unmatched Detections
  for each unmatched detection

D_{u} \in D

do
Compute similarity score

s (D_{u}, T_{j})

using Equation (14)
Compute association probability

p (d_{u} | T_{j})

using Equation (15)
if

s (D_{u}, T_{j}) \geq λ_{s}

p (d_{u} | T_{j}) \geq λ_{p}

then
Reconnect

D_{u}

to corresponding trajectory ID
Update state of

T_{j}

with

D_{u}

else
Store

D_{u}

as unmatched for next frame
   end if
  end for
  Step 3: New Track Creation
  for each detection

D_{u} \in D

do
if

D_{u}

does not match any existing tracklet or unmatched detection and has a high confidence score then
Create new tracklet

T_{n e w}

Assign new track ID to

D_{u}

Add

D_{u}

to

T_{n e w}

   end if
  end for
  Step 4: Track Deletion
  for each tracklet

T_{k} \in T

do
if tracklet

T_{k}

has no matching detection for a dynamic number of frames based on confidence and disappearance time then
Delete tracklet

T_{k}

end if
end for

4. Experiments

4.1. Datasets and Evaluation Metrics

4.1.1. Datasets

In this study, we chose to evaluate the proposed method on the currently popular large-scale autonomous driving public datasets NuScenes and Waymo. Both datasets provide LiDAR point clouds and 3D bounding box labels. NuScenes [37] is a large dataset that contains 1000 driving sequences, and each sequence spans 20 s. LiDAR data in NuScenes are provided at 20 Hz, but 3D labels are only given at 2 Hz. We used the validation set that includes 150 scenes, 6019 frames, and 140 k instances of object annotations. Our algorithm tracks seven categories specified by NuScenes. The Waymo Open Dataset comprises 1150 sequences, with 798 training, 202 validation, and 150 testing sequences, each containing 20 s of continuous driving data within the range of [−75 m, 75 m]. It provides 3D labels for three classes: vehicle, pedestrian, and cyclist. We used LEVEL 2 as the default performance setting. Following the official evaluation metrics specified in [21], we report Multiple Object Tracking Accuracy (MOTA), false positives (FPs), misses, and mismatches for objects at the L2 difficulty.

4.1.2. Evaluation Metrics

We use the official evaluation metrics: Multiple Object Tracking Accuracy (MOTA), Multiple Object Tracking Precision (MOTP) [38], Higher-Order Tracking Accuracy (HOTA) [39], ID switches (IDSs), fragmentation (Frag), Average Multiple Object Tracking Accuracy (AMOTA), and Average Multiple Object Tracking Precision (AMOTP) given by the multi-object tracking challenge to evaluate this work. Among them, MOTA is defined as follows:

M O T A = 1 - \frac{F P + F N + I D S}{n u m_{g t}},

(16)

where

F P

,

F N

, and

I D S

are the numbers of false positives, false negatives, and ID switches. The definition of AMOTA is as follows:

A M O T A = \frac{1}{L} \sum_{r \in \{\frac{1}{L}, \frac{2}{L}, \frac{3}{L}, . . ., 1\}} M O T A_{r},

(17)

where r is recall at specific thresholds, and L is the number of confidence thresholds.

M O T A_{r}

represents

M O T A

computed at r. MOTP is defined as

M O T P = \frac{\sum_{i, t} d i s_{i, t}}{\sum_{t} c_{t}},

(18)

where

c_{t}

represents the number of successful matches between the predicted detection and the corresponding ground truth in the

t th

frame, and

d i s_{i, t}

is defined. On the NuScenes benchmark,

d i s_{i, t}

is the Euclidean distance, which means that the smaller the MOTP, the better.

4.2. Implementation Details

In our experiments, we followed the experimental settings of CenterPoint. The method was run on a computer with an Intel(R) Xeon(R) Platinum 8358P CPU (15 vCPU @ 2.60 GHz) and an RTX 3090 GPU (24 GB). The parameters of all compared methods were set according to their best performances. In addition, the hyperparameter settings presented in this work are shown in Table 1. The loss functions for the relevant modules are detailed as follows:

ReID Loss

L_{r e i d} = \sum_{i \in unmatched} (1 - \frac{1}{| T |} \sum_{j \in T} p (d_{i} | T_{j}))

(19)

where

T

is the set of all trajectories in the current frame.

p (d_{i} | T_{j})

is the association probability between detection

d_{i}

and trajectory

T_{j}

.

| T |

is the number of trajectories in the current frame.

By integrating the comprehensive cost matrix and ReID-based re-association in the data association stage, our method effectively enhances the robustness and accuracy of multi-object tracking in 3D point cloud data.

4.3. Comparison with Different Methods

4.3.1. NuScenes Open Dataset

We evaluated our approach on the NuScenes [37] validation set, and the results are presented in Table 2. Our method achieves an AMOTA score of 67.0, surpassing CenterPoint’s 66.5, indicating superior overall tracking accuracy. This improvement is largely attributed to the sophisticated cost matrix optimization and effective ReID-based matching, which reduce false positives (FPs) and false negatives (FNs). Our AMOTP score of 56.6 demonstrates competitive tracking precision, comparable to CenterPoint’s 56.7. The integration of detailed motion features ensures precise tracking even in challenging scenarios. Although the improvement in MOTA is marginal, with our score at 57.3 compared to CenterPoint’s 56.2, this still reflects consistent tracking accuracy. Our MOTA score benefits from the accurate motion model and the minimum-cost flow optimization, which collectively enhance the overall tracking accuracy by reducing errors in track association. One slight improvement in our method is the reduction in identity switches (IDSs) to 543, compared to CenterPoint’s 562. This reduction is primarily due to the combined effects of our backbone aggregating neighborhood information, the sophisticated cost matrix calculation in bipartite graph matching, and the final dynamic feature-based ReID, which collectively minimize instances of identity switching.

Comparison with Methods in Seven Classes Quantitative results using the NuScenes validation dataset are presented in Figure 6. This bar chart displays the performance of four competing methods in terms of the key evaluation metric, AMOTA, for the NuScenes evaluation sequences. Our proposed method achieves the highest overall AMOTA value, excelling in all classes, especially for several specific classes, including pedestrian, bicycle, and motorcycle. The AMOTA metric reflects the overall tracking performance of an MOT method by considering FPs, FNs, and IDSs across all recall values. Therefore, the superior AMOTA value attained by our method indicates that it surpasses its competitors, delivering an enhanced and comprehensive 3D tracking performance.

4.3.2. Waymo Open Dataset

In Table 3, we compare our method with other 3D MOT methods on the validation set of the Waymo [41] Open Dataset, where our method exhibits superior performance compared to others. Specifically, it outperforms the highest reported performance by 0.8%, 3.8%, and 1.8% in terms of the MOTA metric for vehicle, pedestrian, and cyclist, respectively, and exceeds the adopted CenterPoint baseline by 1.4%, 7.1%, and 3.5%. More specifically, our method shows significant improvement in the miss metric compared to the employed baseline, indicating that our approach can successfully recover objects missed by the detector. We attribute this success to our intra-frame graph geometrical structure and bipartite graph data association strategy, which captures object features, especially for small objects, from their neighborhood, allowing our model to better capture the spatial features of nearby objects. Additionally, our bipartite graph data association method can associate objects with more motion information through global optimization. Moreover, for pedestrians, our method achieves lower false positive (FP) values compared to other methods, indicating higher quality in our trajectories. Pedestrian trajectories are more complex and crowded compared to other categories, making it challenging for the network to generate correct associations. Therefore, the lower FP rate for pedestrians indicates that our method can handle associations in complex scenarios. This is largely due to our ReID module, which uses the predicted velocity and heading angle as motion information to compute unmatched detection and trajectories’ similarity and association probabilities, allowing our model to better associate correct track IDs during temporary occlusions. The results demonstrate that our model can achieve even higher performance due to the module for handling short-term occlusions. After bipartite graph matching, the unmatched detections are re-sorted based on their confidence scores and are matched again within the current frame and across the next three frames. This strategy ensures that short-term occlusions are handled effectively, enabling our model to maintain accurate tracking even in challenging scenarios.

As shown in Figure 7, our method’s robustness and accuracy improvements are visually demonstrated.

In Figure 7, (a) depicts the tracking scenario in the Waymo dataset involving vehicles, pedestrians, and bicycles. Due to unmatched detections marked in pink, the issue shown in (b) arises. Occlusion causes an ID switch, as illustrated in (c). Our algorithm successfully re-associates the data and matches the correct IDs. Finally, after a brief period of tracking, our algorithm can re-track the targets even after short-term occlusions, as demonstrated in (e). This process illustrates the capability of our method to maintain high tracking accuracy and robustness, even in complex scenarios involving temporary occlusions.

4.3.3. Comparison with Advanced Tracking Methods

AB3DMOT [40]: This method employs a 3D Kalman filter for state estimation and the Hungarian algorithm for data association, providing robust tracking performance in 3D space.
SimpleTrack [42]: This method uses non-maximum suppression for detection preprocessing, a Kalman filter for motion modeling, 3D Generalized IoU for association, and trajectory interpolation to achieve object tracking.
ImmortalTrack [43]: This method uses a simple Kalman filter for trajectory prediction to maintain tracklets when the target is not visible, effectively preventing premature tracklet termination and reducing ID switches and track fragmentation.
SpOT [44]: SpOT proposes a multi-frame spatio-temporal tracking method that utilizes 4D refinement for frame-by-frame detection data association, achieving efficient object tracking.
CenterPoint [15]: This method first detects the centers of objects using a keypoint detector and regresses their 3D size, 3D orientation, and velocity. In the second stage, it refines these estimates using additional point features on the object.

4.3.4. Comparative Analysis

The primary differences between our method and the others lie in both the motion model and data association strategies. The Kalman filter (KF) utilized by AB3DMOT, SimpleTrack, and ImmortalTrack leverages historical information to predict the target’s state, providing smoother results when dealing with low-quality detections. However, the KF requires careful parameter initialization, and its robustness can be limited if the parameters are not appropriately set. Our method, on the other hand, employs a Constant Velocity (CV) model for motion prediction, which uses explicit speed predictions to handle abrupt and unpredictable motions more effectively. The simplicity of the CV model allows for straightforward implementation without the need for parameter tuning.

Advantages and Disadvantages of the KF The Kalman filter can utilize information from multiple frames, resulting in smoother outcomes in scenarios with low-quality detections. However, it necessitates careful parameter initialization, and improper parameter settings can significantly impact its robustness.
Advantages and Disadvantages of the CV Model The Constant Velocity model better handles abrupt and unpredictable motions with explicit speed predictions and is simpler to implement without requiring parameter tuning. Nevertheless, its effectiveness in motion smoothing is limited, and it may not perform as well in scenarios with low-quality detections.

Overall, while the KF offers smoother results by leveraging historical information, it requires the careful tuning of parameters, which can limit its robustness. In contrast, our CV-based method simplifies the tracking process, providing better handling of abrupt motions but with some limitations in motion smoothing.

4.3.5. Strengths and Weaknesses of Our Method

Our method, based on the CV model, includes in-frame graph construction, inter-frame bipartite matching, and ReID-based trajectory management. These innovations enhance data association and tracking robustness. As shown in Figure 7, our method can handle short-term occlusions and ID switches by re-matching and tracking the objects effectively. In Figure 10(1), even when detection confidence is occasionally low, our robust data association strategy ensures accurate tracking and matching. Weaknesses are discussed in the failure case analysis section.

4.4. Ablation Studies

4.4.1. Ablation Study for GBRTracker

As shown in Table 4, our ablation study on the NuScenes validation set evaluated the impact of various components of GBRTracker on its overall performance. Integrating the graph backbone alone improved AMOTA by +0.1, indicating enhanced feature extraction through a KNN graph structure. Adding the bipartite graph matching module resulted in a +0.26 AMOTA increase, validating its advantage in data association. Incorporating the ReID ReTrack module led to an AMOTA improvement of +0.14, demonstrating its effectiveness in reconnecting lost objects or initiating new tracks.

Combining the graph backbone with bipartite graph matching increased AMOTA by +0.3, highlighting their complementary effects in feature extraction and data association. The combination of the graph backbone and ReID ReTrack improved AMOTA by +0.18, showing better target re-identification. Combining bipartite graph matching with ReID ReTrack resulted in a +0.33 AMOTA increase, indicating improvements in data association and re-identification.

Finally, integrating all three modules, GBRTracker achieved a +0.5 AMOTA improvement, demonstrating that these components work well together and provide the best overall multi-object tracking performance.

4.4.2. Influence of Object Detection Module

Effectiveness of Backbone on Tracker As can be seen in Table 5, for CenterPoint, our designed graph VoxelNet backbone shows slight improvements in AMOTA (from 0.665 to 0.667) and MOTA (from 0.562 to 0.563), indicating a marginal positive impact of EdgeConv integration. For GBRTracker, the improvements are more pronounced. AMOTA increases from 0.669 to 0.670 and MOTA from 0.581 to 0.583, suggesting that our tracking approach benefits more from the enhanced feature representation provided by our designed backbone. Overall, our graph-based VoxelNet provides slight performance gains across both tracking frameworks, with our GBRTracker showing a better utilization of the improved backbone features.

Effectiveness of K Value in KNN Graphs

As shown in Figure 8, the ablation study on KNN k values demonstrates that using k = 20 yields the highest AMOTA scores for both cars and pedestrians, indicating optimal performance. Increasing k beyond 20 results in decreased AMOTA, suggesting that larger neighborhood sizes may introduce noise, thereby reducing tracking accuracy.

4.4.3. Effectiveness of Aggregated Pairwise Cost

As shown in Table 6, the ablation study demonstrates the effectiveness of incorporating different costs into the bipartite graph matching process. Notably, the introduction of motion-based costs (velocity and angle) in combination with IoU and center distance metrics yields the highest improvement in performance. The baseline results for CenterPoint are

M O T A = 0.562

,

I D S = 562

, and

F R A G = 424

. Specifically, our tracker with all costs (IoU, center, and motion) achieves a 1.96% increase in MOTA, a 4.27% reduction in IDSs, and a 4.48% reduction in FRAG compared to the baseline. This indicates that while IoU and center distance are essential, the motion information significantly enhances the model’s ability to maintain accurate and stable object tracks, leading to fewer identity switches and fragmentation events. Thus, the motion cost matrix plays a crucial role in improving tracking robustness and overall performance.

4.4.4. Evaluation of Similarity Score and Association Probability for Track Update

Based on the heatmap in Figure 9, the analysis of the effect of different thresholds

λ_{D}

for similarity scores and

λ_{S}

for association probabilities on AMOTA demonstrates significant impacts on tracking performance. As the similarity score threshold

λ_{D}

increases, AMOTA improves, reaching a peak value of 67.0 at a similarity threshold of 0.75 and an association probability threshold

λ_{S}

of 0.8. This indicates that a moderate level of similarity and association probability optimizes the tracking performance. However, thresholds that are either too low or too high lead to decreased AMOTA values. Specifically, lower similarity thresholds (e.g., 0.6) and higher association probability thresholds (e.g., 0.9) result in lower AMOTA scores, indicating that too lenient or too strict criteria for matching detections can reduce tracking accuracy. It is important to carefully tune the similarity score and association probability thresholds to achieve optimal tracking performance.

4.5. Discussion of Failure Cases and Future Challenges

4.5.1. Object Suddenly Appears with Delayed Tracking

Cause Analysis As shown in Figure 10(1), the object suddenly appeared at a distance in (b) with low confidence due to point cloud detection limitations, resulting in no trajectory initialization (as the confidence was too low to start a new track). However, in (c), even with low confidence, it still performed correct tracking due to our bipartite graph matching for data association. This indicates that our current trajectory initialization method is not well suited for handling low-confidence detections of newly appearing distant objects.

Future Challenges To address this, future work should focus on integrating multimodal data, such as utilizing RGB information from images to enhance detection confidence for distant objects. This can improve the robustness of the model in initializing tracks for newly appearing objects.

4.5.2. Complete Occlusion

Cause Analysis As shown in Figure 10(2), complete occlusion by other objects leads to tracking algorithm failure. In (b), the tracking-by-detection-based detector cannot predict proposals for the occluded object, preventing data association between the current detection and last-frame tracking results, resulting in failure, as seen in (c).

Future Challenges Future work could involve using the extended Kalman filter (EKF) or Bayesian estimation methods to predict the object’s state during complete occlusion. By utilizing historical trajectory and motion models, the algorithm can roughly estimate the target’s position. However, optimizing filter parameters for various datasets remains an intrinsic problem, as the robustness of these methods heavily depends on proper parameter initialization. Additionally, complete occlusion cannot be fully addressed by filtering methods alone, and it remains a challenging problem in the field of multi-object tracking that requires further research.

5. Conclusions

In this work, we proposed GBRTracker, a novel approach to 3D multi-object tracking (MOT) in point cloud data. By integrating an intra-frame graph structure into the backbone, our method effectively reduced detection misses through enhanced spatial feature aggregation. Additionally, we constructed an inter-frame bipartite graph for data association, leveraging a customized cost matrix incorporating center, box size, velocity, and heading angle information. Consequently, our method, coupled with a minimum-cost flow algorithm, achieved globally optimal matching and minimized ID switches. For unmatched detections, we designed a motion-based re-identification (ReID) feature embedding module to reconnect lost objects or initiate new tracks, maintaining high tracking accuracy and reliability. The results demonstrated that GBRTracker outperformed state-of-the-art techniques on the NuScenes and Waymo Open Datasets. However, it faced challenges in re-associating objects lost during several frames of short-term tracking, particularly when objects made sudden turns or stopped abruptly, affecting the effectiveness of the motion cost matrix. This highlights a potential limitation in our approach. In future work, we will focus on multi-frame Kalman filter-based methods to use historical information to improve tracking performance or integrate multimodal information through the incorporation of image data for multi-object tracking.

Author Contributions

Conceptualization, R.S.; methodology, writing, visualization, experiments, S.S.; formal analysis and coding, S.S. and C.S.; investigation, Q.Z.; resources, Y.D.; writing—review and editing, C.W.; visualization, G.X.; supervision, C.S.; project administration, B.X.; funding acquisition, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China under grant number 2022YFC3803702.

Data Availability Statement

The NuScenes Open Dataset utilized in this work is openly available at https://www.nuscenes.org/ (accessed on 1 June 2024). The Waymo Open Dataset is openly available at https://waymo.com/open/ (accessed on 1 June 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, C.; Chen, J.; Li, J.; Peng, Y.; Mao, Z. Large language models for human-robot interaction: A review. Biomim. Intell. Robot. 2023, 3, 100131. [Google Scholar] [CrossRef]
Peng, Y.; Funabora, Y.; Doki, S. An Application of Transformer based Point Cloud Auto-encoder for Fabric-type Actuator. In Proceedings of the JSME Annual Conference on Robotics and Mechatronics (Robomec), Nagoya, Japan, 28 June–1 July 2023; The Japan Society of Mechanical Engineers: Tokyo, Japan, 2023; p. 2P1-E12. [Google Scholar]
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
Zhang, Y.; Hu, Q.; Xu, G.; Ma, Y.; Wan, J.; Guo, Y. Not all points are equal: Learning highly efficient point-based detectors for 3d lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 18953–18962. [Google Scholar]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10529–10538. [Google Scholar]
Wang, L.; Song, Z.; Zhang, X.; Wang, C.; Zhang, G.; Zhu, L.; Li, J.; Liu, H. SAT-GCN: Self-attention graph convolutional network-based 3D object detection for autonomous driving. Knowl.-Based Syst. 2023, 259, 110080. [Google Scholar] [CrossRef]
Shi, W.; Rajkumar, R. Point-gnn: Graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1711–1719. [Google Scholar]
Sun, S.; Shi, C.; Wang, C.; Liu, X. A Novel Adaptive Graph Transformer For Point Cloud Object Detection. In Proceedings of the 2023 7th International Conference on Communication and Information Systems (ICCIS), Chongqing, China, 20–22 October 2023; pp. 151–155. [Google Scholar]
Kim, A.; Brasó, G.; Ošep, A.; Leal-Taixé, L. Polarmot: How far can geometric relations take us in 3d multi-object tracking? In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2022; pp. 41–58. [Google Scholar]
Chu, P.; Wang, J.; You, Q.; Ling, H.; Liu, Z. Transmot: Spatial-temporal graph transformer for multiple object tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 4870–4880. [Google Scholar]
Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
Xu, Y.; Osep, A.; Ban, Y.; Horaud, R.; Leal-Taixé, L.; Alameda-Pineda, X. How to train your deep multi-object tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6787–6796. [Google Scholar]
Wang, L.; Zhang, X.; Qin, W.; Li, X.; Gao, J.; Yang, L.; Li, Z.; Li, J.; Zhu, L.; Wang, H.; et al. Camo-mot: Combined appearance-motion optimization for 3d multi-object tracking with camera-lidar fusion. IEEE Trans. Intell. Transp. Syst. 2023, 24, 11981–11996. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11784–11793. [Google Scholar]
Chiu, H.k.; Wang, C.Y.; Chen, M.H.; Smith, S.F. Probabilistic 3D Multi-Object Cooperative Tracking for Autonomous Driving via Differentiable Multi-Sensor Kalman Filter. arXiv 2023, arXiv:2309.14655. [Google Scholar]
Ma, S.; Duan, S.; Hou, Z.; Yu, W.; Pu, L.; Zhao, X. Multi-object tracking algorithm based on interactive attention network and adaptive trajectory reconnection. Expert Syst. Appl. 2024, 249, 123581. [Google Scholar] [CrossRef]
Liu, H.; Ma, Y.; Hu, Q.; Guo, Y. CenterTube: Tracking multiple 3D objects with 4D tubelets in dynamic point clouds. IEEE Trans. Multimed. 2023, 25, 8793–8804. [Google Scholar] [CrossRef]
Wang, L.; Zhang, J.; Cai, P.; Lil, X. Towards Robust Reference System for Autonomous Driving: Rethinking 3D MOT. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 8319–8325. [Google Scholar]
Chen, X.; Shi, S.; Zhang, C.; Zhu, B.; Wang, Q.; Cheung, K.C.; See, S.; Li, H. Trajectoryformer: 3D object tracking transformer with predictive trajectory hypotheses. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 18527–18536. [Google Scholar]
Chen, S.; Yu, E.; Li, J.; Tao, W. Delving into the Trajectory Long-tail Distribution for Muti-object Tracking. arXiv 2024, arXiv:2403.04700. [Google Scholar]
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
Ding, G.; Liu, J.; Xia, Y.; Huang, T.; Zhu, B.; Sun, J. LiDAR Point Cloud-based Multiple Vehicle Tracking with Probabilistic Measurement-Region Association. arXiv 2024, arXiv:2403.06423. [Google Scholar]
Liu, J.; Bai, L.; Xia, Y.; Huang, T.; Zhu, B.; Han, Q.L. GNN-PMB: A simple but effective online 3D multi-object tracker without bells and whistles. IEEE Trans. Intell. Veh. 2022, 8, 1176–1189. [Google Scholar] [CrossRef]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Weng, X.; Wang, J.; Held, D.; Kitani, K. 3d multi-object tracking: A baseline and new evaluation metrics. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 10359–10366. [Google Scholar]
Zaech, J.N.; Liniger, A.; Dai, D.; Danelljan, M.; Van Gool, L. Learnable online graph representations for 3d multi-object tracking. IEEE Robot. Autom. Lett. 2022, 7, 5103–5110. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, J.; Xia, Y.; Huang, T.; Han, Q.L.; Liu, H. LEGO: Learning and graph-optimized modular tracker for online multi-object tracking with point clouds. arXiv 2023, arXiv:2308.09908. [Google Scholar]
Meyer, F.; Kropfreiter, T.; Williams, J.L.; Lau, R.; Hlawatsch, F.; Braca, P.; Win, M.Z. Message passing algorithms for scalable multitarget tracking. Proc. IEEE 2018, 106, 221–259. [Google Scholar] [CrossRef]
Rangesh, A.; Maheshwari, P.; Gebre, M.; Mhatre, S.; Ramezani, V.; Trivedi, M.M. Trackmpnn: A message passing graph neural architecture for multi-object tracking. arXiv 2021, arXiv:2101.04206. [Google Scholar]
Sun, S.; Wang, C.; Liu, X.; Shi, C.; Ding, Y.; Xi, G. Spatio-Temporal Bi-directional Cross-frame Memory for Distractor Filtering Point Cloud Single Object Tracking. arXiv 2024, arXiv:2403.15831. [Google Scholar]
Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking objects as points. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 474–490. [Google Scholar]
Han, S.; Huang, P.; Wang, H.; Yu, E.; Liu, D.; Pan, X. Mat: Motion-aware multi-object tracking. Neurocomputing 2022, 476, 75–86. [Google Scholar] [CrossRef]
Wu, H.; Li, Q.; Wen, C.; Li, X.; Fan, X.; Wang, C. Tracklet Proposal Network for Multi-Object Tracking on Point Clouds. In Proceedings of the IJCAI, Virtual Event, 19–26 August 2021; pp. 1165–1171. [Google Scholar]
Yu, E.; Li, Z.; Han, S.; Wang, H. Relationtrack: Relation-aware multiple object tracking with decoupled representation. IEEE Trans. Multimed. 2022, 25, 2686–2697. [Google Scholar] [CrossRef]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
Bernardin, K.; Stiefelhagen, R. Evaluating multiple object tracking performance: The clear mot metrics. EURASIP J. Image Video Process. 2008, 2008, 1–10. [Google Scholar] [CrossRef]
Luiten, J.; Osep, A.; Dendorfer, P.; Torr, P.; Geiger, A.; Leal-Taixé, L.; Leibe, B. Hota: A higher order metric for evaluating multi-object tracking. Int. J. Comput. Vis. 2021, 129, 548–578. [Google Scholar] [CrossRef] [PubMed]
Weng, X.; Wang, J.; Held, D.; Kitani, K. Ab3dmot: A baseline for 3d multi-object tracking and new evaluation metrics. arXiv 2020, arXiv:2008.08063. [Google Scholar]
Wang, Y.; Chen, S.; Huang, L.; Ge, R.; Hu, Y.; Ding, Z.; Liao, J. 1st Place Solutions for Waymo Open Dataset Challenges–2D and 3D Tracking. arXiv 2020, arXiv:2006.15506. [Google Scholar]
Pang, Z.; Li, Z.; Wang, N. Simpletrack: Understanding and rethinking 3d multi-object tracking. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 680–696. [Google Scholar]
Wang, Q.; Chen, Y.; Pang, Z.; Wang, N.; Zhang, Z. Immortal tracker: Tracklet never dies. arXiv 2021, arXiv:2111.13672. [Google Scholar]
Stearns, C.; Rempe, D.; Li, J.; Ambruş, R.; Zakharov, S.; Guizilini, V.; Yang, Y.; Guibas, L.J. Spot: Spatiotemporal modeling for 3d object tracking. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 639–656. [Google Scholar]

Figure 1. Our tracker pipeline.

Figure 2. Overview of GBRTracker architecture. First, we design a graph-enhanced detector to improve spatial features and reduce occlusion-related detection errors. Then, we design a pairwise cost matrix to represent the bipartite graph matching between tracks and detections, minimizing ID switches. Finally, for unmatched detections, we design motion-based ReID and track features to reconnect, initialize, or terminate trajectories, handling temporary occlusions effectively.

Figure 3. KNN voxel graph visualization. (a) VoxelNet detector extracts voxel features, (b) KNN graph is constructed in feature space, not D Euclidean space; (c) adaptive GraphConv is used on each edge, and features are aggregated by max-pooling.

Figure 4. Bipartite graph matching for data association. The current frame is processed by the detector, and features from the last frame’s tracks are used for bipartite graph matching, which reduces computational redundancy compared to an adjacency matrix. We design an objective function and a cost matrix, applying the minimum-cost algorithm to achieve the final matching results.

Figure 5. Trajectory management and track update. We use the ranked score to choose the detection for matching. By computing the unmatched detections and the last-frame of trajectories based on ReID embedding feature similarity, we can improve the accuracy of tracking. Specifically, we use softmax to normalize the similarity for association probability within maximum-age frames.

Figure 6. Comparison of AMOTA results overall and for seven classes, namely, bicycle, bus, car, motorcycle, pedestrian, trailer, and truck, on NuScenes validation set.

Figure 7. Temporal occlusion tracking. (a) A tracking scenario with two objects, marked by yellow stars. (b) An unmatched low detection score for the pink object and an ID switch to the blue pedestrian. (c) An ID switch from a vehicle to a pedestrian. (d) Correct ID. (e) A temporal ID switch and re-tracking of the object with the correct ID.

Figure 8. Comparison of AMOTA results for cars and pedestrians with different K values on the NuScenes validation set.

Figure 9. An ablation study of the impact of a detection-to-track score on track assignment and confidence score.

Figure 10. Failure cases. (1) Object Suddenly Appears with Delayed Tracking: The black bounding box represents the ground truth, pink indicates low detection confidence, red shows high detection confidence, and blue represents invalid states. In this scenario, the object suddenly appears with low confidence and is not immediately initialized as a new track, leading to a delayed tracking response. (2) Tracking Failure Due to Complete Occlusion: The object is completely occluded by other objects, causing the tracking algorithm to fail, as the detector cannot predict the proposal for the occluded object.

Table 1. All hyperparameters set in this study.

Hyperparameter	Source	Value
K	Section 3.1.2	20
$F_{i}^{'}$	Equation (3)	256
$w_{θ}$	Equation (12)	0.55
$w_{v}$	Equation (12)	0.45
$α$	Equation (13)	0.5
$β$	Equation (13)	0.3
$γ$	Equation (13)	0.2
$s (r_{i}^{d}, r^{D})$	Equation (14)	0.75
$p (d_{i} \| T)$	Equation (15)	0.8

Table 2. Performance of nuScenes 3D tracking validation set. All methods listed are LIDAR only, without multimodal extension.

Method	AMOTA	AMOTP	MOTA	IDS
AB3DMOT [40]	57.8	80.7	51.4	1275
Probabilistic [16]	56.1	80.0	48.3	679
MPN-Baseline [27]	59.3	83.2	51.4	1079
CenterPoint [15]	66.5	56.7	56.2	562
Ours	67.0	56.6	57.3	543

Table 3. Tracking performance on the Waymo Open Dataset LEVEL 2 validation split. Arrows indicate whether a higher (↑) or lower (↓) value is better.

Method	Vehicle				Pedestrian				Cyclist
Method	MOTA↑	FP%↓	Miss%↓	IDS%↓	MOTA↑	FP%↓	Miss%↓	IDS%↓	MOTA↑	FP%↓	Miss%↓	IDS%↓
AB3DMOT [40]	55.7	-	30.2	0.40	52.2	-	-	2.74	-	-	-	-
CenterPoint [15]	55.1	10.8	33.9	0.26	54.9	10.0	34.0	1.13	57.4	13.7	28.1	0.83
SimpleTrack [42]	56.1	10.4	33.4	0.08	57.8	10.9	30.9	0.42	56.9	11.6	30.9	0.56
ImmortalTrack [43]	56.4	10.2	33.4	0.01	58.2	11.3	30.5	0.26	59.1	11.8	28.9	0.10
SpOT [44]	55.7	11.0	28.4	33.2	0.18	60.5	11.3	27.6	0.56	-	-	-
Ours	56.5	10.1	32.2	0.18	62.0	10.7	27.4	0.52	60.9	11.3	27.1	0.26

Table 4. Multi-object tracking ablation study on the nuScenes validation set. The rows correspond to different configurations of our network, GBRTracker, based on CenterPoint. The components tested are the graph backbone, bipartite graph matching, and ReID ReTrack module. Arrows indicate whether a higher (↑) value is better.

Method	AMOTA↑
CenterPoint	66.5
CenterPoint + graph backbone	66.6
CenterPoint + ReID ReTrack	66.64
CenterPoint + graph backbone + ReID ReTrack	66.68
CenterPoint + bipartite graph matching	66.76
CenterPoint + graph backbone + bipartite graph matching	66.8
CenterPoint + bipartite graph matching + ReID ReTrack	66.83
CenterPoint + graph backbone + bipartite graph matching + ReID ReTrack (GBRTracker)	67.0

Table 5. Results of using different backbones on same tracker on NuScenes validation set.

Method	AMOTA	AMOTP	MOTA
CenterPoint
VoxelNet	0.665	0.567	0.562
Ours	0.667	0.565	0.563
GBRTracker (ours)
VoxelNet	0.669	0.558	0.581
Ours	0.670	0.557	0.583

Table 6. Results of using different affinity models on the Nuscenes validation dataset, where IoU, Cen., and Motion denote EIoU-based BBox distance, center distance, and combined velocity and angle pairwise cost matrix, respectively. The arrows indicate the direction of improvement or degradation: ↑ for increase and ↓ for decrease.

Cost			MOTA ↑	IDS ↓	FRAG ↓
IoU	Cen.	Motion	MOTA ↑	IDS ↓	FRAG ↓
√	√	-	0.53%	2.14%	0.94%
√	-	√	0.89%	3.03%	2.12%
-	√	√	1.42%	3.91%	3.31%
√	√	√	1.96%	4.27%	4.48%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, S.; Shi, C.; Wang, C.; Zhou, Q.; Sun, R.; Xiao, B.; Ding, Y.; Xi, G. Intra-Frame Graph Structure and Inter-Frame Bipartite Graph Matching with ReID-Based Occlusion Resilience for Point Cloud Multi-Object Tracking. Electronics 2024, 13, 2968. https://doi.org/10.3390/electronics13152968

AMA Style

Sun S, Shi C, Wang C, Zhou Q, Sun R, Xiao B, Ding Y, Xi G. Intra-Frame Graph Structure and Inter-Frame Bipartite Graph Matching with ReID-Based Occlusion Resilience for Point Cloud Multi-Object Tracking. Electronics. 2024; 13(15):2968. https://doi.org/10.3390/electronics13152968

Chicago/Turabian Style

Sun, Shaoyu, Chunhao Shi, Chunyang Wang, Qing Zhou, Rongliang Sun, Bo Xiao, Yueyang Ding, and Guan Xi. 2024. "Intra-Frame Graph Structure and Inter-Frame Bipartite Graph Matching with ReID-Based Occlusion Resilience for Point Cloud Multi-Object Tracking" Electronics 13, no. 15: 2968. https://doi.org/10.3390/electronics13152968

APA Style

Sun, S., Shi, C., Wang, C., Zhou, Q., Sun, R., Xiao, B., Ding, Y., & Xi, G. (2024). Intra-Frame Graph Structure and Inter-Frame Bipartite Graph Matching with ReID-Based Occlusion Resilience for Point Cloud Multi-Object Tracking. Electronics, 13(15), 2968. https://doi.org/10.3390/electronics13152968

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Intra-Frame Graph Structure and Inter-Frame Bipartite Graph Matching with ReID-Based Occlusion Resilience for Point Cloud Multi-Object Tracking

Abstract

1. Introduction

2. Related Works

2.1. Three-Dimensional Multi-Object Tracking

2.2. Data Association in Multi-Object Tracking

3. Methods

3.1. Intra-Frame Graph Structure

3.1.1. Voxel Feature Extraction

3.1.2. Graph Construction and Adaptive Graph Convolution

3.2. Bipartite Graph Matching for Data Association

3.2.1. Objective Function

3.2.2. Association Cost Matrix

3.3. Trajectory Management and Track Update

3.3.1. Association Probability Calculation

3.3.2. Track Creation and Deletion Strategy

4. Experiments

4.1. Datasets and Evaluation Metrics

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.2. Implementation Details

4.3. Comparison with Different Methods

4.3.1. NuScenes Open Dataset

4.3.2. Waymo Open Dataset

4.3.3. Comparison with Advanced Tracking Methods

4.3.4. Comparative Analysis

4.3.5. Strengths and Weaknesses of Our Method

4.4. Ablation Studies

4.4.1. Ablation Study for GBRTracker

4.4.2. Influence of Object Detection Module

4.4.3. Effectiveness of Aggregated Pairwise Cost

4.4.4. Evaluation of Similarity Score and Association Probability for Track Update

4.5. Discussion of Failure Cases and Future Challenges

4.5.1. Object Suddenly Appears with Delayed Tracking

4.5.2. Complete Occlusion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI