The effectiveness of our method is demonstrated by benchmarking against SORT-style baseline models on three large scale datasets: KITTI, NuScenes, and Waymo. In addition, we perform an ablation study using NuScenes dataset to better understand the impact of each component on our system’s general performance.
4.1. Tuning the Hyper Parameters
There are three hyper parameters in our data association pipeline: the confidence threshold
, the detection model accuracy
in Equation (
3), and the affinity threshold
.
The confidence threshold
is set to 0.5 according to [
22]. It is worth noticing that [
22] suggests that this parameter does not have any significant effect on the tracking performance. The value of
is chosen empirically such that a high-confident tracklet becomes low-confident after being undetected for three frames.
As observed from experiments, the position affinity
is the dominant component in the tracklet-to-detection and tracklet-to-tracklet affinity. Since the position affinity, which is the Mahalanobis distance between expected detection and real detection, is
distributed, the affinity threshold
in Equation (
11) is chosen according to the percentile of
distribution where the position affinity resulted from a correct association is expected to fall into. Notice that the degree of freedom of the
distribution of our interest is 4 due to the dimension of the measurement vector
in Equation (
6).
Intuitively, the affinity threshold determines how conservative our tracking algorithm is. Small makes our algorithm be more skeptical by rejecting detections that are close, but not close enough to the prediction of tracks’ states. This works well in the scenario where a large number of false-positive detections presents (e.g., Waymo dataset). However, too small can reject correct detections thus deteriorating the tracking performance. The method used for searching for a good value of is
Performs a coarse grid search with the expected percentile of distribution in the set which means the value of is in the set , while keeping the rest of hyper parameters unchanged. Please note that here the value of the threshold is just half of the corresponding value in Distribution Table. This is because the motion affinity is scaled by half in our implementation to reduce its dominance over the size affinity.
Once a performance peak is identified at , a fine grid search is performed on the set
The resulted value of on KITTI, NuScenes, and Waymo are respectively , , and .
4.2. Tracking Results
Evaluation Metrics: Classically, MOT systems are evaluated by the CLEAR MOT metrics [
29] which compute tracking performance based on three cores quantities which are the number of False Positives, False Negatives, and ID Switches (the definition of these quantities can be found in
Section 1). Intuitively, this set of metrics aims at evaluating a tracker’s precision in estimating tracks’ states as well as its consistency (i.e., keeping a unique ID for each even in the presence of occlusion). As pointed out by [
30] and later by [
6], there is a linear relation between MOTA and object detectors’ recall rate, as a result, MOTA does not provide a well-rounded evaluation performance of trackers. To remedy this, [
6] proposes to average MOTA and MOTP over a range of recall rate, resulting in two integral metrics AMOTA and AMOTP which become the norm in recent benchmarks.
Datasets: To verify the effectiveness of our method, we benchmark it on three popular autonomous driving datasets which offer 3D MOT benchmark: KITTI, NuScenes, and Waymo. These datasets are collections of driving sequences collected in various environment using a multi-modal sensor suite including LiDAR. KITTI tracking benchmark interests in two classes of object which are cars and pedestrians. Initially, KITTI tracking was designed for MOT in 2D images and recently [
6] adapts it to 3D MOT. NuScenes concerns a larger set of objects which comprises of cars, bicycles, buses, trucks, pedestrians, motorcycles, trailers. Waymo shares the same interest as NuScenes but groups car-like vehicles into a meta class.
Public Detection: As can be seen in
Table 1, AMOTA highly depends on the precision of object detectors. Therefore, to have a fair comparison, the baseline detection results made publicly available by the benchmarks are used as the input to our tracking system. Specifically, we use Point-RCNN detection for KITTI dataset, MEGVII detection for NuScenes, and PointPillars with PPBA detection for Waymo.
The performance of our model compared to the SORT-style baseline model in three popular benchmarks are shown in
Table 2.
As can be seen, our model consistently outperforms the baseline model in term of the primary metric AMOTA, thus proving the effectiveness of the 2-stage data association. Specifically, the improvements are , and for KITTI, NuScenes and Waymo, respectively. It is worth noticing that our approach has more track fragmentations (FRAG), 259 compared to 93 of the base line, in KITTI. The reason for this is that at each time step tracklets have no matched detections are not reported by our approach, while the baseline predicts their pose using the constant velocity model (CV) and reports this prediction.
The comparison runtime on KITTI dataset of our tracking algorithm against AB3DMOT [
6] is shown in
Table 3. Despite the additional complexity added by the second stage of the data association (i.e., the Global Association step), our approach can achieve a runtime that is close to AB3DMOT on KITTI and exceeds the real-time speed by a large margin. On more challenging datasets, the object detector generates a significantly larger number of detections per frame on average,
on NuScenes and
on Waymo, compared to
of KITTI. This large number of detections enlarges the cost matrix of the Local and Global Association step, thus making the LAPs represented by them more costly to solve. Therefore, the runtime of our approach is reduced to
frames-per-second (fps) on NuScenes and
fps on Waymo. This runtime can be greatly improved if our approach is re-implemented in a compiling language such as C++.
The qualitative performance on NuScenes is illustrated by drawing the bird-eye view of a scene with tracking result, ground truth objects and detection result accumulated through time as in
Figure 3 and
Figure 4.
The difficulty of the 3D MOT can be appreciated by the noisy detection with several false positives denoted by the clutter in the top of
Figure 4-Detection and false negatives, as shown by the absence of one trajectory in the top left corner of
Figure 3-Detection.